linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [take11 0/3] kevent: Generic event handling mechanism.
       [not found] <12345678912345.GA1898@2ka.mipt.ru>
@ 2006-08-17  7:43 ` Evgeniy Polyakov
  2006-08-17  7:43   ` [take11 1/3] kevent: Core files Evgeniy Polyakov
  2006-08-21 10:19 ` [take12 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-17  7:43 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


Generic event handling mechanism.

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take11 3/3] kevent: Timer notifications.
  2006-08-17  7:43     ` [take11 2/3] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-08-17  7:43       ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-17  7:43 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig



Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..5217cd1
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,107 @@
+/*
+ * 	kevent_timer.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct timer_list	ktimer;
+	struct kevent_storage	ktimer_storage;
+};
+
+static void kevent_timer_func(unsigned long data)
+{
+	struct kevent *k = (struct kevent *)data;
+	struct timer_list *t = k->st->origin;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+	
+	mod_timer(&t->ktimer, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+
+	return 0;
+
+err_out_st_fini:	
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	del_timer_sync(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = (__u32)jiffies;
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback, 
+		.enqueue = &kevent_timer_enqueue, 
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take11 1/3] kevent: Core files.
  2006-08-17  7:43 ` [take11 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-08-17  7:43   ` Evgeniy Polyakov
  2006-08-17  7:43     ` [take11 2/3] kevent: poll/select() notifications Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-17  7:43 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
 	.long sys_move_pages
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
 	.quad sys_tee
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
+#define __NR_kevent_get_events	318
+#define __NR_kevent_ctl		319
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 320
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..eef9709
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,174 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MAX_EVENTS	4096
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's queue. */
+	struct list_head	kevent_entry;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages. 
+	 * poll()/select storage has a list of wait_queue_t containers 
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_user
+{
+	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+	
+	unsigned int		pages_in_use;
+	/* Array of pages forming mapped ring buffer */
+	unsigned long		*pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n", 
+			__func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..8609910 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, 
+		unsigned int timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..a756e85
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,31 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback 
+	  invocations, advanced timer notifications and other kernel 
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents 
+	  which are ready immediately at insertion time and number of kevents 
+	  which were removed through readiness completion. 
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select() 
+	  notifications.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..ab6bca0
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,3 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..e16e1fa
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,240 @@
+/*
+ * 	kevent.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -EINVAL;
+
+	if (!k->callbacks.enqueue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+	
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -EINVAL;
+	
+	if (!k->callbacks.dequeue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return 0;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(struct kevent_callbacks *cb, int pos)
+{
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+	kevent_registered_callbacks[pos] = *cb;
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (k->event.type >= KEVENT_MAX)
+		return -EINVAL;
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (!k->callbacks.callback) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue. 
+ * It does not decrease origin's reference counter in any way 
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem = 0;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0) {
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	} else if (ret < 0) {
+		k->event.ret_flags |= KEVENT_RET_BROKEN;
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	}
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	if (!ret)
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && k->flags &KEVENT_STORAGE) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+		
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+	
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(k, &st->list, storage_entry) {
+		if (ready_callback)
+			(*ready_callback)(k);
+
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	}
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..2ced76f
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,983 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+static int kevent_get_sb(struct file_system_type *fs_type, 
+		int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	/* So original magic... */
+	return get_sb_pseudo(fs_type, kevent_name, NULL, 0xbcdbcdul, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+	.name		= kevent_name,
+	.get_sb		= kevent_get_sb,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *kevent_mnt;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM 
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+	
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 40 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct ukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		index;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+	struct kevent_mring *ring;
+
+	ring = (struct kevent_mring *)u->pring[0];
+	ring->index = num;
+}
+
+static inline void kevent_user_ring_inc(struct kevent_user *u)
+{
+	struct kevent_mring *ring;
+
+	ring = (struct kevent_mring *)u->pring[0];
+	ring->index++;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+	struct kevent_mring *ring;
+	unsigned int idx;
+
+	ring = (struct kevent_mring *)u->pring[0];
+
+	idx = (ring->index + 1) / KEVENTS_ON_PAGE;
+	if (idx >= u->pages_in_use) {
+		u->pring[idx] = __get_free_page(GFP_KERNEL);
+		if (!u->pring[idx])
+			return -ENOMEM;
+		u->pages_in_use++;
+	}
+	return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+
+	ring = (struct kevent_mring *)k->user->pring[0];
+	
+	pidx = ring->index/KEVENTS_ON_PAGE;
+	off = ring->index%KEVENTS_ON_PAGE;
+
+	copy_ring = (struct kevent_mring *)k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->index >= KEVENT_MAX_EVENTS)
+		ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int pnum;
+
+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+	u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	u->pring[0] = __get_free_page(GFP_KERNEL);
+	if (!u->pring[0])
+		goto err_out_free;
+
+	u->pages_in_use = 1;
+	kevent_user_ring_set(u, 0);
+
+	return 0;
+
+err_out_free:
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+	
+	for (i = 0; i < u->pages_in_use; ++i)
+		free_page(u->pring[i]);
+
+	kfree(u->pring);
+}
+
+
+/*
+ * Allocate new kevent userspace control entry.
+ */
+static struct kevent_user *kevent_user_alloc(void)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return NULL;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+		INIT_LIST_HEAD(&u->kevent_list[i]);
+	
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (kevent_user_ring_init(u)) {
+		kfree(u);
+		u = NULL;
+	}
+
+	return u;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = kevent_user_alloc();
+	
+	if (!u)
+		return -ENOMEM;
+
+	file->private_data = u;
+	
+	return 0;
+}
+
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+	struct kevent_user *u = vma->vm_file->private_data;
+	unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
+	if (off >= u->pages_in_use)
+		goto err_out_sigbus;
+
+	return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+	return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+	.nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start;
+	struct kevent_user *u = file->private_data;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_ops = &kevent_user_vm_ops;
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page((void *)u->pring[0])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at 
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY) {
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect 
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events, 
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk, 
+		struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	
+	list_for_each_entry(k, head, kevent_entry) {
+		spin_lock(&k->ulock);
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] && 
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			ret = k;
+			spin_unlock(&k->ulock);
+			break;
+		}
+		spin_unlock(&k->ulock);
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	int err = -ENODEV;
+	unsigned long flags;
+	
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor 
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+		list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+			kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy 
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+	
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy 
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+	
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+	
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+	unsigned long flags;
+	unsigned int hash = kevent_user_hash(&k->event);
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+	k->flags |= KEVENT_USER;
+	u->kevent_num++;
+	kevent_user_get(u);
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	if (kevent_user_ring_grow(u)) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	kevent_user_enqueue(u, k);
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+	} else {
+		kevent_user_ring_inc(u);
+	}
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	}
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one 
+ * and add them into appropriate kevent_storages, 
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number 
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure 
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, knum = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+		goto out_remove;
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				} else
+					knum++;
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		} else
+			knum++;
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u, 
+		unsigned int min_nr, unsigned int max_nr, unsigned int timeout, 
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait, 
+			u->ready_num >= min_nr, msecs_to_jiffies(timeout));
+	}
+	
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent), 
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+
+/*
+ * Userspace control block creation and initialization.
+ */
+static int kevent_ctl_init(void)
+{
+	struct kevent_user *u;
+	struct file *file;
+	int fd, ret;
+
+	fd = get_unused_fd();
+	if (fd < 0)
+		return fd;
+
+	file = get_empty_filp();
+	if (!file) {
+		ret = -ENFILE;
+		goto out_put_fd;
+	}
+
+	u = kevent_user_alloc();
+	if (unlikely(!u)) {
+		ret = -ENOMEM;
+		goto out_put_file;
+	}
+
+	file->f_op = &kevent_user_fops;
+	file->f_vfsmnt = mntget(kevent_mnt);
+	file->f_dentry = dget(kevent_mnt->mnt_root);
+	file->f_mapping = file->f_dentry->d_inode->i_mapping;
+	file->f_mode = FMODE_READ;
+	file->f_flags = O_RDONLY;
+	file->private_data = u;
+	
+	fd_install(fd, file);
+
+	return fd;
+
+out_put_file:
+	put_filp(file);
+out_put_fd:
+	put_unused_fd(fd);
+	return ret;
+}
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u || num > KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in milliseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		unsigned int timeout, void __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	if (cmd == KEVENT_CTL_INIT)
+		return kevent_ctl_init();
+
+	file = fget(fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+	int err = 0;
+	
+	kevent_cache = kmem_cache_create("kevent_cache", 
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = register_filesystem(&kevent_fs_type);
+	if (err)
+		panic("%s: failed to register filesystem: err=%d.\n",
+			       kevent_name, err);
+
+	kevent_mnt = kern_mount(&kevent_fs_type);
+	if (IS_ERR(kevent_mnt))
+		panic("%s: failed to mount silesystem: err=%ld.\n", 
+				kevent_name, PTR_ERR(kevent_mnt));
+	
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk("KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	mntput(kevent_mnt);
+	unregister_filesystem(&kevent_fs_type);
+
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	misc_deregister(&kevent_miscdev);
+	mntput(kevent_mnt);
+	unregister_filesystem(&kevent_fs_type);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8d3769b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take11 2/3] kevent: poll/select() notifications.
  2006-08-17  7:43   ` [take11 1/3] kevent: Core files Evgeniy Polyakov
@ 2006-08-17  7:43     ` Evgeniy Polyakov
  2006-08-17  7:43       ` [take11 3/3] kevent: Timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-17  7:43 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..75a75d1
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,221 @@
+/*
+ * 	kevent_poll.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait, 
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont = 
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, 
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k = 
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+		
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+	
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+	
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+	
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", 
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+	
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", 
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+	
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take12 0/3] kevent: Generic event handling mechanism.
       [not found] <12345678912345.GA1898@2ka.mipt.ru>
  2006-08-17  7:43 ` [take11 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-08-21 10:19 ` Evgeniy Polyakov
  2006-08-21 10:19   ` [take12 1/3] kevent: Core files Evgeniy Polyakov
                     ` (3 more replies)
  2006-08-23 11:24 ` [take13 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 10:19 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


Generic event handling mechanism.

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead if if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take12 2/3] kevent: poll/select() notifications.
  2006-08-21 10:19   ` [take12 1/3] kevent: Core files Evgeniy Polyakov
@ 2006-08-21 10:19     ` Evgeniy Polyakov
  2006-08-21 10:19       ` [take12 3/3] kevent: Timer notifications Evgeniy Polyakov
  2006-08-23  8:51     ` [take12 1/3] kevent: Core files Eric Dumazet
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 10:19 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..76b3039 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -698,6 +699,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..75a75d1
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,221 @@
+/*
+ * 	kevent_poll.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait, 
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont = 
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, 
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k = 
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+		
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+	
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+	
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+	
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", 
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+	
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", 
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+	
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take12 1/3] kevent: Core files.
  2006-08-21 10:19 ` [take12 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-08-21 10:19   ` Evgeniy Polyakov
  2006-08-21 10:19     ` [take12 2/3] kevent: poll/select() notifications Evgeniy Polyakov
  2006-08-23  8:51     ` [take12 1/3] kevent: Core files Eric Dumazet
  2006-08-22  7:00   ` [take12 0/3] kevent: Generic event handling mechanism Nicholas Miell
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 10:19 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
 	.long sys_move_pages
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
 	.quad sys_tee
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
+#define __NR_kevent_get_events	318
+#define __NR_kevent_ctl		319
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 320
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..eef9709
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,174 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MAX_EVENTS	4096
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's queue. */
+	struct list_head	kevent_entry;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages. 
+	 * poll()/select storage has a list of wait_queue_t containers 
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_user
+{
+	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+	
+	unsigned int		pages_in_use;
+	/* Array of pages forming mapped ring buffer */
+	unsigned long		*pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n", 
+			__func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..8609910 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, 
+		unsigned int timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..4282793
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,136 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT	0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	__u32		raw[2];
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user. */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct mukevent
+{
+	struct kevent_id	id;
+	__u32			ret_flags;
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+#define	KEVENT_CTL_INIT		3
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..a756e85
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,31 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback 
+	  invocations, advanced timer notifications and other kernel 
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents 
+	  which are ready immediately at insertion time and number of kevents 
+	  which were removed through readiness completion. 
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select() 
+	  notifications.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..ab6bca0
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,3 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..2872aa2
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,238 @@
+/*
+ * 	kevent.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -EINVAL;
+
+	if (!k->callbacks.enqueue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -EINVAL;
+
+	if (!k->callbacks.dequeue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return 0;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(struct kevent_callbacks *cb, int pos)
+{
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+	kevent_registered_callbacks[pos] = *cb;
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (k->event.type >= KEVENT_MAX)
+		return -EINVAL;
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (!k->callbacks.callback) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue. 
+ * It does not decrease origin's reference counter in any way 
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry)
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..6e1bf3a
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,986 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+static int kevent_get_sb(struct file_system_type *fs_type, 
+		int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	/* So original magic... */
+	return get_sb_pseudo(fs_type, kevent_name, NULL, 0xbcdbcdul, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+	.name		= kevent_name,
+	.get_sb		= kevent_get_sb,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *kevent_mnt;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM 
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+	
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 40 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct ukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		index;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+	struct kevent_mring *ring;
+
+	ring = (struct kevent_mring *)u->pring[0];
+	ring->index = num;
+}
+
+static inline void kevent_user_ring_inc(struct kevent_user *u)
+{
+	struct kevent_mring *ring;
+
+	ring = (struct kevent_mring *)u->pring[0];
+	ring->index++;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+	struct kevent_mring *ring;
+	unsigned int idx;
+
+	ring = (struct kevent_mring *)u->pring[0];
+
+	idx = (ring->index + 1) / KEVENTS_ON_PAGE;
+	if (idx >= u->pages_in_use) {
+		u->pring[idx] = __get_free_page(GFP_KERNEL);
+		if (!u->pring[idx])
+			return -ENOMEM;
+		u->pages_in_use++;
+	}
+	return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+
+	ring = (struct kevent_mring *)k->user->pring[0];
+	
+	pidx = ring->index/KEVENTS_ON_PAGE;
+	off = ring->index%KEVENTS_ON_PAGE;
+
+	copy_ring = (struct kevent_mring *)k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->index >= KEVENT_MAX_EVENTS)
+		ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int pnum;
+
+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+	u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	u->pring[0] = __get_free_page(GFP_KERNEL);
+	if (!u->pring[0])
+		goto err_out_free;
+
+	u->pages_in_use = 1;
+	kevent_user_ring_set(u, 0);
+
+	return 0;
+
+err_out_free:
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+	
+	for (i = 0; i < u->pages_in_use; ++i)
+		free_page(u->pring[i]);
+
+	kfree(u->pring);
+}
+
+
+/*
+ * Allocate new kevent userspace control entry.
+ */
+static struct kevent_user *kevent_user_alloc(void)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return NULL;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+		INIT_LIST_HEAD(&u->kevent_list[i]);
+	
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (kevent_user_ring_init(u)) {
+		kfree(u);
+		u = NULL;
+	}
+
+	return u;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = kevent_user_alloc();
+	
+	if (!u)
+		return -ENOMEM;
+
+	file->private_data = u;
+	
+	return 0;
+}
+
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+	struct kevent_user *u = vma->vm_file->private_data;
+	unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
+	if (off >= u->pages_in_use)
+		goto err_out_sigbus;
+
+	return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+	return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+	.nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start;
+	struct kevent_user *u = file->private_data;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_ops = &kevent_user_vm_ops;
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page((void *)u->pring[0])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at 
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY) {
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect 
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events, 
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk, 
+		struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	
+	list_for_each_entry(k, head, kevent_entry) {
+		spin_lock(&k->ulock);
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] && 
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			ret = k;
+			spin_unlock(&k->ulock);
+			break;
+		}
+		spin_unlock(&k->ulock);
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	int err = -ENODEV;
+	unsigned long flags;
+	
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor 
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+		list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+			kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy 
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+	
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy 
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+	
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+	
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+	unsigned long flags;
+	unsigned int hash = kevent_user_hash(&k->event);
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+	k->flags |= KEVENT_USER;
+	u->kevent_num++;
+	kevent_user_get(u);
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	if (kevent_user_ring_grow(u)) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	kevent_user_enqueue(u, k);
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	kevent_user_ring_inc(u);
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one 
+ * and add them into appropriate kevent_storages, 
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number 
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure 
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, knum = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+		goto out_remove;
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				} else
+					knum++;
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		} else
+			knum++;
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u, 
+		unsigned int min_nr, unsigned int max_nr, unsigned int timeout, 
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait, 
+			u->ready_num >= min_nr, msecs_to_jiffies(timeout));
+	}
+	
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent), 
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+
+/*
+ * Userspace control block creation and initialization.
+ */
+static int kevent_ctl_init(void)
+{
+	struct kevent_user *u;
+	struct file *file;
+	int fd, ret;
+
+	fd = get_unused_fd();
+	if (fd < 0)
+		return fd;
+
+	file = get_empty_filp();
+	if (!file) {
+		ret = -ENFILE;
+		goto out_put_fd;
+	}
+
+	u = kevent_user_alloc();
+	if (unlikely(!u)) {
+		ret = -ENOMEM;
+		goto out_put_file;
+	}
+
+	file->f_op = &kevent_user_fops;
+	file->f_vfsmnt = mntget(kevent_mnt);
+	file->f_dentry = dget(kevent_mnt->mnt_root);
+	file->f_mapping = file->f_dentry->d_inode->i_mapping;
+	file->f_mode = FMODE_READ;
+	file->f_flags = O_RDONLY;
+	file->private_data = u;
+	
+	fd_install(fd, file);
+
+	return fd;
+
+out_put_file:
+	put_filp(file);
+out_put_fd:
+	put_unused_fd(fd);
+	return ret;
+}
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u || num > KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in milliseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		unsigned int timeout, void __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	if (cmd == KEVENT_CTL_INIT)
+		return kevent_ctl_init();
+
+	file = fget(fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+	int err = 0;
+	
+	kevent_cache = kmem_cache_create("kevent_cache", 
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = register_filesystem(&kevent_fs_type);
+	if (err)
+		panic("%s: failed to register filesystem: err=%d.\n",
+			       kevent_name, err);
+
+	kevent_mnt = kern_mount(&kevent_fs_type);
+	if (IS_ERR(kevent_mnt))
+		panic("%s: failed to mount silesystem: err=%ld.\n", 
+				kevent_name, PTR_ERR(kevent_mnt));
+	
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk("KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	mntput(kevent_mnt);
+	unregister_filesystem(&kevent_fs_type);
+
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	misc_deregister(&kevent_miscdev);
+	mntput(kevent_mnt);
+	unregister_filesystem(&kevent_fs_type);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8d3769b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take12 3/3] kevent: Timer notifications.
  2006-08-21 10:19     ` [take12 2/3] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-08-21 10:19       ` Evgeniy Polyakov
  2006-08-21 11:12         ` Christoph Hellwig
  2006-08-21 12:37         ` [take12 4/3] kevent: Comment cleanup Evgeniy Polyakov
  0 siblings, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 10:19 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig



Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..5217cd1
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,107 @@
+/*
+ * 	kevent_timer.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct timer_list	ktimer;
+	struct kevent_storage	ktimer_storage;
+};
+
+static void kevent_timer_func(unsigned long data)
+{
+	struct kevent *k = (struct kevent *)data;
+	struct timer_list *t = k->st->origin;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+	
+	mod_timer(&t->ktimer, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+
+	return 0;
+
+err_out_st_fini:	
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	del_timer_sync(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = (__u32)jiffies;
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback, 
+		.enqueue = &kevent_timer_enqueue, 
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 10:19       ` [take12 3/3] kevent: Timer notifications Evgeniy Polyakov
@ 2006-08-21 11:12         ` Christoph Hellwig
  2006-08-21 11:18           ` Evgeniy Polyakov
  2006-08-21 12:09           ` Evgeniy Polyakov
  2006-08-21 12:37         ` [take12 4/3] kevent: Comment cleanup Evgeniy Polyakov
  1 sibling, 2 replies; 143+ messages in thread
From: Christoph Hellwig @ 2006-08-21 11:12 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, tglx

On Mon, Aug 21, 2006 at 02:19:49PM +0400, Evgeniy Polyakov wrote:
> 
> 
> Timer notifications.
> 
> Timer notifications can be used for fine grained per-process time 
> management, since interval timers are very inconvenient to use, 
> and they are limited.

Shouldn't this at leat use a hrtimer?

> new file mode 100644
> index 0000000..5217cd1
> --- /dev/null
> +++ b/kernel/kevent/kevent_timer.c
> @@ -0,0 +1,107 @@
> +/*
> + * 	kevent_timer.c

You still include those sill filename ontop of file comments..

> +static struct lock_class_key kevent_timer_key;
> +
> +static int kevent_timer_enqueue(struct kevent *k)
> +{
> +	int err;
> +	struct kevent_timer *t;
> +
> +	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
> +	if (!t)
> +		return -ENOMEM;
> +
> +	setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
> +
> +	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
> +	if (err)
> +		goto err_out_free;
> +	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);

When looking at the kevent_storage_init callers most need to do
those lockdep_set_class class.  Shouldn't kevent_storage_init just
get a "struct lock_class_key *" argument?

> +static int kevent_timer_callback(struct kevent *k)
> +{
> +	k->event.ret_data[0] = (__u32)jiffies;

This is returned to userspace, isn't it?  raw jiffies should never be
user-visible.  Please convert this to an unit that actually makes sense
for userspace (probably nanoseconds)

> +static int __init kevent_init_timer(void)
> +{
> +	struct kevent_callbacks tc = {
> +		.callback = &kevent_timer_callback, 
> +		.enqueue = &kevent_timer_enqueue, 
> +		.dequeue = &kevent_timer_dequeue};

I think this should be static, and the normal style to write it would be:

static struct kevent_callbacks tc = {
	.callback	= kevent_timer_callback,
	.enqueue	= kevent_timer_enqueue,
	.dequeue	= kevent_timer_dequeue,
};

also please consider makring all the kevent_callbacks structs const
to avoid false cacheline sharing and accidental modification, similar
to what we did to various other operation vectors.


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 11:12         ` Christoph Hellwig
@ 2006-08-21 11:18           ` Evgeniy Polyakov
  2006-08-21 11:27             ` Arjan van de Ven
  2006-08-21 14:25             ` Thomas Gleixner
  2006-08-21 12:09           ` Evgeniy Polyakov
  1 sibling, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 11:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, tglx

On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> On Mon, Aug 21, 2006 at 02:19:49PM +0400, Evgeniy Polyakov wrote:
> > 
> > 
> > Timer notifications.
> > 
> > Timer notifications can be used for fine grained per-process time 
> > management, since interval timers are very inconvenient to use, 
> > and they are limited.
> 
> Shouldn't this at leat use a hrtimer?

Not everymachine has them and getting into account possibility that
userspace can be scheduled away, it will be overkill.

> > new file mode 100644
> > index 0000000..5217cd1
> > --- /dev/null
> > +++ b/kernel/kevent/kevent_timer.c
> > @@ -0,0 +1,107 @@
> > +/*
> > + * 	kevent_timer.c
> 
> You still include those sill filename ontop of file comments..

Sorry, it looks like I updated not every file.

> > +static struct lock_class_key kevent_timer_key;
> > +
> > +static int kevent_timer_enqueue(struct kevent *k)
> > +{
> > +	int err;
> > +	struct kevent_timer *t;
> > +
> > +	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
> > +	if (!t)
> > +		return -ENOMEM;
> > +
> > +	setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
> > +
> > +	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
> > +	if (err)
> > +		goto err_out_free;
> > +	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
> 
> When looking at the kevent_storage_init callers most need to do
> those lockdep_set_class class.  Shouldn't kevent_storage_init just
> get a "struct lock_class_key *" argument?

It will not work, since inode is used for both socket and inode
notifications (to save some space in struct sock), lockdep initalization
is performed on the highest level, so I put it alone.

> > +static int kevent_timer_callback(struct kevent *k)
> > +{
> > +	k->event.ret_data[0] = (__u32)jiffies;
> 
> This is returned to userspace, isn't it?  raw jiffies should never be
> user-visible.  Please convert this to an unit that actually makes sense
> for userspace (probably nanoseconds)

It is just to show something, my userspace application just prints it to
stdout to show kernelspace difference between events.
Andrew pointed to it too, I think it is easier to just remove this string.

> > +static int __init kevent_init_timer(void)
> > +{
> > +	struct kevent_callbacks tc = {
> > +		.callback = &kevent_timer_callback, 
> > +		.enqueue = &kevent_timer_enqueue, 
> > +		.dequeue = &kevent_timer_dequeue};
> 
> I think this should be static, and the normal style to write it would be:
> 
> static struct kevent_callbacks tc = {
> 	.callback	= kevent_timer_callback,
> 	.enqueue	= kevent_timer_enqueue,
> 	.dequeue	= kevent_timer_dequeue,
> };
> 
> also please consider makring all the kevent_callbacks structs const
> to avoid false cacheline sharing and accidental modification, similar
> to what we did to various other operation vectors.

Ok.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 11:18           ` Evgeniy Polyakov
@ 2006-08-21 11:27             ` Arjan van de Ven
  2006-08-21 11:59               ` Evgeniy Polyakov
  2006-08-21 14:25             ` Thomas Gleixner
  1 sibling, 1 reply; 143+ messages in thread
From: Arjan van de Ven @ 2006-08-21 11:27 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, tglx

On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> ]> > +	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
> > 
> > When looking at the kevent_storage_init callers most need to do
> > those lockdep_set_class class.  Shouldn't kevent_storage_init just
> > get a "struct lock_class_key *" argument?
> 
> It will not work, since inode is used for both socket and inode
> notifications (to save some space in struct sock), lockdep initalization
> is performed on the highest level, so I put it alone.

Call me a cynic, but I'm always a bit sceptical about needing lockdep
annotations like this... Can you explain why you need it in this case,
including the proof that it's safe?



^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 11:27             ` Arjan van de Ven
@ 2006-08-21 11:59               ` Evgeniy Polyakov
  2006-08-21 12:13                 ` Arjan van de Ven
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 11:59 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, tglx

On Mon, Aug 21, 2006 at 01:27:22PM +0200, Arjan van de Ven (arjan@infradead.org) wrote:
> On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> > ]> > +	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
> > > 
> > > When looking at the kevent_storage_init callers most need to do
> > > those lockdep_set_class class.  Shouldn't kevent_storage_init just
> > > get a "struct lock_class_key *" argument?
> > 
> > It will not work, since inode is used for both socket and inode
> > notifications (to save some space in struct sock), lockdep initalization
> > is performed on the highest level, so I put it alone.
> 
> Call me a cynic, but I'm always a bit sceptical about needing lockdep
> annotations like this... Can you explain why you need it in this case,
> including the proof that it's safe?

Ok, again :)
Kevent uses meaning of storage of kevents without any special knowledge
what is is (inode, socket, file, timer - anything), so it's
initalization function among other things calls spin_lock_init().
Lockdep inserts static variable just before real spinlock
initialization, and since all locks are initialized in the same place,
all of them get the same static magic.
Later those locks are used in different context (for example inode
notificatins only in process context, but socket can be called from BH
context), since lockdep thinks they are the same, it screams.
Obviously the same inode can not be used for sockets and files, so I
added above lockdep initialization.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 11:12         ` Christoph Hellwig
  2006-08-21 11:18           ` Evgeniy Polyakov
@ 2006-08-21 12:09           ` Evgeniy Polyakov
  2006-08-22  4:36             ` Andrew Morton
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 12:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, tglx

On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> > +static int __init kevent_init_timer(void)
> > +{
> > +	struct kevent_callbacks tc = {
> > +		.callback = &kevent_timer_callback, 
> > +		.enqueue = &kevent_timer_enqueue, 
> > +		.dequeue = &kevent_timer_dequeue};
> 
> I think this should be static, and the normal style to write it would be:
> 
> static struct kevent_callbacks tc = {
> 	.callback	= kevent_timer_callback,
> 	.enqueue	= kevent_timer_enqueue,
> 	.dequeue	= kevent_timer_dequeue,
> };
> 
> also please consider makring all the kevent_callbacks structs const
> to avoid false cacheline sharing and accidental modification, similar
> to what we did to various other operation vectors.

Actually I do not think it should be static, since it is only used for
initialization and it's members are copied into main structure.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 11:59               ` Evgeniy Polyakov
@ 2006-08-21 12:13                 ` Arjan van de Ven
  2006-08-21 12:25                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Arjan van de Ven @ 2006-08-21 12:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, tglx

On Mon, 2006-08-21 at 15:59 +0400, Evgeniy Polyakov wrote:
> On Mon, Aug 21, 2006 at 01:27:22PM +0200, Arjan van de Ven (arjan@infradead.org) wrote:
> > On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> > > ]> > +	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
> > > > 
> > > > When looking at the kevent_storage_init callers most need to do
> > > > those lockdep_set_class class.  Shouldn't kevent_storage_init just
> > > > get a "struct lock_class_key *" argument?
> > > 
> > > It will not work, since inode is used for both socket and inode
> > > notifications (to save some space in struct sock), lockdep initalization
> > > is performed on the highest level, so I put it alone.
> > 
> > Call me a cynic, but I'm always a bit sceptical about needing lockdep
> > annotations like this... Can you explain why you need it in this case,
> > including the proof that it's safe?
> 
> Ok, again :)
> Kevent uses meaning of storage of kevents without any special knowledge
> what is is (inode, socket, file, timer - anything), so it's
> initalization function among other things calls spin_lock_init().
> Lockdep inserts static variable just before real spinlock
> initialization, and since all locks are initialized in the same place,
> all of them get the same static magic.
> Later those locks are used in different context (for example inode
> notificatins only in process context, but socket can be called from BH
> context), since lockdep thinks they are the same, it screams.
> Obviously the same inode can not be used for sockets and files, so I
> added above lockdep initialization.

ok... but since kevent doesn't know what is in it, wouldn't the locking
rules need to be such that it can deal with the "worst case" event? Eg
do you really have both no knowledge of what is inside, and specific
locking implementations for the different types of content??? That
sounds rather error prone.....
(if you had consistent locking rules lockdep would be perfectly fine
with that)


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 12:13                 ` Arjan van de Ven
@ 2006-08-21 12:25                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 12:25 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, tglx

On Mon, Aug 21, 2006 at 02:13:49PM +0200, Arjan van de Ven (arjan@infradead.org) wrote:
> > > Call me a cynic, but I'm always a bit sceptical about needing lockdep
> > > annotations like this... Can you explain why you need it in this case,
> > > including the proof that it's safe?
> > 
> > Ok, again :)
> > Kevent uses meaning of storage of kevents without any special knowledge
> > what is is (inode, socket, file, timer - anything), so it's
> > initalization function among other things calls spin_lock_init().
> > Lockdep inserts static variable just before real spinlock
> > initialization, and since all locks are initialized in the same place,
> > all of them get the same static magic.
> > Later those locks are used in different context (for example inode
> > notificatins only in process context, but socket can be called from BH
> > context), since lockdep thinks they are the same, it screams.
> > Obviously the same inode can not be used for sockets and files, so I
> > added above lockdep initialization.
> 
> ok... but since kevent doesn't know what is in it, wouldn't the locking
> rules need to be such that it can deal with the "worst case" event? Eg
> do you really have both no knowledge of what is inside, and specific
> locking implementations for the different types of content??? That
> sounds rather error prone.....
> (if you had consistent locking rules lockdep would be perfectly fine
> with that)

It is tricky - currently there is RCU protection for storage list
traversal from each origin, i.e. the that path takes a lock when
kevent is really ready (to put it into ready list) and to check kevent's
flags (actually event that could be replced by more strict callback's 
return values).
No existing origins call kevent_storage_ready() from context where it
can be reentered (it guarantees from other side that locks will not be
reentered), so in theory that lockdep tricks are not needed right now
(they were added before I moved list traversal to RCU).

In the current code storage->lock is safe and correct from lockdep point 
of view (since different origins are never crossed), so I have not
changed reinit.

> -- 
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take12 4/3] kevent: Comment cleanup.
  2006-08-21 10:19       ` [take12 3/3] kevent: Timer notifications Evgeniy Polyakov
  2006-08-21 11:12         ` Christoph Hellwig
@ 2006-08-21 12:37         ` Evgeniy Polyakov
  1 sibling, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-21 12:37 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig

Remove file name from comments.

dda1ae6fe306b485a91ebb5873eeee4bba06aebf
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
index 2872aa2..02ecf30 100644
--- a/kernel/kevent/kevent.c
+++ b/kernel/kevent/kevent.c
@@ -1,6 +1,4 @@
 /*
- * 	kevent.c
- * 
  * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
  * All rights reserved.
  * 
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
index 75a75d1..0233a4d 100644
--- a/kernel/kevent/kevent_poll.c
+++ b/kernel/kevent/kevent_poll.c
@@ -1,6 +1,4 @@
 /*
- * 	kevent_poll.c
- * 
  * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
  * All rights reserved.
  * 
diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
index 5217cd1..08dfc55 100644
--- a/kernel/kevent/kevent_timer.c
+++ b/kernel/kevent/kevent_timer.c
@@ -1,6 +1,4 @@
 /*
- * 	kevent_timer.c
- * 
  * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
  * All rights reserved.
  * 

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 11:18           ` Evgeniy Polyakov
  2006-08-21 11:27             ` Arjan van de Ven
@ 2006-08-21 14:25             ` Thomas Gleixner
  2006-08-22 18:25               ` Evgeniy Polyakov
  1 sibling, 1 reply; 143+ messages in thread
From: Thomas Gleixner @ 2006-08-21 14:25 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown

On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> > On Mon, Aug 21, 2006 at 02:19:49PM +0400, Evgeniy Polyakov wrote:
> > > 
> > > 
> > > Timer notifications.
> > > 
> > > Timer notifications can be used for fine grained per-process time 
> > > management, since interval timers are very inconvenient to use, 
> > > and they are limited.
>
> > Shouldn't this at leat use a hrtimer?
> 
> Not everymachine has them 

Every machine has hrtimers - not necessarily with high resolution timer
support, but the core code is there in any case and it is designed to
provide fine grained timers. 

In case of high resolution time support one would expect that the "fine
grained" timer event is actually fine grained.

> and getting into account possibility that
> userspace can be scheduled away, it will be overkill.

If you think out your argument then everything which is fine grained or
high responsive should be removed from userspace access for the very
same reason. Please look at the existing users of the hrtimer subsystem
- all of them are exposed to userspace.

	tglx



^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 12:09           ` Evgeniy Polyakov
@ 2006-08-22  4:36             ` Andrew Morton
  2006-08-22  5:48               ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Andrew Morton @ 2006-08-22  4:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev,
	Zach Brown, tglx

On Mon, 21 Aug 2006 16:09:34 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> > > +static int __init kevent_init_timer(void)
> > > +{
> > > +	struct kevent_callbacks tc = {
> > > +		.callback = &kevent_timer_callback, 
> > > +		.enqueue = &kevent_timer_enqueue, 
> > > +		.dequeue = &kevent_timer_dequeue};
> > 
> > I think this should be static, and the normal style to write it would be:
> > 
> > static struct kevent_callbacks tc = {
> > 	.callback	= kevent_timer_callback,
> > 	.enqueue	= kevent_timer_enqueue,
> > 	.dequeue	= kevent_timer_dequeue,
> > };
> > 
> > also please consider makring all the kevent_callbacks structs const
> > to avoid false cacheline sharing and accidental modification, similar
> > to what we did to various other operation vectors.
> 
> Actually I do not think it should be static, since it is only used for
> initialization and it's members are copied into main structure.
> 

It should be static __initdata a) so we don't need to construct it at
runtime and b) so it gets dropped from memory after initcalls have run.

(But given that kevent_init_timer() also gets dropped from memory after initcalls
it hardly matters).

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-22  4:36             ` Andrew Morton
@ 2006-08-22  5:48               ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22  5:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev,
	Zach Brown, tglx

On Mon, Aug 21, 2006 at 09:36:50PM -0700, Andrew Morton (akpm@osdl.org) wrote:
> On Mon, 21 Aug 2006 16:09:34 +0400
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> > > > +static int __init kevent_init_timer(void)
> > > > +{
> > > > +	struct kevent_callbacks tc = {
> > > > +		.callback = &kevent_timer_callback, 
> > > > +		.enqueue = &kevent_timer_enqueue, 
> > > > +		.dequeue = &kevent_timer_dequeue};
> > > 
> > > I think this should be static, and the normal style to write it would be:
> > > 
> > > static struct kevent_callbacks tc = {
> > > 	.callback	= kevent_timer_callback,
> > > 	.enqueue	= kevent_timer_enqueue,
> > > 	.dequeue	= kevent_timer_dequeue,
> > > };
> > > 
> > > also please consider makring all the kevent_callbacks structs const
> > > to avoid false cacheline sharing and accidental modification, similar
> > > to what we did to various other operation vectors.
> > 
> > Actually I do not think it should be static, since it is only used for
> > initialization and it's members are copied into main structure.
> > 
> 
> It should be static __initdata a) so we don't need to construct it at
> runtime and b) so it gets dropped from memory after initcalls have run.
> 
> (But given that kevent_init_timer() also gets dropped from memory after initcalls
> it hardly matters).

That's what I'm talking about.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-21 10:19 ` [take12 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
  2006-08-21 10:19   ` [take12 1/3] kevent: Core files Evgeniy Polyakov
@ 2006-08-22  7:00   ` Nicholas Miell
  2006-08-22  7:24     ` Evgeniy Polyakov
  2006-08-22 11:54   ` [PATCH] kevent_user: remove non-chardev interface Christoph Hellwig
  2006-08-22 11:55   ` [PATCH] kevent_user: use struct kevent_mring for the page ring Christoph Hellwig
  3 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22  7:00 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> Generic event handling mechanism.

Since this is the sixth[1] event notification system that's getting
added to the kernel, could somebody please convince me that the
userspace API is right this time? (Evidently, the others weren't and are
now just backward compatibility bloat.)

Just looking at the proposed kevent API, it appears that the timer event
queuing mechanism can't be used for the queuing of POSIX.1b interval
timer events (i.e. via a SIGEV_KEVENT notification value in a struct
sigevent) because (being a very thin veneer over the internal kernel
timer system) you can't specify a clockid, the time value doesn't have
the flexibility of a struct itimerspec (no re-arm timeout or absolute
times), and there's no way to alter, disable or query a pending timer or
query a timer overrun count.

Overall, kevent timers appear to be inconvenient to use and limited
compared to POSIX interval timers (excepting the fact you can read their
expiray events out of a queue, of course).



[1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  7:00   ` [take12 0/3] kevent: Generic event handling mechanism Nicholas Miell
@ 2006-08-22  7:24     ` Evgeniy Polyakov
  2006-08-22  8:17       ` Nicholas Miell
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22  7:24 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > Generic event handling mechanism.
> 
> Since this is the sixth[1] event notification system that's getting
> added to the kernel, could somebody please convince me that the
> userspace API is right this time? (Evidently, the others weren't and are
> now just backward compatibility bloat.)
> 
> Just looking at the proposed kevent API, it appears that the timer event
> queuing mechanism can't be used for the queuing of POSIX.1b interval
> timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> sigevent) because (being a very thin veneer over the internal kernel
> timer system) you can't specify a clockid, the time value doesn't have
> the flexibility of a struct itimerspec (no re-arm timeout or absolute
> times), and there's no way to alter, disable or query a pending timer or
> query a timer overrun count.
> 
> Overall, kevent timers appear to be inconvenient to use and limited
> compared to POSIX interval timers (excepting the fact you can read their
> expiray events out of a queue, of course).
 
Kevent timers are just trivial kevent user.
But even as is it is not that bad solution.
I, as user, do not want to know which timer is used  - I only need to
get some signal when interval completed, especially I do not want to
have some troubles when timer with given clockid has disappeared.
Kevent timer can be trivially rearmed (acutally it is always rearmed 
until one-shot flag is set).
Of course it can be disabled by removing requested kevent.
I can add possibility to alter timeout without removing kevent if there
is strong requirement for that.

Timer notifications were designed not from committee point of view, when
theoretical discussions end up in multi-megabyte documentation 99.9% of
which can not be used without major brain surgery.

I just implemented what I use, if you want more - say what you need.
 
> [1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?

Let me guess - kevent, which can do all above and a lot of other things?
And you forget netlink-based notificators - netlink, rtnetlink,
gennetlink, connector and tons of accounting application based on them,
kobject, kobject_uevent.
There also filessytem based ones - sysfs, procfs, debugfs, relayfs.

> -- 
> Nicholas Miell <nmiell@comcast.net>

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  7:24     ` Evgeniy Polyakov
@ 2006-08-22  8:17       ` Nicholas Miell
  2006-08-22  8:23         ` David Miller
                           ` (2 more replies)
  0 siblings, 3 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22  8:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Tue, 2006-08-22 at 11:24 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > > Generic event handling mechanism.
> > 
> > Since this is the sixth[1] event notification system that's getting
> > added to the kernel, could somebody please convince me that the
> > userspace API is right this time? (Evidently, the others weren't and are
> > now just backward compatibility bloat.)
> > 
> > Just looking at the proposed kevent API, it appears that the timer event
> > queuing mechanism can't be used for the queuing of POSIX.1b interval
> > timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> > sigevent) because (being a very thin veneer over the internal kernel
> > timer system) you can't specify a clockid, the time value doesn't have
> > the flexibility of a struct itimerspec (no re-arm timeout or absolute
> > times), and there's no way to alter, disable or query a pending timer or
> > query a timer overrun count.
> > 
> > Overall, kevent timers appear to be inconvenient to use and limited
> > compared to POSIX interval timers (excepting the fact you can read their
> > expiry events out of a queue, of course).
>  
> Kevent timers are just trivial kevent user.
> But even as is it is not that bad solution.
> I, as user, do not want to know which timer is used  - I only need to
> get some signal when interval completed, especially I do not want to
> have some troubles when timer with given clockid has disappeared.
> Kevent timer can be trivially rearmed (acutally it is always rearmed 
> until one-shot flag is set).
> Of course it can be disabled by removing requested kevent.
> I can add possibility to alter timeout without removing kevent if there
> is strong requirement for that.
> 

Is any of this documented anywhere? I'd think that any new userspace
interfaces should have man pages explaining their use and some example
code before getting merged into the kernel to shake out any interface
problems.


> Timer notifications were designed not from committee point of view, when
> theoretical discussions end up in multi-megabyte documentation 99.9% of
> which can not be used without major brain surgery.

Do you have any technical objections to the POSIX.1b interval timer
design to back up your insults?

> I just implemented what I use, if you want more - say what you need.

I don't know what I need, I just know what POSIX already has, and your
extensions don't appear to be compatible with that model and
deliberately designing something that has no hope of ever getting into
the POSIX standard or serving as the basis for whatever comes out of the
standard committee seems rather stupid. (Especially considering that
Linux's only viable competitor has already shipped a unified event
queuing API that does fit into the existing POSIX design.)

Ulrich Drepper is probably better able to speak on this than I am,
considering that he's involved with POSIX and is probably going to be
involved in the Linux libc work, whatever it may be.

>  
> > [1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?
> 
> Let me guess - kevent, which can do all above and a lot of other things?
> And you forget netlink-based notificators - netlink, rtnetlink,
> gennetlink, connector and tons of accounting application based on them,
> kobject, kobject_uevent.
> There also filessytem based ones - sysfs, procfs, debugfs, relayfs.

OK, so with literally a dozen different interfaces to queue events to
userspace, all of which are apparently inadequate and in need of
replacement by kevent, don't you want to slow down a bit and make sure
that the kevent API is correct before it becomes permanent and then just
has to be replaced *again* ?


-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  8:17       ` Nicholas Miell
@ 2006-08-22  8:23         ` David Miller
  2006-08-22  8:59           ` Nicholas Miell
  2006-08-22  8:37         ` Evgeniy Polyakov
       [not found]         ` <b3f268590608220957g43a16d6bmde8a542f8ad8710b@mail.gmail.com>
  2 siblings, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-22  8:23 UTC (permalink / raw)
  To: nmiell; +Cc: johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

From: Nicholas Miell <nmiell@comcast.net>
Date: Tue, 22 Aug 2006 01:17:52 -0700

> Is any of this documented anywhere? I'd think that any new userspace
> interfaces should have man pages explaining their use and some example
> code before getting merged into the kernel to shake out any interface
> problems.

Get real.

Nobody made this requirement for things like splice() et al.

I think people are being mostly very unreasonable in the
demands they are making upon Evgeniy.  It will only serve
to discourage the one person who is doing work to solve
these problems.


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  8:17       ` Nicholas Miell
  2006-08-22  8:23         ` David Miller
@ 2006-08-22  8:37         ` Evgeniy Polyakov
  2006-08-22  9:29           ` Nicholas Miell
       [not found]         ` <b3f268590608220957g43a16d6bmde8a542f8ad8710b@mail.gmail.com>
  2 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22  8:37 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Tue, Aug 22, 2006 at 01:17:52AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> On Tue, 2006-08-22 at 11:24 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > > > Generic event handling mechanism.
> > > 
> > > Since this is the sixth[1] event notification system that's getting
> > > added to the kernel, could somebody please convince me that the
> > > userspace API is right this time? (Evidently, the others weren't and are
> > > now just backward compatibility bloat.)
> > > 
> > > Just looking at the proposed kevent API, it appears that the timer event
> > > queuing mechanism can't be used for the queuing of POSIX.1b interval
> > > timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> > > sigevent) because (being a very thin veneer over the internal kernel
> > > timer system) you can't specify a clockid, the time value doesn't have
> > > the flexibility of a struct itimerspec (no re-arm timeout or absolute
> > > times), and there's no way to alter, disable or query a pending timer or
> > > query a timer overrun count.
> > > 
> > > Overall, kevent timers appear to be inconvenient to use and limited
> > > compared to POSIX interval timers (excepting the fact you can read their
> > > expiry events out of a queue, of course).
> >  
> > Kevent timers are just trivial kevent user.
> > But even as is it is not that bad solution.
> > I, as user, do not want to know which timer is used  - I only need to
> > get some signal when interval completed, especially I do not want to
> > have some troubles when timer with given clockid has disappeared.
> > Kevent timer can be trivially rearmed (acutally it is always rearmed 
> > until one-shot flag is set).
> > Of course it can be disabled by removing requested kevent.
> > I can add possibility to alter timeout without removing kevent if there
> > is strong requirement for that.
> > 
> 
> Is any of this documented anywhere? I'd think that any new userspace
> interfaces should have man pages explaining their use and some example
> code before getting merged into the kernel to shake out any interface
> problems.

There are two excellent articles on lwn.net
 
> > Timer notifications were designed not from committee point of view, when
> > theoretical discussions end up in multi-megabyte documentation 99.9% of
> > which can not be used without major brain surgery.
> 
> Do you have any technical objections to the POSIX.1b interval timer
> design to back up your insults?

POSIX timers can have any design, but do not force others to use the
same.

> > I just implemented what I use, if you want more - say what you need.
> 
> I don't know what I need, I just know what POSIX already has, and your

And I do know what I need, that is why I do it.

> extensions don't appear to be compatible with that model and
> deliberately designing something that has no hope of ever getting into
> the POSIX standard or serving as the basis for whatever comes out of the
> standard committee seems rather stupid. (Especially considering that
> Linux's only viable competitor has already shipped a unified event
> queuing API that does fit into the existing POSIX design.)

I think I even know what it is :)

> Ulrich Drepper is probably better able to speak on this than I am,
> considering that he's involved with POSIX and is probably going to be
> involved in the Linux libc work, whatever it may be.

Feel free to use POSIX timers, but do not force others to it too.

> >  
> > > [1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?
> > 
> > Let me guess - kevent, which can do all above and a lot of other things?
> > And you forget netlink-based notificators - netlink, rtnetlink,
> > gennetlink, connector and tons of accounting application based on them,
> > kobject, kobject_uevent.
> > There also filessytem based ones - sysfs, procfs, debugfs, relayfs.
> 
> OK, so with literally a dozen different interfaces to queue events to
> userspace, all of which are apparently inadequate and in need of
> replacement by kevent, don't you want to slow down a bit and make sure
> that the kevent API is correct before it becomes permanent and then just
> has to be replaced *again* ?

I only though that license issues remains unresolved in that
linux-kernel@ flood, but not, I was wrong :)

I will ask just one question, do _you_ propose anything here?
 
> -- 
> Nicholas Miell <nmiell@comcast.net>

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  8:23         ` David Miller
@ 2006-08-22  8:59           ` Nicholas Miell
  2006-08-22 14:59             ` James Morris
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22  8:59 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

On Tue, 2006-08-22 at 01:23 -0700, David Miller wrote:
> From: Nicholas Miell <nmiell@comcast.net>
> Date: Tue, 22 Aug 2006 01:17:52 -0700
> 
> > Is any of this documented anywhere? I'd think that any new userspace
> > interfaces should have man pages explaining their use and some example
> > code before getting merged into the kernel to shake out any interface
> > problems.
> 
> Get real.
> 
> Nobody made this requirement for things like splice() et al.
> 
> I think people are being mostly very unreasonable in the
> demands they are making upon Evgeniy.  It will only serve
> to discourage the one person who is doing work to solve
> these problems.

splice() is a single synchronous function call, and it's signature still
managed to change wildly during it's development.

In this brave new world of always stable kernel development, the time a
new interface has for public testing before a new kernel release is
drastically shorter than the old unstable development series, and if
nobody is documenting how this stuff is supposed to work and
demonstrating how it will be used, then mistakes are bound to slip
through.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  8:37         ` Evgeniy Polyakov
@ 2006-08-22  9:29           ` Nicholas Miell
  2006-08-22 10:03             ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22  9:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Tue, 2006-08-22 at 12:37 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 01:17:52AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > On Tue, 2006-08-22 at 11:24 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > > On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > > > > Generic event handling mechanism.
> > > > 
> > > > Since this is the sixth[1] event notification system that's getting
> > > > added to the kernel, could somebody please convince me that the
> > > > userspace API is right this time? (Evidently, the others weren't and are
> > > > now just backward compatibility bloat.)
> > > > 
> > > > Just looking at the proposed kevent API, it appears that the timer event
> > > > queuing mechanism can't be used for the queuing of POSIX.1b interval
> > > > timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> > > > sigevent) because (being a very thin veneer over the internal kernel
> > > > timer system) you can't specify a clockid, the time value doesn't have
> > > > the flexibility of a struct itimerspec (no re-arm timeout or absolute
> > > > times), and there's no way to alter, disable or query a pending timer or
> > > > query a timer overrun count.
> > > > 
> > > > Overall, kevent timers appear to be inconvenient to use and limited
> > > > compared to POSIX interval timers (excepting the fact you can read their
> > > > expiry events out of a queue, of course).
> > >  
> > > Kevent timers are just trivial kevent user.
> > > But even as is it is not that bad solution.
> > > I, as user, do not want to know which timer is used  - I only need to
> > > get some signal when interval completed, especially I do not want to
> > > have some troubles when timer with given clockid has disappeared.
> > > Kevent timer can be trivially rearmed (acutally it is always rearmed 
> > > until one-shot flag is set).
> > > Of course it can be disabled by removing requested kevent.
> > > I can add possibility to alter timeout without removing kevent if there
> > > is strong requirement for that.
> > > 
> > 
> > Is any of this documented anywhere? I'd think that any new userspace
> > interfaces should have man pages explaining their use and some example
> > code before getting merged into the kernel to shake out any interface
> > problems.
> 
> There are two excellent articles on lwn.net

Google knows of one and it doesn't actually explain how to use kevents.


> > 
> > OK, so with literally a dozen different interfaces to queue events to
> > userspace, all of which are apparently inadequate and in need of
> > replacement by kevent, don't you want to slow down a bit and make sure
> > that the kevent API is correct before it becomes permanent and then just
> > has to be replaced *again* ?
> 
> I only though that license issues remains unresolved in that
> linux-kernel@ flood, but not, I was wrong :)
> 
> I will ask just one question, do _you_ propose anything here?
>  

struct sigevent sigev = {
	.sigev_notify = SIGEV_KEVENT,
	.sigev_kevent_fd = kev_fd,
	.sigev_value.sival_ptr = &MyCookie
};

struct itimerspec its = {
	.it_value = { ... },
	.it_interval = { ... }
};

struct timespec timeout = { .. };

struct ukevent events[max];

timer_t timer;

timer_create(CLOCK_MONOTONIC, &sigev, &timer);
timer_settime(timer, 0, &its, NULL);

/* ... */

kevent_get_events(kev_fd, min, max, &timeout, events, 0);



Which isn't all that different from what Ulrich Drepper suggested and
Solaris does right now. (timer_create would probably end up calling
kevent_ctl itself, but it obviously can't do that unless kevents
actually support real interval timers).

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  9:29           ` Nicholas Miell
@ 2006-08-22 10:03             ` Evgeniy Polyakov
  2006-08-22 19:57               ` Nicholas Miell
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 10:03 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Tue, Aug 22, 2006 at 02:29:48AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > Is any of this documented anywhere? I'd think that any new userspace
> > > interfaces should have man pages explaining their use and some example
> > > code before getting merged into the kernel to shake out any interface
> > > problems.
> > 
> > There are two excellent articles on lwn.net
> 
> Google knows of one and it doesn't actually explain how to use kevents.

http://lwn.net/Articles/192964/
http://lwn.net/Articles/172844/

In the thread there were enough links to homepage where you can find
several examples of how to use kevents (and timers among others) with
old interfaces and new ones.

> > I will ask just one question, do _you_ propose anything here?
> >  
> 
> struct sigevent sigev = {
> 	.sigev_notify = SIGEV_KEVENT,
> 	.sigev_kevent_fd = kev_fd,
> 	.sigev_value.sival_ptr = &MyCookie
> };
> 
> struct itimerspec its = {
> 	.it_value = { ... },
> 	.it_interval = { ... }
> };
> 
> struct timespec timeout = { .. };
> 
> struct ukevent events[max];
> 
> timer_t timer;
> 
> timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> timer_settime(timer, 0, &its, NULL);
> 
> /* ... */
> 
> kevent_get_events(kev_fd, min, max, &timeout, events, 0);
> 
> 
> 
> Which isn't all that different from what Ulrich Drepper suggested and
> Solaris does right now. (timer_create would probably end up calling
> kevent_ctl itself, but it obviously can't do that unless kevents
> actually support real interval timers).

Ugh, rtsignals... Their's problems forced me to not implement
"interrupt"-like mechanism for kevents in addition to dequeueing.

Anyway, it seems you did not read the whole thread, homepage, lwn and
userpsace examples, so you do not understand what kevents are.

They are userspace requests which are returned back when they are ready.
It means that userspace must provide something to kernel and ask it to
notify when that "something" is ready. For example it can provide a
timeout value and ask kernel to fire a timer with it and inform
userspace when timeout has expired.
It does not matter what timer is used there - feel free to use
high-resolution one, usual timer, busyloop or anything else. Main issue 
that userspace request must be completed.

What you are trying to do is to put kevents under POSIX API.
That means that those kevents can not be read using
kevent_get_events(), basicaly because there are no user-known kevents,
i.e. user has not requested timer, so it should not receive it's
notifications (otherwise it will receive everything requested by other
threads and other issues, i.e. how to differentiate timer request made
by timer_create(), which is not supposed to be caught by
kevent_get_events()).

You could implement POSIX timer _fully_ on top of kevents, i.e. both
create and read, for example network AIO is implemented in that way -
there is a system calls aio_send()/aio_recv() and aio_sendfile() which
create kevent internally and then get it's readiness notifications over
provided callback, process data and finally remove kevent,
so POSIX timers could create timer kevent, wait until it is ready, in
completeness callback it would call signal delivering mechanism...

But there are no reading mechanism in POSIX timers (I mean not reading
pending timeout values or remaining time), they use signals for 
completeness delivering... So where do you want to put kevent's
userspace there?

What you are trying to achive is not POSIX timers in any way, you want
completely new machanism which has similar to POSIX API, and I give it to
you (well, with API which can be used not only with timers, but with any 
other type of notifications you like). 
You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
callback select different type of timers.
What else?

> -- 
> Nicholas Miell <nmiell@comcast.net>

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [PATCH] kevent_user: remove non-chardev interface
  2006-08-21 10:19 ` [take12 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
  2006-08-21 10:19   ` [take12 1/3] kevent: Core files Evgeniy Polyakov
  2006-08-22  7:00   ` [take12 0/3] kevent: Generic event handling mechanism Nicholas Miell
@ 2006-08-22 11:54   ` Christoph Hellwig
  2006-08-22 12:17     ` Evgeniy Polyakov
  2006-08-22 11:55   ` [PATCH] kevent_user: use struct kevent_mring for the page ring Christoph Hellwig
  3 siblings, 1 reply; 143+ messages in thread
From: Christoph Hellwig @ 2006-08-22 11:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown

Currently a user can create a user kevents in two ways:

 a) simply open() the kevent chardevice
 b) use sys_kevent_ctl with the KEVENT_CTL_INIT cmd type

both are equally easy to use for the user, but to support type b) a lot
of code in kernelspace is required.  remove type b to save lots of code
without functionality loss.


 include/linux/ukevent.h     |    1
 kernel/kevent/kevent_user.c |   99 +-------------------------------------------
 2 files changed, 4 insertions(+), 96 deletions(-)

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/kernel/kevent/kevent_user.c
===================================================================
--- linux-2.6.orig/kernel/kevent/kevent_user.c	2006-08-22 13:26:25.000000000 +0200
+++ linux-2.6/kernel/kevent/kevent_user.c	2006-08-22 13:46:08.000000000 +0200
@@ -36,20 +36,6 @@
 static char kevent_name[] = "kevent";
 static kmem_cache_t *kevent_cache;
 
-static int kevent_get_sb(struct file_system_type *fs_type, 
-		int flags, const char *dev_name, void *data, struct vfsmount *mnt)
-{
-	/* So original magic... */
-	return get_sb_pseudo(fs_type, kevent_name, NULL, 0xbcdbcdul, mnt);
-}
-
-static struct file_system_type kevent_fs_type = {
-	.name		= kevent_name,
-	.get_sb		= kevent_get_sb,
-	.kill_sb	= kill_anon_super,
-};
-
-static struct vfsmount *kevent_mnt;
 
 /*
  * kevents are pollable, return POLLIN and POLLRDNORM 
@@ -178,17 +164,14 @@
 }
 
 
-/*
- * Allocate new kevent userspace control entry.
- */
-static struct kevent_user *kevent_user_alloc(void)
+static int kevent_user_open(struct inode *inode, struct file *file)
 {
 	struct kevent_user *u;
 	int i;
 
 	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
 	if (!u)
-		return NULL;
+		return -ENOMEM;
 
 	INIT_LIST_HEAD(&u->ready_list);
 	spin_lock_init(&u->ready_lock);
@@ -202,23 +185,12 @@
 
 	atomic_set(&u->refcnt, 1);
 
-	if (kevent_user_ring_init(u)) {
+	if (unlikely(kevent_user_ring_init(u))) {
 		kfree(u);
-		u = NULL;
-	}
-
-	return u;
-}
-
-static int kevent_user_open(struct inode *inode, struct file *file)
-{
-	struct kevent_user *u = kevent_user_alloc();
-	
-	if (!u)
 		return -ENOMEM;
+	}
 
 	file->private_data = u;
-	
 	return 0;
 }
 
@@ -807,51 +779,6 @@
 	.fops = &kevent_user_fops,
 };
 
-
-/*
- * Userspace control block creation and initialization.
- */
-static int kevent_ctl_init(void)
-{
-	struct kevent_user *u;
-	struct file *file;
-	int fd, ret;
-
-	fd = get_unused_fd();
-	if (fd < 0)
-		return fd;
-
-	file = get_empty_filp();
-	if (!file) {
-		ret = -ENFILE;
-		goto out_put_fd;
-	}
-
-	u = kevent_user_alloc();
-	if (unlikely(!u)) {
-		ret = -ENOMEM;
-		goto out_put_file;
-	}
-
-	file->f_op = &kevent_user_fops;
-	file->f_vfsmnt = mntget(kevent_mnt);
-	file->f_dentry = dget(kevent_mnt->mnt_root);
-	file->f_mapping = file->f_dentry->d_inode->i_mapping;
-	file->f_mode = FMODE_READ;
-	file->f_flags = O_RDONLY;
-	file->private_data = u;
-	
-	fd_install(fd, file);
-
-	return fd;
-
-out_put_file:
-	put_filp(file);
-out_put_fd:
-	put_unused_fd(fd);
-	return ret;
-}
-
 static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
 {
 	int err;
@@ -920,9 +847,6 @@
 	int err = -EINVAL;
 	struct file *file;
 
-	if (cmd == KEVENT_CTL_INIT)
-		return kevent_ctl_init();
-
 	file = fget(fd);
 	if (!file)
 		return -ENODEV;
@@ -948,16 +872,6 @@
 	kevent_cache = kmem_cache_create("kevent_cache", 
 			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
 
-	err = register_filesystem(&kevent_fs_type);
-	if (err)
-		panic("%s: failed to register filesystem: err=%d.\n",
-			       kevent_name, err);
-
-	kevent_mnt = kern_mount(&kevent_fs_type);
-	if (IS_ERR(kevent_mnt))
-		panic("%s: failed to mount silesystem: err=%ld.\n", 
-				kevent_name, PTR_ERR(kevent_mnt));
-	
 	err = misc_register(&kevent_miscdev);
 	if (err) {
 		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
@@ -969,17 +883,12 @@
 	return 0;
 
 err_out_exit:
-	mntput(kevent_mnt);
-	unregister_filesystem(&kevent_fs_type);
-
 	return err;
 }
 
 static void __devexit kevent_user_fini(void)
 {
 	misc_deregister(&kevent_miscdev);
-	mntput(kevent_mnt);
-	unregister_filesystem(&kevent_fs_type);
 }
 
 module_init(kevent_user_init);
Index: linux-2.6/include/linux/ukevent.h
===================================================================
--- linux-2.6.orig/include/linux/ukevent.h	2006-08-22 12:10:24.000000000 +0200
+++ linux-2.6/include/linux/ukevent.h	2006-08-22 13:48:05.000000000 +0200
@@ -131,6 +131,5 @@
 #define	KEVENT_CTL_ADD 		0
 #define	KEVENT_CTL_REMOVE	1
 #define	KEVENT_CTL_MODIFY	2
-#define	KEVENT_CTL_INIT		3
 
 #endif /* __UKEVENT_H */

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [PATCH] kevent_user: use struct kevent_mring for the page ring
  2006-08-21 10:19 ` [take12 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
                     ` (2 preceding siblings ...)
  2006-08-22 11:54   ` [PATCH] kevent_user: remove non-chardev interface Christoph Hellwig
@ 2006-08-22 11:55   ` Christoph Hellwig
  2006-08-22 12:20     ` Evgeniy Polyakov
  3 siblings, 1 reply; 143+ messages in thread
From: Christoph Hellwig @ 2006-08-22 11:55 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown

Currently struct kevent_user.print is an array of unsigned long, but it's used
as an array of pointers to struct kevent_mring everyewhere in the code.

Switch it to use the real type and cast the return value from __get_free_page /
argument to free_page.


 include/linux/kevent.h      |    2 +-
 kernel/kevent/kevent_user.c |   31 +++++++++++--------------------
  2 files changed, 12 insertions(+), 21 deletions(-)

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/include/linux/kevent.h
===================================================================
--- linux-2.6.orig/include/linux/kevent.h	2006-08-22 12:10:24.000000000 +0200
+++ linux-2.6/include/linux/kevent.h	2006-08-22 13:30:48.000000000 +0200
@@ -105,7 +105,7 @@
 	
 	unsigned int		pages_in_use;
 	/* Array of pages forming mapped ring buffer */
-	unsigned long		*pring;
+	struct kevent_mring	**pring;
 
 #ifdef CONFIG_KEVENT_USER_STAT
 	unsigned long		im_num;
Index: linux-2.6/kernel/kevent/kevent_user.c
===================================================================
--- linux-2.6.orig/kernel/kevent/kevent_user.c	2006-08-22 13:28:06.000000000 +0200
+++ linux-2.6/kernel/kevent/kevent_user.c	2006-08-22 13:43:46.000000000 +0200
@@ -69,30 +69,21 @@
 
 static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
 {
-	struct kevent_mring *ring;
-
-	ring = (struct kevent_mring *)u->pring[0];
-	ring->index = num;
+	u->pring[0]->index = num;
 }
 
 static inline void kevent_user_ring_inc(struct kevent_user *u)
 {
-	struct kevent_mring *ring;
-
-	ring = (struct kevent_mring *)u->pring[0];
-	ring->index++;
+	u->pring[0]->index++;
 }
 
 static int kevent_user_ring_grow(struct kevent_user *u)
 {
-	struct kevent_mring *ring;
 	unsigned int idx;
 
-	ring = (struct kevent_mring *)u->pring[0];
-
-	idx = (ring->index + 1) / KEVENTS_ON_PAGE;
+	idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
 	if (idx >= u->pages_in_use) {
-		u->pring[idx] = __get_free_page(GFP_KERNEL);
+		u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
 		if (!u->pring[idx])
 			return -ENOMEM;
 		u->pages_in_use++;
@@ -108,12 +99,12 @@
 	unsigned int pidx, off;
 	struct kevent_mring *ring, *copy_ring;
 
-	ring = (struct kevent_mring *)k->user->pring[0];
-	
+	ring = k->user->pring[0];
+
 	pidx = ring->index/KEVENTS_ON_PAGE;
 	off = ring->index%KEVENTS_ON_PAGE;
 
-	copy_ring = (struct kevent_mring *)k->user->pring[pidx];
+	copy_ring = k->user->pring[pidx];
 
 	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
 	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
@@ -134,11 +125,11 @@
 
 	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
 
-	u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+	u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
 	if (!u->pring)
 		return -ENOMEM;
 
-	u->pring[0] = __get_free_page(GFP_KERNEL);
+	u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
 	if (!u->pring[0])
 		goto err_out_free;
 
@@ -158,7 +149,7 @@
 	int i;
 	
 	for (i = 0; i < u->pages_in_use; ++i)
-		free_page(u->pring[i]);
+		free_page((unsigned long)u->pring[i]);
 
 	kfree(u->pring);
 }
@@ -254,7 +245,7 @@
 	vma->vm_flags |= VM_RESERVED;
 	vma->vm_file = file;
 
-	if (vm_insert_page(vma, start, virt_to_page((void *)u->pring[0])))
+	if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
 		return -EFAULT;
 
 	return 0;

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH] kevent_user: remove non-chardev interface
  2006-08-22 11:54   ` [PATCH] kevent_user: remove non-chardev interface Christoph Hellwig
@ 2006-08-22 12:17     ` Evgeniy Polyakov
  2006-08-22 12:27       ` Christoph Hellwig
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 12:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown

On Tue, Aug 22, 2006 at 12:54:59PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> Currently a user can create a user kevents in two ways:
> 
>  a) simply open() the kevent chardevice
>  b) use sys_kevent_ctl with the KEVENT_CTL_INIT cmd type
> 
> both are equally easy to use for the user, but to support type b) a lot
> of code in kernelspace is required.  remove type b to save lots of code
> without functionality loss.

I personally do not have objections against it, but it introduces
additional complexies - one needs to open /dev/kevent and then perform
syscalls on top of returuned file descriptor.

If there are no objections, I will apply this patch.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH] kevent_user: use struct kevent_mring for the page ring
  2006-08-22 11:55   ` [PATCH] kevent_user: use struct kevent_mring for the page ring Christoph Hellwig
@ 2006-08-22 12:20     ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 12:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown

On Tue, Aug 22, 2006 at 12:55:04PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> Currently struct kevent_user.print is an array of unsigned long, but it's used
> as an array of pointers to struct kevent_mring everyewhere in the code.
> 
> Switch it to use the real type and cast the return value from __get_free_page /
> argument to free_page.
> 
> 
>  include/linux/kevent.h      |    2 +-
>  kernel/kevent/kevent_user.c |   31 +++++++++++--------------------
>   2 files changed, 12 insertions(+), 21 deletions(-)
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I will apply this patch with small minor nit below.

>  	if (idx >= u->pages_in_use) {
> -		u->pring[idx] = __get_free_page(GFP_KERNEL);
> +		u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);

Better cast it directly to (struct kevent_mring *).


If there will not be any objectsion about syscall-based initialization,
I will release new patchset tomorrow with your changes.

Thank you, Christoph.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH] kevent_user: remove non-chardev interface
  2006-08-22 12:17     ` Evgeniy Polyakov
@ 2006-08-22 12:27       ` Christoph Hellwig
  2006-08-22 12:39         ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Christoph Hellwig @ 2006-08-22 12:27 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown

On Tue, Aug 22, 2006 at 04:17:10PM +0400, Evgeniy Polyakov wrote:
> I personally do not have objections against it, but it introduces
> additional complexies - one needs to open /dev/kevent and then perform
> syscalls on top of returuned file descriptor.

it disalllows

int fd = sys_kevent_ctl(<random>, KEVENT_CTL_INIT, <random>, <random>);

in favour of only

int fd = open("/dev/kevent", O_SOMETHING);

which doesn't seem like a problem, especially as I really badly hope
no one will use the syscalls but some library instead.

In addition to that I'm researching whether there's a better way to
implement the other functionality instead of the two syscalls.  But I'd
rather let code speak, so wait for some patches from me on that.


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH] kevent_user: remove non-chardev interface
  2006-08-22 12:27       ` Christoph Hellwig
@ 2006-08-22 12:39         ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 12:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown

On Tue, Aug 22, 2006 at 01:27:31PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> On Tue, Aug 22, 2006 at 04:17:10PM +0400, Evgeniy Polyakov wrote:
> > I personally do not have objections against it, but it introduces
> > additional complexies - one needs to open /dev/kevent and then perform
> > syscalls on top of returuned file descriptor.
> 
> it disalllows
> 
> int fd = sys_kevent_ctl(<random>, KEVENT_CTL_INIT, <random>, <random>);
> 
> in favour of only
> 
> int fd = open("/dev/kevent", O_SOMETHING);
> 
> which doesn't seem like a problem, especially as I really badly hope
> no one will use the syscalls but some library instead.

Yep, exactly about above open/kevent_ctl I'm talking.
I still have a system which has ioctl() based kevent setup, and it
works - I really do not want to rise another flamewar about which
approach is better. If no one will complain until tomorrow I will commit
it.

> In addition to that I'm researching whether there's a better way to
> implement the other functionality instead of the two syscalls.  But I'd
> rather let code speak, so wait for some patches from me on that.

There were implementation with pure ioctl() and with one syscall for all
oprations (and control block embedded in it), all were rejected in
favour of two syscalls, so I'm waiting for your patches.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22  8:59           ` Nicholas Miell
@ 2006-08-22 14:59             ` James Morris
  2006-08-22 20:00               ` Nicholas Miell
  0 siblings, 1 reply; 143+ messages in thread
From: James Morris @ 2006-08-22 14:59 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: David Miller, johnpol, linux-kernel, drepper, akpm, netdev,
	zach.brown, hch

On Tue, 22 Aug 2006, Nicholas Miell wrote:

> In this brave new world of always stable kernel development, the time a
> new interface has for public testing before a new kernel release is
> drastically shorter than the old unstable development series, and if
> nobody is documenting how this stuff is supposed to work and
> demonstrating how it will be used, then mistakes are bound to slip
> through.

Feel free to provide the documentation.  Perhaps, even as much as you've 
written so far in these emails would be enough.



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
       [not found]         ` <b3f268590608220957g43a16d6bmde8a542f8ad8710b@mail.gmail.com>
@ 2006-08-22 17:09           ` Jari Sundell
  2006-08-22 18:01           ` Evgeniy Polyakov
  1 sibling, 0 replies; 143+ messages in thread
From: Jari Sundell @ 2006-08-22 17:09 UTC (permalink / raw)
  To: netdev, lkml

Not to mention the name used causes (at least me) some confusion with
BSD's kqueue implementation. Skimming over the patches it actually
looks somewhat like kqueue with the more interesting features removed,
like the ability to pass the filter changes simultaneously with
polling.

Maybe this is a topic that will singe my fur, but what is wrong with
the kqueue API? Will I really have to implement support for yet
another event API in my program.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
       [not found]         ` <b3f268590608220957g43a16d6bmde8a542f8ad8710b@mail.gmail.com>
  2006-08-22 17:09           ` Jari Sundell
@ 2006-08-22 18:01           ` Evgeniy Polyakov
  2006-08-22 19:14             ` Jari Sundell
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 18:01 UTC (permalink / raw)
  To: Jari Sundell
  Cc: Nicholas Miell, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig

On Tue, Aug 22, 2006 at 06:57:05PM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> On 8/22/06, Nicholas Miell <nmiell@comcast.net> wrote:
> >
> >
> >OK, so with literally a dozen different interfaces to queue events to
> >userspace, all of which are apparently inadequate and in need of
> >replacement by kevent, don't you want to slow down a bit and make sure
> >that the kevent API is correct before it becomes permanent and then just
> >has to be replaced *again* ?
> >
> 
> Not to mention the name used causes (at least me) some confusion with BSD's
> kqueue implementation. Skimming over the patches it actually looks somewhat
> like kqueue with the more interesting features removed, like the ability to
> pass the filter changes simultaneously with polling.

I do not understand, what do you mean?
It is obviously allowed to poll and change kevents at the same time.

> Maybe this is a topic that will singe my fur, but what is wrong with the
> kqueue API? Will I really have to implement support for yet another event
> API in my program.

Why did I not implemented it like Solaris did?
Or FreeBSD did?
It was designed with features mention on AIO homepage in mind, but not
to be compatible with some other implementation.
And why should it be?

> Rakshasa

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 3/3] kevent: Timer notifications.
  2006-08-21 14:25             ` Thomas Gleixner
@ 2006-08-22 18:25               ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 18:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown

On Mon, Aug 21, 2006 at 04:25:49PM +0200, Thomas Gleixner (tglx@linutronix.de) wrote:
> > Not everymachine has them 
> 
> Every machine has hrtimers - not necessarily with high resolution timer
> support, but the core code is there in any case and it is designed to
> provide fine grained timers. 
> 
> In case of high resolution time support one would expect that the "fine
> grained" timer event is actually fine grained.

Ok, I should reformulate, that currently not every machine has support
in kernel. Obviously each machine has a clock which runs faster than
jiffies.
And as a side note - kevents were created half a year ago - there were
no hrtimers in kernel in that time, btw, does kernel have high-resolutin
clock engine already in?

> > and getting into account possibility that
> > userspace can be scheduled away, it will be overkill.
> 
> If you think out your argument then everything which is fine grained or
> high responsive should be removed from userspace access for the very
> same reason. Please look at the existing users of the hrtimer subsystem
> - all of them are exposed to userspace.

Getting into account that system call gets more than 100 nsec, and one
should create kevent and then read it (with at least three rescheduling
- after two syscalls and wake up), it is not exactly the best way to
obtain nanoseconds resolution. And even one usec is good one for
userspace, and I can create an interface through kevents, but let's get
it real - if we still can not agree on other issues, should we do it
right now? I would like kevent core's issues are resolved and everyone
become happy with it before adding new kevent users.

If everyone says "yes, replace usual timers with high-resolution ones",
then ok, I will schedule it for the next patchset.

> 	tglx
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 18:01           ` Evgeniy Polyakov
@ 2006-08-22 19:14             ` Jari Sundell
  2006-08-22 19:47               ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-22 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nicholas Miell, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig

On 8/22/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > Not to mention the name used causes (at least me) some confusion with BSD's
> > kqueue implementation. Skimming over the patches it actually looks somewhat
> > like kqueue with the more interesting features removed, like the ability to
> > pass the filter changes simultaneously with polling.
>
> I do not understand, what do you mean?
> It is obviously allowed to poll and change kevents at the same time.

Changing kevents are done with a separate system call from polling
afaics, thus every change requires a context switch. This in contrast
to BSD's kqueue which allows user-space to pass the changes when
kevent (polling) is called.

It may also choose to update the filters immediately with the same call.

> > Maybe this is a topic that will singe my fur, but what is wrong with the
> > kqueue API? Will I really have to implement support for yet another event
> > API in my program.
>
> Why did I not implemented it like Solaris did?
> Or FreeBSD did?
> It was designed with features mention on AIO homepage in mind, but not
> to be compatible with some other implementation.
> And why should it be?

If it can be, why should it not be? At least, if you reinvent the
wheel its advantages should be obvious.

Considering that kqueue is available on more popular OSes like darwin
it would ease portability greatly if there was a shared event API.
That is, unless you think there's something fundamentally wrong with
their design.

Your interface:

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min,
unsigned int max,
+               unsigned int timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned
int num, void __user *buf);

BSD's kqueue:

struct kevent {
  uintptr_t ident;        /* identifier for this event */
  short     filter;       /* filter for event */
  u_short   flags;        /* action flags for kqueue */
  u_int     fflags;       /* filter flag value */
  intptr_t  data;         /* filter data value */
  void      *udata;       /* opaque user data identifier */
};

int kevent(int kq, const struct kevent *changelist, int nchanges,
struct kevent *eventlist, int nevents, const struct timespec
*timeout);

The only thing missing in BSD's kevent is the min/max parameters, the
various filters in kevent_get_events either have equivalent filters or
could be added as extensions. (I didn't look too carefully through
them)

On the other hand, your API lacks the ability to pass changes when
polling, as mentioned above. It would be preferable if the timeout
parameter was either timespec or timeval.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 19:14             ` Jari Sundell
@ 2006-08-22 19:47               ` Evgeniy Polyakov
  2006-08-22 22:51                 ` Jari Sundell
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 19:47 UTC (permalink / raw)
  To: Jari Sundell
  Cc: Nicholas Miell, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig

On Tue, Aug 22, 2006 at 09:14:30PM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> Changing kevents are done with a separate system call from polling
> afaics, thus every change requires a context switch. This in contrast
> to BSD's kqueue which allows user-space to pass the changes when
> kevent (polling) is called.
> 
> It may also choose to update the filters immediately with the same call.

Word "polling" really confuses me here, but now I understand you.
Such approach actually has unresolved issues - consider for
example a situation when all provided events are ready immediately - what
should be returned (as far as I recall they are always added into kqueue in
BSDs before started to be checked, so old events will be returned
first)? And currently ready events can be read through mapped buffer
without any syscall at all.
And Linux syscall is much cheaper than BSD's one.
Consider (especially apped buffer)  that issues, it really does not cost
interface complexity.

> >> Maybe this is a topic that will singe my fur, but what is wrong with the
> >> kqueue API? Will I really have to implement support for yet another event
> >> API in my program.
> >
> >Why did I not implemented it like Solaris did?
> >Or FreeBSD did?
> >It was designed with features mention on AIO homepage in mind, but not
> >to be compatible with some other implementation.
> >And why should it be?
> 
> If it can be, why should it not be? At least, if you reinvent the
> wheel its advantages should be obvious.
> 
> Considering that kqueue is available on more popular OSes like darwin
> it would ease portability greatly if there was a shared event API.
> That is, unless you think there's something fundamentally wrong with
> their design.

First of all, there are completely different types.
Design of the in-kernel part is very different too.

> Your interface:
> 
> +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min,
> unsigned int max,
> +               unsigned int timeout, void __user *buf, unsigned flags);
> +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned
> int num, void __user *buf);
> 
> BSD's kqueue:
> 
> struct kevent {
>  uintptr_t ident;        /* identifier for this event */
>  short     filter;       /* filter for event */
>  u_short   flags;        /* action flags for kqueue */
>  u_int     fflags;       /* filter flag value */
>  intptr_t  data;         /* filter data value */
>  void      *udata;       /* opaque user data identifier */
> };


>From your description there is a serious problem with arches which
supports different width of the pointer. I do not have sources of ny BSD
right now, but if it is really like you've described, it can not be used
in Linux at all.

> int kevent(int kq, const struct kevent *changelist, int nchanges,
> struct kevent *eventlist, int nevents, const struct timespec
> *timeout);
> 
> The only thing missing in BSD's kevent is the min/max parameters, the
> various filters in kevent_get_events either have equivalent filters or
> could be added as extensions. (I didn't look too carefully through
> them)
> 
> On the other hand, your API lacks the ability to pass changes when
> polling, as mentioned above. It would be preferable if the timeout
> parameter was either timespec or timeval.

No way - timespec uses long.

> Rakshasa

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 10:03             ` Evgeniy Polyakov
@ 2006-08-22 19:57               ` Nicholas Miell
  2006-08-22 20:16                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22 19:57 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 02:29:48AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > > Is any of this documented anywhere? I'd think that any new userspace
> > > > interfaces should have man pages explaining their use and some example
> > > > code before getting merged into the kernel to shake out any interface
> > > > problems.
> > > 
> > > There are two excellent articles on lwn.net
> > 
> > Google knows of one and it doesn't actually explain how to use kevents.
> 
> http://lwn.net/Articles/192964/
> http://lwn.net/Articles/172844/
> 
> In the thread there were enough links to homepage where you can find
> several examples of how to use kevents (and timers among others) with
> old interfaces and new ones.
> 

Oh, I found both of those. Neither of them told me what values I could
use in a struct kevent_user_control or what they meant or what any of
the fields in a struct ukevent or struct kevent_id meant or what I'm
supposed to pass in kevent_get_event's "void* buf", or many other things
that I don't remember now. 

In short, I'm stuck trying to reverse engineer from the source what the
API is supposed to be (which might not even be what is actually
implemented due to the as of yet unfound bug).

Of course, since you already know how all this stuff is supposed to
work, you could maybe write it down somewhere?


> > > I will ask just one question, do _you_ propose anything here?
> > >  
> > 
> > struct sigevent sigev = {
> > 	.sigev_notify = SIGEV_KEVENT,
> > 	.sigev_kevent_fd = kev_fd,
> > 	.sigev_value.sival_ptr = &MyCookie
> > };
> > 
> > struct itimerspec its = {
> > 	.it_value = { ... },
> > 	.it_interval = { ... }
> > };
> > 
> > struct timespec timeout = { .. };
> > 
> > struct ukevent events[max];
> > 
> > timer_t timer;
> > 
> > timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> > timer_settime(timer, 0, &its, NULL);
> > 
> > /* ... */
> > 
> > kevent_get_events(kev_fd, min, max, &timeout, events, 0);
> > 
> > 
> > 
> > Which isn't all that different from what Ulrich Drepper suggested and
> > Solaris does right now. (timer_create would probably end up calling
> > kevent_ctl itself, but it obviously can't do that unless kevents
> > actually support real interval timers).
> 
> Ugh, rtsignals... Their's problems forced me to not implement
> "interrupt"-like mechanism for kevents in addition to dequeueing.
> 
> Anyway, it seems you did not read the whole thread, homepage, lwn and
> userpsace examples, so you do not understand what kevents are.
> 
> They are userspace requests which are returned back when they are ready.
> It means that userspace must provide something to kernel and ask it to
> notify when that "something" is ready. For example it can provide a
> timeout value and ask kernel to fire a timer with it and inform
> userspace when timeout has expired.
> It does not matter what timer is used there - feel free to use
> high-resolution one, usual timer, busyloop or anything else. Main issue 
> that userspace request must be completed.
> 
> What you are trying to do is to put kevents under POSIX API.
> That means that those kevents can not be read using
> kevent_get_events(), basicaly because there are no user-known kevents,
> i.e. user has not requested timer, so it should not receive it's
> notifications (otherwise it will receive everything requested by other
> threads and other issues, i.e. how to differentiate timer request made
> by timer_create(), which is not supposed to be caught by
> kevent_get_events()).
> 

I have no idea what you're trying to say here. I've created a timer,
specified which kevent queue I want it's expiry notification delivered
to, and armed it. Where have I not specified enough information to
request the reception of timer notifications?

Also, differentiating timers made by timer_create() that aren't supposed
to deliver events via kevent_get_events() is easy -- their .sigev_notify
isn't SIGEV_KEVENT.

> You could implement POSIX timer _fully_ on top of kevents, i.e. both
> create and read, for example network AIO is implemented in that way -
> there is a system calls aio_send()/aio_recv() and aio_sendfile() which
> create kevent internally and then get it's readiness notifications over
> provided callback, process data and finally remove kevent,
> so POSIX timers could create timer kevent, wait until it is ready, in
> completeness callback it would call signal delivering mechanism...
> 

Yes, but that would be stupid. The kernel already has a fully functional
POSIX timer implementation, so throwing it out to reimplement it using
kevents would be a waste of effort, especially considering that your
kevent timers can't fully express a POSIX interval timer.

Now, if there were some way for me to ask that an interval timer queue
it's expiry notices into a kevent queue, that would combine the best of
both worlds.

> But there are no reading mechanism in POSIX timers (I mean not reading
> pending timeout values or remaining time), they use signals for 
> completeness delivering... So where do you want to put kevent's
> userspace there?
> 

The goal of this proposal is to extend sigevent completions to include
kevent queues along with signals and created threads, exactly because
thread creation is too heavy and signals are a pain to use.

> What you are trying to achive is not POSIX timers in any way, you want
> completely new machanism which has similar to POSIX API, and I give it to
> you (well, with API which can be used not only with timers, but with any 
> other type of notifications you like). 
> You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
> callback select different type of timers.
> What else?

No, it's still POSIX timers -- the vast majority of the API is the same,
they just report their completion differently.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 14:59             ` James Morris
@ 2006-08-22 20:00               ` Nicholas Miell
  2006-08-22 20:36                 ` David Miller
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22 20:00 UTC (permalink / raw)
  To: James Morris
  Cc: David Miller, johnpol, linux-kernel, drepper, akpm, netdev,
	zach.brown, hch

On Tue, 2006-08-22 at 10:59 -0400, James Morris wrote:
> On Tue, 22 Aug 2006, Nicholas Miell wrote:
> 
> > In this brave new world of always stable kernel development, the time a
> > new interface has for public testing before a new kernel release is
> > drastically shorter than the old unstable development series, and if
> > nobody is documenting how this stuff is supposed to work and
> > demonstrating how it will be used, then mistakes are bound to slip
> > through.
> 
> Feel free to provide the documentation.  Perhaps, even as much as you've 
> written so far in these emails would be enough.
> 

I'm not the one proposing the new (potentially wrong) interface. The
onus isn't on me.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 19:57               ` Nicholas Miell
@ 2006-08-22 20:16                 ` Evgeniy Polyakov
  2006-08-22 21:13                   ` Nicholas Miell
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-22 20:16 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 22, 2006 at 02:29:48AM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > > > Is any of this documented anywhere? I'd think that any new userspace
> > > > > interfaces should have man pages explaining their use and some example
> > > > > code before getting merged into the kernel to shake out any interface
> > > > > problems.
> > > > 
> > > > There are two excellent articles on lwn.net
> > > 
> > > Google knows of one and it doesn't actually explain how to use kevents.
> > 
> > http://lwn.net/Articles/192964/
> > http://lwn.net/Articles/172844/
> > 
> > In the thread there were enough links to homepage where you can find
> > several examples of how to use kevents (and timers among others) with
> > old interfaces and new ones.
> > 
> 
> Oh, I found both of those. Neither of them told me what values I could
> use in a struct kevent_user_control or what they meant or what any of
> the fields in a struct ukevent or struct kevent_id meant or what I'm
> supposed to pass in kevent_get_event's "void* buf", or many other things
> that I don't remember now. 

Well, I think LWN has very good explaination of what all parameters
mean, but it is possible that there can be some white areas.
No one forbids to look into userspace examples, link to them was posted
a lot of times.

> In short, I'm stuck trying to reverse engineer from the source what the
> API is supposed to be (which might not even be what is actually
> implemented due to the as of yet unfound bug).
> 
> Of course, since you already know how all this stuff is supposed to
> work, you could maybe write it down somewhere?

I will write documantation, but as you can see some interfaces are
changed.

> 
> > > > I will ask just one question, do _you_ propose anything here?
> > > >  
> > > 
> > > struct sigevent sigev = {
> > > 	.sigev_notify = SIGEV_KEVENT,
> > > 	.sigev_kevent_fd = kev_fd,
> > > 	.sigev_value.sival_ptr = &MyCookie
> > > };
> > > 
> > > struct itimerspec its = {
> > > 	.it_value = { ... },
> > > 	.it_interval = { ... }
> > > };
> > > 
> > > struct timespec timeout = { .. };
> > > 
> > > struct ukevent events[max];
> > > 
> > > timer_t timer;
> > > 
> > > timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> > > timer_settime(timer, 0, &its, NULL);
> > > 
> > > /* ... */
> > > 
> > > kevent_get_events(kev_fd, min, max, &timeout, events, 0);
> > > 
> > > 
> > > 
> > > Which isn't all that different from what Ulrich Drepper suggested and
> > > Solaris does right now. (timer_create would probably end up calling
> > > kevent_ctl itself, but it obviously can't do that unless kevents
> > > actually support real interval timers).
> > 
> > Ugh, rtsignals... Their's problems forced me to not implement
> > "interrupt"-like mechanism for kevents in addition to dequeueing.
> > 
> > Anyway, it seems you did not read the whole thread, homepage, lwn and
> > userpsace examples, so you do not understand what kevents are.
> > 
> > They are userspace requests which are returned back when they are ready.
> > It means that userspace must provide something to kernel and ask it to
> > notify when that "something" is ready. For example it can provide a
> > timeout value and ask kernel to fire a timer with it and inform
> > userspace when timeout has expired.
> > It does not matter what timer is used there - feel free to use
> > high-resolution one, usual timer, busyloop or anything else. Main issue 
> > that userspace request must be completed.
> > 
> > What you are trying to do is to put kevents under POSIX API.
> > That means that those kevents can not be read using
> > kevent_get_events(), basicaly because there are no user-known kevents,
> > i.e. user has not requested timer, so it should not receive it's
> > notifications (otherwise it will receive everything requested by other
> > threads and other issues, i.e. how to differentiate timer request made
> > by timer_create(), which is not supposed to be caught by
> > kevent_get_events()).
> > 
> 
> I have no idea what you're trying to say here. I've created a timer,
> specified which kevent queue I want it's expiry notification delivered
> to, and armed it. Where have I not specified enough information to
> request the reception of timer notifications?

You can do it with kevent timer notifications. Easily.
I've even attached simple program for that.

> Also, differentiating timers made by timer_create() that aren't supposed
> to deliver events via kevent_get_events() is easy -- their .sigev_notify
> isn't SIGEV_KEVENT.

What should be returned to user? What should be placed into user's data,
into id? How user can determine that given event fires after which
initial value?
Finally, if you think that kevents should use different API for
different events, think about complicated userspace code which must know
tons of syscalls for the same task.

> > You could implement POSIX timer _fully_ on top of kevents, i.e. both
> > create and read, for example network AIO is implemented in that way -
> > there is a system calls aio_send()/aio_recv() and aio_sendfile() which
> > create kevent internally and then get it's readiness notifications over
> > provided callback, process data and finally remove kevent,
> > so POSIX timers could create timer kevent, wait until it is ready, in
> > completeness callback it would call signal delivering mechanism...
> > 
> 
> Yes, but that would be stupid. The kernel already has a fully functional
> POSIX timer implementation, so throwing it out to reimplement it using
> kevents would be a waste of effort, especially considering that your
> kevent timers can't fully express a POSIX interval timer.
> 
> Now, if there were some way for me to ask that an interval timer queue
> it's expiry notices into a kevent queue, that would combine the best of
> both worlds.

Just use kevents directly without POSIX timers at all.
It is possible to add there high-resolution timers.

> > But there are no reading mechanism in POSIX timers (I mean not reading
> > pending timeout values or remaining time), they use signals for 
> > completeness delivering... So where do you want to put kevent's
> > userspace there?
> > 
> 
> The goal of this proposal is to extend sigevent completions to include
> kevent queues along with signals and created threads, exactly because
> thread creation is too heavy and signals are a pain to use.

What you propose is completely new mechanism - it is implemented inside
kevent timer notifications expect that API does not match POSIX one.

> > What you are trying to achive is not POSIX timers in any way, you want
> > completely new machanism which has similar to POSIX API, and I give it to
> > you (well, with API which can be used not only with timers, but with any 
> > other type of notifications you like). 
> > You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
> > callback select different type of timers.
> > What else?
> 
> No, it's still POSIX timers -- the vast majority of the API is the same,
> they just report their completion differently.

POSIX timers are just a API over in-kernel timers.
Kevent provides different and much more convenient API (since you want
to not use signals but kevent's queue), so where are those similar
things?
How will you change a timer from POSIX syscall, when it completely does
not know about kevents, there will be racess, so you need to
change POSIX API (it's internal part which works with timers).
If you are going to change internal part of POSIX timers implementatin
and add new syscall into your userspace program, you can just switch to
the new API entirely, since right now I do not see any major problems in
kevent timer implementation (expect that it does not use high-res
timers, which were not included into kernel when kevents were created).

> -- 
> Nicholas Miell <nmiell@comcast.net>

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 20:00               ` Nicholas Miell
@ 2006-08-22 20:36                 ` David Miller
  2006-08-22 21:13                   ` Nicholas Miell
  0 siblings, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-22 20:36 UTC (permalink / raw)
  To: nmiell
  Cc: jmorris, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

From: Nicholas Miell <nmiell@comcast.net>
Date: Tue, 22 Aug 2006 13:00:23 -0700

> I'm not the one proposing the new (potentially wrong) interface. The
> onus isn't on me.

You can't demand a volunteer to do work, period.

If it matters to you, you have the option of doing the work.
Otherwise you can't complain.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 20:16                 ` Evgeniy Polyakov
@ 2006-08-22 21:13                   ` Nicholas Miell
  2006-08-22 21:37                     ` Randy.Dunlap
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22 21:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > Of course, since you already know how all this stuff is supposed to
> > work, you could maybe write it down somewhere?
> 
> I will write documantation, but as you can see some interfaces are
> changed.

Thanks; rapidly changing interfaces need good documentation even more
than stable interfaces simply because reverse engineering the intended
API from a changing implementation becomes even more difficult.

> > > > > I will ask just one question, do _you_ propose anything here?
> > > > 
> > > > struct sigevent sigev = {
> > > > 	.sigev_notify = SIGEV_KEVENT,
> > > > 	.sigev_kevent_fd = kev_fd,
> > > > 	.sigev_value.sival_ptr = &MyCookie
> > > > };
> > > > 
> > > > struct itimerspec its = {
> > > > 	.it_value = { ... },
> > > > 	.it_interval = { ... }
> > > > };
> > > > 
> > > > struct timespec timeout = { .. };
> > > > 
> > > > struct ukevent events[max];
> > > > 
> > > > timer_t timer;
> > > > 
> > > > timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> > > > timer_settime(timer, 0, &its, NULL);
> > > > 
> > > > /* ... */
> > > > 
> > > > kevent_get_events(kev_fd, min, max, &timeout, events, 0);
> > > > 
> > > > 
> > > > 
> > > > Which isn't all that different from what Ulrich Drepper suggested and
> > > > Solaris does right now. (timer_create would probably end up calling
> > > > kevent_ctl itself, but it obviously can't do that unless kevents
> > > > actually support real interval timers).
> > > 
> > > Ugh, rtsignals... Their's problems forced me to not implement
> > > "interrupt"-like mechanism for kevents in addition to dequeueing.
> > > 
> > > Anyway, it seems you did not read the whole thread, homepage, lwn and
> > > userpsace examples, so you do not understand what kevents are.
> > > 
> > > They are userspace requests which are returned back when they are ready.
> > > It means that userspace must provide something to kernel and ask it to
> > > notify when that "something" is ready. For example it can provide a
> > > timeout value and ask kernel to fire a timer with it and inform
> > > userspace when timeout has expired.
> > > It does not matter what timer is used there - feel free to use
> > > high-resolution one, usual timer, busyloop or anything else. Main issue 
> > > that userspace request must be completed.
> > > 
> > > What you are trying to do is to put kevents under POSIX API.
> > > That means that those kevents can not be read using
> > > kevent_get_events(), basicaly because there are no user-known kevents,
> > > i.e. user has not requested timer, so it should not receive it's
> > > notifications (otherwise it will receive everything requested by other
> > > threads and other issues, i.e. how to differentiate timer request made
> > > by timer_create(), which is not supposed to be caught by
> > > kevent_get_events()). 
> > 
> > I have no idea what you're trying to say here. I've created a timer,
> > specified which kevent queue I want it's expiry notification delivered
> > to, and armed it. Where have I not specified enough information to
> > request the reception of timer notifications?
> 
> You can do it with kevent timer notifications. Easily.
> I've even attached simple program for that.

You forgot to attach the program.

> > Also, differentiating timers made by timer_create() that aren't supposed
> > to deliver events via kevent_get_events() is easy -- their .sigev_notify
> > isn't SIGEV_KEVENT.
> 
> What should be returned to user? 
> What should be placed into user's data, into id? 

The cookie I passed in -- in this example, it was &MyCookie.

> How user can determine that given event fires after which
> initial value?

I don't know what this means.

> Finally, if you think that kevents should use different API for
> different events, think about complicated userspace code which must know
> tons of syscalls for the same task.

I don't think cramming everything together into the same syscall is any
better. In fact, a series of discrete, easy-to-understand function calls
is a hell of a lot easier to deal with than a single call that takes an
array of large multi-purpose structures, especially when most of those
function calls have standard specified behavior.

In fact, I doubt anything will *ever* use kevents directly -- it's
either going to be something like libevent which wraps this stuff
portably or the app's own portability layer or GLib's event loop or
something else that abstracts away the fact that nobody can agree on
what the primitives for a unified event loop should be. There's nothing
like another layer of indirection to solve your problems.

> > > You could implement POSIX timer _fully_ on top of kevents, i.e. both
> > > create and read, for example network AIO is implemented in that way -
> > > there is a system calls aio_send()/aio_recv() and aio_sendfile() which
> > > create kevent internally and then get it's readiness notifications over
> > > provided callback, process data and finally remove kevent,
> > > so POSIX timers could create timer kevent, wait until it is ready, in
> > > completeness callback it would call signal delivering mechanism...
> > 
> > Yes, but that would be stupid. The kernel already has a fully functional
> > POSIX timer implementation, so throwing it out to reimplement it using
> > kevents would be a waste of effort, especially considering that your
> > kevent timers can't fully express a POSIX interval timer.
> > 
> > Now, if there were some way for me to ask that an interval timer queue
> > it's expiry notices into a kevent queue, that would combine the best of
> > both worlds.
> 
> Just use kevents directly without POSIX timers at all.
> It is possible to add there high-resolution timers.

So the existing kevent API is currently incomplete?

> > > But there are no reading mechanism in POSIX timers (I mean not reading
> > > pending timeout values or remaining time), they use signals for 
> > > completeness delivering... So where do you want to put kevent's
> > > userspace there?
> > 
> > The goal of this proposal is to extend sigevent completions to include
> > kevent queues along with signals and created threads, exactly because
> > thread creation is too heavy and signals are a pain to use.
> 
> What you propose is completely new mechanism - it is implemented inside
> kevent timer notifications expect that API does not match POSIX one.

A completely new delivery mechanism, yes. The rest of the API for timer
creation, arming, query, destruction, etc. remains the same.

This is opposed to the completely new mechanism for delivery, creation,
arming, query, destruction, etc. that is the currently proposed kevents
timer interface.

> > > What you are trying to achive is not POSIX timers in any way, you want
> > > completely new machanism which has similar to POSIX API, and I give it to
> > > you (well, with API which can be used not only with timers, but with any 
> > > other type of notifications you like). 
> > > You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
> > > callback select different type of timers.
> > > What else?
> > 
> > No, it's still POSIX timers -- the vast majority of the API is the same,
> > they just report their completion differently.
> 
> POSIX timers are just a API over in-kernel timers.
> Kevent provides different and much more convenient API (since you want
> to not use signals but kevent's queue), so where are those similar
> things?

I don't think you have established the kevent timer API's convenience
yet (beyond the fact that it doesn't use signals, which everybody
wants).

> How will you change a timer from POSIX syscall, when it completely does
> not know about kevents, there will be racess, so you need to
> change POSIX API (it's internal part which works with timers).

Yes, the kernel's POSIX timer implementation will need to be altered so
that it can queue timer completion events to a kevent queue.

> If you are going to change internal part of POSIX timers implementatin
> and add new syscall into your userspace program, you can just switch to
> the new API entirely, since right now I do not see any major problems in
> kevent timer implementation (expect that it does not use high-res
> timers, which were not included into kernel when kevents were created).

Switching an entire program away from POSIX interval timers would be
more work then the modifications necessary to switch only it's timer
delivery mechanism, especially when the new timer system doesn't have
documentation and isn't as functional as the old.

And of course you don't see any major problems in the kevent timer
implementation -- if you did, you would have fixed them already.
However, that doesn't mean that they don't exist.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 20:36                 ` David Miller
@ 2006-08-22 21:13                   ` Nicholas Miell
  2006-08-22 21:25                     ` David Miller
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22 21:13 UTC (permalink / raw)
  To: David Miller
  Cc: jmorris, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

On Tue, 2006-08-22 at 13:36 -0700, David Miller wrote:
> From: Nicholas Miell <nmiell@comcast.net>
> Date: Tue, 22 Aug 2006 13:00:23 -0700
> 
> > I'm not the one proposing the new (potentially wrong) interface. The
> > onus isn't on me.
> 
> You can't demand a volunteer to do work, period.
> 
> If it matters to you, you have the option of doing the work.
> Otherwise you can't complain.

So if a volunteer does bad work, I'm obligated to accept it just because
I haven't done better?

Alternately, if a volunteer does bad work, must it be merged into the
kernel because there's isn't a better implementation? (I believe that
was tried at least once with devfs.)

And how is the quality of the work to be judged if the work isn't
commented, documented and explained, especially the userland-visible
parts that *cannot* *ever* *be* *changed* *or* *removed* once they're in
a stable kernel release?

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 21:13                   ` Nicholas Miell
@ 2006-08-22 21:25                     ` David Miller
  2006-08-22 22:58                       ` Nicholas Miell
  0 siblings, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-22 21:25 UTC (permalink / raw)
  To: nmiell
  Cc: jmorris, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

From: Nicholas Miell <nmiell@comcast.net>
Date: Tue, 22 Aug 2006 14:13:40 -0700

> And how is the quality of the work to be judged if the work isn't
> commented, documented and explained, especially the userland-visible
> parts that *cannot* *ever* *be* *changed* *or* *removed* once they're in
> a stable kernel release?

Are you even willing to look at the collection of example applications
Evgeniy wrote against this API?

That is the true test of a set of interfaces, what happens when you
try to actually use them in real programs.

Everything else is fluff, including standards and "documentation".

He even bothered to benchmark things, and post assosciated graphs and
performance analysis during the course of development.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 21:13                   ` Nicholas Miell
@ 2006-08-22 21:37                     ` Randy.Dunlap
  2006-08-22 22:01                       ` Andrew Morton
  2006-08-22 22:58                       ` Nicholas Miell
  0 siblings, 2 replies; 143+ messages in thread
From: Randy.Dunlap @ 2006-08-22 21:37 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Evgeniy Polyakov, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig

On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:

> On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > Of course, since you already know how all this stuff is supposed to
> > > work, you could maybe write it down somewhere?
> > 
> > I will write documantation, but as you can see some interfaces are
> > changed.
> 
> Thanks; rapidly changing interfaces need good documentation even more
> than stable interfaces simply because reverse engineering the intended
> API from a changing implementation becomes even more difficult.

OK, I don't quite get it.
Can you be precise about what you would like?

a.  good documentation
b.  a POSIX API
c.  a Windows-compatible API
d.  other?

and we won't make you use any of this code.

---
~Randy

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 21:37                     ` Randy.Dunlap
@ 2006-08-22 22:01                       ` Andrew Morton
  2006-08-22 22:17                         ` David Miller
  2006-08-22 22:58                       ` Nicholas Miell
  1 sibling, 1 reply; 143+ messages in thread
From: Andrew Morton @ 2006-08-22 22:01 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: Nicholas Miell, Evgeniy Polyakov, lkml, David Miller,
	Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig

On Tue, 22 Aug 2006 14:37:47 -0700
"Randy.Dunlap" <rdunlap@xenotime.net> wrote:

> On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:
> 
> > On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > > Of course, since you already know how all this stuff is supposed to
> > > > work, you could maybe write it down somewhere?
> > > 
> > > I will write documantation, but as you can see some interfaces are
> > > changed.
> > 
> > Thanks; rapidly changing interfaces need good documentation even more
> > than stable interfaces simply because reverse engineering the intended
> > API from a changing implementation becomes even more difficult.
> 
> OK, I don't quite get it.
> Can you be precise about what you would like?
> 
> a.  good documentation
> b.  a POSIX API
> c.  a Windows-compatible API
> d.  other?
> 
> and we won't make you use any of this code.
> 

Today seems to be beat-up-Nick day?

This is a major, major new addition to the kernel API.  It's a big deal. 
Getting it documented prior to committing ourselves is a useful part of the
review process.  It certainly can't hurt, and it might help.  It is a
little too soon to spend too much time on that though.  (It's actually
_better_ if someone other than the developer writes the documentation,
too).


And the "why not emulate kqueue" question strikes me as an excellent one. 
Presumably a lot of developer thought and in-field experience has gone into
kqueue.  It would benefit us to use that knowledge as much as we can.

I mean, if there's nothing wrong with kqueue then let's minimise app
developer pain and copy it exactly.  If there _is_ something wrong with
kqueue then let us identify those weaknesses and then diverge.  Doing
something which looks the same and works the same and does the same thing
but has a different API doesn't benefit anyone.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 22:01                       ` Andrew Morton
@ 2006-08-22 22:17                         ` David Miller
  2006-08-22 23:35                           ` Andrew Morton
  0 siblings, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-22 22:17 UTC (permalink / raw)
  To: akpm
  Cc: rdunlap, nmiell, johnpol, linux-kernel, drepper, netdev, zach.brown, hch

From: Andrew Morton <akpm@osdl.org>
Date: Tue, 22 Aug 2006 15:01:44 -0700

> If there _is_ something wrong with kqueue then let us identify those
> weaknesses and then diverge.

Evgeniy already enumerated this, both on his web site and in the
current thread.

Unlike some people seem to imply, Evgeniy did research all the other
implementations of event queueing out there, including kqueue.
He took the best of that survey, adding some of his own ideas,
and that's what kevent is.  It's not like he's some kind of
charlatan and made arbitrary decisions in his design without any
regard for what's out there already.

Again, the proof is in the pudding, he wrote applications against his
interfaces and tested them.  That's what people need to really do if
they want to judge his interface, try to write programs against it and
report back any problems they run into.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 19:47               ` Evgeniy Polyakov
@ 2006-08-22 22:51                 ` Jari Sundell
  2006-08-22 23:11                   ` Alexey Kuznetsov
  0 siblings, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-22 22:51 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nicholas Miell, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig

On 8/22/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Word "polling" really confuses me here, but now I understand you.
> Such approach actually has unresolved issues - consider for
> example a situation when all provided events are ready immediately - what
> should be returned (as far as I recall they are always added into kqueue in
> BSDs before started to be checked, so old events will be returned
> first)? And currently ready events can be read through mapped buffer
> without any syscall at all.
> And Linux syscall is much cheaper than BSD's one.
> Consider (especially apped buffer)  that issues, it really does not cost
> interface complexity.

There's no reason I can see that kqueue's kevent should not be able to
check an mmaped buffer as in your implementation, after having passed
any filter changes to the kernel.

I'm not sure if I read you correctly, but the situation where all
events are ready immediately is not a problem. Only the delta is
passed with the kevent call, so old events will still be first in the
queue. And as long as the user doesn't randomize the order of the
changelist and passes the changedlist with each kevent call, the
resulting order in which changes are received will be no different
from using individual system calls.

If there's some very specific reason the user needs to retain the
order in which events happen in the interval between adding it to the
changelist and calling kevent, he may decide to call kevent
immediately without asking for any events.

> First of all, there are completely different types.
> Design of the in-kernel part is very different too.

The question I'm asking is not whet ever kqueue can fit this
implementation, but rather if it is possible to make the
implementation fit kqueue. I can't really see any fundemental
differences, merely implementation details. Maybe I'm just unfamiliar
with the requirements.

> > BSD's kqueue:
> >
> > struct kevent {
> >  uintptr_t ident;        /* identifier for this event */
> >  short     filter;       /* filter for event */
> >  u_short   flags;        /* action flags for kqueue */
> >  u_int     fflags;       /* filter flag value */
> >  intptr_t  data;         /* filter data value */
> >  void      *udata;       /* opaque user data identifier */
> > };
>
>
> From your description there is a serious problem with arches which
> supports different width of the pointer. I do not have sources of ny BSD
> right now, but if it is really like you've described, it can not be used
> in Linux at all.

Are you referring to udata or data? I'll assume the latter as the
former is more of a restriction on user-space. intptr_t is required to
be safely convertible to a void*, so I don't see what the problem
would be.

> No way - timespec uses long.

I must have missed that discussion. Please enlighten me in what regard
using an opaque type with lower resolution is preferable to a type
defined in POSIX for this sort of purpose. Considering the extra code
I need to write to properly handle having just ms resolution, it
better be something fundamentally broken. ;)

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 21:37                     ` Randy.Dunlap
  2006-08-22 22:01                       ` Andrew Morton
@ 2006-08-22 22:58                       ` Nicholas Miell
  2006-08-22 23:06                         ` David Miller
  2006-08-22 23:22                         ` [take12 0/3] kevent: Generic event handling mechanism Randy.Dunlap
  1 sibling, 2 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22 22:58 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: Evgeniy Polyakov, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig

On Tue, 2006-08-22 at 14:37 -0700, Randy.Dunlap wrote:
> On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:
> 
> > On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > > Of course, since you already know how all this stuff is supposed to
> > > > work, you could maybe write it down somewhere?
> > > 
> > > I will write documantation, but as you can see some interfaces are
> > > changed.
> > 
> > Thanks; rapidly changing interfaces need good documentation even more
> > than stable interfaces simply because reverse engineering the intended
> > API from a changing implementation becomes even more difficult.
> 
> OK, I don't quite get it.
> Can you be precise about what you would like?
> 
> a.  good documentation
> b.  a POSIX API
> c.  a Windows-compatible API
> d.  other?
> 
> and we won't make you use any of this code.

I want something that I can be confident won't be replaced again in two
years because nobody noticed problems with the old API design or they're
just feeling very NIH with their snazzy new feature.

Maybe then we won't end up with another in the { signal/sigaction,
waitpid/wait4, select/pselect, poll/ppol,  msgrcv, mq_receive,
io_getevents, aio_suspend/aio_return, epoll_wait, inotify read,
kevent_get_events } collection -- or do you like having a maze of
twisted interfaces, all subtly different and none supporting the
complete feature set?

Good documentation giving enough detail to judge the design and an API
that fits with the current POSIX API (at least, the parts that everybody
agrees don't suck) goes a long way toward assuaging my fears that this
won't just be another waste of effort, doomed to be replaced by the Next
Great Thing (We Really Mean It This Time!) in unified event loop API
design or whatever other interface somebody happens to be working on.

---

This is made extraordinarily difficult by the fact kernel people don't
even agree themselves on what APIs should look like anyway and Linus
won't take a stand on the issue -- people with influence are
simultaneously arguing things like:

- ioctls are bad because they aren't typesafe and you should use
syscalls instead because they are typesafe

- ioctls are good, because they're much easier to add than syscalls,
type safety can be supplied by the library wrapper, and syscalls are a
(relatively) scarce resource, harder to wire up in the first place, and
are more difficult to make optional or remove entirely if you decide
they were a stupid idea.

- multiplexors are bad because they're too complex or not typesafe

- multiplexors are good because they save syscall slots or ioctl numbers
and the library wrapper provides the typesafety anyway.

- instead of syscalls or ioctls, you should create a whole new
filesystem that has a bunch of magic files that you read from and write
to in order to talk to the kernel

- filesystem interfaces are bad, because they're take more effort to
write than a syscall or a ioctl and nobody seems to know how to maintain
and evolve a filesystem-based ABI or make them easy to use outside of a
fragile shell script (see: sysfs)

- that everything in those custom filesystems should ASCII strings and
nobody needs an actual grammar describing how to parse them, we can just
break userspace whenever we feel like it

- that everything in those custom filesystems should be C structs, and
screw the shell scripts

- new filesystem metadata should be exposed by:
	- xattrs
	- ioctls
	- new syscalls
		or
	- named streams/forks/not-xattrs
  and three out of four of these suggestions are completely wrong for
  some critical reason

- meanwhile, the networking folks are doing everything via AF_NETLINK
sockets instead of syscalls or ioctl or whatever, I guess because the
network stack is what's most familiar to them

- and there's the usual arguments about typedefs verses bare struct
names, #defines verses enums, returning 0 on success vs. 0 on failure,
and lots of other piddly stupid stuff that somebody just needs to say
"this is how it's done and no arguing" about.

Honestly, somebody with enough clout to make it stick needs to write out
a spec describing what new kernel interfaces should look like and how
they should fit in with existing interfaces.

It'd probably make Evgeniy's life easier if you could just point at the
interface guidelines and say "you did this wrong" instead of random
people telling him to change his design and random other people telling
him to change it back.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 21:25                     ` David Miller
@ 2006-08-22 22:58                       ` Nicholas Miell
  2006-08-22 23:46                         ` Ulrich Drepper
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-22 22:58 UTC (permalink / raw)
  To: David Miller
  Cc: jmorris, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

On Tue, 2006-08-22 at 14:25 -0700, David Miller wrote:
> From: Nicholas Miell <nmiell@comcast.net>
> Date: Tue, 22 Aug 2006 14:13:40 -0700
> 
> > And how is the quality of the work to be judged if the work isn't
> > commented, documented and explained, especially the userland-visible
> > parts that *cannot* *ever* *be* *changed* *or* *removed* once they're in
> > a stable kernel release?
> 
> Are you even willing to look at the collection of example applications
> Evgeniy wrote against this API?
> 
> That is the true test of a set of interfaces, what happens when you
> try to actually use them in real programs.
> 
> Everything else is fluff, including standards and "documentation".
> 
> He even bothered to benchmark things, and post assosciated graphs and
> performance analysis during the course of development.

I wasn't aware that any of these existed, he didn't mention them in this
patch series. Having now looked, all I've managed to find are a series
of simple example apps that no longer work because of API changes.

Also, if you've been paying attention, you'll note that I've never
criticized the performance or quality of the underlying kevent
implementation -- as best I can tell, aside from some lockdep complaints
(which, afaik, are the result of lockdep's limitations rather than
problems with kevent), the internals of kevent are excellent.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 22:58                       ` Nicholas Miell
@ 2006-08-22 23:06                         ` David Miller
  2006-08-23  1:36                           ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Nicholas Miell
  2006-08-22 23:22                         ` [take12 0/3] kevent: Generic event handling mechanism Randy.Dunlap
  1 sibling, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-22 23:06 UTC (permalink / raw)
  To: nmiell
  Cc: rdunlap, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

From: Nicholas Miell <nmiell@comcast.net>
Date: Tue, 22 Aug 2006 15:58:12 -0700

> Honestly, somebody with enough clout to make it stick needs to write out
> a spec describing what new kernel interfaces should look like and how
> they should fit in with existing interfaces.

With the time you spent writing this long email alone you could have
worked on either documenting Evgeniy's interfaces or trying to write
test applications against kevent to validate how useful the interfaces
are and if there are any problems with them.

You choose to rant and complain instead of participate.

Therefore, many of us cannot take you seriously.


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 22:51                 ` Jari Sundell
@ 2006-08-22 23:11                   ` Alexey Kuznetsov
  2006-08-23  0:28                     ` Jari Sundell
  0 siblings, 1 reply; 143+ messages in thread
From: Alexey Kuznetsov @ 2006-08-22 23:11 UTC (permalink / raw)
  To: Jari Sundell
  Cc: Evgeniy Polyakov, Nicholas Miell, lkml, David Miller,
	Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig

Hello!

> >No way - timespec uses long.
> 
> I must have missed that discussion. Please enlighten me in what regard
> using an opaque type with lower resolution is preferable to a type
> defined in POSIX for this sort of purpose.

Let me explain, as a person who did this mistake and deeply
regrets about this.

F.e. in this case you just cannot use kevents in 32bit application
on x86_64, unless you add the whole translation layer inside kevent core.
Even when you deal with plain syscall, translation is a big pain,
but when you use mmapped buffer, it can be simply impossible.

F.e. my mistake was "unsigned long" in struct tpacket_hdr in linux/if_packet.h.
It makes use of mmapped packet socket essentially impossible by 32bit
applications on 64bit archs.

Alexey

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 22:58                       ` Nicholas Miell
  2006-08-22 23:06                         ` David Miller
@ 2006-08-22 23:22                         ` Randy.Dunlap
  1 sibling, 0 replies; 143+ messages in thread
From: Randy.Dunlap @ 2006-08-22 23:22 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Evgeniy Polyakov, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig

On Tue, 22 Aug 2006 15:58:12 -0700 Nicholas Miell wrote:

> On Tue, 2006-08-22 at 14:37 -0700, Randy.Dunlap wrote:
> > On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:
> > 
> > > On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > > > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > > > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > > > Of course, since you already know how all this stuff is supposed to
> > > > > work, you could maybe write it down somewhere?
> > > > 
> > > > I will write documantation, but as you can see some interfaces are
> > > > changed.
> > > 
> > > Thanks; rapidly changing interfaces need good documentation even more
> > > than stable interfaces simply because reverse engineering the intended
> > > API from a changing implementation becomes even more difficult.
> > 
> > OK, I don't quite get it.
> > Can you be precise about what you would like?
> > 
> > a.  good documentation
> > b.  a POSIX API
> > c.  a Windows-compatible API
> > d.  other?
> > 
> > and we won't make you use any of this code.
> 
> I want something that I can be confident won't be replaced again in two
> years because nobody noticed problems with the old API design or they're
> just feeling very NIH with their snazzy new feature.
> 
> Maybe then we won't end up with another in the { signal/sigaction,
> waitpid/wait4, select/pselect, poll/ppol,  msgrcv, mq_receive,
> io_getevents, aio_suspend/aio_return, epoll_wait, inotify read,
> kevent_get_events } collection -- or do you like having a maze of
> twisted interfaces, all subtly different and none supporting the
> complete feature set?
> 
> Good documentation giving enough detail to judge the design and an API
> that fits with the current POSIX API (at least, the parts that everybody
> agrees don't suck) goes a long way toward assuaging my fears that this
> won't just be another waste of effort, doomed to be replaced by the Next
> Great Thing (We Really Mean It This Time!) in unified event loop API
> design or whatever other interface somebody happens to be working on.
> 
> ---

OK, thank you for elaborating.

I suppose that I am more <choose one> {cynical, sarcastic,
practial, pragmatic}.  I don't have a crystal ball for 2 years
out and I don't know anyone who does.

IMO we do the best that we can given some human constraints
and probably some marketplace constraints (like ship something
instead of playing with it for 5 years before shipping it).


> This is made extraordinarily difficult by the fact kernel people don't
> even agree themselves on what APIs should look like anyway and Linus
> won't take a stand on the issue -- people with influence are
> simultaneously arguing things like:
> 
> - ioctls are bad because they aren't typesafe and you should use
> syscalls instead because they are typesafe
> 
> - ioctls are good, because they're much easier to add than syscalls,
> type safety can be supplied by the library wrapper, and syscalls are a
> (relatively) scarce resource, harder to wire up in the first place, and
> are more difficult to make optional or remove entirely if you decide
> they were a stupid idea.

Yes, I was recently part of that argument in Ottawa.

> - multiplexors are bad because they're too complex or not typesafe
> 
> - multiplexors are good because they save syscall slots or ioctl numbers
> and the library wrapper provides the typesafety anyway.

Multiplexors have already lost AFAIK.  Unless someone changes their
mind.  Which happens and will continue to happen.

> - instead of syscalls or ioctls, you should create a whole new
> filesystem that has a bunch of magic files that you read from and write
> to in order to talk to the kernel

Yep.  Some people like that one.  Not everyone.

> - filesystem interfaces are bad, because they're take more effort to
> write than a syscall or a ioctl and nobody seems to know how to maintain
> and evolve a filesystem-based ABI or make them easy to use outside of a
> fragile shell script (see: sysfs)

Ack.

> - that everything in those custom filesystems should ASCII strings and
> nobody needs an actual grammar describing how to parse them, we can just
> break userspace whenever we feel like it

sysfs requires one value per file.  Little parsing required.
But I don't know how to capture atomic values from N files with sysfs.

> - that everything in those custom filesystems should be C structs, and
> screw the shell scripts

Hm, I don't recall that one.

> - new filesystem metadata should be exposed by:
> 	- xattrs
> 	- ioctls
> 	- new syscalls
> 		or
> 	- named streams/forks/not-xattrs
>   and three out of four of these suggestions are completely wrong for
>   some critical reason
> 
> - meanwhile, the networking folks are doing everything via AF_NETLINK
> sockets instead of syscalls or ioctl or whatever, I guess because the
> network stack is what's most familiar to them

I sympathize with you on that one.  I don't care for netlink much
either.  To me it's still an ioctl, even though it is routable
and extensible.

> - and there's the usual arguments about typedefs verses bare struct
> names, #defines verses enums, returning 0 on success vs. 0 on failure,
> and lots of other piddly stupid stuff that somebody just needs to say
> "this is how it's done and no arguing" about.

That's all part of the open-source development process.  Sorry
you dislike it.

> Honestly, somebody with enough clout to make it stick needs to write out
> a spec describing what new kernel interfaces should look like and how
> they should fit in with existing interfaces.

I only know 2 people who could make it stick.

> It'd probably make Evgeniy's life easier if you could just point at the
> interface guidelines and say "you did this wrong" instead of random
> people telling him to change his design and random other people telling
> him to change it back.

I agree.  We went thru some of this at the kernel summit.
The ioctls dissent topic didn't really have many dissenters.
(That was a surprise to me.)

Anyway, thanks again for the additional details.

---
~Randy

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 22:17                         ` David Miller
@ 2006-08-22 23:35                           ` Andrew Morton
  0 siblings, 0 replies; 143+ messages in thread
From: Andrew Morton @ 2006-08-22 23:35 UTC (permalink / raw)
  To: David Miller
  Cc: rdunlap, nmiell, johnpol, linux-kernel, drepper, netdev, zach.brown, hch

On Tue, 22 Aug 2006 15:17:47 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@osdl.org>
> Date: Tue, 22 Aug 2006 15:01:44 -0700
> 
> > If there _is_ something wrong with kqueue then let us identify those
> > weaknesses and then diverge.
> 
> Evgeniy already enumerated this, both on his web site and in the
> current thread.

<googles, spends a few minutes clicking around on
http://tservice.net.ru/~s0mbre/, fails.  Looks in changelogs, also fails>

Best I can find is
http://tservice.net.ru/~s0mbre/blog/devel/kevent/index.html, and that's
doesn't cover these things.

At some stage we're going to need to tell Linus (for example) what we've
done and why we did it.  I don't know how to do that.


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 22:58                       ` Nicholas Miell
@ 2006-08-22 23:46                         ` Ulrich Drepper
  2006-08-23  1:51                           ` Nicholas Miell
  2006-08-23  6:54                           ` Evgeniy Polyakov
  0 siblings, 2 replies; 143+ messages in thread
From: Ulrich Drepper @ 2006-08-22 23:46 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: David Miller, jmorris, johnpol, linux-kernel, akpm, netdev,
	zach.brown, hch

[-- Attachment #1: Type: text/plain, Size: 862 bytes --]

I so far also haven't taken the time to look exactly at the interface.
I plan to do it asap since this is IMO our big chance to get it right.
I want to have a unifying interface which can handle all the different
events we need and which come up today and tomorrow.  We have to be able
to handle not only file descriptors and AIO but also timers, signals,
message queues (OK, they are file descriptors but let's make it
official), futexes.  I'm probably missing the one or the other thing now.

DaveM says there are example programs for the current interfaces.  I
must admit I haven't seen those either.  So if possible, point the world
to them again.  If you do that now I'll review everything and write up
my recommendations re the interface before Monday.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 23:11                   ` Alexey Kuznetsov
@ 2006-08-23  0:28                     ` Jari Sundell
  2006-08-23  0:32                       ` David Miller
  0 siblings, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-23  0:28 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Evgeniy Polyakov, Nicholas Miell, lkml, David Miller,
	Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig

On 8/23/06, Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
> Let me explain, as a person who did this mistake and deeply
> regrets about this.
>
> F.e. in this case you just cannot use kevents in 32bit application
> on x86_64, unless you add the whole translation layer inside kevent core.
> Even when you deal with plain syscall, translation is a big pain,
> but when you use mmapped buffer, it can be simply impossible.
>
> F.e. my mistake was "unsigned long" in struct tpacket_hdr in linux/if_packet.h.
> It makes use of mmapped packet socket essentially impossible by 32bit
> applications on 64bit archs.

There are system calls that take timespec, so I assume the magic is
already available for handling the timeout argument of kevent.
Although I'm not entirely sure about the kqueue timer interface, there
isn't any reason timespec would need to be written to the mmaped
buffer for the rest.

AFAICS, only struct ukevent is visible to the user, same would go for
kqueue's struct kevent.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  0:28                     ` Jari Sundell
@ 2006-08-23  0:32                       ` David Miller
  2006-08-23  0:43                         ` Jari Sundell
  0 siblings, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-23  0:32 UTC (permalink / raw)
  To: sundell.software
  Cc: kuznet, johnpol, nmiell, linux-kernel, drepper, akpm, netdev,
	zach.brown, hch

From: "Jari Sundell" <sundell.software@gmail.com>
Date: Wed, 23 Aug 2006 02:28:32 +0200

> There are system calls that take timespec, so I assume the magic is
> already available for handling the timeout argument of kevent.

System calls are one thing, they can be translated for these
kinds of situations.  But this doesn't help, and nothing at
all can be done, for datastructures exposed to userspace via
mmap()'d buffers, which is what kevent will be doing.

This is what Alexey is trying to explain to you.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  0:32                       ` David Miller
@ 2006-08-23  0:43                         ` Jari Sundell
  2006-08-23  6:56                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-23  0:43 UTC (permalink / raw)
  To: David Miller
  Cc: kuznet, johnpol, nmiell, linux-kernel, drepper, akpm, netdev,
	zach.brown, hch

On 8/23/06, David Miller <davem@davemloft.net> wrote:
> > There are system calls that take timespec, so I assume the magic is
> > already available for handling the timeout argument of kevent.
>
> System calls are one thing, they can be translated for these
> kinds of situations.  But this doesn't help, and nothing at
> all can be done, for datastructures exposed to userspace via
> mmap()'d buffers, which is what kevent will be doing.
>
> This is what Alexey is trying to explain to you.

Actually, I didn't miss that, it is an orthogonal issue. A timespec
timeout parameter for the syscall does not imply the use of timespec
in any timer event, etc. Nor is there any timespec timer in kqueue's
struct kevent, which is the only (interface related) thing that will
be exposed.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.)
  2006-08-22 23:06                         ` David Miller
@ 2006-08-23  1:36                           ` Nicholas Miell
  2006-08-23  2:01                             ` The Proposed Linux kevent API Howard Chu
                                               ` (3 more replies)
  0 siblings, 4 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-23  1:36 UTC (permalink / raw)
  To: David Miller
  Cc: rdunlap, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

On Tue, 2006-08-22 at 16:06 -0700, David Miller wrote:
> With the time you spent writing this long email alone you could have
> worked on either documenting Evgeniy's interfaces or trying to write
> test applications against kevent to validate how useful the interfaces
> are and if there are any problems with them.
> 
> You choose to rant and complain instead of participate.
> 
> Therefore, many of us cannot take you seriously. 

== The Proposed Linux kevent API == 

The proposed Linux kevent API is a new unified event handling
interface, similar in spirit to Windows completion ports and Solaris
completion ports and similar in fact to the FreeBSD/OS X kqueue
interface.

Using a single kernel call, a thread can wait for all possible event
types that the kernel can generate, instead of past interfaces that
only allow you to wait for specific subsets of events (e.g. POSIX
sigevent completions are limited only to AIO completion, timer expiry,
and the arrival of new messages to a message queue, while epoll_wait
is just a more efficient method of doing a traditional Unix select or
poll).

Instead of evolving the struct sigevent notification methods to allow
you to continue using standard POSIX interfaces like lio_listio(),
mq_notify() or timer_create() while queuing completion notifications
to a kevent completion queue (much the way the Solaris port API is
designed, or the the API proposed by Ulrich Drepper in "The
Need for Asynchronous, Zero-Copy Network I/O" as found at
http://people.redhat.com/drepper/newni.pdf ), kevent choooses to
follow the FreeBSD route and introduce an entirely new and
incompatible method of requesting and reporting event notifications
(while also managing to be incompatible with FreeBSD's kqueue).

This is done through the introduction of two new syscalls and a
variety of supporting datatypes. The first function, kevent_ctl(), is
used to create and manipulate kevent queues, while the second,
kevent_get_events(), is use to wait for new events.


They operate as follows:

int kevent_ctl(int fd, unsigned int cmd, unsigned int num, void *arg);

fd is the file descriptor referring to the kevent queue to
manipulate. It is ignored if the cmd parameter is KEVENT_CTL_INIT.

cmd is the requested operation. It can be one of the following:

	KEVENT_CTL_INIT - create a new kevent queue and return it's file
		descriptor. The fd, num, and arg parameters are ignored.

	KEVENT_CTL_ADD, KEVENT_CTL_MODIFY, KEVENT_CTL_REMOVE - add new,
		modify existing, or remove existing event notification
		requests.

num is the number of struct ukevent in the array pointed to by arg

arg is an array of struct ukevent. Why it is of type void* and not 
	struct ukevent* is a mystery.

When called, kevent_ctl will carry out the operation specified in the
cmd parameter.


int kevent_get_events(int ctl_fd, unsigned int min_nr,
		unsigned int max_nr, unsigned int timeout,
		void *buf, unsigned flags)

ctl_fd is the file descriptor referring to the kevent queue.

min_nr is the minimum number of completed events that
       kevent_get_events will block waiting for.

max_nr is the number of struct ukevent in buf.

timeout is the number of milliseconds to wait before returning less
	than min_nr events. If this is -1, I *think* it'll wait
	indefinitely, but I'm not sure that msecs_to_jiffies(-1) ends
	up being MAX_SCHEDULE_TIMEOUT

buf is a pointer an array of struct ukevent. Why it is of type void*
    and not struct ukevent* is a mystery.

flags is unused.

When called, kevent_get_events will wait timeout milliseconds for at
least min_nr completed events, copying completed struct ukevents to
buf and deleting any KEVENT_REQ_ONESHOT event requests.


The bulk of the interface is entirely done through the ukevent struct.
It is used to add event requests, modify existing event requests,
specify which event requests to remove, and return completed events.

struct ukevent contains the following members:

struct kevent_id id
       This is described as containing the "socket number, file
       descriptor and so on", which I take to mean it contains an fd,
       however for some mysterious reason struct kevent_id contains
       __u32 raw[2] and (for KEVENT_POLL events) the actual fd is
       placed in raw[0] and raw[1] is never mentioned except to
       faithfully copy it around.

       For KEVENT_TIMER events, raw[0] contains a relative time in
       milliseconds and raw[1] is still not used.

       Why the struct member is called "raw" remains a mystery.

__u32 type
      The actual event type, either KEVENT_POLL for fd polling or
      KEVENT_TIMER for timers.

__u32 event
      For events of type KEVENT_POLL, event contains the polling flags
      of interest (i.e. POLLIN, POLLPRI, POLLOUT, POLLERR, POLLHUP,
      POLLNVAL).

      For events of type KEVENT_TIMER, event is ignored.

__u32 req_flags
      Per-event request flags. Currently, this may be 0 or
      KEVENT_REQ_ONESHOT to specify that the event be removed after it
      is fired.

__u32 ret_flags
      Per-event return flags. This may be 0 or a combination of
      KEVENT_RET_DONE if the event has completed or
      KVENT_RET_BROKEN if "the event is broken", which I take to mean
      any sort of error condition. DONE|BROKEN is a valid state, but I
      don't really know what it means.

__u32 ret_data[2]
      Event return data. This is unused by KEVENT_POLL events, while
      KEVENT_TIMER inexplicably places jiffies in ret_data[0]. If the
      event is broken, an error code is placed in ret_data[1].

union { __u32 user[2]; void *ptr; }
      An anonymous union (which is a fairly recent C addition)
      containing data saved for the user and otherwise ignored by the
      kernel.

For KEVENT_CTL_ADD, all fields relevant to the event type must be
filled (id, type, possibly event, req_flags). After kevent_ctl(...,
KEVENT_CTL_ADD, ...) returns each struct's ret_flags should be
checked to see if the event is already broken or done.

For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields
must be set and an existing kevent request must have matching id and
user fields. If a match is found, req_flags and event are replaced
with the newly supplied values. If a match can't be found, the passed
in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
always set.

For KEVENT_CTL_REMOVE, the id and user fields must be set and an
existing kevent request must have matching id and user fields. If a
match is found, the kevent request is removed. If a match can't be
found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN
set. KEVENT_RET_DONE is always set.

For kevent_get_events, the entire structure is returned with ret_data[0]
modified to contain jiffies for KEVENT_TIMER events.

--

Having looked all this over to figure out what it actually does, I can
make the following comments:

- there's a distinct lack of any sort of commenting beyond brief
descriptions of what the occasional function is supposed to do

- the kevent interface is all the horror of the BSD kqueue interface,
but with no compatibility with the BSD kqueue interface.

- lots of parameters from userspace go unsanitized, although I'm not
sure if this will actually cause problems. At the very least, there
should be checks for unknown flags and use of reserved fields, lest
somebody start using them for their own purposes and then their app
breaks when a newer version of the kernel starts using them itself.

- timeouts are specified as int instead of struct timespec.

- kevent_ctl() and kevent_get_events() take void* for no discernible
reason.

- KEVENT_POLL is less functional than epoll (no return of which events
were actually signalled) and KEVENT_TIMER isn't as flexible as POSIX
interval timers (no clocks, only millisecond resolution, timers don't
have separate start and interval values).

- kevent_get_events() looks a whole lot like io_getevents() and a kevent
fd looks a whole lot like an io_context_t.

- struct ukevent has problems/inconsistencies -- id is wrapped in it's
own member struct, while user and ret_data aren't; id's single member is
named raw which does nothing to describe it's purpose; the user data is
an anonymous union, which is C99 only and might not be widely supported,
req_flags and ret_flags are more difficult to visually distinguish than
they have to be; and every member name could use a ukev_ prefix.

--

P.S.

Dear DaveM,

	Go fuck yourself.

Love,
	Nicholas

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 23:46                         ` Ulrich Drepper
@ 2006-08-23  1:51                           ` Nicholas Miell
  2006-08-23  6:54                           ` Evgeniy Polyakov
  1 sibling, 0 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-23  1:51 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, jmorris, johnpol, linux-kernel, akpm, netdev,
	zach.brown, hch

On Tue, 2006-08-22 at 16:46 -0700, Ulrich Drepper wrote:
> I so far also haven't taken the time to look exactly at the interface.
> I plan to do it asap since this is IMO our big chance to get it right.
> I want to have a unifying interface which can handle all the different
> events we need and which come up today and tomorrow.  We have to be able
> to handle not only file descriptors and AIO but also timers, signals,
> message queues (OK, they are file descriptors but let's make it
> official), futexes.  I'm probably missing the one or the other thing now.

Are you sure about signals? I thought about that, but they generally
fall into two categories: signals that have to be signals (i.e. SIGILL,
SIGABRT, SIGFPE, SIGSEGV, etc.) and signals that should be replaced a
queued event notification (SIGALRM, SIGRTMIN-SIGRTMAX).

Of course, that leaves things like SIGTERM, SIGINT, SIGQUIT, etc. so,
uh, nevermind then. Signal redirection to event queues is definitely
needed.

> DaveM says there are example programs for the current interfaces.  I
> must admit I haven't seen those either.  So if possible, point the world
> to them again.  If you do that now I'll review everything and write up
> my recommendations re the interface before Monday.

There's a handful of little test apps at
http://tservice.net.ru/~s0mbre/archive/kevent/ , but they don't work
with the current iteration of the interface. I don't know if there are
others somewhere else.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: The Proposed Linux kevent API
  2006-08-23  1:36                           ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Nicholas Miell
@ 2006-08-23  2:01                             ` Howard Chu
  2006-08-23  3:31                             ` David Miller
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 143+ messages in thread
From: Howard Chu @ 2006-08-23  2:01 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: David Miller, rdunlap, johnpol, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

Nicholas Miell wrote:
> Having looked all this over to figure out what it actually does, I can
> make the following comments:
>
> - there's a distinct lack of any sort of commenting beyond brief
> descriptions of what the occasional function is supposed to do
>
> - the kevent interface is all the horror of the BSD kqueue interface,
> but with no compatibility with the BSD kqueue interface.
>
> - lots of parameters from userspace go unsanitized, although I'm not
> sure if this will actually cause problems. At the very least, there
> should be checks for unknown flags and use of reserved fields, lest
> somebody start using them for their own purposes and then their app
> breaks when a newer version of the kernel starts using them itself.
>   


Which reminds me, why go through the trouble of copying the structs back 
and forth between userspace  and kernel space? Why not map the struct 
array and leave it in place, as I proposed back here?
http://groups.google.com/group/linux.kernel/browse_frm/ 
thread/57847cfedb61bdd5/8d02afa60a8f83af?lnk=gst&q=equeue&rnum= 
1#8d02afa60a8f83af

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: The Proposed Linux kevent API
  2006-08-23  1:36                           ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Nicholas Miell
  2006-08-23  2:01                             ` The Proposed Linux kevent API Howard Chu
@ 2006-08-23  3:31                             ` David Miller
  2006-08-23  3:47                               ` Nicholas Miell
  2006-08-23  6:22                             ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Evgeniy Polyakov
  2006-08-23 18:24                             ` The Proposed Linux kevent API Stephen Hemminger
  3 siblings, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-23  3:31 UTC (permalink / raw)
  To: nmiell
  Cc: rdunlap, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

From: Nicholas Miell <nmiell@comcast.net>
Date: Tue, 22 Aug 2006 18:36:07 -0700

> Dear DaveM,
> 
> 	Go fuck yourself.

I guess this is the bit that's supposed to make me take you seriously
:-)

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: The Proposed Linux kevent API
  2006-08-23  3:31                             ` David Miller
@ 2006-08-23  3:47                               ` Nicholas Miell
  2006-08-23  4:23                                 ` Nicholas Miell
  0 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-23  3:47 UTC (permalink / raw)
  To: David Miller
  Cc: rdunlap, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

On Tue, 2006-08-22 at 20:31 -0700, David Miller wrote:
> From: Nicholas Miell <nmiell@comcast.net>
> Date: Tue, 22 Aug 2006 18:36:07 -0700
> 
> > Dear DaveM,
> > 
> > 	Go fuck yourself.
> 
> I guess this is the bit that's supposed to make me take you seriously
> :-)

Of course. ^_^

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: The Proposed Linux kevent API
  2006-08-23  3:47                               ` Nicholas Miell
@ 2006-08-23  4:23                                 ` Nicholas Miell
  0 siblings, 0 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-23  4:23 UTC (permalink / raw)
  To: David Miller
  Cc: rdunlap, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

On Tue, 2006-08-22 at 20:47 -0700, Nicholas Miell wrote:
> On Tue, 2006-08-22 at 20:31 -0700, David Miller wrote:
> > From: Nicholas Miell <nmiell@comcast.net>
> > Date: Tue, 22 Aug 2006 18:36:07 -0700
> > 
> > > Dear DaveM,
> > > 
> > > 	Go fuck yourself.
> > 
> > I guess this is the bit that's supposed to make me take you seriously
> > :-)
> 
> Of course. ^_^
> 

Note that when I made this suggestion, I was not literally instructing
you to perform sexual acts upon yourself, especially if such a thing
would be illegal in your jurisdiction (although, IIRC, you moved to
Seattle recently and I'm pretty sure we allow that kind of thing here,
but we don't generally talk about it in public). So, my apologies to
you, Dave, for making such metaphorical instructions.

However, your choice to characterize my technical criticism as "rants"
and "complaints" and your continuous variations on "let's see you do
something better" as if it were a valid response to my objections did
get on my nerves and made it very hard for me to take you seriously. 

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.)
  2006-08-23  1:36                           ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Nicholas Miell
  2006-08-23  2:01                             ` The Proposed Linux kevent API Howard Chu
  2006-08-23  3:31                             ` David Miller
@ 2006-08-23  6:22                             ` Evgeniy Polyakov
  2006-08-23  8:01                               ` Nicholas Miell
  2006-08-23 18:24                             ` The Proposed Linux kevent API Stephen Hemminger
  3 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  6:22 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: David Miller, rdunlap, linux-kernel, drepper, akpm, netdev,
	zach.brown, hch

On Tue, Aug 22, 2006 at 06:36:07PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> == The Proposed Linux kevent API == 
> 
> The proposed Linux kevent API is a new unified event handling
> interface, similar in spirit to Windows completion ports and Solaris
> completion ports and similar in fact to the FreeBSD/OS X kqueue
> interface.
> 
> Using a single kernel call, a thread can wait for all possible event
> types that the kernel can generate, instead of past interfaces that
> only allow you to wait for specific subsets of events (e.g. POSIX
> sigevent completions are limited only to AIO completion, timer expiry,
> and the arrival of new messages to a message queue, while epoll_wait
> is just a more efficient method of doing a traditional Unix select or
> poll).
> 
> Instead of evolving the struct sigevent notification methods to allow
> you to continue using standard POSIX interfaces like lio_listio(),
> mq_notify() or timer_create() while queuing completion notifications
> to a kevent completion queue (much the way the Solaris port API is
> designed, or the the API proposed by Ulrich Drepper in "The
> Need for Asynchronous, Zero-Copy Network I/O" as found at
> http://people.redhat.com/drepper/newni.pdf ), kevent choooses to
> follow the FreeBSD route and introduce an entirely new and
> incompatible method of requesting and reporting event notifications
> (while also managing to be incompatible with FreeBSD's kqueue).
> 
> This is done through the introduction of two new syscalls and a
> variety of supporting datatypes. The first function, kevent_ctl(), is
> used to create and manipulate kevent queues, while the second,
> kevent_get_events(), is use to wait for new events.
> 
> 
> They operate as follows:
> 
> int kevent_ctl(int fd, unsigned int cmd, unsigned int num, void *arg);
> 
> fd is the file descriptor referring to the kevent queue to
> manipulate. It is ignored if the cmd parameter is KEVENT_CTL_INIT.
> 
> cmd is the requested operation. It can be one of the following:
> 
> 	KEVENT_CTL_INIT - create a new kevent queue and return it's file
> 		descriptor. The fd, num, and arg parameters are ignored.
> 
> 	KEVENT_CTL_ADD, KEVENT_CTL_MODIFY, KEVENT_CTL_REMOVE - add new,
> 		modify existing, or remove existing event notification
> 		requests.
> 
> num is the number of struct ukevent in the array pointed to by arg
> 
> arg is an array of struct ukevent. Why it is of type void* and not 
> 	struct ukevent* is a mystery.
> 
> When called, kevent_ctl will carry out the operation specified in the
> cmd parameter.
> 
> 
> int kevent_get_events(int ctl_fd, unsigned int min_nr,
> 		unsigned int max_nr, unsigned int timeout,
> 		void *buf, unsigned flags)
> 
> ctl_fd is the file descriptor referring to the kevent queue.
> 
> min_nr is the minimum number of completed events that
>        kevent_get_events will block waiting for.
> 
> max_nr is the number of struct ukevent in buf.
> 
> timeout is the number of milliseconds to wait before returning less
> 	than min_nr events. If this is -1, I *think* it'll wait
> 	indefinitely, but I'm not sure that msecs_to_jiffies(-1) ends
> 	up being MAX_SCHEDULE_TIMEOUT

You forget the case for non-blocked file descriptor.
Here is comment from the code:

 * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
 * In blocking mode it waits until timeout or if at least @min_nr events are ready.

> buf is a pointer an array of struct ukevent. Why it is of type void*
>     and not struct ukevent* is a mystery.
> 
> flags is unused.
> 
> When called, kevent_get_events will wait timeout milliseconds for at
> least min_nr completed events, copying completed struct ukevents to
> buf and deleting any KEVENT_REQ_ONESHOT event requests.
> 
> 
> The bulk of the interface is entirely done through the ukevent struct.
> It is used to add event requests, modify existing event requests,
> specify which event requests to remove, and return completed events.
> 
> struct ukevent contains the following members:
> 
> struct kevent_id id
>        This is described as containing the "socket number, file
>        descriptor and so on", which I take to mean it contains an fd,
>        however for some mysterious reason struct kevent_id contains
>        __u32 raw[2] and (for KEVENT_POLL events) the actual fd is
>        placed in raw[0] and raw[1] is never mentioned except to
>        faithfully copy it around.
> 
>        For KEVENT_TIMER events, raw[0] contains a relative time in
>        milliseconds and raw[1] is still not used.
> 
>        Why the struct member is called "raw" remains a mystery.

If you followed previous patchsets you could find, that there were
network AIO, fs IO and fs-inotify-like notifications.
Some of them use that fields.
I got two u32 numbers to be "union"ed with pointer like user data is.
That pointer should be obtained through Ulrich's dma_alloc() and
friends.

> __u32 type
>       The actual event type, either KEVENT_POLL for fd polling or
>       KEVENT_TIMER for timers.
> 
> __u32 event
>       For events of type KEVENT_POLL, event contains the polling flags
>       of interest (i.e. POLLIN, POLLPRI, POLLOUT, POLLERR, POLLHUP,
>       POLLNVAL).
> 
>       For events of type KEVENT_TIMER, event is ignored.
> 
> __u32 req_flags
>       Per-event request flags. Currently, this may be 0 or
>       KEVENT_REQ_ONESHOT to specify that the event be removed after it
>       is fired.
> 
> __u32 ret_flags
>       Per-event return flags. This may be 0 or a combination of
>       KEVENT_RET_DONE if the event has completed or
>       KVENT_RET_BROKEN if "the event is broken", which I take to mean
>       any sort of error condition. DONE|BROKEN is a valid state, but I
>       don't really know what it means.

DONE means that event processing is completed and it can be read back to
userspace, if in addition it contains BROKEN it means that kevent is
broken.

> __u32 ret_data[2]
>       Event return data. This is unused by KEVENT_POLL events, while
>       KEVENT_TIMER inexplicably places jiffies in ret_data[0]. If the
>       event is broken, an error code is placed in ret_data[1].

Each kevent user can place here any hints it wants, for example network
socket notifications place there length of the accept queue and so on.
In error condition error is placed there too.

> union { __u32 user[2]; void *ptr; }
>       An anonymous union (which is a fairly recent C addition)
>       containing data saved for the user and otherwise ignored by the
>       kernel.
> 
> For KEVENT_CTL_ADD, all fields relevant to the event type must be
> filled (id, type, possibly event, req_flags). After kevent_ctl(...,
> KEVENT_CTL_ADD, ...) returns each struct's ret_flags should be
> checked to see if the event is already broken or done.
> 
> For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields
> must be set and an existing kevent request must have matching id and
> user fields. If a match is found, req_flags and event are replaced
> with the newly supplied values. If a match can't be found, the passed
> in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
> always set.

DONE means that user's request is completed.
I.e. it was copied from userspace, watched, analyzed and somehow
processed.

> For KEVENT_CTL_REMOVE, the id and user fields must be set and an
> existing kevent request must have matching id and user fields. If a
> match is found, the kevent request is removed. If a match can't be
> found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN
> set. KEVENT_RET_DONE is always set.
> 
> For kevent_get_events, the entire structure is returned with ret_data[0]
> modified to contain jiffies for KEVENT_TIMER events.

ret_data can contain any hint kernel wants to put there.
It can contain 0.

> --
> 
> Having looked all this over to figure out what it actually does, I can
> make the following comments:
> 
> - there's a distinct lack of any sort of commenting beyond brief
> descriptions of what the occasional function is supposed to do
> 
> - the kevent interface is all the horror of the BSD kqueue interface,
> but with no compatibility with the BSD kqueue interface.
> 
> - lots of parameters from userspace go unsanitized, although I'm not
> sure if this will actually cause problems. At the very least, there
> should be checks for unknown flags and use of reserved fields, lest
> somebody start using them for their own purposes and then their app
> breaks when a newer version of the kernel starts using them itself.

All parameters which are not checked are not used.
If user puts own flags where it is not allowed it to do (like ret_flags)
he creates problems for himself. No one complains when arbitrary number
is placed into file dsecriptor and write() fails.

> - timeouts are specified as int instead of struct timespec.

timespec uses long, which is wrong.
I can put there any other structure which has strict types - no longs,
that's the rule, no matter if there is a wrappers in per-arch syscall
code.
poll alwasy ued millisecods and all are happy.

> - kevent_ctl() and kevent_get_events() take void* for no discernible
> reason.

Because interfaces are changed - they used contorl block before, and now
they do not. There is an opinion from Cristoph that syscall there is
wrong too and better to use ioctls(), so I do not change it right now,
since it can be changed in future (again).

> - KEVENT_POLL is less functional than epoll (no return of which events
> were actually signalled) and KEVENT_TIMER isn't as flexible as POSIX
> interval timers (no clocks, only millisecond resolution, timers don't
> have separate start and interval values).

That's nonsence - kevent returns fired events, POSIX timer API can only
use timers. When you can put network AIO into timer API call me, I will
buy your a t-shrt.
Your meaning of "separate start and interval values" is not correct,
please see how both timers work.

The only thing correct is that it only support millisecond resolution -
I use poll quite for a while and it is really good interface, so it was
copied from there.

> - kevent_get_events() looks a whole lot like io_getevents() and a kevent
> fd looks a whole lot like an io_context_t.
> 
> - struct ukevent has problems/inconsistencies -- id is wrapped in it's
> own member struct, while user and ret_data aren't; id's single member is
> named raw which does nothing to describe it's purpose; the user data is
> an anonymous union, which is C99 only and might not be widely supported,
> req_flags and ret_flags are more difficult to visually distinguish than
> they have to be; and every member name could use a ukev_ prefix.

I described what id is and why it is placed into u32[2] - it must be
union'ed with pointer, when such interface will be created.
How can you describe id for inode notification and user pointer?

As you can see there are no problems with understanding how it works -
I'm sure it has not took you too much time, I think writing previous
messages took much longer.

Now my point:
1. unified interface - since there are many types of different event
mechanisms (already implemented, but not theoretical handwaving), I
created unified interface, which does not know about what the event is
provided, it just routes it into appropriate storage and start
processing. Anyone who thinks that kevents must have separate interface
for each type of events just do not see, how many types there are.
It is simple to wrap it into epoll and POSIX timers, but there are quite
a few others - inotify, socket notifications, various AIO
implementatinos. Who will create new API for them?
If you think that kevents are going to be used through wrapper library -
implement there any interface you like. If you do not, consider how many
syscalls are required, and finally the same function will be called.

2. Wrong documantation and examples.
For the last two weeks interface were changed at least three (!) times.
Do you really think that I have some slaves in the cellar?
When interface is ready I will write docs and change examples.
But even with old applicatinos, it is _really_ trivial to understand
what parameter is used and where, especially with excellent LWN
articles.


And actually I do not see that this process comes to it's end -

	NO FSCKING ONE knows what we want!

	So I will say as author what _I_ want.

Until there is strong objection on API, nothing will be changed.

Something will be changed only when there are several people who acks
the change.

This can end up with declining of merge - do not care, I hack not for 
the entry in MAINTAINERS, but because I like the process, 
and I can use it with external patches easily.

Nick, you want POSIX timers API? Ok, I can change it, if several core 
developers ack this. If they do not, I will not even discuss it.
You can implement it as addon, no problem.

Dixi.

> --
> 
> P.S.
> 
> Dear DaveM,
> 
> 	Go fuck yourself.
> 
> Love,
> 	Nicholas

In a decent society you would have your nose broken...
But in virtual one you just can not be considered as serious person.

> -- 
> Nicholas Miell <nmiell@comcast.net>

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-22 23:46                         ` Ulrich Drepper
  2006-08-23  1:51                           ` Nicholas Miell
@ 2006-08-23  6:54                           ` Evgeniy Polyakov
  1 sibling, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  6:54 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Nicholas Miell, David Miller, jmorris, linux-kernel, akpm,
	netdev, zach.brown, hch

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]

On Tue, Aug 22, 2006 at 04:46:19PM -0700, Ulrich Drepper (drepper@redhat.com) wrote:
> DaveM says there are example programs for the current interfaces.  I
> must admit I haven't seen those either.  So if possible, point the world
> to them again.  If you do that now I'll review everything and write up
> my recommendations re the interface before Monday.

Attached typical usage for inode and timer events.
Network AIO was implemented as separated syscalls.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> 



-- 
	Evgeniy Polyakov

[-- Attachment #2: evtest.c --]
[-- Type: text/plain, Size: 4109 bytes --]

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/time.h>

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>

#include <linux/unistd.h>
#include <linux/types.h>
#include <linux/ukevent.h>

#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \
type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4);\
}

#define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
	  type5,arg5) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5);\
}

#define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
	  type5,arg5,type6,arg6) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5, type6 arg6) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5, arg6);\
}

_syscall4(int, kevent_ctl, int, arg1, unsigned int, argv2, unsigned int, argv3, void *, argv4);
_syscall6(int, kevent_get_events, int, arg1, unsigned int, argv2, unsigned int, argv3, unsigned int, argv4, void *, argv5, unsigned, arg6);

#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog_err(f, a...) ulog(f ": %s [%d].\n", ##a, strerror(errno), errno)

static void usage(char *p)
{
	ulog("Usage: %s -t type -e event -o oneshot -p path -n wait_num -h\n", p);
}

static int get_id(int type, char *path)
{
	int ret = -1;

	switch (type) {
		case KEVENT_TIMER:
			ret = 3000;
			break;
		case KEVENT_INODE:
			ret = open(path, O_RDONLY);
			break;
	}

	return ret;
}

int main(int argc, char *argv[])
{
	int ch, fd, err, type, event, oneshot, i, num, wait_num;
	char *path;
	char buf[4096];
	struct ukevent *uk;
	struct timeval tm1, tm2;

	path = NULL;
	type = event = -1;
	oneshot = 0;
	wait_num = 10;

	while ((ch = getopt(argc, argv, "p:t:e:o:n:h")) > 0) {
		switch (ch) {
			case 'n':
				wait_num = atoi(optarg);
				break;
			case 'p':
				path = optarg;
				break;
			case 't':
				type = atoi(optarg);
				break;
			case 'e':
				event = atoi(optarg);
				break;
			case 'o':
				oneshot = atoi(optarg);
				break;
			default:
				usage(argv[0]);
				return -1;
		}
	}

	if (event == -1 || type == -1 || (type == KEVENT_INODE && !path)) {
		ulog("You need at least -t -e parameters and -p for inode notifications.\n");
		usage(argv[0]);
		return -1;
	}
	
	fd = kevent_ctl(0, KEVENT_CTL_INIT, 1, NULL);
	if (fd == -1) {
		ulog_err("Failed create kevent control block");
		return -1;
	}

	memset(buf, 0, sizeof(buf));
	
	gettimeofday(&tm1, NULL);

	num = 1;
	for (i=0; i<num; ++i) {
		uk = (struct ukevent *)buf;
		uk->event = event;
		uk->type = type;
		if (oneshot)
			uk->req_flags |= KEVENT_REQ_ONESHOT;
		uk->user[0] = i;
		uk->id.raw[0] = get_id(uk->type, path);

		err = kevent_ctl(fd, KEVENT_CTL_ADD, 1, uk);
		if (err < 0) {
			ulog_err("Failed to perform control operation: type=%d, event=%d, oneshot=%d", type, event, oneshot);
			close(fd);
			return err;
		}
		ulog("%s: err: %d.\n", __func__, err);
		if (err) {
			ulog("%d: ret_flags: 0x%x, ret_data: %u %d.\n", i, uk->ret_flags, uk->ret_data[0], (int)uk->ret_data[1]);
		}
	}
	
	gettimeofday(&tm2, NULL);

	ulog("%08ld.%08ld: Load: diff=%ld usecs.\n", 
			tm2.tv_sec, tm2.tv_usec, ((tm2.tv_sec - tm1.tv_sec)*1000000 + (tm2.tv_usec - tm1.tv_usec))/num);

	while (1) {
		gettimeofday(&tm1, NULL);
		
		err = kevent_get_events(fd, 1, wait_num, 3000, buf, 0);
		if (err < 0) {
			ulog_err("Failed to perform control operation: type=%d, event=%d, oneshot=%d", type, event, oneshot);
			close(fd);
			return err;
		}
		
		gettimeofday(&tm2, NULL);

		ulog("%08ld.%08ld: Wait: num=%d, diff=%ld usec.\n", 
				tm2.tv_sec, tm2.tv_usec,
				err,
				((tm2.tv_sec - tm1.tv_sec)*1000000 + (tm2.tv_usec - tm1.tv_usec))/(err?err:1));
		uk = (struct ukevent *)buf;
		for (i=0; i<(signed)err; ++i) {
			ulog("%08x: %08x.%08x - %08x.%08x\n", 
					uk[i].user[0],
					uk[i].id.raw[0], uk[i].id.raw[1],
					uk[i].ret_data[0], uk[i].ret_data[1]);
		}
	}

	close(fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  0:43                         ` Jari Sundell
@ 2006-08-23  6:56                           ` Evgeniy Polyakov
  2006-08-23  7:07                             ` Andrew Morton
  2006-08-23  8:22                             ` Jari Sundell
  0 siblings, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  6:56 UTC (permalink / raw)
  To: Jari Sundell
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> Actually, I didn't miss that, it is an orthogonal issue. A timespec
> timeout parameter for the syscall does not imply the use of timespec
> in any timer event, etc. Nor is there any timespec timer in kqueue's
> struct kevent, which is the only (interface related) thing that will
> be exposed.

void * in structure exported to userspace is forbidden.
long in syscall requires wrapper in per-arch code (although that
workaround _is_ there, it does not mean that broken interface should 
be used).
poll uses millisecods - it is perfectly ok.

> Rakshasa

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  6:56                           ` Evgeniy Polyakov
@ 2006-08-23  7:07                             ` Andrew Morton
  2006-08-23  7:10                               ` Evgeniy Polyakov
                                                 ` (3 more replies)
  2006-08-23  8:22                             ` Jari Sundell
  1 sibling, 4 replies; 143+ messages in thread
From: Andrew Morton @ 2006-08-23  7:07 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jari Sundell, David Miller, kuznet, nmiell, linux-kernel,
	drepper, netdev, zach.brown, hch

On Wed, 23 Aug 2006 10:56:59 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > timeout parameter for the syscall does not imply the use of timespec
> > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > struct kevent, which is the only (interface related) thing that will
> > be exposed.
> 
> void * in structure exported to userspace is forbidden.
> long in syscall requires wrapper in per-arch code (although that
> workaround _is_ there, it does not mean that broken interface should 
> be used).
> poll uses millisecods - it is perfectly ok.

I wonder whether designing-in a millisecond granularity is the right thing
to do.  If in a few years the kernel is running tickless with high-res clock
interrupt sources, that might look a bit lumpy.

Switching it to a __u64 nanosecond counter would be basically free on
64-bit machines, and not very expensive on 32-bit, no?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  7:07                             ` Andrew Morton
@ 2006-08-23  7:10                               ` Evgeniy Polyakov
  2006-08-23  9:58                                 ` Andi Kleen
  2006-08-23  7:35                               ` David Miller
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  7:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jari Sundell, David Miller, kuznet, nmiell, linux-kernel,
	drepper, netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 12:07:58AM -0700, Andrew Morton (akpm@osdl.org) wrote:
> On Wed, 23 Aug 2006 10:56:59 +0400
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> > > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > > timeout parameter for the syscall does not imply the use of timespec
> > > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > > struct kevent, which is the only (interface related) thing that will
> > > be exposed.
> > 
> > void * in structure exported to userspace is forbidden.
> > long in syscall requires wrapper in per-arch code (although that
> > workaround _is_ there, it does not mean that broken interface should 
> > be used).
> > poll uses millisecods - it is perfectly ok.
> 
> I wonder whether designing-in a millisecond granularity is the right thing
> to do.  If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
> 
> Switching it to a __u64 nanosecond counter would be basically free on
> 64-bit machines, and not very expensive on 32-bit, no?

Let's then place there a structure with 64bit seconds and nanoseconds,
similar to timspec, but without longs there.
What do you think?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  7:07                             ` Andrew Morton
  2006-08-23  7:10                               ` Evgeniy Polyakov
@ 2006-08-23  7:35                               ` David Miller
  2006-08-23  8:18                                 ` Nicholas Miell
  2006-08-23  7:43                               ` Ian McDonald
  2006-08-23  7:50                               ` Evgeniy Polyakov
  3 siblings, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-23  7:35 UTC (permalink / raw)
  To: akpm
  Cc: johnpol, sundell.software, kuznet, nmiell, linux-kernel, drepper,
	netdev, zach.brown, hch

From: Andrew Morton <akpm@osdl.org>
Date: Wed, 23 Aug 2006 00:07:58 -0700

> I wonder whether designing-in a millisecond granularity is the right thing
> to do.  If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
> 
> Switching it to a __u64 nanosecond counter would be basically free on
> 64-bit machines, and not very expensive on 32-bit, no?

If it ends up in a structure we'll need to use the "aligned_u64" type
in order to avoid problems with 32-bit x86 binaries running on 64-bit
kernels.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  7:07                             ` Andrew Morton
  2006-08-23  7:10                               ` Evgeniy Polyakov
  2006-08-23  7:35                               ` David Miller
@ 2006-08-23  7:43                               ` Ian McDonald
  2006-08-23  7:50                               ` Evgeniy Polyakov
  3 siblings, 0 replies; 143+ messages in thread
From: Ian McDonald @ 2006-08-23  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Evgeniy Polyakov, Jari Sundell, David Miller, kuznet, nmiell,
	linux-kernel, drepper, netdev, zach.brown, hch

> I wonder whether designing-in a millisecond granularity is the right thing
> to do.  If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
>
I'd second that - when working on DCCP I've done a lot of the work in
microseconds and it made quite a difference instead of milliseconds
because of it's design.

I haven't followed kevents in great detail but it sounds like
something that could be useful for me with higher resolution timers
than milliseconds.
-- 
Ian McDonald
Web: http://wand.net.nz/~iam4
Blog: http://imcdnzl.blogspot.com
WAND Network Research Group
Department of Computer Science
University of Waikato
New Zealand

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  7:07                             ` Andrew Morton
                                                 ` (2 preceding siblings ...)
  2006-08-23  7:43                               ` Ian McDonald
@ 2006-08-23  7:50                               ` Evgeniy Polyakov
  2006-08-23 16:09                                 ` Andrew Morton
  3 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  7:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jari Sundell, David Miller, kuznet, nmiell, linux-kernel,
	drepper, netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 12:07:58AM -0700, Andrew Morton (akpm@osdl.org) wrote:
> On Wed, 23 Aug 2006 10:56:59 +0400
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> > > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > > timeout parameter for the syscall does not imply the use of timespec
> > > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > > struct kevent, which is the only (interface related) thing that will
> > > be exposed.
> > 
> > void * in structure exported to userspace is forbidden.
> > long in syscall requires wrapper in per-arch code (although that
> > workaround _is_ there, it does not mean that broken interface should 
> > be used).
> > poll uses millisecods - it is perfectly ok.
> 
> I wonder whether designing-in a millisecond granularity is the right thing
> to do.  If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
> 
> Switching it to a __u64 nanosecond counter would be basically free on
> 64-bit machines, and not very expensive on 32-bit, no?

I can put nanoseconds as timer interval too (with aligned_u64 as David
mentioned), and put it for timeout value too - 64 bit nanosecods ends up
with 58 years, probably enough.
Structures with u64 a really not so good idea.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.)
  2006-08-23  6:22                             ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Evgeniy Polyakov
@ 2006-08-23  8:01                               ` Nicholas Miell
  0 siblings, 0 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-23  8:01 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, rdunlap, linux-kernel, drepper, akpm, netdev,
	zach.brown, hch

On Wed, 2006-08-23 at 10:22 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 06:36:07PM -0700, Nicholas Miell (nmiell@comcast.net) wrote:
> > int kevent_get_events(int ctl_fd, unsigned int min_nr,
> > 		unsigned int max_nr, unsigned int timeout,
> > 		void *buf, unsigned flags)
> > 
> > ctl_fd is the file descriptor referring to the kevent queue.
> > 
> > min_nr is the minimum number of completed events that
> >        kevent_get_events will block waiting for.
> > 
> > max_nr is the number of struct ukevent in buf.
> > 
> > timeout is the number of milliseconds to wait before returning less
> > 	than min_nr events. If this is -1, I *think* it'll wait
> > 	indefinitely, but I'm not sure that msecs_to_jiffies(-1) ends
> > 	up being MAX_SCHEDULE_TIMEOUT
> 
> You forget the case for non-blocked file descriptor.
> Here is comment from the code:
> 
>  * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
>  * In blocking mode it waits until timeout or if at least @min_nr events are ready.

I missed that, but why bother with O_NONBLOCK? It appears to to make the
timeout parameter completely unnecessary, which means you could just
make timeout = 0 give you the nonblocking behavior, and non-zero the
blocking behavior (leaving -1 as wait forever).

> > buf is a pointer an array of struct ukevent. Why it is of type void*
> >     and not struct ukevent* is a mystery.
> > 
> > flags is unused.
> > 
> > When called, kevent_get_events will wait timeout milliseconds for at
> > least min_nr completed events, copying completed struct ukevents to
> > buf and deleting any KEVENT_REQ_ONESHOT event requests.
> > 
> > 
> > The bulk of the interface is entirely done through the ukevent struct.
> > It is used to add event requests, modify existing event requests,
> > specify which event requests to remove, and return completed events.
> > 
> > struct ukevent contains the following members:
> > 
> > struct kevent_id id
> >        This is described as containing the "socket number, file
> >        descriptor and so on", which I take to mean it contains an fd,
> >        however for some mysterious reason struct kevent_id contains
> >        __u32 raw[2] and (for KEVENT_POLL events) the actual fd is
> >        placed in raw[0] and raw[1] is never mentioned except to
> >        faithfully copy it around.
> > 
> >        For KEVENT_TIMER events, raw[0] contains a relative time in
> >        milliseconds and raw[1] is still not used.
> > 
> >        Why the struct member is called "raw" remains a mystery.
> 
> If you followed previous patchsets you could find, that there were
> network AIO, fs IO and fs-inotify-like notifications.
> Some of them use that fields.
> I got two u32 numbers to be "union"ed with pointer like user data is.
> That pointer should be obtained through Ulrich's dma_alloc() and
> friends.
> 
> > __u32 type
> >       The actual event type, either KEVENT_POLL for fd polling or
> >       KEVENT_TIMER for timers.
> > 
> > __u32 event
> >       For events of type KEVENT_POLL, event contains the polling flags
> >       of interest (i.e. POLLIN, POLLPRI, POLLOUT, POLLERR, POLLHUP,
> >       POLLNVAL).
> > 
> >       For events of type KEVENT_TIMER, event is ignored.
> > 
> > __u32 req_flags
> >       Per-event request flags. Currently, this may be 0 or
> >       KEVENT_REQ_ONESHOT to specify that the event be removed after it
> >       is fired.
> > 
> > __u32 ret_flags
> >       Per-event return flags. This may be 0 or a combination of
> >       KEVENT_RET_DONE if the event has completed or
> >       KVENT_RET_BROKEN if "the event is broken", which I take to mean
> >       any sort of error condition. DONE|BROKEN is a valid state, but I
> >       don't really know what it means.
> 
> DONE means that event processing is completed and it can be read back to
> userspace, if in addition it contains BROKEN it means that kevent is
> broken.

So KEVENT_RET_DONE is purely an internal thing? And what does
KEVENT_RET_BROKEN mean, exactly?

> > __u32 ret_data[2]
> >       Event return data. This is unused by KEVENT_POLL events, while
> >       KEVENT_TIMER inexplicably places jiffies in ret_data[0]. If the
> >       event is broken, an error code is placed in ret_data[1].
> 
> Each kevent user can place here any hints it wants, for example network
> socket notifications place there length of the accept queue and so on.

I didn't document what it could theoretically be used for, just what it
is actually used for.

> In error condition error is placed there too.
> 
> > union { __u32 user[2]; void *ptr; }
> >       An anonymous union (which is a fairly recent C addition)
> >       containing data saved for the user and otherwise ignored by the
> >       kernel.
> > 
> > For KEVENT_CTL_ADD, all fields relevant to the event type must be
> > filled (id, type, possibly event, req_flags). After kevent_ctl(...,
> > KEVENT_CTL_ADD, ...) returns each struct's ret_flags should be
> > checked to see if the event is already broken or done.
> > 
> > For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields
> > must be set and an existing kevent request must have matching id and
> > user fields. If a match is found, req_flags and event are replaced
> > with the newly supplied values. If a match can't be found, the passed
> > in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
> > always set.
> 
> DONE means that user's request is completed.
> I.e. it was copied from userspace, watched, analyzed and somehow
> processed.

KEVENT_RET_DONE certainly looks like a purely internal flag, which
shouldn't be exported to userspace. (Except for it's usage with
KEVENT_CTL_ADD, I guess.)

> > For KEVENT_CTL_REMOVE, the id and user fields must be set and an
> > existing kevent request must have matching id and user fields. If a
> > match is found, the kevent request is removed. If a match can't be
> > found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN
> > set. KEVENT_RET_DONE is always set.
> > 
> > For kevent_get_events, the entire structure is returned with ret_data[0]
> > modified to contain jiffies for KEVENT_TIMER events.
> 
> ret_data can contain any hint kernel wants to put there.
> It can contain 0.

Again, I didn't document theoretical usage, just what's actually done.

> 
> > --
> > 
> > Having looked all this over to figure out what it actually does, I can
> > make the following comments:
> > 
> > - there's a distinct lack of any sort of commenting beyond brief
> > descriptions of what the occasional function is supposed to do
> > 
> > - the kevent interface is all the horror of the BSD kqueue interface,
> > but with no compatibility with the BSD kqueue interface.
> > 
> > - lots of parameters from userspace go unsanitized, although I'm not
> > sure if this will actually cause problems. At the very least, there
> > should be checks for unknown flags and use of reserved fields, lest
> > somebody start using them for their own purposes and then their app
> > breaks when a newer version of the kernel starts using them itself.
> 
> All parameters which are not checked are not used.

Not used currently.

> If user puts own flags where it is not allowed it to do (like ret_flags)
> he creates problems for himself. No one complains when arbitrary number
> is placed into file dsecriptor and write() fails.

So prevent the user from causing future problems -- reject all invalid
uses.

> > - timeouts are specified as int instead of struct timespec.
> 
> timespec uses long, which is wrong.
> I can put there any other structure which has strict types - no longs,
> that's the rule, no matter if there is a wrappers in per-arch syscall
> code.

I don't understand this -- you're saying that you can't use a long
because of compat tasks on 64-bit architectures?

> poll alwasy ued millisecods and all are happy.
> 
> > - kevent_ctl() and kevent_get_events() take void* for no discernible
> > reason.
> 
> Because interfaces are changed - they used contorl block before, and now
> they do not. There is an opinion from Cristoph that syscall there is
> wrong too and better to use ioctls(), so I do not change it right now,
> since it can be changed in future (again).

OK, that makes sense, but it still has to be fixed assuming this is the
final form of the interface.

> > - KEVENT_POLL is less functional than epoll (no return of which events
> > were actually signalled) and KEVENT_TIMER isn't as flexible as POSIX
> > interval timers (no clocks, only millisecond resolution, timers don't
> > have separate start and interval values).
> 
> That's nonsence - kevent returns fired events,

Yes, but why did the event fire? poll/epoll return the
POLLIN/POLLPRI/POLLOUT/POLLERR/etc. bitmask when they return events.

>  POSIX timer API can only
> use timers. When you can put network AIO into timer API call me, I will
> buy your a t-shrt.

I was talking about POSIX timers verses KEVENT_TIMER in isolation.
Ignoring the event delivery mechanism, POSIX timers are more capable
than kevent timers.

> Your meaning of "separate start and interval values" is not correct,
> please see how both timers work.

With POSIX timers, I can create a timer that starts periodically firing
itself at some point in the future, where the period isn't equal to the
different between now and it's first expiry (i.e. 10 seconds from now,
start firing every 2 seconds). I don't think I can do this using
KEVENT_TIMER.

> 
> The only thing correct is that it only support millisecond resolution -
> I use poll quite for a while and it is really good interface, so it was
> copied from there.
> 
> > - kevent_get_events() looks a whole lot like io_getevents() and a kevent
> > fd looks a whole lot like an io_context_t.
> > 
> > - struct ukevent has problems/inconsistencies -- id is wrapped in it's
> > own member struct, while user and ret_data aren't; id's single member is
> > named raw which does nothing to describe it's purpose; the user data is
> > an anonymous union, which is C99 only and might not be widely supported,
> > req_flags and ret_flags are more difficult to visually distinguish than
> > they have to be; and every member name could use a ukev_ prefix.
> 
> I described what id is and why it is placed into u32[2] - it must be
> union'ed with pointer, when such interface will be created.
> How can you describe id for inode notification and user pointer?
> 
> As you can see there are no problems with understanding how it works -
> I'm sure it has not took you too much time, I think writing previous
> messages took much longer.

The previous message which DaveM was kind enough to characterize as a
rant took 10 minutes. The kevent docs took ~2 hours. You're welcome.

> 
> Now my point:
> 1. unified interface - since there are many types of different event
> mechanisms (already implemented, but not theoretical handwaving), I
> created unified interface, which does not know about what the event is
> provided, it just routes it into appropriate storage and start
> processing. Anyone who thinks that kevents must have separate interface
> for each type of events just do not see, how many types there are.
> It is simple to wrap it into epoll and POSIX timers, but there are quite
> a few others - inotify, socket notifications, various AIO
> implementatinos. Who will create new API for them?

Ah, now there's your problem. The joy of this is that you don't have to
implement any of it. All you really need to do is implement the
userspace interface for creating kevent queues and dequeuing events and
the kernel interface for adding events to the queue and provide a
general enough event struct. (All you really need is the event type,
event specific data (a uintptr_t would probably be sufficient), and a
user cookie pointer)

Once you've done that, you can say "Hey, POSIX timers people, you should
make a way for people to queue timer completion events" and "Hey, AIO
people, you should make a way for people to queue AIO completion events"
and wait for them to do the work. (And if the AIO people complain that
they already have a way for people to queue AIO completions, well, then
you really need to justify why yours is better.)

In fact, I'm beginning to wonder if kevent_ctl() should exist at all --
it's only real use in this scenario would be for the odds-and-ends that
don't already have mechanisms for specifying how event completions
should be handled, in which case it becomes much more like
port_associate() and port_disassociate().

> If you think that kevents are going to be used through wrapper library -
> implement there any interface you like. If you do not, consider how many
> syscalls are required, and finally the same function will be called.
> 
> 2. Wrong documantation and examples.
> For the last two weeks interface were changed at least three (!) times.
> Do you really think that I have some slaves in the cellar?
> When interface is ready I will write docs and change examples.
> But even with old applicatinos, it is _really_ trivial to understand
> what parameter is used and where, especially with excellent LWN
> articles.

The LWN articles weren't really that excellent; more of a pair of
cursory overviews really. And yes, I'm pretty sure that it is understood
that you don't have infinite time to work on this.

> 
> And actually I do not see that this process comes to it's end -
> 
> 	NO FSCKING ONE knows what we want!
> 
> 	So I will say as author what _I_ want.
> 
> Until there is strong objection on API, nothing will be changed.
> 
> Something will be changed only when there are several people who acks
> the change.
> 
> This can end up with declining of merge - do not care, I hack not for 
> the entry in MAINTAINERS, but because I like the process, 
> and I can use it with external patches easily.
> 
> Nick, you want POSIX timers API? Ok, I can change it, if several core 
> developers ack this. If they do not, I will not even discuss it.
> You can implement it as addon, no problem.
> 
> Dixi.
> 
> > --
> > 
> > P.S.
> > 
> > Dear DaveM,
> > 
> > 	Go fuck yourself.
> > 
> > Love,
> > 	Nicholas
> 
> In a decent society you would have your nose broken...
> But in virtual one you just can not be considered as serious person.

I'm fairly certain DaveM is still the reigning champion of the
F-bomb-to-lines-of-kernel-code ratio, although I think Rusty may be
catching up. 

Wait, no, somebody in MIPS land (I think maybe Christoph Hellwig) really
really hates the IOC3 very very much.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  7:35                               ` David Miller
@ 2006-08-23  8:18                                 ` Nicholas Miell
  0 siblings, 0 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-23  8:18 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, johnpol, sundell.software, kuznet, linux-kernel, drepper,
	netdev, zach.brown, hch

On Wed, 2006-08-23 at 00:35 -0700, David Miller wrote:
> From: Andrew Morton <akpm@osdl.org>
> Date: Wed, 23 Aug 2006 00:07:58 -0700
> 
> > I wonder whether designing-in a millisecond granularity is the right thing
> > to do.  If in a few years the kernel is running tickless with high-res clock
> > interrupt sources, that might look a bit lumpy.
> > 
> > Switching it to a __u64 nanosecond counter would be basically free on
> > 64-bit machines, and not very expensive on 32-bit, no?
> 
> If it ends up in a structure we'll need to use the "aligned_u64" type
> in order to avoid problems with 32-bit x86 binaries running on 64-bit
> kernels.

Perhaps

struct timespec64
{
	uint64_t tv_sec __attribute__((aligned(8)));
	uint32_t tv_nsec;
}

with a snide remark about gcc in the comments?

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  6:56                           ` Evgeniy Polyakov
  2006-08-23  7:07                             ` Andrew Morton
@ 2006-08-23  8:22                             ` Jari Sundell
  2006-08-23  8:39                               ` Evgeniy Polyakov
  1 sibling, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-23  8:22 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> void * in structure exported to userspace is forbidden.

Only void * I'm seeing belongs to the user, (udata) perhaps you are
talking of something different?

> long in syscall requires wrapper in per-arch code (although that
> workaround _is_ there, it does not mean that broken interface should
> be used).
> poll uses millisecods - it is perfectly ok.

The kernel is there to hide those ugly implementation details from the
user, so I don't care that much about a workaround being required in
some cases. More important, IMHO is consistency with the POSIX system
calls.

I guess as long as you use usec, at least it won't be a pain to use.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  8:22                             ` Jari Sundell
@ 2006-08-23  8:39                               ` Evgeniy Polyakov
  2006-08-23  9:49                                 ` Jari Sundell
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  8:39 UTC (permalink / raw)
  To: Jari Sundell
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 10:22:06AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >void * in structure exported to userspace is forbidden.
> 
> Only void * I'm seeing belongs to the user, (udata) perhaps you are
> talking of something different?

Yes, exactly about it.

I put union {
	u32 a[2];
	void *b;
} 
epcially to eliminate that problem.

And I'm not that sure aboit stuff like uptr_t or how they call pointers
in userspace and kernelspace.

> >long in syscall requires wrapper in per-arch code (although that
> >workaround _is_ there, it does not mean that broken interface should
> >be used).
> >poll uses millisecods - it is perfectly ok.
> 
> The kernel is there to hide those ugly implementation details from the
> user, so I don't care that much about a workaround being required in
> some cases. More important, IMHO is consistency with the POSIX system
> calls.
> 
> I guess as long as you use usec, at least it won't be a pain to use.

Andrew suggested to use nanoseconds there in u64 variable.
I think it is ok.

> Rakshasa

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 1/3] kevent: Core files.
  2006-08-21 10:19   ` [take12 1/3] kevent: Core files Evgeniy Polyakov
  2006-08-21 10:19     ` [take12 2/3] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-08-23  8:51     ` Eric Dumazet
  2006-08-23  9:18       ` Evgeniy Polyakov
  1 sibling, 1 reply; 143+ messages in thread
From: Eric Dumazet @ 2006-08-23  8:51 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

Hello Evgeniy

I have one comment/suggestion (minor detail, your work is very good)

I suggest to add one item in kevent_registered_callbacks[], so that 
kevent_registered_callbacks[KEVENT_MAX] is valid and can act as a fallback.

In kevent_add_callbacks() you could replace the eventual NULL pointers by 
kevent_break() in 
kevent_registered_callbacks[pos].{callback, enqueue, dequeue}
like :

+int kevent_add_callbacks(const struct kevent_callbacks *cb, unsigned int pos)
+{
+	struct kevent_callbacks *p = &kevent_registered_callbacks[pos];
+       if (pos >= KEVENT_MAX)
+               return -EINVAL;
+      p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+      p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+      p->callback = (cb->callback) ? cb->callback : kevent_break;
+       printk(KERN_INFO "KEVENT: Added callbacks for type %u.\n", pos);
+       return 0;
+}

(I also added a const qualifier in first function argument, and unsigned int 
pos so that the "if (pos >= KEVENT_MAX)" test catches 'negative' values)

Then you change kevent_break() to return -EINVAL instead of 0.

+int kevent_break(struct kevent *k)
+{
+       unsigned long flags;
+
+       spin_lock_irqsave(&k->ulock, flags);
+       k->event.ret_flags |= KEVENT_RET_BROKEN;
+       spin_unlock_irqrestore(&k->ulock, flags);
+       return -EINVAL;
+}

Then avoid the tests in kevent_enqueue()

+int kevent_enqueue(struct kevent *k)
+{
+       return k->callbacks.enqueue(k);
+}

And avoid the tests in  kevent_dequeue()

+int kevent_dequeue(struct kevent *k)
+{
+       return k->callbacks.dequeue(k);
+}

And change kevent_init() to

+int kevent_init(struct kevent *k)
+{
+       spin_lock_init(&k->ulock);
+       k->flags = 0;
+
+       if (unlikely(k->event.type >= KEVENT_MAX))
+               k->event.type = KEVENT_MAX;
+	 
+
+       k->callbacks = kevent_registered_callbacks[k->event.type];
+       if (unlikely(k->callbacks.callback == kevent_break))
+               return kevent_break(k);
+
+       return 0;
+}



Eric Dumazet

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 1/3] kevent: Core files.
  2006-08-23  8:51     ` [take12 1/3] kevent: Core files Eric Dumazet
@ 2006-08-23  9:18       ` Evgeniy Polyakov
  2006-08-23  9:23         ` Eric Dumazet
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  9:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Wed, Aug 23, 2006 at 10:51:36AM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Hello Evgeniy

Hi Eric.
 
> I have one comment/suggestion (minor detail, your work is very good)
> 
> I suggest to add one item in kevent_registered_callbacks[], so that 
> kevent_registered_callbacks[KEVENT_MAX] is valid and can act as a fallback.

Sounds good, could you please send appliable patch with proper
signed-off line?

> Eric Dumazet

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 1/3] kevent: Core files.
  2006-08-23  9:18       ` Evgeniy Polyakov
@ 2006-08-23  9:23         ` Eric Dumazet
  2006-08-23  9:29           ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Eric Dumazet @ 2006-08-23  9:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Wednesday 23 August 2006 11:18, Evgeniy Polyakov wrote:
> On Wed, Aug 23, 2006 at 10:51:36AM +0200, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > Hello Evgeniy
>
> Hi Eric.
>
> > I have one comment/suggestion (minor detail, your work is very good)
> >
> > I suggest to add one item in kevent_registered_callbacks[], so that
> > kevent_registered_callbacks[KEVENT_MAX] is valid and can act as a
> > fallback.
>
> Sounds good, could you please send appliable patch with proper
> signed-off line?

Unfortunately not at this moment, I'm quite busy at work, my boss will kill 
me :( .
If you find this good, please add it to your next patch submission or forget 
it. 

Thank you
Eric


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 1/3] kevent: Core files.
  2006-08-23  9:23         ` Eric Dumazet
@ 2006-08-23  9:29           ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23  9:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Wed, Aug 23, 2006 at 11:23:52AM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > Sounds good, could you please send appliable patch with proper
> > signed-off line?
> 
> Unfortunately not at this moment, I'm quite busy at work, my boss will kill 
> me :( .
> If you find this good, please add it to your next patch submission or forget 
> it. 

Ok, I will try to get it from pieces in e-mail.

> Thank you
> Eric

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  8:39                               ` Evgeniy Polyakov
@ 2006-08-23  9:49                                 ` Jari Sundell
  2006-08-23 10:20                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-23  9:49 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Wed, Aug 23, 2006 at 10:22:06AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> > On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > >void * in structure exported to userspace is forbidden.
> >
> > Only void * I'm seeing belongs to the user, (udata) perhaps you are
> > talking of something different?
>
> Yes, exactly about it.
>
> I put union {
>         u32 a[2];
>         void *b;
> }
> epcially to eliminate that problem.

It's just random data of a known maximum size appended to the struct,
I'm sure you can find a clean way to handle it. If you mangle the
first variable name in your union, you'll end up with something that
should be usable instead of udata.

> And I'm not that sure aboit stuff like uptr_t or how they call pointers
> in userspace and kernelspace.

Well, I can't find any use of pointers in your struct ukevent, nor in
any of the kqueue events in my man page. So if this is a deficit it
applies to both, I guess?

> ukevent is aligned to 8 bytes already (it's size selected to be 40 bytes),
> so it should not be a problem.
>
> > Eric

Even if it is so, wouldn't it be better to be explicit about it?

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  7:10                               ` Evgeniy Polyakov
@ 2006-08-23  9:58                                 ` Andi Kleen
  2006-08-23 10:03                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Andi Kleen @ 2006-08-23  9:58 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jari Sundell, David Miller, kuznet, nmiell, linux-kernel,
	drepper, netdev, zach.brown, hch

Evgeniy Polyakov <johnpol@2ka.mipt.ru> writes:
> 
> Let's then place there a structure with 64bit seconds and nanoseconds,
> similar to timspec, but without longs there.

You need 64bit (or at least more than 32bit) for the seconds,
otherwise you add a y2038 problem which would be sad in new code.
Remember you might be still alive then ;-)

Ok one could argue that on 32bit architectures 2038 is so deeply
embedded that it doesn't make much difference, but I still
think it would be better to not readd it to new interfaces there.

64bit longs on 32bit is fine, as long as you use aligned_u64,
never long long or u64 (which has varying alignment between i386 and x86-64)

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  9:58                                 ` Andi Kleen
@ 2006-08-23 10:03                                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 10:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jari Sundell, David Miller, kuznet, nmiell, linux-kernel,
	drepper, netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 11:58:20AM +0200, Andi Kleen (ak@suse.de) wrote:
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> writes:
> > 
> > Let's then place there a structure with 64bit seconds and nanoseconds,
> > similar to timspec, but without longs there.
> 
> You need 64bit (or at least more than 32bit) for the seconds,
> otherwise you add a y2038 problem which would be sad in new code.
> Remember you might be still alive then ;-)

I hope so :)

> Ok one could argue that on 32bit architectures 2038 is so deeply
> embedded that it doesn't make much difference, but I still
> think it would be better to not readd it to new interfaces there.
> 
> 64bit longs on 32bit is fine, as long as you use aligned_u64,
> never long long or u64 (which has varying alignment between i386 and x86-64)

Btw, aligned_u64 is not exported to userspace.
I commited a change with __u64 nanoseconds without any strucutres.
Do we really need a structure?

> -Andi

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  9:49                                 ` Jari Sundell
@ 2006-08-23 10:20                                   ` Evgeniy Polyakov
  2006-08-23 10:34                                     ` Jari Sundell
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 10:20 UTC (permalink / raw)
  To: Jari Sundell
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 11:49:22AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> >> Only void * I'm seeing belongs to the user, (udata) perhaps you are
> >> talking of something different?
> >
> >Yes, exactly about it.
> >
> >I put union {
> >        u32 a[2];
> >        void *b;
> >}
> >epcially to eliminate that problem.
> 
> It's just random data of a known maximum size appended to the struct,
> I'm sure you can find a clean way to handle it. If you mangle the
> first variable name in your union, you'll end up with something that
> should be usable instead of udata.

If there will be usual pointer, size of the whole structure will be
different in kernel and userspace.

> >And I'm not that sure aboit stuff like uptr_t or how they call pointers
> >in userspace and kernelspace.
> 
> Well, I can't find any use of pointers in your struct ukevent, nor in
> any of the kqueue events in my man page. So if this is a deficit it
> applies to both, I guess?

No, it will change sizes of the structure in kernelspace and userspace,
so they just can not communicate.

> >ukevent is aligned to 8 bytes already (it's size selected to be 40 bytes),
> >so it should not be a problem.
> >
> >> Eric
> 
> Even if it is so, wouldn't it be better to be explicit about it?

Ok, I will add a comment about it.

> Rakshasa

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23 10:20                                   ` Evgeniy Polyakov
@ 2006-08-23 10:34                                     ` Jari Sundell
  2006-08-23 10:51                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-23 10:34 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> No, it will change sizes of the structure in kernelspace and userspace,
> so they just can not communicate.

struct kevent {
  uintptr_t ident;        /* identifier for this event */
  short     filter;       /* filter for event */
  u_short   flags;        /* action flags for kqueue */
  u_int     fflags;       /* filter flag value */

  union {
    u32       _data_padding[2];
    intptr_t  data;         /* filter data value */
  };

  union {
    u32       _udata_padding[2];
    void      *udata;       /* opaque user data identifier */
  };
};

I'm not missing anything obvious here, I hope.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23 10:34                                     ` Jari Sundell
@ 2006-08-23 10:51                                       ` Evgeniy Polyakov
  2006-08-23 12:55                                         ` Jari Sundell
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 10:51 UTC (permalink / raw)
  To: Jari Sundell
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 12:34:25PM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >
> >No, it will change sizes of the structure in kernelspace and userspace,
> >so they just can not communicate.
> 
> struct kevent {
>  uintptr_t ident;        /* identifier for this event */
>  short     filter;       /* filter for event */
>  u_short   flags;        /* action flags for kqueue */
>  u_int     fflags;       /* filter flag value */
> 
>  union {
>    u32       _data_padding[2];
>    intptr_t  data;         /* filter data value */
>  };

As Eric pointed it must be aligned.

>  union {
>    u32       _udata_padding[2];
>    void      *udata;       /* opaque user data identifier */
>  };
> };
> 
> I'm not missing anything obvious here, I hope.

We still do not know what uintptr_t is, and it looks like it is a pointer, 
which is forbidden. Those numbers are not enough to make network AIO.
And actually is not compatible with kqueue already, so you will need to
write your own parser to convert your parameters into above structure.

> Rakshasa
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take13 0/3] kevent: Generic event handling mechanism.
       [not found] <12345678912345.GA1898@2ka.mipt.ru>
  2006-08-17  7:43 ` [take11 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
  2006-08-21 10:19 ` [take12 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-08-23 11:24 ` Evgeniy Polyakov
  2006-08-23 11:24   ` [take13 1/3] kevent: Core files Evgeniy Polyakov
       [not found]   ` <Pine.LNX.4.63.0608231313370.8007@alpha.polcom.net>
  2006-08-25  9:54 ` [take14 " Evgeniy Polyakov
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 11:24 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


Generic event handling mechanism.

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take13 3/3] kevent: Timer notifications.
  2006-08-23 11:24     ` [take13 2/3] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-08-23 11:24       ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 11:24 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig



Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..1c8ffb2
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,105 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct timer_list	ktimer;
+	struct kevent_storage	ktimer_storage;
+};
+
+static void kevent_timer_func(unsigned long data)
+{
+	struct kevent *k = (struct kevent *)data;
+	struct timer_list *t = k->st->origin;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+	
+	mod_timer(&t->ktimer, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+
+	return 0;
+
+err_out_st_fini:	
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	del_timer_sync(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback, 
+		.enqueue = &kevent_timer_enqueue, 
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take13 2/3] kevent: poll/select() notifications.
  2006-08-23 11:24   ` [take13 1/3] kevent: Core files Evgeniy Polyakov
@ 2006-08-23 11:24     ` Evgeniy Polyakov
  2006-08-23 11:24       ` [take13 3/3] kevent: Timer notifications Evgeniy Polyakov
  2006-08-23 12:51     ` [take13 1/3] kevent: Core files Eric Dumazet
  2006-08-24 20:03     ` Christoph Hellwig
  2 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 11:24 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..76b3039 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -698,6 +699,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..b051784
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait, 
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont = 
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, 
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k = 
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+		
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+	
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+	
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+	
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+
+	k->event.ret_data[0] = revents & k->event.event;
+
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", 
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+	
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", 
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+	
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take13 1/3] kevent: Core files.
  2006-08-23 11:24 ` [take13 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-08-23 11:24   ` Evgeniy Polyakov
  2006-08-23 11:24     ` [take13 2/3] kevent: poll/select() notifications Evgeniy Polyakov
                       ` (2 more replies)
       [not found]   ` <Pine.LNX.4.63.0608231313370.8007@alpha.polcom.net>
  1 sibling, 3 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 11:24 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig


Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
 	.long sys_move_pages
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
 	.quad sys_tee
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
+#define __NR_kevent_get_events	318
+#define __NR_kevent_ctl		319
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 320
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..fa282ac
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,173 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's queue. */
+	struct list_head	kevent_entry;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages. 
+	 * poll()/select storage has a list of wait_queue_t containers 
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_user
+{
+	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+	
+	unsigned int		pages_in_use;
+	/* Array of pages forming mapped ring buffer */
+	struct kevent_mring	**pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n", 
+			__func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..366234e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, 
+		__u64 timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..3397955
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,155 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT	0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user. 
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct mukevent
+{
+	struct kevent_id	id;
+	__u32			ret_flags;
+};
+
+#define KEVENT_MAX_EVENTS	4096
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 40 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct ukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		index;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..a756e85
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,31 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback 
+	  invocations, advanced timer notifications and other kernel 
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents 
+	  which are ready immediately at insertion time and number of kevents 
+	  which were removed through readiness completion. 
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select() 
+	  notifications.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..ab6bca0
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,3 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..39dd8c3
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,226 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+	
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+	
+	p = &kevent_registered_callbacks[pos];
+	
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue. 
+ * It does not decrease origin's reference counter in any way 
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry)
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..b8e0a56
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,869 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM 
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+	
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+	u->pring[0]->index = num;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+	unsigned int idx;
+
+	idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
+	if (idx >= u->pages_in_use) {
+		u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
+		if (!u->pring[idx])
+			return -ENOMEM;
+		u->pages_in_use++;
+	}
+	return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+
+	ring = k->user->pring[0];
+
+	pidx = ring->index/KEVENTS_ON_PAGE;
+	off = ring->index%KEVENTS_ON_PAGE;
+
+	copy_ring = k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->index >= KEVENT_MAX_EVENTS)
+		ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int pnum;
+
+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+	u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+	if (!u->pring[0])
+		goto err_out_free;
+
+	u->pages_in_use = 1;
+	kevent_user_ring_set(u, 0);
+
+	return 0;
+
+err_out_free:
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+	
+	for (i = 0; i < u->pages_in_use; ++i)
+		free_page((unsigned long)u->pring[i]);
+
+	kfree(u->pring);
+}
+
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+		INIT_LIST_HEAD(&u->kevent_list[i]);
+	
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (unlikely(kevent_user_ring_init(u))) {
+		kfree(u);
+		return -ENOMEM;
+	}
+
+	file->private_data = u;
+	return 0;
+}
+
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+	struct kevent_user *u = vma->vm_file->private_data;
+	unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
+	if (off >= u->pages_in_use)
+		goto err_out_sigbus;
+
+	return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+	return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+	.nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start;
+	struct kevent_user *u = file->private_data;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_ops = &kevent_user_vm_ops;
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at 
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY) {
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect 
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events, 
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk, 
+		struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	
+	list_for_each_entry(k, head, kevent_entry) {
+		spin_lock(&k->ulock);
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] && 
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			ret = k;
+			spin_unlock(&k->ulock);
+			break;
+		}
+		spin_unlock(&k->ulock);
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	int err = -ENODEV;
+	unsigned long flags;
+	
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor 
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+		list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+			kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy 
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+	
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy 
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+	
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+	
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+	unsigned long flags;
+	unsigned int hash = kevent_user_hash(&k->event);
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+	k->flags |= KEVENT_USER;
+	u->kevent_num++;
+	kevent_user_get(u);
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	if (kevent_user_ring_grow(u)) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	kevent_user_enqueue(u, k);
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one 
+ * and add them into appropriate kevent_storages, 
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number 
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure 
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, knum = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+		goto out_remove;
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				} else
+					knum++;
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		} else
+			knum++;
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u, 
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout, 
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait, 
+			u->ready_num >= min_nr, 
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+	
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent), 
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u || num > KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, void __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+	int err = 0;
+	
+	kevent_cache = kmem_cache_create("kevent_cache", 
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk("KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	misc_deregister(&kevent_miscdev);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8d3769b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-23 11:24   ` [take13 1/3] kevent: Core files Evgeniy Polyakov
  2006-08-23 11:24     ` [take13 2/3] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-08-23 12:51     ` Eric Dumazet
       [not found]       ` <20060823132753.GB29056@2ka.mipt.ru>
  2006-08-24 20:03     ` Christoph Hellwig
  2 siblings, 1 reply; 143+ messages in thread
From: Eric Dumazet @ 2006-08-23 12:51 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

Again Evgeniy I really begin to like kevent :)

On Wednesday 23 August 2006 13:24, Evgeniy Polyakov wrote:
+struct kevent
+{
+       /* Used for kevent freeing.*/
+       struct rcu_head         rcu_head;
+       struct ukevent          event;
+       /* This lock protects ukevent manipulations, e.g. ret_flags changes. 
*/
+       spinlock_t              ulock;
+
+       /* Entry of user's queue. */
+       struct list_head        kevent_entry;
+       /* Entry of origin's queue. */
+       struct list_head        storage_entry;
+       /* Entry of user's ready. */
+       struct list_head        ready_entry;
+
+       u32                     flags;
+
+       /* User who requested this kevent. */
+       struct kevent_user      *user;
+       /* Kevent container. */
+       struct kevent_storage   *st;
+
+       struct kevent_callbacks callbacks;
+
+       /* Private data for different storages. 
+        * poll()/select storage has a list of wait_queue_t containers 
+        * for each ->poll() { poll_wait()' } here.
+        */
+       void                    *priv;
+};

I wonder if you can reorder fields in this structure, so that 'read mostly' 
fields are grouped together, maybe in a different cache line.
This should help reduce false sharing in SMP.
read mostly fields are (but you know better than me) : callbacks, rcu_head, 
priv, user, event, ...


+#define KEVENT_MAX_EVENTS      4096

Could you please tell me (Forgive me if you already clarified this point) , 
what happens if the number of queued events reaches this value ?


+int kevent_init(struct kevent *k)
+{
+       spin_lock_init(&k->ulock);
+       k->flags = 0;
+
+       if (unlikely(k->event.type >= KEVENT_MAX))
+               return kevent_break(k);
+

As long you are sure we cannot call kevent_enqueue()/kevent_dequeue() after a 
failed kevent_init() it should be fine.

+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+       struct kevent_callbacks *p;
+       
+       if (pos >= KEVENT_MAX)
+               return -EINVAL;

if a negative pos is used here we might crash. KEVENT_MAX is a signed too, so 
the compare is done on signed values.
If we consider callers always give a sane value, the test can be suppressed.
If we consider callers may be wrong, then we must do a correct test.
If you dont want to change function prototype, then change the test to :

if ((unsigned)pos >= KEVENT_MAX)
              return -EINVAL;

Some people on lkml will prefer:
if (pos < 0 || pos >= KEVENT_MAX)
	return -EINVAL;
or
#define KEVENT_MAX 6U /* unsigned constant */

+static kmem_cache_t *kevent_cache;

You probably want to add __read_mostly here to avoid false sharing.

+static kmem_cache_t *kevent_cache __read_mostly;

Same for other caches :
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;


About the hash table :

+struct kevent_user
+{
+       struct list_head        kevent_list[KEVENT_HASH_MASK+1];
+       spinlock_t              kevent_lock;

epoll used to use a hash table too (its size was configurable at init time), 
and was converted to a RB-tree for good reasons...(avoid a user to allocate a 
big hash table in pinned memory and DOS)
Are you sure a process handling one million sockets will succeed using kevent 
instead of epoll ?

Do you have a pointer to sample source code using mmap()/kevent interface ? 
It's not clear to me how we can use it (and notice that a full wrap occured, 
user app could miss x*KEVENT_MAX_EVENTS events ?). Do we still must use a 
syscall to dequeue events ?

In particular you state sizeof(mukevent) is 40, while its 12:

+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 40 
bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct ukevent.
+ */

+struct mukevent
+{
+       struct kevent_id        id;  /* size()=8 */
+       __u32                   ret_flags; /* size()=4 */
+};



Thank you
Eric

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23 10:51                                       ` Evgeniy Polyakov
@ 2006-08-23 12:55                                         ` Jari Sundell
  2006-08-23 13:11                                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Jari Sundell @ 2006-08-23 12:55 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> We still do not know what uintptr_t is, and it looks like it is a pointer,
> which is forbidden. Those numbers are not enough to make network AIO.
> And actually is not compatible with kqueue already, so you will need to
> write your own parser to convert your parameters into above structure.

7.18.1.4 Integertypes capable of holding object pointers

"1 The following type designates a signed integer type with the
property that any valid
pointer to void can be converted to this type, then converted back to
pointer to void,
and the result will compare equal to the original pointer:"

Dunno if this means that x86-64 needs yet another typedef, or if using
long for intptr_t is incorrect. But assuming a different integer type
was used instead of intptr_t, that is known to be able to hold a
pointer, would there still be any problems?

I'm unable to see anything specific about AIO in your kevent patch
that these modifications wouldn't support.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23 12:55                                         ` Jari Sundell
@ 2006-08-23 13:11                                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 13:11 UTC (permalink / raw)
  To: Jari Sundell
  Cc: David Miller, kuznet, nmiell, linux-kernel, drepper, akpm,
	netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 02:55:47PM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> On 8/23/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >We still do not know what uintptr_t is, and it looks like it is a pointer,
> >which is forbidden. Those numbers are not enough to make network AIO.
> >And actually is not compatible with kqueue already, so you will need to
> >write your own parser to convert your parameters into above structure.
> 
> 7.18.1.4 Integertypes capable of holding object pointers
> 
> "1 The following type designates a signed integer type with the
> property that any valid
> pointer to void can be converted to this type, then converted back to
> pointer to void,
> and the result will compare equal to the original pointer:"
> 
> Dunno if this means that x86-64 needs yet another typedef, or if using
> long for intptr_t is incorrect. But assuming a different integer type
> was used instead of intptr_t, that is known to be able to hold a
> pointer, would there still be any problems?

stdint.h

/* Types for `void *' pointers.  */
#if __WORDSIZE == 64
# ifndef __intptr_t_defined
typedef long int		intptr_t;
#  define __intptr_t_defined
# endif
typedef unsigned long int	uintptr_t;
#else
# ifndef __intptr_t_defined
typedef int			intptr_t;
#  define __intptr_t_defined
# endif
typedef unsigned int		uintptr_t;
#endif

which means that with 32bit userspace it will be equal to 32bit only.

> I'm unable to see anything specific about AIO in your kevent patch
> that these modifications wouldn't support.

I was asked to postpone AIO stuff for now, you can find it in previous
patchsets sent about week or two ago.

> Rakshasa

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
       [not found]       ` <20060823132753.GB29056@2ka.mipt.ru>
@ 2006-08-23 13:44         ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 13:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 263 bytes --]

On Wed, Aug 23, 2006 at 05:27:53PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> One can find it in archive on homepage
> http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent 
> or attached.

Now it is really attached.

-- 
	Evgeniy Polyakov

[-- Attachment #2: evtest.c --]
[-- Type: text/plain, Size: 4911 bytes --]

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/time.h>
#include <sys/mman.h>

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>

#include <linux/unistd.h>
#include <linux/types.h>

#define PAGE_SIZE	4096
#include <linux/ukevent.h>

#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \
type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4);\
}

#define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
	  type5,arg5) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5);\
}

#define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
	  type5,arg5,type6,arg6) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5, type6 arg6) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5, arg6);\
}

_syscall4(int, kevent_ctl, int, arg1, unsigned int, argv2, unsigned int, argv3, void *, argv4);
_syscall6(int, kevent_get_events, int, arg1, unsigned int, argv2, unsigned int, argv3, __u64, argv4, void *, argv5, unsigned, arg6);

#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog_err(f, a...) ulog(f ": %s [%d].\n", ##a, strerror(errno), errno)

static void usage(char *p)
{
	ulog("Usage: %s -t type -e event -o oneshot -p path -n wait_num -f kevent_file -h\n", p);
}

static int get_id(int type, char *path)
{
	int ret = -1;

	switch (type) {
		case KEVENT_TIMER:
			ret = 3000;
			break;
		case KEVENT_INODE:
			ret = open(path, O_RDONLY);
			break;
	}

	return ret;
}

static void *evtest_mmap(int fd, off_t *offset, unsigned int number)
{
	void *start, *ptr;
	off_t o = *offset;

	start = NULL;

	ptr = mmap(start, PAGE_SIZE*number, PROT_READ, MAP_SHARED, fd, o*PAGE_SIZE);
	if (ptr == MAP_FAILED) {
		ulog_err("Failed to mmap: start: %p, number: %u, offset: %lu", start, number, o);
		return NULL;
	}

	printf("mmap: ptr: %p, start: %p, number: %u, offset: %lu.\n", ptr, start, number, o);
	*offset =  o + number;
	return ptr;
}

int main(int argc, char *argv[])
{
	int ch, fd, err, type, event, oneshot, wait_num, number;
	unsigned int i, num, old_idx;
	char *path, *file;
	char buf[4096];
	struct ukevent *uk;
	struct kevent_mring *ring;
	off_t offset;

	path = NULL;
	type = event = -1;
	oneshot = 0;
	wait_num = 10;
	offset = 0;
	number = 1;
	old_idx = 0;
	file = "/dev/kevent";

	while ((ch = getopt(argc, argv, "f:p:t:e:o:n:h")) > 0) {
		switch (ch) {
			case 'f':
				file = optarg;
				break;
			case 'n':
				wait_num = atoi(optarg);
				break;
			case 'p':
				path = optarg;
				break;
			case 't':
				type = atoi(optarg);
				break;
			case 'e':
				event = atoi(optarg);
				break;
			case 'o':
				oneshot = atoi(optarg);
				break;
			default:
				usage(argv[0]);
				return -1;
		}
	}

	if (event == -1 || type == -1 || (type == KEVENT_INODE && !path)) {
		ulog("You need at least -t -e parameters and -p for inode notifications.\n");
		usage(argv[0]);
		return -1;
	}
	
	fd = open(file, O_RDWR);
	if (fd == -1) {
		ulog_err("Failed create kevent control block using file %s", file);
		return -1;
	}

	ring = evtest_mmap(fd, &offset, number);
	if (!ring)
		return -1;

	memset(buf, 0, sizeof(buf));
	
	num = 1;
	for (i=0; i<num; ++i) {
		uk = (struct ukevent *)buf;
		uk->event = event;
		uk->type = type;
		if (oneshot)
			uk->req_flags |= KEVENT_REQ_ONESHOT;
		uk->user[0] = i;
		uk->id.raw[0] = get_id(uk->type, path);

		err = kevent_ctl(fd, KEVENT_CTL_ADD, 1, uk);
		if (err < 0) {
			ulog_err("Failed to perform control operation: type=%d, event=%d, oneshot=%d", type, event, oneshot);
			close(fd);
			return err;
		}
		if (err) {
			ulog("%d: ret_flags: 0x%x, ret_data: %u %d.\n", i, uk->ret_flags, uk->ret_data[0], (int)uk->ret_data[1]);
		}
	}
	
	while (1) {
		err = kevent_get_events(fd, 1, wait_num, 10000000000, buf, 0);
		if (err < 0) {
			ulog_err("Failed to perform control operation: type=%d, event=%d, oneshot=%d", type, event, oneshot);
			close(fd);
			return err;
		}

		num = ring->index;
		if (num != old_idx) {
			ulog("mmap: idx: %u, returned: %d.\n", num, err);
			while (old_idx != num) {
				if (old_idx < KEVENTS_ON_PAGE) {
					struct mukevent *m = &ring->event[old_idx];
					ulog("%08x: %08x.%08x - %08x\n", 
						i, m->id.raw[0], m->id.raw[1], m->ret_flags);
				} else {
					/*
					 * Mmap next page.
					 */
				}
				if (++old_idx >= KEVENT_MAX_EVENTS)
					old_idx = 0;
			}
			old_idx = num;
		}

		num = (unsigned)err;
		if (num) {
			ulog("syscall dump: %u events.\n", num);
			uk = (struct ukevent *)buf;
			for (i=0; i<num; ++i) {
				ulog("%08x: %08x.%08x - %08x.%08x\n", 
						uk[i].user[0],
						uk[i].id.raw[0], uk[i].id.raw[1],
						uk[i].ret_data[0], uk[i].ret_data[1]);
			}
		}
	}

	close(fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23  7:50                               ` Evgeniy Polyakov
@ 2006-08-23 16:09                                 ` Andrew Morton
  2006-08-23 16:22                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Andrew Morton @ 2006-08-23 16:09 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jari Sundell, David Miller, kuznet, nmiell, linux-kernel,
	drepper, netdev, zach.brown, hch

On Wed, 23 Aug 2006 11:50:56 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Wed, Aug 23, 2006 at 12:07:58AM -0700, Andrew Morton (akpm@osdl.org) wrote:
> > On Wed, 23 Aug 2006 10:56:59 +0400
> > Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > 
> > > On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell (sundell.software@gmail.com) wrote:
> > > > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > > > timeout parameter for the syscall does not imply the use of timespec
> > > > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > > > struct kevent, which is the only (interface related) thing that will
> > > > be exposed.
> > > 
> > > void * in structure exported to userspace is forbidden.
> > > long in syscall requires wrapper in per-arch code (although that
> > > workaround _is_ there, it does not mean that broken interface should 
> > > be used).
> > > poll uses millisecods - it is perfectly ok.
> > 
> > I wonder whether designing-in a millisecond granularity is the right thing
> > to do.  If in a few years the kernel is running tickless with high-res clock
> > interrupt sources, that might look a bit lumpy.
> > 
> > Switching it to a __u64 nanosecond counter would be basically free on
> > 64-bit machines, and not very expensive on 32-bit, no?
> 
> I can put nanoseconds as timer interval too (with aligned_u64 as David
> mentioned), and put it for timeout value too - 64 bit nanosecods ends up
> with 58 years, probably enough.
> Structures with u64 a really not so good idea.
> 

OK.  One could do u32 seconds/u32 nsecs, but a simple aligned_u64 will be
better for 64-bit machines, and OK for 32-bit.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take12 0/3] kevent: Generic event handling mechanism.
  2006-08-23 16:09                                 ` Andrew Morton
@ 2006-08-23 16:22                                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jari Sundell, David Miller, kuznet, nmiell, linux-kernel,
	drepper, netdev, zach.brown, hch

On Wed, Aug 23, 2006 at 09:09:20AM -0700, Andrew Morton (akpm@osdl.org) wrote:
> > I can put nanoseconds as timer interval too (with aligned_u64 as David
> > mentioned), and put it for timeout value too - 64 bit nanosecods ends up
> > with 58 years, probably enough.
> > Structures with u64 a really not so good idea.
> > 
> 
> OK.  One could do u32 seconds/u32 nsecs, but a simple aligned_u64 will be
> better for 64-bit machines, and OK for 32-bit.

aligned_u64 is not exported to userspace, so in the last patchset I just
use __u64 as syscall parameter.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: The Proposed Linux kevent API
  2006-08-23  1:36                           ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Nicholas Miell
                                               ` (2 preceding siblings ...)
  2006-08-23  6:22                             ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Evgeniy Polyakov
@ 2006-08-23 18:24                             ` Stephen Hemminger
  3 siblings, 0 replies; 143+ messages in thread
From: Stephen Hemminger @ 2006-08-23 18:24 UTC (permalink / raw)
  To: David Miller
  Cc: rdunlap, johnpol, linux-kernel, drepper, akpm, netdev, zach.brown, hch

Could we try to real documentation rather than pissy little arguing.
I setup a page to start
	http://linux-net.osdl.org/index.php/Kevent



^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 0/3] kevent: Generic event handling mechanism.
       [not found]         ` <20060823134227.GC29056@2ka.mipt.ru>
@ 2006-08-23 18:56           ` Evgeniy Polyakov
  2006-08-23 19:42             ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 18:56 UTC (permalink / raw)
  To: Grzegorz Kulewski
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Wed, Aug 23, 2006 at 05:42:30PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Wed, Aug 23, 2006 at 03:05:15PM +0200, Grzegorz Kulewski (kangur@polcom.net) wrote:
> > >But you've rised an interesting question, I think it is good possibility
> > >to have such timeout per-event. Let me think a little about it.
> > >
> > >It can be done even without changing size of the kevent structure - by
> > >reusing ret_data (since it is not needed to have timeout when event is
> > >ready, and if it is not and timeout has expired, we can use a flag in
> > >ret_flags). That requires per-event timer or tricky state machine
> > >though. So I would ask core developers if we need such additional
> > >functionality?
> > 
> > Well, I will not comment on implementation because I don't know it too 
> > much. But what is in my opinion essetial is that:
> > 
> > a. Date-time of timeout is not changed or reset if nothing (including 
> > timeout) happens with this file descriptor. (This is to ensure we are 
> > tracing time from last event/operation on that fd not from call to wait 
> > function.)
> > 
> > b. Timeout is Date-time not amount of (mili)seconds, so no kernel or 
> > userspace work (like decrementing) needed if nothing happens with this fd.
> 
> Please note that memory is limited in kernelspace, so Date-time should
> not be that heavy. It is possible to reuse 64bits there without major
> surgery which should be enough for number of (...)seconds, but is not 
> enough to store real date. (although we can steal yet another 32 bits
> from ret_flags and reuse fields in req_flags).
> 
> > c. Time exceeded is event so userspace does not have to check all 
> > registered events/fds for timeout on them (like it has with todays event 
> > notification mechanisms).
> 
> It's not a problem.
> 
> > And it should be easy to use too... :)
> 
> It can be discussed at the very end :)

Actually thinking some more about this issues I've come to conclusion,
that it is not required.
User can always crate two kevents - one for timer and one for rela data
processing and put crossed referencies into both, so when one of them is
ready user could remove another.

> > >>3. I had read this new patchset (especially user interface part) and as I
> > >>see the user visible part is monolithic. There is only one struct for all
> > >>types of events. Did you consider making one genral struct (with type
> > >>field, reference to some event specific struct and possibly some other
> > >>fileds) and several small event specific struct (that can be added later
> > >>as needed)? If so why did you choose the monolithic way?
> > >
> > >Right now I do not see if it has some benifits to have such extensible
> > >structures. If other developers think that it worth it, it can be
> > >implemented.
> > 
> > Well, the only benefit I can see is that when somebody will invent some 
> > completly new event type that requires something more than current struct 
> > provides it will be easy to add.
> > 
> > Also user interface (and probably documentation) could be easier. For 
> > example one event specific struct for man page and no 
> > reserved/undocumented/for-extensions-or-futher-usage fields will be 
> > needed.
> 
> It can be done by selecting special event type, which in turn will reuse
> special fields as length.
> But variable-sized members can not be put into cache and without
> knowledge of it's size it is impossible to put htem into mapped buffer.

And thinking more about this issue, I can say that I'm again
variable-sized structures - they can not be placed into ring buffer (at
least into simple one), they do not allow allocation from cache, it is
impossible to get them correctly from userspace if there is now exact
knowledge about nature of that events and a lot of other problems.
If one strongly feels that it is required, it is possible to provide
userspace pointer in the ukevent structure, which then can be read in
->enqueue callback by kernelside (there is similar trick in network
AIO).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 0/3] kevent: Generic event handling mechanism.
  2006-08-23 18:56           ` [take13 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-08-23 19:42             ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-23 19:42 UTC (permalink / raw)
  To: Grzegorz Kulewski
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

On Wed, Aug 23, 2006 at 10:56:24PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > It can be done by selecting special event type, which in turn will reuse
> > special fields as length.
> > But variable-sized members can not be put into cache and without
> > knowledge of it's size it is impossible to put htem into mapped buffer.
> 
> And thinking more about this issue, I can say that I'm again
> variable-sized structures - they can not be placed into ring buffer (at
> least into simple one), they do not allow allocation from cache, it is
> impossible to get them correctly from userspace if there is now exact
> knowledge about nature of that events and a lot of other problems.
> If one strongly feels that it is required, it is possible to provide
> userspace pointer in the ukevent structure, which then can be read in
> ->enqueue callback by kernelside (there is similar trick in network
> AIO).

I've reread my text - sorry for tons of errors, I use extremely slow
GPRS link, so it is almost impossible to return and correct errors using
it, I think it is simple to understand what I meant :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-23 11:24   ` [take13 1/3] kevent: Core files Evgeniy Polyakov
  2006-08-23 11:24     ` [take13 2/3] kevent: poll/select() notifications Evgeniy Polyakov
  2006-08-23 12:51     ` [take13 1/3] kevent: Core files Eric Dumazet
@ 2006-08-24 20:03     ` Christoph Hellwig
  2006-08-25  5:48       ` Evgeniy Polyakov
  2 siblings, 1 reply; 143+ messages in thread
From: Christoph Hellwig @ 2006-08-24 20:03 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig

One question on the implementation of kevent_user_ctl_modify/
kevent_user_ctl_remove/kevent_user_ctl_add:  What benchmarks did you
do to add the separate 'fastpath' with the single onstack ukevent
structure if there are three or less events?  I can't believe this
actually helps in practice for various reasons:

 - you add quite a lot of icache footprint by duplicating all this code
 - kmalloc is really fast
 - two or three small copy_from/to_user calls are quite a bit slower
   than one that covers the size of all of them.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-24 20:03     ` Christoph Hellwig
@ 2006-08-25  5:48       ` Evgeniy Polyakov
  2006-08-25  6:20         ` Andrew Morton
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-25  5:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown

On Thu, Aug 24, 2006 at 09:03:22PM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> One question on the implementation of kevent_user_ctl_modify/
> kevent_user_ctl_remove/kevent_user_ctl_add:  What benchmarks did you
> do to add the separate 'fastpath' with the single onstack ukevent
> structure if there are three or less events?  I can't believe this
> actually helps in practice for various reasons:
> 
>  - you add quite a lot of icache footprint by duplicating all this code
>  - kmalloc is really fast
>  - two or three small copy_from/to_user calls are quite a bit slower
>    than one that covers the size of all of them.

kmalloc is really slow actually - it always shows somewhere on top 
in profiles and brings noticeble overhead (as was shown in network tree 
allocator project, although there were used bigger allocations).
I chose 3 ukevents, since they fit exactly one cache line (on my test
machine). In general I try to avoid allocation as much as possible, and
more generic usage case (for various servers) to accept one client and
add it, instead of waiting for several of them and commit them at once.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-25  5:48       ` Evgeniy Polyakov
@ 2006-08-25  6:20         ` Andrew Morton
  2006-08-25  6:32           ` Evgeniy Polyakov
  2006-08-25  7:01           ` David Miller
  0 siblings, 2 replies; 143+ messages in thread
From: Andrew Morton @ 2006-08-25  6:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev,
	Zach Brown

On Fri, 25 Aug 2006 09:48:15 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> kmalloc is really slow actually - it always shows somewhere on top 
> in profiles and brings noticeble overhead

It shouldn't.  Please describe the workload and send the profiles.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-25  6:20         ` Andrew Morton
@ 2006-08-25  6:32           ` Evgeniy Polyakov
  2006-08-25  6:58             ` Andrew Morton
  2006-08-25  7:01           ` David Miller
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-25  6:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev,
	Zach Brown

On Thu, Aug 24, 2006 at 11:20:24PM -0700, Andrew Morton (akpm@osdl.org) wrote:
> On Fri, 25 Aug 2006 09:48:15 +0400
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > kmalloc is really slow actually - it always shows somewhere on top 
> > in profiles and brings noticeble overhead
> 
> It shouldn't.  Please describe the workload and send the profiles.

epoll based trivial server (accept + sendfile for the same file, about
4k), httperf with big amount of simulateneous connections. 3c59x NIC 
(with e1000 there were no ioreads and netif_rx).
__alloc_skb calls kmem_cache_alloc() and ___kmalloc().

16158     1.3681  ioread16
8073      0.6835  ioread32
3485      0.2951  irq_entries_start
3018      0.2555  _spin_lock
2103      0.1781  tcp_v4_rcv
1503      0.1273  sysenter_past_esp
1492      0.1263  netif_rx
1459      0.1235  skb_copy_bits
1422      0.1204  _spin_lock_irqsave
1145      0.0969  ip_route_input
983       0.0832  kmem_cache_free
964       0.0816  __alloc_skb
926       0.0784  common_interrupt
891       0.0754  __do_IRQ
846       0.0716  _read_lock
826       0.0699  __netif_rx_schedule
806       0.0682  __kmalloc
767       0.0649  do_tcp_sendpages
747       0.0632  __copy_to_user_ll
744       0.0630  pskb_expand_head


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-25  6:32           ` Evgeniy Polyakov
@ 2006-08-25  6:58             ` Andrew Morton
  2006-08-25  7:20               ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Andrew Morton @ 2006-08-25  6:58 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev,
	Zach Brown

On Fri, 25 Aug 2006 10:32:38 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Thu, Aug 24, 2006 at 11:20:24PM -0700, Andrew Morton (akpm@osdl.org) wrote:
> > On Fri, 25 Aug 2006 09:48:15 +0400
> > Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > 
> > > kmalloc is really slow actually - it always shows somewhere on top 
> > > in profiles and brings noticeble overhead
> > 
> > It shouldn't.  Please describe the workload and send the profiles.
> 
> epoll based trivial server (accept + sendfile for the same file, about
> 4k), httperf with big amount of simulateneous connections. 3c59x NIC 
> (with e1000 there were no ioreads and netif_rx).
> __alloc_skb calls kmem_cache_alloc() and ___kmalloc().
> 
> 16158     1.3681  ioread16
> 8073      0.6835  ioread32
> 3485      0.2951  irq_entries_start
> 3018      0.2555  _spin_lock
> 2103      0.1781  tcp_v4_rcv
> 1503      0.1273  sysenter_past_esp
> 1492      0.1263  netif_rx
> 1459      0.1235  skb_copy_bits
> 1422      0.1204  _spin_lock_irqsave
> 1145      0.0969  ip_route_input
> 983       0.0832  kmem_cache_free
> 964       0.0816  __alloc_skb
> 926       0.0784  common_interrupt
> 891       0.0754  __do_IRQ
> 846       0.0716  _read_lock
> 826       0.0699  __netif_rx_schedule
> 806       0.0682  __kmalloc
> 767       0.0649  do_tcp_sendpages
> 747       0.0632  __copy_to_user_ll
> 744       0.0630  pskb_expand_head
> 

That doesn't look too bad.

What's that as a percentage of total user+system time?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-25  6:20         ` Andrew Morton
  2006-08-25  6:32           ` Evgeniy Polyakov
@ 2006-08-25  7:01           ` David Miller
  2006-08-25  7:13             ` Andrew Morton
  1 sibling, 1 reply; 143+ messages in thread
From: David Miller @ 2006-08-25  7:01 UTC (permalink / raw)
  To: akpm; +Cc: johnpol, hch, linux-kernel, drepper, netdev, zach.brown

From: Andrew Morton <akpm@osdl.org>
Date: Thu, 24 Aug 2006 23:20:24 -0700

> On Fri, 25 Aug 2006 09:48:15 +0400
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > kmalloc is really slow actually - it always shows somewhere on top 
> > in profiles and brings noticeble overhead
> 
> It shouldn't.  Please describe the workload and send the profiles.

Not that I can account for the problem in this specific case, in my
experience cutting down kmalloc() calls matters a _lot_ performance
wise.

For example, this is why we allocate TCP sockets as one huge blob
instead of 3 seperate allocations (generic socket, IP socket, TCP
socket).

In fact, one of the remaining performance issues in IPSEC rule
creation is that we allocate seperately hunks of memory for the rule's
encryption state, the optional hash algorithm state, etc.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-25  7:01           ` David Miller
@ 2006-08-25  7:13             ` Andrew Morton
  0 siblings, 0 replies; 143+ messages in thread
From: Andrew Morton @ 2006-08-25  7:13 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, hch, linux-kernel, drepper, netdev, zach.brown

On Fri, 25 Aug 2006 00:01:06 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@osdl.org>
> Date: Thu, 24 Aug 2006 23:20:24 -0700
> 
> > On Fri, 25 Aug 2006 09:48:15 +0400
> > Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > 
> > > kmalloc is really slow actually - it always shows somewhere on top 
> > > in profiles and brings noticeble overhead
> > 
> > It shouldn't.  Please describe the workload and send the profiles.
> 
> Not that I can account for the problem in this specific case, in my
> experience cutting down kmalloc() calls matters a _lot_ performance
> wise.
> 
> For example, this is why we allocate TCP sockets as one huge blob
> instead of 3 seperate allocations (generic socket, IP socket, TCP
> socket).
> 
> In fact, one of the remaining performance issues in IPSEC rule
> creation is that we allocate seperately hunks of memory for the rule's
> encryption state, the optional hash algorithm state, etc.

Part of that will be cache sharing between the three structs though.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take13 1/3] kevent: Core files.
  2006-08-25  6:58             ` Andrew Morton
@ 2006-08-25  7:20               ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-25  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev,
	Zach Brown

On Thu, Aug 24, 2006 at 11:58:59PM -0700, Andrew Morton (akpm@osdl.org) wrote:
> > > > kmalloc is really slow actually - it always shows somewhere on top 
> > > > in profiles and brings noticeble overhead
> > > 
> > > It shouldn't.  Please describe the workload and send the profiles.
> > 
> > epoll based trivial server (accept + sendfile for the same file, about
> > 4k), httperf with big amount of simulateneous connections. 3c59x NIC 
> > (with e1000 there were no ioreads and netif_rx).
> > __alloc_skb calls kmem_cache_alloc() and ___kmalloc().
> > 
> > 16158     1.3681  ioread16
> > 8073      0.6835  ioread32
> > 3485      0.2951  irq_entries_start
> > 3018      0.2555  _spin_lock
> > 2103      0.1781  tcp_v4_rcv
> > 1503      0.1273  sysenter_past_esp
> > 1492      0.1263  netif_rx
> > 1459      0.1235  skb_copy_bits
> > 1422      0.1204  _spin_lock_irqsave
> > 1145      0.0969  ip_route_input
> > 983       0.0832  kmem_cache_free
> > 964       0.0816  __alloc_skb
> > 926       0.0784  common_interrupt
> > 891       0.0754  __do_IRQ
> > 846       0.0716  _read_lock
> > 826       0.0699  __netif_rx_schedule
> > 806       0.0682  __kmalloc
> > 767       0.0649  do_tcp_sendpages
> > 747       0.0632  __copy_to_user_ll
> > 744       0.0630  pskb_expand_head
> > 
> 
> That doesn't look too bad.
> 
> What's that as a percentage of total user+system time?

With e1000 allocations take more time than actual TCP processing, so it
rised some suspicious for me (especially in bulk transfer).
Total time is about 7 times more than system one, user time is much less
than system one (about 20 times less, but test duration was not too
long, so it can vary).

I do not say it is bad, but it is noticeble and should be eliminated
if there are no requirements to have it.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take14 0/3] kevent: Generic event handling mechanism.
       [not found] <12345678912345.GA1898@2ka.mipt.ru>
                   ` (2 preceding siblings ...)
  2006-08-23 11:24 ` [take13 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-08-25  9:54 ` Evgeniy Polyakov
  2006-08-25  9:54   ` [take14 1/3] kevent: Core files Evgeniy Polyakov
  2006-08-27 21:03   ` [take14 0/3] kevent: Generic event handling mechanism Ulrich Drepper
  2006-09-04 10:14 ` [take15 0/4] " Evgeniy Polyakov
  2006-09-06 11:55 ` [take16 " Evgeniy Polyakov
  5 siblings, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-25  9:54 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


Generic event handling mechanism.

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take14 3/3] kevent: Timer notifications.
  2006-08-25  9:54     ` [take14 2/3] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-08-25  9:54       ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-25  9:54 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..b2fee61
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,105 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct timer_list	ktimer;
+	struct kevent_storage	ktimer_storage;
+};
+
+static void kevent_timer_func(unsigned long data)
+{
+	struct kevent *k = (struct kevent *)data;
+	struct timer_list *t = k->st->origin;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	mod_timer(&t->ktimer, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	del_timer_sync(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take14 1/3] kevent: Core files.
  2006-08-25  9:54 ` [take14 " Evgeniy Polyakov
@ 2006-08-25  9:54   ` Evgeniy Polyakov
  2006-08-25  9:54     ` [take14 2/3] kevent: poll/select() notifications Evgeniy Polyakov
  2006-08-27 21:03   ` [take14 0/3] kevent: Generic event handling mechanism Ulrich Drepper
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-25  9:54 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
 	.long sys_move_pages
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
 	.quad sys_tee
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
+#define __NR_kevent_get_events	318
+#define __NR_kevent_ctl		319
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 320
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..de33ec7
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,173 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's queue. */
+	struct list_head	kevent_entry;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_user
+{
+	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	unsigned int		pages_in_use;
+	/* Array of pages forming mapped ring buffer */
+	struct kevent_mring	**pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+			__func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..4d72286 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..f8ff3a2
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,155 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT	0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct mukevent
+{
+	struct kevent_id	id;
+	__u32			ret_flags;
+};
+
+#define KEVENT_MAX_EVENTS	4096
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		index;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..977699c
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,31 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..ab6bca0
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,3 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..422f585
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry)
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..8e01ec3
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,863 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+	u->pring[0]->index = num;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+	unsigned int idx;
+
+	idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
+	if (idx >= u->pages_in_use) {
+		u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
+		if (!u->pring[idx])
+			return -ENOMEM;
+		u->pages_in_use++;
+	}
+	return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+
+	ring = k->user->pring[0];
+
+	pidx = ring->index/KEVENTS_ON_PAGE;
+	off = ring->index%KEVENTS_ON_PAGE;
+
+	copy_ring = k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->index >= KEVENT_MAX_EVENTS)
+		ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int pnum;
+
+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+	u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+	if (!u->pring[0])
+		goto err_out_free;
+
+	u->pages_in_use = 1;
+	kevent_user_ring_set(u, 0);
+
+	return 0;
+
+err_out_free:
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+
+	for (i = 0; i < u->pages_in_use; ++i)
+		free_page((unsigned long)u->pring[i]);
+
+	kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+		INIT_LIST_HEAD(&u->kevent_list[i]);
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (unlikely(kevent_user_ring_init(u))) {
+		kfree(u);
+		return -ENOMEM;
+	}
+
+	file->private_data = u;
+	return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+	struct kevent_user *u = vma->vm_file->private_data;
+	unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
+	if (off >= u->pages_in_use)
+		goto err_out_sigbus;
+
+	return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+	return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+	.nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start;
+	struct kevent_user *u = file->private_data;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_ops = &kevent_user_vm_ops;
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY) {
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
+		struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+
+	list_for_each_entry(k, head, kevent_entry) {
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] &&
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+		list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+			kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+	unsigned long flags;
+	unsigned int hash = kevent_user_hash(&k->event);
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+	k->flags |= KEVENT_USER;
+	u->kevent_num++;
+	kevent_user_get(u);
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	if (kevent_user_ring_grow(u)) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	kevent_user_enqueue(u, k);
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, knum = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+		goto out_remove;
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				} else
+					knum++;
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		} else
+			knum++;
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u || num > KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, void __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk("KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	misc_deregister(&kevent_miscdev);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8d3769b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take14 2/3] kevent: poll/select() notifications.
  2006-08-25  9:54   ` [take14 1/3] kevent: Core files Evgeniy Polyakov
@ 2006-08-25  9:54     ` Evgeniy Polyakov
  2006-08-25  9:54       ` [take14 3/3] kevent: Timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-25  9:54 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..76b3039 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -698,6 +699,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+
+	k->event.ret_data[0] = revents & k->event.event;
+
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-25  9:54 ` [take14 " Evgeniy Polyakov
  2006-08-25  9:54   ` [take14 1/3] kevent: Core files Evgeniy Polyakov
@ 2006-08-27 21:03   ` Ulrich Drepper
  2006-08-28  1:57     ` David Miller
                       ` (2 more replies)
  1 sibling, 3 replies; 143+ messages in thread
From: Ulrich Drepper @ 2006-08-27 21:03 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters

[-- Attachment #1: Type: text/plain, Size: 13862 bytes --]

[Sorry for the length, but I want to be clear.]

As promised, before Monday (at least my time), here are my thoughts on
the proposed kevent interfaces.  Not necessarily well ordered:


- one point of critique which applied to many proposals over the years:
  multiplexer syscalls a bad, really bad.  They are more complicated
  to use at userlevel and in the kernel.  We've seen more than once that
  unimplemented functions are not reported correctly with ENOSYS.  Just
  use individual syscalls.  Adding them is cheap and probably overall
  less expensive than the multiplexer.



Events to wait for are basically all those with syscalls which can
potentially block indefinitely:

- file descriptor
- POSIX message queues (these are in fact file descriptors but
  let's make it legitimate)
- timer expiration
- signals (just as sigwait, not normal delivery instead of a handler)
- futexes (needs a lot more investigation)
- SysV message queues
- SysV semaphores
- bind socket operations (Alan brought this up in a different context)
- delays (nanosleep/clock_nanosleep, could be done using timers but the
  overhead would likely be too high)
- process state change (waitpid, wait4, waitid etc)
- file locking (flock, lockf)
-

We might also want to think about

- msync/fsync: Today's wait/no-wait option doesn't allow us to work on
  other things if the sync takes time and we need a real notification
  (i.e., if no-wait cannot be used)


The reporting must of course provide the userlevel code with enough
information to identify the request.  For submitting requests we need
such identification, too, so having unique identifiers for all the
different event types is necessary.  So some extend this is what the
KEVENT_TIMER_FIRED, KEVENT_SOCKET_RECV, etc constants do.  But they
should be more generic in their names since we need to use them also
when registering the even.  I.e., KEVENT_EVENT_TIMER or so is more
appropriate.

Often (most of the time) this ID and the actual descriptor (file
descriptor, message queue descriptor, signal number, etc) is not
sufficient.  In the POSIX API we therefore usually have a cookie value
which the userlevel code can provide and which is returned unchanged as
part of the notification.  See the sigev_value member of struct
sigevent.  I think this is the best approach: is compact and it gives
all the flexibility needed.  Userlevel code will store a value or more
often a pointer in the cookie and can then access additional information
based of the cookie.

I know there is a controversy around using pointer-sized values in
kernel structures which are exposed to userlevel.  It should be possible
to work around this.  We can simply always use 64-bit values and when
the data structure is exposed to 32-bit userland code only the first or
second 32-bit word of the structure is exposed with the name.  The other
word is padding.  If planned in from the beginning this should not cause
any problems at all.

Looking at the current struct mukevent, I don't think it is sufficient.
 We need more room for the various types of events.  And we shouldn't
prevent future innovative uses.  I suggest to create records of a fixed
size with sufficient room.  Maybe 32 bytes are sufficient but I'd leave
this open for can until the very end.  Members of the structure must be
- ID if the type of event; type int
- descriptor (file descriptor, SysV msg descriptors etc); type int
- user-provided cookie; type uint64_t
That's only 16 bytes so far but we'll likely need more for some uses.


Next, the current interfaces once again fail to learn from a mistake we
made and which got corrected for the other interfaces.  We need to be
able to change the signal mask around the delay atomically.  Just like
we have ppoll for poll, pselect for select (and hopefully soon also
epoll_pwait for epoll_wait) we need to have this feature in the new
interfaces.


I read the description Nicholas Miell produced (the example programs
aren't available, accessing the URL fails for me) and looked over the
last patch (take 14).

The biggest problem I see so far is the integration into the existing
interfaces.  kevent notification *really* should be usable as a new
sigevent type.  Whether the POSIX interfaces are liked by kernel folks
or not, they are what the majority of the userlevel programmers use.
The mechanism is easily extensible.  I've described this in my paper.  I
cannot comment on the complexity of the kernel side but I'd imagine it's
not much more difficult, just different from what is implemented now.
Let's learn for a change from the mistakes of the past.  The new and
innovative AIO interfaces never took off because their implementation
differs so much from the POSIX interfaces.  People are interested in
portable code.  So, please, let's introduce SIGEV_KEVENT.  Then we
magically get timer notification etc for free.


The ring buffer interface is not described in Nicholas' description.
I'm looking at the sources and am a bit baffled.  For instance, the
kevent_user_ring_add_event function simply adds an event without
determining whether this overwrites an undelivered entry.  One single
index into the buffer isn't sufficient for this anyway.  So let me ask
some questions:

- how is userlevel code supposed to locate events in the buffer?  We
  can maintain a separate pointer for the ring buffer (in a separate
  location, which might actually be good for CPU cache reasons).  But
  this cannot solve all problems.  E.g., if the read pointer is
  initialized to zero (as is the write pointer), the ring buffer fits N
  entries, if now N+1 entries arrive before the first event is handled
  by the userlevel code, how does the userland code know that all ring
  buffer entries are valid?  If the code supposed to always scan the
  entire buffer?

- we need to signal the ring buffer overflow in some form to the
  userlevel code.  What proposals have been made for this?  Signals
  are the old and tried mechanism.  I.e., one would be allowed to
  associate a signal with each kevent descriptor and receive overflow
  notifications this way.  When rt signals are used we event can get
  the kevent descriptor and the possible a user cookie delivered.
  Something like this is needed in case such a kevent queue is used
  in library code where we cannot rely on being the only user for an
  event.

I must admit I haven't spent too much time thinking about the ideal ring
buffer interface.  At OLS there were quite a few people (like Zach) who
said they did.  So, let's solicit advice.  I think the kernel AIO
interface can also provide some info on what not to do.


One aspect of the interface I did think about: the delay syscall.  I
already mentioned the signal mask issue above.  The interface already
has a timeout value (good!).  But we need to specify the semantics quite
detailed to avoid problems.

What I mean by that is the problem we are facing if there is more than
one thread waiting for events.  If no event is available all threads use
the delay syscall.  If now an event becomes available, what do we do?
Do we want exactly one thread?  This is a problem.  The thread might not
be working on the event after it gets woken (e.g., because the thread
gets canceled).  The result is that there is an event available and no
other thread gets woken.  This can be avoided by requiring that if a
thread, which got woken from a delay syscall, doesn't use the event, it
has to wake another thread.  But how do we do this?

One possibility I could see is that the delay syscall returns the event
which caused the thread to be woken.  This event is _not_ also reported
in the ring buffer.  Then, if the thread does not use the event, it
simply requeues it.  This will then implicitly wake another delayed thread.

Which brings me to the second point about the current kevent_get_events
syscall.  I don't think the min_nr parameter is useful.  Probably we
should not even allow the kevent queue to be used with different max_nr
parameters in different thread.  If you'd allow this, how would the
event notification be handled?  A waiter with a smaller required number
of events would always be woken first.  I think the number of required
events should be a property of the kevent object.  Then the code would
create different kevent object if the requirement is different.  At the
very least I'd declare it an error if at any time there are two or more
threads delayed which have different requirements on the number of
events.  This could provide all the flexibility needed while preventing
some of the mistakes one can make.



In summary, I don't think we're at the point where the current
interfaces are usable.  I'd like to see them redesigned and
reimplemented.  The bad news is that I'll not be able to help with the
coding.  The somewhat good news is that I can given some more
recommendations.  In general I still think the text from my OLS paper
applies:


- one syscall to create a kevent queue.  Using a special filesystem like
  take 14 does is OK.  But who do you pass parameters like the maximum
  number of expected outstanding events?  I think a dedicated syscall is
  better.  It also works more reliably since /proc might not be yet
  mounted when the first user of the interface is started.  The result
  should be a file descriptor.  At least an object which can be handled
  like a file descriptor when it comes to transmitting it over Unix
  domain sockets.  Questions to answer: what happens if you use the
  descriptor with any other interface but the kevent interfaces (I think
  all such calls like dup, read, write, ... should fail).

  int kevent_init (int num);


- one system call to create the userlevel ring buffer.  Simply
  overloading the mmap operation for the special kevent filesystem can
  work so no separate syscall is needed in that case.  We need to
  nail down the semantics, though.  What happens if more than one mmap
  call is made?  Does only the last one count?  Does the second one
  fail?  Will mremap() work to increase/descrease the size?  Will
  mremap() be allowed to be called with MREMAP_MAYMOVE?  What if mmap()
  is called from different processes (in the POSIX sense, i.e., from
  different address spaces)?

  Either

   mmap(...)

  Or

   int kevent_map_ringbuf (int kfd, size_t num)


- one interface to set additional parameters.  This is likely mostly to
  make the interfaces safe for the future.  Perhaps the number of events
  needed per delay call should be set this way.

    int kevent_ctl (int kfd, int cmd, ...)


- one interface to shut the kevent down.  This might be overkill.  We
  should be able to use munmap() and close().  If a real interface for
  this would be created it should look like this

   int kevent_destroy (int kfd, void *ringbuf, size_t num)

  I find this rather more cumbersome.  Just use close and munmap.


- one interface to submit requests.

    int kevent_submit (int kfd, struct kevent_event *ev, int flags,
                       struct timespec *timeout)

  Maybe the flags parameter isn't needed, it's just another way to make
  sure we won't regret the design later.  If the ring buffer can fill up
  and this is detected by the kernel (unlike what happens in take 14)
  then the calling thread could be delayed undefinitely.  Maybe we even
  have a deadlock if there is only one thread.  If only a wait/no-wait
  mode is needed, then use only a flags parameter and no timeout
  parameter.

  A special variant should be if ev == NULL the call is taken as a
  request to wake one or more delayed threads.


- one interface to delay threads until the next event becomes available.
  No data is transfered along with the call.  The event data must be
  read from the ring buffer:

    int kevent_wait (int kfd, unsigned ringstate,
                     const struct timespec *timeout,
                     const sigset_t *sigmask)

  Wait-mode can be implemented by recognizing timeout==NULL.  no-wait
  mode is implemented using timeout->tv_sec==timeout->tv_nsec==0.  If
  sigset_t is NULL the signal mask is not changed.

  The ringstate parameter is also not present in the take 14 proposal.
  Something like it is necessary to prevent the thread from going to
  sleep while there are events in the ring buffer.  It would be very
  wasteful if the kernel would have to keep track of outstanding
  events.  This would also mean then handling events would require
  a system call, exactly what the ring buffer approach should prevent.

  I think the sequence for waiting for an event should be like this:

    + get current ring state
    + check whether any outstanding event in ring buffer
    + if yes, copy data out of ring buffer, mark ring buffer record
      as unused (atomically).
    + if no, call kevent_wait with ring state value

  When the kernel delivers a new event it does:

    + find place to store event
    + change ring state (might be a simple counter)

  The kevent_wait implementation in the kernel would then as the first
  thing determine whether the ring state changed.  If yes, the syscall
  returns immediate with -ENWOULDBLOCK.  Otherwise it is queued for
  waiting.

  With these steps and the requirement that all ring buffer entries are
  processed FIFO we can
  a) avoid syscalls to avoid freeing ring buffer entries
  b) detect overflows in the ring buffer
  c) can maintain the read pointer at userlevel while the kernel can
     maintain the write pointer into the buffer


-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-27 21:03   ` [take14 0/3] kevent: Generic event handling mechanism Ulrich Drepper
@ 2006-08-28  1:57     ` David Miller
  2006-08-28  2:11       ` Ulrich Drepper
  2006-08-28  2:40       ` Nicholas Miell
  2006-08-28  2:59     ` Nicholas Miell
  2006-08-31  7:58     ` Evgeniy Polyakov
  2 siblings, 2 replies; 143+ messages in thread
From: David Miller @ 2006-08-28  1:57 UTC (permalink / raw)
  To: drepper
  Cc: johnpol, linux-kernel, akpm, netdev, zach.brown, hch, chase.venters

From: Ulrich Drepper <drepper@redhat.com>
Date: Sun, 27 Aug 2006 14:03:33 -0700

> The biggest problem I see so far is the integration into the existing
> interfaces.  kevent notification *really* should be usable as a new
> sigevent type.  Whether the POSIX interfaces are liked by kernel folks
> or not, they are what the majority of the userlevel programmers use.
> The mechanism is easily extensible.  I've described this in my paper.  I
> cannot comment on the complexity of the kernel side but I'd imagine it's
> not much more difficult, just different from what is implemented now.
> Let's learn for a change from the mistakes of the past.  The new and
> innovative AIO interfaces never took off because their implementation
> differs so much from the POSIX interfaces.  People are interested in
> portable code.  So, please, let's introduce SIGEV_KEVENT.  Then we
> magically get timer notification etc for free.

I have to disagree with this.

SigEvent, and signals in general, are crap.  They are complex
and userland gets it wrong more often than not.  Interfaces
for userland should be simple, signals are not simple.  A core
loop that says "give me events to process", on the other hand,
is.  And this is what is most natural for userspace.

The user can say when he wants the process events.  In fact,
ripping out the complex signal handling will be a welcome
change for most server applications.

We are going to require the use of a new interface to register
the events anyways, why keep holding onto the delivery baggage
as well when we can break free of those limitations?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-28  1:57     ` David Miller
@ 2006-08-28  2:11       ` Ulrich Drepper
  2006-08-28  2:40       ` Nicholas Miell
  1 sibling, 0 replies; 143+ messages in thread
From: Ulrich Drepper @ 2006-08-28  2:11 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, linux-kernel, akpm, netdev, zach.brown, hch, chase.venters

[-- Attachment #1: Type: text/plain, Size: 1050 bytes --]

David Miller wrote:
> SigEvent, and signals in general, are crap.  They are complex
> and userland gets it wrong more often than not.  Interfaces
> for userland should be simple, signals are not simple.

You miss the point.

sigevent has nothing necessarily to do with signals.  I don't want
signals.  I just want the same interface to specify the action to be used.

If I'm using

  struct sigevent sigev;
  int kfd;

  kfd = kevent_create (...);

  sigev.sigev_notify = SIGEV_KEVENT;
  sigev.sigev_kfd = kfd;
  sigev.sigev_valie.sival_ptr = &some_data;


then I can use this sigev variable in an unmodified timer_create call.
The kernel would see SIGEV_KEVENT (as opposed to SIGEV_SIGNAL etc) and
**not** generate a signal but instead create the event in the kevent queue.


The proposal to use sigevent has nothing to do with signals.  It's just
about the interface and to have smooth integration with existing
functionality.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-28  1:57     ` David Miller
  2006-08-28  2:11       ` Ulrich Drepper
@ 2006-08-28  2:40       ` Nicholas Miell
  1 sibling, 0 replies; 143+ messages in thread
From: Nicholas Miell @ 2006-08-28  2:40 UTC (permalink / raw)
  To: David Miller
  Cc: drepper, johnpol, linux-kernel, akpm, netdev, zach.brown, hch,
	chase.venters

On Sun, 2006-08-27 at 18:57 -0700, David Miller wrote:
> From: Ulrich Drepper <drepper@redhat.com>
> Date: Sun, 27 Aug 2006 14:03:33 -0700
> 
> > The biggest problem I see so far is the integration into the existing
> > interfaces.  kevent notification *really* should be usable as a new
> > sigevent type.  Whether the POSIX interfaces are liked by kernel folks
> > or not, they are what the majority of the userlevel programmers use.
> > The mechanism is easily extensible.  I've described this in my paper.  I
> > cannot comment on the complexity of the kernel side but I'd imagine it's
> > not much more difficult, just different from what is implemented now.
> > Let's learn for a change from the mistakes of the past.  The new and
> > innovative AIO interfaces never took off because their implementation
> > differs so much from the POSIX interfaces.  People are interested in
> > portable code.  So, please, let's introduce SIGEV_KEVENT.  Then we
> > magically get timer notification etc for free.
> 
> I have to disagree with this.
> 
> SigEvent, and signals in general, are crap.  They are complex
> and userland gets it wrong more often than not.  Interfaces
> for userland should be simple, signals are not simple.  A core
> loop that says "give me events to process", on the other hand,
> is.  And this is what is most natural for userspace.
> 
> The user can say when he wants the process events.  In fact,
> ripping out the complex signal handling will be a welcome
> change for most server applications.
> 
> We are going to require the use of a new interface to register
> the events anyways, why keep holding onto the delivery baggage
> as well when we can break free of those limitations?

struct sigevent is the POSIX method for describing how event
notifications are delivered.

Two methods are specified in POSIX -- SIGEV_SIGNAL, which delivers a
signal to the process and SIGEV_THREAD which creates a new thread in the
process and calls a user-supplied function. In addition to these two
methods, Linux also implements SIGEV_THREAD_ID, which sends a signal to
a specific thread (this is used internally by glibc to implement
SIGEV_THREAD, but I imagine that would change on the addition of
SIGEV_KEVENT).

Ulrich is suggesting the addition of SIGEV_KEVENT, which causes the
event notification to be delivered to a specific kevent queue. This
would allow for event delivery to kevent queues from POSIX AIO
completions, POSIX message queues, POSIX timers, glibc's async name
resolution interface and anything else that might use a struct sigevent
in the future.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-27 21:03   ` [take14 0/3] kevent: Generic event handling mechanism Ulrich Drepper
  2006-08-28  1:57     ` David Miller
@ 2006-08-28  2:59     ` Nicholas Miell
  2006-08-28 11:47       ` Jari Sundell
  2006-08-31  7:58     ` Evgeniy Polyakov
  2 siblings, 1 reply; 143+ messages in thread
From: Nicholas Miell @ 2006-08-28  2:59 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Evgeniy Polyakov, lkml, David Miller, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters

On Sun, 2006-08-27 at 14:03 -0700, Ulrich Drepper wrote:

[ note: there was lots of good stuff that I cut out because it was a
long email and I'm only replying to some of its points ]

> Events to wait for are basically all those with syscalls which can
> potentially block indefinitely:
> 
> - file descriptor
> - POSIX message queues (these are in fact file descriptors but
>   let's make it legitimate)
> - timer expiration
> - signals (just as sigwait, not normal delivery instead of a handler)

For some of them (like SIGTERM), delivery to a kevent queue would
actually make sense.

> The ring buffer interface is not described in Nicholas' description.

I wasn't even aware there was a ring-buffer interface in the proposed
patches. Another reason why the onus of documenting a patch is on the
originator: the random nobody who ends up doing the documenting may
screw it up.

> Which brings me to the second point about the current kevent_get_events
> syscall.  I don't think the min_nr parameter is useful.  Probably we
> should not even allow the kevent queue to be used with different max_nr
> parameters in different thread.  If you'd allow this, how would the
> event notification be handled?  A waiter with a smaller required number
> of events would always be woken first.  I think the number of required
> events should be a property of the kevent object.  Then the code would
> create different kevent object if the requirement is different.  At the
> very least I'd declare it an error if at any time there are two or more
> threads delayed which have different requirements on the number of
> events.  This could provide all the flexibility needed while preventing
> some of the mistakes one can make.

I was thinking about this, and it's even worse in the case where a
kevent fd is shared by different processes (either by forking or by
passing it via PF_UNIX sockets).

What happens when you queue an AIO completion to a shared kevent queue?
(The AIO read only happened in one address space, or did it? What if the
read was to a shared memory region? What if the memory region is shared,
but mapped at different addresses? What if not all of the processes
involved have that AIO fd open?)

Also complicated is the case where waiting threads have different
priorities, different timeouts, and different minimum event counts --
how do you decide which thread gets events first? What if the decisions
are different depending on whether you want to maximize throughput or
interactivity?

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-28  2:59     ` Nicholas Miell
@ 2006-08-28 11:47       ` Jari Sundell
  0 siblings, 0 replies; 143+ messages in thread
From: Jari Sundell @ 2006-08-28 11:47 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Ulrich Drepper, Evgeniy Polyakov, lkml, David Miller,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters

On 8/28/06, Nicholas Miell <nmiell@comcast.net> wrote:
> Also complicated is the case where waiting threads have different
> priorities, different timeouts, and different minimum event counts --
> how do you decide which thread gets events first? What if the decisions
> are different depending on whether you want to maximize throughput or
> interactivity?

BTW, what is the intended use of the min event count parameter? The
obvious reason I can see, avoiding waking up a thread too often with
few queued events, would imo be handled cleaner by just passing a
parameter telling the kernel to try to queue more events.

With a min event count you'd have to use a rather low timeout to
ensure that events get handled within a resonable time.

Rakshasa

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-27 21:03   ` [take14 0/3] kevent: Generic event handling mechanism Ulrich Drepper
  2006-08-28  1:57     ` David Miller
  2006-08-28  2:59     ` Nicholas Miell
@ 2006-08-31  7:58     ` Evgeniy Polyakov
  2006-09-09 16:10       ` Ulrich Drepper
  2 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-08-31  7:58 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: lkml, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters

Hello.

Sorry ofr long delay - I was on small vacations.

On Sun, Aug 27, 2006 at 02:03:33PM -0700, Ulrich Drepper (drepper@redhat.com) wrote:
> [Sorry for the length, but I want to be clear.]
> 
> As promised, before Monday (at least my time), here are my thoughts on
> the proposed kevent interfaces.  Not necessarily well ordered:
> 
> 
> - one point of critique which applied to many proposals over the years:
>   multiplexer syscalls a bad, really bad.  They are more complicated
>   to use at userlevel and in the kernel.  We've seen more than once that
>   unimplemented functions are not reported correctly with ENOSYS.  Just
>   use individual syscalls.  Adding them is cheap and probably overall
>   less expensive than the multiplexer.

Can you convince Christoph?
I do not care about interfaces, but until several people agree on it, I
will not change anything.

> Events to wait for are basically all those with syscalls which can
> potentially block indefinitely:
> 
> - file descriptor
> - POSIX message queues (these are in fact file descriptors but
>   let's make it legitimate)
> - timer expiration
> - signals (just as sigwait, not normal delivery instead of a handler)
> - futexes (needs a lot more investigation)
> - SysV message queues
> - SysV semaphores
> - bind socket operations (Alan brought this up in a different context)
> - delays (nanosleep/clock_nanosleep, could be done using timers but the
>   overhead would likely be too high)
> - process state change (waitpid, wait4, waitid etc)
> - file locking (flock, lockf)
> -

You completely miss AIO here (I talk not about POSIX AIO).

> We might also want to think about
> 
> - msync/fsync: Today's wait/no-wait option doesn't allow us to work on
>   other things if the sync takes time and we need a real notification
>   (i.e., if no-wait cannot be used)
> 
> 
> The reporting must of course provide the userlevel code with enough
> information to identify the request.  For submitting requests we need
> such identification, too, so having unique identifiers for all the
> different event types is necessary.  So some extend this is what the
> KEVENT_TIMER_FIRED, KEVENT_SOCKET_RECV, etc constants do.  But they
> should be more generic in their names since we need to use them also
> when registering the even.  I.e., KEVENT_EVENT_TIMER or so is more
> appropriate.

There are such identifiers.
We have _two_ levels of id 
 - event type (KEVENT_EVENT_TIMER,
KEVENT_EVENT_POLL, KEVENT_EVENT_AIO and so one, but they are called
without _EVENT_ inside), which is type of origin for given events
 - events itself - timer fired, data received, client accepted and so
   on.

> Often (most of the time) this ID and the actual descriptor (file
> descriptor, message queue descriptor, signal number, etc) is not
> sufficient.  In the POSIX API we therefore usually have a cookie value
> which the userlevel code can provide and which is returned unchanged as
> part of the notification.  See the sigev_value member of struct
> sigevent.  I think this is the best approach: is compact and it gives
> all the flexibility needed.  Userlevel code will store a value or more
> often a pointer in the cookie and can then access additional information
> based of the cookie.

kevents have such "cookies".

> I know there is a controversy around using pointer-sized values in
> kernel structures which are exposed to userlevel.  It should be possible
> to work around this.  We can simply always use 64-bit values and when
> the data structure is exposed to 32-bit userland code only the first or
> second 32-bit word of the structure is exposed with the name.  The other
> word is padding.  If planned in from the beginning this should not cause
> any problems at all.

I use union of two 32bit values and pointer to simplify userspace.
It was planned and implemented already.

> Looking at the current struct mukevent, I don't think it is sufficient.
>  We need more room for the various types of events.  And we shouldn't
> prevent future innovative uses.  I suggest to create records of a fixed
> size with sufficient room.  Maybe 32 bytes are sufficient but I'd leave
> this open for can until the very end.  Members of the structure must be
> - ID if the type of event; type int
> - descriptor (file descriptor, SysV msg descriptors etc); type int
> - user-provided cookie; type uint64_t
> That's only 16 bytes so far but we'll likely need more for some uses.

I use there only id provided by user, it is not his cookie, but it was
done to make strucutre as small as possible.
Think about size of the mapped buffer when there are several kevent
queues - it is all mapped and thus pinned memory.
It of course can be extended.

> Next, the current interfaces once again fail to learn from a mistake we
> made and which got corrected for the other interfaces.  We need to be
> able to change the signal mask around the delay atomically.  Just like
> we have ppoll for poll, pselect for select (and hopefully soon also
> epoll_pwait for epoll_wait) we need to have this feature in the new
> interfaces.

We able to change kevents atomically.
 
> I read the description Nicholas Miell produced (the example programs
> aren't available, accessing the URL fails for me) and looked over the
> last patch (take 14).
> 
> The biggest problem I see so far is the integration into the existing
> interfaces.  kevent notification *really* should be usable as a new
> sigevent type.  Whether the POSIX interfaces are liked by kernel folks
> or not, they are what the majority of the userlevel programmers use.
> The mechanism is easily extensible.  I've described this in my paper.  I
> cannot comment on the complexity of the kernel side but I'd imagine it's
> not much more difficult, just different from what is implemented now.
> Let's learn for a change from the mistakes of the past.  The new and
> innovative AIO interfaces never took off because their implementation
> differs so much from the POSIX interfaces.  People are interested in
> portable code.  So, please, let's introduce SIGEV_KEVENT.  Then we
> magically get timer notification etc for free.

Well, I rarely talk about what other people want, but if you strongly
feel, that all posix crap is better than epoll interface, then I can not
agree with you.

It is possible to create additional one using any POSIX API you like,
but I strongly insist on having possibility to use lightweight syscall
interface too.

> The ring buffer interface is not described in Nicholas' description.
> I'm looking at the sources and am a bit baffled.  For instance, the
> kevent_user_ring_add_event function simply adds an event without
> determining whether this overwrites an undelivered entry.  One single
> index into the buffer isn't sufficient for this anyway.  So let me ask
> some questions:
> 
> - how is userlevel code supposed to locate events in the buffer?  We
>   can maintain a separate pointer for the ring buffer (in a separate
>   location, which might actually be good for CPU cache reasons).  But
>   this cannot solve all problems.  E.g., if the read pointer is
>   initialized to zero (as is the write pointer), the ring buffer fits N
>   entries, if now N+1 entries arrive before the first event is handled
>   by the userlevel code, how does the userland code know that all ring
>   buffer entries are valid?  If the code supposed to always scan the
>   entire buffer?

Ring buffer _always_ has space for new events until queue is not filled.
So if userspace do not read for too much time it's events and eventually
tries to add new one, it will fail early.

> - we need to signal the ring buffer overflow in some form to the
>   userlevel code.  What proposals have been made for this?  Signals
>   are the old and tried mechanism.  I.e., one would be allowed to
>   associate a signal with each kevent descriptor and receive overflow
>   notifications this way.  When rt signals are used we event can get
>   the kevent descriptor and the possible a user cookie delivered.
>   Something like this is needed in case such a kevent queue is used
>   in library code where we cannot rely on being the only user for an
>   event.

There is no overflow - I do not want to introduce another signal queue
overflow crap here.
And once again - no signals.

> I must admit I haven't spent too much time thinking about the ideal ring
> buffer interface.  At OLS there were quite a few people (like Zach) who
> said they did.  So, let's solicit advice.  I think the kernel AIO
> interface can also provide some info on what not to do.

Sure, I would like to see different design if it is ready.

> One aspect of the interface I did think about: the delay syscall.  I
> already mentioned the signal mask issue above.  The interface already
> has a timeout value (good!).  But we need to specify the semantics quite
> detailed to avoid problems.
> 
> What I mean by that is the problem we are facing if there is more than
> one thread waiting for events.  If no event is available all threads use
> the delay syscall.  If now an event becomes available, what do we do?
> Do we want exactly one thread?  This is a problem.  The thread might not
> be working on the event after it gets woken (e.g., because the thread
> gets canceled).  The result is that there is an event available and no
> other thread gets woken.  This can be avoided by requiring that if a
> thread, which got woken from a delay syscall, doesn't use the event, it
> has to wake another thread.  But how do we do this?

I can reformulate your words in a different manner. PLease correct me if
I'm wrong.

You basically want to deliver the same event to several users.
But how do you want to achive it with network buffers for example.
When several threads reads from the same socket, they do not obtain the
same data.
So I disagree that we need to deliver some events to several threads. If
you need to wake up several ones when one network socket is ready (I
seriously doubt you want), create per-thread kevent queue and put that
socket into them.
You want to wake several threads on timeout - create per-thread queue
and put there an event.

The more simple interface is, the less problems we will catch when some
tricky configuration is used.

> One possibility I could see is that the delay syscall returns the event
> which caused the thread to be woken.  This event is _not_ also reported
> in the ring buffer.  Then, if the thread does not use the event, it
> simply requeues it.  This will then implicitly wake another delayed thread.
> 
> Which brings me to the second point about the current kevent_get_events
> syscall.  I don't think the min_nr parameter is useful.  Probably we
> should not even allow the kevent queue to be used with different max_nr
> parameters in different thread.  If you'd allow this, how would the
> event notification be handled?  A waiter with a smaller required number
> of events would always be woken first.  I think the number of required
> events should be a property of the kevent object.  Then the code would
> create different kevent object if the requirement is different.  At the
> very least I'd declare it an error if at any time there are two or more
> threads delayed which have different requirements on the number of
> events.  This could provide all the flexibility needed while preventing
> some of the mistakes one can make.
 
min_nr is used to specify special case "wake up when at least one event 
is ready and get all ready ones".
 
> In summary, I don't think we're at the point where the current
> interfaces are usable.  I'd like to see them redesigned and
> reimplemented.  The bad news is that I'll not be able to help with the
> coding.  The somewhat good news is that I can given some more
> recommendations.  In general I still think the text from my OLS paper
> applies:

I can do it.
But I will not, until other core developers acks your proposals.
As I described above I disagree with most of them.
 
> - one syscall to create a kevent queue.  Using a special filesystem like
>   take 14 does is OK.  But who do you pass parameters like the maximum
>   number of expected outstanding events?  I think a dedicated syscall is
>   better.  It also works more reliably since /proc might not be yet

There are no "expected outstanding events", I think it can be a problem.
Currently there is absolute maximum of events, which can not be
increased in real-time.

>   mounted when the first user of the interface is started.  The result
>   should be a file descriptor.  At least an object which can be handled
>   like a file descriptor when it comes to transmitting it over Unix
>   domain sockets.  Questions to answer: what happens if you use the
>   descriptor with any other interface but the kevent interfaces (I think
>   all such calls like dup, read, write, ... should fail).
>
>   int kevent_init (int num);

Kevent always provides file descriptor (which is "poll"able) as result
of either opening of special file (like in the latest patchset), or
using special filesystem (which was removed by Christoph).

> - one system call to create the userlevel ring buffer.  Simply
>   overloading the mmap operation for the special kevent filesystem can
>   work so no separate syscall is needed in that case.  We need to
>   nail down the semantics, though.  What happens if more than one mmap
>   call is made?  Does only the last one count?  Does the second one
>   fail?  Will mremap() work to increase/descrease the size?  Will
>   mremap() be allowed to be called with MREMAP_MAYMOVE?  What if mmap()
>   is called from different processes (in the POSIX sense, i.e., from
>   different address spaces)?
> 
>   Either
> 
>    mmap(...)
> 
>   Or
> 
>    int kevent_map_ringbuf (int kfd, size_t num)

Each subsequent mmap will mmap existing buffers, first one mmap can
create that buffer.
 
> - one interface to set additional parameters.  This is likely mostly to
>   make the interfaces safe for the future.  Perhaps the number of events
>   needed per delay call should be set this way.
> 
>     int kevent_ctl (int kfd, int cmd, ...)
> 
> 
> - one interface to shut the kevent down.  This might be overkill.  We
>   should be able to use munmap() and close().  If a real interface for
>   this would be created it should look like this
> 
>    int kevent_destroy (int kfd, void *ringbuf, size_t num)
> 
>   I find this rather more cumbersome.  Just use close and munmap.
> 
> 
> - one interface to submit requests.
> 
>     int kevent_submit (int kfd, struct kevent_event *ev, int flags,
>                        struct timespec *timeout)
> 
>   Maybe the flags parameter isn't needed, it's just another way to make
>   sure we won't regret the design later.  If the ring buffer can fill up
>   and this is detected by the kernel (unlike what happens in take 14)

Just a repeat - with current buffer implementation it can not happen -
maximum  queue length is a limit for buffer size.

>   then the calling thread could be delayed undefinitely.  Maybe we even
>   have a deadlock if there is only one thread.  If only a wait/no-wait
>   mode is needed, then use only a flags parameter and no timeout
>   parameter.
> 
>   A special variant should be if ev == NULL the call is taken as a
>   request to wake one or more delayed threads.

Well, you propose three different syscalls for threee operations. I use
one with multiplexer. I do not have strong opinion on how it must be
done, but I created a policy for such changes - until other developers
ack such changes, nothing will be done.
 
> - one interface to delay threads until the next event becomes available.
>   No data is transfered along with the call.  The event data must be
>   read from the ring buffer:
> 
>     int kevent_wait (int kfd, unsigned ringstate,
>                      const struct timespec *timeout,
>                      const sigset_t *sigmask)

Yes, I agree, this is good syscall.
Except signals (no signals, that's the rule) and variable sized timespec
structure. What about putting there u64 number of nanoseconds?

>   Wait-mode can be implemented by recognizing timeout==NULL.  no-wait
>   mode is implemented using timeout->tv_sec==timeout->tv_nsec==0.  If
>   sigset_t is NULL the signal mask is not changed.
> 
>   The ringstate parameter is also not present in the take 14 proposal.
>   Something like it is necessary to prevent the thread from going to
>   sleep while there are events in the ring buffer.  It would be very
>   wasteful if the kernel would have to keep track of outstanding
>   events.  This would also mean then handling events would require
>   a system call, exactly what the ring buffer approach should prevent.

It is possible to put there a number of the last "acked" kevent, so
kernel will remove all events which were placed into the buffer before
and including this one.

>   I think the sequence for waiting for an event should be like this:
> 
>     + get current ring state
>     + check whether any outstanding event in ring buffer
>     + if yes, copy data out of ring buffer, mark ring buffer record
>       as unused (atomically).
>     + if no, call kevent_wait with ring state value
>
>   When the kernel delivers a new event it does:
> 
>     + find place to store event
>     + change ring state (might be a simple counter)

What about following:
userspace:
 - check ring index, if it differs from stored in userspace, then there
   are events between old stored index and new one just read.
 - copy events
 - call kevent_wait() or other method to show kernel that all events
   upto provided in syscall numbers are processed, and thus kernel can
   remove them and put there new ones.

kernelspace:
 - when new kevent is added, it guarantees that there is a place for it
   in kernel ring buffer
 - when event is ready it is copied into mapped buffer and index of the
   "last ready" is increased (it is fully atomic operation)
 - when userspace calls kevent_wait() kernel get ring index from
   syscall, searches for all events upto provided number and free them
   (or rearm)
   

Except kevent_wait() imeplementation it is how it is implemented right
now.

>   The kevent_wait implementation in the kernel would then as the first
>   thing determine whether the ring state changed.  If yes, the syscall
>   returns immediate with -ENWOULDBLOCK.  Otherwise it is queued for
>   waiting.
> 
>   With these steps and the requirement that all ring buffer entries are
>   processed FIFO we can
>   a) avoid syscalls to avoid freeing ring buffer entries
>   b) detect overflows in the ring buffer
>   c) can maintain the read pointer at userlevel while the kernel can
>      maintain the write pointer into the buffer

As shown above it is already implemented.
 
> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> 



-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take15 0/4] kevent: Generic event handling mechanism.
  2006-09-04 10:14 ` [take15 0/4] " Evgeniy Polyakov
@ 2006-09-04  9:58   ` Evgeniy Polyakov
  2006-09-04 10:14   ` [take15 1/4] kevent: Core files Evgeniy Polyakov
  2006-09-04 10:24   ` [take15 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
  2 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-04  9:58 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters

On Mon, Sep 04, 2006 at 02:14:20PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> 
> Generic event handling mechanism.

Unfortunately bogofilter on vger.kernel.org decided that socket and
timer notifications are spam, so they will not be found in linux-kernel
archive.

One can use kevent homepage instead:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent


Missed socket notifications description:
# added socket notifications (send/recv/accept). Using trivial web
# server based on kevent and this features instead of epoll it's
# performance increased more than noticebly. More details about
# benchmark and server itself (evserver_kevent.c) can be found on project
# homepage. Splitted patches are available in archive 
# http://tservice.net.ru/~s0mbre/archive/kevent/2.6.18.15

-- 
	Evgeniy Polyakov

-- 
VGER BF report: U 0.499757

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take15 0/4] kevent: Generic event handling mechanism.
       [not found] <12345678912345.GA1898@2ka.mipt.ru>
                   ` (3 preceding siblings ...)
  2006-08-25  9:54 ` [take14 " Evgeniy Polyakov
@ 2006-09-04 10:14 ` Evgeniy Polyakov
  2006-09-04  9:58   ` Evgeniy Polyakov
                     ` (2 more replies)
  2006-09-06 11:55 ` [take16 " Evgeniy Polyakov
  5 siblings, 3 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-04 10:14 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


Generic event handling mechanism.

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



-- 
VGER BF report: U 0.501426

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take15 1/4] kevent: Core files.
  2006-09-04 10:14 ` [take15 0/4] " Evgeniy Polyakov
  2006-09-04  9:58   ` Evgeniy Polyakov
@ 2006-09-04 10:14   ` Evgeniy Polyakov
  2006-09-04 10:14     ` [take15 2/4] kevent: poll/select() notifications Evgeniy Polyakov
  2006-09-05 13:28     ` [take15 1/4] kevent: Core files Arnd Bergmann
  2006-09-04 10:24   ` [take15 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
  2 siblings, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-04 10:14 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck


Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..c10698e 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,6 @@ ENTRY(sys_call_table)
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
 	.long sys_move_pages
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl
+	.long sys_kevent_wait		/* 320 */
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..a06b76f 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -710,7 +710,10 @@ #endif
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl
+	.quad sys_kevent_wait		/* 320 */
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..68072b5 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,13 @@ #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
+#define __NR_kevent_get_events	318
+#define __NR_kevent_ctl		319
+#define __NR_kevent_wait	320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 321
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..ee907ad 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..67007f2
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,196 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's queue. */
+	struct list_head	kevent_entry;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_user
+{
+	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	unsigned int		pages_in_use;
+	/* Array of pages forming mapped ring buffer */
+	struct kevent_mring	**pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+			__func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_KEVENT_SOCKET
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..cbb1c0d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,8 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..f8ff3a2
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,155 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT	0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct mukevent
+{
+	struct kevent_id	id;
+	__u32			ret_flags;
+};
+
+#define KEVENT_MAX_EVENTS	4096
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		index;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..85ad472
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,40 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+	
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..422f585
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry)
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..e9560f7
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,961 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+	u->pring[0]->index = num;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+	unsigned int idx;
+
+	idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
+	if (idx >= u->pages_in_use) {
+		u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
+		if (!u->pring[idx])
+			return -ENOMEM;
+		u->pages_in_use++;
+	}
+	return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+
+	ring = k->user->pring[0];
+
+	pidx = ring->index/KEVENTS_ON_PAGE;
+	off = ring->index%KEVENTS_ON_PAGE;
+
+	copy_ring = k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->index >= KEVENT_MAX_EVENTS)
+		ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int pnum;
+
+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+	u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+	if (!u->pring[0])
+		goto err_out_free;
+
+	u->pages_in_use = 1;
+	kevent_user_ring_set(u, 0);
+
+	return 0;
+
+err_out_free:
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+
+	for (i = 0; i < u->pages_in_use; ++i)
+		free_page((unsigned long)u->pring[i]);
+
+	kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+		INIT_LIST_HEAD(&u->kevent_list[i]);
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (unlikely(kevent_user_ring_init(u))) {
+		kfree(u);
+		return -ENOMEM;
+	}
+
+	file->private_data = u;
+	return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+	struct kevent_user *u = vma->vm_file->private_data;
+	unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
+	if (off >= u->pages_in_use)
+		goto err_out_sigbus;
+
+	return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+	return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+	.nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start;
+	struct kevent_user *u = file->private_data;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_ops = &kevent_user_vm_ops;
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline unsigned int __kevent_user_hash(struct kevent_id *id)
+{
+	return jhash_1word(id->raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return __kevent_user_hash(&uk->id);
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY) {
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
+		struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+
+	list_for_each_entry(k, head, kevent_entry) {
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] &&
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+		list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+			kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+	unsigned long flags;
+	unsigned int hash = kevent_user_hash(&k->event);
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+	k->flags |= KEVENT_USER;
+	u->kevent_num++;
+	kevent_user_get(u);
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	if (kevent_user_ring_grow(u)) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	kevent_user_enqueue(u, k);
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, knum = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+		goto out_remove;
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				} else
+					knum++;
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		} else
+			knum++;
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u || num > KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, void __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes all ready kevents until and including @index.
+ * @ctl_fd - kevent file descriptor.
+ * @start - start index of the processed by userspace kevents.
+ * @num - number of processed kevents.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * 	free space in kevent queue.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
+{
+	int err = -EINVAL, found;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k, *n;
+	struct mukevent *muk;
+	unsigned int idx, off, hash;
+	unsigned long flags;
+
+	if (start + num >= KEVENT_MAX_EVENTS || 
+			start >= KEVENT_MAX_EVENTS || 
+			num >= KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	if (((start + num) / KEVENTS_ON_PAGE) >= u->pages_in_use || 
+			(start / KEVENTS_ON_PAGE) >= u->pages_in_use)
+		goto out_fput;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (num > 0) {
+		idx = start / KEVENTS_ON_PAGE;
+		off = start % KEVENTS_ON_PAGE;
+
+		muk = &u->pring[idx]->event[off];
+		hash = __kevent_user_hash(&muk->id);
+		found = 0;
+		list_for_each_entry_safe(k, n, &u->kevent_list[hash], kevent_entry) {
+			if ((k->event.id.raw[0] == muk->id.raw[0]) && (k->event.id.raw[1] == muk->id.raw[1])) {
+				/*
+				 * Optimization for case when there is only one rearming kevent and 
+				 * userspace is buggy enough and sets start index to zero.
+				 */
+				if (k->flags & KEVENT_READY) {
+					spin_lock(&u->ready_lock);
+					if (k->flags & KEVENT_READY) {
+						list_del(&k->ready_entry);
+						k->flags &= ~KEVENT_READY;
+						u->ready_num--;
+					}
+					spin_unlock(&u->ready_lock);
+				}
+
+				if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+					__kevent_finish_user(k, 1);
+				found = 1;
+
+				break;
+			}
+		}
+
+		if (!found) {
+			spin_unlock_irqrestore(&u->kevent_lock, flags);
+			goto out_fput;
+		}
+
+		if (++start >= KEVENT_MAX_EVENTS)
+			start = 0;
+		num--;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= 1,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	fput(file);
+
+	return (u->ready_num >= 1)?0:-EAGAIN;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk("KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	misc_deregister(&kevent_miscdev);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..564e618 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


-- 
VGER BF report: U 0.974357

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take15 4/4] kevent: Timer notifications.
  2006-09-04 10:14       ` [take15 3/4] kevent: Socket notifications Evgeniy Polyakov
@ 2006-09-04 10:14         ` Evgeniy Polyakov
  2006-09-05 13:39           ` Arnd Bergmann
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-04 10:14 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..b2fee61
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,105 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct timer_list	ktimer;
+	struct kevent_storage	ktimer_storage;
+};
+
+static void kevent_timer_func(unsigned long data)
+{
+	struct kevent *k = (struct kevent *)data;
+	struct timer_list *t = k->st->origin;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	mod_timer(&t->ktimer, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	del_timer_sync(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);


-- 
VGER BF report: U 0.951759

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take15 3/4] kevent: Socket notifications.
  2006-09-04 10:14     ` [take15 2/4] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-09-04 10:14       ` Evgeniy Polyakov
  2006-09-04 10:14         ` [take15 4/4] kevent: Timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-04 10:14 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck


Socket notifications.

This patch include socket send/recv/accept notifications.
Using trivial web server based on kevent and this features 
instead of epoll it's performance increased more than noticebly.
More details about benchmark and server itself (evserver_kevent.c)
can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index 0bf9f04..181521d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -165,12 +166,18 @@ #endif
 		}
 		memset(&inode->u, 0, sizeof(inode->u));
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..a697930 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/include/net/sock.h b/include/net/sock.h
index 324b3ea..5d71ed7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..5b15f22
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,142 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct sock *sk = SOCKET_I(inode)->sk;
+	int rmem;
+	
+	if (k->event.event & KEVENT_SOCKET_RECV) {
+		int ret = 0;
+		
+		if ((rmem = atomic_read(&sk->sk_rmem_alloc)) > 0 || 
+				!skb_queue_empty(&sk->sk_receive_queue))
+			ret = 1;
+		if (sk->sk_shutdown & RCV_SHUTDOWN)
+			ret = 1;
+		if (ret)
+			return ret;
+	}
+	if ((k->event.event & KEVENT_SOCKET_ACCEPT) && 
+		(!reqsk_queue_empty(&inet_csk(sk)->icsk_accept_queue) || 
+		 	reqsk_queue_len_young(&inet_csk(sk)->icsk_accept_queue))) {
+		k->event.ret_data[1] = reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
+		return 1;
+	}
+
+	return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -ENODEV;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	sockfd_put(sock);
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+
+	kevent_storage_dequeue(k->st, k);
+	iput(inode);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index 51fcfbc..4f91615 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1406,6 +1406,7 @@ static void sock_def_wakeup(struct sock 
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1415,6 +1416,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1424,6 +1426,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1443,6 +1446,7 @@ static void sock_def_write_space(struct 
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1493,6 +1497,8 @@ #endif
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1559,8 +1565,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 104af5d..14cee12 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3112,6 +3112,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4b04c3e..cda1500 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -867,6 +868,7 @@ #endif
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index b4848ce..42e19e2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -526,6 +527,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;


-- 
VGER BF report: S 0.999647

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take15 2/4] kevent: poll/select() notifications.
  2006-09-04 10:14   ` [take15 1/4] kevent: Core files Evgeniy Polyakov
@ 2006-09-04 10:14     ` Evgeniy Polyakov
  2006-09-04 10:14       ` [take15 3/4] kevent: Socket notifications Evgeniy Polyakov
  2006-09-05 13:28     ` [take15 1/4] kevent: Core files Arnd Bergmann
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-04 10:14 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..a697930 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+
+	k->event.ret_data[0] = revents & k->event.event;
+
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


-- 
VGER BF report: S 0.999941

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [take15 0/4] kevent: Generic event handling mechanism.
  2006-09-04 10:14 ` [take15 0/4] " Evgeniy Polyakov
  2006-09-04  9:58   ` Evgeniy Polyakov
  2006-09-04 10:14   ` [take15 1/4] kevent: Core files Evgeniy Polyakov
@ 2006-09-04 10:24   ` Evgeniy Polyakov
  2 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-04 10:24 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters

On Mon, Sep 04, 2006 at 02:14:20PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> 
> Generic event handling mechanism.

I've also updated documentation at
http://linux-net.osdl.org/index.php/Kevent

-- 
	Evgeniy Polyakov

-- 
VGER BF report: H 0

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take15 1/4] kevent: Core files.
  2006-09-04 10:14   ` [take15 1/4] kevent: Core files Evgeniy Polyakov
  2006-09-04 10:14     ` [take15 2/4] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-09-05 13:28     ` Arnd Bergmann
  2006-09-06  6:51       ` Evgeniy Polyakov
  1 sibling, 1 reply; 143+ messages in thread
From: Arnd Bergmann @ 2006-09-05 13:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck

On Monday 04 September 2006 12:14, Evgeniy Polyakov wrote:

> +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr,
> 		unsigned int max_nr, __u64 timeout, void __user *buf,
> 		unsigned flags) 
> +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num,
> 		void __user *arg) 

'void __user *arg' in both of these always points to a struct ukevent,
according to your documentation. Shouldn't it be a 
'struct ukevent __user *arg' then?

	Arnd <><

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take15 4/4] kevent: Timer notifications.
  2006-09-04 10:14         ` [take15 4/4] kevent: Timer notifications Evgeniy Polyakov
@ 2006-09-05 13:39           ` Arnd Bergmann
  2006-09-06  6:42             ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Arnd Bergmann @ 2006-09-05 13:39 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters

On Monday 04 September 2006 12:14, Evgeniy Polyakov wrote:
> Timer notifications can be used for fine grained per-process time 
> management, since interval timers are very inconvenient to use, 
> and they are limited.

I guess this must have been discussed before, but why is this
not using high-resolution timers?

Are you planning to change this?

Maybe at least mention it in the description.

	Arnd <><

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take15 4/4] kevent: Timer notifications.
  2006-09-05 13:39           ` Arnd Bergmann
@ 2006-09-06  6:42             ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06  6:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters

On Tue, Sep 05, 2006 at 03:39:57PM +0200, Arnd Bergmann (arnd.bergmann@de.ibm.com) wrote:
> On Monday 04 September 2006 12:14, Evgeniy Polyakov wrote:
> > Timer notifications can be used for fine grained per-process time 
> > management, since interval timers are very inconvenient to use, 
> > and they are limited.
> 
> I guess this must have been discussed before, but why is this
> not using high-resolution timers?
> 
> Are you planning to change this?
> 
> Maybe at least mention it in the description.

I can use them, but right now there is no strong requirement for that.
It is in TODO, but not at the top.
The most important thing right now is kevent core.

> 	Arnd <><

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take15 1/4] kevent: Core files.
  2006-09-05 13:28     ` [take15 1/4] kevent: Core files Arnd Bergmann
@ 2006-09-06  6:51       ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06  6:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck

On Tue, Sep 05, 2006 at 03:28:17PM +0200, Arnd Bergmann (arnd.bergmann@de.ibm.com) wrote:
> On Monday 04 September 2006 12:14, Evgeniy Polyakov wrote:
> 
> > +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr,
> > 		unsigned int max_nr, __u64 timeout, void __user *buf,
> > 		unsigned flags) 
> > +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num,
> > 		void __user *arg) 
> 
> 'void __user *arg' in both of these always points to a struct ukevent,
> according to your documentation. Shouldn't it be a 
> 'struct ukevent __user *arg' then?

Yep. I will update it in the next patchset.
Thank you.

> 	Arnd <><

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take16 0/4] kevent: Generic event handling mechanism.
       [not found] <12345678912345.GA1898@2ka.mipt.ru>
                   ` (4 preceding siblings ...)
  2006-09-04 10:14 ` [take15 0/4] " Evgeniy Polyakov
@ 2006-09-06 11:55 ` Evgeniy Polyakov
  2006-09-06 11:55   ` [take16 1/4] kevent: Core files Evgeniy Polyakov
  5 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06 11:55 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


Generic event handling mechanism.

Since number of comments has come mostly to zero, I freeze for some time kevent
development (since resending practically the same patches is not that interesting
task) and switch to imeplementation of special tree, which probably
will be used with kevents instead of hash table.

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API update at
	http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 143+ messages in thread

* [take16 4/4] kevent: Timer notifications.
  2006-09-06 11:55       ` [take16 3/4] kevent: Socket notifications Evgeniy Polyakov
@ 2006-09-06 11:55         ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06 11:55 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct hrtimer		ktimer;
+	struct kevent_storage	ktimer_storage;
+	struct kevent		*ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+	struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+	struct kevent *k = t->ktimer_event;
+
+	kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+	hrtimer_forward(timer, timer->base->softirq_time,
+			ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+	return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+	t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+	t->ktimer.function = kevent_timer_func;
+	t->ktimer_event = k;
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer);
+	hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	hrtimer_cancel(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take16 2/4] kevent: poll/select() notifications.
  2006-09-06 11:55   ` [take16 1/4] kevent: Core files Evgeniy Polyakov
@ 2006-09-06 11:55     ` Evgeniy Polyakov
  2006-09-06 11:55       ` [take16 3/4] kevent: Socket notifications Evgeniy Polyakov
  2006-09-06 13:40     ` [take16 1/4] kevent: Core files Chase Venters
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06 11:55 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..a697930 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+
+	k->event.ret_data[0] = revents & k->event.event;
+
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take16 1/4] kevent: Core files.
  2006-09-06 11:55 ` [take16 " Evgeniy Polyakov
@ 2006-09-06 11:55   ` Evgeniy Polyakov
  2006-09-06 11:55     ` [take16 2/4] kevent: poll/select() notifications Evgeniy Polyakov
  2006-09-06 13:40     ` [take16 1/4] kevent: Core files Chase Venters
  0 siblings, 2 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06 11:55 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck


Core files.

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..c10698e 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,6 @@ ENTRY(sys_call_table)
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
 	.long sys_move_pages
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl
+	.long sys_kevent_wait		/* 320 */
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..a06b76f 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -710,7 +710,10 @@ #endif
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl
+	.quad sys_kevent_wait		/* 320 */
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..68072b5 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,13 @@ #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
+#define __NR_kevent_get_events	318
+#define __NR_kevent_ctl		319
+#define __NR_kevent_wait	320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 321
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..ee907ad 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..67007f2
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,196 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's queue. */
+	struct list_head	kevent_entry;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_user
+{
+	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	unsigned int		pages_in_use;
+	/* Array of pages forming mapped ring buffer */
+	struct kevent_mring	**pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+			__func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_KEVENT_SOCKET
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..9d4690f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,8 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..f8ff3a2
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,155 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT	0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct mukevent
+{
+	struct kevent_id	id;
+	__u32			ret_flags;
+};
+
+#define KEVENT_MAX_EVENTS	4096
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		index;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..85ad472
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,40 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+	
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..422f585
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry)
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..ff1a770
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,968 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+	u->pring[0]->index = num;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+	unsigned int idx;
+
+	idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
+	if (idx >= u->pages_in_use) {
+		u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
+		if (!u->pring[idx])
+			return -ENOMEM;
+		u->pages_in_use++;
+	}
+	return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+
+	ring = k->user->pring[0];
+
+	pidx = ring->index/KEVENTS_ON_PAGE;
+	off = ring->index%KEVENTS_ON_PAGE;
+
+	if (unlikely(pidx >= k->user->pages_in_use)) {
+		printk("%s: ring->index: %u, pidx: %u, pages_in_use: %u.\n",
+				__func__, ring->index, pidx, k->user->pages_in_use);
+		WARN_ON(1);
+		return;
+	}
+
+	copy_ring = k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->index >= KEVENT_MAX_EVENTS)
+		ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int pnum;
+
+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+	u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+	if (!u->pring[0])
+		goto err_out_free;
+
+	u->pages_in_use = 1;
+	kevent_user_ring_set(u, 0);
+
+	return 0;
+
+err_out_free:
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+
+	for (i = 0; i < u->pages_in_use; ++i)
+		free_page((unsigned long)u->pring[i]);
+
+	kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+		INIT_LIST_HEAD(&u->kevent_list[i]);
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (unlikely(kevent_user_ring_init(u))) {
+		kfree(u);
+		return -ENOMEM;
+	}
+
+	file->private_data = u;
+	return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+	struct kevent_user *u = vma->vm_file->private_data;
+	unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
+	if (off >= u->pages_in_use)
+		goto err_out_sigbus;
+
+	return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+	return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+	.nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start;
+	struct kevent_user *u = file->private_data;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_ops = &kevent_user_vm_ops;
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline unsigned int __kevent_user_hash(struct kevent_id *id)
+{
+	return jhash_1word(id->raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return __kevent_user_hash(&uk->id);
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY) {
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_del(&k->kevent_entry);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
+		struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+
+	list_for_each_entry(k, head, kevent_entry) {
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] &&
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&u->kevent_list[hash], uk, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+		list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+			kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+	unsigned long flags;
+	unsigned int hash = kevent_user_hash(&k->event);
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+	k->flags |= KEVENT_USER;
+	u->kevent_num++;
+	kevent_user_get(u);
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	if (kevent_user_ring_grow(u)) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	kevent_user_enqueue(u, k);
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, knum = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+		goto out_remove;
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				} else
+					knum++;
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		} else
+			knum++;
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u || num > KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes all ready kevents until and including @index.
+ * @ctl_fd - kevent file descriptor.
+ * @start - start index of the processed by userspace kevents.
+ * @num - number of processed kevents.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * 	free space in kevent queue.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
+{
+	int err = -EINVAL, found;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k, *n;
+	struct mukevent *muk;
+	unsigned int idx, off, hash;
+	unsigned long flags;
+
+	if (start + num >= KEVENT_MAX_EVENTS || 
+			start >= KEVENT_MAX_EVENTS || 
+			num >= KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	if (((start + num) / KEVENTS_ON_PAGE) >= u->pages_in_use || 
+			(start / KEVENTS_ON_PAGE) >= u->pages_in_use)
+		goto out_fput;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (num > 0) {
+		idx = start / KEVENTS_ON_PAGE;
+		off = start % KEVENTS_ON_PAGE;
+
+		muk = &u->pring[idx]->event[off];
+		hash = __kevent_user_hash(&muk->id);
+		found = 0;
+		list_for_each_entry_safe(k, n, &u->kevent_list[hash], kevent_entry) {
+			if ((k->event.id.raw[0] == muk->id.raw[0]) && (k->event.id.raw[1] == muk->id.raw[1])) {
+				/*
+				 * Optimization for case when there is only one rearming kevent and 
+				 * userspace is buggy enough and sets start index to zero.
+				 */
+				if (k->flags & KEVENT_READY) {
+					spin_lock(&u->ready_lock);
+					if (k->flags & KEVENT_READY) {
+						list_del(&k->ready_entry);
+						k->flags &= ~KEVENT_READY;
+						u->ready_num--;
+					}
+					spin_unlock(&u->ready_lock);
+				}
+
+				if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+					__kevent_finish_user(k, 1);
+				found = 1;
+
+				break;
+			}
+		}
+
+		if (!found) {
+			spin_unlock_irqrestore(&u->kevent_lock, flags);
+			goto out_fput;
+		}
+
+		if (++start >= KEVENT_MAX_EVENTS)
+			start = 0;
+		num--;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= 1,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	fput(file);
+
+	return (u->ready_num >= 1)?0:-EAGAIN;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk("KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	misc_deregister(&kevent_miscdev);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..564e618 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [take16 3/4] kevent: Socket notifications.
  2006-09-06 11:55     ` [take16 2/4] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-09-06 11:55       ` Evgeniy Polyakov
  2006-09-06 11:55         ` [take16 4/4] kevent: Timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06 11:55 UTC (permalink / raw)
  To: lkml
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck


Socket notifications.

This patch include socket send/recv/accept notifications.
Using trivial web server based on kevent and this features 
instead of epoll it's performance increased more than noticebly.
More details about benchmark and server itself (evserver_kevent.c)
can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index 0bf9f04..181521d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -165,12 +166,18 @@ #endif
 		}
 		memset(&inode->u, 0, sizeof(inode->u));
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..a697930 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/include/net/sock.h b/include/net/sock.h
index 324b3ea..5d71ed7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..5b15f22
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,142 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct sock *sk = SOCKET_I(inode)->sk;
+	int rmem;
+	
+	if (k->event.event & KEVENT_SOCKET_RECV) {
+		int ret = 0;
+		
+		if ((rmem = atomic_read(&sk->sk_rmem_alloc)) > 0 || 
+				!skb_queue_empty(&sk->sk_receive_queue))
+			ret = 1;
+		if (sk->sk_shutdown & RCV_SHUTDOWN)
+			ret = 1;
+		if (ret)
+			return ret;
+	}
+	if ((k->event.event & KEVENT_SOCKET_ACCEPT) && 
+		(!reqsk_queue_empty(&inet_csk(sk)->icsk_accept_queue) || 
+		 	reqsk_queue_len_young(&inet_csk(sk)->icsk_accept_queue))) {
+		k->event.ret_data[1] = reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
+		return 1;
+	}
+
+	return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -ENODEV;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	sockfd_put(sock);
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+
+	kevent_storage_dequeue(k->st, k);
+	iput(inode);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index 51fcfbc..4f91615 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1406,6 +1406,7 @@ static void sock_def_wakeup(struct sock 
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1415,6 +1416,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1424,6 +1426,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1443,6 +1446,7 @@ static void sock_def_write_space(struct 
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1493,6 +1497,8 @@ #endif
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1559,8 +1565,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 104af5d..14cee12 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3112,6 +3112,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4b04c3e..cda1500 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -867,6 +868,7 @@ #endif
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index b4848ce..42e19e2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -526,6 +527,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;


^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [take16 1/4] kevent: Core files.
  2006-09-06 11:55   ` [take16 1/4] kevent: Core files Evgeniy Polyakov
  2006-09-06 11:55     ` [take16 2/4] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-09-06 13:40     ` Chase Venters
  2006-09-06 13:54       ` Chase Venters
  2006-09-06 14:03       ` Evgeniy Polyakov
  1 sibling, 2 replies; 143+ messages in thread
From: Chase Venters @ 2006-09-06 13:40 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck

Evgeniy,
 	Sorry about the radio silence later. Some reviewer commentary 
follows.

On Wed, 6 Sep 2006, Evgeniy Polyakov wrote:

>
> Core files.
>
> This patch includes core kevent files:
> - userspace controlling
> - kernelspace interfaces
> - initialization
> - notification state machines
>
> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
>
> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
> index dd63d47..c10698e 100644
> --- a/arch/i386/kernel/syscall_table.S
> +++ b/arch/i386/kernel/syscall_table.S
> @@ -317,3 +317,6 @@ ENTRY(sys_call_table)
> 	.long sys_tee			/* 315 */
> 	.long sys_vmsplice
> 	.long sys_move_pages
> +	.long sys_kevent_get_events
> +	.long sys_kevent_ctl
> +	.long sys_kevent_wait		/* 320 */
> diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
> index 5d4a7d1..a06b76f 100644
> --- a/arch/x86_64/ia32/ia32entry.S
> +++ b/arch/x86_64/ia32/ia32entry.S
> @@ -710,7 +710,10 @@ #endif
> 	.quad compat_sys_get_robust_list
> 	.quad sys_splice
> 	.quad sys_sync_file_range
> -	.quad sys_tee
> +	.quad sys_tee			/* 315 */
> 	.quad compat_sys_vmsplice
> 	.quad compat_sys_move_pages
> +	.quad sys_kevent_get_events
> +	.quad sys_kevent_ctl
> +	.quad sys_kevent_wait		/* 320 */
> ia32_syscall_end:
> diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
> index fc1c8dd..68072b5 100644
> --- a/include/asm-i386/unistd.h
> +++ b/include/asm-i386/unistd.h
> @@ -323,10 +323,13 @@ #define __NR_sync_file_range	314
> #define __NR_tee		315
> #define __NR_vmsplice		316
> #define __NR_move_pages		317
> +#define __NR_kevent_get_events	318
> +#define __NR_kevent_ctl		319
> +#define __NR_kevent_wait	320
>
> #ifdef __KERNEL__
>
> -#define NR_syscalls 318
> +#define NR_syscalls 321
>
> /*
>  * user-visible error numbers are in the range -1 - -128: see
> diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
> index 94387c9..ee907ad 100644
> --- a/include/asm-x86_64/unistd.h
> +++ b/include/asm-x86_64/unistd.h
> @@ -619,10 +619,16 @@ #define __NR_vmsplice		278
> __SYSCALL(__NR_vmsplice, sys_vmsplice)
> #define __NR_move_pages		279
> __SYSCALL(__NR_move_pages, sys_move_pages)
> +#define __NR_kevent_get_events	280
> +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
> +#define __NR_kevent_ctl		281
> +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
> +#define __NR_kevent_wait	282
> +__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
>
> #ifdef __KERNEL__
>
> -#define __NR_syscall_max __NR_move_pages
> +#define __NR_syscall_max __NR_kevent_wait
>
> #ifndef __NO_STUBS
>
> diff --git a/include/linux/kevent.h b/include/linux/kevent.h
> new file mode 100644
> index 0000000..67007f2
> --- /dev/null
> +++ b/include/linux/kevent.h
> @@ -0,0 +1,196 @@
> +/*
> + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> + */
> +
> +#ifndef __KEVENT_H
> +#define __KEVENT_H
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/wait.h>
> +#include <linux/net.h>
> +#include <linux/rcupdate.h>
> +#include <linux/kevent_storage.h>
> +#include <linux/ukevent.h>
> +
> +#define KEVENT_MIN_BUFFS_ALLOC	3
> +
> +struct kevent;
> +struct kevent_storage;
> +typedef int (* kevent_callback_t)(struct kevent *);
> +
> +/* @callback is called each time new event has been caught. */
> +/* @enqueue is called each time new event is queued. */
> +/* @dequeue is called each time event is dequeued. */
> +
> +struct kevent_callbacks {
> +	kevent_callback_t	callback, enqueue, dequeue;
> +};
> +
> +#define KEVENT_READY		0x1
> +#define KEVENT_STORAGE		0x2
> +#define KEVENT_USER		0x4
> +
> +struct kevent
> +{
> +	/* Used for kevent freeing.*/
> +	struct rcu_head		rcu_head;
> +	struct ukevent		event;
> +	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
> +	spinlock_t		ulock;
> +
> +	/* Entry of user's queue. */
> +	struct list_head	kevent_entry;
> +	/* Entry of origin's queue. */
> +	struct list_head	storage_entry;
> +	/* Entry of user's ready. */
> +	struct list_head	ready_entry;
> +
> +	u32			flags;
> +
> +	/* User who requested this kevent. */
> +	struct kevent_user	*user;
> +	/* Kevent container. */
> +	struct kevent_storage	*st;
> +
> +	struct kevent_callbacks	callbacks;
> +
> +	/* Private data for different storages.
> +	 * poll()/select storage has a list of wait_queue_t containers
> +	 * for each ->poll() { poll_wait()' } here.
> +	 */
> +	void			*priv;
> +};
> +
> +#define KEVENT_HASH_MASK	0xff
> +
> +struct kevent_user
> +{

These structure names get a little dicey (kevent, kevent_user, ukevent, 
mukevent)... might there be slightly different names that could be 
selected to better distinguish the purpose of each?

> +	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
> +	spinlock_t		kevent_lock;
> +	/* Number of queued kevents. */
> +	unsigned int		kevent_num;
> +
> +	/* List of ready kevents. */
> +	struct list_head	ready_list;
> +	/* Number of ready kevents. */
> +	unsigned int		ready_num;
> +	/* Protects all manipulations with ready queue. */
> +	spinlock_t 		ready_lock;
> +
> +	/* Protects against simultaneous kevent_user control manipulations. */
> +	struct mutex		ctl_mutex;
> +	/* Wait until some events are ready. */
> +	wait_queue_head_t	wait;
> +
> +	/* Reference counter, increased for each new kevent. */
> +	atomic_t		refcnt;
> +
> +	unsigned int		pages_in_use;
> +	/* Array of pages forming mapped ring buffer */
> +	struct kevent_mring	**pring;
> +
> +#ifdef CONFIG_KEVENT_USER_STAT
> +	unsigned long		im_num;
> +	unsigned long		wait_num;
> +	unsigned long		total;
> +#endif
> +};
> +
> +int kevent_enqueue(struct kevent *k);
> +int kevent_dequeue(struct kevent *k);
> +int kevent_init(struct kevent *k);
> +void kevent_requeue(struct kevent *k);
> +int kevent_break(struct kevent *k);
> +
> +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
> +
> +void kevent_user_ring_add_event(struct kevent *k);
> +
> +void kevent_storage_ready(struct kevent_storage *st,
> +		kevent_callback_t ready_callback, u32 event);
> +int kevent_storage_init(void *origin, struct kevent_storage *st);
> +void kevent_storage_fini(struct kevent_storage *st);
> +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
> +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
> +
> +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
> +
> +#ifdef CONFIG_KEVENT_POLL
> +void kevent_poll_reinit(struct file *file);
> +#else
> +static inline void kevent_poll_reinit(struct file *file)
> +{
> +}
> +#endif
> +
> +#ifdef CONFIG_KEVENT_USER_STAT
> +static inline void kevent_stat_init(struct kevent_user *u)
> +{
> +	u->wait_num = u->im_num = u->total = 0;
> +}
> +static inline void kevent_stat_print(struct kevent_user *u)
> +{
> +	pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
> +			__func__, u, u->wait_num, u->im_num, u->total);
> +}
> +static inline void kevent_stat_im(struct kevent_user *u)
> +{
> +	u->im_num++;
> +}
> +static inline void kevent_stat_wait(struct kevent_user *u)
> +{
> +	u->wait_num++;
> +}
> +static inline void kevent_stat_total(struct kevent_user *u)
> +{
> +	u->total++;
> +}
> +#else
> +#define kevent_stat_print(u)		({ (void) u;})
> +#define kevent_stat_init(u)		({ (void) u;})
> +#define kevent_stat_im(u)		({ (void) u;})
> +#define kevent_stat_wait(u)		({ (void) u;})
> +#define kevent_stat_total(u)		({ (void) u;})
> +#endif
> +
> +#ifdef CONFIG_KEVENT_SOCKET
> +#ifdef CONFIG_LOCKDEP
> +void kevent_socket_reinit(struct socket *sock);
> +void kevent_sk_reinit(struct sock *sk);
> +#else
> +static inline void kevent_socket_reinit(struct socket *sock)
> +{
> +}
> +static inline void kevent_sk_reinit(struct sock *sk)
> +{
> +}
> +#endif
> +void kevent_socket_notify(struct sock *sock, u32 event);
> +int kevent_socket_dequeue(struct kevent *k);
> +int kevent_socket_enqueue(struct kevent *k);
> +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
> +#else
> +static inline void kevent_socket_notify(struct sock *sock, u32 event)
> +{
> +}
> +#define sock_async(__sk)	({ (void)__sk; 0; })
> +#endif
> +
> +#endif /* __KEVENT_H */
> diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
> new file mode 100644
> index 0000000..a38575d
> --- /dev/null
> +++ b/include/linux/kevent_storage.h
> @@ -0,0 +1,11 @@
> +#ifndef __KEVENT_STORAGE_H
> +#define __KEVENT_STORAGE_H
> +
> +struct kevent_storage
> +{
> +	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
> +	struct list_head	list;			/* List of queued kevents. */
> +	spinlock_t		lock;			/* Protects users queue. */
> +};
> +
> +#endif /* __KEVENT_STORAGE_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 008f04c..9d4690f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -597,4 +597,8 @@ asmlinkage long sys_get_robust_list(int
> asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
> 				    size_t len);
>
> +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
> +		__u64 timeout, struct ukevent __user *buf, unsigned flags);
> +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
> +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
> #endif
> diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
> new file mode 100644
> index 0000000..f8ff3a2
> --- /dev/null
> +++ b/include/linux/ukevent.h
> @@ -0,0 +1,155 @@
> +/*
> + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> + */
> +
> +#ifndef __UKEVENT_H
> +#define __UKEVENT_H
> +
> +/*
> + * Kevent request flags.
> + */
> +
> +/* Process this event only once and then dequeue. */
> +#define KEVENT_REQ_ONESHOT	0x1
> +
> +/*
> + * Kevent return flags.
> + */
> +/* Kevent is broken. */
> +#define KEVENT_RET_BROKEN	0x1
> +/* Kevent processing was finished successfully. */
> +#define KEVENT_RET_DONE		0x2
> +
> +/*
> + * Kevent type set.
> + */
> +#define KEVENT_SOCKET 		0
> +#define KEVENT_INODE		1
> +#define KEVENT_TIMER		2
> +#define KEVENT_POLL		3
> +#define KEVENT_NAIO		4
> +#define KEVENT_AIO		5
> +#define	KEVENT_MAX		6
> +
> +/*
> + * Per-type event sets.
> + * Number of per-event sets should be exactly as number of kevent types.
> + */
> +
> +/*
> + * Timer events.
> + */
> +#define	KEVENT_TIMER_FIRED	0x1
> +
> +/*
> + * Socket/network asynchronous IO events.
> + */
> +#define	KEVENT_SOCKET_RECV	0x1
> +#define	KEVENT_SOCKET_ACCEPT	0x2
> +#define	KEVENT_SOCKET_SEND	0x4
> +
> +/*
> + * Inode events.
> + */
> +#define	KEVENT_INODE_CREATE	0x1
> +#define	KEVENT_INODE_REMOVE	0x2
> +
> +/*
> + * Poll events.
> + */
> +#define	KEVENT_POLL_POLLIN	0x0001
> +#define	KEVENT_POLL_POLLPRI	0x0002
> +#define	KEVENT_POLL_POLLOUT	0x0004
> +#define	KEVENT_POLL_POLLERR	0x0008
> +#define	KEVENT_POLL_POLLHUP	0x0010
> +#define	KEVENT_POLL_POLLNVAL	0x0020
> +
> +#define	KEVENT_POLL_POLLRDNORM	0x0040
> +#define	KEVENT_POLL_POLLRDBAND	0x0080
> +#define	KEVENT_POLL_POLLWRNORM	0x0100
> +#define	KEVENT_POLL_POLLWRBAND	0x0200
> +#define	KEVENT_POLL_POLLMSG	0x0400
> +#define	KEVENT_POLL_POLLREMOVE	0x1000
> +
> +/*
> + * Asynchronous IO events.
> + */
> +#define	KEVENT_AIO_BIO		0x1
> +
> +#define KEVENT_MASK_ALL		0xffffffff
> +/* Mask of all possible event values. */
> +#define KEVENT_MASK_EMPTY	0x0
> +/* Empty mask of ready events. */
> +
> +struct kevent_id
> +{
> +	union {
> +		__u32		raw[2];
> +		__u64		raw_u64 __attribute__((aligned(8)));
> +	};
> +};
> +
> +struct ukevent
> +{
> +	/* Id of this request, e.g. socket number, file descriptor and so on... */
> +	struct kevent_id	id;
> +	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
> +	__u32			type;
> +	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
> +	__u32			event;
> +	/* Per-event request flags */
> +	__u32			req_flags;
> +	/* Per-event return flags */
> +	__u32			ret_flags;
> +	/* Event return data. Event originator fills it with anything it likes. */
> +	__u32			ret_data[2];
> +	/* User's data. It is not used, just copied to/from user.
> +	 * The whole structure is aligned to 8 bytes already, so the last union
> +	 * is aligned properly.
> +	 */
> +	union {
> +		__u32		user[2];
> +		void		*ptr;
> +	};
> +};
> +
> +struct mukevent
> +{
> +	struct kevent_id	id;
> +	__u32			ret_flags;
> +};
> +
> +#define KEVENT_MAX_EVENTS	4096
> +

This limit governs how many simultaneous kevents you can be waiting on / 
for at once, correct? Would it be possible to drop the hard limit and 
limit instead, say, the maximum number of kevents you can have pending in 
the mmap ring-buffer? After the number is exceeded, additional events 
could get dropped, or some magic number could be put in the 
kevent_mring->index field to let the process know that it must hit another 
syscall to drain the rest of the events.

> +/*
> + * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
> + * so we reuse 4 bytes at the begining of the first page to store index.
> + * Take that into account if you want to change size of struct mukevent.
> + */
> +#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
> +struct kevent_mring
> +{
> +	unsigned int		index;
> +	struct mukevent		event[KEVENTS_ON_PAGE];
> +};
> +
> +#define	KEVENT_CTL_ADD 		0
> +#define	KEVENT_CTL_REMOVE	1
> +#define	KEVENT_CTL_MODIFY	2
> +
> +#endif /* __UKEVENT_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index a099fc6..c550fcc 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -218,6 +218,8 @@ config AUDITSYSCALL
> 	  such as SELinux.  To use audit's filesystem watch feature, please
> 	  ensure that INOTIFY is configured.
>
> +source "kernel/kevent/Kconfig"
> +
> config IKCONFIG
> 	bool "Kernel .config support"
> 	---help---
> diff --git a/kernel/Makefile b/kernel/Makefile
> index d62ec66..2d7a6dd 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
> obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
> obj-$(CONFIG_SECCOMP) += seccomp.o
> obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
> +obj-$(CONFIG_KEVENT) += kevent/
> obj-$(CONFIG_RELAY) += relay.o
> obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> obj-$(CONFIG_TASKSTATS) += taskstats.o
> diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
> new file mode 100644
> index 0000000..85ad472
> --- /dev/null
> +++ b/kernel/kevent/Kconfig
> @@ -0,0 +1,40 @@
> +config KEVENT
> +	bool "Kernel event notification mechanism"
> +	help
> +	  This option enables event queue mechanism.
> +	  It can be used as replacement for poll()/select(), AIO callback
> +	  invocations, advanced timer notifications and other kernel
> +	  object status changes.
> +
> +config KEVENT_USER_STAT
> +	bool "Kevent user statistic"
> +	depends on KEVENT
> +	default N
> +	help
> +	  This option will turn kevent_user statistic collection on.
> +	  Statistic data includes total number of kevent, number of kevents
> +	  which are ready immediately at insertion time and number of kevents
> +	  which were removed through readiness completion.
> +	  It will be printed each time control kevent descriptor is closed.
> +
> +config KEVENT_TIMER
> +	bool "Kernel event notifications for timers"
> +	depends on KEVENT
> +	help
> +	  This option allows to use timers through KEVENT subsystem.
> +
> +config KEVENT_POLL
> +	bool "Kernel event notifications for poll()/select()"
> +	depends on KEVENT
> +	help
> +	  This option allows to use kevent subsystem for poll()/select()
> +	  notifications.
> +
> +config KEVENT_SOCKET
> +	bool "Kernel event notifications for sockets"
> +	depends on NET && KEVENT
> +	help
> +	  This option enables notifications through KEVENT subsystem of
> +	  sockets operations, like new packet receiving conditions,
> +	  ready for accept conditions and so on.
> +
> diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
> new file mode 100644
> index 0000000..9130cad
> --- /dev/null
> +++ b/kernel/kevent/Makefile
> @@ -0,0 +1,4 @@
> +obj-y := kevent.o kevent_user.o
> +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
> +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
> +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
> diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
> new file mode 100644
> index 0000000..422f585
> --- /dev/null
> +++ b/kernel/kevent/kevent.c
> @@ -0,0 +1,227 @@
> +/*
> + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/mempool.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/kevent.h>
> +
> +/*
> + * Attempts to add an event into appropriate origin's queue.
> + * Returns positive value if this event is ready immediately,
> + * negative value in case of error and zero if event has been queued.
> + * ->enqueue() callback must increase origin's reference counter.
> + */
> +int kevent_enqueue(struct kevent *k)
> +{
> +	return k->callbacks.enqueue(k);
> +}
> +
> +/*
> + * Remove event from the appropriate queue.
> + * ->dequeue() callback must decrease origin's reference counter.
> + */
> +int kevent_dequeue(struct kevent *k)
> +{
> +	return k->callbacks.dequeue(k);
> +}
> +
> +/*
> + * Mark kevent as broken.
> + */
> +int kevent_break(struct kevent *k)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&k->ulock, flags);
> +	k->event.ret_flags |= KEVENT_RET_BROKEN;
> +	spin_unlock_irqrestore(&k->ulock, flags);
> +	return -EINVAL;
> +}
> +
> +static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];

__read_mostly?

> +
> +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
> +{
> +	struct kevent_callbacks *p;
> +
> +	if (pos >= KEVENT_MAX)
> +		return -EINVAL;
> +
> +	p = &kevent_registered_callbacks[pos];
> +
> +	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
> +	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
> +	p->callback = (cb->callback) ? cb->callback : kevent_break;

Curious... why are these callbacks copied, rather than just retaining a 
pointer to a const/static "ops" structure?

> +
> +	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);

Is this printk() chatter necessary?

> +	return 0;
> +}
> +
> +/*
> + * Must be called before event is going to be added into some origin's queue.
> + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
> + * If failed, kevent should not be used or kevent_enqueue() will fail to add
> + * this kevent into origin's queue with setting
> + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
> + */
> +int kevent_init(struct kevent *k)
> +{
> +	spin_lock_init(&k->ulock);
> +	k->flags = 0;
> +
> +	if (unlikely(k->event.type >= KEVENT_MAX ||
> +			!kevent_registered_callbacks[k->event.type].callback))
> +		return kevent_break(k);
> +
> +	k->callbacks = kevent_registered_callbacks[k->event.type];
> +	if (unlikely(k->callbacks.callback == kevent_break))
> +		return kevent_break(k);
> +
> +	return 0;
> +}
> +
> +/*
> + * Called from ->enqueue() callback when reference counter for given
> + * origin (socket, inode...) has been increased.
> + */
> +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
> +{
> +	unsigned long flags;
> +
> +	k->st = st;
> +	spin_lock_irqsave(&st->lock, flags);
> +	list_add_tail_rcu(&k->storage_entry, &st->list);
> +	k->flags |= KEVENT_STORAGE;
> +	spin_unlock_irqrestore(&st->lock, flags);
> +	return 0;
> +}
> +
> +/*
> + * Dequeue kevent from origin's queue.
> + * It does not decrease origin's reference counter in any way
> + * and must be called before it, so storage itself must be valid.
> + * It is called from ->dequeue() callback.
> + */
> +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&st->lock, flags);
> +	if (k->flags & KEVENT_STORAGE) {
> +		list_del_rcu(&k->storage_entry);
> +		k->flags &= ~KEVENT_STORAGE;
> +	}
> +	spin_unlock_irqrestore(&st->lock, flags);
> +}
> +
> +/*
> + * Call kevent ready callback and queue it into ready queue if needed.
> + * If kevent is marked as one-shot, then remove it from storage queue.
> + */
> +static void __kevent_requeue(struct kevent *k, u32 event)
> +{
> +	int ret, rem;
> +	unsigned long flags;
> +
> +	ret = k->callbacks.callback(k);
> +
> +	spin_lock_irqsave(&k->ulock, flags);
> +	if (ret > 0)
> +		k->event.ret_flags |= KEVENT_RET_DONE;
> +	else if (ret < 0)
> +		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
> +	else
> +		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
> +	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
> +	spin_unlock_irqrestore(&k->ulock, flags);
> +
> +	if (ret) {
> +		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
> +			list_del_rcu(&k->storage_entry);
> +			k->flags &= ~KEVENT_STORAGE;
> +		}
> +
> +		spin_lock_irqsave(&k->user->ready_lock, flags);
> +		if (!(k->flags & KEVENT_READY)) {
> +			kevent_user_ring_add_event(k);
> +			list_add_tail(&k->ready_entry, &k->user->ready_list);
> +			k->flags |= KEVENT_READY;
> +			k->user->ready_num++;
> +		}
> +		spin_unlock_irqrestore(&k->user->ready_lock, flags);
> +		wake_up(&k->user->wait);
> +	}
> +}
> +
> +/*
> + * Check if kevent is ready (by invoking it's callback) and requeue/remove
> + * if needed.
> + */
> +void kevent_requeue(struct kevent *k)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&k->st->lock, flags);
> +	__kevent_requeue(k, 0);
> +	spin_unlock_irqrestore(&k->st->lock, flags);
> +}
> +
> +/*
> + * Called each time some activity in origin (socket, inode...) is noticed.
> + */
> +void kevent_storage_ready(struct kevent_storage *st,
> +		kevent_callback_t ready_callback, u32 event)
> +{
> +	struct kevent *k;
> +
> +	rcu_read_lock();
> +	if (ready_callback)
> +		list_for_each_entry_rcu(k, &st->list, storage_entry)
> +			(*ready_callback)(k);
> +
> +	list_for_each_entry_rcu(k, &st->list, storage_entry)
> +		if (event & k->event.event)
> +			__kevent_requeue(k, event);
> +	rcu_read_unlock();
> +}
> +
> +int kevent_storage_init(void *origin, struct kevent_storage *st)
> +{
> +	spin_lock_init(&st->lock);
> +	st->origin = origin;
> +	INIT_LIST_HEAD(&st->list);
> +	return 0;
> +}
> +
> +/*
> + * Mark all events as broken, that will remove them from storage,
> + * so storage origin (inode, sockt and so on) can be safely removed.
> + * No new entries are allowed to be added into the storage at this point.
> + * (Socket is removed from file table at this point for example).
> + */
> +void kevent_storage_fini(struct kevent_storage *st)
> +{
> +	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
> +}
> diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
> new file mode 100644
> index 0000000..ff1a770
> --- /dev/null
> +++ b/kernel/kevent/kevent_user.c
> @@ -0,0 +1,968 @@
> +/*
> + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/mount.h>
> +#include <linux/device.h>
> +#include <linux/poll.h>
> +#include <linux/kevent.h>
> +#include <linux/jhash.h>
> +#include <linux/miscdevice.h>
> +#include <asm/io.h>
> +
> +static char kevent_name[] = "kevent";

const?

> +static kmem_cache_t *kevent_cache;
> +
> +/*
> + * kevents are pollable, return POLLIN and POLLRDNORM
> + * when there is at least one ready kevent.
> + */
> +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
> +{
> +	struct kevent_user *u = file->private_data;
> +	unsigned int mask;
> +
> +	poll_wait(file, &u->wait, wait);
> +	mask = 0;
> +
> +	if (u->ready_num)
> +		mask |= POLLIN | POLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
> +{
> +	u->pring[0]->index = num;
> +}
> +
> +static int kevent_user_ring_grow(struct kevent_user *u)
> +{
> +	unsigned int idx;
> +
> +	idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
> +	if (idx >= u->pages_in_use) {
> +		u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
> +		if (!u->pring[idx])
> +			return -ENOMEM;
> +		u->pages_in_use++;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Called under kevent_user->ready_lock, so updates are always protected.
> + */
> +void kevent_user_ring_add_event(struct kevent *k)
> +{
> +	unsigned int pidx, off;
> +	struct kevent_mring *ring, *copy_ring;
> +
> +	ring = k->user->pring[0];
> +
> +	pidx = ring->index/KEVENTS_ON_PAGE;
> +	off = ring->index%KEVENTS_ON_PAGE;
> +
> +	if (unlikely(pidx >= k->user->pages_in_use)) {
> +		printk("%s: ring->index: %u, pidx: %u, pages_in_use: %u.\n",
> +				__func__, ring->index, pidx, k->user->pages_in_use);
> +		WARN_ON(1);
> +		return;
> +	}
> +
> +	copy_ring = k->user->pring[pidx];
> +
> +	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
> +	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
> +	copy_ring->event[off].ret_flags = k->event.ret_flags;
> +
> +	if (++ring->index >= KEVENT_MAX_EVENTS)
> +		ring->index = 0;
> +}
> +
> +/*
> + * Initialize mmap ring buffer.
> + * It will store ready kevents, so userspace could get them directly instead
> + * of using syscall. Esentially syscall becomes just a waiting point.
> + */
> +static int kevent_user_ring_init(struct kevent_user *u)
> +{
> +	int pnum;
> +
> +	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;

This calculation works with the current constants, but it comes up a page 
short if, say, KEVENT_MAX_EVENTS were 4095. It also looks incorrect 
visually since the 'sizeof(unsigned int)' is only factored in once (rather 
than once per page). I suggest a static / inline __max_kevent_pages() 
function that either does:

return KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE + 1;

or

int pnum = KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE;
if (KEVENT_MAX_EVENTS % KEVENTS_ON_PAGE)
 	pnum++;
return pnum;

Both should be optimized away by the compiler and will give correct 
answers regardless of the constant values.

> +
> +	u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
> +	if (!u->pring)
> +		return -ENOMEM;
> +
> +	u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
> +	if (!u->pring[0])
> +		goto err_out_free;
> +
> +	u->pages_in_use = 1;
> +	kevent_user_ring_set(u, 0);
> +
> +	return 0;
> +
> +err_out_free:
> +	kfree(u->pring);
> +
> +	return -ENOMEM;
> +}
> +
> +static void kevent_user_ring_fini(struct kevent_user *u)
> +{
> +	int i;
> +
> +	for (i = 0; i < u->pages_in_use; ++i)
> +		free_page((unsigned long)u->pring[i]);
> +
> +	kfree(u->pring);
> +}
> +
> +static int kevent_user_open(struct inode *inode, struct file *file)
> +{
> +	struct kevent_user *u;
> +	int i;
> +
> +	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
> +	if (!u)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&u->ready_list);
> +	spin_lock_init(&u->ready_lock);
> +	kevent_stat_init(u);
> +	spin_lock_init(&u->kevent_lock);
> +	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
> +		INIT_LIST_HEAD(&u->kevent_list[i]);
> +
> +	mutex_init(&u->ctl_mutex);
> +	init_waitqueue_head(&u->wait);
> +
> +	atomic_set(&u->refcnt, 1);
> +
> +	if (unlikely(kevent_user_ring_init(u))) {
> +		kfree(u);
> +		return -ENOMEM;
> +	}
> +
> +	file->private_data = u;
> +	return 0;
> +}
> +
> +/*
> + * Kevent userspace control block reference counting.
> + * Set to 1 at creation time, when appropriate kevent file descriptor
> + * is closed, that reference counter is decreased.
> + * When counter hits zero block is freed.
> + */
> +static inline void kevent_user_get(struct kevent_user *u)
> +{
> +	atomic_inc(&u->refcnt);
> +}
> +
> +static inline void kevent_user_put(struct kevent_user *u)
> +{
> +	if (atomic_dec_and_test(&u->refcnt)) {
> +		kevent_stat_print(u);
> +		kevent_user_ring_fini(u);
> +		kfree(u);
> +	}
> +}
> +
> +static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
> +{
> +	struct kevent_user *u = vma->vm_file->private_data;
> +	unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
> +
> +	if (type)
> +		*type = VM_FAULT_MINOR;
> +
> +	if (off >= u->pages_in_use)
> +		goto err_out_sigbus;
> +
> +	return virt_to_page(u->pring[off]);
> +
> +err_out_sigbus:
> +	return NOPAGE_SIGBUS;
> +}
> +
> +static struct vm_operations_struct kevent_user_vm_ops = {
> +	.nopage = &kevent_user_nopage,
> +};
> +
> +/*
> + * Mmap implementation for ring buffer, which is created as array
> + * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
> + * the first page to be mapped.
> + */
> +static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	unsigned long start = vma->vm_start;
> +	struct kevent_user *u = file->private_data;
> +
> +	if (vma->vm_flags & VM_WRITE)
> +		return -EPERM;
> +
> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +	vma->vm_ops = &kevent_user_vm_ops;
> +	vma->vm_flags |= VM_RESERVED;
> +	vma->vm_file = file;
> +
> +	if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static inline unsigned int __kevent_user_hash(struct kevent_id *id)
> +{
> +	return jhash_1word(id->raw[0], 0) & KEVENT_HASH_MASK;
> +}
> +
> +static inline unsigned int kevent_user_hash(struct ukevent *uk)
> +{
> +	return __kevent_user_hash(&uk->id);
> +}
> +
> +/*
> + * RCU protects storage list (kevent->storage_entry).
> + * Free entry in RCU callback, it is dequeued from all lists at
> + * this point.
> + */
> +
> +static void kevent_free_rcu(struct rcu_head *rcu)
> +{
> +	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
> +	kmem_cache_free(kevent_cache, kevent);
> +}
> +
> +/*
> + * Complete kevent removing - it dequeues kevent from storage list
> + * if it is requested, removes kevent from ready list, drops userspace
> + * control block reference counter and schedules kevent freeing through RCU.
> + */
> +static void kevent_finish_user_complete(struct kevent *k, int deq)
> +{
> +	struct kevent_user *u = k->user;
> +	unsigned long flags;
> +
> +	if (deq)
> +		kevent_dequeue(k);
> +
> +	spin_lock_irqsave(&u->ready_lock, flags);
> +	if (k->flags & KEVENT_READY) {
> +		list_del(&k->ready_entry);
> +		k->flags &= ~KEVENT_READY;
> +		u->ready_num--;
> +	}
> +	spin_unlock_irqrestore(&u->ready_lock, flags);
> +
> +	kevent_user_put(u);
> +	call_rcu(&k->rcu_head, kevent_free_rcu);
> +}
> +
> +/*
> + * Remove from all lists and free kevent.
> + * Must be called under kevent_user->kevent_lock to protect
> + * kevent->kevent_entry removing.
> + */
> +static void __kevent_finish_user(struct kevent *k, int deq)
> +{
> +	struct kevent_user *u = k->user;
> +
> +	list_del(&k->kevent_entry);
> +	k->flags &= ~KEVENT_USER;
> +	u->kevent_num--;
> +	kevent_finish_user_complete(k, deq);
> +}
> +
> +/*
> + * Remove kevent from user's list of all events,
> + * dequeue it from storage and decrease user's reference counter,
> + * since this kevent does not exist anymore. That is why it is freed here.
> + */
> +static void kevent_finish_user(struct kevent *k, int deq)
> +{
> +	struct kevent_user *u = k->user;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&u->kevent_lock, flags);
> +	list_del(&k->kevent_entry);
> +	k->flags &= ~KEVENT_USER;
> +	u->kevent_num--;
> +	spin_unlock_irqrestore(&u->kevent_lock, flags);
> +	kevent_finish_user_complete(k, deq);
> +}
> +
> +/*
> + * Dequeue one entry from user's ready queue.
> + */
> +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
> +{
> +	unsigned long flags;
> +	struct kevent *k = NULL;
> +
> +	spin_lock_irqsave(&u->ready_lock, flags);
> +	if (u->ready_num && !list_empty(&u->ready_list)) {
> +		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
> +		list_del(&k->ready_entry);
> +		k->flags &= ~KEVENT_READY;
> +		u->ready_num--;
> +	}
> +	spin_unlock_irqrestore(&u->ready_lock, flags);
> +
> +	return k;
> +}
> +
> +/*
> + * Search a kevent inside hash bucket for given ukevent.
> + */
> +static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
> +		struct kevent_user *u)
> +{
> +	struct kevent *k, *ret = NULL;
> +
> +	list_for_each_entry(k, head, kevent_entry) {
> +		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
> +				k->event.id.raw[0] == uk->id.raw[0] &&
> +				k->event.id.raw[1] == uk->id.raw[1]) {
> +			ret = k;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Search and modify kevent according to provided ukevent.
> + */
> +static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
> +{
> +	struct kevent *k;
> +	unsigned int hash = kevent_user_hash(uk);
> +	int err = -ENODEV;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&u->kevent_lock, flags);
> +	k = __kevent_search(&u->kevent_list[hash], uk, u);
> +	if (k) {
> +		spin_lock(&k->ulock);
> +		k->event.event = uk->event;
> +		k->event.req_flags = uk->req_flags;
> +		k->event.ret_flags = 0;
> +		spin_unlock(&k->ulock);
> +		kevent_requeue(k);
> +		err = 0;
> +	}
> +	spin_unlock_irqrestore(&u->kevent_lock, flags);
> +
> +	return err;
> +}
> +
> +/*
> + * Remove kevent which matches provided ukevent.
> + */
> +static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
> +{
> +	int err = -ENODEV;
> +	struct kevent *k;
> +	unsigned int hash = kevent_user_hash(uk);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&u->kevent_lock, flags);
> +	k = __kevent_search(&u->kevent_list[hash], uk, u);
> +	if (k) {
> +		__kevent_finish_user(k, 1);
> +		err = 0;
> +	}
> +	spin_unlock_irqrestore(&u->kevent_lock, flags);
> +
> +	return err;
> +}
> +
> +/*
> + * Detaches userspace control block from file descriptor
> + * and decrease it's reference counter.
> + * No new kevents can be added or removed from any list at this point.
> + */
> +static int kevent_user_release(struct inode *inode, struct file *file)
> +{
> +	struct kevent_user *u = file->private_data;
> +	struct kevent *k, *n;
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
> +		list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
> +			kevent_finish_user(k, 1);
> +	}
> +
> +	kevent_user_put(u);
> +	file->private_data = NULL;
> +
> +	return 0;
> +}
> +
> +/*
> + * Read requested number of ukevents in one shot.
> + */
> +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
> +{
> +	struct ukevent *ukev;
> +
> +	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
> +	if (!ukev)
> +		return NULL;
> +
> +	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
> +		kfree(ukev);
> +		return NULL;
> +	}
> +
> +	return ukev;
> +}
> +
> +/*
> + * Read from userspace all ukevents and modify appropriate kevents.
> + * If provided number of ukevents is more that threshold, it is faster
> + * to allocate a room for them and copy in one shot instead of copy
> + * one-by-one and then process them.
> + */
> +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
> +{
> +	int err = 0, i;
> +	struct ukevent uk;
> +
> +	mutex_lock(&u->ctl_mutex);
> +
> +	if (num > u->kevent_num) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (num > KEVENT_MIN_BUFFS_ALLOC) {
> +		struct ukevent *ukev;
> +
> +		ukev = kevent_get_user(num, arg);
> +		if (ukev) {
> +			for (i = 0; i < num; ++i) {
> +				if (kevent_modify(&ukev[i], u))
> +					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
> +				ukev[i].ret_flags |= KEVENT_RET_DONE;
> +			}
> +			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
> +				err = -EFAULT;
> +			kfree(ukev);
> +			goto out;
> +		}
> +	}
> +
> +	for (i = 0; i < num; ++i) {
> +		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		if (kevent_modify(&uk, u))
> +			uk.ret_flags |= KEVENT_RET_BROKEN;
> +		uk.ret_flags |= KEVENT_RET_DONE;
> +
> +		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		arg += sizeof(struct ukevent);
> +	}
> +out:
> +	mutex_unlock(&u->ctl_mutex);
> +
> +	return err;
> +}
> +
> +/*
> + * Read from userspace all ukevents and remove appropriate kevents.
> + * If provided number of ukevents is more that threshold, it is faster
> + * to allocate a room for them and copy in one shot instead of copy
> + * one-by-one and then process them.
> + */
> +static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
> +{
> +	int err = 0, i;
> +	struct ukevent uk;
> +
> +	mutex_lock(&u->ctl_mutex);
> +
> +	if (num > u->kevent_num) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (num > KEVENT_MIN_BUFFS_ALLOC) {
> +		struct ukevent *ukev;
> +
> +		ukev = kevent_get_user(num, arg);
> +		if (ukev) {
> +			for (i = 0; i < num; ++i) {
> +				if (kevent_remove(&ukev[i], u))
> +					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
> +				ukev[i].ret_flags |= KEVENT_RET_DONE;
> +			}
> +			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
> +				err = -EFAULT;
> +			kfree(ukev);
> +			goto out;
> +		}
> +	}
> +
> +	for (i = 0; i < num; ++i) {
> +		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		if (kevent_remove(&uk, u))
> +			uk.ret_flags |= KEVENT_RET_BROKEN;
> +
> +		uk.ret_flags |= KEVENT_RET_DONE;
> +
> +		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		arg += sizeof(struct ukevent);
> +	}
> +out:
> +	mutex_unlock(&u->ctl_mutex);
> +
> +	return err;
> +}
> +
> +/*
> + * Queue kevent into userspace control block and increase
> + * it's reference counter.
> + */
> +static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
> +{
> +	unsigned long flags;
> +	unsigned int hash = kevent_user_hash(&k->event);
> +
> +	spin_lock_irqsave(&u->kevent_lock, flags);
> +	list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
> +	k->flags |= KEVENT_USER;
> +	u->kevent_num++;
> +	kevent_user_get(u);
> +	spin_unlock_irqrestore(&u->kevent_lock, flags);
> +}
> +
> +/*
> + * Add kevent from both kernel and userspace users.
> + * This function allocates and queues kevent, returns negative value
> + * on error, positive if kevent is ready immediately and zero
> + * if kevent has been queued.
> + */
> +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
> +{
> +	struct kevent *k;
> +	int err;
> +
> +	if (kevent_user_ring_grow(u)) {
> +		err = -ENOMEM;
> +		goto err_out_exit;
> +	}
> +
> +	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
> +	if (!k) {
> +		err = -ENOMEM;
> +		goto err_out_exit;
> +	}
> +
> +	memcpy(&k->event, uk, sizeof(struct ukevent));
> +	INIT_RCU_HEAD(&k->rcu_head);
> +
> +	k->event.ret_flags = 0;
> +
> +	err = kevent_init(k);
> +	if (err) {
> +		kmem_cache_free(kevent_cache, k);
> +		goto err_out_exit;
> +	}
> +	k->user = u;
> +	kevent_stat_total(u);
> +	kevent_user_enqueue(u, k);
> +
> +	err = kevent_enqueue(k);
> +	if (err) {
> +		memcpy(uk, &k->event, sizeof(struct ukevent));
> +		kevent_finish_user(k, 0);
> +		goto err_out_exit;
> +	}
> +
> +	return 0;
> +
> +err_out_exit:
> +	if (err < 0) {
> +		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
> +		uk->ret_data[1] = err;
> +	} else if (err > 0)
> +		uk->ret_flags |= KEVENT_RET_DONE;
> +	return err;
> +}
> +
> +/*
> + * Copy all ukevents from userspace, allocate kevent for each one
> + * and add them into appropriate kevent_storages,
> + * e.g. sockets, inodes and so on...
> + * Ready events will replace ones provided by used and number
> + * of ready events is returned.
> + * User must check ret_flags field of each ukevent structure
> + * to determine if it is fired or failed event.
> + */
> +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
> +{
> +	int err, cerr = 0, knum = 0, rnum = 0, i;
> +	void __user *orig = arg;
> +	struct ukevent uk;
> +
> +	mutex_lock(&u->ctl_mutex);
> +
> +	err = -EINVAL;
> +	if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
> +		goto out_remove;
> +
> +	if (num > KEVENT_MIN_BUFFS_ALLOC) {
> +		struct ukevent *ukev;
> +
> +		ukev = kevent_get_user(num, arg);
> +		if (ukev) {
> +			for (i = 0; i < num; ++i) {
> +				err = kevent_user_add_ukevent(&ukev[i], u);
> +				if (err) {
> +					kevent_stat_im(u);
> +					if (i != rnum)
> +						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
> +					rnum++;
> +				} else
> +					knum++;
> +			}
> +			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
> +				cerr = -EFAULT;
> +			kfree(ukev);
> +			goto out_setup;
> +		}
> +	}
> +
> +	for (i = 0; i < num; ++i) {
> +		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> +			cerr = -EFAULT;
> +			break;
> +		}
> +		arg += sizeof(struct ukevent);
> +
> +		err = kevent_user_add_ukevent(&uk, u);
> +		if (err) {
> +			kevent_stat_im(u);
> +			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
> +				cerr = -EFAULT;
> +				break;
> +			}
> +			orig += sizeof(struct ukevent);
> +			rnum++;
> +		} else
> +			knum++;
> +	}
> +
> +out_setup:
> +	if (cerr < 0) {
> +		err = cerr;
> +		goto out_remove;
> +	}
> +
> +	err = rnum;
> +out_remove:
> +	mutex_unlock(&u->ctl_mutex);
> +
> +	return err;
> +}
> +
> +/*
> + * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
> + * In blocking mode it waits until timeout or if at least @min_nr events are ready.
> + */
> +static int kevent_user_wait(struct file *file, struct kevent_user *u,
> +		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
> +		void __user *buf)
> +{
> +	struct kevent *k;
> +	int num = 0;
> +
> +	if (!(file->f_flags & O_NONBLOCK)) {
> +		wait_event_interruptible_timeout(u->wait,
> +			u->ready_num >= min_nr,
> +			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
> +	}
> +
> +	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
> +		if (copy_to_user(buf + num*sizeof(struct ukevent),
> +					&k->event, sizeof(struct ukevent)))
> +			break;
> +
> +		/*
> +		 * If it is one-shot kevent, it has been removed already from
> +		 * origin's queue, so we can easily free it here.
> +		 */
> +		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
> +			kevent_finish_user(k, 1);
> +		++num;
> +		kevent_stat_wait(u);
> +	}
> +
> +	return num;
> +}
> +
> +static struct file_operations kevent_user_fops = {
> +	.mmap		= kevent_user_mmap,
> +	.open		= kevent_user_open,
> +	.release	= kevent_user_release,
> +	.poll		= kevent_user_poll,
> +	.owner		= THIS_MODULE,
> +};

const?

> +
> +static struct miscdevice kevent_miscdev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = kevent_name,
> +	.fops = &kevent_user_fops,
> +};

const?

> +
> +static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
> +{
> +	int err;
> +	struct kevent_user *u = file->private_data;
> +
> +	if (!u || num > KEVENT_MAX_EVENTS)
> +		return -EINVAL;
> +
> +	switch (cmd) {
> +	case KEVENT_CTL_ADD:
> +		err = kevent_user_ctl_add(u, num, arg);
> +		break;
> +	case KEVENT_CTL_REMOVE:
> +		err = kevent_user_ctl_remove(u, num, arg);
> +		break;
> +	case KEVENT_CTL_MODIFY:
> +		err = kevent_user_ctl_modify(u, num, arg);
> +		break;
> +	default:
> +		err = -EINVAL;
> +		break;
> +	}
> +
> +	return err;
> +}
> +
> +/*
> + * Used to get ready kevents from queue.
> + * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).

Isn't this obtained through open(chardev) now?

> + * @min_nr - minimum number of ready kevents.
> + * @max_nr - maximum number of ready kevents.
> + * @timeout - timeout in nanoseconds to wait until some events are ready.
> + * @buf - buffer to place ready events.
> + * @flags - ununsed for now (will be used for mmap implementation).

There is currently an mmap implementation. Is flags still used?

> + */
> +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
> +		__u64 timeout, struct ukevent __user *buf, unsigned flags)
> +{
> +	int err = -EINVAL;
> +	struct file *file;
> +	struct kevent_user *u;
> +
> +	file = fget(ctl_fd);
> +	if (!file)
> +		return -ENODEV;
> +
> +	if (file->f_op != &kevent_user_fops)
> +		goto out_fput;
> +	u = file->private_data;
> +
> +	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
> +out_fput:
> +	fput(file);
> +	return err;
> +}
> +
> +/*
> + * This syscall is used to perform waiting until there is free space in kevent queue
> + * and removes all ready kevents until and including @index.
> + * @ctl_fd - kevent file descriptor.
> + * @start - start index of the processed by userspace kevents.
> + * @num - number of processed kevents.
> + * @timeout - this timeout specifies number of nanoseconds to wait until there is
> + * 	free space in kevent queue.
> + */
> +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
> +{
> +	int err = -EINVAL, found;
> +	struct file *file;
> +	struct kevent_user *u;
> +	struct kevent *k, *n;
> +	struct mukevent *muk;
> +	unsigned int idx, off, hash;
> +	unsigned long flags;
> +
> +	if (start + num >= KEVENT_MAX_EVENTS ||
> +			start >= KEVENT_MAX_EVENTS ||
> +			num >= KEVENT_MAX_EVENTS)

Since start and num are unsigned, the last two checks are redundant. If 
start or num is individually >= KEVENT_MAX_EVENTS, start + num must be.

> +		return -EINVAL;
> +
> +	file = fget(ctl_fd);
> +	if (!file)
> +		return -ENODEV;
> +
> +	if (file->f_op != &kevent_user_fops)
> +		goto out_fput;
> +	u = file->private_data;
> +
> +	if (((start + num) / KEVENTS_ON_PAGE) >= u->pages_in_use ||
> +			(start / KEVENTS_ON_PAGE) >= u->pages_in_use)
> +		goto out_fput;
> +
> +	spin_lock_irqsave(&u->kevent_lock, flags);
> +	while (num > 0) {
> +		idx = start / KEVENTS_ON_PAGE;
> +		off = start % KEVENTS_ON_PAGE;
> +
> +		muk = &u->pring[idx]->event[off];
> +		hash = __kevent_user_hash(&muk->id);
> +		found = 0;
> +		list_for_each_entry_safe(k, n, &u->kevent_list[hash], kevent_entry) {
> +			if ((k->event.id.raw[0] == muk->id.raw[0]) && (k->event.id.raw[1] == muk->id.raw[1])) {
> +				/*
> +				 * Optimization for case when there is only one rearming kevent and
> +				 * userspace is buggy enough and sets start index to zero.
> +				 */
> +				if (k->flags & KEVENT_READY) {
> +					spin_lock(&u->ready_lock);
> +					if (k->flags & KEVENT_READY) {
> +						list_del(&k->ready_entry);
> +						k->flags &= ~KEVENT_READY;
> +						u->ready_num--;
> +					}
> +					spin_unlock(&u->ready_lock);
> +				}
> +
> +				if (k->event.req_flags & KEVENT_REQ_ONESHOT)
> +					__kevent_finish_user(k, 1);
> +				found = 1;
> +
> +				break;
> +			}
> +		}
> +
> +		if (!found) {
> +			spin_unlock_irqrestore(&u->kevent_lock, flags);
> +			goto out_fput;
> +		}
> +
> +		if (++start >= KEVENT_MAX_EVENTS)
> +			start = 0;
> +		num--;
> +	}
> +	spin_unlock_irqrestore(&u->kevent_lock, flags);
> +
> +	if (!(file->f_flags & O_NONBLOCK)) {
> +		wait_event_interruptible_timeout(u->wait,
> +			u->ready_num >= 1,
> +			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
> +	}
> +
> +	fput(file);
> +
> +	return (u->ready_num >= 1)?0:-EAGAIN;
> +out_fput:
> +	fput(file);
> +	return err;
> +}
> +
> +/*
> + * This syscall is used to perform various control operations
> + * on given kevent queue, which is obtained through kevent file descriptor @fd.
> + * @cmd - type of operation.
> + * @num - number of kevents to be processed.
> + * @arg - pointer to array of struct ukevent.
> + */
> +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
> +{
> +	int err = -EINVAL;
> +	struct file *file;
> +
> +	file = fget(fd);
> +	if (!file)
> +		return -ENODEV;
> +
> +	if (file->f_op != &kevent_user_fops)
> +		goto out_fput;
> +
> +	err = kevent_ctl_process(file, cmd, num, arg);
> +
> +out_fput:
> +	fput(file);
> +	return err;
> +}
> +
> +/*
> + * Kevent subsystem initialization - create kevent cache and register
> + * filesystem to get control file descriptors from.
> + */
> +static int __devinit kevent_user_init(void)
> +{
> +	int err = 0;
> +
> +	kevent_cache = kmem_cache_create("kevent_cache",
> +			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
> +
> +	err = misc_register(&kevent_miscdev);
> +	if (err) {
> +		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);

Are we leaving the slab cache for kevents around in this failure case for 
a reason? Why is a slab creation failure panic()-worthy fatal when a 
chardev allocation failure is not?

> +		goto err_out_exit;
> +	}
> +
> +	printk("KEVENT subsystem has been successfully registered.\n");
> +
> +	return 0;
> +
> +err_out_exit:
> +	return err;
> +}
> +
> +static void __devexit kevent_user_fini(void)
> +{
> +	misc_deregister(&kevent_miscdev);
> +}
> +
> +module_init(kevent_user_init);
> +module_exit(kevent_user_fini);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 6991bec..564e618 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
> cond_syscall(sys_spu_run);
> cond_syscall(sys_spu_create);
>
> +cond_syscall(sys_kevent_get_events);
> +cond_syscall(sys_kevent_wait);
> +cond_syscall(sys_kevent_ctl);
> +
> /* mmu depending weak syscall entries */
> cond_syscall(sys_mprotect);
> cond_syscall(sys_msync);
>
>

Looking pretty good. This is my first pass of comments, and I'll probably 
have questions that follow, but I'm trying to get a really good picture of 
what is going on here for documentation purposes.

Thanks,
Chase

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take16 1/4] kevent: Core files.
  2006-09-06 13:40     ` [take16 1/4] kevent: Core files Chase Venters
@ 2006-09-06 13:54       ` Chase Venters
  2006-09-06 14:03       ` Evgeniy Polyakov
  1 sibling, 0 replies; 143+ messages in thread
From: Chase Venters @ 2006-09-06 13:54 UTC (permalink / raw)
  To: Chase Venters
  Cc: Evgeniy Polyakov, lkml, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Johann Borck

On Wed, 6 Sep 2006, Chase Venters wrote:

>>  +	if (start + num >= KEVENT_MAX_EVENTS ||
>>  +			start >= KEVENT_MAX_EVENTS ||
>>  +			num >= KEVENT_MAX_EVENTS)
>
> Since start and num are unsigned, the last two checks are redundant. If start 
> or num is individually >= KEVENT_MAX_EVENTS, start + num must be.
>

Actually, my early-morning brain code optimizer is apparently broken, 
because it forgot all about integer wraparound. Disregard please.

>
> Thanks,
> Chase
>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take16 1/4] kevent: Core files.
  2006-09-06 13:40     ` [take16 1/4] kevent: Core files Chase Venters
  2006-09-06 13:54       ` Chase Venters
@ 2006-09-06 14:03       ` Evgeniy Polyakov
  2006-09-06 14:23         ` Chase Venters
  1 sibling, 1 reply; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-06 14:03 UTC (permalink / raw)
  To: Chase Venters
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Johann Borck

On Wed, Sep 06, 2006 at 08:40:21AM -0500, Chase Venters (chase.venters@clientec.com) wrote:
> Evgeniy,
> 	Sorry about the radio silence later. Some reviewer commentary 
> follows.


> >+struct kevent
> >+{
> >+	/* Used for kevent freeing.*/
> >+	struct rcu_head		rcu_head;
> >+	struct ukevent		event;
> >+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. 
> >*/
> >+	spinlock_t		ulock;
> >+
> >+	/* Entry of user's queue. */
> >+	struct list_head	kevent_entry;
> >+	/* Entry of origin's queue. */
> >+	struct list_head	storage_entry;
> >+	/* Entry of user's ready. */
> >+	struct list_head	ready_entry;
> >+
> >+	u32			flags;
> >+
> >+	/* User who requested this kevent. */
> >+	struct kevent_user	*user;
> >+	/* Kevent container. */
> >+	struct kevent_storage	*st;
> >+
> >+	struct kevent_callbacks	callbacks;
> >+
> >+	/* Private data for different storages.
> >+	 * poll()/select storage has a list of wait_queue_t containers
> >+	 * for each ->poll() { poll_wait()' } here.
> >+	 */
> >+	void			*priv;
> >+};
> >+
> >+#define KEVENT_HASH_MASK	0xff
> >+
> >+struct kevent_user
> >+{
> 
> These structure names get a little dicey (kevent, kevent_user, ukevent, 
> mukevent)... might there be slightly different names that could be 
> selected to better distinguish the purpose of each?

Like what?
ukevent means userspace_kevent, but ukevent is much smaller.
mukevent is mapped userspace kevent, mukevent is again much smaller.

> >+	struct list_head	kevent_list[KEVENT_HASH_MASK+1];
> >+	spinlock_t		kevent_lock;
> >+	/* Number of queued kevents. */
> >+	unsigned int		kevent_num;
> >+
> >+	/* List of ready kevents. */
> >+	struct list_head	ready_list;
> >+	/* Number of ready kevents. */
> >+	unsigned int		ready_num;
> >+	/* Protects all manipulations with ready queue. */
> >+	spinlock_t 		ready_lock;
> >+
> >+	/* Protects against simultaneous kevent_user control manipulations. 
> >*/
> >+	struct mutex		ctl_mutex;
> >+	/* Wait until some events are ready. */
> >+	wait_queue_head_t	wait;
> >+
> >+	/* Reference counter, increased for each new kevent. */
> >+	atomic_t		refcnt;
> >+
> >+	unsigned int		pages_in_use;
> >+	/* Array of pages forming mapped ring buffer */
> >+	struct kevent_mring	**pring;
> >+
> >+#ifdef CONFIG_KEVENT_USER_STAT
> >+	unsigned long		im_num;
> >+	unsigned long		wait_num;
> >+	unsigned long		total;
> >+#endif
> >+};
> >+#define KEVENT_MAX_EVENTS	4096
> >+
> 
> This limit governs how many simultaneous kevents you can be waiting on / 
> for at once, correct? Would it be possible to drop the hard limit and 
> limit instead, say, the maximum number of kevents you can have pending in 
> the mmap ring-buffer? After the number is exceeded, additional events 
> could get dropped, or some magic number could be put in the 
> kevent_mring->index field to let the process know that it must hit another 
> syscall to drain the rest of the events.

I decided to use queue length for mmaped buffer, using size of the
mmapped buffer as queue length is possible too.
But in any case it is very broken behaviour to introduce any kind of
overflow and special marking for that - rt signals already have it, no
need to create additional headache.


> >+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
> 
> __read_mostly?

Yep, I was told already that some structures can be marked as such.
Such change is not 100% requirement though.

> >+
> >+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
> >+{
> >+	struct kevent_callbacks *p;
> >+
> >+	if (pos >= KEVENT_MAX)
> >+		return -EINVAL;
> >+
> >+	p = &kevent_registered_callbacks[pos];
> >+
> >+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
> >+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
> >+	p->callback = (cb->callback) ? cb->callback : kevent_break;
> 
> Curious... why are these callbacks copied, rather than just retaining a 
> pointer to a const/static "ops" structure?

It simplifies callers of that callbacks to just call a function instead
of dereferencing and check for various pointers.

> >+
> >+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
> 
> Is this printk() chatter necessary?

As any other information printk in kernel it is not neccessary, but it
allows user to know which kevent kernel users are enabled.

> >+static char kevent_name[] = "kevent";
> 
> const?

Yep.

> >+/*
> >+ * Initialize mmap ring buffer.
> >+ * It will store ready kevents, so userspace could get them directly 
> >instead
> >+ * of using syscall. Esentially syscall becomes just a waiting point.
> >+ */
> >+static int kevent_user_ring_init(struct kevent_user *u)
> >+{
> >+	int pnum;
> >+
> >+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + 
> >sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
> 
> This calculation works with the current constants, but it comes up a page 
> short if, say, KEVENT_MAX_EVENTS were 4095. It also looks incorrect 
> visually since the 'sizeof(unsigned int)' is only factored in once (rather 
> than once per page). I suggest a static / inline __max_kevent_pages() 
> function that either does:
> 
> return KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE + 1;
> 
> or
> 
> int pnum = KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE;
> if (KEVENT_MAX_EVENTS % KEVENTS_ON_PAGE)
> 	pnum++;
> return pnum;
> 
> Both should be optimized away by the compiler and will give correct 
> answers regardless of the constant values.

Above pnum calculation aligns number of mukevents to pages size with
appropriate check for (unsigned int), although it is not stated in that
comment (more clear commant can be found around KEVENTS_ON_PAGE). 
You propose esentially the same calcualtion in the seconds case, while
first one requires additional page in some cases.

> >+static struct file_operations kevent_user_fops = {
> >+	.mmap		= kevent_user_mmap,
> >+	.open		= kevent_user_open,
> >+	.release	= kevent_user_release,
> >+	.poll		= kevent_user_poll,
> >+	.owner		= THIS_MODULE,
> >+};
> 
> const?
> 
> >+
> >+static struct miscdevice kevent_miscdev = {
> >+	.minor = MISC_DYNAMIC_MINOR,
> >+	.name = kevent_name,
> >+	.fops = &kevent_user_fops,
> >+};
> 
> const?


Yep, bot structures can be const.

> >+/*
> >+ * Used to get ready kevents from queue.
> >+ * @ctl_fd - kevent control descriptor which must be obtained through 
> >kevent_ctl(KEVENT_CTL_INIT).
> 
> Isn't this obtained through open(chardev) now?

Comment is old, you are correct.

> >+ * @min_nr - minimum number of ready kevents.
> >+ * @max_nr - maximum number of ready kevents.
> >+ * @timeout - timeout in nanoseconds to wait until some events are ready.
> >+ * @buf - buffer to place ready events.
> >+ * @flags - ununsed for now (will be used for mmap implementation).
> 
> There is currently an mmap implementation. Is flags still used?

It is unused, but I'm still waiting on comments if we need
kevent_get_events() at all - some people wanted to completely eliminate
that function in favour of total mmap domination.

> >+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned 
> >int num, __u64 timeout)
> >+{
> >+	int err = -EINVAL, found;
> >+	struct file *file;
> >+	struct kevent_user *u;
> >+	struct kevent *k, *n;
> >+	struct mukevent *muk;
> >+	unsigned int idx, off, hash;
> >+	unsigned long flags;
> >+
> >+	if (start + num >= KEVENT_MAX_EVENTS ||
> >+			start >= KEVENT_MAX_EVENTS ||
> >+			num >= KEVENT_MAX_EVENTS)
> 
> Since start and num are unsigned, the last two checks are redundant. If 
> start or num is individually >= KEVENT_MAX_EVENTS, start + num must be.

No, consider the case when start is -1U and num is 1, without both checks
start+num will not fail in this condition.

> >+/*
> >+ * Kevent subsystem initialization - create kevent cache and register
> >+ * filesystem to get control file descriptors from.
> >+ */
> >+static int __devinit kevent_user_init(void)
> >+{
> >+	int err = 0;
> >+
> >+	kevent_cache = kmem_cache_create("kevent_cache",
> >+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
> >+
> >+	err = misc_register(&kevent_miscdev);
> >+	if (err) {
> >+		printk(KERN_ERR "Failed to register kevent miscdev: 
> >err=%d.\n", err);
> 
> Are we leaving the slab cache for kevents around in this failure case for 
> a reason? Why is a slab creation failure panic()-worthy fatal when a 
> chardev allocation failure is not?

I have no strong opinion on how to behave in this situation.
kevent can panic, can free cache, can go to infinite loop or screw up 
the hard drive. Everything is (almost) the same.

> Looking pretty good. This is my first pass of comments, and I'll probably 
> have questions that follow, but I'm trying to get a really good picture of 
> what is going on here for documentation purposes.

Thank you, Chase.
I will definitely get your comments into account and change related
bits.

> Thanks,
> Chase

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take16 1/4] kevent: Core files.
  2006-09-06 14:03       ` Evgeniy Polyakov
@ 2006-09-06 14:23         ` Chase Venters
  2006-09-07  7:10           ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Chase Venters @ 2006-09-06 14:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Chase Venters, lkml, David Miller, Ulrich Drepper, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Johann Borck

On Wed, 6 Sep 2006, Evgeniy Polyakov wrote:

>>> +
>>> +struct kevent_user
>>> +{
>>
>> These structure names get a little dicey (kevent, kevent_user, ukevent,
>> mukevent)... might there be slightly different names that could be
>> selected to better distinguish the purpose of each?
>
> Like what?
> ukevent means userspace_kevent, but ukevent is much smaller.
> mukevent is mapped userspace kevent, mukevent is again much smaller.
>

Hmm, well, kevent_user and ukevent are perhaps the only ones I'm concerned 
about. What about calling kevent_user a kevent_queue, kevent_fd or 
kevent_set?

>
> I decided to use queue length for mmaped buffer, using size of the
> mmapped buffer as queue length is possible too.
> But in any case it is very broken behaviour to introduce any kind of
> overflow and special marking for that - rt signals already have it, no
> need to create additional headache.
>

Hmm. The concern here is pinned memory, is it not? I'm trying to think of 
the best way to avoid compile-time limits. select() has a rather 
(infamous) compile-time limit of 1,024 thanks to libc (and thanks to the 
bit vector, a glass ceiling). Now, you'd be a fool to use select() on that many
fd's in modern code meant to run on modern UNIXes. But kevent is a new 
system, the grand unified event loop all of us userspace programmers have 
been begging for since many years ago. Glass ceilings tend to hurt when 
you run into them :)

Using the size of the memory mapped buffer as queue length sounds like a 
sane simplification.

>>> +static int kevent_user_ring_init(struct kevent_user *u)
>>> +{
>>> +	int pnum;
>>> +
>>> +	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) +
>>> sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
>>
>> This calculation works with the current constants, but it comes up a page
>> short if, say, KEVENT_MAX_EVENTS were 4095. It also looks incorrect
>> visually since the 'sizeof(unsigned int)' is only factored in once (rather
>> than once per page). I suggest a static / inline __max_kevent_pages()
>> function that either does:
>>
>> return KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE + 1;
>>
>> or
>>
>> int pnum = KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE;
>> if (KEVENT_MAX_EVENTS % KEVENTS_ON_PAGE)
>> 	pnum++;
>> return pnum;
>>
>> Both should be optimized away by the compiler and will give correct
>> answers regardless of the constant values.
>
> Above pnum calculation aligns number of mukevents to pages size with
> appropriate check for (unsigned int), although it is not stated in that
> comment (more clear commant can be found around KEVENTS_ON_PAGE).
> You propose esentially the same calcualtion in the seconds case, while
> first one requires additional page in some cases.
>

You are right about my first suggestion sometimes coming up a page extra. 
What I'm worried about is that the current ALIGN() based calculation comes 
up a page short if KEVENT_MAX_EVENTS is certain values (say 4095). This is 
because the "unsigned int index" is inside kevent_mring for every page, 
though the ALIGN() calculation just factors in room for one of them. In 
these boundary cases (KEVENT_MAX_EVENTS == 4095), your calculation thinks 
it can fit one last mukevent on a page because it didn't factor in room 
for "unsigned int index" at the start of every page; rather just for one 
page. In this case, the modulus should always come up non-zero, giving us 
the extra required page.

>
> It is unused, but I'm still waiting on comments if we need
> kevent_get_events() at all - some people wanted to completely eliminate
> that function in favour of total mmap domination.
>

Interesting idea. It would certainly simplify the interface.

>
> I have no strong opinion on how to behave in this situation.
> kevent can panic, can free cache, can go to infinite loop or screw up
> the hard drive. Everything is (almost) the same.
>

Obviously it's not a huge deal :)

If kevent is to screw up the hard drive, though, we must put in an 
exception for it to avoid my music directory.

>> Looking pretty good. This is my first pass of comments, and I'll probably
>> have questions that follow, but I'm trying to get a really good picture of
>> what is going on here for documentation purposes.
>
> Thank you, Chase.
> I will definitely get your comments into account and change related
> bits.

Thanks again!
Chase

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take16 1/4] kevent: Core files.
  2006-09-06 14:23         ` Chase Venters
@ 2006-09-07  7:10           ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-07  7:10 UTC (permalink / raw)
  To: Chase Venters
  Cc: lkml, David Miller, Ulrich Drepper, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Johann Borck

On Wed, Sep 06, 2006 at 09:23:56AM -0500, Chase Venters (chase.venters@clientec.com) wrote:
> On Wed, 6 Sep 2006, Evgeniy Polyakov wrote:
> >>>+struct kevent_user
> >>>+{
> >>
> >>These structure names get a little dicey (kevent, kevent_user, ukevent,
> >>mukevent)... might there be slightly different names that could be
> >>selected to better distinguish the purpose of each?
> >
> >Like what?
> >ukevent means userspace_kevent, but ukevent is much smaller.
> >mukevent is mapped userspace kevent, mukevent is again much smaller.
> >
> 
> Hmm, well, kevent_user and ukevent are perhaps the only ones I'm concerned 
> about. What about calling kevent_user a kevent_queue, kevent_fd or 
> kevent_set?

kevent_user is kernel side representation of, guess what? Yes, kevent
user :)

> >I decided to use queue length for mmaped buffer, using size of the
> >mmapped buffer as queue length is possible too.
> >But in any case it is very broken behaviour to introduce any kind of
> >overflow and special marking for that - rt signals already have it, no
> >need to create additional headache.
> >
> 
> Hmm. The concern here is pinned memory, is it not? I'm trying to think of 
> the best way to avoid compile-time limits. select() has a rather 
> (infamous) compile-time limit of 1,024 thanks to libc (and thanks to the 
> bit vector, a glass ceiling). Now, you'd be a fool to use select() on that 
> many
> fd's in modern code meant to run on modern UNIXes. But kevent is a new 
> system, the grand unified event loop all of us userspace programmers have 
> been begging for since many years ago. Glass ceilings tend to hurt when 
> you run into them :)
> 
> Using the size of the memory mapped buffer as queue length sounds like a 
> sane simplification.

Pinned memory is not the _main_ issue in a real world application - only
if it is some kind of a DoS or really broken behaviour where tons of
event queues are going to be created (like many epoll control
descriptors).
Memory mapped buffer actually can even not exist, if application is not
going to use mmap interface.

> >>>+static int kevent_user_ring_init(struct kevent_user *u)
> >>>+{
> >>>+	int pnum;
> >>>+
> >>>+	pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) +
> >>>sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
> >>
> >>This calculation works with the current constants, but it comes up a page
> >>short if, say, KEVENT_MAX_EVENTS were 4095. It also looks incorrect
> >>visually since the 'sizeof(unsigned int)' is only factored in once (rather
> >>than once per page). I suggest a static / inline __max_kevent_pages()
> >>function that either does:
> >>
> >>return KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE + 1;
> >>
> >>or
> >>
> >>int pnum = KEVENT_MAX_EVENTS / KEVENTS_ON_PAGE;
> >>if (KEVENT_MAX_EVENTS % KEVENTS_ON_PAGE)
> >>	pnum++;
> >>return pnum;
> >>
> >>Both should be optimized away by the compiler and will give correct
> >>answers regardless of the constant values.
> >
> >Above pnum calculation aligns number of mukevents to pages size with
> >appropriate check for (unsigned int), although it is not stated in that
> >comment (more clear commant can be found around KEVENTS_ON_PAGE).
> >You propose esentially the same calcualtion in the seconds case, while
> >first one requires additional page in some cases.
> >
> 
> You are right about my first suggestion sometimes coming up a page extra. 
> What I'm worried about is that the current ALIGN() based calculation comes 
> up a page short if KEVENT_MAX_EVENTS is certain values (say 4095). This is 
> because the "unsigned int index" is inside kevent_mring for every page, 
> though the ALIGN() calculation just factors in room for one of them. In 
> these boundary cases (KEVENT_MAX_EVENTS == 4095), your calculation thinks 
> it can fit one last mukevent on a page because it didn't factor in room 
> for "unsigned int index" at the start of every page; rather just for one 
> page. In this case, the modulus should always come up non-zero, giving us 
> the extra required page.

Comment about KEVENTS_ON_PAGE celarly says what must be taken into
account when size is calculated, but you are right, I should use there
better macros, which should take sizeof(struct kevent_mring).
I will update it.

> >It is unused, but I'm still waiting on comments if we need
> >kevent_get_events() at all - some people wanted to completely eliminate
> >that function in favour of total mmap domination.
> >
> 
> Interesting idea. It would certainly simplify the interface.

Only for those who really wants to use additional mmap interface.

> >
> >I have no strong opinion on how to behave in this situation.
> >kevent can panic, can free cache, can go to infinite loop or screw up
> >the hard drive. Everything is (almost) the same.
> >
> 
> Obviously it's not a huge deal :)
> 
> If kevent is to screw up the hard drive, though, we must put in an 
> exception for it to avoid my music directory.

Care to send a patch for kernel command line? :)


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-08-31  7:58     ` Evgeniy Polyakov
@ 2006-09-09 16:10       ` Ulrich Drepper
  2006-09-11  5:42         ` Evgeniy Polyakov
  0 siblings, 1 reply; 143+ messages in thread
From: Ulrich Drepper @ 2006-09-09 16:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, lkml, David Miller, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters

On 8/31/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Sorry ofr long delay - I was on small vacations.

No vacation here, but travel nontheless.

> > - one point of critique which applied to many proposals over the years:
> >   multiplexer syscalls a bad, really bad. [...]
>
> Can you convince Christoph?
> I do not care about interfaces, but until several people agree on it, I
> will not change anything.

I hope that Linus and/or Andrew simply decree that multiplexers are
bad. glibc and probably strace are the two most affected programs so
their maintainers should have a say.  My opinion os clear.  Also for
analysis tools the multiplexers are bad since different numbers of
parameters are used and maybe even with different types.


> You completely miss AIO here (I talk not about POSIX AIO).

Sure, I should have mentioned it.  But I was assuming this all along.


> I use there only id provided by user, it is not his cookie, but it was
> done to make strucutre as small as possible.
> Think about size of the mapped buffer when there are several kevent
> queues - it is all mapped and thus pinned memory.
> It of course can be extended.

"It" being what?  The problem is that the structure of the ring buffer
elements cannot easily be changed later.  So we have to get it right
now which means being a bit pessimistic about future requirements.
Add padding, there will certainly be future uses which need more
space.


> > Next, the current interfaces once again fail to learn from a mistake we
> > made and which got corrected for the other interfaces.  We need to be
> > able to change the signal mask around the delay atomically.  Just like
> > we have ppoll for poll, pselect for select (and hopefully soon also
> > epoll_pwait for epoll_wait) we need to have this feature in the new
> > interfaces.
>
> We able to change kevents atomically.

I don't understand.  Or you don't understand.  I was talking about
changing the signal mask atomically around the wait call.  I.e., the
call needs an additional optional parameter specifying the signal mask
to use (for the kernel: two parameters, pointer and length).  This
parameter is not available in the version of the patch I looked at and
should be added if it's still missing in the latest version of the
patch.  Again, look at the difference between poll() and ppoll() and
do the same.


> Well, I rarely talk about what other people want, but if you strongly
> feel, that all posix crap is better than epoll interface, then I can not
> agree with you.

You miss the point entirely like DaveM before you.  What I ask for is
simply a uniform and well established form to tell an interface to use
the kevent notification mechanism and not sue signals etc.  Look at
the mail I sent in reply to DaveM's mail.


> It is possible to create additional one using any POSIX API you like,
> but I strongly insist on having possibility to use lightweight syscall
> interface too.

Again, missing the point.  We can without any significant change
enable POSIX interfaces and GNU extensions like the timer, AIO, the
async DNS code, etc use kevents.  For the latter, which is entirely
implemented at userlevel, we need interfaces to queue kevents from
userlevel.  I think this is already supported.  The other two
definitely benefit from using kevent notification and since they
are/will be handled in the kernel the completion events should be
queued in a kevent queue as specified in the sigevent structure passed
to the system call.


> Ring buffer _always_ has space for new events until queue is not filled.
> So if userspace do not read for too much time it's events and eventually
> tries to add new one, it will fail early.

Sorry, I don't understand this at all.

If the ring buffer always has enough room then events must be
preregistered.  Is this the case?  Seems very inflexible and who would
this work with event sources like timers which can trigger many times?

I hope you don't mean that ring buffers probably won't overflow since
programs have to handle events fast enough.  That's not acceptable.


> There is no overflow - I do not want to introduce another signal queue
> overflow crap here.
> And once again - no signals.

Well, signals are the only asynchronous notification mechanism we
have.  But more to the point: why cannot there be overflows?


> You basically want to deliver the same event to several users.
> But how do you want to achive it with network buffers for example.
> When several threads reads from the same socket, they do not obtain the
> same data.

That's not what I am after.  I'm perfectly fine with waking only one
thread.  In fact, this is how it must be to avoid the trampling herd
effects.  But there is the problem that if the woken thread is not
working on the issue for which it was woken (e.g., if the thread got
canceled) then it must be able to wake another thread.  In affect,
there should be a syscall which causes a given number of other waiters
(make the number a parameter to the syscall) is woken.  They would
start running and if nothing is to be done go back to sleep.  The
wakeup interface is what is needed.


> min_nr is used to specify special case "wake up when at least one event
> is ready and get all ready ones".

I understand but when is this really necessary?  The nature of the
event queue will find many different types of events being reported
via them.  In such a situation a minimum count is not really useful.
I would argue this is unnecessary complexity which easily and more
flexible can be handled at userlevel.

> There are no "expected outstanding events", I think it can be a problem.
> Currently there is absolute maximum of events, which can not be
> increased in real-time.

That is a problem.  If we succeed in having a unified event mechanism
the number of outstanding events can be unbounded, only limited by the
systems capabilities.


> Each subsequent mmap will mmap existing buffers, first one mmap can
> create that buffer.

OK, so you have magic in mmap() calls using the kevent file
descriptor?  Seems OK but I will not export this as the interface
glibc exports.  All this should be abstracted out.


> >   Maybe the flags parameter isn't needed, it's just another way to make
> >   sure we won't regret the design later.  If the ring buffer can fill up
> >   and this is detected by the kernel (unlike what happens in take 14)
>
> Just a repeat - with current buffer implementation it can not happen -
> maximum  queue length is a limit for buffer size.

How can the buffer not fill up?  Where is the intformation stored in
case the userlevel code did not process the ring buffer entries in
time?


> >     int kevent_wait (int kfd, unsigned ringstate,
> >                      const struct timespec *timeout,
> >                      const sigset_t *sigmask)
>
> Yes, I agree, this is good syscall.
> Except signals (no signals, that's the rule) and variable sized timespec
> structure. What about putting there u64 number of nanoseconds?

Well, I've explained it already above and repeated during the
pselect/ppoll discussions.  The sigmask parameter is not to in any way
a signal that events should be sent using signals.  It is simply a way
to set the signal mask atomically around the delay to some other
value.  This is functionality which cannot be implemented at
userlevel.  Hence we now have pselect and ppoll system call.  The
kevent_wait syscall will need the same.


> What about following:
> userspace:
>  - check ring index, if it differs from stored in userspace, then there
>    are events between old stored index and new one just read.
>  - copy events
>  - call kevent_wait() or other method to show kernel that all events
>    upto provided in syscall numbers are processed, and thus kernel can
>    remove them and put there new ones.

This would require a system call to free ring buffer entries.  And
delaying the ack of an event (to avoid syscall overhead) means that
the ring buffer might fill up.

Having a userlevel-writable fields which indicated whether an entry in
the ring buffer is free would help to prevent these syscalls and allow
freeing up the entries.  These fields could be in the form of a bitmap
outside the actual ring buffer.

If a ring buffer is not wanted, then a simple writable buffer index
should be used.  This will require that all entries in the ring buffer
are processed in sequence but I don't consider this too much of a
limitation.  The kernel only ever reads this buffer index field.
Instead of making this field part of the mapping (which could be
read-only) the field index position could be passed to the kernel in
the syscall to create an kevent queue.


> kernelspace:
>  - when new kevent is added, it guarantees that there is a place for it
>    in kernel ring buffer

How?  Unless you severely want to limit the usefulness of kevents this
is not possible.  One example, already given above, are periodic
timers.


>  - when event is ready it is copied into mapped buffer and index of the
>    "last ready" is increased (it is fully atomic operation)
>  - when userspace calls kevent_wait() kernel get ring index from
>    syscall, searches for all events upto provided number and free them
>    (or rearm)

Yes, that's OK.  But in the fast path no kevent_wait syscall should be
needed.  If the index variable is exposed in the memory region
containing the ring buffer no syscall is needed in case the ring
buffer is not empty.

> As shown above it is already implemented.

How can you say that.  Just before you said the kevent_wait syscall is
not implemented.  This paragraph was all about how to use kevent_wait.
 I'll have to look at the latest code to see how the _wait syscall is
now implemented.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [take14 0/3] kevent: Generic event handling mechanism.
  2006-09-09 16:10       ` Ulrich Drepper
@ 2006-09-11  5:42         ` Evgeniy Polyakov
  0 siblings, 0 replies; 143+ messages in thread
From: Evgeniy Polyakov @ 2006-09-11  5:42 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Ulrich Drepper, lkml, David Miller, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters

On Sat, Sep 09, 2006 at 09:10:35AM -0700, Ulrich Drepper (drepper@gmail.com) wrote:
> >> - one point of critique which applied to many proposals over the years:
> >>   multiplexer syscalls a bad, really bad. [...]
> >
> >Can you convince Christoph?
> >I do not care about interfaces, but until several people agree on it, I
> >will not change anything.
> 
> I hope that Linus and/or Andrew simply decree that multiplexers are
> bad. glibc and probably strace are the two most affected programs so
> their maintainers should have a say.  My opinion os clear.  Also for
> analysis tools the multiplexers are bad since different numbers of
> parameters are used and maybe even with different types.

Types are exactly the same, actually the whole set of operations
multiplexed in kevents is add/remove/modify. They really look and work 
very similar, so it is not that bad to multiplex them in one syscall.
But yes, we can extend it to 3 differently named ones, which will end up
just in waste of space in syscall tables.
 
> >I use there only id provided by user, it is not his cookie, but it was
> >done to make strucutre as small as possible.
> >Think about size of the mapped buffer when there are several kevent
> >queues - it is all mapped and thus pinned memory.
> >It of course can be extended.
> 
> "It" being what?  The problem is that the structure of the ring buffer
> elements cannot easily be changed later.  So we have to get it right
> now which means being a bit pessimistic about future requirements.
> Add padding, there will certainly be future uses which need more
> space.

"It" was/is a whole situation about mmaped buffer - we can extend it, no
problem, what fields you think needs to be added?

> >> Next, the current interfaces once again fail to learn from a mistake we
> >> made and which got corrected for the other interfaces.  We need to be
> >> able to change the signal mask around the delay atomically.  Just like
> >> we have ppoll for poll, pselect for select (and hopefully soon also
> >> epoll_pwait for epoll_wait) we need to have this feature in the new
> >> interfaces.
> >
> >We able to change kevents atomically.
> 
> I don't understand.  Or you don't understand.  I was talking about
> changing the signal mask atomically around the wait call.  I.e., the
> call needs an additional optional parameter specifying the signal mask
> to use (for the kernel: two parameters, pointer and length).  This
> parameter is not available in the version of the patch I looked at and
> should be added if it's still missing in the latest version of the
> patch.  Again, look at the difference between poll() and ppoll() and
> do the same.

You meant "atomically" with respect to signals, I meant about atomically
compared to simultaneous access.
Looking into ppol() I wonder what is the difference between doing the
same in userspace? There are no special locks, nothing special except
TIF_RESTORE_SIGMASK bit set, so what's the point of it not being done in
userspace?

> >Well, I rarely talk about what other people want, but if you strongly
> >feel, that all posix crap is better than epoll interface, then I can not
> >agree with you.
> 
> You miss the point entirely like DaveM before you.  What I ask for is
> simply a uniform and well established form to tell an interface to use
> the kevent notification mechanism and not sue signals etc.  Look at
> the mail I sent in reply to DaveM's mail.

There is special function in kevents which is used for kevents addition,
which can be called from everywhere (except modules since it is not
exported right now), so one can create _any_ interface he likes.
POSIX timer-look API is not what a lot of people want, since 
epoll/poll/select is completely different thing and exactly _that_ is
what majority of people use. So I create similar interface.
But there are no problem to implement any additional, is is simple.

> >It is possible to create additional one using any POSIX API you like,
> >but I strongly insist on having possibility to use lightweight syscall
> >interface too.
> 
> Again, missing the point.  We can without any significant change
> enable POSIX interfaces and GNU extensions like the timer, AIO, the
> async DNS code, etc use kevents.  For the latter, which is entirely
> implemented at userlevel, we need interfaces to queue kevents from
> userlevel.  I think this is already supported.  The other two
> definitely benefit from using kevent notification and since they
> are/will be handled in the kernel the completion events should be
> queued in a kevent queue as specified in the sigevent structure passed
> to the system call.

I do not object against additional interfaces, no problem,
implementation is really simple. But I strongly object against removing
existing interface, it is there not for the furniture, but since it is
the most convenient way (in my opinion) to use existing (supported by 
kevent) event notifications. If we need additional interfaces, it is
really simple to add them, just use kevent_user_add_ukevent(), which
requires struct ukevent, which desctribe requested notification, and
struct kevent_user, which is those queue where you want to put your
events and which will be checked when events are ready.
 
> >Ring buffer _always_ has space for new events until queue is not filled.
> >So if userspace do not read for too much time it's events and eventually
> >tries to add new one, it will fail early.
> 
> Sorry, I don't understand this at all.
> 
> If the ring buffer always has enough room then events must be
> preregistered.  Is this the case?  Seems very inflexible and who would
> this work with event sources like timers which can trigger many times?

Ready event is only placed once into the buffer, even if timer has fired
many times, how would it look if we put there a notification each time
new data has arrived in network instead of marking KEVENT_SOCKET_RECV
event as ready? It can eat the whole memory, if for each one byte packet
we put there 12 bytes of event.
There is a limit of maximum allowed events in one kevent queue, when
this limit is reached, no new events can be added from userspace until
all previously commited are removed, so that moment is used as limit
factor for mapped buffer - it can grow until maximum allowed number of
events can be placed there.
Such method can look like unconvenient, but I doubt that buffer overflow
scenario (what happens with rt-signals) is really much nicier...

> I hope you don't mean that ring buffers probably won't overflow since
> programs have to handle events fast enough.  That's not acceptable.

:)
 
> >There is no overflow - I do not want to introduce another signal queue
> >overflow crap here.
> >And once again - no signals.
> 
> Well, signals are the only asynchronous notification mechanism we
> have.  But more to the point: why cannot there be overflows?

Kevent queue is limited (for purpose of mapped buffer), so mapped buffer
will grow until it can host maximum number of events (it is 4096 right
now), when such situation happens (i.e. queue is full), no new event can
be added, so no events can be put into the mapped buffer, and it can not
overflow.

> >You basically want to deliver the same event to several users.
> >But how do you want to achive it with network buffers for example.
> >When several threads reads from the same socket, they do not obtain the
> >same data.
> 
> That's not what I am after.  I'm perfectly fine with waking only one
> thread.  In fact, this is how it must be to avoid the trampling herd
> effects.  But there is the problem that if the woken thread is not
> working on the issue for which it was woken (e.g., if the thread got
> canceled) then it must be able to wake another thread.  In affect,
> there should be a syscall which causes a given number of other waiters
> (make the number a parameter to the syscall) is woken.  They would
> start running and if nothing is to be done go back to sleep.  The
> wakeup interface is what is needed.

You look to the problem from some strange and it looks like wrong angle.
There is one queue of events, and that queue does not and can not know
who will read it. It just exists and hosts ready events, if there are
several threads which can access it, how can it detect which one will do
it? How recv() syscall will wake up exactly those thread, which is
supposed to receive the data, but not those one which is supposed to
print info into syslog that data has arrived?
 
> >min_nr is used to specify special case "wake up when at least one event
> >is ready and get all ready ones".
> 
> I understand but when is this really necessary?  The nature of the
> event queue will find many different types of events being reported
> via them.  In such a situation a minimum count is not really useful.
> I would argue this is unnecessary complexity which easily and more
> flexible can be handled at userlevel.

Consider situation when you have web server. Connected user do not want
to wait until 10 other users have connected (or some timeout), 
so server would be awakened and started to process them.
>From the other point, consider someone who writes data asynchronously,
it is much beter to wake him up when several writes are completed, but
not each time one write is ready.

> >There are no "expected outstanding events", I think it can be a problem.
> >Currently there is absolute maximum of events, which can not be
> >increased in real-time.
> 
> That is a problem.  If we succeed in having a unified event mechanism
> the number of outstanding events can be unbounded, only limited by the
> systems capabilities.

Then I will remove mapped buffer implementation, since unbounded pinned
memory is not what we want. Buffer overflow is not the case - recall
rt-signals overflow and recovery.
 
> >Each subsequent mmap will mmap existing buffers, first one mmap can
> >create that buffer.
> 
> OK, so you have magic in mmap() calls using the kevent file
> descriptor?  Seems OK but I will not export this as the interface
> glibc exports.  All this should be abstracted out.

Yes, I use private area created when kevent file descriptor was
allocated.

> >>   Maybe the flags parameter isn't needed, it's just another way to make
> >>   sure we won't regret the design later.  If the ring buffer can fill up
> >>   and this is detected by the kernel (unlike what happens in take 14)
> >
> >Just a repeat - with current buffer implementation it can not happen -
> >maximum  queue length is a limit for buffer size.
> 
> How can the buffer not fill up?  Where is the intformation stored in
> case the userlevel code did not process the ring buffer entries in
> time?

Buffer can be filled completely, but there is no possibility to have an
overflow, since maximum number of events is a limiting factor for buffer
size.
 
> >>     int kevent_wait (int kfd, unsigned ringstate,
> >>                      const struct timespec *timeout,
> >>                      const sigset_t *sigmask)
> >
> >Yes, I agree, this is good syscall.
> >Except signals (no signals, that's the rule) and variable sized timespec
> >structure. What about putting there u64 number of nanoseconds?
> 
> Well, I've explained it already above and repeated during the
> pselect/ppoll discussions.  The sigmask parameter is not to in any way
> a signal that events should be sent using signals.  It is simply a way
> to set the signal mask atomically around the delay to some other
> value.  This is functionality which cannot be implemented at
> userlevel.  Hence we now have pselect and ppoll system call.  The
> kevent_wait syscall will need the same.

What I see in sys_ppol() is just change of the mask and call for usual
poll(), there are no locks and no special tricks.

> >What about following:
> >userspace:
> > - check ring index, if it differs from stored in userspace, then there
> >   are events between old stored index and new one just read.
> > - copy events
> > - call kevent_wait() or other method to show kernel that all events
> >   upto provided in syscall numbers are processed, and thus kernel can
> >   remove them and put there new ones.
> 
> This would require a system call to free ring buffer entries.  And
> delaying the ack of an event (to avoid syscall overhead) means that
> the ring buffer might fill up.
> 
> Having a userlevel-writable fields which indicated whether an entry in
> the ring buffer is free would help to prevent these syscalls and allow
> freeing up the entries.  These fields could be in the form of a bitmap
> outside the actual ring buffer.

I added kevent_wait() exactly for that.
I's parameters allow to complete events, although it is not possible to
complete some event in the middle of set of ready events, only from the
begining.

> If a ring buffer is not wanted, then a simple writable buffer index
> should be used.  This will require that all entries in the ring buffer
> are processed in sequence but I don't consider this too much of a
> limitation.  The kernel only ever reads this buffer index field.
> Instead of making this field part of the mapping (which could be
> read-only) the field index position could be passed to the kernel in
> the syscall to create an kevent queue.
> 
> 
> >kernelspace:
> > - when new kevent is added, it guarantees that there is a place for it
> >   in kernel ring buffer
> 
> How?  Unless you severely want to limit the usefulness of kevents this
> is not possible.  One example, already given above, are periodic
> timers.

Periodic timer is added only once from userspace. And it is marked as
ready when it fires first time. If userspace has missed that it was
fired several times before it was read, than it is userspace's problem -
I put a last timer's ready time into ret_data so userspace can check
how many times this event would be marked as ready.
The same can be applied to other such way triggered events.

> > - when event is ready it is copied into mapped buffer and index of the
> >   "last ready" is increased (it is fully atomic operation)
> > - when userspace calls kevent_wait() kernel get ring index from
> >   syscall, searches for all events upto provided number and free them
> >   (or rearm)
> 
> Yes, that's OK.  But in the fast path no kevent_wait syscall should be
> needed.  If the index variable is exposed in the memory region
> containing the ring buffer no syscall is needed in case the ring
> buffer is not empty.

We need to inform kernel that some events have been processed by
userspace and thus can be rearmed (i.e. mapped as not ready, so 
rearming work could map them as ready again: like received data or
timer timout) or freed - kevent_wait() both
waits and commits (or it does not wait, if events are ready, or
does not commit if provided number of events is zero).

Commiting through writable mapping is not the best way I think, it
introduces a lot of problems with damaged by error events, with
unability to sleep on that variable and so on.

> >As shown above it is already implemented.
> 
> How can you say that.  Just before you said the kevent_wait syscall is
> not implemented.  This paragraph was all about how to use kevent_wait.
> I'll have to look at the latest code to see how the _wait syscall is
> now implemented.

Kevent development is quite fast :)
kevent_wait() is already implemented in the take14 patchset.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 143+ messages in thread

end of thread, other threads:[~2006-09-11  5:43 UTC | newest]

Thread overview: 143+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <12345678912345.GA1898@2ka.mipt.ru>
2006-08-17  7:43 ` [take11 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-08-17  7:43   ` [take11 1/3] kevent: Core files Evgeniy Polyakov
2006-08-17  7:43     ` [take11 2/3] kevent: poll/select() notifications Evgeniy Polyakov
2006-08-17  7:43       ` [take11 3/3] kevent: Timer notifications Evgeniy Polyakov
2006-08-21 10:19 ` [take12 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-08-21 10:19   ` [take12 1/3] kevent: Core files Evgeniy Polyakov
2006-08-21 10:19     ` [take12 2/3] kevent: poll/select() notifications Evgeniy Polyakov
2006-08-21 10:19       ` [take12 3/3] kevent: Timer notifications Evgeniy Polyakov
2006-08-21 11:12         ` Christoph Hellwig
2006-08-21 11:18           ` Evgeniy Polyakov
2006-08-21 11:27             ` Arjan van de Ven
2006-08-21 11:59               ` Evgeniy Polyakov
2006-08-21 12:13                 ` Arjan van de Ven
2006-08-21 12:25                   ` Evgeniy Polyakov
2006-08-21 14:25             ` Thomas Gleixner
2006-08-22 18:25               ` Evgeniy Polyakov
2006-08-21 12:09           ` Evgeniy Polyakov
2006-08-22  4:36             ` Andrew Morton
2006-08-22  5:48               ` Evgeniy Polyakov
2006-08-21 12:37         ` [take12 4/3] kevent: Comment cleanup Evgeniy Polyakov
2006-08-23  8:51     ` [take12 1/3] kevent: Core files Eric Dumazet
2006-08-23  9:18       ` Evgeniy Polyakov
2006-08-23  9:23         ` Eric Dumazet
2006-08-23  9:29           ` Evgeniy Polyakov
2006-08-22  7:00   ` [take12 0/3] kevent: Generic event handling mechanism Nicholas Miell
2006-08-22  7:24     ` Evgeniy Polyakov
2006-08-22  8:17       ` Nicholas Miell
2006-08-22  8:23         ` David Miller
2006-08-22  8:59           ` Nicholas Miell
2006-08-22 14:59             ` James Morris
2006-08-22 20:00               ` Nicholas Miell
2006-08-22 20:36                 ` David Miller
2006-08-22 21:13                   ` Nicholas Miell
2006-08-22 21:25                     ` David Miller
2006-08-22 22:58                       ` Nicholas Miell
2006-08-22 23:46                         ` Ulrich Drepper
2006-08-23  1:51                           ` Nicholas Miell
2006-08-23  6:54                           ` Evgeniy Polyakov
2006-08-22  8:37         ` Evgeniy Polyakov
2006-08-22  9:29           ` Nicholas Miell
2006-08-22 10:03             ` Evgeniy Polyakov
2006-08-22 19:57               ` Nicholas Miell
2006-08-22 20:16                 ` Evgeniy Polyakov
2006-08-22 21:13                   ` Nicholas Miell
2006-08-22 21:37                     ` Randy.Dunlap
2006-08-22 22:01                       ` Andrew Morton
2006-08-22 22:17                         ` David Miller
2006-08-22 23:35                           ` Andrew Morton
2006-08-22 22:58                       ` Nicholas Miell
2006-08-22 23:06                         ` David Miller
2006-08-23  1:36                           ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Nicholas Miell
2006-08-23  2:01                             ` The Proposed Linux kevent API Howard Chu
2006-08-23  3:31                             ` David Miller
2006-08-23  3:47                               ` Nicholas Miell
2006-08-23  4:23                                 ` Nicholas Miell
2006-08-23  6:22                             ` The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.) Evgeniy Polyakov
2006-08-23  8:01                               ` Nicholas Miell
2006-08-23 18:24                             ` The Proposed Linux kevent API Stephen Hemminger
2006-08-22 23:22                         ` [take12 0/3] kevent: Generic event handling mechanism Randy.Dunlap
     [not found]         ` <b3f268590608220957g43a16d6bmde8a542f8ad8710b@mail.gmail.com>
2006-08-22 17:09           ` Jari Sundell
2006-08-22 18:01           ` Evgeniy Polyakov
2006-08-22 19:14             ` Jari Sundell
2006-08-22 19:47               ` Evgeniy Polyakov
2006-08-22 22:51                 ` Jari Sundell
2006-08-22 23:11                   ` Alexey Kuznetsov
2006-08-23  0:28                     ` Jari Sundell
2006-08-23  0:32                       ` David Miller
2006-08-23  0:43                         ` Jari Sundell
2006-08-23  6:56                           ` Evgeniy Polyakov
2006-08-23  7:07                             ` Andrew Morton
2006-08-23  7:10                               ` Evgeniy Polyakov
2006-08-23  9:58                                 ` Andi Kleen
2006-08-23 10:03                                   ` Evgeniy Polyakov
2006-08-23  7:35                               ` David Miller
2006-08-23  8:18                                 ` Nicholas Miell
2006-08-23  7:43                               ` Ian McDonald
2006-08-23  7:50                               ` Evgeniy Polyakov
2006-08-23 16:09                                 ` Andrew Morton
2006-08-23 16:22                                   ` Evgeniy Polyakov
2006-08-23  8:22                             ` Jari Sundell
2006-08-23  8:39                               ` Evgeniy Polyakov
2006-08-23  9:49                                 ` Jari Sundell
2006-08-23 10:20                                   ` Evgeniy Polyakov
2006-08-23 10:34                                     ` Jari Sundell
2006-08-23 10:51                                       ` Evgeniy Polyakov
2006-08-23 12:55                                         ` Jari Sundell
2006-08-23 13:11                                           ` Evgeniy Polyakov
2006-08-22 11:54   ` [PATCH] kevent_user: remove non-chardev interface Christoph Hellwig
2006-08-22 12:17     ` Evgeniy Polyakov
2006-08-22 12:27       ` Christoph Hellwig
2006-08-22 12:39         ` Evgeniy Polyakov
2006-08-22 11:55   ` [PATCH] kevent_user: use struct kevent_mring for the page ring Christoph Hellwig
2006-08-22 12:20     ` Evgeniy Polyakov
2006-08-23 11:24 ` [take13 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-08-23 11:24   ` [take13 1/3] kevent: Core files Evgeniy Polyakov
2006-08-23 11:24     ` [take13 2/3] kevent: poll/select() notifications Evgeniy Polyakov
2006-08-23 11:24       ` [take13 3/3] kevent: Timer notifications Evgeniy Polyakov
2006-08-23 12:51     ` [take13 1/3] kevent: Core files Eric Dumazet
     [not found]       ` <20060823132753.GB29056@2ka.mipt.ru>
2006-08-23 13:44         ` Evgeniy Polyakov
2006-08-24 20:03     ` Christoph Hellwig
2006-08-25  5:48       ` Evgeniy Polyakov
2006-08-25  6:20         ` Andrew Morton
2006-08-25  6:32           ` Evgeniy Polyakov
2006-08-25  6:58             ` Andrew Morton
2006-08-25  7:20               ` Evgeniy Polyakov
2006-08-25  7:01           ` David Miller
2006-08-25  7:13             ` Andrew Morton
     [not found]   ` <Pine.LNX.4.63.0608231313370.8007@alpha.polcom.net>
     [not found]     ` <20060823122509.GA5744@2ka.mipt.ru>
     [not found]       ` <Pine.LNX.4.63.0608231437170.8007@alpha.polcom.net>
     [not found]         ` <20060823134227.GC29056@2ka.mipt.ru>
2006-08-23 18:56           ` [take13 0/3] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-08-23 19:42             ` Evgeniy Polyakov
2006-08-25  9:54 ` [take14 " Evgeniy Polyakov
2006-08-25  9:54   ` [take14 1/3] kevent: Core files Evgeniy Polyakov
2006-08-25  9:54     ` [take14 2/3] kevent: poll/select() notifications Evgeniy Polyakov
2006-08-25  9:54       ` [take14 3/3] kevent: Timer notifications Evgeniy Polyakov
2006-08-27 21:03   ` [take14 0/3] kevent: Generic event handling mechanism Ulrich Drepper
2006-08-28  1:57     ` David Miller
2006-08-28  2:11       ` Ulrich Drepper
2006-08-28  2:40       ` Nicholas Miell
2006-08-28  2:59     ` Nicholas Miell
2006-08-28 11:47       ` Jari Sundell
2006-08-31  7:58     ` Evgeniy Polyakov
2006-09-09 16:10       ` Ulrich Drepper
2006-09-11  5:42         ` Evgeniy Polyakov
2006-09-04 10:14 ` [take15 0/4] " Evgeniy Polyakov
2006-09-04  9:58   ` Evgeniy Polyakov
2006-09-04 10:14   ` [take15 1/4] kevent: Core files Evgeniy Polyakov
2006-09-04 10:14     ` [take15 2/4] kevent: poll/select() notifications Evgeniy Polyakov
2006-09-04 10:14       ` [take15 3/4] kevent: Socket notifications Evgeniy Polyakov
2006-09-04 10:14         ` [take15 4/4] kevent: Timer notifications Evgeniy Polyakov
2006-09-05 13:39           ` Arnd Bergmann
2006-09-06  6:42             ` Evgeniy Polyakov
2006-09-05 13:28     ` [take15 1/4] kevent: Core files Arnd Bergmann
2006-09-06  6:51       ` Evgeniy Polyakov
2006-09-04 10:24   ` [take15 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-09-06 11:55 ` [take16 " Evgeniy Polyakov
2006-09-06 11:55   ` [take16 1/4] kevent: Core files Evgeniy Polyakov
2006-09-06 11:55     ` [take16 2/4] kevent: poll/select() notifications Evgeniy Polyakov
2006-09-06 11:55       ` [take16 3/4] kevent: Socket notifications Evgeniy Polyakov
2006-09-06 11:55         ` [take16 4/4] kevent: Timer notifications Evgeniy Polyakov
2006-09-06 13:40     ` [take16 1/4] kevent: Core files Chase Venters
2006-09-06 13:54       ` Chase Venters
2006-09-06 14:03       ` Evgeniy Polyakov
2006-09-06 14:23         ` Chase Venters
2006-09-07  7:10           ` Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).