linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/1] Generalize poll events from eventfd
@ 2015-09-16  6:27 Damian Hobson-Garcia
  2015-09-16  6:27 ` [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag Damian Hobson-Garcia
  0 siblings, 1 reply; 13+ messages in thread
From: Damian Hobson-Garcia @ 2015-09-16  6:27 UTC (permalink / raw)
  To: viro, sustrik
  Cc: linux-kernel, linux-fsdevel, linux-api, netdev, David.Laight,
	Damian Hobson-Garcia

Using eventfd user space can generate POLLIN/POLLOUT events but some
applications may want to generate POLLPRI/POLLERR events as well.
This patch submission aims to generalize the events generated by an
eventfd. This is a resubmission of a patch from Feb 2013[1]. The original
discussion trailed off without any conclusion, but the original author
has recently confirmed[2] that this functionality is still useful, so I
volunteered to rebase and resubmit the patch for discussion.

[1] https://lkml.org/lkml/2013/2/18/147
[2] https://lkml.org/lkml/2015/7/9/153

Changes in v2
-------------

* rebased on Linux v4.3-rc1
* Move file operation implementations for EFD_MASK to a seperate structure
* Remove 'data' element from efd_mask structure
* read() is no longer supported when EFD_MASK is set (fails with EINVAL)
* eventfd_ctx_fileget() now returns EINVAL when EFD_MASK is set, eliminating
  the possibility of triggering the orginal BUG_ON() macros which have now
  been removed.

Thank you,
Damian

Martin Sustrik (1):
  eventfd: implementation of EFD_MASK flag

 fs/eventfd.c                 | 91 ++++++++++++++++++++++++++++++++++++++------
 include/linux/eventfd.h      | 16 +-------
 include/uapi/linux/eventfd.h | 40 +++++++++++++++++++
 3 files changed, 121 insertions(+), 26 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

-- 
1.9.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2015-09-16  6:27 [PATCH v2 0/1] Generalize poll events from eventfd Damian Hobson-Garcia
@ 2015-09-16  6:27 ` Damian Hobson-Garcia
  2015-09-16  6:51   ` Martin Sustrik
  0 siblings, 1 reply; 13+ messages in thread
From: Damian Hobson-Garcia @ 2015-09-16  6:27 UTC (permalink / raw)
  To: viro, sustrik
  Cc: linux-kernel, linux-fsdevel, linux-api, netdev, David.Laight,
	Damian Hobson-Garcia

From: Martin Sustrik <sustrik@250bpm.com>

When implementing network protocols in user space, one has to implement
fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll
accept only true file descriptors) and forces protocol implementers to use
various workarounds resulting in complex, non-standard and convoluted APIs.

More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes half
the way towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. Most
    notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid
    combination for a network protocol (rx buffer is empty and tx buffer is
    full), cannot be signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above problems.

Additionally, to provide a way to associate user-space state with eventfd
object, it allows to attach user-space data to the file descriptor.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a
way as to signal no events on the file descriptor when it is polled on.
The 'initval' argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  uint32_t events;
};

The value of 'events' should be any combination of event flags as defined
by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will be signaled when polling (select, poll, epoll) on the eventfd
is done later on.

read(2):

read is not supported and will fail with EINVAL.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events
specified in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik <sustrik@250bpm.com>

[dhobsong@igel.co.jp: Rebased, and resubmitted for Linux 4.3]
Signed-off-by: Damian Hobson-Garcia <dhobsong@igel.co.jp>
---
 fs/eventfd.c                 | 102 ++++++++++++++++++++++++++++++++++++++-----
 include/linux/eventfd.h      |  16 +------
 include/uapi/linux/eventfd.h |  39 +++++++++++++++++
 3 files changed, 132 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8d0c0df..1a6a066 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2013  Martin Sustrik <sustrik@250bpm.com>
  *
  */
 
@@ -22,18 +23,31 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 
+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+
 struct eventfd_ctx {
 	struct kref kref;
 	wait_queue_head_t wqh;
-	/*
-	 * Every time that a write(2) is performed on an eventfd, the
-	 * value of the __u64 being written is added to "count" and a
-	 * wakeup is performed on "wqh". A read(2) will return the "count"
-	 * value to userspace, and will reset "count" to zero. The kernel
-	 * side eventfd_signal() also, adds to the "count" counter and
-	 * issue a wakeup.
-	 */
-	__u64 count;
+	union {
+		/*
+		 * Every time that a write(2) is performed on an eventfd, the
+		 * value of the __u64 being written is added to "count" and a
+		 * wakeup is performed on "wqh". A read(2) will return the
+		 * "count" value to userspace, and will reset "count" to zero.
+		 * The kernel side eventfd_signal() also, adds to the "count"
+		 * counter and issue a wakeup.
+		 */
+		__u64 count;
+
+		/*
+		 * When using eventfd in EFD_MASK mode this stracture stores the
+		 * current events to be signaled on the eventfd (events member)
+		 * along with opaque user-defined data (data member).
+		 */
+		struct efd_mask mask;
+	};
 	unsigned int flags;
 };
 
@@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)
 	return events;
 }
 
+static unsigned int eventfd_mask_poll(struct file *file, poll_table *wait)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+
+	poll_wait(file, &ctx->wqh, wait);
+	return ctx->mask.events;
+}
+
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
 {
 	*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
@@ -239,6 +261,14 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
 	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
 }
 
+static ssize_t eventfd_mask_read(struct file *file, char __user *buf,
+			    size_t count,
+			    loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+
 static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
 			     loff_t *ppos)
 {
@@ -286,6 +316,28 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
 	return res;
 }
 
+static ssize_t eventfd_mask_write(struct file *file, const char __user *buf,
+			     size_t count,
+			     loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	struct efd_mask mask;
+
+	if (count < sizeof(mask))
+		return -EINVAL;
+	if (copy_from_user(&mask, buf, sizeof(mask)))
+		return -EFAULT;
+	if (mask.events & ~EFD_MASK_VALID_EVENTS)
+		return -EINVAL;
+	spin_lock_irq(&ctx->wqh.lock);
+	memcpy(&ctx->mask, &mask, sizeof(ctx->mask));
+	if (waitqueue_active(&ctx->wqh))
+		wake_up_locked_poll(&ctx->wqh,
+			(unsigned long)ctx->mask.events);
+	spin_unlock_irq(&ctx->wqh.lock);
+	return sizeof(ctx->mask);
+}
+
 #ifdef CONFIG_PROC_FS
 static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
@@ -296,6 +348,16 @@ static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
 		   (unsigned long long)ctx->count);
 	spin_unlock_irq(&ctx->wqh.lock);
 }
+
+static void eventfd_mask_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct eventfd_ctx *ctx = f->private_data;
+
+	spin_lock_irq(&ctx->wqh.lock);
+	seq_printf(m, "eventfd-mask: %x\n",
+		(unsigned)ctx->mask.events);
+	spin_unlock_irq(&ctx->wqh.lock);
+}
 #endif
 
 static const struct file_operations eventfd_fops = {
@@ -309,6 +371,17 @@ static const struct file_operations eventfd_fops = {
 	.llseek		= noop_llseek,
 };
 
+static const struct file_operations eventfd_mask_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= eventfd_mask_show_fdinfo,
+#endif
+	.release	= eventfd_release,
+	.poll		= eventfd_mask_poll,
+	.read		= eventfd_mask_read,
+	.write		= eventfd_mask_write,
+	.llseek		= noop_llseek,
+};
+
 /**
  * eventfd_fget - Acquire a reference of an eventfd file descriptor.
  * @fd: [in] Eventfd file descriptor.
@@ -392,6 +465,7 @@ struct file *eventfd_file_create(unsigned int count, int flags)
 {
 	struct file *file;
 	struct eventfd_ctx *ctx;
+	const struct file_operations *fops;
 
 	/* Check the EFD_* constants for consistency.  */
 	BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC);
@@ -406,10 +480,16 @@ struct file *eventfd_file_create(unsigned int count, int flags)
 
 	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
-	ctx->count = count;
+	if (flags & EFD_MASK) {
+		memset(&ctx->mask, 0, sizeof(ctx->mask));
+		fops = &eventfd_mask_fops;
+	} else {
+		ctx->count = count;
+		fops = &eventfd_fops;
+	}
 	ctx->flags = flags;
 
-	file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
+	file = anon_inode_getfile("[eventfd]", fops, ctx,
 				  O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
 	if (IS_ERR(file))
 		eventfd_free_ctx(ctx);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index ff0b981..87de343 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,23 +8,11 @@
 #ifndef _LINUX_EVENTFD_H
 #define _LINUX_EVENTFD_H
 
+#include <uapi/linux/eventfd.h>
+
 #include <linux/fcntl.h>
 #include <linux/wait.h>
 
-/*
- * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
- * new flags, since they might collide with O_* ones. We want
- * to re-use O_* flags that couldn't possibly have a meaning
- * from eventfd, in order to leave a free define-space for
- * shared O_* flags.
- */
-#define EFD_SEMAPHORE (1 << 0)
-#define EFD_CLOEXEC O_CLOEXEC
-#define EFD_NONBLOCK O_NONBLOCK
-
-#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
-
 struct file;
 
 #ifdef CONFIG_EVENTFD
diff --git a/include/uapi/linux/eventfd.h b/include/uapi/linux/eventfd.h
new file mode 100644
index 0000000..98ec034
--- /dev/null
+++ b/include/uapi/linux/eventfd.h
@@ -0,0 +1,39 @@
+/*
+ *  Copyright (C) 2013 Martin Sustrik <sustrik@250bpm.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_EVENTFD_H
+#define _UAPI_LINUX_EVENTFD_H
+
+/* For O_CLOEXEC */
+#include <linux/fcntl.h>
+#include <linux/types.h>
+
+/*
+ * CAREFUL: Check include/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from eventfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+
+/* Provide semaphore-like semantics for reads from the eventfd. */
+#define EFD_SEMAPHORE (1 << 0)
+/* Provide event mask semantics for the eventfd. */
+#define EFD_MASK (1 << 1)
+/*  Set the close-on-exec (FD_CLOEXEC) flag on the eventfd. */
+#define EFD_CLOEXEC O_CLOEXEC
+/*  Create the eventfd in non-blocking mode. */
+#define EFD_NONBLOCK O_NONBLOCK
+
+/* Structure to read/write to eventfd in EFD_MASK mode. */
+struct efd_mask {
+	__u32 events;
+};
+
+#endif /* _UAPI_LINUX_EVENTFD_H */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2015-09-16  6:27 ` [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag Damian Hobson-Garcia
@ 2015-09-16  6:51   ` Martin Sustrik
  2015-09-16  7:43     ` Damian Hobson-Garcia
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Sustrik @ 2015-09-16  6:51 UTC (permalink / raw)
  To: Damian Hobson-Garcia
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, netdev, David.Laight

On 2015-09-16 08:27, Damian Hobson-Garcia wrote:
> From: Martin Sustrik <sustrik@250bpm.com>
> 
> When implementing network protocols in user space, one has to implement
> fake file descriptors to represent the sockets for the protocol.
> 
> Polling on such fake file descriptors is a problem (poll/select/epoll
> accept only true file descriptors) and forces protocol implementers to 
> use
> various workarounds resulting in complex, non-standard and convoluted 
> APIs.
> 
> More generally, ability to create full-blown file descriptors for
> userspace-to-userspace signalling is missing. While eventfd(2) goes 
> half
> the way towards this goal it has follwoing shorcomings:
> 
> I.  There's no way to signal POLLPRI, POLLHUP etc.
> II. There's no way to signal arbitrary combination of POLL* flags. Most
>     notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly 
> valid
>     combination for a network protocol (rx buffer is empty and tx 
> buffer is
>     full), cannot be signaled using eventfd.
> 
> This patch implements new EFD_MASK flag which solves the above 
> problems.
> 
> Additionally, to provide a way to associate user-space state with 
> eventfd
> object, it allows to attach user-space data to the file descriptor.

The above paragraph is a leftover from the past. The functionality no 
longer exist.

> 
> The semantics of EFD_MASK are as follows:
> 
> eventfd(2):
> 
> If eventfd is created with EFD_MASK flag set, it is initialised in such 
> a
> way as to signal no events on the file descriptor when it is polled on.
> The 'initval' argument is ignored.
> 
> write(2):
> 
> User is allowed to write only buffers containing the following 
> structure:
> 
> struct efd_mask {
>   uint32_t events;
> };

Is it worth having a struct here? Why not just uint32_t?

Martin

> 
> The value of 'events' should be any combination of event flags as 
> defined
> by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
> events will be signaled when polling (select, poll, epoll) on the 
> eventfd
> is done later on.
> 
> read(2):
> 
> read is not supported and will fail with EINVAL.
> 
> select(2), poll(2) and similar:
> 
> When polling on the eventfd marked by EFD_MASK flag, all the events
> specified in last written 'events' field shall be signaled.
> 
> Signed-off-by: Martin Sustrik <sustrik@250bpm.com>
> 
> [dhobsong@igel.co.jp: Rebased, and resubmitted for Linux 4.3]
> Signed-off-by: Damian Hobson-Garcia <dhobsong@igel.co.jp>
> ---
>  fs/eventfd.c                 | 102 
> ++++++++++++++++++++++++++++++++++++++-----
>  include/linux/eventfd.h      |  16 +------
>  include/uapi/linux/eventfd.h |  39 +++++++++++++++++
>  3 files changed, 132 insertions(+), 25 deletions(-)
>  create mode 100644 include/uapi/linux/eventfd.h
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 8d0c0df..1a6a066 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -2,6 +2,7 @@
>   *  fs/eventfd.c
>   *
>   *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> + *  Copyright (C) 2013  Martin Sustrik <sustrik@250bpm.com>
>   *
>   */
> 
> @@ -22,18 +23,31 @@
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
> 
> +#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | 
> EFD_MASK)
> +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | 
> POLLHUP)
> +
>  struct eventfd_ctx {
>  	struct kref kref;
>  	wait_queue_head_t wqh;
> -	/*
> -	 * Every time that a write(2) is performed on an eventfd, the
> -	 * value of the __u64 being written is added to "count" and a
> -	 * wakeup is performed on "wqh". A read(2) will return the "count"
> -	 * value to userspace, and will reset "count" to zero. The kernel
> -	 * side eventfd_signal() also, adds to the "count" counter and
> -	 * issue a wakeup.
> -	 */
> -	__u64 count;
> +	union {
> +		/*
> +		 * Every time that a write(2) is performed on an eventfd, the
> +		 * value of the __u64 being written is added to "count" and a
> +		 * wakeup is performed on "wqh". A read(2) will return the
> +		 * "count" value to userspace, and will reset "count" to zero.
> +		 * The kernel side eventfd_signal() also, adds to the "count"
> +		 * counter and issue a wakeup.
> +		 */
> +		__u64 count;
> +
> +		/*
> +		 * When using eventfd in EFD_MASK mode this stracture stores the
> +		 * current events to be signaled on the eventfd (events member)
> +		 * along with opaque user-defined data (data member).
> +		 */
> +		struct efd_mask mask;
> +	};
>  	unsigned int flags;
>  };
> 
> @@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file
> *file, poll_table *wait)
>  	return events;
>  }
> 
> +static unsigned int eventfd_mask_poll(struct file *file, poll_table 
> *wait)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +
> +	poll_wait(file, &ctx->wqh, wait);
> +	return ctx->mask.events;
> +}
> +
>  static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
>  {
>  	*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
> @@ -239,6 +261,14 @@ static ssize_t eventfd_read(struct file *file,
> char __user *buf, size_t count,
>  	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
>  }
> 
> +static ssize_t eventfd_mask_read(struct file *file, char __user *buf,
> +			    size_t count,
> +			    loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +
> +
>  static ssize_t eventfd_write(struct file *file, const char __user
> *buf, size_t count,
>  			     loff_t *ppos)
>  {
> @@ -286,6 +316,28 @@ static ssize_t eventfd_write(struct file *file,
> const char __user *buf, size_t c
>  	return res;
>  }
> 
> +static ssize_t eventfd_mask_write(struct file *file, const char __user 
> *buf,
> +			     size_t count,
> +			     loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	struct efd_mask mask;
> +
> +	if (count < sizeof(mask))
> +		return -EINVAL;
> +	if (copy_from_user(&mask, buf, sizeof(mask)))
> +		return -EFAULT;
> +	if (mask.events & ~EFD_MASK_VALID_EVENTS)
> +		return -EINVAL;
> +	spin_lock_irq(&ctx->wqh.lock);
> +	memcpy(&ctx->mask, &mask, sizeof(ctx->mask));
> +	if (waitqueue_active(&ctx->wqh))
> +		wake_up_locked_poll(&ctx->wqh,
> +			(unsigned long)ctx->mask.events);
> +	spin_unlock_irq(&ctx->wqh.lock);
> +	return sizeof(ctx->mask);
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
>  {
> @@ -296,6 +348,16 @@ static void eventfd_show_fdinfo(struct seq_file
> *m, struct file *f)
>  		   (unsigned long long)ctx->count);
>  	spin_unlock_irq(&ctx->wqh.lock);
>  }
> +
> +static void eventfd_mask_show_fdinfo(struct seq_file *m, struct file 
> *f)
> +{
> +	struct eventfd_ctx *ctx = f->private_data;
> +
> +	spin_lock_irq(&ctx->wqh.lock);
> +	seq_printf(m, "eventfd-mask: %x\n",
> +		(unsigned)ctx->mask.events);
> +	spin_unlock_irq(&ctx->wqh.lock);
> +}
>  #endif
> 
>  static const struct file_operations eventfd_fops = {
> @@ -309,6 +371,17 @@ static const struct file_operations eventfd_fops = 
> {
>  	.llseek		= noop_llseek,
>  };
> 
> +static const struct file_operations eventfd_mask_fops = {
> +#ifdef CONFIG_PROC_FS
> +	.show_fdinfo	= eventfd_mask_show_fdinfo,
> +#endif
> +	.release	= eventfd_release,
> +	.poll		= eventfd_mask_poll,
> +	.read		= eventfd_mask_read,
> +	.write		= eventfd_mask_write,
> +	.llseek		= noop_llseek,
> +};
> +
>  /**
>   * eventfd_fget - Acquire a reference of an eventfd file descriptor.
>   * @fd: [in] Eventfd file descriptor.
> @@ -392,6 +465,7 @@ struct file *eventfd_file_create(unsigned int
> count, int flags)
>  {
>  	struct file *file;
>  	struct eventfd_ctx *ctx;
> +	const struct file_operations *fops;
> 
>  	/* Check the EFD_* constants for consistency.  */
>  	BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC);
> @@ -406,10 +480,16 @@ struct file *eventfd_file_create(unsigned int
> count, int flags)
> 
>  	kref_init(&ctx->kref);
>  	init_waitqueue_head(&ctx->wqh);
> -	ctx->count = count;
> +	if (flags & EFD_MASK) {
> +		memset(&ctx->mask, 0, sizeof(ctx->mask));
> +		fops = &eventfd_mask_fops;
> +	} else {
> +		ctx->count = count;
> +		fops = &eventfd_fops;
> +	}
>  	ctx->flags = flags;
> 
> -	file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
> +	file = anon_inode_getfile("[eventfd]", fops, ctx,
>  				  O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
>  	if (IS_ERR(file))
>  		eventfd_free_ctx(ctx);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index ff0b981..87de343 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -8,23 +8,11 @@
>  #ifndef _LINUX_EVENTFD_H
>  #define _LINUX_EVENTFD_H
> 
> +#include <uapi/linux/eventfd.h>
> +
>  #include <linux/fcntl.h>
>  #include <linux/wait.h>
> 
> -/*
> - * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> - * new flags, since they might collide with O_* ones. We want
> - * to re-use O_* flags that couldn't possibly have a meaning
> - * from eventfd, in order to leave a free define-space for
> - * shared O_* flags.
> - */
> -#define EFD_SEMAPHORE (1 << 0)
> -#define EFD_CLOEXEC O_CLOEXEC
> -#define EFD_NONBLOCK O_NONBLOCK
> -
> -#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> -
>  struct file;
> 
>  #ifdef CONFIG_EVENTFD
> diff --git a/include/uapi/linux/eventfd.h 
> b/include/uapi/linux/eventfd.h
> new file mode 100644
> index 0000000..98ec034
> --- /dev/null
> +++ b/include/uapi/linux/eventfd.h
> @@ -0,0 +1,39 @@
> +/*
> + *  Copyright (C) 2013 Martin Sustrik <sustrik@250bpm.com>
> + *
> + *  This program is free software; you can redistribute it and/or 
> modify
> + *  it under the terms of the GNU General Public License as published 
> by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + */
> +
> +#ifndef _UAPI_LINUX_EVENTFD_H
> +#define _UAPI_LINUX_EVENTFD_H
> +
> +/* For O_CLOEXEC */
> +#include <linux/fcntl.h>
> +#include <linux/types.h>
> +
> +/*
> + * CAREFUL: Check include/asm-generic/fcntl.h when defining
> + * new flags, since they might collide with O_* ones. We want
> + * to re-use O_* flags that couldn't possibly have a meaning
> + * from eventfd, in order to leave a free define-space for
> + * shared O_* flags.
> + */
> +
> +/* Provide semaphore-like semantics for reads from the eventfd. */
> +#define EFD_SEMAPHORE (1 << 0)
> +/* Provide event mask semantics for the eventfd. */
> +#define EFD_MASK (1 << 1)
> +/*  Set the close-on-exec (FD_CLOEXEC) flag on the eventfd. */
> +#define EFD_CLOEXEC O_CLOEXEC
> +/*  Create the eventfd in non-blocking mode. */
> +#define EFD_NONBLOCK O_NONBLOCK
> +
> +/* Structure to read/write to eventfd in EFD_MASK mode. */
> +struct efd_mask {
> +	__u32 events;
> +};
> +
> +#endif /* _UAPI_LINUX_EVENTFD_H */


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2015-09-16  6:51   ` Martin Sustrik
@ 2015-09-16  7:43     ` Damian Hobson-Garcia
  0 siblings, 0 replies; 13+ messages in thread
From: Damian Hobson-Garcia @ 2015-09-16  7:43 UTC (permalink / raw)
  To: Martin Sustrik
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, netdev, David.Laight

Hi Martin,

On 2015-09-16 3:51 PM, Martin Sustrik wrote:
> On 2015-09-16 08:27, Damian Hobson-Garcia wrote:
>>
>> Additionally, to provide a way to associate user-space state with eventfd
>> object, it allows to attach user-space data to the file descriptor.
> 
> The above paragraph is a leftover from the past. The functionality no
> longer exist.
> 
Oops, I forgot to delete that part. I'll get rid of it.

>>
>> The semantics of EFD_MASK are as follows:
>>
>> eventfd(2):
>>
>> If eventfd is created with EFD_MASK flag set, it is initialised in such a
>> way as to signal no events on the file descriptor when it is polled on.
>> The 'initval' argument is ignored.
>>
>> write(2):
>>
>> User is allowed to write only buffers containing the following structure:
>>
>> struct efd_mask {
>>   uint32_t events;
>> };
> 
> Is it worth having a struct here? Why not just uint32_t?
As it stands right now, no, the struct doesn't really add anything.
uint32_t should be just fine.
> 
> Martin

Damian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-14 22:54 ` Andrew Morton
                     ` (2 preceding siblings ...)
  2013-02-18  8:54   ` Martin Sustrik
@ 2013-02-18 11:57   ` Martin Sustrik
  3 siblings, 0 replies; 13+ messages in thread
From: Martin Sustrik @ 2013-02-18 11:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Sha Zhengju, linux-fsdevel, linux-kernel, netdev,
	Michael Kerrisk, Davide Libenzi, Andy Lutomirski, Eric Wong

On 14/02/13 23:54, Andrew Morton wrote:

> This patch adds userspace interfaces which will require manpage
> updates.  Please Cc Michael and work with him on getting those changes
> completed.

Right. It adds the efd_mask structure. As far as I understand how it 
works is that the actual user-space definition of the structure should 
be provided by glibc (sys/eventfd.h), right?

If so, should the man page be updated now? If so, it may result in the 
situation where the man page describes a structure that's not available 
in the user space yet.

Is there a formal process to do this kind of kernel+glibc+docs changes?

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-14 22:54 ` Andrew Morton
  2013-02-14 23:57   ` Andy Lutomirski
  2013-02-15  3:42   ` Martin Sustrik
@ 2013-02-18  8:54   ` Martin Sustrik
  2013-02-18 11:57   ` Martin Sustrik
  3 siblings, 0 replies; 13+ messages in thread
From: Martin Sustrik @ 2013-02-18  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Sha Zhengju, linux-fsdevel, linux-kernel, netdev,
	Michael Kerrisk, Davide Libenzi, Andy Lutomirski, Eric Wong

On 14/02/13 23:54, Andrew Morton wrote:

>> +/*  On x86-64 keep the same binary layout as on i386. */
>> +#ifdef __x86_64__
>> +#define EVENTFD_MASK_PACKED __packed
>> +#else
>> +#define EVENTFD_MASK_PACKED
>> +#endif
>> +
>> +struct eventfd_mask {
>> +	__u32 events;
>> +	__u64 data;
>> +} EVENTFD_MASK_PACKED;
>
> The x86-64 specific thing is ugly.  I can find no explanation of why it
> was done, but it should go away.  You could make `events' a u64, or
> swap the order of the two fields and make the struct __packed on all
> architectures.
>
> Given that the size of the types is fixed, I see no compat issues here.

I've just copied how the definition is done for epoll_event. The comment 
there goes like this:

/*
  * On x86-64 make the 64bit structure have the same alignment as the
  * 32bit structure. This makes 32bit emulation easier.
  *
  * UML/x86_64 needs the same packing as x86_64
  */

If you still think I should remove the #ifdef, I am happy to do so.

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-15 17:32       ` Andy Lutomirski
@ 2013-02-15 18:37         ` Martin Sustrik
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Sustrik @ 2013-02-15 18:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Alexander Viro, Sha Zhengju, linux-fsdevel,
	linux-kernel, netdev, Michael Kerrisk, Davide Libenzi, Eric Wong

On 15/02/13 18:32, Andy Lutomirski wrote:
> On Thu, Feb 14, 2013 at 9:24 PM, Andrew Morton
> <akpm@linux-foundation.org>  wrote:
>> On Fri, 15 Feb 2013 04:42:27 +0100 Martin Sustrik<sustrik@250bpm.com>  wrote:
>>
>>>> This is a non-back-compatible userspace interface change.  A procfs
>>>> file which previously displayed
>>>>
>>>>      eventfd-count: nnnn
>>>>
>>>> can now also display
>>>>
>>>>      eventfd-mask: nnnn
>>>>
>>>> So existing userspace could misbehave.
>>>>
>>>> Please fully describe the proposed interface change in the changelog.
>>>> That description should include the full pathname of the procfs file
>>>> and example before-and-after output and a discussion of whether and why
>>>> the risk to existing userspace is acceptable.
>>>
>>> I am not sure what the policy is here. Is not printing out the state of
>>> the object acceptable way to maintain backward compatibility? If not so,
>>> does new type of object require new procfs file, which, AFAIU, is the
>>> only way to retain full backward compatibility?
>>
>> Adding a new file is the only way I can think of to preserve the API.
>> But from Andy's comment is sounds like we don't have to worry a lot
>> about back-compatibility.
>>
>
> I'm not even convinced there's an issue in the first place (other than
> the fact that use of this feature will break old criu, regardless of
> /proc changes).  The fdinfo files already vary by descriptor type.
> Anything that screws up if unexpected fields are present is already
> screwed.

Ok then. I'll leave the relevant code as is.

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-15  5:24     ` Andrew Morton
@ 2013-02-15 17:32       ` Andy Lutomirski
  2013-02-15 18:37         ` Martin Sustrik
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2013-02-15 17:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Sustrik, Alexander Viro, Sha Zhengju, linux-fsdevel,
	linux-kernel, netdev, Michael Kerrisk, Davide Libenzi, Eric Wong

On Thu, Feb 14, 2013 at 9:24 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri, 15 Feb 2013 04:42:27 +0100 Martin Sustrik <sustrik@250bpm.com> wrote:
>
>> > This is a non-back-compatible userspace interface change.  A procfs
>> > file which previously displayed
>> >
>> >     eventfd-count: nnnn
>> >
>> > can now also display
>> >
>> >     eventfd-mask: nnnn
>> >
>> > So existing userspace could misbehave.
>> >
>> > Please fully describe the proposed interface change in the changelog.
>> > That description should include the full pathname of the procfs file
>> > and example before-and-after output and a discussion of whether and why
>> > the risk to existing userspace is acceptable.
>>
>> I am not sure what the policy is here. Is not printing out the state of
>> the object acceptable way to maintain backward compatibility? If not so,
>> does new type of object require new procfs file, which, AFAIU, is the
>> only way to retain full backward compatibility?
>
> Adding a new file is the only way I can think of to preserve the API.
> But from Andy's comment is sounds like we don't have to worry a lot
> about back-compatibility.
>

I'm not even convinced there's an issue in the first place (other than
the fact that use of this feature will break old criu, regardless of
/proc changes).  The fdinfo files already vary by descriptor type.
Anything that screws up if unexpected fields are present is already
screwed.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-15  3:42   ` Martin Sustrik
@ 2013-02-15  5:24     ` Andrew Morton
  2013-02-15 17:32       ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2013-02-15  5:24 UTC (permalink / raw)
  To: Martin Sustrik
  Cc: Alexander Viro, Sha Zhengju, linux-fsdevel, linux-kernel, netdev,
	Michael Kerrisk, Davide Libenzi, Andy Lutomirski, Eric Wong

On Fri, 15 Feb 2013 04:42:27 +0100 Martin Sustrik <sustrik@250bpm.com> wrote:

> > This is a non-back-compatible userspace interface change.  A procfs
> > file which previously displayed
> >
> > 	eventfd-count: nnnn
> >
> > can now also display
> >
> > 	eventfd-mask: nnnn
> >
> > So existing userspace could misbehave.
> >
> > Please fully describe the proposed interface change in the changelog.
> > That description should include the full pathname of the procfs file
> > and example before-and-after output and a discussion of whether and why
> > the risk to existing userspace is acceptable.
> 
> I am not sure what the policy is here. Is not printing out the state of 
> the object acceptable way to maintain backward compatibility? If not so, 
> does new type of object require new procfs file, which, AFAIU, is the 
> only way to retain full backward compatibility?

Adding a new file is the only way I can think of to preserve the API. 
But from Andy's comment is sounds like we don't have to worry a lot
about back-compatibility.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-14 22:54 ` Andrew Morton
  2013-02-14 23:57   ` Andy Lutomirski
@ 2013-02-15  3:42   ` Martin Sustrik
  2013-02-15  5:24     ` Andrew Morton
  2013-02-18  8:54   ` Martin Sustrik
  2013-02-18 11:57   ` Martin Sustrik
  3 siblings, 1 reply; 13+ messages in thread
From: Martin Sustrik @ 2013-02-15  3:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Sha Zhengju, linux-fsdevel, linux-kernel, netdev,
	Michael Kerrisk, Davide Libenzi, Andy Lutomirski, Eric Wong

Hi Andrew,

Thanks for the detailed code review! I'll have a look at all the 
problems you've pointed out, however, one quick question:

>> -	ret = seq_printf(m, "eventfd-count: %16llx\n",
>> -			 (unsigned long long)ctx->count);
>> +	if (ctx->flags&  EFD_MASK) {
>> +		ret = seq_printf(m, "eventfd-mask: %x\n",
>> +				 (unsigned)ctx->mask.events);
>> +	} else {
>> +		ret = seq_printf(m, "eventfd-count: %16llx\n",
>> +				 (unsigned long long)ctx->count);
>> +	}
>>   	spin_unlock_irq(&ctx->wqh.lock);
>
> This is a non-back-compatible userspace interface change.  A procfs
> file which previously displayed
>
> 	eventfd-count: nnnn
>
> can now also display
>
> 	eventfd-mask: nnnn
>
> So existing userspace could misbehave.
>
> Please fully describe the proposed interface change in the changelog.
> That description should include the full pathname of the procfs file
> and example before-and-after output and a discussion of whether and why
> the risk to existing userspace is acceptable.

I am not sure what the policy is here. Is not printing out the state of 
the object acceptable way to maintain backward compatibility? If not so, 
does new type of object require new procfs file, which, AFAIU, is the 
only way to retain full backward compatibility?

Martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-14 22:54 ` Andrew Morton
@ 2013-02-14 23:57   ` Andy Lutomirski
  2013-02-15  3:42   ` Martin Sustrik
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2013-02-14 23:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Sustrik, Alexander Viro, Sha Zhengju, linux-fsdevel,
	linux-kernel, netdev, Michael Kerrisk, Davide Libenzi, Eric Wong

On Thu, Feb 14, 2013 at 2:54 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri,  8 Feb 2013 09:11:17 +0100
> Martin Sustrik <sustrik@250bpm.com> wrote:
>
>> When implementing network protocols in user space, one has to implement
>> fake user-space file descriptors to represent the sockets for the protocol.

>
>>       if (count < sizeof(ucnt))
>>               return -EINVAL;
>> @@ -293,8 +351,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)
>>       int ret;
>>
>>       spin_lock_irq(&ctx->wqh.lock);
>> -     ret = seq_printf(m, "eventfd-count: %16llx\n",
>> -                      (unsigned long long)ctx->count);
>> +     if (ctx->flags & EFD_MASK) {
>> +             ret = seq_printf(m, "eventfd-mask: %x\n",
>> +                              (unsigned)ctx->mask.events);
>> +     } else {
>> +             ret = seq_printf(m, "eventfd-count: %16llx\n",
>> +                              (unsigned long long)ctx->count);
>> +     }
>>       spin_unlock_irq(&ctx->wqh.lock);
>
> This is a non-back-compatible userspace interface change.  A procfs
> file which previously displayed
>
>         eventfd-count: nnnn
>
> can now also display
>
>         eventfd-mask: nnnn
>
> So existing userspace could misbehave.
>
> Please fully describe the proposed interface change in the changelog.
> That description should include the full pathname of the procfs file
> and example before-and-after output and a discussion of whether and why
> the risk to existing userspace is acceptable.

I suspect that the fdinfo stuff is only used by criu, which will need
to be updated regardless.  (If the kernel adopts the policy "don't
break criu", then that may be the end of new kernel features.)

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
  2013-02-08  8:11 Martin Sustrik
@ 2013-02-14 22:54 ` Andrew Morton
  2013-02-14 23:57   ` Andy Lutomirski
                     ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andrew Morton @ 2013-02-14 22:54 UTC (permalink / raw)
  To: Martin Sustrik
  Cc: Alexander Viro, Sha Zhengju, linux-fsdevel, linux-kernel, netdev,
	Michael Kerrisk, Davide Libenzi, Andy Lutomirski, Eric Wong

On Fri,  8 Feb 2013 09:11:17 +0100
Martin Sustrik <sustrik@250bpm.com> wrote:

> When implementing network protocols in user space, one has to implement
> fake user-space file descriptors to represent the sockets for the protocol.
> 
> While all the BSD socket API functionality for such descriptors may be faked as
> well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
> polling  (select, poll, epoll). And unfortunately, sockets that can't be polled
> on allow only for building the simplest possible applications. Basically, you
> can build a simple client, but once you want to implement a server handling
> many sockets in parallel, you are stuck.
> 
> However, to do polling, real system-level file descriptor is needed,
> not a fake one.
> 
> In theory, eventfd may be used for this purpose, except that it is well suited
> only for signaling POLLIN. With some hacking it can be also used to signal
> POLLOUT and POLLERR, but:
> 
> I.  There's no way to signal POLLPRI, POLLHUP etc.
> II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
>     !POLLIN & !POLLOUT, which is a perfectly valid combination for a network
>     protocol (rx buffer is empty and tx buffer is full), cannot be signaled
>     using current implementation of eventfd.
> 
> This patch implements new EFD_MASK flag which attempts to solve this problem.
> 
> Additionally, when implementing network protocols in user space, there's a
> need to associate user-space state with the each "socket". If eventfd object is
> used as a reference to the socket, it should be possible to associate an opaque
> pointer to user-space data with it.
> 
> The semantics of EFD_MASK are as follows:
> 
> eventfd(2):
> 
> If eventfd is created with EFD_MASK flag set, it is initialised in such a way
> as to signal no events on the file descriptor when it is polled on. 'initval'
> argument is ignored.
> 
> write(2):
> 
> User is allowed to write only buffers containing the following structure:
> 
> struct efd_mask {
>   short events;
>   union {
>     void *ptr;
>     uint32_t u32;
>     uint64_t u64;
>   };
> };
> 
> The value of 'events' should be any combination of event flags as defined by
> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
> be signaled when polling (select, poll, epoll) on the eventfd is done later on.
> ptr, u32 and u63 are opaque data that are not interpreted by eventfd object.
> 
> read(2):
> 
> User is allowed to read an efd_mask structure from the eventfd marked by
> EFD_MASK. Returned value shall be the last one written to the eventfd.
> 
> select(2), poll(2) and similar:
> 
> When polling on the eventfd marked by EFD_MASK flag, all the events specified
> in last written 'events' field shall be signaled.
> 
> ...

This patch adds userspace interfaces which will require manpage
updates.  Please Cc Michael and work with him on getting those changes
completed.

>
> +/*  On x86-64 keep the same binary layout as on i386. */
> +#ifdef __x86_64__
> +#define EVENTFD_MASK_PACKED __packed
> +#else
> +#define EVENTFD_MASK_PACKED
> +#endif
> +
> +struct eventfd_mask {
> +	__u32 events;
> +	__u64 data;
> +} EVENTFD_MASK_PACKED;

The x86-64 specific thing is ugly.  I can find no explanation of why it
was done, but it should go away.  You could make `events' a u64, or
swap the order of the two fields and make the struct __packed on all
architectures.

Given that the size of the types is fixed, I see no compat issues here.

As this struct is known by userspace, this definition should appear in
a header which is available to usersapce: include/uapi/...

>  struct eventfd_ctx {
>  	struct kref kref;
>  	wait_queue_head_t wqh;
> -	/*
> -	 * Every time that a write(2) is performed on an eventfd, the
> -	 * value of the __u64 being written is added to "count" and a
> -	 * wakeup is performed on "wqh". A read(2) will return the "count"
> -	 * value to userspace, and will reset "count" to zero. The kernel
> -	 * side eventfd_signal() also, adds to the "count" counter and
> -	 * issue a wakeup.
> -	 */
> -	__u64 count;
> +	union {
> +		/*
> +		 * Every time that a write(2) is performed on an eventfd, the
> +		 * value of the __u64 being written is added to "count" and a
> +		 * wakeup is performed on "wqh". A read(2) will return the
> +		 * "count" value to userspace, and will reset "count" to zero.
> +		 * The kernel side eventfd_signal() also, adds to the "count"
> +		 * counter and issue a wakeup.
> +		 */
> +		__u64 count;
> +		struct eventfd_mask mask;

The nice explanation for `count' was retained, but is it appropriate
that `mask' have no explanation?

> +	};
>  	unsigned int flags;
>  };
>  
> ...
>
> @@ -230,13 +261,23 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
>  	ssize_t res;
>  	__u64 cnt;
>  
> -	if (count < sizeof(cnt))
> -		return -EINVAL;
> -	res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
> -	if (res < 0)
> +	if (ctx->flags & EFD_MASK) {
> +		spin_lock_irq(&ctx->wqh.lock);
> +		if (count < sizeof(ctx->mask))
> +			return -EINVAL;
> +		res = copy_to_user(buf, &ctx->mask, sizeof(ctx->mask)) ?
> +			-EFAULT : sizeof(ctx->mask);
> +		spin_unlock_irq(&ctx->wqh.lock);
>  		return res;

This code is crawling with bugs.

- can return with wqh.lock held -> dead kernel

- performs copy_to_user() under spinlock -> warning spew, kernel
  deadlocks.  It should go via a local temporary, as was done in
  eventfd_write().

This should have filled your screen with warnings when testing.  Either
it wasn't tested or its author forgot to read
Documentation/SubmitChecklist section 12.  Please do so ;)

(otoh maybe might_sleep and lockdep fail to detect copy_*_user under
spinlock when the copy doesn't fault.  If so, that's a big fail)

> -
> -	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
> +	} else {
> +		if (count < sizeof(cnt))
> +			return -EINVAL;
> +		res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
> +		if (res < 0)
> +			return res;
> +		return put_user(cnt, (__u64 __user *) buf) ?
> +			-EFAULT : sizeof(cnt);
> +	}
>  }
>  
>  static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> @@ -246,6 +287,23 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
>  	ssize_t res;
>  	__u64 ucnt;
>  	DECLARE_WAITQUEUE(wait, current);
> +	struct eventfd_mask mask;
> +
> +	if (ctx->flags & EFD_MASK) {
> +		if (count < sizeof(mask))
> +			return -EINVAL;
> +		if (copy_from_user(&mask, buf, sizeof(mask)))
> +			return -EFAULT;
> +		if (mask.events & ~EFD_MASK_VALID_EVENTS)
> +			return -EINVAL;
> +		spin_lock_irq(&ctx->wqh.lock);
> +		memcpy(&ctx->mask, &mask, sizeof(ctx->mask));
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked_poll(&ctx->wqh,
> +				(unsigned long)ctx->mask.events);
> +		spin_unlock_irq(&ctx->wqh.lock);
> +		return sizeof(ctx->mask);
> +	}

`mask' can be made local to this `if' block, which is nicer.

>  	if (count < sizeof(ucnt))
>  		return -EINVAL;
> @@ -293,8 +351,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)
>  	int ret;
>  
>  	spin_lock_irq(&ctx->wqh.lock);
> -	ret = seq_printf(m, "eventfd-count: %16llx\n",
> -			 (unsigned long long)ctx->count);
> +	if (ctx->flags & EFD_MASK) {
> +		ret = seq_printf(m, "eventfd-mask: %x\n",
> +				 (unsigned)ctx->mask.events);
> +	} else {
> +		ret = seq_printf(m, "eventfd-count: %16llx\n",
> +				 (unsigned long long)ctx->count);
> +	}
>  	spin_unlock_irq(&ctx->wqh.lock);

This is a non-back-compatible userspace interface change.  A procfs
file which previously displayed

	eventfd-count: nnnn

can now also display

	eventfd-mask: nnnn

So existing userspace could misbehave.

Please fully describe the proposed interface change in the changelog. 
That description should include the full pathname of the procfs file
and example before-and-after output and a discussion of whether and why
the risk to existing userspace is acceptable.
 
> @@ -412,7 +475,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>  
>  	kref_init(&ctx->kref);
>  	init_waitqueue_head(&ctx->wqh);
> -	ctx->count = count;
> +	if (flags & EFD_MASK) {
> +		ctx->mask.events = 0;
> +		ctx->mask.data = 0;
> +	} else {
> +		ctx->count = count;
> +	}
>  	ctx->flags = flags;
>  
>  	file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index 3c3ef19..b806d2b 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -20,11 +20,12 @@
>   * shared O_* flags.
>   */
>  #define EFD_SEMAPHORE (1 << 0)
> +#define EFD_MASK (1 << 1)

Does this addition comply with the "CAREFUL:" immediately above it?

It would be best to add code comemntary describing what this constant does.

>  #define EFD_CLOEXEC O_CLOEXEC
>  #define EFD_NONBLOCK O_NONBLOCK
>  
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
>  
>  #ifdef CONFIG_EVENTFD


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
@ 2013-02-08  8:11 Martin Sustrik
  2013-02-14 22:54 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Sustrik @ 2013-02-08  8:11 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Sha Zhengju, linux-fsdevel, linux-kernel
  Cc: netdev, Martin Sustrik

When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
polling  (select, poll, epoll). And unfortunately, sockets that can't be polled
on allow only for building the simplest possible applications. Basically, you
can build a simple client, but once you want to implement a server handling
many sockets in parallel, you are stuck.

However, to do polling, real system-level file descriptor is needed,
not a fake one.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal
POLLOUT and POLLERR, but:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
    !POLLIN & !POLLOUT, which is a perfectly valid combination for a network
    protocol (rx buffer is empty and tx buffer is full), cannot be signaled
    using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  short events;
  union {
    void *ptr;
    uint32_t u32;
    uint64_t u64;
  };
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
ptr, u32 and u63 are opaque data that are not interpreted by eventfd object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik <sustrik@250bpm.com>
---
v2 - Concerns raised by Andy Lutomirski addressed:
- uses fixed size (64 bits) opaque data in eventfd_mask instead of void*
- write returns EINVAL if invalid POLL flags are specified
---
 fs/eventfd.c            |  116 +++++++++++++++++++++++++++++++++++++----------
 include/linux/eventfd.h |    3 +-
 2 files changed, 94 insertions(+), 25 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..bf83f6d87 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2013  Martin Sustrik <sustrik@250bpm.com>
  *
  */
 
@@ -22,18 +23,35 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+
+/*  On x86-64 keep the same binary layout as on i386. */
+#ifdef __x86_64__
+#define EVENTFD_MASK_PACKED __packed
+#else
+#define EVENTFD_MASK_PACKED
+#endif
+
+struct eventfd_mask {
+	__u32 events;
+	__u64 data;
+} EVENTFD_MASK_PACKED;
+
 struct eventfd_ctx {
 	struct kref kref;
 	wait_queue_head_t wqh;
-	/*
-	 * Every time that a write(2) is performed on an eventfd, the
-	 * value of the __u64 being written is added to "count" and a
-	 * wakeup is performed on "wqh". A read(2) will return the "count"
-	 * value to userspace, and will reset "count" to zero. The kernel
-	 * side eventfd_signal() also, adds to the "count" counter and
-	 * issue a wakeup.
-	 */
-	__u64 count;
+	union {
+		/*
+		 * Every time that a write(2) is performed on an eventfd, the
+		 * value of the __u64 being written is added to "count" and a
+		 * wakeup is performed on "wqh". A read(2) will return the
+		 * "count" value to userspace, and will reset "count" to zero.
+		 * The kernel side eventfd_signal() also, adds to the "count"
+		 * counter and issue a wakeup.
+		 */
+		__u64 count;
+		struct eventfd_mask mask;
+	};
 	unsigned int flags;
 };
 
@@ -55,6 +73,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 {
 	unsigned long flags;
 
+	/* This function should never be used with eventfd in the mask mode. */
+	BUG_ON(ctx->flags & EFD_MASK);
+
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
 	if (ULLONG_MAX - ctx->count < n)
 		n = ULLONG_MAX - ctx->count;
@@ -123,12 +144,16 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)
 	poll_wait(file, &ctx->wqh, wait);
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
-	if (ctx->count > 0)
-		events |= POLLIN;
-	if (ctx->count == ULLONG_MAX)
-		events |= POLLERR;
-	if (ULLONG_MAX - 1 > ctx->count)
-		events |= POLLOUT;
+	if (ctx->flags & EFD_MASK) {
+		events = ctx->mask.events;
+	} else {
+		if (ctx->count > 0)
+			events |= POLLIN;
+		if (ctx->count == ULLONG_MAX)
+			events |= POLLERR;
+		if (ULLONG_MAX - 1 > ctx->count)
+			events |= POLLOUT;
+	}
 	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
 	return events;
@@ -158,6 +183,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
 {
 	unsigned long flags;
 
+	/* This function should never be used with eventfd in the mask mode. */
+	BUG_ON(ctx->flags & EFD_MASK);
+
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
 	eventfd_ctx_do_read(ctx, cnt);
 	__remove_wait_queue(&ctx->wqh, wait);
@@ -188,6 +216,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 	ssize_t res;
 	DECLARE_WAITQUEUE(wait, current);
 
+	/* This function should never be used with eventfd in the mask mode. */
+	BUG_ON(ctx->flags & EFD_MASK);
+
 	spin_lock_irq(&ctx->wqh.lock);
 	*cnt = 0;
 	res = -EAGAIN;
@@ -230,13 +261,23 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
 	ssize_t res;
 	__u64 cnt;
 
-	if (count < sizeof(cnt))
-		return -EINVAL;
-	res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
-	if (res < 0)
+	if (ctx->flags & EFD_MASK) {
+		spin_lock_irq(&ctx->wqh.lock);
+		if (count < sizeof(ctx->mask))
+			return -EINVAL;
+		res = copy_to_user(buf, &ctx->mask, sizeof(ctx->mask)) ?
+			-EFAULT : sizeof(ctx->mask);
+		spin_unlock_irq(&ctx->wqh.lock);
 		return res;
-
-	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
+	} else {
+		if (count < sizeof(cnt))
+			return -EINVAL;
+		res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
+		if (res < 0)
+			return res;
+		return put_user(cnt, (__u64 __user *) buf) ?
+			-EFAULT : sizeof(cnt);
+	}
 }
 
 static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
@@ -246,6 +287,23 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
 	ssize_t res;
 	__u64 ucnt;
 	DECLARE_WAITQUEUE(wait, current);
+	struct eventfd_mask mask;
+
+	if (ctx->flags & EFD_MASK) {
+		if (count < sizeof(mask))
+			return -EINVAL;
+		if (copy_from_user(&mask, buf, sizeof(mask)))
+			return -EFAULT;
+		if (mask.events & ~EFD_MASK_VALID_EVENTS)
+			return -EINVAL;
+		spin_lock_irq(&ctx->wqh.lock);
+		memcpy(&ctx->mask, &mask, sizeof(ctx->mask));
+		if (waitqueue_active(&ctx->wqh))
+			wake_up_locked_poll(&ctx->wqh,
+				(unsigned long)ctx->mask.events);
+		spin_unlock_irq(&ctx->wqh.lock);
+		return sizeof(ctx->mask);
+	}
 
 	if (count < sizeof(ucnt))
 		return -EINVAL;
@@ -293,8 +351,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)
 	int ret;
 
 	spin_lock_irq(&ctx->wqh.lock);
-	ret = seq_printf(m, "eventfd-count: %16llx\n",
-			 (unsigned long long)ctx->count);
+	if (ctx->flags & EFD_MASK) {
+		ret = seq_printf(m, "eventfd-mask: %x\n",
+				 (unsigned)ctx->mask.events);
+	} else {
+		ret = seq_printf(m, "eventfd-count: %16llx\n",
+				 (unsigned long long)ctx->count);
+	}
 	spin_unlock_irq(&ctx->wqh.lock);
 
 	return ret;
@@ -412,7 +475,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)
 
 	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
-	ctx->count = count;
+	if (flags & EFD_MASK) {
+		ctx->mask.events = 0;
+		ctx->mask.data = 0;
+	} else {
+		ctx->count = count;
+	}
 	ctx->flags = flags;
 
 	file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 3c3ef19..b806d2b 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -20,11 +20,12 @@
  * shared O_* flags.
  */
 #define EFD_SEMAPHORE (1 << 0)
+#define EFD_MASK (1 << 1)
 #define EFD_CLOEXEC O_CLOEXEC
 #define EFD_NONBLOCK O_NONBLOCK
 
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
 
 #ifdef CONFIG_EVENTFD
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-09-16  7:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-16  6:27 [PATCH v2 0/1] Generalize poll events from eventfd Damian Hobson-Garcia
2015-09-16  6:27 ` [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag Damian Hobson-Garcia
2015-09-16  6:51   ` Martin Sustrik
2015-09-16  7:43     ` Damian Hobson-Garcia
  -- strict thread matches above, loose matches on Subject: below --
2013-02-08  8:11 Martin Sustrik
2013-02-14 22:54 ` Andrew Morton
2013-02-14 23:57   ` Andy Lutomirski
2013-02-15  3:42   ` Martin Sustrik
2013-02-15  5:24     ` Andrew Morton
2013-02-15 17:32       ` Andy Lutomirski
2013-02-15 18:37         ` Martin Sustrik
2013-02-18  8:54   ` Martin Sustrik
2013-02-18 11:57   ` Martin Sustrik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).