linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
@ 2019-05-26 14:25 Renzo Davoli
  2019-05-26 20:24 ` kbuild test robot
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Renzo Davoli @ 2019-05-26 14:25 UTC (permalink / raw)
  To: Alexander Viro, Davide Libenzi, linux-fsdevel, linux-kernel; +Cc: linux-api

This patch implements an extension of eventfd to define file descriptors 
whose I/O events can be generated at user level. These file descriptors
trigger notifications for [p]select/[p]poll/epoll.

This feature is useful for user-level implementations of network stacks
or virtual device drivers as libraries.

Development and porting of code often requires to find the way to wait for I/O
events both coming from file descriptors and generated by user-level code (e.g.
user-implemented net stacks or drivers).  While it is possible to provide a
partial support (e.g. using pipes or socketpairs), a clean and complete
solution is still missing (as far as I have seen); e.g. I have not seen any
clean way to generate EPOLLPRI, EPOLLERR, etc.

This proposal is based on a new tag for eventfd2(2): EFD_VPOLL.

This statement:
	fd = eventfd(EPOLLOUT, EFD_VPOLL | EFD_CLOEXEC);
creates a file descriptor for I/O event generation. In this case EPOLLOUT is
initially true.

Likewise all the other eventfs services, read(2) and write(2) use a 8-byte 
integer argument.

read(2) returns the current state of the pending events.

The argument of write(2) is an or-composition of a control command
(EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the
bitmap of events to be added, deleted to the current set of pending events.
EFD_VPOLL_MODEVENTS completely redefines the set of pending events.

e.g.:
	uint64_t request = EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI;
	write(fd, &request, sizeof(request);
adds EPOLLIN and EPOLLPRI to the set of pending events.

These are examples of messages asking for a feature like EFD_VPOLL:
https://stackoverflow.com/questions/909189/simulating-file-descriptor-in-user-space
https://stackoverflow.com/questions/1648147/running-a-simple-tcp-server-with-poll-how-do-i-trigger-events-artificially
... and I need it to write networking and device modules for vuos:
https://github.com/virtualsquare/vuos
(it is the new codebase of ViewOS, see www.virtualsquare.org).

EXAMPLE:
The following program creates an eventfd/EFD_VPOLL file descriptor and then forks
a child process.  While the parent waits for events using epoll_wait the child
generates a sequence of events. When the parent receives an event (or a set of events)
it prints it and disarm it.
The following shell session shows a sample run of the program:
	timeout...
	timeout...
	GOT event 1
	timeout...
	GOT event 1
	timeout...
	GOT event 3
	timeout...
	GOT event 2
	timeout...
	GOT event 4
	timeout...
	GOT event 10

Program source:
#include <sys/eventfd.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>             /* Definition of uint64_t */

#ifndef EFD_VPOLL
#define EFD_VPOLL (1 << 1)
#define EFD_VPOLL_ADDEVENTS (1UL << 32)
#define EFD_VPOLL_DELEVENTS (2UL << 32)
#define EFD_VPOLL_MODEVENTS (3UL << 32)
#endif

#define handle_error(msg) \
	do { perror(msg); exit(EXIT_FAILURE); } while (0)

static void vpoll_ctl(int fd, uint64_t request) {
	ssize_t s;
	s = write(fd, &request, sizeof(request));
	if (s != sizeof(uint64_t))
		handle_error("write");
}

int
main(int argc, char *argv[])
{
	int efd, epollfd; 
	struct epoll_event ev;
	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI;
	ev.data.u64 = 0;

	efd = eventfd(0, EFD_VPOLL | EFD_CLOEXEC);
	if (efd == -1)
		handle_error("eventfd");
	epollfd = epoll_create1(EPOLL_CLOEXEC);
	if (efd == -1)
		handle_error("epoll_create1");
	if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1) 
		handle_error("epoll_ctl");

	switch (fork()) {
		case 0:
			sleep(3);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLPRI);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLOUT);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLHUP);
			exit(EXIT_SUCCESS);
		default:
			while (1) {
				int nfds;
				nfds = epoll_wait(epollfd, &ev, 1, 1000);
				if (nfds < 0)
					handle_error("epoll_wait");
				else if (nfds == 0)
					printf("timeout...\n");
				else {
					printf("GOT event %x\n", ev.events);
					vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events);
					if (ev.events & EPOLLHUP)
						break;
				}
			}
		case -1:
			handle_error("fork");
	}
	close(epollfd);
	close(efd);
	return 0;
}

Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
---
 fs/eventfd.c                   | 115 +++++++++++++++++++++++++++++++--
 include/linux/eventfd.h        |   7 +-
 include/uapi/linux/eventpoll.h |   2 +
 3 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..f83b7d02307e 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -3,6 +3,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  EFD_VPOLL support: 2019 Renzo Davoli <renzo@cs.unibo.it>
  *
  */
 
@@ -30,12 +31,24 @@ struct eventfd_ctx {
 	struct kref kref;
 	wait_queue_head_t wqh;
 	/*
-	 * Every time that a write(2) is performed on an eventfd, the
-	 * value of the __u64 being written is added to "count" and a
-	 * wakeup is performed on "wqh". A read(2) will return the "count"
-	 * value to userspace, and will reset "count" to zero. The kernel
-	 * side eventfd_signal() also, adds to the "count" counter and
-	 * issue a wakeup.
+	 * If the EFD_VPOLL flag was NOT set at eventfd creation:
+	 *   Every time that a write(2) is performed on an eventfd, the
+	 *   value of the __u64 being written is added to "count" and a
+	 *   wakeup is performed on "wqh". A read(2) will return the "count"
+	 *   value to userspace, and will reset "count" to zero (or decrement
+	 *   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
+	 *   side eventfd_signal() also, adds to the "count" counter and
+	 *   issue a wakeup.
+	 *
+	 * If the EFD_VPOLL flag was set at eventfd creation:
+	 *   count is the set of pending EPOLL events.
+	 *   read(2) returns the current value of count.
+	 *   The argument of write(2) is an 8-byte integer:
+	 *   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
+	 *   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
+	 *   events to be added, deleted to the current set of pending events.
+	 *   (i.e. which bits of "count" must be set or reset).
+	 *   EFD_VPOLL_MODEVENTS redefines the set of pending events.
 	 */
 	__u64 count;
 	unsigned int flags;
@@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
 	return res;
 }
 
+static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	__poll_t events = 0;
+	u64 count;
+
+	poll_wait(file, &ctx->wqh, wait);
+
+	count = READ_ONCE(ctx->count);
+
+	events = (count & EPOLLALLMASK);
+
+	return events;
+}
+
+static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 ucnt = 0;
+
+	if (count < sizeof(ucnt))
+		return -EINVAL;
+	res = sizeof(ucnt);
+	ucnt = READ_ONCE(ctx->count);
+	if (put_user(ucnt, (__u64 __user *)buf))
+		return -EFAULT;
+
+	return res;
+}
+
+static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 ucnt;
+	__u32 events;
+
+	if (count < sizeof(ucnt))
+		return -EINVAL;
+	if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
+		return -EFAULT;
+	spin_lock_irq(&ctx->wqh.lock);
+
+	events = ucnt & EPOLLALLMASK;
+	res = sizeof(ucnt);
+	switch (ucnt & ~((__u64)EPOLLALLMASK)) {
+	case EFD_VPOLL_ADDEVENTS:
+		ctx->count |= events;
+		break;
+	case EFD_VPOLL_DELEVENTS:
+		ctx->count &= ~(events);
+		break;
+	case EFD_VPOLL_MODEVENTS:
+		ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
+		break;
+	default:
+		res = -EINVAL;
+	}
+
+	/* wake up waiting threads */
+	if (res >= 0 && waitqueue_active(&ctx->wqh))
+		wake_up_locked_poll(&ctx->wqh, res);
+
+	spin_unlock_irq(&ctx->wqh.lock);
+
+	return res;
+
+}
+
 #ifdef CONFIG_PROC_FS
 static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
@@ -319,6 +404,17 @@ static const struct file_operations eventfd_fops = {
 	.llseek		= noop_llseek,
 };
 
+static const struct file_operations eventfd_vpoll_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= eventfd_show_fdinfo,
+#endif
+	.release	= eventfd_release,
+	.poll		= eventfd_vpoll_poll,
+	.read		= eventfd_vpoll_read,
+	.write		= eventfd_vpoll_write,
+	.llseek		= noop_llseek,
+};
+
 /**
  * eventfd_fget - Acquire a reference of an eventfd file descriptor.
  * @fd: [in] Eventfd file descriptor.
@@ -391,6 +487,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
 static int do_eventfd(unsigned int count, int flags)
 {
 	struct eventfd_ctx *ctx;
+	const struct file_operations *fops = &eventfd_fops;
 	int fd;
 
 	/* Check the EFD_* constants for consistency.  */
@@ -410,7 +507,11 @@ static int do_eventfd(unsigned int count, int flags)
 	ctx->flags = flags;
 	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
 
-	fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
+	if (flags & EFD_VPOLL) {
+		fops = &eventfd_vpoll_fops;
+		ctx->count &= EPOLLALLMASK;
+	}
+	fd = anon_inode_getfd("[eventfd]", fops, ctx,
 			      O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
 	if (fd < 0)
 		eventfd_free_ctx(ctx);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index ffcc7724ca21..63258cf29344 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -21,11 +21,16 @@
  * shared O_* flags.
  */
 #define EFD_SEMAPHORE (1 << 0)
+#define EFD_VPOLL (1 << 1)
 #define EFD_CLOEXEC O_CLOEXEC
 #define EFD_NONBLOCK O_NONBLOCK
 
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
+
+#define EFD_VPOLL_ADDEVENTS (1UL << 32)
+#define EFD_VPOLL_DELEVENTS (2UL << 32)
+#define EFD_VPOLL_MODEVENTS (3UL << 32)
 
 struct eventfd_ctx;
 struct file;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..814de6d869c7 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -41,6 +41,8 @@
 #define EPOLLMSG	(__force __poll_t)0x00000400
 #define EPOLLRDHUP	(__force __poll_t)0x00002000
 
+#define EPOLLALLMASK	((__force __poll_t)0x0fffffff)
+
 /* Set exclusive wakeup mode for the target file descriptor */
 #define EPOLLEXCLUSIVE	((__force __poll_t)(1U << 28))
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-26 14:25 [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events Renzo Davoli
@ 2019-05-26 20:24 ` kbuild test robot
  2019-05-26 20:49 ` kbuild test robot
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: kbuild test robot @ 2019-05-26 20:24 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: kbuild-all, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api

[-- Attachment #1: Type: text/plain, Size: 3479 bytes --]

Hi Renzo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.2-rc1 next-20190524]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Renzo-Davoli/eventfd-new-tag-EFD_VPOLL-generate-epoll-events/20190527-023620
config: i386-randconfig-x011-201921 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from fs/eventfd.c:23:0:
   fs/eventfd.c: In function 'eventfd_vpoll_write':
>> include/linux/eventfd.h:31:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_ADDEVENTS (1UL << 32)
                                     ^
>> fs/eventfd.c:360:7: note: in expansion of macro 'EFD_VPOLL_ADDEVENTS'
     case EFD_VPOLL_ADDEVENTS:
          ^~~~~~~~~~~~~~~~~~~
   include/linux/eventfd.h:32:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_DELEVENTS (2UL << 32)
                                     ^
>> fs/eventfd.c:363:7: note: in expansion of macro 'EFD_VPOLL_DELEVENTS'
     case EFD_VPOLL_DELEVENTS:
          ^~~~~~~~~~~~~~~~~~~
>> fs/eventfd.c:363:2: error: duplicate case value
     case EFD_VPOLL_DELEVENTS:
     ^~~~
   fs/eventfd.c:360:2: note: previously used here
     case EFD_VPOLL_ADDEVENTS:
     ^~~~
   In file included from fs/eventfd.c:23:0:
   include/linux/eventfd.h:33:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_MODEVENTS (3UL << 32)
                                     ^
>> fs/eventfd.c:366:7: note: in expansion of macro 'EFD_VPOLL_MODEVENTS'
     case EFD_VPOLL_MODEVENTS:
          ^~~~~~~~~~~~~~~~~~~
   fs/eventfd.c:366:2: error: duplicate case value
     case EFD_VPOLL_MODEVENTS:
     ^~~~
   fs/eventfd.c:360:2: note: previously used here
     case EFD_VPOLL_ADDEVENTS:
     ^~~~

vim +363 fs/eventfd.c

   342	
   343	static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
   344			size_t count, loff_t *ppos)
   345	{
   346		struct eventfd_ctx *ctx = file->private_data;
   347		ssize_t res;
   348		__u64 ucnt;
   349		__u32 events;
   350	
   351		if (count < sizeof(ucnt))
   352			return -EINVAL;
   353		if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
   354			return -EFAULT;
   355		spin_lock_irq(&ctx->wqh.lock);
   356	
   357		events = ucnt & EPOLLALLMASK;
   358		res = sizeof(ucnt);
   359		switch (ucnt & ~((__u64)EPOLLALLMASK)) {
 > 360		case EFD_VPOLL_ADDEVENTS:
   361			ctx->count |= events;
   362			break;
 > 363		case EFD_VPOLL_DELEVENTS:
   364			ctx->count &= ~(events);
   365			break;
 > 366		case EFD_VPOLL_MODEVENTS:
   367			ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
   368			break;
   369		default:
   370			res = -EINVAL;
   371		}
   372	
   373		/* wake up waiting threads */
   374		if (res >= 0 && waitqueue_active(&ctx->wqh))
   375			wake_up_locked_poll(&ctx->wqh, res);
   376	
   377		spin_unlock_irq(&ctx->wqh.lock);
   378	
   379		return res;
   380	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36461 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-26 14:25 [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events Renzo Davoli
  2019-05-26 20:24 ` kbuild test robot
@ 2019-05-26 20:49 ` kbuild test robot
  2019-05-27  3:09 ` kbuild test robot
  2019-05-27  7:33 ` Greg KH
  3 siblings, 0 replies; 12+ messages in thread
From: kbuild test robot @ 2019-05-26 20:49 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: kbuild-all, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api

[-- Attachment #1: Type: text/plain, Size: 3414 bytes --]

Hi Renzo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.2-rc1 next-20190524]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Renzo-Davoli/eventfd-new-tag-EFD_VPOLL-generate-epoll-events/20190527-023620
config: i386-randconfig-l2-201921 (attached as .config)
compiler: gcc-5 (Debian 5.5.0-3) 5.4.1 20171010
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from fs/eventfd.c:23:0:
   fs/eventfd.c: In function 'eventfd_vpoll_write':
   include/linux/eventfd.h:31:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_ADDEVENTS (1UL << 32)
                                     ^
   fs/eventfd.c:360:7: note: in expansion of macro 'EFD_VPOLL_ADDEVENTS'
     case EFD_VPOLL_ADDEVENTS:
          ^
   include/linux/eventfd.h:32:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_DELEVENTS (2UL << 32)
                                     ^
   fs/eventfd.c:363:7: note: in expansion of macro 'EFD_VPOLL_DELEVENTS'
     case EFD_VPOLL_DELEVENTS:
          ^
   fs/eventfd.c:363:2: error: duplicate case value
     case EFD_VPOLL_DELEVENTS:
     ^
>> fs/eventfd.c:360:2: error: previously used here
     case EFD_VPOLL_ADDEVENTS:
     ^
   In file included from fs/eventfd.c:23:0:
   include/linux/eventfd.h:33:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_MODEVENTS (3UL << 32)
                                     ^
   fs/eventfd.c:366:7: note: in expansion of macro 'EFD_VPOLL_MODEVENTS'
     case EFD_VPOLL_MODEVENTS:
          ^
   fs/eventfd.c:366:2: error: duplicate case value
     case EFD_VPOLL_MODEVENTS:
     ^
>> fs/eventfd.c:360:2: error: previously used here
     case EFD_VPOLL_ADDEVENTS:
     ^

vim +360 fs/eventfd.c

   342	
   343	static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
   344			size_t count, loff_t *ppos)
   345	{
   346		struct eventfd_ctx *ctx = file->private_data;
   347		ssize_t res;
   348		__u64 ucnt;
   349		__u32 events;
   350	
   351		if (count < sizeof(ucnt))
   352			return -EINVAL;
   353		if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
   354			return -EFAULT;
   355		spin_lock_irq(&ctx->wqh.lock);
   356	
   357		events = ucnt & EPOLLALLMASK;
   358		res = sizeof(ucnt);
   359		switch (ucnt & ~((__u64)EPOLLALLMASK)) {
 > 360		case EFD_VPOLL_ADDEVENTS:
   361			ctx->count |= events;
   362			break;
 > 363		case EFD_VPOLL_DELEVENTS:
   364			ctx->count &= ~(events);
   365			break;
   366		case EFD_VPOLL_MODEVENTS:
   367			ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
   368			break;
   369		default:
   370			res = -EINVAL;
   371		}
   372	
   373		/* wake up waiting threads */
   374		if (res >= 0 && waitqueue_active(&ctx->wqh))
   375			wake_up_locked_poll(&ctx->wqh, res);
   376	
   377		spin_unlock_irq(&ctx->wqh.lock);
   378	
   379		return res;
   380	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30147 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-26 14:25 [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events Renzo Davoli
  2019-05-26 20:24 ` kbuild test robot
  2019-05-26 20:49 ` kbuild test robot
@ 2019-05-27  3:09 ` kbuild test robot
  2019-05-27  7:33 ` Greg KH
  3 siblings, 0 replies; 12+ messages in thread
From: kbuild test robot @ 2019-05-27  3:09 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: kbuild-all, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api

Hi Renzo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.2-rc2 next-20190524]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Renzo-Davoli/eventfd-new-tag-EFD_VPOLL-generate-epoll-events/20190527-023620
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-rc1-7-g2b96cd8-dirty
        make ARCH=x86_64 allmodconfig
        make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> fs/eventfd.c:321:27: sparse: sparse: restricted __poll_t degrades to integer
>> fs/eventfd.c:321:16: sparse: sparse: incorrect type in assignment (different base types) @@    expected restricted __poll_t [usertype] events @@    got poll_t [usertype] events @@
>> fs/eventfd.c:321:16: sparse:    expected restricted __poll_t [usertype] events
>> fs/eventfd.c:321:16: sparse:    got unsigned long long
   fs/eventfd.c:357:25: sparse: sparse: restricted __poll_t degrades to integer
>> fs/eventfd.c:359:27: sparse: sparse: cast from restricted __poll_t
   fs/eventfd.c:367:44: sparse: sparse: restricted __poll_t degrades to integer
>> fs/eventfd.c:375:17: sparse: sparse: cast to restricted __poll_t
>> fs/eventfd.c:512:28: sparse: sparse: invalid assignment: &=
>> fs/eventfd.c:512:28: sparse:    left side has type unsigned long long
>> fs/eventfd.c:512:28: sparse:    right side has type restricted __poll_t

vim +321 fs/eventfd.c

   310	
   311	static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
   312	{
   313		struct eventfd_ctx *ctx = file->private_data;
   314		__poll_t events = 0;
   315		u64 count;
   316	
   317		poll_wait(file, &ctx->wqh, wait);
   318	
   319		count = READ_ONCE(ctx->count);
   320	
 > 321		events = (count & EPOLLALLMASK);
   322	
   323		return events;
   324	}
   325	
   326	static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
   327			size_t count, loff_t *ppos)
   328	{
   329		struct eventfd_ctx *ctx = file->private_data;
   330		ssize_t res;
   331		__u64 ucnt = 0;
   332	
   333		if (count < sizeof(ucnt))
   334			return -EINVAL;
   335		res = sizeof(ucnt);
   336		ucnt = READ_ONCE(ctx->count);
   337		if (put_user(ucnt, (__u64 __user *)buf))
   338			return -EFAULT;
   339	
   340		return res;
   341	}
   342	
   343	static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
   344			size_t count, loff_t *ppos)
   345	{
   346		struct eventfd_ctx *ctx = file->private_data;
   347		ssize_t res;
   348		__u64 ucnt;
   349		__u32 events;
   350	
   351		if (count < sizeof(ucnt))
   352			return -EINVAL;
   353		if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
   354			return -EFAULT;
   355		spin_lock_irq(&ctx->wqh.lock);
   356	
   357		events = ucnt & EPOLLALLMASK;
   358		res = sizeof(ucnt);
 > 359		switch (ucnt & ~((__u64)EPOLLALLMASK)) {
   360		case EFD_VPOLL_ADDEVENTS:
   361			ctx->count |= events;
   362			break;
   363		case EFD_VPOLL_DELEVENTS:
   364			ctx->count &= ~(events);
   365			break;
   366		case EFD_VPOLL_MODEVENTS:
   367			ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
   368			break;
   369		default:
   370			res = -EINVAL;
   371		}
   372	
   373		/* wake up waiting threads */
   374		if (res >= 0 && waitqueue_active(&ctx->wqh))
 > 375			wake_up_locked_poll(&ctx->wqh, res);
   376	
   377		spin_unlock_irq(&ctx->wqh.lock);
   378	
   379		return res;
   380	
   381	}
   382	
   383	#ifdef CONFIG_PROC_FS
   384	static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
   385	{
   386		struct eventfd_ctx *ctx = f->private_data;
   387	
   388		spin_lock_irq(&ctx->wqh.lock);
   389		seq_printf(m, "eventfd-count: %16llx\n",
   390			   (unsigned long long)ctx->count);
   391		spin_unlock_irq(&ctx->wqh.lock);
   392		seq_printf(m, "eventfd-id: %d\n", ctx->id);
   393	}
   394	#endif
   395	
   396	static const struct file_operations eventfd_fops = {
   397	#ifdef CONFIG_PROC_FS
   398		.show_fdinfo	= eventfd_show_fdinfo,
   399	#endif
   400		.release	= eventfd_release,
   401		.poll		= eventfd_poll,
   402		.read		= eventfd_read,
   403		.write		= eventfd_write,
   404		.llseek		= noop_llseek,
   405	};
   406	
   407	static const struct file_operations eventfd_vpoll_fops = {
   408	#ifdef CONFIG_PROC_FS
   409		.show_fdinfo	= eventfd_show_fdinfo,
   410	#endif
   411		.release	= eventfd_release,
   412		.poll		= eventfd_vpoll_poll,
   413		.read		= eventfd_vpoll_read,
   414		.write		= eventfd_vpoll_write,
   415		.llseek		= noop_llseek,
   416	};
   417	
   418	/**
   419	 * eventfd_fget - Acquire a reference of an eventfd file descriptor.
   420	 * @fd: [in] Eventfd file descriptor.
   421	 *
   422	 * Returns a pointer to the eventfd file structure in case of success, or the
   423	 * following error pointer:
   424	 *
   425	 * -EBADF    : Invalid @fd file descriptor.
   426	 * -EINVAL   : The @fd file descriptor is not an eventfd file.
   427	 */
   428	struct file *eventfd_fget(int fd)
   429	{
   430		struct file *file;
   431	
   432		file = fget(fd);
   433		if (!file)
   434			return ERR_PTR(-EBADF);
   435		if (file->f_op != &eventfd_fops) {
   436			fput(file);
   437			return ERR_PTR(-EINVAL);
   438		}
   439	
   440		return file;
   441	}
   442	EXPORT_SYMBOL_GPL(eventfd_fget);
   443	
   444	/**
   445	 * eventfd_ctx_fdget - Acquires a reference to the internal eventfd context.
   446	 * @fd: [in] Eventfd file descriptor.
   447	 *
   448	 * Returns a pointer to the internal eventfd context, otherwise the error
   449	 * pointers returned by the following functions:
   450	 *
   451	 * eventfd_fget
   452	 */
   453	struct eventfd_ctx *eventfd_ctx_fdget(int fd)
   454	{
   455		struct eventfd_ctx *ctx;
   456		struct fd f = fdget(fd);
   457		if (!f.file)
   458			return ERR_PTR(-EBADF);
   459		ctx = eventfd_ctx_fileget(f.file);
   460		fdput(f);
   461		return ctx;
   462	}
   463	EXPORT_SYMBOL_GPL(eventfd_ctx_fdget);
   464	
   465	/**
   466	 * eventfd_ctx_fileget - Acquires a reference to the internal eventfd context.
   467	 * @file: [in] Eventfd file pointer.
   468	 *
   469	 * Returns a pointer to the internal eventfd context, otherwise the error
   470	 * pointer:
   471	 *
   472	 * -EINVAL   : The @fd file descriptor is not an eventfd file.
   473	 */
   474	struct eventfd_ctx *eventfd_ctx_fileget(struct file *file)
   475	{
   476		struct eventfd_ctx *ctx;
   477	
   478		if (file->f_op != &eventfd_fops)
   479			return ERR_PTR(-EINVAL);
   480	
   481		ctx = file->private_data;
   482		kref_get(&ctx->kref);
   483		return ctx;
   484	}
   485	EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
   486	
   487	static int do_eventfd(unsigned int count, int flags)
   488	{
   489		struct eventfd_ctx *ctx;
   490		const struct file_operations *fops = &eventfd_fops;
   491		int fd;
   492	
   493		/* Check the EFD_* constants for consistency.  */
   494		BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC);
   495		BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK);
   496	
   497		if (flags & ~EFD_FLAGS_SET)
   498			return -EINVAL;
   499	
   500		ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
   501		if (!ctx)
   502			return -ENOMEM;
   503	
   504		kref_init(&ctx->kref);
   505		init_waitqueue_head(&ctx->wqh);
   506		ctx->count = count;
   507		ctx->flags = flags;
   508		ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
   509	
   510		if (flags & EFD_VPOLL) {
   511			fops = &eventfd_vpoll_fops;
 > 512			ctx->count &= EPOLLALLMASK;
   513		}
   514		fd = anon_inode_getfd("[eventfd]", fops, ctx,
   515				      O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
   516		if (fd < 0)
   517			eventfd_free_ctx(ctx);
   518	
   519		return fd;
   520	}
   521	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-26 14:25 [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events Renzo Davoli
                   ` (2 preceding siblings ...)
  2019-05-27  3:09 ` kbuild test robot
@ 2019-05-27  7:33 ` Greg KH
  2019-05-27 13:36   ` Renzo Davoli
  3 siblings, 1 reply; 12+ messages in thread
From: Greg KH @ 2019-05-27  7:33 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: Alexander Viro, Davide Libenzi, linux-fsdevel, linux-kernel, linux-api

On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
> This patch implements an extension of eventfd to define file descriptors 
> whose I/O events can be generated at user level. These file descriptors
> trigger notifications for [p]select/[p]poll/epoll.
> 
> This feature is useful for user-level implementations of network stacks
> or virtual device drivers as libraries.

How can this be used to create a "virtual device driver"?  Do you have
any examples of this new interface being used anywhere?

Also, meta-comment, you should provide some sort of test to kselftests
for your new feature so that it can actually be tested, as well as a man
page update (separately).

> Development and porting of code often requires to find the way to wait for I/O
> events both coming from file descriptors and generated by user-level code (e.g.
> user-implemented net stacks or drivers).  While it is possible to provide a
> partial support (e.g. using pipes or socketpairs), a clean and complete
> solution is still missing (as far as I have seen); e.g. I have not seen any
> clean way to generate EPOLLPRI, EPOLLERR, etc.

What's wrong with pipes or sockets for stuff like this?  Why is epoll
required?

> This proposal is based on a new tag for eventfd2(2): EFD_VPOLL.
> 
> This statement:
> 	fd = eventfd(EPOLLOUT, EFD_VPOLL | EFD_CLOEXEC);
> creates a file descriptor for I/O event generation. In this case EPOLLOUT is
> initially true.
> 
> Likewise all the other eventfs services, read(2) and write(2) use a 8-byte 
> integer argument.
> 
> read(2) returns the current state of the pending events.
> 
> The argument of write(2) is an or-composition of a control command
> (EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the
> bitmap of events to be added, deleted to the current set of pending events.
> EFD_VPOLL_MODEVENTS completely redefines the set of pending events.
> 
> e.g.:
> 	uint64_t request = EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI;
> 	write(fd, &request, sizeof(request);
> adds EPOLLIN and EPOLLPRI to the set of pending events.
> 
> These are examples of messages asking for a feature like EFD_VPOLL:
> https://stackoverflow.com/questions/909189/simulating-file-descriptor-in-user-space
> https://stackoverflow.com/questions/1648147/running-a-simple-tcp-server-with-poll-how-do-i-trigger-events-artificially
> ... and I need it to write networking and device modules for vuos:
> https://github.com/virtualsquare/vuos
> (it is the new codebase of ViewOS, see www.virtualsquare.org).
> 
> EXAMPLE:
> The following program creates an eventfd/EFD_VPOLL file descriptor and then forks
> a child process.  While the parent waits for events using epoll_wait the child
> generates a sequence of events. When the parent receives an event (or a set of events)
> it prints it and disarm it.
> The following shell session shows a sample run of the program:
> 	timeout...
> 	timeout...
> 	GOT event 1
> 	timeout...
> 	GOT event 1
> 	timeout...
> 	GOT event 3
> 	timeout...
> 	GOT event 2
> 	timeout...
> 	GOT event 4
> 	timeout...
> 	GOT event 10
> 
> Program source:
> #include <sys/eventfd.h>
> #include <sys/epoll.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <stdint.h>             /* Definition of uint64_t */
> 
> #ifndef EFD_VPOLL
> #define EFD_VPOLL (1 << 1)
> #define EFD_VPOLL_ADDEVENTS (1UL << 32)
> #define EFD_VPOLL_DELEVENTS (2UL << 32)
> #define EFD_VPOLL_MODEVENTS (3UL << 32)
> #endif
> 
> #define handle_error(msg) \
> 	do { perror(msg); exit(EXIT_FAILURE); } while (0)
> 
> static void vpoll_ctl(int fd, uint64_t request) {
> 	ssize_t s;
> 	s = write(fd, &request, sizeof(request));
> 	if (s != sizeof(uint64_t))
> 		handle_error("write");
> }
> 
> int
> main(int argc, char *argv[])
> {
> 	int efd, epollfd; 
> 	struct epoll_event ev;
> 	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI;
> 	ev.data.u64 = 0;
> 
> 	efd = eventfd(0, EFD_VPOLL | EFD_CLOEXEC);
> 	if (efd == -1)
> 		handle_error("eventfd");
> 	epollfd = epoll_create1(EPOLL_CLOEXEC);
> 	if (efd == -1)
> 		handle_error("epoll_create1");
> 	if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1) 
> 		handle_error("epoll_ctl");
> 
> 	switch (fork()) {
> 		case 0:
> 			sleep(3);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLPRI);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLOUT);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLHUP);
> 			exit(EXIT_SUCCESS);
> 		default:
> 			while (1) {
> 				int nfds;
> 				nfds = epoll_wait(epollfd, &ev, 1, 1000);
> 				if (nfds < 0)
> 					handle_error("epoll_wait");
> 				else if (nfds == 0)
> 					printf("timeout...\n");
> 				else {
> 					printf("GOT event %x\n", ev.events);
> 					vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events);
> 					if (ev.events & EPOLLHUP)
> 						break;
> 				}
> 			}
> 		case -1:
> 			handle_error("fork");
> 	}
> 	close(epollfd);
> 	close(efd);
> 	return 0;
> }
> 
> Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
> ---
>  fs/eventfd.c                   | 115 +++++++++++++++++++++++++++++++--
>  include/linux/eventfd.h        |   7 +-
>  include/uapi/linux/eventpoll.h |   2 +
>  3 files changed, 116 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 8aa0ea8c55e8..f83b7d02307e 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -3,6 +3,7 @@
>   *  fs/eventfd.c
>   *
>   *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> + *  EFD_VPOLL support: 2019 Renzo Davoli <renzo@cs.unibo.it>

No need for this line, that's what the git history shows.

>   *
>   */
>  
> @@ -30,12 +31,24 @@ struct eventfd_ctx {
>  	struct kref kref;
>  	wait_queue_head_t wqh;
>  	/*
> -	 * Every time that a write(2) is performed on an eventfd, the
> -	 * value of the __u64 being written is added to "count" and a
> -	 * wakeup is performed on "wqh". A read(2) will return the "count"
> -	 * value to userspace, and will reset "count" to zero. The kernel
> -	 * side eventfd_signal() also, adds to the "count" counter and
> -	 * issue a wakeup.
> +	 * If the EFD_VPOLL flag was NOT set at eventfd creation:
> +	 *   Every time that a write(2) is performed on an eventfd, the
> +	 *   value of the __u64 being written is added to "count" and a
> +	 *   wakeup is performed on "wqh". A read(2) will return the "count"
> +	 *   value to userspace, and will reset "count" to zero (or decrement
> +	 *   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
> +	 *   side eventfd_signal() also, adds to the "count" counter and
> +	 *   issue a wakeup.
> +	 *
> +	 * If the EFD_VPOLL flag was set at eventfd creation:
> +	 *   count is the set of pending EPOLL events.
> +	 *   read(2) returns the current value of count.
> +	 *   The argument of write(2) is an 8-byte integer:
> +	 *   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
> +	 *   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
> +	 *   events to be added, deleted to the current set of pending events.
> +	 *   (i.e. which bits of "count" must be set or reset).
> +	 *   EFD_VPOLL_MODEVENTS redefines the set of pending events.

Ugh, overloading stuff, this is increased complexity, do you _have_ to
do it this way?

>  	 */
>  	__u64 count;
>  	unsigned int flags;
> @@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
>  	return res;
>  }
>  
> +static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	__poll_t events = 0;
> +	u64 count;
> +
> +	poll_wait(file, &ctx->wqh, wait);
> +
> +	count = READ_ONCE(ctx->count);
> +
> +	events = (count & EPOLLALLMASK);

Why mask?

> +
> +	return events;
> +}
> +
> +static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
> +		size_t count, loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt = 0;
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;

What is magic about the size of a __u64 here?

> +	res = sizeof(ucnt);
> +	ucnt = READ_ONCE(ctx->count);
> +	if (put_user(ucnt, (__u64 __user *)buf))
> +		return -EFAULT;
> +
> +	return res;
> +}
> +
> +static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
> +		size_t count, loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	__u32 events;
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;

Why can it not be less than 64?

> +	if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
> +		return -EFAULT;
> +	spin_lock_irq(&ctx->wqh.lock);
> +
> +	events = ucnt & EPOLLALLMASK;
> +	res = sizeof(ucnt);
> +	switch (ucnt & ~((__u64)EPOLLALLMASK)) {
> +	case EFD_VPOLL_ADDEVENTS:
> +		ctx->count |= events;
> +		break;
> +	case EFD_VPOLL_DELEVENTS:
> +		ctx->count &= ~(events);
> +		break;
> +	case EFD_VPOLL_MODEVENTS:
> +		ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
> +		break;
> +	default:
> +		res = -EINVAL;
> +	}
> +
> +	/* wake up waiting threads */
> +	if (res >= 0 && waitqueue_active(&ctx->wqh))
> +		wake_up_locked_poll(&ctx->wqh, res);

Can you call this with a spinlock held?  I really don't remember, sorry,
if so, nevermind, but you should check...

> +
> +	spin_unlock_irq(&ctx->wqh.lock);
> +
> +	return res;
> +
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
>  {
> @@ -319,6 +404,17 @@ static const struct file_operations eventfd_fops = {
>  	.llseek		= noop_llseek,
>  };
>  
> +static const struct file_operations eventfd_vpoll_fops = {
> +#ifdef CONFIG_PROC_FS
> +	.show_fdinfo	= eventfd_show_fdinfo,
> +#endif
> +	.release	= eventfd_release,
> +	.poll		= eventfd_vpoll_poll,
> +	.read		= eventfd_vpoll_read,
> +	.write		= eventfd_vpoll_write,
> +	.llseek		= noop_llseek,
> +};
> +
>  /**
>   * eventfd_fget - Acquire a reference of an eventfd file descriptor.
>   * @fd: [in] Eventfd file descriptor.
> @@ -391,6 +487,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
>  static int do_eventfd(unsigned int count, int flags)
>  {
>  	struct eventfd_ctx *ctx;
> +	const struct file_operations *fops = &eventfd_fops;
>  	int fd;
>  
>  	/* Check the EFD_* constants for consistency.  */
> @@ -410,7 +507,11 @@ static int do_eventfd(unsigned int count, int flags)
>  	ctx->flags = flags;
>  	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
>  
> -	fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
> +	if (flags & EFD_VPOLL) {
> +		fops = &eventfd_vpoll_fops;
> +		ctx->count &= EPOLLALLMASK;
> +	}
> +	fd = anon_inode_getfd("[eventfd]", fops, ctx,
>  			      O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
>  	if (fd < 0)
>  		eventfd_free_ctx(ctx);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index ffcc7724ca21..63258cf29344 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -21,11 +21,16 @@
>   * shared O_* flags.
>   */
>  #define EFD_SEMAPHORE (1 << 0)
> +#define EFD_VPOLL (1 << 1)

BIT(1)?

>  #define EFD_CLOEXEC O_CLOEXEC
>  #define EFD_NONBLOCK O_NONBLOCK
>  
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
> +
> +#define EFD_VPOLL_ADDEVENTS (1UL << 32)
> +#define EFD_VPOLL_DELEVENTS (2UL << 32)
> +#define EFD_VPOLL_MODEVENTS (3UL << 32)

Aren't these part of the uapi?  Why are they hidden in here?

>  
>  struct eventfd_ctx;
>  struct file;
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index 8a3432d0f0dc..814de6d869c7 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -41,6 +41,8 @@
>  #define EPOLLMSG	(__force __poll_t)0x00000400
>  #define EPOLLRDHUP	(__force __poll_t)0x00002000
>  
> +#define EPOLLALLMASK	((__force __poll_t)0x0fffffff)

Why is this part of the uapi?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-27  7:33 ` Greg KH
@ 2019-05-27 13:36   ` Renzo Davoli
  2019-05-31  9:34     ` Roman Penyaev
  0 siblings, 1 reply; 12+ messages in thread
From: Renzo Davoli @ 2019-05-27 13:36 UTC (permalink / raw)
  To: Greg KH
  Cc: Alexander Viro, Davide Libenzi, linux-fsdevel, linux-kernel, linux-api

On Mon, May 27, 2019 at 09:33:32AM +0200, Greg KH wrote:
> On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
> > This patch implements an extension of eventfd to define file descriptors 
> > whose I/O events can be generated at user level. These file descriptors
> > trigger notifications for [p]select/[p]poll/epoll.
> > 
> > This feature is useful for user-level implementations of network stacks
> > or virtual device drivers as libraries.
> 
> How can this be used to create a "virtual device driver"?  Do you have
> any examples of this new interface being used anywhere?

Networking programs use system calls implementing the Berkeley sockets API:
socket, accept, connect, listen, recv*, send* etc.  Programs dealing with a
device use system calls like open, read, write, ioctl etc.

When somebody wants to write a library able to behave like a network stack (say
lwipv6, picotcp) or a device, they can implement functions like my_socket,
my_accept, my_open or my_ioctl, as drop-in replacement of their system
call counterpart.  (It is also possible to use dynamic library magic to
rename/divert the system call requests to use their 'virtual'
implementation provided by the library: socket maps to my_socket, recv
to my_recv etc).

In this way portability and compatibility is easier, using a well known API
instead of inventing new ones.

Unfortunately this approach cannot be applied to
poll/select/ppoll/pselect/epoll.  These system calls can refer at the same time
to file descriptors created by 'real' system calls like socket, open, signalfd... 
and to file descriptors returned by my_open, your_socket.

> 
> Also, meta-comment, you should provide some sort of test to kselftests
> for your new feature so that it can actually be tested, as well as a man
> page update (separately).
Sure. I'll do it ASAP, let me collect suggestions first.

> 
> > Development and porting of code often requires to find the way to wait for I/O
> > events both coming from file descriptors and generated by user-level code (e.g.
> > user-implemented net stacks or drivers).  While it is possible to provide a
> > partial support (e.g. using pipes or socketpairs), a clean and complete
> > solution is still missing (as far as I have seen); e.g. I have not seen any
> > clean way to generate EPOLLPRI, EPOLLERR, etc.
> 
> What's wrong with pipes or sockets for stuff like this?  Why is epoll
> required?
Example:
suppose there is an application waiting for a TCP OOB message. It uses poll to wait 
for POLLPRI and then reads the message (e.g. by 'recv').
If I want to port that application to use a network stack implemented as a library
I have to rewrite the code about 'poll' as it is not possible to receive a POLLPRI.
From a pipe I can just receive a POLLIN, I have to encode in an external data structure
any further information.
Using EFD_VPOLL the solution is straightforward: the function mysocket (used in place
of socket to create a file descripor behaving as a 'real'socket) returns a file
descriptor created by eventfd/EFD_VPOLL, so the poll system call can be left
unmodified in the code. When the OOB message is available the library can trigger
an EPOLLPRI and the message can be received using my_recv.

> 
...omissis...
> > 
> > Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
> > ---
> >  fs/eventfd.c                   | 115 +++++++++++++++++++++++++++++++--
> >  include/linux/eventfd.h        |   7 +-
> >  include/uapi/linux/eventpoll.h |   2 +
> >  3 files changed, 116 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index 8aa0ea8c55e8..f83b7d02307e 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -3,6 +3,7 @@
> >   *  fs/eventfd.c
> >   *
> >   *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> > + *  EFD_VPOLL support: 2019 Renzo Davoli <renzo@cs.unibo.it>
> 
> No need for this line, that's what the git history shows.
okay

> 
> >   *
> >   */
> >  
> > @@ -30,12 +31,24 @@ struct eventfd_ctx {
> >  	struct kref kref;
> >  	wait_queue_head_t wqh;
> >  	/*
> > -	 * Every time that a write(2) is performed on an eventfd, the
> > -	 * value of the __u64 being written is added to "count" and a
> > -	 * wakeup is performed on "wqh". A read(2) will return the "count"
> > -	 * value to userspace, and will reset "count" to zero. The kernel
> > -	 * side eventfd_signal() also, adds to the "count" counter and
> > -	 * issue a wakeup.
> > +	 * If the EFD_VPOLL flag was NOT set at eventfd creation:
> > +	 *   Every time that a write(2) is performed on an eventfd, the
> > +	 *   value of the __u64 being written is added to "count" and a
> > +	 *   wakeup is performed on "wqh". A read(2) will return the "count"
> > +	 *   value to userspace, and will reset "count" to zero (or decrement
> > +	 *   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
> > +	 *   side eventfd_signal() also, adds to the "count" counter and
> > +	 *   issue a wakeup.
> > +	 *
> > +	 * If the EFD_VPOLL flag was set at eventfd creation:
> > +	 *   count is the set of pending EPOLL events.
> > +	 *   read(2) returns the current value of count.
> > +	 *   The argument of write(2) is an 8-byte integer:
> > +	 *   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
> > +	 *   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
> > +	 *   events to be added, deleted to the current set of pending events.
> > +	 *   (i.e. which bits of "count" must be set or reset).
> > +	 *   EFD_VPOLL_MODEVENTS redefines the set of pending events.
> 
> Ugh, overloading stuff, this is increased complexity, do you _have_ to
> do it this way?
There can be other approaches: e.g. two specific new system calls like "vpollfd_create" and "vpollfd_ctl".
Their signature could be:
  int vpollfd_create(unsigned int init_events, int flags);
  where flags are the usual NONBLOCK/CLOEXEC
  int vpollfd_ctl(int fd, int op, unsigned int events);
  where op can be VPOLL_ADDEVENTS, VPOLL_DELEVENTS, VPOLL_MODEVENTS

It possible to reimplement the patch this way. It needs the definition of the new system calls.
I am proposing just a new tag for eventfd as eventfd purpose is conceptually close to the new feature.
Eventfd creates a file descriptor which generates events. The default eventfd mode uses counters while
EFD_VPOLL uses event flags.  The new feature can be implemented on eventfd with a very limited
impact on the kernel core code.
Instead of syscalls, the vpollfd_create/vpollfd_ctl API could be provided by the glibc as (very simple) 
library functions, as it is the case for eventfd_read/eventfd_write in /usr/include/sys/eventfd.h

It seems to me that EFD_VPOLL is just a different MODE of the same system call, not a real overloading.
Something similar happened to seccomp: the original implementation by Arcangeli is now
MODE_STRICT, the MODE_FILTER has been added in a second time as an extension.

> 
> >  	 */
> >  	__u64 count;
> >  	unsigned int flags;
> > @@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
> >  	return res;
> >  }
> >  
> > +static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	__poll_t events = 0;
> > +	u64 count;
> > +
> > +	poll_wait(file, &ctx->wqh, wait);
> > +
> > +	count = READ_ONCE(ctx->count);
> > +
> > +	events = (count & EPOLLALLMASK);
> 
> Why mask?
Because the four most significant bits are not events but policy modifiers (mainly for
		multithreading support):
#define EPOLLEXCLUSIVE  ((__force __poll_t)(1U << 28))
#define EPOLLWAKEUP ((__force __poll_t)(1U << 29))
#define EPOLLONESHOT  ((__force __poll_t)(1U << 30))
#define EPOLLET   ((__force __poll_t)(1U << 31))
a file_operations poll function implementation should never generate these, isn't it?
> 
> > +
> > +	return events;
> > +}
> > +
> > +static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
> > +		size_t count, loff_t *ppos)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	ssize_t res;
> > +	__u64 ucnt = 0;
> > +
> > +	if (count < sizeof(ucnt))
> > +		return -EINVAL;
> 
> What is magic about the size of a __u64 here?
Just for consistency with the non EFD_VPOLL eventfd.

> 
> > +	res = sizeof(ucnt);
> > +	ucnt = READ_ONCE(ctx->count);
> > +	if (put_user(ucnt, (__u64 __user *)buf))
> > +		return -EFAULT;
> > +
> > +	return res;
> > +}
> > +
> > +static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
> > +		size_t count, loff_t *ppos)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	ssize_t res;
> > +	__u64 ucnt;
> > +	__u32 events;
> > +
> > +	if (count < sizeof(ucnt))
> > +		return -EINVAL;
> 
> Why can it not be less than 64?
This is the imeplementation of 'write'. The 64 bits include the 'command'
EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the most
significant 32 bits) and the set of events (in the lowest 32 bits).

> 
> > +	if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
> > +		return -EFAULT;
> > +	spin_lock_irq(&ctx->wqh.lock);
> > +
> > +	events = ucnt & EPOLLALLMASK;
> > +	res = sizeof(ucnt);
> > +	switch (ucnt & ~((__u64)EPOLLALLMASK)) {
> > +	case EFD_VPOLL_ADDEVENTS:
> > +		ctx->count |= events;
> > +		break;
> > +	case EFD_VPOLL_DELEVENTS:
> > +		ctx->count &= ~(events);
> > +		break;
> > +	case EFD_VPOLL_MODEVENTS:
> > +		ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
> > +		break;
> > +	default:
> > +		res = -EINVAL;
> > +	}
> > +
> > +	/* wake up waiting threads */
> > +	if (res >= 0 && waitqueue_active(&ctx->wqh))
> > +		wake_up_locked_poll(&ctx->wqh, res);
> 
> Can you call this with a spinlock held?  I really don't remember, sorry,
> if so, nevermind, but you should check...

I would have done the same objection. eventfd_vpoll_write uses the same pattern
of code of eventfd_write, the difference is in the cause to unblock processes:
counter values vs. events. So this is as correct as eventfd_write is, isn't it? :)

> 
> > +
> > +	spin_unlock_irq(&ctx->wqh.lock);
> > +
> > +	return res;
> > +
> > +}
> > +
> >  #ifdef CONFIG_PROC_FS
> >  static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
> >  {
> > @@ -319,6 +404,17 @@ static const struct file_operations eventfd_fops = {
> >  	.llseek		= noop_llseek,
> >  };
> >  
> > +static const struct file_operations eventfd_vpoll_fops = {
> > +#ifdef CONFIG_PROC_FS
> > +	.show_fdinfo	= eventfd_show_fdinfo,
> > +#endif
> > +	.release	= eventfd_release,
> > +	.poll		= eventfd_vpoll_poll,
> > +	.read		= eventfd_vpoll_read,
> > +	.write		= eventfd_vpoll_write,
> > +	.llseek		= noop_llseek,
> > +};
> > +
> >  /**
> >   * eventfd_fget - Acquire a reference of an eventfd file descriptor.
> >   * @fd: [in] Eventfd file descriptor.
> > @@ -391,6 +487,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
> >  static int do_eventfd(unsigned int count, int flags)
> >  {
> >  	struct eventfd_ctx *ctx;
> > +	const struct file_operations *fops = &eventfd_fops;
> >  	int fd;
> >  
> >  	/* Check the EFD_* constants for consistency.  */
> > @@ -410,7 +507,11 @@ static int do_eventfd(unsigned int count, int flags)
> >  	ctx->flags = flags;
> >  	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
> >  
> > -	fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
> > +	if (flags & EFD_VPOLL) {
> > +		fops = &eventfd_vpoll_fops;
> > +		ctx->count &= EPOLLALLMASK;
> > +	}
> > +	fd = anon_inode_getfd("[eventfd]", fops, ctx,
> >  			      O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
> >  	if (fd < 0)
> >  		eventfd_free_ctx(ctx);
> > diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> > index ffcc7724ca21..63258cf29344 100644
> > --- a/include/linux/eventfd.h
> > +++ b/include/linux/eventfd.h
> > @@ -21,11 +21,16 @@
> >   * shared O_* flags.
> >   */
> >  #define EFD_SEMAPHORE (1 << 0)
> > +#define EFD_VPOLL (1 << 1)
> 
> BIT(1)?
It is for the sake of consistency with the rest of the header file,
I can change also the line above:
	#define EFD_SEMAPHORE BIT(0)
	#define EFD_VPOLL BIT(1)
The value has been chosen to fullfil the request written just above the patched statement:
/*
 * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
 * new flags, since they might collide with O_* ones. We want
 * to re-use O_* flags that couldn't possibly have a meaning
 * from eventfd, in order to leave a free define-space for
 * shared O_* flags.
 */
EFD_SEMAPHORE uses the same value of O_WRONLY, I decided to use the same value of O_RDWR.
Both should not have a meaning from eventfd.

> 
> >  #define EFD_CLOEXEC O_CLOEXEC
> >  #define EFD_NONBLOCK O_NONBLOCK
> >  
> >  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> > -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> > +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
> > +
> > +#define EFD_VPOLL_ADDEVENTS (1UL << 32)
> > +#define EFD_VPOLL_DELEVENTS (2UL << 32)
> > +#define EFD_VPOLL_MODEVENTS (3UL << 32)
> 
> Aren't these part of the uapi?  Why are they hidden in here?

You are right. There is not a uapi/linux/eventfd.h. EFD_SEMAPHORE is here, so for
consistency I added EFD_VPOLL/EFD_VPOLL_* here, too. (glibc provides the value of EFD_SEMAPHORE
in /usr/include/bits/eventfd.h).

> 
> >  
> >  struct eventfd_ctx;
> >  struct file;
> > diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> > index 8a3432d0f0dc..814de6d869c7 100644
> > --- a/include/uapi/linux/eventpoll.h
> > +++ b/include/uapi/linux/eventpoll.h
> > @@ -41,6 +41,8 @@
> >  #define EPOLLMSG	(__force __poll_t)0x00000400
> >  #define EPOLLRDHUP	(__force __poll_t)0x00002000
> >  
> > +#define EPOLLALLMASK	((__force __poll_t)0x0fffffff)
> 
> Why is this part of the uapi?
It can be useful. As I have explained above it permits to split betweeen
event bits and behavioral flags in poll_t.

> thanks,
> greg k-h

many thanks to you
  renzo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-27 13:36   ` Renzo Davoli
@ 2019-05-31  9:34     ` Roman Penyaev
  2019-05-31 10:45       ` Renzo Davoli
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Penyaev @ 2019-05-31  9:34 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner

Hi Renzo,

On 2019-05-27 15:36, Renzo Davoli wrote:
> On Mon, May 27, 2019 at 09:33:32AM +0200, Greg KH wrote:
>> On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
>> > This patch implements an extension of eventfd to define file descriptors
>> > whose I/O events can be generated at user level. These file descriptors
>> > trigger notifications for [p]select/[p]poll/epoll.
>> >
>> > This feature is useful for user-level implementations of network stacks
>> > or virtual device drivers as libraries.
>> 
>> How can this be used to create a "virtual device driver"?  Do you have
>> any examples of this new interface being used anywhere?
> 
> Networking programs use system calls implementing the Berkeley sockets 
> API:
> socket, accept, connect, listen, recv*, send* etc.  Programs dealing 
> with a
> device use system calls like open, read, write, ioctl etc.
> 
> When somebody wants to write a library able to behave like a network 
> stack (say
> lwipv6, picotcp) or a device, they can implement functions like 
> my_socket,
> my_accept, my_open or my_ioctl, as drop-in replacement of their system
> call counterpart.  (It is also possible to use dynamic library magic to
> rename/divert the system call requests to use their 'virtual'
> implementation provided by the library: socket maps to my_socket, recv
> to my_recv etc).
> 
> In this way portability and compatibility is easier, using a well known 
> API
> instead of inventing new ones.
> 
> Unfortunately this approach cannot be applied to
> poll/select/ppoll/pselect/epoll.

If you have to override other systemcalls, what is the problem to 
override
poll family?  It will add, let's say, 50 extra code lines complexity to 
your
userspace code.  All you need is to be woken up by *any* event and check
one mask variable, in order to understand what you need to do: read or 
write,
basically exactly what you do in your eventfd modification, but only in
userspace.


>> Why can it not be less than 64?
> This is the imeplementation of 'write'. The 64 bits include the 
> 'command'
> EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the 
> most
> significant 32 bits) and the set of events (in the lowest 32 bits).

Do you really need add/del/mod semantics?  Userspace still has to keep 
mask
somewhere, so you can have one simple command, which does:

    ctx->count = events;

in kernel, so no masks and this games with bits are needed.  That will
simplify API.

--
Roman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-31  9:34     ` Roman Penyaev
@ 2019-05-31 10:45       ` Renzo Davoli
  2019-05-31 11:48         ` Roman Penyaev
  0 siblings, 1 reply; 12+ messages in thread
From: Renzo Davoli @ 2019-05-31 10:45 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner

HI Roman,

On Fri, May 31, 2019 at 11:34:08AM +0200, Roman Penyaev wrote:
> On 2019-05-27 15:36, Renzo Davoli wrote:
> > Unfortunately this approach cannot be applied to
> > poll/select/ppoll/pselect/epoll.
> 
> If you have to override other systemcalls, what is the problem to override
> poll family?  It will add, let's say, 50 extra code lines complexity to your
> userspace code.  All you need is to be woken up by *any* event and check
> one mask variable, in order to understand what you need to do: read or
> write,
> basically exactly what you do in your eventfd modification, but only in
> userspace.

This approach would not scale. If I want to use both a (user-space) network stack
and a (emulated) device (or more stacks and devices) which (overridden) poll would I use?

The poll of the first stack is not able to to deal with the third device.

> 
> 
> > > Why can it not be less than 64?
> > This is the imeplementation of 'write'. The 64 bits include the
> > 'command'
> > EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the
> > most
> > significant 32 bits) and the set of events (in the lowest 32 bits).
> 
> Do you really need add/del/mod semantics?  Userspace still has to keep mask
> somewhere, so you can have one simple command, which does:
>    ctx->count = events;
> in kernel, so no masks and this games with bits are needed.  That will
> simplify API.

It is true, at the price to have more complex code in user space.
Other system calls could have beeen implemented as "set the value", instead there are
ADD/DEL modification flags.
I mean for example sigprocmask (SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK), or even epoll_ctl.
While poll requires the program to keep the struct pollfd array stored somewhere,
epoll is more powerful and flexible as different file descriptors can be added
and deleted by different modules/components.

If I have two threads implementing the send and receive path of a socket in a user-space
network stack implementation the epoll pending bitmap is shared so I have to create
critical sections like the following one any time I need to set or reset a bit.
	pthread_mutex_lock(mylock)
	events |= EPOLLIN
	write(efd, &events, sizeof(events));
	pthread_mutex_unlock(mylock)
Using add/del semantics locking is not required as the send path thread deals with EPOLLOUT while
its siblings receive thread uses EPOLLIN or EPOLLPRI

I would prefer the add/del/mod semantics, but if this is generally perceived as a unnecessary 
complexity in the kernel code I can update my patch.  

Thank you Roman,
			
			renzo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-31 10:45       ` Renzo Davoli
@ 2019-05-31 11:48         ` Roman Penyaev
  2019-06-03 15:00           ` Renzo Davoli
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Penyaev @ 2019-05-31 11:48 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner

On 2019-05-31 12:45, Renzo Davoli wrote:
> HI Roman,
> 
> On Fri, May 31, 2019 at 11:34:08AM +0200, Roman Penyaev wrote:
>> On 2019-05-27 15:36, Renzo Davoli wrote:
>> > Unfortunately this approach cannot be applied to
>> > poll/select/ppoll/pselect/epoll.
>> 
>> If you have to override other systemcalls, what is the problem to 
>> override
>> poll family?  It will add, let's say, 50 extra code lines complexity 
>> to your
>> userspace code.  All you need is to be woken up by *any* event and 
>> check
>> one mask variable, in order to understand what you need to do: read or
>> write,
>> basically exactly what you do in your eventfd modification, but only 
>> in
>> userspace.
> 
> This approach would not scale. If I want to use both a (user-space)
> network stack
> and a (emulated) device (or more stacks and devices) which
> (overridden) poll would I use?
> 
> The poll of the first stack is not able to to deal with the third 
> device.

Since each such a stack has a set of read/write/etc functions you always
can extend you stack with another call which returns you event mask,
specifying what exactly you have to do, e.g.:

     nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
     for (n = 0; n < nfds; ++n) {
          struct sock *sock;

          sock = events[n].data.ptr;
          events = sock->get_events(sock, &events[n]);

          if (events & EPOLLIN)
              sock->read(sock);
          if (events & EPOLLOUT)
              sock->write(sock);
     }


With such a virtual table you can mix all userspace stacks and even
with normal sockets, for which 'get_events' function can be declared as

static poll_t kernel_sock_get_events(struct sock *sock, struct 
epoll_event *ev)
{
     return ev->events;
}

Do I miss something?


>> > > Why can it not be less than 64?
>> > This is the imeplementation of 'write'. The 64 bits include the
>> > 'command'
>> > EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the
>> > most
>> > significant 32 bits) and the set of events (in the lowest 32 bits).
>> 
>> Do you really need add/del/mod semantics?  Userspace still has to keep 
>> mask
>> somewhere, so you can have one simple command, which does:
>>    ctx->count = events;
>> in kernel, so no masks and this games with bits are needed.  That will
>> simplify API.
> 
> It is true, at the price to have more complex code in user space.
> Other system calls could have beeen implemented as "set the value",
> instead there are
> ADD/DEL modification flags.
> I mean for example sigprocmask (SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK),
> or even epoll_ctl.
> While poll requires the program to keep the struct pollfd array stored
> somewhere,
> epoll is more powerful and flexible as different file descriptors can 
> be added
> and deleted by different modules/components.
> 
> If I have two threads implementing the send and receive path of a
> socket in a user-space

Eventually you come up with such a lock to protect your tcp or whatever
state machine.  Or you have a real example where read and write paths
can work completely independently?

--
Roman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-05-31 11:48         ` Roman Penyaev
@ 2019-06-03 15:00           ` Renzo Davoli
  2019-06-06 20:11             ` Roman Penyaev
  0 siblings, 1 reply; 12+ messages in thread
From: Renzo Davoli @ 2019-06-03 15:00 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner

Hi Roman,

	 I sorry for the delay in my answer, but I needed to set up a minimal
tutorial to show what I am working on and why I need a feature like the
one I am proposing.

Please, have a look of the README.md page here:
https://github.com/virtualsquare/vuos
(everything can be downloaded and tested)

On Fri, May 31, 2019 at 01:48:39PM +0200, Roman Penyaev wrote:
> Since each such a stack has a set of read/write/etc functions you always
> can extend you stack with another call which returns you event mask,
> specifying what exactly you have to do, e.g.:
> 
>     nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
>     for (n = 0; n < nfds; ++n) {
>          struct sock *sock;
> 
>          sock = events[n].data.ptr;
>          events = sock->get_events(sock, &events[n]);
> 
>          if (events & EPOLLIN)
>              sock->read(sock);
>          if (events & EPOLLOUT)
>              sock->write(sock);
>     }
> 
> 
> With such a virtual table you can mix all userspace stacks and even
> with normal sockets, for which 'get_events' function can be declared as
> 
> static poll_t kernel_sock_get_events(struct sock *sock, struct epoll_event
> *ev)
> {
>     return ev->events;
> }
> 
> Do I miss something?

I am not trying to port some tools to use user-space implemented stacks or device
drivers/emulators, I am seeking to a general purpose approach.

I think that the example in the section of the README "mount a user-level 
networking stack" explains the situation.

The submodule vunetvdestack uses a namespace to define a networking stack connected
to a VDE network (see https://github.com/rd235/vdeplug4).

The API is clean (as it can be seen at the end of the file vunet_modules/vunetvdestack.c).
All the methods but "socket" are directly mapped to their system call counterparts:

struct vunet_operations vunet_ops = {
  .socket = vdestack_socket,
  .bind = bind,
  .connect = connect,
  .listen = listen,
  .accept4 = accept4,
....
	.epoll_ctl = epoll_ctl,
...
}

(the elegance of the API can be seen also in vunet_modules/vunetreal.c: a 38 lines module
 implementing a gateway to the real networking of the hosting machine)

Unfortunately I cannot use the same clean interface to support user-library implemented
stacks like lwip/lwipv6/picotcp because I cannot generate EPOLL events...

Bizantine workarounds based on data structures exchanged in the data.ptr field of epoll_event
that must be decoded by the hypervisor to retrieve the missing information about the event
can be implemented... but it would be a pity ;-)

The same problem arises in umdev modules: virtual devices should generate the same
EPOLL events of their real couterparts.

I feel that the ability to generate/synthesize EPOLL events could be useful for many projects.
(In my first message I included some URLs of people seeking for this feature, retrieved by
 some queries on a web search engine)

Implementations may vary as well as the kernel API to support such a feature.
As I told, my proposal has a minimal impact on the code, it does not require the definition
of new syscalls, it simply enhances the features of eventfd.

> 
> Eventually you come up with such a lock to protect your tcp or whatever
> state machine.  Or you have a real example where read and write paths
> can work completely independently?

Actually umvu hypervisor uses concurrent tracing of concurrent processes.
We have named this technique "guardian angels": each process/thread running in the
partial virtual machine has a correspondent thread in the hypervisor.
So if a process uses two threads to manage a network connection (say a TCP stream),
the two guardian angels replicate their requests towards the networking module.

So I am looking for a general solution, not to a pattern to port some projects.
(and I cannot use two different approaches for event driven and multi-threaded
 implementations as I have to support both).

If you reached this point...  Thank you for your patience.
I am more than pleased to receive further comments or proposals.

	renzo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-06-03 15:00           ` Renzo Davoli
@ 2019-06-06 20:11             ` Roman Penyaev
  2019-06-07  9:40               ` Renzo Davoli
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Penyaev @ 2019-06-06 20:11 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner

Hi Renzo,

On 2019-06-03 17:00, Renzo Davoli wrote:
> Hi Roman,
> 
> 	 I sorry for the delay in my answer, but I needed to set up a minimal
> tutorial to show what I am working on and why I need a feature like the
> one I am proposing.
> 
> Please, have a look of the README.md page here:
> https://github.com/virtualsquare/vuos
> (everything can be downloaded and tested)

Is that similar to what user-mode linux does?  I mean the principle.

> I am not trying to port some tools to use user-space implemented
> stacks or device
> drivers/emulators, I am seeking to a general purpose approach.

You still intersect *each* syscall, why not to do the same for 
epoll_wait()
and replace events with correct value?  Seems you do something similar 
already
in a vu_wrap_poll.c: wo_epoll_wait(), right?

Don't get me wrong, I really want to understand whether everything 
really
looks so bad without proposed change. It seems not, because the whole 
principle
is based on intersection of each syscall, thus one more one less - it 
does not
become more clean and especially does not look like a generic purpose 
solution,
which you seek.  I may be wrong.

--
Roman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
  2019-06-06 20:11             ` Roman Penyaev
@ 2019-06-07  9:40               ` Renzo Davoli
  0 siblings, 0 replies; 12+ messages in thread
From: Renzo Davoli @ 2019-06-07  9:40 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api, linux-kernel-owner

Hi Roman,

On Thu, Jun 06, 2019 at 10:11:57PM +0200, Roman Penyaev wrote:
> Hi Renzo,
> On 2019-06-03 17:00, Renzo Davoli wrote:
> > Please, have a look of the README.md page here:
> > https://github.com/virtualsquare/vuos
> Is that similar to what user-mode linux does?  I mean the principle.

let us write this proportion:
user-mode-linux / umvu = linux / namespace

In a comparison between user-mode linux and umvu,
while the way to get the system call requests is the same (ptrace)
the goal is different.

user-mode linux catches all the system calls, none of them is forwarded to
the real kernel: it uses a linux kernel compiled as a process to give processes
the illusion to live in another machine.

umvu catches all the system calls and then it decides if the syscall must be forwarded
to the kernel (maybe modified) or entirely processed at user-level by the hypervisor
(by means of specific plug-in modules like vufuse for file systems, vudev for devices
 and so on).
While the "illusion" of user-mode linux is a global illusion, the "illusion" provided by
umvu is limited and configurable. After a "mount" of a filesystem using vufuse,
the file system tree is the same *but* the subtree of the mountpoint.
The illusion is limited to the subtree as only the system call requests for paths inside
the subtree are processed by umvu and its modules.

It is similar to a namespace implemented at user level.
w.r.t. namespaces:
* umvu does not change the attack surface of the kernel (it is just a virtualization
		-- a.k.a. illusion -- provided by a user process to other user processes)
* umvu can provide features not currently supported by the kernel (e.g. a file system
		organization unavailable as kernel code, networking stacks at user level etc.)
* ...

umvu is an implementation of vuos concepts using ptrace. In a future maybe it will be
possibile to reimplement the same idea of partial virtual machines using other
syscall tracing/filtering tools.
> 
> > I am not trying to port some tools to use user-space implemented
> > stacks or device
> > drivers/emulators, I am seeking to a general purpose approach.
> 
> You still intersect *each* syscall, why not to do the same for epoll_wait()
> and replace events with correct value?  Seems you do something similar
> already
> in a vu_wrap_poll.c: wo_epoll_wait(), right?
> 
> Don't get me wrong, I really want to understand whether everything really
> looks so bad without proposed change. It seems not, because the whole
> principle
> is based on intersection of each syscall, thus one more one less - it does
> not
> become more clean and especially does not look like a generic purpose
> solution,
> which you seek.  I may be wrong.
Your comments are precious. Thank you as I see that you have browsed into my code
to have a better view of the problem.

umvu is a modular tool. The executable of umvu is a dispatcher between the
system call requests coming from the user processes and modules (loaded at
run time as dynamic plug-in libraries)

+-----------------+         +------+      +---------------------------------+
+processes running|<------->| umvu |<---->| module (e.g. vufuse/vudev/vunet)|
+  "inside" umvu  |         +------+      +---------------------------------+
+-----------------+

Each module "registers" to umvu its "responsabilities"
It can register:
* a pathname (it will receive the syscall requests for that subtree)
* an address_family (all the syscall for sockets of that AF)
* major/minor numbers of a char or block device
* a systam call number
* ....
(each module can register more items)

The problem is not in the dialogue between umvu and the user processes
(<---> on the left in the diagram above) but between umvu and its modules
(<---> on the right). 
(wi_epoll_wait, wd_epoll_wait, wo_epoll_wait are the three wrappers used
 respectively before, during and after epoll_wait in the dialogue on the left with the
user processes).

When a user process generates a "read" syscall request and umvu discovers that
the fd is managed by vufuse, it forwards to vufuse a "read" request
having the same signature of the "read" system call (plus a trailing fdprivate arg for
syscalls using a fd. This arg can be used to speed up virtualization but can be safely
ignored).

If for the same "read" request the file descriptor is managed by vunet,
it is forwarded to vunet (actually it is converted to "recvmsg": if fd is a socket
recvmesg manages all read/recv/recvfrom/recvmsg, umvu tends to simplify the API 
by unifying similar system calls).

But what about poll/epoll/ppoll/select/pselect?
umvu takes care of all the system call requests but it needs a clean way to ask
modules some feedback when the expected events happen.

I think the clean way to do it is illustrated in test_modules/unrealsock.c
This is a test module: it registers the address families AF_INET, AF_INET6 and AF_NETLINK
then it forwards all the requests to the system calls.
In this way when this module is loaded all system calls requests 
related to sockets of the mentioned families are sent to unrealsock 
which uses the system call.

When unrealsock is loaded
   vu_insmod unrealsock
nothing apparently happens.
It is possible to run ssh, ask for ip addresses using "ip addr" etc.
The difference is that all the system call requests "pass through" umvu
and are sent to the real kernel by unrealsock.

We use modules like unrealsock to test umvu features in an independent manner (without
specific modules and submodules).

All the stuff related the virtualization of poll/ppoll/select/pselect/epoll is managed at
line 77:
		vu_syscall_handler(s, epoll_ctl) = epoll_ctl;

It says: "dear module if I need to get informed by you about an event on a file descriptor
I'll call an epoll_ctl". That's all.

Here it works as I am diverting the request to the system call which is able to generate
EPOLL_PRI and all other flavours of EPOLL_*.
When I want to write a module able to virtualize a network stack I need to write my epoll_ctl 
so I need a way to generate EPOLL_PRI when I receive an OOB packet, EPOLL_IN must be set 
when the stack receives a packet and reset when the buffer gets empty, etc.

For sure I could teach people aiming to write a module for umvu that there is a variable in
proxima centaury (I mean an helper function using data.ptr of struct epoll_event) and
then if they generate an EPOLL_IN it will be processed as if it were an EPOLL_PRI.
why?

Let us reverse the question.
Why not giving linux programmers the ability to have file descriptors on which arbitrary 
EPOLL events can be generated as they need/please? eventfd/EFD_VPOLL is (in my opinion) 
a simple way to implement this feature, we can converge on a different API.
It is not a Copernican revolution of the code. It seems to me that it is not a dangerous
feature (I could be wrong, please tell me!). It is possible to generate signals normally
sent by the kernel (e.g. SEGV), why it is not possible to generate an EPOLL_PRI?
There is at least one happy user of the new feature (me) and other may come.
It is just one (tiny) more degree of freedom.

Many kernel features were added for one usage (e.g. seccomp was created to lend/rent processing
power in a safe way) and then it was discovered that they are useful in other cases (e.g.
sandboxes for browsers).

Again thanks to everybody who will have read this message up to this point ;-)

	renzo

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-06-07  9:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-26 14:25 [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events Renzo Davoli
2019-05-26 20:24 ` kbuild test robot
2019-05-26 20:49 ` kbuild test robot
2019-05-27  3:09 ` kbuild test robot
2019-05-27  7:33 ` Greg KH
2019-05-27 13:36   ` Renzo Davoli
2019-05-31  9:34     ` Roman Penyaev
2019-05-31 10:45       ` Renzo Davoli
2019-05-31 11:48         ` Roman Penyaev
2019-06-03 15:00           ` Renzo Davoli
2019-06-06 20:11             ` Roman Penyaev
2019-06-07  9:40               ` Renzo Davoli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).