linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] signalfd: introduce signalfd_cleanup()
       [not found] <20120222173326.GA7139@redhat.com>
@ 2012-02-22 17:33 ` Oleg Nesterov
  2012-02-22 17:34 ` [PATCH 2/4] epoll: introduce POLLFREE for ep_poll_callback() Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-22 17:33 UTC (permalink / raw)
  To: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Linus Torvalds, Roland McGrath
  Cc: Eugene Teo, Maxime Bizon, Denys Vlasenko, linux-kernel

Preparation. __cleanup_sighand() obviously must not free sighand
if ->signalfd_wqh is in use. And the new helper which checks this
wait_queue_head_t before kmem_cache_free(). Currently it does
nothing except it helps to catch the bug early.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/signalfd.c            |    7 +++++++
 include/linux/signalfd.h |    5 ++++-
 kernel/fork.c            |    5 ++++-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/signalfd.c b/fs/signalfd.c
index 492465b..35d19ae 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -30,6 +30,13 @@
 #include <linux/signalfd.h>
 #include <linux/syscalls.h>
 
+void signalfd_cleanup(struct sighand_struct *sighand)
+{
+	wait_queue_head_t *wqh = &sighand->signalfd_wqh;
+
+	BUG_ON(waitqueue_active(wqh));
+}
+
 struct signalfd_ctx {
 	sigset_t sigmask;
 };
diff --git a/include/linux/signalfd.h b/include/linux/signalfd.h
index 3ff4961..247399b 100644
--- a/include/linux/signalfd.h
+++ b/include/linux/signalfd.h
@@ -61,13 +61,16 @@ static inline void signalfd_notify(struct task_struct *tsk, int sig)
 		wake_up(&tsk->sighand->signalfd_wqh);
 }
 
+extern void signalfd_cleanup(struct sighand_struct *sighand);
+
 #else /* CONFIG_SIGNALFD */
 
 static inline void signalfd_notify(struct task_struct *tsk, int sig) { }
 
+static inline void signalfd_cleanup(struct sighand_struct *sighand) { }
+
 #endif /* CONFIG_SIGNALFD */
 
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SIGNALFD_H */
-
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..e2cd3e2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/signalfd.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -935,8 +936,10 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 
 void __cleanup_sighand(struct sighand_struct *sighand)
 {
-	if (atomic_dec_and_test(&sighand->count))
+	if (atomic_dec_and_test(&sighand->count)) {
+		signalfd_cleanup(sighand);
 		kmem_cache_free(sighand_cachep, sighand);
+	}
 }
 
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] epoll: introduce POLLFREE for ep_poll_callback()
       [not found] <20120222173326.GA7139@redhat.com>
  2012-02-22 17:33 ` [PATCH 1/4] signalfd: introduce signalfd_cleanup() Oleg Nesterov
@ 2012-02-22 17:34 ` Oleg Nesterov
  2012-02-22 17:34 ` [PATCH 3/4] signalfd: signalfd_cleanup() can race with remove_wait_queue() Oleg Nesterov
  2012-02-22 17:35 ` [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead Oleg Nesterov
  3 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-22 17:34 UTC (permalink / raw)
  To: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Linus Torvalds, Roland McGrath
  Cc: Eugene Teo, Maxime Bizon, Denys Vlasenko, linux-kernel

Note: this patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next changes.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note: we do not have wake_up_all_poll() but wake_up_poll() should
be fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/eventpoll.c             |    4 ++++
 fs/signalfd.c              |    5 +++++
 include/asm-generic/poll.h |    2 ++
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index aabdfc3..442bedb 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -844,6 +844,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 
 	spin_lock_irqsave(&ep->lock, flags);
 
+	/* the caller holds eppoll_entry->whead->lock */
+	if ((unsigned long)key & POLLFREE)
+		list_del_init(&wait->task_list);
+
 	/*
 	 * If the event mask does not contain any poll(2) event, we consider the
 	 * descriptor to be disabled. This condition is likely the effect of the
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 35d19ae..838ba21 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -34,6 +34,11 @@ void signalfd_cleanup(struct sighand_struct *sighand)
 {
 	wait_queue_head_t *wqh = &sighand->signalfd_wqh;
 
+	if (likely(!waitqueue_active(wqh)))
+		return;
+
+	/* ask wait_queue_t->func() to remove_wait_queue() */
+	wake_up_poll(wqh, POLLHUP | POLLFREE);
 	BUG_ON(waitqueue_active(wqh));
 }
 
diff --git a/include/asm-generic/poll.h b/include/asm-generic/poll.h
index 44bce83..9ce7f44 100644
--- a/include/asm-generic/poll.h
+++ b/include/asm-generic/poll.h
@@ -28,6 +28,8 @@
 #define POLLRDHUP       0x2000
 #endif
 
+#define POLLFREE	0x4000	/* currently only for epoll */
+
 struct pollfd {
 	int fd;
 	short events;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] signalfd: signalfd_cleanup() can race with remove_wait_queue()
       [not found] <20120222173326.GA7139@redhat.com>
  2012-02-22 17:33 ` [PATCH 1/4] signalfd: introduce signalfd_cleanup() Oleg Nesterov
  2012-02-22 17:34 ` [PATCH 2/4] epoll: introduce POLLFREE for ep_poll_callback() Oleg Nesterov
@ 2012-02-22 17:34 ` Oleg Nesterov
  2012-02-22 17:35 ` [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead Oleg Nesterov
  3 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-22 17:34 UTC (permalink / raw)
  To: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Linus Torvalds, Roland McGrath
  Cc: Eugene Teo, Maxime Bizon, Denys Vlasenko, linux-kernel

signalfd_cleanup() checks waitqueue_active() lockless, this can
race with ep_unregister_pollwait(). We can see list_empty() == T
before remove_wait_queue() completes and list_empty_careful()
can't help. Add spin_unlock_wait() to serialize.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/signalfd.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/signalfd.c b/fs/signalfd.c
index 838ba21..6e51887 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -34,6 +34,10 @@ void signalfd_cleanup(struct sighand_struct *sighand)
 {
 	wait_queue_head_t *wqh = &sighand->signalfd_wqh;
 
+	/* make sure we can't race with remove_wait_queue() in progress */
+	spin_unlock_wait(&wqh->lock);
+	smp_rmb();
+
 	if (likely(!waitqueue_active(wqh)))
 		return;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead
       [not found] <20120222173326.GA7139@redhat.com>
                   ` (2 preceding siblings ...)
  2012-02-22 17:34 ` [PATCH 3/4] signalfd: signalfd_cleanup() can race with remove_wait_queue() Oleg Nesterov
@ 2012-02-22 17:35 ` Oleg Nesterov
  2012-02-23 15:44   ` Oleg Nesterov
  3 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-22 17:35 UTC (permalink / raw)
  To: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Linus Torvalds, Roland McGrath
  Cc: Eugene Teo, Maxime Bizon, Denys Vlasenko, linux-kernel

signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL before
remove_wait_queue(). We add the new ep_remove_wait_queue() helper
for this.

However this needs more locking. ep_remove_wait_queue() should take
ep->lock first to avoid the race and pin pwq->whead, then it needs
pwq->whead->lock for __remove_wait_queue().

This can obviously AB-BA deadlock with wake_up()->ep_poll_callback(),
so ep_remove_wait_queue() does the nasty lock + trylock-or-retry dance.

Of course, this also assumes that it is safe to take ep->lock in
ep_unregister_pollwait() paths, afaics this is true.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/eventpoll.c |   43 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 442bedb..ac8bd15 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -320,6 +320,11 @@ static inline int ep_is_linked(struct list_head *p)
 	return !list_empty(p);
 }
 
+static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_t *p)
+{
+	return container_of(p, struct eppoll_entry, wait);
+}
+
 /* Get the "struct epitem" from a wait queue pointer */
 static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
 {
@@ -467,6 +472,33 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
 	put_cpu();
 }
 
+static void ep_remove_wait_queue(struct eventpoll *ep, struct eppoll_entry *pwq)
+{
+	for (;;) {
+		unsigned long flags;	/* probably unneeded */
+
+		spin_lock_irqsave(&ep->lock, flags);
+		/* can be cleared by ep_poll_callback(POLLFREE) */
+		if (!pwq->whead)
+			goto unlock;
+
+		/* _trylock to avoid the deadlock, retry if it fails */
+		if (!spin_trylock(&pwq->whead->lock))
+			goto unlock;
+
+		__remove_wait_queue(pwq->whead, &pwq->wait);
+		spin_unlock(&pwq->whead->lock);
+		pwq->whead = NULL;
+ unlock:
+		spin_unlock_irqrestore(&ep->lock, flags);
+
+		if (!pwq->whead)
+			break;
+
+		cpu_relax();
+	}
+}
+
 /*
  * This function unregisters poll callbacks from the associated file
  * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
@@ -481,7 +513,7 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 		pwq = list_first_entry(lsthead, struct eppoll_entry, llink);
 
 		list_del(&pwq->llink);
-		remove_wait_queue(pwq->whead, &pwq->wait);
+		ep_remove_wait_queue(ep, pwq);
 		kmem_cache_free(pwq_cache, pwq);
 	}
 }
@@ -844,9 +876,12 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 
 	spin_lock_irqsave(&ep->lock, flags);
 
-	/* the caller holds eppoll_entry->whead->lock */
-	if ((unsigned long)key & POLLFREE)
-		list_del_init(&wait->task_list);
+	if ((unsigned long)key & POLLFREE) {
+		struct eppoll_entry *pwq = ep_pwq_from_wait(wait);
+		/* the caller holds pwq->whead->lock */
+		__remove_wait_queue(pwq->whead, wait);
+		pwq->whead = NULL;
+	}
 
 	/*
 	 * If the event mask does not contain any poll(2) event, we consider the
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead
  2012-02-22 17:35 ` [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead Oleg Nesterov
@ 2012-02-23 15:44   ` Oleg Nesterov
  2012-02-23 22:17     ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-23 15:44 UTC (permalink / raw)
  To: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Linus Torvalds, Roland McGrath
  Cc: Eugene Teo, Maxime Bizon, Denys Vlasenko, linux-kernel

On 02/22, Oleg Nesterov wrote:
>
> However this needs more locking. ep_remove_wait_queue() should take
> ep->lock first to avoid the race and pin pwq->whead, then it needs
> pwq->whead->lock for __remove_wait_queue().
>
> This can obviously AB-BA deadlock with wake_up()->ep_poll_callback(),
> so ep_remove_wait_queue() does the nasty lock + trylock-or-retry dance.

Or we can rely on the fact that sighand_cachep is SLAB_DESTROY_BY_RCU,
and assume that ->whead is always rcu-protected if it can go away.

In this case we don't need 3/4 (although it makes sense to add the
fat comment), and 4/4 can be simplified, see below.

ep_pwq_from_wait() is not really needed, it has a single caller.
Just I tried to follow the coding style.

I'd prefer the explicit locking, but I don't really mind.

Oleg.

--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -320,6 +320,11 @@ static inline int ep_is_linked(struct list_head *p)
 	return !list_empty(p);
 }
 
+static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_t *p)
+{
+	return container_of(p, struct eppoll_entry, wait);
+}
+
 /* Get the "struct epitem" from a wait queue pointer */
 static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
 {
@@ -467,6 +472,17 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
 	put_cpu();
 }
 
+static void ep_remove_wait_queue(struct eppoll_entry *pwq)
+{
+	wait_queue_head_t *whead;
+
+	rcu_read_lock();
+	whead = rcu_dereference(pwq->whead);
+	if (whead)
+		remove_wait_queue(whead, &pwq->wait);
+	rcu_read_unlock();
+}
+
 /*
  * This function unregisters poll callbacks from the associated file
  * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
@@ -481,7 +497,7 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 		pwq = list_first_entry(lsthead, struct eppoll_entry, llink);
 
 		list_del(&pwq->llink);
-		remove_wait_queue(pwq->whead, &pwq->wait);
+		ep_remove_wait_queue(pwq);
 		kmem_cache_free(pwq_cache, pwq);
 	}
 }
@@ -845,8 +861,11 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	spin_lock_irqsave(&ep->lock, flags);
 
 	/* the caller holds eppoll_entry->whead->lock */
-	if ((unsigned long)key & POLLFREE)
+	if ((unsigned long)key & POLLFREE) {
+		/* can't use __remove_wait_queue(), we need list_del_init() */
 		list_del_init(&wait->task_list);
+		ep_pwq_from_wait(wait)->whead = NULL;
+	}
 
 	/*
 	 * If the event mask does not contain any poll(2) event, we consider the


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead
  2012-02-23 15:44   ` Oleg Nesterov
@ 2012-02-23 22:17     ` Linus Torvalds
  2012-02-24 19:06       ` [PATCH v2 0/2] signalfd/epoll fixes Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2012-02-23 22:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On Thu, Feb 23, 2012 at 7:44 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Or we can rely on the fact that sighand_cachep is SLAB_DESTROY_BY_RCU,
> and assume that ->whead is always rcu-protected if it can go away.
>
> In this case we don't need 3/4 (although it makes sense to add the
> fat comment), and 4/4 can be simplified, see below.

Ok.

Can you also get rid of 1/4, because quite frankly, adding that
BUG_ON() is just annoying. Either the thing gets fixed or not, but at
no point is it ok to say "ok, I'm going to fix it, but before I do
I'll just make it much worse".

Hmm?
               Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 0/2] signalfd/epoll fixes
  2012-02-23 22:17     ` Linus Torvalds
@ 2012-02-24 19:06       ` Oleg Nesterov
  2012-02-24 19:07         ` [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() Oleg Nesterov
                           ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-24 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On 02/23, Linus Torvalds wrote:
>
> On Thu, Feb 23, 2012 at 7:44 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Or we can rely on the fact that sighand_cachep is SLAB_DESTROY_BY_RCU,
> > and assume that ->whead is always rcu-protected if it can go away.
> >
> > In this case we don't need 3/4 (although it makes sense to add the
> > fat comment), and 4/4 can be simplified, see below.
>
> Ok.
>
> Can you also get rid of 1/4, because quite frankly, adding that
> BUG_ON() is just annoying. Either the thing gets fixed or not, but at
> no point is it ok to say "ok, I'm going to fix it, but before I do
> I'll just make it much worse".

OK. Please see v2.

Other changes:

	- some comments

	- now that we rely on rcu, ep_poll_callback() can do
	  remove_wait_queue() outsife of ep->lock



Davide, I see the new email, but it is too late for me to reply
today. Anyway, I think it makes sense to make the "simple" fix
before anything else. IOW, I'd suggest these changes for now in
any case (unless, of course, you see some problems).

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()
  2012-02-24 19:06       ` [PATCH v2 0/2] signalfd/epoll fixes Oleg Nesterov
@ 2012-02-24 19:07         ` Oleg Nesterov
  2012-02-29 19:57           ` Andy Lutomirski
  2012-02-24 19:07         ` [PATCH v2 2/2] epoll: ep_unregister_pollwait() can use the freed pwq->whead Oleg Nesterov
  2012-02-24 20:23         ` [PATCH v2 0/2] signalfd/epoll fixes Linus Torvalds
  2 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-24 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

	- we do not have wake_up_all_poll() but wake_up_poll()
	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
	  we need a couple of simple changes in eventpoll.c to
	  make sure it can't be "lost".

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/eventpoll.c             |    4 ++++
 fs/signalfd.c              |   11 +++++++++++
 include/asm-generic/poll.h |    2 ++
 include/linux/signalfd.h   |    5 ++++-
 kernel/fork.c              |    5 ++++-
 5 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index aabdfc3..34bbfc6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -842,6 +842,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
 
+	/* the caller holds eppoll_entry->whead->lock */
+	if ((unsigned long)key & POLLFREE)
+		list_del_init(&wait->task_list);
+
 	spin_lock_irqsave(&ep->lock, flags);
 
 	/*
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 492465b..79c1eea 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -30,6 +30,17 @@
 #include <linux/signalfd.h>
 #include <linux/syscalls.h>
 
+void signalfd_cleanup(struct sighand_struct *sighand)
+{
+	wait_queue_head_t *wqh = &sighand->signalfd_wqh;
+
+	if (likely(!waitqueue_active(wqh)))
+		return;
+
+	/* wait_queue_t->func(POLLFREE) should do remove_wait_queue() */
+	wake_up_poll(wqh, POLLHUP | POLLFREE);
+}
+
 struct signalfd_ctx {
 	sigset_t sigmask;
 };
diff --git a/include/asm-generic/poll.h b/include/asm-generic/poll.h
index 44bce83..9ce7f44 100644
--- a/include/asm-generic/poll.h
+++ b/include/asm-generic/poll.h
@@ -28,6 +28,8 @@
 #define POLLRDHUP       0x2000
 #endif
 
+#define POLLFREE	0x4000	/* currently only for epoll */
+
 struct pollfd {
 	int fd;
 	short events;
diff --git a/include/linux/signalfd.h b/include/linux/signalfd.h
index 3ff4961..247399b 100644
--- a/include/linux/signalfd.h
+++ b/include/linux/signalfd.h
@@ -61,13 +61,16 @@ static inline void signalfd_notify(struct task_struct *tsk, int sig)
 		wake_up(&tsk->sighand->signalfd_wqh);
 }
 
+extern void signalfd_cleanup(struct sighand_struct *sighand);
+
 #else /* CONFIG_SIGNALFD */
 
 static inline void signalfd_notify(struct task_struct *tsk, int sig) { }
 
+static inline void signalfd_cleanup(struct sighand_struct *sighand) { }
+
 #endif /* CONFIG_SIGNALFD */
 
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SIGNALFD_H */
-
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..e2cd3e2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/signalfd.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -935,8 +936,10 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 
 void __cleanup_sighand(struct sighand_struct *sighand)
 {
-	if (atomic_dec_and_test(&sighand->count))
+	if (atomic_dec_and_test(&sighand->count)) {
+		signalfd_cleanup(sighand);
 		kmem_cache_free(sighand_cachep, sighand);
+	}
 }
 
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/2] epoll: ep_unregister_pollwait() can use the freed pwq->whead
  2012-02-24 19:06       ` [PATCH v2 0/2] signalfd/epoll fixes Oleg Nesterov
  2012-02-24 19:07         ` [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() Oleg Nesterov
@ 2012-02-24 19:07         ` Oleg Nesterov
  2012-02-24 20:23         ` [PATCH v2 0/2] signalfd/epoll fixes Linus Torvalds
  2 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-24 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL under
rcu_read_lock() before remove_wait_queue(). We add the new helper,
ep_remove_wait_queue(), for this.

This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
ep_unregister_pollwait()->remove_wait_queue() can play with already
freed and potentially reused ->sighand, but this is fine. This memory
must have the valid ->signalfd_wqh until rcu_read_unlock().

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/eventpoll.c |   30 +++++++++++++++++++++++++++---
 fs/signalfd.c  |    6 +++++-
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 34bbfc6..ea54cde 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -320,6 +320,11 @@ static inline int ep_is_linked(struct list_head *p)
 	return !list_empty(p);
 }
 
+static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_t *p)
+{
+	return container_of(p, struct eppoll_entry, wait);
+}
+
 /* Get the "struct epitem" from a wait queue pointer */
 static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
 {
@@ -467,6 +472,18 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
 	put_cpu();
 }
 
+static void ep_remove_wait_queue(struct eppoll_entry *pwq)
+{
+	wait_queue_head_t *whead;
+
+	rcu_read_lock();
+	/* If it is cleared by POLLFREE, it should be rcu-safe */
+	whead = rcu_dereference(pwq->whead);
+	if (whead)
+		remove_wait_queue(whead, &pwq->wait);
+	rcu_read_unlock();
+}
+
 /*
  * This function unregisters poll callbacks from the associated file
  * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
@@ -481,7 +498,7 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 		pwq = list_first_entry(lsthead, struct eppoll_entry, llink);
 
 		list_del(&pwq->llink);
-		remove_wait_queue(pwq->whead, &pwq->wait);
+		ep_remove_wait_queue(pwq);
 		kmem_cache_free(pwq_cache, pwq);
 	}
 }
@@ -842,9 +859,16 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
 
-	/* the caller holds eppoll_entry->whead->lock */
-	if ((unsigned long)key & POLLFREE)
+	if ((unsigned long)key & POLLFREE) {
+		ep_pwq_from_wait(wait)->whead = NULL;
+		/*
+		 * whead = NULL above can race with ep_remove_wait_queue()
+		 * which can do another remove_wait_queue() after us, so we
+		 * can't use __remove_wait_queue(). whead->lock is held by
+		 * the caller.
+		 */
 		list_del_init(&wait->task_list);
+	}
 
 	spin_lock_irqsave(&ep->lock, flags);
 
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 79c1eea..7ae2a57 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -33,7 +33,11 @@
 void signalfd_cleanup(struct sighand_struct *sighand)
 {
 	wait_queue_head_t *wqh = &sighand->signalfd_wqh;
-
+	/*
+	 * The lockless check can race with remove_wait_queue() in progress,
+	 * but in this case its caller should run under rcu_read_lock() and
+	 * sighand_cachep is SLAB_DESTROY_BY_RCU, we can safely return.
+	 */
 	if (likely(!waitqueue_active(wqh)))
 		return;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/2] signalfd/epoll fixes
  2012-02-24 19:06       ` [PATCH v2 0/2] signalfd/epoll fixes Oleg Nesterov
  2012-02-24 19:07         ` [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() Oleg Nesterov
  2012-02-24 19:07         ` [PATCH v2 2/2] epoll: ep_unregister_pollwait() can use the freed pwq->whead Oleg Nesterov
@ 2012-02-24 20:23         ` Linus Torvalds
  2012-02-24 23:14           ` Linus Torvalds
  2 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2012-02-24 20:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On Fri, Feb 24, 2012 at 11:06 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> OK. Please see v2.

Ok, I applied these.

And then, dammit, I unapplied them again.

And then I applied them again.

I'm *really* conflicted, because I have this really strong feeling
that it's just papering over a symptom, and we damn well shouldn't be
doing that. I really think that what we really should do is allow
"poll()" to have a "poll_remove" callback (so each "add_poll_wait()"
will have a callback when it gets remove).

Then we could make the poll() functions actually do allocations and
crap - or at least add refcounts - and the "poll_remove()" ones would
undo them.

And then we could rip out all this, and make signalfd just do

  static void poll_remove(struct file *file, struct wait_queue *wq)
  {
    struct sighand *sighand = container_of(wq, struct sighand, signalfd_wqh);
    __cleanup_sighand(sighand);
  }

and add that "poll_remove" to its file handler operations. And in
"poll()", it would just do

   atomic_inc(&sighand->count);

as it does the poll_wait() thing. Sure, we need to have some way to
test "did we really add it", and only increment the count if so (so
poll_wait() would need to return a value), but this seems to be the
*real* fix. Because the real problem is that we cannot currently
refcount the poll users.

Ok, so it's just a strong feeling, and I *did* end up applying these
two patches after all, but I really wonder how hard it would be to
just add a single new callback function and be able to refcount that
sighand structure itself.

                           Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/2] signalfd/epoll fixes
  2012-02-24 20:23         ` [PATCH v2 0/2] signalfd/epoll fixes Linus Torvalds
@ 2012-02-24 23:14           ` Linus Torvalds
  2012-02-25 16:08             ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2012-02-24 23:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 807 bytes --]

On Fri, Feb 24, 2012 at 12:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm *really* conflicted, because I have this really strong feeling
> that it's just papering over a symptom, and we damn well shouldn't be
> doing that. I really think that what we really should do is allow
> "poll()" to have a "poll_remove" callback (so each "add_poll_wait()"
> will have a callback when it gets remove).
>
> Then we could make the poll() functions actually do allocations and
> crap - or at least add refcounts - and the "poll_remove()" ones would
> undo them.

Ok, I have *NOT* tested this, and by now it's certainly not 3.3
material any more, but here's a patch to kind of lay the groundwork
and show what I mean.

UNTESTED! UNTESTED! CAVEAT! Probably horribly broken.

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 9099 bytes --]

 drivers/vhost/vhost.c |    5 +++--
 fs/eventpoll.c        |    9 ++++++++-
 fs/select.c           |   12 ++++++++----
 fs/signalfd.c         |   12 +++++++++++-
 include/linux/fs.h    |    1 +
 include/linux/poll.h  |    7 ++++---
 kernel/cgroup.c       |    3 ++-
 net/9p/trans_fd.c     |    5 +++--
 virt/kvm/eventfd.c    |    3 ++-
 9 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c14c42b95ab8..5880f3d81333 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -42,14 +42,15 @@ static unsigned vhost_zcopy_mask __read_mostly;
 #define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
 
-static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
-			    poll_table *pt)
+static int vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
+			   poll_table *pt)
 {
 	struct vhost_poll *poll;
 
 	poll = container_of(pt, struct vhost_poll, table);
 	poll->wqh = wqh;
 	add_wait_queue(wqh, &poll->wait);
+	return 1;
 }
 
 static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index aabdfc38cf24..010b9eb9de74 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -474,13 +474,18 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
  */
 static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 {
+	struct file *file = epi->ffd.file;
 	struct list_head *lsthead = &epi->pwqlist;
 	struct eppoll_entry *pwq;
+	void (*poll_rm) (struct file *, wait_queue_head_t *);
 
+	poll_rm = file->f_op->poll_rm;
 	while (!list_empty(lsthead)) {
 		pwq = list_first_entry(lsthead, struct eppoll_entry, llink);
 
 		list_del(&pwq->llink);
+		if (poll_rm)
+			poll_rm(file, pwq->whead);
 		remove_wait_queue(pwq->whead, &pwq->wait);
 		kmem_cache_free(pwq_cache, pwq);
 	}
@@ -903,7 +908,7 @@ out_unlock:
  * This is the callback that is used to add our wait queue to the
  * target file wakeup lists.
  */
-static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
+static int ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 				 poll_table *pt)
 {
 	struct epitem *epi = ep_item_from_epqueue(pt);
@@ -916,9 +921,11 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
+		return 1;
 	} else {
 		/* We have to signal that an error occurred */
 		epi->nwait = -1;
+		return 0;
 	}
 }
 
diff --git a/fs/select.c b/fs/select.c
index e782258d0de3..a4e369dda85c 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -110,7 +110,7 @@ struct poll_table_page {
  * as all select/poll functions have to call it to add an entry to the
  * poll table.
  */
-static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
+static int __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 		       poll_table *p);
 
 void poll_initwait(struct poll_wqueues *pwq)
@@ -126,8 +126,11 @@ EXPORT_SYMBOL(poll_initwait);
 
 static void free_poll_entry(struct poll_table_entry *entry)
 {
+	struct file *file = entry->filp;
+	if (file->f_op->poll_rm)
+		file->f_op->poll_rm(file, entry->wait_address);
 	remove_wait_queue(entry->wait_address, &entry->wait);
-	fput(entry->filp);
+	fput(file);
 }
 
 void poll_freewait(struct poll_wqueues *pwq)
@@ -213,13 +216,13 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 }
 
 /* Add a new entry */
-static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
+static int __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
 {
 	struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
 	struct poll_table_entry *entry = poll_get_entry(pwq);
 	if (!entry)
-		return;
+		return 0;
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
@@ -227,6 +230,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
+	return 1;
 }
 
 int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 492465b451dd..89afafafbf2e 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -40,12 +40,21 @@ static int signalfd_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static void signalfd_poll_rm(struct file *file, wait_queue_head_t *p)
+{
+	struct sighand_struct *sighand;
+
+	sighand = container_of(p, struct sighand_struct, signalfd_wqh);
+	__cleanup_sighand(sighand);
+}
+
 static unsigned int signalfd_poll(struct file *file, poll_table *wait)
 {
 	struct signalfd_ctx *ctx = file->private_data;
 	unsigned int events = 0;
 
-	poll_wait(file, &current->sighand->signalfd_wqh, wait);
+	if (poll_wait(file, &current->sighand->signalfd_wqh, wait))
+		atomic_inc(&current->sighand->count);
 
 	spin_lock_irq(&current->sighand->siglock);
 	if (next_signal(&current->pending, &ctx->sigmask) ||
@@ -215,6 +224,7 @@ static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
 static const struct file_operations signalfd_fops = {
 	.release	= signalfd_release,
 	.poll		= signalfd_poll,
+	.poll_rm	= signalfd_poll_rm,
 	.read		= signalfd_read,
 	.llseek		= noop_llseek,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb640f5..a96e193dfa6b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1604,6 +1604,7 @@ struct file_operations {
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	int (*readdir) (struct file *, void *, filldir_t);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
+	void (*poll_rm) (struct file *, wait_queue_head_t * wait_address);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index cf40010ce0cd..9b73a2557add 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -30,17 +30,18 @@ struct poll_table_struct;
 /* 
  * structures and helpers for f_op->poll implementations
  */
-typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
+typedef int (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
 
 typedef struct poll_table_struct {
 	poll_queue_proc qproc;
 	unsigned long key;
 } poll_table;
 
-static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
+static inline int poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
 {
 	if (p && wait_address)
-		p->qproc(filp, wait_address, p);
+		return p->qproc(filp, wait_address, p);
+	return 0;
 }
 
 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a5d3b5325f77..3029d66d8b51 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3520,7 +3520,7 @@ static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
 	return 0;
 }
 
-static void cgroup_event_ptable_queue_proc(struct file *file,
+static int cgroup_event_ptable_queue_proc(struct file *file,
 		wait_queue_head_t *wqh, poll_table *pt)
 {
 	struct cgroup_event *event = container_of(pt,
@@ -3528,6 +3528,7 @@ static void cgroup_event_ptable_queue_proc(struct file *file,
 
 	event->wqh = wqh;
 	add_wait_queue(wqh, &event->wait);
+	return 1;
 }
 
 /*
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index fccae26fa674..d87b30b6e8e9 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -538,7 +538,7 @@ static int p9_pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
  * called by files poll operation to add v9fs-poll task to files wait queue
  */
 
-static void
+static int
 p9_pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *p)
 {
 	struct p9_conn *m = container_of(p, struct p9_conn, pt);
@@ -554,13 +554,14 @@ p9_pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *p)
 
 	if (!pwait) {
 		p9_debug(P9_DEBUG_ERROR, "not enough wait_address slots\n");
-		return;
+		return 0;
 	}
 
 	pwait->conn = m;
 	pwait->wait_addr = wait_address;
 	init_waitqueue_func_entry(&pwait->wait, p9_pollwake);
 	add_wait_queue(wait_address, &pwait->wait);
+	return 1;
 }
 
 /**
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index f59c1e8de7a2..f5e59036a937 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -168,12 +168,13 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	return 0;
 }
 
-static void
+static int
 irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
 			poll_table *pt)
 {
 	struct _irqfd *irqfd = container_of(pt, struct _irqfd, pt);
 	add_wait_queue(wqh, &irqfd->wait);
+	return 1;
 }
 
 /* Must be called under irqfds.lock */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/2] signalfd/epoll fixes
  2012-02-24 23:14           ` Linus Torvalds
@ 2012-02-25 16:08             ` Oleg Nesterov
  2012-02-25 19:00               ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-25 16:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On 02/24, Linus Torvalds wrote:
>
> On Fri, Feb 24, 2012 at 12:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'm *really* conflicted, because I have this really strong feeling
> > that it's just papering over a symptom, and we damn well shouldn't be
> > doing that.

Heh. This is even documented in the changelog.

> I really think that what we really should do is allow
> > "poll()" to have a "poll_remove" callback (so each "add_poll_wait()"
> > will have a callback when it gets remove).

Yes. I also thought about this. In fact I think this probably makes
sense even ignoring epoll problems. Although I was thinking about
file_operations::poll_v2(file, poll_table, poll_table_entry, mode)
which could could replace the old ->poll() eventually. But perhaps
the explicit poll_rm() is better.

However, speaking about epoll/sigfd, imho this hides the problem too.

> +static void signalfd_poll_rm(struct file *file, wait_queue_head_t *p)
> +{
> +	struct sighand_struct *sighand;
> +
> +	sighand = container_of(p, struct sighand_struct, signalfd_wqh);
> +	__cleanup_sighand(sighand);
> +}
> +
>  static unsigned int signalfd_poll(struct file *file, poll_table *wait)
>  {
>  	struct signalfd_ctx *ctx = file->private_data;
>  	unsigned int events = 0;
>
> -	poll_wait(file, &current->sighand->signalfd_wqh, wait);
> +	if (poll_wait(file, &current->sighand->signalfd_wqh, wait))
> +		atomic_inc(&current->sighand->count);

Yes, this fixes use-after-free. But suppose the the application does:

	sigfd = signalfd(...);
	epoll_ctl(epollfd, EPOLL_CTL_ADD, sigfd);
	execve();

de_thread() unshares ->sighand because of atomic_inc() above and
epollfd no longer works after exec, and the application can't know
this. (yes, currently we have the same problem with CLONE_SIGHAND'ed
apps, but still).

We can turn ->signalfd_wqh into the pointer to the refcountable
memory, or add another counter, but this is not nice.

And in any case. If the task exits, to me it looks simply pointless
to keep its ->sighand until __fput(). This only makes poll_rm() or
whatever safe, it can't report a signal. OK, OK, I am not arguing,
POLLFREE is equally ugly or even worse.

I _think_ that the right fix should move wait_queue_head_t from
sighand_struct to signalfd_ctx (file->private_data) somehow...

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/2] signalfd/epoll fixes
  2012-02-25 16:08             ` Oleg Nesterov
@ 2012-02-25 19:00               ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2012-02-25 19:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Davide Libenzi, Eric Dumazet, Greg KH,
	Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On Sat, Feb 25, 2012 at 8:08 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Yes, this fixes use-after-free. But suppose the the application does:
>
>        sigfd = signalfd(...);
>        epoll_ctl(epollfd, EPOLL_CTL_ADD, sigfd);
>        execve();

Well, that can't work anyway. We're adding to the wrong sighand.

Sure, we can force a wakeup for it at execve(), but that's not the old
behavior either, so why would we care about the above case? It's not a
sane thing to do, and it has never worked before either, so why
bother? My patch should make it "work" as well as it ever did.

Quite frankly, if you wanted to make signalfd work sanely with
eventpoll, you'd need to change the semantics entirely, and just say
"signalfd works in the context it was creared in, no other". Those
aren't the semantics we have, though, and there's no point in trying
to make up some *new* semantics for "if you execve() we'll still
follow it"

> And in any case. If the task exits, to me it looks simply pointless
> to keep its ->sighand until __fput(). This only makes poll_rm() or
> whatever safe, it can't report a signal. OK, OK, I am not arguing,
> POLLFREE is equally ugly or even worse.

That's my point. POLLFREE really does look worse, and doesn't really
give any saner behaviors.

> I _think_ that the right fix should move wait_queue_head_t from
> sighand_struct to signalfd_ctx (file->private_data) somehow...

I looked at it - and it requires the same "poll_rm()" callback, and
then instead you have to make allocations in poll() (for the list) and
de-allocate in "poll_rm()".

I do agree that in many ways it would be the "right" thing to do, but
it does require the same infrastructure, and the advantage is dubious
for this case. The main advantage is that we can extend the signalfd +
epoll interaction in new directions (like the "across execve" or other
cases), but I'd argue that we don't *want* to do that.

So your POLLFREE patches are merged, and I think they work. The reason
I like the poll_rm() thing is that I think it's conceptually the right
solution, and while our "poll_wait()" interface has really worked very
well, the fact that you cannot do anything stateful in it (because
there is never any way to undo anything except for the "remove from
wait queue" action) is very rigid and inflexible.

Of course, aside from signalfd, that inflexibility has never really
mattered. So..

                  Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()
  2012-02-24 19:07         ` [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() Oleg Nesterov
@ 2012-02-29 19:57           ` Andy Lutomirski
  2012-02-29 20:06             ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2012-02-29 19:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Davide Libenzi, Eric Dumazet,
	Greg KH, Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On 02/24/2012 11:07 AM, Oleg Nesterov wrote:
> This patch is intentionally incomplete to simplify the review.
> It ignores ep_unregister_pollwait() which plays with the same wqh.
> See the next change.
> 
> epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
> f_op->poll() needs. In particular it assumes that the wait queue
> can't go away until eventpoll_release(). This is not true in case
> of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
> which is not connected to the file.
> 
> This patch adds the special event, POLLFREE, currently only for
> epoll. It expects that init_poll_funcptr()'ed hook should do the
> necessary cleanup. Perhaps it should be defined as EPOLLFREE in
> eventpoll.

> [lots of kernel-internal technical stuff]

I have a bunch of userspace code that uses signalfd via epoll.  Does
this affect the ABI?  Will epoll_wait ever set POLLFREE?  Does
EPOLL_CTL_MOD accept POLLFREE?

IOW, from a userspace point of view, wtf does this do?  I don't want to
end up with another POLLRDHUP-like* special case in my code.

--Andy

* IMO the right fix would have been to make EPOLLET fire POLLIN again
when the read point advances to EOF but before EOF is actually seen when
read() returns zero.  Then POLLRDHUP would be unnecessary and user code
could do its thing in blissful ignorance.  I hope POLLFREE isn't like that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()
  2012-02-29 19:57           ` Andy Lutomirski
@ 2012-02-29 20:06             ` Oleg Nesterov
  2012-02-29 20:16               ` Andrew Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2012-02-29 20:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andrew Morton, Davide Libenzi, Eric Dumazet,
	Greg KH, Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On 02/29, Andy Lutomirski wrote:
>
> On 02/24/2012 11:07 AM, Oleg Nesterov wrote:
> > This patch adds the special event, POLLFREE, currently only for
> > epoll. It expects that init_poll_funcptr()'ed hook should do the
> > necessary cleanup. Perhaps it should be defined as EPOLLFREE in
> > eventpoll.
>
> I have a bunch of userspace code that uses signalfd via epoll.  Does
> this affect the ABI?

I hope not ;)

> Will epoll_wait ever set POLLFREE?

No.

> Does
> EPOLL_CTL_MOD accept POLLFREE?

Yes, it doesn't check the bits. EPOLL_CTL_MOD accepts any bit.

> IOW, from a userspace point of view, wtf does this do?

I hope this is transparent to the user-space. If the application
does EPOLL_CTL_MOD(POLLFREE) by mistake then ep_poll_callback(POLLFREE)
can trigger the unnecessary wakeup and move the file to ep->rdllist
but this is harmless. The subsequent ->poll() can't return POLLFREE.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()
  2012-02-29 20:06             ` Oleg Nesterov
@ 2012-02-29 20:16               ` Andrew Lutomirski
  2012-03-01 19:26                 ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Lutomirski @ 2012-02-29 20:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Davide Libenzi, Eric Dumazet,
	Greg KH, Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On Wed, Feb 29, 2012 at 12:06 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 02/29, Andy Lutomirski wrote:
>>
>> On 02/24/2012 11:07 AM, Oleg Nesterov wrote:
>> > This patch adds the special event, POLLFREE, currently only for
>> > epoll. It expects that init_poll_funcptr()'ed hook should do the
>> > necessary cleanup. Perhaps it should be defined as EPOLLFREE in
>> > eventpoll.
>>
>> I have a bunch of userspace code that uses signalfd via epoll.  Does
>> this affect the ABI?
>
> I hope not ;)
>

Excellent!

To avoid further confusion, would it make sense to update the comment
in poll.h to indicate that POLLFREE is only for epoll and is internal
to the kernel?

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()
  2012-02-29 20:16               ` Andrew Lutomirski
@ 2012-03-01 19:26                 ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2012-03-01 19:26 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Andrew Morton, Davide Libenzi, Eric Dumazet,
	Greg KH, Jason Baron, Roland McGrath, Eugene Teo, Maxime Bizon,
	Denys Vlasenko, linux-kernel

On 02/29, Andrew Lutomirski wrote:
>
> On Wed, Feb 29, 2012 at 12:06 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 02/29, Andy Lutomirski wrote:
> >>
> >> On 02/24/2012 11:07 AM, Oleg Nesterov wrote:
> >> > This patch adds the special event, POLLFREE, currently only for
> >> > epoll. It expects that init_poll_funcptr()'ed hook should do the
> >> > necessary cleanup. Perhaps it should be defined as EPOLLFREE in
> >> > eventpoll.
> >>
> >> I have a bunch of userspace code that uses signalfd via epoll.  Does
> >> this affect the ABI?
> >
> > I hope not ;)
> >
>
> Excellent!
>
> To avoid further confusion, would it make sense to update the comment
> in poll.h to indicate that POLLFREE is only for epoll and is internal
> to the kernel?

Agreed. Or perhaps we can simply use POLLREMOVE instead...

But note that nobody (including me) likes this fix, just I was asked
to make something simple/backportable. Probably it will be reverted
later.

If not, then perhaps it makes sense to do a couple of simple changes
on top to cleanup the usage of POLLFREE and to ensure that POLLHUP
is actually delivered (the latter will be user-visible).

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-03-01 19:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20120222173326.GA7139@redhat.com>
2012-02-22 17:33 ` [PATCH 1/4] signalfd: introduce signalfd_cleanup() Oleg Nesterov
2012-02-22 17:34 ` [PATCH 2/4] epoll: introduce POLLFREE for ep_poll_callback() Oleg Nesterov
2012-02-22 17:34 ` [PATCH 3/4] signalfd: signalfd_cleanup() can race with remove_wait_queue() Oleg Nesterov
2012-02-22 17:35 ` [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead Oleg Nesterov
2012-02-23 15:44   ` Oleg Nesterov
2012-02-23 22:17     ` Linus Torvalds
2012-02-24 19:06       ` [PATCH v2 0/2] signalfd/epoll fixes Oleg Nesterov
2012-02-24 19:07         ` [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() Oleg Nesterov
2012-02-29 19:57           ` Andy Lutomirski
2012-02-29 20:06             ` Oleg Nesterov
2012-02-29 20:16               ` Andrew Lutomirski
2012-03-01 19:26                 ` Oleg Nesterov
2012-02-24 19:07         ` [PATCH v2 2/2] epoll: ep_unregister_pollwait() can use the freed pwq->whead Oleg Nesterov
2012-02-24 20:23         ` [PATCH v2 0/2] signalfd/epoll fixes Linus Torvalds
2012-02-24 23:14           ` Linus Torvalds
2012-02-25 16:08             ` Oleg Nesterov
2012-02-25 19:00               ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).