All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] nested sleeps, fixes and debug infrastructure
@ 2014-09-24  8:18 Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 01/11] locking/mutex: Dont assume TASK_RUNNING Peter Zijlstra
                   ` (11 more replies)
  0 siblings, 12 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

Hi,

This is a refresh of the nested sleep debug stuff which I posted as an RFC a
while back: lkml.kernel.org/r/20140804103025.478913141@infradead.org

Since then a number of issues identified by these patches have allready made
their way upstream:

    de713b57947a ("atm/svc: Fix blocking in wait loop")
    7c3af9752573 ("nfs: don't sleep with inode lock in lock_and_join_requests")

And I finally got some time to finish up these patches so we could merge them.
So please have a look and if nobody holllers we'll merge this 'soon'.

 ~ Peter


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/11] locking/mutex: Dont assume TASK_RUNNING
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-10-28 11:09   ` [tip:sched/core] locking/mutex: Don't " tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking Peter Zijlstra
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-mutex_lock-vs-task_state.patch --]
[-- Type: text/plain, Size: 1196 bytes --]

We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.

There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.

Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/mutex.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -378,8 +378,14 @@ static bool mutex_optimistic_spin(struct
 	 * reschedule now, before we try-lock the mutex. This avoids getting
 	 * scheduled out right after we obtained the mutex.
 	 */
-	if (need_resched())
+	if (need_resched()) {
+		/*
+		 * We _should_ have TASK_RUNNING here, but just in case
+		 * we do not, make it so, otherwise we might get stuck.
+		 */
+		__set_current_state(TASK_RUNNING);
 		schedule_preempt_disabled();
+	}
 
 	return false;
 }



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 01/11] locking/mutex: Dont assume TASK_RUNNING Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-09-29 21:02   ` Oleg Nesterov
  2014-10-28 11:09   ` [tip:sched/core] sched/wait: " tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 03/11] wait: Add might_sleep() Peter Zijlstra
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-wait-nested.patch --]
[-- Type: text/plain, Size: 3854 bytes --]

There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.

We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/wait.h |    7 +++++
 kernel/sched/wait.c  |   61 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 1 deletion(-)

--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -13,9 +13,12 @@ typedef struct __wait_queue wait_queue_t
 typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key);
 int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *key);
 
+/* __wait_queue::flags */
+#define WQ_FLAG_EXCLUSIVE	0x01
+#define WQ_FLAG_WOKEN		0x02
+
 struct __wait_queue {
 	unsigned int		flags;
-#define WQ_FLAG_EXCLUSIVE	0x01
 	void			*private;
 	wait_queue_func_t	func;
 	struct list_head	task_list;
@@ -829,6 +832,8 @@ void prepare_to_wait_exclusive(wait_queu
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
+long wait_woken(wait_queue_t *wait, unsigned mode, long timeout);
+int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -297,6 +297,67 @@ int autoremove_wake_function(wait_queue_
 }
 EXPORT_SYMBOL(autoremove_wake_function);
 
+
+/*
+ * DEFINE_WAIT_FUNC(wait, woken_wait_func);
+ *
+ * add_wait_queue(&wq, &wait);
+ * for (;;) {
+ *     if (condition)
+ *         break;
+ *
+ *     p->state = mode;				condition = true;
+ *     smp_mb(); // A				smp_wmb(); // C
+ *     if (!wait->flags & WQ_FLAG_WOKEN)	wait->flags |= WQ_FLAG_WOKEN;
+ *         schedule()				try_to_wake_up();
+ *     p->state = TASK_RUNNING;		    ~~~~~~~~~~~~~~~~~~
+ *     wait->flags &= ~WQ_FLAG_WOKEN;		condition = true;
+ *     smp_mb() // B				smp_wmb(); // C
+ *						wait->flags |= WQ_FLAG_WOKEN;
+ * }
+ * remove_wait_queue(&wq, &wait);
+ *
+ */
+long wait_woken(wait_queue_t *wait, unsigned mode, long timeout)
+{
+	set_current_state(mode); /* A */
+	/*
+	 * The above implies an smp_mb(), which matches with the smp_wmb() from
+	 * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
+	 * also observe all state before the wakeup.
+	 */
+	if (!(wait->flags & WQ_FLAG_WOKEN))
+		timeout = schedule_timeout(timeout);
+	__set_current_state(TASK_RUNNING);
+
+	/*
+	 * The below implies an smp_mb(), it too pairs with the smp_wmb() from
+	 * woken_wake_function() such that we must either observe the wait
+	 * condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss
+	 * an event.
+	 */
+	set_mb(wait->flags, wait->flags & ~WQ_FLAG_WOKEN); /* B */
+
+	return timeout;
+}
+EXPORT_SYMBOL(wait_woken);
+
+int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	/*
+	 * Although this function is called under waitqueue lock, LOCK
+	 * doesn't imply write barrier and the users expects write
+	 * barrier semantics on wakeup functions.  The following
+	 * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
+	 * and is paired with set_mb() in wait_woken().
+	 */
+	smp_wmb(); /* C */
+	wait->flags |= WQ_FLAG_WOKEN;
+
+	return default_wake_function(wait, mode, sync, key);
+}
+EXPORT_SYMBOL(woken_wake_function);
+
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 03/11] wait: Add might_sleep()
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 01/11] locking/mutex: Dont assume TASK_RUNNING Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-10-28 11:09   ` [tip:sched/core] sched/wait: Add might_sleep() checks tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 04/11] exit: Deal with nested sleeps Peter Zijlstra
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-wait-might_sleep.patch --]
[-- Type: text/plain, Size: 4669 bytes --]

Add more might_sleep() checks, suppose someone put a wait_event() like
thing in a wait loop..

Can't put might_sleep() in ___wait_event() because there's the locked
primitives which call ___wait_event() with locks held.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/wait.h |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -260,6 +260,7 @@ __out:	__ret;								\
  */
 #define wait_event(wq, condition)					\
 do {									\
+	might_sleep();							\
 	if (condition)							\
 		break;							\
 	__wait_event(wq, condition);					\
@@ -292,6 +293,7 @@ do {									\
 #define wait_event_timeout(wq, condition, timeout)			\
 ({									\
 	long __ret = timeout;						\
+	might_sleep();							\
 	if (!___wait_cond_timeout(condition))				\
 		__ret = __wait_event_timeout(wq, condition, timeout);	\
 	__ret;								\
@@ -317,6 +319,7 @@ do {									\
  */
 #define wait_event_cmd(wq, condition, cmd1, cmd2)			\
 do {									\
+	might_sleep();							\
 	if (condition)							\
 		break;							\
 	__wait_event_cmd(wq, condition, cmd1, cmd2);			\
@@ -344,6 +347,7 @@ do {									\
 #define wait_event_interruptible(wq, condition)				\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_interruptible(wq, condition);	\
 	__ret;								\
@@ -377,6 +381,7 @@ do {									\
 #define wait_event_interruptible_timeout(wq, condition, timeout)	\
 ({									\
 	long __ret = timeout;						\
+	might_sleep();							\
 	if (!___wait_cond_timeout(condition))				\
 		__ret = __wait_event_interruptible_timeout(wq,		\
 						condition, timeout);	\
@@ -427,6 +432,7 @@ do {									\
 #define wait_event_hrtimeout(wq, condition, timeout)			\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_hrtimeout(wq, condition, timeout,	\
 					       TASK_UNINTERRUPTIBLE);	\
@@ -452,6 +458,7 @@ do {									\
 #define wait_event_interruptible_hrtimeout(wq, condition, timeout)	\
 ({									\
 	long __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_hrtimeout(wq, condition, timeout,	\
 					       TASK_INTERRUPTIBLE);	\
@@ -465,6 +472,7 @@ do {									\
 #define wait_event_interruptible_exclusive(wq, condition)		\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_interruptible_exclusive(wq, condition);\
 	__ret;								\
@@ -639,6 +647,7 @@ do {									\
 #define wait_event_killable(wq, condition)				\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_killable(wq, condition);		\
 	__ret;								\
@@ -888,6 +897,7 @@ extern int bit_wait_io(struct wait_bit_k
 static inline int
 wait_on_bit(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit(word, bit,
@@ -912,6 +922,7 @@ wait_on_bit(void *word, int bit, unsigne
 static inline int
 wait_on_bit_io(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit(word, bit,
@@ -938,6 +949,7 @@ wait_on_bit_io(void *word, int bit, unsi
 static inline int
 wait_on_bit_action(void *word, int bit, wait_bit_action_f *action, unsigned mode)
 {
+	might_sleep();
 	if (!test_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit(word, bit, action, mode);
@@ -965,6 +977,7 @@ wait_on_bit_action(void *word, int bit,
 static inline int
 wait_on_bit_lock(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_and_set_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, bit_wait, mode);
@@ -988,6 +1001,7 @@ wait_on_bit_lock(void *word, int bit, un
 static inline int
 wait_on_bit_lock_io(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_and_set_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, bit_wait_io, mode);
@@ -1013,6 +1027,7 @@ wait_on_bit_lock_io(void *word, int bit,
 static inline int
 wait_on_bit_lock_action(void *word, int bit, wait_bit_action_f *action, unsigned mode)
 {
+	might_sleep();
 	if (!test_and_set_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, action, mode);
@@ -1031,6 +1046,7 @@ wait_on_bit_lock_action(void *word, int
 static inline
 int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 {
+	might_sleep();
 	if (atomic_read(val) == 0)
 		return 0;
 	return out_of_line_wait_on_atomic_t(val, action, mode);



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 04/11] exit: Deal with nested sleeps
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 03/11] wait: Add might_sleep() Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-10-28 11:10   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 05/11] inotify: " Peter Zijlstra
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-might_sleep-exit.patch --]
[-- Type: text/plain, Size: 2625 bytes --]

do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end
up calling potential sleeps before we reset it.

Not strictly a bug since we're guaranteed to exit the loop and not
call schedule(); put in annotations to quiet might_sleep().

 WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90()
 do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270

 Call Trace:
  [<ffffffff81694991>] dump_stack+0x4e/0x7a
  [<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0
  [<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50
  [<ffffffff810bca6e>] __might_sleep+0x7e/0x90
  [<ffffffff811a1c15>] might_fault+0x55/0xb0
  [<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10
  [<ffffffff8109a804>] do_wait+0x104/0x270
  [<ffffffff8109b837>] SyS_wait4+0x77/0x100
  [<ffffffff8169d692>] system_call_fastpath+0x16/0x1b

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/kernel.h |    2 ++
 kernel/exit.c          |    5 +++++
 2 files changed, 7 insertions(+)

--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -175,10 +175,12 @@ extern int _cond_resched(void);
  */
 # define might_sleep() \
 	do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
+# define fixup_sleep()	__set_current_state(TASK_RUNNING)
 #else
   static inline void __might_sleep(const char *file, int line,
 				   int preempt_offset) { }
 # define might_sleep() do { might_resched(); } while (0)
+# define fixup_sleep() do { } while (0)
 #endif
 
 #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -997,6 +997,8 @@ static int wait_task_zombie(struct wait_
 
 		get_task_struct(p);
 		read_unlock(&tasklist_lock);
+		fixup_sleep();
+
 		if ((exit_code & 0x7f) == 0) {
 			why = CLD_EXITED;
 			status = exit_code >> 8;
@@ -1079,6 +1081,7 @@ static int wait_task_zombie(struct wait_
 	 * thread can reap it because we its state == DEAD/TRACE.
 	 */
 	read_unlock(&tasklist_lock);
+	fixup_sleep();
 
 	retval = wo->wo_rusage
 		? getrusage(p, RUSAGE_BOTH, wo->wo_rusage) : 0;
@@ -1210,6 +1213,7 @@ static int wait_task_stopped(struct wait
 	pid = task_pid_vnr(p);
 	why = ptrace ? CLD_TRAPPED : CLD_STOPPED;
 	read_unlock(&tasklist_lock);
+	fixup_sleep();
 
 	if (unlikely(wo->wo_flags & WNOWAIT))
 		return wait_noreap_copyout(wo, p, pid, uid, why, exit_code);
@@ -1272,6 +1276,7 @@ static int wait_task_continued(struct wa
 	pid = task_pid_vnr(p);
 	get_task_struct(p);
 	read_unlock(&tasklist_lock);
+	fixup_sleep();
 
 	if (!wo->wo_info) {
 		retval = wo->wo_rusage



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 05/11] inotify: Deal with nested sleeps
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 04/11] exit: Deal with nested sleeps Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-10-28 11:10   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 06/11] tty: " Peter Zijlstra
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra,
	Robert Love, Eric Paris, John McCutchan

[-- Attachment #1: peterz-might_sleep-inotify.patch --]
[-- Type: text/plain, Size: 1722 bytes --]

inotify_read is a wait loop with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state and the wait loop breaks the sleep functions that assume
TASK_RUNNING (mutex_lock).

Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.

Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/notify/inotify/inotify_user.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -227,14 +227,13 @@ static ssize_t inotify_read(struct file
 	struct fsnotify_event *kevent;
 	char __user *start;
 	int ret;
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
 	start = buf;
 	group = file->private_data;
 
+	add_wait_queue(&group->notification_waitq, &wait);
 	while (1) {
-		prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
-
 		mutex_lock(&group->notification_mutex);
 		kevent = get_one_event(group, count);
 		mutex_unlock(&group->notification_mutex);
@@ -264,10 +263,10 @@ static ssize_t inotify_read(struct file
 		if (start != buf)
 			break;
 
-		schedule();
+		wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
 	}
+	remove_wait_queue(&group->notification_waitq, &wait);
 
-	finish_wait(&group->notification_waitq, &wait);
 	if (start != buf && ret != -EFAULT)
 		ret = buf - start;
 	return ret;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 06/11] tty: Deal with nested sleeps
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 05/11] inotify: " Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-10-28 11:10   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 07/11] smp: Correctly deal " Peter Zijlstra
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra,
	Jiri Slaby, Greg Kroah-Hartman

[-- Attachment #1: peterz-might_sleep-tty.patch --]
[-- Type: text/plain, Size: 2848 bytes --]

n_tty_{read,write} are wait loops with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state.

Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.

Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/tty/n_tty.c |   17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -2123,7 +2123,7 @@ static ssize_t n_tty_read(struct tty_str
 {
 	struct n_tty_data *ldata = tty->disc_data;
 	unsigned char __user *b = buf;
-	DECLARE_WAITQUEUE(wait, current);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 	int c;
 	int minimum, time;
 	ssize_t retval = 0;
@@ -2186,10 +2186,6 @@ static ssize_t n_tty_read(struct tty_str
 			nr--;
 			break;
 		}
-		/* This statement must be first before checking for input
-		   so that any interrupt will set the state back to
-		   TASK_RUNNING. */
-		set_current_state(TASK_INTERRUPTIBLE);
 
 		if (((minimum - (b - buf)) < ldata->minimum_to_wake) &&
 		    ((minimum - (b - buf)) >= 1))
@@ -2220,13 +2216,13 @@ static ssize_t n_tty_read(struct tty_str
 				n_tty_set_room(tty);
 				up_read(&tty->termios_rwsem);
 
-				timeout = schedule_timeout(timeout);
+				timeout = wait_woken(&wait, TASK_INTERRUPTIBLE,
+						     timeout);
 
 				down_read(&tty->termios_rwsem);
 				continue;
 			}
 		}
-		__set_current_state(TASK_RUNNING);
 
 		/* Deal with packet mode. */
 		if (packet && b == buf) {
@@ -2273,7 +2269,6 @@ static ssize_t n_tty_read(struct tty_str
 
 	mutex_unlock(&ldata->atomic_read_lock);
 
-	__set_current_state(TASK_RUNNING);
 	if (b - buf)
 		retval = b - buf;
 
@@ -2306,7 +2301,7 @@ static ssize_t n_tty_write(struct tty_st
 			   const unsigned char *buf, size_t nr)
 {
 	const unsigned char *b = buf;
-	DECLARE_WAITQUEUE(wait, current);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 	int c;
 	ssize_t retval = 0;
 
@@ -2324,7 +2319,6 @@ static ssize_t n_tty_write(struct tty_st
 
 	add_wait_queue(&tty->write_wait, &wait);
 	while (1) {
-		set_current_state(TASK_INTERRUPTIBLE);
 		if (signal_pending(current)) {
 			retval = -ERESTARTSYS;
 			break;
@@ -2378,12 +2372,11 @@ static ssize_t n_tty_write(struct tty_st
 		}
 		up_read(&tty->termios_rwsem);
 
-		schedule();
+		wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
 
 		down_read(&tty->termios_rwsem);
 	}
 break_out:
-	__set_current_state(TASK_RUNNING);
 	remove_wait_queue(&tty->write_wait, &wait);
 	if (b - buf != nr && tty->fasync)
 		set_bit(TTY_DO_WRITE_WAKEUP, &tty->flags);



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 07/11] smp: Correctly deal with nested sleeps
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (5 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 06/11] tty: " Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-10-28 11:11   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 08/11] module: Fix nested sleep Peter Zijlstra
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-might_sleep_smpboot.patch --]
[-- Type: text/plain, Size: 1770 bytes --]

smp_hotplug_thread::{setup,unpark} functions can sleep too, so be
consistent and do the same for all callbacks.

 __might_sleep+0x74/0x80
 kmem_cache_alloc_trace+0x4e/0x1c0
 perf_event_alloc+0x55/0x450
 perf_event_create_kernel_counter+0x2f/0x100
 watchdog_nmi_enable+0x8d/0x160
 watchdog_enable+0x45/0x90
 smpboot_thread_fn+0xec/0x2b0
 kthread+0xe4/0x100
 ret_from_fork+0x7c/0xb0

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/smpboot.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -110,7 +110,7 @@ static int smpboot_thread_fn(void *data)
 		set_current_state(TASK_INTERRUPTIBLE);
 		preempt_disable();
 		if (kthread_should_stop()) {
-			set_current_state(TASK_RUNNING);
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			if (ht->cleanup)
 				ht->cleanup(td->cpu, cpu_online(td->cpu));
@@ -136,26 +136,27 @@ static int smpboot_thread_fn(void *data)
 		/* Check for state change setup */
 		switch (td->status) {
 		case HP_THREAD_NONE:
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			if (ht->setup)
 				ht->setup(td->cpu);
 			td->status = HP_THREAD_ACTIVE;
-			preempt_disable();
-			break;
+			continue;
+
 		case HP_THREAD_PARKED:
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			if (ht->unpark)
 				ht->unpark(td->cpu);
 			td->status = HP_THREAD_ACTIVE;
-			preempt_disable();
-			break;
+			continue;
 		}
 
 		if (!ht->thread_should_run(td->cpu)) {
-			preempt_enable();
+			preempt_enable_no_resched();
 			schedule();
 		} else {
-			set_current_state(TASK_RUNNING);
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			ht->thread_fn(td->cpu);
 		}



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 08/11] module: Fix nested sleep
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (6 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 07/11] smp: Correctly deal " Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-09-29 22:18   ` Oleg Nesterov
  2014-10-28 11:11   ` [tip:sched/core] sched, modules: Fix nested sleep in add_unformed_module() tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 09/11] net: Clean up sk_wait_event() vs might_sleep() Peter Zijlstra
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra,
	Rusty Russell, Fengguang Wu

[-- Attachment #1: peterz-sleep-module.patch --]
[-- Type: text/plain, Size: 1661 bytes --]

A genuine bug, we cannot use blocking primitives inside a wait loop.

So rewrite the wait_event_interruptible() usage to use the fresh
wait_woken() stuff.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3095,6 +3095,32 @@ static int may_init_module(void)
 }
 
 /*
+ * Can't use wait_event_interruptible() because our condition
+ * 'finished_loading()' contains a blocking primitive itself (mutex_lock).
+ */
+static int wait_finished_loading(struct module *mod)
+{
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
+	int ret = 0;
+
+	add_wait_queue(&module_wq, &wait);
+	for (;;) {
+		if (finished_loading(mod->name))
+			break;
+
+		if (signal_pending_state(TASK_INTERRUPTIBLE, current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
+	}
+	remove_wait_queue(&module_wq, &wait);
+
+	return ret;
+}
+
+/*
  * We try to place it in the list now to make sure it's unique before
  * we dedicate too many resources.  In particular, temporary percpu
  * memory exhaustion.
@@ -3114,8 +3140,8 @@ static int add_unformed_module(struct mo
 		    || old->state == MODULE_STATE_UNFORMED) {
 			/* Wait in case it fails to load. */
 			mutex_unlock(&module_mutex);
-			err = wait_event_interruptible(module_wq,
-					       finished_loading(mod->name));
+
+			err = wait_finished_loading(mod);
 			if (err)
 				goto out_unlocked;
 			goto again;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 09/11] net: Clean up sk_wait_event() vs might_sleep()
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (7 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 08/11] module: Fix nested sleep Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-09-24  8:36   ` Peter Zijlstra
  2014-10-28 11:11   ` [tip:sched/core] sched, net: Clean up sk_wait_event() vs. might_sleep() tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 10/11] sched: Debug nested sleeps Peter Zijlstra
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra,
	David Miller

[-- Attachment #1: peterz-sleep-net-sk_stream_wait_memory.patch --]
[-- Type: text/plain, Size: 1093 bytes --]

WARNING: CPU: 1 PID: 1744 at kernel/sched/core.c:7104 __might_sleep+0x58/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff81070e10>] prepare_to_wait+0x50 /0xa0

 [<ffffffff8105bc38>] __might_sleep+0x58/0x90
 [<ffffffff8148c671>] lock_sock_nested+0x31/0xb0
 [<ffffffff81498aaa>] sk_stream_wait_memory+0x18a/0x2d0

Which is a false positive because sk_wait_event() will already have
TASK_RUNNING at that point if it would've gone through
schedule_timeout().

So annotate with fixup_sleep(); which goes away on !DEBUG builds.

Cc: netdev@vger.kernel.org
Cc: David Miller <davem@davemloft.net>
Reported-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/net/sock.h |    1 +
 1 file changed, 1 insertion(+)

--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -902,6 +902,7 @@ static inline void sock_rps_reset_rxhash
 		if (!__rc) {						\
 			*(__timeo) = schedule_timeout(*(__timeo));	\
 		}							\
+		fixup_sleep();						\
 		lock_sock(__sk);					\
 		__rc = __condition;					\
 		__rc;							\



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 10/11] sched: Debug nested sleeps
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (8 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 09/11] net: Clean up sk_wait_event() vs might_sleep() Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-09-29 22:13   ` Oleg Nesterov
  2014-10-28 11:11   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2014-09-24  8:18 ` [PATCH 11/11] sched: Exclude cond_resched() from nested sleep test Peter Zijlstra
  2014-09-25  8:30 ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Mike Galbraith
  11 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-might_sleep_nesting.patch --]
[-- Type: text/plain, Size: 3739 bytes --]

Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.

Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.

Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |   46 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/core.c   |   13 +++++++++++++
 2 files changed, 57 insertions(+), 2 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -242,6 +242,43 @@ extern char ___assert_task_state[1 - 2*!
 				((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
 				 (task->flags & PF_FROZEN) == 0)
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+
+#define __set_task_state(tsk, state_value)			\
+	do {							\
+		(tsk)->task_state_change = _THIS_IP_;		\
+		(tsk)->state = (state_value);			\
+	} while (0)
+#define set_task_state(tsk, state_value)			\
+	do {							\
+		(tsk)->task_state_change = _THIS_IP_;		\
+		set_mb((tsk)->state, (state_value));		\
+	} while (0)
+
+/*
+ * set_current_state() includes a barrier so that the write of current->state
+ * is correctly serialised wrt the caller's subsequent test of whether to
+ * actually sleep:
+ *
+ *	set_current_state(TASK_UNINTERRUPTIBLE);
+ *	if (do_i_need_to_sleep())
+ *		schedule();
+ *
+ * If the caller does not need such serialisation then use __set_current_state()
+ */
+#define __set_current_state(state_value)			\
+	do {							\
+		current->task_state_change = _THIS_IP_;		\
+		current->state = (state_value);			\
+	} while (0)
+#define set_current_state(state_value)				\
+	do {							\
+		current->task_state_change = _THIS_IP_;		\
+		set_mb(current->state, (state_value));		\
+	} while (0)
+
+#else
+
 #define __set_task_state(tsk, state_value)		\
 	do { (tsk)->state = (state_value); } while (0)
 #define set_task_state(tsk, state_value)		\
@@ -258,11 +295,13 @@ extern char ___assert_task_state[1 - 2*!
  *
  * If the caller does not need such serialisation then use __set_current_state()
  */
-#define __set_current_state(state_value)			\
+#define __set_current_state(state_value)		\
 	do { current->state = (state_value); } while (0)
-#define set_current_state(state_value)		\
+#define set_current_state(state_value)			\
 	set_mb(current->state, (state_value))
 
+#endif
+
 /* Task command name length */
 #define TASK_COMM_LEN 16
 
@@ -1660,6 +1699,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	unsigned long	task_state_change;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7143,6 +7143,19 @@ void __might_sleep(const char *file, int
 {
 	static unsigned long prev_jiffy;	/* ratelimiting */
 
+	/*
+	 * Blocking primitives will set (and therefore destroy) current->state,
+	 * since we will exit with TASK_RUNNING make sure we enter with it,
+	 * otherwise we will destroy state.
+	 */
+	if (WARN(current->state != TASK_RUNNING,
+			"do not call blocking ops when !TASK_RUNNING; "
+			"state=%lx set at [<%p>] %pS\n",
+			current->state,
+			(void *)current->task_state_change,
+			(void *)current->task_state_change))
+		__set_current_state(TASK_RUNNING);
+
 	rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
 	     !is_idle_task(current)) ||



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 11/11] sched: Exclude cond_resched() from nested sleep test
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (9 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 10/11] sched: Debug nested sleeps Peter Zijlstra
@ 2014-09-24  8:18 ` Peter Zijlstra
  2014-10-28 11:12   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2014-09-25  8:30 ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Mike Galbraith
  11 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:18 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, Peter Zijlstra

[-- Attachment #1: peterz-might_sleep-cond_resched.patch --]
[-- Type: text/plain, Size: 3146 bytes --]

cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.

In particular, preemption preserves task_struct::state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/kernel.h |    3 +++
 include/linux/sched.h  |    6 +++---
 kernel/sched/core.c    |   12 +++++++++---
 3 files changed, 15 insertions(+), 6 deletions(-)

--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -162,6 +162,7 @@ extern int _cond_resched(void);
 #endif
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+  void ___might_sleep(const char *file, int line, int preempt_offset);
   void __might_sleep(const char *file, int line, int preempt_offset);
 /**
  * might_sleep - annotation for functions that can sleep
@@ -177,6 +178,8 @@ extern int _cond_resched(void);
 	do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
 # define fixup_sleep()	__set_current_state(TASK_RUNNING)
 #else
+  static inline void ___might_sleep(const char *file, int line,
+				   int preempt_offset) { }
   static inline void __might_sleep(const char *file, int line,
 				   int preempt_offset) { }
 # define might_sleep() do { might_resched(); } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2787,7 +2787,7 @@ static inline int signal_pending_state(l
 extern int _cond_resched(void);
 
 #define cond_resched() ({			\
-	__might_sleep(__FILE__, __LINE__, 0);	\
+	___might_sleep(__FILE__, __LINE__, 0);	\
 	_cond_resched();			\
 })
 
@@ -2800,14 +2800,14 @@ extern int __cond_resched_lock(spinlock_
 #endif
 
 #define cond_resched_lock(lock) ({				\
-	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
 	__cond_resched_lock(lock);				\
 })
 
 extern int __cond_resched_softirq(void);
 
 #define cond_resched_softirq() ({					\
-	__might_sleep(__FILE__, __LINE__, SOFTIRQ_DISABLE_OFFSET);	\
+	___might_sleep(__FILE__, __LINE__, SOFTIRQ_DISABLE_OFFSET);	\
 	__cond_resched_softirq();					\
 })
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7141,8 +7141,6 @@ static inline int preempt_count_equals(i
 
 void __might_sleep(const char *file, int line, int preempt_offset)
 {
-	static unsigned long prev_jiffy;	/* ratelimiting */
-
 	/*
 	 * Blocking primitives will set (and therefore destroy) current->state,
 	 * since we will exit with TASK_RUNNING make sure we enter with it,
@@ -7156,6 +7154,14 @@ void __might_sleep(const char *file, int
 			(void *)current->task_state_change))
 		__set_current_state(TASK_RUNNING);
 
+	___might_sleep(file, line, preempt_offset);
+}
+EXPORT_SYMBOL(__might_sleep);
+
+void ___might_sleep(const char *file, int line, int preempt_offset)
+{
+	static unsigned long prev_jiffy;	/* ratelimiting */
+
 	rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
 	     !is_idle_task(current)) ||
@@ -7185,7 +7191,7 @@ void __might_sleep(const char *file, int
 #endif
 	dump_stack();
 }
-EXPORT_SYMBOL(__might_sleep);
+EXPORT_SYMBOL(___might_sleep);
 #endif
 
 #ifdef CONFIG_MAGIC_SYSRQ



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/11] net: Clean up sk_wait_event() vs might_sleep()
  2014-09-24  8:18 ` [PATCH 09/11] net: Clean up sk_wait_event() vs might_sleep() Peter Zijlstra
@ 2014-09-24  8:36   ` Peter Zijlstra
  2014-10-28 11:11   ` [tip:sched/core] sched, net: Clean up sk_wait_event() vs. might_sleep() tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-24  8:36 UTC (permalink / raw)
  To: mingo, oleg, torvalds
  Cc: tglx, ilya.dryomov, umgwanakikbuti, linux-kernel, David Miller

On Wed, Sep 24, 2014 at 10:18:54AM +0200, Peter Zijlstra wrote:
> WARNING: CPU: 1 PID: 1744 at kernel/sched/core.c:7104 __might_sleep+0x58/0x90()
> do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff81070e10>] prepare_to_wait+0x50 /0xa0
> 
>  [<ffffffff8105bc38>] __might_sleep+0x58/0x90
>  [<ffffffff8148c671>] lock_sock_nested+0x31/0xb0
>  [<ffffffff81498aaa>] sk_stream_wait_memory+0x18a/0x2d0
> 
> Which is a false positive because sk_wait_event() will already have
> TASK_RUNNING at that point if it would've gone through
> schedule_timeout().
> 
> So annotate with fixup_sleep(); which goes away on !DEBUG builds.
> 
> Cc: netdev@vger.kernel.org
> Cc: David Miller <davem@davemloft.net>
> Reported-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Dave, FYI this patch relies on a previous one that introduces the
fixup_sleep()

 lkml.kernel.org/r/20140924082242.186408915@infradead.org

If you're ok with things, I'll merge this one along with the rest of
things.

> ---
>  include/net/sock.h |    1 +
>  1 file changed, 1 insertion(+)
> 
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -902,6 +902,7 @@ static inline void sock_rps_reset_rxhash
>  		if (!__rc) {						\
>  			*(__timeo) = schedule_timeout(*(__timeo));	\
>  		}							\
> +		fixup_sleep();						\
>  		lock_sock(__sk);					\
>  		__rc = __condition;					\
>  		__rc;							\
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
                   ` (10 preceding siblings ...)
  2014-09-24  8:18 ` [PATCH 11/11] sched: Exclude cond_resched() from nested sleep test Peter Zijlstra
@ 2014-09-25  8:30 ` Mike Galbraith
  2014-09-25  9:06   ` Peter Zijlstra
  11 siblings, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2014-09-25  8:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel

On Wed, 2014-09-24 at 10:18 +0200, Peter Zijlstra wrote: 
> Hi,
> 
> This is a refresh of the nested sleep debug stuff which I posted as an RFC a
> while back: lkml.kernel.org/r/20140804103025.478913141@infradead.org
> 
> Since then a number of issues identified by these patches have allready made
> their way upstream:
> 
>     de713b57947a ("atm/svc: Fix blocking in wait loop")
>     7c3af9752573 ("nfs: don't sleep with inode lock in lock_and_join_requests")
> 
> And I finally got some time to finish up these patches so we could merge them.
> So please have a look and if nobody holllers we'll merge this 'soon'.

My DL980 hollered itself to death while booting.

[   39.587224] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff811021d0>] kauditd_thread+0x130/0x1e0
[   39.706325] Modules linked in: iTCO_wdt(E) gpio_ich(E) iTCO_vendor_support(E) joydev(E) i7core_edac(E) lpc_ich(E) hid_generic(E) hpwdt(E) mfd_core(E) edac_core(E) bnx2(E) shpchp(E) sr_mod(E) ehci_pci(E) hpilo(E) netxen_nic(E) ipmi_si(E) cdrom(E) pcspkr(E) sg(E) acpi_power_meter(E) ipmi_msghandler(E) button(E) ext4(E) jbd2(E) mbcache(E) crc16(E) usbhid(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) sd_mod(E) thermal(E) usb_common(E) processor(E) scsi_dh_hp_sw(E) scsi_dh_emc(E) scsi_dh_rdac(E) scsi_dh_alua(E) scsi_dh(E) ata_generic(E) ata_piix(E) libata(E) hpsa(E) cciss(E) scsi_mod(E)
[   40.373599] CPU: 9 PID: 1974 Comm: kauditd Tainted: G            E  3.17.0-default #2
[   40.506928] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[   40.613753]  0000000000001bd9 ffff88026f3d3d78 ffffffff815b2fc2 ffff88026f3d3db8
[   40.728720]  ffffffff8106613c ffff88026f3d3da8 ffff88026b4fa110 0000000000000000
[   40.816116]  0000000000000038 ffffffff8180ff47 ffff88026f3d3e58 ffff88026f3d3e18
[   40.905088] Call Trace:
[   40.938325]  [<ffffffff815b2fc2>] dump_stack+0x72/0x88
[   41.000143]  [<ffffffff8106613c>] warn_slowpath_common+0x8c/0xc0
[   41.067996]  [<ffffffff81066226>] warn_slowpath_fmt+0x46/0x50
[   41.132669]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
[   41.204105]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
[   41.270699]  [<ffffffff8108d214>] __might_sleep+0x84/0xa0
[   41.333979]  [<ffffffff8110224b>] kauditd_thread+0x1ab/0x1e0
[   41.398612]  [<ffffffff810940c0>] ? try_to_wake_up+0x210/0x210
[   41.465435]  [<ffffffff811020a0>] ? audit_printk_skb+0x70/0x70
[   41.534628]  [<ffffffff810859db>] kthread+0xeb/0x100
[   41.596562]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80
[   41.678973]  [<ffffffff815b85bc>] ret_from_fork+0x7c/0xb0
[   41.742073]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80

Then printk() went gaga printing eg ** 54502 printk messages dropped **
plus snippets of above endlessly, so I had to power reset.

Without your patches, but CONFIG_DEBUG_ATOMIC_SLEEP still enabled, there
was a reboot time gripe.  My other boxen seems to be gripe free.

[ 1031.427792] BUG: sleeping function called from invalid context at include/linux/netdevice.h:476
[ 1031.522628] in_atomic(): 1, irqs_disabled(): 0, pid: 23854, name: ip
[ 1031.591809] CPU: 61 PID: 23854 Comm: ip Tainted: G            E  3.17.0-default #1
[ 1031.673757] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[ 1031.756990]  ffff88026b486800 ffff8800792ab568 ffffffff815b2462 ffff8800792ab578
[ 1031.836756]  ffffffff8108cf46 ffff8800792ab5a8 ffffffffa02e4df4 ffff880272595200
[ 1031.915786]  0000000000000001 ffff880272595280 ffff880274527000 ffff8800792ab5f8
[ 1032.021354] Call Trace:
[ 1032.049208]  [<ffffffff815b2462>] dump_stack+0x72/0x88
[ 1032.107115]  [<ffffffff8108cf46>] __might_sleep+0xd6/0x110
[ 1032.168179]  [<ffffffffa02e4df4>] netxen_napi_disable+0x94/0xf0 [netxen_nic]
[ 1032.245813]  [<ffffffffa02e79f0>] __netxen_nic_down+0x160/0x1d0 [netxen_nic]
[ 1032.327204]  [<ffffffffa02e7d1b>] netxen_nic_close+0x1b/0x20 [netxen_nic]
[ 1032.407045]  [<ffffffff814bafed>] __dev_close_many+0x9d/0xf0
[ 1032.472971]  [<ffffffff814bb076>] __dev_close+0x36/0x50
[ 1032.531286]  [<ffffffff814bc36c>] __dev_change_flags+0xac/0x180
[ 1032.596742]  [<ffffffff814bc477>] dev_change_flags+0x37/0x80
[ 1032.658692]  [<ffffffff814cf214>] do_setlink+0x244/0x7e0
[ 1032.718090]  [<ffffffff814d0cb0>] rtnl_newlink+0x5a0/0x7d0
[ 1032.778735]  [<ffffffff814d085a>] ? rtnl_newlink+0x14a/0x7d0
[ 1032.840688]  [<ffffffff814d0321>] rtnetlink_rcv_msg+0xa1/0x240
[ 1032.905125]  [<ffffffff813084a6>] ? rhashtable_lookup_compare+0x46/0x70
[ 1032.981889]  [<ffffffff814d0280>] ? __rtnl_unlock+0x20/0x20
[ 1033.043229]  [<ffffffff814ef619>] netlink_rcv_skb+0x89/0xb0
[ 1033.103637]  [<ffffffff814d051c>] rtnetlink_rcv+0x2c/0x40
[ 1033.163646]  [<ffffffff814ef019>] netlink_unicast+0x119/0x180
[ 1033.227938]  [<ffffffff81305f4c>] ? memcpy_fromiovec+0x6c/0x90
[ 1033.293557]  [<ffffffff814efa1a>] netlink_sendmsg+0x3da/0x470
[ 1033.355779]  [<ffffffff814a4dfc>] sock_sendmsg+0x9c/0xd0
[ 1033.417217]  [<ffffffff810a8583>] ? __wake_up+0x53/0x70
[ 1033.480884]  [<ffffffff81164e15>] ? unlock_page+0x65/0x70
[ 1033.541126]  [<ffffffff81192053>] ? might_fault+0x43/0x50
[ 1033.599727]  [<ffffffff814b2a6e>] ? verify_iovec+0x5e/0xf0
[ 1033.659925]  [<ffffffff814a5756>] ___sys_sendmsg+0x436/0x440
[ 1033.723060]  [<ffffffff81198283>] ? handle_pte_fault+0x213/0x260
[ 1033.789348]  [<ffffffff814a289f>] ? copy_to_user+0x2f/0x40
[ 1033.849237]  [<ffffffff81198429>] ? __handle_mm_fault+0x159/0x330
[ 1033.915082]  [<ffffffff811986ff>] ? handle_mm_fault+0xff/0x1b0
[ 1033.979227]  [<ffffffff81051d5c>] ? __do_page_fault+0x2dc/0x4c0
[ 1034.065054]  [<ffffffff8119c3c1>] ? __vma_link_rb+0x101/0x120
[ 1034.135430]  [<ffffffff8119dc38>] ? do_brk+0x1c8/0x340
[ 1034.193893]  [<ffffffff814a2e52>] ? SyS_getsockname+0xb2/0xc0
[ 1034.257707]  [<ffffffff814a5939>] __sys_sendmsg+0x49/0x80
[ 1034.317279]  [<ffffffff814a5989>] SyS_sendmsg+0x19/0x20
[ 1034.376290]  [<ffffffff815b7969>] system_call_fastpath+0x16/0x1b



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-09-25  8:30 ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Mike Galbraith
@ 2014-09-25  9:06   ` Peter Zijlstra
  2014-09-25  9:10     ` Mike Galbraith
  2014-09-25  9:15     ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-25  9:06 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel

On Thu, Sep 25, 2014 at 10:30:03AM +0200, Mike Galbraith wrote:
> On Wed, 2014-09-24 at 10:18 +0200, Peter Zijlstra wrote: 
> > Hi,
> > 
> > This is a refresh of the nested sleep debug stuff which I posted as an RFC a
> > while back: lkml.kernel.org/r/20140804103025.478913141@infradead.org
> > 
> > Since then a number of issues identified by these patches have allready made
> > their way upstream:
> > 
> >     de713b57947a ("atm/svc: Fix blocking in wait loop")
> >     7c3af9752573 ("nfs: don't sleep with inode lock in lock_and_join_requests")
> > 
> > And I finally got some time to finish up these patches so we could merge them.
> > So please have a look and if nobody holllers we'll merge this 'soon'.
> 
> My DL980 hollered itself to death while booting.
> 
> [   39.587224] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff811021d0>] kauditd_thread+0x130/0x1e0
> [   39.706325] Modules linked in: iTCO_wdt(E) gpio_ich(E) iTCO_vendor_support(E) joydev(E) i7core_edac(E) lpc_ich(E) hid_generic(E) hpwdt(E) mfd_core(E) edac_core(E) bnx2(E) shpchp(E) sr_mod(E) ehci_pci(E) hpilo(E) netxen_nic(E) ipmi_si(E) cdrom(E) pcspkr(E) sg(E) acpi_power_meter(E) ipmi_msghandler(E) button(E) ext4(E) jbd2(E) mbcache(E) crc16(E) usbhid(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) sd_mod(E) thermal(E) usb_common(E) processor(E) scsi_dh_hp_sw(E) scsi_dh_emc(E) scsi_dh_rdac(E) scsi_dh_alua(E) scsi_dh(E) ata_generic(E) ata_piix(E) libata(E) hpsa(E) cciss(E) scsi_mod(E)
> [   40.373599] CPU: 9 PID: 1974 Comm: kauditd Tainted: G            E  3.17.0-default #2
> [   40.506928] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
> [   40.613753]  0000000000001bd9 ffff88026f3d3d78 ffffffff815b2fc2 ffff88026f3d3db8
> [   40.728720]  ffffffff8106613c ffff88026f3d3da8 ffff88026b4fa110 0000000000000000
> [   40.816116]  0000000000000038 ffffffff8180ff47 ffff88026f3d3e58 ffff88026f3d3e18
> [   40.905088] Call Trace:
> [   40.938325]  [<ffffffff815b2fc2>] dump_stack+0x72/0x88
> [   41.000143]  [<ffffffff8106613c>] warn_slowpath_common+0x8c/0xc0
> [   41.067996]  [<ffffffff81066226>] warn_slowpath_fmt+0x46/0x50
> [   41.132669]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> [   41.204105]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> [   41.270699]  [<ffffffff8108d214>] __might_sleep+0x84/0xa0
> [   41.333979]  [<ffffffff8110224b>] kauditd_thread+0x1ab/0x1e0
> [   41.398612]  [<ffffffff810940c0>] ? try_to_wake_up+0x210/0x210
> [   41.465435]  [<ffffffff811020a0>] ? audit_printk_skb+0x70/0x70
> [   41.534628]  [<ffffffff810859db>] kthread+0xeb/0x100
> [   41.596562]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80
> [   41.678973]  [<ffffffff815b85bc>] ret_from_fork+0x7c/0xb0
> [   41.742073]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80

Cute,.. where exactly is that __might_sleep coming from; nothing obvious
there (then again, I'm half asleep) and stuff seems to have gotten
inlined.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-09-25  9:06   ` Peter Zijlstra
@ 2014-09-25  9:10     ` Mike Galbraith
  2014-09-25  9:15     ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Mike Galbraith @ 2014-09-25  9:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel

On Thu, 2014-09-25 at 11:06 +0200, Peter Zijlstra wrote: 
> On Thu, Sep 25, 2014 at 10:30:03AM +0200, Mike Galbraith wrote:
> > On Wed, 2014-09-24 at 10:18 +0200, Peter Zijlstra wrote: 
> > > Hi,
> > > 
> > > This is a refresh of the nested sleep debug stuff which I posted as an RFC a
> > > while back: lkml.kernel.org/r/20140804103025.478913141@infradead.org
> > > 
> > > Since then a number of issues identified by these patches have allready made
> > > their way upstream:
> > > 
> > >     de713b57947a ("atm/svc: Fix blocking in wait loop")
> > >     7c3af9752573 ("nfs: don't sleep with inode lock in lock_and_join_requests")
> > > 
> > > And I finally got some time to finish up these patches so we could merge them.
> > > So please have a look and if nobody holllers we'll merge this 'soon'.
> > 
> > My DL980 hollered itself to death while booting.
> > 
> > [   39.587224] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff811021d0>] kauditd_thread+0x130/0x1e0
> > [   39.706325] Modules linked in: iTCO_wdt(E) gpio_ich(E) iTCO_vendor_support(E) joydev(E) i7core_edac(E) lpc_ich(E) hid_generic(E) hpwdt(E) mfd_core(E) edac_core(E) bnx2(E) shpchp(E) sr_mod(E) ehci_pci(E) hpilo(E) netxen_nic(E) ipmi_si(E) cdrom(E) pcspkr(E) sg(E) acpi_power_meter(E) ipmi_msghandler(E) button(E) ext4(E) jbd2(E) mbcache(E) crc16(E) usbhid(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) sd_mod(E) thermal(E) usb_common(E) processor(E) scsi_dh_hp_sw(E) scsi_dh_emc(E) scsi_dh_rdac(E) scsi_dh_alua(E) scsi_dh(E) ata_generic(E) ata_piix(E) libata(E) hpsa(E) cciss(E) scsi_mod(E)
> > [   40.373599] CPU: 9 PID: 1974 Comm: kauditd Tainted: G            E  3.17.0-default #2
> > [   40.506928] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
> > [   40.613753]  0000000000001bd9 ffff88026f3d3d78 ffffffff815b2fc2 ffff88026f3d3db8
> > [   40.728720]  ffffffff8106613c ffff88026f3d3da8 ffff88026b4fa110 0000000000000000
> > [   40.816116]  0000000000000038 ffffffff8180ff47 ffff88026f3d3e58 ffff88026f3d3e18
> > [   40.905088] Call Trace:
> > [   40.938325]  [<ffffffff815b2fc2>] dump_stack+0x72/0x88
> > [   41.000143]  [<ffffffff8106613c>] warn_slowpath_common+0x8c/0xc0
> > [   41.067996]  [<ffffffff81066226>] warn_slowpath_fmt+0x46/0x50
> > [   41.132669]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> > [   41.204105]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> > [   41.270699]  [<ffffffff8108d214>] __might_sleep+0x84/0xa0
> > [   41.333979]  [<ffffffff8110224b>] kauditd_thread+0x1ab/0x1e0
> > [   41.398612]  [<ffffffff810940c0>] ? try_to_wake_up+0x210/0x210
> > [   41.465435]  [<ffffffff811020a0>] ? audit_printk_skb+0x70/0x70
> > [   41.534628]  [<ffffffff810859db>] kthread+0xeb/0x100
> > [   41.596562]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80
> > [   41.678973]  [<ffffffff815b85bc>] ret_from_fork+0x7c/0xb0
> > [   41.742073]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80
> 
> Cute,.. where exactly is that __might_sleep coming from; nothing obvious
> there (then again, I'm half asleep) and stuff seems to have gotten
> inlined.

0xffffffff8110227a is in kauditd_thread (include/linux/freezer.h:56).
51       * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
52       * If try_to_freeze causes a lockdep warning it means the caller may deadlock
53       */
54      static inline bool try_to_freeze_unsafe(void)
55      {
56              might_sleep();
57              if (likely(!freezing(current)))
58                      return false;
59              return __refrigerator(false);
60      }


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-09-25  9:06   ` Peter Zijlstra
  2014-09-25  9:10     ` Mike Galbraith
@ 2014-09-25  9:15     ` Peter Zijlstra
  2014-09-25  9:56       ` Mike Galbraith
  2014-10-02 10:22       ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
  1 sibling, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-25  9:15 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel

On Thu, Sep 25, 2014 at 11:06:19AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 25, 2014 at 10:30:03AM +0200, Mike Galbraith wrote:
> > On Wed, 2014-09-24 at 10:18 +0200, Peter Zijlstra wrote: 
> > > Hi,
> > > 
> > > This is a refresh of the nested sleep debug stuff which I posted as an RFC a
> > > while back: lkml.kernel.org/r/20140804103025.478913141@infradead.org
> > > 
> > > Since then a number of issues identified by these patches have allready made
> > > their way upstream:
> > > 
> > >     de713b57947a ("atm/svc: Fix blocking in wait loop")
> > >     7c3af9752573 ("nfs: don't sleep with inode lock in lock_and_join_requests")
> > > 
> > > And I finally got some time to finish up these patches so we could merge them.
> > > So please have a look and if nobody holllers we'll merge this 'soon'.
> > 
> > My DL980 hollered itself to death while booting.
> > 
> > [   39.587224] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff811021d0>] kauditd_thread+0x130/0x1e0
> > [   39.706325] Modules linked in: iTCO_wdt(E) gpio_ich(E) iTCO_vendor_support(E) joydev(E) i7core_edac(E) lpc_ich(E) hid_generic(E) hpwdt(E) mfd_core(E) edac_core(E) bnx2(E) shpchp(E) sr_mod(E) ehci_pci(E) hpilo(E) netxen_nic(E) ipmi_si(E) cdrom(E) pcspkr(E) sg(E) acpi_power_meter(E) ipmi_msghandler(E) button(E) ext4(E) jbd2(E) mbcache(E) crc16(E) usbhid(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) sd_mod(E) thermal(E) usb_common(E) processor(E) scsi_dh_hp_sw(E) scsi_dh_emc(E) scsi_dh_rdac(E) scsi_dh_alua(E) scsi_dh(E) ata_generic(E) ata_piix(E) libata(E) hpsa(E) cciss(E) scsi_mod(E)
> > [   40.373599] CPU: 9 PID: 1974 Comm: kauditd Tainted: G            E  3.17.0-default #2
> > [   40.506928] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
> > [   40.613753]  0000000000001bd9 ffff88026f3d3d78 ffffffff815b2fc2 ffff88026f3d3db8
> > [   40.728720]  ffffffff8106613c ffff88026f3d3da8 ffff88026b4fa110 0000000000000000
> > [   40.816116]  0000000000000038 ffffffff8180ff47 ffff88026f3d3e58 ffff88026f3d3e18
> > [   40.905088] Call Trace:
> > [   40.938325]  [<ffffffff815b2fc2>] dump_stack+0x72/0x88
> > [   41.000143]  [<ffffffff8106613c>] warn_slowpath_common+0x8c/0xc0
> > [   41.067996]  [<ffffffff81066226>] warn_slowpath_fmt+0x46/0x50
> > [   41.132669]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> > [   41.204105]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> > [   41.270699]  [<ffffffff8108d214>] __might_sleep+0x84/0xa0
> > [   41.333979]  [<ffffffff8110224b>] kauditd_thread+0x1ab/0x1e0
> > [   41.398612]  [<ffffffff810940c0>] ? try_to_wake_up+0x210/0x210
> > [   41.465435]  [<ffffffff811020a0>] ? audit_printk_skb+0x70/0x70
> > [   41.534628]  [<ffffffff810859db>] kthread+0xeb/0x100
> > [   41.596562]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80
> > [   41.678973]  [<ffffffff815b85bc>] ret_from_fork+0x7c/0xb0
> > [   41.742073]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80
> 
> Cute,.. where exactly is that __might_sleep coming from; nothing obvious
> there (then again, I'm half asleep) and stuff seems to have gotten
> inlined.

D'uh asleep indeed. Its the try_to_freeze() call. Curious construct
there, they don't actually have a wait loop, just a single go round.

I suppose just flipping the scheudle() and try_to_freeze() should do the
trick. Ugly code that, but then again, I think audit is known for that.

Can you give that a go?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-09-25  9:15     ` Peter Zijlstra
@ 2014-09-25  9:56       ` Mike Galbraith
  2014-09-25 13:59         ` BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370 Mike Galbraith
  2014-10-02 10:22       ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2014-09-25  9:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel

On Thu, 2014-09-25 at 11:15 +0200, Peter Zijlstra wrote:

> I suppose just flipping the scheudle() and try_to_freeze() should do the
> trick. Ugly code that, but then again, I think audit is known for that.
> 
> Can you give that a go?

Yeah, seems that fixed that one.

And thanks to my clever ratelimiting, printk() no longer goes gaga as it
emits this one.. forever.  This boot is gonna take a while too :)


[   53.951297] BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
[   54.044733] in_atomic(): 1, irqs_disabled(): 0, pid: 2676, name: modprobe
[   54.120213] CPU: 42 PID: 2676 Comm: modprobe Tainted: G            E  3.17.0-default #5
[   54.212540] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[   54.294917]  0000000000000000 ffff880273f83838 ffffffff815b2ff2 ffff880273f83848
[   54.377182]  ffffffff8108d156 ffff880273f83878 ffffffff8108d1d8 ffff88026e71e600
[   54.459679]  ffff880273f83898 ffff88026e71e748 ffff88026e71e744 ffff880273f838f8
[   54.542107] Call Trace:
[   54.573965]  [<ffffffff815b2ff2>] dump_stack+0x72/0x88
[   54.635104]  [<ffffffff8108d156>] ___might_sleep+0xd6/0x110
[   54.700956]  [<ffffffff8108d1d8>] __might_sleep+0x48/0xd0
[   54.765821]  [<ffffffff8146c1d5>] cpufreq_freq_transition_begin+0x75/0x140
[   54.842571]  [<ffffffffa0548523>] pcc_cpufreq_target+0x63/0x208 [pcc_cpufreq]
[   54.923403]  [<ffffffff810dec5f>] ? update_ts_time_stats+0x7f/0xb0
[   54.996678]  [<ffffffff8146c725>] __cpufreq_driver_target+0x85/0x170
[   55.067871]  [<ffffffff8146ed98>] od_check_cpu+0xa8/0xb0
[   55.126035]  [<ffffffff8146f7c0>] dbs_check_cpu+0x180/0x1d0
[   55.191425]  [<ffffffff8146fbc0>] cpufreq_governor_dbs+0x3b0/0x720
[   55.259809]  [<ffffffff8146f493>] od_cpufreq_governor_dbs+0x33/0xe0
[   55.329556]  [<ffffffff81469cb9>] __cpufreq_governor+0xa9/0x210
[   55.396925]  [<ffffffff8146a892>] cpufreq_set_policy+0x1e2/0x2e0
[   55.462852]  [<ffffffff8146afac>] cpufreq_init_policy+0x8c/0x110
[   55.462854]  [<ffffffff8146d270>] ? cpufreq_update_policy+0x1b0/0x1b0
[   55.462857]  [<ffffffff8146cf96>] __cpufreq_add_dev+0x596/0x6b0
[   55.462859]  [<ffffffffa054d608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
[   55.462861]  [<ffffffff8146d0be>] cpufreq_add_dev+0xe/0x10
[   55.462863]  [<ffffffff8140eef1>] subsys_interface_register+0xc1/0xf0
[   55.462865]  [<ffffffff8146bcb7>] cpufreq_register_driver+0x117/0x2a0
[   55.462868]  [<ffffffffa054d65d>] pcc_cpufreq_init+0x55/0x9f8 [pcc_cpufreq]
[   55.462870]  [<ffffffffa054d608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
[   55.462873]  [<ffffffff81000298>] do_one_initcall+0xc8/0x1f0
[   55.462875]  [<ffffffff811a581d>] ? __vunmap+0x9d/0x100
[   55.462877]  [<ffffffff810e9680>] do_init_module+0x30/0x1b0
[   55.462878]  [<ffffffff810ebaaa>] load_module+0x67a/0x700
[   55.462879]  [<ffffffff810e9800>] ? do_init_module+0x1b0/0x1b0
[   55.462881]  [<ffffffff810ebcdb>] SyS_init_module+0x9b/0xc0
[   55.462883]  [<ffffffff815b86a9>] system_call_fastpath+0x16/0x1b
[   55.463467] BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
[   55.463468] in_atomic(): 1, irqs_disabled(): 0, pid: 2676, name: modprobe
[   55.463469] CPU: 42 PID: 2676 Comm: modprobe Tainted: G         










^ permalink raw reply	[flat|nested] 59+ messages in thread

* BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
  2014-09-25  9:56       ` Mike Galbraith
@ 2014-09-25 13:59         ` Mike Galbraith
  2014-09-26  6:24           ` Mike Galbraith
  0 siblings, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2014-09-25 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel,
	Rafael J. Wysocki

On Thu, 2014-09-25 at 11:56 +0200, Mike Galbraith wrote: 
> On Thu, 2014-09-25 at 11:15 +0200, Peter Zijlstra wrote:
> 
> > I suppose just flipping the scheudle() and try_to_freeze() should do the
> > trick. Ugly code that, but then again, I think audit is known for that.
> > 
> > Can you give that a go?
> 
> Yeah, seems that fixed that one.

pcc-cpufreq: spin_lock() -> wait_event() -> gripe() gripe() gripe()...

[   50.756280] BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
[   50.850582] in_atomic(): 1, irqs_disabled(): 0, pid: 2636, name: modprobe
[   50.926744] Preemption disabled at:[<ffffffffa04d74d7>] pcc_cpufreq_target+0x27/0x200 [pcc_cpufreq]
[   51.025044]
[   51.045454] CPU: 57 PID: 2636 Comm: modprobe Tainted: G            E  3.17.0-default #7
[   51.136727] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[   51.218484]  00000000ffffffff ffff88026c46b828 ffffffff81589dbd 0000000000000000
[   51.300293]  ffff880037978090 ffff88026c46b848 ffffffff8108e1df ffff880037978090
[   51.382666]  0000000000000000 ffff88026c46b878 ffffffff8108e298 ffff88026d73ec00
[   51.464884] Call Trace:
[   51.495159]  [<ffffffff81589dbd>] dump_stack+0x4d/0x90
[   51.555439]  [<ffffffff8108e1df>] ___might_sleep+0x10f/0x180
[   51.620610]  [<ffffffff8108e298>] __might_sleep+0x48/0xd0
[   51.684774]  [<ffffffff8145b905>] cpufreq_freq_transition_begin+0x75/0x140 drivers/cpufreq/cpufreq.c:370 wait_event(policy->transition_wait, !policy->transition_ongoing);
[   51.767992]  [<ffffffff8108fc99>] ? preempt_count_add+0xb9/0xc0
[   51.833294]  [<ffffffffa04d7513>] pcc_cpufreq_target+0x63/0x200 [pcc_cpufreq] drivers/cpufreq/pcc-cpufreq.c:207 spin_lock(&pcc_lock);
[   51.930204]  [<ffffffff810e0d0f>] ? update_ts_time_stats+0x7f/0xb0
[   52.009288]  [<ffffffff8145be55>] __cpufreq_driver_target+0x85/0x170
[   52.082479]  [<ffffffff8145e4c8>] od_check_cpu+0xa8/0xb0
[   52.143874]  [<ffffffff8145ef10>] dbs_check_cpu+0x180/0x1d0
[   52.207329]  [<ffffffff8145f310>] cpufreq_governor_dbs+0x3b0/0x720
[   52.276916]  [<ffffffff8145ebe3>] od_cpufreq_governor_dbs+0x33/0xe0
[   52.349048]  [<ffffffff814593d9>] __cpufreq_governor+0xa9/0x210
[   52.415040]  [<ffffffff81459fb2>] cpufreq_set_policy+0x1e2/0x2e0
[   52.480473]  [<ffffffff8145a6cc>] cpufreq_init_policy+0x8c/0x110
[   52.545829]  [<ffffffff8145c9a0>] ? cpufreq_update_policy+0x1b0/0x1b0
[   52.615796]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
[   52.681961]  [<ffffffff8145c6c6>] __cpufreq_add_dev+0x596/0x6b0
[   52.746288]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
[   52.830472]  [<ffffffff8145c7ee>] cpufreq_add_dev+0xe/0x10
[   52.894106]  [<ffffffff81408e81>] subsys_interface_register+0xc1/0xf0
[   52.970763]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
[   53.038953]  [<ffffffff8145b3d7>] cpufreq_register_driver+0x117/0x2a0
[   53.110751]  [<ffffffffa016c65d>] pcc_cpufreq_init+0x55/0x9f8 [pcc_cpufreq]
[   53.186920]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
[   53.267583]  [<ffffffff81000298>] do_one_initcall+0xc8/0x1f0
[   53.330601]  [<ffffffff811a731d>] ? __vunmap+0x9d/0x100
[   53.388285]  [<ffffffff810eb9a0>] do_init_module+0x30/0x1b0
[   53.451958]  [<ffffffff810edfa6>] load_module+0x686/0x710
[   53.514007]  [<ffffffff810ebb20>] ? do_init_module+0x1b0/0x1b0
[   53.581246]  [<ffffffff810ee1db>] SyS_init_module+0x9b/0xc0
[   53.644024]  [<ffffffff8158f7a9>] system_call_fastpath+0x16/0x1b



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
  2014-09-25 13:59         ` BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370 Mike Galbraith
@ 2014-09-26  6:24           ` Mike Galbraith
  2014-09-26  7:54             ` Mike Galbraith
  0 siblings, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2014-09-26  6:24 UTC (permalink / raw)
  To: LKML; +Cc: Rafael Wysocki, Viresh Kumar

(the hazards of multitasking.. post escaped early, and went to mostly
the wrong folks)


While testing some scheduler patches, the below pcc-cpufreq
might_sleep() gripe fell out.

On Thu, 2014-09-25 at 15:59 +0200, Mike Galbraith wrote:

> pcc-cpufreq: spin_lock() -> wait_event() -> gripe() gripe() gripe()...
> 
> [   50.756280] BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
> [   50.850582] in_atomic(): 1, irqs_disabled(): 0, pid: 2636, name: modprobe
> [   50.926744] Preemption disabled at:[<ffffffffa04d74d7>] pcc_cpufreq_target+0x27/0x200 [pcc_cpufreq]
> [   51.025044]
> [   51.045454] CPU: 57 PID: 2636 Comm: modprobe Tainted: G            E  3.17.0-default #7
> [   51.136727] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
> [   51.218484]  00000000ffffffff ffff88026c46b828 ffffffff81589dbd 0000000000000000
> [   51.300293]  ffff880037978090 ffff88026c46b848 ffffffff8108e1df ffff880037978090
> [   51.382666]  0000000000000000 ffff88026c46b878 ffffffff8108e298 ffff88026d73ec00
> [   51.464884] Call Trace:
> [   51.495159]  [<ffffffff81589dbd>] dump_stack+0x4d/0x90
> [   51.555439]  [<ffffffff8108e1df>] ___might_sleep+0x10f/0x180
> [   51.620610]  [<ffffffff8108e298>] __might_sleep+0x48/0xd0
> [   51.684774]  [<ffffffff8145b905>] cpufreq_freq_transition_begin+0x75/0x140 drivers/cpufreq/cpufreq.c:370 wait_event(policy->transition_wait, !policy->transition_ongoing);
> [   51.767992]  [<ffffffff8108fc99>] ? preempt_count_add+0xb9/0xc0
> [   51.833294]  [<ffffffffa04d7513>] pcc_cpufreq_target+0x63/0x200 [pcc_cpufreq] drivers/cpufreq/pcc-cpufreq.c:207 spin_lock(&pcc_lock);
> [   51.930204]  [<ffffffff810e0d0f>] ? update_ts_time_stats+0x7f/0xb0
> [   52.009288]  [<ffffffff8145be55>] __cpufreq_driver_target+0x85/0x170
> [   52.082479]  [<ffffffff8145e4c8>] od_check_cpu+0xa8/0xb0
> [   52.143874]  [<ffffffff8145ef10>] dbs_check_cpu+0x180/0x1d0
> [   52.207329]  [<ffffffff8145f310>] cpufreq_governor_dbs+0x3b0/0x720
> [   52.276916]  [<ffffffff8145ebe3>] od_cpufreq_governor_dbs+0x33/0xe0
> [   52.349048]  [<ffffffff814593d9>] __cpufreq_governor+0xa9/0x210
> [   52.415040]  [<ffffffff81459fb2>] cpufreq_set_policy+0x1e2/0x2e0
> [   52.480473]  [<ffffffff8145a6cc>] cpufreq_init_policy+0x8c/0x110
> [   52.545829]  [<ffffffff8145c9a0>] ? cpufreq_update_policy+0x1b0/0x1b0
> [   52.615796]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
> [   52.681961]  [<ffffffff8145c6c6>] __cpufreq_add_dev+0x596/0x6b0
> [   52.746288]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
> [   52.830472]  [<ffffffff8145c7ee>] cpufreq_add_dev+0xe/0x10
> [   52.894106]  [<ffffffff81408e81>] subsys_interface_register+0xc1/0xf0
> [   52.970763]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
> [   53.038953]  [<ffffffff8145b3d7>] cpufreq_register_driver+0x117/0x2a0
> [   53.110751]  [<ffffffffa016c65d>] pcc_cpufreq_init+0x55/0x9f8 [pcc_cpufreq]
> [   53.186920]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
> [   53.267583]  [<ffffffff81000298>] do_one_initcall+0xc8/0x1f0
> [   53.330601]  [<ffffffff811a731d>] ? __vunmap+0x9d/0x100
> [   53.388285]  [<ffffffff810eb9a0>] do_init_module+0x30/0x1b0
> [   53.451958]  [<ffffffff810edfa6>] load_module+0x686/0x710
> [   53.514007]  [<ffffffff810ebb20>] ? do_init_module+0x1b0/0x1b0
> [   53.581246]  [<ffffffff810ee1db>] SyS_init_module+0x9b/0xc0
> [   53.644024]  [<ffffffff8158f7a9>] system_call_fastpath+0x16/0x1b
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
  2014-09-26  6:24           ` Mike Galbraith
@ 2014-09-26  7:54             ` Mike Galbraith
  2014-09-26 14:10               ` Rafael J. Wysocki
  2014-09-26 22:44               ` Rafael J. Wysocki
  0 siblings, 2 replies; 59+ messages in thread
From: Mike Galbraith @ 2014-09-26  7:54 UTC (permalink / raw)
  To: LKML; +Cc: Rafael Wysocki, Viresh Kumar

On Fri, 2014-09-26 at 08:24 +0200, Mike Galbraith wrote: 
> (the hazards of multitasking.. post escaped early, and went to mostly
> the wrong folks)
> 
> 
> While testing some scheduler patches, the below pcc-cpufreq
> might_sleep() gripe fell out.

Because the bits below from 8fec051e didn't make the lock go away first.
Reverting only pcc-cpufreq back to notifier.. works.

--------------------- drivers/cpufreq/integrator-cpufreq.c ---------------------
index 0e27844e8c2d..e5122f1bfe78 100644
@@ -122,7 +122,7 @@ static int integrator_set_target(struct cpufreq_policy *policy,
 		return 0;
 	}
 
-	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
+	cpufreq_freq_transition_begin(policy, &freqs);
 
 	cm_osc = __raw_readl(cm_base + INTEGRATOR_HDR_OSC_OFFSET);
 
@@ -143,7 +143,7 @@ static int integrator_set_target(struct cpufreq_policy *policy,
 	 */
 	set_cpus_allowed(current, cpus_allowed);
 
-	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
+	cpufreq_freq_transition_end(policy, &freqs, 0);
 
 	return 0;
 }

> On Thu, 2014-09-25 at 15:59 +0200, Mike Galbraith wrote:
> 
> > pcc-cpufreq: spin_lock() -> wait_event() -> gripe() gripe() gripe()...
> > 
> > [   50.756280] BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
> > [   50.850582] in_atomic(): 1, irqs_disabled(): 0, pid: 2636, name: modprobe
> > [   50.926744] Preemption disabled at:[<ffffffffa04d74d7>] pcc_cpufreq_target+0x27/0x200 [pcc_cpufreq]
> > [   51.025044]
> > [   51.045454] CPU: 57 PID: 2636 Comm: modprobe Tainted: G            E  3.17.0-default #7
> > [   51.136727] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
> > [   51.218484]  00000000ffffffff ffff88026c46b828 ffffffff81589dbd 0000000000000000
> > [   51.300293]  ffff880037978090 ffff88026c46b848 ffffffff8108e1df ffff880037978090
> > [   51.382666]  0000000000000000 ffff88026c46b878 ffffffff8108e298 ffff88026d73ec00
> > [   51.464884] Call Trace:
> > [   51.495159]  [<ffffffff81589dbd>] dump_stack+0x4d/0x90
> > [   51.555439]  [<ffffffff8108e1df>] ___might_sleep+0x10f/0x180
> > [   51.620610]  [<ffffffff8108e298>] __might_sleep+0x48/0xd0
> > [   51.684774]  [<ffffffff8145b905>] cpufreq_freq_transition_begin+0x75/0x140 drivers/cpufreq/cpufreq.c:370 wait_event(policy->transition_wait, !policy->transition_ongoing);
> > [   51.767992]  [<ffffffff8108fc99>] ? preempt_count_add+0xb9/0xc0
> > [   51.833294]  [<ffffffffa04d7513>] pcc_cpufreq_target+0x63/0x200 [pcc_cpufreq] drivers/cpufreq/pcc-cpufreq.c:207 spin_lock(&pcc_lock);
> > [   51.930204]  [<ffffffff810e0d0f>] ? update_ts_time_stats+0x7f/0xb0
> > [   52.009288]  [<ffffffff8145be55>] __cpufreq_driver_target+0x85/0x170
> > [   52.082479]  [<ffffffff8145e4c8>] od_check_cpu+0xa8/0xb0
> > [   52.143874]  [<ffffffff8145ef10>] dbs_check_cpu+0x180/0x1d0
> > [   52.207329]  [<ffffffff8145f310>] cpufreq_governor_dbs+0x3b0/0x720
> > [   52.276916]  [<ffffffff8145ebe3>] od_cpufreq_governor_dbs+0x33/0xe0
> > [   52.349048]  [<ffffffff814593d9>] __cpufreq_governor+0xa9/0x210
> > [   52.415040]  [<ffffffff81459fb2>] cpufreq_set_policy+0x1e2/0x2e0
> > [   52.480473]  [<ffffffff8145a6cc>] cpufreq_init_policy+0x8c/0x110
> > [   52.545829]  [<ffffffff8145c9a0>] ? cpufreq_update_policy+0x1b0/0x1b0
> > [   52.615796]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
> > [   52.681961]  [<ffffffff8145c6c6>] __cpufreq_add_dev+0x596/0x6b0
> > [   52.746288]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
> > [   52.830472]  [<ffffffff8145c7ee>] cpufreq_add_dev+0xe/0x10
> > [   52.894106]  [<ffffffff81408e81>] subsys_interface_register+0xc1/0xf0
> > [   52.970763]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
> > [   53.038953]  [<ffffffff8145b3d7>] cpufreq_register_driver+0x117/0x2a0
> > [   53.110751]  [<ffffffffa016c65d>] pcc_cpufreq_init+0x55/0x9f8 [pcc_cpufreq]
> > [   53.186920]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
> > [   53.267583]  [<ffffffff81000298>] do_one_initcall+0xc8/0x1f0
> > [   53.330601]  [<ffffffff811a731d>] ? __vunmap+0x9d/0x100
> > [   53.388285]  [<ffffffff810eb9a0>] do_init_module+0x30/0x1b0
> > [   53.451958]  [<ffffffff810edfa6>] load_module+0x686/0x710
> > [   53.514007]  [<ffffffff810ebb20>] ? do_init_module+0x1b0/0x1b0
> > [   53.581246]  [<ffffffff810ee1db>] SyS_init_module+0x9b/0xc0
> > [   53.644024]  [<ffffffff8158f7a9>] system_call_fastpath+0x16/0x1b
> > 
> 
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
  2014-09-26  7:54             ` Mike Galbraith
@ 2014-09-26 14:10               ` Rafael J. Wysocki
  2014-09-26 22:44               ` Rafael J. Wysocki
  1 sibling, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2014-09-26 14:10 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Viresh Kumar, Linux PM list

On Friday, September 26, 2014 09:54:00 AM Mike Galbraith wrote:
> On Fri, 2014-09-26 at 08:24 +0200, Mike Galbraith wrote: 
> > (the hazards of multitasking.. post escaped early, and went to mostly
> > the wrong folks)
> > 
> > 
> > While testing some scheduler patches, the below pcc-cpufreq
> > might_sleep() gripe fell out.
> 
> Because the bits below from 8fec051e didn't make the lock go away first.
> Reverting only pcc-cpufreq back to notifier.. works.

OK, thanks!

I may just apply the below, but I'd prefer to keep things consistent overall.
I need to think about this for a while.

> --------------------- drivers/cpufreq/integrator-cpufreq.c ---------------------
> index 0e27844e8c2d..e5122f1bfe78 100644
> @@ -122,7 +122,7 @@ static int integrator_set_target(struct cpufreq_policy *policy,
>  		return 0;
>  	}
>  
> -	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
> +	cpufreq_freq_transition_begin(policy, &freqs);
>  
>  	cm_osc = __raw_readl(cm_base + INTEGRATOR_HDR_OSC_OFFSET);
>  
> @@ -143,7 +143,7 @@ static int integrator_set_target(struct cpufreq_policy *policy,
>  	 */
>  	set_cpus_allowed(current, cpus_allowed);
>  
> -	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
> +	cpufreq_freq_transition_end(policy, &freqs, 0);
>  
>  	return 0;
>  }
> 
> > On Thu, 2014-09-25 at 15:59 +0200, Mike Galbraith wrote:
> > 
> > > pcc-cpufreq: spin_lock() -> wait_event() -> gripe() gripe() gripe()...
> > > 
> > > [   50.756280] BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
> > > [   50.850582] in_atomic(): 1, irqs_disabled(): 0, pid: 2636, name: modprobe
> > > [   50.926744] Preemption disabled at:[<ffffffffa04d74d7>] pcc_cpufreq_target+0x27/0x200 [pcc_cpufreq]
> > > [   51.025044]
> > > [   51.045454] CPU: 57 PID: 2636 Comm: modprobe Tainted: G            E  3.17.0-default #7
> > > [   51.136727] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
> > > [   51.218484]  00000000ffffffff ffff88026c46b828 ffffffff81589dbd 0000000000000000
> > > [   51.300293]  ffff880037978090 ffff88026c46b848 ffffffff8108e1df ffff880037978090
> > > [   51.382666]  0000000000000000 ffff88026c46b878 ffffffff8108e298 ffff88026d73ec00
> > > [   51.464884] Call Trace:
> > > [   51.495159]  [<ffffffff81589dbd>] dump_stack+0x4d/0x90
> > > [   51.555439]  [<ffffffff8108e1df>] ___might_sleep+0x10f/0x180
> > > [   51.620610]  [<ffffffff8108e298>] __might_sleep+0x48/0xd0
> > > [   51.684774]  [<ffffffff8145b905>] cpufreq_freq_transition_begin+0x75/0x140 drivers/cpufreq/cpufreq.c:370 wait_event(policy->transition_wait, !policy->transition_ongoing);
> > > [   51.767992]  [<ffffffff8108fc99>] ? preempt_count_add+0xb9/0xc0
> > > [   51.833294]  [<ffffffffa04d7513>] pcc_cpufreq_target+0x63/0x200 [pcc_cpufreq] drivers/cpufreq/pcc-cpufreq.c:207 spin_lock(&pcc_lock);
> > > [   51.930204]  [<ffffffff810e0d0f>] ? update_ts_time_stats+0x7f/0xb0
> > > [   52.009288]  [<ffffffff8145be55>] __cpufreq_driver_target+0x85/0x170
> > > [   52.082479]  [<ffffffff8145e4c8>] od_check_cpu+0xa8/0xb0
> > > [   52.143874]  [<ffffffff8145ef10>] dbs_check_cpu+0x180/0x1d0
> > > [   52.207329]  [<ffffffff8145f310>] cpufreq_governor_dbs+0x3b0/0x720
> > > [   52.276916]  [<ffffffff8145ebe3>] od_cpufreq_governor_dbs+0x33/0xe0
> > > [   52.349048]  [<ffffffff814593d9>] __cpufreq_governor+0xa9/0x210
> > > [   52.415040]  [<ffffffff81459fb2>] cpufreq_set_policy+0x1e2/0x2e0
> > > [   52.480473]  [<ffffffff8145a6cc>] cpufreq_init_policy+0x8c/0x110
> > > [   52.545829]  [<ffffffff8145c9a0>] ? cpufreq_update_policy+0x1b0/0x1b0
> > > [   52.615796]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
> > > [   52.681961]  [<ffffffff8145c6c6>] __cpufreq_add_dev+0x596/0x6b0
> > > [   52.746288]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
> > > [   52.830472]  [<ffffffff8145c7ee>] cpufreq_add_dev+0xe/0x10
> > > [   52.894106]  [<ffffffff81408e81>] subsys_interface_register+0xc1/0xf0
> > > [   52.970763]  [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
> > > [   53.038953]  [<ffffffff8145b3d7>] cpufreq_register_driver+0x117/0x2a0
> > > [   53.110751]  [<ffffffffa016c65d>] pcc_cpufreq_init+0x55/0x9f8 [pcc_cpufreq]
> > > [   53.186920]  [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
> > > [   53.267583]  [<ffffffff81000298>] do_one_initcall+0xc8/0x1f0
> > > [   53.330601]  [<ffffffff811a731d>] ? __vunmap+0x9d/0x100
> > > [   53.388285]  [<ffffffff810eb9a0>] do_init_module+0x30/0x1b0
> > > [   53.451958]  [<ffffffff810edfa6>] load_module+0x686/0x710
> > > [   53.514007]  [<ffffffff810ebb20>] ? do_init_module+0x1b0/0x1b0
> > > [   53.581246]  [<ffffffff810ee1db>] SyS_init_module+0x9b/0xc0
> > > [   53.644024]  [<ffffffff8158f7a9>] system_call_fastpath+0x16/0x1b
> > > 
> > 
> > 
> 
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
  2014-09-26  7:54             ` Mike Galbraith
  2014-09-26 14:10               ` Rafael J. Wysocki
@ 2014-09-26 22:44               ` Rafael J. Wysocki
  2014-09-27  6:14                 ` Mike Galbraith
  1 sibling, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2014-09-26 22:44 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Viresh Kumar, Linux PM list

On Friday, September 26, 2014 09:54:00 AM Mike Galbraith wrote:
> On Fri, 2014-09-26 at 08:24 +0200, Mike Galbraith wrote: 
> > (the hazards of multitasking.. post escaped early, and went to mostly
> > the wrong folks)
> > 
> > 
> > While testing some scheduler patches, the below pcc-cpufreq
> > might_sleep() gripe fell out.
> 
> Because the bits below from 8fec051e didn't make the lock go away first.
> Reverting only pcc-cpufreq back to notifier.. works.
> 
> --------------------- drivers/cpufreq/integrator-cpufreq.c ---------------------

Are you sure this is the right file?

Shouldn't that be pcc-cpufreq.c rather?

Also moving the spin_lock(&pcc_lock) after the cpufreq_freq_transition_begin()
should fix the problem too (like the below).  Have you tried that?


---
 drivers/cpufreq/pcc-cpufreq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-pm/drivers/cpufreq/pcc-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/pcc-cpufreq.c
+++ linux-pm/drivers/cpufreq/pcc-cpufreq.c
@@ -204,7 +204,6 @@ static int pcc_cpufreq_target(struct cpu
 	u32 input_buffer;
 	int cpu;
 
-	spin_lock(&pcc_lock);
 	cpu = policy->cpu;
 	pcc_cpu_data = per_cpu_ptr(pcc_cpu_info, cpu);
 
@@ -216,6 +215,7 @@ static int pcc_cpufreq_target(struct cpu
 	freqs.old = policy->cur;
 	freqs.new = target_freq;
 	cpufreq_freq_transition_begin(policy, &freqs);
+	spin_lock(&pcc_lock);
 
 	input_buffer = 0x1 | (((target_freq * 100)
 			       / (ioread32(&pcch_hdr->nominal) * 1000)) << 8);


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
  2014-09-26 22:44               ` Rafael J. Wysocki
@ 2014-09-27  6:14                 ` Mike Galbraith
  2014-09-27 19:57                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Mike Galbraith @ 2014-09-27  6:14 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: LKML, Viresh Kumar, Linux PM list

On Sat, 2014-09-27 at 00:44 +0200, Rafael J. Wysocki wrote:

> Shouldn't that be pcc-cpufreq.c rather?

Yeah, silly mouse copy/pasted the wrong gitk bits.

> Also moving the spin_lock(&pcc_lock) after the cpufreq_freq_transition_begin()
> should fix the problem too (like the below).  Have you tried that?

Have now.  Yup, works and is prettier.

-Mike


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
  2014-09-27  6:14                 ` Mike Galbraith
@ 2014-09-27 19:57                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2014-09-27 19:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, Viresh Kumar, Linux PM list

On Saturday, September 27, 2014 08:14:46 AM Mike Galbraith wrote:
> On Sat, 2014-09-27 at 00:44 +0200, Rafael J. Wysocki wrote:
> 
> > Shouldn't that be pcc-cpufreq.c rather?
> 
> Yeah, silly mouse copy/pasted the wrong gitk bits.
> 
> > Also moving the spin_lock(&pcc_lock) after the cpufreq_freq_transition_begin()
> > should fix the problem too (like the below).  Have you tried that?
> 
> Have now.  Yup, works and is prettier.

OK, thanks!

Below it goes again with full changelog.

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: cpufreq: pcc-cpufreq: Fix wait_event() under spinlock

Fix the following bug introduced by commit 8fec051eea73 (cpufreq:
Convert existing drivers to use cpufreq_freq_transition_{begin|end})
that forgot to move the spin_lock() in pcc_cpufreq_target() past
cpufreq_freq_transition_begin() which calls wait_event():

BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370
in_atomic(): 1, irqs_disabled(): 0, pid: 2636, name: modprobe
Preemption disabled at:[<ffffffffa04d74d7>] pcc_cpufreq_target+0x27/0x200 [pcc_cpufreq]
[   51.025044]
CPU: 57 PID: 2636 Comm: modprobe Tainted: G            E  3.17.0-default #7
Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
 00000000ffffffff ffff88026c46b828 ffffffff81589dbd 0000000000000000
 ffff880037978090 ffff88026c46b848 ffffffff8108e1df ffff880037978090
 0000000000000000 ffff88026c46b878 ffffffff8108e298 ffff88026d73ec00
Call Trace:
 [<ffffffff81589dbd>] dump_stack+0x4d/0x90
 [<ffffffff8108e1df>] ___might_sleep+0x10f/0x180
 [<ffffffff8108e298>] __might_sleep+0x48/0xd0
 [<ffffffff8145b905>] cpufreq_freq_transition_begin+0x75/0x140 drivers/cpufreq/cpufreq.c:370 wait_event(policy->transition_wait, !policy->transition_ongoing);
 [<ffffffff8108fc99>] ? preempt_count_add+0xb9/0xc0
 [<ffffffffa04d7513>] pcc_cpufreq_target+0x63/0x200 [pcc_cpufreq] drivers/cpufreq/pcc-cpufreq.c:207 spin_lock(&pcc_lock);
 [<ffffffff810e0d0f>] ? update_ts_time_stats+0x7f/0xb0
 [<ffffffff8145be55>] __cpufreq_driver_target+0x85/0x170
 [<ffffffff8145e4c8>] od_check_cpu+0xa8/0xb0
 [<ffffffff8145ef10>] dbs_check_cpu+0x180/0x1d0
 [<ffffffff8145f310>] cpufreq_governor_dbs+0x3b0/0x720
 [<ffffffff8145ebe3>] od_cpufreq_governor_dbs+0x33/0xe0
 [<ffffffff814593d9>] __cpufreq_governor+0xa9/0x210
 [<ffffffff81459fb2>] cpufreq_set_policy+0x1e2/0x2e0
 [<ffffffff8145a6cc>] cpufreq_init_policy+0x8c/0x110
 [<ffffffff8145c9a0>] ? cpufreq_update_policy+0x1b0/0x1b0
 [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
 [<ffffffff8145c6c6>] __cpufreq_add_dev+0x596/0x6b0
 [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
 [<ffffffff8145c7ee>] cpufreq_add_dev+0xe/0x10
 [<ffffffff81408e81>] subsys_interface_register+0xc1/0xf0
 [<ffffffff8108fb99>] ? preempt_count_sub+0xb9/0x100
 [<ffffffff8145b3d7>] cpufreq_register_driver+0x117/0x2a0
 [<ffffffffa016c65d>] pcc_cpufreq_init+0x55/0x9f8 [pcc_cpufreq]
 [<ffffffffa016c608>] ? pcc_cpufreq_probe+0x4b4/0x4b4 [pcc_cpufreq]
 [<ffffffff81000298>] do_one_initcall+0xc8/0x1f0
 [<ffffffff811a731d>] ? __vunmap+0x9d/0x100
 [<ffffffff810eb9a0>] do_init_module+0x30/0x1b0
 [<ffffffff810edfa6>] load_module+0x686/0x710
 [<ffffffff810ebb20>] ? do_init_module+0x1b0/0x1b0
 [<ffffffff810ee1db>] SyS_init_module+0x9b/0xc0
 [<ffffffff8158f7a9>] system_call_fastpath+0x16/0x1b

Fixes: 8fec051eea73 (cpufreq: Convert existing drivers to use cpufreq_freq_transition_{begin|end})
Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: 3.15+ <stable@vger.kernel.org> # 3.15+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/pcc-cpufreq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-pm/drivers/cpufreq/pcc-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/pcc-cpufreq.c
+++ linux-pm/drivers/cpufreq/pcc-cpufreq.c
@@ -204,7 +204,6 @@ static int pcc_cpufreq_target(struct cpu
 	u32 input_buffer;
 	int cpu;
 
-	spin_lock(&pcc_lock);
 	cpu = policy->cpu;
 	pcc_cpu_data = per_cpu_ptr(pcc_cpu_info, cpu);
 
@@ -216,6 +215,7 @@ static int pcc_cpufreq_target(struct cpu
 	freqs.old = policy->cur;
 	freqs.new = target_freq;
 	cpufreq_freq_transition_begin(policy, &freqs);
+	spin_lock(&pcc_lock);
 
 	input_buffer = 0x1 | (((target_freq * 100)
 			       / (ioread32(&pcch_hdr->nominal) * 1000)) << 8);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking
  2014-09-24  8:18 ` [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking Peter Zijlstra
@ 2014-09-29 21:02   ` Oleg Nesterov
  2014-10-02  7:37     ` Peter Zijlstra
  2014-10-28 11:09   ` [tip:sched/core] sched/wait: " tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2014-09-29 21:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On 09/24, Peter Zijlstra wrote:
>
> There are a few places that call blocking primitives from wait loops,
> provide infrastructure to support this without the typical
> task_struct::state collision.
>
> We record the wakeup in wait_queue_t::flags which leaves
> task_struct::state free to be used by others.

Sorry for delay. FWIW,

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

> +/*
> + * DEFINE_WAIT_FUNC(wait, woken_wait_func);
                             ^^^^^^^^^^^^^^^^
woken_wake_function ;)

> +int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +	/*
> +	 * Although this function is called under waitqueue lock, LOCK
> +	 * doesn't imply write barrier and the users expects write
> +	 * barrier semantics on wakeup functions.  The following
> +	 * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
> +	 * and is paired with set_mb() in wait_woken().
> +	 */
> +	smp_wmb(); /* C */
> +	wait->flags |= WQ_FLAG_WOKEN;

Perhaps it is just me, but I was a bit confused by the comment above wmb().
Afaics, it is not that "users expects write barrier semantics", just we
need to ensure that

	CONDITION = true;
	wait->flags |= WQ_FLAG_WOKEN;

can't be reordered (and this differs from smp_wmb() in try_to_wake_up()).
Otherwise we can obviously race with

	// wait_woken() -> set_mb()
	wait->flags &= ~WQ_FLAG_WOKEN;
	mb();

	if (CONDITION)
		break;

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/11] sched: Debug nested sleeps
  2014-09-24  8:18 ` [PATCH 10/11] sched: Debug nested sleeps Peter Zijlstra
@ 2014-09-29 22:13   ` Oleg Nesterov
  2014-09-30 13:49     ` Peter Zijlstra
  2014-10-28 11:11   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2014-09-29 22:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On 09/24, Peter Zijlstra wrote:
>
> +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> +
> +#define __set_task_state(tsk, state_value)			\
> +	do {							\
> +		(tsk)->task_state_change = _THIS_IP_;		\
> +		(tsk)->state = (state_value);			\
> +	} while (0)

...

> @@ -7143,6 +7143,19 @@ void __might_sleep(const char *file, int
>  {
>  	static unsigned long prev_jiffy;	/* ratelimiting */
>
> +	/*
> +	 * Blocking primitives will set (and therefore destroy) current->state,
> +	 * since we will exit with TASK_RUNNING make sure we enter with it,
> +	 * otherwise we will destroy state.
> +	 */
> +	if (WARN(current->state != TASK_RUNNING,
> +			"do not call blocking ops when !TASK_RUNNING; "
> +			"state=%lx set at [<%p>] %pS\n",
> +			current->state,
> +			(void *)current->task_state_change,
> +			(void *)current->task_state_change))
> +		__set_current_state(TASK_RUNNING);

Question: now that we have ->task_state_change, perhaps it makes sense
to redefine fixup_sleep()

	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
	#define fixup_sleep()	(current->task_state_change = 0)
	#else
	#define fixup_sleep()	do { } while (0)
	#endif

and make the WARN() above depend on task_state_change != 0 ?

This is minor, but this way CONFIG_DEBUG_ATOMIC_SLEEP will not imply
a subtle behavioural change.

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/11] module: Fix nested sleep
  2014-09-24  8:18 ` [PATCH 08/11] module: Fix nested sleep Peter Zijlstra
@ 2014-09-29 22:18   ` Oleg Nesterov
  2014-09-30 13:43     ` Peter Zijlstra
  2014-10-28 11:11   ` [tip:sched/core] sched, modules: Fix nested sleep in add_unformed_module() tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2014-09-29 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti,
	linux-kernel, Rusty Russell, Fengguang Wu

On 09/24, Peter Zijlstra wrote:
>
> A genuine bug, we cannot use blocking primitives inside a wait loop.
>
> So rewrite the wait_event_interruptible() usage to use the fresh
> wait_woken() stuff.

OK, but ...

> +static int wait_finished_loading(struct module *mod)
> +{
> +	DEFINE_WAIT_FUNC(wait, woken_wake_function);
> +	int ret = 0;
> +
> +	add_wait_queue(&module_wq, &wait);
> +	for (;;) {
> +		if (finished_loading(mod->name))
> +			break;
> +
> +		if (signal_pending_state(TASK_INTERRUPTIBLE, current)) {

I am puzzled by this line, why not

		if (signal_pending(current)) {

?

this should be 100% equivalent.

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/11] module: Fix nested sleep
  2014-09-29 22:18   ` Oleg Nesterov
@ 2014-09-30 13:43     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-30 13:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti,
	linux-kernel, Rusty Russell, Fengguang Wu

On Tue, Sep 30, 2014 at 12:18:23AM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > A genuine bug, we cannot use blocking primitives inside a wait loop.
> >
> > So rewrite the wait_event_interruptible() usage to use the fresh
> > wait_woken() stuff.
> 
> OK, but ...
> 
> > +static int wait_finished_loading(struct module *mod)
> > +{
> > +	DEFINE_WAIT_FUNC(wait, woken_wake_function);
> > +	int ret = 0;
> > +
> > +	add_wait_queue(&module_wq, &wait);
> > +	for (;;) {
> > +		if (finished_loading(mod->name))
> > +			break;
> > +
> > +		if (signal_pending_state(TASK_INTERRUPTIBLE, current)) {
> 
> I am puzzled by this line, why not
> 
> 		if (signal_pending(current)) {
> 
> ?
> 
> this should be 100% equivalent.

Ah, because I looked at ___wait_event() and that ends up using
signal_pending_state() through preprare_to_wait_event().

I did not actually 'think' much. Lemme change that for you.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/11] sched: Debug nested sleeps
  2014-09-29 22:13   ` Oleg Nesterov
@ 2014-09-30 13:49     ` Peter Zijlstra
  2014-09-30 21:47       ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-09-30 13:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On Tue, Sep 30, 2014 at 12:13:44AM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> > +
> > +#define __set_task_state(tsk, state_value)			\
> > +	do {							\
> > +		(tsk)->task_state_change = _THIS_IP_;		\
> > +		(tsk)->state = (state_value);			\
> > +	} while (0)
> 
> ...
> 
> > @@ -7143,6 +7143,19 @@ void __might_sleep(const char *file, int
> >  {
> >  	static unsigned long prev_jiffy;	/* ratelimiting */
> >
> > +	/*
> > +	 * Blocking primitives will set (and therefore destroy) current->state,
> > +	 * since we will exit with TASK_RUNNING make sure we enter with it,
> > +	 * otherwise we will destroy state.
> > +	 */
> > +	if (WARN(current->state != TASK_RUNNING,
> > +			"do not call blocking ops when !TASK_RUNNING; "
> > +			"state=%lx set at [<%p>] %pS\n",
> > +			current->state,
> > +			(void *)current->task_state_change,
> > +			(void *)current->task_state_change))
> > +		__set_current_state(TASK_RUNNING);
> 
> Question: now that we have ->task_state_change, perhaps it makes sense
> to redefine fixup_sleep()
> 
> 	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> 	#define fixup_sleep()	(current->task_state_change = 0)
> 	#else
> 	#define fixup_sleep()	do { } while (0)
> 	#endif
> 
> and make the WARN() above depend on task_state_change != 0 ?
> 
> This is minor, but this way CONFIG_DEBUG_ATOMIC_SLEEP will not imply
> a subtle behavioural change.

You mean the __set_current_state() that's extra? I would actually argue
to keep that since it makes the 'problem' much worse.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/11] sched: Debug nested sleeps
  2014-09-30 13:49     ` Peter Zijlstra
@ 2014-09-30 21:47       ` Oleg Nesterov
  2014-10-01 16:10         ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2014-09-30 21:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On 09/30, Peter Zijlstra wrote:
>
> On Tue, Sep 30, 2014 at 12:13:44AM +0200, Oleg Nesterov wrote:
> > On 09/24, Peter Zijlstra wrote:
> > >
> > > +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> > > +
> > > +#define __set_task_state(tsk, state_value)			\
> > > +	do {							\
> > > +		(tsk)->task_state_change = _THIS_IP_;		\
> > > +		(tsk)->state = (state_value);			\
> > > +	} while (0)
> >
> > ...
> >
> > > @@ -7143,6 +7143,19 @@ void __might_sleep(const char *file, int
> > >  {
> > >  	static unsigned long prev_jiffy;	/* ratelimiting */
> > >
> > > +	/*
> > > +	 * Blocking primitives will set (and therefore destroy) current->state,
> > > +	 * since we will exit with TASK_RUNNING make sure we enter with it,
> > > +	 * otherwise we will destroy state.
> > > +	 */
> > > +	if (WARN(current->state != TASK_RUNNING,
> > > +			"do not call blocking ops when !TASK_RUNNING; "
> > > +			"state=%lx set at [<%p>] %pS\n",
> > > +			current->state,
> > > +			(void *)current->task_state_change,
> > > +			(void *)current->task_state_change))
> > > +		__set_current_state(TASK_RUNNING);
> >
> > Question: now that we have ->task_state_change, perhaps it makes sense
> > to redefine fixup_sleep()
> >
> > 	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> > 	#define fixup_sleep()	(current->task_state_change = 0)
> > 	#else
> > 	#define fixup_sleep()	do { } while (0)
> > 	#endif
> >
> > and make the WARN() above depend on task_state_change != 0 ?
> >
> > This is minor, but this way CONFIG_DEBUG_ATOMIC_SLEEP will not imply
> > a subtle behavioural change.
>
> You mean the __set_current_state() that's extra?

Yes, and note that it only does __set_current_state(RUNNING) if
CONFIG_DEBUG_ATOMIC_SLEEP. This means that disabling/enabling this
option can, silently hide/uncover a bug.

> I would actually argue
> to keep that since it makes the 'problem' much worse.

OK, I won't insist, but could you explain why the suggested change can
make the problem (and which problem ;) worse?

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/11] sched: Debug nested sleeps
  2014-09-30 21:47       ` Oleg Nesterov
@ 2014-10-01 16:10         ` Peter Zijlstra
  2014-10-01 18:35           ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-01 16:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On Tue, Sep 30, 2014 at 11:47:32PM +0200, Oleg Nesterov wrote:

> > > This is minor, but this way CONFIG_DEBUG_ATOMIC_SLEEP will not imply
> > > a subtle behavioural change.
> >
> > You mean the __set_current_state() that's extra?
> 
> Yes, and note that it only does __set_current_state(RUNNING) if
> CONFIG_DEBUG_ATOMIC_SLEEP. This means that disabling/enabling this
> option can, silently hide/uncover a bug.
> 
> > I would actually argue
> > to keep that since it makes the 'problem' much worse.
> 
> OK, I won't insist, but could you explain why the suggested change can
> make the problem (and which problem ;) worse?

Sure, so the trivial problem is not actually going to sleep in the outer
wait primitive because the inner wait primitive reset ->state to
TASK_RUNNING.

So by always setting the ->state to TASK_RUNNING it never goes to sleep
and it'll revert to spinning, causing spikes in CPU usage that should
hopefully be far easier to notice than the occasional funny.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/11] sched: Debug nested sleeps
  2014-10-01 16:10         ` Peter Zijlstra
@ 2014-10-01 18:35           ` Oleg Nesterov
  2014-10-02  9:07             ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2014-10-01 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On 10/01, Peter Zijlstra wrote:
>
> On Tue, Sep 30, 2014 at 11:47:32PM +0200, Oleg Nesterov wrote:
>
> > > > This is minor, but this way CONFIG_DEBUG_ATOMIC_SLEEP will not imply
> > > > a subtle behavioural change.
> > >
> > > You mean the __set_current_state() that's extra?
> >
> > Yes, and note that it only does __set_current_state(RUNNING) if
> > CONFIG_DEBUG_ATOMIC_SLEEP. This means that disabling/enabling this
> > option can, silently hide/uncover a bug.
> >
> > > I would actually argue
> > > to keep that since it makes the 'problem' much worse.
> >
> > OK, I won't insist, but could you explain why the suggested change can
> > make the problem (and which problem ;) worse?
>
> Sure, so the trivial problem is not actually going to sleep in the outer
> wait primitive because the inner wait primitive reset ->state to
> TASK_RUNNING.

But this means that fixup_sleep() must not be used?

> So by always setting the ->state to TASK_RUNNING it never goes to sleep
> and it'll revert to spinning,

But I tried to suggest to not set TASK_RUNNING?


Peter, I am sorry for wasting your time, this is really minor, but still
I'd like to understand.

Let me try again. With this series we have

	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
	#define fixup_sleep()	__set_current_state(TASK_RUNNING)
	#else
	#define fixup_sleep()	do { } while (0)
	#endif

and this means that we do not need __set_current_state(RUNNING) for
correctness, just we want to shut up the warning in __might_sleep().
This is fine (and the self-documenting helper is nice), but this means
that CONFIG_DEBUG_ATOMIC_SLEEP adds a subtle difference.

For example, let's suppose that we do not have 01/11 which fixes
mutex_lock(). Then this code

	set_current_state(TASK_UNINTERRUPTIBLE);
	...
	fixup_sleep();
	...
	mutex_lock(some_mutex);

can hang, but only if !CONFIG_DEBUG_ATOMIC_SLEEP.

So perhaps it makes sense to redefine it

	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
	#define fixup_sleep()	(current->task_state_change = 0)
	#else
	#define fixup_sleep()	do { } while (0)
	#endif

and change __might_sleep()

	-	if (WARN(current->state != TASK_RUNNING,
	+	if (WARN(current->state != TASK_RUNNING && current->task_state_change != 0,

?

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking
  2014-09-29 21:02   ` Oleg Nesterov
@ 2014-10-02  7:37     ` Peter Zijlstra
  2014-10-02 21:21       ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-02  7:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On Mon, Sep 29, 2014 at 11:02:21PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > There are a few places that call blocking primitives from wait loops,
> > provide infrastructure to support this without the typical
> > task_struct::state collision.
> >
> > We record the wakeup in wait_queue_t::flags which leaves
> > task_struct::state free to be used by others.
> 
> Sorry for delay. FWIW,
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 
> > +/*
> > + * DEFINE_WAIT_FUNC(wait, woken_wait_func);
>                              ^^^^^^^^^^^^^^^^
> woken_wake_function ;)
> 
> > +int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
> > +{
> > +	/*
> > +	 * Although this function is called under waitqueue lock, LOCK
> > +	 * doesn't imply write barrier and the users expects write
> > +	 * barrier semantics on wakeup functions.  The following
> > +	 * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
> > +	 * and is paired with set_mb() in wait_woken().
> > +	 */
> > +	smp_wmb(); /* C */
> > +	wait->flags |= WQ_FLAG_WOKEN;
> 
> Perhaps it is just me, but I was a bit confused by the comment above wmb().
> Afaics, it is not that "users expects write barrier semantics", just we
> need to ensure that
> 
> 	CONDITION = true;
> 	wait->flags |= WQ_FLAG_WOKEN;
> 
> can't be reordered (and this differs from smp_wmb() in try_to_wake_up()).
> Otherwise we can obviously race with
> 
> 	// wait_woken() -> set_mb()
> 	wait->flags &= ~WQ_FLAG_WOKEN;
> 	mb();
> 
> 	if (CONDITION)
> 		break;
> 

Yes, that comment could be clearer. It is however, to me, the 'same' as
a regular wakeup in that we need to separate whatever state changes
before the wakeup (CONDITION=true typically) from whatever writes are
required to affect the wakeup (->state=TASK_RUNNING typically, +optional
enqueuing on runqueues and all that).



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/11] sched: Debug nested sleeps
  2014-10-01 18:35           ` Oleg Nesterov
@ 2014-10-02  9:07             ` Peter Zijlstra
  2014-10-02 21:34               ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-02  9:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On Wed, Oct 01, 2014 at 08:35:49PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> > Sure, so the trivial problem is not actually going to sleep in the outer
> > wait primitive because the inner wait primitive reset ->state to
> > TASK_RUNNING.
> 
> But this means that fixup_sleep() must not be used?

Right, in case its an actual bug, we'll not use fixup_sleep(). Those are
only used to annotate the few odd cases.

> > So by always setting the ->state to TASK_RUNNING it never goes to sleep
> > and it'll revert to spinning,
> 
> But I tried to suggest to not set TASK_RUNNING?

That's what I understood, because that's the difference between
CONFIG_DEBUG_ATOMIC_SLEEP and not. Or I made a complete mess of things,
which could well have happened, I had a terrible headache yesterday.

> Peter, I am sorry for wasting your time, this is really minor, but still
> I'd like to understand.
> 
> Let me try again. With this series we have
> 
> 	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> 	#define fixup_sleep()	__set_current_state(TASK_RUNNING)
> 	#else
> 	#define fixup_sleep()	do { } while (0)
> 	#endif
> 
> and this means that we do not need __set_current_state(RUNNING) for
> correctness, just we want to shut up the warning in __might_sleep().
> This is fine (and the self-documenting helper is nice), but this means
> that CONFIG_DEBUG_ATOMIC_SLEEP adds a subtle difference.
> 
> For example, let's suppose that we do not have 01/11 which fixes
> mutex_lock(). Then this code
> 
> 	set_current_state(TASK_UNINTERRUPTIBLE);
> 	...
> 	fixup_sleep();
> 	...
> 	mutex_lock(some_mutex);
> 
> can hang, but only if !CONFIG_DEBUG_ATOMIC_SLEEP.

Right, but we should not use fixup_sleep() in this case, because its an
actual proper bug, we should fix it, not paper over it. Arguably we
should use preempt_schedule in mutex_lock() in that particular case, but
that's another discussion.

> So perhaps it makes sense to redefine it
> 
> 	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> 	#define fixup_sleep()	(current->task_state_change = 0)
> 	#else
> 	#define fixup_sleep()	do { } while (0)
> 	#endif
> 
> and change __might_sleep()
> 
> 	-	if (WARN(current->state != TASK_RUNNING,
> 	+	if (WARN(current->state != TASK_RUNNING && current->task_state_change != 0,
> 
> ?

So I'm hesitant to go that way because it adds extra state dependency.
What if someone 'forgets' to use the *set*state() helpers.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-09-25  9:15     ` Peter Zijlstra
  2014-09-25  9:56       ` Mike Galbraith
@ 2014-10-02 10:22       ` Peter Zijlstra
  2014-10-02 12:15         ` Peter Zijlstra
  2014-11-04 16:08         ` [tip:sched/core] audit, sched/wait: Fixup kauditd_thread() wait loop tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-02 10:22 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel, Eric Paris

On Thu, Sep 25, 2014 at 11:15:56AM +0200, Peter Zijlstra wrote:
> > > My DL980 hollered itself to death while booting.
> > > 
> > > [   39.587224] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff811021d0>] kauditd_thread+0x130/0x1e0
> > > [   39.706325] Modules linked in: iTCO_wdt(E) gpio_ich(E) iTCO_vendor_support(E) joydev(E) i7core_edac(E) lpc_ich(E) hid_generic(E) hpwdt(E) mfd_core(E) edac_core(E) bnx2(E) shpchp(E) sr_mod(E) ehci_pci(E) hpilo(E) netxen_nic(E) ipmi_si(E) cdrom(E) pcspkr(E) sg(E) acpi_power_meter(E) ipmi_msghandler(E) button(E) ext4(E) jbd2(E) mbcache(E) crc16(E) usbhid(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) sd_mod(E) thermal(E) usb_common(E) processor(E) scsi_dh_hp_sw(E) scsi_dh_emc(E) scsi_dh_rdac(E) scsi_dh_alua(E) scsi_dh(E) ata_generic(E) ata_piix(E) libata(E) hpsa(E) cciss(E) scsi_mod(E)
> > > [   40.373599] CPU: 9 PID: 1974 Comm: kauditd Tainted: G            E  3.17.0-default #2
> > > [   40.506928] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
> > > [   40.613753]  0000000000001bd9 ffff88026f3d3d78 ffffffff815b2fc2 ffff88026f3d3db8
> > > [   40.728720]  ffffffff8106613c ffff88026f3d3da8 ffff88026b4fa110 0000000000000000
> > > [   40.816116]  0000000000000038 ffffffff8180ff47 ffff88026f3d3e58 ffff88026f3d3e18
> > > [   40.905088] Call Trace:
> > > [   40.938325]  [<ffffffff815b2fc2>] dump_stack+0x72/0x88
> > > [   41.000143]  [<ffffffff8106613c>] warn_slowpath_common+0x8c/0xc0
> > > [   41.067996]  [<ffffffff81066226>] warn_slowpath_fmt+0x46/0x50
> > > [   41.132669]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> > > [   41.204105]  [<ffffffff811021d0>] ? kauditd_thread+0x130/0x1e0
> > > [   41.270699]  [<ffffffff8108d214>] __might_sleep+0x84/0xa0
> > > [   41.333979]  [<ffffffff8110224b>] kauditd_thread+0x1ab/0x1e0
> > > [   41.398612]  [<ffffffff810940c0>] ? try_to_wake_up+0x210/0x210
> > > [   41.465435]  [<ffffffff811020a0>] ? audit_printk_skb+0x70/0x70
> > > [   41.534628]  [<ffffffff810859db>] kthread+0xeb/0x100
> > > [   41.596562]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80
> > > [   41.678973]  [<ffffffff815b85bc>] ret_from_fork+0x7c/0xb0
> > > [   41.742073]  [<ffffffff810858f0>] ? kthread_freezable_should_stop+0x80/0x80


---
Subject: audit,wait: Fixup kauditd_thread wait loop

The kauditd_thread wait loop is a bit iffy; it has a number of problems:

 - calls try_to_freeze() before schedule(); you typically want the
   thread to re-evaluate the sleep condition when unfreezing, also
   freeze_task() issues a wakeup.

 - it unconditionally does the {add,remove}_wait_queue(), even when the
   sleep condition is false.

Introduce a new wait_event() variant, wait_event_freezable() that does
all the right things and employ it here.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/wait.h |   25 +++++++++++++++++++++++++
 kernel/audit.c       |   11 +----------
 2 files changed, 26 insertions(+), 10 deletions(-)

--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -266,6 +266,31 @@ do {									\
 	__wait_event(wq, condition);					\
 } while (0)
 
+#define __wait_event_freezable(wq, condition)				\
+	(void)___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0,	\
+			    schedule(); try_to_freeze())
+
+/**
+ * wait_event - sleep until a condition gets true or freeze (for kthreads)
+ * @wq: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ *
+ * The process is put to sleep (TASK_INTERRUPTIBLE -- so as not to contribute
+ * to system load) until the @condition evaluates to true. The
+ * @condition is checked each time the waitqueue @wq is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ */
+#define wait_event_freezable(wq, condition)				\
+do {									\
+	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));			\
+	might_sleep();							\
+	if (condition)							\
+		break;							\
+	__wait_event_freezable(wq, condition);				\
+} while (0)
+
 #define __wait_event_timeout(wq, condition, timeout)			\
 	___wait_event(wq, ___wait_cond_timeout(condition),		\
 		      TASK_UNINTERRUPTIBLE, 0, timeout,			\
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -499,7 +499,6 @@ static int kauditd_thread(void *dummy)
 	set_freezable();
 	while (!kthread_should_stop()) {
 		struct sk_buff *skb;
-		DECLARE_WAITQUEUE(wait, current);
 
 		flush_hold_queue();
 
@@ -514,16 +513,8 @@ static int kauditd_thread(void *dummy)
 				audit_printk_skb(skb);
 			continue;
 		}
-		set_current_state(TASK_INTERRUPTIBLE);
-		add_wait_queue(&kauditd_wait, &wait);
 
-		if (!skb_queue_len(&audit_skb_queue)) {
-			try_to_freeze();
-			schedule();
-		}
-
-		__set_current_state(TASK_RUNNING);
-		remove_wait_queue(&kauditd_wait, &wait);
+		wait_event_freezable(&kauditd_wait, skb_queue_len(&audit_skb_queue));
 	}
 	return 0;
 }

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-02 10:22       ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
@ 2014-10-02 12:15         ` Peter Zijlstra
  2014-10-27 13:41           ` Peter Zijlstra
  2014-11-04 16:08         ` [tip:sched/core] audit, sched/wait: Fixup kauditd_thread() wait loop tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-02 12:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel,
	Eric Paris, rafael.j.wysocki

On Thu, Oct 02, 2014 at 12:22:51PM +0200, Peter Zijlstra wrote:
> Subject: audit,wait: Fixup kauditd_thread wait loop
> 
> The kauditd_thread wait loop is a bit iffy; it has a number of problems:
> 
>  - calls try_to_freeze() before schedule(); you typically want the
>    thread to re-evaluate the sleep condition when unfreezing, also
>    freeze_task() issues a wakeup.
> 
>  - it unconditionally does the {add,remove}_wait_queue(), even when the
>    sleep condition is false.
> 
> Introduce a new wait_event() variant, wait_event_freezable() that does
> all the right things and employ it here.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Eric Paris <eparis@redhat.com>
> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/wait.h |   25 +++++++++++++++++++++++++
>  kernel/audit.c       |   11 +----------
>  2 files changed, 26 insertions(+), 10 deletions(-)
> 
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -266,6 +266,31 @@ do {									\
>  	__wait_event(wq, condition);					\
>  } while (0)
>  
> +#define __wait_event_freezable(wq, condition)				\
> +	(void)___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0,	\
> +			    schedule(); try_to_freeze())
> +
> +/**
> + * wait_event - sleep until a condition gets true or freeze (for kthreads)
> + * @wq: the waitqueue to wait on
> + * @condition: a C expression for the event to wait for
> + *
> + * The process is put to sleep (TASK_INTERRUPTIBLE -- so as not to contribute
> + * to system load) until the @condition evaluates to true. The
> + * @condition is checked each time the waitqueue @wq is woken up.
> + *
> + * wake_up() has to be called after changing any variable that could
> + * change the result of the wait condition.
> + */
> +#define wait_event_freezable(wq, condition)				\
> +do {									\
> +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));			\
> +	might_sleep();							\
> +	if (condition)							\
> +		break;							\
> +	__wait_event_freezable(wq, condition);				\
> +} while (0)
> +
>  #define __wait_event_timeout(wq, condition, timeout)			\
>  	___wait_event(wq, ___wait_cond_timeout(condition),		\
>  		      TASK_UNINTERRUPTIBLE, 0, timeout,			\

Bah, that doesn't compile, because there already appears to be one,
hidden in freezer.h. Now I can't actually tell if it does the same thing
or not.

Rafael?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking
  2014-10-02  7:37     ` Peter Zijlstra
@ 2014-10-02 21:21       ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2014-10-02 21:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On 10/02, Peter Zijlstra wrote:
>
> On Mon, Sep 29, 2014 at 11:02:21PM +0200, Oleg Nesterov wrote:
> > On 09/24, Peter Zijlstra wrote:
> > >
> > > +int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
> > > +{
> > > +	/*
> > > +	 * Although this function is called under waitqueue lock, LOCK
> > > +	 * doesn't imply write barrier and the users expects write
> > > +	 * barrier semantics on wakeup functions.  The following
> > > +	 * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
> > > +	 * and is paired with set_mb() in wait_woken().
> > > +	 */
> > > +	smp_wmb(); /* C */
> > > +	wait->flags |= WQ_FLAG_WOKEN;
> >
> > Perhaps it is just me, but I was a bit confused by the comment above wmb().
> > Afaics, it is not that "users expects write barrier semantics", just we
> > need to ensure that
> >
> > 	CONDITION = true;
> > 	wait->flags |= WQ_FLAG_WOKEN;
> >
> > can't be reordered (and this differs from smp_wmb() in try_to_wake_up()).
> > Otherwise we can obviously race with
> >
> > 	// wait_woken() -> set_mb()
> > 	wait->flags &= ~WQ_FLAG_WOKEN;
> > 	mb();
> >
> > 	if (CONDITION)
> > 		break;
> >
>
> Yes, that comment could be clearer. It is however, to me, the 'same' as
> a regular wakeup in that we need to separate whatever state changes
> before the wakeup (CONDITION=true typically) from whatever writes are
> required to affect the wakeup (->state=TASK_RUNNING typically,

Not really, ttwu() needs to serialize CONDITION=true and the reading of
task->state. And for the waiter its state is write only, it doesn't need
to check it.

While in this case we need to separate CONDITION and WQ_FLAG_WOKEN, and
the waiter needs to check them in the right order.

But please forget, the code looks clear with or without the comment, and
"paired with set_mb() in wait_woken()" should explain the intent anyway.

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/11] sched: Debug nested sleeps
  2014-10-02  9:07             ` Peter Zijlstra
@ 2014-10-02 21:34               ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2014-10-02 21:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, tglx, ilya.dryomov, umgwanakikbuti, linux-kernel

On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 01, 2014 at 08:35:49PM +0200, Oleg Nesterov wrote:
> >
> > For example, let's suppose that we do not have 01/11 which fixes
> > mutex_lock(). Then this code
> >
> > 	set_current_state(TASK_UNINTERRUPTIBLE);
> > 	...
> > 	fixup_sleep();
> > 	...
> > 	mutex_lock(some_mutex);
> >
> > can hang, but only if !CONFIG_DEBUG_ATOMIC_SLEEP.
>
> Right, but we should not use fixup_sleep() in this case,

(well, I am not really sure but this is off-topic and I agree this needs
 another discussion)

> because its an
> actual proper bug, we should fix it, not paper over it.

Exactly! this is what I meant: CONFIG_DEBUG_ATOMIC_SLEEP will hide the
bug we need to fix.

> > So perhaps it makes sense to redefine it
> >
> > 	#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> > 	#define fixup_sleep()	(current->task_state_change = 0)
> > 	#else
> > 	#define fixup_sleep()	do { } while (0)
> > 	#endif
> >
> > and change __might_sleep()
> >
> > 	-	if (WARN(current->state != TASK_RUNNING,
> > 	+	if (WARN(current->state != TASK_RUNNING && current->task_state_change != 0,
> >
> > ?
>
> So I'm hesitant to go that way because it adds extra state dependency.

OK. We can always reconsider this later. I spammed you only because I
wanted to understand what did me/you/both missed in this discussion.

> What if someone 'forgets' to use the *set*state() helpers.

Yes, this is true. Although we want to fix them anyway, if nothing else
for this warning in might_sleep().

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-02 12:15         ` Peter Zijlstra
@ 2014-10-27 13:41           ` Peter Zijlstra
  2014-10-28  0:07             ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-27 13:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, oleg, torvalds, tglx, ilya.dryomov, linux-kernel,
	Eric Paris, rafael.j.wysocki

On Thu, Oct 02, 2014 at 02:15:53PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2014 at 12:22:51PM +0200, Peter Zijlstra wrote:

> > +#define __wait_event_freezable(wq, condition)				\
> > +	(void)___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0,	\
> > +			    schedule(); try_to_freeze())
> > +
> > +/**
> > + * wait_event - sleep until a condition gets true or freeze (for kthreads)
> > + * @wq: the waitqueue to wait on
> > + * @condition: a C expression for the event to wait for
> > + *
> > + * The process is put to sleep (TASK_INTERRUPTIBLE -- so as not to contribute
> > + * to system load) until the @condition evaluates to true. The
> > + * @condition is checked each time the waitqueue @wq is woken up.
> > + *
> > + * wake_up() has to be called after changing any variable that could
> > + * change the result of the wait condition.
> > + */
> > +#define wait_event_freezable(wq, condition)				\
> > +do {									\
> > +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));			\
> > +	might_sleep();							\
> > +	if (condition)							\
> > +		break;							\
> > +	__wait_event_freezable(wq, condition);				\
> > +} while (0)
> > +
> >  #define __wait_event_timeout(wq, condition, timeout)			\
> >  	___wait_event(wq, ___wait_cond_timeout(condition),		\
> >  		      TASK_UNINTERRUPTIBLE, 0, timeout,			\
> 
> Bah, that doesn't compile, because there already appears to be one,
> hidden in freezer.h. Now I can't actually tell if it does the same thing
> or not.
> 
> Rafael?

Ping?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-27 13:41           ` Peter Zijlstra
@ 2014-10-28  0:07             ` Oleg Nesterov
  2014-10-28  8:23               ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2014-10-28  0:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, mingo, torvalds, tglx, ilya.dryomov,
	linux-kernel, Eric Paris, rafael.j.wysocki

On 10/27, Peter Zijlstra wrote:
>
> On Thu, Oct 02, 2014 at 02:15:53PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 02, 2014 at 12:22:51PM +0200, Peter Zijlstra wrote:
>
> > > +#define __wait_event_freezable(wq, condition)				\
> > > +	(void)___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0,	\
> > > +			    schedule(); try_to_freeze())
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

see below.

> > > +
> > > +/**
> > > + * wait_event - sleep until a condition gets true or freeze (for kthreads)
> > > + * @wq: the waitqueue to wait on
> > > + * @condition: a C expression for the event to wait for
> > > + *
> > > + * The process is put to sleep (TASK_INTERRUPTIBLE -- so as not to contribute
> > > + * to system load) until the @condition evaluates to true. The
> > > + * @condition is checked each time the waitqueue @wq is woken up.
> > > + *
> > > + * wake_up() has to be called after changing any variable that could
> > > + * change the result of the wait condition.
> > > + */
> > > +#define wait_event_freezable(wq, condition)				\
> > > +do {									\
> > > +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));			\
> > > +	might_sleep();							\
> > > +	if (condition)							\
> > > +		break;							\
> > > +	__wait_event_freezable(wq, condition);				\
> > > +} while (0)
> > > +
> > >  #define __wait_event_timeout(wq, condition, timeout)			\
> > >  	___wait_event(wq, ___wait_cond_timeout(condition),		\
> > >  		      TASK_UNINTERRUPTIBLE, 0, timeout,			\
> >
> > Bah, that doesn't compile, because there already appears to be one,
> > hidden in freezer.h. Now I can't actually tell if it does the same thing
> > or not.
> >
> > Rafael?
>
> Ping?

I was going to say that wait_event_freezable() in kauditd_thread()
is not friendly wrt kthread_should_stop() and thus we we need
kthread_freezable_should_stop().

But in fact we never stop this kauditd_task, so I think we should
turn the main loop into "for (;;)" and change this code to use
wait_event_freezable() like your patch does.

Perhaps it also makes sense to redefine wait_event_freezable.*()
via ___wait_event(cmd => freezable_schedule), but I think this needs
another patch.

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-28  0:07             ` Oleg Nesterov
@ 2014-10-28  8:23               ` Peter Zijlstra
  2014-10-29  0:00                 ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-28  8:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, mingo, torvalds, tglx, ilya.dryomov,
	linux-kernel, Eric Paris, rafael.j.wysocki

On Tue, Oct 28, 2014 at 01:07:03AM +0100, Oleg Nesterov wrote:
> On 10/27, Peter Zijlstra wrote:
> >
> > On Thu, Oct 02, 2014 at 02:15:53PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 02, 2014 at 12:22:51PM +0200, Peter Zijlstra wrote:
> >
> > > > +#define __wait_event_freezable(wq, condition)				\
> > > > +	(void)___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0,	\
> > > > +			    schedule(); try_to_freeze())
>                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> see below.

> I was going to say that wait_event_freezable() in kauditd_thread()
> is not friendly wrt kthread_should_stop() and thus we we need
> kthread_freezable_should_stop().

I'm not sure those two would interact, yes, both would first set either
the freezable or stop bit and then wake. If both were to race, all we
need to ensure is to check both before calling schedule again.

A loop like:

	while (!kthread_should_stop()) {
		wait_event_freezable(wq, cond);
	}

Would satisfy that, because even if kthread_should_stop() gets set first
and then freezing happens and we get into try_to_freeze() first, we'd
still to the kthread_should_stop() check right after we thaw.

So I don't se a reason to collect both stop and freeze into a single
thing.

> But in fact we never stop this kauditd_task, so I think we should
> turn the main loop into "for (;;)" and change this code to use
> wait_event_freezable() like your patch does.

Right.

> Perhaps it also makes sense to redefine wait_event_freezable.*()
> via ___wait_event(cmd => freezable_schedule), but I think this needs
> another patch.

So I talked to Rafael yesterday and I'm going to replace all the
wait_event*() stuff, and I suppose also freezable_schedule() because
they're racy.

The moment we call freezer_do_not_count() the freezer will ignore us,
this means the thread could still be running (albeit not for long) when
the freezer reports success.

Ideally I'll be able to kill the entire freezer_do_not_count() stuff.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] locking/mutex: Don't assume TASK_RUNNING
  2014-09-24  8:18 ` [PATCH 01/11] locking/mutex: Dont assume TASK_RUNNING Peter Zijlstra
@ 2014-10-28 11:09   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:09 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, torvalds, mingo, oleg, hpa, tglx, linux-kernel

Commit-ID:  6f942a1f264e875c5f3ad6f505d7b500a3e7fa82
Gitweb:     http://git.kernel.org/tip/6f942a1f264e875c5f3ad6f505d7b500a3e7fa82
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:46 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:55:08 +0100

locking/mutex: Don't assume TASK_RUNNING

We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.

There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.

Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/mutex.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index dadbf88..4541951 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -378,8 +378,14 @@ done:
 	 * reschedule now, before we try-lock the mutex. This avoids getting
 	 * scheduled out right after we obtained the mutex.
 	 */
-	if (need_resched())
+	if (need_resched()) {
+		/*
+		 * We _should_ have TASK_RUNNING here, but just in case
+		 * we do not, make it so, otherwise we might get stuck.
+		 */
+		__set_current_state(TASK_RUNNING);
 		schedule_preempt_disabled();
+	}
 
 	return false;
 }

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched/wait: Provide infrastructure to deal with nested blocking
  2014-09-24  8:18 ` [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking Peter Zijlstra
  2014-09-29 21:02   ` Oleg Nesterov
@ 2014-10-28 11:09   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:09 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, oleg, mingo, hpa, linux-kernel, peterz, torvalds

Commit-ID:  61ada528dea028331e99e8ceaed87c683ad25de2
Gitweb:     http://git.kernel.org/tip/61ada528dea028331e99e8ceaed87c683ad25de2
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:47 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:55:15 +0100

sched/wait: Provide infrastructure to deal with nested blocking

There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.

We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/wait.h |  7 +++++-
 kernel/sched/wait.c  | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index e4a8eb9..fc0e993 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -13,9 +13,12 @@ typedef struct __wait_queue wait_queue_t;
 typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key);
 int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *key);
 
+/* __wait_queue::flags */
+#define WQ_FLAG_EXCLUSIVE	0x01
+#define WQ_FLAG_WOKEN		0x02
+
 struct __wait_queue {
 	unsigned int		flags;
-#define WQ_FLAG_EXCLUSIVE	0x01
 	void			*private;
 	wait_queue_func_t	func;
 	struct list_head	task_list;
@@ -830,6 +833,8 @@ void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int sta
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
+long wait_woken(wait_queue_t *wait, unsigned mode, long timeout);
+int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 5a62915..4dae188 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -297,6 +297,67 @@ int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *
 }
 EXPORT_SYMBOL(autoremove_wake_function);
 
+
+/*
+ * DEFINE_WAIT_FUNC(wait, woken_wake_func);
+ *
+ * add_wait_queue(&wq, &wait);
+ * for (;;) {
+ *     if (condition)
+ *         break;
+ *
+ *     p->state = mode;				condition = true;
+ *     smp_mb(); // A				smp_wmb(); // C
+ *     if (!wait->flags & WQ_FLAG_WOKEN)	wait->flags |= WQ_FLAG_WOKEN;
+ *         schedule()				try_to_wake_up();
+ *     p->state = TASK_RUNNING;		    ~~~~~~~~~~~~~~~~~~
+ *     wait->flags &= ~WQ_FLAG_WOKEN;		condition = true;
+ *     smp_mb() // B				smp_wmb(); // C
+ *						wait->flags |= WQ_FLAG_WOKEN;
+ * }
+ * remove_wait_queue(&wq, &wait);
+ *
+ */
+long wait_woken(wait_queue_t *wait, unsigned mode, long timeout)
+{
+	set_current_state(mode); /* A */
+	/*
+	 * The above implies an smp_mb(), which matches with the smp_wmb() from
+	 * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
+	 * also observe all state before the wakeup.
+	 */
+	if (!(wait->flags & WQ_FLAG_WOKEN))
+		timeout = schedule_timeout(timeout);
+	__set_current_state(TASK_RUNNING);
+
+	/*
+	 * The below implies an smp_mb(), it too pairs with the smp_wmb() from
+	 * woken_wake_function() such that we must either observe the wait
+	 * condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss
+	 * an event.
+	 */
+	set_mb(wait->flags, wait->flags & ~WQ_FLAG_WOKEN); /* B */
+
+	return timeout;
+}
+EXPORT_SYMBOL(wait_woken);
+
+int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	/*
+	 * Although this function is called under waitqueue lock, LOCK
+	 * doesn't imply write barrier and the users expects write
+	 * barrier semantics on wakeup functions.  The following
+	 * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
+	 * and is paired with set_mb() in wait_woken().
+	 */
+	smp_wmb(); /* C */
+	wait->flags |= WQ_FLAG_WOKEN;
+
+	return default_wake_function(wait, mode, sync, key);
+}
+EXPORT_SYMBOL(woken_wake_function);
+
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched/wait: Add might_sleep() checks
  2014-09-24  8:18 ` [PATCH 03/11] wait: Add might_sleep() Peter Zijlstra
@ 2014-10-28 11:09   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:09 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, hpa, linux-kernel, peterz, mingo, torvalds, oleg

Commit-ID:  e22b886a8a43b147e1994a9f970f678fc0df2033
Gitweb:     http://git.kernel.org/tip/e22b886a8a43b147e1994a9f970f678fc0df2033
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:48 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:55:23 +0100

sched/wait: Add might_sleep() checks

Add more might_sleep() checks, suppose someone put a wait_event() like
thing in a wait loop..

Can't put might_sleep() in ___wait_event() because there's the locked
primitives which call ___wait_event() with locks held.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.119255706@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/wait.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index fc0e993..0421775 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -261,6 +261,7 @@ __out:	__ret;								\
  */
 #define wait_event(wq, condition)					\
 do {									\
+	might_sleep();							\
 	if (condition)							\
 		break;							\
 	__wait_event(wq, condition);					\
@@ -293,6 +294,7 @@ do {									\
 #define wait_event_timeout(wq, condition, timeout)			\
 ({									\
 	long __ret = timeout;						\
+	might_sleep();							\
 	if (!___wait_cond_timeout(condition))				\
 		__ret = __wait_event_timeout(wq, condition, timeout);	\
 	__ret;								\
@@ -318,6 +320,7 @@ do {									\
  */
 #define wait_event_cmd(wq, condition, cmd1, cmd2)			\
 do {									\
+	might_sleep();							\
 	if (condition)							\
 		break;							\
 	__wait_event_cmd(wq, condition, cmd1, cmd2);			\
@@ -345,6 +348,7 @@ do {									\
 #define wait_event_interruptible(wq, condition)				\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_interruptible(wq, condition);	\
 	__ret;								\
@@ -378,6 +382,7 @@ do {									\
 #define wait_event_interruptible_timeout(wq, condition, timeout)	\
 ({									\
 	long __ret = timeout;						\
+	might_sleep();							\
 	if (!___wait_cond_timeout(condition))				\
 		__ret = __wait_event_interruptible_timeout(wq,		\
 						condition, timeout);	\
@@ -428,6 +433,7 @@ do {									\
 #define wait_event_hrtimeout(wq, condition, timeout)			\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_hrtimeout(wq, condition, timeout,	\
 					       TASK_UNINTERRUPTIBLE);	\
@@ -453,6 +459,7 @@ do {									\
 #define wait_event_interruptible_hrtimeout(wq, condition, timeout)	\
 ({									\
 	long __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_hrtimeout(wq, condition, timeout,	\
 					       TASK_INTERRUPTIBLE);	\
@@ -466,6 +473,7 @@ do {									\
 #define wait_event_interruptible_exclusive(wq, condition)		\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_interruptible_exclusive(wq, condition);\
 	__ret;								\
@@ -640,6 +648,7 @@ do {									\
 #define wait_event_killable(wq, condition)				\
 ({									\
 	int __ret = 0;							\
+	might_sleep();							\
 	if (!(condition))						\
 		__ret = __wait_event_killable(wq, condition);		\
 	__ret;								\
@@ -891,6 +900,7 @@ extern int bit_wait_io_timeout(struct wait_bit_key *);
 static inline int
 wait_on_bit(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit(word, bit,
@@ -915,6 +925,7 @@ wait_on_bit(void *word, int bit, unsigned mode)
 static inline int
 wait_on_bit_io(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit(word, bit,
@@ -941,6 +952,7 @@ wait_on_bit_io(void *word, int bit, unsigned mode)
 static inline int
 wait_on_bit_action(void *word, int bit, wait_bit_action_f *action, unsigned mode)
 {
+	might_sleep();
 	if (!test_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit(word, bit, action, mode);
@@ -968,6 +980,7 @@ wait_on_bit_action(void *word, int bit, wait_bit_action_f *action, unsigned mode
 static inline int
 wait_on_bit_lock(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_and_set_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, bit_wait, mode);
@@ -991,6 +1004,7 @@ wait_on_bit_lock(void *word, int bit, unsigned mode)
 static inline int
 wait_on_bit_lock_io(void *word, int bit, unsigned mode)
 {
+	might_sleep();
 	if (!test_and_set_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, bit_wait_io, mode);
@@ -1016,6 +1030,7 @@ wait_on_bit_lock_io(void *word, int bit, unsigned mode)
 static inline int
 wait_on_bit_lock_action(void *word, int bit, wait_bit_action_f *action, unsigned mode)
 {
+	might_sleep();
 	if (!test_and_set_bit(bit, word))
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, action, mode);
@@ -1034,6 +1049,7 @@ wait_on_bit_lock_action(void *word, int bit, wait_bit_action_f *action, unsigned
 static inline
 int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 {
+	might_sleep();
 	if (atomic_read(val) == 0)
 		return 0;
 	return out_of_line_wait_on_atomic_t(val, action, mode);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched, exit: Deal with nested sleeps
  2014-09-24  8:18 ` [PATCH 04/11] exit: Deal with nested sleeps Peter Zijlstra
@ 2014-10-28 11:10   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ionut.m.alexa, mhocko, hpa, jbaron, tglx, rusty, torvalds,
	rostedt, mschmidt, linux-kernel, axel.lin, davej, alex.elder,
	riel, guillaume, oleg, paulmck, dborkman, mingo, akpm, peterz

Commit-ID:  1029a2b52c09e479fd7b07275812ad97868c0fb0
Gitweb:     http://git.kernel.org/tip/1029a2b52c09e479fd7b07275812ad97868c0fb0
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:49 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:55:30 +0100

sched, exit: Deal with nested sleeps

do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end
up calling potential sleeps before we reset it.

Not strictly a bug since we're guaranteed to exit the loop and not
call schedule(); put in annotations to quiet might_sleep().

 WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90()
 do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270

 Call Trace:
  [<ffffffff81694991>] dump_stack+0x4e/0x7a
  [<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0
  [<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50
  [<ffffffff810bca6e>] __might_sleep+0x7e/0x90
  [<ffffffff811a1c15>] might_fault+0x55/0xb0
  [<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10
  [<ffffffff8109a804>] do_wait+0x104/0x270
  [<ffffffff8109b837>] SyS_wait4+0x77/0x100
  [<ffffffff8169d692>] system_call_fastpath+0x16/0x1b

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: umgwanakikbuti@gmail.com
Cc: ilya.dryomov@inktank.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/kernel.h | 2 ++
 kernel/exit.c          | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3d770f55..5068a0d 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -175,10 +175,12 @@ extern int _cond_resched(void);
  */
 # define might_sleep() \
 	do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
+# define sched_annotate_sleep()	__set_current_state(TASK_RUNNING)
 #else
   static inline void __might_sleep(const char *file, int line,
 				   int preempt_offset) { }
 # define might_sleep() do { might_resched(); } while (0)
+# define sched_annotate_sleep() do { } while (0)
 #endif
 
 #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019..232c4bc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -997,6 +997,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 
 		get_task_struct(p);
 		read_unlock(&tasklist_lock);
+		sched_annotate_sleep();
+
 		if ((exit_code & 0x7f) == 0) {
 			why = CLD_EXITED;
 			status = exit_code >> 8;
@@ -1079,6 +1081,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 	 * thread can reap it because we its state == DEAD/TRACE.
 	 */
 	read_unlock(&tasklist_lock);
+	sched_annotate_sleep();
 
 	retval = wo->wo_rusage
 		? getrusage(p, RUSAGE_BOTH, wo->wo_rusage) : 0;
@@ -1210,6 +1213,7 @@ unlock_sig:
 	pid = task_pid_vnr(p);
 	why = ptrace ? CLD_TRAPPED : CLD_STOPPED;
 	read_unlock(&tasklist_lock);
+	sched_annotate_sleep();
 
 	if (unlikely(wo->wo_flags & WNOWAIT))
 		return wait_noreap_copyout(wo, p, pid, uid, why, exit_code);
@@ -1272,6 +1276,7 @@ static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
 	pid = task_pid_vnr(p);
 	get_task_struct(p);
 	read_unlock(&tasklist_lock);
+	sched_annotate_sleep();
 
 	if (!wo->wo_info) {
 		retval = wo->wo_rusage

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched, inotify: Deal with nested sleeps
  2014-09-24  8:18 ` [PATCH 05/11] inotify: " Peter Zijlstra
@ 2014-10-28 11:10   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, peterz, hpa, john, eparis, rlove, linux-kernel, tglx,
	mingo, oleg, viro

Commit-ID:  e23738a7300a7591a57a22f47b813fd1b53ec404
Gitweb:     http://git.kernel.org/tip/e23738a7300a7591a57a22f47b813fd1b53ec404
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:50 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:55:37 +0100

sched, inotify: Deal with nested sleeps

inotify_read is a wait loop with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state and the wait loop breaks the sleep functions that assume
TASK_RUNNING (mutex_lock).

Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20140924082242.254858080@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 fs/notify/inotify/inotify_user.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index daf7665..283aa31 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -227,14 +227,13 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
 	struct fsnotify_event *kevent;
 	char __user *start;
 	int ret;
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
 	start = buf;
 	group = file->private_data;
 
+	add_wait_queue(&group->notification_waitq, &wait);
 	while (1) {
-		prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
-
 		mutex_lock(&group->notification_mutex);
 		kevent = get_one_event(group, count);
 		mutex_unlock(&group->notification_mutex);
@@ -264,10 +263,10 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
 		if (start != buf)
 			break;
 
-		schedule();
+		wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
 	}
+	remove_wait_queue(&group->notification_waitq, &wait);
 
-	finish_wait(&group->notification_waitq, &wait);
 	if (start != buf && ret != -EFAULT)
 		ret = buf - start;
 	return ret;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched, tty: Deal with nested sleeps
  2014-09-24  8:18 ` [PATCH 06/11] tty: " Peter Zijlstra
@ 2014-10-28 11:10   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: gregkh, peterz, jslaby, linux-kernel, torvalds, tglx, hpa, mingo

Commit-ID:  97d9e28d1a27b84a6a0b155f2390289afa279341
Gitweb:     http://git.kernel.org/tip/97d9e28d1a27b84a6a0b155f2390289afa279341
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:51 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:56:10 +0100

sched, tty: Deal with nested sleeps

n_tty_{read,write} are wait loops with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state.

Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/20140924082242.323011233@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/tty/n_tty.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index 89c4cee..fb3f519 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -2123,7 +2123,7 @@ static ssize_t n_tty_read(struct tty_struct *tty, struct file *file,
 {
 	struct n_tty_data *ldata = tty->disc_data;
 	unsigned char __user *b = buf;
-	DECLARE_WAITQUEUE(wait, current);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 	int c;
 	int minimum, time;
 	ssize_t retval = 0;
@@ -2186,10 +2186,6 @@ static ssize_t n_tty_read(struct tty_struct *tty, struct file *file,
 			nr--;
 			break;
 		}
-		/* This statement must be first before checking for input
-		   so that any interrupt will set the state back to
-		   TASK_RUNNING. */
-		set_current_state(TASK_INTERRUPTIBLE);
 
 		if (((minimum - (b - buf)) < ldata->minimum_to_wake) &&
 		    ((minimum - (b - buf)) >= 1))
@@ -2220,13 +2216,13 @@ static ssize_t n_tty_read(struct tty_struct *tty, struct file *file,
 				n_tty_set_room(tty);
 				up_read(&tty->termios_rwsem);
 
-				timeout = schedule_timeout(timeout);
+				timeout = wait_woken(&wait, TASK_INTERRUPTIBLE,
+						     timeout);
 
 				down_read(&tty->termios_rwsem);
 				continue;
 			}
 		}
-		__set_current_state(TASK_RUNNING);
 
 		/* Deal with packet mode. */
 		if (packet && b == buf) {
@@ -2273,7 +2269,6 @@ static ssize_t n_tty_read(struct tty_struct *tty, struct file *file,
 
 	mutex_unlock(&ldata->atomic_read_lock);
 
-	__set_current_state(TASK_RUNNING);
 	if (b - buf)
 		retval = b - buf;
 
@@ -2306,7 +2301,7 @@ static ssize_t n_tty_write(struct tty_struct *tty, struct file *file,
 			   const unsigned char *buf, size_t nr)
 {
 	const unsigned char *b = buf;
-	DECLARE_WAITQUEUE(wait, current);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 	int c;
 	ssize_t retval = 0;
 
@@ -2324,7 +2319,6 @@ static ssize_t n_tty_write(struct tty_struct *tty, struct file *file,
 
 	add_wait_queue(&tty->write_wait, &wait);
 	while (1) {
-		set_current_state(TASK_INTERRUPTIBLE);
 		if (signal_pending(current)) {
 			retval = -ERESTARTSYS;
 			break;
@@ -2378,12 +2372,11 @@ static ssize_t n_tty_write(struct tty_struct *tty, struct file *file,
 		}
 		up_read(&tty->termios_rwsem);
 
-		schedule();
+		wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
 
 		down_read(&tty->termios_rwsem);
 	}
 break_out:
-	__set_current_state(TASK_RUNNING);
 	remove_wait_queue(&tty->write_wait, &wait);
 	if (b - buf != nr && tty->fasync)
 		set_bit(TTY_DO_WRITE_WAKEUP, &tty->flags);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched, smp: Correctly deal with nested sleeps
  2014-09-24  8:18 ` [PATCH 07/11] smp: Correctly deal " Peter Zijlstra
@ 2014-10-28 11:11   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, linux-kernel, torvalds, peterz, mingo, hpa

Commit-ID:  7d4d26966e0b6443c78123a8a8b602e8eaf67694
Gitweb:     http://git.kernel.org/tip/7d4d26966e0b6443c78123a8a8b602e8eaf67694
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:52 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:56:24 +0100

sched, smp: Correctly deal with nested sleeps

smp_hotplug_thread::{setup,unpark} functions can sleep too, so be
consistent and do the same for all callbacks.

 __might_sleep+0x74/0x80
 kmem_cache_alloc_trace+0x4e/0x1c0
 perf_event_alloc+0x55/0x450
 perf_event_create_kernel_counter+0x2f/0x100
 watchdog_nmi_enable+0x8d/0x160
 watchdog_enable+0x45/0x90
 smpboot_thread_fn+0xec/0x2b0
 kthread+0xe4/0x100
 ret_from_fork+0x7c/0xb0

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.392279328@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/smpboot.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index eb89e18..f032fb5 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -110,7 +110,7 @@ static int smpboot_thread_fn(void *data)
 		set_current_state(TASK_INTERRUPTIBLE);
 		preempt_disable();
 		if (kthread_should_stop()) {
-			set_current_state(TASK_RUNNING);
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			if (ht->cleanup)
 				ht->cleanup(td->cpu, cpu_online(td->cpu));
@@ -136,26 +136,27 @@ static int smpboot_thread_fn(void *data)
 		/* Check for state change setup */
 		switch (td->status) {
 		case HP_THREAD_NONE:
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			if (ht->setup)
 				ht->setup(td->cpu);
 			td->status = HP_THREAD_ACTIVE;
-			preempt_disable();
-			break;
+			continue;
+
 		case HP_THREAD_PARKED:
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			if (ht->unpark)
 				ht->unpark(td->cpu);
 			td->status = HP_THREAD_ACTIVE;
-			preempt_disable();
-			break;
+			continue;
 		}
 
 		if (!ht->thread_should_run(td->cpu)) {
-			preempt_enable();
+			preempt_enable_no_resched();
 			schedule();
 		} else {
-			set_current_state(TASK_RUNNING);
+			__set_current_state(TASK_RUNNING);
 			preempt_enable();
 			ht->thread_fn(td->cpu);
 		}

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched, modules: Fix nested sleep in add_unformed_module()
  2014-09-24  8:18 ` [PATCH 08/11] module: Fix nested sleep Peter Zijlstra
  2014-09-29 22:18   ` Oleg Nesterov
@ 2014-10-28 11:11   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: gregkh, tglx, hpa, rusty, linux-kernel, torvalds, akpm, mingo,
	fengguang.wu, peterz

Commit-ID:  3c9b2c3d64a49f264422d7743599cf7f6535972d
Gitweb:     http://git.kernel.org/tip/3c9b2c3d64a49f264422d7743599cf7f6535972d
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:53 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:56:30 +0100

sched, modules: Fix nested sleep in add_unformed_module()

This is a genuine bug in add_unformed_module(), we cannot use blocking
primitives inside a wait loop.

So rewrite the wait_event_interruptible() usage to use the fresh
wait_woken() stuff.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20140924082242.458562904@infradead.org
[ So this is probably complex to backport and the race wasn't reported AFAIK,
  so not marked for -stable. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/module.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 88cec1d..e52a873 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3097,6 +3097,32 @@ static int may_init_module(void)
 }
 
 /*
+ * Can't use wait_event_interruptible() because our condition
+ * 'finished_loading()' contains a blocking primitive itself (mutex_lock).
+ */
+static int wait_finished_loading(struct module *mod)
+{
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
+	int ret = 0;
+
+	add_wait_queue(&module_wq, &wait);
+	for (;;) {
+		if (finished_loading(mod->name))
+			break;
+
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
+	}
+	remove_wait_queue(&module_wq, &wait);
+
+	return ret;
+}
+
+/*
  * We try to place it in the list now to make sure it's unique before
  * we dedicate too many resources.  In particular, temporary percpu
  * memory exhaustion.
@@ -3116,8 +3142,8 @@ again:
 		    || old->state == MODULE_STATE_UNFORMED) {
 			/* Wait in case it fails to load. */
 			mutex_unlock(&module_mutex);
-			err = wait_event_interruptible(module_wq,
-					       finished_loading(mod->name));
+
+			err = wait_finished_loading(mod);
 			if (err)
 				goto out_unlocked;
 			goto again;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched, net: Clean up sk_wait_event() vs. might_sleep()
  2014-09-24  8:18 ` [PATCH 09/11] net: Clean up sk_wait_event() vs might_sleep() Peter Zijlstra
  2014-09-24  8:36   ` Peter Zijlstra
@ 2014-10-28 11:11   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, torvalds, davem, ilya.dryomov, peterz, tglx, mingo, hpa

Commit-ID:  26cabd31259ba43f68026ce3f62b78094124333f
Gitweb:     http://git.kernel.org/tip/26cabd31259ba43f68026ce3f62b78094124333f
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:54 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:56:37 +0100

sched, net: Clean up sk_wait_event() vs. might_sleep()

WARNING: CPU: 1 PID: 1744 at kernel/sched/core.c:7104 __might_sleep+0x58/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff81070e10>] prepare_to_wait+0x50 /0xa0

 [<ffffffff8105bc38>] __might_sleep+0x58/0x90
 [<ffffffff8148c671>] lock_sock_nested+0x31/0xb0
 [<ffffffff81498aaa>] sk_stream_wait_memory+0x18a/0x2d0

Which is a false positive because sk_wait_event() will already have
TASK_RUNNING at that point if it would've gone through
schedule_timeout().

So annotate with sched_annotate_sleep(); which goes away on !DEBUG builds.

Reported-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140924082242.524407432@infradead.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: netdev@vger.kernel.org
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/net/sock.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7db3db1..e6f235e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -897,6 +897,7 @@ static inline void sock_rps_reset_rxhash(struct sock *sk)
 		if (!__rc) {						\
 			*(__timeo) = schedule_timeout(*(__timeo));	\
 		}							\
+		sched_annotate_sleep();						\
 		lock_sock(__sk);					\
 		__rc = __condition;					\
 		__rc;							\

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Debug nested sleeps
  2014-09-24  8:18 ` [PATCH 10/11] sched: Debug nested sleeps Peter Zijlstra
  2014-09-29 22:13   ` Oleg Nesterov
@ 2014-10-28 11:11   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, hpa, torvalds, peterz, tglx, linux-kernel

Commit-ID:  8eb23b9f35aae413140d3fda766a98092c21e9b0
Gitweb:     http://git.kernel.org/tip/8eb23b9f35aae413140d3fda766a98092c21e9b0
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:55 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:56:52 +0100

sched: Debug nested sleeps

Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.

Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.

Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/core.c   | 13 +++++++++++++
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 320a977..4648e07 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -243,6 +243,43 @@ extern char ___assert_task_state[1 - 2*!!(
 				((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
 				 (task->flags & PF_FROZEN) == 0)
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+
+#define __set_task_state(tsk, state_value)			\
+	do {							\
+		(tsk)->task_state_change = _THIS_IP_;		\
+		(tsk)->state = (state_value);			\
+	} while (0)
+#define set_task_state(tsk, state_value)			\
+	do {							\
+		(tsk)->task_state_change = _THIS_IP_;		\
+		set_mb((tsk)->state, (state_value));		\
+	} while (0)
+
+/*
+ * set_current_state() includes a barrier so that the write of current->state
+ * is correctly serialised wrt the caller's subsequent test of whether to
+ * actually sleep:
+ *
+ *	set_current_state(TASK_UNINTERRUPTIBLE);
+ *	if (do_i_need_to_sleep())
+ *		schedule();
+ *
+ * If the caller does not need such serialisation then use __set_current_state()
+ */
+#define __set_current_state(state_value)			\
+	do {							\
+		current->task_state_change = _THIS_IP_;		\
+		current->state = (state_value);			\
+	} while (0)
+#define set_current_state(state_value)				\
+	do {							\
+		current->task_state_change = _THIS_IP_;		\
+		set_mb(current->state, (state_value));		\
+	} while (0)
+
+#else
+
 #define __set_task_state(tsk, state_value)		\
 	do { (tsk)->state = (state_value); } while (0)
 #define set_task_state(tsk, state_value)		\
@@ -259,11 +296,13 @@ extern char ___assert_task_state[1 - 2*!!(
  *
  * If the caller does not need such serialisation then use __set_current_state()
  */
-#define __set_current_state(state_value)			\
+#define __set_current_state(state_value)		\
 	do { current->state = (state_value); } while (0)
-#define set_current_state(state_value)		\
+#define set_current_state(state_value)			\
 	set_mb(current->state, (state_value))
 
+#endif
+
 /* Task command name length */
 #define TASK_COMM_LEN 16
 
@@ -1661,6 +1700,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	unsigned long	task_state_change;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0456a55..5b4b96b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7298,6 +7298,19 @@ void __might_sleep(const char *file, int line, int preempt_offset)
 {
 	static unsigned long prev_jiffy;	/* ratelimiting */
 
+	/*
+	 * Blocking primitives will set (and therefore destroy) current->state,
+	 * since we will exit with TASK_RUNNING make sure we enter with it,
+	 * otherwise we will destroy state.
+	 */
+	if (WARN(current->state != TASK_RUNNING,
+			"do not call blocking ops when !TASK_RUNNING; "
+			"state=%lx set at [<%p>] %pS\n",
+			current->state,
+			(void *)current->task_state_change,
+			(void *)current->task_state_change))
+		__set_current_state(TASK_RUNNING);
+
 	rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
 	     !is_idle_task(current)) ||

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Exclude cond_resched() from nested sleep test
  2014-09-24  8:18 ` [PATCH 11/11] sched: Exclude cond_resched() from nested sleep test Peter Zijlstra
@ 2014-10-28 11:12   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-10-28 11:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: rusty, akpm, linux-kernel, jbaron, hpa, mingo, rostedt, peterz,
	alex.elder, davej, torvalds, axel.lin, dborkman, tglx

Commit-ID:  3427445afd26bd2395f29241319283a93f362cd0
Gitweb:     http://git.kernel.org/tip/3427445afd26bd2395f29241319283a93f362cd0
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 24 Sep 2014 10:18:56 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:56:57 +0100

sched: Exclude cond_resched() from nested sleep test

cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.

In particular, preemption preserves task_struct::state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/kernel.h |  3 +++
 include/linux/sched.h  |  6 +++---
 kernel/sched/core.c    | 12 +++++++++---
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5068a0d..446d76a 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -162,6 +162,7 @@ extern int _cond_resched(void);
 #endif
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+  void ___might_sleep(const char *file, int line, int preempt_offset);
   void __might_sleep(const char *file, int line, int preempt_offset);
 /**
  * might_sleep - annotation for functions that can sleep
@@ -177,6 +178,8 @@ extern int _cond_resched(void);
 	do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
 # define sched_annotate_sleep()	__set_current_state(TASK_RUNNING)
 #else
+  static inline void ___might_sleep(const char *file, int line,
+				   int preempt_offset) { }
   static inline void __might_sleep(const char *file, int line,
 				   int preempt_offset) { }
 # define might_sleep() do { might_resched(); } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4648e07..4400ddc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2806,7 +2806,7 @@ static inline int signal_pending_state(long state, struct task_struct *p)
 extern int _cond_resched(void);
 
 #define cond_resched() ({			\
-	__might_sleep(__FILE__, __LINE__, 0);	\
+	___might_sleep(__FILE__, __LINE__, 0);	\
 	_cond_resched();			\
 })
 
@@ -2819,14 +2819,14 @@ extern int __cond_resched_lock(spinlock_t *lock);
 #endif
 
 #define cond_resched_lock(lock) ({				\
-	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
 	__cond_resched_lock(lock);				\
 })
 
 extern int __cond_resched_softirq(void);
 
 #define cond_resched_softirq() ({					\
-	__might_sleep(__FILE__, __LINE__, SOFTIRQ_DISABLE_OFFSET);	\
+	___might_sleep(__FILE__, __LINE__, SOFTIRQ_DISABLE_OFFSET);	\
 	__cond_resched_softirq();					\
 })
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b4b96b..b9f78f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7296,8 +7296,6 @@ static inline int preempt_count_equals(int preempt_offset)
 
 void __might_sleep(const char *file, int line, int preempt_offset)
 {
-	static unsigned long prev_jiffy;	/* ratelimiting */
-
 	/*
 	 * Blocking primitives will set (and therefore destroy) current->state,
 	 * since we will exit with TASK_RUNNING make sure we enter with it,
@@ -7311,6 +7309,14 @@ void __might_sleep(const char *file, int line, int preempt_offset)
 			(void *)current->task_state_change))
 		__set_current_state(TASK_RUNNING);
 
+	___might_sleep(file, line, preempt_offset);
+}
+EXPORT_SYMBOL(__might_sleep);
+
+void ___might_sleep(const char *file, int line, int preempt_offset)
+{
+	static unsigned long prev_jiffy;	/* ratelimiting */
+
 	rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
 	     !is_idle_task(current)) ||
@@ -7340,7 +7346,7 @@ void __might_sleep(const char *file, int line, int preempt_offset)
 #endif
 	dump_stack();
 }
-EXPORT_SYMBOL(__might_sleep);
+EXPORT_SYMBOL(___might_sleep);
 #endif
 
 #ifdef CONFIG_MAGIC_SYSRQ

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-28  8:23               ` Peter Zijlstra
@ 2014-10-29  0:00                 ` Oleg Nesterov
  2014-10-29  9:35                   ` Peter Zijlstra
  2014-10-29 14:26                   ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Oleg Nesterov @ 2014-10-29  0:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, mingo, torvalds, tglx, ilya.dryomov,
	linux-kernel, Eric Paris, rafael.j.wysocki

On 10/28, Peter Zijlstra wrote:
>
> On Tue, Oct 28, 2014 at 01:07:03AM +0100, Oleg Nesterov wrote:
>
> > I was going to say that wait_event_freezable() in kauditd_thread()
> > is not friendly wrt kthread_should_stop() and thus we we need
> > kthread_freezable_should_stop().
>
> I'm not sure those two would interact, yes, both would first set either
> the freezable or stop bit and then wake. If both were to race, all we
> need to ensure is to check both before calling schedule again.
>
> A loop like:
>
> 	while (!kthread_should_stop()) {
> 		wait_event_freezable(wq, cond);
> 	}
>
> Would satisfy that, because even if kthread_should_stop() gets set first
> and then freezing happens and we get into try_to_freeze() first, we'd
> still to the kthread_should_stop() check right after we thaw.

Right after, yes.

But what if it calls try_to_freeze() and another thread (which should
be frozen too) sleeps in kthread_stop() ?

> > Perhaps it also makes sense to redefine wait_event_freezable.*()
> > via ___wait_event(cmd => freezable_schedule), but I think this needs
> > another patch.
>
> So I talked to Rafael yesterday and I'm going to replace all the
> wait_event*() stuff, and I suppose also freezable_schedule() because
> they're racy.
>
> The moment we call freezer_do_not_count() the freezer will ignore us,
> this means the thread could still be running (albeit not for long) when
> the freezer reports success.

Yes, sure. IIRC the theory was that a PF_FREEZER_SKIP will do nothing
"wrong" wrt freezing/suspend before it actually sleeps, but I guess
today we can't assume this.

> Ideally I'll be able to kill the entire freezer_do_not_count() stuff.

Agreed... but it is not clear to me what exactly we can/should do.


Anyway, I only meant that I believe your patch is correct (just it should
not define wait_freezable which we already have), and you could also remove
that kthread_should_stop() which only adds the confusion.

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-29  0:00                 ` Oleg Nesterov
@ 2014-10-29  9:35                   ` Peter Zijlstra
  2014-10-29 11:31                     ` Peter Zijlstra
  2014-10-29 14:26                   ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-29  9:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, mingo, torvalds, tglx, ilya.dryomov,
	linux-kernel, Eric Paris, rafael.j.wysocki

On Wed, Oct 29, 2014 at 01:00:56AM +0100, Oleg Nesterov wrote:
> On 10/28, Peter Zijlstra wrote:
> >
> > On Tue, Oct 28, 2014 at 01:07:03AM +0100, Oleg Nesterov wrote:
> >
> > > I was going to say that wait_event_freezable() in kauditd_thread()
> > > is not friendly wrt kthread_should_stop() and thus we we need
> > > kthread_freezable_should_stop().
> >
> > I'm not sure those two would interact, yes, both would first set either
> > the freezable or stop bit and then wake. If both were to race, all we
> > need to ensure is to check both before calling schedule again.
> >
> > A loop like:
> >
> > 	while (!kthread_should_stop()) {
> > 		wait_event_freezable(wq, cond);
> > 	}
> >
> > Would satisfy that, because even if kthread_should_stop() gets set first
> > and then freezing happens and we get into try_to_freeze() first, we'd
> > still to the kthread_should_stop() check right after we thaw.
> 
> Right after, yes.
> 
> But what if it calls try_to_freeze() and another thread (which should
> be frozen too) sleeps in kthread_stop() ?

Fair point indeed. Now I had a look at __refrigerator() and is there any
reason we should not remove that .check_kthr_stop argument and replace
it with an unconditional (current->flags & PF_KTHREAD) ?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-29  9:35                   ` Peter Zijlstra
@ 2014-10-29 11:31                     ` Peter Zijlstra
  2014-10-29 11:36                       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-29 11:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, mingo, torvalds, tglx, ilya.dryomov,
	linux-kernel, Eric Paris, rafael.j.wysocki

On Wed, Oct 29, 2014 at 10:35:03AM +0100, Peter Zijlstra wrote:
> On Wed, Oct 29, 2014 at 01:00:56AM +0100, Oleg Nesterov wrote:

> > But what if it calls try_to_freeze() and another thread (which should
> > be frozen too) sleeps in kthread_stop() ?
> 
> Fair point indeed. Now I had a look at __refrigerator() and is there any
> reason we should not remove that .check_kthr_stop argument and replace
> it with an unconditional (current->flags & PF_KTHREAD) ?

Hmm, doesn't kthread_freezable_should_stop() suffer the same problem? It
should not refrigerate when should_stop.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-29 11:31                     ` Peter Zijlstra
@ 2014-10-29 11:36                       ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-29 11:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, mingo, torvalds, tglx, ilya.dryomov,
	linux-kernel, Eric Paris, rafael.j.wysocki

On Wed, Oct 29, 2014 at 12:31:03PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 29, 2014 at 10:35:03AM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 29, 2014 at 01:00:56AM +0100, Oleg Nesterov wrote:
> 
> > > But what if it calls try_to_freeze() and another thread (which should
> > > be frozen too) sleeps in kthread_stop() ?
> > 
> > Fair point indeed. Now I had a look at __refrigerator() and is there any
> > reason we should not remove that .check_kthr_stop argument and replace
> > it with an unconditional (current->flags & PF_KTHREAD) ?
> 
> Hmm, doesn't kthread_freezable_should_stop() suffer the same problem? It
> should not refrigerate when should_stop.

Oh never you mind, i've got my head on backwards it seems, it uses
__refrigerator(.check_kthr_stop = true), which does another should_stop
test.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/11] nested sleeps, fixes and debug infrastructure
  2014-10-29  0:00                 ` Oleg Nesterov
  2014-10-29  9:35                   ` Peter Zijlstra
@ 2014-10-29 14:26                   ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2014-10-29 14:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Galbraith, mingo, torvalds, tglx, ilya.dryomov,
	linux-kernel, Eric Paris, rafael.j.wysocki

On Wed, Oct 29, 2014 at 01:00:56AM +0100, Oleg Nesterov wrote:
> On 10/28, Peter Zijlstra wrote:

> > So I talked to Rafael yesterday and I'm going to replace all the
> > wait_event*() stuff, and I suppose also freezable_schedule() because
> > they're racy.
> >
> > The moment we call freezer_do_not_count() the freezer will ignore us,
> > this means the thread could still be running (albeit not for long) when
> > the freezer reports success.
> 
> Yes, sure. IIRC the theory was that a PF_FREEZER_SKIP will do nothing
> "wrong" wrt freezing/suspend before it actually sleeps, but I guess
> today we can't assume this.

Esp. the wait_event_freezable*() family seems suspicious in that the
cond stmt can actually result in quite a lot of code.

But see below, I don't think we have a guarantee it will _ever_ sleep.

Also, this calls schedule(); try_to_freeze() in a suitable loop that's
safe against spurious wakeups, OTOH..

> > Ideally I'll be able to kill the entire freezer_do_not_count() stuff.
> 
> Agreed... but it is not clear to me what exactly we can/should do.

.. I looked at freezable_schedule() and I'm not sure how to 'fix' that.
The problem being things like signal.c:ptrace_stop() that will actually
misbehave in the face of spurious wakeups as allowed by try_to_freeze().

Then again, freezable_schedule() isn't nearly as bad as the
wait_event_freezable() stuff because it does indeed guarantee the task
only calls schedule().

Then again, it is possible to miss these tasks and report freeze success
with a running task all the same, suppose its already woken but
preempted before freezer_count(). The for_each_process_thread() loop in
try_to_freeze_tasks() will skip over it.

And all I can come up with is horrible.. maybe for another day.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] audit, sched/wait: Fixup kauditd_thread() wait loop
  2014-10-02 10:22       ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
  2014-10-02 12:15         ` Peter Zijlstra
@ 2014-11-04 16:08         ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-11-04 16:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, torvalds, eparis, mingo, linux-kernel, umgwanakikbuti, tglx, peterz

Commit-ID:  6b55fc63f46ba299f3d84013e9232be4bd259eab
Gitweb:     http://git.kernel.org/tip/6b55fc63f46ba299f3d84013e9232be4bd259eab
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 2 Oct 2014 12:22:51 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 4 Nov 2014 07:17:47 +0100

audit, sched/wait: Fixup kauditd_thread() wait loop

The kauditd_thread wait loop is a bit iffy; it has a number of problems:

 - calls try_to_freeze() before schedule(); you typically want the
   thread to re-evaluate the sleep condition when unfreezing, also
   freeze_task() issues a wakeup.

 - it unconditionally does the {add,remove}_wait_queue(), even when the
   sleep condition is false.

Use wait_event_freezable() that does the right thing.

Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: oleg@redhat.com
Cc: Eric Paris <eparis@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141002102251.GA6324@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/audit.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 80983df..32bfc43 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -499,7 +499,6 @@ static int kauditd_thread(void *dummy)
 	set_freezable();
 	while (!kthread_should_stop()) {
 		struct sk_buff *skb;
-		DECLARE_WAITQUEUE(wait, current);
 
 		flush_hold_queue();
 
@@ -514,16 +513,8 @@ static int kauditd_thread(void *dummy)
 				audit_printk_skb(skb);
 			continue;
 		}
-		set_current_state(TASK_INTERRUPTIBLE);
-		add_wait_queue(&kauditd_wait, &wait);
 
-		if (!skb_queue_len(&audit_skb_queue)) {
-			try_to_freeze();
-			schedule();
-		}
-
-		__set_current_state(TASK_RUNNING);
-		remove_wait_queue(&kauditd_wait, &wait);
+		wait_event_freezable(kauditd_wait, skb_queue_len(&audit_skb_queue));
 	}
 	return 0;
 }

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2014-11-04 16:09 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24  8:18 [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
2014-09-24  8:18 ` [PATCH 01/11] locking/mutex: Dont assume TASK_RUNNING Peter Zijlstra
2014-10-28 11:09   ` [tip:sched/core] locking/mutex: Don't " tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 02/11] wait: Provide infrastructure to deal with nested blocking Peter Zijlstra
2014-09-29 21:02   ` Oleg Nesterov
2014-10-02  7:37     ` Peter Zijlstra
2014-10-02 21:21       ` Oleg Nesterov
2014-10-28 11:09   ` [tip:sched/core] sched/wait: " tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 03/11] wait: Add might_sleep() Peter Zijlstra
2014-10-28 11:09   ` [tip:sched/core] sched/wait: Add might_sleep() checks tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 04/11] exit: Deal with nested sleeps Peter Zijlstra
2014-10-28 11:10   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 05/11] inotify: " Peter Zijlstra
2014-10-28 11:10   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 06/11] tty: " Peter Zijlstra
2014-10-28 11:10   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 07/11] smp: Correctly deal " Peter Zijlstra
2014-10-28 11:11   ` [tip:sched/core] sched, " tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 08/11] module: Fix nested sleep Peter Zijlstra
2014-09-29 22:18   ` Oleg Nesterov
2014-09-30 13:43     ` Peter Zijlstra
2014-10-28 11:11   ` [tip:sched/core] sched, modules: Fix nested sleep in add_unformed_module() tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 09/11] net: Clean up sk_wait_event() vs might_sleep() Peter Zijlstra
2014-09-24  8:36   ` Peter Zijlstra
2014-10-28 11:11   ` [tip:sched/core] sched, net: Clean up sk_wait_event() vs. might_sleep() tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 10/11] sched: Debug nested sleeps Peter Zijlstra
2014-09-29 22:13   ` Oleg Nesterov
2014-09-30 13:49     ` Peter Zijlstra
2014-09-30 21:47       ` Oleg Nesterov
2014-10-01 16:10         ` Peter Zijlstra
2014-10-01 18:35           ` Oleg Nesterov
2014-10-02  9:07             ` Peter Zijlstra
2014-10-02 21:34               ` Oleg Nesterov
2014-10-28 11:11   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2014-09-24  8:18 ` [PATCH 11/11] sched: Exclude cond_resched() from nested sleep test Peter Zijlstra
2014-10-28 11:12   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2014-09-25  8:30 ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Mike Galbraith
2014-09-25  9:06   ` Peter Zijlstra
2014-09-25  9:10     ` Mike Galbraith
2014-09-25  9:15     ` Peter Zijlstra
2014-09-25  9:56       ` Mike Galbraith
2014-09-25 13:59         ` BUG: sleeping function called from invalid context at drivers/cpufreq/cpufreq.c:370 Mike Galbraith
2014-09-26  6:24           ` Mike Galbraith
2014-09-26  7:54             ` Mike Galbraith
2014-09-26 14:10               ` Rafael J. Wysocki
2014-09-26 22:44               ` Rafael J. Wysocki
2014-09-27  6:14                 ` Mike Galbraith
2014-09-27 19:57                   ` Rafael J. Wysocki
2014-10-02 10:22       ` [PATCH 00/11] nested sleeps, fixes and debug infrastructure Peter Zijlstra
2014-10-02 12:15         ` Peter Zijlstra
2014-10-27 13:41           ` Peter Zijlstra
2014-10-28  0:07             ` Oleg Nesterov
2014-10-28  8:23               ` Peter Zijlstra
2014-10-29  0:00                 ` Oleg Nesterov
2014-10-29  9:35                   ` Peter Zijlstra
2014-10-29 11:31                     ` Peter Zijlstra
2014-10-29 11:36                       ` Peter Zijlstra
2014-10-29 14:26                   ` Peter Zijlstra
2014-11-04 16:08         ` [tip:sched/core] audit, sched/wait: Fixup kauditd_thread() wait loop tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.