All of lore.kernel.org
 help / color / mirror / Atom feed
* [ANNOUNCE] 3.14.3-rt5
@ 2014-05-09 18:12 Sebastian Andrzej Siewior
  2014-05-09 22:54 ` Pavel Vasilyev
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2014-05-09 18:12 UTC (permalink / raw)
  To: linux-rt-users; +Cc: LKML, Thomas Gleixner, rostedt, John Kacur

Dear RT folks!

I'm pleased to announce the v3.14.3-rt5 patch set.

Changes since v3.14.3-rt4
- remove one of the two identical rt_mutex_init() definitions. A patch
  from Steven Rostedt
- use EXPORT_SYMBOL() on __rt_mutex_init() and
  rt_down_write_nested_lock(). The former was dropped accidently and is
  needed by some binary only modules, the latter was requsted by the f2fs
  module. Patch by Joakim Hernberg.
- during v3.14 porting I accidently dropped preempt_check_resched() in
  the non-preempt case which means configs non-preempt configs did not
  build. Reported by Yang Honggang.
- NETCONSOLE is no longer disabled on RT. Daniel Bristot de Oliveira did
  some testing and did not find anything wrong it. That means it can be
  enabled if someone needs/wants it.
- rt_read_lock() uses rwlock_acquire() instead of rwlock_acquire_read()
  for lockdep annotation. It was different compared to what the trylock
  variant used and on RT both act the same way. Patch by Mike Galbraith.
- the tracing code wrongly disable preemption while shrinking the ring
  buffer. Reported by Stanislav Meduna.

Known issues:

      - bcache is disabled.

      - lazy preempt on x86_64 leads to a crash with some load.

      - CPU hotplug works in general. Steven's test script however
        deadlocks usually on the second invocation.

The delta patch against v3.14.3-rt4 is appended below and can be found
here:
   https://www.kernel.org/pub/linux/kernel/projects/rt/3.14/incr/patch-3.14.3-rt4-rt5.patch.xz

The RT patch against 3.14.3 can be found here:

   https://www.kernel.org/pub/linux/kernel/projects/rt/3.14/patch-3.14.3-rt5.patch.xz

The split quilt queue is available at:

   https://www.kernel.org/pub/linux/kernel/projects/rt/3.14/patches-3.14.3-rt5.tar.xz

Sebastian

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 33a4a85..494b888 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -160,7 +160,6 @@ config VXLAN
 
 config NETCONSOLE
 	tristate "Network console logging support"
-	depends on !PREEMPT_RT_FULL
 	---help---
 	If you want to log kernel messages over the network, enable this.
 	See <file:Documentation/networking/netconsole.txt> for details.
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 5b2cdf4..66587bf 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -149,6 +149,7 @@ do { \
 #define sched_preempt_enable_no_resched()	barrier()
 #define preempt_enable_no_resched()		barrier()
 #define preempt_enable()			barrier()
+#define preempt_check_resched()			do { } while (0)
 
 #define preempt_disable_notrace()		barrier()
 #define preempt_enable_no_resched_notrace()	barrier()
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index f8f3dfdd2..edb77fd 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -59,23 +59,18 @@ struct hrtimer_sleeper;
 # define rt_mutex_debug_check_no_locks_held(task)	do { } while (0)
 #endif
 
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-# define __DEBUG_RT_MUTEX_INITIALIZER(mutexname) \
-	, .name = #mutexname, .file = __FILE__, .line = __LINE__
 # define rt_mutex_init(mutex)					\
 	do {							\
 		raw_spin_lock_init(&(mutex)->wait_lock);	\
 		__rt_mutex_init(mutex, #mutex);			\
 	} while (0)
 
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+# define __DEBUG_RT_MUTEX_INITIALIZER(mutexname) \
+	, .name = #mutexname, .file = __FILE__, .line = __LINE__
  extern void rt_mutex_debug_task_free(struct task_struct *tsk);
 #else
 # define __DEBUG_RT_MUTEX_INITIALIZER(mutexname)
-# define rt_mutex_init(mutex)					\
-	 do {							\
-		 raw_spin_lock_init(&(mutex)->wait_lock);	\
-		 __rt_mutex_init(mutex, #mutex);		\
-	 } while (0)
 # define rt_mutex_debug_task_free(t)			do { } while (0)
 #endif
 
diff --git a/kernel/locking/rt.c b/kernel/locking/rt.c
index 055a3df..90b8ba0 100644
--- a/kernel/locking/rt.c
+++ b/kernel/locking/rt.c
@@ -250,7 +250,7 @@ void __lockfunc rt_read_lock(rwlock_t *rwlock)
 	 */
 	if (rt_mutex_owner(lock) != current) {
 		migrate_disable();
-		rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_);
+		rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_);
 		__rt_spin_lock(lock);
 	}
 	rwlock->read_depth++;
@@ -366,6 +366,7 @@ void rt_down_write_nested_lock(struct rw_semaphore *rwsem,
 	rwsem_acquire_nest(&rwsem->dep_map, 0, 0, nest, _RET_IP_);
 	rt_mutex_lock(&rwsem->lock);
 }
+EXPORT_SYMBOL(rt_down_write_nested_lock);
 
 int  rt_down_read_trylock(struct rw_semaphore *rwsem)
 {
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 5c5cc76..fbf152b 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1552,7 +1552,7 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 
 	debug_rt_mutex_init(lock, name);
 }
-EXPORT_SYMBOL_GPL(__rt_mutex_init);
+EXPORT_SYMBOL(__rt_mutex_init);
 
 /**
  * rt_mutex_init_proxy_locked - initialize and lock a rt_mutex on behalf of a
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fc4da2d..112d4a5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1682,28 +1682,22 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
 		 * We can't schedule on offline CPUs, but it's not necessary
 		 * since we can change their buffer sizes without any race.
 		 */
+		migrate_disable();
 		for_each_buffer_cpu(buffer, cpu) {
 			cpu_buffer = buffer->buffers[cpu];
 			if (!cpu_buffer->nr_pages_to_update)
 				continue;
 
 			/* The update must run on the CPU that is being updated. */
-			preempt_disable();
 			if (cpu == smp_processor_id() || !cpu_online(cpu)) {
 				rb_update_pages(cpu_buffer);
 				cpu_buffer->nr_pages_to_update = 0;
 			} else {
-				/*
-				 * Can not disable preemption for schedule_work_on()
-				 * on PREEMPT_RT.
-				 */
-				preempt_enable();
 				schedule_work_on(cpu,
 						&cpu_buffer->update_pages_work);
-				preempt_disable();
 			}
-			preempt_enable();
 		}
+		migrate_enable();
 
 		/* wait for all the updates to complete */
 		for_each_buffer_cpu(buffer, cpu) {
@@ -1740,22 +1734,16 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
 
 		get_online_cpus();
 
-		preempt_disable();
+		migrate_disable();
 		/* The update must run on the CPU that is being updated. */
 		if (cpu_id == smp_processor_id() || !cpu_online(cpu_id))
 			rb_update_pages(cpu_buffer);
 		else {
-			/*
-			 * Can not disable preemption for schedule_work_on()
-			 * on PREEMPT_RT.
-			 */
-			preempt_enable();
 			schedule_work_on(cpu_id,
 					 &cpu_buffer->update_pages_work);
 			wait_for_completion(&cpu_buffer->update_done);
-			preempt_disable();
 		}
-		preempt_enable();
+		migrate_enable();
 
 		cpu_buffer->nr_pages_to_update = 0;
 		put_online_cpus();

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2014-05-09 18:12 [ANNOUNCE] 3.14.3-rt5 Sebastian Andrzej Siewior
@ 2014-05-09 22:54 ` Pavel Vasilyev
  2014-05-13 15:33   ` Sebastian Andrzej Siewior
  2014-05-10  4:15 ` Mike Galbraith
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Pavel Vasilyev @ 2014-05-09 22:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-rt-users
  Cc: LKML, Thomas Gleixner, rostedt, John Kacur

09.05.2014 22:12, Sebastian Andrzej Siewior пишет:

> The delta patch against v3.14.3-rt4 is appended below and can be found

Where delta from 3 to 4 ?

-- 

                                                          Pavel.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2014-05-09 18:12 [ANNOUNCE] 3.14.3-rt5 Sebastian Andrzej Siewior
  2014-05-09 22:54 ` Pavel Vasilyev
@ 2014-05-10  4:15 ` Mike Galbraith
  2014-05-13 15:40   ` Sebastian Andrzej Siewior
  2014-05-13 13:30 ` [ANNOUNCE] 3.14.3-rt5 Juri Lelli
  2014-05-17  3:36 ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Mike Galbraith
  3 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2014-05-10  4:15 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

On Fri, 2014-05-09 at 20:12 +0200, Sebastian Andrzej Siewior wrote:

> Known issues:
> 
>       - bcache is disabled.
> 
>       - lazy preempt on x86_64 leads to a crash with some load.

That is only with NO_HZ_FUL enabled here.  Box blows the stack during
task exit, eyeballing hasn't spotted the why.

> - CPU hotplug works in general. Steven's test script however
>         deadlocks usually on the second invocation.

My 64 core box runs for up to 14 hours, and never deadlocks.. it
explodes in what looks like it should be an impossible manner instead.

-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2014-05-09 18:12 [ANNOUNCE] 3.14.3-rt5 Sebastian Andrzej Siewior
  2014-05-09 22:54 ` Pavel Vasilyev
  2014-05-10  4:15 ` Mike Galbraith
@ 2014-05-13 13:30 ` Juri Lelli
  2015-02-16 11:29   ` Sebastian Andrzej Siewior
  2014-05-17  3:36 ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Mike Galbraith
  3 siblings, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2014-05-13 13:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

Hi,

On Fri, 9 May 2014 20:12:14 +0200
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> Dear RT folks!
> 
> I'm pleased to announce the v3.14.3-rt5 patch set.
> 
> Changes since v3.14.3-rt4
> - remove one of the two identical rt_mutex_init() definitions. A patch
>   from Steven Rostedt
> - use EXPORT_SYMBOL() on __rt_mutex_init() and
>   rt_down_write_nested_lock(). The former was dropped accidently and is
>   needed by some binary only modules, the latter was requsted by the f2fs
>   module. Patch by Joakim Hernberg.
> - during v3.14 porting I accidently dropped preempt_check_resched() in
>   the non-preempt case which means configs non-preempt configs did not
>   build. Reported by Yang Honggang.
> - NETCONSOLE is no longer disabled on RT. Daniel Bristot de Oliveira did
>   some testing and did not find anything wrong it. That means it can be
>   enabled if someone needs/wants it.
> - rt_read_lock() uses rwlock_acquire() instead of rwlock_acquire_read()
>   for lockdep annotation. It was different compared to what the trylock
>   variant used and on RT both act the same way. Patch by Mike Galbraith.
> - the tracing code wrongly disable preemption while shrinking the ring
>   buffer. Reported by Stanislav Meduna.
> 
> Known issues:
> 
>       - bcache is disabled.
> 
>       - lazy preempt on x86_64 leads to a crash with some load.
> 
>       - CPU hotplug works in general. Steven's test script however
>         deadlocks usually on the second invocation.
> 

Also SCHED_DEADLINE dies without the following.

Thanks,

- Juri

---From 3ca5943538c728399037823e5632431bc2da707c Mon Sep 17 00:00:00 2001
From: Juri Lelli <juri.lelli@gmail.com>
Date: Tue, 13 May 2014 15:21:16 +0200
Subject: [PATCH] sched/deadline: dl_task_timer has to be irqsafe

As for rt_period_timer, dl_task_timer has to be irqsafe.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/sched/deadline.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6e79b3f..48b04ce 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -537,6 +537,7 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se)
 
 	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	timer->function = dl_task_timer;
+	timer->irqsafe = 1;
 }
 
 static
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2014-05-09 22:54 ` Pavel Vasilyev
@ 2014-05-13 15:33   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2014-05-13 15:33 UTC (permalink / raw)
  To: Pavel Vasilyev; +Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

* Pavel Vasilyev | 2014-05-10 02:54:34 [+0400]:

>09.05.2014 22:12, Sebastian Andrzej Siewior пишет:
>
>>The delta patch against v3.14.3-rt4 is appended below and can be found
>
>Where delta from 3 to 4 ?
I can remeber answering that question to an earlier release. The change
is mostly in
    https://git.kernel.org/cgit/linux/kernel/git/bigeasy/rt-devel.git/commit/?id=be0f124e477586b437108e09b66153be66dad29d

For more details feel free to "git diff v3.14.2-rt3..v3.14.3-rt4" in
that tree. This is however just a stable fixup.

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2014-05-10  4:15 ` Mike Galbraith
@ 2014-05-13 15:40   ` Sebastian Andrzej Siewior
  2014-05-14  3:10     ` Mike Galbraith
  2014-05-16 13:53     ` [patch] rt/sched: fix resursion when CONTEXT_TRACKING and PREEMPT_LAZY are enabled Mike Galbraith
  0 siblings, 2 replies; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2014-05-13 15:40 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

* Mike Galbraith | 2014-05-10 06:15:03 [+0200]:

>On Fri, 2014-05-09 at 20:12 +0200, Sebastian Andrzej Siewior wrote:
>
>> Known issues:
>> 
>>       - bcache is disabled.
>> 
>>       - lazy preempt on x86_64 leads to a crash with some load.
>
>That is only with NO_HZ_FUL enabled here.  Box blows the stack during
>task exit, eyeballing hasn't spotted the why.

Even if I disable NO_HZ_FULL it explodes as soon as hackbench starts.

>> - CPU hotplug works in general. Steven's test script however
>>         deadlocks usually on the second invocation.
>
>My 64 core box runs for up to 14 hours, and never deadlocks.. it
>explodes in what looks like it should be an impossible manner instead.

It deadlocks here and I haven't figured the exact root cause. From what
it looks like is that the irq thread blocks on something during startup
(migrate_disable() or so). One of the blocked irq thrad is disk driver.
The userland tasks then block on ext4 in order to complete the requests.

I also noticed that the frequent cpu up/down fails at some point and my
kvm guest has just 7 out 8 CPUs. That one CPU remains dead and can't get
back online. Once that happens, the deadlock is comming in a few minutes
:)

>-Mike

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2014-05-13 15:40   ` Sebastian Andrzej Siewior
@ 2014-05-14  3:10     ` Mike Galbraith
  2014-05-16 13:53     ` [patch] rt/sched: fix resursion when CONTEXT_TRACKING and PREEMPT_LAZY are enabled Mike Galbraith
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2014-05-14  3:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

On Tue, 2014-05-13 at 17:40 +0200, Sebastian Andrzej Siewior wrote: 
> * Mike Galbraith | 2014-05-10 06:15:03 [+0200]:
> 
> >On Fri, 2014-05-09 at 20:12 +0200, Sebastian Andrzej Siewior wrote:
> >
> >> Known issues:
> >> 
> >>       - bcache is disabled.
> >> 
> >>       - lazy preempt on x86_64 leads to a crash with some load.
> >
> >That is only with NO_HZ_FUL enabled here.  Box blows the stack during
> >task exit, eyeballing hasn't spotted the why.
> 
> Even if I disable NO_HZ_FULL it explodes as soon as hackbench starts.

Well good, that makes a hell of a lot more sense.  The below is with
NO_HZ_FULL enabled, and hackbench exploding on exit.  Every kaboom I've
see has been a dead task exploding on scrambled thread_info.

Accessing per-anti-cpu data doesn't work well from our universe ;-)

crash> bt  6657
PID: 6657   TASK: ffff8801f947ac00  CPU: 1   COMMAND: "hackbench"
 #0 [ffff88022fc86e00] crash_nmi_callback at ffffffff8102b8f4
 #1 [ffff88022fc86e10] nmi_handle at ffffffff8164865a
 #2 [ffff88022fc86ea0] default_do_nmi at ffffffff81648883
 #3 [ffff88022fc86ed0] do_nmi at ffffffff81648b50
 #4 [ffff88022fc86ef0] end_repeat_nmi at ffffffff81647b71
    [exception RIP: oops_begin+162]
    RIP: ffffffff816483e2  RSP: ffff8800b220d9d8  RFLAGS: 00000097
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000097
    RDX: ffff8800b220d9d8  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff816483e2   R8: ffffffff816483e2   R9: 0000000000000018
    R10: ffff8800b220d9d8  R11: 0000000000000097  R12: ffffffffffffffff
    R13: ffff88022700bf00  R14: 0000000000000100  R15: 0000000000000001
    ORIG_RAX: 0000000000000001  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ffff8800b220d9d8] oops_begin at ffffffff816483e2
 #6 [ffff8800b220d9f0] no_context at ffffffff8162ef25
 #7 [ffff8800b220da40] __bad_area_nosemaphore at ffffffff8162f19d
 #8 [ffff8800b220daa0] bad_area_nosemaphore at ffffffff8162f1ca
 #9 [ffff8800b220dab0] __do_page_fault at ffffffff8164a68e
#10 [ffff8800b220dbd0] do_page_fault at ffffffff8164ab9e
#11 [ffff8800b220dc00] page_fault at ffffffff81647808
    [exception RIP: cpuacct_charge+148]
    RIP: ffffffff810a1874  RSP: ffff8800b220dcb8  RFLAGS: 00010046
    RAX: 0000000000000040  RBX: 000000000000dd08  RCX: 0000000000000003
    RDX: 0000000000000006  RSI: 0000000000000006  RDI: ffff88022700bf00
    RBP: ffff8800b220dcf8   R8: 00000000000006c0   R9: 000000000000000b
    R10: 0000000000000000  R11: 0000000000013f40  R12: ffffffff81c3b180
    R13: ffff8801f947ac00  R14: ffffffffb220ddd8  R15: 0000000000001d64
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#12 [ffff8800b220dd00] update_curr at ffffffff81092451
#13 [ffff8800b220dd60] dequeue_entity at ffffffff810928f3
#14 [ffff8800b220ddc0] dequeue_task_fair at ffffffff81092d4d
#15 [ffff8800b220de10] dequeue_task at ffffffff8108442e
#16 [ffff8800b220de40] deactivate_task at ffffffff81084f9e
#17 [ffff8800b220de50] __schedule at ffffffff816440d4
#18 [ffff8800b220ded0] schedule at ffffffff81644899
#19 [ffff8800b220def0] do_exit at ffffffff810530d0
#20 [ffff8800b220df40] do_group_exit at ffffffff8105334c
#21 [ffff8800b220df70] sys_exit_group at ffffffff810533e2
#22 [ffff8800b220df80] tracesys at ffffffff8164f109 (via system_call)
    RIP: 00007fcc1a078ca8  RSP: 00007fff62546c48  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8164f109  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000003c  RDI: 0000000000000000
    RBP: 00007fcc1a355840   R8: 00000000000000e7   R9: ffffffffffffffa8
    R10: 00007fcc1a969700  R11: 0000000000000246  R12: ffffffff810533e2
    R13: ffff8800b220df78  R14: 0000000001ad9c88  R15: 0000000000000001
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

crash> struct thread_info 0xffff8800b220c000
struct thread_info {
  task = 0xffffffff, 
  exec_domain = 0xffffffff811bae66 <__d_free+70>, 
  flags = 2, 
  status = 0, 
  cpu = 2988498392, 
  saved_preempt_count = -30720, 
  preempt_lazy_count = -112742225, 
  addr_limit = {
    seg = 524802
  }, 
  restart_block = {
    fn = 0xffff88022fc91358, 
    {
      futex = {
        uaddr = 0x80202, 
        val = 3, 
        flags = 0, 
        bitset = 2988490752, 
        time = 18446744071585425101, 
        uaddr2 = 0xffff88022fc91358
      }, 
      nanosleep = {
        clockid = 524802, 
        rmtp = 0x3, 
        compat_rmtp = 0xffff8800b220c000, 
        expires = 18446744071585425101
      }, 
      poll = {
        ufds = 0x80202, 
        nfds = 3, 
        has_timeout = 0, 
        tv_sec = 18446612135302709248, 
        tv_nsec = 18446744071585425101
      }
    }
  }, 
  sysenter_return = 0xffffffff, 
  sig_on_uaccess_error = 0, 
  uaccess_err = 0
}



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch] rt/sched: fix resursion when CONTEXT_TRACKING and PREEMPT_LAZY are enabled
  2014-05-13 15:40   ` Sebastian Andrzej Siewior
  2014-05-14  3:10     ` Mike Galbraith
@ 2014-05-16 13:53     ` Mike Galbraith
  2014-05-25  8:16       ` [patch v2] " Mike Galbraith
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2014-05-16 13:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

On Tue, 2014-05-13 at 17:40 +0200, Sebastian Andrzej Siewior wrote: 
> * Mike Galbraith | 2014-05-10 06:15:03 [+0200]:
> 
> >On Fri, 2014-05-09 at 20:12 +0200, Sebastian Andrzej Siewior wrote:
> >
> >> Known issues:
> >> 
> >>       - bcache is disabled.
> >> 
> >>       - lazy preempt on x86_64 leads to a crash with some load.
> >
> >That is only with NO_HZ_FUL enabled here.  Box blows the stack during
> >task exit, eyeballing hasn't spotted the why.
> 
> Even if I disable NO_HZ_FULL it explodes as soon as hackbench starts.

Ah, you didn't turn CONTEXT_TRACKING off too.  The below made the dirty
little SOB die here.

If context tracking is enabled, we can recurse, and explode violently.
Add missing checks to preempt_schedule_context().

Fix other inconsistencies spotted while searching.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 arch/x86/include/asm/thread_info.h |    1 +
 include/linux/preempt.h            |    2 +-
 include/linux/preempt_mask.h       |   10 ++++++++--
 kernel/context_tracking.c          |    2 +-
 kernel/fork.c                      |    1 +
 kernel/sched/core.c                |   16 +++++-----------
 kernel/sched/fair.c                |    2 +-
 7 files changed, 18 insertions(+), 16 deletions(-)

--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -51,6 +51,7 @@ struct thread_info {
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.saved_preempt_count = INIT_PREEMPT_COUNT,	\
+	.preempt_lazy_count = 0,		\
 	.addr_limit	= KERNEL_DS,		\
 	.restart_block = {			\
 		.fn = do_no_restart_syscall,	\
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -91,8 +91,8 @@ do { \
 
 #define preempt_lazy_enable() \
 do { \
-	dec_preempt_lazy_count(); \
 	barrier(); \
+	dec_preempt_lazy_count(); \
 	preempt_check_resched(); \
 } while (0)
 
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -118,9 +118,15 @@ extern int in_serving_softirq(void);
 		((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
 
 #ifdef CONFIG_PREEMPT_COUNT
-# define preemptible()	(preempt_count() == 0 && !irqs_disabled())
+# define preemptible()		(preempt_count() == 0 && !irqs_disabled())
+#ifdef CONFIG_PREEMPT_LAZY
+# define preemptible_lazy()	(preempt_lazy_count() !=0 && !need_resched_now())
 #else
-# define preemptible()	0
+# define preemptible_lazy()	1
+#endif
+#else
+# define preemptible()		0
+# define preemptible_lazy()	0
 #endif
 
 #endif /* LINUX_PREEMPT_MASK_H */
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -124,7 +124,7 @@ asmlinkage void __sched notrace preempt_
 {
 	enum ctx_state prev_ctx;
 
-	if (likely(!preemptible()))
+	if (likely(!preemptible() || !preemptible_lazy()))
 		return;
 
 	/*
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -329,6 +329,7 @@ static struct task_struct *dup_task_stru
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
 	clear_tsk_need_resched(tsk);
+	clear_tsk_need_resched_lazy(tsk);
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2861,8 +2861,8 @@ void migrate_enable(void)
 		p->migrate_disable = 0;
 
 	unpin_current_cpu();
-	preempt_enable();
 	preempt_lazy_enable();
+	preempt_enable();
 }
 EXPORT_SYMBOL(migrate_enable);
 #else
@@ -3096,19 +3096,13 @@ asmlinkage void __sched notrace preempt_
 {
 	/*
 	 * If there is a non-zero preempt_count or interrupts are disabled,
-	 * we do not want to preempt the current task. Just return..
+	 * we do not want to preempt the current task. Just return.  For
+	 * lazy preemption we also check for non-zero preempt_count_lazy,
+	 * and bail if no immediate preemption is required.
 	 */
-	if (likely(!preemptible()))
+	if (likely(!preemptible() || !preemptible_lazy()))
 		return;
 
-#ifdef CONFIG_PREEMPT_LAZY
-	/*
-	 * Check for lazy preemption
-	 */
-	if (current_thread_info()->preempt_lazy_count &&
-			!test_thread_flag(TIF_NEED_RESCHED))
-		return;
-#endif
 	do {
 		__preempt_count_add(PREEMPT_ACTIVE);
 		/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4446,7 +4446,7 @@ static void check_preempt_wakeup(struct
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(curr) || test_tsk_need_resched_lazy(curr))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat
  2014-05-09 18:12 [ANNOUNCE] 3.14.3-rt5 Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2014-05-13 13:30 ` [ANNOUNCE] 3.14.3-rt5 Juri Lelli
@ 2014-05-17  3:36 ` Mike Galbraith
  2014-05-27 18:18   ` Steven Rostedt
  2014-05-27 18:55   ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Steven Rostedt
  3 siblings, 2 replies; 18+ messages in thread
From: Mike Galbraith @ 2014-05-17  3:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, Thomas Gleixner, rostedt, John Kacur

3.14-rt being build with a non-rt config is unlikely, but..

>From 60e69eed85bb7b5198ef70643b5895c26ad76ef7 Mon Sep 17 00:00:00 2001
From: Mike Galbraith <bitbucket@online.de>
Date: Mon, 7 Apr 2014 10:55:15 +0200
Subject: [PATCH] sched/numa: Fix task_numa_free() lockdep splat

Sasha reported that lockdep claims that the following commit:
made numa_group.lock interrupt unsafe:

  156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")

While I don't see how that could be, given the commit in question moved
task_numa_free() from one irq enabled region to another, the below does
make both gripes and lockups upon gripe with numa=fake=4 go away.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Fixes: 156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")
Signed-off-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org
Cc: mgorman@suse.com
Cc: akpm@linux-foundation.org
Cc: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/r/1396860915.5170.5.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>

---
 kernel/sched/fair.c  |   13 +++++++------
 kernel/sched/sched.h |    9 +++++++++
 2 files changed, 16 insertions(+), 6 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1371,7 +1371,7 @@ static void task_numa_placement(struct t
 	/* If the task is part of a group prevent parallel updates to group stats */
 	if (p->numa_group) {
 		group_lock = &p->numa_group->lock;
-		spin_lock(group_lock);
+		spin_lock_irq(group_lock);
 	}
 
 	/* Find the node with the highest number of faults */
@@ -1432,7 +1432,7 @@ static void task_numa_placement(struct t
 			}
 		}
 
-		spin_unlock(group_lock);
+		spin_unlock_irq(group_lock);
 	}
 
 	/* Preferred node as the node with the most faults */
@@ -1532,7 +1532,8 @@ static void task_numa_group(struct task_
 	if (!join)
 		return;
 
-	double_lock(&my_grp->lock, &grp->lock);
+	BUG_ON(irqs_disabled());
+	double_lock_irq(&my_grp->lock, &grp->lock);
 
 	for (i = 0; i < 2*nr_node_ids; i++) {
 		my_grp->faults[i] -= p->numa_faults[i];
@@ -1546,7 +1547,7 @@ static void task_numa_group(struct task_
 	grp->nr_tasks++;
 
 	spin_unlock(&my_grp->lock);
-	spin_unlock(&grp->lock);
+	spin_unlock_irq(&grp->lock);
 
 	rcu_assign_pointer(p->numa_group, grp);
 
@@ -1565,14 +1566,14 @@ void task_numa_free(struct task_struct *
 	void *numa_faults = p->numa_faults;
 
 	if (grp) {
-		spin_lock(&grp->lock);
+		spin_lock_irq(&grp->lock);
 		for (i = 0; i < 2*nr_node_ids; i++)
 			grp->faults[i] -= p->numa_faults[i];
 		grp->total_faults -= p->total_numa_faults;
 
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
-		spin_unlock(&grp->lock);
+		spin_unlock_irq(&grp->lock);
 		rcu_assign_pointer(p->numa_group, NULL);
 		put_numa_group(grp);
 	}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1392,6 +1392,15 @@ static inline void double_lock(spinlock_
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
+static inline void double_lock_irq(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock_irq(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
 static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
 {
 	if (l1 > l2)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch v2] rt/sched: fix resursion when CONTEXT_TRACKING and PREEMPT_LAZY are enabled
  2014-05-16 13:53     ` [patch] rt/sched: fix resursion when CONTEXT_TRACKING and PREEMPT_LAZY are enabled Mike Galbraith
@ 2014-05-25  8:16       ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2014-05-25  8:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

On Fri, 2014-05-16 at 15:53 +0200, Mike Galbraith wrote: 
> On Tue, 2014-05-13 at 17:40 +0200, Sebastian Andrzej Siewior wrote: 
> > * Mike Galbraith | 2014-05-10 06:15:03 [+0200]:
> > 
> > >On Fri, 2014-05-09 at 20:12 +0200, Sebastian Andrzej Siewior wrote:
> > >
> > >> Known issues:
> > >> 
> > >>       - bcache is disabled.
> > >> 
> > >>       - lazy preempt on x86_64 leads to a crash with some load.
> > >
> > >That is only with NO_HZ_FUL enabled here.  Box blows the stack during
> > >task exit, eyeballing hasn't spotted the why.
> > 
> > Even if I disable NO_HZ_FULL it explodes as soon as hackbench starts.
> 
> Ah, you didn't turn CONTEXT_TRACKING off too.  The below made the dirty
> little SOB die here.

Something obviously went wrong with retest after deciding to do..

> --- a/include/linux/preempt_mask.h
> +++ b/include/linux/preempt_mask.h
> @@ -118,9 +118,15 @@ extern int in_serving_softirq(void);
>  		((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
>  
>  #ifdef CONFIG_PREEMPT_COUNT
> -# define preemptible()	(preempt_count() == 0 && !irqs_disabled())
> +# define preemptible()		(preempt_count() == 0 && !irqs_disabled())
> +#ifdef CONFIG_PREEMPT_LAZY
> +# define preemptible_lazy()	(preempt_lazy_count() !=0 && !need_resched_now())
                             ahem ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

..that preemptible_lazy() bit.  Turn that back right side up.

(or you can do something completely different, like make it just not go
there ala 12-rt, or do the fold thing for rt as well, vs flag being set
meaning try to schedule and bail if not allowed [as usual])

If context tracking is enabled, we can recurse, and explode violently.
Add missing checks to preempt_schedule_context().

Fix other inconsistencies spotted while searching for the little SOB.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 arch/x86/include/asm/thread_info.h |    1 +
 include/linux/preempt.h            |    2 +-
 include/linux/preempt_mask.h       |   10 ++++++++--
 kernel/context_tracking.c          |    2 +-
 kernel/fork.c                      |    1 +
 kernel/sched/core.c                |   16 +++++-----------
 kernel/sched/fair.c                |    2 +-
 7 files changed, 18 insertions(+), 16 deletions(-)

--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -51,6 +51,7 @@ struct thread_info {
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.saved_preempt_count = INIT_PREEMPT_COUNT,	\
+	.preempt_lazy_count = 0,		\
 	.addr_limit	= KERNEL_DS,		\
 	.restart_block = {			\
 		.fn = do_no_restart_syscall,	\
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -91,8 +91,8 @@ do { \
 
 #define preempt_lazy_enable() \
 do { \
-	dec_preempt_lazy_count(); \
 	barrier(); \
+	dec_preempt_lazy_count(); \
 	preempt_check_resched(); \
 } while (0)
 
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -118,9 +118,15 @@ extern int in_serving_softirq(void);
 		((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
 
 #ifdef CONFIG_PREEMPT_COUNT
-# define preemptible()	(preempt_count() == 0 && !irqs_disabled())
+# define preemptible()		(preempt_count() == 0 && !irqs_disabled())
+#ifdef CONFIG_PREEMPT_LAZY
+# define preemptible_lazy()	(preempt_lazy_count() == 0 || need_resched_now())
 #else
-# define preemptible()	0
+# define preemptible_lazy()	1
+#endif
+#else
+# define preemptible()		0
+# define preemptible_lazy()	0
 #endif
 
 #endif /* LINUX_PREEMPT_MASK_H */
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -124,7 +124,7 @@ asmlinkage void __sched notrace preempt_
 {
 	enum ctx_state prev_ctx;
 
-	if (likely(!preemptible()))
+	if (likely(!preemptible() || !preemptible_lazy()))
 		return;
 
 	/*
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -329,6 +329,7 @@ static struct task_struct *dup_task_stru
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
 	clear_tsk_need_resched(tsk);
+	clear_tsk_need_resched_lazy(tsk);
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2866,8 +2866,8 @@ void migrate_enable(void)
 		p->migrate_disable = 0;
 
 	unpin_current_cpu();
-	preempt_enable();
 	preempt_lazy_enable();
+	preempt_enable();
 }
 EXPORT_SYMBOL(migrate_enable);
 #else
@@ -3101,19 +3101,13 @@ asmlinkage void __sched notrace preempt_
 {
 	/*
 	 * If there is a non-zero preempt_count or interrupts are disabled,
-	 * we do not want to preempt the current task. Just return..
+	 * we do not want to preempt the current task. Just return.  For
+	 * lazy preemption we also check for non-zero preempt_count_lazy,
+	 * and bail if no immediate preemption is required.
 	 */
-	if (likely(!preemptible()))
+	if (likely(!preemptible() || !preemptible_lazy()))
 		return;
 
-#ifdef CONFIG_PREEMPT_LAZY
-	/*
-	 * Check for lazy preemption
-	 */
-	if (current_thread_info()->preempt_lazy_count &&
-			!test_thread_flag(TIF_NEED_RESCHED))
-		return;
-#endif
 	do {
 		__preempt_count_add(PREEMPT_ACTIVE);
 		/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4447,7 +4447,7 @@ static void check_preempt_wakeup(struct
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(curr) || test_tsk_need_resched_lazy(curr))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat
  2014-05-17  3:36 ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Mike Galbraith
@ 2014-05-27 18:18   ` Steven Rostedt
  2014-05-27 18:25     ` Peter Zijlstra
  2014-05-27 18:55   ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Steven Rostedt
  1 sibling, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2014-05-27 18:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Sebastian Andrzej Siewior, linux-rt-users, Thomas Gleixner,
	John Kacur, LKML, Paul E. McKenney, Sasha Levin, Peter Zijlstra

[ moving this to LKML from linux-rt-users, as that's where it should be ]

On Sat, 17 May 2014 05:36:59 +0200
Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> 3.14-rt being build with a non-rt config is unlikely, but..
> 
> >From 60e69eed85bb7b5198ef70643b5895c26ad76ef7 Mon Sep 17 00:00:00 2001
> From: Mike Galbraith <bitbucket@online.de>
> Date: Mon, 7 Apr 2014 10:55:15 +0200
> Subject: [PATCH] sched/numa: Fix task_numa_free() lockdep splat
> 
> Sasha reported that lockdep claims that the following commit:
> made numa_group.lock interrupt unsafe:
> 
>   156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")
> 
> While I don't see how that could be, given the commit in question moved
> task_numa_free() from one irq enabled region to another, the below does
> make both gripes and lockups upon gripe with numa=fake=4 go away.

It wasn't the irqs that was causing the lockdep splat, but the
softirqs. You moved it into __put_task_struct() which is called as a
rcu callback that gets called from soft irqs. So yes, you need to
prevent softirqs from happening whenever you take the lock.
spin_lock_irq() is a bigger hammer than needed. The patch below should
be good enough.

I kept the double_lock_irq() as there is no double_lock_bh(). Should we
bother to make one?

-- Steve


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..f072ea9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1497,7 +1497,7 @@ static void task_numa_placement(struct task_struct *p)
 	/* If the task is part of a group prevent parallel updates to group stats */
 	if (p->numa_group) {
 		group_lock = &p->numa_group->lock;
-		spin_lock_irq(group_lock);
+		spin_lock_bh(group_lock);
 	}
 
 	/* Find the node with the highest number of faults */
@@ -1572,7 +1572,7 @@ static void task_numa_placement(struct task_struct *p)
 			}
 		}
 
-		spin_unlock_irq(group_lock);
+		spin_unlock_bh(group_lock);
 	}
 
 	/* Preferred node as the node with the most faults */
@@ -1711,14 +1711,14 @@ void task_numa_free(struct task_struct *p)
 	void *numa_faults = p->numa_faults_memory;
 
 	if (grp) {
-		spin_lock_irq(&grp->lock);
+		spin_lock_bh(&grp->lock);
 		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
 			grp->faults[i] -= p->numa_faults_memory[i];
 		grp->total_faults -= p->total_numa_faults;
 
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
-		spin_unlock_irq(&grp->lock);
+		spin_unlock_bh(&grp->lock);
 		rcu_assign_pointer(p->numa_group, NULL);
 		put_numa_group(grp);
 	}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat
  2014-05-27 18:18   ` Steven Rostedt
@ 2014-05-27 18:25     ` Peter Zijlstra
  2014-05-27 18:52       ` Steven Rostedt
  2014-06-05 14:33       ` [tip:sched/urgent] sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled tip-bot for Steven Rostedt
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-05-27 18:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mike Galbraith, Sebastian Andrzej Siewior, linux-rt-users,
	Thomas Gleixner, John Kacur, LKML, Paul E. McKenney, Sasha Levin

[-- Attachment #1: Type: text/plain, Size: 1961 bytes --]

On Tue, May 27, 2014 at 02:18:36PM -0400, Steven Rostedt wrote:
> [ moving this to LKML from linux-rt-users, as that's where it should be ]
> 
> On Sat, 17 May 2014 05:36:59 +0200
> Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> 
> > 3.14-rt being build with a non-rt config is unlikely, but..
> > 
> > >From 60e69eed85bb7b5198ef70643b5895c26ad76ef7 Mon Sep 17 00:00:00 2001
> > From: Mike Galbraith <bitbucket@online.de>
> > Date: Mon, 7 Apr 2014 10:55:15 +0200
> > Subject: [PATCH] sched/numa: Fix task_numa_free() lockdep splat
> > 
> > Sasha reported that lockdep claims that the following commit:
> > made numa_group.lock interrupt unsafe:
> > 
> >   156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")
> > 
> > While I don't see how that could be, given the commit in question moved
> > task_numa_free() from one irq enabled region to another, the below does
> > make both gripes and lockups upon gripe with numa=fake=4 go away.
> 
> It wasn't the irqs that was causing the lockdep splat, but the
> softirqs. You moved it into __put_task_struct() which is called as a
> rcu callback that gets called from soft irqs. So yes, you need to
> prevent softirqs from happening whenever you take the lock.
> spin_lock_irq() is a bigger hammer than needed. The patch below should
> be good enough.
> 
> I kept the double_lock_irq() as there is no double_lock_bh(). Should we
> bother to make one?

Nope, its really IRQs.

do_exit()
  exit_itimers()
    itimer_delete()
      spin_lock_irqsave(&timer->it_lock, &flags);
      timer_delete_hook(timer);
        kc->timer_del(timer) := posix_cpu_timer_del()
          put_task_struct()
            __put_task_struct()
              task_numa_free()
                spin_lock(&grp->lock);

Which nests the grp->lock inside the timer->it_lock, and where the
timer->it_lock is IRQ-safe, the grp->lock is not.

This allows for IRQ deadlocks.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat
  2014-05-27 18:25     ` Peter Zijlstra
@ 2014-05-27 18:52       ` Steven Rostedt
  2014-05-27 18:53         ` Steven Rostedt
  2014-06-05 14:33       ` [tip:sched/urgent] sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled tip-bot for Steven Rostedt
  1 sibling, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2014-05-27 18:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Sebastian Andrzej Siewior, linux-rt-users,
	Thomas Gleixner, John Kacur, LKML, Paul E. McKenney, Sasha Levin

On Tue, 27 May 2014 20:25:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:


> Nope, its really IRQs.
> 
> do_exit()
>   exit_itimers()
>     itimer_delete()
>       spin_lock_irqsave(&timer->it_lock, &flags);
>       timer_delete_hook(timer);
>         kc->timer_del(timer) := posix_cpu_timer_del()
>           put_task_struct()
>             __put_task_struct()
>               task_numa_free()
>                 spin_lock(&grp->lock);
> 
> Which nests the grp->lock inside the timer->it_lock, and where the
> timer->it_lock is IRQ-safe, the grp->lock is not.
> 
> This allows for IRQ deadlocks.

Ah crap. I did a search on all the callers of put_task_struct(), and
somehow missed this one.  Yep, I was looking for places that called
this while holding other irq safe locks.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat
  2014-05-27 18:52       ` Steven Rostedt
@ 2014-05-27 18:53         ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2014-05-27 18:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mike Galbraith, Sebastian Andrzej Siewior,
	linux-rt-users, Thomas Gleixner, John Kacur, LKML,
	Paul E. McKenney, Sasha Levin

On Tue, 27 May 2014 14:52:01 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

 
> Ah crap. I did a search on all the callers of put_task_struct(), and
> somehow missed this one.  Yep, I was looking for places that called
> this while holding other irq safe locks.
> 

Anyway, this nicely answers Mike's question to why it causes lockdep
splat. There exists a location that calls put_task_struct() with irqs
disabled (not to mention holding irq unsafe locks).

I was having problems just finding a place that called it with irqs off.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat
  2014-05-17  3:36 ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Mike Galbraith
  2014-05-27 18:18   ` Steven Rostedt
@ 2014-05-27 18:55   ` Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2014-05-27 18:55 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Sebastian Andrzej Siewior, linux-rt-users, Thomas Gleixner,
	John Kacur, Peter Zijlstra

On Sat, 17 May 2014 05:36:59 +0200
Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> 3.14-rt being build with a non-rt config is unlikely, but..
> 
> >From 60e69eed85bb7b5198ef70643b5895c26ad76ef7 Mon Sep 17 00:00:00 2001
> From: Mike Galbraith <bitbucket@online.de>
> Date: Mon, 7 Apr 2014 10:55:15 +0200
> Subject: [PATCH] sched/numa: Fix task_numa_free() lockdep splat
> 
> Sasha reported that lockdep claims that the following commit:
> made numa_group.lock interrupt unsafe:
> 
>   156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")
> 
> While I don't see how that could be, given the commit in question moved
> task_numa_free() from one irq enabled region to another, the below does
> make both gripes and lockups upon gripe with numa=fake=4 go away.

As I couldn't find a location, but Peter was able to point one out. You
actually did move it from a irq enabled region to a irq disabled region
(just not disabled most of the time).

Anyway, for inclusion into -rt...

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve


> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Fixes: 156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")
> Signed-off-by: Mike Galbraith <bitbucket@online.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Cc: torvalds@linux-foundation.org
> Cc: mgorman@suse.com
> Cc: akpm@linux-foundation.org
> Cc: Dave Jones <davej@redhat.com>
> Link: http://lkml.kernel.org/r/1396860915.5170.5.camel@marge.simpson.net
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> ---
>  kernel/sched/fair.c  |   13 +++++++------
>  kernel/sched/sched.h |    9 +++++++++
>  2 files changed, 16 insertions(+), 6 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1371,7 +1371,7 @@ static void task_numa_placement(struct t
>  	/* If the task is part of a group prevent parallel updates to group stats */
>  	if (p->numa_group) {
>  		group_lock = &p->numa_group->lock;
> -		spin_lock(group_lock);
> +		spin_lock_irq(group_lock);
>  	}
>  
>  	/* Find the node with the highest number of faults */
> @@ -1432,7 +1432,7 @@ static void task_numa_placement(struct t
>  			}
>  		}
>  
> -		spin_unlock(group_lock);
> +		spin_unlock_irq(group_lock);
>  	}
>  
>  	/* Preferred node as the node with the most faults */
> @@ -1532,7 +1532,8 @@ static void task_numa_group(struct task_
>  	if (!join)
>  		return;
>  
> -	double_lock(&my_grp->lock, &grp->lock);
> +	BUG_ON(irqs_disabled());
> +	double_lock_irq(&my_grp->lock, &grp->lock);
>  
>  	for (i = 0; i < 2*nr_node_ids; i++) {
>  		my_grp->faults[i] -= p->numa_faults[i];
> @@ -1546,7 +1547,7 @@ static void task_numa_group(struct task_
>  	grp->nr_tasks++;
>  
>  	spin_unlock(&my_grp->lock);
> -	spin_unlock(&grp->lock);
> +	spin_unlock_irq(&grp->lock);
>  
>  	rcu_assign_pointer(p->numa_group, grp);
>  
> @@ -1565,14 +1566,14 @@ void task_numa_free(struct task_struct *
>  	void *numa_faults = p->numa_faults;
>  
>  	if (grp) {
> -		spin_lock(&grp->lock);
> +		spin_lock_irq(&grp->lock);
>  		for (i = 0; i < 2*nr_node_ids; i++)
>  			grp->faults[i] -= p->numa_faults[i];
>  		grp->total_faults -= p->total_numa_faults;
>  
>  		list_del(&p->numa_entry);
>  		grp->nr_tasks--;
> -		spin_unlock(&grp->lock);
> +		spin_unlock_irq(&grp->lock);
>  		rcu_assign_pointer(p->numa_group, NULL);
>  		put_numa_group(grp);
>  	}
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1392,6 +1392,15 @@ static inline void double_lock(spinlock_
>  	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
>  }
>  
> +static inline void double_lock_irq(spinlock_t *l1, spinlock_t *l2)
> +{
> +	if (l1 > l2)
> +		swap(l1, l2);
> +
> +	spin_lock_irq(l1);
> +	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
> +}
> +
>  static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
>  {
>  	if (l1 > l2)
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tip:sched/urgent] sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled
  2014-05-27 18:25     ` Peter Zijlstra
  2014-05-27 18:52       ` Steven Rostedt
@ 2014-06-05 14:33       ` tip-bot for Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: tip-bot for Steven Rostedt @ 2014-06-05 14:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, eric.dumazet, torvalds, peterz,
	umgwanakikbuti, rostedt, tglx

Commit-ID:  e9dd685ce81815811fb4da72e6ab10a694ac8468
Gitweb:     http://git.kernel.org/tip/e9dd685ce81815811fb4da72e6ab10a694ac8468
Author:     Steven Rostedt <rostedt@goodmis.org>
AuthorDate: Tue, 27 May 2014 17:02:04 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jun 2014 11:07:41 +0200

sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled

As Peter Zijlstra told me, we have the following path:

do_exit()
  exit_itimers()
    itimer_delete()
      spin_lock_irqsave(&timer->it_lock, &flags);
      timer_delete_hook(timer);
        kc->timer_del(timer) := posix_cpu_timer_del()
          put_task_struct()
            __put_task_struct()
              task_numa_free()
                spin_lock(&grp->lock);

Which means that task_numa_free() can be called with interrupts
disabled, which means that we should not be using spin_lock_irq() but
spin_lock_irqsave() instead. Otherwise we are enabling interrupts while
holding an interrupt unsafe lock!

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner<tglx@linutronix.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140527182541.GH11096@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fdb96d..b4768c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1707,18 +1707,19 @@ no_join:
 void task_numa_free(struct task_struct *p)
 {
 	struct numa_group *grp = p->numa_group;
-	int i;
 	void *numa_faults = p->numa_faults_memory;
+	unsigned long flags;
+	int i;
 
 	if (grp) {
-		spin_lock_irq(&grp->lock);
+		spin_lock_irqsave(&grp->lock, flags);
 		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
 			grp->faults[i] -= p->numa_faults_memory[i];
 		grp->total_faults -= p->total_numa_faults;
 
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
-		spin_unlock_irq(&grp->lock);
+		spin_unlock_irqrestore(&grp->lock, flags);
 		rcu_assign_pointer(p->numa_group, NULL);
 		put_numa_group(grp);
 	}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2014-05-13 13:30 ` [ANNOUNCE] 3.14.3-rt5 Juri Lelli
@ 2015-02-16 11:29   ` Sebastian Andrzej Siewior
  2015-02-16 12:34     ` Juri Lelli
  0 siblings, 1 reply; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-02-16 11:29 UTC (permalink / raw)
  To: Juri Lelli; +Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

* Juri Lelli | 2014-05-13 15:30:20 [+0200]:

>Hi,
Hi Juri,

>Also SCHED_DEADLINE dies without the following.
>
>Thanks,
>
>- Juri
>
>---From 3ca5943538c728399037823e5632431bc2da707c Mon Sep 17 00:00:00 2001
>From: Juri Lelli <juri.lelli@gmail.com>
>Date: Tue, 13 May 2014 15:21:16 +0200
>Subject: [PATCH] sched/deadline: dl_task_timer has to be irqsafe

applied.

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] 3.14.3-rt5
  2015-02-16 11:29   ` Sebastian Andrzej Siewior
@ 2015-02-16 12:34     ` Juri Lelli
  0 siblings, 0 replies; 18+ messages in thread
From: Juri Lelli @ 2015-02-16 12:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Juri Lelli
  Cc: linux-rt-users, LKML, Thomas Gleixner, rostedt, John Kacur

Hi Sebastian,

On 16/02/15 11:29, Sebastian Andrzej Siewior wrote:
> * Juri Lelli | 2014-05-13 15:30:20 [+0200]:
> 
>> Hi,
> Hi Juri,
> 
>> Also SCHED_DEADLINE dies without the following.
>>
>> Thanks,
>>
>> - Juri
>>
>> ---From 3ca5943538c728399037823e5632431bc2da707c Mon Sep 17 00:00:00 2001
>> From: Juri Lelli <juri.lelli@gmail.com>
>> Date: Tue, 13 May 2014 15:21:16 +0200
>> Subject: [PATCH] sched/deadline: dl_task_timer has to be irqsafe
> 
> applied.
> 

Thanks!

- Juri

> Sebastian
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-02-16 12:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-09 18:12 [ANNOUNCE] 3.14.3-rt5 Sebastian Andrzej Siewior
2014-05-09 22:54 ` Pavel Vasilyev
2014-05-13 15:33   ` Sebastian Andrzej Siewior
2014-05-10  4:15 ` Mike Galbraith
2014-05-13 15:40   ` Sebastian Andrzej Siewior
2014-05-14  3:10     ` Mike Galbraith
2014-05-16 13:53     ` [patch] rt/sched: fix resursion when CONTEXT_TRACKING and PREEMPT_LAZY are enabled Mike Galbraith
2014-05-25  8:16       ` [patch v2] " Mike Galbraith
2014-05-13 13:30 ` [ANNOUNCE] 3.14.3-rt5 Juri Lelli
2015-02-16 11:29   ` Sebastian Andrzej Siewior
2015-02-16 12:34     ` Juri Lelli
2014-05-17  3:36 ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Mike Galbraith
2014-05-27 18:18   ` Steven Rostedt
2014-05-27 18:25     ` Peter Zijlstra
2014-05-27 18:52       ` Steven Rostedt
2014-05-27 18:53         ` Steven Rostedt
2014-06-05 14:33       ` [tip:sched/urgent] sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled tip-bot for Steven Rostedt
2014-05-27 18:55   ` [PATCH 3.14-rt] sched/numa: Fix task_numa_free() lockdep splat Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.