linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
@ 2015-05-26 22:50 Thomas Gleixner
  2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

I still have a couple of patches against the timer code in my review
list, but the more I look at them the more horrible I find them.

All these patches are related to the NOHZ stuff and try to mitigate
shortcomings with even more bandaids. And of course these bandaids
cost cycles and are a burden for timer heavy use cases like
networking.

Sadly enough the NOHZ crowd is happy to throw more and more crap at
the kernel and none of these people is even thinking about whether
this stuff could be solved in a different way.

After Eric mentioned in one of the recent discussions that the
timer_migration sysctl is not a lot of gain, I tried to mitigate at
least that issue. That caused me to look deeper and I came up with the
following patch series which:

  - Clean up the sloppy catchup timer jiffies stuff

  - Replace the hash bucket lists by hlists, which shrinks the wheel
    footprint by 50%

  - Replaces the timer base pointer in struct timer_list by a cpu
    index, which shrinks struct timer_list by 4 - 8 bytes depending on
    alignement and architecture.

  - Caches nohz_active and timer_migration in the timer per cpu bases,
    so we can avoid going through loops and hoops to figure that out.

This series applies against

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core

With a pretty naive timer test module and a sched other cpu hog on an
isolated core I verified that I did not wreckage anything. The
following table shows the resulting CPU time of the hog for the
various scenarios.

	     nohz=on	      	nohz=on			nohz=off
	     timer_migration=on	timer_migration=off	    

Unpatched:   46.57%		46.87%			46.64%

Patched:     47.76%		48.20%			48.73%

Though, I do not trust my numbers, so I really hope that Eric or any
other power timer wheel user can throw a bunch of tests at it.


Now some more rant about the whole NOHZ issues. The timer wheel is
fundamentally unfriendly for NOHZ because:

  - it's impossible to keep reliably track of the next expiring timer

  - finding the next expiring timer is in the worst case O(N) and
    involves touching a gazillion of cache lines

The other disaster area is the nohz timer migration mechanism. I think
it's fundamentally broken. 

 - We decide at timer enqueue time, on which CPU we queue the timer,
   based on cpu idle heuristics which even fails to recognize that a
   cpu is really busy with interrupt and softirq processing (reported
   by Eric).

 - When moving a timer to some "non-idle" CPU we speculate about the
   system situation in the future (at timer expiry time). This kinda
   works for limited scenarios (NOHZ_FULL and constraint power saving
   on mobile devices). But it is pretty much nonsensical for
   everything else. For network heavy workloads it can be even
   outright wrong and dangerous as Eric explained.

So we really need to put a full stop at all this NOHZ tinkering and
think about proper solutions. I'm not going to merge any NOHZ related
features^Whacks before the existing problems are not solved. In
hindsight I should have done that earlier, but it's never too late.

So we need to address two issues:

1) Keeping track of the first expiring timer in the wheel.

   Don't even think about rbtrees or such, it just wont work, but I'm
   willing to accept prove of the contrary.

   One of the ideas I had a long time ago is to have a wheel
   implementation, which does never cascade and therefor provides
   different levels of granularity per wheel level.

   LVL0	   1  jiffy granularity
   LVL1	   8  jiffies granularity
   LVL1	   64 jiffies granularity
   ....

   This requires more levels than the classic timer wheel, so its not
   feasible from a memory consumption POV.

   But we can have a compromise and put all 'soon' expiring timers
   into these fancy new wheels and stick all 'far out' timers into the
   last level of the wheel and cascade them occasionally.

   That should work because:

     - Most timers are short term expiry (< 1h)
     - Most timers are canceled before expiry

   So we need a sensible upper bound of levels and get the following
   properties:

   	- Natural batching (no more slack magic). This might also help
          networking to avoid rearming of timers.

	- Long out timers are queued in the slowest wheel and
          ocassionally with the granularity of the slowest wheel
          cascaded

	- Tracking the next expiring timer can be done with a bitmap,
          so the 'get next expiring timer' becomes O(1) without
          touching anything else than the bitmap, if we accept that
          the upper limit of what we can retrieve O(1) is the
          granularity of the last level, i.e. we treat timers which
          need recascading simple as normal inhabitants of the last
          wheel level.
	  
        - The expiry code (run_timers) gets more complicated as it
          needs to walk through the different levels more often, but
          with the bitmap that shouldnt be a too big issue.

        - Seperate storage for non-deferrable and deferrable timers.

    I spent some time in coding that up. It barely boots and has
    certainly a few bugs left and right, but I will send it out
    nevertheless as a reply to this mail to get the discussion going.

2) The timer migration problem

   I think we need to address this from a different angle.

   Joonwoo tried to solve the deferrable timer issue for non pinned
   timers by introducing a global timer wheel for those. That sucks
   and adds obviously more overhead into the fast pathes. But the idea
   itself that the non idle cpus consume events from the idle ones is
   not the worst. A global wheel might work for the few deferrable
   timers, but it wont work for anything else.

   So instead of deciding at enqueue time, we should come up with
   different wheels for the different types of timers. That way we
   could keep the locality for the networking scenario, and at the
   same time have a way to delegate all non bound timers of a cpu to
   some other cpu at idle time or pull them from the other cpu when
   requested.

   I haven't thought that through fully yet, but it's something we
   should at least investigate thoroughly. I won't do that for the next
   10 days as I'm about to leave for vacation and conferencing.

Thanks,

	tglx
---
 include/linux/hrtimer.h      |    4 
 include/linux/sched.h        |    6 
 include/linux/sched/sysctl.h |   12 -
 include/linux/timer.h        |   56 ++++----
 include/trace/events/timer.h |   13 --
 kernel/rcu/tree_plugin.h     |    2 
 kernel/sched/core.c          |    9 -
 kernel/sysctl.c              |   18 +-
 kernel/time/hrtimer.c        |   38 ++++--
 kernel/time/tick-internal.h  |   14 ++
 kernel/time/tick-sched.c     |   25 ++-
 kernel/time/timer.c          |  271 ++++++++++++++++++++-----------------------
 kernel/time/timer_list.c     |    2 
 kernel/time/timer_stats.c    |   10 -
 14 files changed, 244 insertions(+), 236 deletions(-)

   
   

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-05-27  5:39   ` Viresh Kumar
  2015-06-19 13:22   ` [tip:timers/core] " tip-bot for Thomas Gleixner
  2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

[-- Attachment #1: timers-sanitize-catchup-jiffies.patch --]
[-- Type: text/plain, Size: 3247 bytes --]

catchup_timer_jiffies() has been applied blindly to several functions
without looking for possible better ways to do it.

1) internal_add_timer()

   Move the update to base->all_timers before we actually insert the
   timer into the wheel.

2) detach_if_pending()

   Again the update to base->all_timers allows us to explicitely do
   the timer_jiffies update in place, if this was the last timer which
   got removed.

3) __run_timers()

   We only check on entry, which is silly, because base->timer_jiffies
   can be behind - especially on NOHZ kernels - and if there is a
   single deferrable timer somewhere between base->timer_jiffies and
   jiffies we expire it and then loop until base->timer_jiffies ==
   jiffies.

   Move it into the loop.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timer.c |   26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -356,7 +356,7 @@ EXPORT_SYMBOL_GPL(set_timer_slack);
  * The caller must hold the tvec_base lock.  Returns true if the list
  * was empty and therefore ->timer_jiffies was updated.
  */
-static bool catchup_timer_jiffies(struct tvec_base *base)
+static inline bool catchup_timer_jiffies(struct tvec_base *base)
 {
 	if (!base->all_timers) {
 		base->timer_jiffies = jiffies;
@@ -411,7 +411,10 @@ __internal_add_timer(struct tvec_base *b
 
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 {
-	(void)catchup_timer_jiffies(base);
+	/* Advance base->jiffies, if the base is empty */
+	if (!base->all_timers++)
+		base->timer_jiffies = jiffies;
+
 	__internal_add_timer(base, timer);
 	/*
 	 * Update base->active_timers and base->next_timer
@@ -421,7 +424,6 @@ static void internal_add_timer(struct tv
 		    time_before(timer->expires, base->next_timer))
 			base->next_timer = timer->expires;
 	}
-	base->all_timers++;
 
 	/*
 	 * Check whether the other CPU is in dynticks mode and needs
@@ -719,7 +721,6 @@ detach_expired_timer(struct timer_list *
 	if (!tbase_get_deferrable(timer->base))
 		base->active_timers--;
 	base->all_timers--;
-	(void)catchup_timer_jiffies(base);
 }
 
 static int detach_if_pending(struct timer_list *timer, struct tvec_base *base,
@@ -734,8 +735,9 @@ static int detach_if_pending(struct time
 		if (timer->expires == base->next_timer)
 			base->next_timer = base->timer_jiffies;
 	}
-	base->all_timers--;
-	(void)catchup_timer_jiffies(base);
+	/* If this was the last timer, advance base->jiffies */
+	if (!--base->all_timers)
+		base->timer_jiffies = jiffies;
 	return 1;
 }
 
@@ -1185,14 +1187,16 @@ static inline void __run_timers(struct t
 	struct timer_list *timer;
 
 	spin_lock_irq(&base->lock);
-	if (catchup_timer_jiffies(base)) {
-		spin_unlock_irq(&base->lock);
-		return;
-	}
+
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
 		struct list_head work_list;
 		struct list_head *head = &work_list;
-		int index = base->timer_jiffies & TVR_MASK;
+		int index;
+
+		if (catchup_timer_jiffies(base))
+			break;
+
+		index = base->timer_jiffies & TVR_MASK;
 
 		/*
 		 * Cascade timers:



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 2/7] timer: Remove FIFO guarantee
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
  2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-05-27  9:11   ` Viresh Kumar
  2015-06-19 13:22   ` [tip:timers/core] timer: Remove FIFO "guarantee" tip-bot for Thomas Gleixner
  2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

[-- Attachment #1: timer-remove-fifo-guarantee.patch --]
[-- Type: text/plain, Size: 968 bytes --]

The FIFO guarantee has been violated by the introduction of timer
slack already. Remove it.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%. It's a
seperate patch so any (unlikely to happen) regression caused by this
can be identified clearly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timer.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -403,10 +403,8 @@ __internal_add_timer(struct tvec_base *b
 		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
 		vec = base->tv5.vec + i;
 	}
-	/*
-	 * Timers are FIFO:
-	 */
-	list_add_tail(&timer->entry, vec);
+
+	list_add(&timer->entry, vec);
 }
 
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 3/7] timer: Use hlist for the timer wheel hash buckets
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
  2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
  2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-05-27  9:13   ` Viresh Kumar
  2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
  2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

[-- Attachment #1: timer-use-hlist.patch --]
[-- Type: text/plain, Size: 7861 bytes --]

This reduces the size of struct tvec_base by 50% and results in
slightly smaller code as well.

Before:
   struct tvec_base: size: 8256, cachelines: 129

   text	   data	    bss	    dec	    hex	filename
  17698	  13297	   8256	  39251	   9953	../build/kernel/time/timer.o

After:
  struct tvec_base: 4160, cachelines: 65, members: 12 */

   text	   data	    bss	    dec	    hex	filename
  17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/timer.h |    6 ++--
 kernel/time/timer.c   |   64 +++++++++++++++++++++-----------------------------
 2 files changed, 30 insertions(+), 40 deletions(-)

Index: tip/include/linux/timer.h
===================================================================
--- tip.orig/include/linux/timer.h
+++ tip/include/linux/timer.h
@@ -14,7 +14,7 @@ struct timer_list {
 	 * All fields that change during normal runtime grouped to the
 	 * same cacheline
 	 */
-	struct list_head entry;
+	struct hlist_node entry;
 	unsigned long expires;
 	struct tvec_base *base;
 
@@ -71,7 +71,7 @@ extern struct tvec_base boot_tvec_bases;
 #define TIMER_FLAG_MASK			0x3LU
 
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
-		.entry = { .prev = TIMER_ENTRY_STATIC },	\
+		.entry = { .next = TIMER_ENTRY_STATIC },	\
 		.function = (_function),			\
 		.expires = (_expires),				\
 		.data = (_data),				\
@@ -168,7 +168,7 @@ static inline void init_timer_on_stack_k
  */
 static inline int timer_pending(const struct timer_list * timer)
 {
-	return timer->entry.next != NULL;
+	return timer->entry.pprev != NULL;
 }
 
 extern void add_timer_on(struct timer_list *timer, int cpu);
Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -70,11 +70,11 @@ EXPORT_SYMBOL(jiffies_64);
 #define MAX_TVAL ((unsigned long)((1ULL << (TVR_BITS + 4*TVN_BITS)) - 1))
 
 struct tvec {
-	struct list_head vec[TVN_SIZE];
+	struct hlist_head vec[TVN_SIZE];
 };
 
 struct tvec_root {
-	struct list_head vec[TVR_SIZE];
+	struct hlist_head vec[TVR_SIZE];
 };
 
 struct tvec_base {
@@ -370,7 +370,7 @@ __internal_add_timer(struct tvec_base *b
 {
 	unsigned long expires = timer->expires;
 	unsigned long idx = expires - base->timer_jiffies;
-	struct list_head *vec;
+	struct hlist_head *vec;
 
 	if (idx < TVR_SIZE) {
 		int i = expires & TVR_MASK;
@@ -404,7 +404,7 @@ __internal_add_timer(struct tvec_base *b
 		vec = base->tv5.vec + i;
 	}
 
-	list_add(&timer->entry, vec);
+	hlist_add_head(&timer->entry, vec);
 }
 
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
@@ -518,8 +518,8 @@ static int timer_fixup_activate(void *ad
 		 * statically initialized. We just make sure that it
 		 * is tracked in the object tracker.
 		 */
-		if (timer->entry.next == NULL &&
-		    timer->entry.prev == TIMER_ENTRY_STATIC) {
+		if (timer->entry.pprev == NULL &&
+		    timer->entry.next == TIMER_ENTRY_STATIC) {
 			debug_object_init(timer, &timer_debug_descr);
 			debug_object_activate(timer, &timer_debug_descr);
 			return 0;
@@ -565,7 +565,7 @@ static int timer_fixup_assert_init(void
 
 	switch (state) {
 	case ODEBUG_STATE_NOTAVAILABLE:
-		if (timer->entry.prev == TIMER_ENTRY_STATIC) {
+		if (timer->entry.next == TIMER_ENTRY_STATIC) {
 			/*
 			 * This is not really a fixup. The timer was
 			 * statically initialized. We just make sure that it
@@ -669,7 +669,7 @@ static void do_init_timer(struct timer_l
 {
 	struct tvec_base *base = raw_cpu_read(tvec_bases);
 
-	timer->entry.next = NULL;
+	timer->entry.pprev = NULL;
 	timer->base = (void *)((unsigned long)base | flags);
 	timer->slack = -1;
 #ifdef CONFIG_TIMER_STATS
@@ -701,14 +701,14 @@ EXPORT_SYMBOL(init_timer_key);
 
 static inline void detach_timer(struct timer_list *timer, bool clear_pending)
 {
-	struct list_head *entry = &timer->entry;
+	struct hlist_node *entry = &timer->entry;
 
 	debug_deactivate(timer);
 
-	__list_del(entry->prev, entry->next);
+	__hlist_del(entry);
 	if (clear_pending)
-		entry->next = NULL;
-	entry->prev = LIST_POISON2;
+		entry->pprev = NULL;
+	entry->next = LIST_POISON2;
 }
 
 static inline void
@@ -1109,16 +1109,17 @@ EXPORT_SYMBOL(del_timer_sync);
 static int cascade(struct tvec_base *base, struct tvec *tv, int index)
 {
 	/* cascade all the timers from tv up one level */
-	struct timer_list *timer, *tmp;
-	struct list_head tv_list;
+	struct timer_list *timer;
+	struct hlist_node *tmp;
+	struct hlist_head tv_list;
 
-	list_replace_init(tv->vec + index, &tv_list);
+	hlist_move_list(tv->vec + index, &tv_list);
 
 	/*
 	 * We are removing _all_ timers from the list, so we
 	 * don't have to detach them individually.
 	 */
-	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
+	hlist_for_each_entry_safe(timer, tmp, &tv_list, entry) {
 		BUG_ON(tbase_get_base(timer->base) != base);
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
@@ -1186,8 +1187,8 @@ static inline void __run_timers(struct t
 	spin_lock_irq(&base->lock);
 
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
-		struct list_head work_list;
-		struct list_head *head = &work_list;
+		struct hlist_head work_list;
+		struct hlist_head *head = &work_list;
 		int index;
 
 		if (catchup_timer_jiffies(base))
@@ -1204,13 +1205,13 @@ static inline void __run_timers(struct t
 					!cascade(base, &base->tv4, INDEX(2)))
 			cascade(base, &base->tv5, INDEX(3));
 		++base->timer_jiffies;
-		list_replace_init(base->tv1.vec + index, head);
-		while (!list_empty(head)) {
+		hlist_move_list(base->tv1.vec + index, head);
+		while (!hlist_empty(head)) {
 			void (*fn)(unsigned long);
 			unsigned long data;
 			bool irqsafe;
 
-			timer = list_first_entry(head, struct timer_list,entry);
+			timer = hlist_entry(head->first, struct timer_list, entry);
 			fn = timer->function;
 			data = timer->data;
 			irqsafe = tbase_get_irqsafe(timer->base);
@@ -1252,7 +1253,7 @@ static unsigned long __next_timer_interr
 	/* Look for timer events in tv1. */
 	index = slot = timer_jiffies & TVR_MASK;
 	do {
-		list_for_each_entry(nte, base->tv1.vec + slot, entry) {
+		hlist_for_each_entry(nte, base->tv1.vec + slot, entry) {
 			if (tbase_get_deferrable(nte->base))
 				continue;
 
@@ -1283,7 +1284,7 @@ cascade:
 
 		index = slot = timer_jiffies & TVN_MASK;
 		do {
-			list_for_each_entry(nte, varp->vec + slot, entry) {
+			hlist_for_each_entry(nte, varp->vec + slot, entry) {
 				if (tbase_get_deferrable(nte->base))
 					continue;
 
@@ -1542,12 +1543,12 @@ signed long __sched schedule_timeout_uni
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
 #ifdef CONFIG_HOTPLUG_CPU
-static void migrate_timer_list(struct tvec_base *new_base, struct list_head *head)
+static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
 	struct timer_list *timer;
 
-	while (!list_empty(head)) {
-		timer = list_first_entry(head, struct timer_list, entry);
+	while (!hlist_empty(head)) {
+		timer = hlist_entry(head->first, struct timer_list, entry);
 		/* We ignore the accounting on the dying cpu */
 		detach_timer(timer, false);
 		timer_set_base(timer, new_base);
@@ -1615,23 +1616,12 @@ static inline void timer_register_cpu_no
 
 static void __init init_timer_cpu(struct tvec_base *base, int cpu)
 {
-	int j;
-
 	BUG_ON(base != tbase_get_base(base));
 
 	base->cpu = cpu;
 	per_cpu(tvec_bases, cpu) = base;
 	spin_lock_init(&base->lock);
 
-	for (j = 0; j < TVN_SIZE; j++) {
-		INIT_LIST_HEAD(base->tv5.vec + j);
-		INIT_LIST_HEAD(base->tv4.vec + j);
-		INIT_LIST_HEAD(base->tv3.vec + j);
-		INIT_LIST_HEAD(base->tv2.vec + j);
-	}
-	for (j = 0; j < TVR_SIZE; j++)
-		INIT_LIST_HEAD(base->tv1.vec + j);
-
 	base->timer_jiffies = jiffies;
 	base->next_timer = base->timer_jiffies;
 }



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 4/7] timer: Replace timer base by a cpu index
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
                   ` (2 preceding siblings ...)
  2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-05-27  9:22   ` Viresh Kumar
                     ` (2 more replies)
  2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang, Steven Rostedt, Badhri Jagan Sridharan

[-- Attachment #1: timer-replace-timer-base.patch --]
[-- Type: text/plain, Size: 16765 bytes --]

Instead of storing a pointer to the per cpu tvec_base we can simply
cache a CPU index in the timer_list and use that to get hold of the
correct per cpu tvec_base. This is only used in lock_timer_base() and
the slightly larger code is peanuts versus the spinlock operation and
the d-cache foot print of the timer wheel.

Aside of that this allows to get rid of following nuisances:

 - boot_tvec_base

   That statically allocated 4k bss data is just kept around so the
   timer has a home when it gets statically initialized. It serves no
   other purpose.

   With the CPU index we assign the timer to CPU0 at static
   initialization time and therefor can avoid the whole boot_tvec_base
   dance.  That also simplifies the init code, which just can use the
   per cpu base.

   Before:
     text	   data	    bss	    dec	    hex	filename
    17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
   After:
     text	   data	    bss	    dec	    hex	filename
    17440	   9193	      0	  26633	   6809	../build/kernel/time/timer.o

 - Overloading the base pointer with various flags

   The CPU index has enough space to hold the flags (deferrable,
   irqsafe) so we can get rid of the extra masking and bit fiddling
   with the base pointer.

As a benefit we reduce the size of struct timer_list on 64 bit
machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
which is a real win as we have tons of them embedded in other structs.

This changes also the newly added deferrable printout of the timer
start trace point to capture and print all timer->flags, which allows
us to decode the target cpu of the timer as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Badhri Jagan Sridharan <Badhri@google.com>

---
 include/linux/timer.h        |   38 +++++-------
 include/trace/events/timer.h |   13 ++--
 kernel/time/timer.c          |  127 ++++++++++++-------------------------------
 3 files changed, 58 insertions(+), 120 deletions(-)

Index: tip/include/linux/timer.h
===================================================================
--- tip.orig/include/linux/timer.h
+++ tip/include/linux/timer.h
@@ -14,27 +14,23 @@ struct timer_list {
 	 * All fields that change during normal runtime grouped to the
 	 * same cacheline
 	 */
-	struct hlist_node entry;
-	unsigned long expires;
-	struct tvec_base *base;
-
-	void (*function)(unsigned long);
-	unsigned long data;
-
-	int slack;
+	struct hlist_node	entry;
+	unsigned long		expires;
+	void			(*function)(unsigned long);
+	unsigned long		data;
+	u32			flags;
+	int			slack;
 
 #ifdef CONFIG_TIMER_STATS
-	int start_pid;
-	void *start_site;
-	char start_comm[16];
+	int			start_pid;
+	void			*start_site;
+	char			start_comm[16];
 #endif
 #ifdef CONFIG_LOCKDEP
-	struct lockdep_map lockdep_map;
+	struct lockdep_map	lockdep_map;
 #endif
 };
 
-extern struct tvec_base boot_tvec_bases;
-
 #ifdef CONFIG_LOCKDEP
 /*
  * NB: because we have to copy the lockdep_map, setting the lockdep_map key
@@ -49,9 +45,6 @@ extern struct tvec_base boot_tvec_bases;
 #endif
 
 /*
- * Note that all tvec_bases are at least 4 byte aligned and lower two bits
- * of base in timer_list is guaranteed to be zero. Use them for flags.
- *
  * A deferrable timer will work normally when the system is busy, but
  * will not cause a CPU to come out of idle just to service it; instead,
  * the timer will be serviced when the CPU eventually wakes up with a
@@ -65,17 +58,18 @@ extern struct tvec_base boot_tvec_bases;
  * workqueue locking issues. It's not meant for executing random crap
  * with interrupts disabled. Abuse is monitored!
  */
-#define TIMER_DEFERRABLE		0x1LU
-#define TIMER_IRQSAFE			0x2LU
-
-#define TIMER_FLAG_MASK			0x3LU
+#define TIMER_CPUMASK		0x0007FFFF
+#define TIMER_MIGRATING		0x00080000
+#define TIMER_BASEMASK		(TIMER_CPUMASK | TIMER_MIGRATING)
+#define TIMER_DEFERRABLE	0x00100000
+#define TIMER_IRQSAFE		0x00200000
 
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
 		.entry = { .next = TIMER_ENTRY_STATIC },	\
 		.function = (_function),			\
 		.expires = (_expires),				\
 		.data = (_data),				\
-		.base = (void *)((unsigned long)&boot_tvec_bases + (_flags)), \
+		.flags = (_flags),				\
 		.slack = -1,					\
 		__TIMER_LOCKDEP_MAP_INITIALIZER(		\
 			__FILE__ ":" __stringify(__LINE__))	\
Index: tip/include/trace/events/timer.h
===================================================================
--- tip.orig/include/trace/events/timer.h
+++ tip/include/trace/events/timer.h
@@ -45,16 +45,16 @@ TRACE_EVENT(timer_start,
 
 	TP_PROTO(struct timer_list *timer,
 		unsigned long expires,
-		unsigned int deferrable),
+		unsigned int flags),
 
-	TP_ARGS(timer, expires, deferrable),
+	TP_ARGS(timer, expires, flags),
 
 	TP_STRUCT__entry(
 		__field( void *,	timer		)
 		__field( void *,	function	)
 		__field( unsigned long,	expires		)
 		__field( unsigned long,	now		)
-		__field( unsigned int,	deferrable	)
+		__field( unsigned int,	flags		)
 	),
 
 	TP_fast_assign(
@@ -62,13 +62,12 @@ TRACE_EVENT(timer_start,
 		__entry->function	= timer->function;
 		__entry->expires	= expires;
 		__entry->now		= jiffies;
-		__entry->deferrable     = deferrable;
+		__entry->flags		= flags;
 	),
 
-	TP_printk("timer=%p function=%pf expires=%lu [timeout=%ld] defer=%c",
+	TP_printk("timer=%p function=%pf expires=%lu [timeout=%ld] flags=0x%08x",
 		  __entry->timer, __entry->function, __entry->expires,
-		  (long)__entry->expires - __entry->now,
-		  __entry->deferrable > 0 ? 'y':'n')
+		  (long)__entry->expires - __entry->now, __entry->flags)
 );
 
 /**
Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -92,43 +92,8 @@ struct tvec_base {
 	struct tvec tv5;
 } ____cacheline_aligned;
 
-/*
- * __TIMER_INITIALIZER() needs to set ->base to a valid pointer (because we've
- * made NULL special, hint: lock_timer_base()) and we cannot get a compile time
- * pointer to per-cpu entries because we don't know where we'll map the section,
- * even for the boot cpu.
- *
- * And so we use boot_tvec_bases for boot CPU and per-cpu __tvec_bases for the
- * rest of them.
- */
-struct tvec_base boot_tvec_bases;
-EXPORT_SYMBOL(boot_tvec_bases);
-
-static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;
-
-/* Functions below help us manage 'deferrable' flag */
-static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
-{
-	return ((unsigned int)(unsigned long)base & TIMER_DEFERRABLE);
-}
-
-static inline unsigned int tbase_get_irqsafe(struct tvec_base *base)
-{
-	return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE);
-}
-
-static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
-{
-	return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK));
-}
-
-static inline void
-timer_set_base(struct timer_list *timer, struct tvec_base *new_base)
-{
-	unsigned long flags = (unsigned long)timer->base & TIMER_FLAG_MASK;
 
-	timer->base = (struct tvec_base *)((unsigned long)(new_base) | flags);
-}
+static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
 
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
 		bool force_up)
@@ -417,7 +382,7 @@ static void internal_add_timer(struct tv
 	/*
 	 * Update base->active_timers and base->next_timer
 	 */
-	if (!tbase_get_deferrable(timer->base)) {
+	if (!(timer->flags & TIMER_DEFERRABLE)) {
 		if (!base->active_timers++ ||
 		    time_before(timer->expires, base->next_timer))
 			base->next_timer = timer->expires;
@@ -436,7 +401,7 @@ static void internal_add_timer(struct tv
 	 * require special care against races with idle_cpu(), lets deal
 	 * with that later.
 	 */
-	if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(base->cpu))
+	if (!(timer->flags & TIMER_DEFERRABLE) || tick_nohz_full_cpu(base->cpu))
 		wake_up_nohz_cpu(base->cpu);
 }
 
@@ -457,7 +422,7 @@ static void timer_stats_account_timer(st
 
 	if (likely(!timer->start_site))
 		return;
-	if (unlikely(tbase_get_deferrable(timer->base)))
+	if (unlikely(timer->flags & TIMER_DEFERRABLE))
 		flag |= TIMER_STATS_FLAG_DEFERRABLE;
 
 	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
@@ -650,7 +615,7 @@ static inline void
 debug_activate(struct timer_list *timer, unsigned long expires)
 {
 	debug_timer_activate(timer);
-	trace_timer_start(timer, expires, tbase_get_deferrable(timer->base));
+	trace_timer_start(timer, expires, timer->flags);
 }
 
 static inline void debug_deactivate(struct timer_list *timer)
@@ -667,10 +632,8 @@ static inline void debug_assert_init(str
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
 			  const char *name, struct lock_class_key *key)
 {
-	struct tvec_base *base = raw_cpu_read(tvec_bases);
-
 	timer->entry.pprev = NULL;
-	timer->base = (void *)((unsigned long)base | flags);
+	timer->flags = flags | raw_smp_processor_id();
 	timer->slack = -1;
 #ifdef CONFIG_TIMER_STATS
 	timer->start_site = NULL;
@@ -715,7 +678,7 @@ static inline void
 detach_expired_timer(struct timer_list *timer, struct tvec_base *base)
 {
 	detach_timer(timer, true);
-	if (!tbase_get_deferrable(timer->base))
+	if (!(timer->flags & TIMER_DEFERRABLE))
 		base->active_timers--;
 	base->all_timers--;
 }
@@ -727,7 +690,7 @@ static int detach_if_pending(struct time
 		return 0;
 
 	detach_timer(timer, clear_pending);
-	if (!tbase_get_deferrable(timer->base)) {
+	if (!(timer->flags & TIMER_DEFERRABLE)) {
 		base->active_timers--;
 		if (timer->expires == base->next_timer)
 			base->next_timer = base->timer_jiffies;
@@ -746,24 +709,22 @@ static int detach_if_pending(struct time
  * So __run_timers/migrate_timers can safely modify all timers which could
  * be found on ->tvX lists.
  *
- * When the timer's base is locked, and the timer removed from list, it is
- * possible to set timer->base = NULL and drop the lock: the timer remains
- * locked.
+ * When the timer's base is locked and removed from the list, the
+ * TIMER_MIGRATING flag is set, FIXME
  */
 static struct tvec_base *lock_timer_base(struct timer_list *timer,
 					unsigned long *flags)
 	__acquires(timer->base->lock)
 {
-	struct tvec_base *base;
-
 	for (;;) {
-		struct tvec_base *prelock_base = timer->base;
-		base = tbase_get_base(prelock_base);
-		if (likely(base != NULL)) {
+		u32 tf = timer->flags;
+		struct tvec_base *base;
+
+		if (!(tf & TIMER_MIGRATING)) {
+			base = per_cpu_ptr(&tvec_bases, tf & TIMER_CPUMASK);
 			spin_lock_irqsave(&base->lock, *flags);
-			if (likely(prelock_base == timer->base))
+			if (timer->flags == tf)
 				return base;
-			/* The timer has migrated to another CPU */
 			spin_unlock_irqrestore(&base->lock, *flags);
 		}
 		cpu_relax();
@@ -790,7 +751,7 @@ __mod_timer(struct timer_list *timer, un
 	debug_activate(timer, expires);
 
 	cpu = get_nohz_timer_target(pinned);
-	new_base = per_cpu(tvec_bases, cpu);
+	new_base = per_cpu_ptr(&tvec_bases, cpu);
 
 	if (base != new_base) {
 		/*
@@ -802,11 +763,12 @@ __mod_timer(struct timer_list *timer, un
 		 */
 		if (likely(base->running_timer != timer)) {
 			/* See the comment in lock_timer_base() */
-			timer_set_base(timer, NULL);
+			timer->flags |= TIMER_MIGRATING;
+
 			spin_unlock(&base->lock);
 			base = new_base;
 			spin_lock(&base->lock);
-			timer_set_base(timer, base);
+			timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
 		}
 	}
 
@@ -968,13 +930,13 @@ EXPORT_SYMBOL(add_timer);
  */
 void add_timer_on(struct timer_list *timer, int cpu)
 {
-	struct tvec_base *base = per_cpu(tvec_bases, cpu);
+	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
 	unsigned long flags;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
-	timer_set_base(timer, base);
+	timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
 	spin_unlock_irqrestore(&base->lock, flags);
@@ -1039,8 +1001,6 @@ int try_to_del_timer_sync(struct timer_l
 EXPORT_SYMBOL(try_to_del_timer_sync);
 
 #ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);
-
 /**
  * del_timer_sync - deactivate a timer and wait for the handler to finish.
  * @timer: the timer to be deactivated
@@ -1095,7 +1055,7 @@ int del_timer_sync(struct timer_list *ti
 	 * don't use it in hardirq context, because it
 	 * could lead to deadlock.
 	 */
-	WARN_ON(in_irq() && !tbase_get_irqsafe(timer->base));
+	WARN_ON(in_irq() && !(timer->flags & TIMER_IRQSAFE));
 	for (;;) {
 		int ret = try_to_del_timer_sync(timer);
 		if (ret >= 0)
@@ -1120,7 +1080,6 @@ static int cascade(struct tvec_base *bas
 	 * don't have to detach them individually.
 	 */
 	hlist_for_each_entry_safe(timer, tmp, &tv_list, entry) {
-		BUG_ON(tbase_get_base(timer->base) != base);
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
 	}
@@ -1214,7 +1173,7 @@ static inline void __run_timers(struct t
 			timer = hlist_entry(head->first, struct timer_list, entry);
 			fn = timer->function;
 			data = timer->data;
-			irqsafe = tbase_get_irqsafe(timer->base);
+			irqsafe = timer->flags & TIMER_IRQSAFE;
 
 			timer_stats_account_timer(timer);
 
@@ -1254,7 +1213,7 @@ static unsigned long __next_timer_interr
 	index = slot = timer_jiffies & TVR_MASK;
 	do {
 		hlist_for_each_entry(nte, base->tv1.vec + slot, entry) {
-			if (tbase_get_deferrable(nte->base))
+			if (nte->flags & TIMER_DEFERRABLE)
 				continue;
 
 			found = 1;
@@ -1285,7 +1244,7 @@ cascade:
 		index = slot = timer_jiffies & TVN_MASK;
 		do {
 			hlist_for_each_entry(nte, varp->vec + slot, entry) {
-				if (tbase_get_deferrable(nte->base))
+				if (nte->flags & TIMER_DEFERRABLE)
 					continue;
 
 				found = 1;
@@ -1355,7 +1314,7 @@ static u64 cmp_next_hrtimer_event(u64 ba
  */
 u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 {
-	struct tvec_base *base = __this_cpu_read(tvec_bases);
+	struct tvec_base *base = this_cpu_ptr(&tvec_bases);
 	u64 expires = KTIME_MAX;
 	unsigned long nextevt;
 
@@ -1407,7 +1366,7 @@ void update_process_times(int user_tick)
  */
 static void run_timer_softirq(struct softirq_action *h)
 {
-	struct tvec_base *base = __this_cpu_read(tvec_bases);
+	struct tvec_base *base = this_cpu_ptr(&tvec_bases);
 
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
@@ -1546,12 +1505,13 @@ EXPORT_SYMBOL(schedule_timeout_uninterru
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
 	struct timer_list *timer;
+	int cpu = new_base->cpu;
 
 	while (!hlist_empty(head)) {
 		timer = hlist_entry(head->first, struct timer_list, entry);
 		/* We ignore the accounting on the dying cpu */
 		detach_timer(timer, false);
-		timer_set_base(timer, new_base);
+		timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
 		internal_add_timer(new_base, timer);
 	}
 }
@@ -1563,8 +1523,8 @@ static void migrate_timers(int cpu)
 	int i;
 
 	BUG_ON(cpu_online(cpu));
-	old_base = per_cpu(tvec_bases, cpu);
-	new_base = get_cpu_var(tvec_bases);
+	old_base = per_cpu_ptr(&tvec_bases, cpu);
+	new_base = this_cpu_ptr(&tvec_bases);
 	/*
 	 * The caller is globally serialized and nobody else
 	 * takes two locks at once, deadlock is not possible.
@@ -1588,7 +1548,6 @@ static void migrate_timers(int cpu)
 
 	spin_unlock(&old_base->lock);
 	spin_unlock_irq(&new_base->lock);
-	put_cpu_var(tvec_bases);
 }
 
 static int timer_cpu_notify(struct notifier_block *self,
@@ -1614,12 +1573,11 @@ static inline void timer_register_cpu_no
 static inline void timer_register_cpu_notifier(void) { }
 #endif /* CONFIG_HOTPLUG_CPU */
 
-static void __init init_timer_cpu(struct tvec_base *base, int cpu)
+static void __init init_timer_cpu(int cpu)
 {
-	BUG_ON(base != tbase_get_base(base));
+	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
 
 	base->cpu = cpu;
-	per_cpu(tvec_bases, cpu) = base;
 	spin_lock_init(&base->lock);
 
 	base->timer_jiffies = jiffies;
@@ -1628,27 +1586,14 @@ static void __init init_timer_cpu(struct
 
 static void __init init_timer_cpus(void)
 {
-	struct tvec_base *base;
-	int local_cpu = smp_processor_id();
 	int cpu;
 
-	for_each_possible_cpu(cpu) {
-		if (cpu == local_cpu)
-			base = &boot_tvec_bases;
-#ifdef CONFIG_SMP
-		else
-			base = per_cpu_ptr(&__tvec_bases, cpu);
-#endif
-
-		init_timer_cpu(base, cpu);
-	}
+	for_each_possible_cpu(cpu)
+		init_timer_cpu(cpu);
 }
 
 void __init init_timers(void)
 {
-	/* ensure there are enough low bits for flags in timer->base pointer */
-	BUILD_BUG_ON(__alignof__(struct tvec_base) & TIMER_FLAG_MASK);
-
 	init_timer_cpus();
 	init_timer_stats();
 	timer_register_cpu_notifier();



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 5/7] timer: stats: Simplify the flags handling
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
                   ` (3 preceding siblings ...)
  2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-06-19 13:23   ` [tip:timers/core] timer: Stats: " tip-bot for Thomas Gleixner
  2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

[-- Attachment #1: timer-stats-simplify-flags-handling.patch --]
[-- Type: text/plain, Size: 3575 bytes --]

Simplify the handling of the flag storage for the timer statistics. No
intermediate storage anymore. Just hand over the flags field.

I left the printout of 'deferrable' for now because changing this
would be an ABI update and I have no idea how strong people feel about
that. OTOH, I wonder whether we should kill the whole timer stats
stuff because all of that information can be retrieved via ftrace/perf
as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/timer.h     |    5 +----
 kernel/time/timer.c       |    7 ++-----
 kernel/time/timer_stats.c |   10 +++++-----
 3 files changed, 8 insertions(+), 14 deletions(-)

Index: tip/include/linux/timer.h
===================================================================
--- tip.orig/include/linux/timer.h
+++ tip/include/linux/timer.h
@@ -188,13 +188,10 @@ extern void set_timer_slack(struct timer
 
 extern int timer_stats_active;
 
-#define TIMER_STATS_FLAG_DEFERRABLE	0x1
-
 extern void init_timer_stats(void);
 
 extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
-				     void *timerf, char *comm,
-				     unsigned int timer_flag);
+				     void *timerf, char *comm, u32 flags);
 
 extern void __timer_stats_timer_set_start_info(struct timer_list *timer,
 					       void *addr);
Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -418,15 +418,12 @@ void __timer_stats_timer_set_start_info(
 
 static void timer_stats_account_timer(struct timer_list *timer)
 {
-	unsigned int flag = 0;
-
 	if (likely(!timer->start_site))
 		return;
-	if (unlikely(timer->flags & TIMER_DEFERRABLE))
-		flag |= TIMER_STATS_FLAG_DEFERRABLE;
 
 	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
-				 timer->function, timer->start_comm, flag);
+				 timer->function, timer->start_comm,
+				 timer->flags);
 }
 
 #else
Index: tip/kernel/time/timer_stats.c
===================================================================
--- tip.orig/kernel/time/timer_stats.c
+++ tip/kernel/time/timer_stats.c
@@ -68,7 +68,7 @@ struct entry {
 	 * Number of timeout events:
 	 */
 	unsigned long		count;
-	unsigned int		timer_flag;
+	u32			flags;
 
 	/*
 	 * We save the command-line string to preserve
@@ -227,13 +227,13 @@ static struct entry *tstat_lookup(struct
  * @startf:	pointer to the function which did the timer setup
  * @timerf:	pointer to the timer callback function of the timer
  * @comm:	name of the process which set up the timer
+ * @tflags:	The flags field of the timer
  *
  * When the timer is already registered, then the event counter is
  * incremented. Otherwise the timer is registered in a free slot.
  */
 void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
-			      void *timerf, char *comm,
-			      unsigned int timer_flag)
+			      void *timerf, char *comm, u32 tflags)
 {
 	/*
 	 * It doesn't matter which lock we take:
@@ -251,7 +251,7 @@ void timer_stats_update_stats(void *time
 	input.start_func = startf;
 	input.expire_func = timerf;
 	input.pid = pid;
-	input.timer_flag = timer_flag;
+	input.flags = tflags;
 
 	raw_spin_lock_irqsave(lock, flags);
 	if (!timer_stats_active)
@@ -306,7 +306,7 @@ static int tstats_show(struct seq_file *
 
 	for (i = 0; i < nr_entries; i++) {
 		entry = entries + i;
-		if (entry->timer_flag & TIMER_STATS_FLAG_DEFERRABLE) {
+		if (entry->flags & TIMER_DEFERRABLE) {
 			seq_printf(m, "%4luD, %5d %-16s ",
 				entry->count, entry->pid, entry->comm);
 		} else {



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 6/7] timer: Reduce timer migration overhead if disabled
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
                   ` (4 preceding siblings ...)
  2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
  2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
  2015-05-27 14:53 ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Eric Dumazet
  7 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

[-- Attachment #1: timer-minimize-timer-migration-overhead.patch --]
[-- Type: text/plain, Size: 16269 bytes --]

Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h      |    2 +
 include/linux/sched.h        |    6 ----
 include/linux/sched/sysctl.h |   12 --------
 include/linux/timer.h        |    9 ++++++
 kernel/rcu/tree_plugin.h     |    2 -
 kernel/sched/core.c          |    9 ++----
 kernel/sysctl.c              |   18 ++++++-------
 kernel/time/hrtimer.c        |   35 +++++++++++++++++++------
 kernel/time/tick-internal.h  |   14 ++++++++++
 kernel/time/tick-sched.c     |   25 ++++++++++--------
 kernel/time/timer.c          |   59 +++++++++++++++++++++++++++++++++++++++----
 kernel/time/timer_list.c     |    2 -
 12 files changed, 133 insertions(+), 60 deletions(-)

Index: tip/include/linux/hrtimer.h
===================================================================
--- tip.orig/include/linux/hrtimer.h
+++ tip/include/linux/hrtimer.h
@@ -170,6 +170,7 @@ enum  hrtimer_base_type {
  * @cpu:		cpu number
  * @active_bases:	Bitfield to mark bases with active timers
  * @clock_was_set_seq:	Sequence counter of clock was set events
+ * @migration_enabled:	The migration of hrtimers to other cpus is enabled
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @next_timer:		Pointer to the first expiring timer
@@ -191,6 +192,7 @@ struct hrtimer_cpu_base {
 	unsigned int			cpu;
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
+	bool				migration_enabled;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	unsigned int			in_hrtirq	: 1,
 					hres_active	: 1,
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -335,14 +335,10 @@ extern int runqueue_is_locked(int cpu);
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
 extern void nohz_balance_enter_idle(int cpu);
 extern void set_cpu_sd_state_idle(void);
-extern int get_nohz_timer_target(int pinned);
+extern int get_nohz_timer_target(void);
 #else
 static inline void nohz_balance_enter_idle(int cpu) { }
 static inline void set_cpu_sd_state_idle(void) { }
-static inline int get_nohz_timer_target(int pinned)
-{
-	return smp_processor_id();
-}
 #endif
 
 /*
Index: tip/include/linux/sched/sysctl.h
===================================================================
--- tip.orig/include/linux/sched/sysctl.h
+++ tip/include/linux/sched/sysctl.h
@@ -57,24 +57,12 @@ extern unsigned int sysctl_numa_balancin
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
-extern unsigned int sysctl_timer_migration;
 extern unsigned int sysctl_sched_shares_window;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);
 #endif
-#ifdef CONFIG_SCHED_DEBUG
-static inline unsigned int get_sysctl_timer_migration(void)
-{
-	return sysctl_timer_migration;
-}
-#else
-static inline unsigned int get_sysctl_timer_migration(void)
-{
-	return 1;
-}
-#endif
 
 /*
  *  control realtime throttling:
Index: tip/include/linux/timer.h
===================================================================
--- tip.orig/include/linux/timer.h
+++ tip/include/linux/timer.h
@@ -238,6 +238,15 @@ extern void run_local_timers(void);
 struct hrtimer;
 extern enum hrtimer_restart it_real_fn(struct hrtimer *);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+#include <linux/sysctl.h>
+
+extern unsigned int sysctl_timer_migration;
+int timer_migration_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos);
+#endif
+
 unsigned long __round_jiffies(unsigned long j, int cpu);
 unsigned long __round_jiffies_relative(unsigned long j, int cpu);
 unsigned long round_jiffies(unsigned long j);
Index: tip/kernel/rcu/tree_plugin.h
===================================================================
--- tip.orig/kernel/rcu/tree_plugin.h
+++ tip/kernel/rcu/tree_plugin.h
@@ -1432,8 +1432,6 @@ module_param(rcu_idle_gp_delay, int, 064
 static int rcu_idle_lazy_gp_delay = RCU_IDLE_LAZY_GP_DELAY;
 module_param(rcu_idle_lazy_gp_delay, int, 0644);
 
-extern int tick_nohz_active;
-
 /*
  * Try to advance callbacks for all flavors of RCU on the current CPU, but
  * only if it has been awhile since the last time we did so.  Afterwards,
Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -572,13 +572,12 @@ void resched_cpu(int cpu)
  * selecting an idle cpu will add more delays to the timers than intended
  * (as that cpu's timer base may not be uptodate wrt jiffies etc).
  */
-int get_nohz_timer_target(int pinned)
+int get_nohz_timer_target(void)
 {
-	int cpu = smp_processor_id();
-	int i;
+	int i, cpu = smp_processor_id();
 	struct sched_domain *sd;
 
-	if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
+	if (!idle_cpu(cpu))
 		return cpu;
 
 	rcu_read_lock();
@@ -7050,8 +7049,6 @@ void __init sched_init_smp(void)
 }
 #endif /* CONFIG_SMP */
 
-const_debug unsigned int sysctl_timer_migration = 1;
-
 int in_sched_functions(unsigned long addr)
 {
 	return in_lock_functions(addr) ||
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -349,15 +349,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "timer_migration",
-		.data		= &sysctl_timer_migration,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
-		.extra2		= &one,
-	},
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_NUMA_BALANCING
 	{
@@ -1132,6 +1123,15 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+	{
+		.procname	= "timer_migration",
+		.data		= &sysctl_timer_migration,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= timer_migration_handler,
+	},
+#endif
 	{ }
 };
 
Index: tip/kernel/time/hrtimer.c
===================================================================
--- tip.orig/kernel/time/hrtimer.c
+++ tip/kernel/time/hrtimer.c
@@ -164,6 +164,24 @@ hrtimer_check_target(struct hrtimer *tim
 #endif
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static inline
+struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
+					 int pinned)
+{
+	if (pinned || !base->migration_enabled)
+		return this_cpu_ptr(&hrtimer_bases);
+	return &per_cpu(hrtimer_bases, get_nohz_timer_target());
+}
+#else
+static inline
+struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
+					 int pinned)
+{
+	return this_cpu_ptr(&hrtimer_bases);
+}
+#endif
+
 /*
  * Switch the timer base to the current CPU when possible.
  */
@@ -171,14 +189,13 @@ static inline struct hrtimer_clock_base
 switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 		    int pinned)
 {
+	struct hrtimer_cpu_base *new_cpu_base, *this_base;
 	struct hrtimer_clock_base *new_base;
-	struct hrtimer_cpu_base *new_cpu_base;
-	int this_cpu = smp_processor_id();
-	int cpu = get_nohz_timer_target(pinned);
 	int basenum = base->index;
 
+	this_base = this_cpu_ptr(&hrtimer_bases);
+	new_cpu_base = get_target_base(this_base, pinned);
 again:
-	new_cpu_base = &per_cpu(hrtimer_bases, cpu);
 	new_base = &new_cpu_base->clock_base[basenum];
 
 	if (base != new_base) {
@@ -199,17 +216,19 @@ again:
 		raw_spin_unlock(&base->cpu_base->lock);
 		raw_spin_lock(&new_base->cpu_base->lock);
 
-		if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
-			cpu = this_cpu;
+		if (new_cpu_base != this_base &&
+		    hrtimer_check_target(timer, new_base)) {
 			raw_spin_unlock(&new_base->cpu_base->lock);
 			raw_spin_lock(&base->cpu_base->lock);
+			new_cpu_base = this_base;
 			timer->base = base;
 			goto again;
 		}
 		timer->base = new_base;
 	} else {
-		if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
-			cpu = this_cpu;
+		if (new_cpu_base != this_base &&
+		    hrtimer_check_target(timer, new_base)) {
+			new_cpu_base = this_base;
 			goto again;
 		}
 	}
Index: tip/kernel/time/tick-internal.h
===================================================================
--- tip.orig/kernel/time/tick-internal.h
+++ tip/kernel/time/tick-internal.h
@@ -138,4 +138,18 @@ extern void tick_nohz_init(void);
 static inline void tick_nohz_init(void) { }
 #endif
 
+#ifdef CONFIG_NO_HZ_COMMON
+extern unsigned long tick_nohz_active;
+#else
+#define tick_nohz_active (0)
+#endif
+
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+extern void timers_update_migration(void);
+#else
+static inline void timers_update_migration(void) { }
+#endif
+
+DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
+
 extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
Index: tip/kernel/time/tick-sched.c
===================================================================
--- tip.orig/kernel/time/tick-sched.c
+++ tip/kernel/time/tick-sched.c
@@ -399,7 +399,7 @@ void __init tick_nohz_init(void)
  * NO HZ enabled ?
  */
 static int tick_nohz_enabled __read_mostly  = 1;
-int tick_nohz_active  __read_mostly;
+unsigned long tick_nohz_active  __read_mostly;
 /*
  * Enable / Disable tickless mode
  */
@@ -956,6 +956,16 @@ static void tick_nohz_handler(struct clo
 	tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
 }
 
+static inline void tick_nohz_activate(struct tick_sched *ts, int mode)
+{
+	if (!tick_nohz_enabled)
+		return;
+	ts->nohz_mode = mode;
+	/* One update is enough */
+	if (!test_and_set_bit(0, &tick_nohz_active))
+		timers_update_migration();
+}
+
 /**
  * tick_nohz_switch_to_nohz - switch to nohz mode
  */
@@ -970,9 +980,6 @@ static void tick_nohz_switch_to_nohz(voi
 	if (tick_switch_to_oneshot(tick_nohz_handler))
 		return;
 
-	tick_nohz_active = 1;
-	ts->nohz_mode = NOHZ_MODE_LOWRES;
-
 	/*
 	 * Recycle the hrtimer in ts, so we can share the
 	 * hrtimer_forward with the highres code.
@@ -984,6 +991,7 @@ static void tick_nohz_switch_to_nohz(voi
 	hrtimer_forward_now(&ts->sched_timer, tick_period);
 	hrtimer_set_expires(&ts->sched_timer, next);
 	tick_program_event(next, 1);
+	tick_nohz_activate(ts, NOHZ_MODE_LOWRES);
 }
 
 /*
@@ -1035,6 +1043,7 @@ static inline void tick_nohz_irq_enter(v
 
 static inline void tick_nohz_switch_to_nohz(void) { }
 static inline void tick_nohz_irq_enter(void) { }
+static inline void tick_nohz_activate(struct tick_sched *ts, int mode) { }
 
 #endif /* CONFIG_NO_HZ_COMMON */
 
@@ -1117,13 +1126,7 @@ void tick_setup_sched_timer(void)
 
 	hrtimer_forward(&ts->sched_timer, now, tick_period);
 	hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
-
-#ifdef CONFIG_NO_HZ_COMMON
-	if (tick_nohz_enabled) {
-		ts->nohz_mode = NOHZ_MODE_HIGHRES;
-		tick_nohz_active = 1;
-	}
-#endif
+	tick_nohz_activate(ts, NOHZ_MODE_HIGHRES);
 }
 #endif /* HIGH_RES_TIMERS */
 
Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -85,6 +85,7 @@ struct tvec_base {
 	unsigned long active_timers;
 	unsigned long all_timers;
 	int cpu;
+	bool migration_enabled;
 	struct tvec_root tv1;
 	struct tvec tv2;
 	struct tvec tv3;
@@ -95,6 +96,54 @@ struct tvec_base {
 
 static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+unsigned int sysctl_timer_migration = 1;
+
+void timers_update_migration(void)
+{
+	bool on = sysctl_timer_migration && tick_nohz_active;
+	unsigned int cpu;
+
+	/* Avoid the loop, if nothing to update */
+	if (this_cpu_read(tvec_bases.migration_enabled) == on)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(tvec_bases.migration_enabled, cpu) = on;
+		per_cpu(hrtimer_bases.migration_enabled, cpu) = on;
+	}
+}
+
+int timer_migration_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos)
+{
+	static DEFINE_MUTEX(mutex);
+	int ret;
+
+	mutex_lock(&mutex);
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (!ret && write)
+		timers_update_migration();
+	mutex_unlock(&mutex);
+	return ret;
+}
+
+static inline struct tvec_base *get_target_base(struct tvec_base *base,
+						int pinned)
+{
+	if (pinned || !base->migration_enabled)
+		return this_cpu_ptr(&tvec_bases);
+	return per_cpu_ptr(&tvec_bases, get_nohz_timer_target());
+}
+#else
+static inline struct tvec_base *get_target_base(struct tvec_base *base,
+						int pinned)
+{
+	return this_cpu_ptr(&tvec_bases);
+}
+#endif
+
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
 		bool force_up)
 {
@@ -730,11 +779,11 @@ static struct tvec_base *lock_timer_base
 
 static inline int
 __mod_timer(struct timer_list *timer, unsigned long expires,
-						bool pending_only, int pinned)
+	    bool pending_only, int pinned)
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret = 0 , cpu;
+	int ret = 0;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
@@ -747,8 +796,7 @@ __mod_timer(struct timer_list *timer, un
 
 	debug_activate(timer, expires);
 
-	cpu = get_nohz_timer_target(pinned);
-	new_base = per_cpu_ptr(&tvec_bases, cpu);
+	new_base = get_target_base(base, pinned);
 
 	if (base != new_base) {
 		/*
@@ -765,7 +813,8 @@ __mod_timer(struct timer_list *timer, un
 			spin_unlock(&base->lock);
 			base = new_base;
 			spin_lock(&base->lock);
-			timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
+			timer->flags &= ~TIMER_BASEMASK;
+			timer->flags |= base->cpu;
 		}
 	}
 
Index: tip/kernel/time/timer_list.c
===================================================================
--- tip.orig/kernel/time/timer_list.c
+++ tip/kernel/time/timer_list.c
@@ -29,8 +29,6 @@ struct timer_list_iter {
 
 typedef void (*print_fn_t)(struct seq_file *m, unsigned int *classes);
 
-DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
-
 /*
  * This allows printing both to /proc/timer_list and
  * to the console (on SysRq-Q):



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 7/7] timer: Minimize nohz off overhead
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
                   ` (5 preceding siblings ...)
  2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-06-19 13:24   ` [tip:timers/core] " tip-bot for Thomas Gleixner
  2015-05-27 14:53 ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Eric Dumazet
  7 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

[-- Attachment #1: timer-minimize-nohz-off-overhead.patch --]
[-- Type: text/plain, Size: 5709 bytes --]

If nohz is disabled on the kernel command line the [hr]timer code
still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
avoid the overhead.

Before:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.73%  hog       [.] main
  15.36%  [kernel]  [k] _raw_spin_lock_irqsave
   9.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.61%  [kernel]  [k] lock_timer_base.isra.38
   6.42%  [kernel]  [k] mod_timer
   3.90%  [kernel]  [k] detach_if_pending
   3.76%  [kernel]  [k] del_timer
   2.41%  [kernel]  [k] internal_add_timer
   1.39%  [kernel]  [k] __internal_add_timer
   0.76%  [kernel]  [k] timerfn

We probably should have a cached value for nohz full in the per cpu
bases as well to avoid the cpumask check. The base cache line is hot
already, the cpumask not necessarily.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h     |    2 ++
 kernel/time/hrtimer.c       |    3 ++-
 kernel/time/tick-internal.h |    4 ++--
 kernel/time/tick-sched.c    |    2 +-
 kernel/time/timer.c         |   16 ++++++++++++----
 5 files changed, 19 insertions(+), 8 deletions(-)

Index: tip/include/linux/hrtimer.h
===================================================================
--- tip.orig/include/linux/hrtimer.h
+++ tip/include/linux/hrtimer.h
@@ -171,6 +171,7 @@ enum  hrtimer_base_type {
  * @active_bases:	Bitfield to mark bases with active timers
  * @clock_was_set_seq:	Sequence counter of clock was set events
  * @migration_enabled:	The migration of hrtimers to other cpus is enabled
+ * @nohz_active:	The nohz functionality is enabled
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @next_timer:		Pointer to the first expiring timer
@@ -193,6 +194,7 @@ struct hrtimer_cpu_base {
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
 	bool				migration_enabled;
+	bool				nohz_active;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	unsigned int			in_hrtirq	: 1,
 					hres_active	: 1,
Index: tip/kernel/time/hrtimer.c
===================================================================
--- tip.orig/kernel/time/hrtimer.c
+++ tip/kernel/time/hrtimer.c
@@ -987,7 +987,8 @@ void hrtimer_start_range_ns(struct hrtim
 		 * Kick to reschedule the next tick to handle the new timer
 		 * on dynticks target.
 		 */
-		wake_up_nohz_cpu(new_base->cpu_base->cpu);
+		if (new_base->cpu_base->nohz_active)
+			wake_up_nohz_cpu(new_base->cpu_base->cpu);
 	} else {
 		hrtimer_reprogram(timer, new_base);
 	}
Index: tip/kernel/time/tick-internal.h
===================================================================
--- tip.orig/kernel/time/tick-internal.h
+++ tip/kernel/time/tick-internal.h
@@ -145,9 +145,9 @@ extern unsigned long tick_nohz_active;
 #endif
 
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
-extern void timers_update_migration(void);
+extern void timers_update_migration(bool update_nohz);
 #else
-static inline void timers_update_migration(void) { }
+static inline void timers_update_migration(bool update_nohz) { }
 #endif
 
 DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
Index: tip/kernel/time/tick-sched.c
===================================================================
--- tip.orig/kernel/time/tick-sched.c
+++ tip/kernel/time/tick-sched.c
@@ -963,7 +963,7 @@ static inline void tick_nohz_activate(st
 	ts->nohz_mode = mode;
 	/* One update is enough */
 	if (!test_and_set_bit(0, &tick_nohz_active))
-		timers_update_migration();
+		timers_update_migration(true);
 }
 
 /**
Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -86,6 +86,7 @@ struct tvec_base {
 	unsigned long all_timers;
 	int cpu;
 	bool migration_enabled;
+	bool nohz_active;
 	struct tvec_root tv1;
 	struct tvec tv2;
 	struct tvec tv3;
@@ -99,7 +100,7 @@ static DEFINE_PER_CPU(struct tvec_base,
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
 unsigned int sysctl_timer_migration = 1;
 
-void timers_update_migration(void)
+void timers_update_migration(bool update_nohz)
 {
 	bool on = sysctl_timer_migration && tick_nohz_active;
 	unsigned int cpu;
@@ -111,6 +112,10 @@ void timers_update_migration(void)
 	for_each_possible_cpu(cpu) {
 		per_cpu(tvec_bases.migration_enabled, cpu) = on;
 		per_cpu(hrtimer_bases.migration_enabled, cpu) = on;
+		if (!update_nohz)
+			continue;
+		per_cpu(tvec_bases.nohz_active, cpu) = true;
+		per_cpu(hrtimer_bases.nohz_active, cpu) = true;
 	}
 }
 
@@ -124,7 +129,7 @@ int timer_migration_handler(struct ctl_t
 	mutex_lock(&mutex);
 	ret = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (!ret && write)
-		timers_update_migration();
+		timers_update_migration(false);
 	mutex_unlock(&mutex);
 	return ret;
 }
@@ -450,8 +455,11 @@ static void internal_add_timer(struct tv
 	 * require special care against races with idle_cpu(), lets deal
 	 * with that later.
 	 */
-	if (!(timer->flags & TIMER_DEFERRABLE) || tick_nohz_full_cpu(base->cpu))
-		wake_up_nohz_cpu(base->cpu);
+	if (base->nohz_active) {
+		if (!(timer->flags & TIMER_DEFERRABLE) ||
+		    tick_nohz_full_cpu(base->cpu))
+			wake_up_nohz_cpu(base->cpu);
+	}
 }
 
 #ifdef CONFIG_TIMER_STATS



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage
  2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
@ 2015-05-27  5:39   ` Viresh Kumar
  2015-06-19 13:22   ` [tip:timers/core] " tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: Viresh Kumar @ 2015-05-27  5:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, John Stultz, Joonwoo Park,
	Wenbo Wang

On 26-05-15, 22:50, Thomas Gleixner wrote:
> 3) __run_timers()
> 
>    We only check on entry, which is silly, because base->timer_jiffies
>    can be behind - especially on NOHZ kernels - and if there is a
>    single deferrable timer somewhere between base->timer_jiffies and
>    jiffies we expire it and then loop until base->timer_jiffies ==
>    jiffies.

This may be incorrect. Once we expire that single deferrable timer, we
call detach_expired_timer(), which calls catchup_timer_jiffies() at
its end. And so the following loop should end right away, isn't it ?

        while (time_after_eq(jiffies, base->timer_jiffies))

> +++ tip/kernel/time/timer.c
> -static bool catchup_timer_jiffies(struct tvec_base *base)
> +static inline bool catchup_timer_jiffies(struct tvec_base *base)

There is only one user left for this routine now, i.e. __run_timers().
Should we drop this routine ?

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
@ 2015-05-27  9:11   ` Viresh Kumar
  2015-06-19 13:22   ` [tip:timers/core] timer: Remove FIFO "guarantee" tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: Viresh Kumar @ 2015-05-27  9:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, John Stultz, Joonwoo Park,
	Wenbo Wang

On 26-05-15, 22:50, Thomas Gleixner wrote:
> The FIFO guarantee has been violated by the introduction of timer
> slack already. Remove it.
> 
> This is a preparatory patch for converting the timer wheel to hlist
> which reduces the memory foot print of the wheel by 50%. It's a
> seperate patch so any (unlikely to happen) regression caused by this
> can be identified clearly.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/time/timer.c |    6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> Index: tip/kernel/time/timer.c
> ===================================================================
> --- tip.orig/kernel/time/timer.c
> +++ tip/kernel/time/timer.c
> @@ -403,10 +403,8 @@ __internal_add_timer(struct tvec_base *b
>  		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
>  		vec = base->tv5.vec + i;
>  	}
> -	/*
> -	 * Timers are FIFO:
> -	 */
> -	list_add_tail(&timer->entry, vec);
> +
> +	list_add(&timer->entry, vec);
>  }

Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 3/7] timer: Use hlist for the timer wheel hash buckets
  2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
@ 2015-05-27  9:13   ` Viresh Kumar
  2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: Viresh Kumar @ 2015-05-27  9:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, John Stultz, Joonwoo Park,
	Wenbo Wang

On 26-05-15, 22:50, Thomas Gleixner wrote:
> This reduces the size of struct tvec_base by 50% and results in
> slightly smaller code as well.
> 
> Before:
>    struct tvec_base: size: 8256, cachelines: 129
> 
>    text	   data	    bss	    dec	    hex	filename
>   17698	  13297	   8256	  39251	   9953	../build/kernel/time/timer.o
> 
> After:
>   struct tvec_base: 4160, cachelines: 65, members: 12 */
> 
>    text	   data	    bss	    dec	    hex	filename
>   17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/timer.h |    6 ++--
>  kernel/time/timer.c   |   64 +++++++++++++++++++++-----------------------------
>  2 files changed, 30 insertions(+), 40 deletions(-)

Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/7] timer: Replace timer base by a cpu index
  2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
@ 2015-05-27  9:22   ` Viresh Kumar
  2015-05-27 12:09     ` Peter Zijlstra
  2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
  2015-06-27  9:55   ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
  2 siblings, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2015-05-27  9:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, John Stultz, Joonwoo Park,
	Wenbo Wang, Steven Rostedt, Badhri Jagan Sridharan

On 26-05-15, 22:50, Thomas Gleixner wrote:
>  static struct tvec_base *lock_timer_base(struct timer_list *timer,
>  					unsigned long *flags)
>  	__acquires(timer->base->lock)
>  {
> -	struct tvec_base *base;
> -
>  	for (;;) {
> -		struct tvec_base *prelock_base = timer->base;
> -		base = tbase_get_base(prelock_base);
> -		if (likely(base != NULL)) {
> +		u32 tf = timer->flags;
> +		struct tvec_base *base;
> +
> +		if (!(tf & TIMER_MIGRATING)) {
> +			base = per_cpu_ptr(&tvec_bases, tf & TIMER_CPUMASK);
>  			spin_lock_irqsave(&base->lock, *flags);
> -			if (likely(prelock_base == timer->base))
> +			if (timer->flags == tf)
>  				return base;
> -			/* The timer has migrated to another CPU */

Maybe we should retain this comment. Its helpful for people who aren't
very familiar with timer core.

>  static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
>  {
>  	struct timer_list *timer;
> +	int cpu = new_base->cpu;
>  
>  	while (!hlist_empty(head)) {
>  		timer = hlist_entry(head->first, struct timer_list, entry);
>  		/* We ignore the accounting on the dying cpu */
>  		detach_timer(timer, false);
> -		timer_set_base(timer, new_base);
> +		timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;

Because 'cpu' is used only once in this routine, maybe we can use
'new_base->cpu' here and the line will still be exactly 80 columns
long.

Not sure, but maybe we can create a inline helper for this operation
as it is repeated at multiple places.

Look good otherwise:

Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/7] timer: Replace timer base by a cpu index
  2015-05-27  9:22   ` Viresh Kumar
@ 2015-05-27 12:09     ` Peter Zijlstra
  2015-06-02 13:58       ` Thomas Gleixner
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2015-05-27 12:09 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, John Stultz, Joonwoo Park,
	Wenbo Wang, Steven Rostedt, Badhri Jagan Sridharan

On Wed, May 27, 2015 at 02:52:36PM +0530, Viresh Kumar wrote:
> > +		timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
> 
> Because 'cpu' is used only once in this routine, maybe we can use
> 'new_base->cpu' here and the line will still be exactly 80 columns
> long.
> 
> Not sure, but maybe we can create a inline helper for this operation
> as it is repeated at multiple places.

I suggested bitfields at some point I think. That gets the compiler to
generate those helpers for you.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
                   ` (6 preceding siblings ...)
  2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
@ 2015-05-27 14:53 ` Eric Dumazet
  2015-06-02 13:58   ` Thomas Gleixner
  7 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2015-05-27 14:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, Viresh Kumar, John Stultz,
	Joonwoo Park, Wenbo Wang

On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote:
> I still have a couple of patches against the timer code in my review
> list, but the more I look at them the more horrible I find them.
> 
> All these patches are related to the NOHZ stuff and try to mitigate
> shortcomings with even more bandaids. And of course these bandaids
> cost cycles and are a burden for timer heavy use cases like
> networking.
> 
> Sadly enough the NOHZ crowd is happy to throw more and more crap at
> the kernel and none of these people is even thinking about whether
> this stuff could be solved in a different way.
> 
> After Eric mentioned in one of the recent discussions that the
> timer_migration sysctl is not a lot of gain, I tried to mitigate at
> least that issue. That caused me to look deeper and I came up with the
> following patch series which:
> 
>   - Clean up the sloppy catchup timer jiffies stuff
> 
>   - Replace the hash bucket lists by hlists, which shrinks the wheel
>     footprint by 50%
> 
>   - Replaces the timer base pointer in struct timer_list by a cpu
>     index, which shrinks struct timer_list by 4 - 8 bytes depending on
>     alignement and architecture.
> 
>   - Caches nohz_active and timer_migration in the timer per cpu bases,
>     so we can avoid going through loops and hoops to figure that out.
> 
> This series applies against
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core
> 
> With a pretty naive timer test module and a sched other cpu hog on an
> isolated core I verified that I did not wreckage anything. The
> following table shows the resulting CPU time of the hog for the
> various scenarios.
> 
> 	     nohz=on	      	nohz=on			nohz=off
> 	     timer_migration=on	timer_migration=off	    
> 
> Unpatched:   46.57%		46.87%			46.64%
> 
> Patched:     47.76%		48.20%			48.73%
> 
> Though, I do not trust my numbers, so I really hope that Eric or any
> other power timer wheel user can throw a bunch of tests at it.
> 
> 
> Now some more rant about the whole NOHZ issues. The timer wheel is
> fundamentally unfriendly for NOHZ because:
> 
>   - it's impossible to keep reliably track of the next expiring timer
> 
>   - finding the next expiring timer is in the worst case O(N) and
>     involves touching a gazillion of cache lines
> 
> The other disaster area is the nohz timer migration mechanism. I think
> it's fundamentally broken. 
> 
>  - We decide at timer enqueue time, on which CPU we queue the timer,
>    based on cpu idle heuristics which even fails to recognize that a
>    cpu is really busy with interrupt and softirq processing (reported
>    by Eric).
> 
>  - When moving a timer to some "non-idle" CPU we speculate about the
>    system situation in the future (at timer expiry time). This kinda
>    works for limited scenarios (NOHZ_FULL and constraint power saving
>    on mobile devices). But it is pretty much nonsensical for
>    everything else. For network heavy workloads it can be even
>    outright wrong and dangerous as Eric explained.
> 
> So we really need to put a full stop at all this NOHZ tinkering and
> think about proper solutions. I'm not going to merge any NOHZ related
> features^Whacks before the existing problems are not solved. In
> hindsight I should have done that earlier, but it's never too late.
> 
> So we need to address two issues:
> 
> 1) Keeping track of the first expiring timer in the wheel.
> 
>    Don't even think about rbtrees or such, it just wont work, but I'm
>    willing to accept prove of the contrary.
> 
>    One of the ideas I had a long time ago is to have a wheel
>    implementation, which does never cascade and therefor provides
>    different levels of granularity per wheel level.
> 
>    LVL0	   1  jiffy granularity
>    LVL1	   8  jiffies granularity
>    LVL1	   64 jiffies granularity
>    ....
> 
>    This requires more levels than the classic timer wheel, so its not
>    feasible from a memory consumption POV.
> 
>    But we can have a compromise and put all 'soon' expiring timers
>    into these fancy new wheels and stick all 'far out' timers into the
>    last level of the wheel and cascade them occasionally.
> 
>    That should work because:
> 
>      - Most timers are short term expiry (< 1h)
>      - Most timers are canceled before expiry
> 
>    So we need a sensible upper bound of levels and get the following
>    properties:
> 
>    	- Natural batching (no more slack magic). This might also help
>           networking to avoid rearming of timers.
> 
> 	- Long out timers are queued in the slowest wheel and
>           ocassionally with the granularity of the slowest wheel
>           cascaded
> 
> 	- Tracking the next expiring timer can be done with a bitmap,
>           so the 'get next expiring timer' becomes O(1) without
>           touching anything else than the bitmap, if we accept that
>           the upper limit of what we can retrieve O(1) is the
>           granularity of the last level, i.e. we treat timers which
>           need recascading simple as normal inhabitants of the last
>           wheel level.
> 	  
>         - The expiry code (run_timers) gets more complicated as it
>           needs to walk through the different levels more often, but
>           with the bitmap that shouldnt be a too big issue.
> 
>         - Seperate storage for non-deferrable and deferrable timers.
> 
>     I spent some time in coding that up. It barely boots and has
>     certainly a few bugs left and right, but I will send it out
>     nevertheless as a reply to this mail to get the discussion going.

Thanks a lot Thomas !

I finally got back two hosts with 72 cpus where I can try your patches.

I have one comment about socket TCP timers : They are re-armed very
often, but they are never canceled (unless the socket is dismantled)

I am too lazy to search for the commit adding this stuff (before git),
but I guess it helped a lot RPC sessions : No need to cancel a timer if
we're likely to rearm it in the following 10 usec ;)

$ git grep -n INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
  2015-05-27 14:53 ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Eric Dumazet
@ 2015-06-02 13:58   ` Thomas Gleixner
  0 siblings, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-06-02 13:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, Viresh Kumar, John Stultz,
	Joonwoo Park, Wenbo Wang

On Wed, 27 May 2015, Eric Dumazet wrote:
> I finally got back two hosts with 72 cpus where I can try your patches.

Waiting for your results :)

> I have one comment about socket TCP timers : They are re-armed very
> often, but they are never canceled (unless the socket is dismantled)
> 
> I am too lazy to search for the commit adding this stuff (before git),
> but I guess it helped a lot RPC sessions : No need to cancel a timer if
> we're likely to rearm it in the following 10 usec ;)

Right, but that's a 'cancel + rearm' in the timer core code, unless
the expiry time does not change.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/7] timer: Replace timer base by a cpu index
  2015-05-27 12:09     ` Peter Zijlstra
@ 2015-06-02 13:58       ` Thomas Gleixner
  0 siblings, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-06-02 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Viresh Kumar, LKML, Ingo Molnar, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, John Stultz, Joonwoo Park,
	Wenbo Wang, Steven Rostedt, Badhri Jagan Sridharan

On Wed, 27 May 2015, Peter Zijlstra wrote:

> On Wed, May 27, 2015 at 02:52:36PM +0530, Viresh Kumar wrote:
> > > +		timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
> > 
> > Because 'cpu' is used only once in this routine, maybe we can use
> > 'new_base->cpu' here and the line will still be exactly 80 columns
> > long.
> > 
> > Not sure, but maybe we can create a inline helper for this operation
> > as it is repeated at multiple places.
> 
> I suggested bitfields at some point I think. That gets the compiler to
> generate those helpers for you.

I tried, but they are a PITA for the static initializiers and I could
not be bothered to deal with the endianess mess. Maybe I should try
again :)

Thanks,

	tglx






^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:timers/core] timers: Sanitize catchup_timer_jiffies() usage
  2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
  2015-05-27  5:39   ` Viresh Kumar
@ 2015-06-19 13:22   ` tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2015-06-19 13:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, john.stultz, joonwoop, fweisbec, edumazet,
	wenbo.wang, hpa, peterz, tglx, viresh.kumar, mingo, paulmck

Commit-ID:  3bb475a3446facd0425d3f2fe7e85bf03c5c6c05
Gitweb:     http://git.kernel.org/tip/3bb475a3446facd0425d3f2fe7e85bf03c5c6c05
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 26 May 2015 22:50:24 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 15:18:27 +0200

timers: Sanitize catchup_timer_jiffies() usage

catchup_timer_jiffies() has been applied blindly to several functions
without looking for possible better ways to do it.

1) internal_add_timer()

   Move the update to base->all_timers before we actually insert the
   timer into the wheel.

2) detach_if_pending()

   Again the update to base->all_timers allows us to explicitely do
   the timer_jiffies update in place, if this was the last timer which
   got removed.

3) __run_timers()

   We only check on entry, which is silly, because base->timer_jiffies
   can be behind - especially on NOHZ kernels - and if there is a
   single deferrable timer somewhere between base->timer_jiffies and
   jiffies we expire it and then loop until base->timer_jiffies ==
   jiffies.

   Move it into the loop.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224511.662994644@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timer.c | 40 ++++++++++++++++------------------------
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 7775d45..d5e0179 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -351,20 +351,6 @@ void set_timer_slack(struct timer_list *timer, int slack_hz)
 }
 EXPORT_SYMBOL_GPL(set_timer_slack);
 
-/*
- * If the list is empty, catch up ->timer_jiffies to the current time.
- * The caller must hold the tvec_base lock.  Returns true if the list
- * was empty and therefore ->timer_jiffies was updated.
- */
-static bool catchup_timer_jiffies(struct tvec_base *base)
-{
-	if (!base->all_timers) {
-		base->timer_jiffies = jiffies;
-		return true;
-	}
-	return false;
-}
-
 static void
 __internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 {
@@ -411,7 +397,10 @@ __internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 {
-	(void)catchup_timer_jiffies(base);
+	/* Advance base->jiffies, if the base is empty */
+	if (!base->all_timers++)
+		base->timer_jiffies = jiffies;
+
 	__internal_add_timer(base, timer);
 	/*
 	 * Update base->active_timers and base->next_timer
@@ -421,7 +410,6 @@ static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 		    time_before(timer->expires, base->next_timer))
 			base->next_timer = timer->expires;
 	}
-	base->all_timers++;
 
 	/*
 	 * Check whether the other CPU is in dynticks mode and needs
@@ -718,7 +706,6 @@ detach_expired_timer(struct timer_list *timer, struct tvec_base *base)
 	if (!tbase_get_deferrable(timer->base))
 		base->active_timers--;
 	base->all_timers--;
-	(void)catchup_timer_jiffies(base);
 }
 
 static int detach_if_pending(struct timer_list *timer, struct tvec_base *base,
@@ -733,8 +720,9 @@ static int detach_if_pending(struct timer_list *timer, struct tvec_base *base,
 		if (timer->expires == base->next_timer)
 			base->next_timer = base->timer_jiffies;
 	}
-	base->all_timers--;
-	(void)catchup_timer_jiffies(base);
+	/* If this was the last timer, advance base->jiffies */
+	if (!--base->all_timers)
+		base->timer_jiffies = jiffies;
 	return 1;
 }
 
@@ -1184,14 +1172,18 @@ static inline void __run_timers(struct tvec_base *base)
 	struct timer_list *timer;
 
 	spin_lock_irq(&base->lock);
-	if (catchup_timer_jiffies(base)) {
-		spin_unlock_irq(&base->lock);
-		return;
-	}
+
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
 		struct list_head work_list;
 		struct list_head *head = &work_list;
-		int index = base->timer_jiffies & TVR_MASK;
+		int index;
+
+		if (!base->all_timers) {
+			base->timer_jiffies = jiffies;
+			break;
+		}
+
+		index = base->timer_jiffies & TVR_MASK;
 
 		/*
 		 * Cascade timers:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:timers/core] timer: Remove FIFO "guarantee"
  2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
  2015-05-27  9:11   ` Viresh Kumar
@ 2015-06-19 13:22   ` tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2015-06-19 13:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: edumazet, paulmck, john.stultz, linux-kernel, linux, hpa, mingo,
	fweisbec, viresh.kumar, tglx, joonwoop, wenbo.wang, peterz

Commit-ID:  1bd04bf6f68d65f5422b2b85c495d65d49587a54
Gitweb:     http://git.kernel.org/tip/1bd04bf6f68d65f5422b2b85c495d65d49587a54
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 26 May 2015 22:50:26 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 15:18:27 +0200

timer: Remove FIFO "guarantee"

The FIFO guarantee is only there if two timers are queued into the
same bucket at the same jiffie on the same cpu:

 - The slack value depends on the delta between expiry and enqueue
   time, so the resulting expiry time can be different for timers
   which are queued in different jiffies.

 - Timers which are queued into the secondary array end up after a
   later queued timer which was queued into the primary array due to
   cascading.

 - Timers can end up on different cpus due to the NOHZ target moving
   around. Obviously there is no guarantee of expiry ordering between
   cpus.

So anything which relies on FIFO behaviour of the timer wheel is
broken already.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%.

It's a seperate patch so any (unlikely to happen) regression caused by
this can be identified clearly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Cc: George Spelvin <linux@horizon.com>
Link: http://lkml.kernel.org/r/20150526224511.757520403@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timer.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index d5e0179..e212df2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -389,10 +389,8 @@ __internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
 		vec = base->tv5.vec + i;
 	}
-	/*
-	 * Timers are FIFO:
-	 */
-	list_add_tail(&timer->entry, vec);
+
+	list_add(&timer->entry, vec);
 }
 
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:timers/core] timer: Use hlist for the timer wheel hash buckets
  2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
  2015-05-27  9:13   ` Viresh Kumar
@ 2015-06-19 13:23   ` tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2015-06-19 13:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: viresh.kumar, john.stultz, joonwoop, hpa, paulmck, tglx,
	wenbo.wang, peterz, mingo, linux-kernel, fweisbec, edumazet

Commit-ID:  1dabbcec2c0a36fe43509d06499b9e512e70a028
Gitweb:     http://git.kernel.org/tip/1dabbcec2c0a36fe43509d06499b9e512e70a028
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 26 May 2015 22:50:28 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 15:18:27 +0200

timer: Use hlist for the timer wheel hash buckets

This reduces the size of struct tvec_base by 50% and results in
slightly smaller code as well.

Before:
   struct tvec_base: size: 8256, cachelines: 129

   text	   data	    bss	    dec	    hex	filename
  17698	  13297	   8256	  39251	   9953	../build/kernel/time/timer.o

After:
  struct tvec_base: 4160, cachelines: 65

   text	   data	    bss	    dec	    hex	filename
  17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224511.854731214@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/timer.h |  6 ++---
 kernel/time/timer.c   | 64 ++++++++++++++++++++++-----------------------------
 2 files changed, 30 insertions(+), 40 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index fbb80e0..064ee24 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -14,7 +14,7 @@ struct timer_list {
 	 * All fields that change during normal runtime grouped to the
 	 * same cacheline
 	 */
-	struct list_head entry;
+	struct hlist_node entry;
 	unsigned long expires;
 	struct tvec_base *base;
 
@@ -71,7 +71,7 @@ extern struct tvec_base boot_tvec_bases;
 #define TIMER_FLAG_MASK			0x3LU
 
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
-		.entry = { .prev = TIMER_ENTRY_STATIC },	\
+		.entry = { .next = TIMER_ENTRY_STATIC },	\
 		.function = (_function),			\
 		.expires = (_expires),				\
 		.data = (_data),				\
@@ -168,7 +168,7 @@ static inline void init_timer_on_stack_key(struct timer_list *timer,
  */
 static inline int timer_pending(const struct timer_list * timer)
 {
-	return timer->entry.next != NULL;
+	return timer->entry.pprev != NULL;
 }
 
 extern void add_timer_on(struct timer_list *timer, int cpu);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index e212df2..3a5e0c8 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -70,11 +70,11 @@ EXPORT_SYMBOL(jiffies_64);
 #define MAX_TVAL ((unsigned long)((1ULL << (TVR_BITS + 4*TVN_BITS)) - 1))
 
 struct tvec {
-	struct list_head vec[TVN_SIZE];
+	struct hlist_head vec[TVN_SIZE];
 };
 
 struct tvec_root {
-	struct list_head vec[TVR_SIZE];
+	struct hlist_head vec[TVR_SIZE];
 };
 
 struct tvec_base {
@@ -356,7 +356,7 @@ __internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 {
 	unsigned long expires = timer->expires;
 	unsigned long idx = expires - base->timer_jiffies;
-	struct list_head *vec;
+	struct hlist_head *vec;
 
 	if (idx < TVR_SIZE) {
 		int i = expires & TVR_MASK;
@@ -390,7 +390,7 @@ __internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 		vec = base->tv5.vec + i;
 	}
 
-	list_add(&timer->entry, vec);
+	hlist_add_head(&timer->entry, vec);
 }
 
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
@@ -504,8 +504,8 @@ static int timer_fixup_activate(void *addr, enum debug_obj_state state)
 		 * statically initialized. We just make sure that it
 		 * is tracked in the object tracker.
 		 */
-		if (timer->entry.next == NULL &&
-		    timer->entry.prev == TIMER_ENTRY_STATIC) {
+		if (timer->entry.pprev == NULL &&
+		    timer->entry.next == TIMER_ENTRY_STATIC) {
 			debug_object_init(timer, &timer_debug_descr);
 			debug_object_activate(timer, &timer_debug_descr);
 			return 0;
@@ -551,7 +551,7 @@ static int timer_fixup_assert_init(void *addr, enum debug_obj_state state)
 
 	switch (state) {
 	case ODEBUG_STATE_NOTAVAILABLE:
-		if (timer->entry.prev == TIMER_ENTRY_STATIC) {
+		if (timer->entry.next == TIMER_ENTRY_STATIC) {
 			/*
 			 * This is not really a fixup. The timer was
 			 * statically initialized. We just make sure that it
@@ -655,7 +655,7 @@ static void do_init_timer(struct timer_list *timer, unsigned int flags,
 {
 	struct tvec_base *base = raw_cpu_read(tvec_bases);
 
-	timer->entry.next = NULL;
+	timer->entry.pprev = NULL;
 	timer->base = (void *)((unsigned long)base | flags);
 	timer->slack = -1;
 #ifdef CONFIG_TIMER_STATS
@@ -687,14 +687,14 @@ EXPORT_SYMBOL(init_timer_key);
 
 static inline void detach_timer(struct timer_list *timer, bool clear_pending)
 {
-	struct list_head *entry = &timer->entry;
+	struct hlist_node *entry = &timer->entry;
 
 	debug_deactivate(timer);
 
-	__list_del(entry->prev, entry->next);
+	__hlist_del(entry);
 	if (clear_pending)
-		entry->next = NULL;
-	entry->prev = LIST_POISON2;
+		entry->pprev = NULL;
+	entry->next = LIST_POISON2;
 }
 
 static inline void
@@ -1095,16 +1095,17 @@ EXPORT_SYMBOL(del_timer_sync);
 static int cascade(struct tvec_base *base, struct tvec *tv, int index)
 {
 	/* cascade all the timers from tv up one level */
-	struct timer_list *timer, *tmp;
-	struct list_head tv_list;
+	struct timer_list *timer;
+	struct hlist_node *tmp;
+	struct hlist_head tv_list;
 
-	list_replace_init(tv->vec + index, &tv_list);
+	hlist_move_list(tv->vec + index, &tv_list);
 
 	/*
 	 * We are removing _all_ timers from the list, so we
 	 * don't have to detach them individually.
 	 */
-	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
+	hlist_for_each_entry_safe(timer, tmp, &tv_list, entry) {
 		BUG_ON(tbase_get_base(timer->base) != base);
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
@@ -1172,8 +1173,8 @@ static inline void __run_timers(struct tvec_base *base)
 	spin_lock_irq(&base->lock);
 
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
-		struct list_head work_list;
-		struct list_head *head = &work_list;
+		struct hlist_head work_list;
+		struct hlist_head *head = &work_list;
 		int index;
 
 		if (!base->all_timers) {
@@ -1192,13 +1193,13 @@ static inline void __run_timers(struct tvec_base *base)
 					!cascade(base, &base->tv4, INDEX(2)))
 			cascade(base, &base->tv5, INDEX(3));
 		++base->timer_jiffies;
-		list_replace_init(base->tv1.vec + index, head);
-		while (!list_empty(head)) {
+		hlist_move_list(base->tv1.vec + index, head);
+		while (!hlist_empty(head)) {
 			void (*fn)(unsigned long);
 			unsigned long data;
 			bool irqsafe;
 
-			timer = list_first_entry(head, struct timer_list,entry);
+			timer = hlist_entry(head->first, struct timer_list, entry);
 			fn = timer->function;
 			data = timer->data;
 			irqsafe = tbase_get_irqsafe(timer->base);
@@ -1240,7 +1241,7 @@ static unsigned long __next_timer_interrupt(struct tvec_base *base)
 	/* Look for timer events in tv1. */
 	index = slot = timer_jiffies & TVR_MASK;
 	do {
-		list_for_each_entry(nte, base->tv1.vec + slot, entry) {
+		hlist_for_each_entry(nte, base->tv1.vec + slot, entry) {
 			if (tbase_get_deferrable(nte->base))
 				continue;
 
@@ -1271,7 +1272,7 @@ cascade:
 
 		index = slot = timer_jiffies & TVN_MASK;
 		do {
-			list_for_each_entry(nte, varp->vec + slot, entry) {
+			hlist_for_each_entry(nte, varp->vec + slot, entry) {
 				if (tbase_get_deferrable(nte->base))
 					continue;
 
@@ -1530,12 +1531,12 @@ signed long __sched schedule_timeout_uninterruptible(signed long timeout)
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
 #ifdef CONFIG_HOTPLUG_CPU
-static void migrate_timer_list(struct tvec_base *new_base, struct list_head *head)
+static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
 	struct timer_list *timer;
 
-	while (!list_empty(head)) {
-		timer = list_first_entry(head, struct timer_list, entry);
+	while (!hlist_empty(head)) {
+		timer = hlist_entry(head->first, struct timer_list, entry);
 		/* We ignore the accounting on the dying cpu */
 		detach_timer(timer, false);
 		timer_set_base(timer, new_base);
@@ -1603,23 +1604,12 @@ static inline void timer_register_cpu_notifier(void) { }
 
 static void __init init_timer_cpu(struct tvec_base *base, int cpu)
 {
-	int j;
-
 	BUG_ON(base != tbase_get_base(base));
 
 	base->cpu = cpu;
 	per_cpu(tvec_bases, cpu) = base;
 	spin_lock_init(&base->lock);
 
-	for (j = 0; j < TVN_SIZE; j++) {
-		INIT_LIST_HEAD(base->tv5.vec + j);
-		INIT_LIST_HEAD(base->tv4.vec + j);
-		INIT_LIST_HEAD(base->tv3.vec + j);
-		INIT_LIST_HEAD(base->tv2.vec + j);
-	}
-	for (j = 0; j < TVR_SIZE; j++)
-		INIT_LIST_HEAD(base->tv1.vec + j);
-
 	base->timer_jiffies = jiffies;
 	base->next_timer = base->timer_jiffies;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:timers/core] timer: Replace timer base by a cpu index
  2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
  2015-05-27  9:22   ` Viresh Kumar
@ 2015-06-19 13:23   ` tip-bot for Thomas Gleixner
  2015-06-27  9:55   ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
  2 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2015-06-19 13:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, wenbo.wang, hpa, viresh.kumar, john.stultz, mingo,
	edumazet, tglx, rostedt, paulmck, fweisbec, joonwoop, Badhri,
	linux-kernel

Commit-ID:  0eeda71bc30d74f66f8231f45621d5ace3419186
Gitweb:     http://git.kernel.org/tip/0eeda71bc30d74f66f8231f45621d5ace3419186
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 26 May 2015 22:50:29 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 15:18:27 +0200

timer: Replace timer base by a cpu index

Instead of storing a pointer to the per cpu tvec_base we can simply
cache a CPU index in the timer_list and use that to get hold of the
correct per cpu tvec_base. This is only used in lock_timer_base() and
the slightly larger code is peanuts versus the spinlock operation and
the d-cache foot print of the timer wheel.

Aside of that this allows to get rid of following nuisances:

 - boot_tvec_base

   That statically allocated 4k bss data is just kept around so the
   timer has a home when it gets statically initialized. It serves no
   other purpose.

   With the CPU index we assign the timer to CPU0 at static
   initialization time and therefor can avoid the whole boot_tvec_base
   dance.  That also simplifies the init code, which just can use the
   per cpu base.

   Before:
     text	   data	    bss	    dec	    hex	filename
    17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
   After:
     text	   data	    bss	    dec	    hex	filename
    17440	   9193	      0	  26633	   6809	../build/kernel/time/timer.o

 - Overloading the base pointer with various flags

   The CPU index has enough space to hold the flags (deferrable,
   irqsafe) so we can get rid of the extra masking and bit fiddling
   with the base pointer.

As a benefit we reduce the size of struct timer_list on 64 bit
machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
which is a real win as we have tons of them embedded in other structs.

This changes also the newly added deferrable printout of the timer
start trace point to capture and print all timer->flags, which allows
us to decode the target cpu of the timer as well.

We might have used bitfields for this, but that would change the
static initializers and the init function for no value to accomodate
big endian bitfields.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Badhri Jagan Sridharan <Badhri@google.com>
Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/timer.h        |  38 ++++++-------
 include/trace/events/timer.h |  13 ++---
 kernel/time/timer.c          | 127 ++++++++++++-------------------------------
 3 files changed, 58 insertions(+), 120 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 064ee24..4a0d52b 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -14,27 +14,23 @@ struct timer_list {
 	 * All fields that change during normal runtime grouped to the
 	 * same cacheline
 	 */
-	struct hlist_node entry;
-	unsigned long expires;
-	struct tvec_base *base;
-
-	void (*function)(unsigned long);
-	unsigned long data;
-
-	int slack;
+	struct hlist_node	entry;
+	unsigned long		expires;
+	void			(*function)(unsigned long);
+	unsigned long		data;
+	u32			flags;
+	int			slack;
 
 #ifdef CONFIG_TIMER_STATS
-	int start_pid;
-	void *start_site;
-	char start_comm[16];
+	int			start_pid;
+	void			*start_site;
+	char			start_comm[16];
 #endif
 #ifdef CONFIG_LOCKDEP
-	struct lockdep_map lockdep_map;
+	struct lockdep_map	lockdep_map;
 #endif
 };
 
-extern struct tvec_base boot_tvec_bases;
-
 #ifdef CONFIG_LOCKDEP
 /*
  * NB: because we have to copy the lockdep_map, setting the lockdep_map key
@@ -49,9 +45,6 @@ extern struct tvec_base boot_tvec_bases;
 #endif
 
 /*
- * Note that all tvec_bases are at least 4 byte aligned and lower two bits
- * of base in timer_list is guaranteed to be zero. Use them for flags.
- *
  * A deferrable timer will work normally when the system is busy, but
  * will not cause a CPU to come out of idle just to service it; instead,
  * the timer will be serviced when the CPU eventually wakes up with a
@@ -65,17 +58,18 @@ extern struct tvec_base boot_tvec_bases;
  * workqueue locking issues. It's not meant for executing random crap
  * with interrupts disabled. Abuse is monitored!
  */
-#define TIMER_DEFERRABLE		0x1LU
-#define TIMER_IRQSAFE			0x2LU
-
-#define TIMER_FLAG_MASK			0x3LU
+#define TIMER_CPUMASK		0x0007FFFF
+#define TIMER_MIGRATING		0x00080000
+#define TIMER_BASEMASK		(TIMER_CPUMASK | TIMER_MIGRATING)
+#define TIMER_DEFERRABLE	0x00100000
+#define TIMER_IRQSAFE		0x00200000
 
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
 		.entry = { .next = TIMER_ENTRY_STATIC },	\
 		.function = (_function),			\
 		.expires = (_expires),				\
 		.data = (_data),				\
-		.base = (void *)((unsigned long)&boot_tvec_bases + (_flags)), \
+		.flags = (_flags),				\
 		.slack = -1,					\
 		__TIMER_LOCKDEP_MAP_INITIALIZER(		\
 			__FILE__ ":" __stringify(__LINE__))	\
diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index d7abef1..073b9ac 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -45,16 +45,16 @@ TRACE_EVENT(timer_start,
 
 	TP_PROTO(struct timer_list *timer,
 		unsigned long expires,
-		unsigned int deferrable),
+		unsigned int flags),
 
-	TP_ARGS(timer, expires, deferrable),
+	TP_ARGS(timer, expires, flags),
 
 	TP_STRUCT__entry(
 		__field( void *,	timer		)
 		__field( void *,	function	)
 		__field( unsigned long,	expires		)
 		__field( unsigned long,	now		)
-		__field( unsigned int,	deferrable	)
+		__field( unsigned int,	flags		)
 	),
 
 	TP_fast_assign(
@@ -62,13 +62,12 @@ TRACE_EVENT(timer_start,
 		__entry->function	= timer->function;
 		__entry->expires	= expires;
 		__entry->now		= jiffies;
-		__entry->deferrable     = deferrable;
+		__entry->flags		= flags;
 	),
 
-	TP_printk("timer=%p function=%pf expires=%lu [timeout=%ld] defer=%c",
+	TP_printk("timer=%p function=%pf expires=%lu [timeout=%ld] flags=0x%08x",
 		  __entry->timer, __entry->function, __entry->expires,
-		  (long)__entry->expires - __entry->now,
-		  __entry->deferrable > 0 ? 'y':'n')
+		  (long)__entry->expires - __entry->now, __entry->flags)
 );
 
 /**
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 3a5e0c8..1540af9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -92,43 +92,8 @@ struct tvec_base {
 	struct tvec tv5;
 } ____cacheline_aligned;
 
-/*
- * __TIMER_INITIALIZER() needs to set ->base to a valid pointer (because we've
- * made NULL special, hint: lock_timer_base()) and we cannot get a compile time
- * pointer to per-cpu entries because we don't know where we'll map the section,
- * even for the boot cpu.
- *
- * And so we use boot_tvec_bases for boot CPU and per-cpu __tvec_bases for the
- * rest of them.
- */
-struct tvec_base boot_tvec_bases;
-EXPORT_SYMBOL(boot_tvec_bases);
-
-static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;
-
-/* Functions below help us manage 'deferrable' flag */
-static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
-{
-	return ((unsigned int)(unsigned long)base & TIMER_DEFERRABLE);
-}
-
-static inline unsigned int tbase_get_irqsafe(struct tvec_base *base)
-{
-	return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE);
-}
-
-static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
-{
-	return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK));
-}
-
-static inline void
-timer_set_base(struct timer_list *timer, struct tvec_base *new_base)
-{
-	unsigned long flags = (unsigned long)timer->base & TIMER_FLAG_MASK;
 
-	timer->base = (struct tvec_base *)((unsigned long)(new_base) | flags);
-}
+static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
 
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
 		bool force_up)
@@ -403,7 +368,7 @@ static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 	/*
 	 * Update base->active_timers and base->next_timer
 	 */
-	if (!tbase_get_deferrable(timer->base)) {
+	if (!(timer->flags & TIMER_DEFERRABLE)) {
 		if (!base->active_timers++ ||
 		    time_before(timer->expires, base->next_timer))
 			base->next_timer = timer->expires;
@@ -422,7 +387,7 @@ static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 	 * require special care against races with idle_cpu(), lets deal
 	 * with that later.
 	 */
-	if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(base->cpu))
+	if (!(timer->flags & TIMER_DEFERRABLE) || tick_nohz_full_cpu(base->cpu))
 		wake_up_nohz_cpu(base->cpu);
 }
 
@@ -443,7 +408,7 @@ static void timer_stats_account_timer(struct timer_list *timer)
 
 	if (likely(!timer->start_site))
 		return;
-	if (unlikely(tbase_get_deferrable(timer->base)))
+	if (unlikely(timer->flags & TIMER_DEFERRABLE))
 		flag |= TIMER_STATS_FLAG_DEFERRABLE;
 
 	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
@@ -636,7 +601,7 @@ static inline void
 debug_activate(struct timer_list *timer, unsigned long expires)
 {
 	debug_timer_activate(timer);
-	trace_timer_start(timer, expires, tbase_get_deferrable(timer->base));
+	trace_timer_start(timer, expires, timer->flags);
 }
 
 static inline void debug_deactivate(struct timer_list *timer)
@@ -653,10 +618,8 @@ static inline void debug_assert_init(struct timer_list *timer)
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
 			  const char *name, struct lock_class_key *key)
 {
-	struct tvec_base *base = raw_cpu_read(tvec_bases);
-
 	timer->entry.pprev = NULL;
-	timer->base = (void *)((unsigned long)base | flags);
+	timer->flags = flags | raw_smp_processor_id();
 	timer->slack = -1;
 #ifdef CONFIG_TIMER_STATS
 	timer->start_site = NULL;
@@ -701,7 +664,7 @@ static inline void
 detach_expired_timer(struct timer_list *timer, struct tvec_base *base)
 {
 	detach_timer(timer, true);
-	if (!tbase_get_deferrable(timer->base))
+	if (!(timer->flags & TIMER_DEFERRABLE))
 		base->active_timers--;
 	base->all_timers--;
 }
@@ -713,7 +676,7 @@ static int detach_if_pending(struct timer_list *timer, struct tvec_base *base,
 		return 0;
 
 	detach_timer(timer, clear_pending);
-	if (!tbase_get_deferrable(timer->base)) {
+	if (!(timer->flags & TIMER_DEFERRABLE)) {
 		base->active_timers--;
 		if (timer->expires == base->next_timer)
 			base->next_timer = base->timer_jiffies;
@@ -732,24 +695,22 @@ static int detach_if_pending(struct timer_list *timer, struct tvec_base *base,
  * So __run_timers/migrate_timers can safely modify all timers which could
  * be found on ->tvX lists.
  *
- * When the timer's base is locked, and the timer removed from list, it is
- * possible to set timer->base = NULL and drop the lock: the timer remains
- * locked.
+ * When the timer's base is locked and removed from the list, the
+ * TIMER_MIGRATING flag is set, FIXME
  */
 static struct tvec_base *lock_timer_base(struct timer_list *timer,
 					unsigned long *flags)
 	__acquires(timer->base->lock)
 {
-	struct tvec_base *base;
-
 	for (;;) {
-		struct tvec_base *prelock_base = timer->base;
-		base = tbase_get_base(prelock_base);
-		if (likely(base != NULL)) {
+		u32 tf = timer->flags;
+		struct tvec_base *base;
+
+		if (!(tf & TIMER_MIGRATING)) {
+			base = per_cpu_ptr(&tvec_bases, tf & TIMER_CPUMASK);
 			spin_lock_irqsave(&base->lock, *flags);
-			if (likely(prelock_base == timer->base))
+			if (timer->flags == tf)
 				return base;
-			/* The timer has migrated to another CPU */
 			spin_unlock_irqrestore(&base->lock, *flags);
 		}
 		cpu_relax();
@@ -776,7 +737,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
 	debug_activate(timer, expires);
 
 	cpu = get_nohz_timer_target(pinned);
-	new_base = per_cpu(tvec_bases, cpu);
+	new_base = per_cpu_ptr(&tvec_bases, cpu);
 
 	if (base != new_base) {
 		/*
@@ -788,11 +749,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
 		 */
 		if (likely(base->running_timer != timer)) {
 			/* See the comment in lock_timer_base() */
-			timer_set_base(timer, NULL);
+			timer->flags |= TIMER_MIGRATING;
+
 			spin_unlock(&base->lock);
 			base = new_base;
 			spin_lock(&base->lock);
-			timer_set_base(timer, base);
+			timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
 		}
 	}
 
@@ -954,13 +916,13 @@ EXPORT_SYMBOL(add_timer);
  */
 void add_timer_on(struct timer_list *timer, int cpu)
 {
-	struct tvec_base *base = per_cpu(tvec_bases, cpu);
+	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
 	unsigned long flags;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
-	timer_set_base(timer, base);
+	timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
 	spin_unlock_irqrestore(&base->lock, flags);
@@ -1025,8 +987,6 @@ int try_to_del_timer_sync(struct timer_list *timer)
 EXPORT_SYMBOL(try_to_del_timer_sync);
 
 #ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);
-
 /**
  * del_timer_sync - deactivate a timer and wait for the handler to finish.
  * @timer: the timer to be deactivated
@@ -1081,7 +1041,7 @@ int del_timer_sync(struct timer_list *timer)
 	 * don't use it in hardirq context, because it
 	 * could lead to deadlock.
 	 */
-	WARN_ON(in_irq() && !tbase_get_irqsafe(timer->base));
+	WARN_ON(in_irq() && !(timer->flags & TIMER_IRQSAFE));
 	for (;;) {
 		int ret = try_to_del_timer_sync(timer);
 		if (ret >= 0)
@@ -1106,7 +1066,6 @@ static int cascade(struct tvec_base *base, struct tvec *tv, int index)
 	 * don't have to detach them individually.
 	 */
 	hlist_for_each_entry_safe(timer, tmp, &tv_list, entry) {
-		BUG_ON(tbase_get_base(timer->base) != base);
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
 	}
@@ -1202,7 +1161,7 @@ static inline void __run_timers(struct tvec_base *base)
 			timer = hlist_entry(head->first, struct timer_list, entry);
 			fn = timer->function;
 			data = timer->data;
-			irqsafe = tbase_get_irqsafe(timer->base);
+			irqsafe = timer->flags & TIMER_IRQSAFE;
 
 			timer_stats_account_timer(timer);
 
@@ -1242,7 +1201,7 @@ static unsigned long __next_timer_interrupt(struct tvec_base *base)
 	index = slot = timer_jiffies & TVR_MASK;
 	do {
 		hlist_for_each_entry(nte, base->tv1.vec + slot, entry) {
-			if (tbase_get_deferrable(nte->base))
+			if (nte->flags & TIMER_DEFERRABLE)
 				continue;
 
 			found = 1;
@@ -1273,7 +1232,7 @@ cascade:
 		index = slot = timer_jiffies & TVN_MASK;
 		do {
 			hlist_for_each_entry(nte, varp->vec + slot, entry) {
-				if (tbase_get_deferrable(nte->base))
+				if (nte->flags & TIMER_DEFERRABLE)
 					continue;
 
 				found = 1;
@@ -1343,7 +1302,7 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
  */
 u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 {
-	struct tvec_base *base = __this_cpu_read(tvec_bases);
+	struct tvec_base *base = this_cpu_ptr(&tvec_bases);
 	u64 expires = KTIME_MAX;
 	unsigned long nextevt;
 
@@ -1395,7 +1354,7 @@ void update_process_times(int user_tick)
  */
 static void run_timer_softirq(struct softirq_action *h)
 {
-	struct tvec_base *base = __this_cpu_read(tvec_bases);
+	struct tvec_base *base = this_cpu_ptr(&tvec_bases);
 
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
@@ -1534,12 +1493,13 @@ EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
 	struct timer_list *timer;
+	int cpu = new_base->cpu;
 
 	while (!hlist_empty(head)) {
 		timer = hlist_entry(head->first, struct timer_list, entry);
 		/* We ignore the accounting on the dying cpu */
 		detach_timer(timer, false);
-		timer_set_base(timer, new_base);
+		timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
 		internal_add_timer(new_base, timer);
 	}
 }
@@ -1551,8 +1511,8 @@ static void migrate_timers(int cpu)
 	int i;
 
 	BUG_ON(cpu_online(cpu));
-	old_base = per_cpu(tvec_bases, cpu);
-	new_base = get_cpu_var(tvec_bases);
+	old_base = per_cpu_ptr(&tvec_bases, cpu);
+	new_base = this_cpu_ptr(&tvec_bases);
 	/*
 	 * The caller is globally serialized and nobody else
 	 * takes two locks at once, deadlock is not possible.
@@ -1576,7 +1536,6 @@ static void migrate_timers(int cpu)
 
 	spin_unlock(&old_base->lock);
 	spin_unlock_irq(&new_base->lock);
-	put_cpu_var(tvec_bases);
 }
 
 static int timer_cpu_notify(struct notifier_block *self,
@@ -1602,12 +1561,11 @@ static inline void timer_register_cpu_notifier(void)
 static inline void timer_register_cpu_notifier(void) { }
 #endif /* CONFIG_HOTPLUG_CPU */
 
-static void __init init_timer_cpu(struct tvec_base *base, int cpu)
+static void __init init_timer_cpu(int cpu)
 {
-	BUG_ON(base != tbase_get_base(base));
+	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
 
 	base->cpu = cpu;
-	per_cpu(tvec_bases, cpu) = base;
 	spin_lock_init(&base->lock);
 
 	base->timer_jiffies = jiffies;
@@ -1616,27 +1574,14 @@ static void __init init_timer_cpu(struct tvec_base *base, int cpu)
 
 static void __init init_timer_cpus(void)
 {
-	struct tvec_base *base;
-	int local_cpu = smp_processor_id();
 	int cpu;
 
-	for_each_possible_cpu(cpu) {
-		if (cpu == local_cpu)
-			base = &boot_tvec_bases;
-#ifdef CONFIG_SMP
-		else
-			base = per_cpu_ptr(&__tvec_bases, cpu);
-#endif
-
-		init_timer_cpu(base, cpu);
-	}
+	for_each_possible_cpu(cpu)
+		init_timer_cpu(cpu);
 }
 
 void __init init_timers(void)
 {
-	/* ensure there are enough low bits for flags in timer->base pointer */
-	BUILD_BUG_ON(__alignof__(struct tvec_base) & TIMER_FLAG_MASK);
-
 	init_timer_cpus();
 	init_timer_stats();
 	timer_register_cpu_notifier();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:timers/core] timer: Stats: Simplify the flags handling
  2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
@ 2015-06-19 13:23   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2015-06-19 13:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: paulmck, hpa, edumazet, viresh.kumar, fweisbec, peterz, tglx,
	mingo, john.stultz, linux-kernel, joonwoop, wenbo.wang

Commit-ID:  c74441a17eb975b604e339ca6c11b9ab9aaca11f
Gitweb:     http://git.kernel.org/tip/c74441a17eb975b604e339ca6c11b9ab9aaca11f
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 26 May 2015 22:50:31 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 15:18:27 +0200

timer: Stats: Simplify the flags handling

Simplify the handling of the flag storage for the timer statistics. No
intermediate storage anymore. Just hand over the flags field.

I left the printout of 'deferrable' for now because changing this
would be an ABI update and I have no idea how strong people feel about
that. OTOH, I wonder whether we should kill the whole timer stats
stuff because all of that information can be retrieved via ftrace/perf
as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/timer.h     |  5 +----
 kernel/time/timer.c       |  7 ++-----
 kernel/time/timer_stats.c | 10 +++++-----
 3 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 4a0d52b..ff0689b 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -188,13 +188,10 @@ extern void set_timer_slack(struct timer_list *time, int slack_hz);
 
 extern int timer_stats_active;
 
-#define TIMER_STATS_FLAG_DEFERRABLE	0x1
-
 extern void init_timer_stats(void);
 
 extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
-				     void *timerf, char *comm,
-				     unsigned int timer_flag);
+				     void *timerf, char *comm, u32 flags);
 
 extern void __timer_stats_timer_set_start_info(struct timer_list *timer,
 					       void *addr);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 1540af9..3398d93 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -404,15 +404,12 @@ void __timer_stats_timer_set_start_info(struct timer_list *timer, void *addr)
 
 static void timer_stats_account_timer(struct timer_list *timer)
 {
-	unsigned int flag = 0;
-
 	if (likely(!timer->start_site))
 		return;
-	if (unlikely(timer->flags & TIMER_DEFERRABLE))
-		flag |= TIMER_STATS_FLAG_DEFERRABLE;
 
 	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
-				 timer->function, timer->start_comm, flag);
+				 timer->function, timer->start_comm,
+				 timer->flags);
 }
 
 #else
diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index 1fb08f2..1adecb4 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -68,7 +68,7 @@ struct entry {
 	 * Number of timeout events:
 	 */
 	unsigned long		count;
-	unsigned int		timer_flag;
+	u32			flags;
 
 	/*
 	 * We save the command-line string to preserve
@@ -227,13 +227,13 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
  * @startf:	pointer to the function which did the timer setup
  * @timerf:	pointer to the timer callback function of the timer
  * @comm:	name of the process which set up the timer
+ * @tflags:	The flags field of the timer
  *
  * When the timer is already registered, then the event counter is
  * incremented. Otherwise the timer is registered in a free slot.
  */
 void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
-			      void *timerf, char *comm,
-			      unsigned int timer_flag)
+			      void *timerf, char *comm, u32 tflags)
 {
 	/*
 	 * It doesn't matter which lock we take:
@@ -251,7 +251,7 @@ void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
 	input.start_func = startf;
 	input.expire_func = timerf;
 	input.pid = pid;
-	input.timer_flag = timer_flag;
+	input.flags = tflags;
 
 	raw_spin_lock_irqsave(lock, flags);
 	if (!timer_stats_active)
@@ -306,7 +306,7 @@ static int tstats_show(struct seq_file *m, void *v)
 
 	for (i = 0; i < nr_entries; i++) {
 		entry = entries + i;
-		if (entry->timer_flag & TIMER_STATS_FLAG_DEFERRABLE) {
+		if (entry->flags & TIMER_DEFERRABLE) {
 			seq_printf(m, "%4luD, %5d %-16s ",
 				entry->count, entry->pid, entry->comm);
 		} else {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:timers/core] timer: Reduce timer migration overhead if disabled
  2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
@ 2015-06-19 13:23   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2015-06-19 13:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: fweisbec, viresh.kumar, hpa, peterz, joonwoop, edumazet,
	wenbo.wang, tglx, linux-kernel, john.stultz, mingo, paulmck

Commit-ID:  bc7a34b8b9ebfb0f4b8a35a72a0b134fd6c5ef50
Gitweb:     http://git.kernel.org/tip/bc7a34b8b9ebfb0f4b8a35a72a0b134fd6c5ef50
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 26 May 2015 22:50:33 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 15:18:28 +0200

timer: Reduce timer migration overhead if disabled

Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h      |  2 ++
 include/linux/sched.h        |  6 +----
 include/linux/sched/sysctl.h | 12 ---------
 include/linux/timer.h        |  9 +++++++
 kernel/rcu/tree_plugin.h     |  2 --
 kernel/sched/core.c          |  9 +++----
 kernel/sysctl.c              | 18 +++++++-------
 kernel/time/hrtimer.c        | 35 ++++++++++++++++++++------
 kernel/time/tick-internal.h  | 14 +++++++++++
 kernel/time/tick-sched.c     | 25 ++++++++++---------
 kernel/time/timer.c          | 59 ++++++++++++++++++++++++++++++++++++++++----
 kernel/time/timer_list.c     |  2 --
 12 files changed, 133 insertions(+), 60 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 5db0558..6955102 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -163,6 +163,7 @@ enum  hrtimer_base_type {
  * @cpu:		cpu number
  * @active_bases:	Bitfield to mark bases with active timers
  * @clock_was_set_seq:	Sequence counter of clock was set events
+ * @migration_enabled:	The migration of hrtimers to other cpus is enabled
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @next_timer:		Pointer to the first expiring timer
@@ -186,6 +187,7 @@ struct hrtimer_cpu_base {
 	unsigned int			cpu;
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
+	bool				migration_enabled;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	unsigned int			in_hrtirq	: 1,
 					hres_active	: 1,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e61..d715146 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -335,14 +335,10 @@ extern int runqueue_is_locked(int cpu);
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
 extern void nohz_balance_enter_idle(int cpu);
 extern void set_cpu_sd_state_idle(void);
-extern int get_nohz_timer_target(int pinned);
+extern int get_nohz_timer_target(void);
 #else
 static inline void nohz_balance_enter_idle(int cpu) { }
 static inline void set_cpu_sd_state_idle(void) { }
-static inline int get_nohz_timer_target(int pinned)
-{
-	return smp_processor_id();
-}
 #endif
 
 /*
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e0..c9e4731 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -57,24 +57,12 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
-extern unsigned int sysctl_timer_migration;
 extern unsigned int sysctl_sched_shares_window;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);
 #endif
-#ifdef CONFIG_SCHED_DEBUG
-static inline unsigned int get_sysctl_timer_migration(void)
-{
-	return sysctl_timer_migration;
-}
-#else
-static inline unsigned int get_sysctl_timer_migration(void)
-{
-	return 1;
-}
-#endif
 
 /*
  *  control realtime throttling:
diff --git a/include/linux/timer.h b/include/linux/timer.h
index ff0689b..61aa61d 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -238,6 +238,15 @@ extern void run_local_timers(void);
 struct hrtimer;
 extern enum hrtimer_restart it_real_fn(struct hrtimer *);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+#include <linux/sysctl.h>
+
+extern unsigned int sysctl_timer_migration;
+int timer_migration_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos);
+#endif
+
 unsigned long __round_jiffies(unsigned long j, int cpu);
 unsigned long __round_jiffies_relative(unsigned long j, int cpu);
 unsigned long round_jiffies(unsigned long j);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0ef80a0..d72fa24 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1432,8 +1432,6 @@ module_param(rcu_idle_gp_delay, int, 0644);
 static int rcu_idle_lazy_gp_delay = RCU_IDLE_LAZY_GP_DELAY;
 module_param(rcu_idle_lazy_gp_delay, int, 0644);
 
-extern int tick_nohz_active;
-
 /*
  * Try to advance callbacks for all flavors of RCU on the current CPU, but
  * only if it has been awhile since the last time we did so.  Afterwards,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ecb7c42..e9f25ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -572,13 +572,12 @@ void resched_cpu(int cpu)
  * selecting an idle cpu will add more delays to the timers than intended
  * (as that cpu's timer base may not be uptodate wrt jiffies etc).
  */
-int get_nohz_timer_target(int pinned)
+int get_nohz_timer_target(void)
 {
-	int cpu = smp_processor_id();
-	int i;
+	int i, cpu = smp_processor_id();
 	struct sched_domain *sd;
 
-	if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
+	if (!idle_cpu(cpu))
 		return cpu;
 
 	rcu_read_lock();
@@ -7050,8 +7049,6 @@ void __init sched_init_smp(void)
 }
 #endif /* CONFIG_SMP */
 
-const_debug unsigned int sysctl_timer_migration = 1;
-
 int in_sched_functions(unsigned long addr)
 {
 	return in_lock_functions(addr) ||
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a..b13e9d2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -349,15 +349,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "timer_migration",
-		.data		= &sysctl_timer_migration,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
-		.extra2		= &one,
-	},
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_NUMA_BALANCING
 	{
@@ -1132,6 +1123,15 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+	{
+		.procname	= "timer_migration",
+		.data		= &sysctl_timer_migration,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= timer_migration_handler,
+	},
+#endif
 	{ }
 };
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index f026413..6115f4d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -177,6 +177,24 @@ hrtimer_check_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base)
 #endif
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static inline
+struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
+					 int pinned)
+{
+	if (pinned || !base->migration_enabled)
+		return this_cpu_ptr(&hrtimer_bases);
+	return &per_cpu(hrtimer_bases, get_nohz_timer_target());
+}
+#else
+static inline
+struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
+					 int pinned)
+{
+	return this_cpu_ptr(&hrtimer_bases);
+}
+#endif
+
 /*
  * Switch the timer base to the current CPU when possible.
  */
@@ -184,14 +202,13 @@ static inline struct hrtimer_clock_base *
 switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 		    int pinned)
 {
+	struct hrtimer_cpu_base *new_cpu_base, *this_base;
 	struct hrtimer_clock_base *new_base;
-	struct hrtimer_cpu_base *new_cpu_base;
-	int this_cpu = smp_processor_id();
-	int cpu = get_nohz_timer_target(pinned);
 	int basenum = base->index;
 
+	this_base = this_cpu_ptr(&hrtimer_bases);
+	new_cpu_base = get_target_base(this_base, pinned);
 again:
-	new_cpu_base = &per_cpu(hrtimer_bases, cpu);
 	new_base = &new_cpu_base->clock_base[basenum];
 
 	if (base != new_base) {
@@ -212,17 +229,19 @@ again:
 		raw_spin_unlock(&base->cpu_base->lock);
 		raw_spin_lock(&new_base->cpu_base->lock);
 
-		if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
-			cpu = this_cpu;
+		if (new_cpu_base != this_base &&
+		    hrtimer_check_target(timer, new_base)) {
 			raw_spin_unlock(&new_base->cpu_base->lock);
 			raw_spin_lock(&base->cpu_base->lock);
+			new_cpu_base = this_base;
 			timer->base = base;
 			goto again;
 		}
 		timer->base = new_base;
 	} else {
-		if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
-			cpu = this_cpu;
+		if (new_cpu_base != this_base &&
+		    hrtimer_check_target(timer, new_base)) {
+			new_cpu_base = this_base;
 			goto again;
 		}
 	}
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index ec2208a..2edde84 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -149,4 +149,18 @@ extern void tick_nohz_init(void);
 static inline void tick_nohz_init(void) { }
 #endif
 
+#ifdef CONFIG_NO_HZ_COMMON
+extern unsigned long tick_nohz_active;
+#else
+#define tick_nohz_active (0)
+#endif
+
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+extern void timers_update_migration(void);
+#else
+static inline void timers_update_migration(void) { }
+#endif
+
+DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
+
 extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 812f7a3..b1cb016 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -399,7 +399,7 @@ void __init tick_nohz_init(void)
  * NO HZ enabled ?
  */
 static int tick_nohz_enabled __read_mostly  = 1;
-int tick_nohz_active  __read_mostly;
+unsigned long tick_nohz_active  __read_mostly;
 /*
  * Enable / Disable tickless mode
  */
@@ -956,6 +956,16 @@ static void tick_nohz_handler(struct clock_event_device *dev)
 	tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
 }
 
+static inline void tick_nohz_activate(struct tick_sched *ts, int mode)
+{
+	if (!tick_nohz_enabled)
+		return;
+	ts->nohz_mode = mode;
+	/* One update is enough */
+	if (!test_and_set_bit(0, &tick_nohz_active))
+		timers_update_migration();
+}
+
 /**
  * tick_nohz_switch_to_nohz - switch to nohz mode
  */
@@ -970,9 +980,6 @@ static void tick_nohz_switch_to_nohz(void)
 	if (tick_switch_to_oneshot(tick_nohz_handler))
 		return;
 
-	tick_nohz_active = 1;
-	ts->nohz_mode = NOHZ_MODE_LOWRES;
-
 	/*
 	 * Recycle the hrtimer in ts, so we can share the
 	 * hrtimer_forward with the highres code.
@@ -984,6 +991,7 @@ static void tick_nohz_switch_to_nohz(void)
 	hrtimer_forward_now(&ts->sched_timer, tick_period);
 	hrtimer_set_expires(&ts->sched_timer, next);
 	tick_program_event(next, 1);
+	tick_nohz_activate(ts, NOHZ_MODE_LOWRES);
 }
 
 /*
@@ -1035,6 +1043,7 @@ static inline void tick_nohz_irq_enter(void)
 
 static inline void tick_nohz_switch_to_nohz(void) { }
 static inline void tick_nohz_irq_enter(void) { }
+static inline void tick_nohz_activate(struct tick_sched *ts, int mode) { }
 
 #endif /* CONFIG_NO_HZ_COMMON */
 
@@ -1117,13 +1126,7 @@ void tick_setup_sched_timer(void)
 
 	hrtimer_forward(&ts->sched_timer, now, tick_period);
 	hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
-
-#ifdef CONFIG_NO_HZ_COMMON
-	if (tick_nohz_enabled) {
-		ts->nohz_mode = NOHZ_MODE_HIGHRES;
-		tick_nohz_active = 1;
-	}
-#endif
+	tick_nohz_activate(ts, NOHZ_MODE_HIGHRES);
 }
 #endif /* HIGH_RES_TIMERS */
 
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 3398d93..343142e 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -85,6 +85,7 @@ struct tvec_base {
 	unsigned long active_timers;
 	unsigned long all_timers;
 	int cpu;
+	bool migration_enabled;
 	struct tvec_root tv1;
 	struct tvec tv2;
 	struct tvec tv3;
@@ -95,6 +96,54 @@ struct tvec_base {
 
 static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+unsigned int sysctl_timer_migration = 1;
+
+void timers_update_migration(void)
+{
+	bool on = sysctl_timer_migration && tick_nohz_active;
+	unsigned int cpu;
+
+	/* Avoid the loop, if nothing to update */
+	if (this_cpu_read(tvec_bases.migration_enabled) == on)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(tvec_bases.migration_enabled, cpu) = on;
+		per_cpu(hrtimer_bases.migration_enabled, cpu) = on;
+	}
+}
+
+int timer_migration_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos)
+{
+	static DEFINE_MUTEX(mutex);
+	int ret;
+
+	mutex_lock(&mutex);
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (!ret && write)
+		timers_update_migration();
+	mutex_unlock(&mutex);
+	return ret;
+}
+
+static inline struct tvec_base *get_target_base(struct tvec_base *base,
+						int pinned)
+{
+	if (pinned || !base->migration_enabled)
+		return this_cpu_ptr(&tvec_bases);
+	return per_cpu_ptr(&tvec_bases, get_nohz_timer_target());
+}
+#else
+static inline struct tvec_base *get_target_base(struct tvec_base *base,
+						int pinned)
+{
+	return this_cpu_ptr(&tvec_bases);
+}
+#endif
+
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
 		bool force_up)
 {
@@ -716,11 +765,11 @@ static struct tvec_base *lock_timer_base(struct timer_list *timer,
 
 static inline int
 __mod_timer(struct timer_list *timer, unsigned long expires,
-						bool pending_only, int pinned)
+	    bool pending_only, int pinned)
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret = 0 , cpu;
+	int ret = 0;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
@@ -733,8 +782,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
 
 	debug_activate(timer, expires);
 
-	cpu = get_nohz_timer_target(pinned);
-	new_base = per_cpu_ptr(&tvec_bases, cpu);
+	new_base = get_target_base(base, pinned);
 
 	if (base != new_base) {
 		/*
@@ -751,7 +799,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
 			spin_unlock(&base->lock);
 			base = new_base;
 			spin_lock(&base->lock);
-			timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
+			timer->flags &= ~TIMER_BASEMASK;
+			timer->flags |= base->cpu;
 		}
 	}
 
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 1327004..a4536e1 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -29,8 +29,6 @@ struct timer_list_iter {
 
 typedef void (*print_fn_t)(struct seq_file *m, unsigned int *classes);
 
-DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
-
 /*
  * This allows printing both to /proc/timer_list and
  * to the console (on SysRq-Q):
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:timers/core] timer: Minimize nohz off overhead
  2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
@ 2015-06-19 13:24   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2015-06-19 13:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: joonwoop, tglx, edumazet, paulmck, peterz, linux-kernel, hpa,
	fweisbec, john.stultz, wenbo.wang, mingo, viresh.kumar

Commit-ID:  683be13a284720205228e29207ef11a1c3c322b9
Gitweb:     http://git.kernel.org/tip/683be13a284720205228e29207ef11a1c3c322b9
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 26 May 2015 22:50:35 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 15:18:28 +0200

timer: Minimize nohz off overhead

If nohz is disabled on the kernel command line the [hr]timer code
still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
avoid the overhead.

Before:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.73%  hog       [.] main
  15.36%  [kernel]  [k] _raw_spin_lock_irqsave
   9.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.61%  [kernel]  [k] lock_timer_base.isra.38
   6.42%  [kernel]  [k] mod_timer
   3.90%  [kernel]  [k] detach_if_pending
   3.76%  [kernel]  [k] del_timer
   2.41%  [kernel]  [k] internal_add_timer
   1.39%  [kernel]  [k] __internal_add_timer
   0.76%  [kernel]  [k] timerfn

We probably should have a cached value for nohz full in the per cpu
bases as well to avoid the cpumask check. The base cache line is hot
already, the cpumask not necessarily.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h     |  2 ++
 kernel/time/hrtimer.c       |  3 ++-
 kernel/time/tick-internal.h |  4 ++--
 kernel/time/tick-sched.c    |  2 +-
 kernel/time/timer.c         | 16 ++++++++++++----
 5 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 6955102..76dd4f0 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -164,6 +164,7 @@ enum  hrtimer_base_type {
  * @active_bases:	Bitfield to mark bases with active timers
  * @clock_was_set_seq:	Sequence counter of clock was set events
  * @migration_enabled:	The migration of hrtimers to other cpus is enabled
+ * @nohz_active:	The nohz functionality is enabled
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @next_timer:		Pointer to the first expiring timer
@@ -188,6 +189,7 @@ struct hrtimer_cpu_base {
 	unsigned int			active_bases;
 	unsigned int			clock_was_set_seq;
 	bool				migration_enabled;
+	bool				nohz_active;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	unsigned int			in_hrtirq	: 1,
 					hres_active	: 1,
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 6115f4d..db5c950 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -994,7 +994,8 @@ void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 		 * Kick to reschedule the next tick to handle the new timer
 		 * on dynticks target.
 		 */
-		wake_up_nohz_cpu(new_base->cpu_base->cpu);
+		if (new_base->cpu_base->nohz_active)
+			wake_up_nohz_cpu(new_base->cpu_base->cpu);
 	} else {
 		hrtimer_reprogram(timer, new_base);
 	}
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 2edde84..966a5a6 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -156,9 +156,9 @@ extern unsigned long tick_nohz_active;
 #endif
 
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
-extern void timers_update_migration(void);
+extern void timers_update_migration(bool update_nohz);
 #else
-static inline void timers_update_migration(void) { }
+static inline void timers_update_migration(bool update_nohz) { }
 #endif
 
 DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b1cb016..c792429 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -963,7 +963,7 @@ static inline void tick_nohz_activate(struct tick_sched *ts, int mode)
 	ts->nohz_mode = mode;
 	/* One update is enough */
 	if (!test_and_set_bit(0, &tick_nohz_active))
-		timers_update_migration();
+		timers_update_migration(true);
 }
 
 /**
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 343142e..520499d 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -86,6 +86,7 @@ struct tvec_base {
 	unsigned long all_timers;
 	int cpu;
 	bool migration_enabled;
+	bool nohz_active;
 	struct tvec_root tv1;
 	struct tvec tv2;
 	struct tvec tv3;
@@ -99,7 +100,7 @@ static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
 unsigned int sysctl_timer_migration = 1;
 
-void timers_update_migration(void)
+void timers_update_migration(bool update_nohz)
 {
 	bool on = sysctl_timer_migration && tick_nohz_active;
 	unsigned int cpu;
@@ -111,6 +112,10 @@ void timers_update_migration(void)
 	for_each_possible_cpu(cpu) {
 		per_cpu(tvec_bases.migration_enabled, cpu) = on;
 		per_cpu(hrtimer_bases.migration_enabled, cpu) = on;
+		if (!update_nohz)
+			continue;
+		per_cpu(tvec_bases.nohz_active, cpu) = true;
+		per_cpu(hrtimer_bases.nohz_active, cpu) = true;
 	}
 }
 
@@ -124,7 +129,7 @@ int timer_migration_handler(struct ctl_table *table, int write,
 	mutex_lock(&mutex);
 	ret = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (!ret && write)
-		timers_update_migration();
+		timers_update_migration(false);
 	mutex_unlock(&mutex);
 	return ret;
 }
@@ -436,8 +441,11 @@ static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 	 * require special care against races with idle_cpu(), lets deal
 	 * with that later.
 	 */
-	if (!(timer->flags & TIMER_DEFERRABLE) || tick_nohz_full_cpu(base->cpu))
-		wake_up_nohz_cpu(base->cpu);
+	if (base->nohz_active) {
+		if (!(timer->flags & TIMER_DEFERRABLE) ||
+		    tick_nohz_full_cpu(base->cpu))
+			wake_up_nohz_cpu(base->cpu);
+	}
 }
 
 #ifdef CONFIG_TIMER_STATS
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH] timer: Fix unsafe cpu variable access in migrate_timers
  2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
  2015-05-27  9:22   ` Viresh Kumar
  2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
@ 2015-06-27  9:55   ` Jan Kiszka
  2015-06-27 11:00     ` Borislav Petkov
  2 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2015-06-27  9:55 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang, Steven Rostedt, Badhri Jagan Sridharan

From: Jan Kiszka <jan.kiszka@siemens.com>

migrate_timers is invoked with preemption enabled. Therefore we have to
get/put the cpu-local variable tvec_bases like before commit 0eeda71bc3.

This fixes

BUG: using smp_processor_id() in preemptible [00000000] code: bash/4917
caller is debug_smp_processor_id+0x17/0x19
CPU: 0 PID: 4917 Comm: bash Not tainted 4.1.0-dbg+ #97
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
 ffff880038674040 ffff880002e6fb98 ffffffff815356a0 0000000000000002
 0000000000000000 ffff880002e6fbc8 ffffffff8130c16f ffff88003fd8d500
 000000000000d500 0000000000000000 0000000000000003 ffff880002e6fbd8
Call Trace:
 [<ffffffff815356a0>] dump_stack+0x4f/0x7b
 [<ffffffff8130c16f>] check_preemption_disabled+0xdd/0xef
 [<ffffffff8130c198>] debug_smp_processor_id+0x17/0x19
 [<ffffffff810a9b37>] timer_cpu_notify+0x4f/0x112
 [<ffffffff8106718b>] notifier_call_chain+0x4c/0x71
 [<ffffffff810671be>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff810484b6>] __cpu_notify+0x20/0x37
 [<ffffffff810484e0>] cpu_notify+0x13/0x15
 [<ffffffff81048591>] cpu_notify_nofail+0xe/0x16
 [<ffffffff8152efb2>] _cpu_down+0x178/0x268
 [<ffffffff8108bc3a>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8152f0ca>] cpu_down+0x28/0x3c
 [<ffffffff813cbdb9>] cpu_subsys_offline+0x14/0x16
 [<ffffffff813c779d>] device_offline+0x7d/0xb1
 [<ffffffff813c78a2>] online_store+0x48/0x68
 [<ffffffff813c5544>] dev_attr_store+0x18/0x22
 [<ffffffff811dac6c>] sysfs_kf_write+0x49/0x51
 [<ffffffff811da139>] kernfs_fop_write+0x105/0x158
 [<ffffffff8116c54f>] __vfs_write+0x28/0xbd
 [<ffffffff812ab014>] ? security_file_permission+0x23/0x90
 [<ffffffff8116ccb2>] vfs_write+0xb2/0x169
 [<ffffffff8116d717>] SyS_write+0x4a/0x91
 [<ffffffff8153d12e>] entry_SYSCALL_64_fastpath+0x12/0x76

triggered when offlining a CPU, e.g. via sysfs.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 kernel/time/timer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 520499d..c826178 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1566,7 +1566,7 @@ static void migrate_timers(int cpu)
 
 	BUG_ON(cpu_online(cpu));
 	old_base = per_cpu_ptr(&tvec_bases, cpu);
-	new_base = this_cpu_ptr(&tvec_bases);
+	new_base = &get_cpu_var(tvec_bases);
 	/*
 	 * The caller is globally serialized and nobody else
 	 * takes two locks at once, deadlock is not possible.
@@ -1590,6 +1590,7 @@ static void migrate_timers(int cpu)
 
 	spin_unlock(&old_base->lock);
 	spin_unlock_irq(&new_base->lock);
+	put_cpu_var(tvec_bases);
 }
 
 static int timer_cpu_notify(struct notifier_block *self,

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] timer: Fix unsafe cpu variable access in migrate_timers
  2015-06-27  9:55   ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
@ 2015-06-27 11:00     ` Borislav Petkov
  2015-06-27 11:19       ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Borislav Petkov @ 2015-06-27 11:00 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Peter Zijlstra,
	Paul McKenney, Frederic Weisbecker, Eric Dumazet, Viresh Kumar,
	John Stultz, Joonwoo Park, Wenbo Wang, Steven Rostedt,
	Badhri Jagan Sridharan

On Sat, Jun 27, 2015 at 11:55:49AM +0200, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> migrate_timers is invoked with preemption enabled. Therefore we have to
> get/put the cpu-local variable tvec_bases like before commit 0eeda71bc3.
> 
> This fixes
> 
> BUG: using smp_processor_id() in preemptible [00000000] code: bash/4917
> caller is debug_smp_processor_id+0x17/0x19
> CPU: 0 PID: 4917 Comm: bash Not tainted 4.1.0-dbg+ #97
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>  ffff880038674040 ffff880002e6fb98 ffffffff815356a0 0000000000000002
>  0000000000000000 ffff880002e6fbc8 ffffffff8130c16f ffff88003fd8d500
>  000000000000d500 0000000000000000 0000000000000003 ffff880002e6fbd8
> Call Trace:
>  [<ffffffff815356a0>] dump_stack+0x4f/0x7b
>  [<ffffffff8130c16f>] check_preemption_disabled+0xdd/0xef
>  [<ffffffff8130c198>] debug_smp_processor_id+0x17/0x19
>  [<ffffffff810a9b37>] timer_cpu_notify+0x4f/0x112
>  [<ffffffff8106718b>] notifier_call_chain+0x4c/0x71
>  [<ffffffff810671be>] __raw_notifier_call_chain+0xe/0x10
>  [<ffffffff810484b6>] __cpu_notify+0x20/0x37
>  [<ffffffff810484e0>] cpu_notify+0x13/0x15
>  [<ffffffff81048591>] cpu_notify_nofail+0xe/0x16
>  [<ffffffff8152efb2>] _cpu_down+0x178/0x268
>  [<ffffffff8108bc3a>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff8152f0ca>] cpu_down+0x28/0x3c
>  [<ffffffff813cbdb9>] cpu_subsys_offline+0x14/0x16
>  [<ffffffff813c779d>] device_offline+0x7d/0xb1
>  [<ffffffff813c78a2>] online_store+0x48/0x68
>  [<ffffffff813c5544>] dev_attr_store+0x18/0x22
>  [<ffffffff811dac6c>] sysfs_kf_write+0x49/0x51
>  [<ffffffff811da139>] kernfs_fop_write+0x105/0x158
>  [<ffffffff8116c54f>] __vfs_write+0x28/0xbd
>  [<ffffffff812ab014>] ? security_file_permission+0x23/0x90
>  [<ffffffff8116ccb2>] vfs_write+0xb2/0x169
>  [<ffffffff8116d717>] SyS_write+0x4a/0x91
>  [<ffffffff8153d12e>] entry_SYSCALL_64_fastpath+0x12/0x76
> 
> triggered when offlining a CPU, e.g. via sysfs.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>  kernel/time/timer.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 520499d..c826178 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1566,7 +1566,7 @@ static void migrate_timers(int cpu)
>  
>  	BUG_ON(cpu_online(cpu));
>  	old_base = per_cpu_ptr(&tvec_bases, cpu);
> -	new_base = this_cpu_ptr(&tvec_bases);
> +	new_base = &get_cpu_var(tvec_bases);

Hmm, tglx's version doesn't disable preemtion around it:

http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=24bfcb100959c8641a627b5604d967243f8f240c

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] timer: Fix unsafe cpu variable access in migrate_timers
  2015-06-27 11:00     ` Borislav Petkov
@ 2015-06-27 11:19       ` Jan Kiszka
  2015-06-27 11:25         ` Borislav Petkov
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2015-06-27 11:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Peter Zijlstra,
	Paul McKenney, Frederic Weisbecker, Eric Dumazet, Viresh Kumar,
	John Stultz, Joonwoo Park, Wenbo Wang, Steven Rostedt,
	Badhri Jagan Sridharan

[-- Attachment #1: Type: text/plain, Size: 3032 bytes --]

On 2015-06-27 13:00, Borislav Petkov wrote:
> On Sat, Jun 27, 2015 at 11:55:49AM +0200, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> migrate_timers is invoked with preemption enabled. Therefore we have to
>> get/put the cpu-local variable tvec_bases like before commit 0eeda71bc3.
>>
>> This fixes
>>
>> BUG: using smp_processor_id() in preemptible [00000000] code: bash/4917
>> caller is debug_smp_processor_id+0x17/0x19
>> CPU: 0 PID: 4917 Comm: bash Not tainted 4.1.0-dbg+ #97
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>  ffff880038674040 ffff880002e6fb98 ffffffff815356a0 0000000000000002
>>  0000000000000000 ffff880002e6fbc8 ffffffff8130c16f ffff88003fd8d500
>>  000000000000d500 0000000000000000 0000000000000003 ffff880002e6fbd8
>> Call Trace:
>>  [<ffffffff815356a0>] dump_stack+0x4f/0x7b
>>  [<ffffffff8130c16f>] check_preemption_disabled+0xdd/0xef
>>  [<ffffffff8130c198>] debug_smp_processor_id+0x17/0x19
>>  [<ffffffff810a9b37>] timer_cpu_notify+0x4f/0x112
>>  [<ffffffff8106718b>] notifier_call_chain+0x4c/0x71
>>  [<ffffffff810671be>] __raw_notifier_call_chain+0xe/0x10
>>  [<ffffffff810484b6>] __cpu_notify+0x20/0x37
>>  [<ffffffff810484e0>] cpu_notify+0x13/0x15
>>  [<ffffffff81048591>] cpu_notify_nofail+0xe/0x16
>>  [<ffffffff8152efb2>] _cpu_down+0x178/0x268
>>  [<ffffffff8108bc3a>] ? trace_hardirqs_on+0xd/0xf
>>  [<ffffffff8152f0ca>] cpu_down+0x28/0x3c
>>  [<ffffffff813cbdb9>] cpu_subsys_offline+0x14/0x16
>>  [<ffffffff813c779d>] device_offline+0x7d/0xb1
>>  [<ffffffff813c78a2>] online_store+0x48/0x68
>>  [<ffffffff813c5544>] dev_attr_store+0x18/0x22
>>  [<ffffffff811dac6c>] sysfs_kf_write+0x49/0x51
>>  [<ffffffff811da139>] kernfs_fop_write+0x105/0x158
>>  [<ffffffff8116c54f>] __vfs_write+0x28/0xbd
>>  [<ffffffff812ab014>] ? security_file_permission+0x23/0x90
>>  [<ffffffff8116ccb2>] vfs_write+0xb2/0x169
>>  [<ffffffff8116d717>] SyS_write+0x4a/0x91
>>  [<ffffffff8153d12e>] entry_SYSCALL_64_fastpath+0x12/0x76
>>
>> triggered when offlining a CPU, e.g. via sysfs.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>  kernel/time/timer.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
>> index 520499d..c826178 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -1566,7 +1566,7 @@ static void migrate_timers(int cpu)
>>  
>>  	BUG_ON(cpu_online(cpu));
>>  	old_base = per_cpu_ptr(&tvec_bases, cpu);
>> -	new_base = this_cpu_ptr(&tvec_bases);
>> +	new_base = &get_cpu_var(tvec_bases);
> 
> Hmm, tglx's version doesn't disable preemtion around it:
> 
> http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=24bfcb100959c8641a627b5604d967243f8f240c
> 

Oh, there is a fix already. It's just the same: both implicitly disable
preemption via get_cpu_ptr/var, look at the macros.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] timer: Fix unsafe cpu variable access in migrate_timers
  2015-06-27 11:19       ` Jan Kiszka
@ 2015-06-27 11:25         ` Borislav Petkov
  0 siblings, 0 replies; 27+ messages in thread
From: Borislav Petkov @ 2015-06-27 11:25 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Peter Zijlstra,
	Paul McKenney, Frederic Weisbecker, Eric Dumazet, Viresh Kumar,
	John Stultz, Joonwoo Park, Wenbo Wang, Steven Rostedt,
	Badhri Jagan Sridharan

On Sat, Jun 27, 2015 at 01:19:24PM +0200, Jan Kiszka wrote:
> Oh, there is a fix already. It's just the same: both implicitly
> disable preemption via get_cpu_ptr/var, look at the macros.

Ah, yes.

/me goes to the coffee machine to make eyes open.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-06-27 11:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
2015-05-27  5:39   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
2015-05-27  9:11   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] timer: Remove FIFO "guarantee" tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
2015-05-27  9:13   ` Viresh Kumar
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
2015-05-27  9:22   ` Viresh Kumar
2015-05-27 12:09     ` Peter Zijlstra
2015-06-02 13:58       ` Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-06-27  9:55   ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
2015-06-27 11:00     ` Borislav Petkov
2015-06-27 11:19       ` Jan Kiszka
2015-06-27 11:25         ` Borislav Petkov
2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] timer: Stats: " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
2015-06-19 13:24   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-27 14:53 ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Eric Dumazet
2015-06-02 13:58   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).