Re: [patch 13/20] timer: Switch to a non cascading wheel

From: "George Spelvin" <linux@sciencehorizons.net>
To: tglx@linutronix.de
Cc: edumazet@google.com, linux@sciencehorizons.net,
	linux-kernel@vger.kernel.org, peterz@infradead.org,
	richardcochran@gmail.com
Subject: Re: [patch 13/20] timer: Switch to a non cascading wheel
Date: 14 Jun 2016 04:16:02 -0400	[thread overview]
Message-ID: <20160614081602.10755.qmail@ns.sciencehorizons.net> (raw)

Nice cleanup!

I think I see a buglet in your level-5 cascading.

Suppose a timer is requested far in the future for a time
that is an exact multiple of 32768 jiffies.

collect_expired_timers() scans level 5 after all the previous ones,
and will cascade it to level 0, in a level-0 bucket which has already
been scanned, and won't be scanned again for 64 jiffies.

I agree that 64 jiffies is well within your allowed rounding accuracy,
and order of timer firing is not guaranteed when they're for the same
time, but it is a bit odd when a timer fires 32 jiffies *before* another
timer scheduled for 32 jiffies later.  That's the sort of peculiarity
that could lead to a subtle bug.

While I like the cleanup of just limiting long-term resolution, if
it turns out to be necessary, it's not too hard to add exact timers
back in if a need is found in future.  All you need is a second
__internal_add_timer function that rounds down rather than up, and to
teach expire_timers() to cascade in the unlikely situation that a timer
does have an expiry time in the future.

(It also gets rid of the special case for level 5.)

Other, mostly minor, code comments:

> +/* Level offsets in the wheel */
> +#define LVL0_OFFS	(0)
> +#define LVL1_OFFS	(LVL_SIZE)
> +#define LVL2_OFFS	(LVL1_OFFS + LVL_SIZE)
> +#define LVL3_OFFS	(LVL2_OFFS + LVL_SIZE)
> +#define LVL4_OFFS	(LVL3_OFFS + LVL_SIZE)
> +#define LVL5_OFFS	(LVL4_OFFS + LVL_SIZE)
> +
> +/* Clock divisor for the next level */
> +#define LVL_CLK_SHIFT	3
> +#define LVL_CLK_DIV	(1 << LVL_CLK_SHIFT)
> +#define LVL_CLK_MASK	(LVL_CLK_DIV - 1)
> +
> +/* The shift constants for selecting the bucket at the levels */
> +#define LVL1_SHIFT	(1 * LVL_CLK_SHIFT)
> +#define LVL2_SHIFT	(2 * LVL_CLK_SHIFT)
> +#define LVL3_SHIFT	(3 * LVL_CLK_SHIFT)
> +#define LVL4_SHIFT	(4 * LVL_CLK_SHIFT)
> +#define LVL5_SHIFT	(5 * LVL_CLK_SHIFT)
> +
> +/* The granularity of each level */
> +#define LVL0_GRAN	0x00000001
> +#define LVL1_GRAN	(LVL0_GRAN << LVL_CLK_SHIFT)
> +#define LVL2_GRAN	(LVL1_GRAN << LVL_CLK_SHIFT)
> +#define LVL3_GRAN	(LVL2_GRAN << LVL_CLK_SHIFT)
> +#define LVL4_GRAN	(LVL3_GRAN << LVL_CLK_SHIFT)
> +#define LVL5_GRAN	(LVL4_GRAN << LVL_CLK_SHIFT)

Wouldn't this all be so much simpler as

#define LVL_BITS	6	/* Renamed previous LVL_SHIFT */
#define LVL_SIZE	(1 << LVL_BITS)
#define LVL_MASK	(LVL_BITS - 1)
#define LVL_OFFS(n)	((n) * LVL_SIZE)
#define LVL_SHIFT(n)	((n) * LVL_CLK_SHIFT)
#define LVL_GRAN(n)	(1 << LVL_SHIFT(n))

Then you could do
+static inline unsigned calc_index(unsigned expires, unsigned level),
+{
+	/* Round up to next bin bin */
+	expires = ((expires - 1) >> LVL_SHIFT(level)) + 1;
+	return LVL_OFFS(level) + (expires & LVL_MASK);
+}

> +#define LVL1_TSTART	(LVL_SIZE - 1)

Er... isn't that LVL_SIZE, as documented in the table above?
Then it could be
#define LVL_TSTART(n) (LVL_SIZE << LVL_SHIFT(n))

Ideally, you'd like all of that

+	if (delta < LVL1_TSTART) {
+		idx = (expires + LVL0_GRAN) & LVL_MASK;
+	} else if (delta < LVL2_TSTART) {
+		idx = calc_index(expires, LVL1_GRAN, LVL1_SHIFT, LVL1_OFFS);
+	} else if (delta < LVL3_TSTART) {
+		idx = calc_index(expires, LVL2_GRAN, LVL2_SHIFT, LVL2_OFFS);
+	} else if (delta < LVL4_TSTART) {
+		idx = calc_index(expires, LVL3_GRAN, LVL3_SHIFT, LVL3_OFFS);
+	} else if (delta < LVL5_TSTART) {
+		idx = calc_index(expires, LVL4_GRAN, LVL4_SHIFT, LVL4_OFFS);

to be replaced with __builtin_clz or similar:

	level = __fls(delta | LVL_MASK);
	if (level <  LVL_BITS + LVL_SHIFT(LVL_DEPTH-1)) {	/* or LVL_DEPTH-2, no difference */
		level = (level + LVL_CLK_SHIFT - LVL_BITS) / LVL_CLK_SHIFT;
	} else if ((long)delta < 0) {
		expires = base->clk;
		level = 0;
	} else {
		level = LVL_DEPTH - 1;
	}
	index = calc_index(expires, level);

> +static inline void detach_expired_timer(struct timer_list *timer)
>  {
>  	detach_timer(timer, true);
> -	if (!(timer->flags & TIMER_DEFERRABLE))
> -		base->active_timers--;
> -	base->all_timers--;
>  }

Is there even a reason to have this wrapper any more?  Why not
just replace all calls to it in the source?

> +		timer = hlist_entry(head->first, struct timer_list, entry);
> +		fn = timer->function;
> +		data = timer->data;
> +
> +		timer_stats_account_timer(timer);
> +
> +		base->running_timer = timer;
> +		detach_expired_timer(timer);

Is there some non-obvious reason that you have to fetch fn and data
so early?  It seems like a register pressure pessimization, if the
compiler can't figure out that timer_stats code can't change them.

The cache line containing this timer was already prefetched when you
updated its entry.pprev as part of removing the previous entry from
the list.

I see why you want to fetch them with the lock held in case there's some
freaky race, but I'd do it all after detach_timer().