All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kirill Tkhai <ktkhai@odin.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>, <umgwanakikbuti@gmail.com>,
	<mingo@elte.hu>, <ktkhai@parallels.com>, <rostedt@goodmis.org>,
	<tglx@linutronix.de>, <juri.lelli@gmail.com>,
	<pang.xunlei@linaro.org>, <wanpeng.li@linux.intel.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 08/14] hrtimer: Allow hrtimer::function() to free the timer
Date: Wed, 10 Jun 2015 10:46:51 +0300	[thread overview]
Message-ID: <1433922411.23588.132.camel@odin.com> (raw)
In-Reply-To: <20150609213318.GA12436@redhat.com>

Hi, Oleg,

В Вт, 09/06/2015 в 23:33 +0200, Oleg Nesterov пишет:
> On 06/08, Peter Zijlstra wrote:
> >
> > On Mon, Jun 08, 2015 at 11:14:17AM +0200, Peter Zijlstra wrote:
> > > > Finally. Suppose that timer->function() returns HRTIMER_RESTART
> > > > and hrtimer_active() is called right after __run_hrtimer() sets
> > > > cpu_base->running = NULL. I can't understand why hrtimer_active()
> > > > can't miss ENQUEUED in this case. We have wmb() in between, yes,
> > > > but then hrtimer_active() should do something like
> > > >
> > > > 	active = cpu_base->running == timer;
> > > > 	if (!active) {
> > > > 		rmb();
> > > > 		active = state != HRTIMER_STATE_INACTIVE;
> > > > 	}
> > > >
> > > > No?
> > >
> > > Hmm, good point. Let me think about that. It would be nice to be able to
> > > avoid more memory barriers.
> >
> > So your scenario is:
> >
> > 				[R] seq
> > 				  RMB
> > [S] ->state = ACTIVE
> >   WMB
> > [S] ->running = NULL
> > 				[R] ->running (== NULL)
> > 				[R] ->state (== INACTIVE; fail to observe
> > 				             the ->state store due to
> > 					     lack of order)
> > 				  RMB
> > 				[R] seq (== seq)
> > [S] seq++
> >
> > Conversely, if we re-order the (first) seq++ store such that it comes
> > first:
> >
> > [S] seq++
> >
> > 				[R] seq
> > 				  RMB
> > 				[R] ->running (== NULL)
> > [S] ->running = timer;
> >   WMB
> > [S] ->state = INACTIVE
> > 				[R] ->state (== INACTIVE)
> > 				  RMB
> > 				[R] seq (== seq)
> >
> > And we have another false negative.
> >
> > And in this case we need the read order the other way around, we'd need:
> >
> > 	active = timer->state != HRTIMER_STATE_INACTIVE;
> > 	if (!active) {
> > 		smp_rmb();
> > 		active = cpu_base->running == timer;
> > 	}
> >
> > Now I think we can fix this by either doing:
> >
> > 	WMB
> > 	seq++
> > 	WMB
> >
> > On both sides of __run_hrtimer(), or do
> >
> > bool hrtimer_active(const struct hrtimer *timer)
> > {
> > 	struct hrtimer_cpu_base *cpu_base;
> > 	unsigned int seq;
> >
> > 	do {
> > 		cpu_base = READ_ONCE(timer->base->cpu_base);
> > 		seq = raw_read_seqcount(&cpu_base->seq);
> >
> > 		if (timer->state != HRTIMER_STATE_INACTIVE)
> > 			return true;
> >
> > 		smp_rmb();
> >
> > 		if (cpu_base->running == timer)
> > 			return true;
> >
> > 		smp_rmb();
> >
> > 		if (timer->state != HRTIMER_STATE_INACTIVE)
> > 			return true;
> >
> > 	} while (read_seqcount_retry(&cpu_base->seq, seq) ||
> > 		 cpu_base != READ_ONCE(timer->base->cpu_base));
> >
> > 	return false;
> > }
> 
> You know, I simply can't convince myself I understand why this code
> correct... or not.
> 
> But contrary to what I said before, I agree that we need to recheck
> timer->base. This probably needs more discussion, to me it is very
> unobvious why we can trust this cpu_base != READ_ONCE() check. Yes,
> we have a lot of barriers, but they do not pair with each other. Lets
> ignore this for now.
> 
> > And since __run_hrtimer() is the more performance critical code, I think
> > it would be best to reduce the amount of memory barriers there.
> 
> Yes, but wmb() is cheap on x86... Perhaps we can make this code
> "obviously correct" ?
> 
> 
> How about the following..... We add cpu_base->seq as before but
> limit its "write" scope so that we cam use the regular read/retry.
> 
> So,
> 
> 	hrtimer_active(timer)
> 	{
> 
> 		do {
> 			base = READ_ONCE(timer->base->cpu_base);
> 			seq = read_seqcount_begin(&cpu_base->seq);
> 
> 			if (timer->state & ENQUEUED ||
> 			    base->running == timer)
> 				return true;
> 
> 		} while (read_seqcount_retry(&cpu_base->seq, seq) ||
> 			 base != READ_ONCE(timer->base->cpu_base));
> 
> 		return false;
> 	}
> 
> And we need to avoid the races with 2 transitions in __run_hrtimer().
> 
> The first race is trivial, we change __run_hrtimer() to do
> 
> 	write_seqcount_begin(cpu_base->seq);
> 	cpu_base->running = timer;
> 	__remove_hrtimer(timer);	// clears ENQUEUED
> 	write_seqcount_end(cpu_base->seq);

We use seqcount, because we are afraid that hrtimer_active() may miss
timer->state or cpu_base->running, when we are clearing it.

If we use two pairs of write_seqcount_{begin,end} in __run_hrtimer(),
we may protect only the places where we do that:

	cpu_base->running = timer;
	write_seqcount_begin(cpu_base->seq);
	__remove_hrtimer(timer);	// clears ENQUEUED
	write_seqcount_end(cpu_base->seq);

	....

	timer->state |= HRTIMER_STATE_ENQUEUED;
	write_seqcount_begin(cpu_base->seq);
	base->running = NULL;
	write_seqcount_end(cpu_base->seq);

> 
> and hrtimer_active() obviously can't race with this section.
> 
> Then we change enqueue_hrtimer()
> 
> 
> 	+	bool need_lock = base->cpu_base->running == timer;
> 	+	if (need_lock)
> 	+		write_seqcount_begin(cpu_base->seq);
> 	+
> 		timer->state |= HRTIMER_STATE_ENQUEUED;
> 	+
> 	+	if (need_lock)
> 	+		write_seqcount_end(cpu_base->seq);
> 
> 
> Now. If the timer is re-queued by the time __run_hrtimer() clears
> ->running we have the following sequence:
> 
> 	write_seqcount_begin(cpu_base->seq);
> 	timer->state |= HRTIMER_STATE_ENQUEUED;
> 	write_seqcount_end(cpu_base->seq);
> 
> 	base->running = NULL;
> 
> and I think this should equally work, because in this case we do not
> care if hrtimer_active() misses "running = NULL".
> 
> Yes, we only have this 2nd write_seqcount_begin/end if the timer re-
> arms itself, but otherwise we do not race. If another thread does
> hrtime_start() in between we can pretend that hrtimer_active() hits
> the "inactive".
> 
> What do you think?
> 
> 
> And. Note that we can rewrite these 2 "write" critical sections in
> __run_hrtimer() and enqueue_hrtimer() as
> 
> 	cpu_base->running = timer;
> 
> 	write_seqcount_begin(cpu_base->seq);
> 	write_seqcount_end(cpu_base->seq);
> 
> 	__remove_hrtimer(timer);
> 
> and
> 
> 	timer->state |= HRTIMER_STATE_ENQUEUED;
> 
> 	write_seqcount_begin(cpu_base->seq);
> 	write_seqcount_end(cpu_base->seq);
> 
> 	base->running = NULL;
> 
> So we can probably use write_seqcount_barrier() except I am not sure
> about the 2nd wmb...


  parent reply	other threads:[~2015-06-10  7:47 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-05  8:48 [PATCH 00/14] sched: balance callbacks Peter Zijlstra
2015-06-05  8:48 ` [PATCH 01/14] sched: Replace post_schedule with a balance callback list Peter Zijlstra
2015-06-05  8:48 ` [PATCH 02/14] sched: Use replace normalize_task() with __sched_setscheduler() Peter Zijlstra
2015-06-05  8:48 ` [PATCH 03/14] sched: Allow balance callbacks for check_class_changed() Peter Zijlstra
2015-06-05  8:48 ` [PATCH 04/14] sched,rt: Remove return value from pull_rt_task() Peter Zijlstra
2015-06-05  8:48 ` [PATCH 05/14] sched,rt: Convert switched_{from,to}_rt() / prio_changed_rt() to balance callbacks Peter Zijlstra
2015-06-05  8:48 ` [PATCH 06/14] sched,dl: Remove return value from pull_dl_task() Peter Zijlstra
2015-06-05  8:48 ` [PATCH 07/14] sched,dl: Convert switched_{from,to}_dl() / prio_changed_dl() to balance callbacks Peter Zijlstra
2015-06-05  8:48 ` [PATCH 08/14] hrtimer: Allow hrtimer::function() to free the timer Peter Zijlstra
2015-06-05  9:48   ` Thomas Gleixner
2015-06-07 19:43   ` Oleg Nesterov
2015-06-07 22:33   ` Oleg Nesterov
2015-06-07 22:56     ` Oleg Nesterov
2015-06-08  8:06       ` Thomas Gleixner
2015-06-08  9:14     ` Peter Zijlstra
2015-06-08 10:55       ` Peter Zijlstra
2015-06-08 12:42       ` Peter Zijlstra
2015-06-08 14:27         ` Oleg Nesterov
2015-06-08 14:42           ` Peter Zijlstra
2015-06-08 15:49             ` Oleg Nesterov
2015-06-08 15:10           ` Peter Zijlstra
2015-06-08 15:16             ` Oleg Nesterov
2015-06-09 21:33         ` Oleg Nesterov
2015-06-09 21:39           ` Oleg Nesterov
2015-06-10  6:55           ` Peter Zijlstra
2015-06-10  7:46           ` Kirill Tkhai [this message]
2015-06-10 16:04             ` Oleg Nesterov
2015-06-11  7:31               ` Peter Zijlstra
2015-06-11 16:25               ` Kirill Tkhai
2015-06-10 15:49           ` Oleg Nesterov
2015-06-10 22:37           ` Peter Zijlstra
2015-06-08 14:03       ` Oleg Nesterov
2015-06-08 14:17       ` Peter Zijlstra
2015-06-08 15:10         ` [PATCH 0/3] hrtimer: HRTIMER_STATE_ fixes Oleg Nesterov
2015-06-08 15:11           ` [PATCH 2/3] hrtimer: turn newstate arg of __remove_hrtimer() into clear_enqueued Oleg Nesterov
2015-06-08 15:11           ` [PATCH 3/3] hrtimer: fix the __hrtimer_start_range_ns() race with hrtimer_active() Oleg Nesterov
2015-06-08 15:12           ` [PATCH 1/3] hrtimer: kill HRTIMER_STATE_MIGRATE, fix the race with hrtimer_is_queued() Oleg Nesterov
2015-06-08 15:35           ` [PATCH 0/3] hrtimer: HRTIMER_STATE_ fixes Peter Zijlstra
2015-06-08 15:56             ` Oleg Nesterov
2015-06-08 17:11             ` Thomas Gleixner
2015-06-08 19:08               ` Peter Zijlstra
2015-06-08 20:52               ` Oleg Nesterov
2015-06-08 15:10         ` [PATCH 1/3] hrtimer: kill HRTIMER_STATE_MIGRATE, fix the race with hrtimer_is_queued() Oleg Nesterov
2015-06-08 15:13           ` Oleg Nesterov
2015-06-05  8:48 ` [PATCH 09/14] sched,dl: Fix sched class hopping CBS hole Peter Zijlstra
2015-06-05  8:48 ` [PATCH 10/14] sched: Move code around Peter Zijlstra
2015-06-05  8:48 ` [PATCH 11/14] sched: Streamline the task migration locking a little Peter Zijlstra
2015-06-05  8:48 ` [PATCH 12/14] lockdep: Simplify lock_release() Peter Zijlstra
2015-06-05  8:48 ` [PATCH 13/14] lockdep: Implement lock pinning Peter Zijlstra
2015-06-05  9:55   ` Ingo Molnar
2015-06-11 11:37     ` Peter Zijlstra
2015-06-05  8:48 ` [PATCH 14/14] sched,lockdep: Employ " Peter Zijlstra
2015-06-05  9:57   ` Ingo Molnar
2015-06-05 11:03     ` Peter Zijlstra
2015-06-05 11:24       ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1433922411.23588.132.camel@odin.com \
    --to=ktkhai@odin.com \
    --cc=juri.lelli@gmail.com \
    --cc=ktkhai@parallels.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=pang.xunlei@linaro.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=umgwanakikbuti@gmail.com \
    --cc=wanpeng.li@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.