[patch 3/21] x86, bts: wait until traced task has been scheduled out

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 3/21] x86, bts: wait until traced task has been scheduled out
@ 2009-03-31 12:59 Markus Metzger
  2009-04-01  0:17 ` Oleg Nesterov
  2009-04-01  0:26 ` Oleg Nesterov
  0 siblings, 2 replies; 10+ messages in thread
From: Markus Metzger @ 2009-03-31 12:59 UTC (permalink / raw)
  To: linux-kernel, mingo, tglx, hpa
  Cc: markus.t.metzger, markus.t.metzger, roland, eranian, oleg,
	juan.villacis, ak

In order to stop branch tracing for a running task, we need to first
clear the branch tracing control bits before we may free the tracing
buffer.
If the traced task is running, the cpu might still trace that task
after the branch trace control bits have cleared.

Wait until the traced task has been scheduled out before proceeding.


A similar problem affects the task debug store context. We first remove
the context, then we need to wait until the task has been scheduled
out before we can free the context memory.


Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
---

Index: git-tip/arch/x86/kernel/ds.c
===================================================================
--- git-tip.orig/arch/x86/kernel/ds.c	2009-03-30 17:19:14.000000000 +0200
+++ git-tip/arch/x86/kernel/ds.c	2009-03-30 17:20:11.000000000 +0200
@@ -250,6 +250,42 @@ static DEFINE_PER_CPU(struct ds_context 
 #define system_context per_cpu(system_context_array, smp_processor_id())
 
 
+/*
+ * Wait for the traced task to unschedule.
+ *
+ * This guarantees that the bts trace configuration has been
+ * synchronized with the cpu executing the task.
+ */
+static void wait_to_unschedule(struct task_struct *task)
+{
+	unsigned long nvcsw;
+	unsigned long nivcsw;
+
+	if (!task)
+		return;
+
+	if (task == current)
+		return;
+
+	nvcsw  = task->nvcsw;
+	nivcsw = task->nivcsw;
+	for (;;) {
+		if (!task_is_running(task))
+			break;
+		/*
+		 * The switch count is incremented before the actual
+		 * context switch. We thus wait for two switches to be
+		 * sure at least one completed.
+		 */
+		if ((task->nvcsw - nvcsw) > 1)
+			break;
+		if ((task->nivcsw - nivcsw) > 1)
+			break;
+
+		schedule();
+	}
+}
+
 static inline struct ds_context *ds_get_context(struct task_struct *task)
 {
 	struct ds_context **p_context =
@@ -321,6 +357,9 @@ static inline void ds_put_context(struct
 
 	spin_unlock_irqrestore(&ds_lock, irq);
 
+	/* The context might still be in use for context switching. */
+	wait_to_unschedule(context->task);
+
 	kfree(context);
 }
 
@@ -789,6 +828,9 @@ void ds_release_bts(struct bts_tracer *t
 	WARN_ON_ONCE(tracer->ds.context->bts_master != tracer);
 	tracer->ds.context->bts_master = NULL;
 
+	/* Make sure tracing stopped and the tracer is not in use. */
+	wait_to_unschedule(tracer->ds.context->task);
+
 	put_tracer(tracer->ds.context->task);
 	ds_put_context(tracer->ds.context);
 
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-03-31 12:59 [patch 3/21] x86, bts: wait until traced task has been scheduled out Markus Metzger
@ 2009-04-01  0:17 ` Oleg Nesterov
  2009-04-01  8:09   ` Metzger, Markus T
  2009-04-01 11:41   ` Ingo Molnar
  2009-04-01  0:26 ` Oleg Nesterov
  1 sibling, 2 replies; 10+ messages in thread
From: Oleg Nesterov @ 2009-04-01  0:17 UTC (permalink / raw)
  To: Markus Metzger
  Cc: linux-kernel, mingo, tglx, hpa, markus.t.metzger, roland,
	eranian, juan.villacis, ak

On 03/31, Markus Metzger wrote:
>
> +static void wait_to_unschedule(struct task_struct *task)
> +{
> +	unsigned long nvcsw;
> +	unsigned long nivcsw;
> +
> +	if (!task)
> +		return;
> +
> +	if (task == current)
> +		return;
> +
> +	nvcsw  = task->nvcsw;
> +	nivcsw = task->nivcsw;
> +	for (;;) {
> +		if (!task_is_running(task))
> +			break;
> +		/*
> +		 * The switch count is incremented before the actual
> +		 * context switch. We thus wait for two switches to be
> +		 * sure at least one completed.
> +		 */
> +		if ((task->nvcsw - nvcsw) > 1)
> +			break;
> +		if ((task->nivcsw - nivcsw) > 1)
> +			break;
> +
> +		schedule();

schedule() is a nop here. We can wait unpredictably long...

Ingo, do have have any ideas to improve this helper?

Not that I really like it, but how about

	int force_unschedule(struct task_struct *p)
	{
		struct rq *rq;
		unsigned long flags;
		int running;

		rq = task_rq_lock(p, &flags);
		running = task_running(rq, p);
		task_rq_unlock(rq, &flags);

		if (running)
			wake_up_process(rq->migration_thread);

		return running;
	}

which should be used instead of task_is_running() ?


We can even do something like

	void wait_to_unschedule(struct task_struct *task)
	{
		struct migration_req req;

		rq = task_rq_lock(p, &task);
		running = task_running(rq, p);
		if (running) {
			// make sure __migrate_task() will do nothing
			req->dest_cpu = NR_CPUS + 1;
			init_completion(&req->done);
			list_add(&req->list, &rq->migration_queue);
		}
		task_rq_unlock(rq, &flags);

		if (running) {
			wake_up_process(rq->migration_thread);
			wait_for_completion(&req.done);
		}
	}

This way we don't poll, and we need only one helper.

(Can't resist, this patch is not bisect friendly, without the next patches
 wait_to_unschedule() is called under write_lock_irq, this is deadlockable).

But anyway, I think we can do this later.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-03-31 12:59 [patch 3/21] x86, bts: wait until traced task has been scheduled out Markus Metzger
  2009-04-01  0:17 ` Oleg Nesterov
@ 2009-04-01  0:26 ` Oleg Nesterov
  1 sibling, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2009-04-01  0:26 UTC (permalink / raw)
  To: Markus Metzger
  Cc: linux-kernel, mingo, tglx, hpa, markus.t.metzger, roland,
	eranian, juan.villacis, ak

Sorry for noise, forgot to mention...

On 03/31, Markus Metzger wrote:
>
>  static inline struct ds_context *ds_get_context(struct task_struct *task)

Completely off-topic, but ds_get_context() is rather fat, imho makes
sense to uninline.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-04-01  0:17 ` Oleg Nesterov
@ 2009-04-01  8:09   ` Metzger, Markus T
  2009-04-01 19:04     ` Oleg Nesterov
  2009-04-01 11:41   ` Ingo Molnar
  1 sibling, 1 reply; 10+ messages in thread
From: Metzger, Markus T @ 2009-04-01  8:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mingo, tglx, hpa, markus.t.metzger, roland,
	eranian, Villacis, Juan, ak

>-----Original Message-----
>From: Oleg Nesterov [mailto:oleg@redhat.com]
>Sent: Wednesday, April 01, 2009 2:17 AM
>To: Metzger, Markus T


>> +static void wait_to_unschedule(struct task_struct *task)
>> +{
>> +	unsigned long nvcsw;
>> +	unsigned long nivcsw;
>> +
>> +	if (!task)
>> +		return;
>> +
>> +	if (task == current)
>> +		return;
>> +
>> +	nvcsw  = task->nvcsw;
>> +	nivcsw = task->nivcsw;
>> +	for (;;) {
>> +		if (!task_is_running(task))
>> +			break;
>> +		/*
>> +		 * The switch count is incremented before the actual
>> +		 * context switch. We thus wait for two switches to be
>> +		 * sure at least one completed.
>> +		 */
>> +		if ((task->nvcsw - nvcsw) > 1)
>> +			break;
>> +		if ((task->nivcsw - nivcsw) > 1)
>> +			break;
>> +
>> +		schedule();
>
>schedule() is a nop here. We can wait unpredictably long...

Hmmm, As far as I understand the code, rt-workqueues use a higher sched_class
and can thus not be preempted by normal threads. Non-rt workqueues
use the fair_sched_class. And schedule_work() uses a non-rt workqueue.

In practice, task is ptraced. It is either stopped or exiting.
I don't expect to loop very often.


>
>Ingo, do have have any ideas to improve this helper?
>
>Not that I really like it, but how about
>
>	int force_unschedule(struct task_struct *p)
>	{
>		struct rq *rq;
>		unsigned long flags;
>		int running;
>
>		rq = task_rq_lock(p, &flags);
>		running = task_running(rq, p);
>		task_rq_unlock(rq, &flags);
>
>		if (running)
>			wake_up_process(rq->migration_thread);
>
>		return running;
>	}
>
>which should be used instead of task_is_running() ?
>
>
>We can even do something like
>
>	void wait_to_unschedule(struct task_struct *task)
>	{
>		struct migration_req req;
>
>		rq = task_rq_lock(p, &task);
>		running = task_running(rq, p);
>		if (running) {
>			// make sure __migrate_task() will do nothing
>			req->dest_cpu = NR_CPUS + 1;
>			init_completion(&req->done);
>			list_add(&req->list, &rq->migration_queue);
>		}
>		task_rq_unlock(rq, &flags);
>
>		if (running) {
>			wake_up_process(rq->migration_thread);
>			wait_for_completion(&req.done);
>		}
>	}
>
>This way we don't poll, and we need only one helper.
>
>(Can't resist, this patch is not bisect friendly, without the next patches
> wait_to_unschedule() is called under write_lock_irq, this is deadlockable).

I know. See the reply to patch 0; I tried to keep the patches small and focused
to simplify the review work and attract reviewers.

thanks and regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-04-01  0:17 ` Oleg Nesterov
  2009-04-01  8:09   ` Metzger, Markus T
@ 2009-04-01 11:41   ` Ingo Molnar
  2009-04-01 12:43     ` Metzger, Markus T
  2009-04-01 19:45     ` Oleg Nesterov
  1 sibling, 2 replies; 10+ messages in thread
From: Ingo Molnar @ 2009-04-01 11:41 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Markus Metzger, linux-kernel, tglx, hpa, markus.t.metzger,
	roland, eranian, juan.villacis, ak


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 03/31, Markus Metzger wrote:
> >
> > +static void wait_to_unschedule(struct task_struct *task)
> > +{
> > +	unsigned long nvcsw;
> > +	unsigned long nivcsw;
> > +
> > +	if (!task)
> > +		return;
> > +
> > +	if (task == current)
> > +		return;
> > +
> > +	nvcsw  = task->nvcsw;
> > +	nivcsw = task->nivcsw;
> > +	for (;;) {
> > +		if (!task_is_running(task))
> > +			break;
> > +		/*
> > +		 * The switch count is incremented before the actual
> > +		 * context switch. We thus wait for two switches to be
> > +		 * sure at least one completed.
> > +		 */
> > +		if ((task->nvcsw - nvcsw) > 1)
> > +			break;
> > +		if ((task->nivcsw - nivcsw) > 1)
> > +			break;
> > +
> > +		schedule();
> 
> schedule() is a nop here. We can wait unpredictably long...
> 
> Ingo, do have have any ideas to improve this helper?

hm, there's a similar looking existing facility: 
wait_task_inactive(). Have i missed some subtle detail that makes it 
inappropriate for use here?

> Not that I really like it, but how about
> 
> 	int force_unschedule(struct task_struct *p)
> 	{
> 		struct rq *rq;
> 		unsigned long flags;
> 		int running;
> 
> 		rq = task_rq_lock(p, &flags);
> 		running = task_running(rq, p);
> 		task_rq_unlock(rq, &flags);
> 
> 		if (running)
> 			wake_up_process(rq->migration_thread);
> 
> 		return running;
> 	}
> 
> which should be used instead of task_is_running() ?

Yes - wait_task_inactive() should be switched to a scheme like that 
- it would fix bugs like:

  53da1d9: fix ptrace slowness

in a cleaner way.

> We can even do something like
> 
> 	void wait_to_unschedule(struct task_struct *task)
> 	{
> 		struct migration_req req;
> 
> 		rq = task_rq_lock(p, &task);
> 		running = task_running(rq, p);
> 		if (running) {
> 			// make sure __migrate_task() will do nothing
> 			req->dest_cpu = NR_CPUS + 1;
> 			init_completion(&req->done);
> 			list_add(&req->list, &rq->migration_queue);
> 		}
> 		task_rq_unlock(rq, &flags);
> 
> 		if (running) {
> 			wake_up_process(rq->migration_thread);
> 			wait_for_completion(&req.done);
> 		}
> 	}
> 
> This way we don't poll, and we need only one helper.

Looks even better. The migration thread would run complete(), right?

A detail: i suspect this needs to be in a while() loop, for the case 
that the victim task raced with us and went to another CPU before we 
kicked it off via the migration thread.

This looks very useful to me. It could also be tested easily: revert 
53da1d9 and you should see:

   time strace dd if=/dev/zero of=/dev/null bs=1024 count=1000000

performance plummet on an SMP box. The with your fix it should go up 
to near full speed again.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-04-01 11:41   ` Ingo Molnar
@ 2009-04-01 12:43     ` Metzger, Markus T
  2009-04-01 12:53       ` Ingo Molnar
  2009-04-01 19:45     ` Oleg Nesterov
  1 sibling, 1 reply; 10+ messages in thread
From: Metzger, Markus T @ 2009-04-01 12:43 UTC (permalink / raw)
  To: Ingo Molnar, Oleg Nesterov, Peter Zijlstra
  Cc: linux-kernel, tglx, hpa, markus.t.metzger, roland, eranian,
	Villacis, Juan, ak

>-----Original Message-----
>From: Ingo Molnar [mailto:mingo@elte.hu]
>Sent: Wednesday, April 01, 2009 1:42 PM
>To: Oleg Nesterov; Peter Zijlstra


>* Oleg Nesterov <oleg@redhat.com> wrote:
>
>> On 03/31, Markus Metzger wrote:
>> >
>> > +static void wait_to_unschedule(struct task_struct *task)
>> > +{
>> > +	unsigned long nvcsw;
>> > +	unsigned long nivcsw;
>> > +
>> > +	if (!task)
>> > +		return;
>> > +
>> > +	if (task == current)
>> > +		return;
>> > +
>> > +	nvcsw  = task->nvcsw;
>> > +	nivcsw = task->nivcsw;
>> > +	for (;;) {
>> > +		if (!task_is_running(task))
>> > +			break;
>> > +		/*
>> > +		 * The switch count is incremented before the actual
>> > +		 * context switch. We thus wait for two switches to be
>> > +		 * sure at least one completed.
>> > +		 */
>> > +		if ((task->nvcsw - nvcsw) > 1)
>> > +			break;
>> > +		if ((task->nivcsw - nivcsw) > 1)
>> > +			break;
>> > +
>> > +		schedule();
>>
>> schedule() is a nop here. We can wait unpredictably long...
>>
>> Ingo, do have have any ideas to improve this helper?
>
>hm, there's a similar looking existing facility:
>wait_task_inactive(). Have i missed some subtle detail that makes it
>inappropriate for use here?


wait_task_inactive() waits until the task is no longer TASK_RUNNING.

I need to wait until the task has been scheduled out at least once.


regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-04-01 12:43     ` Metzger, Markus T
@ 2009-04-01 12:53       ` Ingo Molnar
  0 siblings, 0 replies; 10+ messages in thread
From: Ingo Molnar @ 2009-04-01 12:53 UTC (permalink / raw)
  To: Metzger, Markus T
  Cc: Oleg Nesterov, Peter Zijlstra, linux-kernel, tglx, hpa,
	markus.t.metzger, roland, eranian, Villacis, Juan, ak


* Metzger, Markus T <markus.t.metzger@intel.com> wrote:

> >-----Original Message-----
> >From: Ingo Molnar [mailto:mingo@elte.hu]
> >Sent: Wednesday, April 01, 2009 1:42 PM
> >To: Oleg Nesterov; Peter Zijlstra
> 
> 
> >* Oleg Nesterov <oleg@redhat.com> wrote:
> >
> >> On 03/31, Markus Metzger wrote:
> >> >
> >> > +static void wait_to_unschedule(struct task_struct *task)
> >> > +{
> >> > +	unsigned long nvcsw;
> >> > +	unsigned long nivcsw;
> >> > +
> >> > +	if (!task)
> >> > +		return;
> >> > +
> >> > +	if (task == current)
> >> > +		return;
> >> > +
> >> > +	nvcsw  = task->nvcsw;
> >> > +	nivcsw = task->nivcsw;
> >> > +	for (;;) {
> >> > +		if (!task_is_running(task))
> >> > +			break;
> >> > +		/*
> >> > +		 * The switch count is incremented before the actual
> >> > +		 * context switch. We thus wait for two switches to be
> >> > +		 * sure at least one completed.
> >> > +		 */
> >> > +		if ((task->nvcsw - nvcsw) > 1)
> >> > +			break;
> >> > +		if ((task->nivcsw - nivcsw) > 1)
> >> > +			break;
> >> > +
> >> > +		schedule();
> >>
> >> schedule() is a nop here. We can wait unpredictably long...
> >>
> >> Ingo, do have have any ideas to improve this helper?
> >
> >hm, there's a similar looking existing facility:
> >wait_task_inactive(). Have i missed some subtle detail that makes it
> >inappropriate for use here?
> 
> wait_task_inactive() waits until the task is no longer 
> TASK_RUNNING.

No, that's wrong, wait_task_inactive() waits until the task 
deschedules.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-04-01  8:09   ` Metzger, Markus T
@ 2009-04-01 19:04     ` Oleg Nesterov
  2009-04-01 19:52       ` Markus Metzger
  0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2009-04-01 19:04 UTC (permalink / raw)
  To: Metzger, Markus T
  Cc: linux-kernel, mingo, tglx, hpa, markus.t.metzger, roland,
	eranian, Villacis, Juan, ak

On 04/01, Metzger, Markus T wrote:
>
> >-----Original Message-----
> >From: Oleg Nesterov [mailto:oleg@redhat.com]
> >Sent: Wednesday, April 01, 2009 2:17 AM
> >To: Metzger, Markus T
>
> >> +static void wait_to_unschedule(struct task_struct *task)
> >> +{
> >> +	unsigned long nvcsw;
> >> +	unsigned long nivcsw;
> >> +
> >> +	if (!task)
> >> +		return;
> >> +
> >> +	if (task == current)
> >> +		return;
> >> +
> >> +	nvcsw  = task->nvcsw;
> >> +	nivcsw = task->nivcsw;
> >> +	for (;;) {
> >> +		if (!task_is_running(task))
> >> +			break;
> >> +		/*
> >> +		 * The switch count is incremented before the actual
> >> +		 * context switch. We thus wait for two switches to be
> >> +		 * sure at least one completed.
> >> +		 */
> >> +		if ((task->nvcsw - nvcsw) > 1)
> >> +			break;
> >> +		if ((task->nivcsw - nivcsw) > 1)
> >> +			break;
> >> +
> >> +		schedule();
> >
> >schedule() is a nop here. We can wait unpredictably long...
>
> Hmmm, As far as I understand the code, rt-workqueues use a higher sched_class
> and can thus not be preempted by normal threads. Non-rt workqueues
> use the fair_sched_class. And schedule_work() uses a non-rt workqueue.

I was unclear, sorry.

I meant, in this case

	while (!CONDITION)
		schedule();

is not better compared to

	while (!CONDITION)
		; /* do nothing */

(OK, schedule() is better without CONFIG_PREEMPT, but this doesn't matter).
wait_to_unschedule() just spins waiting for ->nXvcsw, this is not optimal.

And another problem, we can wait unpredictably long, because

> In practice, task is ptraced. It is either stopped or exiting.
> I don't expect to loop very often.

No. The task _was_ ptraced when we called (say) ptrace_detach(). But when
work->func() runs, the tracee is not traced, it is running (not necessary
of course, the tracer _can_ leave it in TASK_STOPPED).

Now, again, suppose that this task does "for (;;) ;" in user-space.
If CPU is "free", it can spin "forever" without re-scheduling. Yes sure,
this case is not likely in practice, but still.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-04-01 11:41   ` Ingo Molnar
  2009-04-01 12:43     ` Metzger, Markus T
@ 2009-04-01 19:45     ` Oleg Nesterov
  1 sibling, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2009-04-01 19:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Markus Metzger, linux-kernel, tglx, hpa,
	markus.t.metzger, roland, eranian, juan.villacis, ak

On 04/01, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 03/31, Markus Metzger wrote:
> > >
> > > +static void wait_to_unschedule(struct task_struct *task)
> > > +{
> > > +	unsigned long nvcsw;
> > > +	unsigned long nivcsw;
> > > +
> > > +	if (!task)
> > > +		return;
> > > +
> > > +	if (task == current)
> > > +		return;
> > > +
> > > +	nvcsw  = task->nvcsw;
> > > +	nivcsw = task->nivcsw;
> > > +	for (;;) {
> > > +		if (!task_is_running(task))
> > > +			break;
> > > +		/*
> > > +		 * The switch count is incremented before the actual
> > > +		 * context switch. We thus wait for two switches to be
> > > +		 * sure at least one completed.
> > > +		 */
> > > +		if ((task->nvcsw - nvcsw) > 1)
> > > +			break;
> > > +		if ((task->nivcsw - nivcsw) > 1)
> > > +			break;
> > > +
> > > +		schedule();
> >
> > schedule() is a nop here. We can wait unpredictably long...
> >
> > Ingo, do have have any ideas to improve this helper?
>
> hm, there's a similar looking existing facility:
> wait_task_inactive(). Have i missed some subtle detail that makes it
> inappropriate for use here?

Yes, there are similar, but still different.

wait_to_unschedule(task) waits until this task does context switch at
least once. It is fine if this task runs again when wait_to_unschedule()
returns. (if !task_is_running(task), it already did context switch).

wait_task_inactive() ensures that this task is deactivated. It can't be
used here, because it can "never" be deactivated.

> > 	int force_unschedule(struct task_struct *p)
> > 	{
> > 		struct rq *rq;
> > 		unsigned long flags;
> > 		int running;
> >
> > 		rq = task_rq_lock(p, &flags);
> > 		running = task_running(rq, p);
> > 		task_rq_unlock(rq, &flags);
> >
> > 		if (running)
> > 			wake_up_process(rq->migration_thread);
> >
> > 		return running;
> > 	}
> >
> > which should be used instead of task_is_running() ?
>
> Yes - wait_task_inactive() should be switched to a scheme like that

Yes, I thought about this, perhaps we can improve wait_task_inactive()
a bit. Unfortunately, this is not enough to kill schedule_timeout(1).

> - it would fix bugs like:
>
>   53da1d9: fix ptrace slowness

I don't think so. Quite contrary, the problem with "fix ptrace slowness"
is that we do not want the TASK_TRACED task to be preempted before it
does the voluntary schedule() (without PREEMPT_ACTIVE).

> > 	void wait_to_unschedule(struct task_struct *task)
> > 	{
> > 		struct migration_req req;
> >
> > 		rq = task_rq_lock(p, &task);
> > 		running = task_running(rq, p);
> > 		if (running) {
> > 			// make sure __migrate_task() will do nothing
> > 			req->dest_cpu = NR_CPUS + 1;
> > 			init_completion(&req->done);
> > 			list_add(&req->list, &rq->migration_queue);
> > 		}
> > 		task_rq_unlock(rq, &flags);
> >
> > 		if (running) {
> > 			wake_up_process(rq->migration_thread);
> > 			wait_for_completion(&req.done);
> > 		}
> > 	}
> >
> > This way we don't poll, and we need only one helper.
>
> Looks even better. The migration thread would run complete(), right?

Yes,

> A detail: i suspect this needs to be in a while() loop, for the case
> that the victim task raced with us and went to another CPU before we
> kicked it off via the migration thread.

I think this doesn't matter. If the task is not running - we don't
care and do nothing. If it is running and migrates - it should do
a context switch at least once.

But the code above is not right wrt cpu hotplug. wake_up_process()
can hit the NULL rq->migration_thread if we race with CPU_DEAD.

Hmm. don't we have this problem in, say, set_cpus_allowed_ptr() ?
Unless it is called without get_online_cpus(), ->migration_thread
can go away once we drop rq->lock.

Perhaps, we need something like this

	--- kernel/sched.c
	+++ kernel/sched.c
	@@ -6132,8 +6132,10 @@ int set_cpus_allowed_ptr(struct task_str
	 
		if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req)) {
			/* Need help from migration thread: drop lock and wait. */
	+		preempt_disable();
			task_rq_unlock(rq, &flags);
			wake_up_process(rq->migration_thread);
	+		preempt_enable();
			wait_for_completion(&req.done);
			tlb_migrate_finish(p->mm);
			return 0;

?

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 3/21] x86, bts: wait until traced task has been scheduled out
  2009-04-01 19:04     ` Oleg Nesterov
@ 2009-04-01 19:52       ` Markus Metzger
  0 siblings, 0 replies; 10+ messages in thread
From: Markus Metzger @ 2009-04-01 19:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Metzger, Markus T, linux-kernel, mingo, tglx, hpa, roland,
	eranian, Villacis, Juan, ak

On Wed, 2009-04-01 at 21:04 +0200, Oleg Nesterov wrote:
> On 04/01, Metzger, Markus T wrote:
> >
> > >-----Original Message-----
> > >From: Oleg Nesterov [mailto:oleg@redhat.com]
> > >Sent: Wednesday, April 01, 2009 2:17 AM
> > >To: Metzger, Markus T
> >
> > >> +static void wait_to_unschedule(struct task_struct *task)
> > >> +{
> > >> +	unsigned long nvcsw;
> > >> +	unsigned long nivcsw;
> > >> +
> > >> +	if (!task)
> > >> +		return;
> > >> +
> > >> +	if (task == current)
> > >> +		return;
> > >> +
> > >> +	nvcsw  = task->nvcsw;
> > >> +	nivcsw = task->nivcsw;
> > >> +	for (;;) {
> > >> +		if (!task_is_running(task))
> > >> +			break;
> > >> +		/*
> > >> +		 * The switch count is incremented before the actual
> > >> +		 * context switch. We thus wait for two switches to be
> > >> +		 * sure at least one completed.
> > >> +		 */
> > >> +		if ((task->nvcsw - nvcsw) > 1)
> > >> +			break;
> > >> +		if ((task->nivcsw - nivcsw) > 1)
> > >> +			break;
> > >> +
> > >> +		schedule();
> > >
> > >schedule() is a nop here. We can wait unpredictably long...
> >
> > Hmmm, As far as I understand the code, rt-workqueues use a higher sched_class
> > and can thus not be preempted by normal threads. Non-rt workqueues
> > use the fair_sched_class. And schedule_work() uses a non-rt workqueue.
> 
> I was unclear, sorry.
> 
> I meant, in this case
> 
> 	while (!CONDITION)
> 		schedule();
> 
> is not better compared to
> 
> 	while (!CONDITION)
> 		; /* do nothing */
> 
> (OK, schedule() is better without CONFIG_PREEMPT, but this doesn't matter).
> wait_to_unschedule() just spins waiting for ->nXvcsw, this is not optimal.
> 
> And another problem, we can wait unpredictably long, because
> 
> > In practice, task is ptraced. It is either stopped or exiting.
> > I don't expect to loop very often.
> 
> No. The task _was_ ptraced when we called (say) ptrace_detach(). But when
> work->func() runs, the tracee is not traced, it is running (not necessary
> of course, the tracer _can_ leave it in TASK_STOPPED).
> 
> Now, again, suppose that this task does "for (;;) ;" in user-space.
> If CPU is "free", it can spin "forever" without re-scheduling. Yes sure,
> this case is not likely in practice, but still.

So I should rather not call schedule()?

I thought it's better to yield the cpu than to spin.


I will resend a bisect-friendly version of the series (using quilt mail,
this time) tomorrow.

I will remove schedule() in the wait_to_unschedule() loop and also
address the minor nitpicks you mentioned in your other reviews.

thanks,
markus.



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-04-01 19:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-31 12:59 [patch 3/21] x86, bts: wait until traced task has been scheduled out Markus Metzger
2009-04-01  0:17 ` Oleg Nesterov
2009-04-01  8:09   ` Metzger, Markus T
2009-04-01 19:04     ` Oleg Nesterov
2009-04-01 19:52       ` Markus Metzger
2009-04-01 11:41   ` Ingo Molnar
2009-04-01 12:43     ` Metzger, Markus T
2009-04-01 12:53       ` Ingo Molnar
2009-04-01 19:45     ` Oleg Nesterov
2009-04-01  0:26 ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.