linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] context_tracking: remove local_irq_save from __acct_update_integrals
@ 2015-04-24  1:57 Rik van Riel
  2015-04-24  9:11 ` Heiko Carstens
  0 siblings, 1 reply; 3+ messages in thread
From: Rik van Riel @ 2015-04-24  1:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andy Lutomirsky, Frederic Weisbecker, Peter Zijlstra,
	Heiko Carstens, Luiz Capitulino, Marcelo Tosatti, Clark Williams

The function __acct_update_integrals() is called both from irq context
and task context. This creates a race where irq context can advance
tsk->acct_timexpd to a value larger than time, leading to a negative
value, which causes a divide error. See commit 6d5b5acca9e5
("Fix fixpoint divide exception in acct_update_integrals")

In 2012, __acct_update_integrals() was changed to get utime and stime
as function parameters. This re-introduced the bug, because an irq
can hit in-between the call to task_cputime() and where irqs actually
get disabled.

However, this race condition was originally reproduced on Hercules,
and I have not seen any reports of it re-occurring since it was
re-introduced 3 years ago.

On the other hand, the irq disabling and re-enabling, which no longer
even protects us against the race today, show up prominently in the
perf profile of a program that makes a very large number of system calls
in a short period of time, when nohz_full= (and context tracking) is
enabled.

This patch replaces the (now ineffective) irq blocking with a cheaper
way to test for the race condition, and speeds up my microbenchmark
with 10 million iterations:

		run time	system time
vanilla		5.49s		2.08s
patch		5.21s		1.92s

The above shows a reduction in system time of about 7%.
The standard deviation is mostly in the third digit after
the decimal point.

Cc: Andy Lutomirsky <amluto@amacapital.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..0b967f116a6b 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -126,23 +126,29 @@ static void __acct_update_integrals(struct task_struct *tsk,
 	if (likely(tsk->mm)) {
 		cputime_t time, dtime;
 		struct timeval value;
-		unsigned long flags;
 		u64 delta;
 
-		local_irq_save(flags);
 		time = stime + utime;
 		dtime = time - tsk->acct_timexpd;
+		/*
+		 * This code is called both from irq context and from
+		 * task context. There is a race where irq context advances
+		 * tsk->acct_timexpd to a value larger than time, creating
+		 * a negative value. In that case, the irq has already
+		 * updated the statistics.
+		 */
+		if (unlikely((signed long)dtime <= 0))
+			return;
+
 		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
 		delta = value.tv_sec;
 		delta = delta * USEC_PER_SEC + value.tv_usec;
 
 		if (delta == 0)
-			goto out;
+			return;
 		tsk->acct_timexpd = time;
 		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
 		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
-	out:
-		local_irq_restore(flags);
 	}
 }
 


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] context_tracking: remove local_irq_save from __acct_update_integrals
  2015-04-24  1:57 [PATCH] context_tracking: remove local_irq_save from __acct_update_integrals Rik van Riel
@ 2015-04-24  9:11 ` Heiko Carstens
  2015-04-24 15:10   ` Rik van Riel
  0 siblings, 1 reply; 3+ messages in thread
From: Heiko Carstens @ 2015-04-24  9:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Andy Lutomirsky, Frederic Weisbecker,
	Peter Zijlstra, Luiz Capitulino, Marcelo Tosatti, Clark Williams

On Thu, Apr 23, 2015 at 09:57:13PM -0400, Rik van Riel wrote:
> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> index 975cb49e32bf..0b967f116a6b 100644
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -126,23 +126,29 @@ static void __acct_update_integrals(struct task_struct *tsk,
>  	if (likely(tsk->mm)) {
>  		cputime_t time, dtime;
>  		struct timeval value;
> -		unsigned long flags;
>  		u64 delta;
> 
> -		local_irq_save(flags);
>  		time = stime + utime;
>  		dtime = time - tsk->acct_timexpd;
> +		/*
> +		 * This code is called both from irq context and from
> +		 * task context. There is a race where irq context advances
> +		 * tsk->acct_timexpd to a value larger than time, creating
> +		 * a negative value. In that case, the irq has already
> +		 * updated the statistics.
> +		 */
> +		if (unlikely((signed long)dtime <= 0))
> +			return;

FWIW, I think you either need a barrier() before the if-statement or use
READ_ONCE() when reading tsk->acct_timexpd above.

Otherwise the compiler could (in theory at least) generate code which
would translate to 
		if (unlikely(time <= tsk->acct_timexpd))
in order to achieve the same result, no?

Besides that cputime_t might be 64 bit in size, therefore you don't have
much of a guarentee that reading tsk->acct_timexpd happens atomically on
32 bit architectures, so you _may_ end up with garbage, no?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] context_tracking: remove local_irq_save from __acct_update_integrals
  2015-04-24  9:11 ` Heiko Carstens
@ 2015-04-24 15:10   ` Rik van Riel
  0 siblings, 0 replies; 3+ messages in thread
From: Rik van Riel @ 2015-04-24 15:10 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: linux-kernel, Andy Lutomirsky, Frederic Weisbecker,
	Peter Zijlstra, Luiz Capitulino, Marcelo Tosatti, Clark Williams

On 04/24/2015 05:11 AM, Heiko Carstens wrote:
> On Thu, Apr 23, 2015 at 09:57:13PM -0400, Rik van Riel wrote:
>> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
>> index 975cb49e32bf..0b967f116a6b 100644
>> --- a/kernel/tsacct.c
>> +++ b/kernel/tsacct.c
>> @@ -126,23 +126,29 @@ static void __acct_update_integrals(struct task_struct *tsk,
>>  	if (likely(tsk->mm)) {
>>  		cputime_t time, dtime;
>>  		struct timeval value;
>> -		unsigned long flags;
>>  		u64 delta;
>>
>> -		local_irq_save(flags);
>>  		time = stime + utime;
>>  		dtime = time - tsk->acct_timexpd;
>> +		/*
>> +		 * This code is called both from irq context and from
>> +		 * task context. There is a race where irq context advances
>> +		 * tsk->acct_timexpd to a value larger than time, creating
>> +		 * a negative value. In that case, the irq has already
>> +		 * updated the statistics.
>> +		 */
>> +		if (unlikely((signed long)dtime <= 0))
>> +			return;
> 
> FWIW, I think you either need a barrier() before the if-statement or use
> READ_ONCE() when reading tsk->acct_timexpd above.
> 
> Otherwise the compiler could (in theory at least) generate code which
> would translate to 
> 		if (unlikely(time <= tsk->acct_timexpd))
> in order to achieve the same result, no?
> 
> Besides that cputime_t might be 64 bit in size, therefore you don't have
> much of a guarentee that reading tsk->acct_timexpd happens atomically on
> 32 bit architectures, so you _may_ end up with garbage, no?

You are right on both counts. Thank you for pointing out what
should have been obvious...

Let me post a new patch :)

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-04-24 15:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-24  1:57 [PATCH] context_tracking: remove local_irq_save from __acct_update_integrals Rik van Riel
2015-04-24  9:11 ` Heiko Carstens
2015-04-24 15:10   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).