All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] reduce nohz_full syscall overhead by 10%
@ 2015-04-30 21:23 riel
  2015-04-30 21:23 ` [PATCH 1/3] reduce indentation in __acct_update_integrals riel
                   ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: riel @ 2015-04-30 21:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, williams, luto, mingo, bonzini, fweisbec, peterz,
	heiko.carstens, tglx

Profiling reveals that a lot of the overhead from the nohz_full
accounting seems to come not from the accounting itself, but from
disabling and re-enabling interrupts.

This patch series removes the interrupt disabling & re-enabling
from __acct_update_integrals, which is called on both syscall
entry and exit, as well as from the user_exit called on syscall
entry.

Together they speed up a benchmark that calls getpriority in
a row 10 million times by about 10%:

		run time	system time
vanilla		5.49s		2.08s
__acct patch	5.21s		1.92s
both patches	4.88s		1.71s


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/3] reduce indentation in __acct_update_integrals
  2015-04-30 21:23 [PATCH 0/3] reduce nohz_full syscall overhead by 10% riel
@ 2015-04-30 21:23 ` riel
  2015-04-30 21:23 ` [PATCH 2/3] remove local_irq_save from __acct_update_integrals riel
  2015-04-30 21:23 ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry riel
  2 siblings, 0 replies; 83+ messages in thread
From: riel @ 2015-04-30 21:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, williams, luto, mingo, bonzini, fweisbec, peterz,
	heiko.carstens, tglx, Andy Lutomirsky, Rik van Riel

From: Peter Zijlstra <peterz@infradead.org>

Reduce indentation in __acct_update_integrals.

Cc: Andy Lutomirsky <amluto@amacapital.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..9e225425bc3a 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -123,27 +123,27 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
-	if (likely(tsk->mm)) {
-		cputime_t time, dtime;
-		struct timeval value;
-		unsigned long flags;
-		u64 delta;
-
-		local_irq_save(flags);
-		time = stime + utime;
-		dtime = time - tsk->acct_timexpd;
-		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
-		delta = value.tv_sec;
-		delta = delta * USEC_PER_SEC + value.tv_usec;
-
-		if (delta == 0)
-			goto out;
+	cputime_t time, dtime;
+	struct timeval value;
+	unsigned long flags;
+	u64 delta;
+
+	if (unlikely(!tsk->mm))
+		return;
+
+	local_irq_save(flags);
+	time = stime + utime;
+	dtime = time - tsk->acct_timexpd;
+	jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
+	delta = value.tv_sec;
+	delta = delta * USEC_PER_SEC + value.tv_usec;
+
+	if (delta) {
 		tsk->acct_timexpd = time;
 		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
 		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
-	out:
-		local_irq_restore(flags);
 	}
+	local_irq_restore(flags);
 }
 
 /**
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/3] remove local_irq_save from __acct_update_integrals
  2015-04-30 21:23 [PATCH 0/3] reduce nohz_full syscall overhead by 10% riel
  2015-04-30 21:23 ` [PATCH 1/3] reduce indentation in __acct_update_integrals riel
@ 2015-04-30 21:23 ` riel
  2015-04-30 21:23 ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry riel
  2 siblings, 0 replies; 83+ messages in thread
From: riel @ 2015-04-30 21:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, williams, luto, mingo, bonzini, fweisbec, peterz,
	heiko.carstens, tglx, Rik van Riel, Andy Lutomirsky

From: Rik van Riel <riel@redhat.com>

The function __acct_update_integrals() is called both from irq context
and task context. This creates a race where irq context can advance
tsk->acct_timexpd to a value larger than time, leading to a negative
value, which causes a divide error. See commit 6d5b5acca9e5
("Fix fixpoint divide exception in acct_update_integrals")

In 2012, __acct_update_integrals() was changed to get utime and stime
as function parameters. This re-introduced the bug, because an irq
can hit in-between the call to task_cputime() and where irqs actually
get disabled.

However, this race condition was originally reproduced on Hercules,
and I have not seen any reports of it re-occurring since it was
re-introduced 3 years ago.

On the other hand, the irq disabling and re-enabling, which no longer
even protects us against the race today, show up prominently in the
perf profile of a program that makes a very large number of system calls
in a short period of time, when nohz_full= (and context tracking) is
enabled.

This patch replaces the (now ineffective) irq blocking with a cheaper
way to test for the race condition, and speeds up my microbenchmark
with 10 million iterations, average of 5 runs, tiny stddev:

		run time	system time
vanilla		5.49s		2.08s
patch		5.21s		1.92s

Cc: Andy Lutomirsky <amluto@amacapital.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/powerpc/include/asm/cputime.h    |  3 +++
 arch/s390/include/asm/cputime.h       |  3 +++
 include/asm-generic/cputime_jiffies.h |  2 ++
 include/asm-generic/cputime_nsecs.h   |  3 +++
 kernel/tsacct.c                       | 16 ++++++++++++----
 5 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index e2452550bcb1..e41b32f68a2c 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -32,6 +32,9 @@ static inline void setup_cputime_one_jiffy(void) { }
 typedef u64 __nocast cputime_t;
 typedef u64 __nocast cputime64_t;
 
+typedef s64 signed_cputime_t;
+typedef s64 signed_cputime64_t;
+
 #define cmpxchg_cputime(ptr, old, new) cmpxchg(ptr, old, new)
 
 #ifdef __KERNEL__
diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h
index 221b454c734a..2e8c268cc2a7 100644
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -18,6 +18,9 @@
 typedef unsigned long long __nocast cputime_t;
 typedef unsigned long long __nocast cputime64_t;
 
+typedef signed long long signed_cputime_t;
+typedef signed long long signed_cputime64_t;
+
 #define cmpxchg_cputime(ptr, old, new) cmpxchg64(ptr, old, new)
 
 static inline unsigned long __div(unsigned long long n, unsigned long base)
diff --git a/include/asm-generic/cputime_jiffies.h b/include/asm-generic/cputime_jiffies.h
index fe386fc6e85e..b96b6a1b6c97 100644
--- a/include/asm-generic/cputime_jiffies.h
+++ b/include/asm-generic/cputime_jiffies.h
@@ -2,6 +2,7 @@
 #define _ASM_GENERIC_CPUTIME_JIFFIES_H
 
 typedef unsigned long __nocast cputime_t;
+typedef signed long signed_cputime_t;
 
 #define cmpxchg_cputime(ptr, old, new) cmpxchg(ptr, old, new)
 
@@ -11,6 +12,7 @@ typedef unsigned long __nocast cputime_t;
 #define jiffies_to_cputime(__hz)	(__force cputime_t)(__hz)
 
 typedef u64 __nocast cputime64_t;
+typedef s64 signed_cputime_t;
 
 #define cputime64_to_jiffies64(__ct)	(__force u64)(__ct)
 #define jiffies64_to_cputime64(__jif)	(__force cputime64_t)(__jif)
diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h
index 0419485891f2..c1ad2f90a4d9 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -21,6 +21,9 @@
 typedef u64 __nocast cputime_t;
 typedef u64 __nocast cputime64_t;
 
+typedef s64 signed_cputime_t;
+typedef s64 signed_cputime64_t;
+
 #define cmpxchg_cputime(ptr, old, new) cmpxchg64(ptr, old, new)
 
 #define cputime_one_jiffy		jiffies_to_cputime(1)
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 9e225425bc3a..e497c1c05675 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -125,15 +125,24 @@ static void __acct_update_integrals(struct task_struct *tsk,
 {
 	cputime_t time, dtime;
 	struct timeval value;
-	unsigned long flags;
 	u64 delta;
 
 	if (unlikely(!tsk->mm))
 		return;
 
-	local_irq_save(flags);
+	/*
+	 * This code is called both from task context and irq context.
+	 * There is a rare race where irq context advances tsk->acct_timexpd
+	 * to a value larger than time, leading to a negative dtime, which
+	 * could lead to a divide error in cputime_to_jiffies.
+	 * The statistics updated here are fairly rough estimates; just
+	 * ignore irq and task double accounting the same timer tick.
+	 */
 	time = stime + utime;
-	dtime = time - tsk->acct_timexpd;
+	dtime = time - READ_ONCE(tsk->acct_timexpd);
+	if (unlikely((signed_cputime_t)dtime <= 0))
+		return;
+
 	jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
 	delta = value.tv_sec;
 	delta = delta * USEC_PER_SEC + value.tv_usec;
@@ -143,7 +152,6 @@ static void __acct_update_integrals(struct task_struct *tsk,
 		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
 		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
 	}
-	local_irq_restore(flags);
 }
 
 /**
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-04-30 21:23 [PATCH 0/3] reduce nohz_full syscall overhead by 10% riel
  2015-04-30 21:23 ` [PATCH 1/3] reduce indentation in __acct_update_integrals riel
  2015-04-30 21:23 ` [PATCH 2/3] remove local_irq_save from __acct_update_integrals riel
@ 2015-04-30 21:23 ` riel
  2015-04-30 21:56   ` Andy Lutomirski
  2015-05-01  6:40   ` Ingo Molnar
  2 siblings, 2 replies; 83+ messages in thread
From: riel @ 2015-04-30 21:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, williams, luto, mingo, bonzini, fweisbec, peterz,
	heiko.carstens, tglx, Rik van Riel, Ingo Molnar, Paolo Bonzini

From: Rik van Riel <riel@redhat.com>

On syscall entry with nohz_full on, we enable interrupts, call user_exit,
disable interrupts, do something, re-enable interrupts, and go on our
merry way.

Profiling shows that a large amount of the nohz_full overhead comes
from the extraneous disabling and re-enabling of interrupts. Andy
suggested simply not enabling interrupts until after the context
tracking code has done its thing, which allows us to skip a whole
interrupt disable & re-enable cycle.

This patch builds on top of these patches by Paolo:
https://lkml.org/lkml/2015/4/28/188
https://lkml.org/lkml/2015/4/29/139

Together with this patch I posted earlier this week, the syscall path
on a nohz_full cpu seems to be about 10% faster.
https://lkml.org/lkml/2015/4/24/394

My test is a simple microbenchmark that calls getpriority() in a loop
10 million times:

		run time	system time
vanilla		5.49s		2.08s
__acct patch	5.21s		1.92s
both patches	4.88s		1.71s

Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirsky <amluto@amacapital.net>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/kernel/entry_32.S       |  4 ++--
 arch/x86/kernel/entry_64.S       |  4 ++--
 arch/x86/kernel/ptrace.c         |  6 +++++-
 include/linux/context_tracking.h | 11 +++++++++++
 4 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 1c309763e321..0bdf8c7057e4 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -406,7 +406,6 @@ ENTRY(ia32_sysenter_target)
 
 	pushl_cfi %eax
 	SAVE_ALL
-	ENABLE_INTERRUPTS(CLBR_NONE)
 
 /*
  * Load the potential sixth argument from user stack.
@@ -424,6 +423,7 @@ ENTRY(ia32_sysenter_target)
 
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
 	jnz sysenter_audit
+	ENABLE_INTERRUPTS(CLBR_NONE)
 sysenter_do_call:
 	cmpl $(NR_syscalls), %eax
 	jae sysenter_badsys
@@ -647,7 +647,7 @@ END(work_pending)
 syscall_trace_entry:
 	movl $-ENOSYS,PT_EAX(%esp)
 	movl %esp, %eax
-	call syscall_trace_enter
+	call syscall_trace_enter	/* returns with irqs enabled */
 	/* What it returned is what we'll actually use.  */
 	cmpl $(NR_syscalls), %eax
 	jnae syscall_call
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 02c2eff7478d..f7751da7b53e 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -228,7 +228,6 @@ GLOBAL(system_call_after_swapgs)
 	 * task preemption. We must enable interrupts only after we're done
 	 * with using rsp_scratch:
 	 */
-	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi	%r11			/* pt_regs->flags */
 	pushq_cfi	$__USER_CS		/* pt_regs->cs */
 	pushq_cfi	%rcx			/* pt_regs->ip */
@@ -248,6 +247,7 @@ GLOBAL(system_call_after_swapgs)
 
 	testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
 	jnz tracesys
+	ENABLE_INTERRUPTS(CLBR_NONE)
 system_call_fastpath:
 #if __SYSCALL_MASK == ~0
 	cmpq $__NR_syscall_max,%rax
@@ -313,7 +313,7 @@ GLOBAL(system_call_after_swapgs)
 tracesys:
 	movq %rsp, %rdi
 	movl $AUDIT_ARCH_X86_64, %esi
-	call syscall_trace_enter_phase1
+	call syscall_trace_enter_phase1 /* returns with interrupts enabled */
 	test %rax, %rax
 	jnz tracesys_phase2		/* if needed, run the slow path */
 	RESTORE_C_REGS_EXCEPT_RAX	/* else restore clobbered regs */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..066c86d0b68c 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1456,6 +1456,8 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
  *
  * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
  * are fully functional.
+ * Called with IRQs disabled, to be enabled after the context tracking
+ * code has run.
  *
  * For phase 2's benefit, our return value is:
  * 0:			resume the syscall
@@ -1477,10 +1479,12 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 * doing anything that could touch RCU.
 	 */
 	if (work & _TIF_NOHZ) {
-		user_exit();
+		user_exit_irqsoff();
 		work &= ~_TIF_NOHZ;
 	}
 
+	local_irq_enable();
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp first -- it should minimize exposure of other
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 5d3719aed958..dc3b169b2b70 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -25,12 +25,23 @@ static inline void user_enter(void)
 		context_tracking_enter(CONTEXT_USER);
 
 }
+
 static inline void user_exit(void)
 {
 	if (context_tracking_is_enabled())
 		context_tracking_exit(CONTEXT_USER);
 }
 
+/* Called with IRQs already disabled. */
+static inline void user_exit_irqsoff(void)
+{
+	if (in_interrupt())
+		return;
+
+	if (context_tracking_is_enabled())
+		__context_tracking_exit(CONTEXT_USER);
+}
+
 static inline enum ctx_state exception_enter(void)
 {
 	enum ctx_state prev_ctx;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-04-30 21:23 ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry riel
@ 2015-04-30 21:56   ` Andy Lutomirski
  2015-05-01  6:40   ` Ingo Molnar
  1 sibling, 0 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-04-30 21:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, Ingo Molnar,
	bonzini, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini, Borislav Petkov,
	Denys Vlasenko, H. Peter Anvin, Steven Rostedt

[added some cc's]

On Thu, Apr 30, 2015 at 2:23 PM,  <riel@redhat.com> wrote:
> From: Rik van Riel <riel@redhat.com>
>
> On syscall entry with nohz_full on, we enable interrupts, call user_exit,
> disable interrupts, do something, re-enable interrupts, and go on our
> merry way.
>
> Profiling shows that a large amount of the nohz_full overhead comes
> from the extraneous disabling and re-enabling of interrupts. Andy
> suggested simply not enabling interrupts until after the context
> tracking code has done its thing, which allows us to skip a whole
> interrupt disable & re-enable cycle.
>
> This patch builds on top of these patches by Paolo:
> https://lkml.org/lkml/2015/4/28/188
> https://lkml.org/lkml/2015/4/29/139
>
> Together with this patch I posted earlier this week, the syscall path
> on a nohz_full cpu seems to be about 10% faster.
> https://lkml.org/lkml/2015/4/24/394
>
> My test is a simple microbenchmark that calls getpriority() in a loop
> 10 million times:
>
>                 run time        system time
> vanilla         5.49s           2.08s
> __acct patch    5.21s           1.92s
> both patches    4.88s           1.71s

This has two downsides:

1. It lengthens (slightly) the IRQs-off window at the beginning of syscalls.

2. It replaces an untraced irq disable with a traced irq disable.
That's probably not quite free.

Nonetheless, I'm okay with it in principle.  Context tracking is
useful and increasingly common, and I'd like to make it fast.  This is
a minimally invasive change that helps quite a bit.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-04-30 21:23 ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry riel
  2015-04-30 21:56   ` Andy Lutomirski
@ 2015-05-01  6:40   ` Ingo Molnar
  2015-05-01 15:20     ` Rik van Riel
  1 sibling, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01  6:40 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, x86, williams, luto, bonzini, fweisbec, peterz,
	heiko.carstens, tglx, Ingo Molnar, Paolo Bonzini


* riel@redhat.com <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> On syscall entry with nohz_full on, we enable interrupts, call user_exit,
> disable interrupts, do something, re-enable interrupts, and go on our
> merry way.
> 
> Profiling shows that a large amount of the nohz_full overhead comes
> from the extraneous disabling and re-enabling of interrupts. Andy
> suggested simply not enabling interrupts until after the context
> tracking code has done its thing, which allows us to skip a whole
> interrupt disable & re-enable cycle.
> 
> This patch builds on top of these patches by Paolo:
> https://lkml.org/lkml/2015/4/28/188
> https://lkml.org/lkml/2015/4/29/139
> 
> Together with this patch I posted earlier this week, the syscall path
> on a nohz_full cpu seems to be about 10% faster.
> https://lkml.org/lkml/2015/4/24/394
> 
> My test is a simple microbenchmark that calls getpriority() in a loop
> 10 million times:
> 
> 		run time	system time
> vanilla		5.49s		2.08s
> __acct patch	5.21s		1.92s
> both patches	4.88s		1.71s

Just curious, what are the numbers if you don't have context tracking 
enabled, i.e. without nohz_full?

I.e. what's the baseline we are talking about?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01  6:40   ` Ingo Molnar
@ 2015-05-01 15:20     ` Rik van Riel
  2015-05-01 15:59       ` Ingo Molnar
  2015-05-03 13:23       ` Mike Galbraith
  0 siblings, 2 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 15:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, williams, luto, fweisbec, peterz,
	heiko.carstens, tglx, Ingo Molnar, Paolo Bonzini

On 05/01/2015 02:40 AM, Ingo Molnar wrote:

>> This patch builds on top of these patches by Paolo:
>> https://lkml.org/lkml/2015/4/28/188
>> https://lkml.org/lkml/2015/4/29/139
>>
>> Together with this patch I posted earlier this week, the syscall path
>> on a nohz_full cpu seems to be about 10% faster.
>> https://lkml.org/lkml/2015/4/24/394
>>
>> My test is a simple microbenchmark that calls getpriority() in a loop
>> 10 million times:
>>
>> 		run time	system time
>> vanilla		5.49s		2.08s
>> __acct patch	5.21s		1.92s
>> both patches	4.88s		1.71s
>
> Just curious, what are the numbers if you don't have context tracking 
> enabled, i.e. without nohz_full?
> 
> I.e. what's the baseline we are talking about?

It's an astounding difference. This is not a kernel without nohz_full,
just a CPU without nohz_full running the same kernel I tested with
yesterday:

 		run time	system time
vanilla		5.49s		2.08s
__acct patch	5.21s		1.92s
both patches	4.88s		1.71s
CPU w/o nohz	3.12s		1.63s    <-- your numbers, mostly

What is even more interesting is that the majority of the time
difference seems to come from _user_ time, which has gone down
from around 3.4 seconds in the vanilla kernel to around 1.5 seconds
on the CPU without nohz_full enabled...

At syscall entry time, the nohz_full context tracking code is very
straightforward. We check thread_info->flags & _TIF_WORK_SYSCALL_ENTRY,
and call syscall_trace_enter_phase1, which handles USER -> KERNEL
context transition.

Syscall exit time is a convoluted mess. Both do_notify_resume and
syscall_trace_leave call exit_user() on entry and enter_user()
on exit, leaving the time spent looping around between int_with_check
and syscall_return: in entry_64.S accounted as user time.

I sent an email about this last night, it may be useful to add a
third test & function call point to the syscall return code, where
we can call user_enter() just ONCE, and remove the other context
tracking calls from that loop.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 15:20     ` Rik van Riel
@ 2015-05-01 15:59       ` Ingo Molnar
  2015-05-01 16:03         ` Andy Lutomirski
  2015-05-03 13:23       ` Mike Galbraith
  1 sibling, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 15:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, williams, luto, fweisbec, peterz,
	heiko.carstens, tglx, Ingo Molnar, Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> > I.e. what's the baseline we are talking about?
> 
> It's an astounding difference. This is not a kernel without 
> nohz_full, just a CPU without nohz_full running the same kernel I 
> tested with yesterday:
> 
>  		run time	system time
> vanilla		5.49s		2.08s
> __acct patch	5.21s		1.92s
> both patches	4.88s		1.71s
> CPU w/o nohz	3.12s		1.63s    <-- your numbers, mostly
> 
> What is even more interesting is that the majority of the time 
> difference seems to come from _user_ time, which has gone down from 
> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on 
> the CPU without nohz_full enabled...
> 
> At syscall entry time, the nohz_full context tracking code is very 
> straightforward. We check thread_info->flags & 
> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which 
> handles USER -> KERNEL context transition.
> 
> Syscall exit time is a convoluted mess. Both do_notify_resume and 
> syscall_trace_leave call exit_user() on entry and enter_user() on 
> exit, leaving the time spent looping around between int_with_check 
> and syscall_return: in entry_64.S accounted as user time.
> 
> I sent an email about this last night, it may be useful to add a 
> third test & function call point to the syscall return code, where 
> we can call user_enter() just ONCE, and remove the other context 
> tracking calls from that loop.

So what I'm wondering about is the big picture:

 - This is crazy big overhead in something as fundamental as system
   calls!

 - We don't even have the excuse of the syscall auditing code, which
   kind of has to run for every syscall if it wants to do its job!

 - [ The 'precise vtime' stuff that is driven from syscall entry/exit 
     is crazy, and I hope not enabled in any distro. ]

 - So why are we doing this in every syscall time at all?

Basically the whole point of user-context tracking is to be able to 
flush pending RCU callbacks. But that's crazy, we can sure defer a few 
kfree()s on this CPU, even indefinitely!

If some other CPU does a sync_rcu(), then it can very well pluck those 
callbacks from this super low latency CPU's RCU lists (with due care) 
and go and free stuff itself ... There's no need to disturb this CPU 
for that!

If user-space does not do anything kernel-ish then there won't be any 
new RCU callbacks piled up, so it's not like it's a resource leak 
issue either.

So what's the point? Why not remove this big source of overhead 
altogether?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 15:59       ` Ingo Molnar
@ 2015-05-01 16:03         ` Andy Lutomirski
  2015-05-01 16:21           ` Ingo Molnar
  2015-05-01 16:22           ` Rik van Riel
  0 siblings, 2 replies; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-01 16:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paolo Bonzini

On Fri, May 1, 2015 at 8:59 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Rik van Riel <riel@redhat.com> wrote:
>
>> > I.e. what's the baseline we are talking about?
>>
>> It's an astounding difference. This is not a kernel without
>> nohz_full, just a CPU without nohz_full running the same kernel I
>> tested with yesterday:
>>
>>               run time        system time
>> vanilla               5.49s           2.08s
>> __acct patch  5.21s           1.92s
>> both patches  4.88s           1.71s
>> CPU w/o nohz  3.12s           1.63s    <-- your numbers, mostly
>>
>> What is even more interesting is that the majority of the time
>> difference seems to come from _user_ time, which has gone down from
>> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on
>> the CPU without nohz_full enabled...
>>
>> At syscall entry time, the nohz_full context tracking code is very
>> straightforward. We check thread_info->flags &
>> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which
>> handles USER -> KERNEL context transition.
>>
>> Syscall exit time is a convoluted mess. Both do_notify_resume and
>> syscall_trace_leave call exit_user() on entry and enter_user() on
>> exit, leaving the time spent looping around between int_with_check
>> and syscall_return: in entry_64.S accounted as user time.
>>
>> I sent an email about this last night, it may be useful to add a
>> third test & function call point to the syscall return code, where
>> we can call user_enter() just ONCE, and remove the other context
>> tracking calls from that loop.
>
> So what I'm wondering about is the big picture:
>
>  - This is crazy big overhead in something as fundamental as system
>    calls!
>
>  - We don't even have the excuse of the syscall auditing code, which
>    kind of has to run for every syscall if it wants to do its job!
>
>  - [ The 'precise vtime' stuff that is driven from syscall entry/exit
>      is crazy, and I hope not enabled in any distro. ]
>
>  - So why are we doing this in every syscall time at all?
>
> Basically the whole point of user-context tracking is to be able to
> flush pending RCU callbacks. But that's crazy, we can sure defer a few
> kfree()s on this CPU, even indefinitely!
>
> If some other CPU does a sync_rcu(), then it can very well pluck those
> callbacks from this super low latency CPU's RCU lists (with due care)
> and go and free stuff itself ... There's no need to disturb this CPU
> for that!
>
> If user-space does not do anything kernel-ish then there won't be any
> new RCU callbacks piled up, so it's not like it's a resource leak
> issue either.
>
> So what's the point? Why not remove this big source of overhead
> altogether?

The last time I asked, the impression I got was that we needed two things:

1. We can't pluck things from the RCU list without knowing whether the
CPU is in an RCU read-side critical section, and we can't know that
unless we have regular grade periods or we know that the CPU is idle.
To make the CPU detectably idle, we need to set a bit somewhere.

2. To suppress the timing tick, we need to get some timing for, um,
the scheduler?  I wasn't really sure about this one.

Could we reduce the overhead by making the IN_USER vs IN_KERNEL
indication be a single bit and, worst case, an rdtsc and maybe a
subtraction?  We could probably get away with banning full nohz on
non-invariant tsc systems.

(I do understand why it would be tricky to transition from IN_USER to
IN_KERNEL with IRQs on.  Solvable, maybe, but tricky.)

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:03         ` Andy Lutomirski
@ 2015-05-01 16:21           ` Ingo Molnar
  2015-05-01 16:26             ` Rik van Riel
  2015-05-01 16:37             ` Ingo Molnar
  2015-05-01 16:22           ` Rik van Riel
  1 sibling, 2 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 16:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paolo Bonzini


* Andy Lutomirski <luto@amacapital.net> wrote:

> > So what's the point? Why not remove this big source of overhead 
> > altogether?
> 
> The last time I asked, the impression I got was that we needed two 
> things:
> 
> 1. We can't pluck things from the RCU list without knowing whether 
> the CPU is in an RCU read-side critical section, and we can't know 
> that unless we have regular grade periods or we know that the CPU is 
> idle. To make the CPU detectably idle, we need to set a bit 
> somewhere.

'Idle' as in 'executing pure user-space mode, without entering the 
kernel and possibly doing an rcu_read_lock()', right?

So we don't have to test it from the remote CPU: we could probe such 
CPUs via a single low-overhead IPI. I'd much rather push such overhead 
to sync_rcu() than to the syscall entry code!

I can understand people running hard-RT workloads not wanting to see 
the overhead of a timer tick or a scheduler tick with variable (and 
occasionally heavy) work done in IRQ context, but the jitter caused by 
a single trivial IPI with constant work should be very, very low and 
constant.

If user-space RT code does not tolerate _that_ kind of latencies then 
it really has its priorities wrong and we should not try to please it. 
It should not hurt the other 99.9% of sane hard-RT users.

And the other usecase, virtualization, obviously does not care and 
could take the IPI just fine.

> 2. To suppress the timing tick, we need to get some timing for, um, 
> the scheduler?  I wasn't really sure about this one.

So we have variable timeslice timers for the scheduler implemented, 
they are off by default but they worked last someone tried them. See 
the 'HRTICK' scheduler feature.

And for SCHED_FIFO that timeout can be 'never' - i.e. essentially 
stopping the scheduler tick. (within reason.)

> Could we reduce the overhead by making the IN_USER vs IN_KERNEL 
> indication be a single bit and, worst case, an rdtsc and maybe a 
> subtraction?  We could probably get away with banning full nohz on 
> non-invariant tsc systems.
> 
> (I do understand why it would be tricky to transition from IN_USER 
> to IN_KERNEL with IRQs on.  Solvable, maybe, but tricky.)

We can make it literally zero overhead: by using an IPI from 
synchronize_rcu() and friend.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:03         ` Andy Lutomirski
  2015-05-01 16:21           ` Ingo Molnar
@ 2015-05-01 16:22           ` Rik van Riel
  2015-05-01 16:27             ` Ingo Molnar
  1 sibling, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 16:22 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paolo Bonzini

On 05/01/2015 12:03 PM, Andy Lutomirski wrote:

> The last time I asked, the impression I got was that we needed two things:
> 
> 1. We can't pluck things from the RCU list without knowing whether the
> CPU is in an RCU read-side critical section, and we can't know that
> unless we have regular grade periods or we know that the CPU is idle.
> To make the CPU detectably idle, we need to set a bit somewhere.

More than that. We also need a way for another CPU to identify the
callbacks they could run for us, without confusing them with new
callbacks queued after we transitioned back from USER to KERNEL
context.

> 2. To suppress the timing tick, we need to get some timing for, um,
> the scheduler?  I wasn't really sure about this one.
> 
> Could we reduce the overhead by making the IN_USER vs IN_KERNEL
> indication be a single bit and, worst case, an rdtsc and maybe a
> subtraction?  We could probably get away with banning full nohz on
> non-invariant tsc systems.

I suspect we can.

There is no need to update the vtime every single time
we enter vtime_user_enter and functions like it.

We can keep a buffer, which:
1) keeps values in TSC cycles (or whatever unit local_clock does)
2) is only ever accessed by the current task, so it requires no
   locking
3) values can be actually folded into vtime periodically, when
   they exceed a certain threshold (1 second ?)

That means the vtime_seqlock is something that we would only take
once a second or so, and the calculations in account_user_time
would only be done on a fairly infrequent basis. That has the
potential to reduce overhead by a lot.

If nobody has any objections to that kind of change, I would be
happy to implement it.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:21           ` Ingo Molnar
@ 2015-05-01 16:26             ` Rik van Riel
  2015-05-01 16:34               ` Ingo Molnar
  2015-05-01 16:37             ` Ingo Molnar
  1 sibling, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 16:26 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paolo Bonzini

On 05/01/2015 12:21 PM, Ingo Molnar wrote:
> 
> * Andy Lutomirski <luto@amacapital.net> wrote:
> 
>>> So what's the point? Why not remove this big source of overhead 
>>> altogether?
>>
>> The last time I asked, the impression I got was that we needed two 
>> things:
>>
>> 1. We can't pluck things from the RCU list without knowing whether 
>> the CPU is in an RCU read-side critical section, and we can't know 
>> that unless we have regular grade periods or we know that the CPU is 
>> idle. To make the CPU detectably idle, we need to set a bit 
>> somewhere.
> 
> 'Idle' as in 'executing pure user-space mode, without entering the 
> kernel and possibly doing an rcu_read_lock()', right?
> 
> So we don't have to test it from the remote CPU: we could probe such 
> CPUs via a single low-overhead IPI. I'd much rather push such overhead 
> to sync_rcu() than to the syscall entry code!
>
> I can understand people running hard-RT workloads not wanting to see 
> the overhead of a timer tick or a scheduler tick with variable (and 
> occasionally heavy) work done in IRQ context, but the jitter caused by 
> a single trivial IPI with constant work should be very, very low and 
> constant.

Not if the realtime workload is running inside a KVM
guest.

At that point an IPI, either on the host or in the
guest, involves a full VMEXIT & VMENTER cycle.

I suspect it would be easy enough at user_enter() or
guest_enter() time to:
1) flip the bit
2) check to see if we have any callbacks queued on our on queue, and
3) if so, move the callbacks from that queue to the "another CPU can
   take these" queue

At that point, another CPU wanting to do an RCU grace period
advance can:
1) skip trying to advance the grace period on the CPU that
   is in userspace or guest mode, and
2) take the "for somebody else" callbacks of that CPU, and
   run them

This does not seem overly complicated.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:22           ` Rik van Riel
@ 2015-05-01 16:27             ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 16:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> On 05/01/2015 12:03 PM, Andy Lutomirski wrote:
> 
> > The last time I asked, the impression I got was that we needed two things:
> > 
> > 1. We can't pluck things from the RCU list without knowing whether the
> > CPU is in an RCU read-side critical section, and we can't know that
> > unless we have regular grade periods or we know that the CPU is idle.
> > To make the CPU detectably idle, we need to set a bit somewhere.
> 
> More than that. We also need a way for another CPU to identify the 
> callbacks they could run for us, without confusing them with new 
> callbacks queued after we transitioned back from USER to KERNEL 
> context.

That's easy: a simple, constant-work IPI sent to nohz-full CPUs could 
just dequeue these callbacks, if it sees that we are still in 
user-mode.

It's super easy because such an IPI would essentially be equivalent to 
a system call context from RCU's POV, if it uses this test:

	if (user_mode(regs)) {
		... pluck RCU callbacks ...
	}

That way we can have lots of overhead removed from the syscall call 
path ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:26             ` Rik van Riel
@ 2015-05-01 16:34               ` Ingo Molnar
  2015-05-01 18:05                 ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 16:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> > I can understand people running hard-RT workloads not wanting to 
> > see the overhead of a timer tick or a scheduler tick with variable 
> > (and occasionally heavy) work done in IRQ context, but the jitter 
> > caused by a single trivial IPI with constant work should be very, 
> > very low and constant.
> 
> Not if the realtime workload is running inside a KVM guest.

I don't buy this:

> At that point an IPI, either on the host or in the guest, involves a 
> full VMEXIT & VMENTER cycle.

So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
usec on recent hardware, and I bet it will get better with time.

I'm not aware of any hard-RT workload that cannot take 1 usec 
latencies.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:21           ` Ingo Molnar
  2015-05-01 16:26             ` Rik van Riel
@ 2015-05-01 16:37             ` Ingo Molnar
  2015-05-01 16:40               ` Rik van Riel
  1 sibling, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 16:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paolo Bonzini


* Ingo Molnar <mingo@kernel.org> wrote:

> > 2. To suppress the timing tick, we need to get some timing for, 
> > um, the scheduler?  I wasn't really sure about this one.
> 
> So we have variable timeslice timers for the scheduler implemented, 
> they are off by default but they worked last someone tried them. See 
> the 'HRTICK' scheduler feature.
> 
> And for SCHED_FIFO that timeout can be 'never' - i.e. essentially 
> stopping the scheduler tick. (within reason.)

Also note that this bit in context_tracking_enter():

                        if (state == CONTEXT_USER) {
                                trace_user_enter(0);
                                vtime_user_enter(current);
                        }


is related to precise time measurement of user/kernel execution times, 
it's not needed by the scheduler at all, it's just exported to 
tooling. It's not fundamental to the scheduler.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:37             ` Ingo Molnar
@ 2015-05-01 16:40               ` Rik van Riel
  2015-05-01 16:45                 ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 16:40 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paolo Bonzini

On 05/01/2015 12:37 PM, Ingo Molnar wrote:

> Also note that this bit in context_tracking_enter():
> 
>                         if (state == CONTEXT_USER) {
>                                 trace_user_enter(0);
>                                 vtime_user_enter(current);
>                         }
> 
> 
> is related to precise time measurement of user/kernel execution times, 
> it's not needed by the scheduler at all, it's just exported to 
> tooling. It's not fundamental to the scheduler.

Any objections to the idea from the other thread to simply keep
the time accumulating in buffers in local_clock() units, and
only update the task vtime once a second or so?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:40               ` Rik van Riel
@ 2015-05-01 16:45                 ` Ingo Molnar
  2015-05-01 16:54                   ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 16:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> On 05/01/2015 12:37 PM, Ingo Molnar wrote:
> 
> > Also note that this bit in context_tracking_enter():
> > 
> >                         if (state == CONTEXT_USER) {
> >                                 trace_user_enter(0);
> >                                 vtime_user_enter(current);
> >                         }
> > 
> > 
> > is related to precise time measurement of user/kernel execution 
> > times, it's not needed by the scheduler at all, it's just exported 
> > to tooling. It's not fundamental to the scheduler.
> 
> Any objections to the idea from the other thread to simply keep the 
> time accumulating in buffers in local_clock() units, and only update 
> the task vtime once a second or so?

So I really think per syscall overhead is the wrong thing to do for 
anything that a common distro enables.

I see very little use for such precise, high-freq measurements on 
normal systems - and abnormal systems could enable it dynamically just 
like they can enable syscall auditing.

I.e. I don't mind the occasional crazy piece of code, as long as it 
does not hurt the innocent.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:45                 ` Ingo Molnar
@ 2015-05-01 16:54                   ` Rik van Riel
  2015-05-01 17:12                     ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 16:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On 05/01/2015 12:45 PM, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@redhat.com> wrote:
> 
>> On 05/01/2015 12:37 PM, Ingo Molnar wrote:
>>
>>> Also note that this bit in context_tracking_enter():
>>>
>>>                         if (state == CONTEXT_USER) {
>>>                                 trace_user_enter(0);
>>>                                 vtime_user_enter(current);
>>>                         }
>>>
>>>
>>> is related to precise time measurement of user/kernel execution 
>>> times, it's not needed by the scheduler at all, it's just exported 
>>> to tooling. It's not fundamental to the scheduler.
>>
>> Any objections to the idea from the other thread to simply keep the 
>> time accumulating in buffers in local_clock() units, and only update 
>> the task vtime once a second or so?
> 
> So I really think per syscall overhead is the wrong thing to do for 
> anything that a common distro enables.
> 
> I see very little use for such precise, high-freq measurements on 
> normal systems - and abnormal systems could enable it dynamically just 
> like they can enable syscall auditing.
> 
> I.e. I don't mind the occasional crazy piece of code, as long as it 
> does not hurt the innocent.

Then how should/could we keep a rough idea of user / system / guest
time when running without a periodic timer tick?

These statistics are useful, and gathering them in a low impact
way (local_clock read + variable read + subtraction + addition +
store) may be acceptable overhead.

This is a few function calls, one tsc read, a possible cache
line miss, and an unlikely branch.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:54                   ` Rik van Riel
@ 2015-05-01 17:12                     ` Ingo Molnar
  2015-05-01 17:22                       ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 17:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> On 05/01/2015 12:45 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel <riel@redhat.com> wrote:
> > 
> >> On 05/01/2015 12:37 PM, Ingo Molnar wrote:
> >>
> >>> Also note that this bit in context_tracking_enter():
> >>>
> >>>                         if (state == CONTEXT_USER) {
> >>>                                 trace_user_enter(0);
> >>>                                 vtime_user_enter(current);
> >>>                         }
> >>>
> >>>
> >>> is related to precise time measurement of user/kernel execution 
> >>> times, it's not needed by the scheduler at all, it's just exported 
> >>> to tooling. It's not fundamental to the scheduler.
> >>
> >> Any objections to the idea from the other thread to simply keep the 
> >> time accumulating in buffers in local_clock() units, and only update 
> >> the task vtime once a second or so?
> > 
> > So I really think per syscall overhead is the wrong thing to do for 
> > anything that a common distro enables.
> > 
> > I see very little use for such precise, high-freq measurements on 
> > normal systems - and abnormal systems could enable it dynamically just 
> > like they can enable syscall auditing.
> > 
> > I.e. I don't mind the occasional crazy piece of code, as long as it 
> > does not hurt the innocent.
> 
> Then how should/could we keep a rough idea of user / system / guest 
> time when running without a periodic timer tick?

So I'd split the timer tick into two parts: just the constant-work 
sampling bit that doesn't do much, and the variable-work part which 
gets essentially shut off when the timeout is far into the future.

Then we could do IRQ driven sampling without introducing variable 
amount jitter into hard-RT execution time.

I.e. much of what we do today, except that we could skip variable work 
such as the scheduler tick or (unforced) RCU processing like the RCU 
softirq work.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 17:12                     ` Ingo Molnar
@ 2015-05-01 17:22                       ` Rik van Riel
  2015-05-01 17:59                         ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On 05/01/2015 01:12 PM, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@redhat.com> wrote:
> 
>> On 05/01/2015 12:45 PM, Ingo Molnar wrote:
>>>
>>> * Rik van Riel <riel@redhat.com> wrote:
>>>
>>>> On 05/01/2015 12:37 PM, Ingo Molnar wrote:
>>>>
>>>>> Also note that this bit in context_tracking_enter():
>>>>>
>>>>>                         if (state == CONTEXT_USER) {
>>>>>                                 trace_user_enter(0);
>>>>>                                 vtime_user_enter(current);
>>>>>                         }
>>>>>
>>>>>
>>>>> is related to precise time measurement of user/kernel execution 
>>>>> times, it's not needed by the scheduler at all, it's just exported 
>>>>> to tooling. It's not fundamental to the scheduler.
>>>>
>>>> Any objections to the idea from the other thread to simply keep the 
>>>> time accumulating in buffers in local_clock() units, and only update 
>>>> the task vtime once a second or so?
>>>
>>> So I really think per syscall overhead is the wrong thing to do for 
>>> anything that a common distro enables.
>>>
>>> I see very little use for such precise, high-freq measurements on 
>>> normal systems - and abnormal systems could enable it dynamically just 
>>> like they can enable syscall auditing.
>>>
>>> I.e. I don't mind the occasional crazy piece of code, as long as it 
>>> does not hurt the innocent.
>>
>> Then how should/could we keep a rough idea of user / system / guest 
>> time when running without a periodic timer tick?
> 
> So I'd split the timer tick into two parts: just the constant-work 
> sampling bit that doesn't do much, and the variable-work part which 
> gets essentially shut off when the timeout is far into the future.
> 
> Then we could do IRQ driven sampling without introducing variable 
> amount jitter into hard-RT execution time.
> 
> I.e. much of what we do today, except that we could skip variable work 
> such as the scheduler tick or (unforced) RCU processing like the RCU 
> softirq work.

Any ideas how we could avoid that sampling timer interrupt
latency stacking up when dealing with both guest and host?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 17:22                       ` Rik van Riel
@ 2015-05-01 17:59                         ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 17:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> > I.e. much of what we do today, except that we could skip variable 
> > work such as the scheduler tick or (unforced) RCU processing like 
> > the RCU softirq work.
> 
> Any ideas how we could avoid that sampling timer interrupt latency 
> stacking up when dealing with both guest and host?

Well, it would be host_latency+guest_latency worst-case, right?

That should still be bounded, as long as both guest and host is 
running nohz-full mode.

Also, technically when the host gets such an IRQ, the guest is 
interrupted as well. So the host could do the guest's work and pass 
through the result via paravirt or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 16:34               ` Ingo Molnar
@ 2015-05-01 18:05                 ` Rik van Riel
  2015-05-01 18:40                   ` Ingo Molnar
  2015-05-02  4:06                   ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Mike Galbraith
  0 siblings, 2 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 18:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@redhat.com> wrote:
> 
>>> I can understand people running hard-RT workloads not wanting to 
>>> see the overhead of a timer tick or a scheduler tick with variable 
>>> (and occasionally heavy) work done in IRQ context, but the jitter 
>>> caused by a single trivial IPI with constant work should be very, 
>>> very low and constant.
>>
>> Not if the realtime workload is running inside a KVM guest.
> 
> I don't buy this:
> 
>> At that point an IPI, either on the host or in the guest, involves a 
>> full VMEXIT & VMENTER cycle.
> 
> So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> usec on recent hardware, and I bet it will get better with time.
> 
> I'm not aware of any hard-RT workload that cannot take 1 usec 
> latencies.

Now think about doing this kind of IPI from inside a guest,
to another VCPU on the same guest.

Now you are looking at VMEXIT/VMENTER on the first VCPU,
plus the cost of the IPI on the host, plus the cost of
the emulation layer, plus VMEXIT/VMENTER on the second
VCPU to trigger the IPI work, and possibly a second
VMEXIT/VMENTER for IPI completion.

I suspect it would be better to do RCU callback offload
in some other way.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 18:05                 ` Rik van Riel
@ 2015-05-01 18:40                   ` Ingo Molnar
  2015-05-01 19:11                     ` Rik van Riel
  2015-05-02  4:06                   ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Mike Galbraith
  1 sibling, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-01 18:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini, Paul E. McKenney,
	Linus Torvalds


* Rik van Riel <riel@redhat.com> wrote:

> On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel <riel@redhat.com> wrote:
> > 
> >>> I can understand people running hard-RT workloads not wanting to 
> >>> see the overhead of a timer tick or a scheduler tick with variable 
> >>> (and occasionally heavy) work done in IRQ context, but the jitter 
> >>> caused by a single trivial IPI with constant work should be very, 
> >>> very low and constant.
> >>
> >> Not if the realtime workload is running inside a KVM guest.
> > 
> > I don't buy this:
> > 
> >> At that point an IPI, either on the host or in the guest, involves a 
> >> full VMEXIT & VMENTER cycle.
> > 
> > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> > usec on recent hardware, and I bet it will get better with time.
> > 
> > I'm not aware of any hard-RT workload that cannot take 1 usec 
> > latencies.
> 
> Now think about doing this kind of IPI from inside a guest, to 
> another VCPU on the same guest.
> 
> Now you are looking at VMEXIT/VMENTER on the first VCPU,

Does it matter? It's not the hard-RT CPU, and this is a slowpath of 
synchronize_rcu().

> plus the cost of the IPI on the host, plus the cost of the emulation 
> layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI 
> work, and possibly a second VMEXIT/VMENTER for IPI completion.

Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies.

> I suspect it would be better to do RCU callback offload in some 
> other way.

Well, it's not just about callback offload, but it's about the basic 
synchronization guarantee of synchronize_rcu(): that all RCU read-side 
critical sections have finished executing after the call returns.

So even if a nohz-full CPU never actually queues a callback, it needs 
to stop using resources that a synchronize_rcu() caller expects it to 
stop using.

We can do that only if we know it in an SMP-coherent way that the 
remote CPU is not in an rcu_read_lock() section.

Sending an IPI is one way to achieve that.

Or we could do that in the syscall path with a single store of a 
constant flag to a location in the task struct. We have a number of 
natural flags that get written on syscall entry, such as:

        pushq_cfi $__USER_DS                    /* pt_regs->ss */

That goes to a constant location on the kernel stack. On return from 
system calls we could write 0 to that location.

So the remote CPU would have to do a read of this location. There are 
two cases:

 - If it's 0, then it has observed quiescent state on that CPU. (It 
   does not have to be atomics anymore, as we'd only observe the value 
   and MESI coherency takes care of it.)

 - If it's not 0 then the remote CPU is not executing user-space code 
   and we can install (remotely) a TIF_NOHZ flag in it and expect it 
   to process it either on return to user-space or on a context 
   switch.

This way, unless I'm missing something, reduces the overhead to a 
single store to a hot cacheline on return-to-userspace - which 
instruction if we place it well might as well be close to zero cost. 
No syscall entry cost. Slow-return cost only in the (rare) case of 
someone using synchronize_rcu().

Hm?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 18:40                   ` Ingo Molnar
@ 2015-05-01 19:11                     ` Rik van Riel
  2015-05-01 19:37                       ` Andy Lutomirski
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-01 19:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini, Paul E. McKenney,
	Linus Torvalds

On 05/01/2015 02:40 PM, Ingo Molnar wrote:

> Or we could do that in the syscall path with a single store of a 
> constant flag to a location in the task struct. We have a number of 
> natural flags that get written on syscall entry, such as:
> 
>         pushq_cfi $__USER_DS                    /* pt_regs->ss */
> 
> That goes to a constant location on the kernel stack. On return from 
> system calls we could write 0 to that location.
> 
> So the remote CPU would have to do a read of this location. There are 
> two cases:
> 
>  - If it's 0, then it has observed quiescent state on that CPU. (It 
>    does not have to be atomics anymore, as we'd only observe the value 
>    and MESI coherency takes care of it.)

That should do the trick.

>  - If it's not 0 then the remote CPU is not executing user-space code 
>    and we can install (remotely) a TIF_NOHZ flag in it and expect it 
>    to process it either on return to user-space or on a context 
>    switch.

I may have to think about this a little more, but
it seems like it should work.

Can we use a separate byte in the flags word for
flags that can get set remotely, so we can do
stores and clearing of local-only flags without
atomic instructions?

> This way, unless I'm missing something, reduces the overhead to a 
> single store to a hot cacheline on return-to-userspace - which 
> instruction if we place it well might as well be close to zero cost. 
> No syscall entry cost. Slow-return cost only in the (rare) case of 
> someone using synchronize_rcu().

I think that should take care of the RCU aspect of
nohz_full.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 19:11                     ` Rik van Riel
@ 2015-05-01 19:37                       ` Andy Lutomirski
  2015-05-02  5:27                         ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-01 19:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paolo Bonzini, Paul E. McKenney, Linus Torvalds

On Fri, May 1, 2015 at 12:11 PM, Rik van Riel <riel@redhat.com> wrote:
> On 05/01/2015 02:40 PM, Ingo Molnar wrote:
>
>> Or we could do that in the syscall path with a single store of a
>> constant flag to a location in the task struct. We have a number of
>> natural flags that get written on syscall entry, such as:
>>
>>         pushq_cfi $__USER_DS                    /* pt_regs->ss */
>>
>> That goes to a constant location on the kernel stack. On return from
>> system calls we could write 0 to that location.

Huh?  IRET with zero there will fault, and we can't guarantee that all
syscalls return using sysret.  Also, we'd have to audit all the entries,
and system_call_after_swapgs currently enables interrupts early enough
that an interrupt before all the pushes will do unpredictable things to
pt_regs.

We could further abuse orig_ax, but that would require a lot of
auditing.  Honestly, though, I think keeping a flag in an
otherwise-hot cache line is a better bet.  Also, making it per-cpu
instead of per-task will probably be easier on the RCU code, since
otherwise the RCU code will have to do some kind of synchronization
(even if it's a wait-free probe of the rq lock or similar) to figure
out which pt_regs to look at.

If we went that route, I'd advocate sticking the flag in tss->sp1.
That cacheline is unconditionally read on kernel entry already, and
sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
lost track). [1]

Alternatively, I'd suggest that we actually add a whole new word to pt_regs.

[1] It's not unconditionally accessed yet, but it wil be once Denys'
latest patches are in.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 18:05                 ` Rik van Riel
  2015-05-01 18:40                   ` Ingo Molnar
@ 2015-05-02  4:06                   ` Mike Galbraith
  1 sibling, 0 replies; 83+ messages in thread
From: Mike Galbraith @ 2015-05-02  4:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andy Lutomirski, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On Fri, 2015-05-01 at 14:05 -0400, Rik van Riel wrote:
> On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel <riel@redhat.com> wrote:
> > 
> >>> I can understand people running hard-RT workloads not wanting to 
> >>> see the overhead of a timer tick or a scheduler tick with variable 
> >>> (and occasionally heavy) work done in IRQ context, but the jitter 
> >>> caused by a single trivial IPI with constant work should be very, 
> >>> very low and constant.
> >>
> >> Not if the realtime workload is running inside a KVM guest.
> > 
> > I don't buy this:
> > 
> >> At that point an IPI, either on the host or in the guest, involves a 
> >> full VMEXIT & VMENTER cycle.
> > 
> > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> > usec on recent hardware, and I bet it will get better with time.
> > 
> > I'm not aware of any hard-RT workload that cannot take 1 usec 
> > latencies.
> 
> Now think about doing this kind of IPI from inside a guest,
> to another VCPU on the same guest.
> 
> Now you are looking at VMEXIT/VMENTER on the first VCPU,
> plus the cost of the IPI on the host, plus the cost of
> the emulation layer, plus VMEXIT/VMENTER on the second
> VCPU to trigger the IPI work, and possibly a second
> VMEXIT/VMENTER for IPI completion.
> 
> I suspect it would be better to do RCU callback offload
> in some other way.

I don't get it.  How the heck do people manage to talk about realtime in
virtual boxen, and not at least crack a smile.  Real, virtual, real,
virtual... what's wrong with this picture?

Why is virtual realtime not an oxymoron?

I did that for grins once, and it was either really funny, or really
sad, not sure which... but it did not look really really useful.

	-Mike


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 19:37                       ` Andy Lutomirski
@ 2015-05-02  5:27                         ` Ingo Molnar
  2015-05-02 18:27                           ` Rik van Riel
                                             ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-02  5:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paolo Bonzini, Paul E. McKenney, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel <riel@redhat.com> wrote:
> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
> >
> >> Or we could do that in the syscall path with a single store of a
> >> constant flag to a location in the task struct. We have a number of
> >> natural flags that get written on syscall entry, such as:
> >>
> >>         pushq_cfi $__USER_DS                    /* pt_regs->ss */
> >>
> >> That goes to a constant location on the kernel stack. On return from
> >> system calls we could write 0 to that location.
> 
> Huh?  IRET with zero there will fault, and we can't guarantee that 
> all syscalls return using sysret. [...]

So IRET is a slowpath - I was thinking about the fastpath mainly.

> [...]  Also, we'd have to audit all the entries, and 
> system_call_after_swapgs currently enables interrupts early enough 
> that an interrupt before all the pushes will do unpredictable things 
> to pt_regs.

An irq hardware frame won't push zero to that selector value, right? 
That's the only bad thing that would confuse the code.

> We could further abuse orig_ax, but that would require a lot of 
> auditing.  Honestly, though, I think keeping a flag in an 
> otherwise-hot cache line is a better bet. [...]

That would work too, at the cost of one more instruction, as now we'd 
have to both set and clear it.

> [...]  Also, making it per-cpu instead of per-task will probably be 
> easier on the RCU code, since otherwise the RCU code will have to do 
> some kind of synchronization (even if it's a wait-free probe of the 
> rq lock or similar) to figure out which pt_regs to look at.

So the synchronize_rcu() part is an even slower slow path, in 
comparison with system call entry overhead.

But yes, safely accessing the remote task is a complication, but with 
such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So 
even if we do:

> If we went that route, I'd advocate sticking the flag in tss->sp1. 
> That cacheline is unconditionally read on kernel entry already, and 
> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I 
> lost track). [1]
> 
> Alternatively, I'd suggest that we actually add a whole new word to 
> pt_regs.

... we'd still have to set TIF_NOHZ and are back to square one in 
terms of race complexity.

But it's not overly complex: by taking the remote CPU's rq-lock from 
synchronize_rcu() we could get a stable pointer to the currently 
executing task.

And if we do that, we might as well use the opportunity and take a 
look at pt_regs as well - this is why sticking it into pt_regs might 
be better.

So I'd:

  - enlarge pt_regs by a word and stick the flag there (this 
    allocation is essentially free)

  - update the word from entry/exit

  - synchronize_rcu() avoids having to send an IPI by taking a 
    peak at rq->curr's pt_regs::flag, and if:

     - the flag is 0 then it has observed a quiescent state.

     - the flag is 1, then it would set TIF_NOHZ and wait for a 
       completion from a TIF_NOHZ callback.

synchronize_rcu() often involves waiting for (timer tick driven) grace 
periods anyway, so this is a relatively fast solution - and it would 
limit the overhead to 2 instructions.

On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled)
we could patch out those two instructions into NOPs, which would be
eliminated in the decoder.

Regarding the user/kernel execution time split measurement:

1) the same flag could be used to sample a remote CPU's statistics 
from another CPU and update the stats in the currently executing task. 
As long as there's at least one non-nohz-full CPU, this would work. Or 
are there systems were all CPUs are nohz-full?

2) Alternatively we could just drive user/kernel split statistics from 
context switches, which would be inaccurate if the workload is 
SCHED_FIFO that only rarely context switches.

How does this sound?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-02  5:27                         ` Ingo Molnar
@ 2015-05-02 18:27                           ` Rik van Riel
  2015-05-03 18:41                           ` Andy Lutomirski
  2015-05-04  9:26                           ` Paolo Bonzini
  2 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-02 18:27 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paolo Bonzini, Paul E. McKenney, Linus Torvalds

On 05/02/2015 01:27 AM, Ingo Molnar wrote:

> Regarding the user/kernel execution time split measurement:
> 
> 1) the same flag could be used to sample a remote CPU's statistics 
> from another CPU and update the stats in the currently executing task. 
> As long as there's at least one non-nohz-full CPU, this would work. Or 
> are there systems were all CPUs are nohz-full?

On a NO_HZ_FULL system, you need at least one CPU to execute
RCU callbacks, and do other system things like that, so there
is at least one CPU that is not nohz_full.

On NUMA systems, I could even see the sane option being one
CPU that is not isolated or nohz_full per NUMA node, so we
have a place to route irqs, etc...

> 2) Alternatively we could just drive user/kernel split statistics from 
> context switches, which would be inaccurate if the workload is 
> SCHED_FIFO that only rarely context switches.
> 
> How does this sound?

I think option (1) sounds nicer :)

What locks do we need, besides the runqueue lock to make sure
the task does not go away, and later the task's vtime_lock to
update its time statistics?

Do we even need the lock_trace(task) as taken in proc_pid_stack(),
since all we care is whether or not the thing is in kernel, user,
or guest mode?

For guest mode, we set a flag in the task struct somewhere, that
part is easy.

It also looks like dump_trace() can distinguish between normal,
exception, and irq stacks. Not sure how fancy we need to get...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-01 15:20     ` Rik van Riel
  2015-05-01 15:59       ` Ingo Molnar
@ 2015-05-03 13:23       ` Mike Galbraith
  2015-05-03 17:30         ` Rik van Riel
  1 sibling, 1 reply; 83+ messages in thread
From: Mike Galbraith @ 2015-05-03 13:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, linux-kernel, x86, williams, luto, fweisbec, peterz,
	heiko.carstens, tglx, Ingo Molnar, Paolo Bonzini

On Fri, 2015-05-01 at 11:20 -0400, Rik van Riel wrote:
> On 05/01/2015 02:40 AM, Ingo Molnar wrote:
> 
> >> This patch builds on top of these patches by Paolo:
> >> https://lkml.org/lkml/2015/4/28/188
> >> https://lkml.org/lkml/2015/4/29/139
> >>
> >> Together with this patch I posted earlier this week, the syscall path
> >> on a nohz_full cpu seems to be about 10% faster.
> >> https://lkml.org/lkml/2015/4/24/394
> >>
> >> My test is a simple microbenchmark that calls getpriority() in a loop
> >> 10 million times:
> >>
> >> 		run time	system time
> >> vanilla		5.49s		2.08s
> >> __acct patch	5.21s		1.92s
> >> both patches	4.88s		1.71s
> >
> > Just curious, what are the numbers if you don't have context tracking 
> > enabled, i.e. without nohz_full?
> > 
> > I.e. what's the baseline we are talking about?
> 
> It's an astounding difference. This is not a kernel without nohz_full,
> just a CPU without nohz_full running the same kernel I tested with
> yesterday:
> 
>  		run time	system time
> vanilla		5.49s		2.08s
> __acct patch	5.21s		1.92s
> both patches	4.88s		1.71s
> CPU w/o nohz	3.12s		1.63s    <-- your numbers, mostly

Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

100M * stat() on isolated cpu

NO_HZ_FULL off        inactive     housekeeper    nohz_full
real    0m14.266s     0m14.367s    0m20.427s      0m27.921s
user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
(real)  1.000         1.007        1.431          1.957
                                   1.000          1.000

                         real      0m20.423s      0m27.930s  +rik 1,2
                         user      0m2.072s       0m10.450s
                         sys       0m18.304s      0m17.471s
                         vs off    1.431          1.957
                         vs prev   1.000          1.000

                         real      0m20.256s      0m27.803s  +paolo 1,2 (2 missing prototypes)
                         user      0m1.884s       0m10.551s
                         sys       0m18.353s      0m17.242s
                         vs off    1.419          1.948
                         vs prev    .991           .995

                         real      0m19.122s      0m26.946s  +rik 3
                         user      0m1.896s       0m10.292s
                         sys       0m17.198s      0m16.644s
                         vs off    1.340          1.888
                         vs prev   .944            .969




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-03 13:23       ` Mike Galbraith
@ 2015-05-03 17:30         ` Rik van Riel
  2015-05-03 18:24           ` Andy Lutomirski
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-03 17:30 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-kernel, x86, williams, luto, fweisbec, peterz,
	heiko.carstens, tglx, Ingo Molnar, Paolo Bonzini

On 05/03/2015 09:23 AM, Mike Galbraith wrote:

> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
> 
> 100M * stat() on isolated cpu
> 
> NO_HZ_FULL off        inactive     housekeeper    nohz_full
> real    0m14.266s     0m14.367s    0m20.427s      0m27.921s
> user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
> sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
> (real)  1.000         1.007        1.431          1.957
>                                    1.000          1.000
> 
>                          real      0m20.423s      0m27.930s  +rik 1,2
>                          user      0m2.072s       0m10.450s
>                          sys       0m18.304s      0m17.471s
>                          vs off    1.431          1.957
>                          vs prev   1.000          1.000
> 
>                          real      0m20.256s      0m27.803s  +paolo 1,2 (2 missing prototypes)
>                          user      0m1.884s       0m10.551s
>                          sys       0m18.353s      0m17.242s
>                          vs off    1.419          1.948
>                          vs prev    .991           .995
> 
>                          real      0m19.122s      0m26.946s  +rik 3
>                          user      0m1.896s       0m10.292s
>                          sys       0m17.198s      0m16.644s
>                          vs off    1.340          1.888
>                          vs prev   .944            .969

I'm convinced.

Time to try the remote sampling of CPU use statistics, and
lighten up the RCU overhead of context tracking.

I should be able to start on the remote sampling of CPU use
statistics this week.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-03 17:30         ` Rik van Riel
@ 2015-05-03 18:24           ` Andy Lutomirski
  2015-05-03 18:52             ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-03 18:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Galbraith, Ingo Molnar, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On Sun, May 3, 2015 at 10:30 AM, Rik van Riel <riel@redhat.com> wrote:
> On 05/03/2015 09:23 AM, Mike Galbraith wrote:
>
>> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
>>
>> 100M * stat() on isolated cpu
>>
>> NO_HZ_FULL off        inactive     housekeeper    nohz_full
>> real    0m14.266s     0m14.367s    0m20.427s      0m27.921s
>> user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
>> sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
>> (real)  1.000         1.007        1.431          1.957
>>                                    1.000          1.000
>>
>>                          real      0m20.423s      0m27.930s  +rik 1,2
>>                          user      0m2.072s       0m10.450s
>>                          sys       0m18.304s      0m17.471s
>>                          vs off    1.431          1.957
>>                          vs prev   1.000          1.000
>>
>>                          real      0m20.256s      0m27.803s  +paolo 1,2 (2 missing prototypes)
>>                          user      0m1.884s       0m10.551s
>>                          sys       0m18.353s      0m17.242s
>>                          vs off    1.419          1.948
>>                          vs prev    .991           .995
>>
>>                          real      0m19.122s      0m26.946s  +rik 3
>>                          user      0m1.896s       0m10.292s
>>                          sys       0m17.198s      0m16.644s
>>                          vs off    1.340          1.888
>>                          vs prev   .944            .969
>
> I'm convinced.
>
> Time to try the remote sampling of CPU use statistics, and
> lighten up the RCU overhead of context tracking.
>

I don't understand the remote sampling proposal.  Isn't the whole
point of full nohz to avoid periodically interrupting a busy CPU?  If
what you have in mind is sending IPIs, then that's just as bad, right?

If, on the other hand, you're just going to remotely sample the
in-memory context, that sounds good.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-02  5:27                         ` Ingo Molnar
  2015-05-02 18:27                           ` Rik van Riel
@ 2015-05-03 18:41                           ` Andy Lutomirski
  2015-05-07 10:35                             ` Ingo Molnar
  2015-05-04  9:26                           ` Paolo Bonzini
  2 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-03 18:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paolo Bonzini, Paul E. McKenney, Linus Torvalds

On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel <riel@redhat.com> wrote:
>> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
>> >
>> >> Or we could do that in the syscall path with a single store of a
>> >> constant flag to a location in the task struct. We have a number of
>> >> natural flags that get written on syscall entry, such as:
>> >>
>> >>         pushq_cfi $__USER_DS                    /* pt_regs->ss */
>> >>
>> >> That goes to a constant location on the kernel stack. On return from
>> >> system calls we could write 0 to that location.
>>
>> Huh?  IRET with zero there will fault, and we can't guarantee that
>> all syscalls return using sysret. [...]
>
> So IRET is a slowpath - I was thinking about the fastpath mainly.

Slowpath or not, if we're using part of the hardware frame in pt_regs
as an indication of whether we're in user or kernel mode, we don't get
to arbitrarily change it to the "user" state before IRET or the IRET
will fail.  We could work around that with some kind of trampoline,
but that's complicated.

>
>> [...]  Also, we'd have to audit all the entries, and
>> system_call_after_swapgs currently enables interrupts early enough
>> that an interrupt before all the pushes will do unpredictable things
>> to pt_regs.
>
> An irq hardware frame won't push zero to that selector value, right?
> That's the only bad thing that would confuse the code.
>

I think it's not quite that simple.  The syscall entry looks like, roughly:

fix rsp;
sti;
push ss;
push rsp;
push flags;
push cs;
push rip;

We can get an interrupt part-way through those pushes.  Maybe there's
no bad place where we could get an IRQ since SS is first, but this is
still nasty.

>
> But yes, safely accessing the remote task is a complication, but with
> such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So
> even if we do:
>
>> If we went that route, I'd advocate sticking the flag in tss->sp1.
>> That cacheline is unconditionally read on kernel entry already, and
>> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
>> lost track). [1]
>>
>> Alternatively, I'd suggest that we actually add a whole new word to
>> pt_regs.
>
> ... we'd still have to set TIF_NOHZ and are back to square one in
> terms of race complexity.
>
> But it's not overly complex: by taking the remote CPU's rq-lock from
> synchronize_rcu() we could get a stable pointer to the currently
> executing task.
>
> And if we do that, we might as well use the opportunity and take a
> look at pt_regs as well - this is why sticking it into pt_regs might
> be better.
>
> So I'd:
>
>   - enlarge pt_regs by a word and stick the flag there (this
>     allocation is essentially free)
>
>   - update the word from entry/exit
>
>   - synchronize_rcu() avoids having to send an IPI by taking a
>     peak at rq->curr's pt_regs::flag, and if:
>
>      - the flag is 0 then it has observed a quiescent state.
>
>      - the flag is 1, then it would set TIF_NOHZ and wait for a
>        completion from a TIF_NOHZ callback.

This seems complicated, especially given IST entries that put their
pt_regs in different places.  We get a bit of a free pass on some of
the IST entries due to the IST stack switch thing, but we still need
to worry about NMI and MCE at least.

I think I want to understand the considerations a bit better before I
express a real opinion.  IIUC the remote CPU needs to reliably detect
that we've had a grace period.  This can potentially happen by
observing us with some quiescent (user-mode, guest-mode or idle) flag
set or by an interrupt incrementing a counter.  In CONFIG_PREEMPT_RCU
mode, the remote CPU can presumably observe that we're quiescent
directly.

Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is
unnecessary?  We may still want context tracking for the timing stats
sampling, but I don't see why we need it for RCU.

In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about
possible races that could cause us to fail to ever detect a grace
period.  If we have the periodic tick turned off and if the remote CPU
checks our context and sees us non-quiescent, then either we need a
non-racy way to get us to do work when we become quiescent or the
remote CPU needs to eventually send us an IPI.  I don't see any way to
do the former without an RMW or barrier on every transition into a
quiescent state.  Even if we used ti->flags, I still think that would
require us to upgrade the cli; test on exit to userspace to a
test-and-set or test-and-clear, either of which is much more
expensive.

I think I prefer a per-cpu approach over a per-task approach, since
it's easier to reason about and it should still only require one
instruction on entry and one instruction on exit.

FWIW, if we go the per-cpu route, what if we used a counter instead of
a simple context flag?  With a counter, the remote CPU could observe
that we changed state and therefore went through a grace period even
if the remote CPU never actually observes us in the quiescent state.

I'll be travelling and have very high latency for two weeks.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-03 18:24           ` Andy Lutomirski
@ 2015-05-03 18:52             ` Rik van Riel
  2015-05-07 10:48               ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-03 18:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Galbraith, Ingo Molnar, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On 05/03/2015 02:24 PM, Andy Lutomirski wrote:
> On Sun, May 3, 2015 at 10:30 AM, Rik van Riel <riel@redhat.com> wrote:
>> On 05/03/2015 09:23 AM, Mike Galbraith wrote:
>>
>>> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
>>>
>>> 100M * stat() on isolated cpu
>>>
>>> NO_HZ_FULL off        inactive     housekeeper    nohz_full
>>> real    0m14.266s     0m14.367s    0m20.427s      0m27.921s
>>> user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
>>> sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
>>> (real)  1.000         1.007        1.431          1.957

>> I'm convinced.
>>
>> Time to try the remote sampling of CPU use statistics, and
>> lighten up the RCU overhead of context tracking.
>>
> 
> I don't understand the remote sampling proposal.  Isn't the whole
> point of full nohz to avoid periodically interrupting a busy CPU?  If
> what you have in mind is sending IPIs, then that's just as bad, right?
> 
> If, on the other hand, you're just going to remotely sample the
> in-memory context, that sounds good.

It's the latter.

If you look at /proc/<pid>/{stack,syscall,wchan} and other files,
you will see we already have ways to determine, from in memory
content, where a program is running at a certain point in time.

In fact, the timer interrupt based accounting does a similar thing.
It has a task examine its own in-memory state to figure out what it
was doing before the timer interrupt happened.

The kernel side stack pointer is probably enough to tell us whether
a task is active in kernel space, on an irq stack, or (maybe) in user
space. Not convinced about the latter, we may need to look at the
same state the RCU code keeps track of to see what mode a task is in...

I am looking at the code to see what locks we need to grab.

I suspect the runqueue lock may be enough, to ensure that the task
struct, and stack do not go away while we are looking at them.

We cannot take the lock_trace(task) from irq context, and we probably
do not need to anyway, since we do not care about a precise stack trace
for the task.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-02  5:27                         ` Ingo Molnar
  2015-05-02 18:27                           ` Rik van Riel
  2015-05-03 18:41                           ` Andy Lutomirski
@ 2015-05-04  9:26                           ` Paolo Bonzini
  2015-05-04 13:30                             ` Rik van Riel
                                               ` (3 more replies)
  2 siblings, 4 replies; 83+ messages in thread
From: Paolo Bonzini @ 2015-05-04  9:26 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Rik van Riel, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paul E. McKenney, Linus Torvalds



On 02/05/2015 07:27, Ingo Molnar wrote:
> 
>   - synchronize_rcu() avoids having to send an IPI by taking a 
>     peak at rq->curr's pt_regs::flag, and if:
> 
>      - the flag is 0 then it has observed a quiescent state.
> 
>      - the flag is 1, then it would set TIF_NOHZ and wait for a 
>        completion from a TIF_NOHZ callback.

Isn't this racy?

	synchronize_rcu CPU		nohz CPU
	---------------------------------------------------------
					set flag = 0
	read flag = 0
					return to userspace
	set TIF_NOHZ

and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Paolo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-04  9:26                           ` Paolo Bonzini
@ 2015-05-04 13:30                             ` Rik van Riel
  2015-05-04 14:06                             ` Rik van Riel
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 13:30 UTC (permalink / raw)
  To: Paolo Bonzini, Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paul E. McKenney, Linus Torvalds

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> 
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>>     peak at rq->curr's pt_regs::flag, and if:
>>
>>      - the flag is 0 then it has observed a quiescent state.
>>
>>      - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>        completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
> 	synchronize_rcu CPU		nohz CPU
> 	---------------------------------------------------------
> 					set flag = 0
> 	read flag = 0
> 					return to userspace
> 	set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

If the flag and TIF_NOHZ live in the same word, we can cmpxchg
and be guaranteed atomicity.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-04  9:26                           ` Paolo Bonzini
  2015-05-04 13:30                             ` Rik van Riel
@ 2015-05-04 14:06                             ` Rik van Riel
  2015-05-04 14:19                             ` Rik van Riel
  2015-05-04 15:59                             ` question about RCU dynticks_nesting Rik van Riel
  3 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 14:06 UTC (permalink / raw)
  To: Paolo Bonzini, Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paul E. McKenney, Linus Torvalds

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>>     peak at rq->curr's pt_regs::flag, and if:
>>
>>      - the flag is 0 then it has observed a quiescent state.
>>
>>      - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>        completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
> 	synchronize_rcu CPU		nohz CPU
> 	---------------------------------------------------------
> 					set flag = 0
> 	read flag = 0
> 					return to userspace
> 	set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

As an aside, I suspect the remote timer sampling stuff is probably
best off piggybacking on the RCU status (and PF_VCPU) of the tasks
being sampled.

That seems like the most reliable in-memory state we can examine,
since the frame pointer may be in registers only, and not in
memory to begin with.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-04  9:26                           ` Paolo Bonzini
  2015-05-04 13:30                             ` Rik van Riel
  2015-05-04 14:06                             ` Rik van Riel
@ 2015-05-04 14:19                             ` Rik van Riel
  2015-05-04 15:59                             ` question about RCU dynticks_nesting Rik van Riel
  3 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 14:19 UTC (permalink / raw)
  To: Paolo Bonzini, Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paul E. McKenney, Linus Torvalds

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> 
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>>     peak at rq->curr's pt_regs::flag, and if:
>>
>>      - the flag is 0 then it has observed a quiescent state.
>>
>>      - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>        completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
> 	synchronize_rcu CPU		nohz CPU
> 	---------------------------------------------------------
> 					set flag = 0
> 	read flag = 0
> 					return to userspace
> 	set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Actually, the "race" in this direction is fine. If flag==0, then
the nohz CPU is not accessing any RCU protected data structures,
and the synhcronize_rcu CPU will not be setting TIF_NOHZ.

The race is only a concern if the synchronize_rcu CPU reads
flag==1 (nohz CPU is in kernel space), and sets TIF_NOHZ after
the nohz CPU has cleared flag (and is unable to handle RCU
stuff). An atomic compare and swap prevents that issue.

The other race, of the synchronize_rcu CPU reading 0, followed
by the nohz CPU going into kernel space, and setting the flag
to 1, should be fine. After all, this means the nohz_full CPU
just went into a new RCU grace period, which is just what the
synchronize_rcu CPU was waiting for.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* question about RCU dynticks_nesting
  2015-05-04  9:26                           ` Paolo Bonzini
                                               ` (2 preceding siblings ...)
  2015-05-04 14:19                             ` Rik van Riel
@ 2015-05-04 15:59                             ` Rik van Riel
  2015-05-04 18:39                               ` Paul E. McKenney
  2015-05-04 19:00                               ` Rik van Riel
  3 siblings, 2 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 15:59 UTC (permalink / raw)
  To: Paolo Bonzini, Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paul E. McKenney, Linus Torvalds

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:

> Isn't this racy?
> 
> 	synchronize_rcu CPU		nohz CPU
> 	---------------------------------------------------------
> 					set flag = 0
> 	read flag = 0
> 					return to userspace
> 	set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Looking at the code some more, a flag is not going to be enough.

An irq can hit while we are in kernel mode, leading to the
task's "rcu active" counter being incremented twice.

However, currently the RCU code seems to use a much more
complex counting scheme, with a different increment for
kernel/task use, and irq use.

This counter seems to be modeled on the task preempt_counter,
where we do care about whether we are in task context, irq
context, or softirq context.

On the other hand, the RCU code only seems to care about
whether or not a CPU is in an extended quiescent state,
or is potentially in an RCU critical section.

Paul, what is the reason for RCU using a complex counter,
instead of a simple increment for each potential kernel/RCU
entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
enabled?

In fact, would we be able to simply use tsk->rcu_read_lock_nesting
as an indicator of whether or not we should bother waiting on that
task or CPU when doing synchronize_rcu?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 15:59                             ` question about RCU dynticks_nesting Rik van Riel
@ 2015-05-04 18:39                               ` Paul E. McKenney
  2015-05-04 19:39                                 ` Rik van Riel
  2015-05-04 19:00                               ` Rik van Riel
  1 sibling, 1 reply; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-04 18:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
> On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> > Isn't this racy?
> > 
> > 	synchronize_rcu CPU		nohz CPU
> > 	---------------------------------------------------------
> > 					set flag = 0
> > 	read flag = 0
> > 					return to userspace
> > 	set TIF_NOHZ
> > 
> > and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.
> 
> Looking at the code some more, a flag is not going to be enough.
> 
> An irq can hit while we are in kernel mode, leading to the
> task's "rcu active" counter being incremented twice.
> 
> However, currently the RCU code seems to use a much more
> complex counting scheme, with a different increment for
> kernel/task use, and irq use.
> 
> This counter seems to be modeled on the task preempt_counter,
> where we do care about whether we are in task context, irq
> context, or softirq context.
> 
> On the other hand, the RCU code only seems to care about
> whether or not a CPU is in an extended quiescent state,
> or is potentially in an RCU critical section.
> 
> Paul, what is the reason for RCU using a complex counter,
> instead of a simple increment for each potential kernel/RCU
> entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
> enabled?

Heh!  I found out why the hard way.

You see, there are architectures where a CPU can enter an interrupt level
without ever exiting, and perhaps vice versa.  But only if that CPU is
non-idle at the time.  So, when a CPU enters idle, it is necessary to
reset the interrupt nesting to zero.  But that means that it is in turn
necessary to count task-level nesting separately from interrupt-level
nesting, so that we can determine when the CPU goes idle from a task-level
viewpoint.  Hence the use of masks and fields within the counter.

It -might- be possible to simplify this somewhat, especially now that
we have unified idle loops.  Except that I don't trust the architectures
to be reasonable about this at this point.  Furthermore, the associated
nesting checks do trigger when people are making certain types of changes
to architectures, so it is a useful debugging tool.  Which is another
reason that I am reluctant to change it.

> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> as an indicator of whether or not we should bother waiting on that
> task or CPU when doing synchronize_rcu?

Depends on exactly what you are asking.  If you are asking if I could add
a few more checks to preemptible RCU and speed up grace-period detection
in a number of cases, the answer is very likely "yes".  This is on my
list, but not particularly high priority.  If you are asking whether
CPU 0 could access ->rcu_read_lock_nesting of some task running on
some other CPU, in theory, the answer is "yes", but in practice that
would require putting full memory barriers in both rcu_read_lock()
and rcu_read_unlock(), so the real answer is "no".

Or am I missing your point?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 15:59                             ` question about RCU dynticks_nesting Rik van Riel
  2015-05-04 18:39                               ` Paul E. McKenney
@ 2015-05-04 19:00                               ` Rik van Riel
  2015-05-04 19:39                                 ` Paul E. McKenney
  2015-05-05 10:48                                 ` Peter Zijlstra
  1 sibling, 2 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 19:00 UTC (permalink / raw)
  To: Paolo Bonzini, Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paul E. McKenney, Linus Torvalds

On 05/04/2015 11:59 AM, Rik van Riel wrote:

> However, currently the RCU code seems to use a much more
> complex counting scheme, with a different increment for
> kernel/task use, and irq use.
> 
> This counter seems to be modeled on the task preempt_counter,
> where we do care about whether we are in task context, irq
> context, or softirq context.
> 
> On the other hand, the RCU code only seems to care about
> whether or not a CPU is in an extended quiescent state,
> or is potentially in an RCU critical section.
> 
> Paul, what is the reason for RCU using a complex counter,
> instead of a simple increment for each potential kernel/RCU
> entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
> enabled?

Looking at the code for a while more, I have not found
any reason why the rcu dynticks counter is so complex.

The rdtp->dynticks atomic seems to be used as a serial
number. Odd means the cpu is in an rcu quiescent state,
even means it is not.

This test is used to verify whether or not a CPU is
in rcu quiescent state. Presumably the atomic_add_return
is used to add a memory barrier.

	atomic_add_return(0, &rdtp->dynticks) & 0x1)

> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> as an indicator of whether or not we should bother waiting on that
> task or CPU when doing synchronize_rcu?

We seem to have two variants of __rcu_read_lock().

One increments current->rcu_read_lock_nesting, the other
calls preempt_disable().

In case of the non-preemptible RCU, we could easily also
increase current->rcu_read_lock_nesting at the same time
we increase the preempt counter, and use that as the
indicator to test whether the cpu is in an extended
rcu quiescent state. That way there would be no extra
overhead at syscall entry or exit at all. The trick
would be getting the preempt count and the rcu read
lock nesting count in the same cache line for each task.

In case of the preemptible RCU scheme, we would have to
examine the per-task state (under the runqueue lock)
to get the current task info of all CPUs, and in
addition wait for the blkd_tasks list to empty out
when doing a synchronize_rcu().

That does not appear to require special per-cpu
counters; examining the per-cpu rdp and the lists
inside it, with the rnp->lock held if doing any
list manipulation, looks like it would be enough.

However, the current code is a lot more complicated
than that. Am I overlooking something obvious, Paul?
Maybe something non-obvious? :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 19:00                               ` Rik van Riel
@ 2015-05-04 19:39                                 ` Paul E. McKenney
  2015-05-04 19:59                                   ` Rik van Riel
  2015-05-05 10:53                                   ` Peter Zijlstra
  2015-05-05 10:48                                 ` Peter Zijlstra
  1 sibling, 2 replies; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-04 19:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
> On 05/04/2015 11:59 AM, Rik van Riel wrote:
> 
> > However, currently the RCU code seems to use a much more
> > complex counting scheme, with a different increment for
> > kernel/task use, and irq use.
> > 
> > This counter seems to be modeled on the task preempt_counter,
> > where we do care about whether we are in task context, irq
> > context, or softirq context.
> > 
> > On the other hand, the RCU code only seems to care about
> > whether or not a CPU is in an extended quiescent state,
> > or is potentially in an RCU critical section.
> > 
> > Paul, what is the reason for RCU using a complex counter,
> > instead of a simple increment for each potential kernel/RCU
> > entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
> > enabled?
> 
> Looking at the code for a while more, I have not found
> any reason why the rcu dynticks counter is so complex.

For the nesting counter, please see my earlier email.

> The rdtp->dynticks atomic seems to be used as a serial
> number. Odd means the cpu is in an rcu quiescent state,
> even means it is not.

Yep.

> This test is used to verify whether or not a CPU is
> in rcu quiescent state. Presumably the atomic_add_return
> is used to add a memory barrier.
> 
> 	atomic_add_return(0, &rdtp->dynticks) & 0x1)

Yep.  It is sampled remotely, hence the need for full memory barriers.
It doesn't help to sample the counter if the sampling gets reordered
with the surrounding code.  Ditto for the increments.

By the end of the year, and hopefully much sooner, I expect to have
testing infrastructure capable of detecting ordering bugs in this code.
At which point, I can start experimenting with alternative code sequences.

But full ordering is still required, and cache misses can happen.

> > In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> > as an indicator of whether or not we should bother waiting on that
> > task or CPU when doing synchronize_rcu?
> 
> We seem to have two variants of __rcu_read_lock().
> 
> One increments current->rcu_read_lock_nesting, the other
> calls preempt_disable().

Yep.  The first is preemptible RCU, the second classic RCU.

> In case of the non-preemptible RCU, we could easily also
> increase current->rcu_read_lock_nesting at the same time
> we increase the preempt counter, and use that as the
> indicator to test whether the cpu is in an extended
> rcu quiescent state. That way there would be no extra
> overhead at syscall entry or exit at all. The trick
> would be getting the preempt count and the rcu read
> lock nesting count in the same cache line for each task.

But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
counter in production kernels.  Even if there was, we have to sample this
on other CPUs, so the overhead of preempt_disable() and preempt_enable()
would be where kernel entry/exit is, so I expect that this would be a
net loss in overall performance.

> In case of the preemptible RCU scheme, we would have to
> examine the per-task state (under the runqueue lock)
> to get the current task info of all CPUs, and in
> addition wait for the blkd_tasks list to empty out
> when doing a synchronize_rcu().
> 
> That does not appear to require special per-cpu
> counters; examining the per-cpu rdp and the lists
> inside it, with the rnp->lock held if doing any
> list manipulation, looks like it would be enough.
> 
> However, the current code is a lot more complicated
> than that. Am I overlooking something obvious, Paul?
> Maybe something non-obvious? :)

Ummm...  The need to maintain memory ordering when sampling task
state from remote CPUs?

Or am I completely confused about what you are suggesting?

That said, are you chasing a real system-visible performance issue
that you tracked to RCU's dyntick-idle system?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 18:39                               ` Paul E. McKenney
@ 2015-05-04 19:39                                 ` Rik van Riel
  2015-05-04 20:02                                   ` Paul E. McKenney
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 19:39 UTC (permalink / raw)
  To: paulmck
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:

>> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
>> as an indicator of whether or not we should bother waiting on that
>> task or CPU when doing synchronize_rcu?
> 
> Depends on exactly what you are asking.  If you are asking if I could add
> a few more checks to preemptible RCU and speed up grace-period detection
> in a number of cases, the answer is very likely "yes".  This is on my
> list, but not particularly high priority.  If you are asking whether
> CPU 0 could access ->rcu_read_lock_nesting of some task running on
> some other CPU, in theory, the answer is "yes", but in practice that
> would require putting full memory barriers in both rcu_read_lock()
> and rcu_read_unlock(), so the real answer is "no".
> 
> Or am I missing your point?

The main question is "how can we greatly reduce the overhead
of nohz_full, by simplifying the RCU extended quiescent state
code called in the syscall fast path, and maybe piggyback on
that to do time accounting for remote CPUs?"

Your memory barrier answer above makes it clear we will still
want to do the RCU stuff at syscall entry & exit time, at least
on x86, where we already have automatic and implicit memory
barriers.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 19:39                                 ` Paul E. McKenney
@ 2015-05-04 19:59                                   ` Rik van Riel
  2015-05-04 20:40                                     ` Paul E. McKenney
  2015-05-05 10:53                                   ` Peter Zijlstra
  1 sibling, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 19:59 UTC (permalink / raw)
  To: paulmck
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On 05/04/2015 03:39 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:

>> In case of the non-preemptible RCU, we could easily also
>> increase current->rcu_read_lock_nesting at the same time
>> we increase the preempt counter, and use that as the
>> indicator to test whether the cpu is in an extended
>> rcu quiescent state. That way there would be no extra
>> overhead at syscall entry or exit at all. The trick
>> would be getting the preempt count and the rcu read
>> lock nesting count in the same cache line for each task.
> 
> But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> counter in production kernels.  Even if there was, we have to sample this
> on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> would be where kernel entry/exit is, so I expect that this would be a
> net loss in overall performance.

CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT.
Not sure why, but they are :)

>> In case of the preemptible RCU scheme, we would have to
>> examine the per-task state (under the runqueue lock)
>> to get the current task info of all CPUs, and in
>> addition wait for the blkd_tasks list to empty out
>> when doing a synchronize_rcu().
>>
>> That does not appear to require special per-cpu
>> counters; examining the per-cpu rdp and the lists
>> inside it, with the rnp->lock held if doing any
>> list manipulation, looks like it would be enough.
>>
>> However, the current code is a lot more complicated
>> than that. Am I overlooking something obvious, Paul?
>> Maybe something non-obvious? :)
> 
> Ummm...  The need to maintain memory ordering when sampling task
> state from remote CPUs?
> 
> Or am I completely confused about what you are suggesting?
> 
> That said, are you chasing a real system-visible performance issue
> that you tracked to RCU's dyntick-idle system?

The goal is to reduce the syscall overhead of nohz_full.

Part of the overhead is in the vtime updates, part of it is
in the way RCU extended quiescent state is tracked.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 19:39                                 ` Rik van Riel
@ 2015-05-04 20:02                                   ` Paul E. McKenney
  2015-05-04 20:13                                     ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-04 20:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
> On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
> 
> >> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> >> as an indicator of whether or not we should bother waiting on that
> >> task or CPU when doing synchronize_rcu?
> > 
> > Depends on exactly what you are asking.  If you are asking if I could add
> > a few more checks to preemptible RCU and speed up grace-period detection
> > in a number of cases, the answer is very likely "yes".  This is on my
> > list, but not particularly high priority.  If you are asking whether
> > CPU 0 could access ->rcu_read_lock_nesting of some task running on
> > some other CPU, in theory, the answer is "yes", but in practice that
> > would require putting full memory barriers in both rcu_read_lock()
> > and rcu_read_unlock(), so the real answer is "no".
> > 
> > Or am I missing your point?
> 
> The main question is "how can we greatly reduce the overhead
> of nohz_full, by simplifying the RCU extended quiescent state
> code called in the syscall fast path, and maybe piggyback on
> that to do time accounting for remote CPUs?"
> 
> Your memory barrier answer above makes it clear we will still
> want to do the RCU stuff at syscall entry & exit time, at least
> on x86, where we already have automatic and implicit memory
> barriers.

We do need to keep in mind that x86's automatic and implicit memory
barriers do not order prior stores against later loads.

Hmmm...  But didn't earlier performance measurements show that the bulk of
the overhead was the delta-time computations rather than RCU accounting?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 20:02                                   ` Paul E. McKenney
@ 2015-05-04 20:13                                     ` Rik van Riel
  2015-05-04 20:38                                       ` Paul E. McKenney
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 20:13 UTC (permalink / raw)
  To: paulmck
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
>> On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
>>> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
>>
>>>> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
>>>> as an indicator of whether or not we should bother waiting on that
>>>> task or CPU when doing synchronize_rcu?
>>>
>>> Depends on exactly what you are asking.  If you are asking if I could add
>>> a few more checks to preemptible RCU and speed up grace-period detection
>>> in a number of cases, the answer is very likely "yes".  This is on my
>>> list, but not particularly high priority.  If you are asking whether
>>> CPU 0 could access ->rcu_read_lock_nesting of some task running on
>>> some other CPU, in theory, the answer is "yes", but in practice that
>>> would require putting full memory barriers in both rcu_read_lock()
>>> and rcu_read_unlock(), so the real answer is "no".
>>>
>>> Or am I missing your point?
>>
>> The main question is "how can we greatly reduce the overhead
>> of nohz_full, by simplifying the RCU extended quiescent state
>> code called in the syscall fast path, and maybe piggyback on
>> that to do time accounting for remote CPUs?"
>>
>> Your memory barrier answer above makes it clear we will still
>> want to do the RCU stuff at syscall entry & exit time, at least
>> on x86, where we already have automatic and implicit memory
>> barriers.
> 
> We do need to keep in mind that x86's automatic and implicit memory
> barriers do not order prior stores against later loads.
> 
> Hmmm...  But didn't earlier performance measurements show that the bulk of
> the overhead was the delta-time computations rather than RCU accounting?

The bulk of the overhead was disabling and re-enabling
irqs around the calls to rcu_user_exit and rcu_user_enter :)

Of the remaining time, about 2/3 seems to be the vtime
stuff, and the other 1/3 the rcu code.

I suspect it makes sense to optimize both, though the
vtime code may be the easiest :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 20:13                                     ` Rik van Riel
@ 2015-05-04 20:38                                       ` Paul E. McKenney
  2015-05-04 20:53                                         ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-04 20:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 03:39:25PM -0400, Rik van Riel wrote:
> >> On 05/04/2015 02:39 PM, Paul E. McKenney wrote:
> >>> On Mon, May 04, 2015 at 11:59:05AM -0400, Rik van Riel wrote:
> >>
> >>>> In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> >>>> as an indicator of whether or not we should bother waiting on that
> >>>> task or CPU when doing synchronize_rcu?
> >>>
> >>> Depends on exactly what you are asking.  If you are asking if I could add
> >>> a few more checks to preemptible RCU and speed up grace-period detection
> >>> in a number of cases, the answer is very likely "yes".  This is on my
> >>> list, but not particularly high priority.  If you are asking whether
> >>> CPU 0 could access ->rcu_read_lock_nesting of some task running on
> >>> some other CPU, in theory, the answer is "yes", but in practice that
> >>> would require putting full memory barriers in both rcu_read_lock()
> >>> and rcu_read_unlock(), so the real answer is "no".
> >>>
> >>> Or am I missing your point?
> >>
> >> The main question is "how can we greatly reduce the overhead
> >> of nohz_full, by simplifying the RCU extended quiescent state
> >> code called in the syscall fast path, and maybe piggyback on
> >> that to do time accounting for remote CPUs?"
> >>
> >> Your memory barrier answer above makes it clear we will still
> >> want to do the RCU stuff at syscall entry & exit time, at least
> >> on x86, where we already have automatic and implicit memory
> >> barriers.
> > 
> > We do need to keep in mind that x86's automatic and implicit memory
> > barriers do not order prior stores against later loads.
> > 
> > Hmmm...  But didn't earlier performance measurements show that the bulk of
> > the overhead was the delta-time computations rather than RCU accounting?
> 
> The bulk of the overhead was disabling and re-enabling
> irqs around the calls to rcu_user_exit and rcu_user_enter :)

Really???  OK...  How about software irq masking?  (I know, that is
probably a bit of a scary change as well.)

> Of the remaining time, about 2/3 seems to be the vtime
> stuff, and the other 1/3 the rcu code.

OK, worth some thought, then.

> I suspect it makes sense to optimize both, though the
> vtime code may be the easiest :)

Making a crude version that does jiffies (or whatever) instead of
fine-grained computations might give good bang for the buck.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 19:59                                   ` Rik van Riel
@ 2015-05-04 20:40                                     ` Paul E. McKenney
  0 siblings, 0 replies; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-04 20:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Mon, May 04, 2015 at 03:59:02PM -0400, Rik van Riel wrote:
> On 05/04/2015 03:39 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
> 
> >> In case of the non-preemptible RCU, we could easily also
> >> increase current->rcu_read_lock_nesting at the same time
> >> we increase the preempt counter, and use that as the
> >> indicator to test whether the cpu is in an extended
> >> rcu quiescent state. That way there would be no extra
> >> overhead at syscall entry or exit at all. The trick
> >> would be getting the preempt count and the rcu read
> >> lock nesting count in the same cache line for each task.
> > 
> > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > counter in production kernels.  Even if there was, we have to sample this
> > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > would be where kernel entry/exit is, so I expect that this would be a
> > net loss in overall performance.
> 
> CONFIG_PREEMPT_RCU seems to be independent of CONFIG_PREEMPT.
> Not sure why, but they are :)

Well, they used to be independent.  But the "depends" clauses force
them.  You cannot have TREE_RCU unless !PREEMPT && SMP.

> >> In case of the preemptible RCU scheme, we would have to
> >> examine the per-task state (under the runqueue lock)
> >> to get the current task info of all CPUs, and in
> >> addition wait for the blkd_tasks list to empty out
> >> when doing a synchronize_rcu().
> >>
> >> That does not appear to require special per-cpu
> >> counters; examining the per-cpu rdp and the lists
> >> inside it, with the rnp->lock held if doing any
> >> list manipulation, looks like it would be enough.
> >>
> >> However, the current code is a lot more complicated
> >> than that. Am I overlooking something obvious, Paul?
> >> Maybe something non-obvious? :)
> > 
> > Ummm...  The need to maintain memory ordering when sampling task
> > state from remote CPUs?
> > 
> > Or am I completely confused about what you are suggesting?
> > 
> > That said, are you chasing a real system-visible performance issue
> > that you tracked to RCU's dyntick-idle system?
> 
> The goal is to reduce the syscall overhead of nohz_full.
> 
> Part of the overhead is in the vtime updates, part of it is
> in the way RCU extended quiescent state is tracked.

OK, as long as it is actual measurements rather than guesswork.

								Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 20:38                                       ` Paul E. McKenney
@ 2015-05-04 20:53                                         ` Rik van Riel
  2015-05-05  5:54                                           ` Paul E. McKenney
  2015-05-07  0:59                                           ` Frederic Weisbecker
  0 siblings, 2 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-04 20:53 UTC (permalink / raw)
  To: paulmck
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
> On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
>> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:

>>> Hmmm...  But didn't earlier performance measurements show that the bulk of
>>> the overhead was the delta-time computations rather than RCU accounting?
>>
>> The bulk of the overhead was disabling and re-enabling
>> irqs around the calls to rcu_user_exit and rcu_user_enter :)
> 
> Really???  OK...  How about software irq masking?  (I know, that is
> probably a bit of a scary change as well.)
> 
>> Of the remaining time, about 2/3 seems to be the vtime
>> stuff, and the other 1/3 the rcu code.
> 
> OK, worth some thought, then.
> 
>> I suspect it makes sense to optimize both, though the
>> vtime code may be the easiest :)
> 
> Making a crude version that does jiffies (or whatever) instead of
> fine-grained computations might give good bang for the buck.  ;-)

Ingo's idea is to simply have cpu 0 check the current task
on all other CPUs, see whether that task is running in system
mode, user mode, guest mode, irq mode, etc and update that
task's vtime accordingly.

I suspect the runqueue lock is probably enough to do that,
and between rcu state and PF_VCPU we probably have enough
information to see what mode the task is running in, with
just remote memory reads.

I looked at implementing the vtime bits (and am pretty sure
how to do those now), and then spent some hours looking at
the RCU bits, to see if we could not simplify both things at
once, especially considering that the current RCU context
tracking bits need to be called with irqs disabled.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 20:53                                         ` Rik van Riel
@ 2015-05-05  5:54                                           ` Paul E. McKenney
  2015-05-06  1:49                                             ` Mike Galbraith
  2015-05-07  0:59                                           ` Frederic Weisbecker
  1 sibling, 1 reply; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-05  5:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:
> On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
> >> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> 
> >>> Hmmm...  But didn't earlier performance measurements show that the bulk of
> >>> the overhead was the delta-time computations rather than RCU accounting?
> >>
> >> The bulk of the overhead was disabling and re-enabling
> >> irqs around the calls to rcu_user_exit and rcu_user_enter :)
> > 
> > Really???  OK...  How about software irq masking?  (I know, that is
> > probably a bit of a scary change as well.)
> > 
> >> Of the remaining time, about 2/3 seems to be the vtime
> >> stuff, and the other 1/3 the rcu code.
> > 
> > OK, worth some thought, then.
> > 
> >> I suspect it makes sense to optimize both, though the
> >> vtime code may be the easiest :)
> > 
> > Making a crude version that does jiffies (or whatever) instead of
> > fine-grained computations might give good bang for the buck.  ;-)
> 
> Ingo's idea is to simply have cpu 0 check the current task
> on all other CPUs, see whether that task is running in system
> mode, user mode, guest mode, irq mode, etc and update that
> task's vtime accordingly.
> 
> I suspect the runqueue lock is probably enough to do that,
> and between rcu state and PF_VCPU we probably have enough
> information to see what mode the task is running in, with
> just remote memory reads.
> 
> I looked at implementing the vtime bits (and am pretty sure
> how to do those now), and then spent some hours looking at
> the RCU bits, to see if we could not simplify both things at
> once, especially considering that the current RCU context
> tracking bits need to be called with irqs disabled.

Remotely sampling the vtime info without memory barriers makes sense.
After all, the result is statistical anyway.  Unfortunately, as noted
earlier, RCU correctness depends on ordering.

The current RCU idle entry/exit code most definitely absolutely
requires irqs be disabled.  However, I will see if that can be changed.
No promises, especially no short-term promises, but it does not feel
impossible.

You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
RCU_FAST_NO_HZ=n?  If that has a significant effect, easy quick win is
turning it off -- and I could then make it a boot parameter to get you
back to one kernel for everyone.  (The existing tick_nohz_active boot
parameter already turns it off, but also turns off dyntick idle, which
might be a bit excessive.)  Or if there is some way that the kernel can
know that the system is currently running on battery or some such.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 19:00                               ` Rik van Riel
  2015-05-04 19:39                                 ` Paul E. McKenney
@ 2015-05-05 10:48                                 ` Peter Zijlstra
  2015-05-05 10:51                                   ` Peter Zijlstra
  1 sibling, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2015-05-05 10:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paul E. McKenney, Linus Torvalds

On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
 In case of the non-preemptible RCU, we could easily also
> increase current->rcu_read_lock_nesting at the same time
> we increase the preempt counter, and use that as the
> indicator to test whether the cpu is in an extended
> rcu quiescent state. That way there would be no extra
> overhead at syscall entry or exit at all. The trick
> would be getting the preempt count and the rcu read
> lock nesting count in the same cache line for each task.

Can't do that. Remember, on x86 we have per-cpu preempt count, and your
rcu_read_lock_nesting is per task.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05 10:48                                 ` Peter Zijlstra
@ 2015-05-05 10:51                                   ` Peter Zijlstra
  2015-05-05 12:30                                     ` Paul E. McKenney
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2015-05-05 10:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paul E. McKenney, Linus Torvalds

On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
>  In case of the non-preemptible RCU, we could easily also
> > increase current->rcu_read_lock_nesting at the same time
> > we increase the preempt counter, and use that as the
> > indicator to test whether the cpu is in an extended
> > rcu quiescent state. That way there would be no extra
> > overhead at syscall entry or exit at all. The trick
> > would be getting the preempt count and the rcu read
> > lock nesting count in the same cache line for each task.
> 
> Can't do that. Remember, on x86 we have per-cpu preempt count, and your
> rcu_read_lock_nesting is per task.

Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu
counter too and transfer that into the task_struct on context switch.

If you manage to put both sides of that in the same cache things should
not add significant overhead.

You'd have to move the rcu_read_lock_nesting into the thread_info, which
would be painful as you'd have to go touch all archs etc..

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 19:39                                 ` Paul E. McKenney
  2015-05-04 19:59                                   ` Rik van Riel
@ 2015-05-05 10:53                                   ` Peter Zijlstra
  2015-05-05 12:34                                     ` Paul E. McKenney
  1 sibling, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2015-05-05 10:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> counter in production kernels.  Even if there was, we have to sample this
> on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> would be where kernel entry/exit is, so I expect that this would be a
> net loss in overall performance.

We unconditionally have the preempt_count, its just not used much for
PREEMPT_COUNT=n kernels.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05 10:51                                   ` Peter Zijlstra
@ 2015-05-05 12:30                                     ` Paul E. McKenney
  0 siblings, 0 replies; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-05 12:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Tue, May 05, 2015 at 12:51:02PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 12:48:34PM +0200, Peter Zijlstra wrote:
> > On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
> >  In case of the non-preemptible RCU, we could easily also
> > > increase current->rcu_read_lock_nesting at the same time
> > > we increase the preempt counter, and use that as the
> > > indicator to test whether the cpu is in an extended
> > > rcu quiescent state. That way there would be no extra
> > > overhead at syscall entry or exit at all. The trick
> > > would be getting the preempt count and the rcu read
> > > lock nesting count in the same cache line for each task.
> > 
> > Can't do that. Remember, on x86 we have per-cpu preempt count, and your
> > rcu_read_lock_nesting is per task.
> 
> Hmm, I suppose you could do the rcu_read_lock_nesting thing in a per-cpu
> counter too and transfer that into the task_struct on context switch.
> 
> If you manage to put both sides of that in the same cache things should
> not add significant overhead.
> 
> You'd have to move the rcu_read_lock_nesting into the thread_info, which
> would be painful as you'd have to go touch all archs etc..

Last I tried doing that, things got really messy at context-switch time.
Perhaps I simply didn't do the save/restore in the right place?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05 10:53                                   ` Peter Zijlstra
@ 2015-05-05 12:34                                     ` Paul E. McKenney
  2015-05-05 13:00                                       ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-05 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > counter in production kernels.  Even if there was, we have to sample this
> > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > would be where kernel entry/exit is, so I expect that this would be a
> > net loss in overall performance.
> 
> We unconditionally have the preempt_count, its just not used much for
> PREEMPT_COUNT=n kernels.

We have the field, you mean?  I might be missing something, but it still
appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
So what am I missing?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05 12:34                                     ` Paul E. McKenney
@ 2015-05-05 13:00                                       ` Peter Zijlstra
  2015-05-05 18:35                                         ` Paul E. McKenney
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2015-05-05 13:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
> > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > > counter in production kernels.  Even if there was, we have to sample this
> > > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > > would be where kernel entry/exit is, so I expect that this would be a
> > > net loss in overall performance.
> > 
> > We unconditionally have the preempt_count, its just not used much for
> > PREEMPT_COUNT=n kernels.
> 
> We have the field, you mean?  I might be missing something, but it still
> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
> So what am I missing?

There's another layer of accessors that can in fact manipulate the
preempt_count even for !PREEMPT_COUNT kernels. They are currently used
by things like pagefault_disable().

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05 13:00                                       ` Peter Zijlstra
@ 2015-05-05 18:35                                         ` Paul E. McKenney
  2015-05-05 21:09                                           ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-05 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
> > On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
> > > On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> > > > But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> > > > counter in production kernels.  Even if there was, we have to sample this
> > > > on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> > > > would be where kernel entry/exit is, so I expect that this would be a
> > > > net loss in overall performance.
> > > 
> > > We unconditionally have the preempt_count, its just not used much for
> > > PREEMPT_COUNT=n kernels.
> > 
> > We have the field, you mean?  I might be missing something, but it still
> > appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
> > So what am I missing?
> 
> There's another layer of accessors that can in fact manipulate the
> preempt_count even for !PREEMPT_COUNT kernels. They are currently used
> by things like pagefault_disable().

OK, fair enough.

I am going to focus first on getting rid of (or at least greatly reducing)
RCU's interrupt disabling on the user-kernel entry/exit paths, since
that seems to be the biggest cost.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05 18:35                                         ` Paul E. McKenney
@ 2015-05-05 21:09                                           ` Rik van Riel
  2015-05-06  5:41                                             ` Paul E. McKenney
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-05 21:09 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: Paolo Bonzini, Ingo Molnar, Andy Lutomirski, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Linus Torvalds

On 05/05/2015 02:35 PM, Paul E. McKenney wrote:
> On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
>> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
>>> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
>>>> On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
>>>>> But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
>>>>> counter in production kernels.  Even if there was, we have to sample this
>>>>> on other CPUs, so the overhead of preempt_disable() and preempt_enable()
>>>>> would be where kernel entry/exit is, so I expect that this would be a
>>>>> net loss in overall performance.
>>>>
>>>> We unconditionally have the preempt_count, its just not used much for
>>>> PREEMPT_COUNT=n kernels.
>>>
>>> We have the field, you mean?  I might be missing something, but it still
>>> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
>>> So what am I missing?
>>
>> There's another layer of accessors that can in fact manipulate the
>> preempt_count even for !PREEMPT_COUNT kernels. They are currently used
>> by things like pagefault_disable().
> 
> OK, fair enough.
> 
> I am going to focus first on getting rid of (or at least greatly reducing)
> RCU's interrupt disabling on the user-kernel entry/exit paths, since
> that seems to be the biggest cost.

Interrupts are already disabled on kernel-user and kernel-guest
switches.  Paolo and I have patches to move a bunch of the calls
to user_enter, user_exit, guest_enter, and guest_exit to places
where interrupts are already disabled, so we do not need to
disable them again.

With those in place, the vtime calculations are the largest
CPU user. I am working on those.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05  5:54                                           ` Paul E. McKenney
@ 2015-05-06  1:49                                             ` Mike Galbraith
  2015-05-06  3:44                                               ` Mike Galbraith
  0 siblings, 1 reply; 83+ messages in thread
From: Mike Galbraith @ 2015-05-06  1:49 UTC (permalink / raw)
  To: paulmck
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Linus Torvalds

On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:

> You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
> RCU_FAST_NO_HZ=n?

FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
profile to see where costs lie though)

	-Mike


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-06  1:49                                             ` Mike Galbraith
@ 2015-05-06  3:44                                               ` Mike Galbraith
  2015-05-06  6:06                                                 ` Paul E. McKenney
  0 siblings, 1 reply; 83+ messages in thread
From: Mike Galbraith @ 2015-05-06  3:44 UTC (permalink / raw)
  To: paulmck
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Linus Torvalds

On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:
> 
> > You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
> > RCU_FAST_NO_HZ=n?
> 
> FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
> profile to see where costs lie though)

(did that)

100000000 * stat() on isolated cpu

NO_HZ_FULL off        inactive     housekeeper    nohz_full
real    0m14.266s     0m14.367s    0m20.427s      0m27.921s 
user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
(real)  1.000         1.007        1.431          1.957

 inactive                                         housekeeper                                        nohz_full
----------------------------------------------------------------------------------------------------------------------------------------------
 7.61%  [.] __xstat64                             11.12%  [k] context_tracking_exit                  7.41%  [k] context_tracking_exit
 7.04%  [k] system_call                            6.18%  [k] context_tracking_enter                 6.02%  [k] native_sched_clock
 6.96%  [k] copy_user_enhanced_fast_string         5.18%  [.] __xstat64                              4.69%  [k] rcu_eqs_enter_common.isra.37
 6.57%  [k] path_init                              4.89%  [k] system_call                            4.35%  [k] _raw_spin_lock
 5.92%  [k] system_call_after_swapgs               4.84%  [k] copy_user_enhanced_fast_string         4.30%  [k] context_tracking_enter
 5.44%  [k] lockref_put_return                     4.46%  [k] path_init                              4.25%  [k] kmem_cache_alloc
 4.69%  [k] link_path_walk                         4.30%  [k] system_call_after_swapgs               4.14%  [.] __xstat64
 4.47%  [k] lockref_get_not_dead                   4.12%  [k] kmem_cache_free                        3.89%  [k] rcu_eqs_exit_common.isra.38
 4.46%  [k] kmem_cache_free                        3.78%  [k] link_path_walk                         3.50%  [k] system_call
 4.20%  [k] kmem_cache_alloc                       3.62%  [k] lockref_put_return                     3.48%  [k] copy_user_enhanced_fast_string
 4.09%  [k] cp_new_stat                            3.43%  [k] kmem_cache_alloc                       3.02%  [k] system_call_after_swapgs
 3.38%  [k] vfs_getattr_nosec                      2.95%  [k] lockref_get_not_dead                   2.97%  [k] kmem_cache_free
 2.82%  [k] vfs_fstatat                            2.87%  [k] cp_new_stat                            2.88%  [k] lockref_put_return
 2.60%  [k] user_path_at_empty                     2.62%  [k] syscall_trace_leave                    2.61%  [k] link_path_walk
 2.47%  [k] path_lookupat                          1.91%  [k] vfs_getattr_nosec                      2.58%  [k] path_init
 2.14%  [k] strncpy_from_user                      1.89%  [k] syscall_trace_enter_phase1             2.15%  [k] lockref_get_not_dead
 2.11%  [k] getname_flags                          1.77%  [k] path_lookupat                          2.04%  [k] cp_new_stat
 2.10%  [k] generic_fillattr                       1.67%  [k] complete_walk                          1.89%  [k] generic_fillattr
 2.05%  [.] main                                   1.65%  [k] vfs_fstatat                            1.67%  [k] syscall_trace_leave
 1.89%  [k] complete_walk                          1.56%  [k] generic_fillattr                       1.59%  [k] vfs_getattr_nosec
 1.73%  [k] generic_permission                     1.55%  [k] user_path_at_empty                     1.49%  [k] get_vtime_delta
 1.50%  [k] system_call_fastpath                   1.54%  [k] strncpy_from_user                      1.32%  [k] user_path_at_empty
 1.37%  [k] legitimize_mnt                         1.53%  [k] getname_flags                          1.30%  [k] syscall_trace_enter_phase1
 1.30%  [k] dput                                   1.46%  [k] legitimize_mnt                         1.21%  [k] rcu_eqs_exit
 1.26%  [k] putname                                1.34%  [.] main                                   1.21%  [k] vfs_fstatat
 1.19%  [k] path_put                               1.32%  [k] int_with_check                         1.18%  [k] path_lookupat
 1.18%  [k] filename_lookup                        1.28%  [k] generic_permission                     1.15%  [k] getname_flags
 1.01%  [k] SYSC_newstat                           1.16%  [k] int_very_careful                       1.03%  [k] strncpy_from_user
 0.96%  [k] mntput_no_expire                       1.04%  [k] putname                                1.01%  [k] account_system_time
 0.79%  [k] path_cleanup                           0.94%  [k] dput                                   1.00%  [k] complete_walk
 0.79%  [k] mntput                                 0.91%  [k] context_tracking_user_exit             0.99%  [k] vtime_account_user


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-05 21:09                                           ` Rik van Riel
@ 2015-05-06  5:41                                             ` Paul E. McKenney
  0 siblings, 0 replies; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-06  5:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Linus Torvalds

On Tue, May 05, 2015 at 05:09:23PM -0400, Rik van Riel wrote:
> On 05/05/2015 02:35 PM, Paul E. McKenney wrote:
> > On Tue, May 05, 2015 at 03:00:26PM +0200, Peter Zijlstra wrote:
> >> On Tue, May 05, 2015 at 05:34:46AM -0700, Paul E. McKenney wrote:
> >>> On Tue, May 05, 2015 at 12:53:46PM +0200, Peter Zijlstra wrote:
> >>>> On Mon, May 04, 2015 at 12:39:23PM -0700, Paul E. McKenney wrote:
> >>>>> But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
> >>>>> counter in production kernels.  Even if there was, we have to sample this
> >>>>> on other CPUs, so the overhead of preempt_disable() and preempt_enable()
> >>>>> would be where kernel entry/exit is, so I expect that this would be a
> >>>>> net loss in overall performance.
> >>>>
> >>>> We unconditionally have the preempt_count, its just not used much for
> >>>> PREEMPT_COUNT=n kernels.
> >>>
> >>> We have the field, you mean?  I might be missing something, but it still
> >>> appears to me thta preempt_disable() does nothing for PREEMPT=n kernels.
> >>> So what am I missing?
> >>
> >> There's another layer of accessors that can in fact manipulate the
> >> preempt_count even for !PREEMPT_COUNT kernels. They are currently used
> >> by things like pagefault_disable().
> > 
> > OK, fair enough.
> > 
> > I am going to focus first on getting rid of (or at least greatly reducing)
> > RCU's interrupt disabling on the user-kernel entry/exit paths, since
> > that seems to be the biggest cost.
> 
> Interrupts are already disabled on kernel-user and kernel-guest
> switches.  Paolo and I have patches to move a bunch of the calls
> to user_enter, user_exit, guest_enter, and guest_exit to places
> where interrupts are already disabled, so we do not need to
> disable them again.
> 
> With those in place, the vtime calculations are the largest
> CPU user. I am working on those.

OK, so I should stop worrying about making rcu_user_enter() and
rcu_user_exit() operate with interrupts disabled, and instead think about
the overhead of the operations themselves.  Probably starting from Mike
Galbraith's profile (thank you!) unless Rik has some reason to believe
that it is nonrepresentative.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-06  3:44                                               ` Mike Galbraith
@ 2015-05-06  6:06                                                 ` Paul E. McKenney
  2015-05-06  6:52                                                   ` Mike Galbraith
  0 siblings, 1 reply; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-06  6:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Linus Torvalds

On Wed, May 06, 2015 at 05:44:54AM +0200, Mike Galbraith wrote:
> On Wed, 2015-05-06 at 03:49 +0200, Mike Galbraith wrote:
> > On Mon, 2015-05-04 at 22:54 -0700, Paul E. McKenney wrote:
> > 
> > > You have RCU_FAST_NO_HZ=y, correct?  Could you please try measuring with
> > > RCU_FAST_NO_HZ=n?
> > 
> > FWIW, the syscall numbers I posted were RCU_FAST_NO_HZ=n.  (I didn't
> > profile to see where costs lie though)
> 
> (did that)

Nice, thank you!!!

> 100000000 * stat() on isolated cpu
> 
> NO_HZ_FULL off        inactive     housekeeper    nohz_full
> real    0m14.266s     0m14.367s    0m20.427s      0m27.921s 
> user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
> sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
> (real)  1.000         1.007        1.431          1.957
> 
>  inactive                                         housekeeper                                        nohz_full
> ----------------------------------------------------------------------------------------------------------------------------------------------
>  7.61%  [.] __xstat64                             11.12%  [k] context_tracking_exit                  7.41%  [k] context_tracking_exit
>  7.04%  [k] system_call                            6.18%  [k] context_tracking_enter                 6.02%  [k] native_sched_clock
>  6.96%  [k] copy_user_enhanced_fast_string         5.18%  [.] __xstat64                              4.69%  [k] rcu_eqs_enter_common.isra.37
>  6.57%  [k] path_init                              4.89%  [k] system_call                            4.35%  [k] _raw_spin_lock
>  5.92%  [k] system_call_after_swapgs               4.84%  [k] copy_user_enhanced_fast_string         4.30%  [k] context_tracking_enter
>  5.44%  [k] lockref_put_return                     4.46%  [k] path_init                              4.25%  [k] kmem_cache_alloc
>  4.69%  [k] link_path_walk                         4.30%  [k] system_call_after_swapgs               4.14%  [.] __xstat64
>  4.47%  [k] lockref_get_not_dead                   4.12%  [k] kmem_cache_free                        3.89%  [k] rcu_eqs_exit_common.isra.38
>  4.46%  [k] kmem_cache_free                        3.78%  [k] link_path_walk                         3.50%  [k] system_call
>  4.20%  [k] kmem_cache_alloc                       3.62%  [k] lockref_put_return                     3.48%  [k] copy_user_enhanced_fast_string
>  4.09%  [k] cp_new_stat                            3.43%  [k] kmem_cache_alloc                       3.02%  [k] system_call_after_swapgs
>  3.38%  [k] vfs_getattr_nosec                      2.95%  [k] lockref_get_not_dead                   2.97%  [k] kmem_cache_free
>  2.82%  [k] vfs_fstatat                            2.87%  [k] cp_new_stat                            2.88%  [k] lockref_put_return
>  2.60%  [k] user_path_at_empty                     2.62%  [k] syscall_trace_leave                    2.61%  [k] link_path_walk
>  2.47%  [k] path_lookupat                          1.91%  [k] vfs_getattr_nosec                      2.58%  [k] path_init
>  2.14%  [k] strncpy_from_user                      1.89%  [k] syscall_trace_enter_phase1             2.15%  [k] lockref_get_not_dead
>  2.11%  [k] getname_flags                          1.77%  [k] path_lookupat                          2.04%  [k] cp_new_stat
>  2.10%  [k] generic_fillattr                       1.67%  [k] complete_walk                          1.89%  [k] generic_fillattr
>  2.05%  [.] main                                   1.65%  [k] vfs_fstatat                            1.67%  [k] syscall_trace_leave
>  1.89%  [k] complete_walk                          1.56%  [k] generic_fillattr                       1.59%  [k] vfs_getattr_nosec
>  1.73%  [k] generic_permission                     1.55%  [k] user_path_at_empty                     1.49%  [k] get_vtime_delta
>  1.50%  [k] system_call_fastpath                   1.54%  [k] strncpy_from_user                      1.32%  [k] user_path_at_empty
>  1.37%  [k] legitimize_mnt                         1.53%  [k] getname_flags                          1.30%  [k] syscall_trace_enter_phase1
>  1.30%  [k] dput                                   1.46%  [k] legitimize_mnt                         1.21%  [k] rcu_eqs_exit
>  1.26%  [k] putname                                1.34%  [.] main                                   1.21%  [k] vfs_fstatat
>  1.19%  [k] path_put                               1.32%  [k] int_with_check                         1.18%  [k] path_lookupat
>  1.18%  [k] filename_lookup                        1.28%  [k] generic_permission                     1.15%  [k] getname_flags
>  1.01%  [k] SYSC_newstat                           1.16%  [k] int_very_careful                       1.03%  [k] strncpy_from_user
>  0.96%  [k] mntput_no_expire                       1.04%  [k] putname                                1.01%  [k] account_system_time
>  0.79%  [k] path_cleanup                           0.94%  [k] dput                                   1.00%  [k] complete_walk
>  0.79%  [k] mntput                                 0.91%  [k] context_tracking_user_exit             0.99%  [k] vtime_account_user
> 

So we have rcu_eqs_enter_common(4.69%), rcu_eqs_exit_common(3.89%),
and rcu_eqs_exit(1.21%).  Interesting that rcu_eqs_exit() appears at all,
given that it just does very simple operations, and rcu_eqs_exit_common()
is apparently not inlined (OK, perhaps it is partially inlined?).
This suggests that there are useful gains to be had via simple changes,
for exmaple, placing the warnings behind CONFIG_NO_HZ_DEBUG or some such.
Or not, as this overhead might instead be due to cache misses on first
access to the relevant data structures.  But worth a try, perhaps.

Does the attached patch help at all?  If so, there might be some similar
gains to be had.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4e6902005228..3f09e5abb7b0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -616,7 +616,8 @@ static void rcu_eqs_enter_common(long long oldval, bool user)
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
 
 	trace_rcu_dyntick(TPS("Start"), oldval, rdtp->dynticks_nesting);
-	if (!user && !is_idle_task(current)) {
+	if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+	    !user && !is_idle_task(current)) {
 		struct task_struct *idle __maybe_unused =
 			idle_task(smp_processor_id());
 
@@ -635,7 +636,8 @@ static void rcu_eqs_enter_common(long long oldval, bool user)
 	smp_mb__before_atomic();  /* See above. */
 	atomic_inc(&rdtp->dynticks);
 	smp_mb__after_atomic();  /* Force ordering with next sojourn. */
-	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     atomic_read(&rdtp->dynticks) & 0x1);
 	rcu_dynticks_task_enter();
 
 	/*
@@ -661,7 +663,8 @@ static void rcu_eqs_enter(bool user)
 
 	rdtp = this_cpu_ptr(&rcu_dynticks);
 	oldval = rdtp->dynticks_nesting;
-	WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     (oldval & DYNTICK_TASK_NEST_MASK) == 0);
 	if ((oldval & DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE) {
 		rdtp->dynticks_nesting = 0;
 		rcu_eqs_enter_common(oldval, user);
@@ -734,7 +737,8 @@ void rcu_irq_exit(void)
 	rdtp = this_cpu_ptr(&rcu_dynticks);
 	oldval = rdtp->dynticks_nesting;
 	rdtp->dynticks_nesting--;
-	WARN_ON_ONCE(rdtp->dynticks_nesting < 0);
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     rdtp->dynticks_nesting < 0);
 	if (rdtp->dynticks_nesting)
 		trace_rcu_dyntick(TPS("--="), oldval, rdtp->dynticks_nesting);
 	else
@@ -759,10 +763,12 @@ static void rcu_eqs_exit_common(long long oldval, int user)
 	atomic_inc(&rdtp->dynticks);
 	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
 	smp_mb__after_atomic();  /* See above. */
-	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     !(atomic_read(&rdtp->dynticks) & 0x1));
 	rcu_cleanup_after_idle();
 	trace_rcu_dyntick(TPS("End"), oldval, rdtp->dynticks_nesting);
-	if (!user && !is_idle_task(current)) {
+	if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+	    !user && !is_idle_task(current)) {
 		struct task_struct *idle __maybe_unused =
 			idle_task(smp_processor_id());
 
@@ -786,7 +792,7 @@ static void rcu_eqs_exit(bool user)
 
 	rdtp = this_cpu_ptr(&rcu_dynticks);
 	oldval = rdtp->dynticks_nesting;
-	WARN_ON_ONCE(oldval < 0);
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && oldval < 0);
 	if (oldval & DYNTICK_TASK_NEST_MASK) {
 		rdtp->dynticks_nesting += DYNTICK_TASK_NEST_VALUE;
 	} else {
@@ -859,7 +865,8 @@ void rcu_irq_enter(void)
 	rdtp = this_cpu_ptr(&rcu_dynticks);
 	oldval = rdtp->dynticks_nesting;
 	rdtp->dynticks_nesting++;
-	WARN_ON_ONCE(rdtp->dynticks_nesting == 0);
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     rdtp->dynticks_nesting == 0);
 	if (oldval)
 		trace_rcu_dyntick(TPS("++="), oldval, rdtp->dynticks_nesting);
 	else
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c4e1cf04cf57..b908048f8d6a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1373,6 +1373,17 @@ config RCU_TRACE
 	  Say Y here if you want to enable RCU tracing
 	  Say N if you are unsure.
 
+config RCU_EQS_DEBUG
+	bool "Use this when adding any sort of NO_HZ support to your arch"
+	depends on DEBUG_KERNEL
+	help
+	  This option provides consistency checks in RCU's handling of
+	  NO_HZ.  These checks have proven quite helpful in detecting
+	  bugs in arch-specific NO_HZ code.
+
+	  Say N here if you need ultimate kernel/user switch latencies
+	  Say Y if you are unsure
+
 endmenu # "RCU Debugging"
 
 config DEBUG_BLOCK_EXT_DEVT


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-06  6:06                                                 ` Paul E. McKenney
@ 2015-05-06  6:52                                                   ` Mike Galbraith
  2015-05-06  7:01                                                     ` Mike Galbraith
  0 siblings, 1 reply; 83+ messages in thread
From: Mike Galbraith @ 2015-05-06  6:52 UTC (permalink / raw)
  To: paulmck
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Linus Torvalds

On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote:

> > 100000000 * stat() on isolated cpu
> > 
> > NO_HZ_FULL off        inactive     housekeeper    nohz_full
> > real    0m14.266s     0m14.367s    0m20.427s      0m27.921s 
> > user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
> > sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
> > (real)  1.000         1.007        1.431          1.957


> Does the attached patch help at all?

nohz_full
0m27.073s
0m9.423s
0m17.602s

Not a complete retest, and a pull in between, but I'd say that's a no.

	-Mike


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-06  6:52                                                   ` Mike Galbraith
@ 2015-05-06  7:01                                                     ` Mike Galbraith
  0 siblings, 0 replies; 83+ messages in thread
From: Mike Galbraith @ 2015-05-06  7:01 UTC (permalink / raw)
  To: paulmck
  Cc: Rik van Riel, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Linus Torvalds

On Wed, 2015-05-06 at 08:52 +0200, Mike Galbraith wrote:
> On Tue, 2015-05-05 at 23:06 -0700, Paul E. McKenney wrote:
> 
> > > 100000000 * stat() on isolated cpu
> > > 
> > > NO_HZ_FULL off        inactive     housekeeper    nohz_full
> > > real    0m14.266s     0m14.367s    0m20.427s      0m27.921s 
> > > user    0m1.756s      0m1.553s     0m1.976s       0m10.447s
> > > sys     0m12.508s     0m12.769s    0m18.400s      0m17.464s
> > > (real)  1.000         1.007        1.431          1.957
> 
> 
> > Does the attached patch help at all?
> 
> nohz_full
> 0m27.073s
> 0m9.423s
> 0m17.602s
> 
> Not a complete retest, and a pull in between, but I'd say that's a no.

(well, a second is a nice "at all", just not a _huge_ "at all";)


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-04 20:53                                         ` Rik van Riel
  2015-05-05  5:54                                           ` Paul E. McKenney
@ 2015-05-07  0:59                                           ` Frederic Weisbecker
  2015-05-07 15:44                                             ` Rik van Riel
  1 sibling, 1 reply; 83+ messages in thread
From: Frederic Weisbecker @ 2015-05-07  0:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: paulmck, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Linus Torvalds

On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:
> On 05/04/2015 04:38 PM, Paul E. McKenney wrote:
> > On Mon, May 04, 2015 at 04:13:50PM -0400, Rik van Riel wrote:
> >> On 05/04/2015 04:02 PM, Paul E. McKenney wrote:
> 
> >>> Hmmm...  But didn't earlier performance measurements show that the bulk of
> >>> the overhead was the delta-time computations rather than RCU accounting?
> >>
> >> The bulk of the overhead was disabling and re-enabling
> >> irqs around the calls to rcu_user_exit and rcu_user_enter :)
> > 
> > Really???  OK...  How about software irq masking?  (I know, that is
> > probably a bit of a scary change as well.)
> > 
> >> Of the remaining time, about 2/3 seems to be the vtime
> >> stuff, and the other 1/3 the rcu code.
> > 
> > OK, worth some thought, then.
> > 
> >> I suspect it makes sense to optimize both, though the
> >> vtime code may be the easiest :)
> > 
> > Making a crude version that does jiffies (or whatever) instead of
> > fine-grained computations might give good bang for the buck.  ;-)
> 
> Ingo's idea is to simply have cpu 0 check the current task
> on all other CPUs, see whether that task is running in system
> mode, user mode, guest mode, irq mode, etc and update that
> task's vtime accordingly.
> 
> I suspect the runqueue lock is probably enough to do that,
> and between rcu state and PF_VCPU we probably have enough
> information to see what mode the task is running in, with
> just remote memory reads.

Note that we could significantly reduce the overhead of vtime accounting
by only accumulate utime/stime on per cpu buffers and actually account it
on context switch or task_cputime() calls. That way we remove the overhead
of the account_user/system_time() functions and the vtime locks.

But doing the accounting from CPU 0 by just accounting 1 tick to the context
we remotely observe would certainly reduce the local accounting overhead to the strict
minimum. And I think we shouldn't even lock rq for that, we can live with some
lack of precision.

Now we must expect quite some overhead on CPU 0. Perhaps it should be
an option as I'm not sure every full dynticks usecases want that.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-03 18:41                           ` Andy Lutomirski
@ 2015-05-07 10:35                             ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-07 10:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, linux-kernel, X86 ML, williams, Andrew Lutomirski,
	fweisbec, Peter Zijlstra, Heiko Carstens, Thomas Gleixner,
	Ingo Molnar, Paolo Bonzini, Paul E. McKenney, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> >> [...]  Also, we'd have to audit all the entries, and 
> >> system_call_after_swapgs currently enables interrupts early 
> >> enough that an interrupt before all the pushes will do 
> >> unpredictable things to pt_regs.
> >
> > An irq hardware frame won't push zero to that selector value, 
> > right? That's the only bad thing that would confuse the code.
> >
> 
> I think it's not quite that simple.  The syscall entry looks like, 
> roughly:
> 
> fix rsp;
> sti;
> push ss;
> push rsp;
> push flags;
> push cs;
> push rip;
> 
> We can get an interrupt part-way through those pushes.  Maybe there's
> no bad place where we could get an IRQ since SS is first, [...]

... and it should be covered by the 'STI window' where the instruction 
following a STI is still part of the irqs-off section.

> [...] but this is still nasty.

True.

Another approach would be to set up two aliases in the GDT, so we 
could freely change 'ss' between them and thus store information, 
without possibly confusing the syscall entry/exit code.

That still gets complicated by IST entries, which creates multiple 
positions for the 'flag'.

> I think I prefer a per-cpu approach over a per-task approach, since 
> it's easier to reason about and it should still only require one 
> instruction on entry and one instruction on exit.

Yes, agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-03 18:52             ` Rik van Riel
@ 2015-05-07 10:48               ` Ingo Molnar
  2015-05-07 12:18                 ` Frederic Weisbecker
  2015-05-07 12:22                 ` Andy Lutomirski
  0 siblings, 2 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-07 10:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, Mike Galbraith, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> > If, on the other hand, you're just going to remotely sample the 
> > in-memory context, that sounds good.
> 
> It's the latter.
> 
> If you look at /proc/<pid>/{stack,syscall,wchan} and other files, 
> you will see we already have ways to determine, from in memory 
> content, where a program is running at a certain point in time.
> 
> In fact, the timer interrupt based accounting does a similar thing. 
> It has a task examine its own in-memory state to figure out what it 
> was doing before the timer interrupt happened.
> 
> The kernel side stack pointer is probably enough to tell us whether 
> a task is active in kernel space, on an irq stack, or (maybe) in 
> user space. Not convinced about the latter, we may need to look at 
> the same state the RCU code keeps track of to see what mode a task 
> is in...
> 
> I am looking at the code to see what locks we need to grab.
> 
> I suspect the runqueue lock may be enough, to ensure that the task 
> struct, and stack do not go away while we are looking at them.

That will be enough, especially if you get to the task reference via 
rq->curr.

> We cannot take the lock_trace(task) from irq context, and we 
> probably do not need to anyway, since we do not care about a precise 
> stack trace for the task.

So one worry with this and similar approaches of statistically 
detecting user mode would be the fact that on the way out to 
user-space we don't really destroy the previous call trace - we just 
pop off the stack (non-destructively), restore RIPs and are gone.

We'll need that percpu flag I suspect.

And once we have the flag, we can get rid of the per syscall RCU 
callback as well, relatively easily: with CMPXCHG (in 
synchronize_rcu()!) we can reliably sample whether a CPU is in user 
mode right now, while the syscall entry/exit path does not use any 
atomics, we can just use a simple MOV.

Once we observe 'user mode', then we have observed quiescent state and 
synchronize_rcu() can continue. If we've observed kernel mode we can 
frob the remote task's TIF_ flags to make it go into a quiescent state 
publishing routine on syscall-return.

The only hard requirement of this scheme from the RCU synchronization 
POV is that all kernel contexts that may touch RCU state need to flip 
this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
NMIs and all syscall variants need to do this.

But once it's there, it's really neat.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 10:48               ` Ingo Molnar
@ 2015-05-07 12:18                 ` Frederic Weisbecker
  2015-05-07 12:29                   ` Ingo Molnar
  2015-05-07 12:22                 ` Andy Lutomirski
  1 sibling, 1 reply; 83+ messages in thread
From: Frederic Weisbecker @ 2015-05-07 12:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Andy Lutomirski, Mike Galbraith, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On Thu, May 07, 2015 at 12:48:45PM +0200, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@redhat.com> wrote:
> 
> > > If, on the other hand, you're just going to remotely sample the 
> > > in-memory context, that sounds good.
> > 
> > It's the latter.
> > 
> > If you look at /proc/<pid>/{stack,syscall,wchan} and other files, 
> > you will see we already have ways to determine, from in memory 
> > content, where a program is running at a certain point in time.
> > 
> > In fact, the timer interrupt based accounting does a similar thing. 
> > It has a task examine its own in-memory state to figure out what it 
> > was doing before the timer interrupt happened.
> > 
> > The kernel side stack pointer is probably enough to tell us whether 
> > a task is active in kernel space, on an irq stack, or (maybe) in 
> > user space. Not convinced about the latter, we may need to look at 
> > the same state the RCU code keeps track of to see what mode a task 
> > is in...
> > 
> > I am looking at the code to see what locks we need to grab.
> > 
> > I suspect the runqueue lock may be enough, to ensure that the task 
> > struct, and stack do not go away while we are looking at them.
> 
> That will be enough, especially if you get to the task reference via 
> rq->curr.
> 
> > We cannot take the lock_trace(task) from irq context, and we 
> > probably do not need to anyway, since we do not care about a precise 
> > stack trace for the task.
> 
> So one worry with this and similar approaches of statistically 
> detecting user mode would be the fact that on the way out to 
> user-space we don't really destroy the previous call trace - we just 
> pop off the stack (non-destructively), restore RIPs and are gone.
> 
> We'll need that percpu flag I suspect.

Note we have the context tracking state which tells where the current
task is: user/system/guest.

> 
> And once we have the flag, we can get rid of the per syscall RCU 
> callback as well, relatively easily: with CMPXCHG (in 
> synchronize_rcu()!) we can reliably sample whether a CPU is in user 
> mode right now, while the syscall entry/exit path does not use any 
> atomics, we can just use a simple MOV.
> 
> Once we observe 'user mode', then we have observed quiescent state and 
> synchronize_rcu() can continue. If we've observed kernel mode we can 
> frob the remote task's TIF_ flags to make it go into a quiescent state 
> publishing routine on syscall-return.
> 
> The only hard requirement of this scheme from the RCU synchronization 
> POV is that all kernel contexts that may touch RCU state need to flip 
> this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
> NMIs and all syscall variants need to do this.
> 
> But once it's there, it's really neat.
> 
> Thanks,
> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 10:48               ` Ingo Molnar
  2015-05-07 12:18                 ` Frederic Weisbecker
@ 2015-05-07 12:22                 ` Andy Lutomirski
  2015-05-07 12:44                   ` Ingo Molnar
  1 sibling, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-07 12:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams

On May 7, 2015 4:18 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Rik van Riel <riel@redhat.com> wrote:
>
> > > If, on the other hand, you're just going to remotely sample the
> > > in-memory context, that sounds good.
> >
> > It's the latter.
> >
> > If you look at /proc/<pid>/{stack,syscall,wchan} and other files,
> > you will see we already have ways to determine, from in memory
> > content, where a program is running at a certain point in time.
> >
> > In fact, the timer interrupt based accounting does a similar thing.
> > It has a task examine its own in-memory state to figure out what it
> > was doing before the timer interrupt happened.
> >
> > The kernel side stack pointer is probably enough to tell us whether
> > a task is active in kernel space, on an irq stack, or (maybe) in
> > user space. Not convinced about the latter, we may need to look at
> > the same state the RCU code keeps track of to see what mode a task
> > is in...
> >
> > I am looking at the code to see what locks we need to grab.
> >
> > I suspect the runqueue lock may be enough, to ensure that the task
> > struct, and stack do not go away while we are looking at them.
>
> That will be enough, especially if you get to the task reference via
> rq->curr.
>
> > We cannot take the lock_trace(task) from irq context, and we
> > probably do not need to anyway, since we do not care about a precise
> > stack trace for the task.
>
> So one worry with this and similar approaches of statistically
> detecting user mode would be the fact that on the way out to
> user-space we don't really destroy the previous call trace - we just
> pop off the stack (non-destructively), restore RIPs and are gone.
>
> We'll need that percpu flag I suspect.
>
> And once we have the flag, we can get rid of the per syscall RCU
> callback as well, relatively easily: with CMPXCHG (in
> synchronize_rcu()!) we can reliably sample whether a CPU is in user
> mode right now, while the syscall entry/exit path does not use any
> atomics, we can just use a simple MOV.
>
> Once we observe 'user mode', then we have observed quiescent state and
> synchronize_rcu() can continue. If we've observed kernel mode we can
> frob the remote task's TIF_ flags to make it go into a quiescent state
> publishing routine on syscall-return.
>

How does that work?

If the exit code writes the per-cpu flag and then checks TIF_whatever,
we need a barrier to avoid a race where we end up in user mode without
seeing the flag.

I think the right solution is to accept that race and just have the
RCU code send an IPI (or check again) if it sees too long of a period
elapse in kernel mode.

I think the flag should be a counter, though.  That way a workload
that makes lots of syscalls will be quickly detected as going through
quiescent states even if it's never actually observed in user mode.

> The only hard requirement of this scheme from the RCU synchronization
> POV is that all kernel contexts that may touch RCU state need to flip
> this flag reliably to 'kernel mode': i.e. all irq handlers, traps,
> NMIs and all syscall variants need to do this.
>
> But once it's there, it's really neat.
>

We already have to do this with the current code.  I went through and
checked all of the IST entries a couple versions ago.

I think we need to clean up the current garbage asm first, though.  See:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=d22f1dca4c7c93fdd1ce754e38d71d1961c0f9ac

(Very much unfinished, and it should probably be split up, but AFAICT
it works.  Don't hold your breath for a real version.)

--Andy

> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 12:18                 ` Frederic Weisbecker
@ 2015-05-07 12:29                   ` Ingo Molnar
  2015-05-07 15:47                     ` Rik van Riel
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-07 12:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rik van Riel, Andy Lutomirski, Mike Galbraith, linux-kernel,
	X86 ML, williams, Andrew Lutomirski, fweisbec, Peter Zijlstra,
	Heiko Carstens, Thomas Gleixner, Ingo Molnar, Paolo Bonzini


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> > > We cannot take the lock_trace(task) from irq context, and we 
> > > probably do not need to anyway, since we do not care about a 
> > > precise stack trace for the task.
> > 
> > So one worry with this and similar approaches of statistically 
> > detecting user mode would be the fact that on the way out to 
> > user-space we don't really destroy the previous call trace - we 
> > just pop off the stack (non-destructively), restore RIPs and are 
> > gone.
> > 
> > We'll need that percpu flag I suspect.
> 
> Note we have the context tracking state which tells where the 
> current task is: user/system/guest.

Yes, but that overhead is what I'm suggesting we get rid of, I thought 
Rik was trying to find a mechanism that would be independent of that?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 12:22                 ` Andy Lutomirski
@ 2015-05-07 12:44                   ` Ingo Molnar
  2015-05-07 12:49                     ` Ingo Molnar
  2015-05-07 12:52                     ` Andy Lutomirski
  0 siblings, 2 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-07 12:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams


* Andy Lutomirski <luto@amacapital.net> wrote:

> > > We cannot take the lock_trace(task) from irq context, and we 
> > > probably do not need to anyway, since we do not care about a 
> > > precise stack trace for the task.
> >
> > So one worry with this and similar approaches of statistically 
> > detecting user mode would be the fact that on the way out to 
> > user-space we don't really destroy the previous call trace - we 
> > just pop off the stack (non-destructively), restore RIPs and are 
> > gone.
> >
> > We'll need that percpu flag I suspect.
> >
> > And once we have the flag, we can get rid of the per syscall RCU 
> > callback as well, relatively easily: with CMPXCHG (in 
> > synchronize_rcu()!) we can reliably sample whether a CPU is in 
> > user mode right now, while the syscall entry/exit path does not 
> > use any atomics, we can just use a simple MOV.
> >
> > Once we observe 'user mode', then we have observed quiescent state 
> > and synchronize_rcu() can continue. If we've observed kernel mode 
> > we can frob the remote task's TIF_ flags to make it go into a 
> > quiescent state publishing routine on syscall-return.
> >
> 
> How does that work?
>
> If the exit code writes the per-cpu flag and then checks 
> TIF_whatever, we need a barrier to avoid a race where we end up in 
> user mode without seeing the flag.

No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

Also, the user-mode flag would be changed from the 'kernel-mode' value 
to the 'user-mode' value after we do the regular TIF checks - just 
before we hit user-space.

The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being 
executed on some other CPU not doing RT work) to intelligently wait 
for the remote (RT work doing) CPU to finish executing kernel code, 
without polling or so.

And yes, the TIF_RCU_QS handler on the remote CPU would have to 
execute an SMP barrier, before it allows an RCU quiescent state to 
pass. Note that the regular RCU code for quiescent state publishing 
already has that barrier, typically something like:

	this_cpu_inc(rcu_qs_ctr);

That in itself is enough for synchronize_rcu(), it could do a small 
timing loop with short sleeps, until it waits for rcu_qs_ctr to 
increase.

> I think the right solution is to accept that race and just have the 
> RCU code send an IPI (or check again) if it sees too long of a 
> period elapse in kernel mode.

I don't think there's any need for an IPI.

> I think the flag should be a counter, though.  That way a workload 
> that makes lots of syscalls will be quickly detected as going 
> through quiescent states even if it's never actually observed in 
> user mode.

Flag write is easier on the CPU than an INC/DEC: only a store to a 
percpu location, no load needed, and no value dependencies. The store 
will be program ordered, so it's a spin_unlock() work-alike.

> > The only hard requirement of this scheme from the RCU 
> > synchronization POV is that all kernel contexts that may touch RCU 
> > state need to flip this flag reliably to 'kernel mode': i.e. all 
> > irq handlers, traps, NMIs and all syscall variants need to do 
> > this.
> >
> > But once it's there, it's really neat.
> 
> We already have to do this with the current code.  I went through 
> and checked all of the IST entries a couple versions ago.

Yeah - I just mean if the primary flag is _not_ TIF driven (which is 
my suggestion, and which I thought you suggested as well) then all 
code paths need to be covered.

Things like the irq-tracking callbacks are debugging and 
instrumentation code - the user/kernel mode flag would be a hard 
requirement on RCU correctness: missing a flag update might cause the 
kernel to crash in non-obvious ways.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 12:44                   ` Ingo Molnar
@ 2015-05-07 12:49                     ` Ingo Molnar
  2015-05-08  6:17                       ` Paul E. McKenney
  2015-05-07 12:52                     ` Andy Lutomirski
  1 sibling, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-07 12:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams, Paul E. McKenney


* Ingo Molnar <mingo@kernel.org> wrote:

> The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
> (being executed on some other CPU not doing RT work) to 
> intelligently wait for the remote (RT work doing) CPU to finish 
> executing kernel code, without polling or so.

it's basically a cheap IPI being inserted on the remote CPU.

We need the TIF_RCU_QS callback not just to wait intelligently, but 
mainly to elapse a grace period, otherwise synchronize_rcu() might not 
ever make progress: think a SCHED_FIFO task doing some kernel work, 
synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
otherwise never scheduling and never getting any timer irqs either, 
and thus never entering quiescent state.

(Cc:-ed Paul too, he might be interested in this as well.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 12:44                   ` Ingo Molnar
  2015-05-07 12:49                     ` Ingo Molnar
@ 2015-05-07 12:52                     ` Andy Lutomirski
  2015-05-07 15:08                       ` Ingo Molnar
  1 sibling, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-07 12:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams

On Thu, May 7, 2015 at 5:44 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> > > We cannot take the lock_trace(task) from irq context, and we
>> > > probably do not need to anyway, since we do not care about a
>> > > precise stack trace for the task.
>> >
>> > So one worry with this and similar approaches of statistically
>> > detecting user mode would be the fact that on the way out to
>> > user-space we don't really destroy the previous call trace - we
>> > just pop off the stack (non-destructively), restore RIPs and are
>> > gone.
>> >
>> > We'll need that percpu flag I suspect.
>> >
>> > And once we have the flag, we can get rid of the per syscall RCU
>> > callback as well, relatively easily: with CMPXCHG (in
>> > synchronize_rcu()!) we can reliably sample whether a CPU is in
>> > user mode right now, while the syscall entry/exit path does not
>> > use any atomics, we can just use a simple MOV.
>> >
>> > Once we observe 'user mode', then we have observed quiescent state
>> > and synchronize_rcu() can continue. If we've observed kernel mode
>> > we can frob the remote task's TIF_ flags to make it go into a
>> > quiescent state publishing routine on syscall-return.
>> >
>>
>> How does that work?
>>
>> If the exit code writes the per-cpu flag and then checks
>> TIF_whatever, we need a barrier to avoid a race where we end up in
>> user mode without seeing the flag.
>
> No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.
>
> Also, the user-mode flag would be changed from the 'kernel-mode' value
> to the 'user-mode' value after we do the regular TIF checks - just
> before we hit user-space.
>
> The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being
> executed on some other CPU not doing RT work) to intelligently wait
> for the remote (RT work doing) CPU to finish executing kernel code,
> without polling or so.
>
> And yes, the TIF_RCU_QS handler on the remote CPU would have to
> execute an SMP barrier, before it allows an RCU quiescent state to
> pass. Note that the regular RCU code for quiescent state publishing
> already has that barrier, typically something like:
>
>         this_cpu_inc(rcu_qs_ctr);
>
> That in itself is enough for synchronize_rcu(), it could do a small
> timing loop with short sleeps, until it waits for rcu_qs_ctr to
> increase.

I think one or both of us is missing something or we're just talking
about different things.

If the nohz/RT cpu is about to enter user mode and stay there for a
long time, it does:

this_cpu_inc(rcu_qs_ctr);

or whatever.  Maybe we add:

this_cpu_set(rcu_state) = IN_USER;

or however it's spelled.

The remote CPU wants to check our state.  If this happens just before
the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll
think we're in kernel mode.  Now it either has to poll (which is fine)
or try to get us to tell the RCU core when we become quiescent by
setting TIF_RCU_THINGY.

The problem is that I don't see how TIF_RCU_THINGY can work reliably.
If the remote CPU sets it, it'll be too late and we'll still enter
user mode without seeing it.  If it's just an optimization, though,
then it should be fine.

I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr be
enough for all of the above?

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 12:52                     ` Andy Lutomirski
@ 2015-05-07 15:08                       ` Ingo Molnar
  2015-05-07 17:47                         ` Andy Lutomirski
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-07 15:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams


* Andy Lutomirski <luto@amacapital.net> wrote:

> I think one or both of us is missing something or we're just talking 
> about different things.

That's very much possible!

I think part of the problem is that I called the 'remote CPU' the RT 
CPU, while you seem to be calling it the CPU that does the 
synchronize_rcu().

So lets start again, with calling the synchronize_rcu() the 'remote 
CPU', and the one doing the RT workload the 'RT CPU':

> If the nohz/RT cpu is about to enter user mode and stay there for a 
> long time, it does:
> 
>   this_cpu_inc(rcu_qs_ctr);
> 
> or whatever.  Maybe we add:
> 
>   this_cpu_set(rcu_state) = IN_USER;
> 
> or however it's spelled.
> 
> The remote CPU wants to check our state.  If this happens just 
> before the IN_USER write or rcu_qs_ctr increment becomes visible, 
> then it'll think we're in kernel mode.  Now it either has to poll 
> (which is fine) or try to get us to tell the RCU core when we become 
> quiescent by setting TIF_RCU_THINGY.

So do you mean:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;

?

So in your proposal we'd have an INC and two MOVs. I think we can make 
it just two simple stores into a byte flag, one on entry and one on 
exit:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;

plus the rare but magic TIF_RCU_THINGY that tells a waiting 
synchronize_rcu() about the next quiescent state.

> The problem is that I don't see how TIF_RCU_THINGY can work 
> reliably. If the remote CPU sets it, it'll be too late and we'll 
> still enter user mode without seeing it.  If it's just an 
> optimization, though, then it should be fine.

Well, after setting it, the remote CPU has to re-check whether the RT 
CPU has entered user-mode - before it goes to wait.

If it has entered user mode then the remote CPU has observed quiescent 
state and is done. If it's still in kernel mode then it waits until 
the TIF_RCU_THINGY does its completion.

> I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr 
> be enough for all of the above?

How would just INC rcu_qs_ctr (without the IN_KERNEL/IN_USER flag) 
work?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: question about RCU dynticks_nesting
  2015-05-07  0:59                                           ` Frederic Weisbecker
@ 2015-05-07 15:44                                             ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2015-05-07 15:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paulmck, Paolo Bonzini, Ingo Molnar, Andy Lutomirski,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Linus Torvalds

On 05/06/2015 08:59 PM, Frederic Weisbecker wrote:
> On Mon, May 04, 2015 at 04:53:16PM -0400, Rik van Riel wrote:

>> Ingo's idea is to simply have cpu 0 check the current task
>> on all other CPUs, see whether that task is running in system
>> mode, user mode, guest mode, irq mode, etc and update that
>> task's vtime accordingly.
>>
>> I suspect the runqueue lock is probably enough to do that,
>> and between rcu state and PF_VCPU we probably have enough
>> information to see what mode the task is running in, with
>> just remote memory reads.
> 
> Note that we could significantly reduce the overhead of vtime accounting
> by only accumulate utime/stime on per cpu buffers and actually account it
> on context switch or task_cputime() calls. That way we remove the overhead
> of the account_user/system_time() functions and the vtime locks.
> 
> But doing the accounting from CPU 0 by just accounting 1 tick to the context
> we remotely observe would certainly reduce the local accounting overhead to the strict
> minimum. And I think we shouldn't even lock rq for that, we can live with some
> lack of precision.

We can live with lack of precision, but we cannot live with data
structures being re-used and pointers pointing off into la-la
land while we are following them :)

> Now we must expect quite some overhead on CPU 0. Perhaps it should be
> an option as I'm not sure every full dynticks usecases want that.

Lets see if I can get this to work before deciding whether we need yet
another configurable option :)

It may be possible to have most of the overhead happen from schedulable
context, maybe softirq code. Right now I am still stuck in the giant
spaghetti mess under account_process_tick, with dozens of functions that
only work on cpu-local, task-local, or architecture dependently cpu or
task local data...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 12:29                   ` Ingo Molnar
@ 2015-05-07 15:47                     ` Rik van Riel
  2015-05-08  7:58                       ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Rik van Riel @ 2015-05-07 15:47 UTC (permalink / raw)
  To: Ingo Molnar, Frederic Weisbecker
  Cc: Andy Lutomirski, Mike Galbraith, linux-kernel, X86 ML, williams,
	Andrew Lutomirski, fweisbec, Peter Zijlstra, Heiko Carstens,
	Thomas Gleixner, Ingo Molnar, Paolo Bonzini

On 05/07/2015 08:29 AM, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
>>>> We cannot take the lock_trace(task) from irq context, and we 
>>>> probably do not need to anyway, since we do not care about a 
>>>> precise stack trace for the task.
>>>
>>> So one worry with this and similar approaches of statistically 
>>> detecting user mode would be the fact that on the way out to 
>>> user-space we don't really destroy the previous call trace - we 
>>> just pop off the stack (non-destructively), restore RIPs and are 
>>> gone.
>>>
>>> We'll need that percpu flag I suspect.
>>
>> Note we have the context tracking state which tells where the 
>> current task is: user/system/guest.
> 
> Yes, but that overhead is what I'm suggesting we get rid of, I thought 
> Rik was trying to find a mechanism that would be independent of that?

One thing at a time :)

I am working on the timer sampling stuff, which should be easy
to adapt to a different user/system/guest/irq/softirq/... tracking
thing, if somebody else comes up with a more efficient way to do
that.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 15:08                       ` Ingo Molnar
@ 2015-05-07 17:47                         ` Andy Lutomirski
  2015-05-08  6:37                           ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-07 17:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fweisbec, Paolo Bonzini, X86 ML, Thomas Gleixner, Peter Zijlstra,
	Heiko Carstens, Ingo Molnar, Mike Galbraith, Rik van Riel,
	linux-kernel, williams

On May 7, 2015 8:38 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
> > I think one or both of us is missing something or we're just talking
> > about different things.
>
> That's very much possible!
>
> I think part of the problem is that I called the 'remote CPU' the RT
> CPU, while you seem to be calling it the CPU that does the
> synchronize_rcu().
>
> So lets start again, with calling the synchronize_rcu() the 'remote
> CPU', and the one doing the RT workload the 'RT CPU':
>
> > If the nohz/RT cpu is about to enter user mode and stay there for a
> > long time, it does:
> >
> >   this_cpu_inc(rcu_qs_ctr);
> >
> > or whatever.  Maybe we add:
> >
> >   this_cpu_set(rcu_state) = IN_USER;
> >
> > or however it's spelled.
> >
> > The remote CPU wants to check our state.  If this happens just
> > before the IN_USER write or rcu_qs_ctr increment becomes visible,
> > then it'll think we're in kernel mode.  Now it either has to poll
> > (which is fine) or try to get us to tell the RCU core when we become
> > quiescent by setting TIF_RCU_THINGY.
>
> So do you mean:
>
>    this_cpu_set(rcu_state) = IN_KERNEL;
>    ...
>    this_cpu_inc(rcu_qs_ctr);
>    this_cpu_set(rcu_state) = IN_USER;
>
> ?
>
> So in your proposal we'd have an INC and two MOVs. I think we can make
> it just two simple stores into a byte flag, one on entry and one on
> exit:
>
>    this_cpu_set(rcu_state) = IN_KERNEL;
>    ...
>    this_cpu_set(rcu_state) = IN_USER;
>

I was thinking that either a counter or a state flag could make sense.
Doing both would be pointless.  The counter could use the low bits to
indicate the state.  The benefit of the counter would be that the
RCU-waiting CPU could observe that the counter has incremented and
that therefore a grace period has elapsed.  Getting it right would
require lots of care.

> plus the rare but magic TIF_RCU_THINGY that tells a waiting
> synchronize_rcu() about the next quiescent state.
>
> > The problem is that I don't see how TIF_RCU_THINGY can work
> > reliably. If the remote CPU sets it, it'll be too late and we'll
> > still enter user mode without seeing it.  If it's just an
> > optimization, though, then it should be fine.
>
> Well, after setting it, the remote CPU has to re-check whether the RT
> CPU has entered user-mode - before it goes to wait.

How?

Suppose the exit path looked like:

this_cpu_write(rcu_state, IN_USER);

if (ti->flags & _TIF_RCU_NOTIFY) {
    if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags))
        slow_notify_rcu_that_we_are_exiting();
}

iret or sysret;

The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
_TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
because stores can be delayed.  (It could even happen after sysret,
IIRC, but IRET is serializing.)

If we put an mfence after this_cpu_set or did an unconditional
test_and_clear_bit on ti->flags then this problem goes away, but that
would probably be slower than we'd like.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 12:49                     ` Ingo Molnar
@ 2015-05-08  6:17                       ` Paul E. McKenney
  0 siblings, 0 replies; 83+ messages in thread
From: Paul E. McKenney @ 2015-05-08  6:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, fweisbec, Paolo Bonzini, Thomas Gleixner,
	X86 ML, Peter Zijlstra, Ingo Molnar, Heiko Carstens,
	Mike Galbraith, linux-kernel, Rik van Riel, williams

On Thu, May 07, 2015 at 02:49:39PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
> > (being executed on some other CPU not doing RT work) to 
> > intelligently wait for the remote (RT work doing) CPU to finish 
> > executing kernel code, without polling or so.
> 
> it's basically a cheap IPI being inserted on the remote CPU.
> 
> We need the TIF_RCU_QS callback not just to wait intelligently, but 
> mainly to elapse a grace period, otherwise synchronize_rcu() might not 
> ever make progress: think a SCHED_FIFO task doing some kernel work, 
> synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
> otherwise never scheduling and never getting any timer irqs either, 
> and thus never entering quiescent state.
> 
> (Cc:-ed Paul too, he might be interested in this as well.)

Hmmm...  So the point is that a NO_HZ_FULL CPU periodically posts
callbacks to indicate that it has passed through a quiescent state,
for example, upon entry to and/or exit from userspace?  These callbacks
would then be offloaded to some other CPU.

But the callback would not be invoked until RCU saw a grace period,
so I must be missing something here...  Probably that the TIF_RCU_QS
callback is not an RCU callback, but something else?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 17:47                         ` Andy Lutomirski
@ 2015-05-08  6:37                           ` Ingo Molnar
  2015-05-08 10:59                             ` Andy Lutomirski
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-08  6:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: fweisbec, Paolo Bonzini, X86 ML, Thomas Gleixner, Peter Zijlstra,
	Heiko Carstens, Ingo Molnar, Mike Galbraith, Rik van Riel,
	linux-kernel, williams


* Andy Lutomirski <luto@amacapital.net> wrote:

> > So do you mean:
> >
> >    this_cpu_set(rcu_state) = IN_KERNEL;
> >    ...
> >    this_cpu_inc(rcu_qs_ctr);
> >    this_cpu_set(rcu_state) = IN_USER;
> >
> > ?
> >
> > So in your proposal we'd have an INC and two MOVs. I think we can make
> > it just two simple stores into a byte flag, one on entry and one on
> > exit:
> >
> >    this_cpu_set(rcu_state) = IN_KERNEL;
> >    ...
> >    this_cpu_set(rcu_state) = IN_USER;
> >
> 
> I was thinking that either a counter or a state flag could make sense.
> Doing both would be pointless.  The counter could use the low bits to
> indicate the state.  The benefit of the counter would be that the
> RCU-waiting CPU could observe that the counter has incremented and
> that therefore a grace period has elapsed.  Getting it right would
> require lots of care.

So if you mean:

       <syscall entry>
       ...
       this_cpu_inc(rcu_qs_ctr);
       <syscall exit>

I don't see how this would work reliably: how do you handle the case 
of a SCHED_FIFO task never returning from user-space (common technique 
in RT apps)? synchronize_rcu() would block indefinitely as it would 
never see rcu_qs_ctr increase.

We have to be able to observe user-mode anyway, for system-time 
statistics purposes, and that flag could IMHO also drive the RCU GC 
machinery.

> > > The problem is that I don't see how TIF_RCU_THINGY can work 
> > > reliably. If the remote CPU sets it, it'll be too late and we'll 
> > > still enter user mode without seeing it.  If it's just an 
> > > optimization, though, then it should be fine.
> >
> > Well, after setting it, the remote CPU has to re-check whether the 
> > RT CPU has entered user-mode - before it goes to wait.
> 
> How?
> 
> Suppose the exit path looked like:
> 
> this_cpu_write(rcu_state, IN_USER);
> 
> if (ti->flags & _TIF_RCU_NOTIFY) {
>     if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags))
>         slow_notify_rcu_that_we_are_exiting();
> }
> 
> iret or sysret;

No, it would look like this:

   this_cpu_write(rcu_state, IN_USER);
   iret or sysret;

I.e. IN_USER is set well after all notifications are checked. No 
kernel execution happens afterwards. (No extra checks added - the 
regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

( Same goes for idle: we just mark it IN_IDLE and move it back to 
  IN_KERNEL after the idling ends. )

> The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> because stores can be delayed.  (It could even happen after sysret, 
> IIRC, but IRET is serializing.)

All it has to do in the synchronize_rcu() slowpath is something like:

	if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
		smp_mb__before_atomic();
		set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
		smp_rmb();
		if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
			... go wait ...
	}
	/* Cool, we observed quiescent state: */

The cost of the trivial barrier is nothing compared to the 'go wait' 
cost which we will pay in 99.9% of the cases!

> If we put an mfence after this_cpu_set or did an unconditional 
> test_and_clear_bit on ti->flags then this problem goes away, but 
> that would probably be slower than we'd like.

We are talking about a dozen cycles, while a typical synchronize_rcu() 
will wait millions (sometimes billions) of cycles. There's absolutely 
zero performance concern here and it's all CPU local in any case.

In fact a user-mode/kernel-mode flag speeds up naive implementations 
of synchronize_rcu(): because it's able to observe extended quiescent 
state immediately, without having to wait for a counter to increase 
(which was always the classic way to observe grace periods).

If all CPUs are in user mode or are idle (which is rather common!) 
then synchronize_rcu() could return almost immediately - while 
previously it had to wait for scheduling or periodic timer irqs to 
trigger on all CPUs - adding many millisecs of delay even in the best 
of cases.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-07 15:47                     ` Rik van Riel
@ 2015-05-08  7:58                       ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-08  7:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Andy Lutomirski, Mike Galbraith,
	linux-kernel, X86 ML, williams, Andrew Lutomirski, fweisbec,
	Peter Zijlstra, Heiko Carstens, Thomas Gleixner, Ingo Molnar,
	Paolo Bonzini


* Rik van Riel <riel@redhat.com> wrote:

> On 05/07/2015 08:29 AM, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> >>>> We cannot take the lock_trace(task) from irq context, and we 
> >>>> probably do not need to anyway, since we do not care about a 
> >>>> precise stack trace for the task.
> >>>
> >>> So one worry with this and similar approaches of statistically 
> >>> detecting user mode would be the fact that on the way out to 
> >>> user-space we don't really destroy the previous call trace - we 
> >>> just pop off the stack (non-destructively), restore RIPs and are 
> >>> gone.
> >>>
> >>> We'll need that percpu flag I suspect.
> >>
> >> Note we have the context tracking state which tells where the 
> >> current task is: user/system/guest.
> > 
> > Yes, but that overhead is what I'm suggesting we get rid of, I thought 
> > Rik was trying to find a mechanism that would be independent of that?
> 
> One thing at a time :)
> 
> I am working on the timer sampling stuff, which should be easy to 
> adapt to a different user/system/guest/irq/softirq/... tracking 
> thing, if somebody else comes up with a more efficient way to do 
> that.

So if you make the timer sampling use a percpu variable, and set that 
variable from the existing callbacks, then we could do this gradually: 
first the timer sampling uses the flag, then RCU could use it, and 
finally we could push it out to minimal assembly code.

But it's important to start out with a percpu flag to track this all.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-08  6:37                           ` Ingo Molnar
@ 2015-05-08 10:59                             ` Andy Lutomirski
  2015-05-08 11:27                               ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-08 10:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams

On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
> > > So do you mean:
> > >
> > >    this_cpu_set(rcu_state) = IN_KERNEL;
> > >    ...
> > >    this_cpu_inc(rcu_qs_ctr);
> > >    this_cpu_set(rcu_state) = IN_USER;
> > >
> > > ?
> > >
> > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > it just two simple stores into a byte flag, one on entry and one on
> > > exit:
> > >
> > >    this_cpu_set(rcu_state) = IN_KERNEL;
> > >    ...
> > >    this_cpu_set(rcu_state) = IN_USER;
> > >
> >
> > I was thinking that either a counter or a state flag could make sense.
> > Doing both would be pointless.  The counter could use the low bits to
> > indicate the state.  The benefit of the counter would be that the
> > RCU-waiting CPU could observe that the counter has incremented and
> > that therefore a grace period has elapsed.  Getting it right would
> > require lots of care.
>
> So if you mean:
>
>        <syscall entry>
>        ...
>        this_cpu_inc(rcu_qs_ctr);
>        <syscall exit>
>
> I don't see how this would work reliably: how do you handle the case
> of a SCHED_FIFO task never returning from user-space (common technique
> in RT apps)? synchronize_rcu() would block indefinitely as it would
> never see rcu_qs_ctr increase.
>
> We have to be able to observe user-mode anyway, for system-time
> statistics purposes, and that flag could IMHO also drive the RCU GC
> machinery.

The counter would have to be fancier than that to work.  We could say
that an even value means user mode, for example.  IOW the high bits of
the counter would count transitions to quiescent and the low bits
would indicate the state.

Actually, I'd count backwards.  We could use an andl instruction to go
to user mode and a decl to go to kernel mode.

>
> > > > The problem is that I don't see how TIF_RCU_THINGY can work
> > > > reliably. If the remote CPU sets it, it'll be too late and we'll
> > > > still enter user mode without seeing it.  If it's just an
> > > > optimization, though, then it should be fine.
> > >
> > > Well, after setting it, the remote CPU has to re-check whether the
> > > RT CPU has entered user-mode - before it goes to wait.
> >
> > How?
> >
> > Suppose the exit path looked like:
> >
> > this_cpu_write(rcu_state, IN_USER);
> >
> > if (ti->flags & _TIF_RCU_NOTIFY) {
> >     if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags))
> >         slow_notify_rcu_that_we_are_exiting();
> > }
> >
> > iret or sysret;
>
> No, it would look like this:
>
>    this_cpu_write(rcu_state, IN_USER);
>    iret or sysret;
>
> I.e. IN_USER is set well after all notifications are checked. No
> kernel execution happens afterwards. (No extra checks added - the
> regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)
>
> ( Same goes for idle: we just mark it IN_IDLE and move it back to
>   IN_KERNEL after the idling ends. )
>
> > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
> > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
> > because stores can be delayed.  (It could even happen after sysret,
> > IIRC, but IRET is serializing.)
>
> All it has to do in the synchronize_rcu() slowpath is something like:

I don't think this works.  Suppose rt_cpu starts in kernel mode.  Then
it checks ti flags.

>
>         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>                 smp_mb__before_atomic();
>                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>                 smp_rmb();
>                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

Now it sets rcu_state and exits to user mode.  It never notices
TIF_RCU_NOTIFY.  We could fix this by sending an IPI to kick it.

>                         ... go wait ...
>         }
>         /* Cool, we observed quiescent state: */
>
> The cost of the trivial barrier is nothing compared to the 'go wait'
> cost which we will pay in 99.9% of the cases!
>
> > If we put an mfence after this_cpu_set or did an unconditional
> > test_and_clear_bit on ti->flags then this problem goes away, but
> > that would probably be slower than we'd like.
>
> We are talking about a dozen cycles, while a typical synchronize_rcu()
> will wait millions (sometimes billions) of cycles. There's absolutely
> zero performance concern here and it's all CPU local in any case.
>

The barrier would affect all syscalls, though.  Admittedly that's
still much cheaper than the current thing, but given that we can round
trip a syscall in 110 cycles, we'd take at least 10% overhead.

My preference would be to use a counter and skip the barrier.  If the
waiting CPU polled the counter, then, even if it lost this race, it
would be guaranteed to see the counter change reasonably soon as long
as no CPU implementation sits on its store buffer forever.

> In fact a user-mode/kernel-mode flag speeds up naive implementations
> of synchronize_rcu(): because it's able to observe extended quiescent
> state immediately, without having to wait for a counter to increase
> (which was always the classic way to observe grace periods).
>
> If all CPUs are in user mode or are idle (which is rather common!)
> then synchronize_rcu() could return almost immediately - while
> previously it had to wait for scheduling or periodic timer irqs to
> trigger on all CPUs - adding many millisecs of delay even in the best
> of cases.

I agree absolutely, except in the whole-tons-of-cpus case, where I
think the RCU tree stuff might preclude this.  I'm not sure how well
that works with full nohz.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-08 10:59                             ` Andy Lutomirski
@ 2015-05-08 11:27                               ` Ingo Molnar
  2015-05-08 12:56                                 ` Andy Lutomirski
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2015-05-08 11:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams


* Andy Lutomirski <luto@amacapital.net> wrote:

> On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
> >
> >
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > > > So do you mean:
> > > >
> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
> > > >    ...
> > > >    this_cpu_inc(rcu_qs_ctr);
> > > >    this_cpu_set(rcu_state) = IN_USER;
> > > >
> > > > ?
> > > >
> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > > it just two simple stores into a byte flag, one on entry and one on
> > > > exit:
> > > >
> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
> > > >    ...
> > > >    this_cpu_set(rcu_state) = IN_USER;
> > > >
> > >
> > > I was thinking that either a counter or a state flag could make sense.
> > > Doing both would be pointless.  The counter could use the low bits to
> > > indicate the state.  The benefit of the counter would be that the
> > > RCU-waiting CPU could observe that the counter has incremented and
> > > that therefore a grace period has elapsed.  Getting it right would
> > > require lots of care.
> >
> > So if you mean:
> >
> >        <syscall entry>
> >        ...
> >        this_cpu_inc(rcu_qs_ctr);
> >        <syscall exit>
> >
> > I don't see how this would work reliably: how do you handle the case
> > of a SCHED_FIFO task never returning from user-space (common technique
> > in RT apps)? synchronize_rcu() would block indefinitely as it would
> > never see rcu_qs_ctr increase.
> >
> > We have to be able to observe user-mode anyway, for system-time
> > statistics purposes, and that flag could IMHO also drive the RCU GC
> > machinery.
> 
> The counter would have to be fancier than that to work.  We could 
> say that an even value means user mode, for example.  IOW the high 
> bits of the counter would count transitions to quiescent and the low 
> bits would indicate the state.
> 
> Actually, I'd count backwards.  We could use an andl instruction to 
> go to user mode and a decl to go to kernel mode.

at which point it's not really a monotonic quiescent state counter - 
which your naming suggested before.

Also it would have to be done both at entry and at exit.

But yes, if you replace a state flag with a recursive state flag that 
is INC/DEC maintained that would work as well. That's not a counter 
though.

> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> > > because stores can be delayed.  (It could even happen after 
> > > sysret, IIRC, but IRET is serializing.)
> >
> > All it has to do in the synchronize_rcu() slowpath is something 
> > like:
> 
> I don't think this works.  Suppose rt_cpu starts in kernel mode.  
> Then it checks ti flags.
>
> >
> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> >                 smp_mb__before_atomic();
> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> >                 smp_rmb();
> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
> 
> Now it sets rcu_state and exits to user mode.  It never notices
> TIF_RCU_NOTIFY.

Indeed, you are right, that's racy.

> [...]  We could fix this by sending an IPI to kick it.

With an IPI we won't need any flags or counters on the RT CPU - it's 
the IPI we are trying to avoid.

So how about the following way to fix the race: simply do a poll loop 
with a fixed timeout sleep: like it would do anyway if 
synchronize_rcu() was waiting for the timer irq to end the grace 
period on the RT-CPU.

The TIF flag would cause an RCU grace period to lapse, no need to wake 
up the synchronize_rcu() side: it will notice the (regular) increased 
RCU counter.

> > We are talking about a dozen cycles, while a typical 
> > synchronize_rcu() will wait millions (sometimes billions) of 
> > cycles. There's absolutely zero performance concern here and it's 
> > all CPU local in any case.
> 
> The barrier would affect all syscalls, though. [...]

What barrier? I never suggested any barrier in the syscall fast path, 
this bit:

> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> >                 smp_mb__before_atomic();
> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> >                 smp_rmb();
> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

runs inside synchronize_rcu() or so.

But none of that is needed if we do:

	- simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context 
	  entry/exit time, straight in the assembly

	- TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does 
	  nothing but: this_cpu_inc(rcu_qs_ctr).

	- synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and 
	  then does a timeout-poll-loop with jiffies granular 
	  timeouts, simulating a timer IRQ in essence. It will either 
	  observe IN_USER or will see the regular RCU qs counter 
	  increase.

this should be race free.

Alternatively we can even get rid of the TIF flag by merging the 
percpu counter with the percpu state. Quiescent states are recorded 
via a counter shifted by two bits:

	rcu_qs_ctr += 4;

while user/kernel/idle mode is recorded in the lower 2 bits.

So on entering the kernel we'd do:

	rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */

on returning to user-space we'd do:

	rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */

on going idle we'd do:

	rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */

on return from idle we'd do:

	rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */

etc. On all boundary transitions we can use a constant ADD on a 
suitable percpu variable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-08 11:27                               ` Ingo Molnar
@ 2015-05-08 12:56                                 ` Andy Lutomirski
  2015-05-08 13:27                                   ` Ingo Molnar
  0 siblings, 1 reply; 83+ messages in thread
From: Andy Lutomirski @ 2015-05-08 12:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams

On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>> >
>> >
>> > * Andy Lutomirski <luto@amacapital.net> wrote:
>> >
>> > > > So do you mean:
>> > > >
>> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >    ...
>> > > >    this_cpu_inc(rcu_qs_ctr);
>> > > >    this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > > > ?
>> > > >
>> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
>> > > > it just two simple stores into a byte flag, one on entry and one on
>> > > > exit:
>> > > >
>> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >    ...
>> > > >    this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > >
>> > > I was thinking that either a counter or a state flag could make sense.
>> > > Doing both would be pointless.  The counter could use the low bits to
>> > > indicate the state.  The benefit of the counter would be that the
>> > > RCU-waiting CPU could observe that the counter has incremented and
>> > > that therefore a grace period has elapsed.  Getting it right would
>> > > require lots of care.
>> >
>> > So if you mean:
>> >
>> >        <syscall entry>
>> >        ...
>> >        this_cpu_inc(rcu_qs_ctr);
>> >        <syscall exit>
>> >
>> > I don't see how this would work reliably: how do you handle the case
>> > of a SCHED_FIFO task never returning from user-space (common technique
>> > in RT apps)? synchronize_rcu() would block indefinitely as it would
>> > never see rcu_qs_ctr increase.
>> >
>> > We have to be able to observe user-mode anyway, for system-time
>> > statistics purposes, and that flag could IMHO also drive the RCU GC
>> > machinery.
>>
>> The counter would have to be fancier than that to work.  We could
>> say that an even value means user mode, for example.  IOW the high
>> bits of the counter would count transitions to quiescent and the low
>> bits would indicate the state.
>>
>> Actually, I'd count backwards.  We could use an andl instruction to
>> go to user mode and a decl to go to kernel mode.
>
> at which point it's not really a monotonic quiescent state counter -
> which your naming suggested before.
>
> Also it would have to be done both at entry and at exit.
>
> But yes, if you replace a state flag with a recursive state flag that
> is INC/DEC maintained that would work as well. That's not a counter
> though.

I won't quibble about the name :)

>
>> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
>> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
>> > > because stores can be delayed.  (It could even happen after
>> > > sysret, IIRC, but IRET is serializing.)
>> >
>> > All it has to do in the synchronize_rcu() slowpath is something
>> > like:
>>
>> I don't think this works.  Suppose rt_cpu starts in kernel mode.
>> Then it checks ti flags.
>>
>> >
>> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> >                 smp_mb__before_atomic();
>> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> >                 smp_rmb();
>> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>>
>> Now it sets rcu_state and exits to user mode.  It never notices
>> TIF_RCU_NOTIFY.
>
> Indeed, you are right, that's racy.
>
>> [...]  We could fix this by sending an IPI to kick it.
>
> With an IPI we won't need any flags or counters on the RT CPU - it's
> the IPI we are trying to avoid.
>
> So how about the following way to fix the race: simply do a poll loop
> with a fixed timeout sleep: like it would do anyway if
> synchronize_rcu() was waiting for the timer irq to end the grace
> period on the RT-CPU.

Seems reasonable to me.

>
> The TIF flag would cause an RCU grace period to lapse, no need to wake
> up the synchronize_rcu() side: it will notice the (regular) increased
> RCU counter.
>
>> > We are talking about a dozen cycles, while a typical
>> > synchronize_rcu() will wait millions (sometimes billions) of
>> > cycles. There's absolutely zero performance concern here and it's
>> > all CPU local in any case.
>>
>> The barrier would affect all syscalls, though. [...]
>
> What barrier? I never suggested any barrier in the syscall fast path,
> this bit:

Oh, I misunderstood.

>
>> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> >                 smp_mb__before_atomic();
>> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> >                 smp_rmb();
>> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>
> runs inside synchronize_rcu() or so.
>
> But none of that is needed if we do:
>
>         - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
>           entry/exit time, straight in the assembly
>
>         - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
>           nothing but: this_cpu_inc(rcu_qs_ctr).
>
>         - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
>           then does a timeout-poll-loop with jiffies granular
>           timeouts, simulating a timer IRQ in essence. It will either
>           observe IN_USER or will see the regular RCU qs counter
>           increase.
>
> this should be race free.
>
> Alternatively we can even get rid of the TIF flag by merging the
> percpu counter with the percpu state. Quiescent states are recorded
> via a counter shifted by two bits:
>
>         rcu_qs_ctr += 4;
>
> while user/kernel/idle mode is recorded in the lower 2 bits.
>
> So on entering the kernel we'd do:
>
>         rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */
>
> on returning to user-space we'd do:
>
>         rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */
>
> on going idle we'd do:
>
>         rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */
>
> on return from idle we'd do:
>
>         rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
>
> etc. On all boundary transitions we can use a constant ADD on a
> suitable percpu variable.

Sounds good to me, except that we need to be careful to distinguish
between non-syscall entries from quiescent states and non-syscall
entries from quiescent states.  We could save the old state (as the
current exception_enter code does) or we could allocate enough low
bits for the state that the problem goes away.

I don't think the TIF_RCU_QS variant is worthwhile -- merging the
counter and state is probably both easier and faster.

--Andy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
  2015-05-08 12:56                                 ` Andy Lutomirski
@ 2015-05-08 13:27                                   ` Ingo Molnar
  0 siblings, 0 replies; 83+ messages in thread
From: Ingo Molnar @ 2015-05-08 13:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: fweisbec, Paolo Bonzini, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Ingo Molnar, Heiko Carstens, Mike Galbraith, linux-kernel,
	Rik van Riel, williams


* Andy Lutomirski <luto@amacapital.net> wrote:

> > on return from idle we'd do:
> >
> >         rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
> >
> > etc. On all boundary transitions we can use a constant ADD on a 
> > suitable percpu variable.
> 
> Sounds good to me, except that we need to be careful to distinguish 
> between non-syscall entries from quiescent states
>     and non-syscall entries from quiescent states.

It might be hard to make that distinction! ;-)

I suspect you wanted to raise the issue of various contexts nesting on 
each other, such as syscall triggering a page fault, which gets an irq 
nested, etc. - versus non-nested contexts such as user-space 
triggering a page fault or user-space getting an irq?

> [...]  We could save the old state (as the current exception_enter 
> code does) or we could allocate enough low bits for the state that 
> the problem goes away.

So I think, if it all works in practice just as well as it does in 
email, we might be better off with more state bits: that would tell 
any remote statistics/sampling code more as well.

It might also be easier to patch in/out, because this kind of state 
tracking will affect non-RT CPUs as well. (Later on we could do a 
separate IDT for RT CPUs as well, with a patched version of the entry 
code.)

> I don't think the TIF_RCU_QS variant is worthwhile -- merging the 
> counter and state is probably both easier and faster.

Yeah.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2015-05-08 13:27 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-30 21:23 [PATCH 0/3] reduce nohz_full syscall overhead by 10% riel
2015-04-30 21:23 ` [PATCH 1/3] reduce indentation in __acct_update_integrals riel
2015-04-30 21:23 ` [PATCH 2/3] remove local_irq_save from __acct_update_integrals riel
2015-04-30 21:23 ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry riel
2015-04-30 21:56   ` Andy Lutomirski
2015-05-01  6:40   ` Ingo Molnar
2015-05-01 15:20     ` Rik van Riel
2015-05-01 15:59       ` Ingo Molnar
2015-05-01 16:03         ` Andy Lutomirski
2015-05-01 16:21           ` Ingo Molnar
2015-05-01 16:26             ` Rik van Riel
2015-05-01 16:34               ` Ingo Molnar
2015-05-01 18:05                 ` Rik van Riel
2015-05-01 18:40                   ` Ingo Molnar
2015-05-01 19:11                     ` Rik van Riel
2015-05-01 19:37                       ` Andy Lutomirski
2015-05-02  5:27                         ` Ingo Molnar
2015-05-02 18:27                           ` Rik van Riel
2015-05-03 18:41                           ` Andy Lutomirski
2015-05-07 10:35                             ` Ingo Molnar
2015-05-04  9:26                           ` Paolo Bonzini
2015-05-04 13:30                             ` Rik van Riel
2015-05-04 14:06                             ` Rik van Riel
2015-05-04 14:19                             ` Rik van Riel
2015-05-04 15:59                             ` question about RCU dynticks_nesting Rik van Riel
2015-05-04 18:39                               ` Paul E. McKenney
2015-05-04 19:39                                 ` Rik van Riel
2015-05-04 20:02                                   ` Paul E. McKenney
2015-05-04 20:13                                     ` Rik van Riel
2015-05-04 20:38                                       ` Paul E. McKenney
2015-05-04 20:53                                         ` Rik van Riel
2015-05-05  5:54                                           ` Paul E. McKenney
2015-05-06  1:49                                             ` Mike Galbraith
2015-05-06  3:44                                               ` Mike Galbraith
2015-05-06  6:06                                                 ` Paul E. McKenney
2015-05-06  6:52                                                   ` Mike Galbraith
2015-05-06  7:01                                                     ` Mike Galbraith
2015-05-07  0:59                                           ` Frederic Weisbecker
2015-05-07 15:44                                             ` Rik van Riel
2015-05-04 19:00                               ` Rik van Riel
2015-05-04 19:39                                 ` Paul E. McKenney
2015-05-04 19:59                                   ` Rik van Riel
2015-05-04 20:40                                     ` Paul E. McKenney
2015-05-05 10:53                                   ` Peter Zijlstra
2015-05-05 12:34                                     ` Paul E. McKenney
2015-05-05 13:00                                       ` Peter Zijlstra
2015-05-05 18:35                                         ` Paul E. McKenney
2015-05-05 21:09                                           ` Rik van Riel
2015-05-06  5:41                                             ` Paul E. McKenney
2015-05-05 10:48                                 ` Peter Zijlstra
2015-05-05 10:51                                   ` Peter Zijlstra
2015-05-05 12:30                                     ` Paul E. McKenney
2015-05-02  4:06                   ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Mike Galbraith
2015-05-01 16:37             ` Ingo Molnar
2015-05-01 16:40               ` Rik van Riel
2015-05-01 16:45                 ` Ingo Molnar
2015-05-01 16:54                   ` Rik van Riel
2015-05-01 17:12                     ` Ingo Molnar
2015-05-01 17:22                       ` Rik van Riel
2015-05-01 17:59                         ` Ingo Molnar
2015-05-01 16:22           ` Rik van Riel
2015-05-01 16:27             ` Ingo Molnar
2015-05-03 13:23       ` Mike Galbraith
2015-05-03 17:30         ` Rik van Riel
2015-05-03 18:24           ` Andy Lutomirski
2015-05-03 18:52             ` Rik van Riel
2015-05-07 10:48               ` Ingo Molnar
2015-05-07 12:18                 ` Frederic Weisbecker
2015-05-07 12:29                   ` Ingo Molnar
2015-05-07 15:47                     ` Rik van Riel
2015-05-08  7:58                       ` Ingo Molnar
2015-05-07 12:22                 ` Andy Lutomirski
2015-05-07 12:44                   ` Ingo Molnar
2015-05-07 12:49                     ` Ingo Molnar
2015-05-08  6:17                       ` Paul E. McKenney
2015-05-07 12:52                     ` Andy Lutomirski
2015-05-07 15:08                       ` Ingo Molnar
2015-05-07 17:47                         ` Andy Lutomirski
2015-05-08  6:37                           ` Ingo Molnar
2015-05-08 10:59                             ` Andy Lutomirski
2015-05-08 11:27                               ` Ingo Molnar
2015-05-08 12:56                                 ` Andy Lutomirski
2015-05-08 13:27                                   ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.