[PATCH 0/2] locking: Simplify mutex and rwsem spinning code

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] locking: Simplify mutex and rwsem spinning code
@ 2015-04-08 19:39 Jason Low
  2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
                   ` (2 more replies)
  0 siblings, 3 replies; 108+ messages in thread
From: Jason Low @ 2015-04-08 19:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran
  Cc: LKML, Jason Low

This patchset applies on top of tip.

Jason Low (2):
  locking/mutex: Further refactor mutex_spin_on_owner()
  locking/rwsem: Use a return variable in rwsem_spin_on_owner()

 kernel/locking/mutex.c      |   14 ++++----------
 kernel/locking/rwsem-xadd.c |   25 ++++++++++++-------------
 2 files changed, 16 insertions(+), 23 deletions(-)

-- 
1.7.2.5


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner()
  2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
@ 2015-04-08 19:39 ` Jason Low
  2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
  2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
  2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
  2 siblings, 1 reply; 108+ messages in thread
From: Jason Low @ 2015-04-08 19:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran
  Cc: LKML, Jason Low

Similar to what Linus suggested for rwsem_spin_on_owner(), in
mutex_spin_on_owner() instead of having while (true) and breaking
out of the spin loop on lock->owner != owner, we can have the loop
directly check for while (lock->owner == owner) to improve the
readability of the code.

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/locking/mutex.c |   14 ++++----------
 1 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 16b2d3c..4cccea6 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -224,20 +224,14 @@ ww_mutex_set_context_slowpath(struct ww_mutex *lock,
 static noinline
 bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
-	bool ret;
+	bool ret = true;
 
 	rcu_read_lock();
-	while (true) {
-		/* Return success when the lock owner changed */
-		if (lock->owner != owner) {
-			ret = true;
-			break;
-		}
-
+	while (lock->owner == owner) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking lock->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
 		 * the rcu_read_lock() ensures the memory stays valid.
 		 */
 		barrier();
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner()
  2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
@ 2015-04-09  9:00   ` tip-bot for Jason Low
  0 siblings, 0 replies; 108+ messages in thread
From: tip-bot for Jason Low @ 2015-04-09  9:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, tim.c.chen, tglx, akpm, aswin, peterz,
	torvalds, paulmck, dave, hpa, jason.low2

Commit-ID:  01ac33c1f907b366dcc50551316b372f1519cca9
Gitweb:     http://git.kernel.org/tip/01ac33c1f907b366dcc50551316b372f1519cca9
Author:     Jason Low <jason.low2@hp.com>
AuthorDate: Wed, 8 Apr 2015 12:39:19 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 9 Apr 2015 08:10:23 +0200

locking/mutex: Further simplify mutex_spin_on_owner()

Similar to what Linus suggested for rwsem_spin_on_owner(), in
mutex_spin_on_owner() instead of having while (true) and
breaking out of the spin loop on lock->owner != owner, we can
have the loop directly check for while (lock->owner == owner) to
improve the readability of the code.

It also shrinks the code a bit:

   text    data     bss     dec     hex filename
   3721       0       0    3721     e89 mutex.o.before
   3705       0       0    3705     e79 mutex.o.after

Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1428521960-5268-2-git-send-email-jason.low2@hp.com
[ Added code generation info. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/mutex.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 16b2d3c..4cccea6 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -224,20 +224,14 @@ ww_mutex_set_context_slowpath(struct ww_mutex *lock,
 static noinline
 bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
-	bool ret;
+	bool ret = true;
 
 	rcu_read_lock();
-	while (true) {
-		/* Return success when the lock owner changed */
-		if (lock->owner != owner) {
-			ret = true;
-			break;
-		}
-
+	while (lock->owner == owner) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking lock->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
 		 * the rcu_read_lock() ensures the memory stays valid.
 		 */
 		barrier();

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
  2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
@ 2015-04-08 19:39 ` Jason Low
  2015-04-09  5:37   ` Ingo Molnar
  2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
  2 siblings, 1 reply; 108+ messages in thread
From: Jason Low @ 2015-04-08 19:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran
  Cc: LKML, Jason Low

Ingo suggested for mutex_spin_on_owner() that having multiple return
statements is not the cleanest approach, especially when holding locks.

The same thing applies to the rwsem variant. This patch rewrites
much of this function to use a "ret" return value.

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/locking/rwsem-xadd.c |   25 ++++++++++++-------------
 1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3417d01..b1c9156 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -327,38 +327,37 @@ done:
 static noinline
 bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct *owner)
 {
-	long count;
+	bool ret = true;
 
 	rcu_read_lock();
 	while (sem->owner == owner) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
+		 * checking sem->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
 		 * the rcu_read_lock() ensures the memory stays valid.
 		 */
 		barrier();
 
-		/* abort spinning when need_resched or owner is not running */
+		/* Abort spinning when need_resched or owner is not running. */
 		if (!owner->on_cpu || need_resched()) {
-			rcu_read_unlock();
-			return false;
+			ret = false;
+			break;
 		}
 
 		cpu_relax_lowlatency();
 	}
 	rcu_read_unlock();
 
-	if (READ_ONCE(sem->owner))
-		return true; /* new owner, continue spinning */
-
 	/*
 	 * When the owner is not set, the lock could be free or
-	 * held by readers. Check the counter to verify the
-	 * state.
+	 * held by readers. Check the counter to verify the state.
 	 */
-	count = READ_ONCE(sem->count);
-	return (count == 0 || count == RWSEM_WAITING_BIAS);
+	if (!READ_ONCE(sem->owner)) {
+		long count = READ_ONCE(sem->count);
+		ret = (count == 0 || count == RWSEM_WAITING_BIAS);
+	}
+	return ret;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
@ 2015-04-09  5:37   ` Ingo Molnar
  2015-04-09  6:40     ` Jason Low
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-09  5:37 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Linus Torvalds, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML


* Jason Low <jason.low2@hp.com> wrote:

> Ingo suggested for mutex_spin_on_owner() that having multiple return
> statements is not the cleanest approach, especially when holding locks.
> 
> The same thing applies to the rwsem variant. This patch rewrites
> much of this function to use a "ret" return value.
> 
> Signed-off-by: Jason Low <jason.low2@hp.com>
> ---
>  kernel/locking/rwsem-xadd.c |   25 ++++++++++++-------------
>  1 files changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 3417d01..b1c9156 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -327,38 +327,37 @@ done:
>  static noinline
>  bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct *owner)
>  {
> -	long count;
> +	bool ret = true;
>  
>  	rcu_read_lock();
>  	while (sem->owner == owner) {
>  		/*
>  		 * Ensure we emit the owner->on_cpu, dereference _after_
> -		 * checking sem->owner still matches owner, if that fails,
> -		 * owner might point to free()d memory, if it still matches,
> +		 * checking sem->owner still matches owner. If that fails,
> +		 * owner might point to freed memory. If it still matches,
>  		 * the rcu_read_lock() ensures the memory stays valid.
>  		 */
>  		barrier();
>  
> -		/* abort spinning when need_resched or owner is not running */
> +		/* Abort spinning when need_resched or owner is not running. */
>  		if (!owner->on_cpu || need_resched()) {
> -			rcu_read_unlock();
> -			return false;
> +			ret = false;
> +			break;
>  		}
>  
>  		cpu_relax_lowlatency();
>  	}
>  	rcu_read_unlock();
>  
> -	if (READ_ONCE(sem->owner))
> -		return true; /* new owner, continue spinning */
> -
>  	/*
>  	 * When the owner is not set, the lock could be free or
> -	 * held by readers. Check the counter to verify the
> -	 * state.
> +	 * held by readers. Check the counter to verify the state.
>  	 */
> -	count = READ_ONCE(sem->count);
> -	return (count == 0 || count == RWSEM_WAITING_BIAS);
> +	if (!READ_ONCE(sem->owner)) {
> +		long count = READ_ONCE(sem->count);
> +		ret = (count == 0 || count == RWSEM_WAITING_BIAS);
> +	}
> +	return ret;
>  }
>  
>  static bool rwsem_optimistic_spin(struct rw_semaphore *sem)

The 'break' path does not seem to be equivalent, we used to do:

> -			rcu_read_unlock();
> -			return false;

and now we'll do:

> +			ret = false;
...
> +	if (!READ_ONCE(sem->owner)) {
> +		long count = READ_ONCE(sem->count);

it's harmless (we do one more round of checking), but that's not an 
equivalent transformation and slows down the preemption trigger a 
(tiny) bit, because the chance that we actually catch the lock when 
breaking out early is vanishingly small. (It might in fact do the 
wrong thing in returning true if need_resched() is set and we've 
switched owners in that small window.)

Given how dissimilar the return path is in this case, I'm not sure 
it's worth sharing it. This might be one of the few cases where 
separate return statements is the better solution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09  5:37   ` Ingo Molnar
@ 2015-04-09  6:40     ` Jason Low
  2015-04-09  7:53       ` Ingo Molnar
  0 siblings, 1 reply; 108+ messages in thread
From: Jason Low @ 2015-04-09  6:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, jason.low2

On Thu, 2015-04-09 at 07:37 +0200, Ingo Molnar wrote:

> The 'break' path does not seem to be equivalent, we used to do:
> 
> > -			rcu_read_unlock();
> > -			return false;
> 
> and now we'll do:
> 
> > +			ret = false;
> ...
> > +	if (!READ_ONCE(sem->owner)) {
> > +		long count = READ_ONCE(sem->count);
> 
> it's harmless (we do one more round of checking), but that's not an 
> equivalent transformation and slows down the preemption trigger a 
> (tiny) bit, because the chance that we actually catch the lock when 
> breaking out early is vanishingly small. (It might in fact do the 
> wrong thing in returning true if need_resched() is set and we've 
> switched owners in that small window.)
> 
> Given how dissimilar the return path is in this case, I'm not sure 
> it's worth sharing it. This might be one of the few cases where 
> separate return statements is the better solution.

I also preferred the multiple returns for the rwsem variant to avoid
needing to check sem->owner when it should go to slowpath, as you
mentioned.

Now that I think of it though, if we want to have just one return path,
we can still do that if we add an "out" label.

---
 kernel/locking/rwsem-xadd.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3417d01..e74240f 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -327,38 +327,39 @@ done:
 static noinline
 bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct *owner)
 {
-	long count;
+	bool ret = true;
 
 	rcu_read_lock();
 	while (sem->owner == owner) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
+		 * checking sem->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
 		 * the rcu_read_lock() ensures the memory stays valid.
 		 */
 		barrier();
 
-		/* abort spinning when need_resched or owner is not running */
+		/* Abort spinning when need_resched or owner is not running. */
 		if (!owner->on_cpu || need_resched()) {
 			rcu_read_unlock();
-			return false;
+			ret = false;
+			goto out;
 		}
 
 		cpu_relax_lowlatency();
 	}
 	rcu_read_unlock();
 
-	if (READ_ONCE(sem->owner))
-		return true; /* new owner, continue spinning */
-
 	/*
 	 * When the owner is not set, the lock could be free or
-	 * held by readers. Check the counter to verify the
-	 * state.
+	 * held by readers. Check the counter to verify the state.
 	 */
-	count = READ_ONCE(sem->count);
-	return (count == 0 || count == RWSEM_WAITING_BIAS);
+	if (!READ_ONCE(sem->owner)) {
+		long count = READ_ONCE(sem->count);
+		ret = (count == 0 || count == RWSEM_WAITING_BIAS);
+	}
+out:
+	return ret;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-- 
1.7.2.5




^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09  6:40     ` Jason Low
@ 2015-04-09  7:53       ` Ingo Molnar
  2015-04-09 16:47         ` Linus Torvalds
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-09  7:53 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Linus Torvalds, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML


* Jason Low <jason.low2@hp.com> wrote:

> On Thu, 2015-04-09 at 07:37 +0200, Ingo Molnar wrote:
> 
> > The 'break' path does not seem to be equivalent, we used to do:
> > 
> > > -			rcu_read_unlock();
> > > -			return false;
> > 
> > and now we'll do:
> > 
> > > +			ret = false;
> > ...
> > > +	if (!READ_ONCE(sem->owner)) {
> > > +		long count = READ_ONCE(sem->count);
> > 
> > it's harmless (we do one more round of checking), but that's not an 
> > equivalent transformation and slows down the preemption trigger a 
> > (tiny) bit, because the chance that we actually catch the lock when 
> > breaking out early is vanishingly small. (It might in fact do the 
> > wrong thing in returning true if need_resched() is set and we've 
> > switched owners in that small window.)
> > 
> > Given how dissimilar the return path is in this case, I'm not sure 
> > it's worth sharing it. This might be one of the few cases where 
> > separate return statements is the better solution.
> 
> I also preferred the multiple returns for the rwsem variant to avoid
> needing to check sem->owner when it should go to slowpath, as you
> mentioned.
> 
> Now that I think of it though, if we want to have just one return path,
> we can still do that if we add an "out" label.

That's the usual pattern we use, but:

> -		/* abort spinning when need_resched or owner is not running */
> +		/* Abort spinning when need_resched or owner is not running. */
>  		if (!owner->on_cpu || need_resched()) {
>  			rcu_read_unlock();
> -			return false;
> +			ret = false;
> +			goto out;
>  		}

The point is to generally unify the 'out' paths - i.e. to merge it 
with the rcu_read_unlock() as well, so that we have really simple 
gotos and only a single exit path.
 
That's not really doable here without extra overhead AFAICS, so I'd 
suggest we leave it alone ...

I have applied your other patch.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09  7:53       ` Ingo Molnar
@ 2015-04-09 16:47         ` Linus Torvalds
  2015-04-09 17:56           ` Paul E. McKenney
  0 siblings, 1 reply; 108+ messages in thread
From: Linus Torvalds @ 2015-04-09 16:47 UTC (permalink / raw)
  To: Ingo Molnar, Paul McKenney
  Cc: Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML

On Thu, Apr 9, 2015 at 12:53 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> The point is to generally unify the 'out' paths - i.e. to merge it
> with the rcu_read_unlock() as well, so that we have really simple
> gotos and only a single exit path.

Maybe just have the rcu read-locking be done in the *caller* (possibly
through using just a helper wrapper function that does nothing but the
locking), so that you can just do a simple "return false" in the
function itself.

That said, it worries me a bit that we do that spinning while holding
the RCU read lock in the first place. Yes, we stop spinning if
"need_resched()" is set, but what effect - if any - does all of this
have on RCU latency? If somebody is waiting for a RCU grace period,
I'm not seeing that setting need-resched...

At least with CONFIG_PREEMPT_RCU, the read-unlock is *not* just doing
a preempt-disable, so it's not necessarily just about need_resched().
It does all the magic with 'rcu_read_unlock_special.s' too..

Adding Paul. From a RCU locking standpoint, the thing is basically
(not the real code, edited down):

        rcu_read_lock();
        while (sem->owner == owner) {
                if (!owner->on_cpu || need_resched())
                        break;
                cpu_relax_lowlatency();
        }
        rcu_read_unlock();

so we busy-loop while holding the RCU read lock while

   sem->owner == owner && owner->on_cpu && !need_resched()

is true. That is usually not very long, but we've already had
watchdogs go off when we get this wrong, so..

Paul, comments? Are there particular latency concerns wrt
CONFIG_PREEMPT_RCU here? Or am I just being silly?

                         Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 16:47         ` Linus Torvalds
@ 2015-04-09 17:56           ` Paul E. McKenney
  2015-04-09 18:08             ` Linus Torvalds
  0 siblings, 1 reply; 108+ messages in thread
From: Paul E. McKenney @ 2015-04-09 17:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, Apr 09, 2015 at 09:47:36AM -0700, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 12:53 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > The point is to generally unify the 'out' paths - i.e. to merge it
> > with the rcu_read_unlock() as well, so that we have really simple
> > gotos and only a single exit path.
> 
> Maybe just have the rcu read-locking be done in the *caller* (possibly
> through using just a helper wrapper function that does nothing but the
> locking), so that you can just do a simple "return false" in the
> function itself.
> 
> That said, it worries me a bit that we do that spinning while holding
> the RCU read lock in the first place. Yes, we stop spinning if
> "need_resched()" is set, but what effect - if any - does all of this
> have on RCU latency? If somebody is waiting for a RCU grace period,
> I'm not seeing that setting need-resched...
> 
> At least with CONFIG_PREEMPT_RCU, the read-unlock is *not* just doing
> a preempt-disable, so it's not necessarily just about need_resched().
> It does all the magic with 'rcu_read_unlock_special.s' too..
> 
> Adding Paul. From a RCU locking standpoint, the thing is basically
> (not the real code, edited down):
> 
>         rcu_read_lock();
>         while (sem->owner == owner) {
>                 if (!owner->on_cpu || need_resched())
>                         break;
>                 cpu_relax_lowlatency();
>         }
>         rcu_read_unlock();
> 
> so we busy-loop while holding the RCU read lock while
> 
>    sem->owner == owner && owner->on_cpu && !need_resched()
> 
> is true. That is usually not very long, but we've already had
> watchdogs go off when we get this wrong, so..
> 
> Paul, comments? Are there particular latency concerns wrt
> CONFIG_PREEMPT_RCU here? Or am I just being silly?

If this was a pure spinlock, then the effects of spinning would overwhelm
any problems from extended grace periods.

But this is a sleeplock.  Of course, we stay in the loop only as long as
the lock holder is actually running.  But given that this is a sleeplock,
I am worried that some lock holders might run for long time periods.
After all, that is one of the traditional uses for a sleeplock.  :-/

If the RCU read-side critical section lasts a few hundred milliseconds,
no problem.  If it lasts for more than 500 milliseconds, I would start
getting concerned.

And if such long-term spins are likely, I cannot resist asking if this
should be instead using SRCU.  If you have your own srcu_struct, you
get to delay your own SRCU grace periods as long as you want.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 17:56           ` Paul E. McKenney
@ 2015-04-09 18:08             ` Linus Torvalds
  2015-04-09 18:16               ` Linus Torvalds
  0 siblings, 1 reply; 108+ messages in thread
From: Linus Torvalds @ 2015-04-09 18:08 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, Apr 9, 2015 at 10:56 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> And if such long-term spins are likely, I cannot resist asking if this
> should be instead using SRCU.  If you have your own srcu_struct, you
> get to delay your own SRCU grace periods as long as you want.  ;-)

No, this is plain RCU, and it is only needed because the 'struct
task_struct' is RCU-allocated, and we do an optimistic access of that
'owner->on_cpu' without actually holding any locks.

And even *that* wouldn't be needed if it wasn't for DEBUG_PAGEALLOC.
We could just access stale memory.

I wonder if we should get rid of the whole RCU thing (which does add
overhead to a potentially critical piece of code), and replace it with
a new "optimistic_kernel_read()" function that basically just does a
memory read with an exception table entry (ie like __get_user(), but
without any of the user access overhead - no clac etc), so that if we
fault due to DEBUG_PAGEALLOC it just ignores the fault.

Hmm? I think there might be a few other places that currently get RCU
read locks just because they want to do an optimistic read of
something that migth be going away from under them.

The pointer is a known-safe kernel pointer - it's just that it was
"known safe" a few instructions ago, and might be rcu-free'd at any
time.

                          Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 18:08             ` Linus Torvalds
@ 2015-04-09 18:16               ` Linus Torvalds
  2015-04-09 18:39                 ` Paul E. McKenney
                                   ` (2 more replies)
  0 siblings, 3 replies; 108+ messages in thread
From: Linus Torvalds @ 2015-04-09 18:16 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ingo Molnar, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The pointer is a known-safe kernel pointer - it's just that it was
> "known safe" a few instructions ago, and might be rcu-free'd at any
> time.

Actually, we could even do something like this:

 static inline int sem_owner_on_cpu(struct semaphore *sem, struct
task_struct *owner)
 {
        int on_cpu;

    #ifdef CONFIG_DEBUG_PAGEALLOC
        rcu_read_lock();
    #endif
        on_cpu = sem->owner == owner && owner->on_cpu;
    #ifdef CONFIG_DEBUG_PAGEALLOC
        rcu_read_unlock();
    #endif
        return on_cpu;
    }

because we really don't need to hold the RCU lock over the whole loop,
we just need to validate that the semaphore owner still matches, and
if so, check that it's on_cpu.

And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
*at*all*. We will have worse performance problems than doing some RCU
read-locking inside the loop.

And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
locking, since at worst we just access stale memory for one iteration.

Hmm. It's not pretty, but neither is the current "let's just take a
rcu lock that we don't really need over a loop that doesn't have very
strict bounding".

Comments?

                                    Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 18:16               ` Linus Torvalds
@ 2015-04-09 18:39                 ` Paul E. McKenney
  2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
  2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
  2015-04-09 20:36                 ` Jason Low
  2 siblings, 1 reply; 108+ messages in thread
From: Paul E. McKenney @ 2015-04-09 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, Apr 09, 2015 at 11:16:24AM -0700, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The pointer is a known-safe kernel pointer - it's just that it was
> > "known safe" a few instructions ago, and might be rcu-free'd at any
> > time.
> 
> Actually, we could even do something like this:
> 
>  static inline int sem_owner_on_cpu(struct semaphore *sem, struct
> task_struct *owner)
>  {
>         int on_cpu;
> 
>     #ifdef CONFIG_DEBUG_PAGEALLOC
>         rcu_read_lock();
>     #endif
>         on_cpu = sem->owner == owner && owner->on_cpu;
>     #ifdef CONFIG_DEBUG_PAGEALLOC
>         rcu_read_unlock();
>     #endif
>         return on_cpu;
>     }
> 
> because we really don't need to hold the RCU lock over the whole loop,
> we just need to validate that the semaphore owner still matches, and
> if so, check that it's on_cpu.

Much better from an RCU grace-period-latency perspective.

> And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
> *at*all*. We will have worse performance problems than doing some RCU
> read-locking inside the loop.
> 
> And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
> locking, since at worst we just access stale memory for one iteration.

But if we are running on a hypervisor, mightn't our VCPU be preempted
just before accessing ->on_cpu, the task exit and its structures be
freed and unmapped?  Or is the task structure in memory that is never
unmapped?  (If the latter, clearly not a problem.)

							Thanx, Paul

> Hmm. It's not pretty, but neither is the current "let's just take a
> rcu lock that we don't really need over a loop that doesn't have very
> strict bounding".
> 
> Comments?
> 
>                                     Linus
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
  2015-04-09 18:39                 ` Paul E. McKenney
@ 2015-04-10  9:00                   ` Ingo Molnar
  2015-04-10  9:12                     ` Ingo Molnar
  2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
  0 siblings, 2 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10  9:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about 
> > performance *at*all*. We will have worse performance problems than 
> > doing some RCU read-locking inside the loop.
> > 
> > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care 
> > about locking, since at worst we just access stale memory for one 
> > iteration.
> 
> But if we are running on a hypervisor, mightn't our VCPU be 
> preempted just before accessing ->on_cpu, the task exit and its 
> structures be freed and unmapped?  Or is the task structure in 
> memory that is never unmapped?  (If the latter, clearly not a 
> problem.)

kmalloc()able kernel memory is never unmapped in that fashion [*]. 
Even hotplug memory is based on limiting what gets allocated in that 
area and never putting critical kernel data structures there.

Personally I'd be more comfortable with having a special primitive for 
this that is DEBUG_PAGEALLOC aware (Linus's first suggestion), so that 
we don't use different RCU primitives in the rare case someone tests 
CONFIG_DEBUG_PAGEALLOC=y ...

We even have such a primitive: __copy_from_user_inatomic(). It 
compiles to a single instruction for integer types on x86. I've 
attached a patch that implements it for the regular mutexes (xadd can 
be done too), and it all compiles to a rather sweet, compact routine:

0000000000000030 <mutex_spin_on_owner.isra.4>:
  30:	48 3b 37             	cmp    (%rdi),%rsi
  33:	48 8d 4e 28          	lea    0x28(%rsi),%rcx
  37:	75 4e                	jne    87 <mutex_spin_on_owner.isra.4+0x57>
  39:	55                   	push   %rbp
  3a:	45 31 c0             	xor    %r8d,%r8d
  3d:	65 4c 8b 0c 25 00 00 	mov    %gs:0x0,%r9
  44:	00 00 
  46:	48 89 e5             	mov    %rsp,%rbp
  49:	48 83 ec 10          	sub    $0x10,%rsp
  4d:	eb 08                	jmp    57 <mutex_spin_on_owner.isra.4+0x27>
  4f:	90                   	nop
  50:	f3 90                	pause  
  52:	48 3b 37             	cmp    (%rdi),%rsi
  55:	75 29                	jne    80 <mutex_spin_on_owner.isra.4+0x50>
  57:	44 89 c0             	mov    %r8d,%eax
  5a:	90                   	nop
  5b:	90                   	nop
  5c:	90                   	nop
  5d:	8b 11                	mov    (%rcx),%edx
  5f:	90                   	nop
  60:	90                   	nop
  61:	90                   	nop
  62:	85 d2                	test   %edx,%edx
  64:	89 55 fc             	mov    %edx,-0x4(%rbp)
  67:	74 0b                	je     74 <mutex_spin_on_owner.isra.4+0x44>
  69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax
  70:	a8 08                	test   $0x8,%al
  72:	74 dc                	je     50 <mutex_spin_on_owner.isra.4+0x20>
  74:	31 c0                	xor    %eax,%eax
  76:	c9                   	leaveq 
  77:	c3                   	retq   
  78:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  7f:	00 
  80:	b8 01 00 00 00       	mov    $0x1,%eax
  85:	c9                   	leaveq 
  86:	c3                   	retq   
  87:	b8 01 00 00 00       	mov    $0x1,%eax
  8c:	c3                   	retq   
  8d:	0f 1f 00             	nopl   (%rax)

No RCU overhead, and this is the access to owner->on_cpu:

  69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax

Totally untested and all that, I only built the mutex.o.

What do you think? Am I missing anything?

Thanks,

	Ingo

[*] with the exception of CONFIG_DEBUG_PAGEALLOC and other debug 
    mechanisms like CONFIG_KMEMCHECK (which is on the way out) that 
    are based on provoking page faults and fixing up page tables to 
    catch unexpected memory accesses.

=================================>
>From ef3e5e763747d47a43a32f846ee94706089222cf Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 10:49:11 +0200
Subject: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock

Linus suggested to get rid of the held RCU read-lock in 
mutex_spin_on_owner(). The only real complication is that the 'owner' 
task might be freed from under us and we might dereference into 
possibly freed kernel memory.

As long as the kernel pointer itself is valid this approach is fine in 
this case (see the analysis below) - with the exception of 
CONFIG_DEBUG_PAGEALLOC=y and similarly instrumented kernels which 
might fault on referencing freed kernel memory.

Use the non-faulting copy-from-user primitive to get the owner->on_cpu 
value that we use in NMI handlers and which works even on 
CONFIG_DEBUG_PAGEALLOC=y instrumented kernels. This compiles to a 
single instruction on most platforms.

This approach might briefly copy in junk data from an already freed 
(previous) owner task, which might trigger two scenarios:

1) The junk data causes us to loop once more. This is not
   a problem as we'll check the owner on the next loop and
   break out of the loop.

2) If the junk value causes us to break out of the loop
   that's fine too: it's what we'd have done anyway on
   the next iteration, as the lock owner changed.

The inatomic context copy primitives are compiler barriers
too - this matters to make sure the above owner check is
emitted to before the copy attempt.

We also ignore failed copies, as the next iteration will clean
up after us. This saves an extra branch in the common case.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/mutex.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4cccea6b8934..fcc7db45d62e 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -224,28 +224,42 @@ ww_mutex_set_context_slowpath(struct ww_mutex *lock,
 static noinline
 bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
-	bool ret = true;
+	int on_cpu;
+	int ret;
 
-	rcu_read_lock();
 	while (lock->owner == owner) {
 		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking lock->owner still matches owner. If that fails,
-		 * owner might point to freed memory. If it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
+		 * Use the non-faulting copy-user primitive to get the owner->on_cpu
+		 * value that works even on CONFIG_DEBUG_PAGEALLOC=y instrumented
+		 * kernels. This compiles to a single instruction on most platforms.
+		 *
+		 * This might briefly copy in junk data from an already freed
+		 * (previous) owner task, which might trigger two scenarios:
+		 *
+		 * 1) The junk data causes us to loop once more. This is not
+		 *    a problem as we'll check the owner on the next loop and
+		 *    break out of the loop.
+		 *
+		 * 2) If the junk value causes us to break out of the loop
+		 *    that's fine too: it's what we'd have done anyway on
+		 *    the next iteration, as the lock owner changed.
+		 *
+		 * NOTE: the inatomic context copy primitives are compiler barriers
+		 *       too - this matters to make sure the above owner check is
+		 *       emitted to before the copy attempt.
+		 *
+		 * NOTE2: We ignore failed copies, as the next iteration will clean
+		 *        up after us. This saves an extra branch in the common case.
 		 */
-		barrier();
+		ret = __copy_from_user_inatomic(&on_cpu, &owner->on_cpu, sizeof(on_cpu));
 
-		if (!owner->on_cpu || need_resched()) {
-			ret = false;
-			break;
-		}
+		if (!on_cpu || need_resched())
+			return false;
 
 		cpu_relax_lowlatency();
 	}
-	rcu_read_unlock();
 
-	return ret;
+	return true;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
  2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
@ 2015-04-10  9:12                     ` Ingo Molnar
  2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
  2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
  1 sibling, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10  9:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> 0000000000000030 <mutex_spin_on_owner.isra.4>:
>   30:	48 3b 37             	cmp    (%rdi),%rsi
>   33:	48 8d 4e 28          	lea    0x28(%rsi),%rcx
>   37:	75 4e                	jne    87 <mutex_spin_on_owner.isra.4+0x57>
>   39:	55                   	push   %rbp
>   3a:	45 31 c0             	xor    %r8d,%r8d
>   3d:	65 4c 8b 0c 25 00 00 	mov    %gs:0x0,%r9
>   44:	00 00 
>   46:	48 89 e5             	mov    %rsp,%rbp
>   49:	48 83 ec 10          	sub    $0x10,%rsp
>   4d:	eb 08                	jmp    57 <mutex_spin_on_owner.isra.4+0x27>
>   4f:	90                   	nop
>   50:	f3 90                	pause  
>   52:	48 3b 37             	cmp    (%rdi),%rsi
>   55:	75 29                	jne    80 <mutex_spin_on_owner.isra.4+0x50>
>   57:	44 89 c0             	mov    %r8d,%eax
>   5a:	90                   	nop
>   5b:	90                   	nop
>   5c:	90                   	nop
>   5d:	8b 11                	mov    (%rcx),%edx
>   5f:	90                   	nop
>   60:	90                   	nop
>   61:	90                   	nop

Yeah, so what I missed here are those nops: placeholders for the 
STAC/CLAC instructions on x86... and this is what Linus mentioned 
about the clac() overhead.

But this could be solved I think: by adding a 
copy_from_kernel_inatomic() primitive which simply leaves out the 
STAC/CLAC sequence: as these are always guaranteed to be kernel 
addresses, the SMAP fault should not be generated.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive
  2015-04-10  9:12                     ` Ingo Molnar
@ 2015-04-10  9:21                       ` Ingo Molnar
  2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10  9:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> Yeah, so what I missed here are those nops: placeholders for the 
> STAC/CLAC instructions on x86... and this is what Linus mentioned 
> about the clac() overhead.
> 
> But this could be solved I think: by adding a 
> copy_from_kernel_inatomic() primitive which simply leaves out the 
> STAC/CLAC sequence: as these are always guaranteed to be kernel 
> addresses, the SMAP fault should not be generated.

So the first step would be to introduce a generic 
__copy_from_kernel_inatomic() primitive as attached below.

The next patch will implement efficient __copy_from_kernel_inatomic() 
for x86.

Thanks,

	Ingo

==================================>
>From 89b2ac882933947513c0aabd38e6b6c5a203c337 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 11:19:23 +0200
Subject: [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive

Most architectures can just reuse __copy_from_user_inatomic() to
copy possibly-faulting data from known-valid kernel addresses.

Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/uaccess.h | 12 ++++++++++++
 kernel/locking/mutex.c  |  3 ++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ecd3319dac33..885eea43b69f 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -107,4 +107,16 @@ extern long __probe_kernel_read(void *dst, const void *src, size_t size);
 extern long notrace probe_kernel_write(void *dst, const void *src, size_t size);
 extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size);
 
+/*
+ * Generic wrapper, most architectures can just use __copy_from_user_inatomic()
+ * to implement __copy_from_kernel_inatomic():
+ */
+#ifndef ARCH_HAS_COPY_FROM_KERNEL_INATOMIC
+static __must_check __always_inline int
+__copy_from_kernel_inatomic(void *dst, const void __user *src, unsigned size)
+{
+	return __copy_from_user_inatomic(dst, src, size);
+}
+#endif
+
 #endif		/* __LINUX_UACCESS_H__ */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index fcc7db45d62e..a4f74cda9fc4 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -26,6 +26,7 @@
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
 #include <linux/osq_lock.h>
+#include <linux/uaccess.h>
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -251,7 +252,7 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 		 * NOTE2: We ignore failed copies, as the next iteration will clean
 		 *        up after us. This saves an extra branch in the common case.
 		 */
-		ret = __copy_from_user_inatomic(&on_cpu, &owner->on_cpu, sizeof(on_cpu));
+		ret = __copy_from_kernel_inatomic(&on_cpu, &owner->on_cpu, sizeof(on_cpu));
 
 		if (!on_cpu || need_resched())
 			return false;

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH] x86/uaccess: Implement get_kernel()
  2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
@ 2015-04-10 11:14                         ` Ingo Molnar
  2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
                                             ` (2 more replies)
  0 siblings, 3 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 11:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > Yeah, so what I missed here are those nops: placeholders for the 
> > STAC/CLAC instructions on x86... and this is what Linus mentioned 
> > about the clac() overhead.
> > 
> > But this could be solved I think: by adding a 
> > copy_from_kernel_inatomic() primitive which simply leaves out the 
> > STAC/CLAC sequence: as these are always guaranteed to be kernel 
> > addresses, the SMAP fault should not be generated.
> 
> So the first step would be to introduce a generic 
> __copy_from_kernel_inatomic() primitive as attached below.
> 
> The next patch will implement efficient __copy_from_kernel_inatomic() 
> for x86.


The patch below does that. Note, for simplicity I've changed the 
interface to 'get_kernel()' (will propagate this through the other 
patches as well).

Minimally boot tested. owner-spinning looks sane now:

0000000000000030 <mutex_spin_on_owner.isra.4>:
  30:	48 3b 37             	cmp    (%rdi),%rsi
  33:	55                   	push   %rbp
  34:	48 89 e5             	mov    %rsp,%rbp
  37:	75 4f                	jne    88 <mutex_spin_on_owner.isra.4+0x58>
  39:	48 8d 56 28          	lea    0x28(%rsi),%rdx
  3d:	8b 46 28             	mov    0x28(%rsi),%eax
  40:	85 c0                	test   %eax,%eax
  42:	74 3c                	je     80 <mutex_spin_on_owner.isra.4+0x50>
  44:	65 48 8b 04 25 00 00 	mov    %gs:0x0,%rax
  4b:	00 00 
  4d:	48 8b 80 10 c0 ff ff 	mov    -0x3ff0(%rax),%rax
  54:	a8 08                	test   $0x8,%al
  56:	75 28                	jne    80 <mutex_spin_on_owner.isra.4+0x50>
  58:	65 48 8b 0c 25 00 00 	mov    %gs:0x0,%rcx
  5f:	00 00 
  61:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  68:	f3 90                	pause  
  6a:	48 3b 37             	cmp    (%rdi),%rsi
  6d:	75 19                	jne    88 <mutex_spin_on_owner.isra.4+0x58>
  6f:	8b 02                	mov    (%rdx),%eax
  71:	85 c0                	test   %eax,%eax
  73:	74 0b                	je     80 <mutex_spin_on_owner.isra.4+0x50>
  75:	48 8b 81 10 c0 ff ff 	mov    -0x3ff0(%rcx),%rax
  7c:	a8 08                	test   $0x8,%al
  7e:	74 e8                	je     68 <mutex_spin_on_owner.isra.4+0x38>
  80:	31 c0                	xor    %eax,%eax
  82:	5d                   	pop    %rbp
  83:	c3                   	retq   
  84:	0f 1f 40 00          	nopl   0x0(%rax)
  88:	b8 01 00 00 00       	mov    $0x1,%eax
  8d:	5d                   	pop    %rbp
  8e:	c3                   	retq   
  8f:	90                   	nop

although the double unrolled need_resched() check looks silly:

  4d:	48 8b 80 10 c0 ff ff 	mov    -0x3ff0(%rax),%rax
  54:	a8 08                	test   $0x8,%al

  75:	48 8b 81 10 c0 ff ff 	mov    -0x3ff0(%rcx),%rax
  7c:	a8 08                	test   $0x8,%al

Thanks,

	Ingo

=============================>
>From 7f5917c7f9ee3598b353adbd3d0849ecf4485748 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 13:01:39 +0200
Subject: [PATCH] x86/uaccess: Implement get_kernel()

Implement an optimized get_kernel() primitive on x86: avoid the
STAC/CLAC overhead. Only offer a 64-bit variant for now, 32-bit
will fall back to get_user().

Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/uaccess_64.h | 32 ++++++++++++++++++++++++++++++++
 include/linux/uaccess.h           |  8 ++------
 kernel/locking/mutex.c            |  3 +--
 3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index f2f9b39b274a..3ef75f0addac 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -207,6 +207,38 @@ __copy_from_user_inatomic(void *dst, const void __user *src, unsigned size)
 	return __copy_from_user_nocheck(dst, src, size);
 }
 
+
+#define __get_kernel_asm_ex(dst, src, int_type, reg_type, lh_type)	\
+	asm volatile("1: mov"int_type" %1,%"reg_type"0\n"		\
+		     "2:\n"						\
+		     _ASM_EXTABLE_EX(1b, 2b)				\
+		     : lh_type(dst) : "m" (__m(src)))
+
+extern void __get_kernel_BUILD_ERROR(void);
+
+/*
+ * Simple copy-from-possibly-faulting-kernel-addresses method that
+ * avoids the STAC/CLAC SMAP overhead.
+ *
+ * NOTE: this does not propagate the error code of faulting kernel
+ *	 addresses properly. You can recover it via uaccess_catch()
+ *       if you really need to.
+ */
+#define get_kernel(dst, src)						\
+do {									\
+	typeof(*(src)) __val;						\
+									\
+	switch (sizeof(__val)) {					\
+	case 1:	__get_kernel_asm_ex(__val, src, "b", "b", "=q"); break;	\
+	case 2: __get_kernel_asm_ex(__val, src, "w", "w", "=r"); break;	\
+	case 4: __get_kernel_asm_ex(__val, src, "l", "k", "=r"); break;	\
+	case 8: __get_kernel_asm_ex(__val, src, "q", " ", "=r"); break;	\
+	default: __get_kernel_BUILD_ERROR();				\
+	}								\
+	(dst) = __val;							\
+} while (0)
+#define ARCH_HAS_GET_KERNEL
+
 static __must_check __always_inline int
 __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
 {
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 885eea43b69f..03a025b98d2e 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -111,12 +111,8 @@ extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size
  * Generic wrapper, most architectures can just use __copy_from_user_inatomic()
  * to implement __copy_from_kernel_inatomic():
  */
-#ifndef ARCH_HAS_COPY_FROM_KERNEL_INATOMIC
-static __must_check __always_inline int
-__copy_from_kernel_inatomic(void *dst, const void __user *src, unsigned size)
-{
-	return __copy_from_user_inatomic(dst, src, size);
-}
+#ifndef ARCH_HAS_GET_KERNEL
+# define get_kernel(dst, src) get_user(dst, src)
 #endif
 
 #endif		/* __LINUX_UACCESS_H__ */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a4f74cda9fc4..68c750f4e8e8 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -226,7 +226,6 @@ static noinline
 bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
 	int on_cpu;
-	int ret;
 
 	while (lock->owner == owner) {
 		/*
@@ -252,7 +251,7 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 		 * NOTE2: We ignore failed copies, as the next iteration will clean
 		 *        up after us. This saves an extra branch in the common case.
 		 */
-		ret = __copy_from_kernel_inatomic(&on_cpu, &owner->on_cpu, sizeof(on_cpu));
+		get_kernel(on_cpu, &owner->on_cpu);
 
 		if (!on_cpu || need_resched())
 			return false;

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH] mutex: Improve mutex_spin_on_owner() code generation
  2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
@ 2015-04-10 11:27                           ` Ingo Molnar
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
  2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
  2015-04-10 17:49                           ` Linus Torvalds
  2 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 11:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> although the double unrolled need_resched() check looks silly:
> 
>   4d:	48 8b 80 10 c0 ff ff 	mov    -0x3ff0(%rax),%rax
>   54:	a8 08                	test   $0x8,%al
> 
>   75:	48 8b 81 10 c0 ff ff 	mov    -0x3ff0(%rcx),%rax
>   7c:	a8 08                	test   $0x8,%al

The patch below fixes that and shaves 40 bytes off 
mutex_spin_on_owner()'s code size.

Thanks,

	Ingo

===================================>
>From 065e46b7398e38f2e4be98c453e797ee511170e2 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 13:21:24 +0200
Subject: [PATCH] mutex: Improve mutex_spin_on_owner() code generation

GCC somewhat stupidly decides that the loop within mutex_spin_on_owner()
needs unrolling:

0000000000000030 <mutex_spin_on_owner.isra.4>:
  30:	48 3b 37             	cmp    (%rdi),%rsi
  33:	55                   	push   %rbp
  34:	48 89 e5             	mov    %rsp,%rbp
  37:	75 4f                	jne    88 <mutex_spin_on_owner.isra.4+0x58>
  39:	48 8d 56 28          	lea    0x28(%rsi),%rdx
  3d:	8b 46 28             	mov    0x28(%rsi),%eax
  40:	85 c0                	test   %eax,%eax
  42:	74 3c                	je     80 <mutex_spin_on_owner.isra.4+0x50>
  44:	65 48 8b 04 25 00 00 	mov    %gs:0x0,%rax
  4b:	00 00
  4d:	48 8b 80 10 c0 ff ff 	mov    -0x3ff0(%rax),%rax
  54:	a8 08                	test   $0x8,%al
  56:	75 28                	jne    80 <mutex_spin_on_owner.isra.4+0x50>
  58:	65 48 8b 0c 25 00 00 	mov    %gs:0x0,%rcx
  5f:	00 00
  61:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  68:	f3 90                	pause
  6a:	48 3b 37             	cmp    (%rdi),%rsi
  6d:	75 19                	jne    88 <mutex_spin_on_owner.isra.4+0x58>
  6f:	8b 02                	mov    (%rdx),%eax
  71:	85 c0                	test   %eax,%eax
  73:	74 0b                	je     80 <mutex_spin_on_owner.isra.4+0x50>
  75:	48 8b 81 10 c0 ff ff 	mov    -0x3ff0(%rcx),%rax
  7c:	a8 08                	test   $0x8,%al
  7e:	74 e8                	je     68 <mutex_spin_on_owner.isra.4+0x38>
  80:	31 c0                	xor    %eax,%eax
  82:	5d                   	pop    %rbp
  83:	c3                   	retq
  84:	0f 1f 40 00          	nopl   0x0(%rax)
  88:	b8 01 00 00 00       	mov    $0x1,%eax
  8d:	5d                   	pop    %rbp
  8e:	c3                   	retq

The need_resched() check is duplicated:

  4d:	48 8b 80 10 c0 ff ff 	mov    -0x3ff0(%rax),%rax
  54:	a8 08                	test   $0x8,%al
  56:	75 28                	jne    80 <mutex_spin_on_owner.isra.4+0x50>

  75:	48 8b 81 10 c0 ff ff 	mov    -0x3ff0(%rcx),%rax
  7c:	a8 08                	test   $0x8,%al
  7e:	74 e8                	je     68 <mutex_spin_on_owner.isra.4+0x38>

So restructure the loop a bit, to get much tighter code:

0000000000000030 <mutex_spin_on_owner.isra.5>:
  30:	55                   	push   %rbp
  31:	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
  38:	00 00
  3a:	48 89 e5             	mov    %rsp,%rbp
  3d:	48 39 37             	cmp    %rsi,(%rdi)
  40:	75 1e                	jne    60 <mutex_spin_on_owner.isra.5+0x30>
  42:	8b 46 28             	mov    0x28(%rsi),%eax
  45:	85 c0                	test   %eax,%eax
  47:	74 0d                	je     56 <mutex_spin_on_owner.isra.5+0x26>
  49:	f3 90                	pause
  4b:	48 8b 82 10 c0 ff ff 	mov    -0x3ff0(%rdx),%rax
  52:	a8 08                	test   $0x8,%al
  54:	74 e7                	je     3d <mutex_spin_on_owner.isra.5+0xd>
  56:	31 c0                	xor    %eax,%eax
  58:	5d                   	pop    %rbp
  59:	c3                   	retq
  5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  60:	b8 01 00 00 00       	mov    $0x1,%eax
  65:	5d                   	pop    %rbp
  66:	c3                   	retq

What changed relative to the previous loop is that cpu_relax()
is done before the need_resched() check. This in fact makes sense:
only after waiting a bit (or a lot, on virtualized platforms) should
we expect 'need_resched' to have changed.

Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/mutex.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 68c750f4e8e8..45d2445e457a 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -225,9 +225,12 @@ ww_mutex_set_context_slowpath(struct ww_mutex *lock,
 static noinline
 bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
-	int on_cpu;
+	int owner_on_cpu;
+
+	for (;;) {
+		if (unlikely(lock->owner != owner))
+			return true;
 
-	while (lock->owner == owner) {
 		/*
 		 * Use the non-faulting copy-user primitive to get the owner->on_cpu
 		 * value that works even on CONFIG_DEBUG_PAGEALLOC=y instrumented
@@ -251,15 +254,16 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 		 * NOTE2: We ignore failed copies, as the next iteration will clean
 		 *        up after us. This saves an extra branch in the common case.
 		 */
-		get_kernel(on_cpu, &owner->on_cpu);
-
-		if (!on_cpu || need_resched())
+		get_kernel(owner_on_cpu, &owner->on_cpu);
+		if (unlikely(!owner_on_cpu))
 			return false;
 
 		cpu_relax_lowlatency();
-	}
 
-	return true;
+		/* Stop spinning if we are to be preempted: */
+		if (need_resched())
+			return false;
+	}
 }
 
 /*

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
@ 2015-04-10 12:08                             ` Ingo Molnar
  2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
                                                 ` (6 more replies)
  0 siblings, 7 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 12:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Denys Vlasenko, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> So restructure the loop a bit, to get much tighter code:
> 
> 0000000000000030 <mutex_spin_on_owner.isra.5>:
>   30:	55                   	push   %rbp
>   31:	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
>   38:	00 00
>   3a:	48 89 e5             	mov    %rsp,%rbp
>   3d:	48 39 37             	cmp    %rsi,(%rdi)
>   40:	75 1e                	jne    60 <mutex_spin_on_owner.isra.5+0x30>
>   42:	8b 46 28             	mov    0x28(%rsi),%eax
>   45:	85 c0                	test   %eax,%eax
>   47:	74 0d                	je     56 <mutex_spin_on_owner.isra.5+0x26>
>   49:	f3 90                	pause
>   4b:	48 8b 82 10 c0 ff ff 	mov    -0x3ff0(%rdx),%rax
>   52:	a8 08                	test   $0x8,%al
>   54:	74 e7                	je     3d <mutex_spin_on_owner.isra.5+0xd>
>   56:	31 c0                	xor    %eax,%eax
>   58:	5d                   	pop    %rbp
>   59:	c3                   	retq
>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
>   60:	b8 01 00 00 00       	mov    $0x1,%eax
>   65:	5d                   	pop    %rbp
>   66:	c3                   	retq

Btw., totally off topic, the following NOP caught my attention:

>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

That's a dead NOP that boats the function a bit, added for the 16 byte 
alignment of one of the jump targets.

I realize that x86 CPU manufacturers recommend 16-byte jump target 
alignments (it's in the Intel optimization manual), but the cost of 
that is very significant:

        text           data       bss         dec      filename
    12566391        1617840   1089536    15273767      vmlinux.align.16-byte
    12224951        1617840   1089536    14932327      vmlinux.align.1-byte

By using 1 byte jump target alignment (i.e. no alignment at all) we 
get an almost 3% reduction in kernel size (!) - and a probably similar 
reduction in I$ footprint.

So I'm wondering, is the 16 byte jump target optimization suggestion 
really worth this price? The patch below boots fine and I've not 
measured any noticeable slowdown, but I've not tried hard.

Now, the usual justification for jump target alignment is the 
following: with 16 byte instruction-cache cacheline sizes, if a 
forward jump is aligned to cacheline boundary then prefetches will 
start from a new cacheline.

But I think that argument is flawed for typical optimized kernel code 
flows: forward jumps often go to 'cold' (uncommon) pieces of code, and 
aligning cold code to cache lines does not bring a lot of advantages 
(they are uncommon), while it causes collateral damage:

 - their alignment 'spreads out' the cache footprint, it shifts 
   followup hot code further out

 - plus it slows down even 'cold' code that immediately follows 'hot' 
   code (like in the above case), which could have benefited from the 
   partial cacheline that comes off the end of hot code.

What do you guys think about this? I think we should seriously 
consider relaxing our alignment defaults.

Thanks,

	Ingo

==================================>
>From 5b83a095e1abdfee5c710c34a5785232ce74f939 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 13:50:05 +0200
Subject: [PATCH] x86: Align jumps targets to 1 byte boundaries

Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..0366d6b44a14 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,6 +77,9 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+	# Align jump targets to 1 byte, not the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH] x86: Pack function addresses tightly as well
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
@ 2015-04-10 12:18                               ` Ingo Molnar
  2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
  2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
  2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
                                                 ` (5 subsequent siblings)
  6 siblings, 2 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 12:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Denys Vlasenko, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> I realize that x86 CPU manufacturers recommend 16-byte jump target 
> alignments (it's in the Intel optimization manual), but the cost of 
> that is very significant:
> 
>         text           data       bss         dec      filename
>     12566391        1617840   1089536    15273767      vmlinux.align.16-byte
>     12224951        1617840   1089536    14932327      vmlinux.align.1-byte
> 
> By using 1 byte jump target alignment (i.e. no alignment at all) we 
> get an almost 3% reduction in kernel size (!) - and a probably 
> similar reduction in I$ footprint.

Likewise we could pack functions tightly as well via the patch below:

     text	   data	    bss	     dec	 filename
 12566391	1617840	1089536	15273767	 vmlinux.align.16-byte
 12224951	1617840	1089536	14932327	 vmlinux.align.1-byte
 11976567	1617840	1089536	14683943	 vmlinux.align.1-byte.funcs-1-byte

Which brings another 2% reduction in the kernel's code size.

It would be interesting to see some benchmarks with these two patches 
applied. Only lightly tested.

Thanks,

	Ingo

============================>
>From 773dbbfbf37df3520dd98e91972fbfdef5fe91ad Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 14:14:29 +0200
Subject: [PATCH] x86: Pack function addresses tightly as well

Not-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 0366d6b44a14..573d0c459f99 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,9 +77,12 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
-	# Align jump targets to 1 byte, not the default 16 bytes:
+        # Pack jump targets tightly, don't align them to the default 16 bytes:
         KBUILD_CFLAGS += -falign-jumps=1
 
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH] x86: Pack loops tightly as well
  2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
@ 2015-04-10 12:30                                 ` Ingo Molnar
  2015-04-10 13:46                                   ` Borislav Petkov
                                                     ` (2 more replies)
  2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
  1 sibling, 3 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 12:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Denys Vlasenko, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> > I realize that x86 CPU manufacturers recommend 16-byte jump target 
> > alignments (it's in the Intel optimization manual), but the cost 
> > of that is very significant:
> > 
> >         text           data       bss         dec      filename
> >     12566391        1617840   1089536    15273767      vmlinux.align.16-byte
> >     12224951        1617840   1089536    14932327      vmlinux.align.1-byte
> > 
> > By using 1 byte jump target alignment (i.e. no alignment at all) 
> > we get an almost 3% reduction in kernel size (!) - and a probably 
> > similar reduction in I$ footprint.
> 
> Likewise we could pack functions tightly as well via the patch 
> below:
> 
>      text	   data	    bss	     dec	 filename
>  12566391	1617840	1089536	15273767	 vmlinux.align.16-byte
>  12224951	1617840	1089536	14932327	 vmlinux.align.1-byte
>  11976567	1617840	1089536	14683943	 vmlinux.align.1-byte.funcs-1-byte
> 
> Which brings another 2% reduction in the kernel's code size.
> 
> It would be interesting to see some benchmarks with these two 
> patches applied. Only lightly tested.

And the final patch below also packs loops tightly:

     text        data    bss     dec              filename
 12566391        1617840 1089536 15273767         vmlinux.align.16-byte
 12224951        1617840 1089536 14932327         vmlinux.align.1-byte
 11976567        1617840 1089536 14683943         vmlinux.align.1-byte.funcs-1-byte
 11903735        1617840 1089536 14611111         vmlinux.align.1-byte.funcs-1-byte.loops-1-byte

The total reduction is 5.5%.

Now loop alignment is beneficial if:

 - a loop is cache-hot and its surroundings are not.

Loop alignment is harmful if:

 - a loop is cache-cold
 - a loop's surroundings are cache-hot as well
 - two cache-hot loops are close to each other

and I'd argue that the latter three harmful scenarios are much more 
common in the kernel. Similar arguments can be made for function 
alignment as well. (Jump target alignment is a bit different but I 
think the same conclusion holds.)

(I might have missed some CPU microarchitectural details though that 
would make such packing undesirable.)

Thanks,

	Ingo

=============================>
>From cfc2ca24908cce66b9df1f711225d461f5d59b97 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Fri, 10 Apr 2015 14:20:30 +0200
Subject: [PATCH] x86: Pack loops tightly as well

Not-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 573d0c459f99..10989a73b986 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -83,6 +83,9 @@ else
         # Pack functions tightly as well:
         KBUILD_CFLAGS += -falign-functions=1
 
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Pack loops tightly as well
  2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
@ 2015-04-10 13:46                                   ` Borislav Petkov
  2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
  2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
  2 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2015-04-10 13:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Andy Lutomirski, Denys Vlasenko, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 02:30:18PM +0200, Ingo Molnar wrote:
> And the final patch below also packs loops tightly:
> 
>      text        data    bss     dec              filename
>  12566391        1617840 1089536 15273767         vmlinux.align.16-byte
>  12224951        1617840 1089536 14932327         vmlinux.align.1-byte
>  11976567        1617840 1089536 14683943         vmlinux.align.1-byte.funcs-1-byte
>  11903735        1617840 1089536 14611111         vmlinux.align.1-byte.funcs-1-byte.loops-1-byte
> 
> The total reduction is 5.5%.
> 
> Now loop alignment is beneficial if:
> 
>  - a loop is cache-hot and its surroundings are not.
> 
> Loop alignment is harmful if:
> 
>  - a loop is cache-cold
>  - a loop's surroundings are cache-hot as well
>  - two cache-hot loops are close to each other
> 
> and I'd argue that the latter three harmful scenarios are much more 
> common in the kernel. Similar arguments can be made for function 
> alignment as well. (Jump target alignment is a bit different but I 
> think the same conclusion holds.)

So I IMHO think the loop alignment is coupled to the fetch window size
and alignment. I'm looking at the AMD opt. manuals and both for fam 0x15
and 0x16 say that hot loops should be 32-byte aligned due to 32-byte
aligned fetch window in each cycle.

So if we have hot loops, we probably want them 32-byte aligned (I don't
know what that number on Intel is, need to look).

Family 0x16 says, in addition, that if you have branches in those loops,
the first two branches in a cacheline can be processed in a cycle when
they're in the branch predictor. And so to guarantee that you should
align your loop start to a cacheline.

And this all depends on the uarch so I can imagine optimizing for the
one would harm the other.

Looks like a long project of experimenting and running perf counters :-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [tip:x86/asm] x86: Pack loops tightly as well
  2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
  2015-04-10 13:46                                   ` Borislav Petkov
@ 2015-05-15  9:40                                   ` tip-bot for Ingo Molnar
  2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
  2 siblings, 0 replies; 108+ messages in thread
From: tip-bot for Ingo Molnar @ 2015-05-15  9:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: aswin, tim.c.chen, jason.low2, hpa, brgerst, peterz,
	a.p.zijlstra, dvlasenk, luto, dave, torvalds, tglx, paulmck, bp,
	mingo, linux-kernel

Commit-ID:  24b0af706205ba49cd139913f92fea837a5724a7
Gitweb:     http://git.kernel.org/tip/24b0af706205ba49cd139913f92fea837a5724a7
Author:     Ingo Molnar <mingo@kernel.org>
AuthorDate: Fri, 10 Apr 2015 14:30:18 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 15 May 2015 11:17:11 +0200

x86: Pack loops tightly as well

Packing loops tightly (-falign-loops=1) is beneficial to code size:

     text        data    bss     dec              filename
 12566391        1617840 1089536 15273767         vmlinux.align.16-byte
 12224951        1617840 1089536 14932327         vmlinux.align.1-byte
 11976567        1617840 1089536 14683943         vmlinux.align.1-byte.funcs-1-byte
 11903735        1617840 1089536 14611111         vmlinux.align.1-byte.funcs-1-byte.loops-1-byte

Which reduces the size of the kernel by another 0.6%, so the
the total combined size reduction of the alignment-packing
patches is ~5.5%.

The x86 decoder bandwidth and caching arguments laid out in:

  be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")

apply to loop alignment as well.

Furtermore, modern CPU uarchs have a loop cache/buffer that
is a L0 cache before even any uop cache, covering a few
dozen most recently executed instructions.

This loop cache generally does not have the 16-byte alignment
restrictions of the uop cache.

Now loop alignment can still be beneficial if:

 - a loop is cache-hot and its surroundings are not.

 - if the loop is so cache hot that the instruction
   flow becomes x86 decoder bandwidth limited

But loop alignment is harmful if:

 - a loop is cache-cold

 - a loop's surroundings are cache-hot as well

 - two cache-hot loops are close to each other

 - if the loop fits into the loop cache

 - if the code flow is not decoder bandwidth limited

and I'd argue that the latter five scenarios are much
more common in the kernel, as our hottest loops are
typically:

 - pointer chasing: this should fit into the loop cache
   in most cases and is typically data cache and address
   generation limited

 - generic memory ops (memset, memcpy, etc.): these generally
   fit into the loop cache as well, and are likewise data
   cache limited.

So this patch packs loop addresses tightly as well.

Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20150410123017.GB19918@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5c7edf9..8c7cc44 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -83,6 +83,9 @@ else
         # Pack functions tightly as well:
         KBUILD_CFLAGS += -falign-functions=1
 
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [tip:x86/apic] x86: Pack loops tightly as well
  2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
  2015-04-10 13:46                                   ` Borislav Petkov
  2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
@ 2015-05-17  6:03                                   ` tip-bot for Ingo Molnar
  2 siblings, 0 replies; 108+ messages in thread
From: tip-bot for Ingo Molnar @ 2015-05-17  6:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: a.p.zijlstra, aswin, dvlasenk, hpa, tglx, luto, bp, linux-kernel,
	paulmck, mingo, jason.low2, torvalds, tim.c.chen, brgerst,
	peterz, dave

Commit-ID:  52648e83c9a6b9f7fc3dd272d4d10175e93aa62a
Gitweb:     http://git.kernel.org/tip/52648e83c9a6b9f7fc3dd272d4d10175e93aa62a
Author:     Ingo Molnar <mingo@kernel.org>
AuthorDate: Sun, 17 May 2015 07:56:54 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 17 May 2015 07:56:54 +0200

x86: Pack loops tightly as well

Packing loops tightly (-falign-loops=1) is beneficial to code size:

     text        data    bss     dec              filename
 12566391        1617840 1089536 15273767         vmlinux.align.16-byte
 12224951        1617840 1089536 14932327         vmlinux.align.1-byte
 11976567        1617840 1089536 14683943         vmlinux.align.1-byte.funcs-1-byte
 11903735        1617840 1089536 14611111         vmlinux.align.1-byte.funcs-1-byte.loops-1-byte

Which reduces the size of the kernel by another 0.6%, so the
the total combined size reduction of the alignment-packing
patches is ~5.5%.

The x86 decoder bandwidth and caching arguments laid out in:

  be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")

apply to loop alignment as well.

Furtermore, modern CPU uarchs have a loop cache/buffer that
is a L0 cache before even any uop cache, covering a few
dozen most recently executed instructions.

This loop cache generally does not have the 16-byte alignment
restrictions of the uop cache.

Now loop alignment can still be beneficial if:

 - a loop is cache-hot and its surroundings are not.

 - if the loop is so cache hot that the instruction
   flow becomes x86 decoder bandwidth limited

But loop alignment is harmful if:

 - a loop is cache-cold

 - a loop's surroundings are cache-hot as well

 - two cache-hot loops are close to each other

 - if the loop fits into the loop cache

 - if the code flow is not decoder bandwidth limited

and I'd argue that the latter five scenarios are much
more common in the kernel, as our hottest loops are
typically:

 - pointer chasing: this should fit into the loop cache
   in most cases and is typically data cache and address
   generation limited

 - generic memory ops (memset, memcpy, etc.): these generally
   fit into the loop cache as well, and are likewise data
   cache limited.

So this patch packs loop addresses tightly as well.

Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20150410123017.GB19918@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index ca17e5f..57996ee 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -80,6 +80,9 @@ else
         # Align jump targets to 1 byte, not the default 16 bytes:
         KBUILD_CFLAGS += -falign-jumps=1
 
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [tip:x86/asm] x86: Pack function addresses tightly as well
  2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
  2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
@ 2015-05-15  9:39                                 ` tip-bot for Ingo Molnar
  2015-05-15 18:36                                   ` Linus Torvalds
  1 sibling, 1 reply; 108+ messages in thread
From: tip-bot for Ingo Molnar @ 2015-05-15  9:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, a.p.zijlstra, aswin, torvalds, brgerst, paulmck, tglx,
	jason.low2, mingo, luto, hpa, dave, dvlasenk, tim.c.chen, bp,
	linux-kernel

Commit-ID:  4874fe1eeb40b403a8c9d0ddeb4d166cab3f37ba
Gitweb:     http://git.kernel.org/tip/4874fe1eeb40b403a8c9d0ddeb4d166cab3f37ba
Author:     Ingo Molnar <mingo@kernel.org>
AuthorDate: Fri, 10 Apr 2015 14:18:08 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 15 May 2015 11:05:21 +0200

x86: Pack function addresses tightly as well

So as per the arguments laid out in:

  be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")

We can pack function addresses tightly as well:

     text	   data	    bss	     dec	 filename
 12566391	1617840	1089536	15273767	 vmlinux.align.16-byte
 12224951	1617840	1089536	14932327	 vmlinux.align.1-byte
 11976567	1617840	1089536	14683943	 vmlinux.align.1-byte.funcs-1-byte

Which brings another 2% reduction in the defconfig kernel's
code size.

Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20150410121808.GA19918@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index ca17e5f..5c7edf9 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -80,6 +80,9 @@ else
         # Align jump targets to 1 byte, not the default 16 bytes:
         KBUILD_CFLAGS += -falign-jumps=1
 
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [tip:x86/asm] x86: Pack function addresses tightly as well
  2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
@ 2015-05-15 18:36                                   ` Linus Torvalds
  2015-05-15 20:52                                     ` Denys Vlasenko
  2015-05-17  5:58                                     ` Ingo Molnar
  0 siblings, 2 replies; 108+ messages in thread
From: Linus Torvalds @ 2015-05-15 18:36 UTC (permalink / raw)
  To: Andy Lutomirski, Davidlohr Bueso, Peter Anvin, Denys Vlasenko,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Linus Torvalds,
	Peter Zijlstra, Brian Gerst, Paul McKenney, Thomas Gleixner,
	Ingo Molnar, Jason Low
  Cc: linux-tip-commits

On Fri, May 15, 2015 at 2:39 AM, tip-bot for Ingo Molnar
<tipbot@zytor.com> wrote:
>
> We can pack function addresses tightly as well:

So I really want to see performance numbers on a few
microarchitectures for this one in particular.

The kernel generally doesn't have loops (well, not the kinds of
high-rep loops that tend to be worth aligning), and I think the
general branch/loop alignment is likely fine. But the function
alignment doesn't tend to have the same kind of I$ advantages, it's
more lilely purely a size issue and not as interesting. Function
targets are also more likely to be not in the cache, I suspect, since
you don't have a loop priming it or a short forward jump that just got
the cacheline anyway. And then *not* aligning the function would
actually tend to make it *less* dense in the I$.

Put another way: I suspect this is more likely to hurt, and less
likely to help than the others.

Size matters, but size matters mainly from an I$ standpoint, not from
some absolute 'big is bad" issue. Also, even when size matters,
performance matters too. I do want performance numbers. Is this
measurable?

                         Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [tip:x86/asm] x86: Pack function addresses tightly as well
  2015-05-15 18:36                                   ` Linus Torvalds
@ 2015-05-15 20:52                                     ` Denys Vlasenko
  2015-05-17  5:58                                     ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Denys Vlasenko @ 2015-05-15 20:52 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Ingo Molnar,
	Jason Low
  Cc: linux-tip-commits

On 05/15/2015 08:36 PM, Linus Torvalds wrote:
> On Fri, May 15, 2015 at 2:39 AM, tip-bot for Ingo Molnar
> <tipbot@zytor.com> wrote:
>>
>> We can pack function addresses tightly as well:
> 
> So I really want to see performance numbers on a few
> microarchitectures for this one in particular.
> 
> The kernel generally doesn't have loops (well, not the kinds of
> high-rep loops that tend to be worth aligning), and I think the
> general branch/loop alignment is likely fine. But the function
> alignment doesn't tend to have the same kind of I$ advantages, it's
> more lilely purely a size issue and not as interesting. Function
> targets are also more likely to be not in the cache, I suspect, since
> you don't have a loop priming it or a short forward jump that just got
> the cacheline anyway. And then *not* aligning the function would
> actually tend to make it *less* dense in the I$.

How about taking an intermediate step and using -falign-functions=6.
This means "align to 8 if it requires skipping less than 6 bytes".

Why < 6? Because with CONFIG_FTRACE=y, every function starts with
5-byte instruction ("call ftrace", replaced by a 5-byte nop).
We want at least this one insn to be decoded at once.

Without CONFIG_FTRACE, it's not as clear-cut, but typical x86
insns are 5 bytes or less, so it will still make most fuctions
start executing reasonably quickly at the cost of only 2.5 bytes of padding
on average.

I'd prefer "align to 16 if it requires skipping less than 6 bytes"
because aligning to 8 which is not a multiple of 16 doesn't
make sense on modern CPUs (it can in fact hurt a bit), but alas,
gcc's option format doesn't allow that.

If you don't like the 8-byte alignment, the smallest option which would
align to 16 bytes is -falign-functions=9: it means
"align to 16 if it requires skipping less than 9 bytes".
Still significantly better than insane padding to 16 even if we at
address just a few bytes past cacheline start (0x1231 -> 0x1240).

The last thing. If CONFIG_CC_OPTIMIZE_FOR_SIZE=y, we probably
shouldn't do any alignment.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [tip:x86/asm] x86: Pack function addresses tightly as well
  2015-05-15 18:36                                   ` Linus Torvalds
  2015-05-15 20:52                                     ` Denys Vlasenko
@ 2015-05-17  5:58                                     ` Ingo Molnar
  2015-05-17  7:09                                       ` Ingo Molnar
                                                         ` (2 more replies)
  1 sibling, 3 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-17  5:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin, Denys Vlasenko,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, May 15, 2015 at 2:39 AM, tip-bot for Ingo Molnar
> <tipbot@zytor.com> wrote:
> >
> > We can pack function addresses tightly as well:
> 
> So I really want to see performance numbers on a few 
> microarchitectures for this one in particular.
> 
> The kernel generally doesn't have loops (well, not the kinds of 
> high-rep loops that tend to be worth aligning), and I think the 
> general branch/loop alignment is likely fine. But the function 
> alignment doesn't tend to have the same kind of I$ advantages, it's 
> more lilely purely a size issue and not as interesting. Function 
> targets are also more likely to be not in the cache, I suspect, 
> since you don't have a loop priming it or a short forward jump that 
> just got the cacheline anyway. And then *not* aligning the function 
> would actually tend to make it *less* dense in the I$.
> 
> Put another way: I suspect this is more likely to hurt, and less 
> likely to help than the others.

Yeah, indeed.

So my thinking was that it would help, because:

  - There's often locality of reference between functions: we often
    have a handful of hot functions that are sitting next to each 
    other and they could thus be packed closer to each other this way, 
    creating a smaller net I$ footprint.

  - We have a handful of 'clusters' or small and often hot functions,
    especially in the locking code:

	ffffffff81893080 T _raw_spin_unlock_irqrestore
	ffffffff81893090 T _raw_read_unlock_irqrestore
	ffffffff818930a0 T _raw_write_unlock_irqrestore
	ffffffff818930b0 T _raw_spin_trylock_bh
	ffffffff81893110 T _raw_spin_unlock_bh
	ffffffff81893130 T _raw_read_unlock_bh
	ffffffff81893150 T _raw_write_unlock_bh
	ffffffff81893170 T _raw_read_trylock
	ffffffff818931a0 T _raw_write_trylock
	ffffffff818931d0 T _raw_read_lock_irqsave
	ffffffff81893200 T _raw_write_lock_irqsave
	ffffffff81893230 T _raw_spin_lock_bh
	ffffffff81893270 T _raw_spin_lock_irqsave
	ffffffff818932c0 T _raw_write_lock
	ffffffff818932e0 T _raw_write_lock_irq
	ffffffff81893310 T _raw_write_lock_bh
	ffffffff81893340 T _raw_spin_trylock
	ffffffff81893380 T _raw_read_lock
	ffffffff818933a0 T _raw_read_lock_irq
	ffffffff818933c0 T _raw_read_lock_bh
	ffffffff818933f0 T _raw_spin_lock
	ffffffff81893430 T _raw_spin_lock_irq
	ffffffff81893450

     That's 976 bytes total if 16 bytes aligned.

     With function packing, they compress into:

	ffffffff817f2458 T _raw_spin_unlock_irqrestore
	ffffffff817f2463 T _raw_read_unlock_irqrestore
	ffffffff817f2472 T _raw_write_unlock_irqrestore
	ffffffff817f247d T _raw_read_unlock_bh
	ffffffff817f2498 T _raw_write_unlock_bh
	ffffffff817f24af T _raw_spin_unlock_bh
	ffffffff817f24c6 T _raw_read_trylock
	ffffffff817f24ef T _raw_write_trylock
	ffffffff817f250e T _raw_spin_lock_bh
	ffffffff817f2536 T _raw_read_lock_irqsave
	ffffffff817f255e T _raw_write_lock_irqsave
	ffffffff817f2588 T _raw_spin_lock_irqsave
	ffffffff817f25be T _raw_spin_trylock_bh
	ffffffff817f25f6 T _raw_spin_trylock
	ffffffff817f2615 T _raw_spin_lock
	ffffffff817f2632 T _raw_spin_lock_irq
	ffffffff817f2650 T _raw_write_lock
	ffffffff817f266b T _raw_write_lock_irq
	ffffffff817f2687 T _raw_write_lock_bh
	ffffffff817f26ad T _raw_read_lock
	ffffffff817f26c6 T _raw_read_lock_bh
	ffffffff817f26ea T _raw_read_lock_irq
	ffffffff817f2704

      That's 684 bytes - a very stark difference that will show up in 
      better I$ footprint even if usage is sparse.

      OTOH, on the flip side, their ordering is far from ideal, so for 
      example the rarely used 'trylock' variants are mixed into the 
      middle, and the way we mix rwlock with spinlock ops isn't 
      very pretty either.

      So we could reduce alignment for just the locking APIs, via per 
      .o cflags in the Makefile, if packing otherwise hurts the common 
      case.

This function packing argument fails:

  - for large functions that are physically fragmented

  - if less than half of all functions in a hot workload are
    packed together. This might be the common case in fact.

  - even if functions are technically 'packed' next to each other, 
    this only works for small functions: larger functions typically 
    are hotter near their heads, with unlikely codepaths being in 
    their tails.

> Size matters, but size matters mainly from an I$ standpoint, not 
> from some absolute 'big is bad" issue.

Absolutely.

> [...] Also, even when size matters, performance matters too. I do 
> want performance numbers. Is this measurable?

Will try to measure this. I'm somewhat sceptical that I'll be able to 
measure any signal: alignment effects are very hard to measure on x86, 
especially on any realistic workload.

In any case, consider this function alignment patch shelved until it's 
properly measured.

Thanks,

	ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [tip:x86/asm] x86: Pack function addresses tightly as well
  2015-05-17  5:58                                     ` Ingo Molnar
@ 2015-05-17  7:09                                       ` Ingo Molnar
  2015-05-17  7:30                                         ` Ingo Molnar
  2015-05-18  9:28                                       ` Denys Vlasenko
  2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
  2 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-05-17  7:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin, Denys Vlasenko,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits


* Ingo Molnar <mingo@kernel.org> wrote:

> > Put another way: I suspect this is more likely to hurt, and less 
> > likely to help than the others.
> 
> Yeah, indeed.
> 
> So my thinking was that it would help, because:
> 
>   - There's often locality of reference between functions: we often
>     have a handful of hot functions that are sitting next to each 
>     other and they could thus be packed closer to each other this way, 
>     creating a smaller net I$ footprint.
> 
>   - We have a handful of 'clusters' or small and often hot functions,
>     especially in the locking code:
> 
> 	ffffffff81893080 T _raw_spin_unlock_irqrestore
> 	ffffffff81893090 T _raw_read_unlock_irqrestore
> 	ffffffff818930a0 T _raw_write_unlock_irqrestore
> 	ffffffff818930b0 T _raw_spin_trylock_bh
> 	ffffffff81893110 T _raw_spin_unlock_bh
> 	ffffffff81893130 T _raw_read_unlock_bh
> 	ffffffff81893150 T _raw_write_unlock_bh
> 	ffffffff81893170 T _raw_read_trylock
> 	ffffffff818931a0 T _raw_write_trylock
> 	ffffffff818931d0 T _raw_read_lock_irqsave
> 	ffffffff81893200 T _raw_write_lock_irqsave
> 	ffffffff81893230 T _raw_spin_lock_bh
> 	ffffffff81893270 T _raw_spin_lock_irqsave
> 	ffffffff818932c0 T _raw_write_lock
> 	ffffffff818932e0 T _raw_write_lock_irq
> 	ffffffff81893310 T _raw_write_lock_bh
> 	ffffffff81893340 T _raw_spin_trylock
> 	ffffffff81893380 T _raw_read_lock
> 	ffffffff818933a0 T _raw_read_lock_irq
> 	ffffffff818933c0 T _raw_read_lock_bh
> 	ffffffff818933f0 T _raw_spin_lock
> 	ffffffff81893430 T _raw_spin_lock_irq
> 	ffffffff81893450
> 
>      That's 976 bytes total if 16 bytes aligned.
> 
>      With function packing, they compress into:
> 
> 	ffffffff817f2458 T _raw_spin_unlock_irqrestore
> 	ffffffff817f2463 T _raw_read_unlock_irqrestore
> 	ffffffff817f2472 T _raw_write_unlock_irqrestore
> 	ffffffff817f247d T _raw_read_unlock_bh
> 	ffffffff817f2498 T _raw_write_unlock_bh
> 	ffffffff817f24af T _raw_spin_unlock_bh
> 	ffffffff817f24c6 T _raw_read_trylock
> 	ffffffff817f24ef T _raw_write_trylock
> 	ffffffff817f250e T _raw_spin_lock_bh
> 	ffffffff817f2536 T _raw_read_lock_irqsave
> 	ffffffff817f255e T _raw_write_lock_irqsave
> 	ffffffff817f2588 T _raw_spin_lock_irqsave
> 	ffffffff817f25be T _raw_spin_trylock_bh
> 	ffffffff817f25f6 T _raw_spin_trylock
> 	ffffffff817f2615 T _raw_spin_lock
> 	ffffffff817f2632 T _raw_spin_lock_irq
> 	ffffffff817f2650 T _raw_write_lock
> 	ffffffff817f266b T _raw_write_lock_irq
> 	ffffffff817f2687 T _raw_write_lock_bh
> 	ffffffff817f26ad T _raw_read_lock
> 	ffffffff817f26c6 T _raw_read_lock_bh
> 	ffffffff817f26ea T _raw_read_lock_irq
> 	ffffffff817f2704
> 
>       That's 684 bytes - a very stark difference that will show up in 
>       better I$ footprint even if usage is sparse.
> 
>       OTOH, on the flip side, their ordering is far from ideal, so for 
>       example the rarely used 'trylock' variants are mixed into the 
>       middle, and the way we mix rwlock with spinlock ops isn't 
>       very pretty either.
> 
>       So we could reduce alignment for just the locking APIs, via per 
>       .o cflags in the Makefile, if packing otherwise hurts the common 
>       case.

Another datapoint: despite the widespread stereotype of bloated, 
kernel stack guzzling kernel functions, most kernel functions are 
pretty small and fit into a single (or at most two) cachelines.

Here's a histogram of x86-64 defconfig function sizes (in bytes, cut 
off at around 1200 bytes):

  600 ++----------------+------------------+-----------------+-----------------+------------------+----------------++
      +                 +                  +                 +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 +AA                                                                                                          ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |A                                                                                                            |
      |                                                                                                             |
      |                                                                                                             |
  300 +AAA                                                                                                         ++
      | A                                                                                                           |
      |A                                                                                                            |
      |A                                                                                                            |
      |                                                                                                             |
      | AAA                                                                                                         |
  200 +AAAA                                                                                                        ++
      | AAAAA                                                                                                       |
      | AAAAAA                                                                                                      |
      | A AAAA A                                                                                                    |
      |A A AAAAAAA                                                                                                  |
      |       AAAAA                                                                                                 |
  100 +A        AAAA                                                                                               ++
      |         AAAAAAA                                                                                             |
      |            AAAAAAAAAA                                                                                       |
      |               AAAAAAAAAAAA                                                                                  |
      |A                AAAAAAAAAAAA AAAA                                                                           |
      +                 +       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA A A A       +               A  +                 +
    0 AA----------------+------------------AA-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------++
      0                200                400               600               800                1000              1200


Zoomed to just the first 200 bytes:

  600 ++--------------------------+--------------------------+---------------------------+-------------------------++
      +                           +                          +               kernel function sizes histogram   A    +
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  500 ++    A   A                                                                                                  ++
      |                                                                                                             |
      A                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
      |                                                                                                             |
  400 ++                                                                                                           ++
      |                                                                                                             |
      |      A                                                                                                      |
      |                                                                                                             |
      |                                                                                                             |
  300 ++       A  A     A                                                                                          ++
      |         A                                                                                                   |
      |  A                                                                                                          |
      |   A                                                                                                         |
      |                                                                                                             |
      |             A    A  A A                                                                                     |
  200 ++      A    A   A AAA A                                                                                     ++
      |          A   A    AA   A  A A     A                                                                         |
      |        A  A A AA      AAAA A A   A        A                                                                 |
      |            A         A  A  AA AAAA AAAAAA         A AA                                                      |
      |       A         A         A  A  A A  A A A AAAA A A   AAA AAA A                                             |
      |                                             AA A A AA AAAA   AAAAA AA                                       |
  100 ++     A                                                   A A A    AA  AAAAA AA A                           ++
      |                                                             A    AA AA    AA AAA AAAAA AA A  A              |
      |                                                                               A AA AAAA AAAAAAAAA AAAAAA A A|
      |                                                                                               A AAAAA  AA AAA
      |    AA                                                                                                    A  |
      +                           +                          +                           +                          +
    0 +AAAA-----------------------+--------------------------+---------------------------+-------------------------++
      0                           50                        100                         150                        200


The median function size is around 1 cacheline (64-byte one), ~80% 
fitting into two cachelines, with a big peak for very small functions 
that make up something like 20% of all functions - so packing makes 
sense unless the actual functions being executed are more fragmented 
than I'd estimate around 50%.

Which is still a significant hurdle.

Plus there are a truckload of other details, like how exactly the flow 
of execution goes within the function (which part is hot). For example 
locking APIs tend to return early after a relatively straight path of 
execution, making the effective hot function length even smaller than 
the median.

All in one: there's no clear winner, so there's no way around actually 
measuring it ... ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [tip:x86/asm] x86: Pack function addresses tightly as well
  2015-05-17  7:09                                       ` Ingo Molnar
@ 2015-05-17  7:30                                         ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-17  7:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin, Denys Vlasenko,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits


* Ingo Molnar <mingo@kernel.org> wrote:

> The median function size is around 1 cacheline (64-byte one), ~80% 
> fitting into two cachelines, with a big peak for very small 
> functions that make up something like 20% of all functions [...]

Correction:

   32% of kernel functions fit into a single cacheline,
   55% fit into two cachelines,
   70% into three cachelines,
   76% into four cachelines

so the tail is longer than my quick read of the graph suggested.

OTOH, probability of use is biased towards smaller functions: we tend 
to use smaller, facility functions more frequently.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [tip:x86/asm] x86: Pack function addresses tightly as well
  2015-05-17  5:58                                     ` Ingo Molnar
  2015-05-17  7:09                                       ` Ingo Molnar
@ 2015-05-18  9:28                                       ` Denys Vlasenko
  2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
  2 siblings, 0 replies; 108+ messages in thread
From: Denys Vlasenko @ 2015-05-18  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Denys Vlasenko, Linux Kernel Mailing List, Tim Chen,
	Borislav Petkov, Peter Zijlstra, Chandramouleeswaran, Aswin,
	Peter Zijlstra, Brian Gerst, Paul McKenney, Thomas Gleixner,
	Jason Low, linux-tip-commits

On Sun, May 17, 2015 at 7:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
>      With function packing, they compress into:
>
>         ffffffff817f2458 T _raw_spin_unlock_irqrestore
>         ffffffff817f2463 T _raw_read_unlock_irqrestore

So   _raw_spin_unlock_irqrestore is only 11 bytes long?

Sounds like it should have been inlined.
Is this with CONFIG_OPTIMIZE_INLINES=y ?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-17  5:58                                     ` Ingo Molnar
  2015-05-17  7:09                                       ` Ingo Molnar
  2015-05-18  9:28                                       ` Denys Vlasenko
@ 2015-05-19 21:38                                       ` Ingo Molnar
  2015-05-20  0:47                                         ` Linus Torvalds
  2015-05-20 11:29                                         ` Denys Vlasenko
  2 siblings, 2 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-19 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin, Denys Vlasenko,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton


* Ingo Molnar <mingo@kernel.org> wrote:

> This function packing argument fails:
> 
>   - for large functions that are physically fragmented
> 
>   - if less than half of all functions in a hot workload are
>     packed together. This might be the common case in fact.
> 
>   - even if functions are technically 'packed' next to each other,
>     this only works for small functions: larger functions typically 
>     are hotter near their heads, with unlikely codepaths being in 
>     their tails.
> 
> > Size matters, but size matters mainly from an I$ standpoint, not 
> > from some absolute 'big is bad" issue.
> 
> Absolutely.
> 
> > [...] Also, even when size matters, performance matters too. I do 
> > want performance numbers. Is this measurable?
> 
> Will try to measure this. I'm somewhat sceptical that I'll be able 
> to measure any signal: alignment effects are very hard to measure on 
> x86, especially on any realistic workload.

So I spent the better part of today trying to measure all this. As 
expected it's very hard to measure such alignment effects: anything 
that misses the cache and is a real workload tends to be rather noisy.

After a couple of fruitless attempts I wrote the test program below.

It tries to create a more or less meaningful macro-workload that 
executes a lot of diverse kernel functions, while still having 
relatively low data footprint:

/*
 * Copyright (C) 2015, Ingo Molnar <mingo@redhat.com>
 */
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
#include <stdlib.h>
#include <locale.h>
#include <sys/mman.h>
#include <sys/stat.h>

#define BUG_ON(cond) assert(!(cond))

#define DEFAULT_LOOPS 100000

int main(int argc, char **argv)
{
	const char *file_name = "./tmp-test-creat.txt";
	struct stat stat_buf;
	long loops;
	long i;
	int ret;

	setlocale(LC_ALL, "");

	if (argc > 1)
		loops = atol(argv[1]);
	else
		loops = DEFAULT_LOOPS;

	printf("# VFS-mix: creat()+stat()+mmap()+munmap()+close()+unlink()ing a file in $(pwd) %'ld times ... ", loops);
	fflush(stdout);

	for (i = 0; i < loops; i++) {
		int fd;

		fd = creat(file_name, 00700);
		BUG_ON(fd < 0);

		ret = lstat(file_name, &stat_buf);
		BUG_ON(ret != 0);

		ret = lseek(fd, 4095, SEEK_SET);
		BUG_ON(ret != 4095);

		close(fd);

		fd = open(file_name, O_RDWR|O_CREAT|O_TRUNC);
		BUG_ON(fd < 0);

		{
			char c = 1;

			ret = write(fd, &c, 1);
			BUG_ON(ret != 1);
		}

		{
			char *mmap_buf = (char *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

			BUG_ON(mmap_buf == (void *)-1L);

			mmap_buf[0] = 1;

			ret = munmap(mmap_buf, 4096);
			BUG_ON(ret != 0);
		}

		close(fd);

		ret = unlink(file_name);
		BUG_ON(ret != 0);
	}

	printf("done.\n");

	return 0;
}

This is a pretty silly test in itself that re-creates the same file 
again and again and does some VFS and MM ops on it, then deletes it. 
But the nice thing about it is that in every iteration, if ran on a 
journalling filesystem such as ext4, it touches:

  - the VFS
  - various filesystem low level functions
  - various IO/block layer functions
  - low level IO driver
  - memory allocators (slub and page allocator)
  - IRQ handling
  - page fault handling
  - mmap() paths
  - syscall entry/exit path
  - scheduling
  - the RCU subsystem
  - the workqueue subsystem
  - the perf events subsystem

That's a ton of activity: running this on ext4 executes around 1,500 
unique kernel functions (!), around 118,000 instructions per iteration 
- with relatively small data footprint.

The profile is pretty flat, with almost 1,000 functions being beyond 
the 0.01% overhead threshold.

This test workload has a non-trivial I$ footprint: with 1,500 
functions and the median kernel function size of around 150 bytes, 
it's about 220k of text executed - well beyond L1 instruction cache 
sizes.

So this is a test that has a realistic instruction cache footprint, 
but has micro-benchmark data footprint properties - which reduces 
noise while still excercising the I$ effect we seek.

To further reduce measurement noise, I ran this on a system with SSD 
disks, so the IRQ and block IO workload is almost synchronous and the 
CPU is fully utilized.

Then I bound the workload to a single CPU, with prio SCHED_FIFO:90, 
and ran it the following way:

    taskset 1 chrt -f 99 perf stat -a -C 0 --pre sync --repeat 10 \
      -e L1-icache-load-misses \
      -e instructions          \
      -e context-switches      \
    ./test-vfs

Then I built 11 kernels, each with different function alignment 
settings:

fomalhaut:~/linux> size linux-*/vmlinux | sort -n
    text           data     bss      dec     filename
12150606        2565544 1634304 16350454     linux-____CC_OPTIMIZE_FOR_SIZE=y/vmlinux
13534242        2579560 1634304 17748106     linux-falign-functions=__1-bytes/vmlinux
13554530        2579560 1634304 17768394     linux-falign-functions=__2-bytes/vmlinux
13590946        2579560 1634304 17804810     linux-falign-functions=__4-bytes/vmlinux
13658786        2579560 1634304 17872650     linux-falign-functions=__8-bytes/vmlinux
13799602        2579560 1634304 18013466     linux-falign-functions=_16-bytes/vmlinux
14075330        2579560 1634304 18289194     linux-falign-functions=_32-bytes/vmlinux
14664898        2579560 1634304 18878762     linux-falign-functions=_64-bytes/vmlinux
15980994        2579560 1634304 20194858     linux-falign-functions=128-bytes/vmlinux
19038018        2591848 1634304 23264170     linux-falign-functions=256-bytes/vmlinux
26391106        2591848 1634304 30617258     linux-falign-functions=512-bytes/vmlinux

(I've added the -Os kernel as a comparison.)

A single run results in output like:

 Performance counter stats for 'system wide' (10 runs):

       649,398,972      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
    11,875,234,916      instructions                                                  ( +-  0.00% )
           300,038      context-switches                                              ( +-  0.00% )

       7.198533444 seconds time elapsed                                          ( +-  0.28% )

The 'instructions' and 'context-switches' metrics can be used to 
cross-check that the run was undisturbed and that it executed exactly 
the same workload as other runs.

'time elapsed' is rather noisy, while 'L1-icache-load-misses' is the 
L1 instruction cache misses metric that is the most interesting one: 
it's a proxy metric for effective I$ footprint of the workload: if the 
footprint is larger then the number of misses goes up, if it's tigher 
then the number of misses goes down.

I ran this on two systems:

... an Intel system:

 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 62
 model name      : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
 stepping        : 7

 cache size      : 25600 KB
 cache_alignment : 64

... and an AMD system:

 vendor_id       : AuthenticAMD
 cpu family      : 21
 model           : 1
 model name      : AMD Opteron(tm) Processor 6278                 
 stepping        : 2

 cache size      : 2048 KB
 cache_alignment : 64

The CPU frequencies were set to the max on both systems, to reduce 
power throttling noise and run-to-run skew.

The AMD system did not have SSDs, so there I used tmpfs: this reduced 
the cache footprint to about 30% of that of the Intel system. The 
alignment effect was still borderline measurable.

I ran the test 10 times on the AMD system (so 100 runs), and 3 times 
on the Intel system (which took longer due to the file touching 
ext4/SSD) - i.e. 30 runs.

I picked the best result from the tests, to reduce noise from workload 
perturbations and from data cache layout effects. (If I take the 
average values then the effect is still visible, but noisier.)

Here's the result from the Intel system:

linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
linux-falign-functions=128-bytes/res.txt:        669,401,612      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=_32-bytes/res.txt:        685,969,043      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=256-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=512-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_16-bytes/res.txt:        706,080,917      L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  (100.00%)
linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)
linux-falign-functions=__4-bytes/res.txt:        725,707,848      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-falign-functions=__8-bytes/res.txt:        726,543,194      L1-icache-load-misses                                         ( +-  0.04% )  (100.00%)
linux-falign-functions=__2-bytes/res.txt:        738,946,179      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)

The optimal I$ miss rate is at 64 bytes - which is 9% better than the 
default kernel's I$ miss rate at 16 bytes alignment.

The 128/256/512 bytes numbers show an increasing amount of cache 
misses: probably due to the artificially reduced associativity of the 
caching.

Surprisingly there's a rather marked improvement in elapsed time as 
well:

linux-falign-functions=_64-bytes/res.txt:        7.154816369 seconds time elapsed                                          ( +-  0.03% )
linux-falign-functions=_32-bytes/res.txt:        7.231074263 seconds time elapsed                                          ( +-  0.12% )
linux-falign-functions=__8-bytes/res.txt:        7.292203002 seconds time elapsed                                          ( +-  0.30% )
linux-falign-functions=128-bytes/res.txt:        7.314226040 seconds time elapsed                                          ( +-  0.29% )
linux-falign-functions=_16-bytes/res.txt:        7.333597250 seconds time elapsed     [vanilla kernel]                     ( +-  0.48% )
linux-falign-functions=__1-bytes/res.txt:        7.367139908 seconds time elapsed                                          ( +-  0.28% )
linux-falign-functions=__4-bytes/res.txt:        7.371721930 seconds time elapsed                                          ( +-  0.26% )
linux-falign-functions=__2-bytes/res.txt:        7.410033936 seconds time elapsed                                          ( +-  0.34% )
linux-falign-functions=256-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
linux-falign-functions=512-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        8.531418784 seconds time elapsed                                          ( +-  0.19% )

the workload got 2.5% faster - which is pretty nice! This result is 5+ 
standard deviations above the noise of the measurement.

Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y) 
performance is: markedly higher cache miss rate despite a 'smaller' 
kernel, and the workload is 16.3% slower (!).

Part of the -Os picture is that the -Os kernel is executing much more 
instructions:

linux-falign-functions=_64-bytes/res.txt:     11,851,763,357      instructions                                                  ( +-  0.01% )
linux-falign-functions=__1-bytes/res.txt:     11,852,538,446      instructions                                                  ( +-  0.01% )
linux-falign-functions=_16-bytes/res.txt:     11,854,159,736      instructions                                                  ( +-  0.01% )
linux-falign-functions=__4-bytes/res.txt:     11,864,421,708      instructions                                                  ( +-  0.01% )
linux-falign-functions=__8-bytes/res.txt:     11,865,947,941      instructions                                                  ( +-  0.01% )
linux-falign-functions=_32-bytes/res.txt:     11,867,369,566      instructions                                                  ( +-  0.01% )
linux-falign-functions=128-bytes/res.txt:     11,867,698,477      instructions                                                  ( +-  0.01% )
linux-falign-functions=__2-bytes/res.txt:     11,870,853,247      instructions                                                  ( +-  0.01% )
linux-falign-functions=256-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
linux-falign-functions=512-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:     14,318,175,358      instructions                                                  ( +-  0.01% )

21.2% more instructions executed ... that cannot go well.

So this should be a reminder that it's effective I$ footprint and 
number of instructions executed that matters to performance, not 
kernel size alone. With current GCC -Os should only be used on 
embedded systems where one is willing to make the kernel 10%+ slower, 
in exchange for a 20% smaller kernel.

The AMD system, with a starkly different x86 microarchitecture, is 
showing similar characteristics:

linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)
linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      L1-icache-load-misses                                         ( +-  0.22% )  (100.00%)
linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      L1-icache-load-misses                                         ( +-  0.71% )  (100.00%)
linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      L1-icache-load-misses                                         ( +-  0.26% )  (100.00%)

64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
is much steeper than on Intel systems: cache associativity is likely 
lower on this AMD CPU.

Interestingly the 1 byte alignment result is still pretty good on AMD 
systems - and I used the exact same kernel image on both systems, so 
the layout of the functions is exactly the same.

Elapsed time is noisier, but shows a similar trend:

linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds time elapsed                                          ( +-  2.18% )
linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )
linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds time elapsed                                          ( +-  2.38% )
linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds time elapsed                                          ( +-  1.18% )
linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds time elapsed                                          ( +-  2.75% )
linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds time elapsed                                          ( +-  3.16% )
linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds time elapsed                                          ( +-  1.77% )
linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds time elapsed                                          ( +-  1.06% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds time elapsed                                          ( +-  2.19% )

64 bytes alignment is the sweet spot here as well, it's 3.7% faster 
than the default 16 bytes alignment.

So based on those measurements, I think we should do the exact 
opposite of my original patch that reduced alignment to 1 bytes, and 
increase kernel function address alignment from 16 bytes to the 
natural cache line size (64 bytes on modern CPUs).

The cost is a 6.2% larger kernel image:

 13799602        2579560 1634304 18013466     linux-falign-functions=_16-bytes/vmlinux
 14664898        2579560 1634304 18878762     linux-falign-functions=_64-bytes/vmlinux

But this is basically just a RAM cost, it does not increase effective 
cache footprint (in fact it decreases it), so it's likely a win on 
everything but RAM starved embedded systems.

In the future we could perhaps still pack small functions of certain 
subsystems (such as hot locking functions).

The patch below implements the alignment optimization by the build 
system using X86_L1_CACHE_SIZE to align functions, limited to 64-bit 
systems for the time being.

Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig.cpu | 17 +++++++++++++++++
 arch/x86/Makefile    |  6 ++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 6983314c8b37..9eacb85efda9 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -304,6 +304,23 @@ config X86_L1_CACHE_SHIFT
 	default "4" if MELAN || M486 || MGEODEGX1
 	default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
 
+config X86_L1_CACHE_SIZE
+	int
+	default "16" if X86_L1_CACHE_SHIFT=4
+	default "32" if X86_L1_CACHE_SHIFT=5
+	default "64" if X86_L1_CACHE_SHIFT=6
+	default "128" if X86_L1_CACHE_SHIFT=7
+
+#
+# We optimize function alignment on 64-bit kernels,
+# to pack the instruction cache optimally:
+#
+config X86_FUNCTION_ALIGNMENT
+	int
+	default "16"			if !64BIT
+	default X86_L1_CACHE_SIZE	if 64BIT && !CC_OPTIMIZE_FOR_SIZE
+	default "1"			if 64BIT && CC_OPTIMIZE_FOR_SIZE
+
 config X86_PPRO_FENCE
 	bool "PentiumPro memory ordering errata workaround"
 	depends on M686 || M586MMX || M586TSC || M586 || M486 || MGEODEGX1
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 57996ee840dd..45ebf0bbe833 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -83,6 +83,12 @@ else
         # Pack loops tightly as well:
         KBUILD_CFLAGS += -falign-loops=1
 
+        #
+        # Allocate a separate cacheline for every function,
+        # for optimal instruction cache packing:
+        #
+        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
@ 2015-05-20  0:47                                         ` Linus Torvalds
  2015-05-20 12:21                                           ` Denys Vlasenko
  2015-05-20 13:09                                           ` Ingo Molnar
  2015-05-20 11:29                                         ` Denys Vlasenko
  1 sibling, 2 replies; 108+ messages in thread
From: Linus Torvalds @ 2015-05-20  0:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin, Denys Vlasenko,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton

On Tue, May 19, 2015 at 2:38 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> The optimal I$ miss rate is at 64 bytes - which is 9% better than the
> default kernel's I$ miss rate at 16 bytes alignment.

Ok, these numbers looks reasonable (which is, of course, defined as
"meets Linus' expectations"), so I like it.

At the same time, I have to admit that I abhor a 64-byte function
alignment, when we have a fair number of functions that are (much)
smaller than that.

Is there some way to get gcc to take the size of the function into
account? Because aligning a 16-byte or 32-byte function on a 64-byte
alignment is just criminally nasty and wasteful.

>From your numbers the 64-byte alignment definitely makes sense in
general, but I really think it would be much nicer if we could get
something like "align functions to their power-of-two size rounded up,
up to a maximum of 64 bytes"

Maybe I did something wrong, but doing this:

    export last=0
    nm vmlinux | grep ' [tT] ' | sort | while read i t name
    do
        size=$((0x$i-$last)); last=0x$i; lastname=$name
        [ $size -ge 16 ] && echo $size $lastname
    done | sort -n | less -S

seems to say that we have a *lot* of small functions (don't do this
with a debug build that has a lot of odd things, do it with something
you'd actually boot and run).

The above assumes the default 16-byte alignment, and gets rid of the
the zero-sized ones (due to mainly system call aliases), and the ones
less than 16 bytes (obviously not aligned as-is). But you still end up
with a *lot* of functions.a lot of the really small ones are silly
setup functions etc, but there's actually a fair number of 16-byte
functions.

I seem to get ~30k functions in my defconfig vmlinux file, and about
half seem to be lless than 96 bytes (that's _with_ the 16-byte
alignment). In fact, there seems to be ~5500 functions that are 32
bytes or less, of which 1850 functions are 16 bytes or less.

Aligning a 16-byte function to 64 bytes really does sound wrong, and
there's a fair number of them.  Of course, it depends on what's around
it just how much memory it wastes, but it *definitely* doesn't help I$
to round small functions up to the next cacheline too.

I dunno. I might have screwed up the above shellscript badly and my
numbers may be pure garbage. But apart from the tail end that has
insane big sizes (due to section changes or intermixed data or
something, I suspect) it doesn't look obviously wrong. So I think it
might be a reasonable approximation.

We'd need toolchain help to do saner alignment.

                         Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-20  0:47                                         ` Linus Torvalds
@ 2015-05-20 12:21                                           ` Denys Vlasenko
  2015-05-21 11:36                                             ` Ingo Molnar
  2015-05-21 11:38                                             ` Denys Vlasenko
  2015-05-20 13:09                                           ` Ingo Molnar
  1 sibling, 2 replies; 108+ messages in thread
From: Denys Vlasenko @ 2015-05-20 12:21 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton

On 05/20/2015 02:47 AM, Linus Torvalds wrote:
> On Tue, May 19, 2015 at 2:38 PM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> The optimal I$ miss rate is at 64 bytes - which is 9% better than the
>> default kernel's I$ miss rate at 16 bytes alignment.
> 
> Ok, these numbers looks reasonable (which is, of course, defined as
> "meets Linus' expectations"), so I like it.
> 
> At the same time, I have to admit that I abhor a 64-byte function
> alignment, when we have a fair number of functions that are (much)
> smaller than that.
> 
> Is there some way to get gcc to take the size of the function into
> account? Because aligning a 16-byte or 32-byte function on a 64-byte
> alignment is just criminally nasty and wasteful.
> 
> From your numbers the 64-byte alignment definitely makes sense in
> general, but I really think it would be much nicer if we could get
> something like "align functions to their power-of-two size rounded up,
> up to a maximum of 64 bytes"

Well, that would be a bit hard to implement for gcc, at least in its traditional
mode where it emits assembly source, not machine code.

However, not all is lost.

I was thinking about Ingo's AMD results:

linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed
linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed
linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed

AMD is almost perfect. Having no alignment at all still works
very well. Almost perfect. Where "almost" comes from?

I bet it comes from the small fraction of functions which got unlucly
enough to have their first instruction split by 64-byte boundary.

If we would be able to avoid just this corner case, that would help a lot.

And GNU as has means to do that!
See https://sourceware.org/binutils/docs/as/P2align.html

.p2align N1,FILL,N3

"The third expression is also absolute, and is also optional.
If it is present, it is the maximum number of bytes that should
be skipped by this alignment directive."

So what we need is to put something like ".p2align 64,,7"
before every function.

(
	Why 7?

	defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
	6923 of them have 1st insn 5 or more bytes long,
	5841 of them have 1st insn 6 or more bytes long,
	5095 of them have 1st insn 7 or more bytes long,
	786 of them have 1st insn 8 or more bytes long,
	548 of them have 1st insn 9 or more bytes long,
	375 of them have 1st insn 10 or more bytes long,
	73 of them have 1st insn 11 or more bytes long,
	one of them has 1st insn 12 bytes long:
	    this "heroic" instruction is in local_touch_nmi()
		65 48 c7 05 44 3c 00 7f 00 00 00 00
		movq $0x0,%gs:0x7f003c44(%rip)

	Thus ensuring that at least seven first bytes do not cross
	64-byte boundary would cover >98% of all functions.
)	

gcc can't do that right now. With -falign-functions=N,
it emits ".p2align next_power_of_2(N),,N-1"

We need to make it just a tiny bit smarter.

> We'd need toolchain help to do saner alignment.

Yep.
I'm going to create a gcc BZ with a feature request,
unless you disagree with my musings above.

-- 
vda



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-20 12:21                                           ` Denys Vlasenko
@ 2015-05-21 11:36                                             ` Ingo Molnar
  2015-05-21 11:38                                             ` Denys Vlasenko
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-21 11:36 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton

* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> I was thinking about Ingo's AMD results:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed
> linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed
> linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed
> 
> AMD is almost perfect. Having no alignment at all still works very 
> well. [...]

Not quite. As I mentioned it in my post, the 'time elapsed' numbers 
were very noisy in the AMD case - and you've cut off the stddev column 
that shows this. Here is the full data:

 linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
 linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
 linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )

2-3% of stddev for a 3.7% speedup is not conclusive.

What you should use instead is the cachemiss counts, which is a good 
proxy and a lot more stable statistically:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)

which shows that 64 bytes alignment still generates a better I$ layout 
than tight packing, resulting in 4.3% fewer I$ misses.

On Intel it's more pronounced:

 linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
 linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)

12% difference. Note that the Intel workload is running on SSDs which 
makes the cache footprint several times larger, and the workload is 
more realistic as well than the AMD test that was running in tmpfs.

I think it's a fair bet to assume that the AMD system will show a 
similar difference if it were to run the same workload.

Allowing smaller functions to be cut in half by cacheline boundaries 
looks like a losing strategy, especially with larger workloads.

The modified scheme I suggested: 64 bytes alignment + intelligent 
packing might do even better than dumb 64 bytes alignment.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-20 12:21                                           ` Denys Vlasenko
  2015-05-21 11:36                                             ` Ingo Molnar
@ 2015-05-21 11:38                                             ` Denys Vlasenko
  2016-04-16 21:08                                               ` Denys Vlasenko
  1 sibling, 1 reply; 108+ messages in thread
From: Denys Vlasenko @ 2015-05-21 11:38 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton

On 05/20/2015 02:21 PM, Denys Vlasenko wrote:
> So what we need is to put something like ".p2align 64,,7"
> before every function.
> 
> (
> 	Why 7?
> 
> 	defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
> 	6923 of them have 1st insn 5 or more bytes long,
> 	5841 of them have 1st insn 6 or more bytes long,
> 	5095 of them have 1st insn 7 or more bytes long,
> 	786 of them have 1st insn 8 or more bytes long,
> 	548 of them have 1st insn 9 or more bytes long,
> 	375 of them have 1st insn 10 or more bytes long,
> 	73 of them have 1st insn 11 or more bytes long,
> 	one of them has 1st insn 12 bytes long:
> 	    this "heroic" instruction is in local_touch_nmi()
> 		65 48 c7 05 44 3c 00 7f 00 00 00 00
> 		movq $0x0,%gs:0x7f003c44(%rip)
> 
> 	Thus ensuring that at least seven first bytes do not cross
> 	64-byte boundary would cover >98% of all functions.
> )	
> 
> gcc can't do that right now. With -falign-functions=N,
> it emits ".p2align next_power_of_2(N),,N-1"
> 
> We need to make it just a tiny bit smarter.
> 
>> We'd need toolchain help to do saner alignment.
> 
> Yep.
> I'm going to create a gcc BZ with a feature request,
> unless you disagree with my musings above.

The BZ is here:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-21 11:38                                             ` Denys Vlasenko
@ 2016-04-16 21:08                                               ` Denys Vlasenko
  0 siblings, 0 replies; 108+ messages in thread
From: Denys Vlasenko @ 2016-04-16 21:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Ingo Molnar, Andy Lutomirski, Davidlohr Bueso,
	Peter Anvin, Linux Kernel Mailing List, Tim Chen,
	Borislav Petkov, Peter Zijlstra, Chandramouleeswaran, Aswin,
	Peter Zijlstra, Brian Gerst, Paul McKenney, Thomas Gleixner,
	Jason Low, linux-tip-commits, Arjan van de Ven, Andrew Morton

On Thu, May 21, 2015 at 1:38 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 05/20/2015 02:21 PM, Denys Vlasenko wrote:
>> So what we need is to put something like ".p2align 64,,7"
>> before every function.
>>
>> (
>>       Why 7?
>>
>>       defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
>>       6923 of them have 1st insn 5 or more bytes long,
>>       5841 of them have 1st insn 6 or more bytes long,
>>       5095 of them have 1st insn 7 or more bytes long,
>>       786 of them have 1st insn 8 or more bytes long,
>>       548 of them have 1st insn 9 or more bytes long,
>>       375 of them have 1st insn 10 or more bytes long,
>>       73 of them have 1st insn 11 or more bytes long,
>>       one of them has 1st insn 12 bytes long:
>>           this "heroic" instruction is in local_touch_nmi()
>>               65 48 c7 05 44 3c 00 7f 00 00 00 00
>>               movq $0x0,%gs:0x7f003c44(%rip)
>>
>>       Thus ensuring that at least seven first bytes do not cross
>>       64-byte boundary would cover >98% of all functions.
>> )
>>
>> gcc can't do that right now. With -falign-functions=N,
>> it emits ".p2align next_power_of_2(N),,N-1"
>>
>> We need to make it just a tiny bit smarter.
>>
>>> We'd need toolchain help to do saner alignment.
>>
>> Yep.
>> I'm going to create a gcc BZ with a feature request,
>> unless you disagree with my musings above.
>
> The BZ is here:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

...and now this BZ has a working patch, which implements e.g.
-falign-functions=64,7

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-20  0:47                                         ` Linus Torvalds
  2015-05-20 12:21                                           ` Denys Vlasenko
@ 2015-05-20 13:09                                           ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-20 13:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin, Denys Vlasenko,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, May 19, 2015 at 2:38 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> > The optimal I$ miss rate is at 64 bytes - which is 9% better than 
> > the default kernel's I$ miss rate at 16 bytes alignment.
> 
> Ok, these numbers looks reasonable (which is, of course, defined as 
> "meets Linus' expectations"), so I like it.
> 
> At the same time, I have to admit that I abhor a 64-byte function 
> alignment, when we have a fair number of functions that are (much) 
> smaller than that.
> 
> Is there some way to get gcc to take the size of the function into 
> account? Because aligning a 16-byte or 32-byte function on a 64-byte 
> alignment is just criminally nasty and wasteful.
> 
> From your numbers the 64-byte alignment definitely makes sense in 
> general, but I really think it would be much nicer if we could get 
> something like "align functions to their power-of-two size rounded 
> up, up to a maximum of 64 bytes"

I think the ideal strategy would be to minimize the number of cache 
line boundaries that cut across a function body, but otherwise pack as 
tightly as possible.

I.e. a good first approximation would be to pack functions tightly 
within a single cache line as long as the next function still fits - 
and go to the next cacheline if it doesn't.

This makes sure we use the cachelines to their max, while also making 
sure that functions are fragmented across more cachelines than 
necessary.

> Maybe I did something wrong, but doing this:
> 
>     export last=0
>     nm vmlinux | grep ' [tT] ' | sort | while read i t name
>     do
>         size=$((0x$i-$last)); last=0x$i; lastname=$name
>         [ $size -ge 16 ] && echo $size $lastname
>     done | sort -n | less -S
> 
> seems to say that we have a *lot* of small functions (don't do this 
> with a debug build that has a lot of odd things, do it with 
> something you'd actually boot and run).

Yeah, we do, and I ran your script and it looks similar to what I did 
a few days ago, so I think your observations are correct.

> The above assumes the default 16-byte alignment, and gets rid of the 
> the zero-sized ones (due to mainly system call aliases), and the 
> ones less than 16 bytes (obviously not aligned as-is). But you still 
> end up with a *lot* of functions.a lot of the really small ones are 
> silly setup functions etc, but there's actually a fair number of 
> 16-byte functions.

So if you build with -falign-functions=1 to get the true size of the 
functions then the numbers are even more convincing: about 8% of all 
functions in vmlinux on a typica distro config are 16 bytes or 
smaller, 20% are 32 bytes or smaller, 36% are 64 bytes or smaller.

> I seem to get ~30k functions in my defconfig vmlinux file, and about 
> half seem to be lless than 96 bytes (that's _with_ the 16-byte 
> alignment). In fact, there seems to be ~5500 functions that are 32 
> bytes or less, of which 1850 functions are 16 bytes or less.

Yes.

So given the prevalence of small functions I still find my result 
highly non-intuitive: packing them tightly _should_ have helped I$ 
footprint.

But I'm certainly not going to argue against numbers!

> Aligning a 16-byte function to 64 bytes really does sound wrong, and 
> there's a fair number of them.  Of course, it depends on what's 
> around it just how much memory it wastes, but it *definitely* 
> doesn't help I$ to round small functions up to the next cacheline 
> too.
> 
> I dunno. I might have screwed up the above shellscript badly and my 
> numbers may be pure garbage. But apart from the tail end that has 
> insane big sizes (due to section changes or intermixed data or 
> something, I suspect) it doesn't look obviously wrong. So I think it 
> might be a reasonable approximation.
> 
> We'd need toolchain help to do saner alignment.

So in theory we could use -ffunction-sections and then create a linker 
script on the fly with arbitrary alignment logic to our liking, but 
I'd guess it would be a bit slow and possibly also somewhat fragile, 
as linker scripts aren't the most robust pieces of GNU tooling.

Another advantage would be that we could reorder functions (within the 
same .o) to achieve better packing.

I'll try play with it a bit to see how feasible it is, and to see 
whether more performance is possible with better I$ packing.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
  2015-05-20  0:47                                         ` Linus Torvalds
@ 2015-05-20 11:29                                         ` Denys Vlasenko
  2015-05-21 13:28                                           ` Ingo Molnar
  2015-05-21 14:03                                           ` Ingo Molnar
  1 sibling, 2 replies; 108+ messages in thread
From: Denys Vlasenko @ 2015-05-20 11:29 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds
  Cc: Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton

On 05/19/2015 11:38 PM, Ingo Molnar wrote:
> Here's the result from the Intel system:
> 
> linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
> linux-falign-functions=128-bytes/res.txt:        669,401,612      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
> linux-falign-functions=_32-bytes/res.txt:        685,969,043      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
> linux-falign-functions=256-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
> linux-falign-functions=512-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
> linux-falign-functions=_16-bytes/res.txt:        706,080,917      L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  (100.00%)
> linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)
> linux-falign-functions=__4-bytes/res.txt:        725,707,848      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
> linux-falign-functions=__8-bytes/res.txt:        726,543,194      L1-icache-load-misses                                         ( +-  0.04% )  (100.00%)
> linux-falign-functions=__2-bytes/res.txt:        738,946,179      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)
> 
> The optimal I$ miss rate is at 64 bytes - which is 9% better than the 
> default kernel's I$ miss rate at 16 bytes alignment.
> 
> The 128/256/512 bytes numbers show an increasing amount of cache 
> misses: probably due to the artificially reduced associativity of the 
> caching.
> 
> Surprisingly there's a rather marked improvement in elapsed time as 
> well:
> 
> linux-falign-functions=_64-bytes/res.txt:        7.154816369 seconds time elapsed                                          ( +-  0.03% )
> linux-falign-functions=_32-bytes/res.txt:        7.231074263 seconds time elapsed                                          ( +-  0.12% )
> linux-falign-functions=__8-bytes/res.txt:        7.292203002 seconds time elapsed                                          ( +-  0.30% )
> linux-falign-functions=128-bytes/res.txt:        7.314226040 seconds time elapsed                                          ( +-  0.29% )
> linux-falign-functions=_16-bytes/res.txt:        7.333597250 seconds time elapsed     [vanilla kernel]                     ( +-  0.48% )
> linux-falign-functions=__1-bytes/res.txt:        7.367139908 seconds time elapsed                                          ( +-  0.28% )
> linux-falign-functions=__4-bytes/res.txt:        7.371721930 seconds time elapsed                                          ( +-  0.26% )
> linux-falign-functions=__2-bytes/res.txt:        7.410033936 seconds time elapsed                                          ( +-  0.34% )
> linux-falign-functions=256-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
> linux-falign-functions=512-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        8.531418784 seconds time elapsed                                          ( +-  0.19% )
> 
> the workload got 2.5% faster - which is pretty nice! This result is 5+ 
> standard deviations above the noise of the measurement.
> 
> Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y) 
> performance is: markedly higher cache miss rate despite a 'smaller' 
> kernel, and the workload is 16.3% slower (!).
>
> Part of the -Os picture is that the -Os kernel is executing much more 
> instructions:
> 
> linux-falign-functions=_64-bytes/res.txt:     11,851,763,357      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__1-bytes/res.txt:     11,852,538,446      instructions                                                  ( +-  0.01% )
> linux-falign-functions=_16-bytes/res.txt:     11,854,159,736      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__4-bytes/res.txt:     11,864,421,708      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__8-bytes/res.txt:     11,865,947,941      instructions                                                  ( +-  0.01% )
> linux-falign-functions=_32-bytes/res.txt:     11,867,369,566      instructions                                                  ( +-  0.01% )
> linux-falign-functions=128-bytes/res.txt:     11,867,698,477      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__2-bytes/res.txt:     11,870,853,247      instructions                                                  ( +-  0.01% )
> linux-falign-functions=256-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
> linux-falign-functions=512-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:     14,318,175,358      instructions                                                  ( +-  0.01% )
> 
> 21.2% more instructions executed ... that cannot go well.
>
> So this should be a reminder that it's effective I$ footprint and 
> number of instructions executed that matters to performance, not 
> kernel size alone. With current GCC -Os should only be used on 
> embedded systems where one is willing to make the kernel 10%+ slower, 
> in exchange for a 20% smaller kernel.

Can you post your .config for the test?
If you have CONFIG_OPTIMIZE_INLINING=y in your -Os test,
consider re-testing with it turned off.
You may be seeing this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122


> The AMD system, with a starkly different x86 microarchitecture, is 
> showing similar characteristics:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)
> linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      L1-icache-load-misses                                         ( +-  0.22% )  (100.00%)
> linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      L1-icache-load-misses                                         ( +-  0.71% )  (100.00%)
> linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
> linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      L1-icache-load-misses                                         ( +-  0.26% )  (100.00%)
> 
> 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
> is much steeper than on Intel systems: cache associativity is likely 
> lower on this AMD CPU.
> 
> Interestingly the 1 byte alignment result is still pretty good on AMD 
> systems - and I used the exact same kernel image on both systems, so 
> the layout of the functions is exactly the same.
> 
> Elapsed time is noisier, but shows a similar trend:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
> linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds time elapsed                                          ( +-  2.18% )
> linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
> linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )
> linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds time elapsed                                          ( +-  2.38% )
> linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds time elapsed                                          ( +-  1.18% )
> linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds time elapsed                                          ( +-  2.75% )
> linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds time elapsed                                          ( +-  3.16% )
> linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds time elapsed                                          ( +-  1.77% )
> linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds time elapsed                                          ( +-  1.06% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds time elapsed                                          ( +-  2.19% )
> 
> 64 bytes alignment is the sweet spot here as well, it's 3.7% faster 
> than the default 16 bytes alignment.

In AMD, 64 bytes win too, yes, but by a *very* small margin.
8 bytes and 1 byte alignments have basically same timings,
and both take what, +0.63% of time longer to run?

linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed
linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed
linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed

I wouldn't say that it's the same as Intel. There the difference between 64 byte
alignment and no alignment at all is five times larger than for AMD, it's +3%:

linux-falign-functions=_64-bytes/res.txt:        7.154816369 seconds time elapsed
linux-falign-functions=_32-bytes/res.txt:        7.231074263 seconds time elapsed
linux-falign-functions=__8-bytes/res.txt:        7.292203002 seconds time elapsed
linux-falign-functions=_16-bytes/res.txt:        7.333597250 seconds time elapsed
linux-falign-functions=__1-bytes/res.txt:        7.367139908 seconds time elapsed

> So based on those measurements, I think we should do the exact 
> opposite of my original patch that reduced alignment to 1 bytes, and 
> increase kernel function address alignment from 16 bytes to the 
> natural cache line size (64 bytes on modern CPUs).

> +        #
> +        # Allocate a separate cacheline for every function,
> +        # for optimal instruction cache packing:
> +        #
> +        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)

How about  -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1  instead?

This avoids pathological cases where function starting just a few bytes after
64-bytes boundary gets aligned to the next one, wasting ~60 bytes.
-- 
vda


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-20 11:29                                         ` Denys Vlasenko
@ 2015-05-21 13:28                                           ` Ingo Molnar
  2015-05-21 14:03                                           ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-21 13:28 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> Can you post your .config for the test?
> If you have CONFIG_OPTIMIZE_INLINING=y in your -Os test,
> consider re-testing with it turned off.

Yes, I had CONFIG_OPTIMIZE_INLINING=y.

With that turned off, on GCC 4.9.2, I'm seeing:

 fomalhaut:~/linux/linux-____CC_OPTIMIZE_FOR_SIZE=y> size vmlinux.OPTIMIZE_INLINING\=*
     text           data     bss      dec            hex filename
 12150606        2565544 1634304 16350454         f97cf6 vmlinux.OPTIMIZE_INLINING=y
 12354814        2572520 1634304 16561638         fcb5e6 vmlinux.OPTIMIZE_INLINING=n

I.e. forcing the inlining increases the kernel size again, by about 
1.7%.

I re-ran the tests on the Intel system, and got these I$ miss rates:

linux-falign-functions=_64-bytes:                  647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
linux-falign-functions=_16-bytes:                  706,080,917      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)
linux-CC_OPTIMIZE_FOR_SIZE=y+OPTIMIZE_INLINING=y:  921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)
linux-CC_OPTIMIZE_FOR_SIZE=y+OPTIMIZE_INLINING=n:  792,395,265      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)

So yeah, it got better - but the I$ cache miss rate is still 22.4% 
higher than that of the 64-bytes aligned kernel and 12.2% higher than 
the vanilla kernel.

Elapsed time had this original OPTIMIZE_FOR_SIZE result:

       8.531418784 seconds time elapsed                                          ( +-  0.19% )

this now improved to:

       7.686174880 seconds time elapsed                                          ( +-  0.18% )

but it's still much worse than the 64-byte aligned one:

       7.154816369 seconds time elapsed                                          ( +-  0.03% )

and the 16-byte aligned one:

       7.333597250 seconds time elapsed                                          ( +-  0.48% )

> You may be seeing this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

Yeah, disabling OPTIMIZE_INLINING made a difference - but it didn't 
recover the performance loss, -Os is still 4.8% slower in this 
workload than the vanilla kernel.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
  2015-05-20 11:29                                         ` Denys Vlasenko
  2015-05-21 13:28                                           ` Ingo Molnar
@ 2015-05-21 14:03                                           ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-21 14:03 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso, Peter Anvin,
	Linux Kernel Mailing List, Tim Chen, Borislav Petkov,
	Peter Zijlstra, Chandramouleeswaran, Aswin, Peter Zijlstra,
	Brian Gerst, Paul McKenney, Thomas Gleixner, Jason Low,
	linux-tip-commits, Arjan van de Ven, Andrew Morton


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> > The AMD system, with a starkly different x86 microarchitecture, is 
> > showing similar characteristics:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> > linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> > linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)
> > linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      L1-icache-load-misses                                         ( +-  0.22% )  (100.00%)
> > linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> > linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> > linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      L1-icache-load-misses                                         ( +-  0.71% )  (100.00%)
> > linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
> > linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> > linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      L1-icache-load-misses                                         ( +-  0.26% )  (100.00%)
> > 
> > 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
> > is much steeper than on Intel systems: cache associativity is likely 
> > lower on this AMD CPU.
> > 
> > Interestingly the 1 byte alignment result is still pretty good on AMD 
> > systems - and I used the exact same kernel image on both systems, so 
> > the layout of the functions is exactly the same.
> > 
> > Elapsed time is noisier, but shows a similar trend:
> > 
> > linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
> > linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds time elapsed                                          ( +-  2.18% )
> > linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
> > linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )
> > linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds time elapsed                                          ( +-  2.38% )
> > linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds time elapsed                                          ( +-  1.18% )
> > linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds time elapsed                                          ( +-  2.75% )
> > linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds time elapsed                                          ( +-  3.16% )
> > linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds time elapsed                                          ( +-  1.77% )
> > linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds time elapsed                                          ( +-  1.06% )
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds time elapsed                                          ( +-  2.19% )
> > 
> > 64 bytes alignment is the sweet spot here as well, it's 3.7% 
> > faster than the default 16 bytes alignment.
> 
> In AMD, 64 bytes win too, yes, but by a *very* small margin. 8 bytes 
> and 1 byte alignments have basically same timings, and both take 
> what, +0.63% of time longer to run?

See my previous mail why this is a misleading conclusion: the timings 
have so much stddev that they are not really comparable. But if you 
look at the cachemiss you'll see the true difference:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)

1 byte alignment isn't optimal on AMD either.

> > +        #
> > +        # Allocate a separate cacheline for every function,
> > +        # for optimal instruction cache packing:
> > +        #
> > +        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
> 
> How about  -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1  instead?
> 
> This avoids pathological cases where function starting just a few 
> bytes after 64-bytes boundary gets aligned to the next one, wasting 
> ~60 bytes.

Well, that should be pretty similar to the 32-byte aligned numbers I 
already posted, right?

The 32-byte numbers are better than the default 16-byte alignment, but 
64-byte alignment was even better.

A quick build shows:

   text    data     bss     dec     hex filename
13534242        2579560 1634304 17748106        10ed08a linux-falign-functions=__1-bytes/vmlinux
13554530        2579560 1634304 17768394        10f1fca linux-falign-functions=__2-bytes/vmlinux
13590946        2579560 1634304 17804810        10fae0a linux-falign-functions=__4-bytes/vmlinux
13658786        2579560 1634304 17872650        110b70a linux-falign-functions=__8-bytes/vmlinux
13799602        2579560 1634304 18013466        112dd1a linux-falign-functions=_16-bytes/vmlinux
13943906        2579560 1634304 18157770        11510ca linux-falign-functions=_33-bytes/vmlinux [1]
14075330        2579560 1634304 18289194        117122a linux-falign-functions=_32-bytes/vmlinux [2]
14664898        2579560 1634304 18878762        120112a linux-falign-functions=_64-bytes/vmlinux
15980994        2579560 1634304 20194858        134262a linux-falign-functions=128-bytes/vmlinux
19038018        2591848 1634304 23264170        162fbaa linux-falign-functions=256-bytes/vmlinux
26391106        2591848 1634304 30617258        1d32eaa linux-falign-functions=512-bytes/vmlinux

Interestingly the -falign-functions=33 kernel is marginally smaller 
than the -falign-functions=32 kernel, by 0.1%.

But the numbers aren't exceptional:

       671,702,272      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
    11,892,913,320      instructions                                                  ( +-  0.01% )
           300,030      context-switches                                              ( +-  0.00% )

       7.349312237 seconds time elapsed                                          ( +-  1.20% )

Compared to the others it's on rank 3:

linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
linux-falign-functions=128-bytes/res.txt:        669,401,612      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=_33-bytes/res.txt: [NEW]  671,702,272      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_32-bytes/res.txt:        685,969,043      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
linux-falign-functions=256-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=512-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
linux-falign-functions=_16-bytes/res.txt:        706,080,917      L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  (100.00%)
linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)
linux-falign-functions=__4-bytes/res.txt:        725,707,848      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-falign-functions=__8-bytes/res.txt:        726,543,194      L1-icache-load-misses                                         ( +-  0.04% )  (100.00%)
linux-falign-functions=__2-bytes/res.txt:        738,946,179      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)

So I still think the best strategy would be a variant of what I 
mentioned before:

  1) 'small functions': whose size is 1-63 bytes.

  2) 'large functions': whose size is 64- bytes.

  3) align all large functions to 64 bytes cachelines.

  4) pack all small functions into remaining large function tail 
     holes, without them spilling over into the next cacheline

  5) pack the remaining small functions with each other, tightly, as 
     long as they don't spill over to the next cacheline.

  6) once optimization steps 1-5 are performed, for all cachelines 
     that contain small functions and still have a hole near the end 
     of the cacheline, allow functions to be aligned to 32/16/8/4/2 
     bytes as long as they still fit into the cacheline.

So for example lets consider this layout with tight packing:
  
  [small-fn1][small-fn2][large-fn3...........................][small-fn4 ..... ][hole............]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Both 'fn3' and 'fn4' are on two cachelines.

After step 5 we'd get such a packing:

  [small-fn1][small-fn2][hole....][large-fn3...........................][small-fn4 ..... ][hole..]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

fn1, fn2 and fn4 each have their own cacheline, while large-fn3 
necessarily has to spill into the next cacheline due to its size. 
'fn4' is now on a single cache line.

After step 6 we'd get some extra intra-cacheline alignment by 
utilizing the holes:

  [small-fn1]..[small-fn2][hole..][large-fn3...........................]...[small-fn4 ..... ][hle]
  [......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Note how small-fn2 got moved forward by 2 bytes, to move it to an 8 
bytes boundary and small-fn4 got moved forward by 3 bytes - at the 
expense of the holes at the end of the cachelines.

This function alignment optimization basically minimizes the number of 
cache line boundaries intersecting individual functions, so this 
minimizes the I$ footprint aggressively.

This is not a trivial computation btw.: because even if we do this 
only per .o object file we'd have to consider nr_fns! permutations of 
all possible function ordering. We'd want to pick the ordering that 
matches the above 1-6 constraints, _and_ which is as close to 'source 
code ordering' as possible.

Such an optimization cannot be described via simple local alignment 
rules like any variant of -falign-functions, as it requires reordering 
of functions.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
  2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
@ 2015-04-10 12:50                               ` Denys Vlasenko
  2015-04-10 13:18                                 ` H. Peter Anvin
                                                   ` (3 more replies)
  2015-04-10 13:19                               ` Borislav Petkov
                                                 ` (4 subsequent siblings)
  6 siblings, 4 replies; 108+ messages in thread
From: Denys Vlasenko @ 2015-04-10 12:50 UTC (permalink / raw)
  To: Ingo Molnar, Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On 04/10/2015 02:08 PM, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
>> So restructure the loop a bit, to get much tighter code:
>>
>> 0000000000000030 <mutex_spin_on_owner.isra.5>:
>>   30:	55                   	push   %rbp
>>   31:	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
>>   38:	00 00
>>   3a:	48 89 e5             	mov    %rsp,%rbp
>>   3d:	48 39 37             	cmp    %rsi,(%rdi)
>>   40:	75 1e                	jne    60 <mutex_spin_on_owner.isra.5+0x30>
>>   42:	8b 46 28             	mov    0x28(%rsi),%eax
>>   45:	85 c0                	test   %eax,%eax
>>   47:	74 0d                	je     56 <mutex_spin_on_owner.isra.5+0x26>
>>   49:	f3 90                	pause
>>   4b:	48 8b 82 10 c0 ff ff 	mov    -0x3ff0(%rdx),%rax
>>   52:	a8 08                	test   $0x8,%al
>>   54:	74 e7                	je     3d <mutex_spin_on_owner.isra.5+0xd>
>>   56:	31 c0                	xor    %eax,%eax
>>   58:	5d                   	pop    %rbp
>>   59:	c3                   	retq
>>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
>>   60:	b8 01 00 00 00       	mov    $0x1,%eax
>>   65:	5d                   	pop    %rbp
>>   66:	c3                   	retq
> 
> Btw., totally off topic, the following NOP caught my attention:
> 
>>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)



> That's a dead NOP that boats the function a bit, added for the 16 byte 
> alignment of one of the jump targets.
> 
> I realize that x86 CPU manufacturers recommend 16-byte jump target 
> alignments (it's in the Intel optimization manual), but the cost of 
> that is very significant:
> 
>         text           data       bss         dec      filename
>     12566391        1617840   1089536    15273767      vmlinux.align.16-byte
>     12224951        1617840   1089536    14932327      vmlinux.align.1-byte
> 
> By using 1 byte jump target alignment (i.e. no alignment at all) we 
> get an almost 3% reduction in kernel size (!) - and a probably similar 
> reduction in I$ footprint.
> 
> So I'm wondering, is the 16 byte jump target optimization suggestion 
> really worth this price? The patch below boots fine and I've not 
> measured any noticeable slowdown, but I've not tried hard.

I am absolutely thrilled by the proposal to cut down on sadistic amounts
of alignment.

However, I'm an -Os guy. Expect -O2 people to disagree :)

New-ish versions of gcc allow people to specify optimization
options per function:

https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes

optimize
    The optimize attribute is used to specify that a function is to be compiled
    with different optimization options than specified on the command line.
    Arguments can either be numbers or strings. Numbers are assumed to be an
    optimization level. Strings that begin with O are assumed to be an
    optimization option, while other options are assumed to be used with
    a -f prefix.

How about not aligning code by default, and using

    #define hot_func __attribute__((optimize("O2","align-functions=16","align-jumps=16")))
    ...

    void hot_func super_often_called_func(...) {...}

in hot code paths?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
@ 2015-04-10 13:18                                 ` H. Peter Anvin
  2015-04-10 17:54                                   ` Ingo Molnar
  2015-04-11 14:41                                   ` Markus Trippelsdorf
  2015-04-10 18:48                                 ` Linus Torvalds
                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 108+ messages in thread
From: H. Peter Anvin @ 2015-04-10 13:18 UTC (permalink / raw)
  To: Denys Vlasenko, Ingo Molnar, Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On 04/10/2015 05:50 AM, Denys Vlasenko wrote:
> 
> However, I'm an -Os guy. Expect -O2 people to disagree :)
> 

The problem with -Os is that the compiler will make *any* tradeoffs to
save a byte.  It is really designed to squeeze as much code into a
fixed-size chunk, e.g. a ROM, as possible.

We have asked for an -Okernel mode from the gcc folks forever.  It
basically would mean "-Os except when really dumb."

As far as the 16-byte alignment, my understanding is not that it is
related to the I$ but rather is the decoder datum.

	-hpa



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 13:18                                 ` H. Peter Anvin
@ 2015-04-10 17:54                                   ` Ingo Molnar
  2015-04-10 18:32                                     ` H. Peter Anvin
  2015-04-11 14:41                                   ` Markus Trippelsdorf
  1 sibling, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 17:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Denys Vlasenko, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 04/10/2015 05:50 AM, Denys Vlasenko wrote:
> > 
> > However, I'm an -Os guy. Expect -O2 people to disagree :)
> > 
> 
> The problem with -Os is that the compiler will make *any* tradeoffs 
> to save a byte.  It is really designed to squeeze as much code into 
> a fixed-size chunk, e.g. a ROM, as possible.
> 
> We have asked for an -Okernel mode from the gcc folks forever.  It 
> basically would mean "-Os except when really dumb."

Yes, and it appears that not aligning to 16 bytes gives 5.5% size 
savings already - which is a big chunk of the -Os win.

So we might be able to get a "poor man's -Okernel" by not aligning. 
(I'm also looking at GCC options to make loop unrolls less aggressive 
- that's another common source of bloat.)

I strongly suspect it's the silly 'use weird, wildly data-dependent 
instructions just to save a single byte' games are that are killing 
-Os performance in practice.

> As far as the 16-byte alignment, my understanding is not that it is 
> related to the I$ but rather is the decoder datum.

Yeah, but the decoder stops if the prefetch crosses a cache line? So 
it appears to be an interaction of the 16 byte prefetch window and 
cache line boundaries?

Btw., given that much of a real life kernel's instructions execute 
cache-cold, a 5.5% reduction in kernel size could easily speed up 
cache-cold execution by a couple of percent. In the cache-cold case 
the prefetch window size is probably not important at all, what 
determines execution speed is cache miss latency and cache footprint.

[ At least in my simple mental picture of it, which might be wrong ;-) ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 17:54                                   ` Ingo Molnar
@ 2015-04-10 18:32                                     ` H. Peter Anvin
  0 siblings, 0 replies; 108+ messages in thread
From: H. Peter Anvin @ 2015-04-10 18:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On 04/10/2015 10:54 AM, Ingo Molnar wrote:
> 
> Yeah, but the decoder stops if the prefetch crosses a cache line? So 
> it appears to be an interaction of the 16 byte prefetch window and 
> cache line boundaries?
> 

I don't think it has anything to do with cache lines.  Rather, you may
lose a cycle if the first instruction bundle crosses into a new 16-byte
chunk.

	-hpa


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 13:18                                 ` H. Peter Anvin
  2015-04-10 17:54                                   ` Ingo Molnar
@ 2015-04-11 14:41                                   ` Markus Trippelsdorf
  2015-04-12 10:14                                     ` Ingo Molnar
  1 sibling, 1 reply; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-11 14:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Denys Vlasenko, Ingo Molnar, Paul E. McKenney, Linus Torvalds,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On 2015.04.10 at 06:18 -0700, H. Peter Anvin wrote:
> On 04/10/2015 05:50 AM, Denys Vlasenko wrote:
> > 
> > However, I'm an -Os guy. Expect -O2 people to disagree :)
> > 
> 
> The problem with -Os is that the compiler will make *any* tradeoffs to
> save a byte.  It is really designed to squeeze as much code into a
> fixed-size chunk, e.g. a ROM, as possible.
> 
> We have asked for an -Okernel mode from the gcc folks forever.  It
> basically would mean "-Os except when really dumb."

If you want the best of both worlds perhaps you should reconsider Andy's
LTO patch? With -flto gcc automatically optimizes all functions that it
considers cold for size. So you could expect some code size savings even
with -O2 (or -O3).

-- 
Markus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-11 14:41                                   ` Markus Trippelsdorf
@ 2015-04-12 10:14                                     ` Ingo Molnar
  2015-04-13 16:23                                       ` Markus Trippelsdorf
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-12 10:14 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: H. Peter Anvin, Denys Vlasenko, Paul E. McKenney, Linus Torvalds,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra


* Markus Trippelsdorf <markus@trippelsdorf.de> wrote:

> On 2015.04.10 at 06:18 -0700, H. Peter Anvin wrote:
> > On 04/10/2015 05:50 AM, Denys Vlasenko wrote:
> > > 
> > > However, I'm an -Os guy. Expect -O2 people to disagree :)
> > > 
> > 
> > The problem with -Os is that the compiler will make *any* tradeoffs to
> > save a byte.  It is really designed to squeeze as much code into a
> > fixed-size chunk, e.g. a ROM, as possible.
> > 
> > We have asked for an -Okernel mode from the gcc folks forever.  It
> > basically would mean "-Os except when really dumb."
> 
> If you want the best of both worlds perhaps you should reconsider Andy's
> LTO patch? With -flto gcc automatically optimizes all functions that it
> considers cold for size. So you could expect some code size savings even
> with -O2 (or -O3).

In my (past) experience the main win from -flto is not due to better 
hot/cold decisions, but simply due to more aggressive dead code 
elimination. -flto has less of an effect on code that is actually 
being executed.

Which isn't to be sneered at, but it's far less of a direct effect as 
branch probabilities are, which cut to the core of most hotpaths in 
the kernel.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-12 10:14                                     ` Ingo Molnar
@ 2015-04-13 16:23                                       ` Markus Trippelsdorf
  2015-04-13 17:26                                         ` Markus Trippelsdorf
  0 siblings, 1 reply; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-13 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Denys Vlasenko, Paul E. McKenney, Linus Torvalds,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On 2015.04.12 at 12:14 +0200, Ingo Molnar wrote:
> In my (past) experience the main win from -flto is not due to better 
> hot/cold decisions, but simply due to more aggressive dead code 
> elimination. -flto has less of an effect on code that is actually 
> being executed.
> 
> Which isn't to be sneered at, but it's far less of a direct effect as 
> branch probabilities are, which cut to the core of most hotpaths in 
> the kernel.

I did some measurements with gcc-5.1-RC on X86_64 using Andi's latest
LTO kernel patch for 4.0. With my simple monolithic .config the code
size savings are below 1%. That is lower than I've expected.

-- 
Markus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-13 16:23                                       ` Markus Trippelsdorf
@ 2015-04-13 17:26                                         ` Markus Trippelsdorf
  2015-04-13 18:31                                           ` Linus Torvalds
  0 siblings, 1 reply; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-13 17:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Denys Vlasenko, Paul E. McKenney, Linus Torvalds,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1104 bytes --]

On 2015.04.13 at 18:23 +0200, Markus Trippelsdorf wrote:
> On 2015.04.12 at 12:14 +0200, Ingo Molnar wrote:
> > In my (past) experience the main win from -flto is not due to better 
> > hot/cold decisions, but simply due to more aggressive dead code 
> > elimination. -flto has less of an effect on code that is actually 
> > being executed.
> > 
> > Which isn't to be sneered at, but it's far less of a direct effect as 
> > branch probabilities are, which cut to the core of most hotpaths in 
> > the kernel.
> 
> I did some measurements with gcc-5.1-RC on X86_64 using Andi's latest
> LTO kernel patch for 4.0. With my simple monolithic .config the code
> size savings are below 1%. That is lower than I've expected.

I must have made a measurement mistake above, because the actual code
size savings are roughly 5%:

   text    data     bss     dec     filename
   8746230  970072  802816 10519118 ./vmlinux gcc-5 (lto) 
   9202488  978512  811008 10992008 ./vmlinux gcc-5
   8686246 1009104  811008 10506358 ./vmlinux gcc-4.9 (lto)
   9228994  992976  815104 11037074 ./vmlinux gcc-4.9 

-- 
Markus

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 71758 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.0.0 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_KERNEL_LZ4=y
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_FHANDLE is not set
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_GENERIC_MSI_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_KTHREAD_PRIO=0
# CONFIG_RCU_NOCB_CPU is not set
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_CGROUPS is not set
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_NAMESPACES is not set
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_LTO_MENU=y
# CONFIG_LTO_DISABLE is not set
CONFIG_LTO=y
# CONFIG_LTO_DEBUG is not set
# CONFIG_LTO_CP_CLONE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
CONFIG_EXPERT=y
# CONFIG_SGETMASK_SYSCALL is not set
# CONFIG_SYSFS_SYSCALL is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_PRINTK=y
CONFIG_BUG=y
# CONFIG_PCSPKR_PLATFORM is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
# CONFIG_BPF_SYSCALL is not set
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_PCI_QUIRKS=y
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
# CONFIG_SLUB_DEBUG is not set
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_PROFILING is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_JUMP_LABEL=y
# CONFIG_UPROBES is not set
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_CC_STACKPROTECTOR_NONE=y
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y

#
# GCOV-based kernel profiling
#
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
# CONFIG_MODULES is not set
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_BSGLIB is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
# CONFIG_BLK_CMDLINE_PARSER is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_BSD_DISKLABEL is not set
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_DEADLINE=y
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="deadline"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUE_RWLOCK=y
CONFIG_QUEUE_RWLOCK=y
# CONFIG_FREEZER is not set

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
# CONFIG_X86_MPPARSE is not set
# CONFIG_X86_EXTENDED_PLATFORM is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
CONFIG_IOSF_MBI=y
# CONFIG_IOSF_MBI_DEBUG is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_HYPERVISOR_GUEST is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PROCESSOR_SELECT=y
# CONFIG_CPU_SUP_INTEL is not set
CONFIG_CPU_SUP_AMD=y
# CONFIG_CPU_SUP_CENTAUR is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_GART_IOMMU is not set
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=4
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_UP_APIC_MSI=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
# CONFIG_X86_16BIT is not set
CONFIG_X86_VSYSCALL_EMULATION=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
# CONFIG_MICROCODE_INTEL is not set
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
# CONFIG_MICROCODE_INTEL_EARLY is not set
# CONFIG_MICROCODE_AMD_EARLY is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
# CONFIG_HAVE_BOOTMEM_INFO_NODE is not set
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
# CONFIG_COMPACTION is not set
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
# CONFIG_CLEANCACHE is not set
# CONFIG_FRONTSWAP is not set
# CONFIG_CMA is not set
# CONFIG_ZPOOL is not set
# CONFIG_ZBUD is not set
# CONFIG_ZSMALLOC is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=2
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
# CONFIG_ARCH_RANDOM is not set
# CONFIG_X86_SMAP is not set
# CONFIG_EFI is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
# CONFIG_HOTPLUG_CPU is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_HAVE_LIVEPATCH=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
# CONFIG_SUSPEND is not set
# CONFIG_HIBERNATION is not set
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
# CONFIG_ACPI_PROCFS_POWER is not set
# CONFIG_ACPI_EC_DEBUGFS is not set
# CONFIG_ACPI_AC is not set
# CONFIG_ACPI_BATTERY is not set
# CONFIG_ACPI_BUTTON is not set
# CONFIG_ACPI_VIDEO is not set
# CONFIG_ACPI_FAN is not set
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
# CONFIG_ACPI_THERMAL is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
# CONFIG_ACPI_CONTAINER is not set
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
# CONFIG_ACPI_APEI is not set
# CONFIG_ACPI_EXTLOG is not set
# CONFIG_PMIC_OPREGION is not set
# CONFIG_SFI is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set

#
# CPU frequency scaling drivers
#
# CONFIG_X86_INTEL_PSTATE is not set
# CONFIG_X86_PCC_CPUFREQ is not set
CONFIG_X86_ACPI_CPUFREQ=y
# CONFIG_X86_ACPI_CPUFREQ_CPB is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_AMD_FREQ_SENSITIVITY is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
# CONFIG_PCI_MMCONFIG is not set
CONFIG_PCI_DOMAINS=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
# CONFIG_PCI_IOV is not set
# CONFIG_PCI_PRI is not set
# CONFIG_PCI_PASID is not set
CONFIG_PCI_LABEL=y

#
# PCI host controller drivers
#
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set
# CONFIG_RAPIDIO is not set
# CONFIG_X86_SYSFB is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
CONFIG_BINFMT_SCRIPT=y
# CONFIG_HAVE_AOUT is not set
# CONFIG_BINFMT_MISC is not set
# CONFIG_COREDUMP is not set
# CONFIG_IA32_EMULATION is not set
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_PMC_ATOM=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_XFRM_USER is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_NET_IP_TUNNEL is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_NET_UDP_TUNNEL is not set
# CONFIG_NET_FOU is not set
# CONFIG_GENEVE is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_INET_UDP_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NET_PTP_CLASSIFY is not set
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
# CONFIG_NETFILTER_ADVANCED is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_NETLINK_LOG=y
CONFIG_NF_CONNTRACK=y
CONFIG_NF_LOG_COMMON=y
# CONFIG_NF_CONNTRACK_PROCFS is not set
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CONNTRACK_IRC=y
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
CONFIG_NF_CONNTRACK_SIP=y
CONFIG_NF_CT_NETLINK=y
CONFIG_NF_NAT=y
CONFIG_NF_NAT_NEEDED=y
# CONFIG_NF_NAT_AMANDA is not set
CONFIG_NF_NAT_FTP=y
CONFIG_NF_NAT_IRC=y
CONFIG_NF_NAT_SIP=y
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_REDIRECT is not set
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=y

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_LOG=y
CONFIG_NETFILTER_XT_NAT=y
# CONFIG_NETFILTER_XT_TARGET_NETMAP is not set
CONFIG_NETFILTER_XT_TARGET_NFLOG=y
# CONFIG_NETFILTER_XT_TARGET_REDIRECT is not set
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NETFILTER_XT_MATCH_STATE=y
# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_CONNTRACK_IPV4=y
CONFIG_NF_LOG_ARP=y
CONFIG_NF_LOG_IPV4=y
CONFIG_NF_REJECT_IPV4=y
CONFIG_NF_NAT_IPV4=y
CONFIG_NF_NAT_MASQUERADE_IPV4=y
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_NAT=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
CONFIG_IP_NF_MANGLE=y
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
CONFIG_HAVE_NET_DSA=y
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_MMAP is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_NET_MPLS_GSO is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
CONFIG_HAVE_BPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
# CONFIG_UEVENT_HELPER is not set
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE="amd-ucode/microcode_amd.bin radeon/R600_rlc.bin"
CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
# CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set
# CONFIG_ALLOW_DEV_COREDUMP is not set
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_FENCE_TRACE is not set

#
# Bus devices
#
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=1
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_BLK_DEV_SKD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_RAM is not set
CONFIG_CDROM_PKTCDVD=y
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_VIRTIO_BLK=y
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1780 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_BMP085_I2C is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_SRAM is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
# CONFIG_VMWARE_VMCI is not set

#
# Intel MIC Bus Driver
#
# CONFIG_INTEL_MIC_BUS is not set

#
# Intel MIC Host Driver
#

#
# Intel MIC Card Driver
#
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_CXL_BASE is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_MQ_DEFAULT is not set
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# CONFIG_SCSI_LOWLEVEL is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
# CONFIG_ATA_ACPI is not set
# CONFIG_SATA_PMP is not set

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_ATA_SFF is not set
# CONFIG_MD is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
CONFIG_DUMMY=y
# CONFIG_EQUALIZER is not set
# CONFIG_NET_FC is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
CONFIG_TUN=y
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=y
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#
CONFIG_VHOST_NET=y
CONFIG_VHOST_RING=y
CONFIG_VHOST=y

#
# Distributed Switch Architecture drivers
#
# CONFIG_NET_DSA_MV88E6XXX is not set
# CONFIG_NET_DSA_MV88E6060 is not set
# CONFIG_NET_DSA_MV88E6XXX_NEED_PPU is not set
# CONFIG_NET_DSA_MV88E6131 is not set
# CONFIG_NET_DSA_MV88E6123_61_65 is not set
# CONFIG_NET_DSA_MV88E6171 is not set
# CONFIG_NET_DSA_MV88E6352 is not set
# CONFIG_NET_DSA_BCM_SF2 is not set
CONFIG_ETHERNET=y
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_NET_VENDOR_ADAPTEC is not set
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
# CONFIG_NET_VENDOR_ALTEON is not set
# CONFIG_ALTERA_TSE is not set
# CONFIG_NET_VENDOR_AMD is not set
# CONFIG_NET_XGENE is not set
# CONFIG_NET_VENDOR_ARC is not set
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
CONFIG_ATL1E=y
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
# CONFIG_NET_VENDOR_BROADCOM is not set
# CONFIG_NET_VENDOR_BROCADE is not set
# CONFIG_NET_VENDOR_CHELSIO is not set
# CONFIG_NET_VENDOR_CISCO is not set
# CONFIG_CX_ECAT is not set
# CONFIG_DNET is not set
# CONFIG_NET_VENDOR_DEC is not set
# CONFIG_NET_VENDOR_DLINK is not set
# CONFIG_NET_VENDOR_EMULEX is not set
# CONFIG_NET_VENDOR_EXAR is not set
# CONFIG_NET_VENDOR_HP is not set
# CONFIG_NET_VENDOR_INTEL is not set
# CONFIG_IP1000 is not set
# CONFIG_JME is not set
# CONFIG_NET_VENDOR_MARVELL is not set
# CONFIG_NET_VENDOR_MELLANOX is not set
# CONFIG_NET_VENDOR_MICREL is not set
# CONFIG_NET_VENDOR_MYRI is not set
# CONFIG_FEALNX is not set
# CONFIG_NET_VENDOR_NATSEMI is not set
# CONFIG_NET_VENDOR_NVIDIA is not set
# CONFIG_NET_VENDOR_OKI is not set
# CONFIG_ETHOC is not set
# CONFIG_NET_PACKET_ENGINE is not set
# CONFIG_NET_VENDOR_QLOGIC is not set
# CONFIG_NET_VENDOR_QUALCOMM is not set
# CONFIG_NET_VENDOR_REALTEK is not set
# CONFIG_NET_VENDOR_RDC is not set
# CONFIG_NET_VENDOR_ROCKER is not set
# CONFIG_NET_VENDOR_SAMSUNG is not set
# CONFIG_NET_VENDOR_SEEQ is not set
# CONFIG_NET_VENDOR_SILAN is not set
# CONFIG_NET_VENDOR_SIS is not set
# CONFIG_SFC is not set
# CONFIG_NET_VENDOR_SMSC is not set
# CONFIG_NET_VENDOR_STMICRO is not set
# CONFIG_NET_VENDOR_SUN is not set
# CONFIG_NET_VENDOR_TEHUTI is not set
# CONFIG_NET_VENDOR_TI is not set
# CONFIG_NET_VENDOR_VIA is not set
# CONFIG_NET_VENDOR_WIZNET is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_PHYLIB is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_USB_NET_DRIVERS is not set
# CONFIG_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
# CONFIG_VMXNET3 is not set
# CONFIG_ISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set
# CONFIG_INPUT_POLLDEV is not set
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
# CONFIG_MOUSE_PS2 is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
# CONFIG_INPUT_MISC is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_NOZOMI is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
CONFIG_DEVMEM=y
# CONFIG_DEVKMEM is not set

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_DEPRECATED_OPTIONS is not set
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
# CONFIG_SERIAL_8250_EXTENDED is not set
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_FINTEK is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_TTY_PRINTK is not set
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
# CONFIG_IPMI_HANDLER is not set
# CONFIG_HW_RANDOM is not set
# CONFIG_NVRAM is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
CONFIG_HPET_MMAP_DEFAULT=y
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
# CONFIG_XILLYBUS is not set

#
# I2C support
#
CONFIG_I2C=y
# CONFIG_ACPI_I2C_OPREGION is not set
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=y
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set

#
# PPS support
#
# CONFIG_PPS is not set

#
# PPS generators support
#

#
# PTP clock support
#
# CONFIG_PTP_1588_CLOCK is not set

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
# CONFIG_POWER_RESET is not set
# CONFIG_POWER_AVS is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
CONFIG_SENSORS_K10TEMP=y
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_HTU21 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
CONFIG_SENSORS_ATK0110=y
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
# CONFIG_THERMAL_GOV_USER_SPACE is not set
# CONFIG_THERMAL_EMULATION is not set
# CONFIG_INTEL_SOC_DTS_THERMAL is not set
# CONFIG_INT340X_THERMAL is not set

#
# Texas Instruments thermal drivers
#
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_AXP20X is not set
# CONFIG_MFD_CROS_EC is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_INTEL_SOC_PMIC is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RTSX_PCI is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RTSX_USB is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_RN5T618 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SMSC is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TPS65218 is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS80031 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TC3589X is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_REGULATOR is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_AGP is not set
# CONFIG_VGA_ARB is not set
# CONFIG_VGA_SWITCHEROO is not set

#
# Direct Rendering Manager
#
CONFIG_DRM=y
CONFIG_DRM_KMS_HELPER=y
CONFIG_DRM_KMS_FB_HELPER=y
# CONFIG_DRM_LOAD_EDID_FIRMWARE is not set
CONFIG_DRM_TTM=y

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_ADV7511 is not set
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I2C_NXP_TDA998X is not set
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=y
# CONFIG_DRM_RADEON_UMS is not set
# CONFIG_DRM_NOUVEAU is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_DRM_QXL is not set
# CONFIG_DRM_BOCHS is not set

#
# Frame buffer Devices
#
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_CMDLINE=y
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_VESA is not set
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_BACKLIGHT_LCD_SUPPORT is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_VGASTATE is not set
CONFIG_HDMI=y

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_LOGO is not set
CONFIG_SOUND=y
CONFIG_SOUND_OSS_CORE=y
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_HWDEP=y
CONFIG_SND_RAWMIDI=y
CONFIG_SND_SEQUENCER=y
CONFIG_SND_SEQ_DUMMY=y
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_HRTIMER=y
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_MAX_CARDS=4
CONFIG_SND_SUPPORT_OLD_API=y
# CONFIG_SND_VERBOSE_PROCFS is not set
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_KCTL_JACK=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_RAWMIDI_SEQ=y
# CONFIG_SND_OPL3_LIB_SEQ is not set
# CONFIG_SND_OPL4_LIB_SEQ is not set
# CONFIG_SND_SBAWE_SEQ is not set
# CONFIG_SND_EMU10K1_SEQ is not set
# CONFIG_SND_DRIVERS is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ASIHPI is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LOLA is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SE6X is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set

#
# HD-Audio
#
CONFIG_SND_HDA=y
CONFIG_SND_HDA_INTEL=y
CONFIG_SND_HDA_PREALLOC_SIZE=64
# CONFIG_SND_HDA_HWDEP is not set
# CONFIG_SND_HDA_RECONFIG is not set
# CONFIG_SND_HDA_INPUT_BEEP is not set
# CONFIG_SND_HDA_INPUT_JACK is not set
# CONFIG_SND_HDA_PATCH_LOADER is not set
# CONFIG_SND_HDA_CODEC_REALTEK is not set
# CONFIG_SND_HDA_CODEC_ANALOG is not set
# CONFIG_SND_HDA_CODEC_SIGMATEL is not set
CONFIG_SND_HDA_CODEC_VIA=y
# CONFIG_SND_HDA_CODEC_HDMI is not set
# CONFIG_SND_HDA_CODEC_CIRRUS is not set
# CONFIG_SND_HDA_CODEC_CONEXANT is not set
# CONFIG_SND_HDA_CODEC_CA0110 is not set
# CONFIG_SND_HDA_CODEC_CA0132 is not set
# CONFIG_SND_HDA_CODEC_CMEDIA is not set
# CONFIG_SND_HDA_CODEC_SI3054 is not set
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=60
CONFIG_SND_USB=y
CONFIG_SND_USB_AUDIO=y
# CONFIG_SND_USB_UA101 is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_USB_6FIRE is not set
# CONFIG_SND_USB_HIFACE is not set
# CONFIG_SND_BCD2000 is not set
# CONFIG_SND_USB_POD is not set
# CONFIG_SND_USB_PODHD is not set
# CONFIG_SND_USB_TONEPORT is not set
# CONFIG_SND_USB_VARIAX is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
# CONFIG_HID_A4TECH is not set
# CONFIG_HID_ACRUX is not set
# CONFIG_HID_APPLE is not set
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_AUREAL is not set
# CONFIG_HID_BELKIN is not set
# CONFIG_HID_BETOP_FF is not set
# CONFIG_HID_CHERRY is not set
# CONFIG_HID_CHICONY is not set
# CONFIG_HID_PRODIKEYS is not set
# CONFIG_HID_CYPRESS is not set
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
# CONFIG_HID_EZKEY is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_HUION is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
# CONFIG_HID_TWINHAN is not set
# CONFIG_HID_KENSINGTON is not set
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO is not set
CONFIG_HID_LOGITECH=y
CONFIG_HID_LOGITECH_DJ=y
CONFIG_HID_LOGITECH_HIDPP=y
# CONFIG_LOGITECH_FF is not set
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
# CONFIG_LOGIWHEELS_FF is not set
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MICROSOFT is not set
# CONFIG_HID_MONTEREY is not set
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTRIG is not set
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set

#
# USB HID support
#
CONFIG_USB_HID=y
# CONFIG_HID_PID is not set
CONFIG_USB_HIDDEV=y

#
# I2C HID support
#
# CONFIG_I2C_HID is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_WHITELIST is not set
# CONFIG_USB_OTG_BLACKLIST_HUB is not set
# CONFIG_USB_OTG_FSM is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_XHCI_HCD is not set
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
# CONFIG_USB_FUSBH200_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
# CONFIG_USB_UHCI_HCD is not set
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_REALTEK is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_STORAGE_ENE_UB6250 is not set
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set

#
# USB Physical Layer drivers
#
# CONFIG_USB_PHY is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
# CONFIG_EDAC_MCE_INJ is not set
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_AMD64=y
# CONFIG_EDAC_AMD64_ERROR_INJECTION is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_ISL12057 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_XGENE is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_HID_SENSOR_TIME is not set
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO=y

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_PCI_LEGACY=y
# CONFIG_VIRTIO_BALLOON is not set
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_STAGING is not set
# CONFIG_X86_PLATFORM_DEVICES is not set
# CONFIG_CHROME_PLATFORMS is not set

#
# Hardware Spinlock drivers
#

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_CLKBLD_I8253=y
# CONFIG_ATMEL_PIT is not set
# CONFIG_SH_TIMER_CMT is not set
# CONFIG_SH_TIMER_MTU2 is not set
# CONFIG_SH_TIMER_TMU is not set
# CONFIG_EM_TIMER_STI is not set
# CONFIG_MAILBOX is not set
# CONFIG_IOMMU_SUPPORT is not set

#
# Remoteproc drivers
#
# CONFIG_STE_MODEM_RPROC is not set

#
# Rpmsg drivers
#

#
# SOC (System On Chip) specific Drivers
#
# CONFIG_SOC_TI is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set
# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set
# CONFIG_FMC is not set

#
# PHY Subsystem
#
# CONFIG_GENERIC_PHY is not set
# CONFIG_BCM_KONA_USB2_PHY is not set
# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set
CONFIG_RAS=y
# CONFIG_THUNDERBOLT is not set

#
# Android
#
# CONFIG_ANDROID is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
# CONFIG_DMIID is not set
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT_FIND is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
# CONFIG_EXT4_USE_FOR_EXT23 is not set
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
CONFIG_BTRFS_FS=y
# CONFIG_BTRFS_FS_POSIX_ACL is not set
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_FS_DAX is not set
# CONFIG_FS_POSIX_ACL is not set
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
# CONFIG_QUOTA is not set
# CONFIG_QUOTACTL is not set
# CONFIG_AUTOFS4_FS is not set
# CONFIG_FUSE_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
# CONFIG_ZISOFS is not set
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_TMPFS_XATTR is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
# CONFIG_CONFIGFS_FS is not set
# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=y
# CONFIG_9P_FS_POSIX_ACL is not set
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=y
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y

#
# printk and dmesg options
#
# CONFIG_PRINTK_TIME is not set
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_DYNAMIC_DEBUG is not set

#
# Compile-time checks and compiler options
#
# CONFIG_DEBUG_INFO is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=1024
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_PAGE_OWNER is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
# CONFIG_FRAME_POINTER is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_DEBUG_KERNEL=y

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_MEMORY_INIT is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_KMEMCHECK is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_KASAN_SHADOW_OFFSET=0xdffffc0000000000
# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Lockups and Hangs
#
# CONFIG_LOCKUP_DETECTOR is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
# CONFIG_SCHED_DEBUG is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
# CONFIG_TIMER_STATS is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_STACKTRACE is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PI_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_TORTURE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y
# CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACING_SUPPORT=y
# CONFIG_FTRACE is not set

#
# Runtime Testing
#
# CONFIG_LKDTM is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_TEST_STRING_HELPERS is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
# CONFIG_EARLY_PRINTK is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DOUBLEFAULT=y
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_NMI_SELFTEST is not set
# CONFIG_X86_DEBUG_STATIC_CPU_HAS is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_DEFAULT_SECURITY=""
CONFIG_XOR_BLOCKS=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_MCRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
# CONFIG_CRYPTO_CMAC is not set
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
# CONFIG_CRYPTO_CRCT10DIF is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_MD4 is not set
# CONFIG_CRYPTO_MD5 is not set
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
# CONFIG_CRYPTO_SHA1_MB is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_DRBG_MENU is not set
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_HW is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_AMD=y
# CONFIG_BINARY_PRINTF is not set

#
# Library routines
#
CONFIG_RAID6_PQ=y
CONFIG_BITREVERSE=y
# CONFIG_HAVE_ARCH_BITREVERSE is not set
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
# CONFIG_CRC_CCITT is not set
CONFIG_CRC16=y
# CONFIG_CRC_T10DIF is not set
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
# CONFIG_CRC8 is not set
# CONFIG_AUDIT_ARCH_COMPAT_GENERIC is not set
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
# CONFIG_XZ_DEC is not set
# CONFIG_XZ_DEC_BCJ is not set
CONFIG_INTERVAL_TREE=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_AVERAGE=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_ARCH_HAS_SG_CHAIN=y

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-13 17:26                                         ` Markus Trippelsdorf
@ 2015-04-13 18:31                                           ` Linus Torvalds
  2015-04-13 19:09                                             ` Markus Trippelsdorf
  0 siblings, 1 reply; 108+ messages in thread
From: Linus Torvalds @ 2015-04-13 18:31 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Ingo Molnar, H. Peter Anvin, Denys Vlasenko, Paul E. McKenney,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On Mon, Apr 13, 2015 at 10:26 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
>
> I must have made a measurement mistake above, because the actual code
> size savings are roughly 5%:

Can you check against the -fno-guess-branch-probability output?

Does lto (without pgo) perhaps end up undoing a lot of the random
"branch out and back" noise?

                          Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-13 18:31                                           ` Linus Torvalds
@ 2015-04-13 19:09                                             ` Markus Trippelsdorf
  2015-04-14  5:38                                               ` Ingo Molnar
  0 siblings, 1 reply; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-13 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Denys Vlasenko, Paul E. McKenney,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On 2015.04.13 at 11:31 -0700, Linus Torvalds wrote:
> On Mon, Apr 13, 2015 at 10:26 AM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> >
> > I must have made a measurement mistake above, because the actual code
> > size savings are roughly 5%:
> 
> Can you check against the -fno-guess-branch-probability output?

   text    data     bss     dec     filename
   8746230  970072  802816 10519118 ./vmlinux gcc-5 (lto) 
   9202488  978512  811008 10992008 ./vmlinux gcc-5
   8036915  970296  802816 9810027  ./vmlinux gcc-5 (lto -fno-guess-branch-probability)
   8593615  978512  811008 10383135 ./vmlinux gcc-5 (-fno-guess-branch-probability)

> Does lto (without pgo) perhaps end up undoing a lot of the random
> "branch out and back" noise?

As Honza wrote somewhere else in this thread, gcc uses
-fguess-branch-probability to get a profile estimate, that is then used
throughout the whole optimization chain. So disabling this option
really makes no sense and will result in significant performance loss.

On the other hand the LTO code size savings should not affect
performance negatively at all.

-- 
Markus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-13 19:09                                             ` Markus Trippelsdorf
@ 2015-04-14  5:38                                               ` Ingo Molnar
  2015-04-14  8:23                                                 ` Markus Trippelsdorf
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-14  5:38 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Linus Torvalds, H. Peter Anvin, Denys Vlasenko, Paul E. McKenney,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra


* Markus Trippelsdorf <markus@trippelsdorf.de> wrote:

> On 2015.04.13 at 11:31 -0700, Linus Torvalds wrote:
> > On Mon, Apr 13, 2015 at 10:26 AM, Markus Trippelsdorf
> > <markus@trippelsdorf.de> wrote:
> > >
> > > I must have made a measurement mistake above, because the actual code
> > > size savings are roughly 5%:
> > 
> > Can you check against the -fno-guess-branch-probability output?
> 
>    text    data     bss     dec     filename
>    8746230  970072  802816 10519118 ./vmlinux gcc-5 (lto) 
>    9202488  978512  811008 10992008 ./vmlinux gcc-5
>    8036915  970296  802816 9810027  ./vmlinux gcc-5 (lto -fno-guess-branch-probability)
>    8593615  978512  811008 10383135 ./vmlinux gcc-5 (-fno-guess-branch-probability)

Just to make sure, could you please also apply the 3 alignment patches 
attached below? There's a lot of noise from extra alignment.

Having said that, LTO should have three main effects:

 1) better cross-unit inlining decisions

 2) better register allocation and clobbering knowledge (if a small 
    function is known not to clobber caller-saved registers, then the 
    saving can be skipped)

 3) better dead code elimination

1)-2) is probably worth the price, 3) in isolation isn't. So we'd have 
to estimate which one is how significant, to judge the value of LTO - 
but I haven't seen any effort so far to disambiguate it.

_Possibly_ if you build kernel/built-in.o only, and compared its 
sizes, that would help a bit, because the core kernel has very little 
dead code, giving a fairer estimation of 'true' optimizations.

Thanks,

	Ingo

======

 arch/x86/Makefile | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..10989a73b986 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,6 +77,15 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Pack jump targets tightly, don't align them to the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-14  5:38                                               ` Ingo Molnar
@ 2015-04-14  8:23                                                 ` Markus Trippelsdorf
  2015-04-14  9:16                                                   ` Ingo Molnar
  0 siblings, 1 reply; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-14  8:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, H. Peter Anvin, Denys Vlasenko, Paul E. McKenney,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On 2015.04.14 at 07:38 +0200, Ingo Molnar wrote:
> 
> Just to make sure, could you please also apply the 3 alignment patches 
> attached below? There's a lot of noise from extra alignment.

Here's an updated table:

   text    data     bss     dec     filename
   8746230  970072  802816 10519118 ./vmlinux gcc-5 (lto) 
   9202488  978512  811008 10992008 ./vmlinux gcc-5
   8036915  970296  802816 9810027  ./vmlinux gcc-5 (lto -fno-guess-branch-probability)
   8593615  978512  811008 10383135 ./vmlinux gcc-5 (-fno-guess-branch-probability)
   8202614  970072  802816 9975502  ./vmlinux gcc-5 (lto + Ingo's patch)
   8801016  978512  811008 10590536 ./vmlinux gcc-5 (Ingo's patch) 
   8733943  952088  798720 10484751 ./vmlinux gcc-5 (lto + -malign-data=abi)
   9186105  958320  806912 10951337 ./vmlinux gcc-5 (-malign-data=abi)
   8190327  952088  798720 9941135  ./vmlinux gcc-5 (lto + Ingo's patch + -malign-data=abi)
   8784633  958320  806912 10549865 ./vmlinux gcc-5 (Ingo's patch + -malign-data=abi) 

For the "lto + Ingo's patch + -malign-data=abi" combination there is a
10% text size reduction.

-malign-data is a new option for gcc-5 that controls how the compiler
aligns variables. "abi" aligns variables according to psABI and give the
tightest packing. 
"compat" is the default and uses an increased alignment value compatible
with gcc-4.8. But this should be unnecessary for the kernel. 
(The other possible value is "cache", which increases the alignment
value to match the cache line size.)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..93702eef1684 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,6 +77,9 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Align variables according to psABI
+        KBUILD_CFLAGS += -malign-data=abi
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

> Having said that, LTO should have three main effects:
> 
>  1) better cross-unit inlining decisions
> 
>  2) better register allocation and clobbering knowledge (if a small 
>     function is known not to clobber caller-saved registers, then the 
>     saving can be skipped)
> 
>  3) better dead code elimination
> 
> 1)-2) is probably worth the price, 3) in isolation isn't. So we'd have 
> to estimate which one is how significant, to judge the value of LTO - 
> but I haven't seen any effort so far to disambiguate it.

For a high level overview of LTO in gcc-5 see Honza's recent article:
http://hubicka.blogspot.de/2015/04/GCC5-IPA-LTO-news.html

I haven't looked at the generated code at all yet, because the kernel is
huge and I'm not sure where to best look for specific changes.

> _Possibly_ if you build kernel/built-in.o only, and compared its 
> sizes, that would help a bit, because the core kernel has very little 
> dead code, giving a fairer estimation of 'true' optimizations.

This isn't possible, because kernel/built-in.o is a 'slim' lto object
file, that only contains compressed LTO sections with the compiler's
internal representation.

-- 
Markus

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-14  8:23                                                 ` Markus Trippelsdorf
@ 2015-04-14  9:16                                                   ` Ingo Molnar
  2015-04-14 11:17                                                     ` Markus Trippelsdorf
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-14  9:16 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Linus Torvalds, H. Peter Anvin, Denys Vlasenko, Paul E. McKenney,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra


* Markus Trippelsdorf <markus@trippelsdorf.de> wrote:

> On 2015.04.14 at 07:38 +0200, Ingo Molnar wrote:
> > 
> > Just to make sure, could you please also apply the 3 alignment patches 
> > attached below? There's a lot of noise from extra alignment.
> 
> Here's an updated table:
> 
>    text    data     bss     dec     filename
>    8746230  970072  802816 10519118 ./vmlinux gcc-5 (lto) 
>    9202488  978512  811008 10992008 ./vmlinux gcc-5
>    8036915  970296  802816 9810027  ./vmlinux gcc-5 (lto -fno-guess-branch-probability)
>    8593615  978512  811008 10383135 ./vmlinux gcc-5 (-fno-guess-branch-probability)
>    8202614  970072  802816 9975502  ./vmlinux gcc-5 (lto + Ingo's patch)
>    8801016  978512  811008 10590536 ./vmlinux gcc-5 (Ingo's patch) 
>    8733943  952088  798720 10484751 ./vmlinux gcc-5 (lto + -malign-data=abi)
>    9186105  958320  806912 10951337 ./vmlinux gcc-5 (-malign-data=abi)
>    8190327  952088  798720 9941135  ./vmlinux gcc-5 (lto + Ingo's patch + -malign-data=abi)
>    8784633  958320  806912 10549865 ./vmlinux gcc-5 (Ingo's patch + -malign-data=abi) 
> 
> For the "lto + Ingo's patch + -malign-data=abi" combination there is a
> 10% text size reduction.

Lets rename "Ingo's patch" to "code_align=1". The interesting one 
would be to compare:

           code_align=1 + -fno-guess-branch-probability
vs.
     lto + code_align=1 + -fno-guess-branch-probability

Or rather, you could try "Ingo's combo patch" further below, with and 
without LTO.

I'd expect LTO to still be in the 5% reduction range.

> -malign-data is a new option for gcc-5 that controls how the 
> compiler aligns variables. "abi" aligns variables according to psABI 
> and give the tightest packing.

I'm not so sure about that one, our data access patterns are usually a 
lot more well thought out than our code alignment (which is really 
mostly compiler controlled). It also gives limited savings:

  9202488 vmlinux gcc-5
  9186105 vmlinux gcc-5 (-malign-data=abi)

Which is 0.1%. I've got a handful of options in that size range:

+        # Reduces vmlinux size by 0.25%:
+        KBUILD_CFLAGS += -fno-caller-saves
+
+        # Reduces vmlinux size by 1.10%:
+        KBUILD_CFLAGS += -fno-inline-small-functions
+
+        # Reduces vmlinux size by about 0.95%:
+        KBUILD_CFLAGS += -fno-tree-ch

but obviously they are more obscure and thus riskier. Find below an 
updated "Ingo's combo patch". It gives more than 10% savings here on 
x86 defconfig using gcc 4.9, without LTO.

Thanks,

	Ingo

---------------->
 arch/x86/Makefile | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..999e94685d12 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,10 +77,34 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Pack jump targets tightly, don't align them to the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
 
+        #
+        # Don't guess branch probabilities, follow the code and unlikely()/likely() hints,
+        # which reduces vmlinux size by about 5.4%:
+        #
+        KBUILD_CFLAGS += -fno-guess-branch-probability
+
+        # Reduces vmlinux size by 0.25%:
+        KBUILD_CFLAGS += -fno-caller-saves
+
+        # Reduces vmlinux size by 1.10%:
+        KBUILD_CFLAGS += -fno-inline-small-functions
+
+        # Reduces vmlinux size by about 0.95%:
+        KBUILD_CFLAGS += -fno-tree-ch
+
 	# Use -mpreferred-stack-boundary=3 if supported.
 	KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=3)
 

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-14  9:16                                                   ` Ingo Molnar
@ 2015-04-14 11:17                                                     ` Markus Trippelsdorf
  2015-04-14 12:09                                                       ` Ingo Molnar
  0 siblings, 1 reply; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-14 11:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, H. Peter Anvin, Denys Vlasenko, Paul E. McKenney,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra

On 2015.04.14 at 11:16 +0200, Ingo Molnar wrote:
> 
> * Markus Trippelsdorf <markus@trippelsdorf.de> wrote:
> 
> > On 2015.04.14 at 07:38 +0200, Ingo Molnar wrote:
> > > 
> > > Just to make sure, could you please also apply the 3 alignment patches 
> > > attached below? There's a lot of noise from extra alignment.
> > 
> > Here's an updated table:
> > 
> >    text    data     bss     dec     filename
> >    8746230  970072  802816 10519118 ./vmlinux gcc-5 (lto) 
> >    9202488  978512  811008 10992008 ./vmlinux gcc-5
> >    8036915  970296  802816 9810027  ./vmlinux gcc-5 (lto -fno-guess-branch-probability)
> >    8593615  978512  811008 10383135 ./vmlinux gcc-5 (-fno-guess-branch-probability)
> >    8202614  970072  802816 9975502  ./vmlinux gcc-5 (lto + Ingo's patch)
> >    8801016  978512  811008 10590536 ./vmlinux gcc-5 (Ingo's patch) 
> >    8733943  952088  798720 10484751 ./vmlinux gcc-5 (lto + -malign-data=abi)
> >    9186105  958320  806912 10951337 ./vmlinux gcc-5 (-malign-data=abi)
> >    8190327  952088  798720 9941135  ./vmlinux gcc-5 (lto + Ingo's patch + -malign-data=abi)
> >    8784633  958320  806912 10549865 ./vmlinux gcc-5 (Ingo's patch + -malign-data=abi) 
> > 
> > For the "lto + Ingo's patch + -malign-data=abi" combination there is a
> > 10% text size reduction.
> 
> Lets rename "Ingo's patch" to "code_align=1". The interesting one 
> would be to compare:
> 
>            code_align=1 + -fno-guess-branch-probability
> vs.
>      lto + code_align=1 + -fno-guess-branch-probability
> 
> I'd expect LTO to still be in the 5% reduction range.

Yes:

   text     data    bss     dec 
   7886231  970296  802816  9659343 lto + code_align=1 + -fno-guess-branch-probability  
   8398284  978512  811008 10187804 code_align=1 + -fno-guess-branch-probability 

> > -malign-data is a new option for gcc-5 that controls how the 
> > compiler aligns variables. "abi" aligns variables according to psABI 
> > and give the tightest packing.
> 
> I'm not so sure about that one, our data access patterns are usually a 
> lot more well thought out than our code alignment (which is really 
> mostly compiler controlled). It also gives limited savings:
> 
>   9202488 vmlinux gcc-5
>   9186105 vmlinux gcc-5 (-malign-data=abi)
> 
> Which is 0.1%. I've got a handful of options in that size range:
> 
> +        # Reduces vmlinux size by 0.25%:
> +        KBUILD_CFLAGS += -fno-caller-saves
> +
> +        # Reduces vmlinux size by 1.10%:
> +        KBUILD_CFLAGS += -fno-inline-small-functions
> +
> +        # Reduces vmlinux size by about 0.95%:
> +        KBUILD_CFLAGS += -fno-tree-ch
> 
> but obviously they are more obscure and thus riskier. Find below an 
> updated "Ingo's combo patch". It gives more than 10% savings here on 
> x86 defconfig using gcc 4.9, without LTO.

Well obviously, if you do not care about performance you can reduce the
text size further and further. But what is interesting is to keep the
performance up (or even increase it) and still reduce the text size.
And that is what the "lto + Ingo's patch + -malign-data=abi" kernel
hopefully achieves (, but it would need further benchmarking to confirm
this claim).

-- 
Markus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-14 11:17                                                     ` Markus Trippelsdorf
@ 2015-04-14 12:09                                                       ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-14 12:09 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Linus Torvalds, H. Peter Anvin, Denys Vlasenko, Paul E. McKenney,
	Jason Low, Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, Thomas Gleixner, Peter Zijlstra


* Markus Trippelsdorf <markus@trippelsdorf.de> wrote:

> > I'm not so sure about that one, our data access patterns are 
> > usually a lot more well thought out than our code alignment (which 
> > is really mostly compiler controlled). It also gives limited 
> > savings:
> > 
> >   9202488 vmlinux gcc-5
> >   9186105 vmlinux gcc-5 (-malign-data=abi)
> > 
> > Which is 0.1%. I've got a handful of options in that size range:
> > 
> > +        # Reduces vmlinux size by 0.25%:
> > +        KBUILD_CFLAGS += -fno-caller-saves
> > +
> > +        # Reduces vmlinux size by 1.10%:
> > +        KBUILD_CFLAGS += -fno-inline-small-functions
> > +
> > +        # Reduces vmlinux size by about 0.95%:
> > +        KBUILD_CFLAGS += -fno-tree-ch
> > 
> > but obviously they are more obscure and thus riskier. Find below 
> > an updated "Ingo's combo patch". It gives more than 10% savings 
> > here on x86 defconfig using gcc 4.9, without LTO.
> 
> Well obviously, if you do not care about performance you can reduce 
> the text size further and further. [...]

Yes, but I picked GCC options that I don't think impact performance 
negatively and offer a sizable debloating effect. Especially with 
inlining if code size increases it's probably a net loss.

> [...] But what is interesting is to keep the performance up (or even 
> increase it) and still reduce the text size.

By my (admittedly quick) review I think those 3 extra options I added 
still generate pretty OK code in the end. I.e. they are not like -Os 
that generates utter crap to save a byte.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
  2015-04-10 13:18                                 ` H. Peter Anvin
@ 2015-04-10 18:48                                 ` Linus Torvalds
  2015-04-12 23:44                                   ` Maciej W. Rozycki
  2015-04-10 19:23                                 ` Daniel Borkmann
  2015-04-11 13:48                                 ` Markus Trippelsdorf
  3 siblings, 1 reply; 108+ messages in thread
From: Linus Torvalds @ 2015-04-10 18:48 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Paul E. McKenney, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Borislav Petkov, Andy Lutomirski, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 5:50 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> However, I'm an -Os guy. Expect -O2 people to disagree :)

I used to be an -Os guy too. I'm a big believer in I$ density.

HOWEVER.

It turns out that gcc's -Os is just horrible nasty crap. It doesn't
actually make good tradeoffs for code density, because it doesn't make
any tradeoffs at all. It tries to choose small code, even when it's
ridiculously bad small code.

For example, a 24-byte static memcpy is best done as three quad-word
load/store pairs. That's very cheap, and not at all unreasonable.

But what does gcc do? It does a "rep movsl".

Seriously. That's *shit*. It absolutely kills performance on some very
critical code.

I'm not making that up. Try "-O2" and "-Os" on the appended trivial
code. Yes, the "rep movsl" is smaller, but it's incredibly expensive,
particularly if the result is partially used afterwards.

And I'm not a hater of "rep movs" - not at all. I think that "rep
movsb" is basically a perfect way to tell the CPU "do an optimized
memcpy with whatever cache situation you have". So I'm a big fan of
the string instructions, but only when appropriate. And "appropriate"
here very much includes "I don't know the memory copy size, so I'm
going to call out to some complex generic code that does all kinds of
size checks and tricks".

Replacing three pairs of "mov" instructions with a "rep movs" is insane.

(There are a couple of other examples of that kind of issues with
"-Os". Like using "imul $15" instead of single shift-by-4 and
subtract. Again, the "imul" is certainly smaller, but can have quite
bad latency and throughput issues).

So I'm no longer a fan of -Os. It disables too many obviously good
code optimizations.

                   Linus

---
struct dummy {
        unsigned long a, b, c;
};

void test(struct dummy *a, struct dummy *b)
{
        *b = *a;
}

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 18:48                                 ` Linus Torvalds
@ 2015-04-12 23:44                                   ` Maciej W. Rozycki
  0 siblings, 0 replies; 108+ messages in thread
From: Maciej W. Rozycki @ 2015-04-12 23:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Ingo Molnar, Paul E. McKenney, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On Fri, 10 Apr 2015, Linus Torvalds wrote:

> It turns out that gcc's -Os is just horrible nasty crap. It doesn't
> actually make good tradeoffs for code density, because it doesn't make
> any tradeoffs at all. It tries to choose small code, even when it's
> ridiculously bad small code.
> 
> For example, a 24-byte static memcpy is best done as three quad-word
> load/store pairs. That's very cheap, and not at all unreasonable.
> 
> But what does gcc do? It does a "rep movsl".
> 
> Seriously. That's *shit*. It absolutely kills performance on some very
> critical code.
> 
> I'm not making that up. Try "-O2" and "-Os" on the appended trivial
> code. Yes, the "rep movsl" is smaller, but it's incredibly expensive,
> particularly if the result is partially used afterwards.
> 
> And I'm not a hater of "rep movs" - not at all. I think that "rep
> movsb" is basically a perfect way to tell the CPU "do an optimized
> memcpy with whatever cache situation you have". So I'm a big fan of
> the string instructions, but only when appropriate. And "appropriate"
> here very much includes "I don't know the memory copy size, so I'm
> going to call out to some complex generic code that does all kinds of
> size checks and tricks".
> 
> Replacing three pairs of "mov" instructions with a "rep movs" is insane.
> 
> (There are a couple of other examples of that kind of issues with
> "-Os". Like using "imul $15" instead of single shift-by-4 and
> subtract. Again, the "imul" is certainly smaller, but can have quite
> bad latency and throughput issues).
> 
> So I'm no longer a fan of -Os. It disables too many obviously good
> code optimizations.

 I think the issue is -Os is a binary yes/no option without further tuning 
as to how desperate about code size saving GCC is asked to be.  That's 
what we'd probably have with speed optimisation too if there was only a 
single -O GCC option -- equivalent to today's -O3.  However instead GCC 
has -O1, -O2, -O3 that turn on more and more possibly insane optimisations 
gradually (plus a load -f options for further fine tuning).

 So a possible complementary solution for size saving could be keeping -Os 
as it is for people's build recipe compatibility, and then have say -Os1, 
-Os2, -Os3 enabling more and insane optimisations, on the size side for a 
change.  In that case -Os3 would be equivalent to today's -Os.  There 
could be further fine-tune options to control things like the string moves 
you mention.

 The thing here is someone would have to implement all of it and I gather 
GCC folks have more than enough stuff to do already.  I'm fairly sure they 
wouldn't decline a patch though. ;)

  Maciej

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
  2015-04-10 13:18                                 ` H. Peter Anvin
  2015-04-10 18:48                                 ` Linus Torvalds
@ 2015-04-10 19:23                                 ` Daniel Borkmann
  2015-04-11 13:48                                 ` Markus Trippelsdorf
  3 siblings, 0 replies; 108+ messages in thread
From: Daniel Borkmann @ 2015-04-10 19:23 UTC (permalink / raw)
  To: Denys Vlasenko, Ingo Molnar, Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On 04/10/2015 02:50 PM, Denys Vlasenko wrote:
...
> New-ish versions of gcc allow people to specify optimization
> options per function:
>
> https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes
>
> optimize
>      The optimize attribute is used to specify that a function is to be compiled
>      with different optimization options than specified on the command line.
>      Arguments can either be numbers or strings. Numbers are assumed to be an
>      optimization level. Strings that begin with O are assumed to be an
>      optimization option, while other options are assumed to be used with
>      a -f prefix.
>
> How about not aligning code by default, and using
>
>      #define hot_func __attribute__((optimize("O2","align-functions=16","align-jumps=16")))

I stumbled over that some time ago in a different context. Apparently,
that's being considered broken by gcc folks [1]. ;)

   [1] http://gcc.gnu.org/ml/gcc/2012-07/msg00211.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
                                                   ` (2 preceding siblings ...)
  2015-04-10 19:23                                 ` Daniel Borkmann
@ 2015-04-11 13:48                                 ` Markus Trippelsdorf
  3 siblings, 0 replies; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-11 13:48 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On 2015.04.10 at 14:50 +0200, Denys Vlasenko wrote:
> New-ish versions of gcc allow people to specify optimization
> options per function:
> 
> https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes
> 
> optimize
>     The optimize attribute is used to specify that a function is to be compiled
>     with different optimization options than specified on the command line.
>     Arguments can either be numbers or strings. Numbers are assumed to be an
>     optimization level. Strings that begin with O are assumed to be an
>     optimization option, while other options are assumed to be used with
>     a -f prefix.
> 
> How about not aligning code by default, and using
> 
>     #define hot_func __attribute__((optimize("O2","align-functions=16","align-jumps=16")))
>     ...
> 
>     void hot_func super_often_called_func(...) {...}
> 
> in hot code paths?

__attribute__((optimize)) is meant for compiler debugging only (to help
folks to reduce their testcases to a single function).
It should _not_ be used in production code, because the implementation
is very buggy.

-- 
Markus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
  2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
  2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
@ 2015-04-10 13:19                               ` Borislav Petkov
  2015-04-10 13:54                                 ` Denys Vlasenko
  2015-04-10 14:10                               ` Paul E. McKenney
                                                 ` (3 subsequent siblings)
  6 siblings, 1 reply; 108+ messages in thread
From: Borislav Petkov @ 2015-04-10 13:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Andy Lutomirski, Denys Vlasenko, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote:
> Now, the usual justification for jump target alignment is the 
> following: with 16 byte instruction-cache cacheline sizes, if a 

You mean 64 bytes?

Cacheline size on modern x86 is 64 bytes. The 16 alignment is probably
some branch predictor stride thing.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 13:19                               ` Borislav Petkov
@ 2015-04-10 13:54                                 ` Denys Vlasenko
  2015-04-10 14:01                                   ` Borislav Petkov
  0 siblings, 1 reply; 108+ messages in thread
From: Denys Vlasenko @ 2015-04-10 13:54 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Andy Lutomirski, Brian Gerst, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On 04/10/2015 03:19 PM, Borislav Petkov wrote:
> On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote:
>> Now, the usual justification for jump target alignment is the 
>> following: with 16 byte instruction-cache cacheline sizes, if a 
> 
> You mean 64 bytes?
> 
> Cacheline size on modern x86 is 64 bytes. The 16 alignment is probably
> some branch predictor stride thing.

IIRC it's a maximum decode bandwidth. Decoders on the most powerful
x86 CPUs, both Intel and AMD, attempt to decode in one cycle
up to four instructions. For this they fetch up to 16 bytes.
If cacheline ends before 16 bytes are available, then decode
will operate on fewer bytes, or it will wait for next cacheline
to be fetched.




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 13:54                                 ` Denys Vlasenko
@ 2015-04-10 14:01                                   ` Borislav Petkov
  2015-04-10 14:53                                     ` Denys Vlasenko
  0 siblings, 1 reply; 108+ messages in thread
From: Borislav Petkov @ 2015-04-10 14:01 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Andy Lutomirski, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 03:54:57PM +0200, Denys Vlasenko wrote:
> On 04/10/2015 03:19 PM, Borislav Petkov wrote:
> > On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote:
> >> Now, the usual justification for jump target alignment is the 
> >> following: with 16 byte instruction-cache cacheline sizes, if a 
> > 
> > You mean 64 bytes?
> > 
> > Cacheline size on modern x86 is 64 bytes. The 16 alignment is probably
> > some branch predictor stride thing.
> 
> IIRC it's a maximum decode bandwidth. Decoders on the most powerful
> x86 CPUs, both Intel and AMD, attempt to decode in one cycle
> up to four instructions. For this they fetch up to 16 bytes.

32 bytes fetch window per cycle for AMD F15h and F16h, see my other
mail. And Intel probably do the same.

> If cacheline ends before 16 bytes are available, then decode
> will operate on fewer bytes, or it will wait for next cacheline
> to be fetched.

Yap.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 14:01                                   ` Borislav Petkov
@ 2015-04-10 14:53                                     ` Denys Vlasenko
  2015-04-10 15:25                                       ` Borislav Petkov
  2015-04-10 18:54                                       ` Linus Torvalds
  0 siblings, 2 replies; 108+ messages in thread
From: Denys Vlasenko @ 2015-04-10 14:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Andy Lutomirski, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On 04/10/2015 04:01 PM, Borislav Petkov wrote:
> On Fri, Apr 10, 2015 at 03:54:57PM +0200, Denys Vlasenko wrote:
>> On 04/10/2015 03:19 PM, Borislav Petkov wrote:
>>> On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote:
>>>> Now, the usual justification for jump target alignment is the 
>>>> following: with 16 byte instruction-cache cacheline sizes, if a 
>>>
>>> You mean 64 bytes?
>>>
>>> Cacheline size on modern x86 is 64 bytes. The 16 alignment is probably
>>> some branch predictor stride thing.
>>
>> IIRC it's a maximum decode bandwidth. Decoders on the most powerful
>> x86 CPUs, both Intel and AMD, attempt to decode in one cycle
>> up to four instructions. For this they fetch up to 16 bytes.
> 
> 32 bytes fetch window per cycle for AMD F15h and F16h, see my other
> mail. And Intel probably do the same.

There are people who experimentally researched this.
According to this guy:

http://www.agner.org/optimize/microarchitecture.pdf

Intel CPUs can decode only up to 16 bytes at a time
(but the have loop buffers and some has uop cache,
which can skip decoding entirely).
AMD CPUs can decode 21 bytes at best. With two cores active,
only 16 bytes.

"""
10 Haswell pipeline
...
10.1 Pipeline
The pipeline is similar to previous designs, but improved with more of everything. It is
designed for a throughput of four instructions per clock cycle.
Each core has a reorder buffer with 192 entries, the reservation station has 60 entries, and
the register file has 168 integer registers and 168 vector registers, according to the literature
listed on page 145 below.
All parts of the pipeline are shared between two threads in those CPU models that can run
two threads in each core. Each thread gets half of the total throughput when two threads are
running in the same core.

10.2 Instruction fetch and decoding
The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle in single
threaded applications.
There are four decoders, which can handle instructions generating up to four μops per clock
cycle in the way described on page 120 for Sandy Bridge.
Instructions with any number of prefixes are decoded in a single clock cycle. There is no
penalty for redundant prefixes.

...
...

15 AMD Bulldozer, Piledriver and Steamroller pipeline
15.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller
...
15.2 Instruction fetch
The instruction fetcher is shared between the two cores of an execution unit. The instruction
fetcher can fetch 32 aligned bytes of code per clock cycle from the level-1 code cache. The
measured fetch rate was up to 16 bytes per clock per core when two cores were active, and
up to 21 bytes per clock in linear code when only one core was active. The fetch rate is
lower than these maximum values when instructions are misaligned.
Critical subroutine entries and loop entries should not start near the end of a 32-bytes block.
You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in
the first four instructions after a critical label.
"""

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 14:53                                     ` Denys Vlasenko
@ 2015-04-10 15:25                                       ` Borislav Petkov
  2015-04-10 15:48                                         ` Denys Vlasenko
  2015-04-10 18:54                                       ` Linus Torvalds
  1 sibling, 1 reply; 108+ messages in thread
From: Borislav Petkov @ 2015-04-10 15:25 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Andy Lutomirski, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 04:53:29PM +0200, Denys Vlasenko wrote:
> There are people who experimentally researched this.
> According to this guy:
> 
> http://www.agner.org/optimize/microarchitecture.pdf
> 
> Intel CPUs can decode only up to 16 bytes at a time
> (but the have loop buffers and some has uop cache,
> which can skip decoding entirely).

Ok, so Intel don't need a 32-byte fetch window. Probably the uop cache
and other tricks make a larger fetch window not really important in that
case. A larger fetch window means also more power and having predecoded
stuff in a cache is a much better story no matter from where you look at
it.

> AMD CPUs can decode 21 bytes at best. With two cores active,
> only 16 bytes.
> 
> 
> """
> 10 Haswell pipeline
> ...
> 10.1 Pipeline
> The pipeline is similar to previous designs, but improved with more of everything. It is

"more of everything", yeah! :)

> 10.2 Instruction fetch and decoding
> The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle in single
> threaded applications.

That uop cache is simply spot on, it seems.

> There are four decoders, which can handle instructions generating up to four μops per clock
> cycle in the way described on page 120 for Sandy Bridge.
> Instructions with any number of prefixes are decoded in a single clock cycle. There is no
> penalty for redundant prefixes.

That's nice.

> 15 AMD Bulldozer, Piledriver and Steamroller pipeline
> 15.1 The pipeline in AMD Bulldozer, Piledriver and Steamroller
> ...
> 15.2 Instruction fetch
> The instruction fetcher is shared between the two cores of an execution unit. The instruction
> fetcher can fetch 32 aligned bytes of code per clock cycle from the level-1 code cache. The

That's also understandable - you want to enlarge the fetch window for
the two cores of a compute unit as they share a front end.

> measured fetch rate was up to 16 bytes per clock per core when two cores were active, and
> up to 21 bytes per clock in linear code when only one core was active. The fetch rate is
> lower than these maximum values when instructions are misaligned.
> Critical subroutine entries and loop entries should not start near the end of a 32-bytes block.
> You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in
> the first four instructions after a critical label.
> """

All F15h models are Bulldozer uarch with improvements. For example,
later F15h models have things like loop buffer and loop predictor
which can replay loops under certain conditions, thus diminishing the
importance of the fetch window size wrt to loops performance.

And then there's AMD F16h Software Optimization Guide, that's the Jaguar
uarch:

"...The processor can fetch 32 bytes per cycle and can scan two 16-byte
instruction windows for up to two instruction decodes per cycle.

...

2.7.2 Loop Alignment

For the Family 16h processor loop alignment is not usually a significant
issue. However, for hot loops, some further knowledge of trade-offs can
be helpful. Since the processor can read an aligned 32-byte fetch block
every cycle, to achieve maximum fetch bandwidth the loop start point
should be aligned to 32 bytes. For very hot loops, it may be useful to
further consider branch placement. The branch predictor can process the
first two branches in a cache line in a single cycle through the sparse
predictor. For best performance, any branches in the first cache line
of the hot loop should be in the sparse predictor. The simplest way to
guarantee this for very hot loops is to align the start point to a cache
line (64-byte) boundary."

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 15:25                                       ` Borislav Petkov
@ 2015-04-10 15:48                                         ` Denys Vlasenko
  2015-04-10 15:54                                           ` Borislav Petkov
  0 siblings, 1 reply; 108+ messages in thread
From: Denys Vlasenko @ 2015-04-10 15:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Andy Lutomirski, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On 04/10/2015 05:25 PM, Borislav Petkov wrote:
>> measured fetch rate was up to 16 bytes per clock per core when two cores were active, and
>> up to 21 bytes per clock in linear code when only one core was active. The fetch rate is
>> lower than these maximum values when instructions are misaligned.
>> Critical subroutine entries and loop entries should not start near the end of a 32-bytes block.
>> You may align critical entries by 16 or at least make sure there is no 16-bytes boundary in
>> the first four instructions after a critical label.
>> """
> 
> All F15h models are Bulldozer uarch with improvements. For example,
> later F15h models have things like loop buffer and loop predictor
> which can replay loops under certain conditions, thus diminishing the
> importance of the fetch window size wrt to loops performance.
> 
> And then there's AMD F16h Software Optimization Guide, that's the Jaguar
> uarch:
> 
> "...The processor can fetch 32 bytes per cycle and can scan two 16-byte
> instruction windows for up to two instruction decodes per cycle.

As you know, manuals are not be-all, end-all documents.
They contains mistakes. And they are written before silicon
is finalized, and sometimes they advertise capabilities
which in the end had to be downscaled. It's hard to check
a 1000+ pages document and correct all mistakes, especially
hard-to-quantify ones.

In the same document by Agner Fog, he says that he failed to confirm
32-byte fetch on Fam16h CPUs:

"""
16 AMD Bobcat and Jaguar pipeline
...
...
16.2 Instruction fetch
The instruction fetch rate is stated as "up to 32 bytes per cycle",
but this is not confirmed by my measurements which consistently show
a maximum of 16 bytes per clock cycle on average for both Bobcat and
Jaguar. Some reports say that the Jaguar has a loop buffer,
but I cannot detect any improvement in performance for tiny loops.
"""

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 15:48                                         ` Denys Vlasenko
@ 2015-04-10 15:54                                           ` Borislav Petkov
  2015-04-10 21:44                                             ` Borislav Petkov
  0 siblings, 1 reply; 108+ messages in thread
From: Borislav Petkov @ 2015-04-10 15:54 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Andy Lutomirski, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 05:48:51PM +0200, Denys Vlasenko wrote:
> As you know, manuals are not be-all, end-all documents.
> They contains mistakes. And they are written before silicon
> is finalized, and sometimes they advertise capabilities
> which in the end had to be downscaled. It's hard to check
> a 1000+ pages document and correct all mistakes, especially
> hard-to-quantify ones.
> 
> In the same document by Agner Fog, he says that he failed to confirm
> 32-byte fetch on Fam16h CPUs:

Hmm, ok. I probably should poke some people about it. Lemme see...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 15:54                                           ` Borislav Petkov
@ 2015-04-10 21:44                                             ` Borislav Petkov
  0 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2015-04-10 21:44 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Paul E. McKenney, Linus Torvalds, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Andy Lutomirski, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 05:54:10PM +0200, Borislav Petkov wrote:
> On Fri, Apr 10, 2015 at 05:48:51PM +0200, Denys Vlasenko wrote:
> > As you know, manuals are not be-all, end-all documents.
> > They contains mistakes. And they are written before silicon
> > is finalized, and sometimes they advertise capabilities
> > which in the end had to be downscaled. It's hard to check
> > a 1000+ pages document and correct all mistakes, especially
> > hard-to-quantify ones.
> > 
> > In the same document by Agner Fog, he says that he failed to confirm
> > 32-byte fetch on Fam16h CPUs:

So reportedly, the fetch window is 32-byte wide but I think there are
restrictions in the pipe elsewhere, which would have negative influence
on the actual throughput. And which could explain Agner's observations.
AFAICT.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 14:53                                     ` Denys Vlasenko
  2015-04-10 15:25                                       ` Borislav Petkov
@ 2015-04-10 18:54                                       ` Linus Torvalds
  1 sibling, 0 replies; 108+ messages in thread
From: Linus Torvalds @ 2015-04-10 18:54 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Borislav Petkov, Ingo Molnar, Paul E. McKenney, Jason Low,
	Peter Zijlstra, Davidlohr Bueso, Tim Chen,
	Aswin Chandramouleeswaran, LKML, Andy Lutomirski, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 7:53 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> There are people who experimentally researched this.
> According to this guy:
>
> http://www.agner.org/optimize/microarchitecture.pdf
>
> Intel CPUs can decode only up to 16 bytes at a time

Indeed.

For intel decoding, the old "4-1-1-1" decode patterns are almost
entirely immaterial these days. Even the "single uop" (the "1"s int he
4-1-1-1) cover the vast majority of cases.

So for Intel decoders, the biggest limit - especially for x86-64
instructions - tends to be the 16-byte decode window. The problem with
x86 decoding isn't that individual instructions are complicated, but
the fact that when you try to decode multiple instructions at once,
finding the start of each instruction is somewhat painful. What I
*think* Intel does is have this rather complex net of logic that
basically decodes 16 bytes in parallel, but has this rippling thing
that just disables the incorrect decodes.

That said, the fetch boundary from L2 is probably an issue too,
especially if the front-end hasn't had time to run ahead of the
execution engine. That's likely where the "32 byte alignment" comes
from.

                        Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
                                                 ` (2 preceding siblings ...)
  2015-04-10 13:19                               ` Borislav Petkov
@ 2015-04-10 14:10                               ` Paul E. McKenney
  2015-04-11 14:28                                 ` Josh Triplett
  2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
                                                 ` (2 subsequent siblings)
  6 siblings, 1 reply; 108+ messages in thread
From: Paul E. McKenney @ 2015-04-10 14:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Denys Vlasenko, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, josh

On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > So restructure the loop a bit, to get much tighter code:
> > 
> > 0000000000000030 <mutex_spin_on_owner.isra.5>:
> >   30:	55                   	push   %rbp
> >   31:	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
> >   38:	00 00
> >   3a:	48 89 e5             	mov    %rsp,%rbp
> >   3d:	48 39 37             	cmp    %rsi,(%rdi)
> >   40:	75 1e                	jne    60 <mutex_spin_on_owner.isra.5+0x30>
> >   42:	8b 46 28             	mov    0x28(%rsi),%eax
> >   45:	85 c0                	test   %eax,%eax
> >   47:	74 0d                	je     56 <mutex_spin_on_owner.isra.5+0x26>
> >   49:	f3 90                	pause
> >   4b:	48 8b 82 10 c0 ff ff 	mov    -0x3ff0(%rdx),%rax
> >   52:	a8 08                	test   $0x8,%al
> >   54:	74 e7                	je     3d <mutex_spin_on_owner.isra.5+0xd>
> >   56:	31 c0                	xor    %eax,%eax
> >   58:	5d                   	pop    %rbp
> >   59:	c3                   	retq
> >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
> >   60:	b8 01 00 00 00       	mov    $0x1,%eax
> >   65:	5d                   	pop    %rbp
> >   66:	c3                   	retq
> 
> Btw., totally off topic, the following NOP caught my attention:
> 
> >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
> 
> That's a dead NOP that boats the function a bit, added for the 16 byte 
> alignment of one of the jump targets.
> 
> I realize that x86 CPU manufacturers recommend 16-byte jump target 
> alignments (it's in the Intel optimization manual), but the cost of 
> that is very significant:
> 
>         text           data       bss         dec      filename
>     12566391        1617840   1089536    15273767      vmlinux.align.16-byte
>     12224951        1617840   1089536    14932327      vmlinux.align.1-byte
> 
> By using 1 byte jump target alignment (i.e. no alignment at all) we 
> get an almost 3% reduction in kernel size (!) - and a probably similar 
> reduction in I$ footprint.
> 
> So I'm wondering, is the 16 byte jump target optimization suggestion 
> really worth this price? The patch below boots fine and I've not 
> measured any noticeable slowdown, but I've not tried hard.

Good point, adding Josh Triplett on CC.  I suspect that he might be
interested.  ;-)

							Thanx, Paul

> Now, the usual justification for jump target alignment is the 
> following: with 16 byte instruction-cache cacheline sizes, if a 
> forward jump is aligned to cacheline boundary then prefetches will 
> start from a new cacheline.
> 
> But I think that argument is flawed for typical optimized kernel code 
> flows: forward jumps often go to 'cold' (uncommon) pieces of code, and 
> aligning cold code to cache lines does not bring a lot of advantages 
> (they are uncommon), while it causes collateral damage:
> 
>  - their alignment 'spreads out' the cache footprint, it shifts 
>    followup hot code further out
> 
>  - plus it slows down even 'cold' code that immediately follows 'hot' 
>    code (like in the above case), which could have benefited from the 
>    partial cacheline that comes off the end of hot code.
> 
> What do you guys think about this? I think we should seriously 
> consider relaxing our alignment defaults.
> 
> Thanks,
> 
> 	Ingo
> 
> ==================================>
> >From 5b83a095e1abdfee5c710c34a5785232ce74f939 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@kernel.org>
> Date: Fri, 10 Apr 2015 13:50:05 +0200
> Subject: [PATCH] x86: Align jumps targets to 1 byte boundaries
> 
> Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/Makefile | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 5ba2d9ce82dc..0366d6b44a14 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -77,6 +77,9 @@ else
>          KBUILD_AFLAGS += -m64
>          KBUILD_CFLAGS += -m64
> 
> +	# Align jump targets to 1 byte, not the default 16 bytes:
> +        KBUILD_CFLAGS += -falign-jumps=1
> +
>          # Don't autogenerate traditional x87 instructions
>          KBUILD_CFLAGS += $(call cc-option,-mno-80387)
>          KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 14:10                               ` Paul E. McKenney
@ 2015-04-11 14:28                                 ` Josh Triplett
  0 siblings, 0 replies; 108+ messages in thread
From: Josh Triplett @ 2015-04-11 14:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ingo Molnar, Linus Torvalds, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Borislav Petkov, Andy Lutomirski, Denys Vlasenko, Brian Gerst,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On Fri, Apr 10, 2015 at 07:10:08AM -0700, Paul E. McKenney wrote:
> On Fri, Apr 10, 2015 at 02:08:46PM +0200, Ingo Molnar wrote:
> > * Ingo Molnar <mingo@kernel.org> wrote:
[...]
> > Btw., totally off topic, the following NOP caught my attention:
> > 
> > >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
> > 
> > That's a dead NOP that boats the function a bit, added for the 16 byte 
> > alignment of one of the jump targets.
> > 
> > I realize that x86 CPU manufacturers recommend 16-byte jump target 
> > alignments (it's in the Intel optimization manual), but the cost of 
> > that is very significant:
> > 
> >         text           data       bss         dec      filename
> >     12566391        1617840   1089536    15273767      vmlinux.align.16-byte
> >     12224951        1617840   1089536    14932327      vmlinux.align.1-byte
> > 
> > By using 1 byte jump target alignment (i.e. no alignment at all) we 
> > get an almost 3% reduction in kernel size (!) - and a probably similar 
> > reduction in I$ footprint.
> > 
> > So I'm wondering, is the 16 byte jump target optimization suggestion 
> > really worth this price? The patch below boots fine and I've not 
> > measured any noticeable slowdown, but I've not tried hard.
> 
> Good point, adding Josh Triplett on CC.  I suspect that he might be
> interested.  ;-)

Quite interested, yes.  Even if there *are* benchmarks to support
keeping the optimization (which wouldn't surprise me), it'd be nice to
have a Kconfig option to enable the jump-target optimization.  (With 'y'
meaning "pad jump targets to 16 bytes", so that allnoconfig and
tinyconfig automatically don't.)

- Josh Triplett

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
                                                 ` (3 preceding siblings ...)
  2015-04-10 14:10                               ` Paul E. McKenney
@ 2015-04-11  9:20                               ` Ingo Molnar
  2015-04-11 17:41                                 ` Linus Torvalds
  2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
  2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
  6 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-11  9:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Denys Vlasenko, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> Btw., totally off topic, the following NOP caught my attention:
> 
> >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
> 
> That's a dead NOP that boats the function a bit, added for the 16 byte 
> alignment of one of the jump targets.

Another thing caught my attention (and I'm hijacking the RCU thread 
again): GCC's notion of how to place branches seems somewhat random, 
and rather bloaty.

So I tried the experiment below on an x86 defconfig, turning off GCC's 
branch heuristics, and it's rather surprising:

     text           data     bss      dec         filename
 12566447        1617840 1089536 15273823         vmlinux.fguess-branch-probability
 11923593        1617840 1089536 14630969         vmlinux.fno-guess-branch-probability

That's an 5.4% code size improvement!

So maybe we should try this, as it results in much more predictable 
(and more compact!) code by default - and allows us to shape loops and 
branches in a natural fashion: by their placement, and also via 
likely()/unlikely() hints when absolutely necessary.

Thoughts?

Thanks,

	Ingo


=================>
From: Ingo Molnar <mingo@kernel.org>
Date: Sat, 11 Apr 2015 11:16:30 +0200
Subject: [PATCH] x86: Turn off GCC branch probability heuristics

Not-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..7c12b3f56915 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -81,6 +81,9 @@ else
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
 
+        # Don't guess branch probabilities, follow the code and unlikely()/likely() hints:
+        KBUILD_CFLAGS += -fno-guess-branch-probability
+
 	# Use -mpreferred-stack-boundary=3 if supported.
 	KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=3)
 

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
@ 2015-04-11 17:41                                 ` Linus Torvalds
  2015-04-11 18:57                                   ` Thomas Gleixner
  0 siblings, 1 reply; 108+ messages in thread
From: Linus Torvalds @ 2015-04-11 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Denys Vlasenko, Borislav Petkov, Tim Chen,
	Andy Lutomirski, Jason Low, Brian Gerst,
	Aswin Chandramouleeswaran, Paul E. McKenney, Davidlohr Bueso,
	Peter Zijlstra, H. Peter Anvin, LKML, Peter Zijlstra

On Apr 11, 2015 2:20 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
> Another thing caught my attention (and I'm hijacking the RCU thread
> again): GCC's notion of how to place branches seems somewhat random,
> and rather bloaty.
>
> So I tried the experiment below on an x86 defconfig, turning off GCC's
> branch heuristics, and it's rather surprising:
>
>      text           data     bss      dec         filename
>  12566447        1617840 1089536 15273823         vmlinux.fguess-branch-probability
>  11923593        1617840 1089536 14630969         vmlinux.fno-guess-branch-probability
>
> That's an 5.4% code size improvement!

Ugh. That's much larger than I would have expected. Is it just because
gcc ends up turning

         if (a)
               b
         c

into

         if (a) goto out-of-line
   return:
         c
         ..
   out-of-line:
         c
         goto return;

a lot? Still, 5% sounds insanely big.

How much of that 5% comes from code alignment? Or was this on *top* of
the 1-byte alignment testt?

                                Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-11 17:41                                 ` Linus Torvalds
@ 2015-04-11 18:57                                   ` Thomas Gleixner
  2015-04-11 19:35                                     ` Linus Torvalds
                                                       ` (2 more replies)
  0 siblings, 3 replies; 108+ messages in thread
From: Thomas Gleixner @ 2015-04-11 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Denys Vlasenko, Borislav Petkov, Tim Chen,
	Andy Lutomirski, Jason Low, Brian Gerst,
	Aswin Chandramouleeswaran, Paul E. McKenney, Davidlohr Bueso,
	Peter Zijlstra, H. Peter Anvin, LKML, Peter Zijlstra

On Sat, 11 Apr 2015, Linus Torvalds wrote:
> On Apr 11, 2015 2:20 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
> >
> > Another thing caught my attention (and I'm hijacking the RCU thread
> > again): GCC's notion of how to place branches seems somewhat random,
> > and rather bloaty.
> >
> > So I tried the experiment below on an x86 defconfig, turning off GCC's
> > branch heuristics, and it's rather surprising:
> >
> >      text           data     bss      dec         filename
> >  12566447        1617840 1089536 15273823         vmlinux.fguess-branch-probability
> >  11923593        1617840 1089536 14630969         vmlinux.fno-guess-branch-probability
> >
> > That's an 5.4% code size improvement!
> 
> Ugh. That's much larger than I would have expected. Is it just because
> gcc ends up turning
> 
>          if (a)
>                b
>          c
> 
> into
> 
>          if (a) goto out-of-line
>    return:
>          c
>          ..
>    out-of-line:
>          c
>          goto return;
> 
> a lot? Still, 5% sounds insanely big.
> 
> How much of that 5% comes from code alignment? Or was this on *top* of
> the 1-byte alignment testt?

I thinks its just the no-guess one:

   text	       data       dec  patch           reduction

7563475	    1781048  10302987   

7192973	    1780024   9931461  no-guess	           -4.8%

7354819	    1781048    958464  align-1		   -2.7%

7192973	    1780024   9931461  no-guess + align-1  -4.8%

So with the no-guess applied the align-1 does not matter anymore.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-11 18:57                                   ` Thomas Gleixner
@ 2015-04-11 19:35                                     ` Linus Torvalds
  2015-04-12  5:47                                       ` Ingo Molnar
  2015-04-12  7:41                                       ` Ingo Molnar
  2015-04-12  8:07                                     ` Ingo Molnar
  2015-04-12 21:11                                     ` Jan Hubicka
  2 siblings, 2 replies; 108+ messages in thread
From: Linus Torvalds @ 2015-04-11 19:35 UTC (permalink / raw)
  To: Thomas Gleixner, Jakub Jelinek
  Cc: Ingo Molnar, Denys Vlasenko, Borislav Petkov, Tim Chen,
	Andy Lutomirski, Jason Low, Brian Gerst,
	Aswin Chandramouleeswaran, Paul E. McKenney, Davidlohr Bueso,
	Peter Zijlstra, H. Peter Anvin, LKML, Peter Zijlstra

On Sat, Apr 11, 2015 at 11:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I thinks its just the no-guess one:
>
>    text        data       dec  patch           reduction
> 7563475     1781048  10302987
> 7192973     1780024   9931461  no-guess            -4.8%
> 7354819     1781048    958464  align-1             -2.7%
> 7192973     1780024   9931461  no-guess + align-1  -4.8%

Yeah, a 5% code expansion is a big deal. Sadly, it looks like
'no-guess' also disables our explicit likely/unlikely handling.

Damn. If it actually honored likely/unlikely, then we should just do
it - and manually fix up any places where we really care.

But the fact that it apparently entirely disables not just the
guesses, but our *explicit* likely/unlikely, means that we can't fix
up the mistakes.

And in many of the hot codepaths that likely/unlikely really does
matter. Some of our hottest paths have known "this basically never
happens" situations that we do *not* want to break up our L1 I$ over.
There's a number of functions that have been optimized to really
generate good code, and "-fno-guess-branch-probability" disables those
manual optimizations.

So we'd have no way to fix it for the cases that matter.

Sad.

It might be worth bringing this up with some gcc people. I added Jakub
to the cc. Any other gcc people suggestions?

                            Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-11 19:35                                     ` Linus Torvalds
@ 2015-04-12  5:47                                       ` Ingo Molnar
  2015-04-12  6:20                                         ` Markus Trippelsdorf
  2015-04-12  7:56                                         ` Mike Galbraith
  2015-04-12  7:41                                       ` Ingo Molnar
  1 sibling, 2 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-12  5:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Jakub Jelinek, Denys Vlasenko, Borislav Petkov,
	Tim Chen, Andy Lutomirski, Jason Low, Brian Gerst,
	Aswin Chandramouleeswaran, Paul E. McKenney, Davidlohr Bueso,
	Peter Zijlstra, H. Peter Anvin, LKML, Peter Zijlstra


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Apr 11, 2015 at 11:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > I thinks its just the no-guess one:
> >
> >    text        data       dec  patch           reduction
> > 7563475     1781048  10302987
> > 7192973     1780024   9931461  no-guess            -4.8%
> > 7354819     1781048    958464  align-1             -2.7%
> > 7192973     1780024   9931461  no-guess + align-1  -4.8%
> 
> Yeah, a 5% code expansion is a big deal. Sadly, it looks like
> 'no-guess' also disables our explicit likely/unlikely handling.

So I spent some time trying to get as much code size reduction as 
possible via GCC optimization options, and the total savings possible 
are 10.1%:

     text           data     bss      dec    filename
 12566391        1617840 1089536 15273767    vmlinux.vanilla
 11416805        1617840 1089536 14124181    vmlinux.combo
 10532552        1596080 1089536 13218168    vmlinux.Os

(combo patch attached below.)

The -Os savings are 19% total - but as you mentioned before, it's 
sometimes achieved through unacceptable techniques.

Unfortunately I found no other GCC options to achieve what -Os does - 
the missing 9% can purely be achieved via -Os, with no cherry-picking 
possible.

The other, smaller savings are:

+        # Reduces vmlinux size by 0.25%:
+        KBUILD_CFLAGS += -fno-caller-saves
+
+        # Reduces vmlinux size by 1.10%:
+        KBUILD_CFLAGS += -fno-inline-small-functions
+
+        # Reduces vmlinux size by about 0.95%:
+        KBUILD_CFLAGS += -fno-tree-ch

(each of them has to be double checked to make sure it leads to 
nothing silly and unacceptable - I just blindly tried to find GCC 
options that impacted kernel code size.)

Thanks,

	Ingo


---
 arch/x86/Makefile | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..999e94685d12 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,10 +77,34 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Pack jump targets tightly, don't align them to the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
 
+        #
+        # Don't guess branch probabilities, follow the code and unlikely()/likely() hints,
+        # which reduces vmlinux size by about 5.4%:
+        #
+        KBUILD_CFLAGS += -fno-guess-branch-probability
+
+        # Reduces vmlinux size by 0.25%:
+        KBUILD_CFLAGS += -fno-caller-saves
+
+        # Reduces vmlinux size by 1.10%:
+        KBUILD_CFLAGS += -fno-inline-small-functions
+
+        # Reduces vmlinux size by about 0.95%:
+        KBUILD_CFLAGS += -fno-tree-ch
+
 	# Use -mpreferred-stack-boundary=3 if supported.
 	KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=3)
 



^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-12  5:47                                       ` Ingo Molnar
@ 2015-04-12  6:20                                         ` Markus Trippelsdorf
  2015-04-12 10:15                                           ` Ingo Molnar
  2015-04-12  7:56                                         ` Mike Galbraith
  1 sibling, 1 reply; 108+ messages in thread
From: Markus Trippelsdorf @ 2015-04-12  6:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, Jakub Jelinek, Denys Vlasenko,
	Borislav Petkov, Tim Chen, Andy Lutomirski, Jason Low,
	Brian Gerst, Aswin Chandramouleeswaran, Paul E. McKenney,
	Davidlohr Bueso, Peter Zijlstra, H. Peter Anvin, LKML,
	Peter Zijlstra

On 2015.04.12 at 07:47 +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Sat, Apr 11, 2015 at 11:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > I thinks its just the no-guess one:
> > >
> > >    text        data       dec  patch           reduction
> > > 7563475     1781048  10302987
> > > 7192973     1780024   9931461  no-guess            -4.8%
> > > 7354819     1781048    958464  align-1             -2.7%
> > > 7192973     1780024   9931461  no-guess + align-1  -4.8%
> > 
> > Yeah, a 5% code expansion is a big deal. Sadly, it looks like
> > 'no-guess' also disables our explicit likely/unlikely handling.
> 
> So I spent some time trying to get as much code size reduction as 
> possible via GCC optimization options, and the total savings possible 
> are 10.1%:
> 
>      text           data     bss      dec    filename
>  12566391        1617840 1089536 15273767    vmlinux.vanilla
>  11416805        1617840 1089536 14124181    vmlinux.combo
>  10532552        1596080 1089536 13218168    vmlinux.Os

If you like to play with more knobs you could explore the various
--param options that are listed in the gcc man page...

-- 
Markus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-12  6:20                                         ` Markus Trippelsdorf
@ 2015-04-12 10:15                                           ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-12 10:15 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Linus Torvalds, Thomas Gleixner, Jakub Jelinek, Denys Vlasenko,
	Borislav Petkov, Tim Chen, Andy Lutomirski, Jason Low,
	Brian Gerst, Aswin Chandramouleeswaran, Paul E. McKenney,
	Davidlohr Bueso, Peter Zijlstra, H. Peter Anvin, LKML,
	Peter Zijlstra


* Markus Trippelsdorf <markus@trippelsdorf.de> wrote:

> On 2015.04.12 at 07:47 +0200, Ingo Molnar wrote:
> > 
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > On Sat, Apr 11, 2015 at 11:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > I thinks its just the no-guess one:
> > > >
> > > >    text        data       dec  patch           reduction
> > > > 7563475     1781048  10302987
> > > > 7192973     1780024   9931461  no-guess            -4.8%
> > > > 7354819     1781048    958464  align-1             -2.7%
> > > > 7192973     1780024   9931461  no-guess + align-1  -4.8%
> > > 
> > > Yeah, a 5% code expansion is a big deal. Sadly, it looks like
> > > 'no-guess' also disables our explicit likely/unlikely handling.
> > 
> > So I spent some time trying to get as much code size reduction as 
> > possible via GCC optimization options, and the total savings possible 
> > are 10.1%:
> > 
> >      text           data     bss      dec    filename
> >  12566391        1617840 1089536 15273767    vmlinux.vanilla
> >  11416805        1617840 1089536 14124181    vmlinux.combo
> >  10532552        1596080 1089536 13218168    vmlinux.Os
> 
> If you like to play with more knobs you could explore the various
> --param options that are listed in the gcc man page...

Well, I had a look, they are rather incomplete (at least as far as 
branch optimizations go), and I wouldn't want to rely on them for 
production kernel patches in any case, only on the more well-known 
compiler options.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-12  5:47                                       ` Ingo Molnar
  2015-04-12  6:20                                         ` Markus Trippelsdorf
@ 2015-04-12  7:56                                         ` Mike Galbraith
  1 sibling, 0 replies; 108+ messages in thread
From: Mike Galbraith @ 2015-04-12  7:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, Jakub Jelinek, Denys Vlasenko,
	Borislav Petkov, Tim Chen, Andy Lutomirski, Jason Low,
	Brian Gerst, Aswin Chandramouleeswaran, Paul E. McKenney,
	Davidlohr Bueso, Peter Zijlstra, H. Peter Anvin, LKML,
	Peter Zijlstra

On Sun, 2015-04-12 at 07:47 +0200, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Sat, Apr 11, 2015 at 11:57 AM, Thomas Gleixner <
> > tglx@linutronix.de> wrote:
> > > 
> > > I thinks its just the no-guess one:
> > > 
> > >    text        data       dec  patch           reduction
> > > 7563475     1781048  10302987
> > > 7192973     1780024   9931461  no-guess            -4.8%
> > > 7354819     1781048    958464  align-1             -2.7%
> > > 7192973     1780024   9931461  no-guess + align-1  -4.8%
> > 
> > Yeah, a 5% code expansion is a big deal. Sadly, it looks like
> > 'no-guess' also disables our explicit likely/unlikely handling.
> 
> So I spent some time trying to get as much code size reduction as 
> possible via GCC optimization options, and the total savings 
> possible 
> are 10.1%:
> 
>      text           data     bss      dec    filename
>  12566391        1617840 1089536 15273767    vmlinux.vanilla
>  11416805        1617840 1089536 14124181    vmlinux.combo
>  10532552        1596080 1089536 13218168    vmlinux.Os
> 
> (combo patch attached below.)
> 
> The -Os savings are 19% total - but as you mentioned before, it's 
> sometimes achieved through unacceptable techniques.
> 
> Unfortunately I found no other GCC options to achieve what -Os does -
> 
> the missing 9% can purely be achieved via -Os, with no cherry-
> picking 
> possible.
> 
> The other, smaller savings are:
> 
> +        # Reduces vmlinux size by 0.25%:
> +        KBUILD_CFLAGS += -fno-caller-saves
> +
> +        # Reduces vmlinux size by 1.10%:
> +        KBUILD_CFLAGS += -fno-inline-small-functions
> +
> +        # Reduces vmlinux size by about 0.95%:
> +        KBUILD_CFLAGS += -fno-tree-ch
> 
> (each of them has to be double checked to make sure it leads to 
> nothing silly and unacceptable - I just blindly tried to find GCC 
> options that impacted kernel code size.)

Ew, my i4790 really hated this patch.

taskset 0xc pipe-test 1
                                                 avg
733.2    727.1    743.1    746.6    737.1        737 KHz  1.000  -gcc_twiddle
713.7    717.9    715.5    718.0    708.7        714 KHz   .968  +gcc_twiddle

tbench.sh 8 30

3566.14  3560.91  3566.35  3556.57  3549.69  3559.93 MB/S 1.000  -gcc_twiddle
2862.18  2899.51  2888.74  2897.18  2878.63  2885.25 MB/S  .810  +gcc_twiddle

        -Mike


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-11 19:35                                     ` Linus Torvalds
  2015-04-12  5:47                                       ` Ingo Molnar
@ 2015-04-12  7:41                                       ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-12  7:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Jakub Jelinek, Denys Vlasenko, Borislav Petkov,
	Tim Chen, Andy Lutomirski, Jason Low, Brian Gerst,
	Aswin Chandramouleeswaran, Paul E. McKenney, Davidlohr Bueso,
	Peter Zijlstra, H. Peter Anvin, LKML, Peter Zijlstra,
	Richard Henderson


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Apr 11, 2015 at 11:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > I thinks its just the no-guess one:
> >
> >    text        data       dec  patch           reduction
> > 7563475     1781048  10302987
> > 7192973     1780024   9931461  no-guess            -4.8%
> > 7354819     1781048    958464  align-1             -2.7%
> > 7192973     1780024   9931461  no-guess + align-1  -4.8%
> 
> Yeah, a 5% code expansion is a big deal. Sadly, it looks like
> 'no-guess' also disables our explicit likely/unlikely handling.
> 
> Damn. If it actually honored likely/unlikely, then we should just do
> it - and manually fix up any places where we really care.
> 
> But the fact that it apparently entirely disables not just the
> guesses, but our *explicit* likely/unlikely, means that we can't fix
> up the mistakes.
> 
> And in many of the hot codepaths that likely/unlikely really does
> matter. Some of our hottest paths have known "this basically never
> happens" situations that we do *not* want to break up our L1 I$ over.
> There's a number of functions that have been optimized to really
> generate good code, and "-fno-guess-branch-probability" disables those
> manual optimizations.
> 
> So we'd have no way to fix it for the cases that matter.
> 
> Sad.
> 
> It might be worth bringing this up with some gcc people. I added Jakub
> to the cc. Any other gcc people suggestions?

(Skip to the 'More numbers' section below to see more measurements.)

So what would be nice to have is if GCC had an optimization option to 
disable all branch probability heuristics (like 
-fno-guess-branch-probability), except explicit __builtin_expect() 
hints (which -fno-guess-branch-probability unfortunately disables).

So what would be useful to have is a -fno-branch-probability-heuristics
GCC option or so, or a "--param branch-probability-heuristics=0" 
option.

I found one related option:

  --param predictable-branch-outcome=N

           predictable-branch-outcome
               When branch is predicted to be taken with probability 
               lower than this threshold (in percent), then it is 
               considered well predictable. The default is 10.

I tried the values 1 and 0, on a -O2 kernel (i.e. guess-branch-probability
was enabled), in the hope that it maybe turns off all non-__builtin_expect()
branch heuristics, but it only had very minor size impact in the 0.001% range.

More numbers:
-------------

I also measured the effect of our __builtin_expect() 
(likely()/unlikely()) branch annotations.

As an experiment I mapped both likely()/unlikely() to the four 
possible __builtin_expect() probability settings:

  likely(): __builtin_expect(, 1)    unlikely(): __builtin_expect(, 0)
  likely(): __builtin_expect(, 0)    unlikely(): __builtin_expect(, 1)
  likely(): __builtin_expect(, 0)    unlikely(): __builtin_expect(, 0)
  likely(): __builtin_expect(, 1)    unlikely(): __builtin_expect(, 1)

(the first mapping is the only one that makes sense, and it is the one 
that is used by the kernel currently.)

The goal of mixing up the probability mappings was to measure the full 
'range' of code size impact that the kernel's explicit hints are 
causing, versus the impact that GCC's own branch probability 
heuristics are causing:

     text      data     bss      dec  filename
 12566383   1617840 1089536 15273759  vmlinux.expect=10 [==vanilla]
 12460250   1617840 1089536 15167626  vmlinux.expect=01
 12563332   1617840 1089536 15270708  vmlinux.expect=00
 12463035   1617840 1089536 15170411  vmlinux.expect=11
 12533382   1617840 1089536 15240758  vmlinux.no-expect

 11923529   1617840 1089536 14630905  vmlinux.-fno-guess-branch-probability


[ This was done on a v4.1-rc7-ish vanilla -O2 kernel (no alignment 
  tweaks), using 'make defconfig' and 'make kvmconfig' on x86-64 to 
  turn it into a minimally bootable kernel. I used GCC 4.9.1. ]

the 'vmlinux.none' kernel is a vanilla kernel with all 
__builtin_expect() hints removed: i.e. GCC heuristics are deciding all 
branch probabilities in the kernel.

So the code size 'range' that the kernel's own probability hints are 
moving in is around 0.4%:

  - the 'vanilla' mapping is the largest (not unexpectedly)

  - the 'inverse' mapping is the smallest, by 0.8%

  - the 00 and 11 mappings are about mid range, 0.4% away from the 
    extremes

  - the 'none' mapping is also mid-range, which too is somewhat
    expected: GCC heuristics would pick only part of our hints.

But note how the 'no GCC heuristics at all' setting reduces size 
brutally, by 5.4%.

So if we were able to do that, while keeping __builtin_expect(), and 
used our own heuristics only via __builtin_expect(), then we could 
still possibly expect a total code size shrinkage of around 5.0%:

  -5.4% code size reduction that comes from removal of all hints,

  +0.4% code size increase between the 'none' and the '10' mappings 
        in the experiment above.

Note that even if I apply all the align=1 patches, 
-fno-guess-branch-probability on top of that still gives me 2.6% code 
size savings:

     text      data     bss      dec  filename
 12566383   1617840 1089536 15273759  vmlinux.expect=10 [==vanilla]
 11923529   1617840 1089536 14630905  vmlinux.-fno-guess-branch-probability
 11903663   1617840 1089536 14611039  vmlinux.align=1
 11646102   1617840 1089536 14353478  vmlinux.align=1+fno-guess-branch-probability

The smallest vmlinux has:

  - about    41,000 functions ('tT' symbols in System.map)
  - about   300,000 branch/jump instructions (all objdump -d asm mnemonics starting with 'j')
  - about   165,000 function calls (all objdump -d asm mnemonics matching 'call')
  - about 2,330,000 instructions (all objdump -d asm mnemonics)

With align=1, GCC's heuristics added about 1,200 new branches and 
1,350 new function calls, 76,900 instructions, altogether +257,561 
bytes of code:

 # of x86 instructions
 2549742                     vmlinux.expect=10 [==vanilla]
 2391069                     vmlinux.-fno-guess-branch-probability
 2411568                     vmlinux.align=1
 2334595                     vmlinux.align=1+fno-guess-branch-probability

(For completeness, the patch generating the smallest kernel is 
attached below.)

The takeaway: the code size savings above are 7.9%. Even if we 
restored __builtin_expect() hints, we'd probably still see a combined 
7.5% code shrinkage.

That would be a rather significant I$ win, with very little cost that 
I can see!

Thanks,

	Ingo

---
 arch/x86/Makefile | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5ba2d9ce82dc..a6d3feb90b97 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,10 +77,25 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Pack jump targets tightly, don't align them to the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
 
+        #
+        # Don't guess branch probabilities, follow the code and unlikely()/likely() hints,
+        # which reduces vmlinux size by about 5.4%:
+        #
+        KBUILD_CFLAGS += -fno-guess-branch-probability
+
 	# Use -mpreferred-stack-boundary=3 if supported.
 	KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=3)
 

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-11 18:57                                   ` Thomas Gleixner
  2015-04-11 19:35                                     ` Linus Torvalds
@ 2015-04-12  8:07                                     ` Ingo Molnar
  2015-04-12 21:11                                     ` Jan Hubicka
  2 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-12  8:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Denys Vlasenko, Borislav Petkov, Tim Chen,
	Andy Lutomirski, Jason Low, Brian Gerst,
	Aswin Chandramouleeswaran, Paul E. McKenney, Davidlohr Bueso,
	Peter Zijlstra, H. Peter Anvin, LKML, Peter Zijlstra

* Thomas Gleixner <tglx@linutronix.de> wrote:

> > How much of that 5% comes from code alignment? Or was this on 
> > *top* of the 1-byte alignment testt?
> 
> I thinks its just the no-guess one:
> 
>    text	       data       dec  patch           reduction
> 
> 7563475	    1781048  10302987   
> 
> 7192973	    1780024   9931461  no-guess	           -4.8%
> 
> 7354819	    1781048    958464  align-1		   -2.7%
> 
> 7192973	    1780024   9931461  no-guess + align-1  -4.8%
> 
> So with the no-guess applied the align-1 does not matter anymore.

What kernel config and GCC version did you use?

I used x86-64 defconfig+kvmconfig with GCC 4.9.1, and it gave me:

     text   filename                                        reduction
 12566383   vmlinux.expect=10 [==vanilla]
 11923529   vmlinux.-fno-guess-branch-probability           -5.4%
 11903663   vmlinux.align=1                                 -5.6%
 11646102   vmlinux.align=1+fno-guess-branch-probability    -7.9%

So there's still appears to be an additional -2.5% to be gained from 
turning off GCC branch heuristics.

x86 defconfig was derived from a distro kernel config and is still 
pretty close to what distros typically enable, so it's a good 
'typical' config to use for testing.

To double check that assumption I also tested a distro kernel .config 
and looked at vmlinux sizes:

     text   filename                                        reduction
  12612201  vmlinux.expect=10 [==vanilla]
  12107614  vmlinux.-fno-guess-branch-probability           -4.1%
  12021817  vmlinux.align=1                                 -4.9%
  11846771  vmlinux.align=1+fno-guess-branch-probability    -6.5%

this was cloned from a different major Linux distro than the one the 
original x86 defconfig was derived from - still the vmlinux sizes are 
pretty similar.

So x86 'make defconfig' measurements are pretty representative if you 
are looking for a quick, independent way to measure 'typical Linux 
distro' kernel characteristics.

( The only bigger difference is that FUNCTION_TRACER was turned on in 
  the distro config - which bloats function prologues a bit and thus 
  reduces the relative savings a bit. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Turn off GCC branch probability heuristics
  2015-04-11 18:57                                   ` Thomas Gleixner
  2015-04-11 19:35                                     ` Linus Torvalds
  2015-04-12  8:07                                     ` Ingo Molnar
@ 2015-04-12 21:11                                     ` Jan Hubicka
  2 siblings, 0 replies; 108+ messages in thread
From: Jan Hubicka @ 2015-04-12 21:11 UTC (permalink / raw)
  To: linux-kernel

Hello,
for me as GCC developer, this is definitely an intersting observation. Let
me briefly explain what happen here.

-fguess-branch-probability does a lot more than just BB reordering.  The way
GCC works is that it first guesses probability of every branch and then uses
the probabilities to estimate a profile (that is frequency at which every BB
executed). This profile is then used thorough the optimization queue to make
decisions that either bloat the code (such as loop header copying) or that
pessimize one code path and improve other code path (such spill code placement).

This is based on 
  T Ball, JR Larus: Branch prediction for free 
and 
  Y Wu, JR Larus: Static branch frequency and program profile analysis. 
In GCC it was implemented in 2000
http://www.ucw.cz/~hubicka/papers/proj/index.html and the earlier ad-hoc
heuristics (that, for example, placed spill code depending on loop depth of
the basic blocks) was slowly replaced by profile driven ones. (Most nowdays
compilers works this way.) This however also means that the compiler is
quite dependent on the profile.

-fno-guess-branch-probability will leave all frequencies to be 0 and all
edge probabilities 0%. This will make most of these heuristics to shut
itself off or possibly behave funny. Disabling all the heuristics except for
builtin_expect will likely result in quite funny profiles, too, where
frequencies will drop to 50% after every branch.

I am very interested in getting -O2 code size down (GCC 5 has quite few
improvements in this area thus most motivated by C++ code and LTO). We
should try to identify what optimization causes most of bloat and start from
these.

http://www.ucw.cz/~hubicka/papers/amd64/node4.html has 13 years old data one
performance and code size with this flag.  Back then branch prediction
improved performance by 4.1% on specint at the expense of 5.66% code size
cost. Very large portion of the bloat was the basic block reordering. I will
try to re-run the benchmarks.

In 2003 the largest portion of growth came from basic block reordering
(-freorder-blocks) which does some tail code duplication that helped the AMD
K7/K8 generation chips it was tested on because of decoder limitations. 
Basic block reordering pass was implemented in 2000 based on
http://dl.acm.org/citation.cfm?id=305178 and while it was improved in
meantime (primarily by Google adding special cases and hot/cold
partitioning), it is still the same software trace algorithm. So it may be
time to rewrite it which should not be too hard.

Would it be possible to identify functions that change most and look into these?

Honza

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
                                                 ` (4 preceding siblings ...)
  2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
@ 2015-05-14 11:59                               ` Denys Vlasenko
  2015-05-14 18:17                                 ` Ingo Molnar
  2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
  6 siblings, 1 reply; 108+ messages in thread
From: Denys Vlasenko @ 2015-05-14 11:59 UTC (permalink / raw)
  To: Ingo Molnar, Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, Borislav Petkov,
	Andy Lutomirski, Brian Gerst, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On 04/10/2015 02:08 PM, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
>> So restructure the loop a bit, to get much tighter code:
>>
>> 0000000000000030 <mutex_spin_on_owner.isra.5>:
>>   30:	55                   	push   %rbp
>>   31:	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
>>   38:	00 00
>>   3a:	48 89 e5             	mov    %rsp,%rbp
>>   3d:	48 39 37             	cmp    %rsi,(%rdi)
>>   40:	75 1e                	jne    60 <mutex_spin_on_owner.isra.5+0x30>
>>   42:	8b 46 28             	mov    0x28(%rsi),%eax
>>   45:	85 c0                	test   %eax,%eax
>>   47:	74 0d                	je     56 <mutex_spin_on_owner.isra.5+0x26>
>>   49:	f3 90                	pause
>>   4b:	48 8b 82 10 c0 ff ff 	mov    -0x3ff0(%rdx),%rax
>>   52:	a8 08                	test   $0x8,%al
>>   54:	74 e7                	je     3d <mutex_spin_on_owner.isra.5+0xd>
>>   56:	31 c0                	xor    %eax,%eax
>>   58:	5d                   	pop    %rbp
>>   59:	c3                   	retq
>>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
>>   60:	b8 01 00 00 00       	mov    $0x1,%eax
>>   65:	5d                   	pop    %rbp
>>   66:	c3                   	retq
> 
> Btw., totally off topic, the following NOP caught my attention:
> 
>>   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
> 
> That's a dead NOP that boats the function a bit, added for the 16 byte 
> alignment of one of the jump targets.
> 
> I realize that x86 CPU manufacturers recommend 16-byte jump target 
> alignments (it's in the Intel optimization manual), but the cost of 
> that is very significant:
> 
>         text           data       bss         dec      filename
>     12566391        1617840   1089536    15273767      vmlinux.align.16-byte
>     12224951        1617840   1089536    14932327      vmlinux.align.1-byte
> 
> By using 1 byte jump target alignment (i.e. no alignment at all) we 
> get an almost 3% reduction in kernel size (!) - and a probably similar 
> reduction in I$ footprint.
> 
> So I'm wondering, is the 16 byte jump target optimization suggestion 
> really worth this price? The patch below boots fine and I've not 
> measured any noticeable slowdown, but I've not tried hard.
> 
> Now, the usual justification for jump target alignment is the 
> following: with 16 byte instruction-cache cacheline sizes, if a 
> forward jump is aligned to cacheline boundary then prefetches will 
> start from a new cacheline.
> 
> But I think that argument is flawed for typical optimized kernel code 
> flows: forward jumps often go to 'cold' (uncommon) pieces of code, and 
> aligning cold code to cache lines does not bring a lot of advantages 
> (they are uncommon), while it causes collateral damage:
> 
>  - their alignment 'spreads out' the cache footprint, it shifts 
>    followup hot code further out
> 
>  - plus it slows down even 'cold' code that immediately follows 'hot' 
>    code (like in the above case), which could have benefited from the 
>    partial cacheline that comes off the end of hot code.
> 
> What do you guys think about this? I think we should seriously 
> consider relaxing our alignment defaults.

Looks like nobody objected. I think it's ok to submit
this patch for real.

> +	# Align jump targets to 1 byte, not the default 16 bytes:
> +        KBUILD_CFLAGS += -falign-jumps=1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
@ 2015-05-14 18:17                                 ` Ingo Molnar
  2015-05-14 19:04                                   ` Denys Vlasenko
  2015-05-15 15:45                                   ` Josh Triplett
  0 siblings, 2 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-14 18:17 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Borislav Petkov, Andy Lutomirski, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> > What do you guys think about this? I think we should seriously 
> > consider relaxing our alignment defaults.
> 
> Looks like nobody objected. I think it's ok to submit
> this patch for real.

Yeah, so my plan is to apply the following three changes from that 
discussion:

--- tip.orig/arch/x86/Makefile
+++ tip/arch/x86/Makefile
@@ -77,6 +77,15 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Pack jump targets tightly, don't align them to the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
+        # Pack functions tightly as well:
+        KBUILD_CFLAGS += -falign-functions=1
+
+        # Pack loops tightly as well:
+        KBUILD_CFLAGS += -falign-loops=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

... and not do -fno-guess-branch-probability, because it destroys 
likely()/unlikely() annotations.

Which is a pity, considering the size effect on defconfig:

     text      data     bss      dec  filename
 12566383   1617840 1089536 15273759  vmlinux.expect=10 [==vanilla]
 11923529   1617840 1089536 14630905  vmlinux.-fno-guess-branch-probability
 11903663   1617840 1089536 14611039  vmlinux.align=1
 11646102   1617840 1089536 14353478  vmlinux.align=1+fno-guess-branch-probability

I.e. 2.6% of savings on top of the above three patches, while the 
effect of our hot/cold branch annotations is only around 0.4%, so if 
GCC preserved our annotations under -fno-guess-branch-probability we'd 
be good by at least 2%.

But GCC doesn't.

There were also these other changes I tested:

+        # Reduces vmlinux size by 0.25%:
+        KBUILD_CFLAGS += -fno-caller-saves
+
+        # Reduces vmlinux size by 1.10%:
+        KBUILD_CFLAGS += -fno-inline-small-functions
+
+        # Reduces vmlinux size by about 0.95%:
+        KBUILD_CFLAGS += -fno-tree-ch

We could maybe consider -fno-caller-saves. What do you think about 
that option?

-fno-inline-small-functions is probably a bad idea, and -fno-tree-ch 
is probably a bad idea as well and is a dangerously rare option in any 
case that could break in unexpected ways.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-05-14 18:17                                 ` Ingo Molnar
@ 2015-05-14 19:04                                   ` Denys Vlasenko
  2015-05-14 19:44                                     ` Ingo Molnar
  2015-05-15 15:45                                   ` Josh Triplett
  1 sibling, 1 reply; 108+ messages in thread
From: Denys Vlasenko @ 2015-05-14 19:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Borislav Petkov, Andy Lutomirski, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra

On 05/14/2015 08:17 PM, Ingo Molnar wrote:
> There were also these other changes I tested:
> 
> +        # Reduces vmlinux size by 0.25%:
> +        KBUILD_CFLAGS += -fno-caller-saves
> +
> +        # Reduces vmlinux size by 1.10%:
> +        KBUILD_CFLAGS += -fno-inline-small-functions
> +
> +        # Reduces vmlinux size by about 0.95%:
> +        KBUILD_CFLAGS += -fno-tree-ch
> 
> We could maybe consider -fno-caller-saves. What do you think about 
> that option?

Quick googling din't even find what it does.
It would need a comment :)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-05-14 19:04                                   ` Denys Vlasenko
@ 2015-05-14 19:44                                     ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-14 19:44 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML,
	Borislav Petkov, Andy Lutomirski, Brian Gerst, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 05/14/2015 08:17 PM, Ingo Molnar wrote:
> > There were also these other changes I tested:
> > 
> > +        # Reduces vmlinux size by 0.25%:
> > +        KBUILD_CFLAGS += -fno-caller-saves
> > +
> > +        # Reduces vmlinux size by 1.10%:
> > +        KBUILD_CFLAGS += -fno-inline-small-functions
> > +
> > +        # Reduces vmlinux size by about 0.95%:
> > +        KBUILD_CFLAGS += -fno-tree-ch
> > 
> > We could maybe consider -fno-caller-saves. What do you think about 
> > that option?
> 
> Quick googling din't even find what it does.
> It would need a comment :)

The GCC manpage says:

       -fcaller-saves
           Enable allocation of values to registers that are clobbered 
           by function calls, by emitting extra instructions to save 
           and restore the registers around such calls.  Such 
           allocation is done only when it seems to result in better 
           code.

           This option is always enabled by default on certain 
           machines, usually those which have no call-preserved 
           registers to use instead.

           Enabled at levels -O2, -O3, -Os.

but I have not checked the disassembly.

0.25% defconfig reduction still seems significant.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-05-14 18:17                                 ` Ingo Molnar
  2015-05-14 19:04                                   ` Denys Vlasenko
@ 2015-05-15 15:45                                   ` Josh Triplett
  2015-05-17  5:34                                     ` Ingo Molnar
  1 sibling, 1 reply; 108+ messages in thread
From: Josh Triplett @ 2015-05-15 15:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Mon, Sep 17, 2001 at 07:00:00AM +0000, Ingo Molnar wrote:
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> > > What do you guys think about this? I think we should seriously 
> > > consider relaxing our alignment defaults.
> > 
> > Looks like nobody objected. I think it's ok to submit
> > this patch for real.
> 
> Yeah, so my plan is to apply the following three changes from that 
> discussion:
> 
> --- tip.orig/arch/x86/Makefile
> +++ tip/arch/x86/Makefile
>  <at>  <at>  -77,6 +77,15  <at>  <at>  else
>          KBUILD_AFLAGS += -m64
>          KBUILD_CFLAGS += -m64
> 
> +        # Pack jump targets tightly, don't align them to the default 16 bytes:
> +        KBUILD_CFLAGS += -falign-jumps=1
> +
> +        # Pack functions tightly as well:
> +        KBUILD_CFLAGS += -falign-functions=1
> +
> +        # Pack loops tightly as well:
> +        KBUILD_CFLAGS += -falign-loops=1
> +
>          # Don't autogenerate traditional x87 instructions
>          KBUILD_CFLAGS += $(call cc-option,-mno-80387)
>          KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

It looks like the patch you applied to the tip tree only included one of
these (-falign-junmps=1), not the other two.

Also, you've only applied these to 64-bit; could you please apply them
to both 32-bit and 64-bit, since many embedded systems aiming for small
code size use 32-bit?  (Unless 32-bit already defaults to these.)

Have you considered including -falign-labels=1 as well?  Does that make
a difference on top of the other three.

Finally, it looks like -Os already implies all four of those, as well
as a few others, so unfortunately the code size benefits don't actually
apply to the tiniest kernels, which already effectively incorporate this
change.  Oh well.

- Josh Triplett

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-05-15 15:45                                   ` Josh Triplett
@ 2015-05-17  5:34                                     ` Ingo Molnar
  2015-05-17 19:18                                       ` Josh Triplett
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-05-17  5:34 UTC (permalink / raw)
  To: Josh Triplett; +Cc: linux-kernel


* Josh Triplett <josh@joshtriplett.org> wrote:

> On Mon, Sep 17, 2001 at 07:00:00AM +0000, Ingo Molnar wrote:
> > * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> > > > What do you guys think about this? I think we should seriously 
> > > > consider relaxing our alignment defaults.
> > > 
> > > Looks like nobody objected. I think it's ok to submit
> > > this patch for real.
> > 
> > Yeah, so my plan is to apply the following three changes from that 
> > discussion:
> > 
> > --- tip.orig/arch/x86/Makefile
> > +++ tip/arch/x86/Makefile
> >  <at>  <at>  -77,6 +77,15  <at>  <at>  else
> >          KBUILD_AFLAGS += -m64
> >          KBUILD_CFLAGS += -m64
> > 
> > +        # Pack jump targets tightly, don't align them to the default 16 bytes:
> > +        KBUILD_CFLAGS += -falign-jumps=1
> > +
> > +        # Pack functions tightly as well:
> > +        KBUILD_CFLAGS += -falign-functions=1
> > +
> > +        # Pack loops tightly as well:
> > +        KBUILD_CFLAGS += -falign-loops=1
> > +
> >          # Don't autogenerate traditional x87 instructions
> >          KBUILD_CFLAGS += $(call cc-option,-mno-80387)
> >          KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
> 
> It looks like the patch you applied to the tip tree only included one of
> these (-falign-junmps=1), not the other two.

It's three separate patches, in case there are any regressions.

> Also, you've only applied these to 64-bit; could you please apply 
> them to both 32-bit and 64-bit, since many embedded systems aiming 
> for small code size use 32-bit?  (Unless 32-bit already defaults to 
> these.)

First things first - 64-bit is getting far more testing these days 
than 32-bit.

> Have you considered including -falign-labels=1 as well?  Does that 
> make a difference on top of the other three.

So isn't the default on x86 for -falign-labels already 1?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-05-17  5:34                                     ` Ingo Molnar
@ 2015-05-17 19:18                                       ` Josh Triplett
  2015-05-18  6:48                                         ` Ingo Molnar
  0 siblings, 1 reply; 108+ messages in thread
From: Josh Triplett @ 2015-05-17 19:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Sun, May 17, 2015 at 07:34:29AM +0200, Ingo Molnar wrote:
> 
> * Josh Triplett <josh@joshtriplett.org> wrote:
> 
> > On Mon, Sep 17, 2001 at 07:00:00AM +0000, Ingo Molnar wrote:
> > > * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> > > > > What do you guys think about this? I think we should seriously 
> > > > > consider relaxing our alignment defaults.
> > > > 
> > > > Looks like nobody objected. I think it's ok to submit
> > > > this patch for real.
> > > 
> > > Yeah, so my plan is to apply the following three changes from that 
> > > discussion:
> > > 
> > > --- tip.orig/arch/x86/Makefile
> > > +++ tip/arch/x86/Makefile
> > >  <at>  <at>  -77,6 +77,15  <at>  <at>  else
> > >          KBUILD_AFLAGS += -m64
> > >          KBUILD_CFLAGS += -m64
> > > 
> > > +        # Pack jump targets tightly, don't align them to the default 16 bytes:
> > > +        KBUILD_CFLAGS += -falign-jumps=1
> > > +
> > > +        # Pack functions tightly as well:
> > > +        KBUILD_CFLAGS += -falign-functions=1
> > > +
> > > +        # Pack loops tightly as well:
> > > +        KBUILD_CFLAGS += -falign-loops=1
> > > +
> > >          # Don't autogenerate traditional x87 instructions
> > >          KBUILD_CFLAGS += $(call cc-option,-mno-80387)
> > >          KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
> > 
> > It looks like the patch you applied to the tip tree only included one of
> > these (-falign-junmps=1), not the other two.
> 
> It's three separate patches, in case there are any regressions.

Fair enough.  At the time I sent my mail, only the first of the three
had shown up on LKML.

> > Also, you've only applied these to 64-bit; could you please apply 
> > them to both 32-bit and 64-bit, since many embedded systems aiming 
> > for small code size use 32-bit?  (Unless 32-bit already defaults to 
> > these.)
> 
> First things first - 64-bit is getting far more testing these days 
> than 32-bit.

What testing do you want to see on these patches before applying them to
32-bit as well?

> > Have you considered including -falign-labels=1 as well?  Does that 
> > make a difference on top of the other three.
> 
> So isn't the default on x86 for -falign-labels already 1?

GCC's manual says that -O2 and above turn on -falign-labels, which has a
machine-specific default alignment.

A fair bit of digging turned up gcc/config/i386/i386.c, which does seem
to have processor-specific defaults for the other three but not for
align-labels.  So it looks like it does indeed use the general default
of 1.  Nevermind.

- Josh Triplett

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86: Align jump targets to 1 byte boundaries
  2015-05-17 19:18                                       ` Josh Triplett
@ 2015-05-18  6:48                                         ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-05-18  6:48 UTC (permalink / raw)
  To: Josh Triplett; +Cc: linux-kernel


* Josh Triplett <josh@joshtriplett.org> wrote:

> > > Also, you've only applied these to 64-bit; could you please 
> > > apply them to both 32-bit and 64-bit, since many embedded 
> > > systems aiming for small code size use 32-bit?  (Unless 32-bit 
> > > already defaults to these.)
> > 
> > First things first - 64-bit is getting far more testing these days 
> > than 32-bit.
> 
> What testing do you want to see on these patches before applying 
> them to 32-bit as well?

Not much specific testing is needed I think, basically just
'no regressions for a few weeks'.

These alignment changes should be pretty safe, since -Os (via 
CONFIG_CC_OPTIMIZE_FOR_SIZE=y) already enabled these.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [tip:x86/asm] x86: Align jump targets to 1-byte boundaries
  2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
                                                 ` (5 preceding siblings ...)
  2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
@ 2015-05-15  9:39                               ` tip-bot for Ingo Molnar
  6 siblings, 0 replies; 108+ messages in thread
From: tip-bot for Ingo Molnar @ 2015-05-15  9:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, linux-kernel, dave, a.p.zijlstra, hpa, jason.low2,
	aswin, luto, tim.c.chen, torvalds, paulmck, peterz, bp, dvlasenk,
	brgerst

Commit-ID:  be6cb02779ca74d83481f017db21578cfe92891c
Gitweb:     http://git.kernel.org/tip/be6cb02779ca74d83481f017db21578cfe92891c
Author:     Ingo Molnar <mingo@kernel.org>
AuthorDate: Fri, 10 Apr 2015 14:08:46 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 15 May 2015 11:04:28 +0200

x86: Align jump targets to 1-byte boundaries

The following NOP in a hot function caught my attention:

  >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

That's a dead NOP that bloats the function a bit, added for the
default 16-byte alignment that GCC applies for jump targets.

I realize that x86 CPU manufacturers recommend 16-byte jump
target alignments (it's in the Intel optimization manual),
to help their relatively narrow decoder prefetch alignment
and uop cache constraints, but the cost of that is very
significant:

        text           data       bss         dec      filename
    12566391        1617840   1089536    15273767      vmlinux.align.16-byte
    12224951        1617840   1089536    14932327      vmlinux.align.1-byte

By using 1-byte jump target alignment (i.e. no alignment at all)
we get an almost 3% reduction in kernel size (!) - and a
probably similar reduction in I$ footprint.

Now, the usual justification for jump target alignment is the
following:

 - modern decoders tend to have 16-byte (effective) decoder
   prefetch windows. (AMD documents it higher but measurements
   suggest the effective prefetch window on curretn uarchs is
   still around 16 bytes)

 - on Intel there's also the uop-cache with cachelines that have
   16-byte granularity and limited associativity.

 - older x86 uarchs had a penalty for decoder fetches that crossed
   16-byte boundaries. These limits are mostly gone from recent
   uarchs.

So if a forward jump target is aligned to cacheline boundary then
prefetches will start from a new prefetch-cacheline and there's
higher chance for decoding in fewer steps and packing tightly.

But I think that argument is flawed for typical optimized kernel
code flows: forward jumps often go to 'cold' (uncommon) pieces
of code, and  aligning cold code to cache lines does not bring a
lot of advantages  (they are uncommon), while it causes
collateral damage:

 - their alignment 'spreads out' the cache footprint, it shifts
   followup hot code further out

 - plus it slows down even 'cold' code that immediately follows 'hot'
   code (like in the above case), which could have benefited from the
   partial cacheline that comes off the end of hot code.

But even in the cache-hot case the 16 byte alignment brings
disadvantages:

 - it spreads out the cache footprint, possibly making the code
   fall out of the L1 I$.

 - On Intel CPUs, recent microarchitectures have plenty of
   uop cache (typically doubling every 3 years) - while the
   size of the L1 cache grows much less aggressively. So
   workloads are rarely uop cache limited.

The only situation where alignment might matter are tight
loops that could fit into a single 16 byte chunk - but those
are pretty rare in the kernel: if they exist they tend
to be pointer chasing or generic memory ops, which both tend
to be cache miss (or cache allocation) intensive and are not
decoder bandwidth limited.

So the balance of arguments strongly favors packing kernel
instructions tightly versus maximizing for decoder bandwidth:
this patch changes the jump target alignment from 16 bytes
to 1 byte (tightly packed, unaligned).

Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20150410120846.GA17101@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index c7c3187..ca17e5f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -77,6 +77,9 @@ else
         KBUILD_AFLAGS += -m64
         KBUILD_CFLAGS += -m64
 
+        # Align jump targets to 1 byte, not the default 16 bytes:
+        KBUILD_CFLAGS += -falign-jumps=1
+
         # Don't autogenerate traditional x87 instructions
         KBUILD_CFLAGS += $(call cc-option,-mno-80387)
         KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86/uaccess: Implement get_kernel()
  2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
  2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
@ 2015-04-10 11:34                           ` Peter Zijlstra
  2015-04-10 18:04                             ` Ingo Molnar
  2015-04-10 17:49                           ` Linus Torvalds
  2 siblings, 1 reply; 108+ messages in thread
From: Peter Zijlstra @ 2015-04-10 11:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Fri, Apr 10, 2015 at 01:14:27PM +0200, Ingo Molnar wrote:
> +/*
> + * Simple copy-from-possibly-faulting-kernel-addresses method that
> + * avoids the STAC/CLAC SMAP overhead.
> + *
> + * NOTE: this does not propagate the error code of faulting kernel
> + *	 addresses properly. You can recover it via uaccess_catch()
> + *       if you really need to.
> + */
> +#define get_kernel(dst, src)						\
> +do {									\
> +	typeof(*(src)) __val;						\

Should we make that:

	typeof(*(src)) __val = (dst);

> +									\
> +	switch (sizeof(__val)) {					\
> +	case 1:	__get_kernel_asm_ex(__val, src, "b", "b", "=q"); break;	\
> +	case 2: __get_kernel_asm_ex(__val, src, "w", "w", "=r"); break;	\
> +	case 4: __get_kernel_asm_ex(__val, src, "l", "k", "=r"); break;	\
> +	case 8: __get_kernel_asm_ex(__val, src, "q", " ", "=r"); break;	\
> +	default: __get_kernel_BUILD_ERROR();				\
> +	}								\
> +	(dst) = __val;							\
> +} while (0)

Such that when we fault, the value is unmodified? The way it is we'll
assign whatever was on stack for __val, which seems undesirable, no?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86/uaccess: Implement get_kernel()
  2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
@ 2015-04-10 18:04                             ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 18:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Linus Torvalds, Jason Low, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Apr 10, 2015 at 01:14:27PM +0200, Ingo Molnar wrote:
> > +/*
> > + * Simple copy-from-possibly-faulting-kernel-addresses method that
> > + * avoids the STAC/CLAC SMAP overhead.
> > + *
> > + * NOTE: this does not propagate the error code of faulting kernel
> > + *	 addresses properly. You can recover it via uaccess_catch()
> > + *       if you really need to.
> > + */
> > +#define get_kernel(dst, src)						\
> > +do {									\
> > +	typeof(*(src)) __val;						\
> 
> Should we make that:
> 
> 	typeof(*(src)) __val = (dst);
> 
> > +									\
> > +	switch (sizeof(__val)) {					\
> > +	case 1:	__get_kernel_asm_ex(__val, src, "b", "b", "=q"); break;	\
> > +	case 2: __get_kernel_asm_ex(__val, src, "w", "w", "=r"); break;	\
> > +	case 4: __get_kernel_asm_ex(__val, src, "l", "k", "=r"); break;	\
> > +	case 8: __get_kernel_asm_ex(__val, src, "q", " ", "=r"); break;	\
> > +	default: __get_kernel_BUILD_ERROR();				\
> > +	}								\
> > +	(dst) = __val;							\
> > +} while (0)
> 
> Such that when we fault, the value is unmodified? The way it is we'll
> assign whatever was on stack for __val, which seems undesirable, no?

Yes, indeed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86/uaccess: Implement get_kernel()
  2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
  2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
  2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
@ 2015-04-10 17:49                           ` Linus Torvalds
  2015-04-10 18:04                             ` Ingo Molnar
  2 siblings, 1 reply; 108+ messages in thread
From: Linus Torvalds @ 2015-04-10 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Fri, Apr 10, 2015 at 4:14 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>>
>> The next patch will implement efficient __copy_from_kernel_inatomic()
>> for x86.
>
> The patch below does that. Note, for simplicity I've changed the
> interface to 'get_kernel()' (will propagate this through the other
> patches as well).

So I think this needs a couple of changes:

 - That "get_kernel()" name is not clear enough about what the issue
is. I think it should make it clearer that it's an unsafe access that
could fault, and that we don't want a user access.

    So maybe "get_kernel_stalepointer()" or something like that.

 - you're just re-implementing "__get_user_asm_ex()" afaik. Try to
share the code, renaming it to be something common.

 - I think we should look at sharing the code for __get_user(). Could
we do something like this:

     (a) implement the basic "load with exceptions" as __get_with_exception()
     (b) #define get_kernel_stalepointer() __get_with_exception
     (c) make "__get_user()" be "stac(); __get_with_exception(); clac()"

 - finally, I wonder what the exact semantics of
"get_kernel_stalepointer()" should be. I could well imagine that what
we should do is

        #ifdef CONFIG_DEBUG_PAGEALLOC
            #define get_kernel_stalepointer(x,ptr) ((x)=READ_ONCE(*(ptr)), 0)
        #else
           #define get_kernel_stalepointer(x,ptr) __get_with_exception(x,ptr)
        #endif

because I think it's reasonable to require that the kernel pointer is
_stale_, and not "invalid". IOW, guarantee that it *has* been a kernel
pointer, and that the only reason it would trap is for
DEBUG_PAGEALLOC.

That last point might need to be verified with hotplug memory. I think
hotplug memory does a stop_machine() or similar, but I'm not sure.

Hmm?

                         Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86/uaccess: Implement get_kernel()
  2015-04-10 17:49                           ` Linus Torvalds
@ 2015-04-10 18:04                             ` Ingo Molnar
  2015-04-10 18:09                               ` Linus Torvalds
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Apr 10, 2015 at 4:14 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >>
> >> The next patch will implement efficient 
> >> __copy_from_kernel_inatomic() for x86.
> >
> > The patch below does that. Note, for simplicity I've changed the 
> > interface to 'get_kernel()' (will propagate this through the other 
> > patches as well).
> 
> So I think this needs a couple of changes:
> 
>  - That "get_kernel()" name is not clear enough about what the issue 
> is. I think it should make it clearer that it's an unsafe access 
> that could fault, and that we don't want a user access.
> 
>     So maybe "get_kernel_stalepointer()" or something like that.

Ok.

>  - you're just re-implementing "__get_user_asm_ex()" afaik. Try to
> share the code, renaming it to be something common.

Ok, will try that.

>  - I think we should look at sharing the code for __get_user(). Could
> we do something like this:
> 
>      (a) implement the basic "load with exceptions" as __get_with_exception()
>      (b) #define get_kernel_stalepointer() __get_with_exception
>      (c) make "__get_user()" be "stac(); __get_with_exception(); clac()"

Will try.

The only possible complication there might be the way we don't recover 
the error code in the _ex() variants, that's actually a pretty 
important aspect to making this zero cost. Since the error code comes 
back from assembly code in some cases we cannot make it go away in the 
_ex() case. So I'm not sure we can share code between _ex() and the 
normal methods - but we can certainly share with the _ex() variants.

>  - finally, I wonder what the exact semantics of 
> "get_kernel_stalepointer()" should be. I could well imagine that 
> what we should do is
> 
>         #ifdef CONFIG_DEBUG_PAGEALLOC
>             #define get_kernel_stalepointer(x,ptr) ((x)=READ_ONCE(*(ptr)), 0)
>         #else
>            #define get_kernel_stalepointer(x,ptr) __get_with_exception(x,ptr)
>         #endif

I guess you meant that to be the other way around?

> because I think it's reasonable to require that the kernel pointer 
> is _stale_, and not "invalid". [...]

Absolutely, and I think this is a hard requirement: we don't (ever) 
want to dereference random addresses, due to possible mmio side 
effects.

> [...] IOW, guarantee that it *has* been a kernel pointer, and that 
> the only reason it would trap is for DEBUG_PAGEALLOC.

Yes.

> That last point might need to be verified with hotplug memory. I 
> think hotplug memory does a stop_machine() or similar, but I'm not 
> sure.

So memory hotplug does it in a pretty simple fashion IIRC: only such 
zones are movable and hot-unpluggable which don't contain 
kmalloc()-able of gfp()-able memory - they are limited purpose memory 
pools only usable for user pages and the page cache.

So stale pointers should never point to hot-unpluggable memory.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] x86/uaccess: Implement get_kernel()
  2015-04-10 18:04                             ` Ingo Molnar
@ 2015-04-10 18:09                               ` Linus Torvalds
  0 siblings, 0 replies; 108+ messages in thread
From: Linus Torvalds @ 2015-04-10 18:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Fri, Apr 10, 2015 at 11:04 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>>
>>     So maybe "get_kernel_stalepointer()" or something like that.
>
> Ok.

Note that I'm not at all married to that particular name, I just want
something that describes the requirements a bit more.

>>  - I think we should look at sharing the code for __get_user(). Could
>> we do something like this:
>>
>>      (a) implement the basic "load with exceptions" as __get_with_exception()
>>      (b) #define get_kernel_stalepointer() __get_with_exception
>>      (c) make "__get_user()" be "stac(); __get_with_exception(); clac()"
>
> Will try.
>
> The only possible complication there might be the way we don't recover
> the error code in the _ex() variants, that's actually a pretty
> important aspect to making this zero cost.

Yeah. You may be right. What I would really want is that "asm goto"
with an output register, but gcc doesn't do that. Then we could
improve on the whole try/catch thing too, so that it would just jump
to the catch..

>>         #ifdef CONFIG_DEBUG_PAGEALLOC
>>             #define get_kernel_stalepointer(x,ptr) ((x)=READ_ONCE(*(ptr)), 0)
>>         #else
>>            #define get_kernel_stalepointer(x,ptr) __get_with_exception(x,ptr)
>>         #endif
>
> I guess you meant that to be the other way around?

Yes I did.

                  Linus

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
  2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
  2015-04-10  9:12                     ` Ingo Molnar
@ 2015-04-10 14:20                     ` Paul E. McKenney
  2015-04-10 17:44                       ` Ingo Molnar
  1 sibling, 1 reply; 108+ messages in thread
From: Paul E. McKenney @ 2015-04-10 14:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Fri, Apr 10, 2015 at 11:00:51AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about 
> > > performance *at*all*. We will have worse performance problems than 
> > > doing some RCU read-locking inside the loop.
> > > 
> > > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care 
> > > about locking, since at worst we just access stale memory for one 
> > > iteration.
> > 
> > But if we are running on a hypervisor, mightn't our VCPU be 
> > preempted just before accessing ->on_cpu, the task exit and its 
> > structures be freed and unmapped?  Or is the task structure in 
> > memory that is never unmapped?  (If the latter, clearly not a 
> > problem.)
> 
> kmalloc()able kernel memory is never unmapped in that fashion [*]. 
> Even hotplug memory is based on limiting what gets allocated in that 
> area and never putting critical kernel data structures there.
> 
> Personally I'd be more comfortable with having a special primitive for 
> this that is DEBUG_PAGEALLOC aware (Linus's first suggestion), so that 
> we don't use different RCU primitives in the rare case someone tests 
> CONFIG_DEBUG_PAGEALLOC=y ...
> 
> We even have such a primitive: __copy_from_user_inatomic(). It 
> compiles to a single instruction for integer types on x86. I've 
> attached a patch that implements it for the regular mutexes (xadd can 
> be done too), and it all compiles to a rather sweet, compact routine:
> 
> 0000000000000030 <mutex_spin_on_owner.isra.4>:
>   30:	48 3b 37             	cmp    (%rdi),%rsi
>   33:	48 8d 4e 28          	lea    0x28(%rsi),%rcx
>   37:	75 4e                	jne    87 <mutex_spin_on_owner.isra.4+0x57>
>   39:	55                   	push   %rbp
>   3a:	45 31 c0             	xor    %r8d,%r8d
>   3d:	65 4c 8b 0c 25 00 00 	mov    %gs:0x0,%r9
>   44:	00 00 
>   46:	48 89 e5             	mov    %rsp,%rbp
>   49:	48 83 ec 10          	sub    $0x10,%rsp
>   4d:	eb 08                	jmp    57 <mutex_spin_on_owner.isra.4+0x27>
>   4f:	90                   	nop
>   50:	f3 90                	pause  
>   52:	48 3b 37             	cmp    (%rdi),%rsi
>   55:	75 29                	jne    80 <mutex_spin_on_owner.isra.4+0x50>
>   57:	44 89 c0             	mov    %r8d,%eax
>   5a:	90                   	nop
>   5b:	90                   	nop
>   5c:	90                   	nop
>   5d:	8b 11                	mov    (%rcx),%edx
>   5f:	90                   	nop
>   60:	90                   	nop
>   61:	90                   	nop
>   62:	85 d2                	test   %edx,%edx
>   64:	89 55 fc             	mov    %edx,-0x4(%rbp)
>   67:	74 0b                	je     74 <mutex_spin_on_owner.isra.4+0x44>
>   69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax
>   70:	a8 08                	test   $0x8,%al
>   72:	74 dc                	je     50 <mutex_spin_on_owner.isra.4+0x20>
>   74:	31 c0                	xor    %eax,%eax
>   76:	c9                   	leaveq 
>   77:	c3                   	retq   
>   78:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
>   7f:	00 
>   80:	b8 01 00 00 00       	mov    $0x1,%eax
>   85:	c9                   	leaveq 
>   86:	c3                   	retq   
>   87:	b8 01 00 00 00       	mov    $0x1,%eax
>   8c:	c3                   	retq   
>   8d:	0f 1f 00             	nopl   (%rax)
> 
> No RCU overhead, and this is the access to owner->on_cpu:
> 
>   69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax
> 
> Totally untested and all that, I only built the mutex.o.
> 
> What do you think? Am I missing anything?

I suspect it is good, but let's take a look at Linus' summary of the code:

        rcu_read_lock();
        while (sem->owner == owner) {
                if (!owner->on_cpu || need_resched())
                        break;
                cpu_relax_lowlatency();
        }
        rcu_read_unlock();

The cpu_relax_lowlatency() looks to have barrier() semantics, so  the
sem->owner should get reloaded every time through the loop.  This is
needed, because otherwise the task structure could get freed and
reallocated as something else that happened to have the field at the
->on_cpu offset always zero, resulting in an infinite loop.

Of course, the implicit barrier() could well be forcing unnecessary
register spilling.  If so, it might be worth trying a variant of
cpu_relax_lowlatency() that doesn't have barrier() semantics and
placing READ_ONCE() around the sem->owner in the "while" condition.

So, in short, given the big fat comment you called for, I don't see
any problem with this approach.

Which means that any bugs will be a surprise, which will at least have
the advantage of avoiding boredom.  ;-)

							Thanx, Paul

> Thanks,
> 
> 	Ingo
> 
> [*] with the exception of CONFIG_DEBUG_PAGEALLOC and other debug 
>     mechanisms like CONFIG_KMEMCHECK (which is on the way out) that 
>     are based on provoking page faults and fixing up page tables to 
>     catch unexpected memory accesses.
> 
> =================================>
> >From ef3e5e763747d47a43a32f846ee94706089222cf Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@kernel.org>
> Date: Fri, 10 Apr 2015 10:49:11 +0200
> Subject: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
> 
> Linus suggested to get rid of the held RCU read-lock in 
> mutex_spin_on_owner(). The only real complication is that the 'owner' 
> task might be freed from under us and we might dereference into 
> possibly freed kernel memory.
> 
> As long as the kernel pointer itself is valid this approach is fine in 
> this case (see the analysis below) - with the exception of 
> CONFIG_DEBUG_PAGEALLOC=y and similarly instrumented kernels which 
> might fault on referencing freed kernel memory.
> 
> Use the non-faulting copy-from-user primitive to get the owner->on_cpu 
> value that we use in NMI handlers and which works even on 
> CONFIG_DEBUG_PAGEALLOC=y instrumented kernels. This compiles to a 
> single instruction on most platforms.
> 
> This approach might briefly copy in junk data from an already freed 
> (previous) owner task, which might trigger two scenarios:
> 
> 1) The junk data causes us to loop once more. This is not
>    a problem as we'll check the owner on the next loop and
>    break out of the loop.
> 
> 2) If the junk value causes us to break out of the loop
>    that's fine too: it's what we'd have done anyway on
>    the next iteration, as the lock owner changed.
> 
> The inatomic context copy primitives are compiler barriers
> too - this matters to make sure the above owner check is
> emitted to before the copy attempt.
> 
> We also ignore failed copies, as the next iteration will clean
> up after us. This saves an extra branch in the common case.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/locking/mutex.c | 40 +++++++++++++++++++++++++++-------------
>  1 file changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 4cccea6b8934..fcc7db45d62e 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -224,28 +224,42 @@ ww_mutex_set_context_slowpath(struct ww_mutex *lock,
>  static noinline
>  bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
>  {
> -	bool ret = true;
> +	int on_cpu;
> +	int ret;
> 
> -	rcu_read_lock();
>  	while (lock->owner == owner) {
>  		/*
> -		 * Ensure we emit the owner->on_cpu, dereference _after_
> -		 * checking lock->owner still matches owner. If that fails,
> -		 * owner might point to freed memory. If it still matches,
> -		 * the rcu_read_lock() ensures the memory stays valid.
> +		 * Use the non-faulting copy-user primitive to get the owner->on_cpu
> +		 * value that works even on CONFIG_DEBUG_PAGEALLOC=y instrumented
> +		 * kernels. This compiles to a single instruction on most platforms.
> +		 *
> +		 * This might briefly copy in junk data from an already freed
> +		 * (previous) owner task, which might trigger two scenarios:
> +		 *
> +		 * 1) The junk data causes us to loop once more. This is not
> +		 *    a problem as we'll check the owner on the next loop and
> +		 *    break out of the loop.
> +		 *
> +		 * 2) If the junk value causes us to break out of the loop
> +		 *    that's fine too: it's what we'd have done anyway on
> +		 *    the next iteration, as the lock owner changed.
> +		 *
> +		 * NOTE: the inatomic context copy primitives are compiler barriers
> +		 *       too - this matters to make sure the above owner check is
> +		 *       emitted to before the copy attempt.
> +		 *
> +		 * NOTE2: We ignore failed copies, as the next iteration will clean
> +		 *        up after us. This saves an extra branch in the common case.
>  		 */
> -		barrier();
> +		ret = __copy_from_user_inatomic(&on_cpu, &owner->on_cpu, sizeof(on_cpu));
> 
> -		if (!owner->on_cpu || need_resched()) {
> -			ret = false;
> -			break;
> -		}
> +		if (!on_cpu || need_resched())
> +			return false;
> 
>  		cpu_relax_lowlatency();
>  	}
> -	rcu_read_unlock();
> 
> -	return ret;
> +	return true;
>  }
> 
>  /*
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
  2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
@ 2015-04-10 17:44                       ` Ingo Molnar
  2015-04-10 18:05                         ` Paul E. McKenney
  0 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10 17:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > No RCU overhead, and this is the access to owner->on_cpu:
> > 
> >   69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax
> > 
> > Totally untested and all that, I only built the mutex.o.
> > 
> > What do you think? Am I missing anything?
> 
> I suspect it is good, but let's take a look at Linus' summary of the code:
> 
>         rcu_read_lock();
>         while (sem->owner == owner) {
>                 if (!owner->on_cpu || need_resched())
>                         break;
>                 cpu_relax_lowlatency();
>         }
>         rcu_read_unlock();

Note that I patched the mutex case as a prototype, which is more 
commonly used than rwsem-xadd. But the rwsem case is similar as well.

> The cpu_relax_lowlatency() looks to have barrier() semantics, so the 
> sem->owner should get reloaded every time through the loop.  This is 
> needed, because otherwise the task structure could get freed and 
> reallocated as something else that happened to have the field at the 
> ->on_cpu offset always zero, resulting in an infinite loop.

So at least with the get_kernel(..., &owner->on_cpu) approach, the 
get_kernel() copy has barrier semantics as well (it's in assembly), so 
it will be reloaded in every iteration in a natural fashion.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
  2015-04-10 17:44                       ` Ingo Molnar
@ 2015-04-10 18:05                         ` Paul E. McKenney
  0 siblings, 0 replies; 108+ messages in thread
From: Paul E. McKenney @ 2015-04-10 18:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jason Low, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Fri, Apr 10, 2015 at 07:44:00PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > No RCU overhead, and this is the access to owner->on_cpu:
> > > 
> > >   69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax
> > > 
> > > Totally untested and all that, I only built the mutex.o.
> > > 
> > > What do you think? Am I missing anything?
> > 
> > I suspect it is good, but let's take a look at Linus' summary of the code:
> > 
> >         rcu_read_lock();
> >         while (sem->owner == owner) {
> >                 if (!owner->on_cpu || need_resched())
> >                         break;
> >                 cpu_relax_lowlatency();
> >         }
> >         rcu_read_unlock();
> 
> Note that I patched the mutex case as a prototype, which is more 
> commonly used than rwsem-xadd. But the rwsem case is similar as well.
> 
> > The cpu_relax_lowlatency() looks to have barrier() semantics, so the 
> > sem->owner should get reloaded every time through the loop.  This is 
> > needed, because otherwise the task structure could get freed and 
> > reallocated as something else that happened to have the field at the 
> > ->on_cpu offset always zero, resulting in an infinite loop.
> 
> So at least with the get_kernel(..., &owner->on_cpu) approach, the 
> get_kernel() copy has barrier semantics as well (it's in assembly), so 
> it will be reloaded in every iteration in a natural fashion.

Good point, even better!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 18:16               ` Linus Torvalds
  2015-04-09 18:39                 ` Paul E. McKenney
@ 2015-04-09 19:43                 ` Jason Low
  2015-04-09 19:58                   ` Paul E. McKenney
  2015-04-09 19:59                   ` Davidlohr Bueso
  2015-04-09 20:36                 ` Jason Low
  2 siblings, 2 replies; 108+ messages in thread
From: Jason Low @ 2015-04-09 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Ingo Molnar, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML, jason.low2

On Thu, 2015-04-09 at 11:16 -0700, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The pointer is a known-safe kernel pointer - it's just that it was
> > "known safe" a few instructions ago, and might be rcu-free'd at any
> > time.
> 
> Actually, we could even do something like this:
> 
>  static inline int sem_owner_on_cpu(struct semaphore *sem, struct
> task_struct *owner)
>  {
>         int on_cpu;
> 
>     #ifdef CONFIG_DEBUG_PAGEALLOC
>         rcu_read_lock();
>     #endif
>         on_cpu = sem->owner == owner && owner->on_cpu;
>     #ifdef CONFIG_DEBUG_PAGEALLOC
>         rcu_read_unlock();
>     #endif
>         return on_cpu;
>     }
> 
> because we really don't need to hold the RCU lock over the whole loop,
> we just need to validate that the semaphore owner still matches, and
> if so, check that it's on_cpu.
> 
> And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
> *at*all*. We will have worse performance problems than doing some RCU
> read-locking inside the loop.
> 
> And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
> locking, since at worst we just access stale memory for one iteration.
> 
> Hmm. It's not pretty, but neither is the current "let's just take a
> rcu lock that we don't really need over a loop that doesn't have very
> strict bounding".
> 
> Comments?

So that looks more similar to how the original code was where the
rcu_read_lock() and rcu_read_unlock() was done inside the owner_running
helper function (though without the CONFIG_DEBUG_PAGEALLOC), before
commit 307bf9803f25 ("sched: Simplify mutex_spin_on_owner()") modified
it to be done outside the loop.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
@ 2015-04-09 19:58                   ` Paul E. McKenney
  2015-04-09 20:58                     ` Jason Low
  2015-04-09 19:59                   ` Davidlohr Bueso
  1 sibling, 1 reply; 108+ messages in thread
From: Paul E. McKenney @ 2015-04-09 19:58 UTC (permalink / raw)
  To: Jason Low
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, Apr 09, 2015 at 12:43:38PM -0700, Jason Low wrote:
> On Thu, 2015-04-09 at 11:16 -0700, Linus Torvalds wrote:
> > On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > The pointer is a known-safe kernel pointer - it's just that it was
> > > "known safe" a few instructions ago, and might be rcu-free'd at any
> > > time.
> > 
> > Actually, we could even do something like this:
> > 
> >  static inline int sem_owner_on_cpu(struct semaphore *sem, struct
> > task_struct *owner)
> >  {
> >         int on_cpu;
> > 
> >     #ifdef CONFIG_DEBUG_PAGEALLOC
> >         rcu_read_lock();
> >     #endif
> >         on_cpu = sem->owner == owner && owner->on_cpu;
> >     #ifdef CONFIG_DEBUG_PAGEALLOC
> >         rcu_read_unlock();
> >     #endif
> >         return on_cpu;
> >     }
> > 
> > because we really don't need to hold the RCU lock over the whole loop,
> > we just need to validate that the semaphore owner still matches, and
> > if so, check that it's on_cpu.
> > 
> > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
> > *at*all*. We will have worse performance problems than doing some RCU
> > read-locking inside the loop.
> > 
> > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
> > locking, since at worst we just access stale memory for one iteration.
> > 
> > Hmm. It's not pretty, but neither is the current "let's just take a
> > rcu lock that we don't really need over a loop that doesn't have very
> > strict bounding".
> > 
> > Comments?
> 
> So that looks more similar to how the original code was where the
> rcu_read_lock() and rcu_read_unlock() was done inside the owner_running
> helper function (though without the CONFIG_DEBUG_PAGEALLOC), before
> commit 307bf9803f25 ("sched: Simplify mutex_spin_on_owner()") modified
> it to be done outside the loop.

Another approach would be to post a timer before entering the spinloop,
and have the timer handler set the resched bit.  Then the loop would
be bounded, safe, and would run at full speed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 19:58                   ` Paul E. McKenney
@ 2015-04-09 20:58                     ` Jason Low
  2015-04-09 21:07                       ` Paul E. McKenney
  0 siblings, 1 reply; 108+ messages in thread
From: Jason Low @ 2015-04-09 20:58 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Jason Low, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, Apr 9, 2015 at 12:58 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Apr 09, 2015 at 12:43:38PM -0700, Jason Low wrote:
>> On Thu, 2015-04-09 at 11:16 -0700, Linus Torvalds wrote:
>> > On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
>> > <torvalds@linux-foundation.org> wrote:
>> > >
>> > > The pointer is a known-safe kernel pointer - it's just that it was
>> > > "known safe" a few instructions ago, and might be rcu-free'd at any
>> > > time.
>> >
>> > Actually, we could even do something like this:
>> >
>> >  static inline int sem_owner_on_cpu(struct semaphore *sem, struct
>> > task_struct *owner)
>> >  {
>> >         int on_cpu;
>> >
>> >     #ifdef CONFIG_DEBUG_PAGEALLOC
>> >         rcu_read_lock();
>> >     #endif
>> >         on_cpu = sem->owner == owner && owner->on_cpu;
>> >     #ifdef CONFIG_DEBUG_PAGEALLOC
>> >         rcu_read_unlock();
>> >     #endif
>> >         return on_cpu;
>> >     }
>> >
>> > because we really don't need to hold the RCU lock over the whole loop,
>> > we just need to validate that the semaphore owner still matches, and
>> > if so, check that it's on_cpu.
>> >
>> > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
>> > *at*all*. We will have worse performance problems than doing some RCU
>> > read-locking inside the loop.
>> >
>> > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
>> > locking, since at worst we just access stale memory for one iteration.
>> >
>> > Hmm. It's not pretty, but neither is the current "let's just take a
>> > rcu lock that we don't really need over a loop that doesn't have very
>> > strict bounding".
>> >
>> > Comments?
>>
>> So that looks more similar to how the original code was where the
>> rcu_read_lock() and rcu_read_unlock() was done inside the owner_running
>> helper function (though without the CONFIG_DEBUG_PAGEALLOC), before
>> commit 307bf9803f25 ("sched: Simplify mutex_spin_on_owner()") modified
>> it to be done outside the loop.
>
> Another approach would be to post a timer before entering the spinloop,
> and have the timer handler set the resched bit.  Then the loop would
> be bounded, safe, and would run at full speed.

Though posting a timer, ect... would also add more overhead right?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 20:58                     ` Jason Low
@ 2015-04-09 21:07                       ` Paul E. McKenney
  0 siblings, 0 replies; 108+ messages in thread
From: Paul E. McKenney @ 2015-04-09 21:07 UTC (permalink / raw)
  To: Jason Low
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, Apr 09, 2015 at 01:58:16PM -0700, Jason Low wrote:
> On Thu, Apr 9, 2015 at 12:58 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Apr 09, 2015 at 12:43:38PM -0700, Jason Low wrote:
> >> On Thu, 2015-04-09 at 11:16 -0700, Linus Torvalds wrote:
> >> > On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
> >> > <torvalds@linux-foundation.org> wrote:
> >> > >
> >> > > The pointer is a known-safe kernel pointer - it's just that it was
> >> > > "known safe" a few instructions ago, and might be rcu-free'd at any
> >> > > time.
> >> >
> >> > Actually, we could even do something like this:
> >> >
> >> >  static inline int sem_owner_on_cpu(struct semaphore *sem, struct
> >> > task_struct *owner)
> >> >  {
> >> >         int on_cpu;
> >> >
> >> >     #ifdef CONFIG_DEBUG_PAGEALLOC
> >> >         rcu_read_lock();
> >> >     #endif
> >> >         on_cpu = sem->owner == owner && owner->on_cpu;
> >> >     #ifdef CONFIG_DEBUG_PAGEALLOC
> >> >         rcu_read_unlock();
> >> >     #endif
> >> >         return on_cpu;
> >> >     }
> >> >
> >> > because we really don't need to hold the RCU lock over the whole loop,
> >> > we just need to validate that the semaphore owner still matches, and
> >> > if so, check that it's on_cpu.
> >> >
> >> > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
> >> > *at*all*. We will have worse performance problems than doing some RCU
> >> > read-locking inside the loop.
> >> >
> >> > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
> >> > locking, since at worst we just access stale memory for one iteration.
> >> >
> >> > Hmm. It's not pretty, but neither is the current "let's just take a
> >> > rcu lock that we don't really need over a loop that doesn't have very
> >> > strict bounding".
> >> >
> >> > Comments?
> >>
> >> So that looks more similar to how the original code was where the
> >> rcu_read_lock() and rcu_read_unlock() was done inside the owner_running
> >> helper function (though without the CONFIG_DEBUG_PAGEALLOC), before
> >> commit 307bf9803f25 ("sched: Simplify mutex_spin_on_owner()") modified
> >> it to be done outside the loop.
> >
> > Another approach would be to post a timer before entering the spinloop,
> > and have the timer handler set the resched bit.  Then the loop would
> > be bounded, safe, and would run at full speed.
> 
> Though posting a timer, ect... would also add more overhead right?

It would.  Not on each pass through the loop, though.  But yes, might
be too much overhead.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
  2015-04-09 19:58                   ` Paul E. McKenney
@ 2015-04-09 19:59                   ` Davidlohr Bueso
  1 sibling, 0 replies; 108+ messages in thread
From: Davidlohr Bueso @ 2015-04-09 19:59 UTC (permalink / raw)
  To: Jason Low
  Cc: Linus Torvalds, Paul McKenney, Ingo Molnar, Peter Zijlstra,
	Tim Chen, Aswin Chandramouleeswaran, LKML

On Thu, 2015-04-09 at 12:43 -0700, Jason Low wrote:
> So that looks more similar to how the original code was where the
> rcu_read_lock() and rcu_read_unlock() was done inside the owner_running
> helper function (though without the CONFIG_DEBUG_PAGEALLOC), before
> commit 307bf9803f25 ("sched: Simplify mutex_spin_on_owner()") modified
> it to be done outside the loop.

I think this is why Linus was mentioning the CONFIG_PREEMPT_RCU case as
well, so taking and releasing the read lock in those cases in the loop
would actually hurt more.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 18:16               ` Linus Torvalds
  2015-04-09 18:39                 ` Paul E. McKenney
  2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
@ 2015-04-09 20:36                 ` Jason Low
  2015-04-10  2:43                   ` Andev
  2015-04-10  9:04                   ` Ingo Molnar
  2 siblings, 2 replies; 108+ messages in thread
From: Jason Low @ 2015-04-09 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Ingo Molnar, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Thomas Gleixner, Aswin Chandramouleeswaran, LKML,
	jason.low2

On Thu, 2015-04-09 at 11:16 -0700, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The pointer is a known-safe kernel pointer - it's just that it was
> > "known safe" a few instructions ago, and might be rcu-free'd at any
> > time.
> 
> Actually, we could even do something like this:
> 
>  static inline int sem_owner_on_cpu(struct semaphore *sem, struct
> task_struct *owner)
>  {
>         int on_cpu;
> 
>     #ifdef CONFIG_DEBUG_PAGEALLOC
>         rcu_read_lock();
>     #endif
>         on_cpu = sem->owner == owner && owner->on_cpu;
>     #ifdef CONFIG_DEBUG_PAGEALLOC
>         rcu_read_unlock();
>     #endif
>         return on_cpu;
>     }
> 
> because we really don't need to hold the RCU lock over the whole loop,
> we just need to validate that the semaphore owner still matches, and
> if so, check that it's on_cpu.
> 
> And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
> *at*all*. We will have worse performance problems than doing some RCU
> read-locking inside the loop.
> 
> And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
> locking, since at worst we just access stale memory for one iteration.
> 
> Hmm. It's not pretty, but neither is the current "let's just take a
> rcu lock that we don't really need over a loop that doesn't have very
> strict bounding".

So then something like the following (for rwsem)?

We can also run some tests to see how the worst case "access
stale memory for one iteration" to the heuristic can have an affect on
performance, though that probably wouldn't be much of an issue in
practice.

---
 kernel/locking/rwsem-xadd.c |   43 ++++++++++++++++++++++++++++---------------
 1 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3417d01..870c574 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -295,6 +295,31 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 	}
 }
 
+static inline bool owner_running(struct rw_semaphore *sem, struct task_struct *owner)
+{
+	bool ret;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	rcu_read_lock();
+#endif
+	if (READ_ONCE(sem->owner) == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu dereference
+		 * after checking sem->owner still matches owner.
+		 */
+		barrier();
+		ret = owner->on_cpu;
+	} else {
+		/* Owner changed. */
+		ret = false;
+	}
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	rcu_read_unlock();
+#endif
+	return ret;
+}
+
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 {
 	struct task_struct *owner;
@@ -329,25 +354,13 @@ bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct *owner)
 {
 	long count;
 
-	rcu_read_lock();
-	while (sem->owner == owner) {
-		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
-		 */
-		barrier();
-
-		/* abort spinning when need_resched or owner is not running */
-		if (!owner->on_cpu || need_resched()) {
-			rcu_read_unlock();
+	while (owner_running(sem, owner)) {
+		/* Abort spinning when need resched. */
+		if (need_resched())
 			return false;
-		}
 
 		cpu_relax_lowlatency();
 	}
-	rcu_read_unlock();
 
 	if (READ_ONCE(sem->owner))
 		return true; /* new owner, continue spinning */
-- 
1.7.2.5




^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 20:36                 ` Jason Low
@ 2015-04-10  2:43                   ` Andev
  2015-04-10  9:04                   ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Andev @ 2015-04-10  2:43 UTC (permalink / raw)
  To: Jason Low
  Cc: Linus Torvalds, Paul McKenney, Ingo Molnar, Peter Zijlstra,
	Davidlohr Bueso, Tim Chen, Thomas Gleixner,
	Aswin Chandramouleeswaran, LKML

On Thu, Apr 9, 2015 at 4:36 PM, Jason Low <jason.low2@hp.com> wrote:
> On Thu, 2015-04-09 at 11:16 -0700, Linus Torvalds wrote:
>> On Thu, Apr 9, 2015 at 11:08 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >
>> > The pointer is a known-safe kernel pointer - it's just that it was
>> > "known safe" a few instructions ago, and might be rcu-free'd at any
>> > time.
>>
>> Actually, we could even do something like this:
>>
>>  static inline int sem_owner_on_cpu(struct semaphore *sem, struct
>> task_struct *owner)
>>  {
>>         int on_cpu;
>>
>>     #ifdef CONFIG_DEBUG_PAGEALLOC
>>         rcu_read_lock();
>>     #endif
>>         on_cpu = sem->owner == owner && owner->on_cpu;
>>     #ifdef CONFIG_DEBUG_PAGEALLOC
>>         rcu_read_unlock();
>>     #endif
>>         return on_cpu;
>>     }
>>
>> because we really don't need to hold the RCU lock over the whole loop,
>> we just need to validate that the semaphore owner still matches, and
>> if so, check that it's on_cpu.
>>
>> And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about performance
>> *at*all*. We will have worse performance problems than doing some RCU
>> read-locking inside the loop.
>>
>> And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care about
>> locking, since at worst we just access stale memory for one iteration.
>>
>> Hmm. It's not pretty, but neither is the current "let's just take a
>> rcu lock that we don't really need over a loop that doesn't have very
>> strict bounding".
>
> So then something like the following (for rwsem)?
>
> We can also run some tests to see how the worst case "access
> stale memory for one iteration" to the heuristic can have an affect on
> performance, though that probably wouldn't be much of an issue in
> practice.
>
> ---
>  kernel/locking/rwsem-xadd.c |   43 ++++++++++++++++++++++++++++---------------
>  1 files changed, 28 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 3417d01..870c574 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -295,6 +295,31 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
>         }
>  }
>
> +static inline bool owner_running(struct rw_semaphore *sem, struct task_struct *owner)
> +{
> +       bool ret;
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +       rcu_read_lock();
> +#endif

Please use 'if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {}' here. Makes
code much readable IMHO.

-- 
Andev

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner()
  2015-04-09 20:36                 ` Jason Low
  2015-04-10  2:43                   ` Andev
@ 2015-04-10  9:04                   ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2015-04-10  9:04 UTC (permalink / raw)
  To: Jason Low
  Cc: Linus Torvalds, Paul McKenney, Peter Zijlstra, Davidlohr Bueso,
	Tim Chen, Thomas Gleixner, Aswin Chandramouleeswaran, LKML


* Jason Low <jason.low2@hp.com> wrote:

> +static inline bool owner_running(struct rw_semaphore *sem, struct task_struct *owner)
> +{
> +	bool ret;
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	rcu_read_lock();
> +#endif
> +	if (READ_ONCE(sem->owner) == owner) {
> +		/*
> +		 * Ensure we emit the owner->on_cpu dereference
> +		 * after checking sem->owner still matches owner.
> +		 */
> +		barrier();
> +		ret = owner->on_cpu;
> +	} else {
> +		/* Owner changed. */
> +		ret = false;
> +	}
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	rcu_read_unlock();
> +#endif
> +	return ret;
> +}

So I really don't like this due to the assymetric RCU pattern. (Also, 
the fact that we might read from already freed kernel memory here 
needs big honking comments.)

But, I think we can do this differently, see the patch I just sent:

  [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock

that should I think work just as well, without having to introduce 
owner_running() as the whole mutex_spin_on_owner() logic is kept 
pretty simple.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/2] locking: Simplify mutex and rwsem spinning code
  2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
  2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
  2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
@ 2015-04-08 19:49 ` Davidlohr Bueso
  2015-04-08 20:10   ` Jason Low
  2 siblings, 1 reply; 108+ messages in thread
From: Davidlohr Bueso @ 2015-04-08 19:49 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Tim Chen,
	Aswin Chandramouleeswaran, LKML

On Wed, 2015-04-08 at 12:39 -0700, Jason Low wrote:
> This patchset applies on top of tip.
> 
> Jason Low (2):
>   locking/mutex: Further refactor mutex_spin_on_owner()
>   locking/rwsem: Use a return variable in rwsem_spin_on_owner()
> 
>  kernel/locking/mutex.c      |   14 ++++----------
>  kernel/locking/rwsem-xadd.c |   25 ++++++++++++-------------
>  2 files changed, 16 insertions(+), 23 deletions(-)

Meh. Personally I wouldn't mid leaving these files alone for a change,
the code is already pretty readable imho. Of course its a matter of
taste, so I won't argue.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/2] locking: Simplify mutex and rwsem spinning code
  2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
@ 2015-04-08 20:10   ` Jason Low
  0 siblings, 0 replies; 108+ messages in thread
From: Jason Low @ 2015-04-08 20:10 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Tim Chen,
	Aswin Chandramouleeswaran, LKML, jason.low2

On Wed, 2015-04-08 at 12:49 -0700, Davidlohr Bueso wrote:
> On Wed, 2015-04-08 at 12:39 -0700, Jason Low wrote:
> > This patchset applies on top of tip.
> > 
> > Jason Low (2):
> >   locking/mutex: Further refactor mutex_spin_on_owner()
> >   locking/rwsem: Use a return variable in rwsem_spin_on_owner()
> > 
> >  kernel/locking/mutex.c      |   14 ++++----------
> >  kernel/locking/rwsem-xadd.c |   25 ++++++++++++-------------
> >  2 files changed, 16 insertions(+), 23 deletions(-)
> 
> Meh. Personally I wouldn't mid leaving these files alone for a change,
> the code is already pretty readable imho. Of course its a matter of
> taste, so I won't argue.

I agree that it's quite readable now too, though Peter and Ingo seems
like they would prefer these changes :)

https://lkml.org/lkml/2015/3/7/46
https://lkml.org/lkml/2015/3/16/102



^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2016-04-16 21:08 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09  5:37   ` Ingo Molnar
2015-04-09  6:40     ` Jason Low
2015-04-09  7:53       ` Ingo Molnar
2015-04-09 16:47         ` Linus Torvalds
2015-04-09 17:56           ` Paul E. McKenney
2015-04-09 18:08             ` Linus Torvalds
2015-04-09 18:16               ` Linus Torvalds
2015-04-09 18:39                 ` Paul E. McKenney
2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10  9:12                     ` Ingo Molnar
2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46                                   ` Borislav Petkov
2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36                                   ` Linus Torvalds
2015-05-15 20:52                                     ` Denys Vlasenko
2015-05-17  5:58                                     ` Ingo Molnar
2015-05-17  7:09                                       ` Ingo Molnar
2015-05-17  7:30                                         ` Ingo Molnar
2015-05-18  9:28                                       ` Denys Vlasenko
2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20  0:47                                         ` Linus Torvalds
2015-05-20 12:21                                           ` Denys Vlasenko
2015-05-21 11:36                                             ` Ingo Molnar
2015-05-21 11:38                                             ` Denys Vlasenko
2016-04-16 21:08                                               ` Denys Vlasenko
2015-05-20 13:09                                           ` Ingo Molnar
2015-05-20 11:29                                         ` Denys Vlasenko
2015-05-21 13:28                                           ` Ingo Molnar
2015-05-21 14:03                                           ` Ingo Molnar
2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18                                 ` H. Peter Anvin
2015-04-10 17:54                                   ` Ingo Molnar
2015-04-10 18:32                                     ` H. Peter Anvin
2015-04-11 14:41                                   ` Markus Trippelsdorf
2015-04-12 10:14                                     ` Ingo Molnar
2015-04-13 16:23                                       ` Markus Trippelsdorf
2015-04-13 17:26                                         ` Markus Trippelsdorf
2015-04-13 18:31                                           ` Linus Torvalds
2015-04-13 19:09                                             ` Markus Trippelsdorf
2015-04-14  5:38                                               ` Ingo Molnar
2015-04-14  8:23                                                 ` Markus Trippelsdorf
2015-04-14  9:16                                                   ` Ingo Molnar
2015-04-14 11:17                                                     ` Markus Trippelsdorf
2015-04-14 12:09                                                       ` Ingo Molnar
2015-04-10 18:48                                 ` Linus Torvalds
2015-04-12 23:44                                   ` Maciej W. Rozycki
2015-04-10 19:23                                 ` Daniel Borkmann
2015-04-11 13:48                                 ` Markus Trippelsdorf
2015-04-10 13:19                               ` Borislav Petkov
2015-04-10 13:54                                 ` Denys Vlasenko
2015-04-10 14:01                                   ` Borislav Petkov
2015-04-10 14:53                                     ` Denys Vlasenko
2015-04-10 15:25                                       ` Borislav Petkov
2015-04-10 15:48                                         ` Denys Vlasenko
2015-04-10 15:54                                           ` Borislav Petkov
2015-04-10 21:44                                             ` Borislav Petkov
2015-04-10 18:54                                       ` Linus Torvalds
2015-04-10 14:10                               ` Paul E. McKenney
2015-04-11 14:28                                 ` Josh Triplett
2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41                                 ` Linus Torvalds
2015-04-11 18:57                                   ` Thomas Gleixner
2015-04-11 19:35                                     ` Linus Torvalds
2015-04-12  5:47                                       ` Ingo Molnar
2015-04-12  6:20                                         ` Markus Trippelsdorf
2015-04-12 10:15                                           ` Ingo Molnar
2015-04-12  7:56                                         ` Mike Galbraith
2015-04-12  7:41                                       ` Ingo Molnar
2015-04-12  8:07                                     ` Ingo Molnar
2015-04-12 21:11                                     ` Jan Hubicka
2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17                                 ` Ingo Molnar
2015-05-14 19:04                                   ` Denys Vlasenko
2015-05-14 19:44                                     ` Ingo Molnar
2015-05-15 15:45                                   ` Josh Triplett
2015-05-17  5:34                                     ` Ingo Molnar
2015-05-17 19:18                                       ` Josh Triplett
2015-05-18  6:48                                         ` Ingo Molnar
2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 17:49                           ` Linus Torvalds
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 18:09                               ` Linus Torvalds
2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
2015-04-10 17:44                       ` Ingo Molnar
2015-04-10 18:05                         ` Paul E. McKenney
2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58                   ` Paul E. McKenney
2015-04-09 20:58                     ` Jason Low
2015-04-09 21:07                       ` Paul E. McKenney
2015-04-09 19:59                   ` Davidlohr Bueso
2015-04-09 20:36                 ` Jason Low
2015-04-10  2:43                   ` Andev
2015-04-10  9:04                   ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10   ` Jason Low

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.