All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
  2008-12-11 22:40                               ` Eric Dumazet
  (?)
@ 2007-07-24  1:13                                 ` Nick Piggin
  -1 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> From: Christoph Lameter <cl@linux-foundation.org>
>
> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>
> Currently we schedule RCU frees for each file we free separately. That has
> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
> did not require RCU callbacks:
>
> 1. Excessive number of RCU callbacks can be generated causing long RCU
>   queues that in turn cause long latencies. We hit SLUB page allocation
>   more often than necessary.
>
> 2. The cache hot object is not preserved between free and realloc. A close
>   followed by another open is very fast with the RCUless approach because
>   the last freed object is returned by the slab allocator that is
>   still cache hot. RCU free means that the object is not immediately
>   available again. The new object is cache cold and therefore open/close
>   performance tests show a significant degradation with the RCU
>   implementation.
>
> One solution to this problem is to move the RCU freeing into the Slab
> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> time. The slab allocator will do RCU frees only when it is necessary
> to dispose of slabs of objects (rare). So with that approach we can cut
> out the RCU overhead significantly.
>
> However, the slab allocator may return the object for another use even
> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> there is the (unlikely) possibility that the object is going to be
> switched under us in sections protected by rcu_read_lock() and
> rcu_read_unlock(). So we need to verify that we have acquired the correct
> object after establishing a stable object reference (incrementing the
> refcounter does that).
>
>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>  include/linux/fs.h                  |    5 ---
>  3 files changed, 42 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/filesystems/files.txt
> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> --- a/Documentation/filesystems/files.txt
> +++ b/Documentation/filesystems/files.txt
> @@ -78,13 +78,28 @@ the fdtable structure -
>     that look-up may race with the last put() operation on the
>     file structure. This is avoided using atomic_long_inc_not_zero()
>     on ->f_count :
> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> +   they can also be freed before a RCU grace period, and reused,
> +   but still as a struct file.
> +   It is necessary to check again after getting
> +   a stable reference (ie after atomic_long_inc_not_zero()),
> +   that fcheck_files(files, fd) points to the same file.
>
>  	rcu_read_lock();
>  	file = fcheck_files(files, fd);
>  	if (file) {
> -		if (atomic_long_inc_not_zero(&file->f_count))
> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>  			*fput_needed = 1;
> -		else
> +			/*
> +			 * Now we have a stable reference to an object.
> +			 * Check if other threads freed file and reallocated it.
> +			 */
> +			if (file != fcheck_files(files, fd)) {
> +				*fput_needed = 0;
> +				put_filp(file);
> +				file = NULL;
> +			}
> +		} else
>  		/* Didn't get the reference, someone's freed */
>  			file = NULL;
>  	}
> @@ -95,6 +110,8 @@ the fdtable structure -
>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>     goes to zero during increment. If it does, we fail
>     fget()/fget_light().
> +   The second call to fcheck_files(files, fd) checks that this filp
> +   was not freed, then reused by an other thread.
>
>  6. Since both fdtable and file structures can be looked up
>     lock-free, they must be installed using rcu_assign_pointer()
> diff --git a/fs/file_table.c b/fs/file_table.c
> index a46e880..3e9259d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>
>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>
> -static inline void file_free_rcu(struct rcu_head *head)
> -{
> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> -	kmem_cache_free(filp_cachep, f);
> -}
> -
>  static inline void file_free(struct file *f)
>  {
>  	percpu_counter_dec(&nr_files);
>  	file_check_state(f);
> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> +	kmem_cache_free(filp_cachep, f);
>  }
>
>  /*
> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>  			rcu_read_unlock();
>  			return NULL;
>  		}
> +		/*
> +		 * Now we have a stable reference to an object.
> +		 * Check if other threads freed file and re-allocated it.
> +		 */
> +		if (unlikely(file != fcheck_files(files, fd))) {
> +			put_filp(file);
> +			file = NULL;
> +		}

This is a non-trivial change, because that put_filp may drop the last
reference to the file. So now we have the case where we free the file
from a context in which it had never been allocated.

>From a quick glance though the callchains, I can't seen an obvious
problem. But it needs to have documentation in put_filp, or at least
a mention in the changelog, and also cc'ed to the security lists.

Also, it adds code and cost to the get/put path in return for
improvement in the free path. get/put is the more common path, but
it is a small loss for a big improvement. So it might be worth it. But
it is not justified by your microbenchmark. Do we have a more useful
case that it helps?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2007-07-24  1:13                                 ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> From: Christoph Lameter <cl@linux-foundation.org>
>
> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>
> Currently we schedule RCU frees for each file we free separately. That has
> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
> did not require RCU callbacks:
>
> 1. Excessive number of RCU callbacks can be generated causing long RCU
>   queues that in turn cause long latencies. We hit SLUB page allocation
>   more often than necessary.
>
> 2. The cache hot object is not preserved between free and realloc. A close
>   followed by another open is very fast with the RCUless approach because
>   the last freed object is returned by the slab allocator that is
>   still cache hot. RCU free means that the object is not immediately
>   available again. The new object is cache cold and therefore open/close
>   performance tests show a significant degradation with the RCU
>   implementation.
>
> One solution to this problem is to move the RCU freeing into the Slab
> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> time. The slab allocator will do RCU frees only when it is necessary
> to dispose of slabs of objects (rare). So with that approach we can cut
> out the RCU overhead significantly.
>
> However, the slab allocator may return the object for another use even
> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> there is the (unlikely) possibility that the object is going to be
> switched under us in sections protected by rcu_read_lock() and
> rcu_read_unlock(). So we need to verify that we have acquired the correct
> object after establishing a stable object reference (incrementing the
> refcounter does that).
>
>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>  include/linux/fs.h                  |    5 ---
>  3 files changed, 42 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/filesystems/files.txt
> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> --- a/Documentation/filesystems/files.txt
> +++ b/Documentation/filesystems/files.txt
> @@ -78,13 +78,28 @@ the fdtable structure -
>     that look-up may race with the last put() operation on the
>     file structure. This is avoided using atomic_long_inc_not_zero()
>     on ->f_count :
> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> +   they can also be freed before a RCU grace period, and reused,
> +   but still as a struct file.
> +   It is necessary to check again after getting
> +   a stable reference (ie after atomic_long_inc_not_zero()),
> +   that fcheck_files(files, fd) points to the same file.
>
>  	rcu_read_lock();
>  	file = fcheck_files(files, fd);
>  	if (file) {
> -		if (atomic_long_inc_not_zero(&file->f_count))
> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>  			*fput_needed = 1;
> -		else
> +			/*
> +			 * Now we have a stable reference to an object.
> +			 * Check if other threads freed file and reallocated it.
> +			 */
> +			if (file != fcheck_files(files, fd)) {
> +				*fput_needed = 0;
> +				put_filp(file);
> +				file = NULL;
> +			}
> +		} else
>  		/* Didn't get the reference, someone's freed */
>  			file = NULL;
>  	}
> @@ -95,6 +110,8 @@ the fdtable structure -
>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>     goes to zero during increment. If it does, we fail
>     fget()/fget_light().
> +   The second call to fcheck_files(files, fd) checks that this filp
> +   was not freed, then reused by an other thread.
>
>  6. Since both fdtable and file structures can be looked up
>     lock-free, they must be installed using rcu_assign_pointer()
> diff --git a/fs/file_table.c b/fs/file_table.c
> index a46e880..3e9259d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>
>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>
> -static inline void file_free_rcu(struct rcu_head *head)
> -{
> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> -	kmem_cache_free(filp_cachep, f);
> -}
> -
>  static inline void file_free(struct file *f)
>  {
>  	percpu_counter_dec(&nr_files);
>  	file_check_state(f);
> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> +	kmem_cache_free(filp_cachep, f);
>  }
>
>  /*
> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>  			rcu_read_unlock();
>  			return NULL;
>  		}
> +		/*
> +		 * Now we have a stable reference to an object.
> +		 * Check if other threads freed file and re-allocated it.
> +		 */
> +		if (unlikely(file != fcheck_files(files, fd))) {
> +			put_filp(file);
> +			file = NULL;
> +		}

This is a non-trivial change, because that put_filp may drop the last
reference to the file. So now we have the case where we free the file
from a context in which it had never been allocated.

>From a quick glance though the callchains, I can't seen an obvious
problem. But it needs to have documentation in put_filp, or at least
a mention in the changelog, and also cc'ed to the security lists.

Also, it adds code and cost to the get/put path in return for
improvement in the free path. get/put is the more common path, but
it is a small loss for a big improvement. So it might be worth it. But
it is not justified by your microbenchmark. Do we have a more useful
case that it helps?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2007-07-24  1:13                                 ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> From: Christoph Lameter <cl@linux-foundation.org>
>
> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>
> Currently we schedule RCU frees for each file we free separately. That has
> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
> did not require RCU callbacks:
>
> 1. Excessive number of RCU callbacks can be generated causing long RCU
>   queues that in turn cause long latencies. We hit SLUB page allocation
>   more often than necessary.
>
> 2. The cache hot object is not preserved between free and realloc. A close
>   followed by another open is very fast with the RCUless approach because
>   the last freed object is returned by the slab allocator that is
>   still cache hot. RCU free means that the object is not immediately
>   available again. The new object is cache cold and therefore open/close
>   performance tests show a significant degradation with the RCU
>   implementation.
>
> One solution to this problem is to move the RCU freeing into the Slab
> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> time. The slab allocator will do RCU frees only when it is necessary
> to dispose of slabs of objects (rare). So with that approach we can cut
> out the RCU overhead significantly.
>
> However, the slab allocator may return the object for another use even
> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> there is the (unlikely) possibility that the object is going to be
> switched under us in sections protected by rcu_read_lock() and
> rcu_read_unlock(). So we need to verify that we have acquired the correct
> object after establishing a stable object reference (incrementing the
> refcounter does that).
>
>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>  include/linux/fs.h                  |    5 ---
>  3 files changed, 42 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/filesystems/files.txt
> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> --- a/Documentation/filesystems/files.txt
> +++ b/Documentation/filesystems/files.txt
> @@ -78,13 +78,28 @@ the fdtable structure -
>     that look-up may race with the last put() operation on the
>     file structure. This is avoided using atomic_long_inc_not_zero()
>     on ->f_count :
> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> +   they can also be freed before a RCU grace period, and reused,
> +   but still as a struct file.
> +   It is necessary to check again after getting
> +   a stable reference (ie after atomic_long_inc_not_zero()),
> +   that fcheck_files(files, fd) points to the same file.
>
>  	rcu_read_lock();
>  	file = fcheck_files(files, fd);
>  	if (file) {
> -		if (atomic_long_inc_not_zero(&file->f_count))
> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>  			*fput_needed = 1;
> -		else
> +			/*
> +			 * Now we have a stable reference to an object.
> +			 * Check if other threads freed file and reallocated it.
> +			 */
> +			if (file != fcheck_files(files, fd)) {
> +				*fput_needed = 0;
> +				put_filp(file);
> +				file = NULL;
> +			}
> +		} else
>  		/* Didn't get the reference, someone's freed */
>  			file = NULL;
>  	}
> @@ -95,6 +110,8 @@ the fdtable structure -
>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>     goes to zero during increment. If it does, we fail
>     fget()/fget_light().
> +   The second call to fcheck_files(files, fd) checks that this filp
> +   was not freed, then reused by an other thread.
>
>  6. Since both fdtable and file structures can be looked up
>     lock-free, they must be installed using rcu_assign_pointer()
> diff --git a/fs/file_table.c b/fs/file_table.c
> index a46e880..3e9259d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>
>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>
> -static inline void file_free_rcu(struct rcu_head *head)
> -{
> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> -	kmem_cache_free(filp_cachep, f);
> -}
> -
>  static inline void file_free(struct file *f)
>  {
>  	percpu_counter_dec(&nr_files);
>  	file_check_state(f);
> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> +	kmem_cache_free(filp_cachep, f);
>  }
>
>  /*
> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>  			rcu_read_unlock();
>  			return NULL;
>  		}
> +		/*
> +		 * Now we have a stable reference to an object.
> +		 * Check if other threads freed file and re-allocated it.
> +		 */
> +		if (unlikely(file != fcheck_files(files, fd))) {
> +			put_filp(file);
> +			file = NULL;
> +		}

This is a non-trivial change, because that put_filp may drop the last
reference to the file. So now we have the case where we free the file
from a context in which it had never been allocated.

From a quick glance though the callchains, I can't seen an obvious
problem. But it needs to have documentation in put_filp, or at least
a mention in the changelog, and also cc'ed to the security lists.

Also, it adds code and cost to the get/put path in return for
improvement in the free path. get/put is the more common path, but
it is a small loss for a big improvement. So it might be worth it. But
it is not justified by your microbenchmark. Do we have a more useful
case that it helps?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
  2008-12-11 22:38                               ` Eric Dumazet
@ 2007-07-24  1:24                                 ` Nick Piggin
  -1 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:38, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
>
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
>
> d_alloc() can avoid taking dcache_lock if parent is NULL
>
> ("socketallocbench -n8" result : 27.5s to 25s)

Seems like a good idea.


> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct
> dentry *dentry) * otherwise we ascend to the parent and move to the
>  			 * next sibling if there is one */
>  			if (!parent)
> -				goto out;
> +				return;
>
>  			dentry = parent;
>

Andrew doesn't like return from middle of function.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
@ 2007-07-24  1:24                                 ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:38, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
>
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
>
> d_alloc() can avoid taking dcache_lock if parent is NULL
>
> ("socketallocbench -n8" result : 27.5s to 25s)

Seems like a good idea.


> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct
> dentry *dentry) * otherwise we ascend to the parent and move to the
>  			 * next sibling if there is one */
>  			if (!parent)
> -				goto out;
> +				return;
>
>  			dentry = parent;
>

Andrew doesn't like return from middle of function.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
@ 2007-07-24  1:30                                 ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
>
> (socket8 bench result : no difference at this point)

Looks good.

But.... If we never actually need fast access to the approximate
total, (which seems to apply to this and the previous patch) we
could use something much simpler which does not have the spinlock
or all this batching stuff that percpu counters have. I'd prefer
that because it will be faster in a straight line...

(BTW. percpu counters can't be used in interrupt context? That's
nice.)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
@ 2007-07-24  1:30                                 ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
>
> (socket8 bench result : no difference at this point)

Looks good.

But.... If we never actually need fast access to the approximate
total, (which seems to apply to this and the previous patch) we
could use something much simpler which does not have the spinlock
or all this batching stuff that percpu counters have. I'd prefer
that because it will be faster in a straight line...

(BTW. percpu counters can't be used in interrupt context? That's
nice.)

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
  2008-12-11 22:39                               ` Eric Dumazet
@ 2007-07-24  1:34                                 ` Nick Piggin
  -1 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

I don't suppose this would cause any filesystems to do silly
things?

Seems like a good idea, if you could just add a #define instead
of 1024.

>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
>  1 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
>  	return node ? inode : NULL;
>  }
>
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> +	static atomic_t shared_last_ino;
> +	int *p = &get_cpu_var(last_ino);
> +	int res = *p;
> +
> +	if (unlikely((res & 1023) == 0))
> +		res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> +	*p = ++res;
> +	put_cpu_var(last_ino);
> +	return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> +	static int last_ino;
> +
> +	return ++last_ino;
> +}
> +#endif
> +
>  /**
>   *	new_inode 	- obtain an inode
>   *	@sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
>  	 * error if st_ino won't fit in target struct field. Use 32bit counter
>  	 * here to attempt to avoid that.
>  	 */
> -	static unsigned int last_ino;
>  	struct inode * inode;
>
>  	spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
> +		inode->i_state = 0;
> +		inode->i_ino = last_ino_get();
>  		spin_lock(&inode_lock);
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		inode->i_ino = ++last_ino;
> -		inode->i_state = 0;
>  		spin_unlock(&inode_lock);
>  	}
>  	return inode;

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
@ 2007-07-24  1:34                                 ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2007-07-24  1:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

I don't suppose this would cause any filesystems to do silly
things?

Seems like a good idea, if you could just add a #define instead
of 1024.

>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
>  1 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
>  	return node ? inode : NULL;
>  }
>
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> +	static atomic_t shared_last_ino;
> +	int *p = &get_cpu_var(last_ino);
> +	int res = *p;
> +
> +	if (unlikely((res & 1023) == 0))
> +		res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> +	*p = ++res;
> +	put_cpu_var(last_ino);
> +	return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> +	static int last_ino;
> +
> +	return ++last_ino;
> +}
> +#endif
> +
>  /**
>   *	new_inode 	- obtain an inode
>   *	@sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
>  	 * error if st_ino won't fit in target struct field. Use 32bit counter
>  	 * here to attempt to avoid that.
>  	 */
> -	static unsigned int last_ino;
>  	struct inode * inode;
>
>  	spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
> +		inode->i_state = 0;
> +		inode->i_ino = last_ino_get();
>  		spin_lock(&inode_lock);
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		inode->i_ino = ++last_ino;
> -		inode->i_state = 0;
>  		spin_unlock(&inode_lock);
>  	}
>  	return inode;

^ permalink raw reply	[flat|nested] 349+ messages in thread

* 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27
@ 2008-11-16 17:38 ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:38 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Andrew Morton, Natalie Protasevich, Kernel Testers List

[NOTE:
 I closed a number of Bugzilla entries dedicated to regressions introduced
 between 2.6.26 and 2.6.27 that appeared to have been fixed to me or where
 the reporters had been totally unresponsive for extended periods of time
 (given that they are notified every week ...).]

This message contains a list of some regressions introduced between 2.6.26 and
2.6.27, for which there are no fixes in the mainline I know of.  If any of them
have been fixed already, please let me know.

If you know of any other unresolved regressions introduced between 2.6.26
and 2.6.27, please let me know either and I'll add them to the list.
Also, please let me know if any of the entries below are invalid.

Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.


Listed regressions statistics:

  Date          Total  Pending  Unresolved
  ----------------------------------------
  2008-11-16      199       18          14
  2008-11-09      196       28          23
  2008-11-02      195       34          28
  2008-10-26      190       34          29
  2008-10-04      181       41          33
  2008-09-27      173       35          28
  2008-09-21      169       45          36
  2008-09-15      163       46          32
  2008-09-12      163       51          38
  2008-09-07      150       43          33
  2008-08-30      135       48          36
  2008-08-23      122       48          40
  2008-08-16      103       47          37
  2008-08-10       80       52          31
  2008-08-02       47       31          20


Unresolved regressions
----------------------

Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12048
Subject		: Regression in bonding between 2.6.26.8 and 2.6.27.6
Submitter	: Jesper Krogh <jesper@krogh.cc>
Date		: 2008-11-16 9:41 (1 days old)
References	: http://marc.info/?l=linux-kernel&m=122682977001048&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12039
Subject		: Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
Submitter	: David <david@unsolicited.net>
Date		: 2008-11-14 20:20 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=122669568022274&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11983
Subject		: iwlagn: wrong command queue 31, command id 0x0
Submitter	: Matt Mackall <mpm@selenic.com>
Date		: 2008-11-06 4:16 (11 days old)
References	: http://marc.info/?l=linux-kernel&m=122598672815803&w=4
		  http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703
Handled-By	: reinette chatre <reinette.chatre@intel.com>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11886
Subject		: without serial console system doesn't  poweroff
Submitter	: Daniel Smolik <marvin@mydatex.cz>
Date		: 2008-10-29 04:06 (19 days old)


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11876
Subject		: RCU hang on cpu re-hotplug with 2.6.27rc8
Submitter	: Andi Kleen <andi@firstfloor.org>
Date		: 2008-10-06 23:28 (42 days old)
References	: http://marc.info/?l=linux-kernel&m=122333610602399&w=2
Handled-By	: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11836
Subject		: Scheduler on C2D CPU and latest 2.6.27 kernel
Submitter	: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date		: 2008-10-21 9:59 (27 days old)
References	: http://marc.info/?l=linux-kernel&m=122458320502371&w=4
Handled-By	: Chris Snook <csnook@redhat.com>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11698
Subject		: 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
Submitter	: Soeren Sonnenburg <kernel@nn7.de>
Date		: 2008-09-29 11:29 (49 days old)
References	: http://marc.info/?l=linux-kernel&m=122268780926859&w=4
Handled-By	: Rafael J. Wysocki <rjw@sisk.pl>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11664
Subject		: acpi errors and random freeze on sony vaio sr
Submitter	: Giovanni Pellerano <giovanni.pellerano@gmail.com>
Date		: 2008-09-28 03:48 (50 days old)


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject		: Panic stop CPUs regression
Submitter	: Andi Kleen <andi@firstfloor.org>
Date		: 2008-09-02 13:49 (76 days old)
References	: http://marc.info/?l=linux-kernel&m=122036356127282&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject		: kernel panic: softlockup in tick_periodic() ???
Submitter	: Joshua Hoblitt <j_kernel@hoblitt.com>
Date		: 2008-09-11 16:46 (67 days old)
References	: http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By	: Thomas Gleixner <tglx@linutronix.de>
		  Cyrill Gorcunov <gorcunov@gmail.com>
		  Ingo Molnar <mingo@elte.hu>
		  Cyrill Gorcunov <gorcunov@gmail.com>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject		: BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter	: rdunlap <randy.dunlap@oracle.com>
Date		: 2008-08-21 5:52 (88 days old)
References	: http://marc.info/?l=linux-kernel&m=121929819616273&w=4
		  http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By	: Miller, Mike (OS Dev) <Mike.Miller@hp.com>
		  James Bottomley <James.Bottomley@hansenpartnership.com>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
Submitter	: Christoph Lameter <cl@linux-foundation.org>
Date		: 2008-08-11 18:36 (98 days old)
References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject		: INFO: possible recursive locking detected ps2_command
Submitter	: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date		: 2008-07-31 9:41 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121749737011637&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject		: VolanoMark regression with 2.6.27-rc1
Submitter	: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Date		: 2008-07-31 3:20 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By	: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
		  Peter Zijlstra <a.p.zijlstra@chello.nl>
		  Dhaval Giani <dhaval@linux.vnet.ibm.com>
		  Miao Xie <miaox@cn.fujitsu.com>


Regressions with patches
------------------------

Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11865
Subject		: WOL for E100 Doesn't Work Anymore
Submitter	: roger <rogerx@sdf.lonestar.org>
Date		: 2008-10-26 21:56 (22 days old)
Handled-By	: Rafael J. Wysocki <rjw@sisk.pl>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11843
Subject		: usb hdd problems with 2.6.27.2
Submitter	: Luciano Rocha <luciano@eurotux.com>
Date		: 2008-10-22 16:22 (26 days old)
References	: http://marc.info/?l=linux-kernel&m=122469318102679&w=4
Handled-By	: Luciano Rocha <luciano@eurotux.com>
Patch		: http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11805
Subject		: mounting XFS produces a segfault
Submitter	: Tiago Maluta <maluta_tiago@yahoo.com.br>
Date		: 2008-10-21 18:00 (27 days old)
Handled-By	: Dave Chinner <dgc@sgi.com>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11795
Subject		: ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
Submitter	: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Date		: 2008-10-20 10:49 (28 days old)
Handled-By	: Samuel Ortiz <samuel@sortiz.org>
Patch           : http://bugzilla.kernel.org/show_bug.cgi?id=11795#c22


For details, please visit the bug entries and follow the links given in
references.

As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions introduced
between 2.6.26 and 2.6.27, unresolved as well as resolved, at:

http://bugzilla.kernel.org/show_bug.cgi?id=11167

Please let me know if there are any Bugzilla entries that should be added to
the list in there.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 349+ messages in thread

* 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27
@ 2008-11-16 17:38 ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:38 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Andrew Morton, Natalie Protasevich, Kernel Testers List

[NOTE:
 I closed a number of Bugzilla entries dedicated to regressions introduced
 between 2.6.26 and 2.6.27 that appeared to have been fixed to me or where
 the reporters had been totally unresponsive for extended periods of time
 (given that they are notified every week ...).]

This message contains a list of some regressions introduced between 2.6.26 and
2.6.27, for which there are no fixes in the mainline I know of.  If any of them
have been fixed already, please let me know.

If you know of any other unresolved regressions introduced between 2.6.26
and 2.6.27, please let me know either and I'll add them to the list.
Also, please let me know if any of the entries below are invalid.

Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.


Listed regressions statistics:

  Date          Total  Pending  Unresolved
  ----------------------------------------
  2008-11-16      199       18          14
  2008-11-09      196       28          23
  2008-11-02      195       34          28
  2008-10-26      190       34          29
  2008-10-04      181       41          33
  2008-09-27      173       35          28
  2008-09-21      169       45          36
  2008-09-15      163       46          32
  2008-09-12      163       51          38
  2008-09-07      150       43          33
  2008-08-30      135       48          36
  2008-08-23      122       48          40
  2008-08-16      103       47          37
  2008-08-10       80       52          31
  2008-08-02       47       31          20


Unresolved regressions
----------------------

Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12048
Subject		: Regression in bonding between 2.6.26.8 and 2.6.27.6
Submitter	: Jesper Krogh <jesper-Q2TZfHgGEy4@public.gmane.org>
Date		: 2008-11-16 9:41 (1 days old)
References	: http://marc.info/?l=linux-kernel&m=122682977001048&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12039
Subject		: Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
Submitter	: David <david-Os2QIKb4eqJd3aXB8m+yKQ@public.gmane.org>
Date		: 2008-11-14 20:20 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=122669568022274&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11983
Subject		: iwlagn: wrong command queue 31, command id 0x0
Submitter	: Matt Mackall <mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ@public.gmane.org>
Date		: 2008-11-06 4:16 (11 days old)
References	: http://marc.info/?l=linux-kernel&m=122598672815803&w=4
		  http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703
Handled-By	: reinette chatre <reinette.chatre-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11886
Subject		: without serial console system doesn't  poweroff
Submitter	: Daniel Smolik <marvin-0pWKB23IDFjrBKCeMvbIDA@public.gmane.org>
Date		: 2008-10-29 04:06 (19 days old)


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11876
Subject		: RCU hang on cpu re-hotplug with 2.6.27rc8
Submitter	: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
Date		: 2008-10-06 23:28 (42 days old)
References	: http://marc.info/?l=linux-kernel&m=122333610602399&w=2
Handled-By	: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11836
Subject		: Scheduler on C2D CPU and latest 2.6.27 kernel
Submitter	: Zdenek Kabelac <zdenek.kabelac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date		: 2008-10-21 9:59 (27 days old)
References	: http://marc.info/?l=linux-kernel&m=122458320502371&w=4
Handled-By	: Chris Snook <csnook-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11698
Subject		: 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
Submitter	: Soeren Sonnenburg <kernel-YxZl4NRrHdw@public.gmane.org>
Date		: 2008-09-29 11:29 (49 days old)
References	: http://marc.info/?l=linux-kernel&m=122268780926859&w=4
Handled-By	: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11664
Subject		: acpi errors and random freeze on sony vaio sr
Submitter	: Giovanni Pellerano <giovanni.pellerano-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date		: 2008-09-28 03:48 (50 days old)


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject		: Panic stop CPUs regression
Submitter	: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
Date		: 2008-09-02 13:49 (76 days old)
References	: http://marc.info/?l=linux-kernel&m=122036356127282&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject		: kernel panic: softlockup in tick_periodic() ???
Submitter	: Joshua Hoblitt <j_kernel-amK9oZtvyLhBDgjK7y7TUQ@public.gmane.org>
Date		: 2008-09-11 16:46 (67 days old)
References	: http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By	: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
		  Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
		  Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
		  Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject		: BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter	: rdunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date		: 2008-08-21 5:52 (88 days old)
References	: http://marc.info/?l=linux-kernel&m=121929819616273&w=4
		  http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By	: Miller, Mike (OS Dev) <Mike.Miller-VXdhtT5mjnY@public.gmane.org>
		  James Bottomley <James.Bottomley-JuX6DAaQMKPCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
Submitter	: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date		: 2008-08-11 18:36 (98 days old)
References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject		: INFO: possible recursive locking detected ps2_command
Submitter	: Zdenek Kabelac <zdenek.kabelac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date		: 2008-07-31 9:41 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121749737011637&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject		: VolanoMark regression with 2.6.27-rc1
Submitter	: Zhang, Yanmin <yanmin_zhang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Date		: 2008-07-31 3:20 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By	: Zhang, Yanmin <yanmin_zhang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
		  Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
		  Dhaval Giani <dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
		  Miao Xie <miaox-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>


Regressions with patches
------------------------

Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11865
Subject		: WOL for E100 Doesn't Work Anymore
Submitter	: roger <rogerx-VThn6mlTRQFChFL4AGkBsw@public.gmane.org>
Date		: 2008-10-26 21:56 (22 days old)
Handled-By	: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11843
Subject		: usb hdd problems with 2.6.27.2
Submitter	: Luciano Rocha <luciano-YWehAnL2kLNBDgjK7y7TUQ@public.gmane.org>
Date		: 2008-10-22 16:22 (26 days old)
References	: http://marc.info/?l=linux-kernel&m=122469318102679&w=4
Handled-By	: Luciano Rocha <luciano-YWehAnL2kLNBDgjK7y7TUQ@public.gmane.org>
Patch		: http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11805
Subject		: mounting XFS produces a segfault
Submitter	: Tiago Maluta <maluta_tiago-/E1597aS9LRfJ/NunPodnw@public.gmane.org>
Date		: 2008-10-21 18:00 (27 days old)
Handled-By	: Dave Chinner <dgc-sJ/iWh9BUns@public.gmane.org>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11795
Subject		: ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
Submitter	: Alex Villacis Lasso <avillaci-x0m+Mc+nT7uljOmnV8AmnkElSqmLX1BE@public.gmane.org>
Date		: 2008-10-20 10:49 (28 days old)
Handled-By	: Samuel Ortiz <samuel-jcdQHdrhKHMdnm+yROfE0A@public.gmane.org>
Patch           : http://bugzilla.kernel.org/show_bug.cgi?id=11795#c22


For details, please visit the bug entries and follow the links given in
references.

As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions introduced
between 2.6.26 and 2.6.27, unresolved as well as resolved, at:

http://bugzilla.kernel.org/show_bug.cgi?id=11167

Please let me know if there are any Bugzilla entries that should be added to
the list in there.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11207] VolanoMark regression with 2.6.27-rc1
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:38   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:38 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Dhaval Giani, Miao Xie, Peter Zijlstra,
	Zhang, Yanmin

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject		: VolanoMark regression with 2.6.27-rc1
Submitter	: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Date		: 2008-07-31 3:20 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By	: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
		  Peter Zijlstra <a.p.zijlstra@chello.nl>
		  Dhaval Giani <dhaval@linux.vnet.ibm.com>
		  Miao Xie <miaox@cn.fujitsu.com>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11207] VolanoMark regression with 2.6.27-rc1
@ 2008-11-16 17:38   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:38 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Dhaval Giani, Miao Xie, Peter Zijlstra,
	Zhang, Yanmin

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject		: VolanoMark regression with 2.6.27-rc1
Submitter	: Zhang, Yanmin <yanmin_zhang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Date		: 2008-07-31 3:20 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By	: Zhang, Yanmin <yanmin_zhang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
		  Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
		  Dhaval Giani <dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
		  Miao Xie <miaox-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11215] INFO: possible recursive locking detected ps2_command
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Zdenek Kabelac

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject		: INFO: possible recursive locking detected ps2_command
Submitter	: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date		: 2008-07-31 9:41 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121749737011637&w=4



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11308] tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Christoph Lameter

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
Submitter	: Christoph Lameter <cl@linux-foundation.org>
Date		: 2008-08-11 18:36 (98 days old)
References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11215] INFO: possible recursive locking detected ps2_command
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Zdenek Kabelac

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject		: INFO: possible recursive locking detected ps2_command
Submitter	: Zdenek Kabelac <zdenek.kabelac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date		: 2008-07-31 9:41 (109 days old)
References	: http://marc.info/?l=linux-kernel&m=121749737011637&w=4


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11308] tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Christoph Lameter

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
Submitter	: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date		: 2008-08-11 18:36 (98 days old)
References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, James Bottomley, Miller, Mike (OS Dev), rdunlap

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject		: BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter	: rdunlap <randy.dunlap@oracle.com>
Date		: 2008-08-21 5:52 (88 days old)
References	: http://marc.info/?l=linux-kernel&m=121929819616273&w=4
		  http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By	: Miller, Mike (OS Dev) <Mike.Miller@hp.com>
		  James Bottomley <James.Bottomley@hansenpartnership.com>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11543] kernel panic: softlockup in tick_periodic() ???
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Cyrill Gorcunov, Ingo Molnar,
	Joshua Hoblitt, Thomas Gleixner

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject		: kernel panic: softlockup in tick_periodic() ???
Submitter	: Joshua Hoblitt <j_kernel@hoblitt.com>
Date		: 2008-09-11 16:46 (67 days old)
References	: http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By	: Thomas Gleixner <tglx@linutronix.de>
		  Cyrill Gorcunov <gorcunov@gmail.com>
		  Ingo Molnar <mingo@elte.hu>
		  Cyrill Gorcunov <gorcunov@gmail.com>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11569] Panic stop CPUs regression
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Andi Kleen

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject		: Panic stop CPUs regression
Submitter	: Andi Kleen <andi@firstfloor.org>
Date		: 2008-09-02 13:49 (76 days old)
References	: http://marc.info/?l=linux-kernel&m=122036356127282&w=4



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11664] acpi errors and random freeze on sony vaio sr
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Giovanni Pellerano, ykzhao, Zhang Rui

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11664
Subject		: acpi errors and random freeze on sony vaio sr
Submitter	: Giovanni Pellerano <giovanni.pellerano@gmail.com>
Date		: 2008-09-28 03:48 (50 days old)
Patch		:  &lt;<a href="http://marc.info/?l=linux-acpi&m=122514341319748&w=4



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11698] 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Rafael J. Wysocki, Soeren Sonnenburg

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11698
Subject		: 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
Submitter	: Soeren Sonnenburg <kernel@nn7.de>
Date		: 2008-09-29 11:29 (49 days old)
References	: http://marc.info/?l=linux-kernel&m=122268780926859&w=4
Handled-By	: Rafael J. Wysocki <rjw@sisk.pl>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11543] kernel panic: softlockup in tick_periodic() ???
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Cyrill Gorcunov, Ingo Molnar,
	Joshua Hoblitt, Thomas Gleixner

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject		: kernel panic: softlockup in tick_periodic() ???
Submitter	: Joshua Hoblitt <j_kernel-amK9oZtvyLhBDgjK7y7TUQ@public.gmane.org>
Date		: 2008-09-11 16:46 (67 days old)
References	: http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By	: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
		  Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
		  Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
		  Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, James Bottomley, Miller, Mike (OS Dev), rdunlap

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject		: BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter	: rdunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date		: 2008-08-21 5:52 (88 days old)
References	: http://marc.info/?l=linux-kernel&m=121929819616273&w=4
		  http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By	: Miller, Mike (OS Dev) <Mike.Miller-VXdhtT5mjnY@public.gmane.org>
		  James Bottomley <James.Bottomley-JuX6DAaQMKPCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11569] Panic stop CPUs regression
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Andi Kleen

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject		: Panic stop CPUs regression
Submitter	: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
Date		: 2008-09-02 13:49 (76 days old)
References	: http://marc.info/?l=linux-kernel&m=122036356127282&w=4


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11664] acpi errors and random freeze on sony vaio sr
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Giovanni Pellerano, ykzhao, Zhang Rui

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11664
Subject		: acpi errors and random freeze on sony vaio sr
Submitter	: Giovanni Pellerano <giovanni.pellerano-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date		: 2008-09-28 03:48 (50 days old)
Patch		:  &lt;<a href="http://marc.info/?l=linux-acpi&m=122514341319748&w=4


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11698] 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Rafael J. Wysocki, Soeren Sonnenburg

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11698
Subject		: 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
Submitter	: Soeren Sonnenburg <kernel-YxZl4NRrHdw@public.gmane.org>
Date		: 2008-09-29 11:29 (49 days old)
References	: http://marc.info/?l=linux-kernel&m=122268780926859&w=4
Handled-By	: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Chris Snook, Zdenek Kabelac

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11836
Subject		: Scheduler on C2D CPU and latest 2.6.27 kernel
Submitter	: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date		: 2008-10-21 9:59 (27 days old)
References	: http://marc.info/?l=linux-kernel&m=122458320502371&w=4
Handled-By	: Chris Snook <csnook@redhat.com>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Alex Villacis Lasso, Samuel Ortiz, Vasily

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11795
Subject		: ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
Submitter	: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Date		: 2008-10-20 10:49 (28 days old)
Handled-By	: Samuel Ortiz <samuel@sortiz.org>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11805] mounting XFS produces a segfault
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Dave Chinner, Tiago Maluta

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11805
Subject		: mounting XFS produces a segfault
Submitter	: Tiago Maluta <maluta_tiago@yahoo.com.br>
Date		: 2008-10-21 18:00 (27 days old)
Handled-By	: Dave Chinner <dgc@sgi.com>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11805] mounting XFS produces a segfault
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Dave Chinner, Tiago Maluta

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11805
Subject		: mounting XFS produces a segfault
Submitter	: Tiago Maluta <maluta_tiago-/E1597aS9LRfJ/NunPodnw@public.gmane.org>
Date		: 2008-10-21 18:00 (27 days old)
Handled-By	: Dave Chinner <dgc-sJ/iWh9BUns@public.gmane.org>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Chris Snook, Zdenek Kabelac

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11836
Subject		: Scheduler on C2D CPU and latest 2.6.27 kernel
Submitter	: Zdenek Kabelac <zdenek.kabelac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date		: 2008-10-21 9:59 (27 days old)
References	: http://marc.info/?l=linux-kernel&m=122458320502371&w=4
Handled-By	: Chris Snook <csnook-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Alex Villacis Lasso, Samuel Ortiz, Vasily

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11795
Subject		: ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
Submitter	: Alex Villacis Lasso <avillaci-x0m+Mc+nT7uljOmnV8AmnkElSqmLX1BE@public.gmane.org>
Date		: 2008-10-20 10:49 (28 days old)
Handled-By	: Samuel Ortiz <samuel-jcdQHdrhKHMdnm+yROfE0A@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11886] without serial console system doesn't  poweroff
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Daniel Smolik, Rafael J. Wysocki

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11886
Subject		: without serial console system doesn't  poweroff
Submitter	: Daniel Smolik <marvin@mydatex.cz>
Date		: 2008-10-29 04:06 (19 days old)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Andi Kleen, Paul E. McKenney

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11876
Subject		: RCU hang on cpu re-hotplug with 2.6.27rc8
Submitter	: Andi Kleen <andi@firstfloor.org>
Date		: 2008-10-06 23:28 (42 days old)
References	: http://marc.info/?l=linux-kernel&m=122333610602399&w=2
Handled-By	: Paul E. McKenney <paulmck@linux.vnet.ibm.com>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11865] WOL for E100 Doesn't Work Anymore
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Rafael J. Wysocki, roger

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11865
Subject		: WOL for E100 Doesn't Work Anymore
Submitter	: roger <rogerx@sdf.lonestar.org>
Date		: 2008-10-26 21:56 (22 days old)
Handled-By	: Rafael J. Wysocki <rjw@sisk.pl>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11843] usb hdd problems with 2.6.27.2
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Alan Stern, Luciano Rocha, Tim Wright

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11843
Subject		: usb hdd problems with 2.6.27.2
Submitter	: Luciano Rocha <luciano@eurotux.com>
Date		: 2008-10-22 16:22 (26 days old)
References	: http://marc.info/?l=linux-kernel&m=122469318102679&w=4
Handled-By	: Luciano Rocha <luciano@eurotux.com>
Patch		: http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11886] without serial console system doesn't  poweroff
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Daniel Smolik, Rafael J. Wysocki

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11886
Subject		: without serial console system doesn't  poweroff
Submitter	: Daniel Smolik <marvin-0pWKB23IDFjrBKCeMvbIDA@public.gmane.org>
Date		: 2008-10-29 04:06 (19 days old)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Andi Kleen, Paul E. McKenney

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11876
Subject		: RCU hang on cpu re-hotplug with 2.6.27rc8
Submitter	: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
Date		: 2008-10-06 23:28 (42 days old)
References	: http://marc.info/?l=linux-kernel&m=122333610602399&w=2
Handled-By	: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11865] WOL for E100 Doesn't Work Anymore
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Rafael J. Wysocki, roger

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11865
Subject		: WOL for E100 Doesn't Work Anymore
Submitter	: roger <rogerx-VThn6mlTRQFChFL4AGkBsw@public.gmane.org>
Date		: 2008-10-26 21:56 (22 days old)
Handled-By	: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11843] usb hdd problems with 2.6.27.2
@ 2008-11-16 17:40   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Alan Stern, Luciano Rocha, Tim Wright

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11843
Subject		: usb hdd problems with 2.6.27.2
Submitter	: Luciano Rocha <luciano-YWehAnL2kLNBDgjK7y7TUQ@public.gmane.org>
Date		: 2008-10-22 16:22 (26 days old)
References	: http://marc.info/?l=linux-kernel&m=122469318102679&w=4
Handled-By	: Luciano Rocha <luciano-YWehAnL2kLNBDgjK7y7TUQ@public.gmane.org>
Patch		: http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #12039] Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:41   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, David

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12039
Subject		: Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
Submitter	: David <david@unsolicited.net>
Date		: 2008-11-14 20:20 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=122669568022274&w=4



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11983] iwlagn: wrong command queue 31, command id 0x0
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:41   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Luis R. Rodriguez, Marcel Holtmann,
	Matt Mackall, reinette chatre

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11983
Subject		: iwlagn: wrong command queue 31, command id 0x0
Submitter	: Matt Mackall <mpm@selenic.com>
Date		: 2008-11-06 4:16 (11 days old)
References	: http://marc.info/?l=linux-kernel&m=122598672815803&w=4
		  http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703
Handled-By	: reinette chatre <reinette.chatre@intel.com>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #12039] Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
@ 2008-11-16 17:41   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, David

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12039
Subject		: Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
Submitter	: David <david-Os2QIKb4eqJd3aXB8m+yKQ@public.gmane.org>
Date		: 2008-11-14 20:20 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=122669568022274&w=4


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #11983] iwlagn: wrong command queue 31, command id 0x0
@ 2008-11-16 17:41   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List, Luis R. Rodriguez, Marcel Holtmann,
	Matt Mackall, reinette chatre

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11983
Subject		: iwlagn: wrong command queue 31, command id 0x0
Submitter	: Matt Mackall <mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ@public.gmane.org>
Date		: 2008-11-06 4:16 (11 days old)
References	: http://marc.info/?l=linux-kernel&m=122598672815803&w=4
		  http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703
Handled-By	: reinette chatre <reinette.chatre-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6
  2008-11-16 17:38 ` Rafael J. Wysocki
@ 2008-11-16 17:41   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Jesper Krogh

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12048
Subject		: Regression in bonding between 2.6.26.8 and 2.6.27.6
Submitter	: Jesper Krogh <jesper@krogh.cc>
Date		: 2008-11-16 9:41 (1 days old)
References	: http://marc.info/?l=linux-kernel&m=122682977001048&w=4



^ permalink raw reply	[flat|nested] 349+ messages in thread

* [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6
@ 2008-11-16 17:41   ` Rafael J. Wysocki
  0 siblings, 0 replies; 349+ messages in thread
From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Jesper Krogh

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27.  Please verify if it still should
be listed and let me know (either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=12048
Subject		: Regression in bonding between 2.6.26.8 and 2.6.27.6
Submitter	: Jesper Krogh <jesper-Q2TZfHgGEy4@public.gmane.org>
Date		: 2008-11-16 9:41 (1 days old)
References	: http://marc.info/?l=linux-kernel&m=122682977001048&w=4


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11843] usb hdd problems with 2.6.27.2
  2008-11-16 17:40   ` Rafael J. Wysocki
  (?)
@ 2008-11-16 21:37   ` Luciano Rocha
  -1 siblings, 0 replies; 349+ messages in thread
From: Luciano Rocha @ 2008-11-16 21:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Kernel Testers List, Alan Stern, Tim Wright

On Sun, Nov 16, 2008 at 06:40:59PM +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
> 
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> be listed and let me know (either way).
> 
> 
> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11843
> Subject		: usb hdd problems with 2.6.27.2
> Submitter	: Luciano Rocha <luciano@eurotux.com>
> Date		: 2008-10-22 16:22 (26 days old)
> References	: http://marc.info/?l=linux-kernel&m=122469318102679&w=4
> Handled-By	: Luciano Rocha <luciano@eurotux.com>
> Patch		: http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26

What does "Handled-By" mean? The patches were created by Alan Stern
<stern@rowland.harvard.edu>, I just tested them.

Regards,
Luciano Rocha

-- 
Luciano Rocha <luciano@eurotux.com>
Eurotux Informática, S.A. <http://www.eurotux.com/>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
  2008-11-16 17:40   ` Rafael J. Wysocki
@ 2008-11-17  9:06     ` Ingo Molnar
  -1 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17  9:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Kernel Testers List,
	Christoph Lameter, Mike Galbraith, Peter Zijlstra


* Rafael J. Wysocki <rjw@sisk.pl> wrote:

> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
> 
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> be listed and let me know (either way).
> 
> 
> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
> Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
> Submitter	: Christoph Lameter <cl@linux-foundation.org>
> Date		: 2008-08-11 18:36 (98 days old)
> References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4

Christoph, as per the recent analysis of Mike:

 http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html

all scheduler components of this regression have been eliminated.

In fact his numbers show that scheduler speedups since 2.6.22 have 
offset and hidden most other sources of tbench regression. (i.e. the 
scheduler portion got 5% faster, hence it was able to offset a 
slowdown of 5% in other areas of the kernel that tbench triggers)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17  9:06     ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17  9:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Kernel Testers List,
	Christoph Lameter, Mike Galbraith, Peter Zijlstra


* Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org> wrote:

> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
> 
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> be listed and let me know (either way).
> 
> 
> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
> Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
> Submitter	: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Date		: 2008-08-11 18:36 (98 days old)
> References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4

Christoph, as per the recent analysis of Mike:

 http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html

all scheduler components of this regression have been eliminated.

In fact his numbers show that scheduler speedups since 2.6.22 have 
offset and hidden most other sources of tbench regression. (i.e. the 
scheduler portion got 5% faster, hence it was able to offset a 
slowdown of 5% in other areas of the kernel that tbench triggers)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17  9:14       ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17  9:14 UTC (permalink / raw)
  To: mingo; +Cc: rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 17 Nov 2008 10:06:48 +0100

> 
> * Rafael J. Wysocki <rjw@sisk.pl> wrote:
> 
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.26 and 2.6.27.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> > be listed and let me know (either way).
> > 
> > 
> > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
> > Submitter	: Christoph Lameter <cl@linux-foundation.org>
> > Date		: 2008-08-11 18:36 (98 days old)
> > References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4
> 
> Christoph, as per the recent analysis of Mike:
> 
>  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> 
> all scheduler components of this regression have been eliminated.
> 
> In fact his numbers show that scheduler speedups since 2.6.22 have 
> offset and hidden most other sources of tbench regression. (i.e. the 
> scheduler portion got 5% faster, hence it was able to offset a 
> slowdown of 5% in other areas of the kernel that tbench triggers)

Although I respect the improvements, wake_up() is still several orders
of magnitude slower than it was in 2.6.22 and wake_up() is at the top
of the profiles in tbench runs.

It really is premature to close this regression at this time.

I am working with every spare moment I have to try and nail this
stuff, but unless someone else helps me people need to be patient.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17  9:14       ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17  9:14 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Mon, 17 Nov 2008 10:06:48 +0100

> 
> * Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> 
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.26 and 2.6.27.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> > be listed and let me know (either way).
> > 
> > 
> > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
> > Submitter	: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> > Date		: 2008-08-11 18:36 (98 days old)
> > References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4
> 
> Christoph, as per the recent analysis of Mike:
> 
>  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> 
> all scheduler components of this regression have been eliminated.
> 
> In fact his numbers show that scheduler speedups since 2.6.22 have 
> offset and hidden most other sources of tbench regression. (i.e. the 
> scheduler portion got 5% faster, hence it was able to offset a 
> slowdown of 5% in other areas of the kernel that tbench triggers)

Although I respect the improvements, wake_up() is still several orders
of magnitude slower than it was in 2.6.22 and wake_up() is at the top
of the profiles in tbench runs.

It really is premature to close this regression at this time.

I am working with every spare moment I have to try and nail this
stuff, but unless someone else helps me people need to be patient.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 11:01         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 11:01 UTC (permalink / raw)
  To: David Miller
  Cc: rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra,
	Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 13750 bytes --]


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 17 Nov 2008 10:06:48 +0100
> 
> > 
> > * Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > 
> > > This message has been generated automatically as a part of a report
> > > of regressions introduced between 2.6.26 and 2.6.27.
> > > 
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> > > be listed and let me know (either way).
> > > 
> > > 
> > > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > > Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
> > > Submitter	: Christoph Lameter <cl@linux-foundation.org>
> > > Date		: 2008-08-11 18:36 (98 days old)
> > > References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > > 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4
> > 
> > Christoph, as per the recent analysis of Mike:
> > 
> >  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> > 
> > all scheduler components of this regression have been eliminated.
> > 
> > In fact his numbers show that scheduler speedups since 2.6.22 have 
> > offset and hidden most other sources of tbench regression. (i.e. the 
> > scheduler portion got 5% faster, hence it was able to offset a 
> > slowdown of 5% in other areas of the kernel that tbench triggers)
> 
> Although I respect the improvements, wake_up() is still several 
> orders of magnitude slower than it was in 2.6.22 and wake_up() is at 
> the top of the profiles in tbench runs.

hm, several orders of magnitude slower? That contradicts Mike's 
numbers and my own numbers and profiles as well: see below.

The scheduler's overhead barely even registers on a 16-way x86 system 
i'm running tbench on. Here's the NMI profile during 64 threads tbench 
on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

  Throughput 3437.65 MB/sec 64 procs
  ==================================
  21570252  total 
  ........
   1494803  copy_user_generic_string 
    998232  sock_rfree 
    491471  tcp_ack 
    482405  ip_dont_fragment 
    470685  ip_local_deliver 
    436325  constant_test_bit         [ called by napi_disable_pending() ]
    375469  avc_has_perm_noaudit 
    347663  tcp_sendmsg 
    310383  tcp_recvmsg 
    300412  __inet_lookup_established 
    294377  system_call 
    286603  tcp_transmit_skb 
    251782  selinux_ip_postroute 
    236028  tcp_current_mss 
    235631  schedule 
    234013  netif_rx 
    229854  _local_bh_enable_ip 
    219501  tcp_v4_rcv 

    [ etc. - see full profile attached further below ]

Note that the scheduler does not even show up in the profile up to 
entry #15!

I've also summarized NMI profiler output by major subsystems:

           NET       overhead (12603450/21570252): 58.43%
           security  overhead ( 1903598/21570252):  8.83%
           usercopy  overhead ( 1753617/21570252):  8.13%
           sched     overhead ( 1599406/21570252):  7.41%
           syscall   overhead (  560487/21570252):  2.60%
           IRQ       overhead (  555439/21570252):  2.58%
           slab      overhead (  492421/21570252):  2.28%
           timer     overhead (  226573/21570252):  1.05%
           pagealloc overhead (  192681/21570252):  0.89%
           PID       overhead (  115123/21570252):  0.53%
           VFS       overhead (  107926/21570252):  0.50%
           pagecache overhead (   62552/21570252):  0.29%
           gtod      overhead (   38651/21570252):  0.18%
           IDLE      overhead (       0/21570252):  0.00%
---------------------------------------------------------
                         left ( 1349494/21570252):  6.26%

The scheduler's functions are absolutely flat, and consistent with an 
extreme context-switching rate of 1.35 million per second. The 
scheduler can go up to about 20 million context switches per second on 
this system:

 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 32  0      0 32229696  29308 649880    0    0     0     0 164135 20026853 24 76  0  0  0
 32  0      0 32229752  29308 649880    0    0     0     0 164203 20032770 24 76  0  0  0
 32  0      0 32229752  29308 649880    0    0     0     0 164201 20036492 25 75  0  0  0

... and 7% scheduling overhead is roughly consistent with 1.35/20.0.

Wake up affinities and data flow caching is just fine in this workload 
- we've got scheduler statistics for that and they look good too.

It all looks like pure old-fashioned straight overhead in the 
networking layer to me. Do we still touch the same global cacheline 
for every localhost packet we process? Anything like that would show 
up big time.

Anyway, in terms of scheduling there's absolutely nothing anomalous i 
can see about this workload. Scheduling looks healthy throughout - and 
the few things we noticed causing unnecessary overhead are now fixed 
in -rc5. (but it's all in the <5% range of impact of total scheduling 
overhead - i.e. in the 0.4% absolute range in this workload)

And the thing is, the scheduler's task in this workload is by far the 
most difficult one conceptually: it has to manage and optimize 
concurrency of _future_ processing, with an event frequency that is 
_WAY_ out of the normal patterns: more than 1.3 million context 
switches per second (!). It also switches to/from completely 
independent contexts of computing, with the all the implications that 
this brings.

Networking and VFS "just" has to shuffle around bits in memory along a 
very specific plan given to it by user-space. That plan is 
well-specified and goes along the lines of: "copy this (already 
cached) file content to that socket" and back.

By the raw throughput figures the system is pushing a couple of 
million data packets per second.

Still we spend 7 times more CPU time in the networking code than in 
the scheduler or in the user-copy code. Why?

	Ingo

------------------------->
  21570252 total 
  ........
  1494803 copy_user_generic_string 
  998232 sock_rfree 
  491471 tcp_ack 
  482405 ip_dont_fragment 
  470685 ip_local_deliver 
  436325 constant_test_bit 
  375469 avc_has_perm_noaudit 
  347663 tcp_sendmsg 
  310383 tcp_recvmsg 
  300412 __inet_lookup_established 
  294377 system_call 
  286603 tcp_transmit_skb 
  251782 selinux_ip_postroute 
  236028 tcp_current_mss 
  235631 schedule 
  234013 netif_rx 
  229854 _local_bh_enable_ip 
  219501 tcp_v4_rcv 
  210046 netlbl_enabled 
  205022 constant_test_bit 
  199598 skb_release_head_state 
  187952 ip_queue_xmit 
  178779 tcp_established_options 
  175955 dev_queue_xmit 
  169904 netif_receive_skb 
  166629 ip_finish_output2 
  162291 sysret_check 
  151262 __switch_to 
  143355 audit_syscall_entry 
  142694 load_cr3 
  136571 memset_c 
  136115 nf_hook_slow 
  130825 ip_local_deliver_finish 
  128795 ip_rcv 
  125995 selinux_socket_sock_rcv_skb 
  123944 net_rx_action 
  123100 __copy_skb_header 
  122052 __inet_lookup 
  121744 constant_test_bit 
  119444 get_page_from_freelist 
  116486 avc_has_perm 
  115643 audit_syscall_exit 
  115123 find_pid_ns 
  114483 tcp_cleanup_rbuf 
  111350 tcp_rcv_established 
  109853 __mod_timer 
  107891 lock_sock_nested 
  107316 napi_disable_pending 
  106581 release_sock 
  104402 skb_copy_datagram_iovec 
  101591 __tcp_push_pending_frames 
  101206 tcp_event_data_recv 
   98046 kmem_cache_alloc_node
   97982 tcp_v4_do_rcv
   92714 sys_recvfrom
   91551 rb_erase
   89730 kfree
   87979 ip_rcv_finish
   87166 compare_ether_addr
   86982 selinux_parse_skb
   86731 nf_iterate
   79690 selinux_ipv4_output
   79347 __cache_free
   78992 audit_free_names
   78127 skb_release_data
   77501 mod_timer
   77241 __sock_recvmsg
   77228 sock_recvmsg
   77211 ____cache_alloc
   76495 tcp_rcv_space_adjust
   75283 sk_wait_data
   71772 sys_sendto
   71594 sched_clock
   70880 eth_type_trans
   70238 memcpy_toiovec
   69193 do_softirq
   68341 __update_sched_clock
   67597 tcp_v4_md5_lookup
   67424 try_to_wake_up
   64465 sock_common_recvmsg
   64116 put_prev_task_fair
   63964 process_backlog
   62216 __do_softirq
   62093 tcp_cwnd_validate
   61128 __alloc_skb
   60588 put_page
   59536 dput
   58411 __ip_local_out
   56349 avc_audit
   55626 __napi_schedule
   55525 selinux_ipv4_postroute
   54499 __enqueue_entity
   53599 local_bh_disable
   53418 unroll_tree_refs
   53162 __unlazy_fpu
   53084 cfs_rq_of
   52475 set_next_entity
   51108 thread_return
   50458 ip_output
   50268 sched_clock_cpu
   49974 tcp_send_delayed_ack
   49736 ip_finish_output
   49670 finish_task_switch
   49070 ___swab16
   48499 audit_get_context
   48347 raw_local_deliver
   47824 tcp_rtt_estimator
   46707 tcp_push
   46405 constant_test_bit
   45859 select_task_rq_fair
   45188 math_state_restore
   44889 check_preempt_wakeup
   44449 task_rq_lock
   43704 sel_netif_sid
   43377 sock_sendmsg
   42612 sk_reset_timer
   42606 __skb_clone
   42223 __find_general_cachep
   41950 selinux_socket_sendmsg
   41716 constant_test_bit
   41097 skb_push
   40723 lock_sock
   40715 system_call_after_swapgs
   40399 selinux_netlbl_inode_permission
   40179 rb_insert_color
   40021 __kfree_skb
   40015 sockfd_lookup_light
   39216 internal_add_timer
   39024 skb_can_coalesce
   38838 __tcp_select_window
   38651 current_kernel_time
   38533 tcp_v4_md5_do_lookup
   38372 __sock_sendmsg
   38162 selinux_socket_recvmsg
   37812 sel_netport_sid
   37727 account_group_exec_runtime
   37695 switch_mm
   36247 nf_hook_thresh
   36057 auditsys
   35266 pick_next_task_fair
   35064 __tcp_ack_snd_check
   35052 sock_def_readable
   34826 sysret_careful
   34578 _local_bh_enable
   34498 free_hot_cold_page
   34338 kmap
   34028 loopback_xmit
   33320 sk_stream_alloc_skb
   33269 test_ti_thread_flag
   33219 skb_fill_page_desc
   33049 tcp_is_cwnd_limited
   33012 update_min_vruntime
   32431 native_read_tsc
   32398 dst_release
   31661 get_pageblock_flags_group
   31652 path_put
   31516 tcp_push_pending_frames
   31265 netif_needs_gso
   31175 constant_test_bit
   31077 __cycles_2_ns
   30971 socket_has_perm
   30893 __phys_addr
   30867 lock_timer_base
   30585 __wake_up
   30456 ret_from_sys_call
   30147 skb_release_all
   29356 local_bh_enable
   29334 __skb_insert
   28681 tcp_cwnd_test
   28652 __skb_dequeue
   28612 prepare_to_wait
   28268 kmem_cache_free
   28193 set_bit
   28149 dequeue_task_fair
   27906 skb_header_pointer
   27861 sys_kill
   27803 selinux_task_kill
   27627 audit_free_aux
   27600 selinux_netlbl_sock_rcv_skb
   26794 update_curr
   26777 __alloc_pages_internal
   26469 skb_entail
   26458 pskb_may_pull
   26216 inet_ehashfn
   26075 call_softirq
   26033 copy_from_user
   25933 __local_bh_disable
   25666 fget_light
   25270 inet_csk_reset_xmit_timer
   25071 signal_pending_state
   24117 tcp_init_tso_segs
   24109 TCP_ECN_check_ce
   23702 nf_hook_thresh
   23558 copy_to_user
   23426 sysret_audit
   23267 sk_wake_async
   22627 tcp_options_write
   22174 netif_tx_queue_stopped
   21795 tcp_prequeue_process
   21757 tcp_set_skb_tso_segs
   21579 avc_hash
   21565 ___swab16
   21560 ip_local_out
   21445 sk_wmem_schedule
   21234 get_page
   21200 __wake_up_common
   21042 sel_netnode_find
   20772 sock_put
   20625 schedule_timeout
   20613 __napi_complete
   20563 fput_light
   20532 tcp_bound_to_half_wnd
   19912 cap_task_kill
   19773 sysret_signal
   19374 compound_head
   19121 get_seconds
   19048 PageLRU
   18893 zone_watermark_ok
   18635 tcp_snd_wnd_test
   18634 enqueue_task_fair
   18603 rb_next
   18598 next_zones_zonelist
   18534 resched_task
   17820 hash_64
   17801 autoremove_wake_function
   17451 __skb_queue_before
   17283 native_load_tls
   17227 __skb_dequeue
   17149 xfrm4_policy_check
   16942 zone_statistics
   16886 skb_reset_network_header
   16824 ___swab16
   16725 pskb_may_pull
   16645 dev_hard_start_xmit
   16580 sk_filter
   16523 tcp_ca_event
   16479 tcp_win_from_space
   16408 tcp_parse_aligned_timestamp
   16204 finish_wait
   16124 virt_to_slab
   15965 tcp_v4_send_check
   15920 skb_reset_transport_header
   15867 tcp_data_snd_check
   15819 security_sock_rcv_skb
   15665 tcp_ack_saw_tstamp
   15621 skb_network_offset
   15568 virt_to_head_page
   15553 dst_confirm
   15320 skb_pull
   15277 clear_bit
   15179 alloc_pages_current
   14991 bictcp_acked
   14743 tcp_store_ts_recent
   14660 sel_netnode_sid
   14650 __xchg
   14573 task_has_perm
   14561 tcp_v4_check
   14492 net_invalid_timestamp
   14485 security_socket_recvmsg
   14363 __dequeue_entity
   14318 pid_nr_ns
   14311 device_not_available
   14212 local_bh_enable_ip
   14092 virt_to_cache
   13804 netpoll_rx
   13781 fcheck_files
   13724 tcp_adjust_fackets_out
   13717 net_timestamp
   13638 ___swab16
   13576 sel_netport_find
   13563 __kmalloc_node
   13530 __inc_zone_state
   13215 pid_vnr
   13208 free_pages_check
   13008 security_socket_sendmsg
   12971 ip_skb_dst_mtu
   12827 __cpu_set
   12782 bictcp_cong_avoid
   12779 test_tsk_thread_flag
   12734 wakeup_preempt_entity
   12651 sel_netif_find
   12545 skb_set_owner_r
   12534 skb_headroom
   12348 tcp_event_new_data_sent
   12251 place_entity
   12047 set_bit
   11805 update_rq_clock
   11788 detach_timer
   11659 policy_zonelist
   11423 skb_clone
   11380 __skb_queue_tail
   11249 dequeue_task
   10823 init_rootdomain
   10690 __cpu_clear
   10558 default_wake_function
   10556 tcp_rcv_rtt_measure_ts
   10451 PageSlab
   10427 sock_wfree
   10277 calc_delta_fair
   10237 tcp_validate_incoming
   10218 task_rq_unlock
   10023 page_get_cache

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 72924 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.28-rc5
# Mon Nov 17 11:59:36 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=20
# CONFIG_CGROUPS is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
# CONFIG_GROUP_SCHED is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
# CONFIG_MARKERS is not set
CONFIG_OPROFILE=m
CONFIG_OPROFILE_IBS=y
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_BLK_DEV_BSG is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_CLASSIC_RCU=y
CONFIG_FREEZER=y

#
# Processor type and features
#
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR_64=y
CONFIG_X86_DS=y
CONFIG_X86_PTRACE_BTS=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
# CONFIG_AMD_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=255
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
# CONFIG_MICROCODE_AMD is not set
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
CONFIG_MMU_NOTIFIER=y
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
# CONFIG_X86_PAT is not set
# CONFIG_EFI is not set
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
# CONFIG_SCHED_HRTICK is not set
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y

#
# Power management and ACPI options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=m
CONFIG_ACPI_BATTERY=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_WMI is not set
CONFIG_ACPI_ASUS=m
CONFIG_ACPI_TOSHIBA=m
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_SBS=m

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_POWERNOW_K8_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
# CONFIG_X86_SPEEDSTEP_LIB is not set
# CONFIG_CPU_IDLE is not set

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=m
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
# CONFIG_PCI_MSI is not set
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
CONFIG_PD6729=m
CONFIG_I82092=m
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=m
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
# CONFIG_HOTPLUG_PCI_CPCI is not set
CONFIG_HOTPLUG_PCI_SHPC=m

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_SUB_POLICY is not set
CONFIG_XFRM_MIGRATE=y
# CONFIG_XFRM_STATISTICS is not set
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=m
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
CONFIG_TCP_CONG_CUBIC=m
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_BIC=y
# CONFIG_DEFAULT_CUBIC is not set
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="bic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_DEBUG=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=y
CONFIG_NF_CT_ACCT=y
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CT_PROTO_DCCP=m
CONFIG_NF_CT_PROTO_GRE=m
CONFIG_NF_CT_PROTO_SCTP=m
# CONFIG_NF_CT_PROTO_UDPLITE is not set
CONFIG_NF_CONNTRACK_AMANDA=m
CONFIG_NF_CONNTRACK_FTP=m
CONFIG_NF_CONNTRACK_H323=m
CONFIG_NF_CONNTRACK_IRC=m
CONFIG_NF_CONNTRACK_NETBIOS_NS=m
CONFIG_NF_CONNTRACK_PPTP=m
CONFIG_NF_CONNTRACK_SANE=m
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
# CONFIG_NF_CT_NETLINK is not set
# CONFIG_NETFILTER_TPROXY is not set
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_TRACE is not set
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
# CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP is not set
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
CONFIG_NETFILTER_XT_MATCH_REALM=m
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
CONFIG_IP_VS=m
# CONFIG_IP_VS_IPV6 is not set
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y

#
# IPVS scheduler
#
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m

#
# IPVS application helper
#
CONFIG_IP_VS_FTP=m

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_NF_NAT_SNMP_BASIC=m
CONFIG_NF_NAT_PROTO_DCCP=m
CONFIG_NF_NAT_PROTO_GRE=m
CONFIG_NF_NAT_PROTO_SCTP=m
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_IRC=m
CONFIG_NF_NAT_TFTP=m
CONFIG_NF_NAT_AMANDA=m
CONFIG_NF_NAT_PPTP=m
CONFIG_NF_NAT_H323=m
CONFIG_NF_NAT_SIP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
# CONFIG_IP_NF_SECURITY is not set
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_RAW=m
# CONFIG_IP6_NF_SECURITY is not set

#
# DECnet: Netfilter Configuration
#
# CONFIG_DECNET_NF_GRABULATOR is not set
CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
CONFIG_BRIDGE_EBT_802_3=m
CONFIG_BRIDGE_EBT_AMONG=m
CONFIG_BRIDGE_EBT_ARP=m
CONFIG_BRIDGE_EBT_IP=m
# CONFIG_BRIDGE_EBT_IP6 is not set
CONFIG_BRIDGE_EBT_LIMIT=m
CONFIG_BRIDGE_EBT_MARK=m
CONFIG_BRIDGE_EBT_PKTTYPE=m
CONFIG_BRIDGE_EBT_STP=m
CONFIG_BRIDGE_EBT_VLAN=m
CONFIG_BRIDGE_EBT_ARPREPLY=m
CONFIG_BRIDGE_EBT_DNAT=m
CONFIG_BRIDGE_EBT_MARK_T=m
CONFIG_BRIDGE_EBT_REDIRECT=m
CONFIG_BRIDGE_EBT_SNAT=m
CONFIG_BRIDGE_EBT_LOG=m
CONFIG_BRIDGE_EBT_ULOG=m
# CONFIG_BRIDGE_EBT_NFLOG is not set
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m
CONFIG_IP_DCCP_ACKVEC=y

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
CONFIG_IP_DCCP_CCID2=m
# CONFIG_IP_DCCP_CCID2_DEBUG is not set
CONFIG_IP_DCCP_CCID3=m
# CONFIG_IP_DCCP_CCID3_DEBUG is not set
CONFIG_IP_DCCP_CCID3_RTO=100
CONFIG_IP_DCCP_TFRC_LIB=m

#
# DCCP Kernel Hacking
#
# CONFIG_IP_DCCP_DEBUG is not set
# CONFIG_NET_DCCPPROBE is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
CONFIG_TIPC=m
# CONFIG_TIPC_ADVANCED is not set
# CONFIG_TIPC_DEBUG is not set
CONFIG_ATM=m
CONFIG_ATM_CLIP=m
# CONFIG_ATM_CLIP_NO_ICMP is not set
CONFIG_ATM_LANE=m
# CONFIG_ATM_MPOA is not set
CONFIG_ATM_BR2684=m
# CONFIG_ATM_BR2684_IPFILTER is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
# CONFIG_NET_DSA is not set
CONFIG_VLAN_8021Q=m
# CONFIG_VLAN_8021Q_GVRP is not set
CONFIG_DECNET=m
CONFIG_DECNET_ROUTER=y
CONFIG_LLC=y
# CONFIG_LLC2 is not set
CONFIG_IPX=m
# CONFIG_IPX_INTERN is not set
CONFIG_ATALK=m
CONFIG_DEV_APPLETALK=m
CONFIG_IPDDP=m
CONFIG_IPDDP_ENCAP=y
CONFIG_IPDDP_DECAP=y
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
CONFIG_WAN_ROUTER=m
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
# CONFIG_NET_SCH_MULTIQ is not set
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_INGRESS=m

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
# CONFIG_NET_ACT_NAT is not set
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
# CONFIG_NET_ACT_SKBEDIT is not set
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
# CONFIG_NET_TCPPROBE is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
CONFIG_IRDA=m

#
# IrDA protocols
#
CONFIG_IRLAN=m
CONFIG_IRNET=m
CONFIG_IRCOMM=m
# CONFIG_IRDA_ULTRA is not set

#
# IrDA options
#
CONFIG_IRDA_CACHE_LAST_LSAP=y
CONFIG_IRDA_FAST_RR=y
# CONFIG_IRDA_DEBUG is not set

#
# Infrared-port device drivers
#

#
# SIR device drivers
#
CONFIG_IRTTY_SIR=m

#
# Dongle support
#
CONFIG_DONGLE=y
CONFIG_ESI_DONGLE=m
CONFIG_ACTISYS_DONGLE=m
CONFIG_TEKRAM_DONGLE=m
CONFIG_TOIM3232_DONGLE=m
CONFIG_LITELINK_DONGLE=m
CONFIG_MA600_DONGLE=m
CONFIG_GIRBIL_DONGLE=m
CONFIG_MCP2120_DONGLE=m
CONFIG_OLD_BELKIN_DONGLE=m
CONFIG_ACT200L_DONGLE=m
# CONFIG_KINGSUN_DONGLE is not set
# CONFIG_KSDAZZLE_DONGLE is not set
# CONFIG_KS959_DONGLE is not set

#
# FIR device drivers
#
CONFIG_USB_IRDA=m
CONFIG_SIGMATEL_FIR=m
CONFIG_NSC_FIR=m
CONFIG_WINBOND_FIR=m
CONFIG_SMC_IRCC_FIR=m
CONFIG_ALI_FIR=m
CONFIG_VLSI_FIR=m
CONFIG_VIA_FIR=m
CONFIG_MCS_FIR=m
CONFIG_BT=m
CONFIG_BT_L2CAP=m
CONFIG_BT_SCO=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=m

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIBTSDIO is not set
CONFIG_BT_HCIUART=m
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
# CONFIG_BT_HCIUART_LL is not set
CONFIG_BT_HCIBCM203X=m
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIDTL1=m
CONFIG_BT_HCIBT3C=m
CONFIG_BT_HCIBLUECARD=m
CONFIG_BT_HCIBTUART=m
CONFIG_BT_HCIVHCI=m
# CONFIG_AF_RXRPC is not set
# CONFIG_PHONET is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
CONFIG_WIRELESS_OLD_REGULATORY=y
CONFIG_WIRELESS_EXT=y
CONFIG_WIRELESS_EXT_SYSFS=y
# CONFIG_MAC80211 is not set
CONFIG_IEEE80211=m
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=m
CONFIG_IEEE80211_CRYPT_CCMP=m
CONFIG_IEEE80211_CRYPT_TKIP=m
CONFIG_RFKILL=m
# CONFIG_RFKILL_INPUT is not set
CONFIG_RFKILL_LEDS=y
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
# CONFIG_PARIDE is not set
CONFIG_BLK_CPQ_DA=y
CONFIG_BLK_CPQ_CISS_DA=m
CONFIG_CISS_SCSI_TAPE=y
CONFIG_BLK_DEV_DAC960=m
CONFIG_BLK_DEV_UMEM=m
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_SX8=m
CONFIG_BLK_DEV_UB=m
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
CONFIG_ATA_OVER_ETH=m
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_EEPROM_93CX6 is not set
CONFIG_SGI_IOC4=m
CONFIG_TIFM_CORE=m
CONFIG_TIFM_7XX1=m
# CONFIG_ACER_WMI is not set
CONFIG_ASUS_LAPTOP=m
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_ICS932S401 is not set
CONFIG_MSI_LAPTOP=m
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
CONFIG_SONY_LAPTOP=m
# CONFIG_SONYPI_COMPAT is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_SGI_XP is not set
# CONFIG_HP_ILO is not set
# CONFIG_SGI_GRU is not set
# CONFIG_C2PORT is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=m
# CONFIG_SCSI_FC_TGT_ATTRS is not set
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
# CONFIG_SCSI_SAS_LIBSAS is not set
CONFIG_SCSI_SRP_ATTRS=m
# CONFIG_SCSI_SRP_TGT_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
CONFIG_SCSI_ARCMSR=m
# CONFIG_SCSI_ARCMSR_AER is not set
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=m
CONFIG_MEGARAID_MAILBOX=m
CONFIG_MEGARAID_LEGACY=m
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_HPTIOP=m
CONFIG_SCSI_BUSLOGIC=m
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
# CONFIG_SCSI_IZIP_EPP16 is not set
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
# CONFIG_SCSI_MVSAS is not set
CONFIG_SCSI_STEX=m
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
# CONFIG_SCSI_IPR is not set
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
CONFIG_SCSI_QLA_ISCSI=m
CONFIG_SCSI_LPFC=m
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=m
# CONFIG_SCSI_DEBUG is not set
CONFIG_SCSI_SRP=m
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
# CONFIG_SCSI_DH is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y
CONFIG_SATA_SVW=m
CONFIG_ATA_PIIX=y
CONFIG_SATA_MV=m
CONFIG_SATA_NV=y
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SX4=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIS=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m
CONFIG_SATA_INIC162X=m
# CONFIG_PATA_ACPI is not set
CONFIG_PATA_ALI=m
CONFIG_PATA_AMD=y
CONFIG_PATA_ARTOP=m
CONFIG_PATA_ATIIXP=m
# CONFIG_PATA_CMD640_PCI is not set
CONFIG_PATA_CMD64X=m
CONFIG_PATA_CS5520=m
CONFIG_PATA_CS5530=m
CONFIG_PATA_CYPRESS=m
CONFIG_PATA_EFAR=m
CONFIG_ATA_GENERIC=m
CONFIG_PATA_HPT366=m
CONFIG_PATA_HPT37X=m
CONFIG_PATA_HPT3X2N=m
CONFIG_PATA_HPT3X3=m
# CONFIG_PATA_HPT3X3_DMA is not set
CONFIG_PATA_IT821X=m
CONFIG_PATA_IT8213=m
CONFIG_PATA_JMICRON=m
CONFIG_PATA_TRIFLEX=m
CONFIG_PATA_MARVELL=m
CONFIG_PATA_MPIIX=m
CONFIG_PATA_OLDPIIX=y
CONFIG_PATA_NETCELL=m
# CONFIG_PATA_NINJA32 is not set
CONFIG_PATA_NS87410=m
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OPTI=m
CONFIG_PATA_OPTIDMA=m
CONFIG_PATA_PCMCIA=m
CONFIG_PATA_PDC_OLD=m
CONFIG_PATA_RADISYS=m
CONFIG_PATA_RZ1000=m
CONFIG_PATA_SC1200=m
CONFIG_PATA_SERVERWORKS=m
CONFIG_PATA_PDC2027X=m
CONFIG_PATA_SIL680=m
CONFIG_PATA_SIS=m
CONFIG_PATA_VIA=m
CONFIG_PATA_WINBOND=m
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
CONFIG_MD_RAID5_RESHAPE=y
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
CONFIG_FUSION_SAS=m
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=m
CONFIG_FUSION_LAN=m
# CONFIG_FUSION_LOGGING is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
# CONFIG_FIREWIRE is not set
CONFIG_IEEE1394=m
CONFIG_IEEE1394_OHCI1394=m
CONFIG_IEEE1394_PCILYNX=m
CONFIG_IEEE1394_SBP2=m
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
CONFIG_IEEE1394_ETH1394_ROM_ENTRY=y
CONFIG_IEEE1394_ETH1394=m
CONFIG_IEEE1394_RAWIO=m
CONFIG_IEEE1394_VIDEO1394=m
CONFIG_IEEE1394_DV1394=m
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
CONFIG_I2O=m
# CONFIG_I2O_LCT_NOTIFY_ON_CHANGES is not set
CONFIG_I2O_EXT_ADAPTEC=y
CONFIG_I2O_EXT_ADAPTEC_DMA64=y
# CONFIG_I2O_CONFIG is not set
CONFIG_I2O_BUS=m
CONFIG_I2O_BLOCK=m
CONFIG_I2O_SCSI=m
CONFIG_I2O_PROC=m
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_IFB=m
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_MACVLAN is not set
CONFIG_EQUALIZER=m
CONFIG_TUN=m
# CONFIG_VETH is not set
CONFIG_NET_SB1000=m
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
CONFIG_HAPPYMEAL=m
CONFIG_SUNGEM=m
CONFIG_CASSINI=m
CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=y
CONFIG_TYPHOON=m
CONFIG_NET_TULIP=y
CONFIG_DE2104X=m
CONFIG_TULIP=m
# CONFIG_TULIP_MWI is not set
CONFIG_TULIP_MMIO=y
# CONFIG_TULIP_NAPI is not set
CONFIG_DE4X5=m
CONFIG_WINBOND_840=m
CONFIG_DM9102=m
CONFIG_ULI526X=m
CONFIG_PCMCIA_XIRCOM=m
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=m
CONFIG_ADAPTEC_STARFIRE=m
CONFIG_B44=m
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_FORCEDETH=y
CONFIG_FORCEDETH_NAPI=y
# CONFIG_EEPRO100 is not set
CONFIG_E100=y
CONFIG_FEALNX=m
CONFIG_NATSEMI=m
CONFIG_NE2K_PCI=m
CONFIG_8139CP=m
CONFIG_8139TOO=y
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R6040 is not set
CONFIG_SIS900=m
CONFIG_EPIC100=m
CONFIG_SUNDANCE=m
# CONFIG_SUNDANCE_MMIO is not set
# CONFIG_TLAN is not set
CONFIG_VIA_RHINE=m
CONFIG_VIA_RHINE_MMIO=y
CONFIG_SC92031=m
CONFIG_NET_POCKET=y
CONFIG_ATP=m
CONFIG_DE600=m
CONFIG_DE620=m
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=m
# CONFIG_ACENIC_OMIT_TIGON_I is not set
CONFIG_DL2K=m
CONFIG_E1000=y
CONFIG_E1000E=y
# CONFIG_IP1000 is not set
# CONFIG_IGB is not set
CONFIG_NS83820=m
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
CONFIG_R8169=m
CONFIG_R8169_VLAN=y
# CONFIG_SIS190 is not set
CONFIG_SKGE=m
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKY2=m
# CONFIG_SKY2_DEBUG is not set
CONFIG_VIA_VELOCITY=m
CONFIG_TIGON3=y
CONFIG_BNX2=m
CONFIG_QLA3XXX=m
CONFIG_ATL1=m
# CONFIG_ATL1E is not set
# CONFIG_JME is not set
CONFIG_NETDEV_10000=y
CONFIG_CHELSIO_T1=m
CONFIG_CHELSIO_T1_1G=y
CONFIG_CHELSIO_T3=m
# CONFIG_ENIC is not set
# CONFIG_IXGBE is not set
CONFIG_IXGB=m
CONFIG_S2IO=m
CONFIG_MYRI10GE=m
CONFIG_NETXEN_NIC=m
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
# CONFIG_QLGE is not set
# CONFIG_SFC is not set
CONFIG_TR=y
CONFIG_IBMOL=m
CONFIG_3C359=m
# CONFIG_TMS380TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set
# CONFIG_IWLWIFI_LEDS is not set

#
# USB Network Adapters
#
CONFIG_USB_CATC=m
CONFIG_USB_KAWETH=m
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_USBNET=m
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_DM9601=m
# CONFIG_USB_NET_SMSC95XX is not set
CONFIG_USB_NET_GL620A=m
CONFIG_USB_NET_NET1080=m
CONFIG_USB_NET_PLUSB=m
CONFIG_USB_NET_MCS7830=m
CONFIG_USB_NET_RNDIS_HOST=m
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_KC2190=y
CONFIG_USB_NET_ZAURUS=m
# CONFIG_USB_HSO is not set
CONFIG_NET_PCMCIA=y
CONFIG_PCMCIA_3C589=m
CONFIG_PCMCIA_3C574=m
CONFIG_PCMCIA_FMVJ18X=m
CONFIG_PCMCIA_PCNET=m
CONFIG_PCMCIA_NMCLAN=m
CONFIG_PCMCIA_SMC91C92=m
CONFIG_PCMCIA_XIRC2PS=m
CONFIG_PCMCIA_AXNET=m
# CONFIG_PCMCIA_IBMTR is not set
# CONFIG_WAN is not set
CONFIG_ATM_DRIVERS=y
# CONFIG_ATM_DUMMY is not set
CONFIG_ATM_TCP=m
CONFIG_ATM_LANAI=m
CONFIG_ATM_ENI=m
# CONFIG_ATM_ENI_DEBUG is not set
# CONFIG_ATM_ENI_TUNE_BURST is not set
CONFIG_ATM_FIRESTREAM=m
# CONFIG_ATM_ZATM is not set
CONFIG_ATM_IDT77252=m
# CONFIG_ATM_IDT77252_DEBUG is not set
# CONFIG_ATM_IDT77252_RCV_ALL is not set
CONFIG_ATM_IDT77252_USE_SUNI=y
CONFIG_ATM_AMBASSADOR=m
# CONFIG_ATM_AMBASSADOR_DEBUG is not set
CONFIG_ATM_HORIZON=m
# CONFIG_ATM_HORIZON_DEBUG is not set
# CONFIG_ATM_IA is not set
# CONFIG_ATM_FORE200E is not set
CONFIG_ATM_HE=m
# CONFIG_ATM_HE_USE_SUNI is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
CONFIG_SKFP=m
# CONFIG_HIPPI is not set
CONFIG_PLIP=m
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
# CONFIG_PPP_BSDCOMP is not set
CONFIG_PPP_MPPE=m
CONFIG_PPPOE=m
CONFIG_PPPOATM=m
# CONFIG_PPPOL2TP is not set
CONFIG_SLIP=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLHC=m
CONFIG_SLIP_SMART=y
# CONFIG_SLIP_MODE_SLIP6 is not set
CONFIG_NET_FC=y
CONFIG_NETCONSOLE=y
# CONFIG_NETCONSOLE_DYNAMIC is not set
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
CONFIG_KEYBOARD_STOWAWAY=m
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=m
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
CONFIG_MOUSE_VSXXXAA=m
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=m
CONFIG_JOYSTICK_A3D=m
CONFIG_JOYSTICK_ADI=m
CONFIG_JOYSTICK_COBRA=m
CONFIG_JOYSTICK_GF2K=m
CONFIG_JOYSTICK_GRIP=m
CONFIG_JOYSTICK_GRIP_MP=m
CONFIG_JOYSTICK_GUILLEMOT=m
CONFIG_JOYSTICK_INTERACT=m
CONFIG_JOYSTICK_SIDEWINDER=m
CONFIG_JOYSTICK_TMDC=m
CONFIG_JOYSTICK_IFORCE=m
CONFIG_JOYSTICK_IFORCE_USB=y
CONFIG_JOYSTICK_IFORCE_232=y
CONFIG_JOYSTICK_WARRIOR=m
CONFIG_JOYSTICK_MAGELLAN=m
CONFIG_JOYSTICK_SPACEORB=m
CONFIG_JOYSTICK_SPACEBALL=m
CONFIG_JOYSTICK_STINGER=m
CONFIG_JOYSTICK_TWIDJOY=m
# CONFIG_JOYSTICK_ZHENHUA is not set
CONFIG_JOYSTICK_DB9=m
CONFIG_JOYSTICK_GAMECON=m
CONFIG_JOYSTICK_TURBOGRAFX=m
CONFIG_JOYSTICK_JOYDUMP=m
# CONFIG_JOYSTICK_XPAD is not set
# CONFIG_INPUT_TABLET is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_FUJITSU is not set
CONFIG_TOUCHSCREEN_GUNZE=m
CONFIG_TOUCHSCREEN_ELO=m
CONFIG_TOUCHSCREEN_MTOUCH=m
# CONFIG_TOUCHSCREEN_INEXIO is not set
CONFIG_TOUCHSCREEN_MK712=m
CONFIG_TOUCHSCREEN_PENMOUNT=m
CONFIG_TOUCHSCREEN_TOUCHRIGHT=m
CONFIG_TOUCHSCREEN_TOUCHWIN=m
# CONFIG_TOUCHSCREEN_WM97XX is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_APANEL is not set
CONFIG_INPUT_ATLAS_BTNS=m
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
CONFIG_GAMEPORT_L4=m
CONFIG_GAMEPORT_EMU10K1=m
CONFIG_GAMEPORT_FM801=m

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
# CONFIG_ROCKETPORT is not set
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
CONFIG_N_HDLC=m
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
# CONFIG_STALDRV is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CS=m
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=m
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_IPMI_HANDLER=m
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_NVRAM=y
CONFIG_R3964=m
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
# CONFIG_IPWIRELESS is not set
CONFIG_MWAVE=m
CONFIG_PC8736x_GPIO=m
CONFIG_NSC_GPIO=m
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
CONFIG_HANGCHECK_TIMER=m
CONFIG_TCG_TPM=m
CONFIG_TCG_TIS=m
CONFIG_TCG_NSC=m
CONFIG_TCG_ATMEL=m
CONFIG_TCG_INFINEON=m
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=m
# CONFIG_I2C_AMD756_S4882 is not set
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=y
CONFIG_I2C_NFORCE2=y
# CONFIG_I2C_NFORCE2_S4985 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
CONFIG_I2C_SIS96X=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
CONFIG_I2C_VOODOO3=m

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_PLATFORM is not set
CONFIG_I2C_STUB=m

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_AT24 is not set
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
CONFIG_W1=m
CONFIG_W1_CON=y

#
# 1-wire Bus Masters
#
CONFIG_W1_MASTER_MATROX=m
CONFIG_W1_MASTER_DS2490=m
CONFIG_W1_MASTER_DS2482=m

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
CONFIG_W1_SLAVE_SMEM=m
CONFIG_W1_SLAVE_DS2433=m
CONFIG_W1_SLAVE_DS2433_CRC=y
# CONFIG_W1_SLAVE_DS2760 is not set
# CONFIG_W1_SLAVE_BQ27000 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=m
CONFIG_SENSORS_ABITUGURU=m
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1029=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7473 is not set
CONFIG_SENSORS_K8TEMP=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS1621=m
# CONFIG_SENSORS_I5K_AMB is not set
CONFIG_SENSORS_F71805F=m
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
# CONFIG_SENSORS_FSCHMD is not set
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
# CONFIG_SENSORS_LM93 is not set
CONFIG_SENSORS_MAX1619=m
# CONFIG_SENSORS_MAX6650 is not set
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_PC87427=m
CONFIG_SENSORS_SIS5595=m
# CONFIG_SENSORS_DME1737 is not set
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_THMC50 is not set
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83793=m
CONFIG_SENSORS_W83L785TS=m
# CONFIG_SENSORS_W83L786NG is not set
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
CONFIG_SENSORS_HDAPS=m
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_HWMON is not set
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=m
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
CONFIG_ALIM1535_WDT=m
CONFIG_ALIM7101_WDT=m
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
CONFIG_IBMASR=m
# CONFIG_WAFER_WDT is not set
CONFIG_I6300ESB_WDT=m
CONFIG_ITCO_WDT=m
CONFIG_ITCO_VENDOR_SUPPORT=y
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
CONFIG_PC87413_WDT=m
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
CONFIG_W83627HF_WDT=m
CONFIG_W83697HF_WDT=m
# CONFIG_W83697UG_WDT is not set
CONFIG_W83877F_WDT=m
CONFIG_W83977F_WDT=m
CONFIG_MACHZ_WDT=m
# CONFIG_SBC_EPX_C3_WATCHDOG is not set

#
# PCI-based Watchdog Cards
#
CONFIG_PCIPCWATCHDOG=m
CONFIG_WDTPCI=m
CONFIG_WDT_501_PCI=y

#
# USB-based Watchdog Cards
#
CONFIG_USBPCWATCHDOG=m
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=m
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
CONFIG_SSB_PCMCIAHOST_POSSIBLE=y
# CONFIG_SSB_PCMCIAHOST is not set
# CONFIG_SSB_DEBUG is not set
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
CONFIG_MFD_SM501=m
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
# CONFIG_VIDEO_DEV is not set
CONFIG_DVB_CORE=m
CONFIG_VIDEO_MEDIA=m

#
# Multimedia drivers
#
# CONFIG_MEDIA_ATTACH is not set
CONFIG_MEDIA_TUNER=m
# CONFIG_MEDIA_TUNER_CUSTOMIZE is not set
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_DVB_CAPTURE_DRIVERS=y

#
# Supported SAA7146 based PCI Adapters
#
# CONFIG_TTPCI_EEPROM is not set
# CONFIG_DVB_BUDGET_CORE is not set

#
# Supported USB Adapters
#
# CONFIG_DVB_USB is not set
CONFIG_DVB_TTUSB_BUDGET=m
CONFIG_DVB_TTUSB_DEC=m
# CONFIG_DVB_SIANO_SMS1XXX is not set

#
# Supported FlexCopII (B2C2) Adapters
#
CONFIG_DVB_B2C2_FLEXCOP=m
CONFIG_DVB_B2C2_FLEXCOP_PCI=m
CONFIG_DVB_B2C2_FLEXCOP_USB=m
# CONFIG_DVB_B2C2_FLEXCOP_DEBUG is not set

#
# Supported BT878 Adapters
#

#
# Supported Pluto2 Adapters
#
CONFIG_DVB_PLUTO2=m

#
# Supported SDMC DM1105 Adapters
#
# CONFIG_DVB_DM1105 is not set

#
# Supported DVB Frontends
#

#
# Customise DVB Frontends
#
# CONFIG_DVB_FE_CUSTOMISE is not set

#
# DVB-S (satellite) frontends
#
CONFIG_DVB_CX24110=m
CONFIG_DVB_CX24123=m
CONFIG_DVB_MT312=m
CONFIG_DVB_S5H1420=m
# CONFIG_DVB_STV0288 is not set
# CONFIG_DVB_STB6000 is not set
CONFIG_DVB_STV0299=m
CONFIG_DVB_TDA8083=m
CONFIG_DVB_TDA10086=m
CONFIG_DVB_VES1X93=m
CONFIG_DVB_TUNER_ITD1000=m
CONFIG_DVB_TDA826X=m
CONFIG_DVB_TUA6100=m
# CONFIG_DVB_CX24116 is not set
# CONFIG_DVB_SI21XX is not set

#
# DVB-T (terrestrial) frontends
#
CONFIG_DVB_SP8870=m
CONFIG_DVB_SP887X=m
CONFIG_DVB_CX22700=m
CONFIG_DVB_CX22702=m
# CONFIG_DVB_DRX397XD is not set
CONFIG_DVB_L64781=m
CONFIG_DVB_TDA1004X=m
CONFIG_DVB_NXT6000=m
CONFIG_DVB_MT352=m
CONFIG_DVB_ZL10353=m
CONFIG_DVB_DIB3000MB=m
CONFIG_DVB_DIB3000MC=m
CONFIG_DVB_DIB7000M=m
CONFIG_DVB_DIB7000P=m
# CONFIG_DVB_TDA10048 is not set

#
# DVB-C (cable) frontends
#
CONFIG_DVB_VES1820=m
CONFIG_DVB_TDA10021=m
# CONFIG_DVB_TDA10023 is not set
CONFIG_DVB_STV0297=m

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#
CONFIG_DVB_NXT200X=m
CONFIG_DVB_OR51211=m
CONFIG_DVB_OR51132=m
CONFIG_DVB_BCM3510=m
CONFIG_DVB_LGDT330X=m
# CONFIG_DVB_S5H1409 is not set
# CONFIG_DVB_AU8522 is not set
# CONFIG_DVB_S5H1411 is not set

#
# Digital terrestrial only tuners/PLL
#
CONFIG_DVB_PLL=m
CONFIG_DVB_TUNER_DIB0070=m

#
# SEC control devices for DVB-S
#
CONFIG_DVB_LNBP21=m
# CONFIG_DVB_ISL6405 is not set
CONFIG_DVB_ISL6421=m
# CONFIG_DVB_LGS8GL5 is not set

#
# Tools to develop new frontends
#
# CONFIG_DVB_DUMMY_FE is not set
# CONFIG_DVB_AF9013 is not set
# CONFIG_DAB is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_DRM=m
CONFIG_DRM_TDFX=m
CONFIG_DRM_R128=m
CONFIG_DRM_RADEON=m
CONFIG_DRM_I810=m
CONFIG_DRM_I830=m
CONFIG_DRM_I915=m
CONFIG_DRM_MGA=m
# CONFIG_DRM_SIS is not set
CONFIG_DRM_VIA=m
CONFIG_DRM_SAVAGE=m
CONFIG_VGASTATE=m
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
CONFIG_FB_SVGALIB=m
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RIVA=m
# CONFIG_FB_RIVA_I2C is not set
# CONFIG_FB_RIVA_DEBUG is not set
CONFIG_FB_RIVA_BACKLIGHT=y
# CONFIG_FB_LE80578 is not set
CONFIG_FB_INTEL=m
# CONFIG_FB_INTEL_DEBUG is not set
CONFIG_FB_INTEL_I2C=y
CONFIG_FB_MATROX=m
CONFIG_FB_MATROX_MILLENIUM=y
CONFIG_FB_MATROX_MYSTIQUE=y
CONFIG_FB_MATROX_G=y
CONFIG_FB_MATROX_I2C=m
CONFIG_FB_MATROX_MAVEN=m
CONFIG_FB_MATROX_MULTIHEAD=y
# CONFIG_FB_RADEON is not set
CONFIG_FB_ATY128=m
CONFIG_FB_ATY128_BACKLIGHT=y
CONFIG_FB_ATY=m
CONFIG_FB_ATY_CT=y
CONFIG_FB_ATY_GENERIC_LCD=y
CONFIG_FB_ATY_GX=y
CONFIG_FB_ATY_BACKLIGHT=y
CONFIG_FB_S3=m
CONFIG_FB_SAVAGE=m
CONFIG_FB_SAVAGE_I2C=y
CONFIG_FB_SAVAGE_ACCEL=y
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
CONFIG_FB_NEOMAGIC=m
CONFIG_FB_KYRO=m
CONFIG_FB_3DFX=m
CONFIG_FB_3DFX_ACCEL=y
CONFIG_FB_VOODOO1=m
# CONFIG_FB_VT8623 is not set
CONFIG_FB_TRIDENT=m
CONFIG_FB_TRIDENT_ACCEL=y
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
CONFIG_FB_SM501=m
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
# CONFIG_LCD_ILI9320 is not set
# CONFIG_LCD_PLATFORM is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_CORGI is not set
CONFIG_BACKLIGHT_PROGEAR=m
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE is not set
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=m
CONFIG_SOUND_OSS_CORE=y
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_DYNAMIC_MINORS=y
# CONFIG_SND_SUPPORT_OLD_API is not set
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_MPU401_UART=m
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_VX_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
# CONFIG_SND_MTPAV is not set
CONFIG_SND_MTS64=m
# CONFIG_SND_SERIAL_U16550 is not set
CONFIG_SND_MPU401=m
CONFIG_SND_PORTMAN2X4=m
CONFIG_SND_AC97_POWER_SAVE=y
CONFIG_SND_AC97_POWER_SAVE_DEFAULT=0
CONFIG_SND_SB_COMMON=m
CONFIG_SND_PCI=y
CONFIG_SND_AD1889=m
CONFIG_SND_ALS300=m
CONFIG_SND_ALS4000=m
CONFIG_SND_ALI5451=m
CONFIG_SND_ATIIXP=m
CONFIG_SND_ATIIXP_MODEM=m
CONFIG_SND_AU8810=m
CONFIG_SND_AU8820=m
CONFIG_SND_AU8830=m
# CONFIG_SND_AW2 is not set
CONFIG_SND_AZT3328=m
CONFIG_SND_BT87X=m
# CONFIG_SND_BT87X_OVERCLOCK is not set
CONFIG_SND_CA0106=m
CONFIG_SND_CMIPCI=m
# CONFIG_SND_OXYGEN is not set
CONFIG_SND_CS4281=m
CONFIG_SND_CS46XX=m
CONFIG_SND_CS46XX_NEW_DSP=y
# CONFIG_SND_CS5530 is not set
CONFIG_SND_DARLA20=m
CONFIG_SND_GINA20=m
CONFIG_SND_LAYLA20=m
CONFIG_SND_DARLA24=m
CONFIG_SND_GINA24=m
CONFIG_SND_LAYLA24=m
CONFIG_SND_MONA=m
CONFIG_SND_MIA=m
CONFIG_SND_ECHO3G=m
CONFIG_SND_INDIGO=m
CONFIG_SND_INDIGOIO=m
CONFIG_SND_INDIGODJ=m
CONFIG_SND_EMU10K1=m
CONFIG_SND_EMU10K1X=m
CONFIG_SND_ENS1370=m
CONFIG_SND_ENS1371=m
CONFIG_SND_ES1938=m
CONFIG_SND_ES1968=m
CONFIG_SND_FM801=m
CONFIG_SND_HDA_INTEL=m
# CONFIG_SND_HDA_HWDEP is not set
# CONFIG_SND_HDA_INPUT_BEEP is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
CONFIG_SND_HDA_CODEC_NVHDMI=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
# CONFIG_SND_HDA_POWER_SAVE is not set
CONFIG_SND_HDSP=m
CONFIG_SND_HDSPM=m
# CONFIG_SND_HIFIER is not set
CONFIG_SND_ICE1712=m
CONFIG_SND_ICE1724=m
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
CONFIG_SND_KORG1212=m
CONFIG_SND_MAESTRO3=m
CONFIG_SND_MIXART=m
CONFIG_SND_NM256=m
CONFIG_SND_PCXHR=m
CONFIG_SND_RIPTIDE=m
CONFIG_SND_RME32=m
CONFIG_SND_RME96=m
CONFIG_SND_RME9652=m
CONFIG_SND_SONICVIBES=m
CONFIG_SND_TRIDENT=m
CONFIG_SND_VIA82XX=m
CONFIG_SND_VIA82XX_MODEM=m
# CONFIG_SND_VIRTUOSO is not set
CONFIG_SND_VX222=m
CONFIG_SND_YMFPCI=m
CONFIG_SND_USB=y
CONFIG_SND_USB_AUDIO=m
CONFIG_SND_USB_USX2Y=m
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
CONFIG_SND_PCMCIA=y
# CONFIG_SND_VXPOCKET is not set
# CONFIG_SND_PDAUDIOCF is not set
CONFIG_SND_SOC=m
# CONFIG_SND_SOC_ALL_CODECS is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
CONFIG_HID_COMPAT=y
CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_BRIGHT=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_DELL=y
CONFIG_HID_EZKEY=y
CONFIG_HID_GYRATION=y
CONFIG_HID_LOGITECH=y
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
CONFIG_HID_PANTHERLORD=y
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=y
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
CONFIG_HID_SUNPLUS=y
CONFIG_THRUSTMASTER_FF=y
CONFIG_ZEROPLUS_FF=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_ISP116X_HCD=m
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=m
CONFIG_USB_SL811_HCD=m
CONFIG_USB_SL811_CS=m
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed;
#

#
# see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_DPCM=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
# CONFIG_USB_STORAGE_ONETOUCH is not set
CONFIG_USB_STORAGE_KARMA=y
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m

#
# USB port drivers
#
CONFIG_USB_USS720=m
CONFIG_USB_SERIAL=m
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_AIRCABLE=m
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
# CONFIG_USB_SERIAL_CH341 is not set
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP2101=m
CONFIG_USB_SERIAL_CYPRESS_M8=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_GARMIN=m
CONFIG_USB_SERIAL_IPW=m
# CONFIG_USB_SERIAL_IUU is not set
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KEYSPAN_MPR=y
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_MOS7720=m
CONFIG_USB_SERIAL_MOS7840=m
# CONFIG_USB_SERIAL_MOTOROLA is not set
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_TI=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_SERIAL_DEBUG=m

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_ADUTUX=m
# CONFIG_USB_SEVSEG is not set
CONFIG_USB_RIO500=m
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_BERRY_CHARGE=m
CONFIG_USB_LED=m
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
CONFIG_USB_PHIDGET=m
CONFIG_USB_PHIDGETKIT=m
CONFIG_USB_PHIDGETMOTORCONTROL=m
CONFIG_USB_PHIDGETSERVO=m
CONFIG_USB_IDMOUSE=m
CONFIG_USB_FTDI_ELAN=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=m
CONFIG_USB_TRANCEVIBRATOR=m
CONFIG_USB_IOWARRIOR=m
CONFIG_USB_TEST=m
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
CONFIG_USB_ATM=m
CONFIG_USB_SPEEDTOUCH=m
CONFIG_USB_CXACRU=m
CONFIG_USB_UEAGLEATM=m
CONFIG_USB_XUSBATM=m
# CONFIG_USB_GADGET is not set
# CONFIG_UWB is not set
CONFIG_MMC=m
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_UNSAFE_RESUME is not set

#
# MMC/SD/SDIO Card Drivers
#
CONFIG_MMC_BLOCK=m
CONFIG_MMC_BLOCK_BOUNCE=y
# CONFIG_SDIO_UART is not set
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
CONFIG_MMC_SDHCI=m
# CONFIG_MMC_SDHCI_PCI is not set
CONFIG_MMC_WBSD=m
CONFIG_MMC_TIFM_SD=m
# CONFIG_MMC_SDRICOH_CS is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_HP_DISK is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set
# CONFIG_ACCESSIBILITY is not set
CONFIG_INFINIBAND=m
CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=m
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_IPATH=m
# CONFIG_INFINIBAND_AMSO1100 is not set
CONFIG_INFINIBAND_CXGB3=m
# CONFIG_INFINIBAND_CXGB3_DEBUG is not set
# CONFIG_MLX4_INFINIBAND is not set
# CONFIG_INFINIBAND_NES is not set
CONFIG_INFINIBAND_IPOIB=m
CONFIG_INFINIBAND_IPOIB_CM=y
CONFIG_INFINIBAND_IPOIB_DEBUG=y
CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
CONFIG_INFINIBAND_SRP=m
CONFIG_INFINIBAND_ISER=m
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=m
CONFIG_EDAC_E752X=m
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
# CONFIG_RTC_DRV_DS1374 is not set
CONFIG_RTC_DRV_DS1672=m
# CONFIG_RTC_DRV_MAX6900 is not set
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=m
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
CONFIG_RTC_DRV_V3020=m

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
# CONFIG_STAGING is not set
CONFIG_STAGING_EXCLUDE_BUILD=y

#
# Firmware Drivers
#
CONFIG_EDD=m
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DELL_RBU=m
CONFIG_DCDBAS=m
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_FS_XIP=y
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=m
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=m
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
# CONFIG_JFS_DEBUG is not set
# CONFIG_JFS_STATISTICS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
# CONFIG_XFS_RT is not set
# CONFIG_XFS_DEBUG is not set
CONFIG_GFS2_FS=m
CONFIG_GFS2_FS_LOCKING_DLM=m
CONFIG_OCFS2_FS=m
CONFIG_OCFS2_FS_O2CB=m
CONFIG_OCFS2_FS_USERSPACE_CLUSTER=m
CONFIG_OCFS2_FS_STATS=y
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
# CONFIG_OCFS2_DEBUG_FS is not set
# CONFIG_OCFS2_COMPAT_JBD is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_GENERIC_ACL=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
CONFIG_AFFS_FS=m
# CONFIG_ECRYPT_FS is not set
CONFIG_HFS_FS=m
CONFIG_HFSPLUS_FS=m
CONFIG_BEFS_FS=m
# CONFIG_BEFS_DEBUG is not set
CONFIG_BFS_FS=m
CONFIG_EFS_FS=m
CONFIG_CRAMFS=m
CONFIG_VXFS_FS=m
CONFIG_MINIX_FS=m
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
CONFIG_QNX4FS_FS=m
CONFIG_ROMFS_FS=m
CONFIG_SYSV_FS=m
CONFIG_UFS_FS=m
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_SUNRPC_XPRT_RDMA=m
# CONFIG_SUNRPC_REGISTER_V4 is not set
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_RPCSEC_GSS_SPKM3=m
# CONFIG_SMB_FS is not set
CONFIG_CIFS=m
# CONFIG_CIFS_STATS is not set
CONFIG_CIFS_WEAK_PW_HASH=y
# CONFIG_CIFS_UPCALL is not set
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
CONFIG_NCP_FS=m
CONFIG_NCPFS_PACKET_SIGNING=y
CONFIG_NCPFS_IOCTL_LOCKING=y
CONFIG_NCPFS_STRONG=y
CONFIG_NCPFS_NFS_NS=y
CONFIG_NCPFS_OS2_NS=y
CONFIG_NCPFS_SMALLDOS=y
CONFIG_NCPFS_NLS=y
CONFIG_NCPFS_EXTRAS=y
CONFIG_CODA_FS=m
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
CONFIG_NLS_UTF8=m
CONFIG_DLM=m
CONFIG_DLM_DEBUG=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_ENABLE_WARN_DEPRECATED=y
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_SCHED_DEBUG is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
CONFIG_RT_MUTEX_TESTER=y
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y

#
# Tracers
#
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SYSPROF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_CONTEXT_SWITCH_TRACER is not set
# CONFIG_BOOT_TRACER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_PRINTK_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_MMIOTRACE is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
# CONFIG_SECURITY_ROOTPLUG is not set
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT is not set
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_SMACK is not set
CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_NULL=m
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_LRW=m
CONFIG_CRYPTO_PCBC=m
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=m

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_WP512=m

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_CAMELLIA=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_KHAZAD=m
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
CONFIG_KVM_AMD=m
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_BALLOON is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 11:01         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 11:01 UTC (permalink / raw)
  To: David Miller
  Cc: rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 13847 bytes --]


* David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:

> From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
> Date: Mon, 17 Nov 2008 10:06:48 +0100
> 
> > 
> > * Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> > 
> > > This message has been generated automatically as a part of a report
> > > of regressions introduced between 2.6.26 and 2.6.27.
> > > 
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> > > be listed and let me know (either way).
> > > 
> > > 
> > > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > > Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
> > > Submitter	: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> > > Date		: 2008-08-11 18:36 (98 days old)
> > > References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > > 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4
> > 
> > Christoph, as per the recent analysis of Mike:
> > 
> >  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> > 
> > all scheduler components of this regression have been eliminated.
> > 
> > In fact his numbers show that scheduler speedups since 2.6.22 have 
> > offset and hidden most other sources of tbench regression. (i.e. the 
> > scheduler portion got 5% faster, hence it was able to offset a 
> > slowdown of 5% in other areas of the kernel that tbench triggers)
> 
> Although I respect the improvements, wake_up() is still several 
> orders of magnitude slower than it was in 2.6.22 and wake_up() is at 
> the top of the profiles in tbench runs.

hm, several orders of magnitude slower? That contradicts Mike's 
numbers and my own numbers and profiles as well: see below.

The scheduler's overhead barely even registers on a 16-way x86 system 
i'm running tbench on. Here's the NMI profile during 64 threads tbench 
on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

  Throughput 3437.65 MB/sec 64 procs
  ==================================
  21570252  total 
  ........
   1494803  copy_user_generic_string 
    998232  sock_rfree 
    491471  tcp_ack 
    482405  ip_dont_fragment 
    470685  ip_local_deliver 
    436325  constant_test_bit         [ called by napi_disable_pending() ]
    375469  avc_has_perm_noaudit 
    347663  tcp_sendmsg 
    310383  tcp_recvmsg 
    300412  __inet_lookup_established 
    294377  system_call 
    286603  tcp_transmit_skb 
    251782  selinux_ip_postroute 
    236028  tcp_current_mss 
    235631  schedule 
    234013  netif_rx 
    229854  _local_bh_enable_ip 
    219501  tcp_v4_rcv 

    [ etc. - see full profile attached further below ]

Note that the scheduler does not even show up in the profile up to 
entry #15!

I've also summarized NMI profiler output by major subsystems:

           NET       overhead (12603450/21570252): 58.43%
           security  overhead ( 1903598/21570252):  8.83%
           usercopy  overhead ( 1753617/21570252):  8.13%
           sched     overhead ( 1599406/21570252):  7.41%
           syscall   overhead (  560487/21570252):  2.60%
           IRQ       overhead (  555439/21570252):  2.58%
           slab      overhead (  492421/21570252):  2.28%
           timer     overhead (  226573/21570252):  1.05%
           pagealloc overhead (  192681/21570252):  0.89%
           PID       overhead (  115123/21570252):  0.53%
           VFS       overhead (  107926/21570252):  0.50%
           pagecache overhead (   62552/21570252):  0.29%
           gtod      overhead (   38651/21570252):  0.18%
           IDLE      overhead (       0/21570252):  0.00%
---------------------------------------------------------
                         left ( 1349494/21570252):  6.26%

The scheduler's functions are absolutely flat, and consistent with an 
extreme context-switching rate of 1.35 million per second. The 
scheduler can go up to about 20 million context switches per second on 
this system:

 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 32  0      0 32229696  29308 649880    0    0     0     0 164135 20026853 24 76  0  0  0
 32  0      0 32229752  29308 649880    0    0     0     0 164203 20032770 24 76  0  0  0
 32  0      0 32229752  29308 649880    0    0     0     0 164201 20036492 25 75  0  0  0

... and 7% scheduling overhead is roughly consistent with 1.35/20.0.

Wake up affinities and data flow caching is just fine in this workload 
- we've got scheduler statistics for that and they look good too.

It all looks like pure old-fashioned straight overhead in the 
networking layer to me. Do we still touch the same global cacheline 
for every localhost packet we process? Anything like that would show 
up big time.

Anyway, in terms of scheduling there's absolutely nothing anomalous i 
can see about this workload. Scheduling looks healthy throughout - and 
the few things we noticed causing unnecessary overhead are now fixed 
in -rc5. (but it's all in the <5% range of impact of total scheduling 
overhead - i.e. in the 0.4% absolute range in this workload)

And the thing is, the scheduler's task in this workload is by far the 
most difficult one conceptually: it has to manage and optimize 
concurrency of _future_ processing, with an event frequency that is 
_WAY_ out of the normal patterns: more than 1.3 million context 
switches per second (!). It also switches to/from completely 
independent contexts of computing, with the all the implications that 
this brings.

Networking and VFS "just" has to shuffle around bits in memory along a 
very specific plan given to it by user-space. That plan is 
well-specified and goes along the lines of: "copy this (already 
cached) file content to that socket" and back.

By the raw throughput figures the system is pushing a couple of 
million data packets per second.

Still we spend 7 times more CPU time in the networking code than in 
the scheduler or in the user-copy code. Why?

	Ingo

------------------------->
  21570252 total 
  ........
  1494803 copy_user_generic_string 
  998232 sock_rfree 
  491471 tcp_ack 
  482405 ip_dont_fragment 
  470685 ip_local_deliver 
  436325 constant_test_bit 
  375469 avc_has_perm_noaudit 
  347663 tcp_sendmsg 
  310383 tcp_recvmsg 
  300412 __inet_lookup_established 
  294377 system_call 
  286603 tcp_transmit_skb 
  251782 selinux_ip_postroute 
  236028 tcp_current_mss 
  235631 schedule 
  234013 netif_rx 
  229854 _local_bh_enable_ip 
  219501 tcp_v4_rcv 
  210046 netlbl_enabled 
  205022 constant_test_bit 
  199598 skb_release_head_state 
  187952 ip_queue_xmit 
  178779 tcp_established_options 
  175955 dev_queue_xmit 
  169904 netif_receive_skb 
  166629 ip_finish_output2 
  162291 sysret_check 
  151262 __switch_to 
  143355 audit_syscall_entry 
  142694 load_cr3 
  136571 memset_c 
  136115 nf_hook_slow 
  130825 ip_local_deliver_finish 
  128795 ip_rcv 
  125995 selinux_socket_sock_rcv_skb 
  123944 net_rx_action 
  123100 __copy_skb_header 
  122052 __inet_lookup 
  121744 constant_test_bit 
  119444 get_page_from_freelist 
  116486 avc_has_perm 
  115643 audit_syscall_exit 
  115123 find_pid_ns 
  114483 tcp_cleanup_rbuf 
  111350 tcp_rcv_established 
  109853 __mod_timer 
  107891 lock_sock_nested 
  107316 napi_disable_pending 
  106581 release_sock 
  104402 skb_copy_datagram_iovec 
  101591 __tcp_push_pending_frames 
  101206 tcp_event_data_recv 
   98046 kmem_cache_alloc_node
   97982 tcp_v4_do_rcv
   92714 sys_recvfrom
   91551 rb_erase
   89730 kfree
   87979 ip_rcv_finish
   87166 compare_ether_addr
   86982 selinux_parse_skb
   86731 nf_iterate
   79690 selinux_ipv4_output
   79347 __cache_free
   78992 audit_free_names
   78127 skb_release_data
   77501 mod_timer
   77241 __sock_recvmsg
   77228 sock_recvmsg
   77211 ____cache_alloc
   76495 tcp_rcv_space_adjust
   75283 sk_wait_data
   71772 sys_sendto
   71594 sched_clock
   70880 eth_type_trans
   70238 memcpy_toiovec
   69193 do_softirq
   68341 __update_sched_clock
   67597 tcp_v4_md5_lookup
   67424 try_to_wake_up
   64465 sock_common_recvmsg
   64116 put_prev_task_fair
   63964 process_backlog
   62216 __do_softirq
   62093 tcp_cwnd_validate
   61128 __alloc_skb
   60588 put_page
   59536 dput
   58411 __ip_local_out
   56349 avc_audit
   55626 __napi_schedule
   55525 selinux_ipv4_postroute
   54499 __enqueue_entity
   53599 local_bh_disable
   53418 unroll_tree_refs
   53162 __unlazy_fpu
   53084 cfs_rq_of
   52475 set_next_entity
   51108 thread_return
   50458 ip_output
   50268 sched_clock_cpu
   49974 tcp_send_delayed_ack
   49736 ip_finish_output
   49670 finish_task_switch
   49070 ___swab16
   48499 audit_get_context
   48347 raw_local_deliver
   47824 tcp_rtt_estimator
   46707 tcp_push
   46405 constant_test_bit
   45859 select_task_rq_fair
   45188 math_state_restore
   44889 check_preempt_wakeup
   44449 task_rq_lock
   43704 sel_netif_sid
   43377 sock_sendmsg
   42612 sk_reset_timer
   42606 __skb_clone
   42223 __find_general_cachep
   41950 selinux_socket_sendmsg
   41716 constant_test_bit
   41097 skb_push
   40723 lock_sock
   40715 system_call_after_swapgs
   40399 selinux_netlbl_inode_permission
   40179 rb_insert_color
   40021 __kfree_skb
   40015 sockfd_lookup_light
   39216 internal_add_timer
   39024 skb_can_coalesce
   38838 __tcp_select_window
   38651 current_kernel_time
   38533 tcp_v4_md5_do_lookup
   38372 __sock_sendmsg
   38162 selinux_socket_recvmsg
   37812 sel_netport_sid
   37727 account_group_exec_runtime
   37695 switch_mm
   36247 nf_hook_thresh
   36057 auditsys
   35266 pick_next_task_fair
   35064 __tcp_ack_snd_check
   35052 sock_def_readable
   34826 sysret_careful
   34578 _local_bh_enable
   34498 free_hot_cold_page
   34338 kmap
   34028 loopback_xmit
   33320 sk_stream_alloc_skb
   33269 test_ti_thread_flag
   33219 skb_fill_page_desc
   33049 tcp_is_cwnd_limited
   33012 update_min_vruntime
   32431 native_read_tsc
   32398 dst_release
   31661 get_pageblock_flags_group
   31652 path_put
   31516 tcp_push_pending_frames
   31265 netif_needs_gso
   31175 constant_test_bit
   31077 __cycles_2_ns
   30971 socket_has_perm
   30893 __phys_addr
   30867 lock_timer_base
   30585 __wake_up
   30456 ret_from_sys_call
   30147 skb_release_all
   29356 local_bh_enable
   29334 __skb_insert
   28681 tcp_cwnd_test
   28652 __skb_dequeue
   28612 prepare_to_wait
   28268 kmem_cache_free
   28193 set_bit
   28149 dequeue_task_fair
   27906 skb_header_pointer
   27861 sys_kill
   27803 selinux_task_kill
   27627 audit_free_aux
   27600 selinux_netlbl_sock_rcv_skb
   26794 update_curr
   26777 __alloc_pages_internal
   26469 skb_entail
   26458 pskb_may_pull
   26216 inet_ehashfn
   26075 call_softirq
   26033 copy_from_user
   25933 __local_bh_disable
   25666 fget_light
   25270 inet_csk_reset_xmit_timer
   25071 signal_pending_state
   24117 tcp_init_tso_segs
   24109 TCP_ECN_check_ce
   23702 nf_hook_thresh
   23558 copy_to_user
   23426 sysret_audit
   23267 sk_wake_async
   22627 tcp_options_write
   22174 netif_tx_queue_stopped
   21795 tcp_prequeue_process
   21757 tcp_set_skb_tso_segs
   21579 avc_hash
   21565 ___swab16
   21560 ip_local_out
   21445 sk_wmem_schedule
   21234 get_page
   21200 __wake_up_common
   21042 sel_netnode_find
   20772 sock_put
   20625 schedule_timeout
   20613 __napi_complete
   20563 fput_light
   20532 tcp_bound_to_half_wnd
   19912 cap_task_kill
   19773 sysret_signal
   19374 compound_head
   19121 get_seconds
   19048 PageLRU
   18893 zone_watermark_ok
   18635 tcp_snd_wnd_test
   18634 enqueue_task_fair
   18603 rb_next
   18598 next_zones_zonelist
   18534 resched_task
   17820 hash_64
   17801 autoremove_wake_function
   17451 __skb_queue_before
   17283 native_load_tls
   17227 __skb_dequeue
   17149 xfrm4_policy_check
   16942 zone_statistics
   16886 skb_reset_network_header
   16824 ___swab16
   16725 pskb_may_pull
   16645 dev_hard_start_xmit
   16580 sk_filter
   16523 tcp_ca_event
   16479 tcp_win_from_space
   16408 tcp_parse_aligned_timestamp
   16204 finish_wait
   16124 virt_to_slab
   15965 tcp_v4_send_check
   15920 skb_reset_transport_header
   15867 tcp_data_snd_check
   15819 security_sock_rcv_skb
   15665 tcp_ack_saw_tstamp
   15621 skb_network_offset
   15568 virt_to_head_page
   15553 dst_confirm
   15320 skb_pull
   15277 clear_bit
   15179 alloc_pages_current
   14991 bictcp_acked
   14743 tcp_store_ts_recent
   14660 sel_netnode_sid
   14650 __xchg
   14573 task_has_perm
   14561 tcp_v4_check
   14492 net_invalid_timestamp
   14485 security_socket_recvmsg
   14363 __dequeue_entity
   14318 pid_nr_ns
   14311 device_not_available
   14212 local_bh_enable_ip
   14092 virt_to_cache
   13804 netpoll_rx
   13781 fcheck_files
   13724 tcp_adjust_fackets_out
   13717 net_timestamp
   13638 ___swab16
   13576 sel_netport_find
   13563 __kmalloc_node
   13530 __inc_zone_state
   13215 pid_vnr
   13208 free_pages_check
   13008 security_socket_sendmsg
   12971 ip_skb_dst_mtu
   12827 __cpu_set
   12782 bictcp_cong_avoid
   12779 test_tsk_thread_flag
   12734 wakeup_preempt_entity
   12651 sel_netif_find
   12545 skb_set_owner_r
   12534 skb_headroom
   12348 tcp_event_new_data_sent
   12251 place_entity
   12047 set_bit
   11805 update_rq_clock
   11788 detach_timer
   11659 policy_zonelist
   11423 skb_clone
   11380 __skb_queue_tail
   11249 dequeue_task
   10823 init_rootdomain
   10690 __cpu_clear
   10558 default_wake_function
   10556 tcp_rcv_rtt_measure_ts
   10451 PageSlab
   10427 sock_wfree
   10277 calc_delta_fair
   10237 tcp_validate_incoming
   10218 task_rq_unlock
   10023 page_get_cache

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 72924 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.28-rc5
# Mon Nov 17 11:59:36 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=20
# CONFIG_CGROUPS is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
# CONFIG_GROUP_SCHED is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
# CONFIG_MARKERS is not set
CONFIG_OPROFILE=m
CONFIG_OPROFILE_IBS=y
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_BLK_DEV_BSG is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_CLASSIC_RCU=y
CONFIG_FREEZER=y

#
# Processor type and features
#
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR_64=y
CONFIG_X86_DS=y
CONFIG_X86_PTRACE_BTS=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
# CONFIG_AMD_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=255
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
# CONFIG_MICROCODE_AMD is not set
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
CONFIG_MMU_NOTIFIER=y
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
# CONFIG_X86_PAT is not set
# CONFIG_EFI is not set
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
# CONFIG_SCHED_HRTICK is not set
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y

#
# Power management and ACPI options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=m
CONFIG_ACPI_BATTERY=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_WMI is not set
CONFIG_ACPI_ASUS=m
CONFIG_ACPI_TOSHIBA=m
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_SBS=m

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_POWERNOW_K8_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
# CONFIG_X86_SPEEDSTEP_LIB is not set
# CONFIG_CPU_IDLE is not set

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=m
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
# CONFIG_PCI_MSI is not set
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
CONFIG_PD6729=m
CONFIG_I82092=m
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=m
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
# CONFIG_HOTPLUG_PCI_CPCI is not set
CONFIG_HOTPLUG_PCI_SHPC=m

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_SUB_POLICY is not set
CONFIG_XFRM_MIGRATE=y
# CONFIG_XFRM_STATISTICS is not set
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=m
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
CONFIG_TCP_CONG_CUBIC=m
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_BIC=y
# CONFIG_DEFAULT_CUBIC is not set
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="bic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_DEBUG=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=y
CONFIG_NF_CT_ACCT=y
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CT_PROTO_DCCP=m
CONFIG_NF_CT_PROTO_GRE=m
CONFIG_NF_CT_PROTO_SCTP=m
# CONFIG_NF_CT_PROTO_UDPLITE is not set
CONFIG_NF_CONNTRACK_AMANDA=m
CONFIG_NF_CONNTRACK_FTP=m
CONFIG_NF_CONNTRACK_H323=m
CONFIG_NF_CONNTRACK_IRC=m
CONFIG_NF_CONNTRACK_NETBIOS_NS=m
CONFIG_NF_CONNTRACK_PPTP=m
CONFIG_NF_CONNTRACK_SANE=m
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
# CONFIG_NF_CT_NETLINK is not set
# CONFIG_NETFILTER_TPROXY is not set
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_TRACE is not set
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
# CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP is not set
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
CONFIG_NETFILTER_XT_MATCH_REALM=m
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
CONFIG_IP_VS=m
# CONFIG_IP_VS_IPV6 is not set
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y

#
# IPVS scheduler
#
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m

#
# IPVS application helper
#
CONFIG_IP_VS_FTP=m

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_NF_NAT_SNMP_BASIC=m
CONFIG_NF_NAT_PROTO_DCCP=m
CONFIG_NF_NAT_PROTO_GRE=m
CONFIG_NF_NAT_PROTO_SCTP=m
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_IRC=m
CONFIG_NF_NAT_TFTP=m
CONFIG_NF_NAT_AMANDA=m
CONFIG_NF_NAT_PPTP=m
CONFIG_NF_NAT_H323=m
CONFIG_NF_NAT_SIP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
# CONFIG_IP_NF_SECURITY is not set
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_RAW=m
# CONFIG_IP6_NF_SECURITY is not set

#
# DECnet: Netfilter Configuration
#
# CONFIG_DECNET_NF_GRABULATOR is not set
CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
CONFIG_BRIDGE_EBT_802_3=m
CONFIG_BRIDGE_EBT_AMONG=m
CONFIG_BRIDGE_EBT_ARP=m
CONFIG_BRIDGE_EBT_IP=m
# CONFIG_BRIDGE_EBT_IP6 is not set
CONFIG_BRIDGE_EBT_LIMIT=m
CONFIG_BRIDGE_EBT_MARK=m
CONFIG_BRIDGE_EBT_PKTTYPE=m
CONFIG_BRIDGE_EBT_STP=m
CONFIG_BRIDGE_EBT_VLAN=m
CONFIG_BRIDGE_EBT_ARPREPLY=m
CONFIG_BRIDGE_EBT_DNAT=m
CONFIG_BRIDGE_EBT_MARK_T=m
CONFIG_BRIDGE_EBT_REDIRECT=m
CONFIG_BRIDGE_EBT_SNAT=m
CONFIG_BRIDGE_EBT_LOG=m
CONFIG_BRIDGE_EBT_ULOG=m
# CONFIG_BRIDGE_EBT_NFLOG is not set
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m
CONFIG_IP_DCCP_ACKVEC=y

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
CONFIG_IP_DCCP_CCID2=m
# CONFIG_IP_DCCP_CCID2_DEBUG is not set
CONFIG_IP_DCCP_CCID3=m
# CONFIG_IP_DCCP_CCID3_DEBUG is not set
CONFIG_IP_DCCP_CCID3_RTO=100
CONFIG_IP_DCCP_TFRC_LIB=m

#
# DCCP Kernel Hacking
#
# CONFIG_IP_DCCP_DEBUG is not set
# CONFIG_NET_DCCPPROBE is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
CONFIG_TIPC=m
# CONFIG_TIPC_ADVANCED is not set
# CONFIG_TIPC_DEBUG is not set
CONFIG_ATM=m
CONFIG_ATM_CLIP=m
# CONFIG_ATM_CLIP_NO_ICMP is not set
CONFIG_ATM_LANE=m
# CONFIG_ATM_MPOA is not set
CONFIG_ATM_BR2684=m
# CONFIG_ATM_BR2684_IPFILTER is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
# CONFIG_NET_DSA is not set
CONFIG_VLAN_8021Q=m
# CONFIG_VLAN_8021Q_GVRP is not set
CONFIG_DECNET=m
CONFIG_DECNET_ROUTER=y
CONFIG_LLC=y
# CONFIG_LLC2 is not set
CONFIG_IPX=m
# CONFIG_IPX_INTERN is not set
CONFIG_ATALK=m
CONFIG_DEV_APPLETALK=m
CONFIG_IPDDP=m
CONFIG_IPDDP_ENCAP=y
CONFIG_IPDDP_DECAP=y
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
CONFIG_WAN_ROUTER=m
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
# CONFIG_NET_SCH_MULTIQ is not set
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_INGRESS=m

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
# CONFIG_NET_ACT_NAT is not set
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
# CONFIG_NET_ACT_SKBEDIT is not set
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
# CONFIG_NET_TCPPROBE is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
CONFIG_IRDA=m

#
# IrDA protocols
#
CONFIG_IRLAN=m
CONFIG_IRNET=m
CONFIG_IRCOMM=m
# CONFIG_IRDA_ULTRA is not set

#
# IrDA options
#
CONFIG_IRDA_CACHE_LAST_LSAP=y
CONFIG_IRDA_FAST_RR=y
# CONFIG_IRDA_DEBUG is not set

#
# Infrared-port device drivers
#

#
# SIR device drivers
#
CONFIG_IRTTY_SIR=m

#
# Dongle support
#
CONFIG_DONGLE=y
CONFIG_ESI_DONGLE=m
CONFIG_ACTISYS_DONGLE=m
CONFIG_TEKRAM_DONGLE=m
CONFIG_TOIM3232_DONGLE=m
CONFIG_LITELINK_DONGLE=m
CONFIG_MA600_DONGLE=m
CONFIG_GIRBIL_DONGLE=m
CONFIG_MCP2120_DONGLE=m
CONFIG_OLD_BELKIN_DONGLE=m
CONFIG_ACT200L_DONGLE=m
# CONFIG_KINGSUN_DONGLE is not set
# CONFIG_KSDAZZLE_DONGLE is not set
# CONFIG_KS959_DONGLE is not set

#
# FIR device drivers
#
CONFIG_USB_IRDA=m
CONFIG_SIGMATEL_FIR=m
CONFIG_NSC_FIR=m
CONFIG_WINBOND_FIR=m
CONFIG_SMC_IRCC_FIR=m
CONFIG_ALI_FIR=m
CONFIG_VLSI_FIR=m
CONFIG_VIA_FIR=m
CONFIG_MCS_FIR=m
CONFIG_BT=m
CONFIG_BT_L2CAP=m
CONFIG_BT_SCO=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=m

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIBTSDIO is not set
CONFIG_BT_HCIUART=m
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
# CONFIG_BT_HCIUART_LL is not set
CONFIG_BT_HCIBCM203X=m
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIDTL1=m
CONFIG_BT_HCIBT3C=m
CONFIG_BT_HCIBLUECARD=m
CONFIG_BT_HCIBTUART=m
CONFIG_BT_HCIVHCI=m
# CONFIG_AF_RXRPC is not set
# CONFIG_PHONET is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
CONFIG_WIRELESS_OLD_REGULATORY=y
CONFIG_WIRELESS_EXT=y
CONFIG_WIRELESS_EXT_SYSFS=y
# CONFIG_MAC80211 is not set
CONFIG_IEEE80211=m
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=m
CONFIG_IEEE80211_CRYPT_CCMP=m
CONFIG_IEEE80211_CRYPT_TKIP=m
CONFIG_RFKILL=m
# CONFIG_RFKILL_INPUT is not set
CONFIG_RFKILL_LEDS=y
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
# CONFIG_PARIDE is not set
CONFIG_BLK_CPQ_DA=y
CONFIG_BLK_CPQ_CISS_DA=m
CONFIG_CISS_SCSI_TAPE=y
CONFIG_BLK_DEV_DAC960=m
CONFIG_BLK_DEV_UMEM=m
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_SX8=m
CONFIG_BLK_DEV_UB=m
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
CONFIG_ATA_OVER_ETH=m
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_EEPROM_93CX6 is not set
CONFIG_SGI_IOC4=m
CONFIG_TIFM_CORE=m
CONFIG_TIFM_7XX1=m
# CONFIG_ACER_WMI is not set
CONFIG_ASUS_LAPTOP=m
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_ICS932S401 is not set
CONFIG_MSI_LAPTOP=m
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
CONFIG_SONY_LAPTOP=m
# CONFIG_SONYPI_COMPAT is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_SGI_XP is not set
# CONFIG_HP_ILO is not set
# CONFIG_SGI_GRU is not set
# CONFIG_C2PORT is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=m
# CONFIG_SCSI_FC_TGT_ATTRS is not set
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
# CONFIG_SCSI_SAS_LIBSAS is not set
CONFIG_SCSI_SRP_ATTRS=m
# CONFIG_SCSI_SRP_TGT_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
CONFIG_SCSI_ARCMSR=m
# CONFIG_SCSI_ARCMSR_AER is not set
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=m
CONFIG_MEGARAID_MAILBOX=m
CONFIG_MEGARAID_LEGACY=m
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_HPTIOP=m
CONFIG_SCSI_BUSLOGIC=m
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
# CONFIG_SCSI_IZIP_EPP16 is not set
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
# CONFIG_SCSI_MVSAS is not set
CONFIG_SCSI_STEX=m
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
# CONFIG_SCSI_IPR is not set
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
CONFIG_SCSI_QLA_ISCSI=m
CONFIG_SCSI_LPFC=m
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=m
# CONFIG_SCSI_DEBUG is not set
CONFIG_SCSI_SRP=m
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
# CONFIG_SCSI_DH is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y
CONFIG_SATA_SVW=m
CONFIG_ATA_PIIX=y
CONFIG_SATA_MV=m
CONFIG_SATA_NV=y
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SX4=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIS=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m
CONFIG_SATA_INIC162X=m
# CONFIG_PATA_ACPI is not set
CONFIG_PATA_ALI=m
CONFIG_PATA_AMD=y
CONFIG_PATA_ARTOP=m
CONFIG_PATA_ATIIXP=m
# CONFIG_PATA_CMD640_PCI is not set
CONFIG_PATA_CMD64X=m
CONFIG_PATA_CS5520=m
CONFIG_PATA_CS5530=m
CONFIG_PATA_CYPRESS=m
CONFIG_PATA_EFAR=m
CONFIG_ATA_GENERIC=m
CONFIG_PATA_HPT366=m
CONFIG_PATA_HPT37X=m
CONFIG_PATA_HPT3X2N=m
CONFIG_PATA_HPT3X3=m
# CONFIG_PATA_HPT3X3_DMA is not set
CONFIG_PATA_IT821X=m
CONFIG_PATA_IT8213=m
CONFIG_PATA_JMICRON=m
CONFIG_PATA_TRIFLEX=m
CONFIG_PATA_MARVELL=m
CONFIG_PATA_MPIIX=m
CONFIG_PATA_OLDPIIX=y
CONFIG_PATA_NETCELL=m
# CONFIG_PATA_NINJA32 is not set
CONFIG_PATA_NS87410=m
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OPTI=m
CONFIG_PATA_OPTIDMA=m
CONFIG_PATA_PCMCIA=m
CONFIG_PATA_PDC_OLD=m
CONFIG_PATA_RADISYS=m
CONFIG_PATA_RZ1000=m
CONFIG_PATA_SC1200=m
CONFIG_PATA_SERVERWORKS=m
CONFIG_PATA_PDC2027X=m
CONFIG_PATA_SIL680=m
CONFIG_PATA_SIS=m
CONFIG_PATA_VIA=m
CONFIG_PATA_WINBOND=m
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
CONFIG_MD_RAID5_RESHAPE=y
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
CONFIG_FUSION_SAS=m
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=m
CONFIG_FUSION_LAN=m
# CONFIG_FUSION_LOGGING is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
# CONFIG_FIREWIRE is not set
CONFIG_IEEE1394=m
CONFIG_IEEE1394_OHCI1394=m
CONFIG_IEEE1394_PCILYNX=m
CONFIG_IEEE1394_SBP2=m
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
CONFIG_IEEE1394_ETH1394_ROM_ENTRY=y
CONFIG_IEEE1394_ETH1394=m
CONFIG_IEEE1394_RAWIO=m
CONFIG_IEEE1394_VIDEO1394=m
CONFIG_IEEE1394_DV1394=m
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
CONFIG_I2O=m
# CONFIG_I2O_LCT_NOTIFY_ON_CHANGES is not set
CONFIG_I2O_EXT_ADAPTEC=y
CONFIG_I2O_EXT_ADAPTEC_DMA64=y
# CONFIG_I2O_CONFIG is not set
CONFIG_I2O_BUS=m
CONFIG_I2O_BLOCK=m
CONFIG_I2O_SCSI=m
CONFIG_I2O_PROC=m
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_IFB=m
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_MACVLAN is not set
CONFIG_EQUALIZER=m
CONFIG_TUN=m
# CONFIG_VETH is not set
CONFIG_NET_SB1000=m
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
CONFIG_HAPPYMEAL=m
CONFIG_SUNGEM=m
CONFIG_CASSINI=m
CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=y
CONFIG_TYPHOON=m
CONFIG_NET_TULIP=y
CONFIG_DE2104X=m
CONFIG_TULIP=m
# CONFIG_TULIP_MWI is not set
CONFIG_TULIP_MMIO=y
# CONFIG_TULIP_NAPI is not set
CONFIG_DE4X5=m
CONFIG_WINBOND_840=m
CONFIG_DM9102=m
CONFIG_ULI526X=m
CONFIG_PCMCIA_XIRCOM=m
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=m
CONFIG_ADAPTEC_STARFIRE=m
CONFIG_B44=m
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_FORCEDETH=y
CONFIG_FORCEDETH_NAPI=y
# CONFIG_EEPRO100 is not set
CONFIG_E100=y
CONFIG_FEALNX=m
CONFIG_NATSEMI=m
CONFIG_NE2K_PCI=m
CONFIG_8139CP=m
CONFIG_8139TOO=y
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R6040 is not set
CONFIG_SIS900=m
CONFIG_EPIC100=m
CONFIG_SUNDANCE=m
# CONFIG_SUNDANCE_MMIO is not set
# CONFIG_TLAN is not set
CONFIG_VIA_RHINE=m
CONFIG_VIA_RHINE_MMIO=y
CONFIG_SC92031=m
CONFIG_NET_POCKET=y
CONFIG_ATP=m
CONFIG_DE600=m
CONFIG_DE620=m
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=m
# CONFIG_ACENIC_OMIT_TIGON_I is not set
CONFIG_DL2K=m
CONFIG_E1000=y
CONFIG_E1000E=y
# CONFIG_IP1000 is not set
# CONFIG_IGB is not set
CONFIG_NS83820=m
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
CONFIG_R8169=m
CONFIG_R8169_VLAN=y
# CONFIG_SIS190 is not set
CONFIG_SKGE=m
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKY2=m
# CONFIG_SKY2_DEBUG is not set
CONFIG_VIA_VELOCITY=m
CONFIG_TIGON3=y
CONFIG_BNX2=m
CONFIG_QLA3XXX=m
CONFIG_ATL1=m
# CONFIG_ATL1E is not set
# CONFIG_JME is not set
CONFIG_NETDEV_10000=y
CONFIG_CHELSIO_T1=m
CONFIG_CHELSIO_T1_1G=y
CONFIG_CHELSIO_T3=m
# CONFIG_ENIC is not set
# CONFIG_IXGBE is not set
CONFIG_IXGB=m
CONFIG_S2IO=m
CONFIG_MYRI10GE=m
CONFIG_NETXEN_NIC=m
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
# CONFIG_QLGE is not set
# CONFIG_SFC is not set
CONFIG_TR=y
CONFIG_IBMOL=m
CONFIG_3C359=m
# CONFIG_TMS380TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set
# CONFIG_IWLWIFI_LEDS is not set

#
# USB Network Adapters
#
CONFIG_USB_CATC=m
CONFIG_USB_KAWETH=m
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_USBNET=m
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_DM9601=m
# CONFIG_USB_NET_SMSC95XX is not set
CONFIG_USB_NET_GL620A=m
CONFIG_USB_NET_NET1080=m
CONFIG_USB_NET_PLUSB=m
CONFIG_USB_NET_MCS7830=m
CONFIG_USB_NET_RNDIS_HOST=m
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_KC2190=y
CONFIG_USB_NET_ZAURUS=m
# CONFIG_USB_HSO is not set
CONFIG_NET_PCMCIA=y
CONFIG_PCMCIA_3C589=m
CONFIG_PCMCIA_3C574=m
CONFIG_PCMCIA_FMVJ18X=m
CONFIG_PCMCIA_PCNET=m
CONFIG_PCMCIA_NMCLAN=m
CONFIG_PCMCIA_SMC91C92=m
CONFIG_PCMCIA_XIRC2PS=m
CONFIG_PCMCIA_AXNET=m
# CONFIG_PCMCIA_IBMTR is not set
# CONFIG_WAN is not set
CONFIG_ATM_DRIVERS=y
# CONFIG_ATM_DUMMY is not set
CONFIG_ATM_TCP=m
CONFIG_ATM_LANAI=m
CONFIG_ATM_ENI=m
# CONFIG_ATM_ENI_DEBUG is not set
# CONFIG_ATM_ENI_TUNE_BURST is not set
CONFIG_ATM_FIRESTREAM=m
# CONFIG_ATM_ZATM is not set
CONFIG_ATM_IDT77252=m
# CONFIG_ATM_IDT77252_DEBUG is not set
# CONFIG_ATM_IDT77252_RCV_ALL is not set
CONFIG_ATM_IDT77252_USE_SUNI=y
CONFIG_ATM_AMBASSADOR=m
# CONFIG_ATM_AMBASSADOR_DEBUG is not set
CONFIG_ATM_HORIZON=m
# CONFIG_ATM_HORIZON_DEBUG is not set
# CONFIG_ATM_IA is not set
# CONFIG_ATM_FORE200E is not set
CONFIG_ATM_HE=m
# CONFIG_ATM_HE_USE_SUNI is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
CONFIG_SKFP=m
# CONFIG_HIPPI is not set
CONFIG_PLIP=m
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
# CONFIG_PPP_BSDCOMP is not set
CONFIG_PPP_MPPE=m
CONFIG_PPPOE=m
CONFIG_PPPOATM=m
# CONFIG_PPPOL2TP is not set
CONFIG_SLIP=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLHC=m
CONFIG_SLIP_SMART=y
# CONFIG_SLIP_MODE_SLIP6 is not set
CONFIG_NET_FC=y
CONFIG_NETCONSOLE=y
# CONFIG_NETCONSOLE_DYNAMIC is not set
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
CONFIG_KEYBOARD_STOWAWAY=m
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=m
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
CONFIG_MOUSE_VSXXXAA=m
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=m
CONFIG_JOYSTICK_A3D=m
CONFIG_JOYSTICK_ADI=m
CONFIG_JOYSTICK_COBRA=m
CONFIG_JOYSTICK_GF2K=m
CONFIG_JOYSTICK_GRIP=m
CONFIG_JOYSTICK_GRIP_MP=m
CONFIG_JOYSTICK_GUILLEMOT=m
CONFIG_JOYSTICK_INTERACT=m
CONFIG_JOYSTICK_SIDEWINDER=m
CONFIG_JOYSTICK_TMDC=m
CONFIG_JOYSTICK_IFORCE=m
CONFIG_JOYSTICK_IFORCE_USB=y
CONFIG_JOYSTICK_IFORCE_232=y
CONFIG_JOYSTICK_WARRIOR=m
CONFIG_JOYSTICK_MAGELLAN=m
CONFIG_JOYSTICK_SPACEORB=m
CONFIG_JOYSTICK_SPACEBALL=m
CONFIG_JOYSTICK_STINGER=m
CONFIG_JOYSTICK_TWIDJOY=m
# CONFIG_JOYSTICK_ZHENHUA is not set
CONFIG_JOYSTICK_DB9=m
CONFIG_JOYSTICK_GAMECON=m
CONFIG_JOYSTICK_TURBOGRAFX=m
CONFIG_JOYSTICK_JOYDUMP=m
# CONFIG_JOYSTICK_XPAD is not set
# CONFIG_INPUT_TABLET is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_FUJITSU is not set
CONFIG_TOUCHSCREEN_GUNZE=m
CONFIG_TOUCHSCREEN_ELO=m
CONFIG_TOUCHSCREEN_MTOUCH=m
# CONFIG_TOUCHSCREEN_INEXIO is not set
CONFIG_TOUCHSCREEN_MK712=m
CONFIG_TOUCHSCREEN_PENMOUNT=m
CONFIG_TOUCHSCREEN_TOUCHRIGHT=m
CONFIG_TOUCHSCREEN_TOUCHWIN=m
# CONFIG_TOUCHSCREEN_WM97XX is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_APANEL is not set
CONFIG_INPUT_ATLAS_BTNS=m
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
CONFIG_GAMEPORT_L4=m
CONFIG_GAMEPORT_EMU10K1=m
CONFIG_GAMEPORT_FM801=m

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
# CONFIG_ROCKETPORT is not set
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
CONFIG_N_HDLC=m
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
# CONFIG_STALDRV is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CS=m
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=m
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_IPMI_HANDLER=m
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_NVRAM=y
CONFIG_R3964=m
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
# CONFIG_IPWIRELESS is not set
CONFIG_MWAVE=m
CONFIG_PC8736x_GPIO=m
CONFIG_NSC_GPIO=m
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
CONFIG_HANGCHECK_TIMER=m
CONFIG_TCG_TPM=m
CONFIG_TCG_TIS=m
CONFIG_TCG_NSC=m
CONFIG_TCG_ATMEL=m
CONFIG_TCG_INFINEON=m
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=m
# CONFIG_I2C_AMD756_S4882 is not set
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=y
CONFIG_I2C_NFORCE2=y
# CONFIG_I2C_NFORCE2_S4985 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
CONFIG_I2C_SIS96X=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
CONFIG_I2C_VOODOO3=m

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_PLATFORM is not set
CONFIG_I2C_STUB=m

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_AT24 is not set
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
CONFIG_W1=m
CONFIG_W1_CON=y

#
# 1-wire Bus Masters
#
CONFIG_W1_MASTER_MATROX=m
CONFIG_W1_MASTER_DS2490=m
CONFIG_W1_MASTER_DS2482=m

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
CONFIG_W1_SLAVE_SMEM=m
CONFIG_W1_SLAVE_DS2433=m
CONFIG_W1_SLAVE_DS2433_CRC=y
# CONFIG_W1_SLAVE_DS2760 is not set
# CONFIG_W1_SLAVE_BQ27000 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=m
CONFIG_SENSORS_ABITUGURU=m
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1029=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7473 is not set
CONFIG_SENSORS_K8TEMP=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS1621=m
# CONFIG_SENSORS_I5K_AMB is not set
CONFIG_SENSORS_F71805F=m
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
# CONFIG_SENSORS_FSCHMD is not set
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
# CONFIG_SENSORS_LM93 is not set
CONFIG_SENSORS_MAX1619=m
# CONFIG_SENSORS_MAX6650 is not set
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_PC87427=m
CONFIG_SENSORS_SIS5595=m
# CONFIG_SENSORS_DME1737 is not set
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_THMC50 is not set
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83793=m
CONFIG_SENSORS_W83L785TS=m
# CONFIG_SENSORS_W83L786NG is not set
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
CONFIG_SENSORS_HDAPS=m
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_HWMON is not set
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=m
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
CONFIG_ALIM1535_WDT=m
CONFIG_ALIM7101_WDT=m
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
CONFIG_IBMASR=m
# CONFIG_WAFER_WDT is not set
CONFIG_I6300ESB_WDT=m
CONFIG_ITCO_WDT=m
CONFIG_ITCO_VENDOR_SUPPORT=y
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
CONFIG_PC87413_WDT=m
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
CONFIG_W83627HF_WDT=m
CONFIG_W83697HF_WDT=m
# CONFIG_W83697UG_WDT is not set
CONFIG_W83877F_WDT=m
CONFIG_W83977F_WDT=m
CONFIG_MACHZ_WDT=m
# CONFIG_SBC_EPX_C3_WATCHDOG is not set

#
# PCI-based Watchdog Cards
#
CONFIG_PCIPCWATCHDOG=m
CONFIG_WDTPCI=m
CONFIG_WDT_501_PCI=y

#
# USB-based Watchdog Cards
#
CONFIG_USBPCWATCHDOG=m
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=m
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
CONFIG_SSB_PCMCIAHOST_POSSIBLE=y
# CONFIG_SSB_PCMCIAHOST is not set
# CONFIG_SSB_DEBUG is not set
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
CONFIG_MFD_SM501=m
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
# CONFIG_VIDEO_DEV is not set
CONFIG_DVB_CORE=m
CONFIG_VIDEO_MEDIA=m

#
# Multimedia drivers
#
# CONFIG_MEDIA_ATTACH is not set
CONFIG_MEDIA_TUNER=m
# CONFIG_MEDIA_TUNER_CUSTOMIZE is not set
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_DVB_CAPTURE_DRIVERS=y

#
# Supported SAA7146 based PCI Adapters
#
# CONFIG_TTPCI_EEPROM is not set
# CONFIG_DVB_BUDGET_CORE is not set

#
# Supported USB Adapters
#
# CONFIG_DVB_USB is not set
CONFIG_DVB_TTUSB_BUDGET=m
CONFIG_DVB_TTUSB_DEC=m
# CONFIG_DVB_SIANO_SMS1XXX is not set

#
# Supported FlexCopII (B2C2) Adapters
#
CONFIG_DVB_B2C2_FLEXCOP=m
CONFIG_DVB_B2C2_FLEXCOP_PCI=m
CONFIG_DVB_B2C2_FLEXCOP_USB=m
# CONFIG_DVB_B2C2_FLEXCOP_DEBUG is not set

#
# Supported BT878 Adapters
#

#
# Supported Pluto2 Adapters
#
CONFIG_DVB_PLUTO2=m

#
# Supported SDMC DM1105 Adapters
#
# CONFIG_DVB_DM1105 is not set

#
# Supported DVB Frontends
#

#
# Customise DVB Frontends
#
# CONFIG_DVB_FE_CUSTOMISE is not set

#
# DVB-S (satellite) frontends
#
CONFIG_DVB_CX24110=m
CONFIG_DVB_CX24123=m
CONFIG_DVB_MT312=m
CONFIG_DVB_S5H1420=m
# CONFIG_DVB_STV0288 is not set
# CONFIG_DVB_STB6000 is not set
CONFIG_DVB_STV0299=m
CONFIG_DVB_TDA8083=m
CONFIG_DVB_TDA10086=m
CONFIG_DVB_VES1X93=m
CONFIG_DVB_TUNER_ITD1000=m
CONFIG_DVB_TDA826X=m
CONFIG_DVB_TUA6100=m
# CONFIG_DVB_CX24116 is not set
# CONFIG_DVB_SI21XX is not set

#
# DVB-T (terrestrial) frontends
#
CONFIG_DVB_SP8870=m
CONFIG_DVB_SP887X=m
CONFIG_DVB_CX22700=m
CONFIG_DVB_CX22702=m
# CONFIG_DVB_DRX397XD is not set
CONFIG_DVB_L64781=m
CONFIG_DVB_TDA1004X=m
CONFIG_DVB_NXT6000=m
CONFIG_DVB_MT352=m
CONFIG_DVB_ZL10353=m
CONFIG_DVB_DIB3000MB=m
CONFIG_DVB_DIB3000MC=m
CONFIG_DVB_DIB7000M=m
CONFIG_DVB_DIB7000P=m
# CONFIG_DVB_TDA10048 is not set

#
# DVB-C (cable) frontends
#
CONFIG_DVB_VES1820=m
CONFIG_DVB_TDA10021=m
# CONFIG_DVB_TDA10023 is not set
CONFIG_DVB_STV0297=m

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#
CONFIG_DVB_NXT200X=m
CONFIG_DVB_OR51211=m
CONFIG_DVB_OR51132=m
CONFIG_DVB_BCM3510=m
CONFIG_DVB_LGDT330X=m
# CONFIG_DVB_S5H1409 is not set
# CONFIG_DVB_AU8522 is not set
# CONFIG_DVB_S5H1411 is not set

#
# Digital terrestrial only tuners/PLL
#
CONFIG_DVB_PLL=m
CONFIG_DVB_TUNER_DIB0070=m

#
# SEC control devices for DVB-S
#
CONFIG_DVB_LNBP21=m
# CONFIG_DVB_ISL6405 is not set
CONFIG_DVB_ISL6421=m
# CONFIG_DVB_LGS8GL5 is not set

#
# Tools to develop new frontends
#
# CONFIG_DVB_DUMMY_FE is not set
# CONFIG_DVB_AF9013 is not set
# CONFIG_DAB is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_DRM=m
CONFIG_DRM_TDFX=m
CONFIG_DRM_R128=m
CONFIG_DRM_RADEON=m
CONFIG_DRM_I810=m
CONFIG_DRM_I830=m
CONFIG_DRM_I915=m
CONFIG_DRM_MGA=m
# CONFIG_DRM_SIS is not set
CONFIG_DRM_VIA=m
CONFIG_DRM_SAVAGE=m
CONFIG_VGASTATE=m
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
CONFIG_FB_SVGALIB=m
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RIVA=m
# CONFIG_FB_RIVA_I2C is not set
# CONFIG_FB_RIVA_DEBUG is not set
CONFIG_FB_RIVA_BACKLIGHT=y
# CONFIG_FB_LE80578 is not set
CONFIG_FB_INTEL=m
# CONFIG_FB_INTEL_DEBUG is not set
CONFIG_FB_INTEL_I2C=y
CONFIG_FB_MATROX=m
CONFIG_FB_MATROX_MILLENIUM=y
CONFIG_FB_MATROX_MYSTIQUE=y
CONFIG_FB_MATROX_G=y
CONFIG_FB_MATROX_I2C=m
CONFIG_FB_MATROX_MAVEN=m
CONFIG_FB_MATROX_MULTIHEAD=y
# CONFIG_FB_RADEON is not set
CONFIG_FB_ATY128=m
CONFIG_FB_ATY128_BACKLIGHT=y
CONFIG_FB_ATY=m
CONFIG_FB_ATY_CT=y
CONFIG_FB_ATY_GENERIC_LCD=y
CONFIG_FB_ATY_GX=y
CONFIG_FB_ATY_BACKLIGHT=y
CONFIG_FB_S3=m
CONFIG_FB_SAVAGE=m
CONFIG_FB_SAVAGE_I2C=y
CONFIG_FB_SAVAGE_ACCEL=y
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
CONFIG_FB_NEOMAGIC=m
CONFIG_FB_KYRO=m
CONFIG_FB_3DFX=m
CONFIG_FB_3DFX_ACCEL=y
CONFIG_FB_VOODOO1=m
# CONFIG_FB_VT8623 is not set
CONFIG_FB_TRIDENT=m
CONFIG_FB_TRIDENT_ACCEL=y
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
CONFIG_FB_SM501=m
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
# CONFIG_LCD_ILI9320 is not set
# CONFIG_LCD_PLATFORM is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_CORGI is not set
CONFIG_BACKLIGHT_PROGEAR=m
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE is not set
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=m
CONFIG_SOUND_OSS_CORE=y
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_DYNAMIC_MINORS=y
# CONFIG_SND_SUPPORT_OLD_API is not set
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_MPU401_UART=m
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_VX_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
# CONFIG_SND_MTPAV is not set
CONFIG_SND_MTS64=m
# CONFIG_SND_SERIAL_U16550 is not set
CONFIG_SND_MPU401=m
CONFIG_SND_PORTMAN2X4=m
CONFIG_SND_AC97_POWER_SAVE=y
CONFIG_SND_AC97_POWER_SAVE_DEFAULT=0
CONFIG_SND_SB_COMMON=m
CONFIG_SND_PCI=y
CONFIG_SND_AD1889=m
CONFIG_SND_ALS300=m
CONFIG_SND_ALS4000=m
CONFIG_SND_ALI5451=m
CONFIG_SND_ATIIXP=m
CONFIG_SND_ATIIXP_MODEM=m
CONFIG_SND_AU8810=m
CONFIG_SND_AU8820=m
CONFIG_SND_AU8830=m
# CONFIG_SND_AW2 is not set
CONFIG_SND_AZT3328=m
CONFIG_SND_BT87X=m
# CONFIG_SND_BT87X_OVERCLOCK is not set
CONFIG_SND_CA0106=m
CONFIG_SND_CMIPCI=m
# CONFIG_SND_OXYGEN is not set
CONFIG_SND_CS4281=m
CONFIG_SND_CS46XX=m
CONFIG_SND_CS46XX_NEW_DSP=y
# CONFIG_SND_CS5530 is not set
CONFIG_SND_DARLA20=m
CONFIG_SND_GINA20=m
CONFIG_SND_LAYLA20=m
CONFIG_SND_DARLA24=m
CONFIG_SND_GINA24=m
CONFIG_SND_LAYLA24=m
CONFIG_SND_MONA=m
CONFIG_SND_MIA=m
CONFIG_SND_ECHO3G=m
CONFIG_SND_INDIGO=m
CONFIG_SND_INDIGOIO=m
CONFIG_SND_INDIGODJ=m
CONFIG_SND_EMU10K1=m
CONFIG_SND_EMU10K1X=m
CONFIG_SND_ENS1370=m
CONFIG_SND_ENS1371=m
CONFIG_SND_ES1938=m
CONFIG_SND_ES1968=m
CONFIG_SND_FM801=m
CONFIG_SND_HDA_INTEL=m
# CONFIG_SND_HDA_HWDEP is not set
# CONFIG_SND_HDA_INPUT_BEEP is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
CONFIG_SND_HDA_CODEC_NVHDMI=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
# CONFIG_SND_HDA_POWER_SAVE is not set
CONFIG_SND_HDSP=m
CONFIG_SND_HDSPM=m
# CONFIG_SND_HIFIER is not set
CONFIG_SND_ICE1712=m
CONFIG_SND_ICE1724=m
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
CONFIG_SND_KORG1212=m
CONFIG_SND_MAESTRO3=m
CONFIG_SND_MIXART=m
CONFIG_SND_NM256=m
CONFIG_SND_PCXHR=m
CONFIG_SND_RIPTIDE=m
CONFIG_SND_RME32=m
CONFIG_SND_RME96=m
CONFIG_SND_RME9652=m
CONFIG_SND_SONICVIBES=m
CONFIG_SND_TRIDENT=m
CONFIG_SND_VIA82XX=m
CONFIG_SND_VIA82XX_MODEM=m
# CONFIG_SND_VIRTUOSO is not set
CONFIG_SND_VX222=m
CONFIG_SND_YMFPCI=m
CONFIG_SND_USB=y
CONFIG_SND_USB_AUDIO=m
CONFIG_SND_USB_USX2Y=m
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
CONFIG_SND_PCMCIA=y
# CONFIG_SND_VXPOCKET is not set
# CONFIG_SND_PDAUDIOCF is not set
CONFIG_SND_SOC=m
# CONFIG_SND_SOC_ALL_CODECS is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
CONFIG_HID_COMPAT=y
CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_BRIGHT=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_DELL=y
CONFIG_HID_EZKEY=y
CONFIG_HID_GYRATION=y
CONFIG_HID_LOGITECH=y
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
CONFIG_HID_PANTHERLORD=y
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=y
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
CONFIG_HID_SUNPLUS=y
CONFIG_THRUSTMASTER_FF=y
CONFIG_ZEROPLUS_FF=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_ISP116X_HCD=m
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=m
CONFIG_USB_SL811_HCD=m
CONFIG_USB_SL811_CS=m
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed;
#

#
# see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_DPCM=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
# CONFIG_USB_STORAGE_ONETOUCH is not set
CONFIG_USB_STORAGE_KARMA=y
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m

#
# USB port drivers
#
CONFIG_USB_USS720=m
CONFIG_USB_SERIAL=m
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_AIRCABLE=m
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
# CONFIG_USB_SERIAL_CH341 is not set
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP2101=m
CONFIG_USB_SERIAL_CYPRESS_M8=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_GARMIN=m
CONFIG_USB_SERIAL_IPW=m
# CONFIG_USB_SERIAL_IUU is not set
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KEYSPAN_MPR=y
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_MOS7720=m
CONFIG_USB_SERIAL_MOS7840=m
# CONFIG_USB_SERIAL_MOTOROLA is not set
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_TI=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_SERIAL_DEBUG=m

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_ADUTUX=m
# CONFIG_USB_SEVSEG is not set
CONFIG_USB_RIO500=m
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_BERRY_CHARGE=m
CONFIG_USB_LED=m
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
CONFIG_USB_PHIDGET=m
CONFIG_USB_PHIDGETKIT=m
CONFIG_USB_PHIDGETMOTORCONTROL=m
CONFIG_USB_PHIDGETSERVO=m
CONFIG_USB_IDMOUSE=m
CONFIG_USB_FTDI_ELAN=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=m
CONFIG_USB_TRANCEVIBRATOR=m
CONFIG_USB_IOWARRIOR=m
CONFIG_USB_TEST=m
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
CONFIG_USB_ATM=m
CONFIG_USB_SPEEDTOUCH=m
CONFIG_USB_CXACRU=m
CONFIG_USB_UEAGLEATM=m
CONFIG_USB_XUSBATM=m
# CONFIG_USB_GADGET is not set
# CONFIG_UWB is not set
CONFIG_MMC=m
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_UNSAFE_RESUME is not set

#
# MMC/SD/SDIO Card Drivers
#
CONFIG_MMC_BLOCK=m
CONFIG_MMC_BLOCK_BOUNCE=y
# CONFIG_SDIO_UART is not set
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
CONFIG_MMC_SDHCI=m
# CONFIG_MMC_SDHCI_PCI is not set
CONFIG_MMC_WBSD=m
CONFIG_MMC_TIFM_SD=m
# CONFIG_MMC_SDRICOH_CS is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_HP_DISK is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set
# CONFIG_ACCESSIBILITY is not set
CONFIG_INFINIBAND=m
CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=m
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_IPATH=m
# CONFIG_INFINIBAND_AMSO1100 is not set
CONFIG_INFINIBAND_CXGB3=m
# CONFIG_INFINIBAND_CXGB3_DEBUG is not set
# CONFIG_MLX4_INFINIBAND is not set
# CONFIG_INFINIBAND_NES is not set
CONFIG_INFINIBAND_IPOIB=m
CONFIG_INFINIBAND_IPOIB_CM=y
CONFIG_INFINIBAND_IPOIB_DEBUG=y
CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
CONFIG_INFINIBAND_SRP=m
CONFIG_INFINIBAND_ISER=m
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=m
CONFIG_EDAC_E752X=m
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
# CONFIG_RTC_DRV_DS1374 is not set
CONFIG_RTC_DRV_DS1672=m
# CONFIG_RTC_DRV_MAX6900 is not set
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=m
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
CONFIG_RTC_DRV_V3020=m

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
# CONFIG_STAGING is not set
CONFIG_STAGING_EXCLUDE_BUILD=y

#
# Firmware Drivers
#
CONFIG_EDD=m
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DELL_RBU=m
CONFIG_DCDBAS=m
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_FS_XIP=y
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=m
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=m
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
# CONFIG_JFS_DEBUG is not set
# CONFIG_JFS_STATISTICS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
# CONFIG_XFS_RT is not set
# CONFIG_XFS_DEBUG is not set
CONFIG_GFS2_FS=m
CONFIG_GFS2_FS_LOCKING_DLM=m
CONFIG_OCFS2_FS=m
CONFIG_OCFS2_FS_O2CB=m
CONFIG_OCFS2_FS_USERSPACE_CLUSTER=m
CONFIG_OCFS2_FS_STATS=y
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
# CONFIG_OCFS2_DEBUG_FS is not set
# CONFIG_OCFS2_COMPAT_JBD is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_GENERIC_ACL=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
CONFIG_AFFS_FS=m
# CONFIG_ECRYPT_FS is not set
CONFIG_HFS_FS=m
CONFIG_HFSPLUS_FS=m
CONFIG_BEFS_FS=m
# CONFIG_BEFS_DEBUG is not set
CONFIG_BFS_FS=m
CONFIG_EFS_FS=m
CONFIG_CRAMFS=m
CONFIG_VXFS_FS=m
CONFIG_MINIX_FS=m
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
CONFIG_QNX4FS_FS=m
CONFIG_ROMFS_FS=m
CONFIG_SYSV_FS=m
CONFIG_UFS_FS=m
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_SUNRPC_XPRT_RDMA=m
# CONFIG_SUNRPC_REGISTER_V4 is not set
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_RPCSEC_GSS_SPKM3=m
# CONFIG_SMB_FS is not set
CONFIG_CIFS=m
# CONFIG_CIFS_STATS is not set
CONFIG_CIFS_WEAK_PW_HASH=y
# CONFIG_CIFS_UPCALL is not set
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
CONFIG_NCP_FS=m
CONFIG_NCPFS_PACKET_SIGNING=y
CONFIG_NCPFS_IOCTL_LOCKING=y
CONFIG_NCPFS_STRONG=y
CONFIG_NCPFS_NFS_NS=y
CONFIG_NCPFS_OS2_NS=y
CONFIG_NCPFS_SMALLDOS=y
CONFIG_NCPFS_NLS=y
CONFIG_NCPFS_EXTRAS=y
CONFIG_CODA_FS=m
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
CONFIG_NLS_UTF8=m
CONFIG_DLM=m
CONFIG_DLM_DEBUG=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_ENABLE_WARN_DEPRECATED=y
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_SCHED_DEBUG is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
CONFIG_RT_MUTEX_TESTER=y
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y

#
# Tracers
#
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SYSPROF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_CONTEXT_SWITCH_TRACER is not set
# CONFIG_BOOT_TRACER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_PRINTK_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_MMIOTRACE is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
# CONFIG_SECURITY_ROOTPLUG is not set
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT is not set
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_SMACK is not set
CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_NULL=m
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_LRW=m
CONFIG_CRYPTO_PCBC=m
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=m

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_WP512=m

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_CAMELLIA=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_KHAZAD=m
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
CONFIG_KVM_AMD=m
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_BALLOON is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
  2008-11-17 11:01         ` Ingo Molnar
  (?)
@ 2008-11-17 11:20         ` Eric Dumazet
  2008-11-17 16:11             ` Ingo Molnar
  -1 siblings, 1 reply; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, Linus Torvalds

Ingo Molnar a écrit :
> * David Miller <davem@davemloft.net> wrote:
> 
>> From: Ingo Molnar <mingo@elte.hu>
>> Date: Mon, 17 Nov 2008 10:06:48 +0100
>>
>>> * Rafael J. Wysocki <rjw@sisk.pl> wrote:
>>>
>>>> This message has been generated automatically as a part of a report
>>>> of regressions introduced between 2.6.26 and 2.6.27.
>>>>
>>>> The following bug entry is on the current list of known regressions
>>>> introduced between 2.6.26 and 2.6.27.  Please verify if it still should
>>>> be listed and let me know (either way).
>>>>
>>>>
>>>> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
>>>> Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
>>>> Submitter	: Christoph Lameter <cl@linux-foundation.org>
>>>> Date		: 2008-08-11 18:36 (98 days old)
>>>> References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
>>>> 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4
>>> Christoph, as per the recent analysis of Mike:
>>>
>>>  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
>>>
>>> all scheduler components of this regression have been eliminated.
>>>
>>> In fact his numbers show that scheduler speedups since 2.6.22 have 
>>> offset and hidden most other sources of tbench regression. (i.e. the 
>>> scheduler portion got 5% faster, hence it was able to offset a 
>>> slowdown of 5% in other areas of the kernel that tbench triggers)
>> Although I respect the improvements, wake_up() is still several 
>> orders of magnitude slower than it was in 2.6.22 and wake_up() is at 
>> the top of the profiles in tbench runs.
> 
> hm, several orders of magnitude slower? That contradicts Mike's 
> numbers and my own numbers and profiles as well: see below.
> 
> The scheduler's overhead barely even registers on a 16-way x86 system 
> i'm running tbench on. Here's the NMI profile during 64 threads tbench 
> on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:
> 
>   Throughput 3437.65 MB/sec 64 procs
>   ==================================
>   21570252  total 
>   ........
>    1494803  copy_user_generic_string 
>     998232  sock_rfree 
>     491471  tcp_ack 
>     482405  ip_dont_fragment 
>     470685  ip_local_deliver 
>     436325  constant_test_bit         [ called by napi_disable_pending() ]
>     375469  avc_has_perm_noaudit 
>     347663  tcp_sendmsg 
>     310383  tcp_recvmsg 
>     300412  __inet_lookup_established 
>     294377  system_call 
>     286603  tcp_transmit_skb 
>     251782  selinux_ip_postroute 
>     236028  tcp_current_mss 
>     235631  schedule 
>     234013  netif_rx 
>     229854  _local_bh_enable_ip 
>     219501  tcp_v4_rcv 
> 
>     [ etc. - see full profile attached further below ]
> 
> Note that the scheduler does not even show up in the profile up to 
> entry #15!
> 
> I've also summarized NMI profiler output by major subsystems:
> 
>            NET       overhead (12603450/21570252): 58.43%
>            security  overhead ( 1903598/21570252):  8.83%
>            usercopy  overhead ( 1753617/21570252):  8.13%
>            sched     overhead ( 1599406/21570252):  7.41%
>            syscall   overhead (  560487/21570252):  2.60%
>            IRQ       overhead (  555439/21570252):  2.58%
>            slab      overhead (  492421/21570252):  2.28%
>            timer     overhead (  226573/21570252):  1.05%
>            pagealloc overhead (  192681/21570252):  0.89%
>            PID       overhead (  115123/21570252):  0.53%
>            VFS       overhead (  107926/21570252):  0.50%
>            pagecache overhead (   62552/21570252):  0.29%
>            gtod      overhead (   38651/21570252):  0.18%
>            IDLE      overhead (       0/21570252):  0.00%
> ---------------------------------------------------------
>                          left ( 1349494/21570252):  6.26%
> 
> The scheduler's functions are absolutely flat, and consistent with an 
> extreme context-switching rate of 1.35 million per second. The 
> scheduler can go up to about 20 million context switches per second on 
> this system:
> 
>  procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>   r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  32  0      0 32229696  29308 649880    0    0     0     0 164135 20026853 24 76  0  0  0
>  32  0      0 32229752  29308 649880    0    0     0     0 164203 20032770 24 76  0  0  0
>  32  0      0 32229752  29308 649880    0    0     0     0 164201 20036492 25 75  0  0  0
> 
> ... and 7% scheduling overhead is roughly consistent with 1.35/20.0.
> 
> Wake up affinities and data flow caching is just fine in this workload 
> - we've got scheduler statistics for that and they look good too.
> 
> It all looks like pure old-fashioned straight overhead in the 
> networking layer to me. Do we still touch the same global cacheline 
> for every localhost packet we process? Anything like that would show 
> up big time.

Yes we do, I find strange we dont see dst_release() in your NMI profile

I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387
net: make sure struct dst_entry refcount is aligned on 64 bytes)
 (in net-next-2.6 tree)
to properly align struct dst_entry refcounter and got 4% speedup on tbench on my machine.

Small speedups too with commit ef711cf1d156428d4c2911b8c86c6ce90519dc45
(net: speedup dst_release())

Also on net-next-2.6, patches avoid dirtying last_rx on netdevices (loopback for example)
, it helps a lot tbench too.



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11805] mounting XFS produces a segfault
  2008-11-16 17:40   ` Rafael J. Wysocki
@ 2008-11-17 14:44     ` Christoph Hellwig
  -1 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-17 14:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Kernel Testers List, Dave Chinner,
	Tiago Maluta

On Sun, Nov 16, 2008 at 06:40:58PM +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
> 
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> be listed and let me know (either way).

The patch for this is both in mainline and -stable

> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11805
> Subject		: mounting XFS produces a segfault
> Submitter	: Tiago Maluta <maluta_tiago@yahoo.com.br>
> Date		: 2008-10-21 18:00 (27 days old)
> Handled-By	: Dave Chinner <dgc@sgi.com>

And that email address for Dave is severly outdated.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11805] mounting XFS produces a segfault
@ 2008-11-17 14:44     ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-17 14:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Kernel Testers List, Dave Chinner,
	Tiago Maluta

On Sun, Nov 16, 2008 at 06:40:58PM +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
> 
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> be listed and let me know (either way).

The patch for this is both in mainline and -stable

> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11805
> Subject		: mounting XFS produces a segfault
> Submitter	: Tiago Maluta <maluta_tiago-/E1597aS9LRfJ/NunPodnw@public.gmane.org>
> Date		: 2008-10-21 18:00 (27 days old)
> Handled-By	: Dave Chinner <dgc-sJ/iWh9BUns@public.gmane.org>

And that email address for Dave is severly outdated.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 16:11             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 16:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, Linus Torvalds


* Eric Dumazet <dada1@cosmosbay.com> wrote:

>> It all looks like pure old-fashioned straight overhead in the 
>> networking layer to me. Do we still touch the same global cacheline 
>> for every localhost packet we process? Anything like that would 
>> show up big time.
>
> Yes we do, I find strange we dont see dst_release() in your NMI 
> profile
>
> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 
> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in 
> net-next-2.6 tree) to properly align struct dst_entry refcounter and 
> got 4% speedup on tbench on my machine.

Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
compared to the things we were after in scheduler land. A lot of 
scheduler folks worked hard to squeeze the last 1-2% out of the 
scheduler fastpath (which was not trivial at all). The _full_ 
scheduler accounts for only about 7% of the total system overhead here 
on a 16-way box...

So why should we be handling this anything but a plain networking 
performance regression/weakness? The localhost scalability bottleneck 
has been reported a _long_ time ago.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 16:11             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 16:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linus Torvalds


* Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:

>> It all looks like pure old-fashioned straight overhead in the 
>> networking layer to me. Do we still touch the same global cacheline 
>> for every localhost packet we process? Anything like that would 
>> show up big time.
>
> Yes we do, I find strange we dont see dst_release() in your NMI 
> profile
>
> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 
> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in 
> net-next-2.6 tree) to properly align struct dst_entry refcounter and 
> got 4% speedup on tbench on my machine.

Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
compared to the things we were after in scheduler land. A lot of 
scheduler folks worked hard to squeeze the last 1-2% out of the 
scheduler fastpath (which was not trivial at all). The _full_ 
scheduler accounts for only about 7% of the total system overhead here 
on a 16-way box...

So why should we be handling this anything but a plain networking 
performance regression/weakness? The localhost scalability bottleneck 
has been reported a _long_ time ago.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr
  2008-11-16 17:40   ` Rafael J. Wysocki
  (?)
@ 2008-11-17 16:19   ` Randy Dunlap
  -1 siblings, 0 replies; 349+ messages in thread
From: Randy Dunlap @ 2008-11-17 16:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Kernel Testers List, James Bottomley,
	Miller, Mike (OS Dev)

Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
> 
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> be listed and let me know (either way).
> 

Nothing has changed.  IMO that means leave the bug as is (alive).

> 
> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11404
> Subject		: BUG: in 2.6.23-rc3-git7 in do_cciss_intr
> Submitter	: rdunlap <randy.dunlap@oracle.com>
> Date		: 2008-08-21 5:52 (88 days old)
> References	: http://marc.info/?l=linux-kernel&m=121929819616273&w=4
> 		  http://marc.info/?l=linux-kernel&m=121932889105368&w=4
> Handled-By	: Miller, Mike (OS Dev) <Mike.Miller@hp.com>
> 		  James Bottomley <James.Bottomley@hansenpartnership.com>

-- 
~Randy

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 16:35               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 16:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, Linus Torvalds, Stephen Hemminger

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>>> It all looks like pure old-fashioned straight overhead in the 
>>> networking layer to me. Do we still touch the same global cacheline 
>>> for every localhost packet we process? Anything like that would 
>>> show up big time.
>> Yes we do, I find strange we dont see dst_release() in your NMI 
>> profile
>>
>> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 
>> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in 
>> net-next-2.6 tree) to properly align struct dst_entry refcounter and 
>> got 4% speedup on tbench on my machine.
> 
> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
> compared to the things we were after in scheduler land. A lot of 
> scheduler folks worked hard to squeeze the last 1-2% out of the 
> scheduler fastpath (which was not trivial at all). The _full_ 
> scheduler accounts for only about 7% of the total system overhead here 
> on a 16-way box...

4% on my machine, but apparently my machine is sooooo special (see oprofile thread),
so maybe its cpus have a hard time playing with a contended cache line.

It definitly needs more testing on other machines.

Maybe you'll discover patch is bad on your machines, this is why it's in
net-next-2.6

> 
> So why should we be handling this anything but a plain networking 
> performance regression/weakness? The localhost scalability bottleneck 
> has been reported a _long_ time ago.
> 

struct dst_entry problem was already discovered a _long_ time ago
and probably solved at this time.

(commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
Thu, 13 Mar 2008 05:52:37 +0000 (22:52 -0700)
[NET]: Fix tbench regression in 2.6.25-rc1)

Then, a gremlin came and broke the thing.

They are many contended cache lines in the system, we can do our
best to try to make them disappear. Thats not always possible.

Another contended cache line is the rwlock in iptables.
I remember Stephen had a patch to make the thing use RCU.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 16:35               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 16:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linus Torvalds,
	Stephen Hemminger

Ingo Molnar a écrit :
> * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:
> 
>>> It all looks like pure old-fashioned straight overhead in the 
>>> networking layer to me. Do we still touch the same global cacheline 
>>> for every localhost packet we process? Anything like that would 
>>> show up big time.
>> Yes we do, I find strange we dont see dst_release() in your NMI 
>> profile
>>
>> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 
>> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in 
>> net-next-2.6 tree) to properly align struct dst_entry refcounter and 
>> got 4% speedup on tbench on my machine.
> 
> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
> compared to the things we were after in scheduler land. A lot of 
> scheduler folks worked hard to squeeze the last 1-2% out of the 
> scheduler fastpath (which was not trivial at all). The _full_ 
> scheduler accounts for only about 7% of the total system overhead here 
> on a 16-way box...

4% on my machine, but apparently my machine is sooooo special (see oprofile thread),
so maybe its cpus have a hard time playing with a contended cache line.

It definitly needs more testing on other machines.

Maybe you'll discover patch is bad on your machines, this is why it's in
net-next-2.6

> 
> So why should we be handling this anything but a plain networking 
> performance regression/weakness? The localhost scalability bottleneck 
> has been reported a _long_ time ago.
> 

struct dst_entry problem was already discovered a _long_ time ago
and probably solved at this time.

(commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
Thu, 13 Mar 2008 05:52:37 +0000 (22:52 -0700)
[NET]: Fix tbench regression in 2.6.25-rc1)

Then, a gremlin came and broke the thing.

They are many contended cache lines in the system, we can do our
best to try to make them disappear. Thats not always possible.

Another contended cache line is the rwlock in iptables.
I remember Stephen had a patch to make the thing use RCU.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:08                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, Linus Torvalds, Stephen Hemminger


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Ingo Molnar a écrit :
>> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>>
>>>> It all looks like pure old-fashioned straight overhead in the  
>>>> networking layer to me. Do we still touch the same global cacheline 
>>>> for every localhost packet we process? Anything like that would  
>>>> show up big time.
>>> Yes we do, I find strange we dont see dst_release() in your NMI  
>>> profile
>>>
>>> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387  
>>> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in  
>>> net-next-2.6 tree) to properly align struct dst_entry refcounter and  
>>> got 4% speedup on tbench on my machine.
>>
>> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup  
>> compared to the things we were after in scheduler land. A lot of  
>> scheduler folks worked hard to squeeze the last 1-2% out of the  
>> scheduler fastpath (which was not trivial at all). The _full_  
>> scheduler accounts for only about 7% of the total system overhead here  
>> on a 16-way box...
>
> 4% on my machine, but apparently my machine is sooooo special (see 
> oprofile thread), so maybe its cpus have a hard time playing with a 
> contended cache line.
>
> It definitly needs more testing on other machines.
>
> Maybe you'll discover patch is bad on your machines, this is why 
> it's in net-next-2.6

ok, i'll try it on my testbox too, to check whether it has any effect 
- find below the port to -git.

tbench _is_ very sensitive to seemingly small details - it seems to be 
hoovering at around some sort of CPU cache boundary and penalizing 
random alignment changes, as we drop in and out of the sweet spot.

Mike Galbraith has been spending months trying to pin down all the 
issues.

	Ingo

------------->
>From 8fbd307d402647b07c3c2662fdac589494d16e5e Mon Sep 17 00:00:00 2001
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sun, 16 Nov 2008 19:46:36 -0800
Subject: [PATCH] net: make sure struct dst_entry refcount is aligned on 64 bytes

As found in the past (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
[NET]: Fix tbench regression in 2.6.25-rc1), it is really
important that struct dst_entry refcount is aligned on a cache line.

We cannot use __atribute((aligned)), so manually pad the structure
for 32 and 64 bit arches.

for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0

As it is not possible to guess at compile time cache line size,
we use a generic value of 64 bytes, that satisfies many current arches.
(Using 128 bytes alignment on 64bit arches would waste 64 bytes)

Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
break this alignment.

"tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
(2350 MB/s instead of 2250 MB/s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/dst.h |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 8a8b71e..1b4de18 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -59,7 +59,11 @@ struct dst_entry
 
 	struct neighbour	*neighbour;
 	struct hh_cache		*hh;
+#ifdef CONFIG_XFRM
 	struct xfrm_state	*xfrm;
+#else
+	void			*__pad1;
+#endif
 
 	int			(*input)(struct sk_buff*);
 	int			(*output)(struct sk_buff*);
@@ -70,8 +74,20 @@ struct dst_entry
 
 #ifdef CONFIG_NET_CLS_ROUTE
 	__u32			tclassid;
+#else
+	__u32			__pad2;
 #endif
 
+
+	/*
+	 * Align __refcnt to a 64 bytes alignment
+	 * (L1_CACHE_SIZE would be too much)
+	 */
+#ifdef CONFIG_64BIT
+	long			__pad_to_align_refcnt[2];
+#else
+	long			__pad_to_align_refcnt[1];
+#endif
 	/*
 	 * __refcnt wants to be on a different cache line from
 	 * input/output/ops or performance tanks badly
@@ -157,6 +173,11 @@ dst_metric_locked(struct dst_entry *dst, int metric)
 
 static inline void dst_hold(struct dst_entry * dst)
 {
+	/*
+	 * If your kernel compilation stops here, please check
+	 * __pad_to_align_refcnt declaration in struct dst_entry
+	 */
+	BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
 	atomic_inc(&dst->__refcnt);
 }
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:08                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linus Torvalds,
	Stephen Hemminger


* Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:

> Ingo Molnar a écrit :
>> * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:
>>
>>>> It all looks like pure old-fashioned straight overhead in the  
>>>> networking layer to me. Do we still touch the same global cacheline 
>>>> for every localhost packet we process? Anything like that would  
>>>> show up big time.
>>> Yes we do, I find strange we dont see dst_release() in your NMI  
>>> profile
>>>
>>> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387  
>>> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in  
>>> net-next-2.6 tree) to properly align struct dst_entry refcounter and  
>>> got 4% speedup on tbench on my machine.
>>
>> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup  
>> compared to the things we were after in scheduler land. A lot of  
>> scheduler folks worked hard to squeeze the last 1-2% out of the  
>> scheduler fastpath (which was not trivial at all). The _full_  
>> scheduler accounts for only about 7% of the total system overhead here  
>> on a 16-way box...
>
> 4% on my machine, but apparently my machine is sooooo special (see 
> oprofile thread), so maybe its cpus have a hard time playing with a 
> contended cache line.
>
> It definitly needs more testing on other machines.
>
> Maybe you'll discover patch is bad on your machines, this is why 
> it's in net-next-2.6

ok, i'll try it on my testbox too, to check whether it has any effect 
- find below the port to -git.

tbench _is_ very sensitive to seemingly small details - it seems to be 
hoovering at around some sort of CPU cache boundary and penalizing 
random alignment changes, as we drop in and out of the sweet spot.

Mike Galbraith has been spending months trying to pin down all the 
issues.

	Ingo

------------->
From 8fbd307d402647b07c3c2662fdac589494d16e5e Mon Sep 17 00:00:00 2001
From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Date: Sun, 16 Nov 2008 19:46:36 -0800
Subject: [PATCH] net: make sure struct dst_entry refcount is aligned on 64 bytes

As found in the past (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
[NET]: Fix tbench regression in 2.6.25-rc1), it is really
important that struct dst_entry refcount is aligned on a cache line.

We cannot use __atribute((aligned)), so manually pad the structure
for 32 and 64 bit arches.

for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0

As it is not possible to guess at compile time cache line size,
we use a generic value of 64 bytes, that satisfies many current arches.
(Using 128 bytes alignment on 64bit arches would waste 64 bytes)

Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
break this alignment.

"tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
(2350 MB/s instead of 2250 MB/s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
---
 include/net/dst.h |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 8a8b71e..1b4de18 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -59,7 +59,11 @@ struct dst_entry
 
 	struct neighbour	*neighbour;
 	struct hh_cache		*hh;
+#ifdef CONFIG_XFRM
 	struct xfrm_state	*xfrm;
+#else
+	void			*__pad1;
+#endif
 
 	int			(*input)(struct sk_buff*);
 	int			(*output)(struct sk_buff*);
@@ -70,8 +74,20 @@ struct dst_entry
 
 #ifdef CONFIG_NET_CLS_ROUTE
 	__u32			tclassid;
+#else
+	__u32			__pad2;
 #endif
 
+
+	/*
+	 * Align __refcnt to a 64 bytes alignment
+	 * (L1_CACHE_SIZE would be too much)
+	 */
+#ifdef CONFIG_64BIT
+	long			__pad_to_align_refcnt[2];
+#else
+	long			__pad_to_align_refcnt[1];
+#endif
 	/*
 	 * __refcnt wants to be on a different cache line from
 	 * input/output/ops or performance tanks badly
@@ -157,6 +173,11 @@ dst_metric_locked(struct dst_entry *dst, int metric)
 
 static inline void dst_hold(struct dst_entry * dst)
 {
+	/*
+	 * If your kernel compilation stops here, please check
+	 * __pad_to_align_refcnt declaration in struct dst_entry
+	 */
+	BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
 	atomic_inc(&dst->__refcnt);
 }
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:25                   ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 17:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, Linus Torvalds, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> > 4% on my machine, but apparently my machine is sooooo special (see 
> > oprofile thread), so maybe its cpus have a hard time playing with 
> > a contended cache line.
> >
> > It definitly needs more testing on other machines.
> >
> > Maybe you'll discover patch is bad on your machines, this is why 
> > it's in net-next-2.6
> 
> ok, i'll try it on my testbox too, to check whether it has any effect 
> - find below the port to -git.

it gives a small speedup of ~1% on my box:

   before:      Throughput 3437.65 MB/sec 64 procs
   after:       Throughput 3473.99 MB/sec 64 procs

... although that's still a bit close to the natural tbench noise 
range so it's not conclusive and not like a smoking gun IMO.

But i think this change might just be papering over the real 
scalability problem that this workload has in my opinion: that there's 
a single localhost route/dst/device that millions of packets are 
squeezed through every second:

 phoenix:~> ifconfig lo
 lo        Link encap:Local Loopback  
           inet addr:127.0.0.1  Mask:255.0.0.0
           UP LOOPBACK RUNNING  MTU:16436  Metric:1
           RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0
           TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0 
           RX bytes:679809512144 (633.1 GiB)  TX bytes:679809512144 (633.1 GiB)

There does not seem to be any per CPU ness in localhost networking - 
it has a globally single-threaded rx/tx queue AFAICS even if both the 
client and server task is on the same CPU - how is that supposed to 
perform well? (but i might be missing something)

What kind of test-system do you have - one with P4 style Xeon CPUs 
perhaps where dirty-cacheline cachemisses to DRAM were particularly 
expensive?

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:25                   ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 17:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linus Torvalds,
	Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> > 4% on my machine, but apparently my machine is sooooo special (see 
> > oprofile thread), so maybe its cpus have a hard time playing with 
> > a contended cache line.
> >
> > It definitly needs more testing on other machines.
> >
> > Maybe you'll discover patch is bad on your machines, this is why 
> > it's in net-next-2.6
> 
> ok, i'll try it on my testbox too, to check whether it has any effect 
> - find below the port to -git.

it gives a small speedup of ~1% on my box:

   before:      Throughput 3437.65 MB/sec 64 procs
   after:       Throughput 3473.99 MB/sec 64 procs

... although that's still a bit close to the natural tbench noise 
range so it's not conclusive and not like a smoking gun IMO.

But i think this change might just be papering over the real 
scalability problem that this workload has in my opinion: that there's 
a single localhost route/dst/device that millions of packets are 
squeezed through every second:

 phoenix:~> ifconfig lo
 lo        Link encap:Local Loopback  
           inet addr:127.0.0.1  Mask:255.0.0.0
           UP LOOPBACK RUNNING  MTU:16436  Metric:1
           RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0
           TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0 
           RX bytes:679809512144 (633.1 GiB)  TX bytes:679809512144 (633.1 GiB)

There does not seem to be any per CPU ness in localhost networking - 
it has a globally single-threaded rx/tx queue AFAICS even if both the 
client and server task is on the same CPU - how is that supposed to 
perform well? (but i might be missing something)

What kind of test-system do you have - one with P4 style Xeon CPUs 
perhaps where dirty-cacheline cachemisses to DRAM were particularly 
expensive?

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:33                     ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 17:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, Linus Torvalds, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>>> 4% on my machine, but apparently my machine is sooooo special (see 
>>> oprofile thread), so maybe its cpus have a hard time playing with 
>>> a contended cache line.
>>>
>>> It definitly needs more testing on other machines.
>>>
>>> Maybe you'll discover patch is bad on your machines, this is why 
>>> it's in net-next-2.6
>> ok, i'll try it on my testbox too, to check whether it has any effect 
>> - find below the port to -git.
> 
> it gives a small speedup of ~1% on my box:
> 
>    before:      Throughput 3437.65 MB/sec 64 procs
>    after:       Throughput 3473.99 MB/sec 64 procs

Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"

> 
> ... although that's still a bit close to the natural tbench noise 
> range so it's not conclusive and not like a smoking gun IMO.
> 
> But i think this change might just be papering over the real 
> scalability problem that this workload has in my opinion: that there's 
> a single localhost route/dst/device that millions of packets are 
> squeezed through every second:

Yes, this point was mentioned on netdev a while back.

> 
>  phoenix:~> ifconfig lo
>  lo        Link encap:Local Loopback  
>            inet addr:127.0.0.1  Mask:255.0.0.0
>            UP LOOPBACK RUNNING  MTU:16436  Metric:1
>            RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:0 
>            RX bytes:679809512144 (633.1 GiB)  TX bytes:679809512144 (633.1 GiB)
> 
> There does not seem to be any per CPU ness in localhost networking - 
> it has a globally single-threaded rx/tx queue AFAICS even if both the 
> client and server task is on the same CPU - how is that supposed to 
> perform well? (but i might be missing something)

Stephen had a patch for this one too, but we got tbench noise too with this patch

http://kerneltrap.org/mailarchive/linux-netdev/2008/11/5/3926034


> 
> What kind of test-system do you have - one with P4 style Xeon CPUs 
> perhaps where dirty-cacheline cachemisses to DRAM were particularly 
> expensive?

Its a HP BL460c g1

Dual quad-core cpus Intel E5450  @3.00GHz

So 8 logical cpus. My bench was "tbench 8"



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:33                     ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 17:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linus Torvalds,
	Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
>>> 4% on my machine, but apparently my machine is sooooo special (see 
>>> oprofile thread), so maybe its cpus have a hard time playing with 
>>> a contended cache line.
>>>
>>> It definitly needs more testing on other machines.
>>>
>>> Maybe you'll discover patch is bad on your machines, this is why 
>>> it's in net-next-2.6
>> ok, i'll try it on my testbox too, to check whether it has any effect 
>> - find below the port to -git.
> 
> it gives a small speedup of ~1% on my box:
> 
>    before:      Throughput 3437.65 MB/sec 64 procs
>    after:       Throughput 3473.99 MB/sec 64 procs

Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"

> 
> ... although that's still a bit close to the natural tbench noise 
> range so it's not conclusive and not like a smoking gun IMO.
> 
> But i think this change might just be papering over the real 
> scalability problem that this workload has in my opinion: that there's 
> a single localhost route/dst/device that millions of packets are 
> squeezed through every second:

Yes, this point was mentioned on netdev a while back.

> 
>  phoenix:~> ifconfig lo
>  lo        Link encap:Local Loopback  
>            inet addr:127.0.0.1  Mask:255.0.0.0
>            UP LOOPBACK RUNNING  MTU:16436  Metric:1
>            RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:0 
>            RX bytes:679809512144 (633.1 GiB)  TX bytes:679809512144 (633.1 GiB)
> 
> There does not seem to be any per CPU ness in localhost networking - 
> it has a globally single-threaded rx/tx queue AFAICS even if both the 
> client and server task is on the same CPU - how is that supposed to 
> perform well? (but i might be missing something)

Stephen had a patch for this one too, but we got tbench noise too with this patch

http://kerneltrap.org/mailarchive/linux-netdev/2008/11/5/3926034


> 
> What kind of test-system do you have - one with P4 style Xeon CPUs 
> perhaps where dirty-cacheline cachemisses to DRAM were particularly 
> expensive?

Its a HP BL460c g1

Dual quad-core cpus Intel E5450  @3.00GHz

So 8 logical cpus. My bench was "tbench 8"


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:38                       ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 17:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, Stephen Hemminger



On Mon, 17 Nov 2008, Eric Dumazet wrote:

> Ingo Molnar a écrit :

> > it gives a small speedup of ~1% on my box:
> > 
> >    before:      Throughput 3437.65 MB/sec 64 procs
> >    after:       Throughput 3473.99 MB/sec 64 procs
> 
> Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"

I think Ingo may have a Nehalem. Let's just say that those things rock, 
and have rather good memory throughput.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:38                       ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 17:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger



On Mon, 17 Nov 2008, Eric Dumazet wrote:

> Ingo Molnar a écrit :

> > it gives a small speedup of ~1% on my box:
> > 
> >    before:      Throughput 3437.65 MB/sec 64 procs
> >    after:       Throughput 3473.99 MB/sec 64 procs
> 
> Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"

I think Ingo may have a Nehalem. Let's just say that those things rock, 
and have rather good memory throughput.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:42                         ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 17:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, Stephen Hemminger

Linus Torvalds a écrit :
> 
> On Mon, 17 Nov 2008, Eric Dumazet wrote:
> 
>> Ingo Molnar a écrit :
> 
>>> it gives a small speedup of ~1% on my box:
>>>
>>>    before:      Throughput 3437.65 MB/sec 64 procs
>>>    after:       Throughput 3473.99 MB/sec 64 procs
>> Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"
> 
> I think Ingo may have a Nehalem. Let's just say that those things rock, 
> and have rather good memory throughput.
> 

I want one :)

Or even two of them :)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 17:42                         ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 17:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Linus Torvalds a écrit :
> 
> On Mon, 17 Nov 2008, Eric Dumazet wrote:
> 
>> Ingo Molnar a écrit :
> 
>>> it gives a small speedup of ~1% on my box:
>>>
>>>    before:      Throughput 3437.65 MB/sec 64 procs
>>>    after:       Throughput 3473.99 MB/sec 64 procs
>> Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"
> 
> I think Ingo may have a Nehalem. Let's just say that those things rock, 
> and have rather good memory throughput.
> 

I want one :)

Or even two of them :)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 18:23                         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 17 Nov 2008, Eric Dumazet wrote:
> 
> > Ingo Molnar a écrit :
> 
> > > it gives a small speedup of ~1% on my box:
> > > 
> > >    before:      Throughput 3437.65 MB/sec 64 procs
> > >    after:       Throughput 3473.99 MB/sec 64 procs
> > 
> > Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"
> 
> I think Ingo may have a Nehalem. Let's just say that those things 
> rock, and have rather good memory throughput.

hm, i'm not sure whether i can post benchmarks from the Nehalem box - 
but i can confirm it in general terms that it's rather nice ;-)

This was run on another testbox (4x4 Barcelona) that rocks similarly 
well in terms of memory subsystem latencies: which seems to be 
tbench's main current critical path.

For the tbench bragging rights i'd probably turn off CONFIG_SECURITY 
and a few other options. Plus i'd run with 16 threads only - in this 
test i ran with 4x overload (64 tbench threads, not 16) to stress the 
scheduler harder.

Although we degrade very gently with overload so the numbers arent all 
that much different:

   16 threads: Throughput 3463.14 MB/sec 16 procs
   64 threads: Throughput 3473.99 MB/sec 64 procs
  256 threads: Throughput 3457.67 MB/sec 256 procs
 1024 threads: Throughput 3448.85 MB/sec 1024 procs

 [ so it's the same within noise range. ]

1024 threads is already a massive 64x overload so beyond any 
reasonable limit of workload sanity.

Which suggests that the main limitation factor is cacheline ping-pong 
that is already in full effect at 16 threads.

Which is supported by the "most expensive instructions" top-10 sorted 
list:

            RIP     #hits
..........................                           

                           [ usercopy ]
ffffffff80350fcd:  1373300 	f3 48 a5             	rep movsq %ds:(%rsi),%es:(%rdi)

ffffffff804a2f33:          <sock_rfree>:
ffffffff804a2f34:   985253 	48 89 e5             	mov    %rsp,%rbp


ffffffff804d2eb7:          <ip_local_deliver>:
ffffffff804d2eb8:   432659 	48 89 e5             	mov    %rsp,%rbp

ffffffff804aa23c:          <constant_test_bit>: [ => napi_disable_pending() ]
ffffffff804aa24c:   374052 	89 d1                	mov    %edx,%ecx

ffffffff804d5076:          <ip_dont_fragment>:
ffffffff804d5076:   310051 	8a 97 56 02 00 00    	mov    0x256(%rdi),%dl

ffffffff804d9b17:          <__inet_lookup_established>:
ffffffff804d9bdf:   247224 	eb ba                	jmp    ffffffff804d9b9b <__inet_lookup_established+0x84>

ffffffff80321529:          <selinux_ip_postroute>:
ffffffff8032152a:   183700 	48 89 e5             	mov    %rsp,%rbp

ffffffff8020c020:          <system_call>:
ffffffff8020c020:   183600 	0f 01 f8             	swapgs 

ffffffff8051884a:          <netlbl_enabled>:
ffffffff8051884a:   179538 	55                   	push   %rbp

The usual profiling caveat applies: it's not _these_ instructions that 
matter, but the surrounding code that calls them. Profiling overhead 
is delayed by a couple of instructions - the more out-of-order a CPU 
is, the larger this delay can be. But even a quick look to the list 
above shows that all of the heavy cachemisses are generated by 
networking.

Beyond the usual suspects of syscall entry and memcpy, it's only 
networking. We dont even have the mov %cr3 TLB flush overhead in this 
list, load_cr3() is a distant #30:

ffffffff8023049f:        0      0f 22 d8                mov    %rax,%cr3
ffffffff802304a2:   126303      c9                      leaveq

The place for the sock_rfree() hit looks a bit weird, and i'll 
investigate it now a bit more to place the real overhead point 
properly. (i already mapped the test-bit overhead: that comes from 
napi_disable_pending())
 
The first entry is 10x the cost of the last entry in the list so 
clearly we've got 1-2 brutal cacheline ping-pongs that dominate the 
overhead of this workload.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 18:23                         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Mon, 17 Nov 2008, Eric Dumazet wrote:
> 
> > Ingo Molnar a écrit :
> 
> > > it gives a small speedup of ~1% on my box:
> > > 
> > >    before:      Throughput 3437.65 MB/sec 64 procs
> > >    after:       Throughput 3473.99 MB/sec 64 procs
> > 
> > Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"
> 
> I think Ingo may have a Nehalem. Let's just say that those things 
> rock, and have rather good memory throughput.

hm, i'm not sure whether i can post benchmarks from the Nehalem box - 
but i can confirm it in general terms that it's rather nice ;-)

This was run on another testbox (4x4 Barcelona) that rocks similarly 
well in terms of memory subsystem latencies: which seems to be 
tbench's main current critical path.

For the tbench bragging rights i'd probably turn off CONFIG_SECURITY 
and a few other options. Plus i'd run with 16 threads only - in this 
test i ran with 4x overload (64 tbench threads, not 16) to stress the 
scheduler harder.

Although we degrade very gently with overload so the numbers arent all 
that much different:

   16 threads: Throughput 3463.14 MB/sec 16 procs
   64 threads: Throughput 3473.99 MB/sec 64 procs
  256 threads: Throughput 3457.67 MB/sec 256 procs
 1024 threads: Throughput 3448.85 MB/sec 1024 procs

 [ so it's the same within noise range. ]

1024 threads is already a massive 64x overload so beyond any 
reasonable limit of workload sanity.

Which suggests that the main limitation factor is cacheline ping-pong 
that is already in full effect at 16 threads.

Which is supported by the "most expensive instructions" top-10 sorted 
list:

            RIP     #hits
..........................                           

                           [ usercopy ]
ffffffff80350fcd:  1373300 	f3 48 a5             	rep movsq %ds:(%rsi),%es:(%rdi)

ffffffff804a2f33:          <sock_rfree>:
ffffffff804a2f34:   985253 	48 89 e5             	mov    %rsp,%rbp


ffffffff804d2eb7:          <ip_local_deliver>:
ffffffff804d2eb8:   432659 	48 89 e5             	mov    %rsp,%rbp

ffffffff804aa23c:          <constant_test_bit>: [ => napi_disable_pending() ]
ffffffff804aa24c:   374052 	89 d1                	mov    %edx,%ecx

ffffffff804d5076:          <ip_dont_fragment>:
ffffffff804d5076:   310051 	8a 97 56 02 00 00    	mov    0x256(%rdi),%dl

ffffffff804d9b17:          <__inet_lookup_established>:
ffffffff804d9bdf:   247224 	eb ba                	jmp    ffffffff804d9b9b <__inet_lookup_established+0x84>

ffffffff80321529:          <selinux_ip_postroute>:
ffffffff8032152a:   183700 	48 89 e5             	mov    %rsp,%rbp

ffffffff8020c020:          <system_call>:
ffffffff8020c020:   183600 	0f 01 f8             	swapgs 

ffffffff8051884a:          <netlbl_enabled>:
ffffffff8051884a:   179538 	55                   	push   %rbp

The usual profiling caveat applies: it's not _these_ instructions that 
matter, but the surrounding code that calls them. Profiling overhead 
is delayed by a couple of instructions - the more out-of-order a CPU 
is, the larger this delay can be. But even a quick look to the list 
above shows that all of the heavy cachemisses are generated by 
networking.

Beyond the usual suspects of syscall entry and memcpy, it's only 
networking. We dont even have the mov %cr3 TLB flush overhead in this 
list, load_cr3() is a distant #30:

ffffffff8023049f:        0      0f 22 d8                mov    %rax,%cr3
ffffffff802304a2:   126303      c9                      leaveq

The place for the sock_rfree() hit looks a bit weird, and i'll 
investigate it now a bit more to place the real overhead point 
properly. (i already mapped the test-bit overhead: that comes from 
napi_disable_pending())
 
The first entry is 10x the cost of the last entry in the list so 
clearly we've got 1-2 brutal cacheline ping-pongs that dominate the 
overhead of this workload.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 18:33                           ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 18:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> hm, i'm not sure whether i can post benchmarks from the Nehalem box - 
> but i can confirm it in general terms that it's rather nice ;-)

Intel released the NDA from various web sites a week or two ago, and Intel 
is now selling it in the US (I think today was in fact the official 
launch), so I think benchmarks are safe - you can buy the dang things on 
the street.

I don't know what availability is, of course. But I doubt that Intel would 
mind Nehalem benchmarks even if it were a paper launch - at least from my 
personal experience, I've not seen any bad behavior (and plenty of good).

> This was run on another testbox (4x4 Barcelona) that rocks similarly 
> well in terms of memory subsystem latencies: which seems to be 
> tbench's main current critical path.

Ahh, ok.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 18:33                           ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 18:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> hm, i'm not sure whether i can post benchmarks from the Nehalem box - 
> but i can confirm it in general terms that it's rather nice ;-)

Intel released the NDA from various web sites a week or two ago, and Intel 
is now selling it in the US (I think today was in fact the official 
launch), so I think benchmarks are safe - you can buy the dang things on 
the street.

I don't know what availability is, of course. But I doubt that Intel would 
mind Nehalem benchmarks even if it were a paper launch - at least from my 
personal experience, I've not seen any bad behavior (and plenty of good).

> This was run on another testbox (4x4 Barcelona) that rocks similarly 
> well in terms of memory subsystem latencies: which seems to be 
> tbench's main current critical path.

Ahh, ok.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 18:49                           ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

4> The place for the sock_rfree() hit looks a bit weird, and i'll 
> investigate it now a bit more to place the real overhead point 
> properly. (i already mapped the test-bit overhead: that comes from 
> napi_disable_pending())

ok, here's a new set of profiles. (again for tbench 64-thread on a 
16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
posted before.)

Here are the per major subsystem percentages:

           NET       overhead ( 5786945/10096751): 57.31%
           security  overhead (  925933/10096751):  9.17%
           usercopy  overhead (  837887/10096751):  8.30%
           sched     overhead (  753662/10096751):  7.46%
           syscall   overhead (  268809/10096751):  2.66%
           IRQ       overhead (  266500/10096751):  2.64%
           slab      overhead (  180258/10096751):  1.79%
           timer     overhead (   92986/10096751):  0.92%
           pagealloc overhead (   87381/10096751):  0.87%
           VFS       overhead (   53295/10096751):  0.53%
           PID       overhead (   44469/10096751):  0.44%
           pagecache overhead (   33452/10096751):  0.33%
           gtod      overhead (   11064/10096751):  0.11%
           IDLE      overhead (       0/10096751):  0.00%
---------------------------------------------------------
                         left (  753878/10096751):  7.47%

The breakdown is very similar to what i sent before, within noise.

[ 'left' is random overhead from all around the place - i categorized 
  the 500 most expensive functions in the profile per subsystem.
  I stopped short of doing it for all 1300+ functions: it's rather
  laborous manual work even with hefty use of regex patterns.
  It's also less meaningful in practice: the trend in the first 500
  functions is present in the remaining 800 functions as well. I 
  watched the breakdown evolve as i increased the coverage - in 
  practice it is the first 100 functions that matter - it just doesnt 
  change after that. ]

The readprofile output below seems structured in a more useful way now 
- i tweaked compiler options to have the profiler hits spread out in a 
more meaningful way. I collected 10 million NMI profiler hits, and 
normalized the readprofile output up to 100%.

[ I'll post per function analysis as i complete them, as a reply to
  this mail. ]

	Ingo

100.000000 total
................
  7.253355 copy_user_generic_string
  3.934833 avc_has_perm_noaudit
  3.356152 ip_queue_xmit
  3.038025 skb_release_data
  2.118525 skb_release_head_state
  1.997533 tcp_ack
  1.833688 tcp_recvmsg
  1.717771 eth_type_trans
  1.673249 __inet_lookup_established
  1.508888 system_call
  1.469183 tcp_current_mss
  1.431553 tcp_transmit_skb
  1.385125 tcp_sendmsg
  1.327643 tcp_v4_rcv
  1.292328 nf_hook_thresh
  1.203205 schedule
  1.059501 nf_hook_slow
  1.027373 constant_test_bit
  0.945183 sock_rfree
  0.922748 __switch_to
  0.911605 netif_rx
  0.876270 register_gifconf
  0.788200 ip_local_deliver_finish
  0.781467 dev_queue_xmit
  0.766530 constant_test_bit
  0.758208 _local_bh_enable_ip
  0.747184 load_cr3
  0.704341 memset_c
  0.671260 sysret_check
  0.651845 ip_finish_output2
  0.620204 audit_free_names
  0.617781 audit_syscall_exit
  0.615149 skb_copy_datagram_iovec
  0.613848 selinux_socket_sock_rcv_skb
  0.606995 constant_test_bit
  0.593936 __tcp_push_pending_frames
  0.592198 tcp_cleanup_rbuf
  0.574093 ip_rcv
  0.567886 netif_receive_skb
  0.563377 get_page_from_freelist
  0.557657 tcp_event_data_recv
  0.539274 ip_local_deliver
  0.534130 sys_recvfrom
  0.512321 __tcp_select_window
  0.498427 tcp_rcv_established
  0.494862 sys_sendto
  0.487473 audit_syscall_entry
  0.478495 sched_clock_cpu
  0.474861 kfree
  0.466310 tcp_established_options
  0.461384 net_rx_action
  0.447162 __mod_timer
  0.442078 ip_rcv_finish
  0.441631 find_pid_ns
  0.441124 sk_wait_data
  0.423943 __sock_recvmsg
  0.422126 selinux_parse_skb
  0.417975 __napi_schedule
  0.414082 __do_softirq
  0.403604 task_rq_lock
  0.380792 nf_iterate
  0.377614 select_task_rq_fair
  0.374973 sock_sendmsg
  0.374635 kmem_cache_alloc_node
  0.368775 avc_has_perm
  0.368706 local_bh_disable
  0.361834 release_sock
  0.346400 sock_common_recvmsg
  0.342825 skb_clone
  0.338704 __alloc_skb
  0.326488 do_softirq
  0.323410 lock_sock_nested
  0.322129 __copy_skb_header
  0.316835 put_page
  0.310966 selinux_ip_postroute
  0.306229 sel_netport_sid
  0.299863 try_to_wake_up
  0.296288 process_backlog
  0.294818 __inet_lookup
  0.294778 thread_return
  0.293219 cfs_rq_of
  0.292315 internal_add_timer
  0.292305 tcp_rcv_space_adjust
  0.281053 constant_test_bit
  0.278779 local_bh_enable
  0.272910 *unknown*
  0.269593 schedule_timeout
  0.261846 tcp_v4_md5_lookup
  0.260992 __ip_local_out
  0.255868 __enqueue_entity
  0.253931 avc_audit
  0.252004 finish_task_switch
  0.249263 audit_get_context
  0.248290 sockfd_lookup_light
  0.247416 virt_to_head_page
  0.244149 tcp_options_write
  0.243603 memcpy_toiovec
  0.243434 sock_recvmsg
  0.242599 call_softirq
  0.242391 __unlazy_fpu
  0.236412 fput_light
  0.235628 ret_from_sys_call
  0.234933 sk_reset_timer
  0.228358 math_state_restore
  0.227117 socket_has_perm
  0.223492 virt_to_cache
  0.219063 __cache_free
  0.216401 update_curr
  0.216232 tcp_v4_send_check
  0.213978 audit_free_aux
  0.213223 tcp_v4_do_rcv
  0.212975 __kfree_skb
  0.211137 dev_hard_start_xmit
  0.209052 tcp_rtt_estimator
  0.207999 netif_needs_gso
  0.207662 __update_sched_clock
  0.207284 rb_erase
  0.204861 enqueue_task_fair
  0.203490 skb_release_all
  0.203252 tcp_send_delayed_ack
  0.203232 inet_ehashfn
  0.199846 sel_netport_find
  0.195396 system_call_after_swapgs
  0.186756 lock_timer_base
  0.186687 pick_next_task_fair
  0.183986 mod_timer
  0.182982 loopback_xmit
  0.182605 native_read_tsc
  0.181195 skb_set_owner_r
  0.179248 switch_mm
  0.175584 set_next_entity
  0.173329 raw_local_deliver
  0.171641 sys_kill
  0.164510 dequeue_task_fair
  0.161938 clear_bit
  0.160528 sock_def_readable
  0.157628 __tcp_ack_snd_check
  0.156893 skb_can_coalesce
  0.156556 tcp_snd_wnd_test
  0.155662 ip_output
  0.150627 sk_stream_alloc_skb
  0.150219 cpu_sdc
  0.149425 sysret_careful
  0.148760 tcp_data_snd_check
  0.147816 auditsys
  0.147419 pskb_may_pull
  0.147151 fget_light
  0.143774 tcp_cwnd_test
  0.143029 rb_insert_color
  0.142265 __wake_up
  0.141808 tcp_bound_to_half_wnd
  0.138600 __sk_dst_check
  0.138431 free_hot_cold_page
  0.137954 unroll_tree_refs
  0.137080 __skb_unlink
  0.135124 __sock_sendmsg
  0.135064 get_pageblock_flags_group
  0.132701 kmem_cache_free
  0.128152 bictcp_cong_avoid
  0.127874 __napi_complete
  0.127527 ____cache_alloc
  0.127368 tcp_is_cwnd_limited
  0.127278 find_vpid
  0.126941 constant_test_bit
  0.126504 sk_mem_charge
  0.126255 __alloc_pages_internal
  0.125977 dst_release
  0.125521 hash_64
  0.124895 put_prev_task_fair
  0.123802 netlbl_enabled
  0.122829 sched_clock
  0.122640 skb_push
  0.122035 __phys_addr
  0.121161 dput
  0.120515 tcp_prequeue_process
  0.118916 __skb_dequeue
  0.117715 selinux_socket_sendmsg
  0.117536 __inc_zone_state
  0.115907 sk_wake_async
  0.113504 selinux_ipv4_output
  0.113017 sel_netif_sid
  0.112431 skb_reset_network_header
  0.111170 check_preempt_wakeup
  0.111061 bictcp_acked
  0.110882 sel_netnode_find
  0.109978 update_min_vruntime
  0.109889 resched_task
  0.109879 current_kernel_time
  0.109432 tcp_checksum_complete_user
  0.107476 ip_dont_fragment
  0.107386 sysret_audit
  0.106979 inet_csk_reset_xmit_timer
  0.106006 skb_entail
  0.105777 sysret_signal
  0.105420 avc_hash
  0.105251 __skb_clone
  0.105211 tcp_init_tso_segs
  0.103523 __dequeue_entity
  0.101715 PageLRU
  0.101378 tcp_parse_aligned_timestamp
  0.101219 __xchg
  0.100544 constant_test_bit
  0.097991 __kmalloc
  0.097584 test_tsk_thread_flag
  0.097475 autoremove_wake_function
  0.095747 selinux_task_kill
  0.094416 get_page
  0.093353 dequeue_task
  0.092728 __local_bh_disable
  0.091943 selinux_netlbl_sock_rcv_skb
  0.091655 path_put
  0.090970 skb_headroom
  0.090950 PageTail
  0.090642 dst_destroy
  0.090523 netpoll_rx
  0.089589 skb_header_pointer
  0.085935 security_socket_recvmsg
  0.084008 alloc_pages_current
  0.083184 compare_ether_addr
  0.082479 rb_next
  0.082439 sk_wmem_schedule
  0.081635 next_zones_zonelist
  0.080135 tcp_cwnd_validate
  0.079877 tcp_event_new_data_sent
  0.079817 fcheck_files
  0.079082 ip_skb_dst_mtu
  0.078804 ip_finish_output
  0.078278 wakeup_preempt_entity
  0.077026 sel_netif_find
  0.076788 __skb_queue_tail
  0.076570 sock_flag
  0.076520 tcp_win_from_space
  0.076510 zone_watermark_ok
  0.076282 sel_netnode_sid
  0.076162 policy_zonelist
  0.074732 __wake_up_common
  0.074613 compound_head
  0.074593 task_has_perm
  0.073243 __find_general_cachep
  0.073064 tcp_push
  0.072925 skb_cloned
  0.072309 pskb_may_pull
  0.071852 TCP_ECN_check_ce
  0.071495 cap_task_to_inode
  0.070770 default_wake_function
  0.069429 xfrm4_policy_check
  0.069091 tcp_parse_md5sig_option
  0.068287 tcp_v4_md5_do_lookup
  0.068059 tcp_v4_tw_remember_stamp
  0.067344 tcp_ca_event
  0.067125 tcp_ca_event
  0.065457 place_entity
  0.065318 write_seqlock
  0.065089 device_not_available
  0.065069 test_ti_thread_flag
  0.063878 tcp_set_skb_tso_segs
  0.063550 selinux_netlbl_inode_permission
  0.063391 sock_wfree
  0.063311 prepare_to_wait
  0.058872 pid_vnr
  0.058803 __cycles_2_ns
  0.057631 ip_local_out
  0.057333 tcp_ack_saw_tstamp
  0.056896 copy_to_user
  0.056628 set_bit
  0.055913 free_pages_check
  0.054969 tcp_rcv_rtt_measure_ts
  0.053797 init_rootdomain
  0.053708 selinux_socket_recvmsg
  0.053698 pid_nr_ns
  0.053629 sk_eat_skb
  0.052814 _local_bh_enable
  0.052645 nf_hook_thresh
  0.052516 sched_info_queued
  0.052457 enqueue_task
  0.052228 sk_filter
  0.052159 __cpu_clear
  0.051980 local_bh_enable_ip
  0.050292 update_rq_clock
  0.048981 task_tgid_vnr
  0.048881 copy_from_user
  0.048782 tcp_parse_options
  0.048484 lock_sock
  0.047779 net_timestamp
  0.047044 open_softirq
  0.046955 tcp_win_from_space
  0.045981 __skb_dequeue
  0.043846 getboottime
  0.043777 account_group_exec_runtime
  0.043519 can_checksum_protocol
  0.043469 set_user_nice
  0.042784 skb_fill_page_desc
  0.042247 security_socket_sendmsg
  0.041989 read_profile
  0.041930 tcp_validate_incoming
  0.041612 check_preempt_curr
  0.041413 skb_pull
  0.041026 generic_smp_call_function_interrupt
  0.041016 calc_delta_fair
  0.040936 clear_buddies
  0.040768 tcp_data_queue
  0.040698 page_count
  0.039695 lock_sock
  0.039099 skb_headroom
  0.038851 system_call_fastpath
  0.038622 zone_statistics
  0.037500 tcp_sack_extend
  0.037381 __kmalloc_node
  0.036587 first_zones_zonelist
  0.036497 mntput
  0.036179 pick_next_task
  0.035991 kmap
  0.035911 sock_put
  0.035613 deactivate_task
  0.035027 __nr_to_section
  0.033985 page_zone
  0.033190 native_load_tls
  0.032882 netif_tx_queue_stopped
  0.032713 __skb_insert
  0.032187 sock_flag
  0.031988 check_kill_permission
  0.031790 policy_nodemask
  0.031621 detach_timer
  0.030558 inet_csk_clear_xmit_timer
  0.030469 task_rq_unlock
  0.029883 tcp_nagle_test
  0.029744 tracesys
  0.028383 virt_to_slab
  0.028115 tcp_v4_check
  0.028046 __cpu_set
  0.027658 page_get_cache
  0.027063 tcp_store_ts_recent
  0.027053 __skb_pull
  0.026953 gfp_zone
  0.026586 sock_rcvlowat
  0.026576 csum_partial
  0.026397 init_waitqueue_head
  0.026109 finish_wait
  0.026040 kill_pid_info
  0.025404 tcp_full_space
  0.024888 __skb_queue_before
  0.024550 dst_confirm
  0.022603 inet_ehash_bucket
  0.021888 activate_task
  0.021650 tcp_rto_min
  0.021283 d_callback
  0.020965 signal_pending
  0.020925 avc_node_free
  0.020915 empty_bucket
  0.020746 group_send_sig_info
  0.020657 skb_reset_transport_header
  0.020061 sock_put
  0.019992 signal_pending_state
  0.019684 tcp_sync_mss
  0.019346 skb_network_offset
  0.019276 skb_split
  0.018988 tcp_adjust_fackets_out
  0.018204 tcp_fast_path_check
  0.017727 __skb_unlink
  0.017687 napi_disable_pending
  0.017678 sg_set_page
  0.017022 get_pageblock_bitmap
  0.016972 tcp_cong_avoid
  0.016962 pid_task
  0.016754 skb_set_tail_pointer
  0.016039 selinux_ipv4_postroute
  0.015930 idle_cpu
  0.015632 skb_reset_network_header
  0.015552 __count_vm_events
  0.015483 source_load
  0.014867 __skb_unlink
  0.014738 skb_reset_transport_header
  0.014599 set_bit
  0.014241 audit_zero_context
  0.014231 zone_page_state
  0.014152 clear_bit
  0.013874 PageSlab
  0.013546 __memset
  0.013238 get_pageblock_migratetype
  0.012623 __rb_rotate_right
  0.012543 kmem_find_general_cachep
  0.012414 __kprobes_text_start
  0.012344 security_sock_rcv_skb
  0.012344 node_zonelist
  0.012335 dnotify_parent
  0.012096 skb_headroom
  0.011778 tcp_push_one
  0.011540 mnt_want_write
  0.011143 kmalloc
  0.011073 retint_swapgs
  0.010954 __rb_rotate_left
  0.010805 check_pgd_range
  0.010785 tcp_mss_split_point
  0.010755 migrate_timer_list
  0.010338 __send_IPI_dest_field
  0.010229 reschedule_interrupt
  0.010179 sock_flag
  0.009882 smp_call_function_mask
  0.009673 test_tsk_need_resched
  0.009564 tcp_urg
  0.009504 generic_file_aio_read
  0.009176 PageReserved
  0.009147 net_invalid_timestamp
  0.009087 __node_set
  0.008749 do_tcp_setsockopt
  0.008730 set_tsk_thread_flag
  0.008720 tcp_enter_loss
  0.008422 sock_error
  0.008362 target_load
  0.008302 crypto_hash_update
  0.008104 PageReadahead
  0.008044 tcp_poll
  0.007915 tcp_checksum_complete
  0.007329 tcp_snd_test
  0.007309 selinux_file_permission
  0.007290 sel_netif_destroy
  0.007220 put_pages_list
  0.006992 dst_output
  0.006743 prepare_to_copy
  0.006694 tcp_init_cwnd
  0.006555 clear_bit
  0.006535 set_bit
  0.006425 normal_prio
  0.006366 msleep
  0.006346 error_sti
  0.006336 tcp_rcv_rtt_update
  0.006167 tcp_send_ack
  0.005989 tcp_init_nondata_skb
  0.005720 kfree_skb
  0.005502 call_function_interrupt
  0.005413 __count_vm_event
  0.005403 __skb_checksum_complete_head
  0.005363 page_cache_get_speculative
  0.005323 dev_kfree_skb_irq
  0.005174 skb_store_bits
  0.004956 cpu_avg_load_per_task
  0.004916 dev_cpu_callback
  0.004807 __kmem_cache_destroy
  0.004777 tcp_init_metrics
  0.004777 io_schedule
  0.004777 find_get_page
  0.004707 eth_header_parse
  0.004688 cap_task_kill
  0.004678 error_exit
  0.004668 rb_prev
  0.004658 tso_fragment
  0.004648 mmdrop
  0.004628 skb_reset_tail_pointer
  0.004598 apic_timer_interrupt
  0.004588 clear_bit
  0.004519 tcp_simple_retransmit
  0.004449 get_max_files
  0.004370 sk_stop_timer
  0.004340 tcp_reset
  0.004251 netlbl_cache_add
  0.004201 tcp_add_reno_sack
  0.004151 __pskb_trim_head
  0.004102 __profile_flip_buffers
  0.004092 sk_common_release
  0.004052 audit_copy_inode
  0.003953 eth_change_mtu
  0.003943 vfs_read
  0.003923 run_timer_softirq
  0.003843 mnt_drop_write
  0.003814 clear_page_c
  0.003804 do_sync_read
  0.003744 unset_migratetype_isolate
  0.003714 sk_stream_moderate_sndbuf
  0.003545 tcp_try_rmem_schedule
  0.003476 native_apic_mem_write
  0.003466 sys_read
  0.003446 skb_checksum
  0.003436 timer_set_base
  0.003426 security_task_kill
  0.003416 __flow_cache_shrink
  0.003406 __skb_checksum_complete
  0.003277 alloc_skb
  0.003267 physflat_send_IPI_mask
  0.003218 skb_gso_ok
  0.003178 constant_test_bit
  0.003168 find_next_bit
  0.003158 selinux_netlbl_skbuff_getsid
  0.003118 constant_test_bit
  0.003099 pull_task
  0.003079 hrtimer_run_queues
  0.003049 free_hot_page
  0.003009 scheduler_tick
  0.002900 set_32bit_tls
  0.002890 tcp_acceptable_seq
  0.002811 rw_verify_area
  0.002751 radix_tree_lookup_slot
  0.002731 zero_user_segment
  0.002731 sock_common_setsockopt
  0.002612 __load_balance_iterator
  0.002473 run_posix_cpu_timers
  0.002264 task_utime
  0.002254 switched_to_fair
  0.002185 fsnotify_access
  0.002145 __rmqueue_smallest
  0.002125 __schedule_bug
  0.002095 __task_rq_lock
  0.002086 tcp_may_update_window
  0.002076 restore_args
  0.002066 hrtimer_run_pending
  0.002056 generic_segment_checks
  0.002026 getnstimeofday
  0.002006 idle_task
  0.001976 touch_atime
  0.001956 __wake_up_locked
  0.001927 sk_mem_charge
  0.001877 smp_apic_timer_interrupt
  0.001827 native_smp_send_reschedule
  0.001798 __tcp_fast_path_on
  0.001788 file_read_actor
  0.001768 _cond_resched
  0.001738 avc_policy_seqno
  0.001718 tcp_ack_snd_check
  0.001629 ip_send_check
  0.001619 account_system_time
  0.001579 __xapic_wait_icr_idle
  0.001579 get_stats
  0.001539 tcp_set_state
  0.001539 bictcp_state
  0.001529 tcp_fast_path_on
  0.001519 file_accessed
  0.001480 get_seconds
  0.001450 kernel_math_error
  0.001410 ktime_set
  0.001331 kmap_atomic
  0.001281 printk_tick
  0.001281 __next_cpu_nr
  0.001271 account_group_system_time
  0.001261 __mod_zone_page_state
  0.001222 weighted_cpuload
  0.001192 security_file_permission
  0.001162 ack_APIC_irq
  0.001152 __free_one_page
  0.001142 rcu_pending
  0.001142 drain_array
  0.001122 sched_clock_tick
  0.001122 csum_fold
  0.001102 ret_from_intr
  0.001083 retint_careful
  0.001073 need_resched
  0.001073 calc_delta_mine
  0.001043 tcp_v4_md5_do_del
  0.001043 PageActive
  0.001033 mark_page_accessed
  0.001033 ktime_get_ts
  0.001023 tcp_insert_write_queue_after
  0.001013 tcp_delack_timer
  0.001013 task_tick_fair
  0.000973 delay_tsc
  0.000963 nv_nic_irq_optimized
  0.000904 tick_periodic
  0.000894 skb_reserve
  0.000884 cache_reap
  0.000874 timespec_trunc
  0.000864 skb_header_release
  0.000854 zone_page_state_add
  0.000844 update_process_times
  0.000834 sk_rmem_schedule
  0.000824 find_busiest_group
  0.000804 current_fs_time
  0.000785 tick_handle_periodic
  0.000785 __sk_mem_schedule
  0.000785 irq_enter
  0.000755 use_cpu_writer_for_mount
  0.000755 tcp_ratehalving_spur_to_response
  0.000745 update_wall_time
  0.000745 tcp_sendpage
  0.000745 __alloc_pages_nodemask
  0.000725 ktime_get
  0.000725 irq_exit
  0.000705 inotify_inode_queue_event
  0.000665 set_pageblock_flags_group
  0.000646 inotify_dentry_parent_queue_event
  0.000626 ack_APIC_irq
  0.000606 write_profile
  0.000566 set_normalized_timespec
  0.000566 raise_softirq
  0.000526 task_cputime_zero
  0.000516 smp_reschedule_interrupt
  0.000516 __skb_insert
  0.000497 page_fault
  0.000497 __copy_user_nocache
  0.000487 run_local_timers
  0.000487 read_tsc
  0.000487 nf_unregister_hook
  0.000477 __rcu_pending
  0.000477 jiffies_to_usecs
  0.000457 timespec_to_ktime
  0.000437 __skb_trim
  0.000427 __call_rcu
  0.000417 free_pages_bulk
  0.000407 smp_call_function_interrupt
  0.000397 set_irq_regs
  0.000397 radix_tree_deref_slot
  0.000397 expand
  0.000387 handle_mm_fault
  0.000387 handle_IRQ_event
  0.000387 fput_light
  0.000377 refresh_cpu_vm_stats
  0.000377 n_tty_write
  0.000367 get_page
  0.000358 run_rebalance_domains
  0.000358 get_cpu_mask
  0.000348 task_hot
  0.000348 __skb_queue_after
  0.000348 retint_check
  0.000348 do_select
  0.000338 PageUptodate
  0.000338 copy_page_c
  0.000328 cond_resched
  0.000318 unmap_vmas
  0.000318 sk_mem_reclaim
  0.000318 rmqueue_bulk
  0.000318 reciprocal_value
  0.000318 irq_return
  0.000308 rb_first
  0.000308 alloc_skb
  0.000308 account_process_tick
  0.000298 net_enable_timestamp
  0.000298 clocksource_read
  0.000298 account_system_time_scaled
  0.000288 sched_slice
  0.000278 ip_compute_csum
  0.000278 constant_test_bit
  0.000278 constant_test_bit
  0.000268 set_curr_task_fair
  0.000268 note_interrupt
  0.000268 exit_idle
  0.000258 native_apic_mem_write
  0.000258 exit_intr
  0.000248 PageReferenced
  0.000238 usb_hcd_irq
  0.000238 __mnt_is_readonly
  0.000238 constant_test_bit
  0.000218 IRQ0xba_interrupt
  0.000218 handle_fasteoi_irq
  0.000209 raise_softirq_irqoff
  0.000209 __find_get_block
  0.000199 tcp_current_ssthresh
  0.000199 n_tty_receive_buf
  0.000189 wake_up_page
  0.000189 vgacon_save_screen
  0.000189 free_block
  0.000189 constant_test_bit
  0.000179 pagefault_disable
  0.000169 clocksource_get_next
  0.000169 __bitmap_weight
  0.000159 tty_ldisc_deref
  0.000159 tcp_write_timer
  0.000159 kmem_cache_alloc
  0.000159 free_alien_cache
  0.000159 ext3_mark_iloc_dirty
  0.000159 constant_test_bit
  0.000159 __bitmap_equal
  0.000149 transfer_objects
  0.000149 __rcu_process_callbacks
  0.000149 page_waitqueue
  0.000149 constant_test_bit
  0.000139 __rmqueue
  0.000139 release_pages
  0.000139 constant_test_bit
  0.000129 __tcp_checksum_complete
  0.000129 run_workqueue
  0.000129 poll_freewait
  0.000129 n_tty_read
  0.000129 iommu_area_free
  0.000129 generic_file_llseek
  0.000129 __cpus_setall
  0.000129 cond_resched_softirq
  0.000129 avc_node_populate
  0.000129 add_to_page_cache_lru
  0.000129 account_user_time
  0.000119 wait_consider_task
  0.000119 sys_select
  0.000119 round_jiffies_common
  0.000119 nv_start_xmit_optimized
  0.000119 core_sys_select
  0.000109 tcp_tso_segment
  0.000109 sigprocmask
  0.000109 proc_reg_read
  0.000109 path_to_nameidata
  0.000109 PageBuddy
  0.000109 ohci_irq
  0.000109 nv_tx_done_optimized
  0.000109 nv_msi_workaround
  0.000109 IRQ0xc2_interrupt
  0.000109 __ext3_get_inode_loc
  0.000109 account_group_user_time
  0.000099 __wake_up_sync
  0.000099 __up_read
  0.000099 update_vsyscall
  0.000099 memmove
  0.000099 kmalloc
  0.000099 ext3_get_blocks_handle
  0.000099 do_device_not_available
  0.000099 constant_test_bit
  0.000089 tcp_incr_quickack
  0.000089 smp_send_reschedule
  0.000089 remove_from_page_cache
  0.000089 rcu_process_callbacks
  0.000089 prepare_to_wait_exclusive
  0.000089 pde_users_dec
  0.000089 find_first_bit
  0.000089 constant_test_bit
  0.000089 common_interrupt
  0.000089 add_wait_queue
  0.000079 task_gtime
  0.000079 sys_lseek
  0.000079 start_this_handle
  0.000079 schedule_hrtimeout_range
  0.000079 __sched_fork
  0.000079 journal_put_journal_head
  0.000079 find_first_zero_bit
  0.000079 do_syslog
  0.000079 do_sync_write
  0.000079 constant_test_bit
  0.000079 ack_apic_level
  0.000070 write_seqlock
  0.000070 slab_get_obj
  0.000070 remove_wait_queue
  0.000070 pty_chars_in_buffer
  0.000070 ____pagevec_lru_add
  0.000070 lock_hrtimer_base
  0.000070 kstat_incr_irqs_this_cpu
  0.000070 journal_dirty_data
  0.000070 journal_add_journal_head
  0.000070 find_lock_page
  0.000070 copy_from_read_buf
  0.000070 bit_waitqueue
  0.000070 alloc_page_vma
  0.000060 vfs_write
  0.000060 tty_write
  0.000060 __strnlen_user
  0.000060 sk_mem_uncharge
  0.000060 rt_worker_func
  0.000060 radix_tree_preload
  0.000060 poll_select_copy_remaining
  0.000060 pagefault_enable
  0.000060 __mark_inode_dirty
  0.000060 lru_add_drain_all
  0.000060 lock_page
  0.000060 list_replace_init
  0.000060 journal_stop
  0.000060 iowrite8
  0.000060 hrtimer_forward
  0.000060 gart_unmap_single
  0.000060 find_vma
  0.000060 __down_read_trylock
  0.000060 do_page_fault
  0.000060 do_IRQ
  0.000060 create_empty_buffers
  0.000060 constant_test_bit
  0.000060 constant_test_bit
  0.000060 alloc_iommu
  0.000060 add_to_page_cache_locked
  0.000050 zero_fd_set
  0.000050 vsnprintf
  0.000050 unlock_page
  0.000050 tty_read
  0.000050 tty_poll
  0.000050 sock_poll
  0.000050 sock_def_error_report
  0.000050 set_wq_data
  0.000050 rcu_check_callbacks
  0.000050 radix_tree_node_rcu_free
  0.000050 pipe_poll
  0.000050 opost
  0.000050 n_tty_chars_in_buffer
  0.000050 __next_cpu
  0.000050 mutex_trylock
  0.000050 msecs_to_jiffies
  0.000050 mempool_alloc_slab
  0.000050 load_elf_binary
  0.000050 __link_path_walk
  0.000050 __journal_remove_journal_head
  0.000050 journal_commit_transaction
  0.000050 journal_cancel_revoke
  0.000050 irq_complete_move
  0.000050 irq_cfg
  0.000050 fsnotify_modify
  0.000050 __first_cpu
  0.000050 file_update_time
  0.000050 filemap_fault
  0.000050 ext3_new_blocks
  0.000050 ext3_mark_inode_dirty
  0.000050 do_wp_page
  0.000050 __do_fault
  0.000050 buffer_dirty
  0.000050 anon_vma_prepare
  0.000040 yield
  0.000040 wq_per_cpu
  0.000040 walk_page_buffers
  0.000040 __wake_up_bit
  0.000040 vma_adjust
  0.000040 tty_put_char
  0.000040 tty_paranoia_check
  0.000040 tcp_current_ssthresh
  0.000040 sys_write
  0.000040 sys_rt_sigprocmask
  0.000040 sock_no_bind
  0.000040 show_stat
  0.000040 SetPageSwapBacked
  0.000040 set_irq_regs
  0.000040 set_buffer_write_io_error
  0.000040 recalc_sigpending
  0.000040 radix_tree_delete
  0.000040 queue_delayed_work_on
  0.000040 pty_write
  0.000040 __pollwait
  0.000040 physflat_send_IPI_allbutself
  0.000040 page_zone
  0.000040 page_remove_rmap
  0.000040 page_is_file_cache
  0.000040 page_evictable
  0.000040 nv_get_empty_tx_slots
  0.000040 n_tty_poll
  0.000040 next_zone
  0.000040 next_online_pgdat
  0.000040 need_resched
  0.000040 mutex_unlock
  0.000040 mpol_needs_cond_ref
  0.000040 __lookup
  0.000040 journal_invalidatepage
  0.000040 journal_dirty_metadata
  0.000040 ioread8
  0.000040 input_available_p
  0.000040 inet_csk_reset_xmit_timer
  0.000040 get_fd_set
  0.000040 generic_write_checks
  0.000040 free_poll_entry
  0.000040 fput
  0.000040 __ext3_journal_stop
  0.000040 ext3_get_group_desc
  0.000040 ext3_get_block
  0.000040 do_mpage_readpage
  0.000040 __d_lookup
  0.000040 del_page_from_lru
  0.000040 __dec_zone_state
  0.000040 copy_user_generic
  0.000040 __bitmap_and
  0.000040 add_page_to_lru_list
  0.000040 account_user_time_scaled
  0.000040 account_steal_time
  0.000030 worker_thread
  0.000030 wake_up_bit
  0.000030 vmstat_update
  0.000030 vm_normal_page
  0.000030 tty_write_unlock
  0.000030 tty_write_lock
  0.000030 tty_wakeup
  0.000030 tty_ldisc_try
  0.000030 tty_ioctl
  0.000030 tag_get
  0.000030 sys_pread64
  0.000030 submit_bh
  0.000030 stop_this_cpu
  0.000030 sock_aio_write
  0.000030 sk_mem_reclaim
  0.000030 sk_backlog_rcv
  0.000030 show_interrupts
  0.000030 sg_next
  0.000030 seq_printf
  0.000030 send_remote_softirq
  0.000030 remove_vma
  0.000030 reg_delay
  0.000030 radix_tree_lookup
  0.000030 radix_tree_insert
  0.000030 proc_lookup_de
  0.000030 pipe_write
  0.000030 __percpu_counter_add
  0.000030 pci_map_single
  0.000030 nv_napi_poll
  0.000030 __next_node
  0.000030 native_send_call_func_ipi
  0.000030 mpage_readpages
  0.000030 mix_pool_bytes_extract
  0.000030 mii_rw
  0.000030 mempool_alloc
  0.000030 __make_request
  0.000030 jbd_lock_bh_state
  0.000030 iov_iter_copy_from_user_atomic
  0.000030 insert_work
  0.000030 hrtimer_try_to_cancel
  0.000030 get_dma_ops
  0.000030 __generic_file_aio_write_nolock
  0.000030 gart_map_sg
  0.000030 __fput
  0.000030 fixup_irqs
  0.000030 __find_get_block_slow
  0.000030 filp_close
  0.000030 ext3_get_branch
  0.000030 ext3_dirty_inode
  0.000030 ext3_block_to_path
  0.000030 do_get_write_access
  0.000030 delayed_work_timer_fn
  0.000030 csum_block_add
  0.000030 copy_process
  0.000030 copy_page_range
  0.000030 constant_test_bit
  0.000030 constant_test_bit
  0.000030 check_irqs_on
  0.000030 call_rcu
  0.000030 __brelse
  0.000030 _atomic_dec_and_lock
  0.000020 __xchg
  0.000020 vm_stat_account
  0.000020 vma_prio_tree_remove
  0.000020 tty_mode_ioctl
  0.000020 tty_audit_add_data
  0.000020 try_to_free_buffers
  0.000020 truncate_inode_pages_range
  0.000020 tcp_slow_start
  0.000020 task_curr
  0.000020 sys_setpgid
  0.000020 sys_rt_sigreturn
  0.000020 sys_getppid
  0.000020 strncpy_from_user
  0.000020 sock_put
  0.000020 smp_call_function
  0.000020 __sk_mem_reclaim
  0.000020 signal_wake_up
  0.000020 signal_pending
  0.000020 set_termios
  0.000020 SetPageUptodate
  0.000020 SetPageLRU
  0.000020 set_fd_set
  0.000020 set_bit
  0.000020 __send_IPI_shortcut
  0.000020 security_inode_need_killpriv
  0.000020 scsi_request_fn
  0.000020 sb_bread
  0.000020 restore_i387_xstate
  0.000020 __qdisc_run
  0.000020 pud_alloc
  0.000020 pmd_alloc
  0.000020 pfn_pte
  0.000020 pfifo_fast_enqueue
  0.000020 pfifo_fast_dequeue
  0.000020 pci_map_page
  0.000020 path_get
  0.000020 __pagevec_free
  0.000020 pagevec_add
  0.000020 PageUnevictable
  0.000020 page_mapping
  0.000020 nv_get_hw_stats
  0.000020 number
  0.000020 normalize_rt_tasks
  0.000020 __netif_tx_lock
  0.000020 mk_pid
  0.000020 memscan
  0.000020 memcpy_c
  0.000020 __lru_cache_add
  0.000020 __lookup_mnt
  0.000020 load_balance_rt
  0.000020 kthread_should_stop
  0.000020 journal_start
  0.000020 journal_remove_journal_head
  0.000020 __journal_file_buffer
  0.000020 jbd_unlock_bh_journal_head
  0.000020 itimer_get_remtime
  0.000020 irq_to_desc
  0.000020 iowrite32
  0.000020 inotify_remove_watch_locked
  0.000020 inode_permission
  0.000020 inode_has_perm
  0.000020 init_timer
  0.000020 goal_in_my_reservation
  0.000020 get_vma_policy
  0.000020 __get_free_pages
  0.000020 generic_sync_sb_inodes
  0.000020 gart_map_single
  0.000020 freezing
  0.000020 free_pgtables
  0.000020 free_pages_and_swap_cache
  0.000020 free_buffer_head
  0.000020 __follow_mount
  0.000020 flush_tlb_page
  0.000020 find_busiest_queue
  0.000020 file_has_perm
  0.000020 ext3_try_to_allocate
  0.000020 ext3_journal_start
  0.000020 __ext3_journal_dirty_metadata
  0.000020 ext3_file_write
  0.000020 enqueue_hrtimer
  0.000020 dup_mm
  0.000020 do_wait
  0.000020 do_vfs_ioctl
  0.000020 do_path_lookup
  0.000020 do_munmap
  0.000020 do_machine_check
  0.000020 do_lookup
  0.000020 do_follow_link
  0.000020 dma_unmap_single
  0.000020 __dec_zone_page_state
  0.000020 count_vm_event
  0.000020 constant_test_bit
  0.000020 constant_test_bit
  0.000020 compound_head
  0.000020 clear_buffer_jbddirty
  0.000020 clear_buffer_delay
  0.000020 claim_block
  0.000020 cascade
  0.000020 cancel_dirty_page
  0.000020 cache_grow
  0.000020 brelse
  0.000020 __block_prepare_write
  0.000020 __blocking_notifier_call_chain
  0.000020 blk_rq_map_sg
  0.000020 __bitmap_empty
  0.000020 __bitmap_andnot
  0.000020 anon_vma_unlink
  0.000010 zone_page_state
  0.000010 zero_user_segments
  0.000010 __xchg
  0.000010 __vma_link_rb
  0.000010 vma_link
  0.000010 vfs_llseek
  0.000010 __up_write
  0.000010 update_xtime_cache
  0.000010 unmap_underlying_metadata
  0.000010 unmap_region
  0.000010 unix_poll
  0.000010 tty_write_room
  0.000010 tty_unthrottle
  0.000010 tty_ldisc_ref_wait
  0.000010 tty_ldisc_ref
  0.000010 tty_fasync
  0.000010 tty_check_change
  0.000010 tty_chars_in_buffer
  0.000010 tty_audit_fork
  0.000010 truncate_complete_page
  0.000010 test_tsk_thread_flag
  0.000010 taskstats_exit
  0.000010 sys_writev
  0.000010 sys_readahead
  0.000010 sys_poll
  0.000010 sys_newstat
  0.000010 sys_nanosleep
  0.000010 sys_ioctl
  0.000010 syscall_trace_leave
  0.000010 sync_supers
  0.000010 stub_execve
  0.000010 split_page
  0.000010 sock_kfree_s
  0.000010 __sleep_on_page_lock
  0.000010 skip_atoi
  0.000010 signal_pending
  0.000010 signal_pending
  0.000010 sg_init_table
  0.000010 set_task_cpu
  0.000010 __set_page_dirty
  0.000010 SetPageActive
  0.000010 set_bit
  0.000010 seq_puts
  0.000010 selinux_task_setpgid
  0.000010 selinux_secctx_to_secid
  0.000010 selinux_sb_show_options
  0.000010 selinux_inode_permission
  0.000010 selinux_inode_need_killpriv
  0.000010 selinux_inode_free_security
  0.000010 selinux_inode_alloc_security
  0.000010 selinux_d_instantiate
  0.000010 security_vm_enough_memory
  0.000010 second_overflow
  0.000010 scsi_run_queue
  0.000010 __scsi_put_command
  0.000010 scsi_init_sgtable
  0.000010 scsi_end_request
  0.000010 schedule_tail
  0.000010 schedule_delayed_work
  0.000010 sb_any_quota_enabled
  0.000010 rt_hash
  0.000010 round_jiffies_relative
  0.000010 remove_hrtimer
  0.000010 __remove_hrtimer
  0.000010 __remove_from_page_cache
  0.000010 rcu_bh_qsctr_inc
  0.000010 radix_tree_tag_clear
  0.000010 radix_tree_gang_lookup_tag_slot
  0.000010 radix_tree_gang_lookup_slot
  0.000010 queue_delayed_work
  0.000010 qdisc_run
  0.000010 put_tty_queue_nolock
  0.000010 put_io_context
  0.000010 pty_write_room
  0.000010 pty_open
  0.000010 ptep_set_access_flags
  0.000010 profile_munmap
  0.000010 proc_pident_lookup
  0.000010 proc_get_inode
  0.000010 prio_tree_replace
  0.000010 prio_tree_remove
  0.000010 prio_tree_insert
  0.000010 pmd_none_or_clear_bad
  0.000010 pipe_release
  0.000010 pipe_read
  0.000010 pid_revalidate
  0.000010 pgd_alloc
  0.000010 pci_unmap_single
  0.000010 pci_read_config_dword
  0.000010 pci_conf1_write
  0.000010 pci_bus_read_config_dword
  0.000010 path_walk
  0.000010 page_zone
  0.000010 PageSwapCache
  0.000010 PageSwapCache
  0.000010 PageSwapCache
  0.000010 __page_set_anon_rmap
  0.000010 PagePrivate
  0.000010 PagePrivate
  0.000010 PagePrivate
  0.000010 page_add_file_rmap
  0.000010 on_each_cpu
  0.000010 nv_do_interrupt
  0.000010 net_tx_action
  0.000010 netif_start_queue
  0.000010 netif_carrier_ok
  0.000010 need_resched
  0.000010 need_iommu
  0.000010 native_pte_clear
  0.000010 native_io_delay
  0.000010 mutex_lock
  0.000010 mprotect_fixup
  0.000010 mod_zone_page_state
  0.000010 mntput_no_expire
  0.000010 mm_init
  0.000010 mmap_region
  0.000010 mempool_free
  0.000010 memcmp
  0.000010 mcheck_check_cpu
  0.000010 may_open
  0.000010 __lookup_tag
  0.000010 locks_remove_posix
  0.000010 locks_remove_flock
  0.000010 lock_buffer
  0.000010 load_elf_binary
  0.000010 load_balance_fair
  0.000010 ll_back_merge_fn
  0.000010 kzalloc
  0.000010 ktime_add_safe
  0.000010 kill_fasync
  0.000010 __journal_temp_unlink_buffer
  0.000010 journal_switch_revoke_table
  0.000010 __journal_remove_checkpoint
  0.000010 journal_get_write_access
  0.000010 journal_get_undo_access
  0.000010 journal_get_descriptor_buffer
  0.000010 journal_bmap
  0.000010 jbd_unlock_bh_state
  0.000010 jbd_unlock_bh_state
  0.000010 IRQ0xd2_interrupt
  0.000010 ip_append_data
  0.000010 iov_iter_advance
  0.000010 iov_fault_in_pages_read
  0.000010 iommu_area_alloc
  0.000010 inode_sub_bytes
  0.000010 inode_doinit_with_dentry
  0.000010 inode_add_bytes
  0.000010 __inc_zone_page_state
  0.000010 inc_zone_page_state
  0.000010 hweight_long
  0.000010 hweight64
  0.000010 hrtimer_wakeup
  0.000010 hrtimer_init
  0.000010 hash_64
  0.000010 half_md4_transform
  0.000010 __grab_cache_page
  0.000010 get_user_pages
  0.000010 get_signal_to_deliver
  0.000010 get_random_int
  0.000010 getname
  0.000010 get_empty_filp
  0.000010 __getblk
  0.000010 generic_permission
  0.000010 generic_make_request
  0.000010 generic_fillattr
  0.000010 generic_file_open
  0.000010 generic_file_llseek_unlocked
  0.000010 generic_file_buffered_write
  0.000010 generic_file_aio_write
  0.000010 generic_cont_expand_simple
  0.000010 generic_block_bmap
  0.000010 freezing
  0.000010 free_swap_cache
  0.000010 free_pid
  0.000010 free_pgd_range
  0.000010 free_pages
  0.000010 flush_old_exec
  0.000010 first_online_pgdat
  0.000010 find_vma_prepare
  0.000010 find_task_by_pid_type_ns
  0.000010 find_next_zero_bit
  0.000010 find_inode_fast
  0.000010 file_remove_suid
  0.000010 file_mask_to_av
  0.000010 file_free_rcu
  0.000010 __FD_CLR
  0.000010 ext3_write_begin
  0.000010 ext3_try_to_allocate_with_rsv
  0.000010 ext3_ordered_write_end
  0.000010 ext3_journalled_set_page_dirty
  0.000010 ext3_invalidatepage
  0.000010 ext3_iget_acl
  0.000010 ext3_get_inode_flags
  0.000010 ext3_free_data
  0.000010 ext3_discard_reservation
  0.000010 exit_thread
  0.000010 exit_task_namespaces
  0.000010 exit_sem
  0.000010 end_that_request_last
  0.000010 end_buffer_write_sync
  0.000010 end_buffer_async_write
  0.000010 elv_rb_del
  0.000010 elv_queue_empty
  0.000010 elv_merged_request
  0.000010 elv_completed_request
  0.000010 elf_map
  0.000010 echo_char
  0.000010 e1000_watchdog
  0.000010 e1000_read_phy_reg
  0.000010 __drain_alien_cache
  0.000010 __d_path
  0.000010 __down_write_nested
  0.000010 __down_write
  0.000010 double_rq_lock
  0.000010 do_timer
  0.000010 do_sys_open
  0.000010 do_sigaltstack
  0.000010 do_sigaction
  0.000010 do_setitimer
  0.000010 do_pipe_flags
  0.000010 __do_page_cache_readahead
  0.000010 do_notify_parent
  0.000010 do_filp_open
  0.000010 do_exit
  0.000010 dnotify_flush
  0.000010 d_kill
  0.000010 destroy_inode
  0.000010 dequeue_signal
  0.000010 de_put
  0.000010 delayacct_end
  0.000010 create_write_pipe
  0.000010 create_workqueue_thread
  0.000010 __cpus_equal
  0.000010 cpu_quiet
  0.000010 __cpu_clear
  0.000010 __cpu_clear
  0.000010 count
  0.000010 copy_thread
  0.000010 copy_namespaces
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 __cond_resched
  0.000010 clocksource_forward_now
  0.000010 __clear_user
  0.000010 clear_inode
  0.000010 clear_buffer_new
  0.000010 clear_bit
  0.000010 clear_bit
  0.000010 check_for_bios_corruption
  0.000010 __cfq_slice_expired
  0.000010 cfq_set_request
  0.000010 cfq_dispatch_requests
  0.000010 cfq_completed_request
  0.000010 cap_set_effective
  0.000010 can_share_swap_page
  0.000010 bvec_alloc_bs
  0.000010 buffer_uptodate
  0.000010 buffer_mapped
  0.000010 buffer_locked
  0.000010 buffer_jbd
  0.000010 buffer_jbd
  0.000010 brelse
  0.000010 __bread
  0.000010 blk_invoke_request_fn
  0.000010 __blk_complete_request
  0.000010 blk_add_trace_generic
  0.000010 blk_add_trace_bio
  0.000010 bit_spin_lock
  0.000010 bio_put
  0.000010 bio_alloc_bioset
  0.000010 bdi_read_congested
  0.000010 balance_runtime
  0.000010 balance_dirty_pages_ratelimited_nr
  0.000010 audit_log_task_context
  0.000010 ata_sff_qc_prep
  0.000010 ata_scsi_queuecmd
  0.000010 ata_link_max_devices
  0.000010 ata_get_xlat_func
  0.000010 arp_process
  0.000010 arch_pick_mmap_layout
  0.000010 arch_irq_stat_cpu
  0.000010 arch_dup_task_struct
  0.000010 alloc_pid
  0.000010 alloc_fdtable
  0.000010 alloc_fd
  0.000010 add_mm_rss
  0.000010 acct_collect

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 18:49                           ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

4> The place for the sock_rfree() hit looks a bit weird, and i'll 
> investigate it now a bit more to place the real overhead point 
> properly. (i already mapped the test-bit overhead: that comes from 
> napi_disable_pending())

ok, here's a new set of profiles. (again for tbench 64-thread on a 
16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
posted before.)

Here are the per major subsystem percentages:

           NET       overhead ( 5786945/10096751): 57.31%
           security  overhead (  925933/10096751):  9.17%
           usercopy  overhead (  837887/10096751):  8.30%
           sched     overhead (  753662/10096751):  7.46%
           syscall   overhead (  268809/10096751):  2.66%
           IRQ       overhead (  266500/10096751):  2.64%
           slab      overhead (  180258/10096751):  1.79%
           timer     overhead (   92986/10096751):  0.92%
           pagealloc overhead (   87381/10096751):  0.87%
           VFS       overhead (   53295/10096751):  0.53%
           PID       overhead (   44469/10096751):  0.44%
           pagecache overhead (   33452/10096751):  0.33%
           gtod      overhead (   11064/10096751):  0.11%
           IDLE      overhead (       0/10096751):  0.00%
---------------------------------------------------------
                         left (  753878/10096751):  7.47%

The breakdown is very similar to what i sent before, within noise.

[ 'left' is random overhead from all around the place - i categorized 
  the 500 most expensive functions in the profile per subsystem.
  I stopped short of doing it for all 1300+ functions: it's rather
  laborous manual work even with hefty use of regex patterns.
  It's also less meaningful in practice: the trend in the first 500
  functions is present in the remaining 800 functions as well. I 
  watched the breakdown evolve as i increased the coverage - in 
  practice it is the first 100 functions that matter - it just doesnt 
  change after that. ]

The readprofile output below seems structured in a more useful way now 
- i tweaked compiler options to have the profiler hits spread out in a 
more meaningful way. I collected 10 million NMI profiler hits, and 
normalized the readprofile output up to 100%.

[ I'll post per function analysis as i complete them, as a reply to
  this mail. ]

	Ingo

100.000000 total
................
  7.253355 copy_user_generic_string
  3.934833 avc_has_perm_noaudit
  3.356152 ip_queue_xmit
  3.038025 skb_release_data
  2.118525 skb_release_head_state
  1.997533 tcp_ack
  1.833688 tcp_recvmsg
  1.717771 eth_type_trans
  1.673249 __inet_lookup_established
  1.508888 system_call
  1.469183 tcp_current_mss
  1.431553 tcp_transmit_skb
  1.385125 tcp_sendmsg
  1.327643 tcp_v4_rcv
  1.292328 nf_hook_thresh
  1.203205 schedule
  1.059501 nf_hook_slow
  1.027373 constant_test_bit
  0.945183 sock_rfree
  0.922748 __switch_to
  0.911605 netif_rx
  0.876270 register_gifconf
  0.788200 ip_local_deliver_finish
  0.781467 dev_queue_xmit
  0.766530 constant_test_bit
  0.758208 _local_bh_enable_ip
  0.747184 load_cr3
  0.704341 memset_c
  0.671260 sysret_check
  0.651845 ip_finish_output2
  0.620204 audit_free_names
  0.617781 audit_syscall_exit
  0.615149 skb_copy_datagram_iovec
  0.613848 selinux_socket_sock_rcv_skb
  0.606995 constant_test_bit
  0.593936 __tcp_push_pending_frames
  0.592198 tcp_cleanup_rbuf
  0.574093 ip_rcv
  0.567886 netif_receive_skb
  0.563377 get_page_from_freelist
  0.557657 tcp_event_data_recv
  0.539274 ip_local_deliver
  0.534130 sys_recvfrom
  0.512321 __tcp_select_window
  0.498427 tcp_rcv_established
  0.494862 sys_sendto
  0.487473 audit_syscall_entry
  0.478495 sched_clock_cpu
  0.474861 kfree
  0.466310 tcp_established_options
  0.461384 net_rx_action
  0.447162 __mod_timer
  0.442078 ip_rcv_finish
  0.441631 find_pid_ns
  0.441124 sk_wait_data
  0.423943 __sock_recvmsg
  0.422126 selinux_parse_skb
  0.417975 __napi_schedule
  0.414082 __do_softirq
  0.403604 task_rq_lock
  0.380792 nf_iterate
  0.377614 select_task_rq_fair
  0.374973 sock_sendmsg
  0.374635 kmem_cache_alloc_node
  0.368775 avc_has_perm
  0.368706 local_bh_disable
  0.361834 release_sock
  0.346400 sock_common_recvmsg
  0.342825 skb_clone
  0.338704 __alloc_skb
  0.326488 do_softirq
  0.323410 lock_sock_nested
  0.322129 __copy_skb_header
  0.316835 put_page
  0.310966 selinux_ip_postroute
  0.306229 sel_netport_sid
  0.299863 try_to_wake_up
  0.296288 process_backlog
  0.294818 __inet_lookup
  0.294778 thread_return
  0.293219 cfs_rq_of
  0.292315 internal_add_timer
  0.292305 tcp_rcv_space_adjust
  0.281053 constant_test_bit
  0.278779 local_bh_enable
  0.272910 *unknown*
  0.269593 schedule_timeout
  0.261846 tcp_v4_md5_lookup
  0.260992 __ip_local_out
  0.255868 __enqueue_entity
  0.253931 avc_audit
  0.252004 finish_task_switch
  0.249263 audit_get_context
  0.248290 sockfd_lookup_light
  0.247416 virt_to_head_page
  0.244149 tcp_options_write
  0.243603 memcpy_toiovec
  0.243434 sock_recvmsg
  0.242599 call_softirq
  0.242391 __unlazy_fpu
  0.236412 fput_light
  0.235628 ret_from_sys_call
  0.234933 sk_reset_timer
  0.228358 math_state_restore
  0.227117 socket_has_perm
  0.223492 virt_to_cache
  0.219063 __cache_free
  0.216401 update_curr
  0.216232 tcp_v4_send_check
  0.213978 audit_free_aux
  0.213223 tcp_v4_do_rcv
  0.212975 __kfree_skb
  0.211137 dev_hard_start_xmit
  0.209052 tcp_rtt_estimator
  0.207999 netif_needs_gso
  0.207662 __update_sched_clock
  0.207284 rb_erase
  0.204861 enqueue_task_fair
  0.203490 skb_release_all
  0.203252 tcp_send_delayed_ack
  0.203232 inet_ehashfn
  0.199846 sel_netport_find
  0.195396 system_call_after_swapgs
  0.186756 lock_timer_base
  0.186687 pick_next_task_fair
  0.183986 mod_timer
  0.182982 loopback_xmit
  0.182605 native_read_tsc
  0.181195 skb_set_owner_r
  0.179248 switch_mm
  0.175584 set_next_entity
  0.173329 raw_local_deliver
  0.171641 sys_kill
  0.164510 dequeue_task_fair
  0.161938 clear_bit
  0.160528 sock_def_readable
  0.157628 __tcp_ack_snd_check
  0.156893 skb_can_coalesce
  0.156556 tcp_snd_wnd_test
  0.155662 ip_output
  0.150627 sk_stream_alloc_skb
  0.150219 cpu_sdc
  0.149425 sysret_careful
  0.148760 tcp_data_snd_check
  0.147816 auditsys
  0.147419 pskb_may_pull
  0.147151 fget_light
  0.143774 tcp_cwnd_test
  0.143029 rb_insert_color
  0.142265 __wake_up
  0.141808 tcp_bound_to_half_wnd
  0.138600 __sk_dst_check
  0.138431 free_hot_cold_page
  0.137954 unroll_tree_refs
  0.137080 __skb_unlink
  0.135124 __sock_sendmsg
  0.135064 get_pageblock_flags_group
  0.132701 kmem_cache_free
  0.128152 bictcp_cong_avoid
  0.127874 __napi_complete
  0.127527 ____cache_alloc
  0.127368 tcp_is_cwnd_limited
  0.127278 find_vpid
  0.126941 constant_test_bit
  0.126504 sk_mem_charge
  0.126255 __alloc_pages_internal
  0.125977 dst_release
  0.125521 hash_64
  0.124895 put_prev_task_fair
  0.123802 netlbl_enabled
  0.122829 sched_clock
  0.122640 skb_push
  0.122035 __phys_addr
  0.121161 dput
  0.120515 tcp_prequeue_process
  0.118916 __skb_dequeue
  0.117715 selinux_socket_sendmsg
  0.117536 __inc_zone_state
  0.115907 sk_wake_async
  0.113504 selinux_ipv4_output
  0.113017 sel_netif_sid
  0.112431 skb_reset_network_header
  0.111170 check_preempt_wakeup
  0.111061 bictcp_acked
  0.110882 sel_netnode_find
  0.109978 update_min_vruntime
  0.109889 resched_task
  0.109879 current_kernel_time
  0.109432 tcp_checksum_complete_user
  0.107476 ip_dont_fragment
  0.107386 sysret_audit
  0.106979 inet_csk_reset_xmit_timer
  0.106006 skb_entail
  0.105777 sysret_signal
  0.105420 avc_hash
  0.105251 __skb_clone
  0.105211 tcp_init_tso_segs
  0.103523 __dequeue_entity
  0.101715 PageLRU
  0.101378 tcp_parse_aligned_timestamp
  0.101219 __xchg
  0.100544 constant_test_bit
  0.097991 __kmalloc
  0.097584 test_tsk_thread_flag
  0.097475 autoremove_wake_function
  0.095747 selinux_task_kill
  0.094416 get_page
  0.093353 dequeue_task
  0.092728 __local_bh_disable
  0.091943 selinux_netlbl_sock_rcv_skb
  0.091655 path_put
  0.090970 skb_headroom
  0.090950 PageTail
  0.090642 dst_destroy
  0.090523 netpoll_rx
  0.089589 skb_header_pointer
  0.085935 security_socket_recvmsg
  0.084008 alloc_pages_current
  0.083184 compare_ether_addr
  0.082479 rb_next
  0.082439 sk_wmem_schedule
  0.081635 next_zones_zonelist
  0.080135 tcp_cwnd_validate
  0.079877 tcp_event_new_data_sent
  0.079817 fcheck_files
  0.079082 ip_skb_dst_mtu
  0.078804 ip_finish_output
  0.078278 wakeup_preempt_entity
  0.077026 sel_netif_find
  0.076788 __skb_queue_tail
  0.076570 sock_flag
  0.076520 tcp_win_from_space
  0.076510 zone_watermark_ok
  0.076282 sel_netnode_sid
  0.076162 policy_zonelist
  0.074732 __wake_up_common
  0.074613 compound_head
  0.074593 task_has_perm
  0.073243 __find_general_cachep
  0.073064 tcp_push
  0.072925 skb_cloned
  0.072309 pskb_may_pull
  0.071852 TCP_ECN_check_ce
  0.071495 cap_task_to_inode
  0.070770 default_wake_function
  0.069429 xfrm4_policy_check
  0.069091 tcp_parse_md5sig_option
  0.068287 tcp_v4_md5_do_lookup
  0.068059 tcp_v4_tw_remember_stamp
  0.067344 tcp_ca_event
  0.067125 tcp_ca_event
  0.065457 place_entity
  0.065318 write_seqlock
  0.065089 device_not_available
  0.065069 test_ti_thread_flag
  0.063878 tcp_set_skb_tso_segs
  0.063550 selinux_netlbl_inode_permission
  0.063391 sock_wfree
  0.063311 prepare_to_wait
  0.058872 pid_vnr
  0.058803 __cycles_2_ns
  0.057631 ip_local_out
  0.057333 tcp_ack_saw_tstamp
  0.056896 copy_to_user
  0.056628 set_bit
  0.055913 free_pages_check
  0.054969 tcp_rcv_rtt_measure_ts
  0.053797 init_rootdomain
  0.053708 selinux_socket_recvmsg
  0.053698 pid_nr_ns
  0.053629 sk_eat_skb
  0.052814 _local_bh_enable
  0.052645 nf_hook_thresh
  0.052516 sched_info_queued
  0.052457 enqueue_task
  0.052228 sk_filter
  0.052159 __cpu_clear
  0.051980 local_bh_enable_ip
  0.050292 update_rq_clock
  0.048981 task_tgid_vnr
  0.048881 copy_from_user
  0.048782 tcp_parse_options
  0.048484 lock_sock
  0.047779 net_timestamp
  0.047044 open_softirq
  0.046955 tcp_win_from_space
  0.045981 __skb_dequeue
  0.043846 getboottime
  0.043777 account_group_exec_runtime
  0.043519 can_checksum_protocol
  0.043469 set_user_nice
  0.042784 skb_fill_page_desc
  0.042247 security_socket_sendmsg
  0.041989 read_profile
  0.041930 tcp_validate_incoming
  0.041612 check_preempt_curr
  0.041413 skb_pull
  0.041026 generic_smp_call_function_interrupt
  0.041016 calc_delta_fair
  0.040936 clear_buddies
  0.040768 tcp_data_queue
  0.040698 page_count
  0.039695 lock_sock
  0.039099 skb_headroom
  0.038851 system_call_fastpath
  0.038622 zone_statistics
  0.037500 tcp_sack_extend
  0.037381 __kmalloc_node
  0.036587 first_zones_zonelist
  0.036497 mntput
  0.036179 pick_next_task
  0.035991 kmap
  0.035911 sock_put
  0.035613 deactivate_task
  0.035027 __nr_to_section
  0.033985 page_zone
  0.033190 native_load_tls
  0.032882 netif_tx_queue_stopped
  0.032713 __skb_insert
  0.032187 sock_flag
  0.031988 check_kill_permission
  0.031790 policy_nodemask
  0.031621 detach_timer
  0.030558 inet_csk_clear_xmit_timer
  0.030469 task_rq_unlock
  0.029883 tcp_nagle_test
  0.029744 tracesys
  0.028383 virt_to_slab
  0.028115 tcp_v4_check
  0.028046 __cpu_set
  0.027658 page_get_cache
  0.027063 tcp_store_ts_recent
  0.027053 __skb_pull
  0.026953 gfp_zone
  0.026586 sock_rcvlowat
  0.026576 csum_partial
  0.026397 init_waitqueue_head
  0.026109 finish_wait
  0.026040 kill_pid_info
  0.025404 tcp_full_space
  0.024888 __skb_queue_before
  0.024550 dst_confirm
  0.022603 inet_ehash_bucket
  0.021888 activate_task
  0.021650 tcp_rto_min
  0.021283 d_callback
  0.020965 signal_pending
  0.020925 avc_node_free
  0.020915 empty_bucket
  0.020746 group_send_sig_info
  0.020657 skb_reset_transport_header
  0.020061 sock_put
  0.019992 signal_pending_state
  0.019684 tcp_sync_mss
  0.019346 skb_network_offset
  0.019276 skb_split
  0.018988 tcp_adjust_fackets_out
  0.018204 tcp_fast_path_check
  0.017727 __skb_unlink
  0.017687 napi_disable_pending
  0.017678 sg_set_page
  0.017022 get_pageblock_bitmap
  0.016972 tcp_cong_avoid
  0.016962 pid_task
  0.016754 skb_set_tail_pointer
  0.016039 selinux_ipv4_postroute
  0.015930 idle_cpu
  0.015632 skb_reset_network_header
  0.015552 __count_vm_events
  0.015483 source_load
  0.014867 __skb_unlink
  0.014738 skb_reset_transport_header
  0.014599 set_bit
  0.014241 audit_zero_context
  0.014231 zone_page_state
  0.014152 clear_bit
  0.013874 PageSlab
  0.013546 __memset
  0.013238 get_pageblock_migratetype
  0.012623 __rb_rotate_right
  0.012543 kmem_find_general_cachep
  0.012414 __kprobes_text_start
  0.012344 security_sock_rcv_skb
  0.012344 node_zonelist
  0.012335 dnotify_parent
  0.012096 skb_headroom
  0.011778 tcp_push_one
  0.011540 mnt_want_write
  0.011143 kmalloc
  0.011073 retint_swapgs
  0.010954 __rb_rotate_left
  0.010805 check_pgd_range
  0.010785 tcp_mss_split_point
  0.010755 migrate_timer_list
  0.010338 __send_IPI_dest_field
  0.010229 reschedule_interrupt
  0.010179 sock_flag
  0.009882 smp_call_function_mask
  0.009673 test_tsk_need_resched
  0.009564 tcp_urg
  0.009504 generic_file_aio_read
  0.009176 PageReserved
  0.009147 net_invalid_timestamp
  0.009087 __node_set
  0.008749 do_tcp_setsockopt
  0.008730 set_tsk_thread_flag
  0.008720 tcp_enter_loss
  0.008422 sock_error
  0.008362 target_load
  0.008302 crypto_hash_update
  0.008104 PageReadahead
  0.008044 tcp_poll
  0.007915 tcp_checksum_complete
  0.007329 tcp_snd_test
  0.007309 selinux_file_permission
  0.007290 sel_netif_destroy
  0.007220 put_pages_list
  0.006992 dst_output
  0.006743 prepare_to_copy
  0.006694 tcp_init_cwnd
  0.006555 clear_bit
  0.006535 set_bit
  0.006425 normal_prio
  0.006366 msleep
  0.006346 error_sti
  0.006336 tcp_rcv_rtt_update
  0.006167 tcp_send_ack
  0.005989 tcp_init_nondata_skb
  0.005720 kfree_skb
  0.005502 call_function_interrupt
  0.005413 __count_vm_event
  0.005403 __skb_checksum_complete_head
  0.005363 page_cache_get_speculative
  0.005323 dev_kfree_skb_irq
  0.005174 skb_store_bits
  0.004956 cpu_avg_load_per_task
  0.004916 dev_cpu_callback
  0.004807 __kmem_cache_destroy
  0.004777 tcp_init_metrics
  0.004777 io_schedule
  0.004777 find_get_page
  0.004707 eth_header_parse
  0.004688 cap_task_kill
  0.004678 error_exit
  0.004668 rb_prev
  0.004658 tso_fragment
  0.004648 mmdrop
  0.004628 skb_reset_tail_pointer
  0.004598 apic_timer_interrupt
  0.004588 clear_bit
  0.004519 tcp_simple_retransmit
  0.004449 get_max_files
  0.004370 sk_stop_timer
  0.004340 tcp_reset
  0.004251 netlbl_cache_add
  0.004201 tcp_add_reno_sack
  0.004151 __pskb_trim_head
  0.004102 __profile_flip_buffers
  0.004092 sk_common_release
  0.004052 audit_copy_inode
  0.003953 eth_change_mtu
  0.003943 vfs_read
  0.003923 run_timer_softirq
  0.003843 mnt_drop_write
  0.003814 clear_page_c
  0.003804 do_sync_read
  0.003744 unset_migratetype_isolate
  0.003714 sk_stream_moderate_sndbuf
  0.003545 tcp_try_rmem_schedule
  0.003476 native_apic_mem_write
  0.003466 sys_read
  0.003446 skb_checksum
  0.003436 timer_set_base
  0.003426 security_task_kill
  0.003416 __flow_cache_shrink
  0.003406 __skb_checksum_complete
  0.003277 alloc_skb
  0.003267 physflat_send_IPI_mask
  0.003218 skb_gso_ok
  0.003178 constant_test_bit
  0.003168 find_next_bit
  0.003158 selinux_netlbl_skbuff_getsid
  0.003118 constant_test_bit
  0.003099 pull_task
  0.003079 hrtimer_run_queues
  0.003049 free_hot_page
  0.003009 scheduler_tick
  0.002900 set_32bit_tls
  0.002890 tcp_acceptable_seq
  0.002811 rw_verify_area
  0.002751 radix_tree_lookup_slot
  0.002731 zero_user_segment
  0.002731 sock_common_setsockopt
  0.002612 __load_balance_iterator
  0.002473 run_posix_cpu_timers
  0.002264 task_utime
  0.002254 switched_to_fair
  0.002185 fsnotify_access
  0.002145 __rmqueue_smallest
  0.002125 __schedule_bug
  0.002095 __task_rq_lock
  0.002086 tcp_may_update_window
  0.002076 restore_args
  0.002066 hrtimer_run_pending
  0.002056 generic_segment_checks
  0.002026 getnstimeofday
  0.002006 idle_task
  0.001976 touch_atime
  0.001956 __wake_up_locked
  0.001927 sk_mem_charge
  0.001877 smp_apic_timer_interrupt
  0.001827 native_smp_send_reschedule
  0.001798 __tcp_fast_path_on
  0.001788 file_read_actor
  0.001768 _cond_resched
  0.001738 avc_policy_seqno
  0.001718 tcp_ack_snd_check
  0.001629 ip_send_check
  0.001619 account_system_time
  0.001579 __xapic_wait_icr_idle
  0.001579 get_stats
  0.001539 tcp_set_state
  0.001539 bictcp_state
  0.001529 tcp_fast_path_on
  0.001519 file_accessed
  0.001480 get_seconds
  0.001450 kernel_math_error
  0.001410 ktime_set
  0.001331 kmap_atomic
  0.001281 printk_tick
  0.001281 __next_cpu_nr
  0.001271 account_group_system_time
  0.001261 __mod_zone_page_state
  0.001222 weighted_cpuload
  0.001192 security_file_permission
  0.001162 ack_APIC_irq
  0.001152 __free_one_page
  0.001142 rcu_pending
  0.001142 drain_array
  0.001122 sched_clock_tick
  0.001122 csum_fold
  0.001102 ret_from_intr
  0.001083 retint_careful
  0.001073 need_resched
  0.001073 calc_delta_mine
  0.001043 tcp_v4_md5_do_del
  0.001043 PageActive
  0.001033 mark_page_accessed
  0.001033 ktime_get_ts
  0.001023 tcp_insert_write_queue_after
  0.001013 tcp_delack_timer
  0.001013 task_tick_fair
  0.000973 delay_tsc
  0.000963 nv_nic_irq_optimized
  0.000904 tick_periodic
  0.000894 skb_reserve
  0.000884 cache_reap
  0.000874 timespec_trunc
  0.000864 skb_header_release
  0.000854 zone_page_state_add
  0.000844 update_process_times
  0.000834 sk_rmem_schedule
  0.000824 find_busiest_group
  0.000804 current_fs_time
  0.000785 tick_handle_periodic
  0.000785 __sk_mem_schedule
  0.000785 irq_enter
  0.000755 use_cpu_writer_for_mount
  0.000755 tcp_ratehalving_spur_to_response
  0.000745 update_wall_time
  0.000745 tcp_sendpage
  0.000745 __alloc_pages_nodemask
  0.000725 ktime_get
  0.000725 irq_exit
  0.000705 inotify_inode_queue_event
  0.000665 set_pageblock_flags_group
  0.000646 inotify_dentry_parent_queue_event
  0.000626 ack_APIC_irq
  0.000606 write_profile
  0.000566 set_normalized_timespec
  0.000566 raise_softirq
  0.000526 task_cputime_zero
  0.000516 smp_reschedule_interrupt
  0.000516 __skb_insert
  0.000497 page_fault
  0.000497 __copy_user_nocache
  0.000487 run_local_timers
  0.000487 read_tsc
  0.000487 nf_unregister_hook
  0.000477 __rcu_pending
  0.000477 jiffies_to_usecs
  0.000457 timespec_to_ktime
  0.000437 __skb_trim
  0.000427 __call_rcu
  0.000417 free_pages_bulk
  0.000407 smp_call_function_interrupt
  0.000397 set_irq_regs
  0.000397 radix_tree_deref_slot
  0.000397 expand
  0.000387 handle_mm_fault
  0.000387 handle_IRQ_event
  0.000387 fput_light
  0.000377 refresh_cpu_vm_stats
  0.000377 n_tty_write
  0.000367 get_page
  0.000358 run_rebalance_domains
  0.000358 get_cpu_mask
  0.000348 task_hot
  0.000348 __skb_queue_after
  0.000348 retint_check
  0.000348 do_select
  0.000338 PageUptodate
  0.000338 copy_page_c
  0.000328 cond_resched
  0.000318 unmap_vmas
  0.000318 sk_mem_reclaim
  0.000318 rmqueue_bulk
  0.000318 reciprocal_value
  0.000318 irq_return
  0.000308 rb_first
  0.000308 alloc_skb
  0.000308 account_process_tick
  0.000298 net_enable_timestamp
  0.000298 clocksource_read
  0.000298 account_system_time_scaled
  0.000288 sched_slice
  0.000278 ip_compute_csum
  0.000278 constant_test_bit
  0.000278 constant_test_bit
  0.000268 set_curr_task_fair
  0.000268 note_interrupt
  0.000268 exit_idle
  0.000258 native_apic_mem_write
  0.000258 exit_intr
  0.000248 PageReferenced
  0.000238 usb_hcd_irq
  0.000238 __mnt_is_readonly
  0.000238 constant_test_bit
  0.000218 IRQ0xba_interrupt
  0.000218 handle_fasteoi_irq
  0.000209 raise_softirq_irqoff
  0.000209 __find_get_block
  0.000199 tcp_current_ssthresh
  0.000199 n_tty_receive_buf
  0.000189 wake_up_page
  0.000189 vgacon_save_screen
  0.000189 free_block
  0.000189 constant_test_bit
  0.000179 pagefault_disable
  0.000169 clocksource_get_next
  0.000169 __bitmap_weight
  0.000159 tty_ldisc_deref
  0.000159 tcp_write_timer
  0.000159 kmem_cache_alloc
  0.000159 free_alien_cache
  0.000159 ext3_mark_iloc_dirty
  0.000159 constant_test_bit
  0.000159 __bitmap_equal
  0.000149 transfer_objects
  0.000149 __rcu_process_callbacks
  0.000149 page_waitqueue
  0.000149 constant_test_bit
  0.000139 __rmqueue
  0.000139 release_pages
  0.000139 constant_test_bit
  0.000129 __tcp_checksum_complete
  0.000129 run_workqueue
  0.000129 poll_freewait
  0.000129 n_tty_read
  0.000129 iommu_area_free
  0.000129 generic_file_llseek
  0.000129 __cpus_setall
  0.000129 cond_resched_softirq
  0.000129 avc_node_populate
  0.000129 add_to_page_cache_lru
  0.000129 account_user_time
  0.000119 wait_consider_task
  0.000119 sys_select
  0.000119 round_jiffies_common
  0.000119 nv_start_xmit_optimized
  0.000119 core_sys_select
  0.000109 tcp_tso_segment
  0.000109 sigprocmask
  0.000109 proc_reg_read
  0.000109 path_to_nameidata
  0.000109 PageBuddy
  0.000109 ohci_irq
  0.000109 nv_tx_done_optimized
  0.000109 nv_msi_workaround
  0.000109 IRQ0xc2_interrupt
  0.000109 __ext3_get_inode_loc
  0.000109 account_group_user_time
  0.000099 __wake_up_sync
  0.000099 __up_read
  0.000099 update_vsyscall
  0.000099 memmove
  0.000099 kmalloc
  0.000099 ext3_get_blocks_handle
  0.000099 do_device_not_available
  0.000099 constant_test_bit
  0.000089 tcp_incr_quickack
  0.000089 smp_send_reschedule
  0.000089 remove_from_page_cache
  0.000089 rcu_process_callbacks
  0.000089 prepare_to_wait_exclusive
  0.000089 pde_users_dec
  0.000089 find_first_bit
  0.000089 constant_test_bit
  0.000089 common_interrupt
  0.000089 add_wait_queue
  0.000079 task_gtime
  0.000079 sys_lseek
  0.000079 start_this_handle
  0.000079 schedule_hrtimeout_range
  0.000079 __sched_fork
  0.000079 journal_put_journal_head
  0.000079 find_first_zero_bit
  0.000079 do_syslog
  0.000079 do_sync_write
  0.000079 constant_test_bit
  0.000079 ack_apic_level
  0.000070 write_seqlock
  0.000070 slab_get_obj
  0.000070 remove_wait_queue
  0.000070 pty_chars_in_buffer
  0.000070 ____pagevec_lru_add
  0.000070 lock_hrtimer_base
  0.000070 kstat_incr_irqs_this_cpu
  0.000070 journal_dirty_data
  0.000070 journal_add_journal_head
  0.000070 find_lock_page
  0.000070 copy_from_read_buf
  0.000070 bit_waitqueue
  0.000070 alloc_page_vma
  0.000060 vfs_write
  0.000060 tty_write
  0.000060 __strnlen_user
  0.000060 sk_mem_uncharge
  0.000060 rt_worker_func
  0.000060 radix_tree_preload
  0.000060 poll_select_copy_remaining
  0.000060 pagefault_enable
  0.000060 __mark_inode_dirty
  0.000060 lru_add_drain_all
  0.000060 lock_page
  0.000060 list_replace_init
  0.000060 journal_stop
  0.000060 iowrite8
  0.000060 hrtimer_forward
  0.000060 gart_unmap_single
  0.000060 find_vma
  0.000060 __down_read_trylock
  0.000060 do_page_fault
  0.000060 do_IRQ
  0.000060 create_empty_buffers
  0.000060 constant_test_bit
  0.000060 constant_test_bit
  0.000060 alloc_iommu
  0.000060 add_to_page_cache_locked
  0.000050 zero_fd_set
  0.000050 vsnprintf
  0.000050 unlock_page
  0.000050 tty_read
  0.000050 tty_poll
  0.000050 sock_poll
  0.000050 sock_def_error_report
  0.000050 set_wq_data
  0.000050 rcu_check_callbacks
  0.000050 radix_tree_node_rcu_free
  0.000050 pipe_poll
  0.000050 opost
  0.000050 n_tty_chars_in_buffer
  0.000050 __next_cpu
  0.000050 mutex_trylock
  0.000050 msecs_to_jiffies
  0.000050 mempool_alloc_slab
  0.000050 load_elf_binary
  0.000050 __link_path_walk
  0.000050 __journal_remove_journal_head
  0.000050 journal_commit_transaction
  0.000050 journal_cancel_revoke
  0.000050 irq_complete_move
  0.000050 irq_cfg
  0.000050 fsnotify_modify
  0.000050 __first_cpu
  0.000050 file_update_time
  0.000050 filemap_fault
  0.000050 ext3_new_blocks
  0.000050 ext3_mark_inode_dirty
  0.000050 do_wp_page
  0.000050 __do_fault
  0.000050 buffer_dirty
  0.000050 anon_vma_prepare
  0.000040 yield
  0.000040 wq_per_cpu
  0.000040 walk_page_buffers
  0.000040 __wake_up_bit
  0.000040 vma_adjust
  0.000040 tty_put_char
  0.000040 tty_paranoia_check
  0.000040 tcp_current_ssthresh
  0.000040 sys_write
  0.000040 sys_rt_sigprocmask
  0.000040 sock_no_bind
  0.000040 show_stat
  0.000040 SetPageSwapBacked
  0.000040 set_irq_regs
  0.000040 set_buffer_write_io_error
  0.000040 recalc_sigpending
  0.000040 radix_tree_delete
  0.000040 queue_delayed_work_on
  0.000040 pty_write
  0.000040 __pollwait
  0.000040 physflat_send_IPI_allbutself
  0.000040 page_zone
  0.000040 page_remove_rmap
  0.000040 page_is_file_cache
  0.000040 page_evictable
  0.000040 nv_get_empty_tx_slots
  0.000040 n_tty_poll
  0.000040 next_zone
  0.000040 next_online_pgdat
  0.000040 need_resched
  0.000040 mutex_unlock
  0.000040 mpol_needs_cond_ref
  0.000040 __lookup
  0.000040 journal_invalidatepage
  0.000040 journal_dirty_metadata
  0.000040 ioread8
  0.000040 input_available_p
  0.000040 inet_csk_reset_xmit_timer
  0.000040 get_fd_set
  0.000040 generic_write_checks
  0.000040 free_poll_entry
  0.000040 fput
  0.000040 __ext3_journal_stop
  0.000040 ext3_get_group_desc
  0.000040 ext3_get_block
  0.000040 do_mpage_readpage
  0.000040 __d_lookup
  0.000040 del_page_from_lru
  0.000040 __dec_zone_state
  0.000040 copy_user_generic
  0.000040 __bitmap_and
  0.000040 add_page_to_lru_list
  0.000040 account_user_time_scaled
  0.000040 account_steal_time
  0.000030 worker_thread
  0.000030 wake_up_bit
  0.000030 vmstat_update
  0.000030 vm_normal_page
  0.000030 tty_write_unlock
  0.000030 tty_write_lock
  0.000030 tty_wakeup
  0.000030 tty_ldisc_try
  0.000030 tty_ioctl
  0.000030 tag_get
  0.000030 sys_pread64
  0.000030 submit_bh
  0.000030 stop_this_cpu
  0.000030 sock_aio_write
  0.000030 sk_mem_reclaim
  0.000030 sk_backlog_rcv
  0.000030 show_interrupts
  0.000030 sg_next
  0.000030 seq_printf
  0.000030 send_remote_softirq
  0.000030 remove_vma
  0.000030 reg_delay
  0.000030 radix_tree_lookup
  0.000030 radix_tree_insert
  0.000030 proc_lookup_de
  0.000030 pipe_write
  0.000030 __percpu_counter_add
  0.000030 pci_map_single
  0.000030 nv_napi_poll
  0.000030 __next_node
  0.000030 native_send_call_func_ipi
  0.000030 mpage_readpages
  0.000030 mix_pool_bytes_extract
  0.000030 mii_rw
  0.000030 mempool_alloc
  0.000030 __make_request
  0.000030 jbd_lock_bh_state
  0.000030 iov_iter_copy_from_user_atomic
  0.000030 insert_work
  0.000030 hrtimer_try_to_cancel
  0.000030 get_dma_ops
  0.000030 __generic_file_aio_write_nolock
  0.000030 gart_map_sg
  0.000030 __fput
  0.000030 fixup_irqs
  0.000030 __find_get_block_slow
  0.000030 filp_close
  0.000030 ext3_get_branch
  0.000030 ext3_dirty_inode
  0.000030 ext3_block_to_path
  0.000030 do_get_write_access
  0.000030 delayed_work_timer_fn
  0.000030 csum_block_add
  0.000030 copy_process
  0.000030 copy_page_range
  0.000030 constant_test_bit
  0.000030 constant_test_bit
  0.000030 check_irqs_on
  0.000030 call_rcu
  0.000030 __brelse
  0.000030 _atomic_dec_and_lock
  0.000020 __xchg
  0.000020 vm_stat_account
  0.000020 vma_prio_tree_remove
  0.000020 tty_mode_ioctl
  0.000020 tty_audit_add_data
  0.000020 try_to_free_buffers
  0.000020 truncate_inode_pages_range
  0.000020 tcp_slow_start
  0.000020 task_curr
  0.000020 sys_setpgid
  0.000020 sys_rt_sigreturn
  0.000020 sys_getppid
  0.000020 strncpy_from_user
  0.000020 sock_put
  0.000020 smp_call_function
  0.000020 __sk_mem_reclaim
  0.000020 signal_wake_up
  0.000020 signal_pending
  0.000020 set_termios
  0.000020 SetPageUptodate
  0.000020 SetPageLRU
  0.000020 set_fd_set
  0.000020 set_bit
  0.000020 __send_IPI_shortcut
  0.000020 security_inode_need_killpriv
  0.000020 scsi_request_fn
  0.000020 sb_bread
  0.000020 restore_i387_xstate
  0.000020 __qdisc_run
  0.000020 pud_alloc
  0.000020 pmd_alloc
  0.000020 pfn_pte
  0.000020 pfifo_fast_enqueue
  0.000020 pfifo_fast_dequeue
  0.000020 pci_map_page
  0.000020 path_get
  0.000020 __pagevec_free
  0.000020 pagevec_add
  0.000020 PageUnevictable
  0.000020 page_mapping
  0.000020 nv_get_hw_stats
  0.000020 number
  0.000020 normalize_rt_tasks
  0.000020 __netif_tx_lock
  0.000020 mk_pid
  0.000020 memscan
  0.000020 memcpy_c
  0.000020 __lru_cache_add
  0.000020 __lookup_mnt
  0.000020 load_balance_rt
  0.000020 kthread_should_stop
  0.000020 journal_start
  0.000020 journal_remove_journal_head
  0.000020 __journal_file_buffer
  0.000020 jbd_unlock_bh_journal_head
  0.000020 itimer_get_remtime
  0.000020 irq_to_desc
  0.000020 iowrite32
  0.000020 inotify_remove_watch_locked
  0.000020 inode_permission
  0.000020 inode_has_perm
  0.000020 init_timer
  0.000020 goal_in_my_reservation
  0.000020 get_vma_policy
  0.000020 __get_free_pages
  0.000020 generic_sync_sb_inodes
  0.000020 gart_map_single
  0.000020 freezing
  0.000020 free_pgtables
  0.000020 free_pages_and_swap_cache
  0.000020 free_buffer_head
  0.000020 __follow_mount
  0.000020 flush_tlb_page
  0.000020 find_busiest_queue
  0.000020 file_has_perm
  0.000020 ext3_try_to_allocate
  0.000020 ext3_journal_start
  0.000020 __ext3_journal_dirty_metadata
  0.000020 ext3_file_write
  0.000020 enqueue_hrtimer
  0.000020 dup_mm
  0.000020 do_wait
  0.000020 do_vfs_ioctl
  0.000020 do_path_lookup
  0.000020 do_munmap
  0.000020 do_machine_check
  0.000020 do_lookup
  0.000020 do_follow_link
  0.000020 dma_unmap_single
  0.000020 __dec_zone_page_state
  0.000020 count_vm_event
  0.000020 constant_test_bit
  0.000020 constant_test_bit
  0.000020 compound_head
  0.000020 clear_buffer_jbddirty
  0.000020 clear_buffer_delay
  0.000020 claim_block
  0.000020 cascade
  0.000020 cancel_dirty_page
  0.000020 cache_grow
  0.000020 brelse
  0.000020 __block_prepare_write
  0.000020 __blocking_notifier_call_chain
  0.000020 blk_rq_map_sg
  0.000020 __bitmap_empty
  0.000020 __bitmap_andnot
  0.000020 anon_vma_unlink
  0.000010 zone_page_state
  0.000010 zero_user_segments
  0.000010 __xchg
  0.000010 __vma_link_rb
  0.000010 vma_link
  0.000010 vfs_llseek
  0.000010 __up_write
  0.000010 update_xtime_cache
  0.000010 unmap_underlying_metadata
  0.000010 unmap_region
  0.000010 unix_poll
  0.000010 tty_write_room
  0.000010 tty_unthrottle
  0.000010 tty_ldisc_ref_wait
  0.000010 tty_ldisc_ref
  0.000010 tty_fasync
  0.000010 tty_check_change
  0.000010 tty_chars_in_buffer
  0.000010 tty_audit_fork
  0.000010 truncate_complete_page
  0.000010 test_tsk_thread_flag
  0.000010 taskstats_exit
  0.000010 sys_writev
  0.000010 sys_readahead
  0.000010 sys_poll
  0.000010 sys_newstat
  0.000010 sys_nanosleep
  0.000010 sys_ioctl
  0.000010 syscall_trace_leave
  0.000010 sync_supers
  0.000010 stub_execve
  0.000010 split_page
  0.000010 sock_kfree_s
  0.000010 __sleep_on_page_lock
  0.000010 skip_atoi
  0.000010 signal_pending
  0.000010 signal_pending
  0.000010 sg_init_table
  0.000010 set_task_cpu
  0.000010 __set_page_dirty
  0.000010 SetPageActive
  0.000010 set_bit
  0.000010 seq_puts
  0.000010 selinux_task_setpgid
  0.000010 selinux_secctx_to_secid
  0.000010 selinux_sb_show_options
  0.000010 selinux_inode_permission
  0.000010 selinux_inode_need_killpriv
  0.000010 selinux_inode_free_security
  0.000010 selinux_inode_alloc_security
  0.000010 selinux_d_instantiate
  0.000010 security_vm_enough_memory
  0.000010 second_overflow
  0.000010 scsi_run_queue
  0.000010 __scsi_put_command
  0.000010 scsi_init_sgtable
  0.000010 scsi_end_request
  0.000010 schedule_tail
  0.000010 schedule_delayed_work
  0.000010 sb_any_quota_enabled
  0.000010 rt_hash
  0.000010 round_jiffies_relative
  0.000010 remove_hrtimer
  0.000010 __remove_hrtimer
  0.000010 __remove_from_page_cache
  0.000010 rcu_bh_qsctr_inc
  0.000010 radix_tree_tag_clear
  0.000010 radix_tree_gang_lookup_tag_slot
  0.000010 radix_tree_gang_lookup_slot
  0.000010 queue_delayed_work
  0.000010 qdisc_run
  0.000010 put_tty_queue_nolock
  0.000010 put_io_context
  0.000010 pty_write_room
  0.000010 pty_open
  0.000010 ptep_set_access_flags
  0.000010 profile_munmap
  0.000010 proc_pident_lookup
  0.000010 proc_get_inode
  0.000010 prio_tree_replace
  0.000010 prio_tree_remove
  0.000010 prio_tree_insert
  0.000010 pmd_none_or_clear_bad
  0.000010 pipe_release
  0.000010 pipe_read
  0.000010 pid_revalidate
  0.000010 pgd_alloc
  0.000010 pci_unmap_single
  0.000010 pci_read_config_dword
  0.000010 pci_conf1_write
  0.000010 pci_bus_read_config_dword
  0.000010 path_walk
  0.000010 page_zone
  0.000010 PageSwapCache
  0.000010 PageSwapCache
  0.000010 PageSwapCache
  0.000010 __page_set_anon_rmap
  0.000010 PagePrivate
  0.000010 PagePrivate
  0.000010 PagePrivate
  0.000010 page_add_file_rmap
  0.000010 on_each_cpu
  0.000010 nv_do_interrupt
  0.000010 net_tx_action
  0.000010 netif_start_queue
  0.000010 netif_carrier_ok
  0.000010 need_resched
  0.000010 need_iommu
  0.000010 native_pte_clear
  0.000010 native_io_delay
  0.000010 mutex_lock
  0.000010 mprotect_fixup
  0.000010 mod_zone_page_state
  0.000010 mntput_no_expire
  0.000010 mm_init
  0.000010 mmap_region
  0.000010 mempool_free
  0.000010 memcmp
  0.000010 mcheck_check_cpu
  0.000010 may_open
  0.000010 __lookup_tag
  0.000010 locks_remove_posix
  0.000010 locks_remove_flock
  0.000010 lock_buffer
  0.000010 load_elf_binary
  0.000010 load_balance_fair
  0.000010 ll_back_merge_fn
  0.000010 kzalloc
  0.000010 ktime_add_safe
  0.000010 kill_fasync
  0.000010 __journal_temp_unlink_buffer
  0.000010 journal_switch_revoke_table
  0.000010 __journal_remove_checkpoint
  0.000010 journal_get_write_access
  0.000010 journal_get_undo_access
  0.000010 journal_get_descriptor_buffer
  0.000010 journal_bmap
  0.000010 jbd_unlock_bh_state
  0.000010 jbd_unlock_bh_state
  0.000010 IRQ0xd2_interrupt
  0.000010 ip_append_data
  0.000010 iov_iter_advance
  0.000010 iov_fault_in_pages_read
  0.000010 iommu_area_alloc
  0.000010 inode_sub_bytes
  0.000010 inode_doinit_with_dentry
  0.000010 inode_add_bytes
  0.000010 __inc_zone_page_state
  0.000010 inc_zone_page_state
  0.000010 hweight_long
  0.000010 hweight64
  0.000010 hrtimer_wakeup
  0.000010 hrtimer_init
  0.000010 hash_64
  0.000010 half_md4_transform
  0.000010 __grab_cache_page
  0.000010 get_user_pages
  0.000010 get_signal_to_deliver
  0.000010 get_random_int
  0.000010 getname
  0.000010 get_empty_filp
  0.000010 __getblk
  0.000010 generic_permission
  0.000010 generic_make_request
  0.000010 generic_fillattr
  0.000010 generic_file_open
  0.000010 generic_file_llseek_unlocked
  0.000010 generic_file_buffered_write
  0.000010 generic_file_aio_write
  0.000010 generic_cont_expand_simple
  0.000010 generic_block_bmap
  0.000010 freezing
  0.000010 free_swap_cache
  0.000010 free_pid
  0.000010 free_pgd_range
  0.000010 free_pages
  0.000010 flush_old_exec
  0.000010 first_online_pgdat
  0.000010 find_vma_prepare
  0.000010 find_task_by_pid_type_ns
  0.000010 find_next_zero_bit
  0.000010 find_inode_fast
  0.000010 file_remove_suid
  0.000010 file_mask_to_av
  0.000010 file_free_rcu
  0.000010 __FD_CLR
  0.000010 ext3_write_begin
  0.000010 ext3_try_to_allocate_with_rsv
  0.000010 ext3_ordered_write_end
  0.000010 ext3_journalled_set_page_dirty
  0.000010 ext3_invalidatepage
  0.000010 ext3_iget_acl
  0.000010 ext3_get_inode_flags
  0.000010 ext3_free_data
  0.000010 ext3_discard_reservation
  0.000010 exit_thread
  0.000010 exit_task_namespaces
  0.000010 exit_sem
  0.000010 end_that_request_last
  0.000010 end_buffer_write_sync
  0.000010 end_buffer_async_write
  0.000010 elv_rb_del
  0.000010 elv_queue_empty
  0.000010 elv_merged_request
  0.000010 elv_completed_request
  0.000010 elf_map
  0.000010 echo_char
  0.000010 e1000_watchdog
  0.000010 e1000_read_phy_reg
  0.000010 __drain_alien_cache
  0.000010 __d_path
  0.000010 __down_write_nested
  0.000010 __down_write
  0.000010 double_rq_lock
  0.000010 do_timer
  0.000010 do_sys_open
  0.000010 do_sigaltstack
  0.000010 do_sigaction
  0.000010 do_setitimer
  0.000010 do_pipe_flags
  0.000010 __do_page_cache_readahead
  0.000010 do_notify_parent
  0.000010 do_filp_open
  0.000010 do_exit
  0.000010 dnotify_flush
  0.000010 d_kill
  0.000010 destroy_inode
  0.000010 dequeue_signal
  0.000010 de_put
  0.000010 delayacct_end
  0.000010 create_write_pipe
  0.000010 create_workqueue_thread
  0.000010 __cpus_equal
  0.000010 cpu_quiet
  0.000010 __cpu_clear
  0.000010 __cpu_clear
  0.000010 count
  0.000010 copy_thread
  0.000010 copy_namespaces
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 constant_test_bit
  0.000010 __cond_resched
  0.000010 clocksource_forward_now
  0.000010 __clear_user
  0.000010 clear_inode
  0.000010 clear_buffer_new
  0.000010 clear_bit
  0.000010 clear_bit
  0.000010 check_for_bios_corruption
  0.000010 __cfq_slice_expired
  0.000010 cfq_set_request
  0.000010 cfq_dispatch_requests
  0.000010 cfq_completed_request
  0.000010 cap_set_effective
  0.000010 can_share_swap_page
  0.000010 bvec_alloc_bs
  0.000010 buffer_uptodate
  0.000010 buffer_mapped
  0.000010 buffer_locked
  0.000010 buffer_jbd
  0.000010 buffer_jbd
  0.000010 brelse
  0.000010 __bread
  0.000010 blk_invoke_request_fn
  0.000010 __blk_complete_request
  0.000010 blk_add_trace_generic
  0.000010 blk_add_trace_bio
  0.000010 bit_spin_lock
  0.000010 bio_put
  0.000010 bio_alloc_bioset
  0.000010 bdi_read_congested
  0.000010 balance_runtime
  0.000010 balance_dirty_pages_ratelimited_nr
  0.000010 audit_log_task_context
  0.000010 ata_sff_qc_prep
  0.000010 ata_scsi_queuecmd
  0.000010 ata_link_max_devices
  0.000010 ata_get_xlat_func
  0.000010 arp_process
  0.000010 arch_pick_mmap_layout
  0.000010 arch_irq_stat_cpu
  0.000010 arch_dup_task_struct
  0.000010 alloc_pid
  0.000010 alloc_fdtable
  0.000010 alloc_fd
  0.000010 add_mm_rss
  0.000010 acct_collect

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:21           ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:21 UTC (permalink / raw)
  To: mingo
  Cc: rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, torvalds

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 17 Nov 2008 12:01:19 +0100

> The scheduler's overhead barely even registers on a 16-way x86 system 
> i'm running tbench on. Here's the NMI profile during 64 threads tbench 
> on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

Try a non-NMI profile.

It's the whole of the try_to_wake_up() path that's the problem.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:21           ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:21 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Mon, 17 Nov 2008 12:01:19 +0100

> The scheduler's overhead barely even registers on a 16-way x86 system 
> i'm running tbench on. Here's the NMI profile during 64 threads tbench 
> on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

Try a non-NMI profile.

It's the whole of the try_to_wake_up() path that's the problem.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:30                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
>> investigate it now a bit more to place the real overhead point 
>> properly. (i already mapped the test-bit overhead: that comes from 
>> napi_disable_pending())
> 
> ok, here's a new set of profiles. (again for tbench 64-thread on a 
> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
> posted before.)
> 
> Here are the per major subsystem percentages:
> 
>            NET       overhead ( 5786945/10096751): 57.31%
>            security  overhead (  925933/10096751):  9.17%
>            usercopy  overhead (  837887/10096751):  8.30%
>            sched     overhead (  753662/10096751):  7.46%
>            syscall   overhead (  268809/10096751):  2.66%
>            IRQ       overhead (  266500/10096751):  2.64%
>            slab      overhead (  180258/10096751):  1.79%
>            timer     overhead (   92986/10096751):  0.92%
>            pagealloc overhead (   87381/10096751):  0.87%
>            VFS       overhead (   53295/10096751):  0.53%
>            PID       overhead (   44469/10096751):  0.44%
>            pagecache overhead (   33452/10096751):  0.33%
>            gtod      overhead (   11064/10096751):  0.11%
>            IDLE      overhead (       0/10096751):  0.00%
> ---------------------------------------------------------
>                          left (  753878/10096751):  7.47%
> 
> The breakdown is very similar to what i sent before, within noise.
> 
> [ 'left' is random overhead from all around the place - i categorized 
>   the 500 most expensive functions in the profile per subsystem.
>   I stopped short of doing it for all 1300+ functions: it's rather
>   laborous manual work even with hefty use of regex patterns.
>   It's also less meaningful in practice: the trend in the first 500
>   functions is present in the remaining 800 functions as well. I 
>   watched the breakdown evolve as i increased the coverage - in 
>   practice it is the first 100 functions that matter - it just doesnt 
>   change after that. ]
> 
> The readprofile output below seems structured in a more useful way now 
> - i tweaked compiler options to have the profiler hits spread out in a 
> more meaningful way. I collected 10 million NMI profiler hits, and 
> normalized the readprofile output up to 100%.
> 
> [ I'll post per function analysis as i complete them, as a reply to
>   this mail. ]
> 
> 	Ingo
> 
> 100.000000 total
> ................
>   7.253355 copy_user_generic_string
>   3.934833 avc_has_perm_noaudit

>   3.356152 ip_queue_xmit

>   3.038025 skb_release_data
>   2.118525 skb_release_head_state
>   1.997533 tcp_ack
>   1.833688 tcp_recvmsg

>   1.717771 eth_type_trans
Strange, in my profile, eth_type_trans is not in the top 20
Maybe an alignment problem ?
Oh, I understand, you hit the netdevice->last_rx update probblem, already corrected on net-next-2.6

>   1.673249 __inet_lookup_established
TCP established/timewait table is now RCUified (for linux-2.6.29), this one
should go down in profiles. 

>   1.508888 system_call

>   1.469183 tcp_current_mss
Yes there is a divide that might be expensive. discussion on netdev.

>   1.431553 tcp_transmit_skb
>   1.385125 tcp_sendmsg
>   1.327643 tcp_v4_rcv
>   1.292328 nf_hook_thresh
>   1.203205 schedule
>   1.059501 nf_hook_slow
>   1.027373 constant_test_bit
>   0.945183 sock_rfree
>   0.922748 __switch_to
>   0.911605 netif_rx
>   0.876270 register_gifconf
>   0.788200 ip_local_deliver_finish
>   0.781467 dev_queue_xmit
>   0.766530 constant_test_bit
>   0.758208 _local_bh_enable_ip
>   0.747184 load_cr3
>   0.704341 memset_c
>   0.671260 sysret_check
>   0.651845 ip_finish_output2
>   0.620204 audit_free_names



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:30                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
> 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
>> investigate it now a bit more to place the real overhead point 
>> properly. (i already mapped the test-bit overhead: that comes from 
>> napi_disable_pending())
> 
> ok, here's a new set of profiles. (again for tbench 64-thread on a 
> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
> posted before.)
> 
> Here are the per major subsystem percentages:
> 
>            NET       overhead ( 5786945/10096751): 57.31%
>            security  overhead (  925933/10096751):  9.17%
>            usercopy  overhead (  837887/10096751):  8.30%
>            sched     overhead (  753662/10096751):  7.46%
>            syscall   overhead (  268809/10096751):  2.66%
>            IRQ       overhead (  266500/10096751):  2.64%
>            slab      overhead (  180258/10096751):  1.79%
>            timer     overhead (   92986/10096751):  0.92%
>            pagealloc overhead (   87381/10096751):  0.87%
>            VFS       overhead (   53295/10096751):  0.53%
>            PID       overhead (   44469/10096751):  0.44%
>            pagecache overhead (   33452/10096751):  0.33%
>            gtod      overhead (   11064/10096751):  0.11%
>            IDLE      overhead (       0/10096751):  0.00%
> ---------------------------------------------------------
>                          left (  753878/10096751):  7.47%
> 
> The breakdown is very similar to what i sent before, within noise.
> 
> [ 'left' is random overhead from all around the place - i categorized 
>   the 500 most expensive functions in the profile per subsystem.
>   I stopped short of doing it for all 1300+ functions: it's rather
>   laborous manual work even with hefty use of regex patterns.
>   It's also less meaningful in practice: the trend in the first 500
>   functions is present in the remaining 800 functions as well. I 
>   watched the breakdown evolve as i increased the coverage - in 
>   practice it is the first 100 functions that matter - it just doesnt 
>   change after that. ]
> 
> The readprofile output below seems structured in a more useful way now 
> - i tweaked compiler options to have the profiler hits spread out in a 
> more meaningful way. I collected 10 million NMI profiler hits, and 
> normalized the readprofile output up to 100%.
> 
> [ I'll post per function analysis as i complete them, as a reply to
>   this mail. ]
> 
> 	Ingo
> 
> 100.000000 total
> ................
>   7.253355 copy_user_generic_string
>   3.934833 avc_has_perm_noaudit

>   3.356152 ip_queue_xmit

>   3.038025 skb_release_data
>   2.118525 skb_release_head_state
>   1.997533 tcp_ack
>   1.833688 tcp_recvmsg

>   1.717771 eth_type_trans
Strange, in my profile, eth_type_trans is not in the top 20
Maybe an alignment problem ?
Oh, I understand, you hit the netdevice->last_rx update probblem, already corrected on net-next-2.6

>   1.673249 __inet_lookup_established
TCP established/timewait table is now RCUified (for linux-2.6.29), this one
should go down in profiles. 

>   1.508888 system_call

>   1.469183 tcp_current_mss
Yes there is a divide that might be expensive. discussion on netdev.

>   1.431553 tcp_transmit_skb
>   1.385125 tcp_sendmsg
>   1.327643 tcp_v4_rcv
>   1.292328 nf_hook_thresh
>   1.203205 schedule
>   1.059501 nf_hook_slow
>   1.027373 constant_test_bit
>   0.945183 sock_rfree
>   0.922748 __switch_to
>   0.911605 netif_rx
>   0.876270 register_gifconf
>   0.788200 ip_local_deliver_finish
>   0.781467 dev_queue_xmit
>   0.766530 constant_test_bit
>   0.758208 _local_bh_enable_ip
>   0.747184 load_cr3
>   0.704341 memset_c
>   0.671260 sysret_check
>   0.651845 ip_finish_output2
>   0.620204 audit_free_names


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:31               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:31 UTC (permalink / raw)
  To: mingo
  Cc: dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, torvalds

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 17 Nov 2008 17:11:35 +0100

> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
> compared to the things we were after in scheduler land.

The scheduler has accounted for at least %10 of the tbench
regressions at this point, what are you talking about?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:31               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:31 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Mon, 17 Nov 2008 17:11:35 +0100

> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
> compared to the things we were after in scheduler land.

The scheduler has accounted for at least %10 of the tbench
regressions at this point, what are you talking about?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:36                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:36 UTC (permalink / raw)
  To: mingo
  Cc: dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, torvalds, shemminger

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 17 Nov 2008 18:08:44 +0100

> Mike Galbraith has been spending months trying to pin down all the 
> issues.

Yes Mike has been doing tireless good work.

Another thing I noticed is that because all of the scheduler
core operations are now function pointer callbacks, the
call chain is deeper for core operations like wake_up().

Much of it used to be completely inlined into try_to_wake_up()

With the addition of the RB tree stuff, that adds yet another
unavoidable depth of function call.

wake_up() is usually at the deepest part of the call chain,
so this is a big deal

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:36                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:36 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Mon, 17 Nov 2008 18:08:44 +0100

> Mike Galbraith has been spending months trying to pin down all the 
> issues.

Yes Mike has been doing tireless good work.

Another thing I noticed is that because all of the scheduler
core operations are now function pointer callbacks, the
call chain is deeper for core operations like wake_up().

Much of it used to be completely inlined into try_to_wake_up()

With the addition of the RB tree stuff, that adds yet another
unavoidable depth of function call.

wake_up() is usually at the deepest part of the call chain,
so this is a big deal

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:39                             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:39 UTC (permalink / raw)
  To: mingo
  Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 17 Nov 2008 19:49:51 +0100

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
> > investigate it now a bit more to place the real overhead point 
> > properly. (i already mapped the test-bit overhead: that comes from 
> > napi_disable_pending())
> 
> ok, here's a new set of profiles. (again for tbench 64-thread on a 
> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
> posted before.)

Again, do a non-NMI profile and the top (at least for me)
looks like this:

samples  %        app name                 symbol name
473       6.3928  vmlinux                  finish_task_switch
349       4.7169  vmlinux                  tcp_v4_rcv
327       4.4195  vmlinux                  U3copy_from_user
322       4.3519  vmlinux                  tl0_linux32
178       2.4057  vmlinux                  tcp_ack
170       2.2976  vmlinux                  tcp_sendmsg
167       2.2571  vmlinux                  U3copy_to_user

That tcp_v4_rcv() hit is %98 on the wake_up() call it does.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:39                             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:39 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Mon, 17 Nov 2008 19:49:51 +0100

> 
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
> 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
> > investigate it now a bit more to place the real overhead point 
> > properly. (i already mapped the test-bit overhead: that comes from 
> > napi_disable_pending())
> 
> ok, here's a new set of profiles. (again for tbench 64-thread on a 
> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
> posted before.)

Again, do a non-NMI profile and the top (at least for me)
looks like this:

samples  %        app name                 symbol name
473       6.3928  vmlinux                  finish_task_switch
349       4.7169  vmlinux                  tcp_v4_rcv
327       4.4195  vmlinux                  U3copy_from_user
322       4.3519  vmlinux                  tl0_linux32
178       2.4057  vmlinux                  tcp_ack
170       2.2976  vmlinux                  tcp_sendmsg
167       2.2571  vmlinux                  U3copy_to_user

That tcp_v4_rcv() hit is %98 on the wake_up() call it does.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:43                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 19:43 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

David Miller a écrit :
> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 17 Nov 2008 19:49:51 +0100
> 
>> * Ingo Molnar <mingo@elte.hu> wrote:
>>
>> 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
>>> investigate it now a bit more to place the real overhead point 
>>> properly. (i already mapped the test-bit overhead: that comes from 
>>> napi_disable_pending())
>> ok, here's a new set of profiles. (again for tbench 64-thread on a 
>> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
>> posted before.)
> 
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:
> 
> samples  %        app name                 symbol name
> 473       6.3928  vmlinux                  finish_task_switch
> 349       4.7169  vmlinux                  tcp_v4_rcv
> 327       4.4195  vmlinux                  U3copy_from_user
> 322       4.3519  vmlinux                  tl0_linux32
> 178       2.4057  vmlinux                  tcp_ack
> 170       2.2976  vmlinux                  tcp_sendmsg
> 167       2.2571  vmlinux                  U3copy_to_user
> 
> That tcp_v4_rcv() hit is %98 on the wake_up() call it does.
> 
> 

Another profile from my tree (net-next-2.6 + some patches), on my machine


CPU: Core 2, speed 3000.22 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
223265    9.2711  __copy_user_zeroing_intel
87525     3.6345  __copy_user_intel
73203     3.0398  tcp_sendmsg
53229     2.2103  netif_rx
53041     2.2025  tcp_recvmsg
47241     1.9617  sysenter_past_esp
42888     1.7809  __copy_from_user_ll
40858     1.6966  tcp_transmit_skb
39390     1.6357  __switch_to
37363     1.5515  dst_release
36823     1.5291  __sk_dst_check_get
36050     1.4970  tcp_v4_rcv
35829     1.4878  __do_softirq
32333     1.3426  tcp_rcv_established
30451     1.2645  tcp_clean_rtx_queue
29758     1.2357  ip_queue_xmit
28497     1.1833  __copy_to_user_ll
28119     1.1676  release_sock
25218     1.0472  lock_sock_nested
23701     0.9842  __inet_lookup_established
23463     0.9743  tcp_ack
22989     0.9546  netif_receive_skb
21880     0.9086  sched_clock_cpu
20730     0.8608  tcp_write_xmit
20372     0.8460  ip_rcv
20336     0.8445  local_bh_enable
19153     0.7953  __update_sched_clock
18603     0.7725  skb_release_data
17020     0.7068  local_bh_enable_ip
16932     0.7031  process_backlog
16299     0.6768  ip_finish_output
16279     0.6760  dev_queue_xmit
15858     0.6585  sock_recvmsg
15641     0.6495  native_read_tsc
15454     0.6417  sock_wfree
15366     0.6381  update_curr
14585     0.6056  sys_socketcall
14564     0.6048  __alloc_skb
14519     0.6029  __tcp_select_window
14417     0.5987  tcp_current_mss
14391     0.5976  nf_iterate
14221     0.5905  page_address
14122     0.5864  local_bh_disable




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:43                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 19:43 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

David Miller a écrit :
> From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
> Date: Mon, 17 Nov 2008 19:49:51 +0100
> 
>> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
>>
>> 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
>>> investigate it now a bit more to place the real overhead point 
>>> properly. (i already mapped the test-bit overhead: that comes from 
>>> napi_disable_pending())
>> ok, here's a new set of profiles. (again for tbench 64-thread on a 
>> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
>> posted before.)
> 
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:
> 
> samples  %        app name                 symbol name
> 473       6.3928  vmlinux                  finish_task_switch
> 349       4.7169  vmlinux                  tcp_v4_rcv
> 327       4.4195  vmlinux                  U3copy_from_user
> 322       4.3519  vmlinux                  tl0_linux32
> 178       2.4057  vmlinux                  tcp_ack
> 170       2.2976  vmlinux                  tcp_sendmsg
> 167       2.2571  vmlinux                  U3copy_to_user
> 
> That tcp_v4_rcv() hit is %98 on the wake_up() call it does.
> 
> 

Another profile from my tree (net-next-2.6 + some patches), on my machine


CPU: Core 2, speed 3000.22 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
223265    9.2711  __copy_user_zeroing_intel
87525     3.6345  __copy_user_intel
73203     3.0398  tcp_sendmsg
53229     2.2103  netif_rx
53041     2.2025  tcp_recvmsg
47241     1.9617  sysenter_past_esp
42888     1.7809  __copy_from_user_ll
40858     1.6966  tcp_transmit_skb
39390     1.6357  __switch_to
37363     1.5515  dst_release
36823     1.5291  __sk_dst_check_get
36050     1.4970  tcp_v4_rcv
35829     1.4878  __do_softirq
32333     1.3426  tcp_rcv_established
30451     1.2645  tcp_clean_rtx_queue
29758     1.2357  ip_queue_xmit
28497     1.1833  __copy_to_user_ll
28119     1.1676  release_sock
25218     1.0472  lock_sock_nested
23701     0.9842  __inet_lookup_established
23463     0.9743  tcp_ack
22989     0.9546  netif_receive_skb
21880     0.9086  sched_clock_cpu
20730     0.8608  tcp_write_xmit
20372     0.8460  ip_rcv
20336     0.8445  local_bh_enable
19153     0.7953  __update_sched_clock
18603     0.7725  skb_release_data
17020     0.7068  local_bh_enable_ip
16932     0.7031  process_backlog
16299     0.6768  ip_finish_output
16279     0.6760  dev_queue_xmit
15858     0.6585  sock_recvmsg
15641     0.6495  native_read_tsc
15454     0.6417  sock_wfree
15366     0.6381  update_curr
14585     0.6056  sys_socketcall
14564     0.6048  __alloc_skb
14519     0.6029  __tcp_select_window
14417     0.5987  tcp_current_mss
14391     0.5976  nf_iterate
14221     0.5905  page_address
14122     0.5864  local_bh_disable



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:47                 ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:47 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra



On Mon, 17 Nov 2008, David Miller wrote:
> 
> The scheduler has accounted for at least %10 of the tbench
> regressions at this point, what are you talking about?

I'm wondering if you're not looking at totally different issues.

For example, if I recall correctly, David had a big hit on the hrtimers. 
And I wonder if perhaps Ingo's numbers are without hrtimers or something? 

The other possibility is that it's just a sparc suckiness issue, that 
simply doesn't show up on x86. 

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:47                 ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:47 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw



On Mon, 17 Nov 2008, David Miller wrote:
> 
> The scheduler has accounted for at least %10 of the tbench
> regressions at this point, what are you talking about?

I'm wondering if you're not looking at totally different issues.

For example, if I recall correctly, David had a big hit on the hrtimers. 
And I wonder if perhaps Ingo's numbers are without hrtimers or something? 

The other possibility is that it's just a sparc suckiness issue, that 
simply doesn't show up on x86. 

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:48             ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:48 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra



On Mon, 17 Nov 2008, David Miller wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 17 Nov 2008 12:01:19 +0100
> 
> > The scheduler's overhead barely even registers on a 16-way x86 system 
> > i'm running tbench on. Here's the NMI profile during 64 threads tbench 
> > on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:
> 
> Try a non-NMI profile.
> 
> It's the whole of the try_to_wake_up() path that's the problem.

David, that makes no sense. A NMI profile is going to be a _lot_ more 
accurate than a non-NMI one. Asking somebody to do a clearly inferior 
profile to get "better numbers" is insane.

We've asked _you_ to do NMI profiling, it shouldn't be the other way 
around.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:48             ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:48 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw



On Mon, 17 Nov 2008, David Miller wrote:

> From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
> Date: Mon, 17 Nov 2008 12:01:19 +0100
> 
> > The scheduler's overhead barely even registers on a 16-way x86 system 
> > i'm running tbench on. Here's the NMI profile during 64 threads tbench 
> > on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:
> 
> Try a non-NMI profile.
> 
> It's the whole of the try_to_wake_up() path that's the problem.

David, that makes no sense. A NMI profile is going to be a _lot_ more 
accurate than a non-NMI one. Asking somebody to do a clearly inferior 
profile to get "better numbers" is insane.

We've asked _you_ to do NMI profiling, it shouldn't be the other way 
around.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:51                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:51 UTC (permalink / raw)
  To: torvalds
  Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 17 Nov 2008 11:47:24 -0800 (PST)

> For example, if I recall correctly, David had a big hit on the hrtimers. 

That got fixed, the HRTIMER bits are now disabled.

> The other possibility is that it's just a sparc suckiness issue, that 
> simply doesn't show up on x86. 

Could be and I intend to measure that to find out.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:51                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:51 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: mingo-X9Un+BFzKDI, dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw

From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date: Mon, 17 Nov 2008 11:47:24 -0800 (PST)

> For example, if I recall correctly, David had a big hit on the hrtimers. 

That got fixed, the HRTIMER bits are now disabled.

> The other possibility is that it's just a sparc suckiness issue, that 
> simply doesn't show up on x86. 

Could be and I intend to measure that to find out.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:52               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:52 UTC (permalink / raw)
  To: torvalds
  Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 17 Nov 2008 11:48:33 -0800 (PST)

> We've asked _you_ to do NMI profiling, it shouldn't be the other way 
> around.

I wasn't able to on these systems, so instead I did cycle level
evaluation of the parts that have to run with interrupts disabled.

And as a result I found that wake_up() is now 4 times slower than it
was in 2.6.22, I even analyzed this for every single kernel release
till now.

It could be a sparc specific issue, because the call chain is deeper
and we eat a lot more register window spills onto the stack.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:52               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 19:52 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw

From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date: Mon, 17 Nov 2008 11:48:33 -0800 (PST)

> We've asked _you_ to do NMI profiling, it shouldn't be the other way 
> around.

I wasn't able to on these systems, so instead I did cycle level
evaluation of the parts that have to run with interrupts disabled.

And as a result I found that wake_up() is now 4 times slower than it
was in 2.6.22, I even analyzed this for every single kernel release
till now.

It could be a sparc specific issue, because the call chain is deeper
and we eat a lot more register window spills onto the stack.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:53                   ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, dada1, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 17 Nov 2008, David Miller wrote:
> > 
> > The scheduler has accounted for at least %10 of the tbench 
> > regressions at this point, what are you talking about?
> 
> I'm wondering if you're not looking at totally different issues.
> 
> For example, if I recall correctly, David had a big hit on the 
> hrtimers. And I wonder if perhaps Ingo's numbers are without 
> hrtimers or something?

hrtimers should not be an issue anymore since this commit:

| commit 0c4b83da58ec2e96ce9c44c211d6eac5f9dae478
| Author: Ingo Molnar <mingo@elte.hu>
| Date:   Mon Oct 20 14:27:43 2008 +0200
|
|     sched: disable the hrtick for now
|    
|     David Miller reported that hrtick update overhead has tripled the
|     wakeup overhead on Sparc64.
|    
|     That is too much - disable the HRTICK feature for now by default,
|     until a faster implementation is found.
|    
|     Reported-by: David Miller <davem@davemloft.net>
|     Acked-by: Peter Zijlstra <peterz@infradead.org>
|     Signed-off-by: Ingo Molnar <mingo@elte.hu>

Which was included in v2.6.28-rc1 already.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:53                   ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw


* Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Mon, 17 Nov 2008, David Miller wrote:
> > 
> > The scheduler has accounted for at least %10 of the tbench 
> > regressions at this point, what are you talking about?
> 
> I'm wondering if you're not looking at totally different issues.
> 
> For example, if I recall correctly, David had a big hit on the 
> hrtimers. And I wonder if perhaps Ingo's numbers are without 
> hrtimers or something?

hrtimers should not be an issue anymore since this commit:

| commit 0c4b83da58ec2e96ce9c44c211d6eac5f9dae478
| Author: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
| Date:   Mon Oct 20 14:27:43 2008 +0200
|
|     sched: disable the hrtick for now
|    
|     David Miller reported that hrtick update overhead has tripled the
|     wakeup overhead on Sparc64.
|    
|     That is too much - disable the HRTICK feature for now by default,
|     until a faster implementation is found.
|    
|     Reported-by: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
|     Acked-by: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
|     Signed-off-by: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>

Which was included in v2.6.28-rc1 already.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:55                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:55 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger



On Mon, 17 Nov 2008, David Miller wrote:
> 
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:

Can _you_ please do a NMI profile and see what your real problem is?

I can't imagine that Niagara (or whatever) is so weak that it can't do 
NMI's. 

The fact is, David, that Ingo just posted a profile that was _better_ than 
anything you have ever posted, and it doesn't show what you complain 
about. So he's not seeing it. Asking him to do a _stupid_ profile is just 
that: stupid.

So try to figure out why his (better) profile doesn't match your 
(inferior) one, instead of asking him to do stupid things. It's some 
difference in architectures, likely: maybe the sparc timekeeping is crap, 
maybe it's a cache issue and sparc caches are crap, maybe it's something 
where Niagara (is it niagara) has some oddness that shows up because it 
has that odd four-threads+four-cores or whatever.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:55                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:55 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA



On Mon, 17 Nov 2008, David Miller wrote:
> 
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:

Can _you_ please do a NMI profile and see what your real problem is?

I can't imagine that Niagara (or whatever) is so weak that it can't do 
NMI's. 

The fact is, David, that Ingo just posted a profile that was _better_ than 
anything you have ever posted, and it doesn't show what you complain 
about. So he's not seeing it. Asking him to do a _stupid_ profile is just 
that: stupid.

So try to figure out why his (better) profile doesn't match your 
(inferior) one, instead of asking him to do stupid things. It's some 
difference in architectures, likely: maybe the sparc timekeeping is crap, 
maybe it's a cache issue and sparc caches are crap, maybe it's something 
where Niagara (is it niagara) has some oddness that shows up because it 
has that odd four-threads+four-cores or whatever.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:57                 ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:57 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra



On Mon, 17 Nov 2008, David Miller wrote:
> 
> And as a result I found that wake_up() is now 4 times slower than it
> was in 2.6.22, I even analyzed this for every single kernel release
> till now.

..and that's the one where you then pointed to hrtimers, and now you claim 
that was fixed?

At least I haven't seen any new analysis since then.

> It could be a sparc specific issue, because the call chain is deeper
> and we eat a lot more register window spills onto the stack.

Oh, easily. In-order machines tend to have serious problems with indirect 
function calls in particular. x86, in contrast, tends to not even notice, 
especially if the indirect function is fairly static per call-site, and 
predicts well.

There is a reason my machine is 15-20 times faster than yours.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:57                 ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 19:57 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw



On Mon, 17 Nov 2008, David Miller wrote:
> 
> And as a result I found that wake_up() is now 4 times slower than it
> was in 2.6.22, I even analyzed this for every single kernel release
> till now.

..and that's the one where you then pointed to hrtimers, and now you claim 
that was fixed?

At least I haven't seen any new analysis since then.

> It could be a sparc specific issue, because the call chain is deeper
> and we eat a lot more register window spills onto the stack.

Oh, easily. In-order machines tend to have serious problems with indirect 
function calls in particular. x86, in contrast, tends to not even notice, 
especially if the indirect function is fairly static per call-site, and 
predicts well.

There is a reason my machine is 15-20 times faster than yours.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:57                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


> [ I'll post per function analysis as i complete them, as a reply to
>   this mail. ]

[ i'll do a separate mail for every function analyzed, the discussion 
  spreads better that way. ]

> 100.000000 total
> ................
>   7.253355 copy_user_generic_string

This is the Well-known pattern of user-copy overhead, which centers 
around this single REP MOVS instruction:

                nr-of-hits
                 .........
ffffffff80341eea:       42 	83 e2 07    		and    $0x7,%edx
ffffffff80341eed:   677398 	f3 48 a5         	rep movsq %ds:(%rsi),%es:(%rdi)
ffffffff80341ef0:     3642 	89 d1                	mov    %edx,%ecx
ffffffff80341ef2:    16260 	f3 a4                	rep movsb %ds:(%rsi),%es:(%rdi)
ffffffff80341ef4:     6554 	31 c0                	xor    %eax,%eax
ffffffff80341ef6:     1958 	c3                   	retq   
ffffffff80341ef7:        0 	90                   	nop    
ffffffff80341ef8:        0 	90                   	nop    

That's to be expected - tbench shuffles 3.5 GB of effective data 
to/from sockets. That's 7.5 GB due to double-copy. So for every 64 
bytes of data transferred we spend 1.4 CPU cycles in this specific 
function - that is OK-ish.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 19:57                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


> [ I'll post per function analysis as i complete them, as a reply to
>   this mail. ]

[ i'll do a separate mail for every function analyzed, the discussion 
  spreads better that way. ]

> 100.000000 total
> ................
>   7.253355 copy_user_generic_string

This is the Well-known pattern of user-copy overhead, which centers 
around this single REP MOVS instruction:

                nr-of-hits
                 .........
ffffffff80341eea:       42 	83 e2 07    		and    $0x7,%edx
ffffffff80341eed:   677398 	f3 48 a5         	rep movsq %ds:(%rsi),%es:(%rdi)
ffffffff80341ef0:     3642 	89 d1                	mov    %edx,%ecx
ffffffff80341ef2:    16260 	f3 a4                	rep movsb %ds:(%rsi),%es:(%rdi)
ffffffff80341ef4:     6554 	31 c0                	xor    %eax,%eax
ffffffff80341ef6:     1958 	c3                   	retq   
ffffffff80341ef7:        0 	90                   	nop    
ffffffff80341ef8:        0 	90                   	nop    

That's to be expected - tbench shuffles 3.5 GB of effective data 
to/from sockets. That's 7.5 GB due to double-copy. So for every 64 
bytes of data transferred we spend 1.4 CPU cycles in this specific 
function - that is OK-ish.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:16                                 ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 20:16 UTC (permalink / raw)
  To: torvalds
  Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 17 Nov 2008 11:55:35 -0800 (PST)

> So try to figure out why his (better) profile doesn't match your 
> (inferior) one, instead of asking him to do stupid things. It's some 
> difference in architectures, likely: maybe the sparc timekeeping is crap, 
> maybe it's a cache issue and sparc caches are crap, maybe it's something 
> where Niagara (is it niagara) has some oddness that shows up because it 
> has that odd four-threads+four-cores or whatever.

It's on my workstation which is a much simpler 2 processor
UltraSPARC-IIIi (1.5Ghz) system.

And yes I will investigate, it's all I've been doing in my
spare time these past few weeks.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:16                                 ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 20:16 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: mingo-X9Un+BFzKDI, dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date: Mon, 17 Nov 2008 11:55:35 -0800 (PST)

> So try to figure out why his (better) profile doesn't match your 
> (inferior) one, instead of asking him to do stupid things. It's some 
> difference in architectures, likely: maybe the sparc timekeeping is crap, 
> maybe it's a cache issue and sparc caches are crap, maybe it's something 
> where Niagara (is it niagara) has some oddness that shows up because it 
> has that odd four-threads+four-cores or whatever.

It's on my workstation which is a much simpler 2 processor
UltraSPARC-IIIi (1.5Ghz) system.

And yes I will investigate, it's all I've been doing in my
spare time these past few weeks.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:18                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 20:18 UTC (permalink / raw)
  To: torvalds
  Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 17 Nov 2008 11:57:55 -0800 (PST)

> On Mon, 17 Nov 2008, David Miller wrote:
> > And as a result I found that wake_up() is now 4 times slower than it
> > was in 2.6.22, I even analyzed this for every single kernel release
> > till now.
> 
> ..and that's the one where you then pointed to hrtimers, and now you claim 
> that was fixed?

That was a huge increase going from 2.6.26 to 2.6.27, and has
been fixed.

The rest of the gradual release-to-release cost increase, however,
remains.

> At least I haven't seen any new analysis since then.

I will find time ot make it after I get back from Portland.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:18                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 20:18 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw

From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date: Mon, 17 Nov 2008 11:57:55 -0800 (PST)

> On Mon, 17 Nov 2008, David Miller wrote:
> > And as a result I found that wake_up() is now 4 times slower than it
> > was in 2.6.22, I even analyzed this for every single kernel release
> > till now.
> 
> ..and that's the one where you then pointed to hrtimers, and now you claim 
> that was fixed?

That was a huge increase going from 2.6.26 to 2.6.27, and has
been fixed.

The rest of the gradual release-to-release cost increase, however,
remains.

> At least I haven't seen any new analysis since then.

I will find time ot make it after I get back from Portland.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* (avc_has_perm_noaudit()) Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:20                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   3.934833 avc_has_perm_noaudit

this one seems spread out:

                      hits (total: 393483 hits)
                 .........
ffffffff80312af3:     1426 <avc_has_perm_noaudit>:
ffffffff80312af3:     1426 	41 57                	push   %r15
ffffffff80312af5:     6124 	41 56                	push   %r14
ffffffff80312af7:        0 	41 55                	push   %r13
ffffffff80312af9:     1443 	41 89 f5             	mov    %esi,%r13d
ffffffff80312afc:     1577 	41 54                	push   %r12
ffffffff80312afe:        0 	41 89 fc             	mov    %edi,%r12d
ffffffff80312b01:     1310 	55                   	push   %rbp
ffffffff80312b02:     1531 	53                   	push   %rbx
ffffffff80312b03:        3 	48 83 ec 68          	sub    $0x68,%rsp
ffffffff80312b07:     2202 	85 c9                	test   %ecx,%ecx
ffffffff80312b09:        0 	89 4c 24 0c          	mov    %ecx,0xc(%rsp)
ffffffff80312b0d:      550 	44 89 44 24 08       	mov    %r8d,0x8(%rsp)
ffffffff80312b12:     1572 	4c 89 0c 24          	mov    %r9,(%rsp)
ffffffff80312b16:        0 	66 89 54 24 12       	mov    %dx,0x12(%rsp)
ffffffff80312b1b:      588 	75 04                	jne    ffffffff80312b21 <avc_has_perm_noaudit+0x2e>
ffffffff80312b1d:        0 	0f 0b                	ud2a   
ffffffff80312b1f:        0 	eb fe                	jmp    ffffffff80312b1f <avc_has_perm_noaudit+0x2c>
ffffffff80312b21:     1646 	0f b7 44 24 12       	movzwl 0x12(%rsp),%eax
ffffffff80312b26:      829 	48 c7 c2 d0 26 93 80 	mov    $0xffffffff809326d0,%rdx
ffffffff80312b2d:      589 	89 44 24 14          	mov    %eax,0x14(%rsp)
ffffffff80312b31:      698 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff80312b38:        0 	00 
ffffffff80312b39:      791 	89 c0                	mov    %eax,%eax
ffffffff80312b3b:      549 	48 c1 e0 03          	shl    $0x3,%rax
ffffffff80312b3f:      791 	48 03 05 fa 30 5a 00 	add    0x5a30fa(%rip),%rax        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312b46:      864 	48 8b 00             	mov    (%rax),%rax
ffffffff80312b49:      533 	48 03 50 08          	add    0x8(%rax),%rdx
ffffffff80312b4d:      732 	ff 02                	incl   (%rdx)
ffffffff80312b4f:      860 	8b 54 24 14          	mov    0x14(%rsp),%edx
ffffffff80312b53:     1259 	e8 54 fc ff ff       	callq  ffffffff803127ac <avc_hash>
ffffffff80312b58:     2087 	48 98                	cltq   
ffffffff80312b5a:     1015 	48 89 44 24 18       	mov    %rax,0x18(%rsp)
ffffffff80312b5f:        0 	48 c1 e0 04          	shl    $0x4,%rax
ffffffff80312b63:     2944 	4c 8d b8 60 6b a9 80 	lea    -0x7f5694a0(%rax),%r15
ffffffff80312b6a:       71 	48 8b 80 60 6b a9 80 	mov    -0x7f5694a0(%rax),%rax
ffffffff80312b71:     3943 	eb 1a                	jmp    ffffffff80312b8d <avc_has_perm_noaudit+0x9a>
ffffffff80312b73:     5184 	44 3b 23             	cmp    (%rbx),%r12d
ffffffff80312b76:    62007 	75 11                	jne    ffffffff80312b89 <avc_has_perm_noaudit+0x96>
ffffffff80312b78:       11 	66 8b 44 24 12       	mov    0x12(%rsp),%ax
ffffffff80312b7d:        0 	66 3b 43 08          	cmp    0x8(%rbx),%ax
ffffffff80312b81:    11115 	75 06                	jne    ffffffff80312b89 <avc_has_perm_noaudit+0x96>
ffffffff80312b83:        4 	44 3b 6b 04          	cmp    0x4(%rbx),%r13d
ffffffff80312b87:    14224 	74 1a                	je     ffffffff80312ba3 <avc_has_perm_noaudit+0xb0>
ffffffff80312b89:        1 	48 8b 43 28          	mov    0x28(%rbx),%rax
ffffffff80312b8d:     6921 	48 8d 58 d8          	lea    -0x28(%rax),%rbx
ffffffff80312b91:     9654 	48 8b 43 28          	mov    0x28(%rbx),%rax
ffffffff80312b95:      414 	0f 18 08             	prefetcht0 (%rax)
ffffffff80312b98:      227 	48 8d 43 28          	lea    0x28(%rbx),%rax
ffffffff80312b9c:     9617 	4c 39 f8             	cmp    %r15,%rax
ffffffff80312b9f:     1402 	75 d2                	jne    ffffffff80312b73 <avc_has_perm_noaudit+0x80>
ffffffff80312ba1:        0 	eb 41                	jmp    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312ba3:        0 	83 7b 20 01          	cmpl   $0x1,0x20(%rbx)
ffffffff80312ba7:      671 	0f 84 70 02 00 00    	je     ffffffff80312e1d <avc_has_perm_noaudit+0x32a>
ffffffff80312bad:        0 	c7 43 20 01 00 00 00 	movl   $0x1,0x20(%rbx)
ffffffff80312bb4:        0 	e9 64 02 00 00       	jmpq   ffffffff80312e1d <avc_has_perm_noaudit+0x32a>
ffffffff80312bb9:     2118 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff80312bc0:        0 	00 
ffffffff80312bc1:     8245 	89 d2                	mov    %edx,%edx
ffffffff80312bc3:        0 	48 c7 c0 d0 26 93 80 	mov    $0xffffffff809326d0,%rax
ffffffff80312bca:      511 	48 c1 e2 03          	shl    $0x3,%rdx
ffffffff80312bce:    11308 	48 03 15 6b 30 5a 00 	add    0x5a306b(%rip),%rdx        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312bd5:        0 	48 8b 12             	mov    (%rdx),%rdx
ffffffff80312bd8:       35 	48 03 42 08          	add    0x8(%rdx),%rax
ffffffff80312bdc:     2224 	ff 40 04             	incl   0x4(%rax)
ffffffff80312bdf:        1 	e9 06 01 00 00       	jmpq   ffffffff80312cea <avc_has_perm_noaudit+0x1f7>
ffffffff80312be4:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff80312beb:        0 	00 
ffffffff80312bec:        0 	89 d2                	mov    %edx,%edx
ffffffff80312bee:        0 	48 c7 c0 d0 26 93 80 	mov    $0xffffffff809326d0,%rax
ffffffff80312bf5:        0 	48 8d 6c 24 30       	lea    0x30(%rsp),%rbp
ffffffff80312bfa:        0 	48 c1 e2 03          	shl    $0x3,%rdx
ffffffff80312bfe:        0 	48 03 15 3b 30 5a 00 	add    0x5a303b(%rip),%rdx        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312c05:        0 	44 89 ee             	mov    %r13d,%esi
ffffffff80312c08:        0 	4c 8d 45 0c          	lea    0xc(%rbp),%r8
ffffffff80312c0c:        0 	44 89 e7             	mov    %r12d,%edi
ffffffff80312c0f:        0 	48 8b 12             	mov    (%rdx),%rdx
ffffffff80312c12:        0 	48 03 42 08          	add    0x8(%rdx),%rax
ffffffff80312c16:        0 	ff 40 08             	incl   0x8(%rax)
ffffffff80312c19:        0 	8b 4c 24 0c          	mov    0xc(%rsp),%ecx
ffffffff80312c1d:        0 	8b 54 24 14          	mov    0x14(%rsp),%edx
ffffffff80312c21:        0 	e8 ee 0a 01 00       	callq  ffffffff80323714 <security_compute_av>
ffffffff80312c26:        0 	85 c0                	test   %eax,%eax
ffffffff80312c28:        0 	41 89 c6             	mov    %eax,%r14d
ffffffff80312c2b:        0 	0f 85 02 02 00 00    	jne    ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312c31:        0 	8b 7c 24 4c          	mov    0x4c(%rsp),%edi
ffffffff80312c35:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff80312c3a:        0 	e8 a5 fb ff ff       	callq  ffffffff803127e4 <avc_latest_notif_update>
ffffffff80312c3f:        0 	85 c0                	test   %eax,%eax
ffffffff80312c41:        0 	0f 85 9c 00 00 00    	jne    ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0>
ffffffff80312c47:        0 	e8 23 fd ff ff       	callq  ffffffff8031296f <avc_alloc_node>
ffffffff80312c4c:        0 	48 85 c0             	test   %rax,%rax
ffffffff80312c4f:        0 	48 89 c3             	mov    %rax,%rbx
ffffffff80312c52:        0 	0f 84 8b 00 00 00    	je     ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0>
ffffffff80312c58:        0 	8b 4c 24 14          	mov    0x14(%rsp),%ecx
ffffffff80312c5c:        0 	49 89 e8             	mov    %rbp,%r8
ffffffff80312c5f:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff80312c62:        0 	48 89 c7             	mov    %rax,%rdi
ffffffff80312c65:        0 	44 89 ea             	mov    %r13d,%edx
ffffffff80312c68:        0 	e8 5d fb ff ff       	callq  ffffffff803127ca <avc_node_populate>
ffffffff80312c6d:        0 	48 8b 44 24 18       	mov    0x18(%rsp),%rax
ffffffff80312c72:        0 	48 8d 2c 85 60 8b a9 	lea    -0x7f5674a0(,%rax,4),%rbp
ffffffff80312c79:        0 	80 
ffffffff80312c7a:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312c7d:        0 	e8 44 3c 20 00       	callq  ffffffff805168c6 <_spin_lock_irqsave>
ffffffff80312c82:        0 	49 8b 37             	mov    (%r15),%rsi
ffffffff80312c85:        0 	49 89 c6             	mov    %rax,%r14
ffffffff80312c88:        0 	eb 24                	jmp    ffffffff80312cae <avc_has_perm_noaudit+0x1bb>
ffffffff80312c8a:        0 	44 39 26             	cmp    %r12d,(%rsi)
ffffffff80312c8d:        0 	75 1b                	jne    ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312c8f:        0 	44 39 6e 04          	cmp    %r13d,0x4(%rsi)
ffffffff80312c93:        0 	75 15                	jne    ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312c95:        0 	66 8b 44 24 12       	mov    0x12(%rsp),%ax
ffffffff80312c9a:        0 	66 39 46 08          	cmp    %ax,0x8(%rsi)
ffffffff80312c9e:        0 	75 0a                	jne    ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312ca0:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff80312ca3:        0 	e8 9e fb ff ff       	callq  ffffffff80312846 <avc_node_replace>
ffffffff80312ca8:        0 	eb 2c                	jmp    ffffffff80312cd6 <avc_has_perm_noaudit+0x1e3>
ffffffff80312caa:        0 	48 8b 76 28          	mov    0x28(%rsi),%rsi
ffffffff80312cae:        0 	48 83 ee 28          	sub    $0x28,%rsi
ffffffff80312cb2:        0 	48 8b 56 28          	mov    0x28(%rsi),%rdx
ffffffff80312cb6:        0 	48 8d 46 28          	lea    0x28(%rsi),%rax
ffffffff80312cba:        0 	4c 39 f8             	cmp    %r15,%rax
ffffffff80312cbd:        0 	0f 18 0a             	prefetcht0 (%rdx)
ffffffff80312cc0:        0 	75 c8                	jne    ffffffff80312c8a <avc_has_perm_noaudit+0x197>
ffffffff80312cc2:        0 	48 8d 43 28          	lea    0x28(%rbx),%rax
ffffffff80312cc6:        0 	48 89 53 28          	mov    %rdx,0x28(%rbx)
ffffffff80312cca:        0 	4c 89 78 08          	mov    %r15,0x8(%rax)
ffffffff80312cce:        0 	48 89 46 28          	mov    %rax,0x28(%rsi)
ffffffff80312cd2:        0 	48 89 42 08          	mov    %rax,0x8(%rdx)
ffffffff80312cd6:        0 	4c 89 f6             	mov    %r14,%rsi
ffffffff80312cd9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312cdc:        0 	e8 20 3d 20 00       	callq  ffffffff80516a01 <_spin_unlock_irqrestore>
ffffffff80312ce1:        0 	eb 07                	jmp    ffffffff80312cea <avc_has_perm_noaudit+0x1f7>
ffffffff80312ce3:        0 	48 8d 44 24 30       	lea    0x30(%rsp),%rax
ffffffff80312ce8:        0 	eb 06                	jmp    ffffffff80312cf0 <avc_has_perm_noaudit+0x1fd>
ffffffff80312cea:     2116 	48 89 d8             	mov    %rbx,%rax
ffffffff80312ced:     7632 	45 31 f6             	xor    %r14d,%r14d
ffffffff80312cf0:        1 	48 83 3c 24 00       	cmpq   $0x0,(%rsp)
ffffffff80312cf5:      404 	74 10                	je     ffffffff80312d07 <avc_has_perm_noaudit+0x214>
ffffffff80312cf7:     1804 	48 8d 70 0c          	lea    0xc(%rax),%rsi
ffffffff80312cfb:        0 	b9 05 00 00 00       	mov    $0x5,%ecx
ffffffff80312d00:      378 	48 8b 3c 24          	mov    (%rsp),%rdi
ffffffff80312d04:     8174 	fc                   	cld    
ffffffff80312d05:    26860 	f3 a5                	rep movsl %ds:(%rsi),%es:(%rdi)
ffffffff80312d07:    11573 	8b 40 0c             	mov    0xc(%rax),%eax
ffffffff80312d0a:     1997 	f7 d0                	not    %eax
ffffffff80312d0c:        0 	85 44 24 0c          	test   %eax,0xc(%rsp)
ffffffff80312d10:        0 	0f 84 1d 01 00 00    	je     ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312d16:        0 	f6 44 24 08 01       	testb  $0x1,0x8(%rsp)
ffffffff80312d1b:        0 	0f 85 f4 00 00 00    	jne    ffffffff80312e15 <avc_has_perm_noaudit+0x322>
ffffffff80312d21:        0 	83 3d 5c 66 78 00 00 	cmpl   $0x0,0x78665c(%rip)        # ffffffff80a99384 <selinux_enforcing>
ffffffff80312d28:        0 	74 10                	je     ffffffff80312d3a <avc_has_perm_noaudit+0x247>
ffffffff80312d2a:        0 	44 89 e7             	mov    %r12d,%edi
ffffffff80312d2d:        0 	e8 87 f9 00 00       	callq  ffffffff803226b9 <security_permissive_sid>
ffffffff80312d32:        0 	85 c0                	test   %eax,%eax
ffffffff80312d34:        0 	0f 84 db 00 00 00    	je     ffffffff80312e15 <avc_has_perm_noaudit+0x322>
ffffffff80312d3a:        0 	e8 30 fc ff ff       	callq  ffffffff8031296f <avc_alloc_node>
ffffffff80312d3f:        0 	48 85 c0             	test   %rax,%rax
ffffffff80312d42:        0 	48 89 c5             	mov    %rax,%rbp
ffffffff80312d45:        0 	0f 84 e8 00 00 00    	je     ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312d4b:        0 	48 8b 44 24 18       	mov    0x18(%rsp),%rax
ffffffff80312d50:        0 	48 8d 04 85 60 8b a9 	lea    -0x7f5674a0(,%rax,4),%rax
ffffffff80312d57:        0 	80 
ffffffff80312d58:        0 	48 89 c7             	mov    %rax,%rdi
ffffffff80312d5b:        0 	48 89 44 24 28       	mov    %rax,0x28(%rsp)
ffffffff80312d60:        0 	e8 61 3b 20 00       	callq  ffffffff805168c6 <_spin_lock_irqsave>
ffffffff80312d65:        0 	49 8b 1f             	mov    (%r15),%rbx
ffffffff80312d68:        0 	48 89 44 24 20       	mov    %rax,0x20(%rsp)
ffffffff80312d6d:        0 	eb 1a                	jmp    ffffffff80312d89 <avc_has_perm_noaudit+0x296>
ffffffff80312d6f:        0 	44 3b 23             	cmp    (%rbx),%r12d
ffffffff80312d72:        0 	75 11                	jne    ffffffff80312d85 <avc_has_perm_noaudit+0x292>
ffffffff80312d74:        0 	44 3b 6b 04          	cmp    0x4(%rbx),%r13d
ffffffff80312d78:        0 	75 0b                	jne    ffffffff80312d85 <avc_has_perm_noaudit+0x292>
ffffffff80312d7a:        0 	66 8b 44 24 12       	mov    0x12(%rsp),%ax
ffffffff80312d7f:        0 	66 3b 43 08          	cmp    0x8(%rbx),%ax
ffffffff80312d83:        0 	74 1a                	je     ffffffff80312d9f <avc_has_perm_noaudit+0x2ac>
ffffffff80312d85:        0 	48 8b 5b 28          	mov    0x28(%rbx),%rbx
ffffffff80312d89:        0 	48 83 eb 28          	sub    $0x28,%rbx
ffffffff80312d8d:        0 	48 8b 43 28          	mov    0x28(%rbx),%rax
ffffffff80312d91:        0 	0f 18 08             	prefetcht0 (%rax)
ffffffff80312d94:        0 	48 8d 43 28          	lea    0x28(%rbx),%rax
ffffffff80312d98:        0 	4c 39 f8             	cmp    %r15,%rax
ffffffff80312d9b:        0 	75 d2                	jne    ffffffff80312d6f <avc_has_perm_noaudit+0x27c>
ffffffff80312d9d:        0 	eb 29                	jmp    ffffffff80312dc8 <avc_has_perm_noaudit+0x2d5>
ffffffff80312d9f:        0 	8b 4c 24 14          	mov    0x14(%rsp),%ecx
ffffffff80312da3:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff80312da6:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312da9:        0 	49 89 d8             	mov    %rbx,%r8
ffffffff80312dac:        0 	44 89 ea             	mov    %r13d,%edx
ffffffff80312daf:        0 	e8 16 fa ff ff       	callq  ffffffff803127ca <avc_node_populate>
ffffffff80312db4:        0 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff80312db8:        0 	09 45 0c             	or     %eax,0xc(%rbp)
ffffffff80312dbb:        0 	48 89 de             	mov    %rbx,%rsi
ffffffff80312dbe:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312dc1:        0 	e8 80 fa ff ff       	callq  ffffffff80312846 <avc_node_replace>
ffffffff80312dc6:        0 	eb 3c                	jmp    ffffffff80312e04 <avc_has_perm_noaudit+0x311>
ffffffff80312dc8:        0 	48 8b 3d a9 65 78 00 	mov    0x7865a9(%rip),%rdi        # ffffffff80a99378 <avc_node_cachep>
ffffffff80312dcf:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff80312dd2:        0 	e8 7b c6 f7 ff       	callq  ffffffff8028f452 <kmem_cache_free>
ffffffff80312dd7:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff80312dde:        0 	00 
ffffffff80312ddf:        0 	89 c0                	mov    %eax,%eax
ffffffff80312de1:        0 	48 c7 c2 d0 26 93 80 	mov    $0xffffffff809326d0,%rdx
ffffffff80312de8:        0 	48 c1 e0 03          	shl    $0x3,%rax
ffffffff80312dec:        0 	48 03 05 4d 2e 5a 00 	add    0x5a2e4d(%rip),%rax        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312df3:        0 	48 8b 00             	mov    (%rax),%rax
ffffffff80312df6:        0 	48 03 50 08          	add    0x8(%rax),%rdx
ffffffff80312dfa:        0 	ff 42 14             	incl   0x14(%rdx)
ffffffff80312dfd:        0 	f0 ff 0d 60 65 78 00 	lock decl 0x786560(%rip)        # ffffffff80a99364 <avc_cache+0x2804>
ffffffff80312e04:        0 	48 8b 74 24 20       	mov    0x20(%rsp),%rsi
ffffffff80312e09:        0 	48 8b 7c 24 28       	mov    0x28(%rsp),%rdi
ffffffff80312e0e:        0 	e8 ee 3b 20 00       	callq  ffffffff80516a01 <_spin_unlock_irqrestore>
ffffffff80312e13:        0 	eb 1e                	jmp    ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312e15:        0 	41 be f3 ff ff ff    	mov    $0xfffffff3,%r14d
ffffffff80312e1b:        0 	eb 16                	jmp    ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312e1d:    35502 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff80312e21:     4360 	23 43 10             	and    0x10(%rbx),%eax
ffffffff80312e24:        0 	3b 44 24 0c          	cmp    0xc(%rsp),%eax
ffffffff80312e28:        0 	0f 85 b6 fd ff ff    	jne    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e:   104641 	e9 86 fd ff ff       	jmpq   ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33:     2106 	48 83 c4 68          	add    $0x68,%rsp
ffffffff80312e37:        1 	44 89 f0             	mov    %r14d,%eax
ffffffff80312e3a:     2068 	5b                   	pop    %rbx
ffffffff80312e3b:        0 	5d                   	pop    %rbp
ffffffff80312e3c:        8 	41 5c                	pop    %r12
ffffffff80312e3e:     2001 	41 5d                	pop    %r13
ffffffff80312e40:        0 	41 5e                	pop    %r14
ffffffff80312e42:      162 	41 5f                	pop    %r15
ffffffff80312e44:     2107 	c3                   	retq   

its main callsite is:

  ffffffff8031368c:     2809 <avc_has_perm>:
  [...]
  ffffffff803136b6:      651 	e8 38 f4 ff ff       	callq  ffffffff80312af3 <avc_has_perm_noaudit>

avc_has_perm() usage is spread out amongst 3 callsites in 2 selinux 
functions:

selinux_ip_postroute():
  ffffffff80314d02:      491 	e8 85 e9 ff ff       	callq  ffffffff8031368c <avc_has_perm>

selinux_socket_sock_rcv_skb():
  ffffffff80314eea:      461 	e8 9d e7 ff ff       	callq  ffffffff8031368c <avc_has_perm>
  ffffffff80314faf:      476 	e8 d8 e6 ff ff       	callq  ffffffff8031368c <avc_has_perm>

related to networking.

regarding avc_has_perm_noaudit() itself, it has a couple of hot spots:

ffffffff80312b73:     5184 	44 3b 23             	cmp    (%rbx),%r12d
ffffffff80312b76:    62007 	75 11                	jne    ffffffff80312b89 <avc_has_perm_noaudit+0x96>

quick guess: cache-cold-miss site.

ffffffff80312d04:     8174 	fc                   	cld    
ffffffff80312d05:    26860 	f3 a5                	rep movsl %ds:(%rsi),%es:(%rdi)

quick guess: unnecessary initialization of something largish via 
memset. Probably:

  security/selinux/avc.c:avc_has_perm_noaudit()'s:
  [...]
        if (avd)
                memcpy(avd, &p_ae->avd, sizeof(*avd));

but one of the fattest ones:

ffffffff80312e28:        0 	0f 85 b6 fd ff ff    	jne    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e:   104641 	e9 86 fd ff ff       	jmpq   ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33:     2106 	48 83 c4 68          	add    $0x68,%rsp

that seems to be either a branch mispredict (seems a tad expensive for 
that though), or a cachemiss delayed to the first non-predicted 
branch. Ah, that's most likely the case, we fall through straight from 
here:

ffffffff80312dfd:        0      f0 ff 0d 60 65 78 00    lock decl 0x786560(%rip)

that's an atomic op of some global address, in the hotpath. Not good.

the wider context is:

ffffffff80312e1d:    35502 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff80312e21:     4360 	23 43 10             	and    0x10(%rbx),%eax
ffffffff80312e24:        0 	3b 44 24 0c          	cmp    0xc(%rsp),%eax
ffffffff80312e28:        0 	0f 85 b6 fd ff ff    	jne    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e:   104641 	e9 86 fd ff ff       	jmpq   ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33:     2106 	48 83 c4 68          	add    $0x68,%rsp

ah, yes. My guess is that the "and (%rbx)" at ffffffff80312e21 
generated this miss, and this all is avc_update_node()'s 
for-each-list-loop, and:

        spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);

that hash doesnt seem to be working well here. It's done via:

static inline int avc_hash(u32 ssid, u32 tsid, u16 tclass)
{
        return (ssid ^ (tsid<<2) ^ (tclass<<4)) & (AVC_CACHE_SLOTS - 1);
}

AVC_CACHE_SLOTS is 512 - but my usecase is likely has a much narrower 
hash key space than that. Increasing the hash wont work, these kind of 
things really only start scaling once some natural per-CPU construct 
is found to it.

And things like this:

        /* cache hit */
        if (atomic_read(&ret->ae.used) != 1)
                atomic_set(&ret->ae.used, 1);

in avc_search_node() dont really help either as they immediately dirty 
the cacheline in the cache-hit case. Hashed fastpath lookup really 
should only be used to validate security rules in a read-mostly way, 
and cachelines should never be dirtied, as long as it can be avoided.

Anyway, this function needs a good scalability look as it represents 
3.9% of the total tbench cost. I'd not be surprised if it was possible 
more than half of that cost via not too ugly changes.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* (avc_has_perm_noaudit()) Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:20                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   3.934833 avc_has_perm_noaudit

this one seems spread out:

                      hits (total: 393483 hits)
                 .........
ffffffff80312af3:     1426 <avc_has_perm_noaudit>:
ffffffff80312af3:     1426 	41 57                	push   %r15
ffffffff80312af5:     6124 	41 56                	push   %r14
ffffffff80312af7:        0 	41 55                	push   %r13
ffffffff80312af9:     1443 	41 89 f5             	mov    %esi,%r13d
ffffffff80312afc:     1577 	41 54                	push   %r12
ffffffff80312afe:        0 	41 89 fc             	mov    %edi,%r12d
ffffffff80312b01:     1310 	55                   	push   %rbp
ffffffff80312b02:     1531 	53                   	push   %rbx
ffffffff80312b03:        3 	48 83 ec 68          	sub    $0x68,%rsp
ffffffff80312b07:     2202 	85 c9                	test   %ecx,%ecx
ffffffff80312b09:        0 	89 4c 24 0c          	mov    %ecx,0xc(%rsp)
ffffffff80312b0d:      550 	44 89 44 24 08       	mov    %r8d,0x8(%rsp)
ffffffff80312b12:     1572 	4c 89 0c 24          	mov    %r9,(%rsp)
ffffffff80312b16:        0 	66 89 54 24 12       	mov    %dx,0x12(%rsp)
ffffffff80312b1b:      588 	75 04                	jne    ffffffff80312b21 <avc_has_perm_noaudit+0x2e>
ffffffff80312b1d:        0 	0f 0b                	ud2a   
ffffffff80312b1f:        0 	eb fe                	jmp    ffffffff80312b1f <avc_has_perm_noaudit+0x2c>
ffffffff80312b21:     1646 	0f b7 44 24 12       	movzwl 0x12(%rsp),%eax
ffffffff80312b26:      829 	48 c7 c2 d0 26 93 80 	mov    $0xffffffff809326d0,%rdx
ffffffff80312b2d:      589 	89 44 24 14          	mov    %eax,0x14(%rsp)
ffffffff80312b31:      698 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff80312b38:        0 	00 
ffffffff80312b39:      791 	89 c0                	mov    %eax,%eax
ffffffff80312b3b:      549 	48 c1 e0 03          	shl    $0x3,%rax
ffffffff80312b3f:      791 	48 03 05 fa 30 5a 00 	add    0x5a30fa(%rip),%rax        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312b46:      864 	48 8b 00             	mov    (%rax),%rax
ffffffff80312b49:      533 	48 03 50 08          	add    0x8(%rax),%rdx
ffffffff80312b4d:      732 	ff 02                	incl   (%rdx)
ffffffff80312b4f:      860 	8b 54 24 14          	mov    0x14(%rsp),%edx
ffffffff80312b53:     1259 	e8 54 fc ff ff       	callq  ffffffff803127ac <avc_hash>
ffffffff80312b58:     2087 	48 98                	cltq   
ffffffff80312b5a:     1015 	48 89 44 24 18       	mov    %rax,0x18(%rsp)
ffffffff80312b5f:        0 	48 c1 e0 04          	shl    $0x4,%rax
ffffffff80312b63:     2944 	4c 8d b8 60 6b a9 80 	lea    -0x7f5694a0(%rax),%r15
ffffffff80312b6a:       71 	48 8b 80 60 6b a9 80 	mov    -0x7f5694a0(%rax),%rax
ffffffff80312b71:     3943 	eb 1a                	jmp    ffffffff80312b8d <avc_has_perm_noaudit+0x9a>
ffffffff80312b73:     5184 	44 3b 23             	cmp    (%rbx),%r12d
ffffffff80312b76:    62007 	75 11                	jne    ffffffff80312b89 <avc_has_perm_noaudit+0x96>
ffffffff80312b78:       11 	66 8b 44 24 12       	mov    0x12(%rsp),%ax
ffffffff80312b7d:        0 	66 3b 43 08          	cmp    0x8(%rbx),%ax
ffffffff80312b81:    11115 	75 06                	jne    ffffffff80312b89 <avc_has_perm_noaudit+0x96>
ffffffff80312b83:        4 	44 3b 6b 04          	cmp    0x4(%rbx),%r13d
ffffffff80312b87:    14224 	74 1a                	je     ffffffff80312ba3 <avc_has_perm_noaudit+0xb0>
ffffffff80312b89:        1 	48 8b 43 28          	mov    0x28(%rbx),%rax
ffffffff80312b8d:     6921 	48 8d 58 d8          	lea    -0x28(%rax),%rbx
ffffffff80312b91:     9654 	48 8b 43 28          	mov    0x28(%rbx),%rax
ffffffff80312b95:      414 	0f 18 08             	prefetcht0 (%rax)
ffffffff80312b98:      227 	48 8d 43 28          	lea    0x28(%rbx),%rax
ffffffff80312b9c:     9617 	4c 39 f8             	cmp    %r15,%rax
ffffffff80312b9f:     1402 	75 d2                	jne    ffffffff80312b73 <avc_has_perm_noaudit+0x80>
ffffffff80312ba1:        0 	eb 41                	jmp    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312ba3:        0 	83 7b 20 01          	cmpl   $0x1,0x20(%rbx)
ffffffff80312ba7:      671 	0f 84 70 02 00 00    	je     ffffffff80312e1d <avc_has_perm_noaudit+0x32a>
ffffffff80312bad:        0 	c7 43 20 01 00 00 00 	movl   $0x1,0x20(%rbx)
ffffffff80312bb4:        0 	e9 64 02 00 00       	jmpq   ffffffff80312e1d <avc_has_perm_noaudit+0x32a>
ffffffff80312bb9:     2118 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff80312bc0:        0 	00 
ffffffff80312bc1:     8245 	89 d2                	mov    %edx,%edx
ffffffff80312bc3:        0 	48 c7 c0 d0 26 93 80 	mov    $0xffffffff809326d0,%rax
ffffffff80312bca:      511 	48 c1 e2 03          	shl    $0x3,%rdx
ffffffff80312bce:    11308 	48 03 15 6b 30 5a 00 	add    0x5a306b(%rip),%rdx        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312bd5:        0 	48 8b 12             	mov    (%rdx),%rdx
ffffffff80312bd8:       35 	48 03 42 08          	add    0x8(%rdx),%rax
ffffffff80312bdc:     2224 	ff 40 04             	incl   0x4(%rax)
ffffffff80312bdf:        1 	e9 06 01 00 00       	jmpq   ffffffff80312cea <avc_has_perm_noaudit+0x1f7>
ffffffff80312be4:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff80312beb:        0 	00 
ffffffff80312bec:        0 	89 d2                	mov    %edx,%edx
ffffffff80312bee:        0 	48 c7 c0 d0 26 93 80 	mov    $0xffffffff809326d0,%rax
ffffffff80312bf5:        0 	48 8d 6c 24 30       	lea    0x30(%rsp),%rbp
ffffffff80312bfa:        0 	48 c1 e2 03          	shl    $0x3,%rdx
ffffffff80312bfe:        0 	48 03 15 3b 30 5a 00 	add    0x5a303b(%rip),%rdx        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312c05:        0 	44 89 ee             	mov    %r13d,%esi
ffffffff80312c08:        0 	4c 8d 45 0c          	lea    0xc(%rbp),%r8
ffffffff80312c0c:        0 	44 89 e7             	mov    %r12d,%edi
ffffffff80312c0f:        0 	48 8b 12             	mov    (%rdx),%rdx
ffffffff80312c12:        0 	48 03 42 08          	add    0x8(%rdx),%rax
ffffffff80312c16:        0 	ff 40 08             	incl   0x8(%rax)
ffffffff80312c19:        0 	8b 4c 24 0c          	mov    0xc(%rsp),%ecx
ffffffff80312c1d:        0 	8b 54 24 14          	mov    0x14(%rsp),%edx
ffffffff80312c21:        0 	e8 ee 0a 01 00       	callq  ffffffff80323714 <security_compute_av>
ffffffff80312c26:        0 	85 c0                	test   %eax,%eax
ffffffff80312c28:        0 	41 89 c6             	mov    %eax,%r14d
ffffffff80312c2b:        0 	0f 85 02 02 00 00    	jne    ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312c31:        0 	8b 7c 24 4c          	mov    0x4c(%rsp),%edi
ffffffff80312c35:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff80312c3a:        0 	e8 a5 fb ff ff       	callq  ffffffff803127e4 <avc_latest_notif_update>
ffffffff80312c3f:        0 	85 c0                	test   %eax,%eax
ffffffff80312c41:        0 	0f 85 9c 00 00 00    	jne    ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0>
ffffffff80312c47:        0 	e8 23 fd ff ff       	callq  ffffffff8031296f <avc_alloc_node>
ffffffff80312c4c:        0 	48 85 c0             	test   %rax,%rax
ffffffff80312c4f:        0 	48 89 c3             	mov    %rax,%rbx
ffffffff80312c52:        0 	0f 84 8b 00 00 00    	je     ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0>
ffffffff80312c58:        0 	8b 4c 24 14          	mov    0x14(%rsp),%ecx
ffffffff80312c5c:        0 	49 89 e8             	mov    %rbp,%r8
ffffffff80312c5f:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff80312c62:        0 	48 89 c7             	mov    %rax,%rdi
ffffffff80312c65:        0 	44 89 ea             	mov    %r13d,%edx
ffffffff80312c68:        0 	e8 5d fb ff ff       	callq  ffffffff803127ca <avc_node_populate>
ffffffff80312c6d:        0 	48 8b 44 24 18       	mov    0x18(%rsp),%rax
ffffffff80312c72:        0 	48 8d 2c 85 60 8b a9 	lea    -0x7f5674a0(,%rax,4),%rbp
ffffffff80312c79:        0 	80 
ffffffff80312c7a:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312c7d:        0 	e8 44 3c 20 00       	callq  ffffffff805168c6 <_spin_lock_irqsave>
ffffffff80312c82:        0 	49 8b 37             	mov    (%r15),%rsi
ffffffff80312c85:        0 	49 89 c6             	mov    %rax,%r14
ffffffff80312c88:        0 	eb 24                	jmp    ffffffff80312cae <avc_has_perm_noaudit+0x1bb>
ffffffff80312c8a:        0 	44 39 26             	cmp    %r12d,(%rsi)
ffffffff80312c8d:        0 	75 1b                	jne    ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312c8f:        0 	44 39 6e 04          	cmp    %r13d,0x4(%rsi)
ffffffff80312c93:        0 	75 15                	jne    ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312c95:        0 	66 8b 44 24 12       	mov    0x12(%rsp),%ax
ffffffff80312c9a:        0 	66 39 46 08          	cmp    %ax,0x8(%rsi)
ffffffff80312c9e:        0 	75 0a                	jne    ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312ca0:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff80312ca3:        0 	e8 9e fb ff ff       	callq  ffffffff80312846 <avc_node_replace>
ffffffff80312ca8:        0 	eb 2c                	jmp    ffffffff80312cd6 <avc_has_perm_noaudit+0x1e3>
ffffffff80312caa:        0 	48 8b 76 28          	mov    0x28(%rsi),%rsi
ffffffff80312cae:        0 	48 83 ee 28          	sub    $0x28,%rsi
ffffffff80312cb2:        0 	48 8b 56 28          	mov    0x28(%rsi),%rdx
ffffffff80312cb6:        0 	48 8d 46 28          	lea    0x28(%rsi),%rax
ffffffff80312cba:        0 	4c 39 f8             	cmp    %r15,%rax
ffffffff80312cbd:        0 	0f 18 0a             	prefetcht0 (%rdx)
ffffffff80312cc0:        0 	75 c8                	jne    ffffffff80312c8a <avc_has_perm_noaudit+0x197>
ffffffff80312cc2:        0 	48 8d 43 28          	lea    0x28(%rbx),%rax
ffffffff80312cc6:        0 	48 89 53 28          	mov    %rdx,0x28(%rbx)
ffffffff80312cca:        0 	4c 89 78 08          	mov    %r15,0x8(%rax)
ffffffff80312cce:        0 	48 89 46 28          	mov    %rax,0x28(%rsi)
ffffffff80312cd2:        0 	48 89 42 08          	mov    %rax,0x8(%rdx)
ffffffff80312cd6:        0 	4c 89 f6             	mov    %r14,%rsi
ffffffff80312cd9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312cdc:        0 	e8 20 3d 20 00       	callq  ffffffff80516a01 <_spin_unlock_irqrestore>
ffffffff80312ce1:        0 	eb 07                	jmp    ffffffff80312cea <avc_has_perm_noaudit+0x1f7>
ffffffff80312ce3:        0 	48 8d 44 24 30       	lea    0x30(%rsp),%rax
ffffffff80312ce8:        0 	eb 06                	jmp    ffffffff80312cf0 <avc_has_perm_noaudit+0x1fd>
ffffffff80312cea:     2116 	48 89 d8             	mov    %rbx,%rax
ffffffff80312ced:     7632 	45 31 f6             	xor    %r14d,%r14d
ffffffff80312cf0:        1 	48 83 3c 24 00       	cmpq   $0x0,(%rsp)
ffffffff80312cf5:      404 	74 10                	je     ffffffff80312d07 <avc_has_perm_noaudit+0x214>
ffffffff80312cf7:     1804 	48 8d 70 0c          	lea    0xc(%rax),%rsi
ffffffff80312cfb:        0 	b9 05 00 00 00       	mov    $0x5,%ecx
ffffffff80312d00:      378 	48 8b 3c 24          	mov    (%rsp),%rdi
ffffffff80312d04:     8174 	fc                   	cld    
ffffffff80312d05:    26860 	f3 a5                	rep movsl %ds:(%rsi),%es:(%rdi)
ffffffff80312d07:    11573 	8b 40 0c             	mov    0xc(%rax),%eax
ffffffff80312d0a:     1997 	f7 d0                	not    %eax
ffffffff80312d0c:        0 	85 44 24 0c          	test   %eax,0xc(%rsp)
ffffffff80312d10:        0 	0f 84 1d 01 00 00    	je     ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312d16:        0 	f6 44 24 08 01       	testb  $0x1,0x8(%rsp)
ffffffff80312d1b:        0 	0f 85 f4 00 00 00    	jne    ffffffff80312e15 <avc_has_perm_noaudit+0x322>
ffffffff80312d21:        0 	83 3d 5c 66 78 00 00 	cmpl   $0x0,0x78665c(%rip)        # ffffffff80a99384 <selinux_enforcing>
ffffffff80312d28:        0 	74 10                	je     ffffffff80312d3a <avc_has_perm_noaudit+0x247>
ffffffff80312d2a:        0 	44 89 e7             	mov    %r12d,%edi
ffffffff80312d2d:        0 	e8 87 f9 00 00       	callq  ffffffff803226b9 <security_permissive_sid>
ffffffff80312d32:        0 	85 c0                	test   %eax,%eax
ffffffff80312d34:        0 	0f 84 db 00 00 00    	je     ffffffff80312e15 <avc_has_perm_noaudit+0x322>
ffffffff80312d3a:        0 	e8 30 fc ff ff       	callq  ffffffff8031296f <avc_alloc_node>
ffffffff80312d3f:        0 	48 85 c0             	test   %rax,%rax
ffffffff80312d42:        0 	48 89 c5             	mov    %rax,%rbp
ffffffff80312d45:        0 	0f 84 e8 00 00 00    	je     ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312d4b:        0 	48 8b 44 24 18       	mov    0x18(%rsp),%rax
ffffffff80312d50:        0 	48 8d 04 85 60 8b a9 	lea    -0x7f5674a0(,%rax,4),%rax
ffffffff80312d57:        0 	80 
ffffffff80312d58:        0 	48 89 c7             	mov    %rax,%rdi
ffffffff80312d5b:        0 	48 89 44 24 28       	mov    %rax,0x28(%rsp)
ffffffff80312d60:        0 	e8 61 3b 20 00       	callq  ffffffff805168c6 <_spin_lock_irqsave>
ffffffff80312d65:        0 	49 8b 1f             	mov    (%r15),%rbx
ffffffff80312d68:        0 	48 89 44 24 20       	mov    %rax,0x20(%rsp)
ffffffff80312d6d:        0 	eb 1a                	jmp    ffffffff80312d89 <avc_has_perm_noaudit+0x296>
ffffffff80312d6f:        0 	44 3b 23             	cmp    (%rbx),%r12d
ffffffff80312d72:        0 	75 11                	jne    ffffffff80312d85 <avc_has_perm_noaudit+0x292>
ffffffff80312d74:        0 	44 3b 6b 04          	cmp    0x4(%rbx),%r13d
ffffffff80312d78:        0 	75 0b                	jne    ffffffff80312d85 <avc_has_perm_noaudit+0x292>
ffffffff80312d7a:        0 	66 8b 44 24 12       	mov    0x12(%rsp),%ax
ffffffff80312d7f:        0 	66 3b 43 08          	cmp    0x8(%rbx),%ax
ffffffff80312d83:        0 	74 1a                	je     ffffffff80312d9f <avc_has_perm_noaudit+0x2ac>
ffffffff80312d85:        0 	48 8b 5b 28          	mov    0x28(%rbx),%rbx
ffffffff80312d89:        0 	48 83 eb 28          	sub    $0x28,%rbx
ffffffff80312d8d:        0 	48 8b 43 28          	mov    0x28(%rbx),%rax
ffffffff80312d91:        0 	0f 18 08             	prefetcht0 (%rax)
ffffffff80312d94:        0 	48 8d 43 28          	lea    0x28(%rbx),%rax
ffffffff80312d98:        0 	4c 39 f8             	cmp    %r15,%rax
ffffffff80312d9b:        0 	75 d2                	jne    ffffffff80312d6f <avc_has_perm_noaudit+0x27c>
ffffffff80312d9d:        0 	eb 29                	jmp    ffffffff80312dc8 <avc_has_perm_noaudit+0x2d5>
ffffffff80312d9f:        0 	8b 4c 24 14          	mov    0x14(%rsp),%ecx
ffffffff80312da3:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff80312da6:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312da9:        0 	49 89 d8             	mov    %rbx,%r8
ffffffff80312dac:        0 	44 89 ea             	mov    %r13d,%edx
ffffffff80312daf:        0 	e8 16 fa ff ff       	callq  ffffffff803127ca <avc_node_populate>
ffffffff80312db4:        0 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff80312db8:        0 	09 45 0c             	or     %eax,0xc(%rbp)
ffffffff80312dbb:        0 	48 89 de             	mov    %rbx,%rsi
ffffffff80312dbe:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80312dc1:        0 	e8 80 fa ff ff       	callq  ffffffff80312846 <avc_node_replace>
ffffffff80312dc6:        0 	eb 3c                	jmp    ffffffff80312e04 <avc_has_perm_noaudit+0x311>
ffffffff80312dc8:        0 	48 8b 3d a9 65 78 00 	mov    0x7865a9(%rip),%rdi        # ffffffff80a99378 <avc_node_cachep>
ffffffff80312dcf:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff80312dd2:        0 	e8 7b c6 f7 ff       	callq  ffffffff8028f452 <kmem_cache_free>
ffffffff80312dd7:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff80312dde:        0 	00 
ffffffff80312ddf:        0 	89 c0                	mov    %eax,%eax
ffffffff80312de1:        0 	48 c7 c2 d0 26 93 80 	mov    $0xffffffff809326d0,%rdx
ffffffff80312de8:        0 	48 c1 e0 03          	shl    $0x3,%rax
ffffffff80312dec:        0 	48 03 05 4d 2e 5a 00 	add    0x5a2e4d(%rip),%rax        # ffffffff808b5c40 <_cpu_pda>
ffffffff80312df3:        0 	48 8b 00             	mov    (%rax),%rax
ffffffff80312df6:        0 	48 03 50 08          	add    0x8(%rax),%rdx
ffffffff80312dfa:        0 	ff 42 14             	incl   0x14(%rdx)
ffffffff80312dfd:        0 	f0 ff 0d 60 65 78 00 	lock decl 0x786560(%rip)        # ffffffff80a99364 <avc_cache+0x2804>
ffffffff80312e04:        0 	48 8b 74 24 20       	mov    0x20(%rsp),%rsi
ffffffff80312e09:        0 	48 8b 7c 24 28       	mov    0x28(%rsp),%rdi
ffffffff80312e0e:        0 	e8 ee 3b 20 00       	callq  ffffffff80516a01 <_spin_unlock_irqrestore>
ffffffff80312e13:        0 	eb 1e                	jmp    ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312e15:        0 	41 be f3 ff ff ff    	mov    $0xfffffff3,%r14d
ffffffff80312e1b:        0 	eb 16                	jmp    ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312e1d:    35502 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff80312e21:     4360 	23 43 10             	and    0x10(%rbx),%eax
ffffffff80312e24:        0 	3b 44 24 0c          	cmp    0xc(%rsp),%eax
ffffffff80312e28:        0 	0f 85 b6 fd ff ff    	jne    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e:   104641 	e9 86 fd ff ff       	jmpq   ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33:     2106 	48 83 c4 68          	add    $0x68,%rsp
ffffffff80312e37:        1 	44 89 f0             	mov    %r14d,%eax
ffffffff80312e3a:     2068 	5b                   	pop    %rbx
ffffffff80312e3b:        0 	5d                   	pop    %rbp
ffffffff80312e3c:        8 	41 5c                	pop    %r12
ffffffff80312e3e:     2001 	41 5d                	pop    %r13
ffffffff80312e40:        0 	41 5e                	pop    %r14
ffffffff80312e42:      162 	41 5f                	pop    %r15
ffffffff80312e44:     2107 	c3                   	retq   

its main callsite is:

  ffffffff8031368c:     2809 <avc_has_perm>:
  [...]
  ffffffff803136b6:      651 	e8 38 f4 ff ff       	callq  ffffffff80312af3 <avc_has_perm_noaudit>

avc_has_perm() usage is spread out amongst 3 callsites in 2 selinux 
functions:

selinux_ip_postroute():
  ffffffff80314d02:      491 	e8 85 e9 ff ff       	callq  ffffffff8031368c <avc_has_perm>

selinux_socket_sock_rcv_skb():
  ffffffff80314eea:      461 	e8 9d e7 ff ff       	callq  ffffffff8031368c <avc_has_perm>
  ffffffff80314faf:      476 	e8 d8 e6 ff ff       	callq  ffffffff8031368c <avc_has_perm>

related to networking.

regarding avc_has_perm_noaudit() itself, it has a couple of hot spots:

ffffffff80312b73:     5184 	44 3b 23             	cmp    (%rbx),%r12d
ffffffff80312b76:    62007 	75 11                	jne    ffffffff80312b89 <avc_has_perm_noaudit+0x96>

quick guess: cache-cold-miss site.

ffffffff80312d04:     8174 	fc                   	cld    
ffffffff80312d05:    26860 	f3 a5                	rep movsl %ds:(%rsi),%es:(%rdi)

quick guess: unnecessary initialization of something largish via 
memset. Probably:

  security/selinux/avc.c:avc_has_perm_noaudit()'s:
  [...]
        if (avd)
                memcpy(avd, &p_ae->avd, sizeof(*avd));

but one of the fattest ones:

ffffffff80312e28:        0 	0f 85 b6 fd ff ff    	jne    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e:   104641 	e9 86 fd ff ff       	jmpq   ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33:     2106 	48 83 c4 68          	add    $0x68,%rsp

that seems to be either a branch mispredict (seems a tad expensive for 
that though), or a cachemiss delayed to the first non-predicted 
branch. Ah, that's most likely the case, we fall through straight from 
here:

ffffffff80312dfd:        0      f0 ff 0d 60 65 78 00    lock decl 0x786560(%rip)

that's an atomic op of some global address, in the hotpath. Not good.

the wider context is:

ffffffff80312e1d:    35502 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff80312e21:     4360 	23 43 10             	and    0x10(%rbx),%eax
ffffffff80312e24:        0 	3b 44 24 0c          	cmp    0xc(%rsp),%eax
ffffffff80312e28:        0 	0f 85 b6 fd ff ff    	jne    ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e:   104641 	e9 86 fd ff ff       	jmpq   ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33:     2106 	48 83 c4 68          	add    $0x68,%rsp

ah, yes. My guess is that the "and (%rbx)" at ffffffff80312e21 
generated this miss, and this all is avc_update_node()'s 
for-each-list-loop, and:

        spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);

that hash doesnt seem to be working well here. It's done via:

static inline int avc_hash(u32 ssid, u32 tsid, u16 tclass)
{
        return (ssid ^ (tsid<<2) ^ (tclass<<4)) & (AVC_CACHE_SLOTS - 1);
}

AVC_CACHE_SLOTS is 512 - but my usecase is likely has a much narrower 
hash key space than that. Increasing the hash wont work, these kind of 
things really only start scaling once some natural per-CPU construct 
is found to it.

And things like this:

        /* cache hit */
        if (atomic_read(&ret->ae.used) != 1)
                atomic_set(&ret->ae.used, 1);

in avc_search_node() dont really help either as they immediately dirty 
the cacheline in the cache-hit case. Hashed fastpath lookup really 
should only be used to validate security rules in a read-mostly way, 
and cachelines should never be dirtied, as long as it can be avoided.

Anyway, this function needs a good scalability look as it represents 
3.9% of the total tbench cost. I'd not be surprised if it was possible 
more than half of that cost via not too ugly changes.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:30                                   ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 20:30 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger



On Mon, 17 Nov 2008, David Miller wrote:
> 
> It's on my workstation which is a much simpler 2 processor
> UltraSPARC-IIIi (1.5Ghz) system.

Ok. It could easily be something like a cache footprint issue. And while I 
don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super- 
scalar but does no out-of-order and speculation, no? So I could easily see 
that the indirect branches in the scheduler hurt much more, and might 
explain why the x86 profile looks so different.

One thing that non-NMI profiles also tend to show is "clumping", which in 
turn tends to rather excessively pinpoint code sequences that release the 
irq flag - just because those points show up in profiles, rather than 
being a spread-out-mush. So it's possible that Ingo's profile did show the 
scheduler more, but it was in the form of much more spread out "noise" 
rather than the single spike you saw. 

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:30                                   ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 20:30 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA



On Mon, 17 Nov 2008, David Miller wrote:
> 
> It's on my workstation which is a much simpler 2 processor
> UltraSPARC-IIIi (1.5Ghz) system.

Ok. It could easily be something like a cache footprint issue. And while I 
don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super- 
scalar but does no out-of-order and speculation, no? So I could easily see 
that the indirect branches in the scheduler hurt much more, and might 
explain why the x86 profile looks so different.

One thing that non-NMI profiles also tend to show is "clumping", which in 
turn tends to rather excessively pinpoint code sequences that release the 
irq flag - just because those points show up in profiles, rather than 
being a spread-out-mush. So it's possible that Ingo's profile did show the 
scheduler more, but it was in the form of much more spread out "noise" 
rather than the single spike you saw. 

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:32                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   3.356152 ip_queue_xmit

                      hits (335615 total)
                 .........
ffffffff804b7045:     1001 <ip_queue_xmit>:
ffffffff804b7045:     1001 	41 57                	push   %r15
ffffffff804b7047:    36698 	41 56                	push   %r14
ffffffff804b7049:        0 	49 89 fe             	mov    %rdi,%r14
ffffffff804b704c:        0 	41 55                	push   %r13
ffffffff804b704e:      447 	41 54                	push   %r12
ffffffff804b7050:        0 	55                   	push   %rbp
ffffffff804b7051:        4 	53                   	push   %rbx
ffffffff804b7052:      465 	48 83 ec 68          	sub    $0x68,%rsp
ffffffff804b7056:        1 	89 74 24 08          	mov    %esi,0x8(%rsp)
ffffffff804b705a:      486 	48 8b 47 28          	mov    0x28(%rdi),%rax
ffffffff804b705e:        0 	48 8b 6f 10          	mov    0x10(%rdi),%rbp
ffffffff804b7062:        7 	48 85 c0             	test   %rax,%rax
ffffffff804b7065:      480 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
ffffffff804b706a:        0 	4c 8b bd 48 02 00 00 	mov    0x248(%rbp),%r15
ffffffff804b7071:        7 	0f 85 0d 01 00 00    	jne    ffffffff804b7184 <ip_queue_xmit+0x13f>
ffffffff804b7077:      452 	31 f6                	xor    %esi,%esi
ffffffff804b7079:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b707c:        5 	e8 c1 eb fc ff       	callq  ffffffff80485c42 <__sk_dst_check>
ffffffff804b7081:      434 	48 85 c0             	test   %rax,%rax
ffffffff804b7084:       54 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
ffffffff804b7089:        0 	0f 85 e0 00 00 00    	jne    ffffffff804b716f <ip_queue_xmit+0x12a>
ffffffff804b708f:        0 	4d 85 ff             	test   %r15,%r15
ffffffff804b7092:        0 	44 8b ad 30 02 00 00 	mov    0x230(%rbp),%r13d
ffffffff804b7099:        0 	74 0a                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
ffffffff804b709b:        0 	41 80 7f 05 00       	cmpb   $0x0,0x5(%r15)
ffffffff804b70a0:        0 	74 03                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
ffffffff804b70a2:        0 	45 8b 2f             	mov    (%r15),%r13d
ffffffff804b70a5:        0 	8b 85 3c 02 00 00    	mov    0x23c(%rbp),%eax
ffffffff804b70ab:        0 	48 8d b5 10 01 00 00 	lea    0x110(%rbp),%rsi
ffffffff804b70b2:        0 	44 8b 65 04          	mov    0x4(%rbp),%r12d
ffffffff804b70b6:        0 	bf 0d 00 00 00       	mov    $0xd,%edi
ffffffff804b70bb:        0 	89 44 24 0c          	mov    %eax,0xc(%rsp)
ffffffff804b70bf:        0 	8a 9d 54 02 00 00    	mov    0x254(%rbp),%bl
ffffffff804b70c5:        0 	e8 9a df ff ff       	callq  ffffffff804b5064 <constant_test_bit>
ffffffff804b70ca:        0 	31 d2                	xor    %edx,%edx
ffffffff804b70cc:        0 	48 8d 7c 24 10       	lea    0x10(%rsp),%rdi
ffffffff804b70d1:        0 	41 89 c3             	mov    %eax,%r11d
ffffffff804b70d4:        0 	fc                   	cld    
ffffffff804b70d5:        0 	89 d0                	mov    %edx,%eax
ffffffff804b70d7:        0 	b9 10 00 00 00       	mov    $0x10,%ecx
ffffffff804b70dc:        0 	44 8a 45 39          	mov    0x39(%rbp),%r8b
ffffffff804b70e0:        0 	40 8a b5 57 02 00 00 	mov    0x257(%rbp),%sil
ffffffff804b70e7:        0 	44 8b 8d 50 02 00 00 	mov    0x250(%rbp),%r9d
ffffffff804b70ee:        0 	83 e3 1e             	and    $0x1e,%ebx
ffffffff804b70f1:        0 	44 8b 95 38 02 00 00 	mov    0x238(%rbp),%r10d
ffffffff804b70f8:        0 	44 09 db             	or     %r11d,%ebx
ffffffff804b70fb:        0 	f3 ab                	rep stos %eax,%es:(%rdi)
ffffffff804b70fd:        0 	40 c0 ee 05          	shr    $0x5,%sil
ffffffff804b7101:        0 	88 5c 24 24          	mov    %bl,0x24(%rsp)
ffffffff804b7105:        0 	48 8d 5c 24 10       	lea    0x10(%rsp),%rbx
ffffffff804b710a:        0 	83 e6 01             	and    $0x1,%esi
ffffffff804b710d:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b7110:        0 	44 88 44 24 40       	mov    %r8b,0x40(%rsp)
ffffffff804b7115:        0 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff804b7119:        0 	40 88 74 24 41       	mov    %sil,0x41(%rsp)
ffffffff804b711e:        0 	48 89 de             	mov    %rbx,%rsi
ffffffff804b7121:        0 	66 44 89 4c 24 44    	mov    %r9w,0x44(%rsp)
ffffffff804b7127:        0 	66 44 89 54 24 46    	mov    %r10w,0x46(%rsp)
ffffffff804b712d:        0 	44 89 64 24 10       	mov    %r12d,0x10(%rsp)
ffffffff804b7132:        0 	44 89 6c 24 1c       	mov    %r13d,0x1c(%rsp)
ffffffff804b7137:        0 	89 44 24 20          	mov    %eax,0x20(%rsp)
ffffffff804b713b:        0 	e8 2d 9f e5 ff       	callq  ffffffff8031106d <security_sk_classify_flow>
ffffffff804b7140:        0 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
ffffffff804b7145:        0 	45 31 c0             	xor    %r8d,%r8d
ffffffff804b7148:        0 	48 89 e9             	mov    %rbp,%rcx
ffffffff804b714b:        0 	48 89 da             	mov    %rbx,%rdx
ffffffff804b714e:        0 	48 c7 c7 d0 15 ab 80 	mov    $0xffffffff80ab15d0,%rdi
ffffffff804b7155:        0 	e8 1a 91 ff ff       	callq  ffffffff804b0274 <ip_route_output_flow>
ffffffff804b715a:        0 	85 c0                	test   %eax,%eax
ffffffff804b715c:        0 	0f 85 9f 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
ffffffff804b7162:        0 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
ffffffff804b7167:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b716a:        0 	e8 a8 eb fc ff       	callq  ffffffff80485d17 <sk_setup_caps>
ffffffff804b716f:      441 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
ffffffff804b7174:     1388 	48 85 c0             	test   %rax,%rax
ffffffff804b7177:        0 	74 07                	je     ffffffff804b7180 <ip_queue_xmit+0x13b>
ffffffff804b7179:        0 	f0 ff 80 b0 00 00 00 	lock incl 0xb0(%rax)
ffffffff804b7180:      556 	49 89 46 28          	mov    %rax,0x28(%r14)
ffffffff804b7184:     8351 	4d 85 ff             	test   %r15,%r15
ffffffff804b7187:        0 	be 14 00 00 00       	mov    $0x14,%esi
ffffffff804b718c:      461 	74 26                	je     ffffffff804b71b4 <ip_queue_xmit+0x16f>
ffffffff804b718e:        0 	41 f6 47 08 01       	testb  $0x1,0x8(%r15)
ffffffff804b7193:        0 	74 17                	je     ffffffff804b71ac <ip_queue_xmit+0x167>
ffffffff804b7195:        0 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804b719a:        0 	8b 82 28 01 00 00    	mov    0x128(%rdx),%eax
ffffffff804b71a0:        0 	39 82 1c 01 00 00    	cmp    %eax,0x11c(%rdx)
ffffffff804b71a6:        0 	0f 85 55 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
ffffffff804b71ac:        0 	41 0f b6 47 04       	movzbl 0x4(%r15),%eax
ffffffff804b71b1:        0 	8d 70 14             	lea    0x14(%rax),%esi
ffffffff804b71b4:       39 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b71b7:      493 	e8 f8 18 fd ff       	callq  ffffffff80488ab4 <skb_push>
ffffffff804b71bc:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b71bf:     1701 	e8 99 df ff ff       	callq  ffffffff804b515d <skb_reset_network_header>
ffffffff804b71c4:      481 	0f b6 85 54 02 00 00 	movzbl 0x254(%rbp),%eax
ffffffff804b71cb:     4202 	41 8b 9e bc 00 00 00 	mov    0xbc(%r14),%ebx
ffffffff804b71d2:        3 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b71d5:        0 	49 03 9e d0 00 00 00 	add    0xd0(%r14),%rbx
ffffffff804b71dc:      466 	80 cc 45             	or     $0x45,%ah
ffffffff804b71df:        7 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804b71e3:        0 	66 89 03             	mov    %ax,(%rbx)
ffffffff804b71e6:      492 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
ffffffff804b71eb:        3 	e8 a0 df ff ff       	callq  ffffffff804b5190 <ip_dont_fragment>
ffffffff804b71f0:     1405 	85 c0                	test   %eax,%eax
ffffffff804b71f2:     4391 	74 0f                	je     ffffffff804b7203 <ip_queue_xmit+0x1be>
ffffffff804b71f4:        0 	83 7c 24 08 00       	cmpl   $0x0,0x8(%rsp)
ffffffff804b71f9:      417 	75 08                	jne    ffffffff804b7203 <ip_queue_xmit+0x1be>
ffffffff804b71fb:      503 	66 c7 43 06 40 00    	movw   $0x40,0x6(%rbx)
ffffffff804b7201:     6743 	eb 06                	jmp    ffffffff804b7209 <ip_queue_xmit+0x1c4>
ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)
ffffffff804b7222:    26297 	8a 45 39             	mov    0x39(%rbp),%al
ffffffff804b7225:    76658 	4d 85 ff             	test   %r15,%r15
ffffffff804b7228:     1712 	88 43 09             	mov    %al,0x9(%rbx)
ffffffff804b722b:      148 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
ffffffff804b7230:     2971 	8b 80 20 01 00 00    	mov    0x120(%rax),%eax
ffffffff804b7236:    14849 	89 43 0c             	mov    %eax,0xc(%rbx)
ffffffff804b7239:       84 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
ffffffff804b723e:      360 	8b 80 1c 01 00 00    	mov    0x11c(%rax),%eax
ffffffff804b7244:      174 	89 43 10             	mov    %eax,0x10(%rbx)
ffffffff804b7247:       96 	74 32                	je     ffffffff804b727b <ip_queue_xmit+0x236>
ffffffff804b7249:        0 	41 8a 57 04          	mov    0x4(%r15),%dl
ffffffff804b724d:        0 	84 d2                	test   %dl,%dl
ffffffff804b724f:        0 	74 2a                	je     ffffffff804b727b <ip_queue_xmit+0x236>
ffffffff804b7251:        0 	c0 ea 02             	shr    $0x2,%dl
ffffffff804b7254:        0 	03 13                	add    (%rbx),%edx
ffffffff804b7256:        0 	8a 03                	mov    (%rbx),%al
ffffffff804b7258:        0 	45 31 c0             	xor    %r8d,%r8d
ffffffff804b725b:        0 	4c 89 fe             	mov    %r15,%rsi
ffffffff804b725e:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b7261:        0 	83 e0 f0             	and    $0xfffffffffffffff0,%eax
ffffffff804b7264:        0 	83 e2 0f             	and    $0xf,%edx
ffffffff804b7267:        0 	09 d0                	or     %edx,%eax
ffffffff804b7269:        0 	88 03                	mov    %al,(%rbx)
ffffffff804b726b:        0 	48 8b 4c 24 58       	mov    0x58(%rsp),%rcx
ffffffff804b7270:        0 	8b 95 30 02 00 00    	mov    0x230(%rbp),%edx
ffffffff804b7276:        0 	e8 e4 d8 ff ff       	callq  ffffffff804b4b5f <ip_options_build>
ffffffff804b727b:      541 	41 8b 86 c8 00 00 00 	mov    0xc8(%r14),%eax
ffffffff804b7282:      570 	31 d2                	xor    %edx,%edx
ffffffff804b7284:        0 	49 03 86 d0 00 00 00 	add    0xd0(%r14),%rax
ffffffff804b728b:       34 	8b 40 08             	mov    0x8(%rax),%eax
ffffffff804b728e:      496 	66 85 c0             	test   %ax,%ax
ffffffff804b7291:       11 	74 06                	je     ffffffff804b7299 <ip_queue_xmit+0x254>
ffffffff804b7293:        9 	0f b7 c0             	movzwl %ax,%eax
ffffffff804b7296:      495 	8d 50 ff             	lea    -0x1(%rax),%edx
ffffffff804b7299:        2 	f6 43 06 40          	testb  $0x40,0x6(%rbx)
ffffffff804b729d:        9 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
ffffffff804b72a2:      497 	74 34                	je     ffffffff804b72d8 <ip_queue_xmit+0x293>
ffffffff804b72a4:        8 	83 bd 30 02 00 00 00 	cmpl   $0x0,0x230(%rbp)
ffffffff804b72ab:       10 	74 23                	je     ffffffff804b72d0 <ip_queue_xmit+0x28b>
ffffffff804b72ad:     1044 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
ffffffff804b72b4:        7 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804b72b8:        8 	66 89 43 04          	mov    %ax,0x4(%rbx)
ffffffff804b72bc:      432 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
ffffffff804b72c3:        9 	ff c0                	inc    %eax
ffffffff804b72c5:       14 	01 d0                	add    %edx,%eax
ffffffff804b72c7:     1141 	66 89 85 52 02 00 00 	mov    %ax,0x252(%rbp)
ffffffff804b72ce:        7 	eb 10                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
ffffffff804b72d0:        0 	66 c7 43 04 00 00    	movw   $0x0,0x4(%rbx)
ffffffff804b72d6:        0 	eb 08                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
ffffffff804b72d8:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804b72db:        0 	e8 b7 9d ff ff       	callq  ffffffff804b1097 <__ip_select_ident>
ffffffff804b72e0:        6 	8b 85 54 01 00 00    	mov    0x154(%rbp),%eax
ffffffff804b72e6:      458 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b72e9:        2 	41 89 46 78          	mov    %eax,0x78(%r14)
ffffffff804b72ed:        4 	8b 85 f0 01 00 00    	mov    0x1f0(%rbp),%eax
ffffffff804b72f3:      841 	41 89 86 b0 00 00 00 	mov    %eax,0xb0(%r14)
ffffffff804b72fa:       11 	e8 30 f2 ff ff       	callq  ffffffff804b652f <ip_local_out>
ffffffff804b72ff:        0 	eb 44                	jmp    ffffffff804b7345 <ip_queue_xmit+0x300>
ffffffff804b7301:        0 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff804b7308:        0 	00 00 
ffffffff804b730a:        0 	8b 80 48 e0 ff ff    	mov    -0x1fb8(%rax),%eax
ffffffff804b7310:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b7313:        0 	30 c0                	xor    %al,%al
ffffffff804b7315:        0 	66 83 f8 01          	cmp    $0x1,%ax
ffffffff804b7319:        0 	48 19 c0             	sbb    %rax,%rax
ffffffff804b731c:        0 	83 e0 08             	and    $0x8,%eax
ffffffff804b731f:        0 	48 8b 90 a8 16 ab 80 	mov    -0x7f54e958(%rax),%rdx
ffffffff804b7326:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804b732d:        0 	00 
ffffffff804b732e:        0 	89 c0                	mov    %eax,%eax
ffffffff804b7330:        0 	48 f7 d2             	not    %rdx
ffffffff804b7333:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804b7337:        0 	48 ff 40 68          	incq   0x68(%rax)
ffffffff804b733b:        0 	e8 b1 18 fd ff       	callq  ffffffff80488bf1 <kfree_skb>
ffffffff804b7340:        0 	b8 8f ff ff ff       	mov    $0xffffff8f,%eax
ffffffff804b7345:     9196 	48 83 c4 68          	add    $0x68,%rsp
ffffffff804b7349:      892 	5b                   	pop    %rbx
ffffffff804b734a:        0 	5d                   	pop    %rbp
ffffffff804b734b:      488 	41 5c                	pop    %r12
ffffffff804b734d:        0 	41 5d                	pop    %r13
ffffffff804b734f:        0 	41 5e                	pop    %r14
ffffffff804b7351:      513 	41 5f                	pop    %r15
ffffffff804b7353:        0 	c3                   	retq   

about 10% of this function's cost is artificial:

ffffffff804b7045:     1001 <ip_queue_xmit>:
ffffffff804b7045:     1001 	41 57                	push   %r15
ffffffff804b7047:    36698 	41 56                	push   %r14

there are profiler hits that leaked in via out-of-order execution from 
the callsites. The callsites are hard to map unfortunately, as this 
function is called via function pointers.

the most likely callsite is tcp_transmit_skb().

30% of the overhead of this function comes from:

ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)

the 16-bit movw looks a bit weird. It comes from line 372:

 0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
 367		iph = ip_hdr(skb);
 368		*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
 369		if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
 370			iph->frag_off = htons(IP_DF);
 371		else
 372			iph->frag_off = 0;
 373		iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
 374		iph->protocol = sk->sk_protocol;
 375		iph->saddr    = rt->rt_src;
 376		iph->daddr    = rt->rt_dst;

the ip-header fragment flag setting to zero.

16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is 
towards eliminating them as much as possible.

_But_, the real overhead probably comes from:

 ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx

which is the next line, the ttl field:

 373             iph->ttl      = ip_select_ttl(inet, &rt->u.dst);

this shows that we are doing a hard cachemiss on the net-localhost 
route dst structure cacheline. We do a plain load instruction from it 
here and get a hefty cachemiss. (because 16 CPUs are banging on that 
single route)

And let make sure we see this in perspective as well: that single 
cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make 
the scheduler 10% slower straight away and it would have less of a 
real-life effect than this single iph->ttl field setting.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:32                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   3.356152 ip_queue_xmit

                      hits (335615 total)
                 .........
ffffffff804b7045:     1001 <ip_queue_xmit>:
ffffffff804b7045:     1001 	41 57                	push   %r15
ffffffff804b7047:    36698 	41 56                	push   %r14
ffffffff804b7049:        0 	49 89 fe             	mov    %rdi,%r14
ffffffff804b704c:        0 	41 55                	push   %r13
ffffffff804b704e:      447 	41 54                	push   %r12
ffffffff804b7050:        0 	55                   	push   %rbp
ffffffff804b7051:        4 	53                   	push   %rbx
ffffffff804b7052:      465 	48 83 ec 68          	sub    $0x68,%rsp
ffffffff804b7056:        1 	89 74 24 08          	mov    %esi,0x8(%rsp)
ffffffff804b705a:      486 	48 8b 47 28          	mov    0x28(%rdi),%rax
ffffffff804b705e:        0 	48 8b 6f 10          	mov    0x10(%rdi),%rbp
ffffffff804b7062:        7 	48 85 c0             	test   %rax,%rax
ffffffff804b7065:      480 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
ffffffff804b706a:        0 	4c 8b bd 48 02 00 00 	mov    0x248(%rbp),%r15
ffffffff804b7071:        7 	0f 85 0d 01 00 00    	jne    ffffffff804b7184 <ip_queue_xmit+0x13f>
ffffffff804b7077:      452 	31 f6                	xor    %esi,%esi
ffffffff804b7079:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b707c:        5 	e8 c1 eb fc ff       	callq  ffffffff80485c42 <__sk_dst_check>
ffffffff804b7081:      434 	48 85 c0             	test   %rax,%rax
ffffffff804b7084:       54 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
ffffffff804b7089:        0 	0f 85 e0 00 00 00    	jne    ffffffff804b716f <ip_queue_xmit+0x12a>
ffffffff804b708f:        0 	4d 85 ff             	test   %r15,%r15
ffffffff804b7092:        0 	44 8b ad 30 02 00 00 	mov    0x230(%rbp),%r13d
ffffffff804b7099:        0 	74 0a                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
ffffffff804b709b:        0 	41 80 7f 05 00       	cmpb   $0x0,0x5(%r15)
ffffffff804b70a0:        0 	74 03                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
ffffffff804b70a2:        0 	45 8b 2f             	mov    (%r15),%r13d
ffffffff804b70a5:        0 	8b 85 3c 02 00 00    	mov    0x23c(%rbp),%eax
ffffffff804b70ab:        0 	48 8d b5 10 01 00 00 	lea    0x110(%rbp),%rsi
ffffffff804b70b2:        0 	44 8b 65 04          	mov    0x4(%rbp),%r12d
ffffffff804b70b6:        0 	bf 0d 00 00 00       	mov    $0xd,%edi
ffffffff804b70bb:        0 	89 44 24 0c          	mov    %eax,0xc(%rsp)
ffffffff804b70bf:        0 	8a 9d 54 02 00 00    	mov    0x254(%rbp),%bl
ffffffff804b70c5:        0 	e8 9a df ff ff       	callq  ffffffff804b5064 <constant_test_bit>
ffffffff804b70ca:        0 	31 d2                	xor    %edx,%edx
ffffffff804b70cc:        0 	48 8d 7c 24 10       	lea    0x10(%rsp),%rdi
ffffffff804b70d1:        0 	41 89 c3             	mov    %eax,%r11d
ffffffff804b70d4:        0 	fc                   	cld    
ffffffff804b70d5:        0 	89 d0                	mov    %edx,%eax
ffffffff804b70d7:        0 	b9 10 00 00 00       	mov    $0x10,%ecx
ffffffff804b70dc:        0 	44 8a 45 39          	mov    0x39(%rbp),%r8b
ffffffff804b70e0:        0 	40 8a b5 57 02 00 00 	mov    0x257(%rbp),%sil
ffffffff804b70e7:        0 	44 8b 8d 50 02 00 00 	mov    0x250(%rbp),%r9d
ffffffff804b70ee:        0 	83 e3 1e             	and    $0x1e,%ebx
ffffffff804b70f1:        0 	44 8b 95 38 02 00 00 	mov    0x238(%rbp),%r10d
ffffffff804b70f8:        0 	44 09 db             	or     %r11d,%ebx
ffffffff804b70fb:        0 	f3 ab                	rep stos %eax,%es:(%rdi)
ffffffff804b70fd:        0 	40 c0 ee 05          	shr    $0x5,%sil
ffffffff804b7101:        0 	88 5c 24 24          	mov    %bl,0x24(%rsp)
ffffffff804b7105:        0 	48 8d 5c 24 10       	lea    0x10(%rsp),%rbx
ffffffff804b710a:        0 	83 e6 01             	and    $0x1,%esi
ffffffff804b710d:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b7110:        0 	44 88 44 24 40       	mov    %r8b,0x40(%rsp)
ffffffff804b7115:        0 	8b 44 24 0c          	mov    0xc(%rsp),%eax
ffffffff804b7119:        0 	40 88 74 24 41       	mov    %sil,0x41(%rsp)
ffffffff804b711e:        0 	48 89 de             	mov    %rbx,%rsi
ffffffff804b7121:        0 	66 44 89 4c 24 44    	mov    %r9w,0x44(%rsp)
ffffffff804b7127:        0 	66 44 89 54 24 46    	mov    %r10w,0x46(%rsp)
ffffffff804b712d:        0 	44 89 64 24 10       	mov    %r12d,0x10(%rsp)
ffffffff804b7132:        0 	44 89 6c 24 1c       	mov    %r13d,0x1c(%rsp)
ffffffff804b7137:        0 	89 44 24 20          	mov    %eax,0x20(%rsp)
ffffffff804b713b:        0 	e8 2d 9f e5 ff       	callq  ffffffff8031106d <security_sk_classify_flow>
ffffffff804b7140:        0 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
ffffffff804b7145:        0 	45 31 c0             	xor    %r8d,%r8d
ffffffff804b7148:        0 	48 89 e9             	mov    %rbp,%rcx
ffffffff804b714b:        0 	48 89 da             	mov    %rbx,%rdx
ffffffff804b714e:        0 	48 c7 c7 d0 15 ab 80 	mov    $0xffffffff80ab15d0,%rdi
ffffffff804b7155:        0 	e8 1a 91 ff ff       	callq  ffffffff804b0274 <ip_route_output_flow>
ffffffff804b715a:        0 	85 c0                	test   %eax,%eax
ffffffff804b715c:        0 	0f 85 9f 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
ffffffff804b7162:        0 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
ffffffff804b7167:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b716a:        0 	e8 a8 eb fc ff       	callq  ffffffff80485d17 <sk_setup_caps>
ffffffff804b716f:      441 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
ffffffff804b7174:     1388 	48 85 c0             	test   %rax,%rax
ffffffff804b7177:        0 	74 07                	je     ffffffff804b7180 <ip_queue_xmit+0x13b>
ffffffff804b7179:        0 	f0 ff 80 b0 00 00 00 	lock incl 0xb0(%rax)
ffffffff804b7180:      556 	49 89 46 28          	mov    %rax,0x28(%r14)
ffffffff804b7184:     8351 	4d 85 ff             	test   %r15,%r15
ffffffff804b7187:        0 	be 14 00 00 00       	mov    $0x14,%esi
ffffffff804b718c:      461 	74 26                	je     ffffffff804b71b4 <ip_queue_xmit+0x16f>
ffffffff804b718e:        0 	41 f6 47 08 01       	testb  $0x1,0x8(%r15)
ffffffff804b7193:        0 	74 17                	je     ffffffff804b71ac <ip_queue_xmit+0x167>
ffffffff804b7195:        0 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804b719a:        0 	8b 82 28 01 00 00    	mov    0x128(%rdx),%eax
ffffffff804b71a0:        0 	39 82 1c 01 00 00    	cmp    %eax,0x11c(%rdx)
ffffffff804b71a6:        0 	0f 85 55 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
ffffffff804b71ac:        0 	41 0f b6 47 04       	movzbl 0x4(%r15),%eax
ffffffff804b71b1:        0 	8d 70 14             	lea    0x14(%rax),%esi
ffffffff804b71b4:       39 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b71b7:      493 	e8 f8 18 fd ff       	callq  ffffffff80488ab4 <skb_push>
ffffffff804b71bc:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b71bf:     1701 	e8 99 df ff ff       	callq  ffffffff804b515d <skb_reset_network_header>
ffffffff804b71c4:      481 	0f b6 85 54 02 00 00 	movzbl 0x254(%rbp),%eax
ffffffff804b71cb:     4202 	41 8b 9e bc 00 00 00 	mov    0xbc(%r14),%ebx
ffffffff804b71d2:        3 	48 89 ef             	mov    %rbp,%rdi
ffffffff804b71d5:        0 	49 03 9e d0 00 00 00 	add    0xd0(%r14),%rbx
ffffffff804b71dc:      466 	80 cc 45             	or     $0x45,%ah
ffffffff804b71df:        7 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804b71e3:        0 	66 89 03             	mov    %ax,(%rbx)
ffffffff804b71e6:      492 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
ffffffff804b71eb:        3 	e8 a0 df ff ff       	callq  ffffffff804b5190 <ip_dont_fragment>
ffffffff804b71f0:     1405 	85 c0                	test   %eax,%eax
ffffffff804b71f2:     4391 	74 0f                	je     ffffffff804b7203 <ip_queue_xmit+0x1be>
ffffffff804b71f4:        0 	83 7c 24 08 00       	cmpl   $0x0,0x8(%rsp)
ffffffff804b71f9:      417 	75 08                	jne    ffffffff804b7203 <ip_queue_xmit+0x1be>
ffffffff804b71fb:      503 	66 c7 43 06 40 00    	movw   $0x40,0x6(%rbx)
ffffffff804b7201:     6743 	eb 06                	jmp    ffffffff804b7209 <ip_queue_xmit+0x1c4>
ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)
ffffffff804b7222:    26297 	8a 45 39             	mov    0x39(%rbp),%al
ffffffff804b7225:    76658 	4d 85 ff             	test   %r15,%r15
ffffffff804b7228:     1712 	88 43 09             	mov    %al,0x9(%rbx)
ffffffff804b722b:      148 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
ffffffff804b7230:     2971 	8b 80 20 01 00 00    	mov    0x120(%rax),%eax
ffffffff804b7236:    14849 	89 43 0c             	mov    %eax,0xc(%rbx)
ffffffff804b7239:       84 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
ffffffff804b723e:      360 	8b 80 1c 01 00 00    	mov    0x11c(%rax),%eax
ffffffff804b7244:      174 	89 43 10             	mov    %eax,0x10(%rbx)
ffffffff804b7247:       96 	74 32                	je     ffffffff804b727b <ip_queue_xmit+0x236>
ffffffff804b7249:        0 	41 8a 57 04          	mov    0x4(%r15),%dl
ffffffff804b724d:        0 	84 d2                	test   %dl,%dl
ffffffff804b724f:        0 	74 2a                	je     ffffffff804b727b <ip_queue_xmit+0x236>
ffffffff804b7251:        0 	c0 ea 02             	shr    $0x2,%dl
ffffffff804b7254:        0 	03 13                	add    (%rbx),%edx
ffffffff804b7256:        0 	8a 03                	mov    (%rbx),%al
ffffffff804b7258:        0 	45 31 c0             	xor    %r8d,%r8d
ffffffff804b725b:        0 	4c 89 fe             	mov    %r15,%rsi
ffffffff804b725e:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b7261:        0 	83 e0 f0             	and    $0xfffffffffffffff0,%eax
ffffffff804b7264:        0 	83 e2 0f             	and    $0xf,%edx
ffffffff804b7267:        0 	09 d0                	or     %edx,%eax
ffffffff804b7269:        0 	88 03                	mov    %al,(%rbx)
ffffffff804b726b:        0 	48 8b 4c 24 58       	mov    0x58(%rsp),%rcx
ffffffff804b7270:        0 	8b 95 30 02 00 00    	mov    0x230(%rbp),%edx
ffffffff804b7276:        0 	e8 e4 d8 ff ff       	callq  ffffffff804b4b5f <ip_options_build>
ffffffff804b727b:      541 	41 8b 86 c8 00 00 00 	mov    0xc8(%r14),%eax
ffffffff804b7282:      570 	31 d2                	xor    %edx,%edx
ffffffff804b7284:        0 	49 03 86 d0 00 00 00 	add    0xd0(%r14),%rax
ffffffff804b728b:       34 	8b 40 08             	mov    0x8(%rax),%eax
ffffffff804b728e:      496 	66 85 c0             	test   %ax,%ax
ffffffff804b7291:       11 	74 06                	je     ffffffff804b7299 <ip_queue_xmit+0x254>
ffffffff804b7293:        9 	0f b7 c0             	movzwl %ax,%eax
ffffffff804b7296:      495 	8d 50 ff             	lea    -0x1(%rax),%edx
ffffffff804b7299:        2 	f6 43 06 40          	testb  $0x40,0x6(%rbx)
ffffffff804b729d:        9 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
ffffffff804b72a2:      497 	74 34                	je     ffffffff804b72d8 <ip_queue_xmit+0x293>
ffffffff804b72a4:        8 	83 bd 30 02 00 00 00 	cmpl   $0x0,0x230(%rbp)
ffffffff804b72ab:       10 	74 23                	je     ffffffff804b72d0 <ip_queue_xmit+0x28b>
ffffffff804b72ad:     1044 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
ffffffff804b72b4:        7 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804b72b8:        8 	66 89 43 04          	mov    %ax,0x4(%rbx)
ffffffff804b72bc:      432 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
ffffffff804b72c3:        9 	ff c0                	inc    %eax
ffffffff804b72c5:       14 	01 d0                	add    %edx,%eax
ffffffff804b72c7:     1141 	66 89 85 52 02 00 00 	mov    %ax,0x252(%rbp)
ffffffff804b72ce:        7 	eb 10                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
ffffffff804b72d0:        0 	66 c7 43 04 00 00    	movw   $0x0,0x4(%rbx)
ffffffff804b72d6:        0 	eb 08                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
ffffffff804b72d8:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804b72db:        0 	e8 b7 9d ff ff       	callq  ffffffff804b1097 <__ip_select_ident>
ffffffff804b72e0:        6 	8b 85 54 01 00 00    	mov    0x154(%rbp),%eax
ffffffff804b72e6:      458 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b72e9:        2 	41 89 46 78          	mov    %eax,0x78(%r14)
ffffffff804b72ed:        4 	8b 85 f0 01 00 00    	mov    0x1f0(%rbp),%eax
ffffffff804b72f3:      841 	41 89 86 b0 00 00 00 	mov    %eax,0xb0(%r14)
ffffffff804b72fa:       11 	e8 30 f2 ff ff       	callq  ffffffff804b652f <ip_local_out>
ffffffff804b72ff:        0 	eb 44                	jmp    ffffffff804b7345 <ip_queue_xmit+0x300>
ffffffff804b7301:        0 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff804b7308:        0 	00 00 
ffffffff804b730a:        0 	8b 80 48 e0 ff ff    	mov    -0x1fb8(%rax),%eax
ffffffff804b7310:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804b7313:        0 	30 c0                	xor    %al,%al
ffffffff804b7315:        0 	66 83 f8 01          	cmp    $0x1,%ax
ffffffff804b7319:        0 	48 19 c0             	sbb    %rax,%rax
ffffffff804b731c:        0 	83 e0 08             	and    $0x8,%eax
ffffffff804b731f:        0 	48 8b 90 a8 16 ab 80 	mov    -0x7f54e958(%rax),%rdx
ffffffff804b7326:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804b732d:        0 	00 
ffffffff804b732e:        0 	89 c0                	mov    %eax,%eax
ffffffff804b7330:        0 	48 f7 d2             	not    %rdx
ffffffff804b7333:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804b7337:        0 	48 ff 40 68          	incq   0x68(%rax)
ffffffff804b733b:        0 	e8 b1 18 fd ff       	callq  ffffffff80488bf1 <kfree_skb>
ffffffff804b7340:        0 	b8 8f ff ff ff       	mov    $0xffffff8f,%eax
ffffffff804b7345:     9196 	48 83 c4 68          	add    $0x68,%rsp
ffffffff804b7349:      892 	5b                   	pop    %rbx
ffffffff804b734a:        0 	5d                   	pop    %rbp
ffffffff804b734b:      488 	41 5c                	pop    %r12
ffffffff804b734d:        0 	41 5d                	pop    %r13
ffffffff804b734f:        0 	41 5e                	pop    %r14
ffffffff804b7351:      513 	41 5f                	pop    %r15
ffffffff804b7353:        0 	c3                   	retq   

about 10% of this function's cost is artificial:

ffffffff804b7045:     1001 <ip_queue_xmit>:
ffffffff804b7045:     1001 	41 57                	push   %r15
ffffffff804b7047:    36698 	41 56                	push   %r14

there are profiler hits that leaked in via out-of-order execution from 
the callsites. The callsites are hard to map unfortunately, as this 
function is called via function pointers.

the most likely callsite is tcp_transmit_skb().

30% of the overhead of this function comes from:

ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)

the 16-bit movw looks a bit weird. It comes from line 372:

 0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
 367		iph = ip_hdr(skb);
 368		*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
 369		if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
 370			iph->frag_off = htons(IP_DF);
 371		else
 372			iph->frag_off = 0;
 373		iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
 374		iph->protocol = sk->sk_protocol;
 375		iph->saddr    = rt->rt_src;
 376		iph->daddr    = rt->rt_dst;

the ip-header fragment flag setting to zero.

16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is 
towards eliminating them as much as possible.

_But_, the real overhead probably comes from:

 ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx

which is the next line, the ttl field:

 373             iph->ttl      = ip_select_ttl(inet, &rt->u.dst);

this shows that we are doing a hard cachemiss on the net-localhost 
route dst structure cacheline. We do a plain load instruction from it 
here and get a hefty cachemiss. (because 16 CPUs are banging on that 
single route)

And let make sure we see this in perspective as well: that single 
cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make 
the scheduler 10% slower straight away and it would have less of a 
real-life effect than this single iph->ttl field setting.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:47                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   3.038025 skb_release_data

                      hits (303802 total)
                 .........
ffffffff80488c7e:      780 <skb_release_data>:
ffffffff80488c7e:      780 	55                   	push   %rbp
ffffffff80488c7f:   267141 	53                   	push   %rbx
ffffffff80488c80:        0 	48 89 fb             	mov    %rdi,%rbx
ffffffff80488c83:     3552 	48 83 ec 08          	sub    $0x8,%rsp
ffffffff80488c87:      604 	8a 47 7c             	mov    0x7c(%rdi),%al
ffffffff80488c8a:     2644 	a8 02                	test   $0x2,%al
ffffffff80488c8c:       49 	74 2a                	je     ffffffff80488cb8 <skb_release_data+0x3a>
ffffffff80488c8e:        0 	83 e0 10             	and    $0x10,%eax
ffffffff80488c91:     2079 	8b 97 c8 00 00 00    	mov    0xc8(%rdi),%edx
ffffffff80488c97:       53 	3c 01                	cmp    $0x1,%al
ffffffff80488c99:        0 	19 c0                	sbb    %eax,%eax
ffffffff80488c9b:      870 	48 03 97 d0 00 00 00 	add    0xd0(%rdi),%rdx
ffffffff80488ca2:       65 	66 31 c0             	xor    %ax,%ax
ffffffff80488ca5:        0 	05 01 00 01 00       	add    $0x10001,%eax
ffffffff80488caa:      888 	f7 d8                	neg    %eax
ffffffff80488cac:       49 	89 c1                	mov    %eax,%ecx
ffffffff80488cae:        0 	f0 0f c1 0a          	lock xadd %ecx,(%rdx)
ffffffff80488cb2:     1909 	01 c8                	add    %ecx,%eax
ffffffff80488cb4:     1040 	85 c0                	test   %eax,%eax
ffffffff80488cb6:        0 	75 6d                	jne    ffffffff80488d25 <skb_release_data+0xa7>
ffffffff80488cb8:        0 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
ffffffff80488cbe:     4199 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
ffffffff80488cc5:     4995 	31 ed                	xor    %ebp,%ebp
ffffffff80488cc7:        0 	66 83 7c 10 04 00    	cmpw   $0x0,0x4(%rax,%rdx,1)
ffffffff80488ccd:      983 	75 15                	jne    ffffffff80488ce4 <skb_release_data+0x66>
ffffffff80488ccf:       15 	eb 28                	jmp    ffffffff80488cf9 <skb_release_data+0x7b>
ffffffff80488cd1:      665 	48 63 c5             	movslq %ebp,%rax
ffffffff80488cd4:      546 	ff c5                	inc    %ebp
ffffffff80488cd6:      328 	48 c1 e0 04          	shl    $0x4,%rax
ffffffff80488cda:      356 	48 8b 7c 02 20       	mov    0x20(%rdx,%rax,1),%rdi
ffffffff80488cdf:       95 	e8 be 87 de ff       	callq  ffffffff802714a2 <put_page>
ffffffff80488ce4:       66 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
ffffffff80488cea:     1321 	48 03 93 d0 00 00 00 	add    0xd0(%rbx),%rdx
ffffffff80488cf1:      439 	0f b7 42 04          	movzwl 0x4(%rdx),%eax
ffffffff80488cf5:        0 	39 c5                	cmp    %eax,%ebp
ffffffff80488cf7:     1887 	7c d8                	jl     ffffffff80488cd1 <skb_release_data+0x53>
ffffffff80488cf9:     2187 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
ffffffff80488cff:     1784 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
ffffffff80488d06:      422 	48 83 7c 10 18 00    	cmpq   $0x0,0x18(%rax,%rdx,1)
ffffffff80488d0c:      110 	74 08                	je     ffffffff80488d16 <skb_release_data+0x98>
ffffffff80488d0e:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff80488d11:        0 	e8 52 ff ff ff       	callq  ffffffff80488c68 <skb_drop_fraglist>
ffffffff80488d16:       14 	48 8b bb d0 00 00 00 	mov    0xd0(%rbx),%rdi
ffffffff80488d1d:      715 	5e                   	pop    %rsi
ffffffff80488d1e:      109 	5b                   	pop    %rbx
ffffffff80488d1f:       20 	5d                   	pop    %rbp
ffffffff80488d20:      980 	e9 b7 66 e0 ff       	jmpq   ffffffff8028f3dc <kfree>
ffffffff80488d25:        0 	59                   	pop    %rcx
ffffffff80488d26:     1948 	5b                   	pop    %rbx
ffffffff80488d27:        0 	5d                   	pop    %rbp
ffffffff80488d28:        0 	c3                   	retq   

this is a short function, and 90% of the overhead is false leaked-in 
overhead from callsites:

ffffffff80488c7f:   267141 	53                   	push   %rbx

unfortunately i have a hard time mapping its callsites. 
pskb_expand_head() is the only static callsite, but it's not active in 
the profile.

The _usual_ callsite is normally skb_release_all(), which does have 
overhead:

ffffffff80489449:      925 <skb_release_all>:
ffffffff80489449:      925 	53                   	push   %rbx
ffffffff8048944a:     5249 	48 89 fb             	mov    %rdi,%rbx
ffffffff8048944d:        4 	e8 3c ff ff ff       	callq  ffffffff8048938e <skb_release_head_state>
ffffffff80489452:     1149 	48 89 df             	mov    %rbx,%rdi
ffffffff80489455:    13163 	5b                   	pop    %rbx
ffffffff80489456:        0 	e9 23 f8 ff ff       	jmpq   ffffffff80488c7e <skb_release_data>

it is also tail-optimized, which explains why i found little 
callsites. The main callsite of skb_release_all() is:

ffffffff80488b86:       26 	e8 be 08 00 00       	callq  ffffffff80489449 <skb_release_all>

which is __kfree_skb(). That is a frequently referenced function, and 
in my profile there's a single callsite active:

ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>

which is tcp_ack() - subject of a later email. The wider context is:

ffffffff804c0ffc:      433 	41 2b 85 e0 00 00 00 	sub    0xe0(%r13),%eax
ffffffff804c1003:     4843 	89 85 f0 00 00 00    	mov    %eax,0xf0(%rbp)
ffffffff804c1009:     1730 	48 8b 45 30          	mov    0x30(%rbp),%rax
ffffffff804c100d:      311 	41 8b 95 e0 00 00 00 	mov    0xe0(%r13),%edx
ffffffff804c1014:        0 	48 83 b8 b0 00 00 00 	cmpq   $0x0,0xb0(%rax)
ffffffff804c101b:        0 	00 
ffffffff804c101c:      418 	74 06                	je     ffffffff804c1024 <tcp_ack+0x50d>
ffffffff804c101e:       37 	01 95 f4 00 00 00    	add    %edx,0xf4(%rbp)
ffffffff804c1024:        2 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>

this is a good, top-of-the-line x86 CPU with a really good BTB 
implementation that seems to be able to fall through calls and tail 
optimizations as if they werent there.

some guesses are:

(gdb) list *0xffffffff804c1003
0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789).
784	
785	static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
786	{
787		skb_truesize_check(skb);
788		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
789		sk->sk_wmem_queued -= skb->truesize;
790		sk_mem_uncharge(sk, skb->truesize);
791		__kfree_skb(skb);
792	}
793	

both sk and skb should be cache-hot here so this seems unlikely.

(gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736).
731	}
732	
733	static inline int sk_has_account(struct sock *sk)
734	{
735		/* return true if protocol supports memory accounting */
736		return !!sk->sk_prot->memory_allocated;
737	}
738	
739	static inline int sk_wmem_schedule(struct sock *sk, int size)
740	{

this cannot be it - unless sk_prot somehow ends up being dirtied or 
false-shared?

Still, my guess would be on ffffffff804c1009 and a 
sk_prot->memory_allocated cachemiss: look at how this instruction uses 
%ebp, and the one that shows the many hits in skb_release_data() 
pushes %ebp to the stack - that's where the CPU's OOO trick ends: it 
has to compute the result and serialize on the cachemiss.

	Ingo


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:47                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   3.038025 skb_release_data

                      hits (303802 total)
                 .........
ffffffff80488c7e:      780 <skb_release_data>:
ffffffff80488c7e:      780 	55                   	push   %rbp
ffffffff80488c7f:   267141 	53                   	push   %rbx
ffffffff80488c80:        0 	48 89 fb             	mov    %rdi,%rbx
ffffffff80488c83:     3552 	48 83 ec 08          	sub    $0x8,%rsp
ffffffff80488c87:      604 	8a 47 7c             	mov    0x7c(%rdi),%al
ffffffff80488c8a:     2644 	a8 02                	test   $0x2,%al
ffffffff80488c8c:       49 	74 2a                	je     ffffffff80488cb8 <skb_release_data+0x3a>
ffffffff80488c8e:        0 	83 e0 10             	and    $0x10,%eax
ffffffff80488c91:     2079 	8b 97 c8 00 00 00    	mov    0xc8(%rdi),%edx
ffffffff80488c97:       53 	3c 01                	cmp    $0x1,%al
ffffffff80488c99:        0 	19 c0                	sbb    %eax,%eax
ffffffff80488c9b:      870 	48 03 97 d0 00 00 00 	add    0xd0(%rdi),%rdx
ffffffff80488ca2:       65 	66 31 c0             	xor    %ax,%ax
ffffffff80488ca5:        0 	05 01 00 01 00       	add    $0x10001,%eax
ffffffff80488caa:      888 	f7 d8                	neg    %eax
ffffffff80488cac:       49 	89 c1                	mov    %eax,%ecx
ffffffff80488cae:        0 	f0 0f c1 0a          	lock xadd %ecx,(%rdx)
ffffffff80488cb2:     1909 	01 c8                	add    %ecx,%eax
ffffffff80488cb4:     1040 	85 c0                	test   %eax,%eax
ffffffff80488cb6:        0 	75 6d                	jne    ffffffff80488d25 <skb_release_data+0xa7>
ffffffff80488cb8:        0 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
ffffffff80488cbe:     4199 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
ffffffff80488cc5:     4995 	31 ed                	xor    %ebp,%ebp
ffffffff80488cc7:        0 	66 83 7c 10 04 00    	cmpw   $0x0,0x4(%rax,%rdx,1)
ffffffff80488ccd:      983 	75 15                	jne    ffffffff80488ce4 <skb_release_data+0x66>
ffffffff80488ccf:       15 	eb 28                	jmp    ffffffff80488cf9 <skb_release_data+0x7b>
ffffffff80488cd1:      665 	48 63 c5             	movslq %ebp,%rax
ffffffff80488cd4:      546 	ff c5                	inc    %ebp
ffffffff80488cd6:      328 	48 c1 e0 04          	shl    $0x4,%rax
ffffffff80488cda:      356 	48 8b 7c 02 20       	mov    0x20(%rdx,%rax,1),%rdi
ffffffff80488cdf:       95 	e8 be 87 de ff       	callq  ffffffff802714a2 <put_page>
ffffffff80488ce4:       66 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
ffffffff80488cea:     1321 	48 03 93 d0 00 00 00 	add    0xd0(%rbx),%rdx
ffffffff80488cf1:      439 	0f b7 42 04          	movzwl 0x4(%rdx),%eax
ffffffff80488cf5:        0 	39 c5                	cmp    %eax,%ebp
ffffffff80488cf7:     1887 	7c d8                	jl     ffffffff80488cd1 <skb_release_data+0x53>
ffffffff80488cf9:     2187 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
ffffffff80488cff:     1784 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
ffffffff80488d06:      422 	48 83 7c 10 18 00    	cmpq   $0x0,0x18(%rax,%rdx,1)
ffffffff80488d0c:      110 	74 08                	je     ffffffff80488d16 <skb_release_data+0x98>
ffffffff80488d0e:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff80488d11:        0 	e8 52 ff ff ff       	callq  ffffffff80488c68 <skb_drop_fraglist>
ffffffff80488d16:       14 	48 8b bb d0 00 00 00 	mov    0xd0(%rbx),%rdi
ffffffff80488d1d:      715 	5e                   	pop    %rsi
ffffffff80488d1e:      109 	5b                   	pop    %rbx
ffffffff80488d1f:       20 	5d                   	pop    %rbp
ffffffff80488d20:      980 	e9 b7 66 e0 ff       	jmpq   ffffffff8028f3dc <kfree>
ffffffff80488d25:        0 	59                   	pop    %rcx
ffffffff80488d26:     1948 	5b                   	pop    %rbx
ffffffff80488d27:        0 	5d                   	pop    %rbp
ffffffff80488d28:        0 	c3                   	retq   

this is a short function, and 90% of the overhead is false leaked-in 
overhead from callsites:

ffffffff80488c7f:   267141 	53                   	push   %rbx

unfortunately i have a hard time mapping its callsites. 
pskb_expand_head() is the only static callsite, but it's not active in 
the profile.

The _usual_ callsite is normally skb_release_all(), which does have 
overhead:

ffffffff80489449:      925 <skb_release_all>:
ffffffff80489449:      925 	53                   	push   %rbx
ffffffff8048944a:     5249 	48 89 fb             	mov    %rdi,%rbx
ffffffff8048944d:        4 	e8 3c ff ff ff       	callq  ffffffff8048938e <skb_release_head_state>
ffffffff80489452:     1149 	48 89 df             	mov    %rbx,%rdi
ffffffff80489455:    13163 	5b                   	pop    %rbx
ffffffff80489456:        0 	e9 23 f8 ff ff       	jmpq   ffffffff80488c7e <skb_release_data>

it is also tail-optimized, which explains why i found little 
callsites. The main callsite of skb_release_all() is:

ffffffff80488b86:       26 	e8 be 08 00 00       	callq  ffffffff80489449 <skb_release_all>

which is __kfree_skb(). That is a frequently referenced function, and 
in my profile there's a single callsite active:

ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>

which is tcp_ack() - subject of a later email. The wider context is:

ffffffff804c0ffc:      433 	41 2b 85 e0 00 00 00 	sub    0xe0(%r13),%eax
ffffffff804c1003:     4843 	89 85 f0 00 00 00    	mov    %eax,0xf0(%rbp)
ffffffff804c1009:     1730 	48 8b 45 30          	mov    0x30(%rbp),%rax
ffffffff804c100d:      311 	41 8b 95 e0 00 00 00 	mov    0xe0(%r13),%edx
ffffffff804c1014:        0 	48 83 b8 b0 00 00 00 	cmpq   $0x0,0xb0(%rax)
ffffffff804c101b:        0 	00 
ffffffff804c101c:      418 	74 06                	je     ffffffff804c1024 <tcp_ack+0x50d>
ffffffff804c101e:       37 	01 95 f4 00 00 00    	add    %edx,0xf4(%rbp)
ffffffff804c1024:        2 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>

this is a good, top-of-the-line x86 CPU with a really good BTB 
implementation that seems to be able to fall through calls and tail 
optimizations as if they werent there.

some guesses are:

(gdb) list *0xffffffff804c1003
0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789).
784	
785	static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
786	{
787		skb_truesize_check(skb);
788		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
789		sk->sk_wmem_queued -= skb->truesize;
790		sk_mem_uncharge(sk, skb->truesize);
791		__kfree_skb(skb);
792	}
793	

both sk and skb should be cache-hot here so this seems unlikely.

(gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736).
731	}
732	
733	static inline int sk_has_account(struct sock *sk)
734	{
735		/* return true if protocol supports memory accounting */
736		return !!sk->sk_prot->memory_allocated;
737	}
738	
739	static inline int sk_wmem_schedule(struct sock *sk, int size)
740	{

this cannot be it - unless sk_prot somehow ends up being dirtied or 
false-shared?

Still, my guess would be on ffffffff804c1009 and a 
sk_prot->memory_allocated cachemiss: look at how this instruction uses 
%ebp, and the one that shows the many hits in skb_release_data() 
pushes %ebp to the stack - that's where the CPU's OOO trick ends: it 
has to compute the result and serialize on the cachemiss.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:55                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   2.118525 skb_release_head_state

                      hits (total: 211852)
                 .........
ffffffff8048938e:      967 <skb_release_head_state>:
ffffffff8048938e:      967 	53                   	push   %rbx
ffffffff8048938f:     3975 	48 89 fb             	mov    %rdi,%rbx
ffffffff80489392:       17 	48 8b 7f 28          	mov    0x28(%rdi),%rdi
ffffffff80489396:        0 	e8 9c 93 00 00       	callq  ffffffff80492737 <dst_release>
ffffffff8048939b:        6 	48 8b 7b 30          	mov    0x30(%rbx),%rdi
ffffffff8048939f:     2887 	48 85 ff             	test   %rdi,%rdi
ffffffff804893a2:      859 	74 0f                	je     ffffffff804893b3 <skb_release_head_state+0x25>
ffffffff804893a4:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff804893a7:        0 	0f 94 c0             	sete   %al
ffffffff804893aa:        0 	84 c0                	test   %al,%al
ffffffff804893ac:        0 	74 05                	je     ffffffff804893b3 <skb_release_head_state+0x25>
ffffffff804893ae:        0 	e8 7a 14 06 00       	callq  ffffffff804ea82d <__secpath_destroy>
ffffffff804893b3:       16 	48 83 bb 80 00 00 00 	cmpq   $0x0,0x80(%rbx)
ffffffff804893ba:        0 	00 
ffffffff804893bb:     4294 	74 31                	je     ffffffff804893ee <skb_release_head_state+0x60>
ffffffff804893bd:        0 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff804893c4:        0 	00 00 
ffffffff804893c6:     6540 	48 63 80 48 e0 ff ff 	movslq -0x1fb8(%rax),%rax
ffffffff804893cd:       14 	a9 00 00 ff 0f       	test   $0xfff0000,%eax
ffffffff804893d2:      471 	74 11                	je     ffffffff804893e5 <skb_release_head_state+0x57>
ffffffff804893d4:        0 	be 89 01 00 00       	mov    $0x189,%esi
ffffffff804893d9:        0 	48 c7 c7 cc b1 6a 80 	mov    $0xffffffff806ab1cc,%rdi
ffffffff804893e0:        0 	e8 d0 cd da ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804893e5:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804893e8:     1733 	ff 93 80 00 00 00    	callq  *0x80(%rbx)
ffffffff804893ee:      888 	48 8b bb 88 00 00 00 	mov    0x88(%rbx),%rdi
ffffffff804893f5:     3959 	48 85 ff             	test   %rdi,%rdi
ffffffff804893f8:        0 	74 0f                	je     ffffffff80489409 <skb_release_head_state+0x7b>
ffffffff804893fa:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff804893fd:        0 	0f 94 c0             	sete   %al
ffffffff80489400:        0 	84 c0                	test   %al,%al
ffffffff80489402:        0 	74 05                	je     ffffffff80489409 <skb_release_head_state+0x7b>
ffffffff80489404:        0 	e8 48 f2 01 00       	callq  ffffffff804a8651 <nf_conntrack_destroy>
ffffffff80489409:        0 	48 8b bb 90 00 00 00 	mov    0x90(%rbx),%rdi
ffffffff80489410:     3132 	48 85 ff             	test   %rdi,%rdi
ffffffff80489413:        1 	74 05                	je     ffffffff8048941a <skb_release_head_state+0x8c>
ffffffff80489415:        0 	e8 d7 f7 ff ff       	callq  ffffffff80488bf1 <kfree_skb>
ffffffff8048941a:      958 	48 8b bb 98 00 00 00 	mov    0x98(%rbx),%rdi
ffffffff80489421:     1999 	48 85 ff             	test   %rdi,%rdi
ffffffff80489424:        0 	74 0f                	je     ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489426:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff80489429:        0 	0f 94 c0             	sete   %al
ffffffff8048942c:        0 	84 c0                	test   %al,%al
ffffffff8048942e:        0 	74 05                	je     ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489430:        0 	e8 a7 5f e0 ff       	callq  ffffffff8028f3dc <kfree>
ffffffff80489435:        0 	66 c7 83 a6 00 00 00 	movw   $0x0,0xa6(%rbx)
ffffffff8048943c:        0 	00 00 
ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
ffffffff80489445:        0 	00 00 
ffffffff80489447:   174101 	5b                   	pop    %rbx
ffffffff80489448:        0 	c3                   	retq   

this function _really_ hurts from a 16-bit op:

ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
ffffffff80489445:        0 	00 00 
ffffffff80489447:   174101 	5b                   	pop    %rbx

(gdb) list *0xffffffff8048943e
0xffffffff8048943e is in skb_release_head_state 
(net/core/skbuff.c:407).
402	#endif
403	/* XXX: IS this still necessary? - JHS */
404	#ifdef CONFIG_NET_SCHED
405		skb->tc_index = 0;
406	#ifdef CONFIG_NET_CLS_ACT
407		skb->tc_verd = 0;
408	#endif
409	#endif
410	}
411	

dirtying skb->tc_verd. I do have:

CONFIG_NET_CLS_ACT=y

BUT, on a second look, i dont think it's really this 16-bit op that 
hurts us. The wider context is:

ffffffff80489426:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff80489429:        0 	0f 94 c0             	sete   %al
ffffffff8048942c:        0 	84 c0                	test   %al,%al
ffffffff8048942e:        0 	74 05                	je     ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489430:        0 	e8 a7 5f e0 ff       	callq  ffffffff8028f3dc <kfree>
ffffffff80489435:        0 	66 c7 83 a6 00 00 00 	movw   $0x0,0xa6(%rbx)
ffffffff8048943c:        0 	00 00 
ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
ffffffff80489445:        0 	00 00 
ffffffff80489447:   174101 	5b                   	pop    %rbx
ffffffff80489448:        0 	c3                   	retq   

look how we jump over the callq most of the time - so what we are 
seeing here i believe is the cost of the atomic op at 
ffffffff80489426. That comes from:

(gdb) list *0xffffffff8048942e
0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783).
1778	}
1779	#endif
1780	#ifdef CONFIG_BRIDGE_NETFILTER
1781	static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
1782	{
1783		if (nf_bridge && atomic_dec_and_test(&nf_bridge->use))
1784			kfree(nf_bridge);
1785	}
1786	static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
1787	{

and ouch does that global dec on &nf_bridge->use hurt!

i do have:

  CONFIG_BRIDGE_NETFILTER=y

(this is a Fedora distro kernel derived .config)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:55                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   2.118525 skb_release_head_state

                      hits (total: 211852)
                 .........
ffffffff8048938e:      967 <skb_release_head_state>:
ffffffff8048938e:      967 	53                   	push   %rbx
ffffffff8048938f:     3975 	48 89 fb             	mov    %rdi,%rbx
ffffffff80489392:       17 	48 8b 7f 28          	mov    0x28(%rdi),%rdi
ffffffff80489396:        0 	e8 9c 93 00 00       	callq  ffffffff80492737 <dst_release>
ffffffff8048939b:        6 	48 8b 7b 30          	mov    0x30(%rbx),%rdi
ffffffff8048939f:     2887 	48 85 ff             	test   %rdi,%rdi
ffffffff804893a2:      859 	74 0f                	je     ffffffff804893b3 <skb_release_head_state+0x25>
ffffffff804893a4:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff804893a7:        0 	0f 94 c0             	sete   %al
ffffffff804893aa:        0 	84 c0                	test   %al,%al
ffffffff804893ac:        0 	74 05                	je     ffffffff804893b3 <skb_release_head_state+0x25>
ffffffff804893ae:        0 	e8 7a 14 06 00       	callq  ffffffff804ea82d <__secpath_destroy>
ffffffff804893b3:       16 	48 83 bb 80 00 00 00 	cmpq   $0x0,0x80(%rbx)
ffffffff804893ba:        0 	00 
ffffffff804893bb:     4294 	74 31                	je     ffffffff804893ee <skb_release_head_state+0x60>
ffffffff804893bd:        0 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff804893c4:        0 	00 00 
ffffffff804893c6:     6540 	48 63 80 48 e0 ff ff 	movslq -0x1fb8(%rax),%rax
ffffffff804893cd:       14 	a9 00 00 ff 0f       	test   $0xfff0000,%eax
ffffffff804893d2:      471 	74 11                	je     ffffffff804893e5 <skb_release_head_state+0x57>
ffffffff804893d4:        0 	be 89 01 00 00       	mov    $0x189,%esi
ffffffff804893d9:        0 	48 c7 c7 cc b1 6a 80 	mov    $0xffffffff806ab1cc,%rdi
ffffffff804893e0:        0 	e8 d0 cd da ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804893e5:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804893e8:     1733 	ff 93 80 00 00 00    	callq  *0x80(%rbx)
ffffffff804893ee:      888 	48 8b bb 88 00 00 00 	mov    0x88(%rbx),%rdi
ffffffff804893f5:     3959 	48 85 ff             	test   %rdi,%rdi
ffffffff804893f8:        0 	74 0f                	je     ffffffff80489409 <skb_release_head_state+0x7b>
ffffffff804893fa:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff804893fd:        0 	0f 94 c0             	sete   %al
ffffffff80489400:        0 	84 c0                	test   %al,%al
ffffffff80489402:        0 	74 05                	je     ffffffff80489409 <skb_release_head_state+0x7b>
ffffffff80489404:        0 	e8 48 f2 01 00       	callq  ffffffff804a8651 <nf_conntrack_destroy>
ffffffff80489409:        0 	48 8b bb 90 00 00 00 	mov    0x90(%rbx),%rdi
ffffffff80489410:     3132 	48 85 ff             	test   %rdi,%rdi
ffffffff80489413:        1 	74 05                	je     ffffffff8048941a <skb_release_head_state+0x8c>
ffffffff80489415:        0 	e8 d7 f7 ff ff       	callq  ffffffff80488bf1 <kfree_skb>
ffffffff8048941a:      958 	48 8b bb 98 00 00 00 	mov    0x98(%rbx),%rdi
ffffffff80489421:     1999 	48 85 ff             	test   %rdi,%rdi
ffffffff80489424:        0 	74 0f                	je     ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489426:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff80489429:        0 	0f 94 c0             	sete   %al
ffffffff8048942c:        0 	84 c0                	test   %al,%al
ffffffff8048942e:        0 	74 05                	je     ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489430:        0 	e8 a7 5f e0 ff       	callq  ffffffff8028f3dc <kfree>
ffffffff80489435:        0 	66 c7 83 a6 00 00 00 	movw   $0x0,0xa6(%rbx)
ffffffff8048943c:        0 	00 00 
ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
ffffffff80489445:        0 	00 00 
ffffffff80489447:   174101 	5b                   	pop    %rbx
ffffffff80489448:        0 	c3                   	retq   

this function _really_ hurts from a 16-bit op:

ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
ffffffff80489445:        0 	00 00 
ffffffff80489447:   174101 	5b                   	pop    %rbx

(gdb) list *0xffffffff8048943e
0xffffffff8048943e is in skb_release_head_state 
(net/core/skbuff.c:407).
402	#endif
403	/* XXX: IS this still necessary? - JHS */
404	#ifdef CONFIG_NET_SCHED
405		skb->tc_index = 0;
406	#ifdef CONFIG_NET_CLS_ACT
407		skb->tc_verd = 0;
408	#endif
409	#endif
410	}
411	

dirtying skb->tc_verd. I do have:

CONFIG_NET_CLS_ACT=y

BUT, on a second look, i dont think it's really this 16-bit op that 
hurts us. The wider context is:

ffffffff80489426:        0 	f0 ff 0f             	lock decl (%rdi)
ffffffff80489429:        0 	0f 94 c0             	sete   %al
ffffffff8048942c:        0 	84 c0                	test   %al,%al
ffffffff8048942e:        0 	74 05                	je     ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489430:        0 	e8 a7 5f e0 ff       	callq  ffffffff8028f3dc <kfree>
ffffffff80489435:        0 	66 c7 83 a6 00 00 00 	movw   $0x0,0xa6(%rbx)
ffffffff8048943c:        0 	00 00 
ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
ffffffff80489445:        0 	00 00 
ffffffff80489447:   174101 	5b                   	pop    %rbx
ffffffff80489448:        0 	c3                   	retq   

look how we jump over the callq most of the time - so what we are 
seeing here i believe is the cost of the atomic op at 
ffffffff80489426. That comes from:

(gdb) list *0xffffffff8048942e
0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783).
1778	}
1779	#endif
1780	#ifdef CONFIG_BRIDGE_NETFILTER
1781	static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
1782	{
1783		if (nf_bridge && atomic_dec_and_test(&nf_bridge->use))
1784			kfree(nf_bridge);
1785	}
1786	static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
1787	{

and ouch does that global dec on &nf_bridge->use hurt!

i do have:

  CONFIG_BRIDGE_NETFILTER=y

(this is a Fedora distro kernel derived .config)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:56                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 20:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> 100.000000 total
>> ................
>>   3.038025 skb_release_data
> 
>                       hits (303802 total)
>                  .........
> ffffffff80488c7e:      780 <skb_release_data>:
> ffffffff80488c7e:      780 	55                   	push   %rbp
> ffffffff80488c7f:   267141 	53                   	push   %rbx
> ffffffff80488c80:        0 	48 89 fb             	mov    %rdi,%rbx
> ffffffff80488c83:     3552 	48 83 ec 08          	sub    $0x8,%rsp
> ffffffff80488c87:      604 	8a 47 7c             	mov    0x7c(%rdi),%al
> ffffffff80488c8a:     2644 	a8 02                	test   $0x2,%al
> ffffffff80488c8c:       49 	74 2a                	je     ffffffff80488cb8 <skb_release_data+0x3a>
> ffffffff80488c8e:        0 	83 e0 10             	and    $0x10,%eax
> ffffffff80488c91:     2079 	8b 97 c8 00 00 00    	mov    0xc8(%rdi),%edx
> ffffffff80488c97:       53 	3c 01                	cmp    $0x1,%al
> ffffffff80488c99:        0 	19 c0                	sbb    %eax,%eax
> ffffffff80488c9b:      870 	48 03 97 d0 00 00 00 	add    0xd0(%rdi),%rdx
> ffffffff80488ca2:       65 	66 31 c0             	xor    %ax,%ax
> ffffffff80488ca5:        0 	05 01 00 01 00       	add    $0x10001,%eax
> ffffffff80488caa:      888 	f7 d8                	neg    %eax
> ffffffff80488cac:       49 	89 c1                	mov    %eax,%ecx
> ffffffff80488cae:        0 	f0 0f c1 0a          	lock xadd %ecx,(%rdx)
> ffffffff80488cb2:     1909 	01 c8                	add    %ecx,%eax
> ffffffff80488cb4:     1040 	85 c0                	test   %eax,%eax
> ffffffff80488cb6:        0 	75 6d                	jne    ffffffff80488d25 <skb_release_data+0xa7>
> ffffffff80488cb8:        0 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
> ffffffff80488cbe:     4199 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
> ffffffff80488cc5:     4995 	31 ed                	xor    %ebp,%ebp
> ffffffff80488cc7:        0 	66 83 7c 10 04 00    	cmpw   $0x0,0x4(%rax,%rdx,1)
> ffffffff80488ccd:      983 	75 15                	jne    ffffffff80488ce4 <skb_release_data+0x66>
> ffffffff80488ccf:       15 	eb 28                	jmp    ffffffff80488cf9 <skb_release_data+0x7b>
> ffffffff80488cd1:      665 	48 63 c5             	movslq %ebp,%rax
> ffffffff80488cd4:      546 	ff c5                	inc    %ebp
> ffffffff80488cd6:      328 	48 c1 e0 04          	shl    $0x4,%rax
> ffffffff80488cda:      356 	48 8b 7c 02 20       	mov    0x20(%rdx,%rax,1),%rdi
> ffffffff80488cdf:       95 	e8 be 87 de ff       	callq  ffffffff802714a2 <put_page>
> ffffffff80488ce4:       66 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
> ffffffff80488cea:     1321 	48 03 93 d0 00 00 00 	add    0xd0(%rbx),%rdx
> ffffffff80488cf1:      439 	0f b7 42 04          	movzwl 0x4(%rdx),%eax
> ffffffff80488cf5:        0 	39 c5                	cmp    %eax,%ebp
> ffffffff80488cf7:     1887 	7c d8                	jl     ffffffff80488cd1 <skb_release_data+0x53>
> ffffffff80488cf9:     2187 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
> ffffffff80488cff:     1784 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
> ffffffff80488d06:      422 	48 83 7c 10 18 00    	cmpq   $0x0,0x18(%rax,%rdx,1)
> ffffffff80488d0c:      110 	74 08                	je     ffffffff80488d16 <skb_release_data+0x98>
> ffffffff80488d0e:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff80488d11:        0 	e8 52 ff ff ff       	callq  ffffffff80488c68 <skb_drop_fraglist>
> ffffffff80488d16:       14 	48 8b bb d0 00 00 00 	mov    0xd0(%rbx),%rdi
> ffffffff80488d1d:      715 	5e                   	pop    %rsi
> ffffffff80488d1e:      109 	5b                   	pop    %rbx
> ffffffff80488d1f:       20 	5d                   	pop    %rbp
> ffffffff80488d20:      980 	e9 b7 66 e0 ff       	jmpq   ffffffff8028f3dc <kfree>
> ffffffff80488d25:        0 	59                   	pop    %rcx
> ffffffff80488d26:     1948 	5b                   	pop    %rbx
> ffffffff80488d27:        0 	5d                   	pop    %rbp
> ffffffff80488d28:        0 	c3                   	retq   
> 
> this is a short function, and 90% of the overhead is false leaked-in 
> overhead from callsites:
> 
> ffffffff80488c7f:   267141 	53                   	push   %rbx
> 
> unfortunately i have a hard time mapping its callsites. 
> pskb_expand_head() is the only static callsite, but it's not active in 
> the profile.
> 
> The _usual_ callsite is normally skb_release_all(), which does have 
> overhead:
> 
> ffffffff80489449:      925 <skb_release_all>:
> ffffffff80489449:      925 	53                   	push   %rbx
> ffffffff8048944a:     5249 	48 89 fb             	mov    %rdi,%rbx
> ffffffff8048944d:        4 	e8 3c ff ff ff       	callq  ffffffff8048938e <skb_release_head_state>
> ffffffff80489452:     1149 	48 89 df             	mov    %rbx,%rdi
> ffffffff80489455:    13163 	5b                   	pop    %rbx
> ffffffff80489456:        0 	e9 23 f8 ff ff       	jmpq   ffffffff80488c7e <skb_release_data>
> 
> it is also tail-optimized, which explains why i found little 
> callsites. The main callsite of skb_release_all() is:
> 
> ffffffff80488b86:       26 	e8 be 08 00 00       	callq  ffffffff80489449 <skb_release_all>
> 
> which is __kfree_skb(). That is a frequently referenced function, and 
> in my profile there's a single callsite active:
> 
> ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>
> 
> which is tcp_ack() - subject of a later email. The wider context is:
> 
> ffffffff804c0ffc:      433 	41 2b 85 e0 00 00 00 	sub    0xe0(%r13),%eax
> ffffffff804c1003:     4843 	89 85 f0 00 00 00    	mov    %eax,0xf0(%rbp)
> ffffffff804c1009:     1730 	48 8b 45 30          	mov    0x30(%rbp),%rax
> ffffffff804c100d:      311 	41 8b 95 e0 00 00 00 	mov    0xe0(%r13),%edx
> ffffffff804c1014:        0 	48 83 b8 b0 00 00 00 	cmpq   $0x0,0xb0(%rax)
> ffffffff804c101b:        0 	00 
> ffffffff804c101c:      418 	74 06                	je     ffffffff804c1024 <tcp_ack+0x50d>
> ffffffff804c101e:       37 	01 95 f4 00 00 00    	add    %edx,0xf4(%rbp)
> ffffffff804c1024:        2 	4c 89 ef             	mov    %r13,%rdi
> ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>
> 
> this is a good, top-of-the-line x86 CPU with a really good BTB 
> implementation that seems to be able to fall through calls and tail 
> optimizations as if they werent there.
> 
> some guesses are:
> 
> (gdb) list *0xffffffff804c1003
> 0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789).
> 784	
> 785	static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
> 786	{
> 787		skb_truesize_check(skb);
> 788		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
> 789		sk->sk_wmem_queued -= skb->truesize;
> 790		sk_mem_uncharge(sk, skb->truesize);
> 791		__kfree_skb(skb);
> 792	}
> 793	
> 
> both sk and skb should be cache-hot here so this seems unlikely.
> 
> (gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736).
> 731	}
> 732	
> 733	static inline int sk_has_account(struct sock *sk)
> 734	{
> 735		/* return true if protocol supports memory accounting */
> 736		return !!sk->sk_prot->memory_allocated;
> 737	}
> 738	
> 739	static inline int sk_wmem_schedule(struct sock *sk, int size)
> 740	{
> 
> this cannot be it - unless sk_prot somehow ends up being dirtied or 
> false-shared?
> 
> Still, my guess would be on ffffffff804c1009 and a 
> sk_prot->memory_allocated cachemiss: look at how this instruction uses 
> %ebp, and the one that shows the many hits in skb_release_data() 
> pushes %ebp to the stack - that's where the CPU's OOO trick ends: it 
> has to compute the result and serialize on the cachemiss.
> 

I did some investigation on this part (memory_allocated) and discovered UDP had a problem,
not TCP (and tbench)

commit 270acefafeb74ce2fe93d35b75733870bf1e11e7

net: sk_free_datagram() should use sk_mem_reclaim_partial()

I noticed a contention on udp_memory_allocated on regular UDP applications.

While tcp_memory_allocated is seldom used, it appears each incoming UDP frame
is currently touching udp_memory_allocated when queued, and when received by
application.

One possible solution is to use sk_mem_reclaim_partial() instead of
sk_mem_reclaim(), so that we keep a small reserve (less than one page)
of memory for each UDP socket.

We did something very similar on TCP side in commit
9993e7d313e80bdc005d09c7def91903e0068f07
([TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer())

A more complex solution would need to convert prot->memory_allocated to
use a percpu_counter with batches of 64 or 128 pages.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:56                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 20:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
>> 100.000000 total
>> ................
>>   3.038025 skb_release_data
> 
>                       hits (303802 total)
>                  .........
> ffffffff80488c7e:      780 <skb_release_data>:
> ffffffff80488c7e:      780 	55                   	push   %rbp
> ffffffff80488c7f:   267141 	53                   	push   %rbx
> ffffffff80488c80:        0 	48 89 fb             	mov    %rdi,%rbx
> ffffffff80488c83:     3552 	48 83 ec 08          	sub    $0x8,%rsp
> ffffffff80488c87:      604 	8a 47 7c             	mov    0x7c(%rdi),%al
> ffffffff80488c8a:     2644 	a8 02                	test   $0x2,%al
> ffffffff80488c8c:       49 	74 2a                	je     ffffffff80488cb8 <skb_release_data+0x3a>
> ffffffff80488c8e:        0 	83 e0 10             	and    $0x10,%eax
> ffffffff80488c91:     2079 	8b 97 c8 00 00 00    	mov    0xc8(%rdi),%edx
> ffffffff80488c97:       53 	3c 01                	cmp    $0x1,%al
> ffffffff80488c99:        0 	19 c0                	sbb    %eax,%eax
> ffffffff80488c9b:      870 	48 03 97 d0 00 00 00 	add    0xd0(%rdi),%rdx
> ffffffff80488ca2:       65 	66 31 c0             	xor    %ax,%ax
> ffffffff80488ca5:        0 	05 01 00 01 00       	add    $0x10001,%eax
> ffffffff80488caa:      888 	f7 d8                	neg    %eax
> ffffffff80488cac:       49 	89 c1                	mov    %eax,%ecx
> ffffffff80488cae:        0 	f0 0f c1 0a          	lock xadd %ecx,(%rdx)
> ffffffff80488cb2:     1909 	01 c8                	add    %ecx,%eax
> ffffffff80488cb4:     1040 	85 c0                	test   %eax,%eax
> ffffffff80488cb6:        0 	75 6d                	jne    ffffffff80488d25 <skb_release_data+0xa7>
> ffffffff80488cb8:        0 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
> ffffffff80488cbe:     4199 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
> ffffffff80488cc5:     4995 	31 ed                	xor    %ebp,%ebp
> ffffffff80488cc7:        0 	66 83 7c 10 04 00    	cmpw   $0x0,0x4(%rax,%rdx,1)
> ffffffff80488ccd:      983 	75 15                	jne    ffffffff80488ce4 <skb_release_data+0x66>
> ffffffff80488ccf:       15 	eb 28                	jmp    ffffffff80488cf9 <skb_release_data+0x7b>
> ffffffff80488cd1:      665 	48 63 c5             	movslq %ebp,%rax
> ffffffff80488cd4:      546 	ff c5                	inc    %ebp
> ffffffff80488cd6:      328 	48 c1 e0 04          	shl    $0x4,%rax
> ffffffff80488cda:      356 	48 8b 7c 02 20       	mov    0x20(%rdx,%rax,1),%rdi
> ffffffff80488cdf:       95 	e8 be 87 de ff       	callq  ffffffff802714a2 <put_page>
> ffffffff80488ce4:       66 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
> ffffffff80488cea:     1321 	48 03 93 d0 00 00 00 	add    0xd0(%rbx),%rdx
> ffffffff80488cf1:      439 	0f b7 42 04          	movzwl 0x4(%rdx),%eax
> ffffffff80488cf5:        0 	39 c5                	cmp    %eax,%ebp
> ffffffff80488cf7:     1887 	7c d8                	jl     ffffffff80488cd1 <skb_release_data+0x53>
> ffffffff80488cf9:     2187 	8b 93 c8 00 00 00    	mov    0xc8(%rbx),%edx
> ffffffff80488cff:     1784 	48 8b 83 d0 00 00 00 	mov    0xd0(%rbx),%rax
> ffffffff80488d06:      422 	48 83 7c 10 18 00    	cmpq   $0x0,0x18(%rax,%rdx,1)
> ffffffff80488d0c:      110 	74 08                	je     ffffffff80488d16 <skb_release_data+0x98>
> ffffffff80488d0e:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff80488d11:        0 	e8 52 ff ff ff       	callq  ffffffff80488c68 <skb_drop_fraglist>
> ffffffff80488d16:       14 	48 8b bb d0 00 00 00 	mov    0xd0(%rbx),%rdi
> ffffffff80488d1d:      715 	5e                   	pop    %rsi
> ffffffff80488d1e:      109 	5b                   	pop    %rbx
> ffffffff80488d1f:       20 	5d                   	pop    %rbp
> ffffffff80488d20:      980 	e9 b7 66 e0 ff       	jmpq   ffffffff8028f3dc <kfree>
> ffffffff80488d25:        0 	59                   	pop    %rcx
> ffffffff80488d26:     1948 	5b                   	pop    %rbx
> ffffffff80488d27:        0 	5d                   	pop    %rbp
> ffffffff80488d28:        0 	c3                   	retq   
> 
> this is a short function, and 90% of the overhead is false leaked-in 
> overhead from callsites:
> 
> ffffffff80488c7f:   267141 	53                   	push   %rbx
> 
> unfortunately i have a hard time mapping its callsites. 
> pskb_expand_head() is the only static callsite, but it's not active in 
> the profile.
> 
> The _usual_ callsite is normally skb_release_all(), which does have 
> overhead:
> 
> ffffffff80489449:      925 <skb_release_all>:
> ffffffff80489449:      925 	53                   	push   %rbx
> ffffffff8048944a:     5249 	48 89 fb             	mov    %rdi,%rbx
> ffffffff8048944d:        4 	e8 3c ff ff ff       	callq  ffffffff8048938e <skb_release_head_state>
> ffffffff80489452:     1149 	48 89 df             	mov    %rbx,%rdi
> ffffffff80489455:    13163 	5b                   	pop    %rbx
> ffffffff80489456:        0 	e9 23 f8 ff ff       	jmpq   ffffffff80488c7e <skb_release_data>
> 
> it is also tail-optimized, which explains why i found little 
> callsites. The main callsite of skb_release_all() is:
> 
> ffffffff80488b86:       26 	e8 be 08 00 00       	callq  ffffffff80489449 <skb_release_all>
> 
> which is __kfree_skb(). That is a frequently referenced function, and 
> in my profile there's a single callsite active:
> 
> ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>
> 
> which is tcp_ack() - subject of a later email. The wider context is:
> 
> ffffffff804c0ffc:      433 	41 2b 85 e0 00 00 00 	sub    0xe0(%r13),%eax
> ffffffff804c1003:     4843 	89 85 f0 00 00 00    	mov    %eax,0xf0(%rbp)
> ffffffff804c1009:     1730 	48 8b 45 30          	mov    0x30(%rbp),%rax
> ffffffff804c100d:      311 	41 8b 95 e0 00 00 00 	mov    0xe0(%r13),%edx
> ffffffff804c1014:        0 	48 83 b8 b0 00 00 00 	cmpq   $0x0,0xb0(%rax)
> ffffffff804c101b:        0 	00 
> ffffffff804c101c:      418 	74 06                	je     ffffffff804c1024 <tcp_ack+0x50d>
> ffffffff804c101e:       37 	01 95 f4 00 00 00    	add    %edx,0xf4(%rbp)
> ffffffff804c1024:        2 	4c 89 ef             	mov    %r13,%rdi
> ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>
> 
> this is a good, top-of-the-line x86 CPU with a really good BTB 
> implementation that seems to be able to fall through calls and tail 
> optimizations as if they werent there.
> 
> some guesses are:
> 
> (gdb) list *0xffffffff804c1003
> 0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789).
> 784	
> 785	static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
> 786	{
> 787		skb_truesize_check(skb);
> 788		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
> 789		sk->sk_wmem_queued -= skb->truesize;
> 790		sk_mem_uncharge(sk, skb->truesize);
> 791		__kfree_skb(skb);
> 792	}
> 793	
> 
> both sk and skb should be cache-hot here so this seems unlikely.
> 
> (gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736).
> 731	}
> 732	
> 733	static inline int sk_has_account(struct sock *sk)
> 734	{
> 735		/* return true if protocol supports memory accounting */
> 736		return !!sk->sk_prot->memory_allocated;
> 737	}
> 738	
> 739	static inline int sk_wmem_schedule(struct sock *sk, int size)
> 740	{
> 
> this cannot be it - unless sk_prot somehow ends up being dirtied or 
> false-shared?
> 
> Still, my guess would be on ffffffff804c1009 and a 
> sk_prot->memory_allocated cachemiss: look at how this instruction uses 
> %ebp, and the one that shows the many hits in skb_release_data() 
> pushes %ebp to the stack - that's where the CPU's OOO trick ends: it 
> has to compute the result and serialize on the cachemiss.
> 

I did some investigation on this part (memory_allocated) and discovered UDP had a problem,
not TCP (and tbench)

commit 270acefafeb74ce2fe93d35b75733870bf1e11e7

net: sk_free_datagram() should use sk_mem_reclaim_partial()

I noticed a contention on udp_memory_allocated on regular UDP applications.

While tcp_memory_allocated is seldom used, it appears each incoming UDP frame
is currently touching udp_memory_allocated when queued, and when received by
application.

One possible solution is to use sk_mem_reclaim_partial() instead of
sk_mem_reclaim(), so that we keep a small reserve (less than one page)
of memory for each UDP socket.

We did something very similar on TCP side in commit
9993e7d313e80bdc005d09c7def91903e0068f07
([TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer())

A more complex solution would need to convert prot->memory_allocated to
use a percpu_counter with batches of 64 or 128 pages.

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:57                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 20:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> 100.000000 total
>> ................
>>   3.356152 ip_queue_xmit
> 
>                       hits (335615 total)
>                  .........
> ffffffff804b7045:     1001 <ip_queue_xmit>:
> ffffffff804b7045:     1001 	41 57                	push   %r15
> ffffffff804b7047:    36698 	41 56                	push   %r14
> ffffffff804b7049:        0 	49 89 fe             	mov    %rdi,%r14
> ffffffff804b704c:        0 	41 55                	push   %r13
> ffffffff804b704e:      447 	41 54                	push   %r12
> ffffffff804b7050:        0 	55                   	push   %rbp
> ffffffff804b7051:        4 	53                   	push   %rbx
> ffffffff804b7052:      465 	48 83 ec 68          	sub    $0x68,%rsp
> ffffffff804b7056:        1 	89 74 24 08          	mov    %esi,0x8(%rsp)
> ffffffff804b705a:      486 	48 8b 47 28          	mov    0x28(%rdi),%rax
> ffffffff804b705e:        0 	48 8b 6f 10          	mov    0x10(%rdi),%rbp
> ffffffff804b7062:        7 	48 85 c0             	test   %rax,%rax
> ffffffff804b7065:      480 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
> ffffffff804b706a:        0 	4c 8b bd 48 02 00 00 	mov    0x248(%rbp),%r15
> ffffffff804b7071:        7 	0f 85 0d 01 00 00    	jne    ffffffff804b7184 <ip_queue_xmit+0x13f>
> ffffffff804b7077:      452 	31 f6                	xor    %esi,%esi
> ffffffff804b7079:        0 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b707c:        5 	e8 c1 eb fc ff       	callq  ffffffff80485c42 <__sk_dst_check>
> ffffffff804b7081:      434 	48 85 c0             	test   %rax,%rax
> ffffffff804b7084:       54 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
> ffffffff804b7089:        0 	0f 85 e0 00 00 00    	jne    ffffffff804b716f <ip_queue_xmit+0x12a>
> ffffffff804b708f:        0 	4d 85 ff             	test   %r15,%r15
> ffffffff804b7092:        0 	44 8b ad 30 02 00 00 	mov    0x230(%rbp),%r13d
> ffffffff804b7099:        0 	74 0a                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
> ffffffff804b709b:        0 	41 80 7f 05 00       	cmpb   $0x0,0x5(%r15)
> ffffffff804b70a0:        0 	74 03                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
> ffffffff804b70a2:        0 	45 8b 2f             	mov    (%r15),%r13d
> ffffffff804b70a5:        0 	8b 85 3c 02 00 00    	mov    0x23c(%rbp),%eax
> ffffffff804b70ab:        0 	48 8d b5 10 01 00 00 	lea    0x110(%rbp),%rsi
> ffffffff804b70b2:        0 	44 8b 65 04          	mov    0x4(%rbp),%r12d
> ffffffff804b70b6:        0 	bf 0d 00 00 00       	mov    $0xd,%edi
> ffffffff804b70bb:        0 	89 44 24 0c          	mov    %eax,0xc(%rsp)
> ffffffff804b70bf:        0 	8a 9d 54 02 00 00    	mov    0x254(%rbp),%bl
> ffffffff804b70c5:        0 	e8 9a df ff ff       	callq  ffffffff804b5064 <constant_test_bit>
> ffffffff804b70ca:        0 	31 d2                	xor    %edx,%edx
> ffffffff804b70cc:        0 	48 8d 7c 24 10       	lea    0x10(%rsp),%rdi
> ffffffff804b70d1:        0 	41 89 c3             	mov    %eax,%r11d
> ffffffff804b70d4:        0 	fc                   	cld    
> ffffffff804b70d5:        0 	89 d0                	mov    %edx,%eax
> ffffffff804b70d7:        0 	b9 10 00 00 00       	mov    $0x10,%ecx
> ffffffff804b70dc:        0 	44 8a 45 39          	mov    0x39(%rbp),%r8b
> ffffffff804b70e0:        0 	40 8a b5 57 02 00 00 	mov    0x257(%rbp),%sil
> ffffffff804b70e7:        0 	44 8b 8d 50 02 00 00 	mov    0x250(%rbp),%r9d
> ffffffff804b70ee:        0 	83 e3 1e             	and    $0x1e,%ebx
> ffffffff804b70f1:        0 	44 8b 95 38 02 00 00 	mov    0x238(%rbp),%r10d
> ffffffff804b70f8:        0 	44 09 db             	or     %r11d,%ebx
> ffffffff804b70fb:        0 	f3 ab                	rep stos %eax,%es:(%rdi)
> ffffffff804b70fd:        0 	40 c0 ee 05          	shr    $0x5,%sil
> ffffffff804b7101:        0 	88 5c 24 24          	mov    %bl,0x24(%rsp)
> ffffffff804b7105:        0 	48 8d 5c 24 10       	lea    0x10(%rsp),%rbx
> ffffffff804b710a:        0 	83 e6 01             	and    $0x1,%esi
> ffffffff804b710d:        0 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b7110:        0 	44 88 44 24 40       	mov    %r8b,0x40(%rsp)
> ffffffff804b7115:        0 	8b 44 24 0c          	mov    0xc(%rsp),%eax
> ffffffff804b7119:        0 	40 88 74 24 41       	mov    %sil,0x41(%rsp)
> ffffffff804b711e:        0 	48 89 de             	mov    %rbx,%rsi
> ffffffff804b7121:        0 	66 44 89 4c 24 44    	mov    %r9w,0x44(%rsp)
> ffffffff804b7127:        0 	66 44 89 54 24 46    	mov    %r10w,0x46(%rsp)
> ffffffff804b712d:        0 	44 89 64 24 10       	mov    %r12d,0x10(%rsp)
> ffffffff804b7132:        0 	44 89 6c 24 1c       	mov    %r13d,0x1c(%rsp)
> ffffffff804b7137:        0 	89 44 24 20          	mov    %eax,0x20(%rsp)
> ffffffff804b713b:        0 	e8 2d 9f e5 ff       	callq  ffffffff8031106d <security_sk_classify_flow>
> ffffffff804b7140:        0 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
> ffffffff804b7145:        0 	45 31 c0             	xor    %r8d,%r8d
> ffffffff804b7148:        0 	48 89 e9             	mov    %rbp,%rcx
> ffffffff804b714b:        0 	48 89 da             	mov    %rbx,%rdx
> ffffffff804b714e:        0 	48 c7 c7 d0 15 ab 80 	mov    $0xffffffff80ab15d0,%rdi
> ffffffff804b7155:        0 	e8 1a 91 ff ff       	callq  ffffffff804b0274 <ip_route_output_flow>
> ffffffff804b715a:        0 	85 c0                	test   %eax,%eax
> ffffffff804b715c:        0 	0f 85 9f 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
> ffffffff804b7162:        0 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
> ffffffff804b7167:        0 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b716a:        0 	e8 a8 eb fc ff       	callq  ffffffff80485d17 <sk_setup_caps>
> ffffffff804b716f:      441 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
> ffffffff804b7174:     1388 	48 85 c0             	test   %rax,%rax
> ffffffff804b7177:        0 	74 07                	je     ffffffff804b7180 <ip_queue_xmit+0x13b>
> ffffffff804b7179:        0 	f0 ff 80 b0 00 00 00 	lock incl 0xb0(%rax)
> ffffffff804b7180:      556 	49 89 46 28          	mov    %rax,0x28(%r14)
> ffffffff804b7184:     8351 	4d 85 ff             	test   %r15,%r15
> ffffffff804b7187:        0 	be 14 00 00 00       	mov    $0x14,%esi
> ffffffff804b718c:      461 	74 26                	je     ffffffff804b71b4 <ip_queue_xmit+0x16f>
> ffffffff804b718e:        0 	41 f6 47 08 01       	testb  $0x1,0x8(%r15)
> ffffffff804b7193:        0 	74 17                	je     ffffffff804b71ac <ip_queue_xmit+0x167>
> ffffffff804b7195:        0 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> ffffffff804b719a:        0 	8b 82 28 01 00 00    	mov    0x128(%rdx),%eax
> ffffffff804b71a0:        0 	39 82 1c 01 00 00    	cmp    %eax,0x11c(%rdx)
> ffffffff804b71a6:        0 	0f 85 55 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
> ffffffff804b71ac:        0 	41 0f b6 47 04       	movzbl 0x4(%r15),%eax
> ffffffff804b71b1:        0 	8d 70 14             	lea    0x14(%rax),%esi
> ffffffff804b71b4:       39 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b71b7:      493 	e8 f8 18 fd ff       	callq  ffffffff80488ab4 <skb_push>
> ffffffff804b71bc:        0 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b71bf:     1701 	e8 99 df ff ff       	callq  ffffffff804b515d <skb_reset_network_header>
> ffffffff804b71c4:      481 	0f b6 85 54 02 00 00 	movzbl 0x254(%rbp),%eax
> ffffffff804b71cb:     4202 	41 8b 9e bc 00 00 00 	mov    0xbc(%r14),%ebx
> ffffffff804b71d2:        3 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b71d5:        0 	49 03 9e d0 00 00 00 	add    0xd0(%r14),%rbx
> ffffffff804b71dc:      466 	80 cc 45             	or     $0x45,%ah
> ffffffff804b71df:        7 	66 c1 c0 08          	rol    $0x8,%ax
> ffffffff804b71e3:        0 	66 89 03             	mov    %ax,(%rbx)
> ffffffff804b71e6:      492 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
> ffffffff804b71eb:        3 	e8 a0 df ff ff       	callq  ffffffff804b5190 <ip_dont_fragment>
> ffffffff804b71f0:     1405 	85 c0                	test   %eax,%eax
> ffffffff804b71f2:     4391 	74 0f                	je     ffffffff804b7203 <ip_queue_xmit+0x1be>
> ffffffff804b71f4:        0 	83 7c 24 08 00       	cmpl   $0x0,0x8(%rsp)
> ffffffff804b71f9:      417 	75 08                	jne    ffffffff804b7203 <ip_queue_xmit+0x1be>
> ffffffff804b71fb:      503 	66 c7 43 06 40 00    	movw   $0x40,0x6(%rbx)
> ffffffff804b7201:     6743 	eb 06                	jmp    ffffffff804b7209 <ip_queue_xmit+0x1c4>
> ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
> ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
> ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
> ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
> ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
> ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)
> ffffffff804b7222:    26297 	8a 45 39             	mov    0x39(%rbp),%al
> ffffffff804b7225:    76658 	4d 85 ff             	test   %r15,%r15
> ffffffff804b7228:     1712 	88 43 09             	mov    %al,0x9(%rbx)
> ffffffff804b722b:      148 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
> ffffffff804b7230:     2971 	8b 80 20 01 00 00    	mov    0x120(%rax),%eax
> ffffffff804b7236:    14849 	89 43 0c             	mov    %eax,0xc(%rbx)
> ffffffff804b7239:       84 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
> ffffffff804b723e:      360 	8b 80 1c 01 00 00    	mov    0x11c(%rax),%eax
> ffffffff804b7244:      174 	89 43 10             	mov    %eax,0x10(%rbx)
> ffffffff804b7247:       96 	74 32                	je     ffffffff804b727b <ip_queue_xmit+0x236>
> ffffffff804b7249:        0 	41 8a 57 04          	mov    0x4(%r15),%dl
> ffffffff804b724d:        0 	84 d2                	test   %dl,%dl
> ffffffff804b724f:        0 	74 2a                	je     ffffffff804b727b <ip_queue_xmit+0x236>
> ffffffff804b7251:        0 	c0 ea 02             	shr    $0x2,%dl
> ffffffff804b7254:        0 	03 13                	add    (%rbx),%edx
> ffffffff804b7256:        0 	8a 03                	mov    (%rbx),%al
> ffffffff804b7258:        0 	45 31 c0             	xor    %r8d,%r8d
> ffffffff804b725b:        0 	4c 89 fe             	mov    %r15,%rsi
> ffffffff804b725e:        0 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b7261:        0 	83 e0 f0             	and    $0xfffffffffffffff0,%eax
> ffffffff804b7264:        0 	83 e2 0f             	and    $0xf,%edx
> ffffffff804b7267:        0 	09 d0                	or     %edx,%eax
> ffffffff804b7269:        0 	88 03                	mov    %al,(%rbx)
> ffffffff804b726b:        0 	48 8b 4c 24 58       	mov    0x58(%rsp),%rcx
> ffffffff804b7270:        0 	8b 95 30 02 00 00    	mov    0x230(%rbp),%edx
> ffffffff804b7276:        0 	e8 e4 d8 ff ff       	callq  ffffffff804b4b5f <ip_options_build>
> ffffffff804b727b:      541 	41 8b 86 c8 00 00 00 	mov    0xc8(%r14),%eax
> ffffffff804b7282:      570 	31 d2                	xor    %edx,%edx
> ffffffff804b7284:        0 	49 03 86 d0 00 00 00 	add    0xd0(%r14),%rax
> ffffffff804b728b:       34 	8b 40 08             	mov    0x8(%rax),%eax
> ffffffff804b728e:      496 	66 85 c0             	test   %ax,%ax
> ffffffff804b7291:       11 	74 06                	je     ffffffff804b7299 <ip_queue_xmit+0x254>
> ffffffff804b7293:        9 	0f b7 c0             	movzwl %ax,%eax
> ffffffff804b7296:      495 	8d 50 ff             	lea    -0x1(%rax),%edx
> ffffffff804b7299:        2 	f6 43 06 40          	testb  $0x40,0x6(%rbx)
> ffffffff804b729d:        9 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
> ffffffff804b72a2:      497 	74 34                	je     ffffffff804b72d8 <ip_queue_xmit+0x293>
> ffffffff804b72a4:        8 	83 bd 30 02 00 00 00 	cmpl   $0x0,0x230(%rbp)
> ffffffff804b72ab:       10 	74 23                	je     ffffffff804b72d0 <ip_queue_xmit+0x28b>
> ffffffff804b72ad:     1044 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
> ffffffff804b72b4:        7 	66 c1 c0 08          	rol    $0x8,%ax
> ffffffff804b72b8:        8 	66 89 43 04          	mov    %ax,0x4(%rbx)
> ffffffff804b72bc:      432 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
> ffffffff804b72c3:        9 	ff c0                	inc    %eax
> ffffffff804b72c5:       14 	01 d0                	add    %edx,%eax
> ffffffff804b72c7:     1141 	66 89 85 52 02 00 00 	mov    %ax,0x252(%rbp)
> ffffffff804b72ce:        7 	eb 10                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
> ffffffff804b72d0:        0 	66 c7 43 04 00 00    	movw   $0x0,0x4(%rbx)
> ffffffff804b72d6:        0 	eb 08                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
> ffffffff804b72d8:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff804b72db:        0 	e8 b7 9d ff ff       	callq  ffffffff804b1097 <__ip_select_ident>
> ffffffff804b72e0:        6 	8b 85 54 01 00 00    	mov    0x154(%rbp),%eax
> ffffffff804b72e6:      458 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b72e9:        2 	41 89 46 78          	mov    %eax,0x78(%r14)
> ffffffff804b72ed:        4 	8b 85 f0 01 00 00    	mov    0x1f0(%rbp),%eax
> ffffffff804b72f3:      841 	41 89 86 b0 00 00 00 	mov    %eax,0xb0(%r14)
> ffffffff804b72fa:       11 	e8 30 f2 ff ff       	callq  ffffffff804b652f <ip_local_out>
> ffffffff804b72ff:        0 	eb 44                	jmp    ffffffff804b7345 <ip_queue_xmit+0x300>
> ffffffff804b7301:        0 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
> ffffffff804b7308:        0 	00 00 
> ffffffff804b730a:        0 	8b 80 48 e0 ff ff    	mov    -0x1fb8(%rax),%eax
> ffffffff804b7310:        0 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b7313:        0 	30 c0                	xor    %al,%al
> ffffffff804b7315:        0 	66 83 f8 01          	cmp    $0x1,%ax
> ffffffff804b7319:        0 	48 19 c0             	sbb    %rax,%rax
> ffffffff804b731c:        0 	83 e0 08             	and    $0x8,%eax
> ffffffff804b731f:        0 	48 8b 90 a8 16 ab 80 	mov    -0x7f54e958(%rax),%rdx
> ffffffff804b7326:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
> ffffffff804b732d:        0 	00 
> ffffffff804b732e:        0 	89 c0                	mov    %eax,%eax
> ffffffff804b7330:        0 	48 f7 d2             	not    %rdx
> ffffffff804b7333:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
> ffffffff804b7337:        0 	48 ff 40 68          	incq   0x68(%rax)
> ffffffff804b733b:        0 	e8 b1 18 fd ff       	callq  ffffffff80488bf1 <kfree_skb>
> ffffffff804b7340:        0 	b8 8f ff ff ff       	mov    $0xffffff8f,%eax
> ffffffff804b7345:     9196 	48 83 c4 68          	add    $0x68,%rsp
> ffffffff804b7349:      892 	5b                   	pop    %rbx
> ffffffff804b734a:        0 	5d                   	pop    %rbp
> ffffffff804b734b:      488 	41 5c                	pop    %r12
> ffffffff804b734d:        0 	41 5d                	pop    %r13
> ffffffff804b734f:        0 	41 5e                	pop    %r14
> ffffffff804b7351:      513 	41 5f                	pop    %r15
> ffffffff804b7353:        0 	c3                   	retq   
> 
> about 10% of this function's cost is artificial:
> 
> ffffffff804b7045:     1001 <ip_queue_xmit>:
> ffffffff804b7045:     1001 	41 57                	push   %r15
> ffffffff804b7047:    36698 	41 56                	push   %r14
> 
> there are profiler hits that leaked in via out-of-order execution from 
> the callsites. The callsites are hard to map unfortunately, as this 
> function is called via function pointers.
> 
> the most likely callsite is tcp_transmit_skb().
> 
> 30% of the overhead of this function comes from:
> 
> ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
> ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
> ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
> ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
> ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
> ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)
> 
> the 16-bit movw looks a bit weird. It comes from line 372:
> 
>  0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
>  367		iph = ip_hdr(skb);
>  368		*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
>  369		if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
>  370			iph->frag_off = htons(IP_DF);
>  371		else
>  372			iph->frag_off = 0;
>  373		iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
>  374		iph->protocol = sk->sk_protocol;
>  375		iph->saddr    = rt->rt_src;
>  376		iph->daddr    = rt->rt_dst;
> 
> the ip-header fragment flag setting to zero.
> 
> 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is 
> towards eliminating them as much as possible.
> 
> _But_, the real overhead probably comes from:
> 
>  ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> 
> which is the next line, the ttl field:
> 
>  373             iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
> 
> this shows that we are doing a hard cachemiss on the net-localhost 
> route dst structure cacheline. We do a plain load instruction from it 
> here and get a hefty cachemiss. (because 16 CPUs are banging on that 
> single route)
> 
> And let make sure we see this in perspective as well: that single 
> cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make 
> the scheduler 10% slower straight away and it would have less of a 
> real-life effect than this single iph->ttl field setting.
> 

If you applied my patch against dst_entry, then you should not have any cache
line miss accessing the first and second cache line of dst_entry, that are mostly
read (and contains all metrics, like ttl at offset 0x58 ). Or something is
really wrong...

Now if your cpu cache is blown away because of the huge send()/receive() done
by tbench, we are stuck of course.

I dont know what you want to prove here. We already have one dst_entry per route in
the rt cache, and it already can consume *lot* of ram if you have 1 million entries
in rt cache.

tbench is mostly a network benchmark (and one using loopback device), thats not a 
suprise it can stress network part or the kernel :)




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:57                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 20:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
>> 100.000000 total
>> ................
>>   3.356152 ip_queue_xmit
> 
>                       hits (335615 total)
>                  .........
> ffffffff804b7045:     1001 <ip_queue_xmit>:
> ffffffff804b7045:     1001 	41 57                	push   %r15
> ffffffff804b7047:    36698 	41 56                	push   %r14
> ffffffff804b7049:        0 	49 89 fe             	mov    %rdi,%r14
> ffffffff804b704c:        0 	41 55                	push   %r13
> ffffffff804b704e:      447 	41 54                	push   %r12
> ffffffff804b7050:        0 	55                   	push   %rbp
> ffffffff804b7051:        4 	53                   	push   %rbx
> ffffffff804b7052:      465 	48 83 ec 68          	sub    $0x68,%rsp
> ffffffff804b7056:        1 	89 74 24 08          	mov    %esi,0x8(%rsp)
> ffffffff804b705a:      486 	48 8b 47 28          	mov    0x28(%rdi),%rax
> ffffffff804b705e:        0 	48 8b 6f 10          	mov    0x10(%rdi),%rbp
> ffffffff804b7062:        7 	48 85 c0             	test   %rax,%rax
> ffffffff804b7065:      480 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
> ffffffff804b706a:        0 	4c 8b bd 48 02 00 00 	mov    0x248(%rbp),%r15
> ffffffff804b7071:        7 	0f 85 0d 01 00 00    	jne    ffffffff804b7184 <ip_queue_xmit+0x13f>
> ffffffff804b7077:      452 	31 f6                	xor    %esi,%esi
> ffffffff804b7079:        0 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b707c:        5 	e8 c1 eb fc ff       	callq  ffffffff80485c42 <__sk_dst_check>
> ffffffff804b7081:      434 	48 85 c0             	test   %rax,%rax
> ffffffff804b7084:       54 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
> ffffffff804b7089:        0 	0f 85 e0 00 00 00    	jne    ffffffff804b716f <ip_queue_xmit+0x12a>
> ffffffff804b708f:        0 	4d 85 ff             	test   %r15,%r15
> ffffffff804b7092:        0 	44 8b ad 30 02 00 00 	mov    0x230(%rbp),%r13d
> ffffffff804b7099:        0 	74 0a                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
> ffffffff804b709b:        0 	41 80 7f 05 00       	cmpb   $0x0,0x5(%r15)
> ffffffff804b70a0:        0 	74 03                	je     ffffffff804b70a5 <ip_queue_xmit+0x60>
> ffffffff804b70a2:        0 	45 8b 2f             	mov    (%r15),%r13d
> ffffffff804b70a5:        0 	8b 85 3c 02 00 00    	mov    0x23c(%rbp),%eax
> ffffffff804b70ab:        0 	48 8d b5 10 01 00 00 	lea    0x110(%rbp),%rsi
> ffffffff804b70b2:        0 	44 8b 65 04          	mov    0x4(%rbp),%r12d
> ffffffff804b70b6:        0 	bf 0d 00 00 00       	mov    $0xd,%edi
> ffffffff804b70bb:        0 	89 44 24 0c          	mov    %eax,0xc(%rsp)
> ffffffff804b70bf:        0 	8a 9d 54 02 00 00    	mov    0x254(%rbp),%bl
> ffffffff804b70c5:        0 	e8 9a df ff ff       	callq  ffffffff804b5064 <constant_test_bit>
> ffffffff804b70ca:        0 	31 d2                	xor    %edx,%edx
> ffffffff804b70cc:        0 	48 8d 7c 24 10       	lea    0x10(%rsp),%rdi
> ffffffff804b70d1:        0 	41 89 c3             	mov    %eax,%r11d
> ffffffff804b70d4:        0 	fc                   	cld    
> ffffffff804b70d5:        0 	89 d0                	mov    %edx,%eax
> ffffffff804b70d7:        0 	b9 10 00 00 00       	mov    $0x10,%ecx
> ffffffff804b70dc:        0 	44 8a 45 39          	mov    0x39(%rbp),%r8b
> ffffffff804b70e0:        0 	40 8a b5 57 02 00 00 	mov    0x257(%rbp),%sil
> ffffffff804b70e7:        0 	44 8b 8d 50 02 00 00 	mov    0x250(%rbp),%r9d
> ffffffff804b70ee:        0 	83 e3 1e             	and    $0x1e,%ebx
> ffffffff804b70f1:        0 	44 8b 95 38 02 00 00 	mov    0x238(%rbp),%r10d
> ffffffff804b70f8:        0 	44 09 db             	or     %r11d,%ebx
> ffffffff804b70fb:        0 	f3 ab                	rep stos %eax,%es:(%rdi)
> ffffffff804b70fd:        0 	40 c0 ee 05          	shr    $0x5,%sil
> ffffffff804b7101:        0 	88 5c 24 24          	mov    %bl,0x24(%rsp)
> ffffffff804b7105:        0 	48 8d 5c 24 10       	lea    0x10(%rsp),%rbx
> ffffffff804b710a:        0 	83 e6 01             	and    $0x1,%esi
> ffffffff804b710d:        0 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b7110:        0 	44 88 44 24 40       	mov    %r8b,0x40(%rsp)
> ffffffff804b7115:        0 	8b 44 24 0c          	mov    0xc(%rsp),%eax
> ffffffff804b7119:        0 	40 88 74 24 41       	mov    %sil,0x41(%rsp)
> ffffffff804b711e:        0 	48 89 de             	mov    %rbx,%rsi
> ffffffff804b7121:        0 	66 44 89 4c 24 44    	mov    %r9w,0x44(%rsp)
> ffffffff804b7127:        0 	66 44 89 54 24 46    	mov    %r10w,0x46(%rsp)
> ffffffff804b712d:        0 	44 89 64 24 10       	mov    %r12d,0x10(%rsp)
> ffffffff804b7132:        0 	44 89 6c 24 1c       	mov    %r13d,0x1c(%rsp)
> ffffffff804b7137:        0 	89 44 24 20          	mov    %eax,0x20(%rsp)
> ffffffff804b713b:        0 	e8 2d 9f e5 ff       	callq  ffffffff8031106d <security_sk_classify_flow>
> ffffffff804b7140:        0 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
> ffffffff804b7145:        0 	45 31 c0             	xor    %r8d,%r8d
> ffffffff804b7148:        0 	48 89 e9             	mov    %rbp,%rcx
> ffffffff804b714b:        0 	48 89 da             	mov    %rbx,%rdx
> ffffffff804b714e:        0 	48 c7 c7 d0 15 ab 80 	mov    $0xffffffff80ab15d0,%rdi
> ffffffff804b7155:        0 	e8 1a 91 ff ff       	callq  ffffffff804b0274 <ip_route_output_flow>
> ffffffff804b715a:        0 	85 c0                	test   %eax,%eax
> ffffffff804b715c:        0 	0f 85 9f 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
> ffffffff804b7162:        0 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
> ffffffff804b7167:        0 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b716a:        0 	e8 a8 eb fc ff       	callq  ffffffff80485d17 <sk_setup_caps>
> ffffffff804b716f:      441 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
> ffffffff804b7174:     1388 	48 85 c0             	test   %rax,%rax
> ffffffff804b7177:        0 	74 07                	je     ffffffff804b7180 <ip_queue_xmit+0x13b>
> ffffffff804b7179:        0 	f0 ff 80 b0 00 00 00 	lock incl 0xb0(%rax)
> ffffffff804b7180:      556 	49 89 46 28          	mov    %rax,0x28(%r14)
> ffffffff804b7184:     8351 	4d 85 ff             	test   %r15,%r15
> ffffffff804b7187:        0 	be 14 00 00 00       	mov    $0x14,%esi
> ffffffff804b718c:      461 	74 26                	je     ffffffff804b71b4 <ip_queue_xmit+0x16f>
> ffffffff804b718e:        0 	41 f6 47 08 01       	testb  $0x1,0x8(%r15)
> ffffffff804b7193:        0 	74 17                	je     ffffffff804b71ac <ip_queue_xmit+0x167>
> ffffffff804b7195:        0 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> ffffffff804b719a:        0 	8b 82 28 01 00 00    	mov    0x128(%rdx),%eax
> ffffffff804b71a0:        0 	39 82 1c 01 00 00    	cmp    %eax,0x11c(%rdx)
> ffffffff804b71a6:        0 	0f 85 55 01 00 00    	jne    ffffffff804b7301 <ip_queue_xmit+0x2bc>
> ffffffff804b71ac:        0 	41 0f b6 47 04       	movzbl 0x4(%r15),%eax
> ffffffff804b71b1:        0 	8d 70 14             	lea    0x14(%rax),%esi
> ffffffff804b71b4:       39 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b71b7:      493 	e8 f8 18 fd ff       	callq  ffffffff80488ab4 <skb_push>
> ffffffff804b71bc:        0 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b71bf:     1701 	e8 99 df ff ff       	callq  ffffffff804b515d <skb_reset_network_header>
> ffffffff804b71c4:      481 	0f b6 85 54 02 00 00 	movzbl 0x254(%rbp),%eax
> ffffffff804b71cb:     4202 	41 8b 9e bc 00 00 00 	mov    0xbc(%r14),%ebx
> ffffffff804b71d2:        3 	48 89 ef             	mov    %rbp,%rdi
> ffffffff804b71d5:        0 	49 03 9e d0 00 00 00 	add    0xd0(%r14),%rbx
> ffffffff804b71dc:      466 	80 cc 45             	or     $0x45,%ah
> ffffffff804b71df:        7 	66 c1 c0 08          	rol    $0x8,%ax
> ffffffff804b71e3:        0 	66 89 03             	mov    %ax,(%rbx)
> ffffffff804b71e6:      492 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
> ffffffff804b71eb:        3 	e8 a0 df ff ff       	callq  ffffffff804b5190 <ip_dont_fragment>
> ffffffff804b71f0:     1405 	85 c0                	test   %eax,%eax
> ffffffff804b71f2:     4391 	74 0f                	je     ffffffff804b7203 <ip_queue_xmit+0x1be>
> ffffffff804b71f4:        0 	83 7c 24 08 00       	cmpl   $0x0,0x8(%rsp)
> ffffffff804b71f9:      417 	75 08                	jne    ffffffff804b7203 <ip_queue_xmit+0x1be>
> ffffffff804b71fb:      503 	66 c7 43 06 40 00    	movw   $0x40,0x6(%rbx)
> ffffffff804b7201:     6743 	eb 06                	jmp    ffffffff804b7209 <ip_queue_xmit+0x1c4>
> ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
> ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
> ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
> ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
> ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
> ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)
> ffffffff804b7222:    26297 	8a 45 39             	mov    0x39(%rbp),%al
> ffffffff804b7225:    76658 	4d 85 ff             	test   %r15,%r15
> ffffffff804b7228:     1712 	88 43 09             	mov    %al,0x9(%rbx)
> ffffffff804b722b:      148 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
> ffffffff804b7230:     2971 	8b 80 20 01 00 00    	mov    0x120(%rax),%eax
> ffffffff804b7236:    14849 	89 43 0c             	mov    %eax,0xc(%rbx)
> ffffffff804b7239:       84 	48 8b 44 24 58       	mov    0x58(%rsp),%rax
> ffffffff804b723e:      360 	8b 80 1c 01 00 00    	mov    0x11c(%rax),%eax
> ffffffff804b7244:      174 	89 43 10             	mov    %eax,0x10(%rbx)
> ffffffff804b7247:       96 	74 32                	je     ffffffff804b727b <ip_queue_xmit+0x236>
> ffffffff804b7249:        0 	41 8a 57 04          	mov    0x4(%r15),%dl
> ffffffff804b724d:        0 	84 d2                	test   %dl,%dl
> ffffffff804b724f:        0 	74 2a                	je     ffffffff804b727b <ip_queue_xmit+0x236>
> ffffffff804b7251:        0 	c0 ea 02             	shr    $0x2,%dl
> ffffffff804b7254:        0 	03 13                	add    (%rbx),%edx
> ffffffff804b7256:        0 	8a 03                	mov    (%rbx),%al
> ffffffff804b7258:        0 	45 31 c0             	xor    %r8d,%r8d
> ffffffff804b725b:        0 	4c 89 fe             	mov    %r15,%rsi
> ffffffff804b725e:        0 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b7261:        0 	83 e0 f0             	and    $0xfffffffffffffff0,%eax
> ffffffff804b7264:        0 	83 e2 0f             	and    $0xf,%edx
> ffffffff804b7267:        0 	09 d0                	or     %edx,%eax
> ffffffff804b7269:        0 	88 03                	mov    %al,(%rbx)
> ffffffff804b726b:        0 	48 8b 4c 24 58       	mov    0x58(%rsp),%rcx
> ffffffff804b7270:        0 	8b 95 30 02 00 00    	mov    0x230(%rbp),%edx
> ffffffff804b7276:        0 	e8 e4 d8 ff ff       	callq  ffffffff804b4b5f <ip_options_build>
> ffffffff804b727b:      541 	41 8b 86 c8 00 00 00 	mov    0xc8(%r14),%eax
> ffffffff804b7282:      570 	31 d2                	xor    %edx,%edx
> ffffffff804b7284:        0 	49 03 86 d0 00 00 00 	add    0xd0(%r14),%rax
> ffffffff804b728b:       34 	8b 40 08             	mov    0x8(%rax),%eax
> ffffffff804b728e:      496 	66 85 c0             	test   %ax,%ax
> ffffffff804b7291:       11 	74 06                	je     ffffffff804b7299 <ip_queue_xmit+0x254>
> ffffffff804b7293:        9 	0f b7 c0             	movzwl %ax,%eax
> ffffffff804b7296:      495 	8d 50 ff             	lea    -0x1(%rax),%edx
> ffffffff804b7299:        2 	f6 43 06 40          	testb  $0x40,0x6(%rbx)
> ffffffff804b729d:        9 	48 8b 74 24 58       	mov    0x58(%rsp),%rsi
> ffffffff804b72a2:      497 	74 34                	je     ffffffff804b72d8 <ip_queue_xmit+0x293>
> ffffffff804b72a4:        8 	83 bd 30 02 00 00 00 	cmpl   $0x0,0x230(%rbp)
> ffffffff804b72ab:       10 	74 23                	je     ffffffff804b72d0 <ip_queue_xmit+0x28b>
> ffffffff804b72ad:     1044 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
> ffffffff804b72b4:        7 	66 c1 c0 08          	rol    $0x8,%ax
> ffffffff804b72b8:        8 	66 89 43 04          	mov    %ax,0x4(%rbx)
> ffffffff804b72bc:      432 	66 8b 85 52 02 00 00 	mov    0x252(%rbp),%ax
> ffffffff804b72c3:        9 	ff c0                	inc    %eax
> ffffffff804b72c5:       14 	01 d0                	add    %edx,%eax
> ffffffff804b72c7:     1141 	66 89 85 52 02 00 00 	mov    %ax,0x252(%rbp)
> ffffffff804b72ce:        7 	eb 10                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
> ffffffff804b72d0:        0 	66 c7 43 04 00 00    	movw   $0x0,0x4(%rbx)
> ffffffff804b72d6:        0 	eb 08                	jmp    ffffffff804b72e0 <ip_queue_xmit+0x29b>
> ffffffff804b72d8:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff804b72db:        0 	e8 b7 9d ff ff       	callq  ffffffff804b1097 <__ip_select_ident>
> ffffffff804b72e0:        6 	8b 85 54 01 00 00    	mov    0x154(%rbp),%eax
> ffffffff804b72e6:      458 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b72e9:        2 	41 89 46 78          	mov    %eax,0x78(%r14)
> ffffffff804b72ed:        4 	8b 85 f0 01 00 00    	mov    0x1f0(%rbp),%eax
> ffffffff804b72f3:      841 	41 89 86 b0 00 00 00 	mov    %eax,0xb0(%r14)
> ffffffff804b72fa:       11 	e8 30 f2 ff ff       	callq  ffffffff804b652f <ip_local_out>
> ffffffff804b72ff:        0 	eb 44                	jmp    ffffffff804b7345 <ip_queue_xmit+0x300>
> ffffffff804b7301:        0 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
> ffffffff804b7308:        0 	00 00 
> ffffffff804b730a:        0 	8b 80 48 e0 ff ff    	mov    -0x1fb8(%rax),%eax
> ffffffff804b7310:        0 	4c 89 f7             	mov    %r14,%rdi
> ffffffff804b7313:        0 	30 c0                	xor    %al,%al
> ffffffff804b7315:        0 	66 83 f8 01          	cmp    $0x1,%ax
> ffffffff804b7319:        0 	48 19 c0             	sbb    %rax,%rax
> ffffffff804b731c:        0 	83 e0 08             	and    $0x8,%eax
> ffffffff804b731f:        0 	48 8b 90 a8 16 ab 80 	mov    -0x7f54e958(%rax),%rdx
> ffffffff804b7326:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
> ffffffff804b732d:        0 	00 
> ffffffff804b732e:        0 	89 c0                	mov    %eax,%eax
> ffffffff804b7330:        0 	48 f7 d2             	not    %rdx
> ffffffff804b7333:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
> ffffffff804b7337:        0 	48 ff 40 68          	incq   0x68(%rax)
> ffffffff804b733b:        0 	e8 b1 18 fd ff       	callq  ffffffff80488bf1 <kfree_skb>
> ffffffff804b7340:        0 	b8 8f ff ff ff       	mov    $0xffffff8f,%eax
> ffffffff804b7345:     9196 	48 83 c4 68          	add    $0x68,%rsp
> ffffffff804b7349:      892 	5b                   	pop    %rbx
> ffffffff804b734a:        0 	5d                   	pop    %rbp
> ffffffff804b734b:      488 	41 5c                	pop    %r12
> ffffffff804b734d:        0 	41 5d                	pop    %r13
> ffffffff804b734f:        0 	41 5e                	pop    %r14
> ffffffff804b7351:      513 	41 5f                	pop    %r15
> ffffffff804b7353:        0 	c3                   	retq   
> 
> about 10% of this function's cost is artificial:
> 
> ffffffff804b7045:     1001 <ip_queue_xmit>:
> ffffffff804b7045:     1001 	41 57                	push   %r15
> ffffffff804b7047:    36698 	41 56                	push   %r14
> 
> there are profiler hits that leaked in via out-of-order execution from 
> the callsites. The callsites are hard to map unfortunately, as this 
> function is called via function pointers.
> 
> the most likely callsite is tcp_transmit_skb().
> 
> 30% of the overhead of this function comes from:
> 
> ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
> ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
> ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
> ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f <ip_queue_xmit+0x1da>
> ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov    0x9c(%rdx),%eax
> ffffffff804b721f:     4963 	88 43 08             	mov    %al,0x8(%rbx)
> 
> the 16-bit movw looks a bit weird. It comes from line 372:
> 
>  0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
>  367		iph = ip_hdr(skb);
>  368		*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
>  369		if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
>  370			iph->frag_off = htons(IP_DF);
>  371		else
>  372			iph->frag_off = 0;
>  373		iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
>  374		iph->protocol = sk->sk_protocol;
>  375		iph->saddr    = rt->rt_src;
>  376		iph->daddr    = rt->rt_dst;
> 
> the ip-header fragment flag setting to zero.
> 
> 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is 
> towards eliminating them as much as possible.
> 
> _But_, the real overhead probably comes from:
> 
>  ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> 
> which is the next line, the ttl field:
> 
>  373             iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
> 
> this shows that we are doing a hard cachemiss on the net-localhost 
> route dst structure cacheline. We do a plain load instruction from it 
> here and get a hefty cachemiss. (because 16 CPUs are banging on that 
> single route)
> 
> And let make sure we see this in perspective as well: that single 
> cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make 
> the scheduler 10% slower straight away and it would have less of a 
> real-life effect than this single iph->ttl field setting.
> 

If you applied my patch against dst_entry, then you should not have any cache
line miss accessing the first and second cache line of dst_entry, that are mostly
read (and contains all metrics, like ttl at offset 0x58 ). Or something is
really wrong...

Now if your cpu cache is blown away because of the huge send()/receive() done
by tbench, we are stuck of course.

I dont know what you want to prove here. We already have one dst_entry per route in
the rt cache, and it already can consume *lot* of ram if you have 1 million entries
in rt cache.

tbench is mostly a network benchmark (and one using loopback device), thats not a 
suprise it can stress network part or the kernel :)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:58                                     ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 20:58 UTC (permalink / raw)
  To: torvalds
  Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)

> On Mon, 17 Nov 2008, David Miller wrote:
> > 
> > It's on my workstation which is a much simpler 2 processor
> > UltraSPARC-IIIi (1.5Ghz) system.
> 
> Ok. It could easily be something like a cache footprint issue. And while I 
> don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super- 
> scalar but does no out-of-order and speculation, no?

I does only very simple speculation, but you're description is accurate.

> So I could easily see that the indirect branches in the scheduler
> hurt much more, and might explain why the x86 profile looks so
> different.

Right.

> One thing that non-NMI profiles also tend to show is "clumping", which in 
> turn tends to rather excessively pinpoint code sequences that release the 
> irq flag - just because those points show up in profiles, rather than 
> being a spread-out-mush. So it's possible that Ingo's profile did show the 
> scheduler more, but it was in the form of much more spread out "noise" 
> rather than the single spike you saw. 

Sure.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 20:58                                     ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 20:58 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: mingo-X9Un+BFzKDI, dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)

> On Mon, 17 Nov 2008, David Miller wrote:
> > 
> > It's on my workstation which is a much simpler 2 processor
> > UltraSPARC-IIIi (1.5Ghz) system.
> 
> Ok. It could easily be something like a cache footprint issue. And while I 
> don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super- 
> scalar but does no out-of-order and speculation, no?

I does only very simple speculation, but you're description is accurate.

> So I could easily see that the indirect branches in the scheduler
> hurt much more, and might explain why the x86 profile looks so
> different.

Right.

> One thing that non-NMI profiles also tend to show is "clumping", which in 
> turn tends to rather excessively pinpoint code sequences that release the 
> irq flag - just because those points show up in profiles, rather than 
> being a spread-out-mush. So it's possible that Ingo's profile did show the 
> scheduler more, but it was in the form of much more spread out "noise" 
> rather than the single spike you saw. 

Sure.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:01                               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 21:01 UTC (permalink / raw)
  To: mingo
  Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 17 Nov 2008 21:55:30 +0100

> and ouch does that global dec on &nf_bridge->use hurt!

nf_bridge should always be NULL on your system

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:01                               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-17 21:01 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Mon, 17 Nov 2008 21:55:30 +0100

> and ouch does that global dec on &nf_bridge->use hurt!

nf_bridge should always be NULL on your system

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:04                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 21:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> (gdb) list *0xffffffff8048942e
> 0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783).
> 1778	}
> 1779	#endif
> 1780	#ifdef CONFIG_BRIDGE_NETFILTER
> 1781	static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
> 1782	{
> 1783		if (nf_bridge && atomic_dec_and_test(&nf_bridge->use))
> 1784			kfree(nf_bridge);
> 1785	}
> 1786	static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
> 1787	{
> 
> and ouch does that global dec on &nf_bridge->use hurt!
> 
> i do have:
> 
>   CONFIG_BRIDGE_NETFILTER=y
> 
> (this is a Fedora distro kernel derived .config)

Hum, you also should hit this cache line at atomic_inc() site then...

Strange, I never caught this one.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:04                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 21:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> (gdb) list *0xffffffff8048942e
> 0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783).
> 1778	}
> 1779	#endif
> 1780	#ifdef CONFIG_BRIDGE_NETFILTER
> 1781	static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
> 1782	{
> 1783		if (nf_bridge && atomic_dec_and_test(&nf_bridge->use))
> 1784			kfree(nf_bridge);
> 1785	}
> 1786	static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
> 1787	{
> 
> and ouch does that global dec on &nf_bridge->use hurt!
> 
> i do have:
> 
>   CONFIG_BRIDGE_NETFILTER=y
> 
> (this is a Fedora distro kernel derived .config)

Hum, you also should hit this cache line at atomic_inc() site then...

Strange, I never caught this one.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* tcp_ack(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:09                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.997533 tcp_ack

                      hits (total: 199753)
                 .........
ffffffff804c0b17:      452 <tcp_ack>:
ffffffff804c0b17:      452 	41 57                	push   %r15
ffffffff804c0b19:     9569 	41 56                	push   %r14
ffffffff804c0b1b:        0 	41 55                	push   %r13
ffffffff804c0b1d:        0 	49 89 f5             	mov    %rsi,%r13
ffffffff804c0b20:      493 	41 54                	push   %r12
ffffffff804c0b22:      104 	41 89 d4             	mov    %edx,%r12d
ffffffff804c0b25:        0 	55                   	push   %rbp
ffffffff804c0b26:      425 	48 89 fd             	mov    %rdi,%rbp
ffffffff804c0b29:       21 	53                   	push   %rbx
ffffffff804c0b2a:        0 	48 81 ec 88 00 00 00 	sub    $0x88,%rsp
ffffffff804c0b31:      445 	8b 87 00 04 00 00    	mov    0x400(%rdi),%eax
ffffffff804c0b37:        0 	89 44 24 18          	mov    %eax,0x18(%rsp)
ffffffff804c0b3b:      443 	48 8d 46 38          	lea    0x38(%rsi),%rax
ffffffff804c0b3f:       18 	8b 50 28             	mov    0x28(%rax),%edx
ffffffff804c0b42:     2565 	44 8b 70 18          	mov    0x18(%rax),%r14d
ffffffff804c0b46:      358 	89 54 24 1c          	mov    %edx,0x1c(%rsp)
ffffffff804c0b4a:        2 	39 97 fc 03 00 00    	cmp    %edx,0x3fc(%rdi)
ffffffff804c0b50:      368 	0f 88 af 13 00 00    	js     ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c0b56:      106 	89 d1                	mov    %edx,%ecx
ffffffff804c0b58:        2 	2b 4c 24 18          	sub    0x18(%rsp),%ecx
ffffffff804c0b5c:      328 	0f 88 83 13 00 00    	js     ffffffff804c1ee5 <tcp_ack+0x13ce>
ffffffff804c0b62:     1440 	8b 44 24 18          	mov    0x18(%rsp),%eax
ffffffff804c0b66:        2 	29 d0                	sub    %edx,%eax
ffffffff804c0b68:       77 	44 89 e2             	mov    %r12d,%edx
ffffffff804c0b6b:      398 	89 c6                	mov    %eax,%esi
ffffffff804c0b6d:        3 	80 ce 04             	or     $0x4,%dh
ffffffff804c0b70:       65 	c1 ee 1f             	shr    $0x1f,%esi
ffffffff804c0b73:      362 	44 0f 45 e2          	cmovne %edx,%r12d
ffffffff804c0b77:        1 	83 3d ea 78 3f 00 00 	cmpl   $0x0,0x3f78ea(%rip)        # ffffffff808b8468 <sysctl_tcp_abc>
ffffffff804c0b7e:       64 	74 27                	je     ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b80:        0 	8a 87 78 03 00 00    	mov    0x378(%rdi),%al
ffffffff804c0b86:        0 	3c 01                	cmp    $0x1,%al
ffffffff804c0b88:        0 	77 08                	ja     ffffffff804c0b92 <tcp_ack+0x7b>
ffffffff804c0b8a:        0 	01 8f dc 04 00 00    	add    %ecx,0x4dc(%rdi)
ffffffff804c0b90:        0 	eb 15                	jmp    ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b92:        0 	3c 04                	cmp    $0x4,%al
ffffffff804c0b94:        0 	75 11                	jne    ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b96:        0 	8b 87 4c 04 00 00    	mov    0x44c(%rdi),%eax
ffffffff804c0b9c:        0 	39 c1                	cmp    %eax,%ecx
ffffffff804c0b9e:        0 	0f 46 c1             	cmovbe %ecx,%eax
ffffffff804c0ba1:        0 	01 87 dc 04 00 00    	add    %eax,0x4dc(%rdi)
ffffffff804c0ba7:      377 	8b 9d d4 04 00 00    	mov    0x4d4(%rbp),%ebx
ffffffff804c0bad:     3672 	41 f7 c4 00 01 00 00 	test   $0x100,%r12d
ffffffff804c0bb4:      282 	89 5c 24 20          	mov    %ebx,0x20(%rsp)
ffffffff804c0bb8:        0 	8b 85 74 04 00 00    	mov    0x474(%rbp),%eax
ffffffff804c0bbe:      140 	89 44 24 30          	mov    %eax,0x30(%rsp)
ffffffff804c0bc2:     7592 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c0bc8:     1580 	89 54 24 24          	mov    %edx,0x24(%rsp)
ffffffff804c0bcc:        3 	8b 9d cc 04 00 00    	mov    0x4cc(%rbp),%ebx
ffffffff804c0bd2:       58 	89 5c 24 28          	mov    %ebx,0x28(%rsp)
ffffffff804c0bd6:      419 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c0bdc:        0 	89 44 24 2c          	mov    %eax,0x2c(%rsp)
ffffffff804c0be0:       65 	75 4f                	jne    ffffffff804c0c31 <tcp_ack+0x11a>
ffffffff804c0be2:      423 	85 f6                	test   %esi,%esi
ffffffff804c0be4:       55 	74 4b                	je     ffffffff804c0c31 <tcp_ack+0x11a>
ffffffff804c0be6:       36 	44 89 b5 40 04 00 00 	mov    %r14d,0x440(%rbp)
ffffffff804c0bed:      368 	8b 54 24 1c          	mov    0x1c(%rsp),%edx
ffffffff804c0bf1:        4 	41 83 cc 02          	or     $0x2,%r12d
ffffffff804c0bf5:       32 	be 05 00 00 00       	mov    $0x5,%esi
ffffffff804c0bfa:      392 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0bfd:        4 	89 95 00 04 00 00    	mov    %edx,0x400(%rbp)
ffffffff804c0c03:     3341 	44 89 64 24 5c       	mov    %r12d,0x5c(%rsp)
ffffffff804c0c08:      855 	e8 98 dc ff ff       	callq  ffffffff804be8a5 <tcp_ca_event>
ffffffff804c0c0d:     2018 	48 8b 05 a4 0a 5f 00 	mov    0x5f0aa4(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c0c14:      858 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c0c1b:        0 	00 
ffffffff804c0c1c:        0 	89 d2                	mov    %edx,%edx
ffffffff804c0c1e:        0 	48 f7 d0             	not    %rax
ffffffff804c0c21:      425 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c0c25:        0 	48 ff 80 e8 00 00 00 	incq   0xe8(%rax)
ffffffff804c0c2c:        0 	e9 1b 01 00 00       	jmpq   ffffffff804c0d4c <tcp_ack+0x235>
ffffffff804c0c31:       41 	45 3b 75 54          	cmp    0x54(%r13),%r14d
ffffffff804c0c35:      360 	74 06                	je     ffffffff804c0c3d <tcp_ack+0x126>
ffffffff804c0c37:        1 	41 83 cc 01          	or     $0x1,%r12d
ffffffff804c0c3b:       80 	eb 1f                	jmp    ffffffff804c0c5c <tcp_ack+0x145>
ffffffff804c0c3d:        1 	48 8b 05 74 0a 5f 00 	mov    0x5f0a74(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c0c44:      303 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c0c4b:        0 	00 
ffffffff804c0c4c:       56 	89 d2                	mov    %edx,%edx
ffffffff804c0c4e:        0 	48 f7 d0             	not    %rax
ffffffff804c0c51:        4 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c0c55:       13 	48 ff 80 e0 00 00 00 	incq   0xe0(%rax)
ffffffff804c0c5c:       12 	41 8b 95 b8 00 00 00 	mov    0xb8(%r13),%edx
ffffffff804c0c63:      300 	49 03 95 d0 00 00 00 	add    0xd0(%r13),%rdx
ffffffff804c0c6a:       17 	66 8b 42 0e          	mov    0xe(%rdx),%ax
ffffffff804c0c6e:        0 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c0c72:       22 	f6 42 0d 02          	testb  $0x2,0xd(%rdx)
ffffffff804c0c76:       13 	0f b7 d8             	movzwl %ax,%ebx
ffffffff804c0c79:        0 	75 0b                	jne    ffffffff804c0c86 <tcp_ack+0x16f>
ffffffff804c0c7b:       26 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c0c81:      343 	83 e1 0f             	and    $0xf,%ecx
ffffffff804c0c84:        0 	d3 e3                	shl    %cl,%ebx
ffffffff804c0c86:       82 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c0c8a:       18 	44 89 f2             	mov    %r14d,%edx
ffffffff804c0c8d:        0 	89 d9                	mov    %ebx,%ecx
ffffffff804c0c8f:       12 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0c92:       12 	e8 47 e6 ff ff       	callq  ffffffff804bf2de <tcp_may_update_window>
ffffffff804c0c97:       16 	31 d2                	xor    %edx,%edx
ffffffff804c0c99:       66 	85 c0                	test   %eax,%eax
ffffffff804c0c9b:        0 	74 48                	je     ffffffff804c0ce5 <tcp_ack+0x1ce>
ffffffff804c0c9d:       12 	39 9d 44 04 00 00    	cmp    %ebx,0x444(%rbp)
ffffffff804c0ca3:       29 	44 89 b5 40 04 00 00 	mov    %r14d,0x440(%rbp)
ffffffff804c0caa:        0 	74 34                	je     ffffffff804c0ce0 <tcp_ack+0x1c9>
ffffffff804c0cac:        7 	89 9d 44 04 00 00    	mov    %ebx,0x444(%rbp)
ffffffff804c0cb2:       59 	c7 85 ec 03 00 00 00 	movl   $0x0,0x3ec(%rbp)
ffffffff804c0cb9:        0 	00 00 00 
ffffffff804c0cbc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0cbf:        7 	e8 13 e8 ff ff       	callq  ffffffff804bf4d7 <tcp_fast_path_check>
ffffffff804c0cc4:       23 	3b 9d 48 04 00 00    	cmp    0x448(%rbp),%ebx
ffffffff804c0cca:       48 	76 14                	jbe    ffffffff804c0ce0 <tcp_ack+0x1c9>
ffffffff804c0ccc:        0 	8b b5 5c 03 00 00    	mov    0x35c(%rbp),%esi
ffffffff804c0cd2:        0 	89 9d 48 04 00 00    	mov    %ebx,0x448(%rbp)
ffffffff804c0cd8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0cdb:        0 	e8 40 41 00 00       	callq  ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c0ce0:        6 	ba 02 00 00 00       	mov    $0x2,%edx
ffffffff804c0ce5:      141 	8b 5c 24 1c          	mov    0x1c(%rsp),%ebx
ffffffff804c0ce9:        1 	44 09 e2             	or     %r12d,%edx
ffffffff804c0cec:        3 	89 9d 00 04 00 00    	mov    %ebx,0x400(%rbp)
ffffffff804c0cf2:       34 	89 54 24 5c          	mov    %edx,0x5c(%rsp)
ffffffff804c0cf6:        0 	41 80 7d 5d 00       	cmpb   $0x0,0x5d(%r13)
ffffffff804c0cfb:        6 	74 13                	je     ffffffff804c0d10 <tcp_ack+0x1f9>
ffffffff804c0cfd:        0 	8b 54 24 18          	mov    0x18(%rsp),%edx
ffffffff804c0d01:        0 	4c 89 ee             	mov    %r13,%rsi
ffffffff804c0d04:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0d07:        0 	e8 b4 f5 ff ff       	callq  ffffffff804c02c0 <tcp_sacktag_write_queue>
ffffffff804c0d0c:        0 	09 44 24 5c          	or     %eax,0x5c(%rsp)
ffffffff804c0d10:       29 	41 8b 85 b8 00 00 00 	mov    0xb8(%r13),%eax
ffffffff804c0d17:      128 	49 03 85 d0 00 00 00 	add    0xd0(%r13),%rax
ffffffff804c0d1e:        0 	8a 40 0d             	mov    0xd(%rax),%al
ffffffff804c0d21:       33 	83 e0 42             	and    $0x42,%eax
ffffffff804c0d24:        0 	3c 40                	cmp    $0x40,%al
ffffffff804c0d26:        0 	75 17                	jne    ffffffff804c0d3f <tcp_ack+0x228>
ffffffff804c0d28:        0 	8b 44 24 5c          	mov    0x5c(%rsp),%eax
ffffffff804c0d2c:        0 	83 c8 40             	or     $0x40,%eax
ffffffff804c0d2f:        0 	f6 85 7e 04 00 00 01 	testb  $0x1,0x47e(%rbp)
ffffffff804c0d36:        0 	0f 44 44 24 5c       	cmove  0x5c(%rsp),%eax
ffffffff804c0d3b:        0 	89 44 24 5c          	mov    %eax,0x5c(%rsp)
ffffffff804c0d3f:       36 	be 06 00 00 00       	mov    $0x6,%esi
ffffffff804c0d44:      167 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0d47:        1 	e8 59 db ff ff       	callq  ffffffff804be8a5 <tcp_ca_event>
ffffffff804c0d4c:      581 	c7 85 48 01 00 00 00 	movl   $0x0,0x148(%rbp)
ffffffff804c0d53:        0 	00 00 00 
ffffffff804c0d56:     6076 	c6 85 7d 03 00 00 00 	movb   $0x0,0x37d(%rbp)
ffffffff804c0d5d:        0 	48 8b 05 1c 8b 3f 00 	mov    0x3f8b1c(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c0d64:      443 	89 85 08 04 00 00    	mov    %eax,0x408(%rbp)
ffffffff804c0d6a:        0 	8b 85 74 04 00 00    	mov    0x474(%rbp),%eax
ffffffff804c0d70:        0 	85 c0                	test   %eax,%eax
ffffffff804c0d72:      845 	89 44 24 14          	mov    %eax,0x14(%rsp)
ffffffff804c0d76:        0 	0f 84 fb 10 00 00    	je     ffffffff804c1e77 <tcp_ack+0x1360>
ffffffff804c0d7c:        0 	48 8b 05 fd 8a 3f 00 	mov    0x3f8afd(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c0d83:      586 	8b 54 24 14          	mov    0x14(%rsp),%edx
ffffffff804c0d87:        1 	41 83 cc ff          	or     $0xffffffffffffffff,%r12d
ffffffff804c0d8b:        2 	89 44 24 48          	mov    %eax,0x48(%rsp)
ffffffff804c0d8f:      879 	89 54 24 34          	mov    %edx,0x34(%rsp)
ffffffff804c0d93:        1 	8b 9d d0 04 00 00    	mov    0x4d0(%rbp),%ebx
ffffffff804c0d99:        0 	89 5c 24 40          	mov    %ebx,0x40(%rsp)
ffffffff804c0d9d:      889 	e8 e2 e8 ff ff       	callq  ffffffff804bf684 <net_invalid_timestamp>
ffffffff804c0da2:        0 	48 89 44 24 08       	mov    %rax,0x8(%rsp)
ffffffff804c0da7:       16 	48 8d 85 c0 00 00 00 	lea    0xc0(%rbp),%rax
ffffffff804c0dae:      445 	c7 44 24 44 01 00 00 	movl   $0x1,0x44(%rsp)
ffffffff804c0db5:        0 	00 
ffffffff804c0db6:        0 	c7 44 24 50 00 00 00 	movl   $0x0,0x50(%rsp)
ffffffff804c0dbd:        0 	00 
ffffffff804c0dbe:       10 	c7 44 24 38 00 00 00 	movl   $0x0,0x38(%rsp)
ffffffff804c0dc5:        0 	00 
ffffffff804c0dc6:     1308 	44 89 64 24 4c       	mov    %r12d,0x4c(%rsp)
ffffffff804c0dcb:      225 	48 89 04 24          	mov    %rax,(%rsp)
ffffffff804c0dcf:        2 	e9 8b 02 00 00       	jmpq   ffffffff804c105f <tcp_ack+0x548>
ffffffff804c0dd4:      488 	4d 8d 7d 38          	lea    0x38(%r13),%r15
ffffffff804c0dd8:     2298 	41 8a 57 25          	mov    0x25(%r15),%dl
ffffffff804c0ddc:        0 	88 54 24 3f          	mov    %dl,0x3f(%rsp)
ffffffff804c0de0:        6 	41 8b 77 1c          	mov    0x1c(%r15),%esi
ffffffff804c0de4:      455 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c0dea:        3 	49 8b 8d d0 00 00 00 	mov    0xd0(%r13),%rcx
ffffffff804c0df1:        0 	41 8b 85 c8 00 00 00 	mov    0xc8(%r13),%eax
ffffffff804c0df8:      440 	39 f2                	cmp    %esi,%edx
ffffffff804c0dfa:        0 	79 6f                	jns    ffffffff804c0e6b <tcp_ack+0x354>
ffffffff804c0dfc:        0 	89 c0                	mov    %eax,%eax
ffffffff804c0dfe:       39 	8b 5c 08 08          	mov    0x8(%rax,%rcx,1),%ebx
ffffffff804c0e02:        0 	66 83 fb 01          	cmp    $0x1,%bx
ffffffff804c0e06:        2 	0f 84 77 02 00 00    	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e0c:        0 	41 8b 47 18          	mov    0x18(%r15),%eax
ffffffff804c0e10:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c0e12:        0 	0f 89 6b 02 00 00    	jns    ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e18:        0 	29 c2                	sub    %eax,%edx
ffffffff804c0e1a:        0 	4c 89 ee             	mov    %r13,%rsi
ffffffff804c0e1d:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0e20:        0 	e8 8f 4f 00 00       	callq  ffffffff804c5db4 <tcp_trim_head>
ffffffff804c0e25:        0 	85 c0                	test   %eax,%eax
ffffffff804c0e27:        0 	0f 85 56 02 00 00    	jne    ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e2d:        0 	41 8b 85 c8 00 00 00 	mov    0xc8(%r13),%eax
ffffffff804c0e34:        0 	0f b7 d3             	movzwl %bx,%edx
ffffffff804c0e37:        0 	49 03 85 d0 00 00 00 	add    0xd0(%r13),%rax
ffffffff804c0e3e:        0 	41 89 d6             	mov    %edx,%r14d
ffffffff804c0e41:        0 	8b 48 08             	mov    0x8(%rax),%ecx
ffffffff804c0e44:        0 	0f b7 c1             	movzwl %cx,%eax
ffffffff804c0e47:        0 	41 29 c6             	sub    %eax,%r14d
ffffffff804c0e4a:        0 	0f 84 33 02 00 00    	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e50:        0 	66 85 c9             	test   %cx,%cx
ffffffff804c0e53:        0 	75 04                	jne    ffffffff804c0e59 <tcp_ack+0x342>
ffffffff804c0e55:        0 	0f 0b                	ud2a   
ffffffff804c0e57:        0 	eb fe                	jmp    ffffffff804c0e57 <tcp_ack+0x340>
ffffffff804c0e59:        0 	41 8b 5f 1c          	mov    0x1c(%r15),%ebx
ffffffff804c0e5d:        0 	41 39 5f 18          	cmp    %ebx,0x18(%r15)
ffffffff804c0e61:        0 	0f 88 d6 10 00 00    	js     ffffffff804c1f3d <tcp_ack+0x1426>
ffffffff804c0e67:        0 	0f 0b                	ud2a   
ffffffff804c0e69:        0 	eb fe                	jmp    ffffffff804c0e69 <tcp_ack+0x352>
ffffffff804c0e6b:        0 	83 7c 24 44 00       	cmpl   $0x0,0x44(%rsp)
ffffffff804c0e70:     6326 	89 c0                	mov    %eax,%eax
ffffffff804c0e72:      348 	44 0f b7 74 08 08    	movzwl 0x8(%rax,%rcx,1),%r14d
ffffffff804c0e78:        0 	0f 84 8f 00 00 00    	je     ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e7e:      132 	83 bd a4 03 00 00 00 	cmpl   $0x0,0x3a4(%rbp)
ffffffff804c0e85:     5840 	0f 84 82 00 00 00    	je     ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e8b:        0 	3b b5 b4 05 00 00    	cmp    0x5b4(%rbp),%esi
ffffffff804c0e91:        0 	78 7a                	js     ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e93:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0e96:        0 	e8 21 da ff ff       	callq  ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c0e9b:        0 	8b b5 4c 04 00 00    	mov    0x44c(%rbp),%esi
ffffffff804c0ea1:        0 	44 8b a5 ac 04 00 00 	mov    0x4ac(%rbp),%r12d
ffffffff804c0ea8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0eab:        0 	89 85 6c 05 00 00    	mov    %eax,0x56c(%rbp)
ffffffff804c0eb1:        0 	e8 c7 3e 00 00       	callq  ffffffff804c4d7d <tcp_mss_to_mtu>
ffffffff804c0eb6:        0 	8b 9d a4 03 00 00    	mov    0x3a4(%rbp),%ebx
ffffffff804c0ebc:        0 	31 d2                	xor    %edx,%edx
ffffffff804c0ebe:        0 	c7 85 b0 04 00 00 00 	movl   $0x0,0x4b0(%rbp)
ffffffff804c0ec5:        0 	00 00 00 
ffffffff804c0ec8:        0 	41 0f af c4          	imul   %r12d,%eax
ffffffff804c0ecc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0ecf:        0 	f7 f3                	div    %ebx
ffffffff804c0ed1:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c0ed7:        0 	48 8b 05 a2 89 3f 00 	mov    0x3f89a2(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c0ede:        0 	89 85 bc 04 00 00    	mov    %eax,0x4bc(%rbp)
ffffffff804c0ee4:        0 	e8 d3 d9 ff ff       	callq  ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c0ee9:        0 	8b b5 5c 03 00 00    	mov    0x35c(%rbp),%esi
ffffffff804c0eef:        0 	89 85 54 04 00 00    	mov    %eax,0x454(%rbp)
ffffffff804c0ef5:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0ef8:        0 	89 9d a0 03 00 00    	mov    %ebx,0x3a0(%rbp)
ffffffff804c0efe:        0 	c7 85 a4 03 00 00 00 	movl   $0x0,0x3a4(%rbp)
ffffffff804c0f05:        0 	00 00 00 
ffffffff804c0f08:        0 	e8 13 3f 00 00       	callq  ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c0f0d:      945 	0f b6 44 24 3f       	movzbl 0x3f(%rsp),%eax
ffffffff804c0f12:     6361 	a8 82                	test   $0x82,%al
ffffffff804c0f14:        0 	74 30                	je     ffffffff804c0f46 <tcp_ack+0x42f>
ffffffff804c0f16:        0 	a8 02                	test   $0x2,%al
ffffffff804c0f18:        0 	74 07                	je     ffffffff804c0f21 <tcp_ack+0x40a>
ffffffff804c0f1a:        0 	44 29 b5 78 04 00 00 	sub    %r14d,0x478(%rbp)
ffffffff804c0f21:        0 	83 4c 24 50 08       	orl    $0x8,0x50(%rsp)
ffffffff804c0f26:        0 	f6 44 24 50 04       	testb  $0x4,0x50(%rsp)
ffffffff804c0f2b:        0 	75 06                	jne    ffffffff804c0f33 <tcp_ack+0x41c>
ffffffff804c0f2d:        0 	41 83 fe 01          	cmp    $0x1,%r14d
ffffffff804c0f31:        0 	76 08                	jbe    ffffffff804c0f3b <tcp_ack+0x424>
ffffffff804c0f33:        0 	81 4c 24 50 00 10 00 	orl    $0x1000,0x50(%rsp)
ffffffff804c0f3a:        0 	00 
ffffffff804c0f3b:        0 	41 83 cc ff          	or     $0xffffffffffffffff,%r12d
ffffffff804c0f3f:        0 	44 89 64 24 4c       	mov    %r12d,0x4c(%rsp)
ffffffff804c0f44:        0 	eb 38                	jmp    ffffffff804c0f7e <tcp_ack+0x467>
ffffffff804c0f46:        0 	44 8b 64 24 48       	mov    0x48(%rsp),%r12d
ffffffff804c0f4b:     5837 	45 2b 67 20          	sub    0x20(%r15),%r12d
ffffffff804c0f4f:        1 	83 7c 24 4c 00       	cmpl   $0x0,0x4c(%rsp)
ffffffff804c0f54:      167 	8b 5c 24 4c          	mov    0x4c(%rsp),%ebx
ffffffff804c0f58:      514 	49 8b 55 18          	mov    0x18(%r13),%rdx
ffffffff804c0f5c:        0 	41 0f 48 dc          	cmovs  %r12d,%ebx
ffffffff804c0f60:      164 	a8 01                	test   $0x1,%al
ffffffff804c0f62:      413 	48 89 54 24 08       	mov    %rdx,0x8(%rsp)
ffffffff804c0f67:        0 	89 5c 24 4c          	mov    %ebx,0x4c(%rsp)
ffffffff804c0f6b:      148 	75 11                	jne    ffffffff804c0f7e <tcp_ack+0x467>
ffffffff804c0f6d:     1608 	8b 54 24 38          	mov    0x38(%rsp),%edx
ffffffff804c0f71:        0 	39 54 24 34          	cmp    %edx,0x34(%rsp)
ffffffff804c0f75:      272 	0f 46 54 24 34       	cmovbe 0x34(%rsp),%edx
ffffffff804c0f7a:      266 	89 54 24 34          	mov    %edx,0x34(%rsp)
ffffffff804c0f7e:        0 	a8 01                	test   $0x1,%al
ffffffff804c0f80:      164 	74 07                	je     ffffffff804c0f89 <tcp_ack+0x472>
ffffffff804c0f82:        0 	44 29 b5 d0 04 00 00 	sub    %r14d,0x4d0(%rbp)
ffffffff804c0f89:     3955 	a8 04                	test   $0x4,%al
ffffffff804c0f8b:     8510 	74 07                	je     ffffffff804c0f94 <tcp_ack+0x47d>
ffffffff804c0f8d:        0 	44 29 b5 cc 04 00 00 	sub    %r14d,0x4cc(%rbp)
ffffffff804c0f94:       11 	44 29 b5 74 04 00 00 	sub    %r14d,0x474(%rbp)
ffffffff804c0f9b:     1426 	44 01 74 24 38       	add    %r14d,0x38(%rsp)
ffffffff804c0fa0:        6 	41 f6 47 24 02       	testb  $0x2,0x24(%r15)
ffffffff804c0fa5:      548 	75 07                	jne    ffffffff804c0fae <tcp_ack+0x497>
ffffffff804c0fa7:        2 	83 4c 24 50 04       	orl    $0x4,0x50(%rsp)
ffffffff804c0fac:        0 	eb 0f                	jmp    ffffffff804c0fbd <tcp_ack+0x4a6>
ffffffff804c0fae:        0 	83 4c 24 50 10       	orl    $0x10,0x50(%rsp)
ffffffff804c0fb3:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c0fba:        0 	00 00 00 
ffffffff804c0fbd:      517 	83 7c 24 44 00       	cmpl   $0x0,0x44(%rsp)
ffffffff804c0fc2:     6012 	0f 84 bb 00 00 00    	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0fc8:     1111 	48 8b 34 24          	mov    (%rsp),%rsi
ffffffff804c0fcc:        0 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c0fcf:      184 	e8 0d d8 ff ff       	callq  ffffffff804be7e1 <__skb_unlink>
ffffffff804c0fd4:        5 	41 8b 45 68          	mov    0x68(%r13),%eax
ffffffff804c0fd8:      517 	05 e8 00 00 00       	add    $0xe8,%eax
ffffffff804c0fdd:        0 	41 39 85 e0 00 00 00 	cmp    %eax,0xe0(%r13)
ffffffff804c0fe4:       31 	7d 08                	jge    ffffffff804c0fee <tcp_ack+0x4d7>
ffffffff804c0fe6:        0 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c0fe9:        0 	e8 d4 66 fc ff       	callq  ffffffff804876c2 <skb_truesize_bug>
ffffffff804c0fee:     1142 	0f ba ad 10 01 00 00 	btsl   $0xe,0x110(%rbp)
ffffffff804c0ff5:        0 	0e 
ffffffff804c0ff6:     2576 	8b 85 f0 00 00 00    	mov    0xf0(%rbp),%eax
ffffffff804c0ffc:      433 	41 2b 85 e0 00 00 00 	sub    0xe0(%r13),%eax
ffffffff804c1003:     4843 	89 85 f0 00 00 00    	mov    %eax,0xf0(%rbp)
ffffffff804c1009:     1730 	48 8b 45 30          	mov    0x30(%rbp),%rax
ffffffff804c100d:      311 	41 8b 95 e0 00 00 00 	mov    0xe0(%r13),%edx
ffffffff804c1014:        0 	48 83 b8 b0 00 00 00 	cmpq   $0x0,0xb0(%rax)
ffffffff804c101b:        0 	00 
ffffffff804c101c:      418 	74 06                	je     ffffffff804c1024 <tcp_ack+0x50d>
ffffffff804c101e:       37 	01 95 f4 00 00 00    	add    %edx,0xf4(%rbp)
ffffffff804c1024:        2 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>
ffffffff804c102c:       44 	4c 3b ad f0 04 00 00 	cmp    0x4f0(%rbp),%r13
ffffffff804c1033:      511 	48 c7 85 e8 04 00 00 	movq   $0x0,0x4e8(%rbp)
ffffffff804c103a:        0 	00 00 00 00 
ffffffff804c103e:        1 	75 0b                	jne    ffffffff804c104b <tcp_ack+0x534>
ffffffff804c1040:        0 	48 c7 85 f0 04 00 00 	movq   $0x0,0x4f0(%rbp)
ffffffff804c1047:        0 	00 00 00 00 
ffffffff804c104b:        0 	4c 3b ad e0 04 00 00 	cmp    0x4e0(%rbp),%r13
ffffffff804c1052:      518 	75 0b                	jne    ffffffff804c105f <tcp_ack+0x548>
ffffffff804c1054:        0 	48 c7 85 e0 04 00 00 	movq   $0x0,0x4e0(%rbp)
ffffffff804c105b:        0 	00 00 00 00 
ffffffff804c105f:      439 	4c 8b ad c0 00 00 00 	mov    0xc0(%rbp),%r13
ffffffff804c1066:     5655 	4c 3b 2c 24          	cmp    (%rsp),%r13
ffffffff804c106a:        0 	75 05                	jne    ffffffff804c1071 <tcp_ack+0x55a>
ffffffff804c106c:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c106f:      810 	eb 12                	jmp    ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c1071:        0 	4d 85 ed             	test   %r13,%r13
ffffffff804c1074:     2574 	74 0d                	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c1076:        0 	4c 3b ad d8 01 00 00 	cmp    0x1d8(%rbp),%r13
ffffffff804c107d:        0 	0f 85 51 fd ff ff    	jne    ffffffff804c0dd4 <tcp_ack+0x2bd>
ffffffff804c1083:      454 	8b 8d 00 04 00 00    	mov    0x400(%rbp),%ecx
ffffffff804c1089:      497 	8b 85 80 04 00 00    	mov    0x480(%rbp),%eax
ffffffff804c108f:        0 	2b 44 24 18          	sub    0x18(%rsp),%eax
ffffffff804c1093:        0 	89 ca                	mov    %ecx,%edx
ffffffff804c1095:      534 	2b 54 24 18          	sub    0x18(%rsp),%edx
ffffffff804c1099:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c109b:        0 	72 06                	jb     ffffffff804c10a3 <tcp_ack+0x58c>
ffffffff804c109d:      458 	89 8d 80 04 00 00    	mov    %ecx,0x480(%rbp)
ffffffff804c10a3:        0 	4d 85 ed             	test   %r13,%r13
ffffffff804c10a6:        0 	74 15                	je     ffffffff804c10bd <tcp_ack+0x5a6>
ffffffff804c10a8:        0 	8b 44 24 50          	mov    0x50(%rsp),%eax
ffffffff804c10ac:        2 	80 cc 20             	or     $0x20,%ah
ffffffff804c10af:        3 	41 f6 45 5d 01       	testb  $0x1,0x5d(%r13)
ffffffff804c10b4:        0 	0f 44 44 24 50       	cmove  0x50(%rsp),%eax
ffffffff804c10b9:        0 	89 44 24 50          	mov    %eax,0x50(%rsp)
ffffffff804c10bd:      444 	f6 44 24 50 14       	testb  $0x14,0x50(%rsp)
ffffffff804c10c2:      551 	0f 84 e1 01 00 00    	je     ffffffff804c12a9 <tcp_ack+0x792>
ffffffff804c10c8:        1 	f6 85 9c 04 00 00 01 	testb  $0x1,0x49c(%rbp)
ffffffff804c10cf:        2 	48 8b 9d 60 03 00 00 	mov    0x360(%rbp),%rbx
ffffffff804c10d6:      462 	74 17                	je     ffffffff804c10ef <tcp_ack+0x5d8>
ffffffff804c10d8:        0 	83 bd 98 04 00 00 00 	cmpl   $0x0,0x498(%rbp)
ffffffff804c10df:        0 	74 0e                	je     ffffffff804c10ef <tcp_ack+0x5d8>
ffffffff804c10e1:      451 	8b 74 24 50          	mov    0x50(%rsp),%esi
ffffffff804c10e5:       43 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c10e8:        0 	e8 ea e8 ff ff       	callq  ffffffff804bf9d7 <tcp_ack_saw_tstamp>
ffffffff804c10ed:       66 	eb 47                	jmp    ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10ef:        0 	83 7c 24 4c 00       	cmpl   $0x0,0x4c(%rsp)
ffffffff804c10f4:        0 	78 40                	js     ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10f6:        0 	f6 44 24 50 08       	testb  $0x8,0x50(%rsp)
ffffffff804c10fb:        0 	75 39                	jne    ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10fd:        0 	8b 74 24 4c          	mov    0x4c(%rsp),%esi
ffffffff804c1101:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1104:        0 	e8 b5 e7 ff ff       	callq  ffffffff804bf8be <tcp_rtt_estimator>
ffffffff804c1109:        0 	8b 85 60 04 00 00    	mov    0x460(%rbp),%eax
ffffffff804c110f:        0 	c6 85 7b 03 00 00 00 	movb   $0x0,0x37b(%rbp)
ffffffff804c1116:        0 	c1 e8 03             	shr    $0x3,%eax
ffffffff804c1119:        0 	03 85 6c 04 00 00    	add    0x46c(%rbp),%eax
ffffffff804c111f:        0 	3d 30 75 00 00       	cmp    $0x7530,%eax
ffffffff804c1124:        0 	89 85 58 03 00 00    	mov    %eax,0x358(%rbp)
ffffffff804c112a:        0 	76 0a                	jbe    ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c112c:        0 	c7 85 58 03 00 00 30 	movl   $0x7530,0x358(%rbp)
ffffffff804c1133:        0 	75 00 00 
ffffffff804c1136:      732 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c113d:     1833 	75 0f                	jne    ffffffff804c114e <tcp_ack+0x637>
ffffffff804c113f:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1144:      493 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1147:        0 	e8 07 d7 ff ff       	callq  ffffffff804be853 <inet_csk_clear_xmit_timer>
ffffffff804c114c:        0 	eb 18                	jmp    ffffffff804c1166 <tcp_ack+0x64f>
ffffffff804c114e:        0 	8b 95 58 03 00 00    	mov    0x358(%rbp),%edx
ffffffff804c1154:        0 	b9 30 75 00 00       	mov    $0x7530,%ecx
ffffffff804c1159:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c115e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1161:        0 	e8 7d e4 ff ff       	callq  ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c1166:      881 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c116c:      845 	c0 e8 04             	shr    $0x4,%al
ffffffff804c116f:        1 	75 63                	jne    ffffffff804c11d4 <tcp_ack+0x6bd>
ffffffff804c1171:        0 	83 7c 24 38 00       	cmpl   $0x0,0x38(%rsp)
ffffffff804c1176:        0 	7e 29                	jle    ffffffff804c11a1 <tcp_ack+0x68a>
ffffffff804c1178:        0 	8b 44 24 38          	mov    0x38(%rsp),%eax
ffffffff804c117c:        0 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c1182:        0 	ff c8                	dec    %eax
ffffffff804c1184:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c1186:        0 	72 0c                	jb     ffffffff804c1194 <tcp_ack+0x67d>
ffffffff804c1188:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c118f:        0 	00 00 00 
ffffffff804c1192:        0 	eb 0d                	jmp    ffffffff804c11a1 <tcp_ack+0x68a>
ffffffff804c1194:        0 	8d 42 01             	lea    0x1(%rdx),%eax
ffffffff804c1197:        0 	2b 44 24 38          	sub    0x38(%rsp),%eax
ffffffff804c119b:        0 	89 85 d0 04 00 00    	mov    %eax,0x4d0(%rbp)
ffffffff804c11a1:        0 	8b 74 24 38          	mov    0x38(%rsp),%esi
ffffffff804c11a5:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c11a8:        0 	e8 2d dd ff ff       	callq  ffffffff804beeda <tcp_check_reno_reordering>
ffffffff804c11ad:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c11b3:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c11b9:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c11bf:        0 	76 5e                	jbe    ffffffff804c121f <tcp_ack+0x708>
ffffffff804c11c1:        0 	be b0 06 00 00       	mov    $0x6b0,%esi
ffffffff804c11c6:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c11cd:        0 	e8 e3 4f d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c11d2:        0 	eb 4b                	jmp    ffffffff804c121f <tcp_ack+0x708>
ffffffff804c11d4:      414 	8b 44 24 20          	mov    0x20(%rsp),%eax
ffffffff804c11d8:     1591 	39 44 24 34          	cmp    %eax,0x34(%rsp)
ffffffff804c11dc:        2 	73 14                	jae    ffffffff804c11f2 <tcp_ack+0x6db>
ffffffff804c11de:        0 	8b b5 d4 04 00 00    	mov    0x4d4(%rbp),%esi
ffffffff804c11e4:        0 	2b 74 24 34          	sub    0x34(%rsp),%esi
ffffffff804c11e8:        0 	31 d2                	xor    %edx,%edx
ffffffff804c11ea:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c11ed:        0 	e8 9c db ff ff       	callq  ffffffff804bed8e <tcp_update_reordering>
ffffffff804c11f2:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c11f8:      865 	c0 e8 04             	shr    $0x4,%al
ffffffff804c11fb:        3 	a8 02                	test   $0x2,%al
ffffffff804c11fd:        0 	8b 85 60 05 00 00    	mov    0x560(%rbp),%eax
ffffffff804c1203:      453 	74 06                	je     ffffffff804c120b <tcp_ack+0x6f4>
ffffffff804c1205:        8 	2b 44 24 38          	sub    0x38(%rsp),%eax
ffffffff804c1209:        0 	eb 0e                	jmp    ffffffff804c1219 <tcp_ack+0x702>
ffffffff804c120b:        0 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c1211:        0 	29 54 24 40          	sub    %edx,0x40(%rsp)
ffffffff804c1215:        0 	2b 44 24 40          	sub    0x40(%rsp),%eax
ffffffff804c1219:      423 	89 85 60 05 00 00    	mov    %eax,0x560(%rbp)
ffffffff804c121f:      492 	8b 85 d4 04 00 00    	mov    0x4d4(%rbp),%eax
ffffffff804c1225:      489 	39 44 24 38          	cmp    %eax,0x38(%rsp)
ffffffff804c1229:        0 	8b 54 24 38          	mov    0x38(%rsp),%edx
ffffffff804c122d:        0 	0f 47 d0             	cmova  %eax,%edx
ffffffff804c1230:      438 	29 d0                	sub    %edx,%eax
ffffffff804c1232:        0 	89 85 d4 04 00 00    	mov    %eax,0x4d4(%rbp)
ffffffff804c1238:        1 	48 83 7b 58 00       	cmpq   $0x0,0x58(%rbx)
ffffffff804c123d:      446 	74 6a                	je     ffffffff804c12a9 <tcp_ack+0x792>
ffffffff804c123f:        0 	f6 44 24 50 08       	testb  $0x8,0x50(%rsp)
ffffffff804c1244:        3 	75 54                	jne    ffffffff804c129a <tcp_ack+0x783>
ffffffff804c1246:      441 	f6 43 10 02          	testb  $0x2,0x10(%rbx)
ffffffff804c124a:        8 	74 3f                	je     ffffffff804c128b <tcp_ack+0x774>
ffffffff804c124c:        0 	e8 33 e4 ff ff       	callq  ffffffff804bf684 <net_invalid_timestamp>
ffffffff804c1251:        0 	48 39 44 24 08       	cmp    %rax,0x8(%rsp)
ffffffff804c1256:        0 	74 33                	je     ffffffff804c128b <tcp_ack+0x774>
ffffffff804c1258:        0 	e8 17 8b d8 ff       	callq  ffffffff80249d74 <ktime_get_real>
ffffffff804c125d:        0 	48 89 c7             	mov    %rax,%rdi
ffffffff804c1260:        0 	48 2b 7c 24 08       	sub    0x8(%rsp),%rdi
ffffffff804c1265:        0 	e8 e3 8e d7 ff       	callq  ffffffff8023a14d <ns_to_timeval>
ffffffff804c126a:        0 	48 89 44 24 60       	mov    %rax,0x60(%rsp)
ffffffff804c126f:        0 	48 89 44 24 70       	mov    %rax,0x70(%rsp)
ffffffff804c1274:        0 	48 69 c0 40 42 0f 00 	imul   $0xf4240,%rax,%rax
ffffffff804c127b:        0 	48 89 54 24 78       	mov    %rdx,0x78(%rsp)
ffffffff804c1280:        0 	48 89 54 24 68       	mov    %rdx,0x68(%rsp)
ffffffff804c1285:        0 	03 44 24 78          	add    0x78(%rsp),%eax
ffffffff804c1289:        0 	eb 12                	jmp    ffffffff804c129d <tcp_ack+0x786>
ffffffff804c128b:       89 	45 85 e4             	test   %r12d,%r12d
ffffffff804c128e:      414 	7e 0a                	jle    ffffffff804c129a <tcp_ack+0x783>
ffffffff804c1290:        0 	49 63 fc             	movslq %r12d,%rdi
ffffffff804c1293:       65 	e8 a8 8b d7 ff       	callq  ffffffff80239e40 <jiffies_to_usecs>
ffffffff804c1298:        0 	eb 03                	jmp    ffffffff804c129d <tcp_ack+0x786>
ffffffff804c129a:        0 	83 c8 ff             	or     $0xffffffffffffffff,%eax
ffffffff804c129d:     1136 	89 c2                	mov    %eax,%edx
ffffffff804c129f:        7 	8b 74 24 38          	mov    0x38(%rsp),%esi
ffffffff804c12a3:      444 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c12a6:        1 	ff 53 58             	callq  *0x58(%rbx)
ffffffff804c12a9:      305 	83 bd d0 04 00 00 00 	cmpl   $0x0,0x4d0(%rbp)
ffffffff804c12b0:      518 	79 11                	jns    ffffffff804c12c3 <tcp_ack+0x7ac>
ffffffff804c12b2:        0 	be ac 0b 00 00       	mov    $0xbac,%esi
ffffffff804c12b7:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c12be:        0 	e8 f2 4e d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12c3:      415 	83 bd cc 04 00 00 00 	cmpl   $0x0,0x4cc(%rbp)
ffffffff804c12ca:     2204 	79 11                	jns    ffffffff804c12dd <tcp_ack+0x7c6>
ffffffff804c12cc:        0 	be ad 0b 00 00       	mov    $0xbad,%esi
ffffffff804c12d1:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c12d8:        0 	e8 d8 4e d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12dd:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c12e4:     1747 	79 11                	jns    ffffffff804c12f7 <tcp_ack+0x7e0>
ffffffff804c12e6:        0 	be ae 0b 00 00       	mov    $0xbae,%esi
ffffffff804c12eb:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c12f2:        0 	e8 be 4e d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12f7:        0 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c12fe:      878 	0f 85 86 00 00 00    	jne    ffffffff804c138a <tcp_ack+0x873>
ffffffff804c1304:     4721 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c130a:      968 	c0 e8 04             	shr    $0x4,%al
ffffffff804c130d:        2 	74 7b                	je     ffffffff804c138a <tcp_ack+0x873>
ffffffff804c130f:      171 	8b b5 cc 04 00 00    	mov    0x4cc(%rbp),%esi
ffffffff804c1315:      282 	85 f6                	test   %esi,%esi
ffffffff804c1317:        0 	74 1f                	je     ffffffff804c1338 <tcp_ack+0x821>
ffffffff804c1319:        0 	0f b6 95 78 03 00 00 	movzbl 0x378(%rbp),%edx
ffffffff804c1320:        0 	48 c7 c7 b2 d9 6a 80 	mov    $0xffffffff806ad9b2,%rdi
ffffffff804c1327:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1329:        0 	e8 46 5a d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c132e:        0 	c7 85 cc 04 00 00 00 	movl   $0x0,0x4cc(%rbp)
ffffffff804c1335:        0 	00 00 00 
ffffffff804c1338:      198 	8b b5 d0 04 00 00    	mov    0x4d0(%rbp),%esi
ffffffff804c133e:      257 	85 f6                	test   %esi,%esi
ffffffff804c1340:        0 	74 1f                	je     ffffffff804c1361 <tcp_ack+0x84a>
ffffffff804c1342:        0 	0f b6 95 78 03 00 00 	movzbl 0x378(%rbp),%edx
ffffffff804c1349:        0 	48 c7 c7 c3 d9 6a 80 	mov    $0xffffffff806ad9c3,%rdi
ffffffff804c1350:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1352:        0 	e8 1d 5a d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c1357:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c135e:        0 	00 00 00 
ffffffff804c1361:     2524 	8b b5 78 04 00 00    	mov    0x478(%rbp),%esi
ffffffff804c1367:     1825 	85 f6                	test   %esi,%esi
ffffffff804c1369:        0 	74 1f                	je     ffffffff804c138a <tcp_ack+0x873>
ffffffff804c136b:        0 	0f b6 95 78 03 00 00 	movzbl 0x378(%rbp),%edx
ffffffff804c1372:        0 	48 c7 c7 d4 d9 6a 80 	mov    $0xffffffff806ad9d4,%rdi
ffffffff804c1379:        0 	31 c0                	xor    %eax,%eax
ffffffff804c137b:        0 	e8 f4 59 d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c1380:        0 	c7 85 78 04 00 00 00 	movl   $0x0,0x478(%rbp)
ffffffff804c1387:        0 	00 00 00 
ffffffff804c138a:       46 	44 8b 64 24 50       	mov    0x50(%rsp),%r12d
ffffffff804c138f:     7369 	31 c9                	xor    %ecx,%ecx
ffffffff804c1391:      348 	44 0b 64 24 5c       	or     0x5c(%rsp),%r12d
ffffffff804c1396:        0 	80 bd 5e 04 00 00 00 	cmpb   $0x0,0x45e(%rbp)
ffffffff804c139d:       96 	0f 84 26 02 00 00    	je     ffffffff804c15c9 <tcp_ack+0xab2>
ffffffff804c13a3:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c13a9:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c13af:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c13b5:        0 	76 11                	jbe    ffffffff804c13c8 <tcp_ack+0x8b1>
ffffffff804c13b7:        0 	be 58 0c 00 00       	mov    $0xc58,%esi
ffffffff804c13bc:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c13c3:        0 	e8 ed 4d d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c13c8:        0 	44 89 e3             	mov    %r12d,%ebx
ffffffff804c13cb:        0 	83 e3 04             	and    $0x4,%ebx
ffffffff804c13ce:        0 	74 07                	je     ffffffff804c13d7 <tcp_ack+0x8c0>
ffffffff804c13d0:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c13d7:        0 	41 f7 c4 00 10 00 00 	test   $0x1000,%r12d
ffffffff804c13de:        0 	75 0f                	jne    ffffffff804c13ef <tcp_ack+0x8d8>
ffffffff804c13e0:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c13e7:        0 	76 10                	jbe    ffffffff804c13f9 <tcp_ack+0x8e2>
ffffffff804c13e9:        0 	41 f6 c4 08          	test   $0x8,%r12b
ffffffff804c13ed:        0 	74 0a                	je     ffffffff804c13f9 <tcp_ack+0x8e2>
ffffffff804c13ef:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c13f6:        0 	00 00 00 
ffffffff804c13f9:        0 	8b 85 58 04 00 00    	mov    0x458(%rbp),%eax
ffffffff804c13ff:        0 	39 85 00 04 00 00    	cmp    %eax,0x400(%rbp)
ffffffff804c1405:        0 	78 12                	js     ffffffff804c1419 <tcp_ack+0x902>
ffffffff804c1407:        0 	31 f6                	xor    %esi,%esi
ffffffff804c1409:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c1410:        0 	40 0f 95 c6          	setne  %sil
ffffffff804c1414:        0 	83 c6 02             	add    $0x2,%esi
ffffffff804c1417:        0 	eb 37                	jmp    ffffffff804c1450 <tcp_ack+0x939>
ffffffff804c1419:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c141c:        0 	e8 e0 da ff ff       	callq  ffffffff804bef01 <tcp_is_sackfrto>
ffffffff804c1421:        0 	85 c0                	test   %eax,%eax
ffffffff804c1423:        0 	75 3b                	jne    ffffffff804c1460 <tcp_ack+0x949>
ffffffff804c1425:        0 	41 f7 c4 34 04 00 00 	test   $0x434,%r12d
ffffffff804c142c:        0 	75 0a                	jne    ffffffff804c1438 <tcp_ack+0x921>
ffffffff804c142e:        0 	41 f6 c4 17          	test   $0x17,%r12b
ffffffff804c1432:        0 	0f 85 8c 01 00 00    	jne    ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c1438:        0 	85 db                	test   %ebx,%ebx
ffffffff804c143a:        0 	0f 85 8d 00 00 00    	jne    ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c1440:        0 	31 f6                	xor    %esi,%esi
ffffffff804c1442:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c1449:        0 	40 0f 95 c6          	setne  %sil
ffffffff804c144d:        0 	8d 34 76             	lea    (%rsi,%rsi,2),%esi
ffffffff804c1450:        0 	44 89 e2             	mov    %r12d,%edx
ffffffff804c1453:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1456:        0 	e8 b8 e7 ff ff       	callq  ffffffff804bfc13 <tcp_enter_frto_loss>
ffffffff804c145b:        0 	e9 64 01 00 00       	jmpq   ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c1460:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1462:        0 	75 37                	jne    ffffffff804c149b <tcp_ack+0x984>
ffffffff804c1464:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c146b:        0 	75 2e                	jne    ffffffff804c149b <tcp_ack+0x984>
ffffffff804c146d:        0 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c1473:        0 	03 85 74 04 00 00    	add    0x474(%rbp),%eax
ffffffff804c1479:        0 	2b 85 d0 04 00 00    	sub    0x4d0(%rbp),%eax
ffffffff804c147f:        0 	8b 95 ac 04 00 00    	mov    0x4ac(%rbp),%edx
ffffffff804c1485:        0 	2b 85 cc 04 00 00    	sub    0x4cc(%rbp),%eax
ffffffff804c148b:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c148d:        0 	0f 47 c2             	cmova  %edx,%eax
ffffffff804c1490:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c1496:        0 	e9 29 01 00 00       	jmpq   ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c149b:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c14a2:        0 	76 29                	jbe    ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c14a4:        0 	41 f6 c4 34          	test   $0x34,%r12b
ffffffff804c14a8:        0 	74 0f                	je     ffffffff804c14b9 <tcp_ack+0x9a2>
ffffffff804c14aa:        0 	44 89 e0             	mov    %r12d,%eax
ffffffff804c14ad:        0 	25 20 02 00 00       	and    $0x220,%eax
ffffffff804c14b2:        0 	83 f8 20             	cmp    $0x20,%eax
ffffffff804c14b5:        0 	75 16                	jne    ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c14b7:        0 	eb 0a                	jmp    ffffffff804c14c3 <tcp_ack+0x9ac>
ffffffff804c14b9:        0 	41 f6 c4 17          	test   $0x17,%r12b
ffffffff804c14bd:        0 	0f 85 01 01 00 00    	jne    ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c14c3:        0 	44 89 e2             	mov    %r12d,%edx
ffffffff804c14c6:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c14cb:        0 	eb 86                	jmp    ffffffff804c1453 <tcp_ack+0x93c>
ffffffff804c14cd:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c14d4:        0 	75 45                	jne    ffffffff804c151b <tcp_ack+0xa04>
ffffffff804c14d6:        0 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c14dc:        0 	03 85 74 04 00 00    	add    0x474(%rbp),%eax
ffffffff804c14e2:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c14e5:        0 	c6 85 5e 04 00 00 02 	movb   $0x2,0x45e(%rbp)
ffffffff804c14ec:        0 	83 c0 02             	add    $0x2,%eax
ffffffff804c14ef:        0 	2b 85 cc 04 00 00    	sub    0x4cc(%rbp),%eax
ffffffff804c14f5:        0 	2b 85 d0 04 00 00    	sub    0x4d0(%rbp),%eax
ffffffff804c14fb:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c1501:        0 	e8 0a 3e 00 00       	callq  ffffffff804c5310 <tcp_may_send_now>
ffffffff804c1506:        0 	85 c0                	test   %eax,%eax
ffffffff804c1508:        0 	0f 85 b6 00 00 00    	jne    ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c150e:        0 	44 89 e2             	mov    %r12d,%edx
ffffffff804c1511:        0 	be 02 00 00 00       	mov    $0x2,%esi
ffffffff804c1516:        0 	e9 38 ff ff ff       	jmpq   ffffffff804c1453 <tcp_ack+0x93c>
ffffffff804c151b:        0 	8b 05 3f 6f 3f 00    	mov    0x3f6f3f(%rip),%eax        # ffffffff808b8460 <sysctl_tcp_frto_response>
ffffffff804c1521:        0 	83 f8 01             	cmp    $0x1,%eax
ffffffff804c1524:        0 	74 1a                	je     ffffffff804c1540 <tcp_ack+0xa29>
ffffffff804c1526:        0 	83 f8 02             	cmp    $0x2,%eax
ffffffff804c1529:        0 	75 5d                	jne    ffffffff804c1588 <tcp_ack+0xa71>
ffffffff804c152b:        0 	41 f6 c4 40          	test   $0x40,%r12b
ffffffff804c152f:        0 	75 57                	jne    ffffffff804c1588 <tcp_ack+0xa71>
ffffffff804c1531:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1536:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1539:        0 	e8 5a db ff ff       	callq  ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c153e:        0 	eb 50                	jmp    ffffffff804c1590 <tcp_ack+0xa79>
ffffffff804c1540:        0 	8b 85 ac 04 00 00    	mov    0x4ac(%rbp),%eax
ffffffff804c1546:        0 	8b 95 a8 04 00 00    	mov    0x4a8(%rbp),%edx
ffffffff804c154c:        0 	c7 85 b0 04 00 00 00 	movl   $0x0,0x4b0(%rbp)
ffffffff804c1553:        0 	00 00 00 
ffffffff804c1556:        0 	c7 85 dc 04 00 00 00 	movl   $0x0,0x4dc(%rbp)
ffffffff804c155d:        0 	00 00 00 
ffffffff804c1560:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c1562:        0 	0f 46 c2             	cmovbe %edx,%eax
ffffffff804c1565:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c156b:        0 	8a 85 7e 04 00 00    	mov    0x47e(%rbp),%al
ffffffff804c1571:        0 	a8 01                	test   $0x1,%al
ffffffff804c1573:        0 	74 09                	je     ffffffff804c157e <tcp_ack+0xa67>
ffffffff804c1575:        0 	83 c8 02             	or     $0x2,%eax
ffffffff804c1578:        0 	88 85 7e 04 00 00    	mov    %al,0x47e(%rbp)
ffffffff804c157e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1581:        0 	e8 27 da ff ff       	callq  ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1586:        0 	eb 08                	jmp    ffffffff804c1590 <tcp_ack+0xa79>
ffffffff804c1588:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c158b:        0 	e8 78 dd ff ff       	callq  ffffffff804bf308 <tcp_ratehalving_spur_to_response>
ffffffff804c1590:        0 	c6 85 5e 04 00 00 00 	movb   $0x0,0x45e(%rbp)
ffffffff804c1597:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c159e:        0 	00 00 00 
ffffffff804c15a1:        0 	31 c9                	xor    %ecx,%ecx
ffffffff804c15a3:        0 	48 8b 05 0e 01 5f 00 	mov    0x5f010e(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c15aa:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c15b1:        0 	00 
ffffffff804c15b2:        0 	89 d2                	mov    %edx,%edx
ffffffff804c15b4:        0 	48 f7 d0             	not    %rax
ffffffff804c15b7:        0 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c15bb:        0 	48 ff 80 28 02 00 00 	incq   0x228(%rax)
ffffffff804c15c2:        0 	eb 05                	jmp    ffffffff804c15c9 <tcp_ack+0xab2>
ffffffff804c15c4:        0 	b9 01 00 00 00       	mov    $0x1,%ecx
ffffffff804c15c9:      466 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c15cf:     5645 	39 95 58 04 00 00    	cmp    %edx,0x458(%rbp)
ffffffff804c15d5:      176 	79 0a                	jns    ffffffff804c15e1 <tcp_ack+0xaca>
ffffffff804c15d7:       24 	c7 85 58 04 00 00 00 	movl   $0x0,0x458(%rbp)
ffffffff804c15de:        0 	00 00 00 
ffffffff804c15e1:      620 	8b 54 24 2c          	mov    0x2c(%rsp),%edx
ffffffff804c15e5:      639 	03 54 24 30          	add    0x30(%rsp),%edx
ffffffff804c15e9:        2 	44 89 e3             	mov    %r12d,%ebx
ffffffff804c15ec:      283 	2b 54 24 28          	sub    0x28(%rsp),%edx
ffffffff804c15f0:      154 	2b 54 24 24          	sub    0x24(%rsp),%edx
ffffffff804c15f4:        0 	83 e3 17             	and    $0x17,%ebx
ffffffff804c15f7:      266 	89 5c 24 54          	mov    %ebx,0x54(%rsp)
ffffffff804c15fb:      168 	74 13                	je     ffffffff804c1610 <tcp_ack+0xaf9>
ffffffff804c15fd:        0 	41 f6 c4 60          	test   $0x60,%r12b
ffffffff804c1601:     6575 	75 0d                	jne    ffffffff804c1610 <tcp_ack+0xaf9>
ffffffff804c1603:       20 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c160a:     1417 	0f 84 3a 09 00 00    	je     ffffffff804c1f4a <tcp_ack+0x1433>
ffffffff804c1610:        0 	44 89 e0             	mov    %r12d,%eax
ffffffff804c1613:        0 	c1 e8 02             	shr    $0x2,%eax
ffffffff804c1616:        0 	88 c3                	mov    %al,%bl
ffffffff804c1618:        0 	80 e3 01             	and    $0x1,%bl
ffffffff804c161b:        0 	41 88 de             	mov    %bl,%r14b
ffffffff804c161e:        0 	74 36                	je     ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1620:        0 	85 c9                	test   %ecx,%ecx
ffffffff804c1622:        0 	75 32                	jne    ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1624:        0 	41 f6 c4 40          	test   $0x40,%r12b
ffffffff804c1628:        0 	74 0e                	je     ffffffff804c1638 <tcp_ack+0xb21>
ffffffff804c162a:        0 	8b 85 a8 04 00 00    	mov    0x4a8(%rbp),%eax
ffffffff804c1630:        0 	39 85 ac 04 00 00    	cmp    %eax,0x4ac(%rbp)
ffffffff804c1636:        0 	73 1e                	jae    ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1638:        0 	0f b6 8d 78 03 00 00 	movzbl 0x378(%rbp),%ecx
ffffffff804c163f:        0 	b8 0c 00 00 00       	mov    $0xc,%eax
ffffffff804c1644:        0 	d3 f8                	sar    %cl,%eax
ffffffff804c1646:        0 	a8 01                	test   $0x1,%al
ffffffff804c1648:        0 	75 0c                	jne    ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c164a:        0 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c164e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1651:        0 	e8 6b dc ff ff       	callq  ffffffff804bf2c1 <tcp_cong_avoid>
ffffffff804c1656:        0 	31 db                	xor    %ebx,%ebx
ffffffff804c1658:        0 	41 f7 c4 17 04 00 00 	test   $0x417,%r12d
ffffffff804c165f:        0 	44 8b bd 74 04 00 00 	mov    0x474(%rbp),%r15d
ffffffff804c1666:        0 	0f 94 c3             	sete   %bl
ffffffff804c1669:        0 	41 bd 01 00 00 00    	mov    $0x1,%r13d
ffffffff804c166f:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1671:        0 	75 21                	jne    ffffffff804c1694 <tcp_ack+0xb7d>
ffffffff804c1673:        0 	45 30 ed             	xor    %r13b,%r13b
ffffffff804c1676:        0 	41 f6 c4 20          	test   $0x20,%r12b
ffffffff804c167a:        0 	74 18                	je     ffffffff804c1694 <tcp_ack+0xb7d>
ffffffff804c167c:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c167f:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c1682:        0 	e8 cf d8 ff ff       	callq  ffffffff804bef56 <tcp_fackets_out>
ffffffff804c1687:        0 	0f b6 95 7f 04 00 00 	movzbl 0x47f(%rbp),%edx
ffffffff804c168e:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c1690:        0 	41 0f 9f c5          	setg   %r13b
ffffffff804c1694:        0 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c169b:        0 	75 24                	jne    ffffffff804c16c1 <tcp_ack+0xbaa>
ffffffff804c169d:        0 	83 bd d0 04 00 00 00 	cmpl   $0x0,0x4d0(%rbp)
ffffffff804c16a4:        0 	74 1b                	je     ffffffff804c16c1 <tcp_ack+0xbaa>
ffffffff804c16a6:        0 	be 16 0a 00 00       	mov    $0xa16,%esi
ffffffff804c16ab:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c16b2:        0 	e8 fe 4a d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c16b7:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c16be:        0 	00 00 00 
ffffffff804c16c1:        0 	83 bd d0 04 00 00 00 	cmpl   $0x0,0x4d0(%rbp)
ffffffff804c16c8:        0 	75 24                	jne    ffffffff804c16ee <tcp_ack+0xbd7>
ffffffff804c16ca:        0 	83 bd d4 04 00 00 00 	cmpl   $0x0,0x4d4(%rbp)
ffffffff804c16d1:        0 	74 1b                	je     ffffffff804c16ee <tcp_ack+0xbd7>
ffffffff804c16d3:        0 	be 18 0a 00 00       	mov    $0xa18,%esi
ffffffff804c16d8:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c16df:        0 	e8 d1 4a d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c16e4:        0 	c7 85 d4 04 00 00 00 	movl   $0x0,0x4d4(%rbp)
ffffffff804c16eb:        0 	00 00 00 
ffffffff804c16ee:        0 	44 89 e0             	mov    %r12d,%eax
ffffffff804c16f1:        0 	83 e0 40             	and    $0x40,%eax
ffffffff804c16f4:        0 	89 44 24 58          	mov    %eax,0x58(%rsp)
ffffffff804c16f8:        0 	74 0a                	je     ffffffff804c1704 <tcp_ack+0xbed>
ffffffff804c16fa:        0 	c7 85 6c 05 00 00 00 	movl   $0x0,0x56c(%rbp)
ffffffff804c1701:        0 	00 00 00 
ffffffff804c1704:        0 	41 f7 c4 00 20 00 00 	test   $0x2000,%r12d
ffffffff804c170b:        0 	0f 84 50 08 00 00    	je     ffffffff804c1f61 <tcp_ack+0x144a>
ffffffff804c1711:        0 	48 8b 15 a0 ff 5e 00 	mov    0x5effa0(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1718:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c171b:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1720:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c1727:        0 	00 
ffffffff804c1728:        0 	89 c0                	mov    %eax,%eax
ffffffff804c172a:        0 	48 f7 d2             	not    %rdx
ffffffff804c172d:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c1731:        0 	48 ff 80 00 01 00 00 	incq   0x100(%rax)
ffffffff804c1738:        0 	e8 df e2 ff ff       	callq  ffffffff804bfa1c <tcp_enter_loss>
ffffffff804c173d:        0 	48 8b b5 c0 00 00 00 	mov    0xc0(%rbp),%rsi
ffffffff804c1744:        0 	fe 85 79 03 00 00    	incb   0x379(%rbp)
ffffffff804c174a:        0 	48 8d 85 c0 00 00 00 	lea    0xc0(%rbp),%rax
ffffffff804c1751:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1754:        0 	48 39 c6             	cmp    %rax,%rsi
ffffffff804c1757:        0 	b8 00 00 00 00       	mov    $0x0,%eax
ffffffff804c175c:        0 	48 0f 44 f0          	cmove  %rax,%rsi
ffffffff804c1760:        0 	e8 2d 4b 00 00       	callq  ffffffff804c6292 <tcp_retransmit_skb>
ffffffff804c1765:        0 	8b 95 58 03 00 00    	mov    0x358(%rbp),%edx
ffffffff804c176b:        0 	b9 30 75 00 00       	mov    $0x7530,%ecx
ffffffff804c1770:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1775:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1778:        0 	e8 66 de ff ff       	callq  ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c177d:        0 	e9 dd 06 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1782:        0 	45 84 e4             	test   %r12b,%r12b
ffffffff804c1785:        0 	79 51                	jns    ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1787:        0 	8b 95 70 05 00 00    	mov    0x570(%rbp),%edx
ffffffff804c178d:        0 	39 95 00 04 00 00    	cmp    %edx,0x400(%rbp)
ffffffff804c1793:        0 	79 43                	jns    ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1795:        0 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c179c:        0 	74 3a                	je     ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c179e:        0 	0f b6 85 7f 04 00 00 	movzbl 0x47f(%rbp),%eax
ffffffff804c17a5:        0 	8b b5 d4 04 00 00    	mov    0x4d4(%rbp),%esi
ffffffff804c17ab:        0 	39 c6                	cmp    %eax,%esi
ffffffff804c17ad:        0 	76 29                	jbe    ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c17af:        0 	29 c6                	sub    %eax,%esi
ffffffff804c17b1:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c17b4:        0 	e8 58 e6 ff ff       	callq  ffffffff804bfe11 <tcp_mark_head_lost>
ffffffff804c17b9:        0 	48 8b 05 f8 fe 5e 00 	mov    0x5efef8(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c17c0:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c17c7:        0 	00 
ffffffff804c17c8:        0 	89 d2                	mov    %edx,%edx
ffffffff804c17ca:        0 	48 f7 d0             	not    %rax
ffffffff804c17cd:        0 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c17d1:        0 	48 ff 80 48 01 00 00 	incq   0x148(%rax)
ffffffff804c17d8:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c17de:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c17e4:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c17ea:        0 	76 11                	jbe    ffffffff804c17fd <tcp_ack+0xce6>
ffffffff804c17ec:        0 	be 2e 0a 00 00       	mov    $0xa2e,%esi
ffffffff804c17f1:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c17f8:        0 	e8 b8 49 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c17fd:        0 	8a 85 78 03 00 00    	mov    0x378(%rbp),%al
ffffffff804c1803:        0 	84 c0                	test   %al,%al
ffffffff804c1805:        0 	75 29                	jne    ffffffff804c1830 <tcp_ack+0xd19>
ffffffff804c1807:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c180e:        0 	74 11                	je     ffffffff804c1821 <tcp_ack+0xd0a>
ffffffff804c1810:        0 	be 33 0a 00 00       	mov    $0xa33,%esi
ffffffff804c1815:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c181c:        0 	e8 94 49 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1821:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c1828:        0 	00 00 00 
ffffffff804c182b:        0 	e9 c4 00 00 00       	jmpq   ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1830:        0 	8b 8d 70 05 00 00    	mov    0x570(%rbp),%ecx
ffffffff804c1836:        0 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c183c:        0 	39 ca                	cmp    %ecx,%edx
ffffffff804c183e:        0 	0f 88 b0 00 00 00    	js     ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1844:        0 	3c 02                	cmp    $0x2,%al
ffffffff804c1846:        0 	74 31                	je     ffffffff804c1879 <tcp_ack+0xd62>
ffffffff804c1848:        0 	77 0a                	ja     ffffffff804c1854 <tcp_ack+0xd3d>
ffffffff804c184a:        0 	fe c8                	dec    %al
ffffffff804c184c:        0 	0f 85 a2 00 00 00    	jne    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1852:        0 	eb 33                	jmp    ffffffff804c1887 <tcp_ack+0xd70>
ffffffff804c1854:        0 	3c 03                	cmp    $0x3,%al
ffffffff804c1856:        0 	74 6f                	je     ffffffff804c18c7 <tcp_ack+0xdb0>
ffffffff804c1858:        0 	3c 04                	cmp    $0x4,%al
ffffffff804c185a:        0 	0f 85 94 00 00 00    	jne    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1860:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c1867:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c186a:        0 	e8 fb d8 ff ff       	callq  ffffffff804bf16a <tcp_try_undo_recovery>
ffffffff804c186f:        0 	85 c0                	test   %eax,%eax
ffffffff804c1871:        0 	0f 85 e8 05 00 00    	jne    ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1877:        0 	eb 7b                	jmp    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1879:        0 	39 ca                	cmp    %ecx,%edx
ffffffff804c187b:        0 	74 77                	je     ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c187d:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1880:        0 	e8 b8 d9 ff ff       	callq  ffffffff804bf23d <tcp_complete_cwr>
ffffffff804c1885:        0 	eb 34                	jmp    ffffffff804c18bb <tcp_ack+0xda4>
ffffffff804c1887:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c188a:        0 	e8 63 d9 ff ff       	callq  ffffffff804bf1f2 <tcp_try_undo_dsack>
ffffffff804c188f:        0 	83 bd 78 05 00 00 00 	cmpl   $0x0,0x578(%rbp)
ffffffff804c1896:        0 	74 19                	je     ffffffff804c18b1 <tcp_ack+0xd9a>
ffffffff804c1898:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c189e:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c18a1:        0 	74 0e                	je     ffffffff804c18b1 <tcp_ack+0xd9a>
ffffffff804c18a3:        0 	8b 85 70 05 00 00    	mov    0x570(%rbp),%eax
ffffffff804c18a9:        0 	39 85 00 04 00 00    	cmp    %eax,0x400(%rbp)
ffffffff804c18af:        0 	74 43                	je     ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c18b1:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c18b8:        0 	00 00 00 
ffffffff804c18bb:        0 	31 f6                	xor    %esi,%esi
ffffffff804c18bd:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c18c0:        0 	e8 b4 cf ff ff       	callq  ffffffff804be879 <tcp_set_ca_state>
ffffffff804c18c5:        0 	eb 2d                	jmp    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c18c7:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c18cd:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c18d0:        0 	75 0a                	jne    ffffffff804c18dc <tcp_ack+0xdc5>
ffffffff804c18d2:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c18d9:        0 	00 00 00 
ffffffff804c18dc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c18df:        0 	e8 86 d8 ff ff       	callq  ffffffff804bf16a <tcp_try_undo_recovery>
ffffffff804c18e4:        0 	85 c0                	test   %eax,%eax
ffffffff804c18e6:        0 	0f 85 73 05 00 00    	jne    ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c18ec:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c18ef:        0 	e8 49 d9 ff ff       	callq  ffffffff804bf23d <tcp_complete_cwr>
ffffffff804c18f4:        0 	8a 85 78 03 00 00    	mov    0x378(%rbp),%al
ffffffff804c18fa:        0 	3c 03                	cmp    $0x3,%al
ffffffff804c18fc:        0 	74 0d                	je     ffffffff804c190b <tcp_ack+0xdf4>
ffffffff804c18fe:        0 	3c 04                	cmp    $0x4,%al
ffffffff804c1900:        0 	0f 85 b8 01 00 00    	jne    ffffffff804c1abe <tcp_ack+0xfa7>
ffffffff804c1906:        0 	e9 c4 00 00 00       	jmpq   ffffffff804c19cf <tcp_ack+0xeb8>
ffffffff804c190b:        0 	41 f7 c4 00 04 00 00 	test   $0x400,%r12d
ffffffff804c1912:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1918:        0 	75 1e                	jne    ffffffff804c1938 <tcp_ack+0xe21>
ffffffff804c191a:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c191d:        0 	0f 85 fd 03 00 00    	jne    ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c1923:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1925:        0 	0f 84 f5 03 00 00    	je     ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c192b:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c192e:        0 	e8 54 dd ff ff       	callq  ffffffff804bf687 <tcp_add_reno_sack>
ffffffff804c1933:        0 	e9 e8 03 00 00       	jmpq   ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c1938:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c193b:        0 	41 bd 01 00 00 00    	mov    $0x1,%r13d
ffffffff804c1941:        0 	74 18                	je     ffffffff804c195b <tcp_ack+0xe44>
ffffffff804c1943:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1946:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c1949:        0 	e8 08 d6 ff ff       	callq  ffffffff804bef56 <tcp_fackets_out>
ffffffff804c194e:        0 	0f b6 95 7f 04 00 00 	movzbl 0x47f(%rbp),%edx
ffffffff804c1955:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c1957:        0 	41 0f 9f c5          	setg   %r13b
ffffffff804c195b:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c195e:        0 	e8 c9 d7 ff ff       	callq  ffffffff804bf12c <tcp_may_undo>
ffffffff804c1963:        0 	85 c0                	test   %eax,%eax
ffffffff804c1965:        0 	0f 84 b5 03 00 00    	je     ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c196b:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c1972:        0 	75 0a                	jne    ffffffff804c197e <tcp_ack+0xe67>
ffffffff804c1974:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c197b:        0 	00 00 00 
ffffffff804c197e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1981:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c1984:        0 	e8 cd d5 ff ff       	callq  ffffffff804bef56 <tcp_fackets_out>
ffffffff804c1989:        0 	44 29 7c 24 14       	sub    %r15d,0x14(%rsp)
ffffffff804c198e:        0 	ba 01 00 00 00       	mov    $0x1,%edx
ffffffff804c1993:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1996:        0 	8b 74 24 14          	mov    0x14(%rsp),%esi
ffffffff804c199a:        0 	01 c6                	add    %eax,%esi
ffffffff804c199c:        0 	e8 ed d3 ff ff       	callq  ffffffff804bed8e <tcp_update_reordering>
ffffffff804c19a1:        0 	31 f6                	xor    %esi,%esi
ffffffff804c19a3:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c19a6:        0 	e8 ed d6 ff ff       	callq  ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c19ab:        0 	48 8b 15 06 fd 5e 00 	mov    0x5efd06(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c19b2:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c19b9:        0 	00 
ffffffff804c19ba:        0 	89 c0                	mov    %eax,%eax
ffffffff804c19bc:        0 	48 f7 d2             	not    %rdx
ffffffff804c19bf:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c19c3:        0 	48 ff 80 30 01 00 00 	incq   0x130(%rax)
ffffffff804c19ca:        0 	e9 51 03 00 00       	jmpq   ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c19cf:        0 	45 84 f6             	test   %r14b,%r14b
ffffffff804c19d2:        0 	74 07                	je     ffffffff804c19db <tcp_ack+0xec4>
ffffffff804c19d4:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c19db:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c19e1:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c19e4:        0 	75 13                	jne    ffffffff804c19f9 <tcp_ack+0xee2>
ffffffff804c19e6:        0 	41 f7 c4 00 04 00 00 	test   $0x400,%r12d
ffffffff804c19ed:        0 	74 0a                	je     ffffffff804c19f9 <tcp_ack+0xee2>
ffffffff804c19ef:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c19f6:        0 	00 00 00 
ffffffff804c19f9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c19fc:        0 	e8 2b d7 ff ff       	callq  ffffffff804bf12c <tcp_may_undo>
ffffffff804c1a01:        0 	85 c0                	test   %eax,%eax
ffffffff804c1a03:        0 	0f 84 6e 05 00 00    	je     ffffffff804c1f77 <tcp_ack+0x1460>
ffffffff804c1a09:        0 	48 8b 95 c0 00 00 00 	mov    0xc0(%rbp),%rdx
ffffffff804c1a10:        0 	48 8d 8d c0 00 00 00 	lea    0xc0(%rbp),%rcx
ffffffff804c1a17:        0 	eb 10                	jmp    ffffffff804c1a29 <tcp_ack+0xf12>
ffffffff804c1a19:        0 	48 3b 95 d8 01 00 00 	cmp    0x1d8(%rbp),%rdx
ffffffff804c1a20:        0 	74 12                	je     ffffffff804c1a34 <tcp_ack+0xf1d>
ffffffff804c1a22:        0 	80 62 5d fb          	andb   $0xfb,0x5d(%rdx)
ffffffff804c1a26:        0 	48 8b 12             	mov    (%rdx),%rdx
ffffffff804c1a29:        0 	48 8b 02             	mov    (%rdx),%rax
ffffffff804c1a2c:        0 	48 39 ca             	cmp    %rcx,%rdx
ffffffff804c1a2f:        0 	0f 18 08             	prefetcht0 (%rax)
ffffffff804c1a32:        0 	75 e5                	jne    ffffffff804c1a19 <tcp_ack+0xf02>
ffffffff804c1a34:        0 	48 c7 85 e0 04 00 00 	movq   $0x0,0x4e0(%rbp)
ffffffff804c1a3b:        0 	00 00 00 00 
ffffffff804c1a3f:        0 	48 c7 85 e8 04 00 00 	movq   $0x0,0x4e8(%rbp)
ffffffff804c1a46:        0 	00 00 00 00 
ffffffff804c1a4a:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1a4f:        0 	48 c7 85 f0 04 00 00 	movq   $0x0,0x4f0(%rbp)
ffffffff804c1a56:        0 	00 00 00 00 
ffffffff804c1a5a:        0 	c7 85 cc 04 00 00 00 	movl   $0x0,0x4cc(%rbp)
ffffffff804c1a61:        0 	00 00 00 
ffffffff804c1a64:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1a67:        0 	e8 2c d6 ff ff       	callq  ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c1a6c:        0 	48 8b 15 45 fc 5e 00 	mov    0x5efc45(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1a73:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c1a7a:        0 	00 
ffffffff804c1a7b:        0 	89 c0                	mov    %eax,%eax
ffffffff804c1a7d:        0 	48 f7 d2             	not    %rdx
ffffffff804c1a80:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c1a84:        0 	48 ff 80 40 01 00 00 	incq   0x140(%rax)
ffffffff804c1a8b:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c1a92:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1a98:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c1a9f:        0 	00 00 00 
ffffffff804c1aa2:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1aa5:        0 	74 0a                	je     ffffffff804c1ab1 <tcp_ack+0xf9a>
ffffffff804c1aa7:        0 	31 f6                	xor    %esi,%esi
ffffffff804c1aa9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1aac:        0 	e8 c8 cd ff ff       	callq  ffffffff804be879 <tcp_set_ca_state>
ffffffff804c1ab1:        0 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c1ab8:        0 	0f 85 a1 03 00 00    	jne    ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1abe:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1ac4:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1ac7:        0 	75 1f                	jne    ffffffff804c1ae8 <tcp_ack+0xfd1>
ffffffff804c1ac9:        0 	41 f7 c4 00 04 00 00 	test   $0x400,%r12d
ffffffff804c1ad0:        0 	74 0a                	je     ffffffff804c1adc <tcp_ack+0xfc5>
ffffffff804c1ad2:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c1ad9:        0 	00 00 00 
ffffffff804c1adc:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1ade:        0 	74 08                	je     ffffffff804c1ae8 <tcp_ack+0xfd1>
ffffffff804c1ae0:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1ae3:        0 	e8 9f db ff ff       	callq  ffffffff804bf687 <tcp_add_reno_sack>
ffffffff804c1ae8:        0 	80 bd 78 03 00 00 01 	cmpb   $0x1,0x378(%rbp)
ffffffff804c1aef:        0 	75 08                	jne    ffffffff804c1af9 <tcp_ack+0xfe2>
ffffffff804c1af1:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1af4:        0 	e8 f9 d6 ff ff       	callq  ffffffff804bf1f2 <tcp_try_undo_dsack>
ffffffff804c1af9:        0 	80 bd 5e 04 00 00 00 	cmpb   $0x0,0x45e(%rbp)
ffffffff804c1b00:        0 	0f 85 90 00 00 00    	jne    ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b06:        0 	83 bd cc 04 00 00 00 	cmpl   $0x0,0x4cc(%rbp)
ffffffff804c1b0d:        0 	0f 85 79 04 00 00    	jne    ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b13:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1b19:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1b1c:        0 	a8 02                	test   $0x2,%al
ffffffff804c1b1e:        0 	74 08                	je     ffffffff804c1b28 <tcp_ack+0x1011>
ffffffff804c1b20:        0 	8b 95 d4 04 00 00    	mov    0x4d4(%rbp),%edx
ffffffff804c1b26:        0 	eb 08                	jmp    ffffffff804c1b30 <tcp_ack+0x1019>
ffffffff804c1b28:        0 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c1b2e:        0 	ff c2                	inc    %edx
ffffffff804c1b30:        0 	0f b6 85 7f 04 00 00 	movzbl 0x47f(%rbp),%eax
ffffffff804c1b37:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c1b39:        0 	0f 8f 4d 04 00 00    	jg     ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b3f:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1b45:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1b48:        0 	a8 02                	test   $0x2,%al
ffffffff804c1b4a:        0 	74 10                	je     ffffffff804c1b5c <tcp_ack+0x1045>
ffffffff804c1b4c:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1b4f:        0 	e8 1d d4 ff ff       	callq  ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1b54:        0 	85 c0                	test   %eax,%eax
ffffffff804c1b56:        0 	0f 85 30 04 00 00    	jne    ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b5c:        0 	0f b6 85 7f 04 00 00 	movzbl 0x47f(%rbp),%eax
ffffffff804c1b63:        0 	8b 95 74 04 00 00    	mov    0x474(%rbp),%edx
ffffffff804c1b69:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c1b6b:        0 	77 29                	ja     ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b6d:        0 	89 d0                	mov    %edx,%eax
ffffffff804c1b6f:        0 	d1 e8                	shr    %eax
ffffffff804c1b71:        0 	39 05 c1 68 3f 00    	cmp    %eax,0x3f68c1(%rip)        # ffffffff808b8438 <sysctl_tcp_reordering>
ffffffff804c1b77:        0 	0f 43 05 ba 68 3f 00 	cmovae 0x3f68ba(%rip),%eax        # ffffffff808b8438 <sysctl_tcp_reordering>
ffffffff804c1b7e:        0 	39 85 d0 04 00 00    	cmp    %eax,0x4d0(%rbp)
ffffffff804c1b84:        0 	72 10                	jb     ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b86:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1b89:        0 	e8 82 37 00 00       	callq  ffffffff804c5310 <tcp_may_send_now>
ffffffff804c1b8e:        0 	85 c0                	test   %eax,%eax
ffffffff804c1b90:        0 	0f 84 f6 03 00 00    	je     ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b96:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c1b9c:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c1ba2:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c1ba8:        0 	76 11                	jbe    ffffffff804c1bbb <tcp_ack+0x10a4>
ffffffff804c1baa:        0 	be d7 09 00 00       	mov    $0x9d7,%esi
ffffffff804c1baf:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c1bb6:        0 	e8 fa 45 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1bbb:        0 	80 bd 5e 04 00 00 00 	cmpb   $0x0,0x45e(%rbp)
ffffffff804c1bc2:        0 	75 13                	jne    ffffffff804c1bd7 <tcp_ack+0x10c0>
ffffffff804c1bc4:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c1bcb:        0 	75 0a                	jne    ffffffff804c1bd7 <tcp_ack+0x10c0>
ffffffff804c1bcd:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c1bd4:        0 	00 00 00 
ffffffff804c1bd7:        0 	83 7c 24 58 00       	cmpl   $0x0,0x58(%rsp)
ffffffff804c1bdc:        0 	74 0d                	je     ffffffff804c1beb <tcp_ack+0x10d4>
ffffffff804c1bde:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1be3:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1be6:        0 	e8 cf d0 ff ff       	callq  ffffffff804becba <tcp_enter_cwr>
ffffffff804c1beb:        0 	80 bd 78 03 00 00 02 	cmpb   $0x2,0x378(%rbp)
ffffffff804c1bf2:        0 	74 15                	je     ffffffff804c1c09 <tcp_ack+0x10f2>
ffffffff804c1bf4:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1bf7:        0 	e8 71 d6 ff ff       	callq  ffffffff804bf26d <tcp_try_keep_open>
ffffffff804c1bfc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1bff:        0 	e8 a9 d3 ff ff       	callq  ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1c04:        0 	e9 56 02 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c09:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff804c1c0c:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1c0f:        0 	e8 d9 d3 ff ff       	callq  ffffffff804befed <tcp_cwnd_down>
ffffffff804c1c14:        0 	e9 46 02 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c19:        0 	8b 95 a4 03 00 00    	mov    0x3a4(%rbp),%edx
ffffffff804c1c1f:        0 	85 d2                	test   %edx,%edx
ffffffff804c1c21:        0 	74 34                	je     ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1c23:        0 	8b 85 b0 05 00 00    	mov    0x5b0(%rbp),%eax
ffffffff804c1c29:        0 	39 85 00 04 00 00    	cmp    %eax,0x400(%rbp)
ffffffff804c1c2f:        0 	75 26                	jne    ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1c31:        0 	ff 85 ac 04 00 00    	incl   0x4ac(%rbp)
ffffffff804c1c37:        0 	8d 42 ff             	lea    -0x1(%rdx),%eax
ffffffff804c1c3a:        0 	c7 85 a4 03 00 00 00 	movl   $0x0,0x3a4(%rbp)
ffffffff804c1c41:        0 	00 00 00 
ffffffff804c1c44:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1c47:        0 	89 85 9c 03 00 00    	mov    %eax,0x39c(%rbp)
ffffffff804c1c4d:        0 	e8 86 54 00 00       	callq  ffffffff804c70d8 <tcp_simple_retransmit>
ffffffff804c1c52:        0 	e9 08 02 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c57:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1c5d:        0 	48 8b 15 54 fa 5e 00 	mov    0x5efa54(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1c64:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1c67:        0 	48 f7 d2             	not    %rdx
ffffffff804c1c6a:        0 	3c 01                	cmp    $0x1,%al
ffffffff804c1c6c:        0 	19 c9                	sbb    %ecx,%ecx
ffffffff804c1c6e:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c1c75:        0 	00 
ffffffff804c1c76:        0 	89 c0                	mov    %eax,%eax
ffffffff804c1c78:        0 	83 c1 1f             	add    $0x1f,%ecx
ffffffff804c1c7b:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c1c7f:        0 	48 63 c9             	movslq %ecx,%rcx
ffffffff804c1c82:        0 	48 ff 04 c8          	incq   (%rax,%rcx,8)
ffffffff804c1c86:        0 	c7 85 6c 05 00 00 00 	movl   $0x0,0x56c(%rbp)
ffffffff804c1c8d:        0 	00 00 00 
ffffffff804c1c90:        0 	8b 85 fc 03 00 00    	mov    0x3fc(%rbp),%eax
ffffffff804c1c96:        0 	80 bd 78 03 00 00 01 	cmpb   $0x1,0x378(%rbp)
ffffffff804c1c9d:        0 	89 85 70 05 00 00    	mov    %eax,0x570(%rbp)
ffffffff804c1ca3:        0 	8b 85 00 04 00 00    	mov    0x400(%rbp),%eax
ffffffff804c1ca9:        0 	89 85 78 05 00 00    	mov    %eax,0x578(%rbp)
ffffffff804c1caf:        0 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c1cb5:        0 	89 85 7c 05 00 00    	mov    %eax,0x57c(%rbp)
ffffffff804c1cbb:        0 	77 3b                	ja     ffffffff804c1cf8 <tcp_ack+0x11e1>
ffffffff804c1cbd:        0 	83 7c 24 58 00       	cmpl   $0x0,0x58(%rsp)
ffffffff804c1cc2:        0 	75 0e                	jne    ffffffff804c1cd2 <tcp_ack+0x11bb>
ffffffff804c1cc4:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1cc7:        0 	e8 f0 cb ff ff       	callq  ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c1ccc:        0 	89 85 6c 05 00 00    	mov    %eax,0x56c(%rbp)
ffffffff804c1cd2:        0 	48 8b 85 60 03 00 00 	mov    0x360(%rbp),%rax
ffffffff804c1cd9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1cdc:        0 	ff 50 28             	callq  *0x28(%rax)
ffffffff804c1cdf:        0 	89 85 a8 04 00 00    	mov    %eax,0x4a8(%rbp)
ffffffff804c1ce5:        0 	8a 85 7e 04 00 00    	mov    0x47e(%rbp),%al
ffffffff804c1ceb:        0 	a8 01                	test   $0x1,%al
ffffffff804c1ced:        0 	74 09                	je     ffffffff804c1cf8 <tcp_ack+0x11e1>
ffffffff804c1cef:        0 	83 c8 02             	or     $0x2,%eax
ffffffff804c1cf2:        0 	88 85 7e 04 00 00    	mov    %al,0x47e(%rbp)
ffffffff804c1cf8:        0 	c7 85 dc 04 00 00 00 	movl   $0x0,0x4dc(%rbp)
ffffffff804c1cff:        0 	00 00 00 
ffffffff804c1d02:        0 	c7 85 b0 04 00 00 00 	movl   $0x0,0x4b0(%rbp)
ffffffff804c1d09:        0 	00 00 00 
ffffffff804c1d0c:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c1d11:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1d14:        0 	bb 01 00 00 00       	mov    $0x1,%ebx
ffffffff804c1d19:        0 	e8 5b cb ff ff       	callq  ffffffff804be879 <tcp_set_ca_state>
ffffffff804c1d1e:        0 	eb 02                	jmp    ffffffff804c1d22 <tcp_ack+0x120b>
ffffffff804c1d20:        0 	31 db                	xor    %ebx,%ebx
ffffffff804c1d22:        0 	45 85 ed             	test   %r13d,%r13d
ffffffff804c1d25:        0 	75 21                	jne    ffffffff804c1d48 <tcp_ack+0x1231>
ffffffff804c1d27:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1d2d:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1d30:        0 	a8 02                	test   $0x2,%al
ffffffff804c1d32:        0 	0f 84 0b 01 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1d38:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1d3b:        0 	e8 31 d2 ff ff       	callq  ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1d40:        0 	85 c0                	test   %eax,%eax
ffffffff804c1d42:        0 	0f 84 fb 00 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1d48:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1d4e:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1d51:        0 	75 07                	jne    ffffffff804c1d5a <tcp_ack+0x1243>
ffffffff804c1d53:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1d58:        0 	eb 31                	jmp    ffffffff804c1d8b <tcp_ack+0x1274>
ffffffff804c1d5a:        0 	a8 02                	test   $0x2,%al
ffffffff804c1d5c:        0 	8a 85 7f 04 00 00    	mov    0x47f(%rbp),%al
ffffffff804c1d62:        0 	74 17                	je     ffffffff804c1d7b <tcp_ack+0x1264>
ffffffff804c1d64:        0 	8b b5 d4 04 00 00    	mov    0x4d4(%rbp),%esi
ffffffff804c1d6a:        0 	0f b6 c0             	movzbl %al,%eax
ffffffff804c1d6d:        0 	29 c6                	sub    %eax,%esi
ffffffff804c1d6f:        0 	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff804c1d74:        0 	85 f6                	test   %esi,%esi
ffffffff804c1d76:        0 	0f 4e f0             	cmovle %eax,%esi
ffffffff804c1d79:        0 	eb 10                	jmp    ffffffff804c1d8b <tcp_ack+0x1274>
ffffffff804c1d7b:        0 	8b b5 d0 04 00 00    	mov    0x4d0(%rbp),%esi
ffffffff804c1d81:        0 	0f b6 c0             	movzbl %al,%eax
ffffffff804c1d84:        0 	29 c6                	sub    %eax,%esi
ffffffff804c1d86:        0 	39 f3                	cmp    %esi,%ebx
ffffffff804c1d88:        0 	0f 4d f3             	cmovge %ebx,%esi
ffffffff804c1d8b:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1d8e:        0 	e8 7e e0 ff ff       	callq  ffffffff804bfe11 <tcp_mark_head_lost>
ffffffff804c1d93:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1d99:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1d9c:        0 	a8 02                	test   $0x2,%al
ffffffff804c1d9e:        0 	0f 84 9f 00 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1da4:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1da7:        0 	e8 c5 d1 ff ff       	callq  ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1dac:        0 	85 c0                	test   %eax,%eax
ffffffff804c1dae:        0 	0f 84 8f 00 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1db4:        0 	48 8b 85 e8 04 00 00 	mov    0x4e8(%rbp),%rax
ffffffff804c1dbb:        0 	48 85 c0             	test   %rax,%rax
ffffffff804c1dbe:        0 	48 89 c3             	mov    %rax,%rbx
ffffffff804c1dc1:        0 	75 42                	jne    ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dc3:        0 	48 8b 9d c0 00 00 00 	mov    0xc0(%rbp),%rbx
ffffffff804c1dca:        0 	48 8d 85 c0 00 00 00 	lea    0xc0(%rbp),%rax
ffffffff804c1dd1:        0 	48 39 c3             	cmp    %rax,%rbx
ffffffff804c1dd4:        0 	75 2f                	jne    ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dd6:        0 	31 db                	xor    %ebx,%ebx
ffffffff804c1dd8:        0 	eb 2b                	jmp    ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dda:        0 	48 3b 9d d8 01 00 00 	cmp    0x1d8(%rbp),%rbx
ffffffff804c1de1:        0 	74 34                	je     ffffffff804c1e17 <tcp_ack+0x1300>
ffffffff804c1de3:        0 	48 8b 05 96 7a 3f 00 	mov    0x3f7a96(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c1dea:        0 	2b 43 58             	sub    0x58(%rbx),%eax
ffffffff804c1ded:        0 	3b 85 58 03 00 00    	cmp    0x358(%rbp),%eax
ffffffff804c1df3:        0 	76 22                	jbe    ffffffff804c1e17 <tcp_ack+0x1300>
ffffffff804c1df5:        0 	48 89 de             	mov    %rbx,%rsi
ffffffff804c1df8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1dfb:        0 	e8 28 d0 ff ff       	callq  ffffffff804bee28 <tcp_skb_mark_lost>
ffffffff804c1e00:        0 	48 8b 1b             	mov    (%rbx),%rbx
ffffffff804c1e03:        0 	eb 07                	jmp    ffffffff804c1e0c <tcp_ack+0x12f5>
ffffffff804c1e05:        0 	4c 8d ad c0 00 00 00 	lea    0xc0(%rbp),%r13
ffffffff804c1e0c:        0 	48 8b 03             	mov    (%rbx),%rax
ffffffff804c1e0f:        0 	4c 39 eb             	cmp    %r13,%rbx
ffffffff804c1e12:        0 	0f 18 08             	prefetcht0 (%rax)
ffffffff804c1e15:        0 	75 c3                	jne    ffffffff804c1dda <tcp_ack+0x12c3>
ffffffff804c1e17:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c1e1d:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c1e23:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c1e29:        0 	48 89 9d e8 04 00 00 	mov    %rbx,0x4e8(%rbp)
ffffffff804c1e30:        0 	76 11                	jbe    ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1e32:        0 	be e5 08 00 00       	mov    $0x8e5,%esi
ffffffff804c1e37:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c1e3e:        0 	e8 72 43 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1e43:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff804c1e46:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1e49:        0 	e8 9f d1 ff ff       	callq  ffffffff804befed <tcp_cwnd_down>
ffffffff804c1e4e:        0 	e9 2c 01 00 00       	jmpq   ffffffff804c1f7f <tcp_ack+0x1468>
ffffffff804c1e53:       47 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c1e57:      513 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1e5a:        0 	e8 62 d4 ff ff       	callq  ffffffff804bf2c1 <tcp_cong_avoid>
ffffffff804c1e5f:      427 	41 80 e4 34          	and    $0x34,%r12b
ffffffff804c1e63:     1234 	75 07                	jne    ffffffff804c1e6c <tcp_ack+0x1355>
ffffffff804c1e65:        0 	83 7c 24 54 00       	cmpl   $0x0,0x54(%rsp)
ffffffff804c1e6a:        0 	75 3c                	jne    ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e6c:        0 	48 8b 7d 78          	mov    0x78(%rbp),%rdi
ffffffff804c1e70:      916 	e8 8d c9 ff ff       	callq  ffffffff804be802 <dst_confirm>
ffffffff804c1e75:        3 	eb 31                	jmp    ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e77:        0 	48 8b 95 d8 01 00 00 	mov    0x1d8(%rbp),%rdx
ffffffff804c1e7e:       99 	48 85 d2             	test   %rdx,%rdx
ffffffff804c1e81:       16 	74 25                	je     ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e83:        0 	8b 85 44 04 00 00    	mov    0x444(%rbp),%eax
ffffffff804c1e89:        0 	03 85 00 04 00 00    	add    0x400(%rbp),%eax
ffffffff804c1e8f:        0 	3b 42 54             	cmp    0x54(%rdx),%eax
ffffffff804c1e92:        0 	78 1e                	js     ffffffff804c1eb2 <tcp_ack+0x139b>
ffffffff804c1e94:        0 	c6 85 7b 03 00 00 00 	movb   $0x0,0x37b(%rbp)
ffffffff804c1e9b:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c1ea0:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1ea3:        0 	e8 ab c9 ff ff       	callq  ffffffff804be853 <inet_csk_clear_xmit_timer>
ffffffff804c1ea8:      520 	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff804c1ead:      994 	e9 ec 00 00 00       	jmpq   ffffffff804c1f9e <tcp_ack+0x1487>
ffffffff804c1eb2:        0 	0f b6 8d 7b 03 00 00 	movzbl 0x37b(%rbp),%ecx
ffffffff804c1eb9:        0 	8b 95 58 03 00 00    	mov    0x358(%rbp),%edx
ffffffff804c1ebf:        0 	b8 30 75 00 00       	mov    $0x7530,%eax
ffffffff804c1ec4:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c1ec9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1ecc:        0 	d3 e2                	shl    %cl,%edx
ffffffff804c1ece:        0 	b9 30 75 00 00       	mov    $0x7530,%ecx
ffffffff804c1ed3:        0 	81 fa 30 75 00 00    	cmp    $0x7530,%edx
ffffffff804c1ed9:        0 	0f 47 d0             	cmova  %eax,%edx
ffffffff804c1edc:        0 	89 d2                	mov    %edx,%edx
ffffffff804c1ede:        0 	e8 00 d7 ff ff       	callq  ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c1ee3:        0 	eb c3                	jmp    ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1ee5:        0 	80 78 25 00          	cmpb   $0x0,0x25(%rax)
ffffffff804c1ee9:        0 	74 1a                	je     ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c1eeb:        0 	8b 54 24 18          	mov    0x18(%rsp),%edx
ffffffff804c1eef:        0 	e8 cc e3 ff ff       	callq  ffffffff804c02c0 <tcp_sacktag_write_queue>
ffffffff804c1ef4:        0 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c1efb:        0 	75 08                	jne    ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c1efd:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f00:        0 	e8 68 d3 ff ff       	callq  ffffffff804bf26d <tcp_try_keep_open>
ffffffff804c1f05:        0 	48 85 ed             	test   %rbp,%rbp
ffffffff804c1f08:        0 	74 2f                	je     ffffffff804c1f39 <tcp_ack+0x1422>
ffffffff804c1f0a:        0 	be 0a 00 00 00       	mov    $0xa,%esi
ffffffff804c1f0f:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f12:        0 	e8 f1 d5 ff ff       	callq  ffffffff804bf508 <sock_flag>
ffffffff804c1f17:        0 	85 c0                	test   %eax,%eax
ffffffff804c1f19:        0 	74 1e                	je     ffffffff804c1f39 <tcp_ack+0x1422>
ffffffff804c1f1b:        0 	8b 8d fc 03 00 00    	mov    0x3fc(%rbp),%ecx
ffffffff804c1f21:        0 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c1f27:        0 	48 c7 c7 e5 d9 6a 80 	mov    $0xffffffff806ad9e5,%rdi
ffffffff804c1f2e:        0 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c1f32:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1f34:        0 	e8 3b 4e d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c1f39:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1f3b:        0 	eb 61                	jmp    ffffffff804c1f9e <tcp_ack+0x1487>
ffffffff804c1f3d:        0 	c7 44 24 44 00 00 00 	movl   $0x0,0x44(%rsp)
ffffffff804c1f44:        0 	00 
ffffffff804c1f45:        0 	e9 c3 ef ff ff       	jmpq   ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c1f4a:       54 	41 f6 c4 04          	test   $0x4,%r12b
ffffffff804c1f4e:      424 	0f 84 0b ff ff ff    	je     ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f54:      364 	85 c9                	test   %ecx,%ecx
ffffffff804c1f56:        0 	0f 84 f7 fe ff ff    	je     ffffffff804c1e53 <tcp_ack+0x133c>
ffffffff804c1f5c:        0 	e9 fe fe ff ff       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f61:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1f67:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1f6a:        0 	a8 02                	test   $0x2,%al
ffffffff804c1f6c:        0 	0f 85 10 f8 ff ff    	jne    ffffffff804c1782 <tcp_ack+0xc6b>
ffffffff804c1f72:        0 	e9 61 f8 ff ff       	jmpq   ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1f77:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f7a:        0 	e8 2e d0 ff ff       	callq  ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1f7f:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f82:        0 	e8 f7 47 00 00       	callq  ffffffff804c677e <tcp_xmit_retransmit_queue>
ffffffff804c1f87:        0 	e9 d3 fe ff ff       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f8c:        0 	80 bd 78 03 00 00 01 	cmpb   $0x1,0x378(%rbp)
ffffffff804c1f93:        0 	0f 87 be fc ff ff    	ja     ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1f99:        0 	e9 7b fc ff ff       	jmpq   ffffffff804c1c19 <tcp_ack+0x1102>
ffffffff804c1f9e:      493 	48 81 c4 88 00 00 00 	add    $0x88,%rsp
ffffffff804c1fa5:     1288 	5b                   	pop    %rbx
ffffffff804c1fa6:        0 	5d                   	pop    %rbp
ffffffff804c1fa7:      446 	41 5c                	pop    %r12
ffffffff804c1fa9:        0 	41 5d                	pop    %r13
ffffffff804c1fab:        2 	41 5e                	pop    %r14
ffffffff804c1fad:      447 	41 5f                	pop    %r15
ffffffff804c1faf:        0 	c3                   	retq   

No real obvious single-instruction hotspots i can see.

But i can see another problem: the function is too large and its flow 
is not fall-through in any way. As you can see it from the profile 
distribution it is broken into 25-30 separate code sequences.

The function consists of more than 1200 instructions and is 5200 bytes 
large. According to the profile above, only 350 instructions are used 
and about 850 of those instructions are never used by this workload. 
So in theory this function should only take up ~1.5K of the 
instruction cache.

But because execution is spread out into 25+ smaller pieces, it takes 
up ~4K of the instruction cache instead (there's a single ~1.2K hole 
in the middle, i subtracted that) - 2-3 times larger than it should.

So this code could make good use of the (brand-new ;-) branch-tracer 
ftrace plugin and grow a few well-placed likely()/unlikely() places - 
at least for this workload. I think.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* tcp_ack(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:09                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   1.997533 tcp_ack

                      hits (total: 199753)
                 .........
ffffffff804c0b17:      452 <tcp_ack>:
ffffffff804c0b17:      452 	41 57                	push   %r15
ffffffff804c0b19:     9569 	41 56                	push   %r14
ffffffff804c0b1b:        0 	41 55                	push   %r13
ffffffff804c0b1d:        0 	49 89 f5             	mov    %rsi,%r13
ffffffff804c0b20:      493 	41 54                	push   %r12
ffffffff804c0b22:      104 	41 89 d4             	mov    %edx,%r12d
ffffffff804c0b25:        0 	55                   	push   %rbp
ffffffff804c0b26:      425 	48 89 fd             	mov    %rdi,%rbp
ffffffff804c0b29:       21 	53                   	push   %rbx
ffffffff804c0b2a:        0 	48 81 ec 88 00 00 00 	sub    $0x88,%rsp
ffffffff804c0b31:      445 	8b 87 00 04 00 00    	mov    0x400(%rdi),%eax
ffffffff804c0b37:        0 	89 44 24 18          	mov    %eax,0x18(%rsp)
ffffffff804c0b3b:      443 	48 8d 46 38          	lea    0x38(%rsi),%rax
ffffffff804c0b3f:       18 	8b 50 28             	mov    0x28(%rax),%edx
ffffffff804c0b42:     2565 	44 8b 70 18          	mov    0x18(%rax),%r14d
ffffffff804c0b46:      358 	89 54 24 1c          	mov    %edx,0x1c(%rsp)
ffffffff804c0b4a:        2 	39 97 fc 03 00 00    	cmp    %edx,0x3fc(%rdi)
ffffffff804c0b50:      368 	0f 88 af 13 00 00    	js     ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c0b56:      106 	89 d1                	mov    %edx,%ecx
ffffffff804c0b58:        2 	2b 4c 24 18          	sub    0x18(%rsp),%ecx
ffffffff804c0b5c:      328 	0f 88 83 13 00 00    	js     ffffffff804c1ee5 <tcp_ack+0x13ce>
ffffffff804c0b62:     1440 	8b 44 24 18          	mov    0x18(%rsp),%eax
ffffffff804c0b66:        2 	29 d0                	sub    %edx,%eax
ffffffff804c0b68:       77 	44 89 e2             	mov    %r12d,%edx
ffffffff804c0b6b:      398 	89 c6                	mov    %eax,%esi
ffffffff804c0b6d:        3 	80 ce 04             	or     $0x4,%dh
ffffffff804c0b70:       65 	c1 ee 1f             	shr    $0x1f,%esi
ffffffff804c0b73:      362 	44 0f 45 e2          	cmovne %edx,%r12d
ffffffff804c0b77:        1 	83 3d ea 78 3f 00 00 	cmpl   $0x0,0x3f78ea(%rip)        # ffffffff808b8468 <sysctl_tcp_abc>
ffffffff804c0b7e:       64 	74 27                	je     ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b80:        0 	8a 87 78 03 00 00    	mov    0x378(%rdi),%al
ffffffff804c0b86:        0 	3c 01                	cmp    $0x1,%al
ffffffff804c0b88:        0 	77 08                	ja     ffffffff804c0b92 <tcp_ack+0x7b>
ffffffff804c0b8a:        0 	01 8f dc 04 00 00    	add    %ecx,0x4dc(%rdi)
ffffffff804c0b90:        0 	eb 15                	jmp    ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b92:        0 	3c 04                	cmp    $0x4,%al
ffffffff804c0b94:        0 	75 11                	jne    ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b96:        0 	8b 87 4c 04 00 00    	mov    0x44c(%rdi),%eax
ffffffff804c0b9c:        0 	39 c1                	cmp    %eax,%ecx
ffffffff804c0b9e:        0 	0f 46 c1             	cmovbe %ecx,%eax
ffffffff804c0ba1:        0 	01 87 dc 04 00 00    	add    %eax,0x4dc(%rdi)
ffffffff804c0ba7:      377 	8b 9d d4 04 00 00    	mov    0x4d4(%rbp),%ebx
ffffffff804c0bad:     3672 	41 f7 c4 00 01 00 00 	test   $0x100,%r12d
ffffffff804c0bb4:      282 	89 5c 24 20          	mov    %ebx,0x20(%rsp)
ffffffff804c0bb8:        0 	8b 85 74 04 00 00    	mov    0x474(%rbp),%eax
ffffffff804c0bbe:      140 	89 44 24 30          	mov    %eax,0x30(%rsp)
ffffffff804c0bc2:     7592 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c0bc8:     1580 	89 54 24 24          	mov    %edx,0x24(%rsp)
ffffffff804c0bcc:        3 	8b 9d cc 04 00 00    	mov    0x4cc(%rbp),%ebx
ffffffff804c0bd2:       58 	89 5c 24 28          	mov    %ebx,0x28(%rsp)
ffffffff804c0bd6:      419 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c0bdc:        0 	89 44 24 2c          	mov    %eax,0x2c(%rsp)
ffffffff804c0be0:       65 	75 4f                	jne    ffffffff804c0c31 <tcp_ack+0x11a>
ffffffff804c0be2:      423 	85 f6                	test   %esi,%esi
ffffffff804c0be4:       55 	74 4b                	je     ffffffff804c0c31 <tcp_ack+0x11a>
ffffffff804c0be6:       36 	44 89 b5 40 04 00 00 	mov    %r14d,0x440(%rbp)
ffffffff804c0bed:      368 	8b 54 24 1c          	mov    0x1c(%rsp),%edx
ffffffff804c0bf1:        4 	41 83 cc 02          	or     $0x2,%r12d
ffffffff804c0bf5:       32 	be 05 00 00 00       	mov    $0x5,%esi
ffffffff804c0bfa:      392 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0bfd:        4 	89 95 00 04 00 00    	mov    %edx,0x400(%rbp)
ffffffff804c0c03:     3341 	44 89 64 24 5c       	mov    %r12d,0x5c(%rsp)
ffffffff804c0c08:      855 	e8 98 dc ff ff       	callq  ffffffff804be8a5 <tcp_ca_event>
ffffffff804c0c0d:     2018 	48 8b 05 a4 0a 5f 00 	mov    0x5f0aa4(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c0c14:      858 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c0c1b:        0 	00 
ffffffff804c0c1c:        0 	89 d2                	mov    %edx,%edx
ffffffff804c0c1e:        0 	48 f7 d0             	not    %rax
ffffffff804c0c21:      425 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c0c25:        0 	48 ff 80 e8 00 00 00 	incq   0xe8(%rax)
ffffffff804c0c2c:        0 	e9 1b 01 00 00       	jmpq   ffffffff804c0d4c <tcp_ack+0x235>
ffffffff804c0c31:       41 	45 3b 75 54          	cmp    0x54(%r13),%r14d
ffffffff804c0c35:      360 	74 06                	je     ffffffff804c0c3d <tcp_ack+0x126>
ffffffff804c0c37:        1 	41 83 cc 01          	or     $0x1,%r12d
ffffffff804c0c3b:       80 	eb 1f                	jmp    ffffffff804c0c5c <tcp_ack+0x145>
ffffffff804c0c3d:        1 	48 8b 05 74 0a 5f 00 	mov    0x5f0a74(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c0c44:      303 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c0c4b:        0 	00 
ffffffff804c0c4c:       56 	89 d2                	mov    %edx,%edx
ffffffff804c0c4e:        0 	48 f7 d0             	not    %rax
ffffffff804c0c51:        4 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c0c55:       13 	48 ff 80 e0 00 00 00 	incq   0xe0(%rax)
ffffffff804c0c5c:       12 	41 8b 95 b8 00 00 00 	mov    0xb8(%r13),%edx
ffffffff804c0c63:      300 	49 03 95 d0 00 00 00 	add    0xd0(%r13),%rdx
ffffffff804c0c6a:       17 	66 8b 42 0e          	mov    0xe(%rdx),%ax
ffffffff804c0c6e:        0 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c0c72:       22 	f6 42 0d 02          	testb  $0x2,0xd(%rdx)
ffffffff804c0c76:       13 	0f b7 d8             	movzwl %ax,%ebx
ffffffff804c0c79:        0 	75 0b                	jne    ffffffff804c0c86 <tcp_ack+0x16f>
ffffffff804c0c7b:       26 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c0c81:      343 	83 e1 0f             	and    $0xf,%ecx
ffffffff804c0c84:        0 	d3 e3                	shl    %cl,%ebx
ffffffff804c0c86:       82 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c0c8a:       18 	44 89 f2             	mov    %r14d,%edx
ffffffff804c0c8d:        0 	89 d9                	mov    %ebx,%ecx
ffffffff804c0c8f:       12 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0c92:       12 	e8 47 e6 ff ff       	callq  ffffffff804bf2de <tcp_may_update_window>
ffffffff804c0c97:       16 	31 d2                	xor    %edx,%edx
ffffffff804c0c99:       66 	85 c0                	test   %eax,%eax
ffffffff804c0c9b:        0 	74 48                	je     ffffffff804c0ce5 <tcp_ack+0x1ce>
ffffffff804c0c9d:       12 	39 9d 44 04 00 00    	cmp    %ebx,0x444(%rbp)
ffffffff804c0ca3:       29 	44 89 b5 40 04 00 00 	mov    %r14d,0x440(%rbp)
ffffffff804c0caa:        0 	74 34                	je     ffffffff804c0ce0 <tcp_ack+0x1c9>
ffffffff804c0cac:        7 	89 9d 44 04 00 00    	mov    %ebx,0x444(%rbp)
ffffffff804c0cb2:       59 	c7 85 ec 03 00 00 00 	movl   $0x0,0x3ec(%rbp)
ffffffff804c0cb9:        0 	00 00 00 
ffffffff804c0cbc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0cbf:        7 	e8 13 e8 ff ff       	callq  ffffffff804bf4d7 <tcp_fast_path_check>
ffffffff804c0cc4:       23 	3b 9d 48 04 00 00    	cmp    0x448(%rbp),%ebx
ffffffff804c0cca:       48 	76 14                	jbe    ffffffff804c0ce0 <tcp_ack+0x1c9>
ffffffff804c0ccc:        0 	8b b5 5c 03 00 00    	mov    0x35c(%rbp),%esi
ffffffff804c0cd2:        0 	89 9d 48 04 00 00    	mov    %ebx,0x448(%rbp)
ffffffff804c0cd8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0cdb:        0 	e8 40 41 00 00       	callq  ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c0ce0:        6 	ba 02 00 00 00       	mov    $0x2,%edx
ffffffff804c0ce5:      141 	8b 5c 24 1c          	mov    0x1c(%rsp),%ebx
ffffffff804c0ce9:        1 	44 09 e2             	or     %r12d,%edx
ffffffff804c0cec:        3 	89 9d 00 04 00 00    	mov    %ebx,0x400(%rbp)
ffffffff804c0cf2:       34 	89 54 24 5c          	mov    %edx,0x5c(%rsp)
ffffffff804c0cf6:        0 	41 80 7d 5d 00       	cmpb   $0x0,0x5d(%r13)
ffffffff804c0cfb:        6 	74 13                	je     ffffffff804c0d10 <tcp_ack+0x1f9>
ffffffff804c0cfd:        0 	8b 54 24 18          	mov    0x18(%rsp),%edx
ffffffff804c0d01:        0 	4c 89 ee             	mov    %r13,%rsi
ffffffff804c0d04:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0d07:        0 	e8 b4 f5 ff ff       	callq  ffffffff804c02c0 <tcp_sacktag_write_queue>
ffffffff804c0d0c:        0 	09 44 24 5c          	or     %eax,0x5c(%rsp)
ffffffff804c0d10:       29 	41 8b 85 b8 00 00 00 	mov    0xb8(%r13),%eax
ffffffff804c0d17:      128 	49 03 85 d0 00 00 00 	add    0xd0(%r13),%rax
ffffffff804c0d1e:        0 	8a 40 0d             	mov    0xd(%rax),%al
ffffffff804c0d21:       33 	83 e0 42             	and    $0x42,%eax
ffffffff804c0d24:        0 	3c 40                	cmp    $0x40,%al
ffffffff804c0d26:        0 	75 17                	jne    ffffffff804c0d3f <tcp_ack+0x228>
ffffffff804c0d28:        0 	8b 44 24 5c          	mov    0x5c(%rsp),%eax
ffffffff804c0d2c:        0 	83 c8 40             	or     $0x40,%eax
ffffffff804c0d2f:        0 	f6 85 7e 04 00 00 01 	testb  $0x1,0x47e(%rbp)
ffffffff804c0d36:        0 	0f 44 44 24 5c       	cmove  0x5c(%rsp),%eax
ffffffff804c0d3b:        0 	89 44 24 5c          	mov    %eax,0x5c(%rsp)
ffffffff804c0d3f:       36 	be 06 00 00 00       	mov    $0x6,%esi
ffffffff804c0d44:      167 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0d47:        1 	e8 59 db ff ff       	callq  ffffffff804be8a5 <tcp_ca_event>
ffffffff804c0d4c:      581 	c7 85 48 01 00 00 00 	movl   $0x0,0x148(%rbp)
ffffffff804c0d53:        0 	00 00 00 
ffffffff804c0d56:     6076 	c6 85 7d 03 00 00 00 	movb   $0x0,0x37d(%rbp)
ffffffff804c0d5d:        0 	48 8b 05 1c 8b 3f 00 	mov    0x3f8b1c(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c0d64:      443 	89 85 08 04 00 00    	mov    %eax,0x408(%rbp)
ffffffff804c0d6a:        0 	8b 85 74 04 00 00    	mov    0x474(%rbp),%eax
ffffffff804c0d70:        0 	85 c0                	test   %eax,%eax
ffffffff804c0d72:      845 	89 44 24 14          	mov    %eax,0x14(%rsp)
ffffffff804c0d76:        0 	0f 84 fb 10 00 00    	je     ffffffff804c1e77 <tcp_ack+0x1360>
ffffffff804c0d7c:        0 	48 8b 05 fd 8a 3f 00 	mov    0x3f8afd(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c0d83:      586 	8b 54 24 14          	mov    0x14(%rsp),%edx
ffffffff804c0d87:        1 	41 83 cc ff          	or     $0xffffffffffffffff,%r12d
ffffffff804c0d8b:        2 	89 44 24 48          	mov    %eax,0x48(%rsp)
ffffffff804c0d8f:      879 	89 54 24 34          	mov    %edx,0x34(%rsp)
ffffffff804c0d93:        1 	8b 9d d0 04 00 00    	mov    0x4d0(%rbp),%ebx
ffffffff804c0d99:        0 	89 5c 24 40          	mov    %ebx,0x40(%rsp)
ffffffff804c0d9d:      889 	e8 e2 e8 ff ff       	callq  ffffffff804bf684 <net_invalid_timestamp>
ffffffff804c0da2:        0 	48 89 44 24 08       	mov    %rax,0x8(%rsp)
ffffffff804c0da7:       16 	48 8d 85 c0 00 00 00 	lea    0xc0(%rbp),%rax
ffffffff804c0dae:      445 	c7 44 24 44 01 00 00 	movl   $0x1,0x44(%rsp)
ffffffff804c0db5:        0 	00 
ffffffff804c0db6:        0 	c7 44 24 50 00 00 00 	movl   $0x0,0x50(%rsp)
ffffffff804c0dbd:        0 	00 
ffffffff804c0dbe:       10 	c7 44 24 38 00 00 00 	movl   $0x0,0x38(%rsp)
ffffffff804c0dc5:        0 	00 
ffffffff804c0dc6:     1308 	44 89 64 24 4c       	mov    %r12d,0x4c(%rsp)
ffffffff804c0dcb:      225 	48 89 04 24          	mov    %rax,(%rsp)
ffffffff804c0dcf:        2 	e9 8b 02 00 00       	jmpq   ffffffff804c105f <tcp_ack+0x548>
ffffffff804c0dd4:      488 	4d 8d 7d 38          	lea    0x38(%r13),%r15
ffffffff804c0dd8:     2298 	41 8a 57 25          	mov    0x25(%r15),%dl
ffffffff804c0ddc:        0 	88 54 24 3f          	mov    %dl,0x3f(%rsp)
ffffffff804c0de0:        6 	41 8b 77 1c          	mov    0x1c(%r15),%esi
ffffffff804c0de4:      455 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c0dea:        3 	49 8b 8d d0 00 00 00 	mov    0xd0(%r13),%rcx
ffffffff804c0df1:        0 	41 8b 85 c8 00 00 00 	mov    0xc8(%r13),%eax
ffffffff804c0df8:      440 	39 f2                	cmp    %esi,%edx
ffffffff804c0dfa:        0 	79 6f                	jns    ffffffff804c0e6b <tcp_ack+0x354>
ffffffff804c0dfc:        0 	89 c0                	mov    %eax,%eax
ffffffff804c0dfe:       39 	8b 5c 08 08          	mov    0x8(%rax,%rcx,1),%ebx
ffffffff804c0e02:        0 	66 83 fb 01          	cmp    $0x1,%bx
ffffffff804c0e06:        2 	0f 84 77 02 00 00    	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e0c:        0 	41 8b 47 18          	mov    0x18(%r15),%eax
ffffffff804c0e10:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c0e12:        0 	0f 89 6b 02 00 00    	jns    ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e18:        0 	29 c2                	sub    %eax,%edx
ffffffff804c0e1a:        0 	4c 89 ee             	mov    %r13,%rsi
ffffffff804c0e1d:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0e20:        0 	e8 8f 4f 00 00       	callq  ffffffff804c5db4 <tcp_trim_head>
ffffffff804c0e25:        0 	85 c0                	test   %eax,%eax
ffffffff804c0e27:        0 	0f 85 56 02 00 00    	jne    ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e2d:        0 	41 8b 85 c8 00 00 00 	mov    0xc8(%r13),%eax
ffffffff804c0e34:        0 	0f b7 d3             	movzwl %bx,%edx
ffffffff804c0e37:        0 	49 03 85 d0 00 00 00 	add    0xd0(%r13),%rax
ffffffff804c0e3e:        0 	41 89 d6             	mov    %edx,%r14d
ffffffff804c0e41:        0 	8b 48 08             	mov    0x8(%rax),%ecx
ffffffff804c0e44:        0 	0f b7 c1             	movzwl %cx,%eax
ffffffff804c0e47:        0 	41 29 c6             	sub    %eax,%r14d
ffffffff804c0e4a:        0 	0f 84 33 02 00 00    	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e50:        0 	66 85 c9             	test   %cx,%cx
ffffffff804c0e53:        0 	75 04                	jne    ffffffff804c0e59 <tcp_ack+0x342>
ffffffff804c0e55:        0 	0f 0b                	ud2a   
ffffffff804c0e57:        0 	eb fe                	jmp    ffffffff804c0e57 <tcp_ack+0x340>
ffffffff804c0e59:        0 	41 8b 5f 1c          	mov    0x1c(%r15),%ebx
ffffffff804c0e5d:        0 	41 39 5f 18          	cmp    %ebx,0x18(%r15)
ffffffff804c0e61:        0 	0f 88 d6 10 00 00    	js     ffffffff804c1f3d <tcp_ack+0x1426>
ffffffff804c0e67:        0 	0f 0b                	ud2a   
ffffffff804c0e69:        0 	eb fe                	jmp    ffffffff804c0e69 <tcp_ack+0x352>
ffffffff804c0e6b:        0 	83 7c 24 44 00       	cmpl   $0x0,0x44(%rsp)
ffffffff804c0e70:     6326 	89 c0                	mov    %eax,%eax
ffffffff804c0e72:      348 	44 0f b7 74 08 08    	movzwl 0x8(%rax,%rcx,1),%r14d
ffffffff804c0e78:        0 	0f 84 8f 00 00 00    	je     ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e7e:      132 	83 bd a4 03 00 00 00 	cmpl   $0x0,0x3a4(%rbp)
ffffffff804c0e85:     5840 	0f 84 82 00 00 00    	je     ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e8b:        0 	3b b5 b4 05 00 00    	cmp    0x5b4(%rbp),%esi
ffffffff804c0e91:        0 	78 7a                	js     ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e93:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0e96:        0 	e8 21 da ff ff       	callq  ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c0e9b:        0 	8b b5 4c 04 00 00    	mov    0x44c(%rbp),%esi
ffffffff804c0ea1:        0 	44 8b a5 ac 04 00 00 	mov    0x4ac(%rbp),%r12d
ffffffff804c0ea8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0eab:        0 	89 85 6c 05 00 00    	mov    %eax,0x56c(%rbp)
ffffffff804c0eb1:        0 	e8 c7 3e 00 00       	callq  ffffffff804c4d7d <tcp_mss_to_mtu>
ffffffff804c0eb6:        0 	8b 9d a4 03 00 00    	mov    0x3a4(%rbp),%ebx
ffffffff804c0ebc:        0 	31 d2                	xor    %edx,%edx
ffffffff804c0ebe:        0 	c7 85 b0 04 00 00 00 	movl   $0x0,0x4b0(%rbp)
ffffffff804c0ec5:        0 	00 00 00 
ffffffff804c0ec8:        0 	41 0f af c4          	imul   %r12d,%eax
ffffffff804c0ecc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0ecf:        0 	f7 f3                	div    %ebx
ffffffff804c0ed1:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c0ed7:        0 	48 8b 05 a2 89 3f 00 	mov    0x3f89a2(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c0ede:        0 	89 85 bc 04 00 00    	mov    %eax,0x4bc(%rbp)
ffffffff804c0ee4:        0 	e8 d3 d9 ff ff       	callq  ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c0ee9:        0 	8b b5 5c 03 00 00    	mov    0x35c(%rbp),%esi
ffffffff804c0eef:        0 	89 85 54 04 00 00    	mov    %eax,0x454(%rbp)
ffffffff804c0ef5:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c0ef8:        0 	89 9d a0 03 00 00    	mov    %ebx,0x3a0(%rbp)
ffffffff804c0efe:        0 	c7 85 a4 03 00 00 00 	movl   $0x0,0x3a4(%rbp)
ffffffff804c0f05:        0 	00 00 00 
ffffffff804c0f08:        0 	e8 13 3f 00 00       	callq  ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c0f0d:      945 	0f b6 44 24 3f       	movzbl 0x3f(%rsp),%eax
ffffffff804c0f12:     6361 	a8 82                	test   $0x82,%al
ffffffff804c0f14:        0 	74 30                	je     ffffffff804c0f46 <tcp_ack+0x42f>
ffffffff804c0f16:        0 	a8 02                	test   $0x2,%al
ffffffff804c0f18:        0 	74 07                	je     ffffffff804c0f21 <tcp_ack+0x40a>
ffffffff804c0f1a:        0 	44 29 b5 78 04 00 00 	sub    %r14d,0x478(%rbp)
ffffffff804c0f21:        0 	83 4c 24 50 08       	orl    $0x8,0x50(%rsp)
ffffffff804c0f26:        0 	f6 44 24 50 04       	testb  $0x4,0x50(%rsp)
ffffffff804c0f2b:        0 	75 06                	jne    ffffffff804c0f33 <tcp_ack+0x41c>
ffffffff804c0f2d:        0 	41 83 fe 01          	cmp    $0x1,%r14d
ffffffff804c0f31:        0 	76 08                	jbe    ffffffff804c0f3b <tcp_ack+0x424>
ffffffff804c0f33:        0 	81 4c 24 50 00 10 00 	orl    $0x1000,0x50(%rsp)
ffffffff804c0f3a:        0 	00 
ffffffff804c0f3b:        0 	41 83 cc ff          	or     $0xffffffffffffffff,%r12d
ffffffff804c0f3f:        0 	44 89 64 24 4c       	mov    %r12d,0x4c(%rsp)
ffffffff804c0f44:        0 	eb 38                	jmp    ffffffff804c0f7e <tcp_ack+0x467>
ffffffff804c0f46:        0 	44 8b 64 24 48       	mov    0x48(%rsp),%r12d
ffffffff804c0f4b:     5837 	45 2b 67 20          	sub    0x20(%r15),%r12d
ffffffff804c0f4f:        1 	83 7c 24 4c 00       	cmpl   $0x0,0x4c(%rsp)
ffffffff804c0f54:      167 	8b 5c 24 4c          	mov    0x4c(%rsp),%ebx
ffffffff804c0f58:      514 	49 8b 55 18          	mov    0x18(%r13),%rdx
ffffffff804c0f5c:        0 	41 0f 48 dc          	cmovs  %r12d,%ebx
ffffffff804c0f60:      164 	a8 01                	test   $0x1,%al
ffffffff804c0f62:      413 	48 89 54 24 08       	mov    %rdx,0x8(%rsp)
ffffffff804c0f67:        0 	89 5c 24 4c          	mov    %ebx,0x4c(%rsp)
ffffffff804c0f6b:      148 	75 11                	jne    ffffffff804c0f7e <tcp_ack+0x467>
ffffffff804c0f6d:     1608 	8b 54 24 38          	mov    0x38(%rsp),%edx
ffffffff804c0f71:        0 	39 54 24 34          	cmp    %edx,0x34(%rsp)
ffffffff804c0f75:      272 	0f 46 54 24 34       	cmovbe 0x34(%rsp),%edx
ffffffff804c0f7a:      266 	89 54 24 34          	mov    %edx,0x34(%rsp)
ffffffff804c0f7e:        0 	a8 01                	test   $0x1,%al
ffffffff804c0f80:      164 	74 07                	je     ffffffff804c0f89 <tcp_ack+0x472>
ffffffff804c0f82:        0 	44 29 b5 d0 04 00 00 	sub    %r14d,0x4d0(%rbp)
ffffffff804c0f89:     3955 	a8 04                	test   $0x4,%al
ffffffff804c0f8b:     8510 	74 07                	je     ffffffff804c0f94 <tcp_ack+0x47d>
ffffffff804c0f8d:        0 	44 29 b5 cc 04 00 00 	sub    %r14d,0x4cc(%rbp)
ffffffff804c0f94:       11 	44 29 b5 74 04 00 00 	sub    %r14d,0x474(%rbp)
ffffffff804c0f9b:     1426 	44 01 74 24 38       	add    %r14d,0x38(%rsp)
ffffffff804c0fa0:        6 	41 f6 47 24 02       	testb  $0x2,0x24(%r15)
ffffffff804c0fa5:      548 	75 07                	jne    ffffffff804c0fae <tcp_ack+0x497>
ffffffff804c0fa7:        2 	83 4c 24 50 04       	orl    $0x4,0x50(%rsp)
ffffffff804c0fac:        0 	eb 0f                	jmp    ffffffff804c0fbd <tcp_ack+0x4a6>
ffffffff804c0fae:        0 	83 4c 24 50 10       	orl    $0x10,0x50(%rsp)
ffffffff804c0fb3:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c0fba:        0 	00 00 00 
ffffffff804c0fbd:      517 	83 7c 24 44 00       	cmpl   $0x0,0x44(%rsp)
ffffffff804c0fc2:     6012 	0f 84 bb 00 00 00    	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0fc8:     1111 	48 8b 34 24          	mov    (%rsp),%rsi
ffffffff804c0fcc:        0 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c0fcf:      184 	e8 0d d8 ff ff       	callq  ffffffff804be7e1 <__skb_unlink>
ffffffff804c0fd4:        5 	41 8b 45 68          	mov    0x68(%r13),%eax
ffffffff804c0fd8:      517 	05 e8 00 00 00       	add    $0xe8,%eax
ffffffff804c0fdd:        0 	41 39 85 e0 00 00 00 	cmp    %eax,0xe0(%r13)
ffffffff804c0fe4:       31 	7d 08                	jge    ffffffff804c0fee <tcp_ack+0x4d7>
ffffffff804c0fe6:        0 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c0fe9:        0 	e8 d4 66 fc ff       	callq  ffffffff804876c2 <skb_truesize_bug>
ffffffff804c0fee:     1142 	0f ba ad 10 01 00 00 	btsl   $0xe,0x110(%rbp)
ffffffff804c0ff5:        0 	0e 
ffffffff804c0ff6:     2576 	8b 85 f0 00 00 00    	mov    0xf0(%rbp),%eax
ffffffff804c0ffc:      433 	41 2b 85 e0 00 00 00 	sub    0xe0(%r13),%eax
ffffffff804c1003:     4843 	89 85 f0 00 00 00    	mov    %eax,0xf0(%rbp)
ffffffff804c1009:     1730 	48 8b 45 30          	mov    0x30(%rbp),%rax
ffffffff804c100d:      311 	41 8b 95 e0 00 00 00 	mov    0xe0(%r13),%edx
ffffffff804c1014:        0 	48 83 b8 b0 00 00 00 	cmpq   $0x0,0xb0(%rax)
ffffffff804c101b:        0 	00 
ffffffff804c101c:      418 	74 06                	je     ffffffff804c1024 <tcp_ack+0x50d>
ffffffff804c101e:       37 	01 95 f4 00 00 00    	add    %edx,0xf4(%rbp)
ffffffff804c1024:        2 	4c 89 ef             	mov    %r13,%rdi
ffffffff804c1027:      432 	e8 56 7b fc ff       	callq  ffffffff80488b82 <__kfree_skb>
ffffffff804c102c:       44 	4c 3b ad f0 04 00 00 	cmp    0x4f0(%rbp),%r13
ffffffff804c1033:      511 	48 c7 85 e8 04 00 00 	movq   $0x0,0x4e8(%rbp)
ffffffff804c103a:        0 	00 00 00 00 
ffffffff804c103e:        1 	75 0b                	jne    ffffffff804c104b <tcp_ack+0x534>
ffffffff804c1040:        0 	48 c7 85 f0 04 00 00 	movq   $0x0,0x4f0(%rbp)
ffffffff804c1047:        0 	00 00 00 00 
ffffffff804c104b:        0 	4c 3b ad e0 04 00 00 	cmp    0x4e0(%rbp),%r13
ffffffff804c1052:      518 	75 0b                	jne    ffffffff804c105f <tcp_ack+0x548>
ffffffff804c1054:        0 	48 c7 85 e0 04 00 00 	movq   $0x0,0x4e0(%rbp)
ffffffff804c105b:        0 	00 00 00 00 
ffffffff804c105f:      439 	4c 8b ad c0 00 00 00 	mov    0xc0(%rbp),%r13
ffffffff804c1066:     5655 	4c 3b 2c 24          	cmp    (%rsp),%r13
ffffffff804c106a:        0 	75 05                	jne    ffffffff804c1071 <tcp_ack+0x55a>
ffffffff804c106c:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c106f:      810 	eb 12                	jmp    ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c1071:        0 	4d 85 ed             	test   %r13,%r13
ffffffff804c1074:     2574 	74 0d                	je     ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c1076:        0 	4c 3b ad d8 01 00 00 	cmp    0x1d8(%rbp),%r13
ffffffff804c107d:        0 	0f 85 51 fd ff ff    	jne    ffffffff804c0dd4 <tcp_ack+0x2bd>
ffffffff804c1083:      454 	8b 8d 00 04 00 00    	mov    0x400(%rbp),%ecx
ffffffff804c1089:      497 	8b 85 80 04 00 00    	mov    0x480(%rbp),%eax
ffffffff804c108f:        0 	2b 44 24 18          	sub    0x18(%rsp),%eax
ffffffff804c1093:        0 	89 ca                	mov    %ecx,%edx
ffffffff804c1095:      534 	2b 54 24 18          	sub    0x18(%rsp),%edx
ffffffff804c1099:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c109b:        0 	72 06                	jb     ffffffff804c10a3 <tcp_ack+0x58c>
ffffffff804c109d:      458 	89 8d 80 04 00 00    	mov    %ecx,0x480(%rbp)
ffffffff804c10a3:        0 	4d 85 ed             	test   %r13,%r13
ffffffff804c10a6:        0 	74 15                	je     ffffffff804c10bd <tcp_ack+0x5a6>
ffffffff804c10a8:        0 	8b 44 24 50          	mov    0x50(%rsp),%eax
ffffffff804c10ac:        2 	80 cc 20             	or     $0x20,%ah
ffffffff804c10af:        3 	41 f6 45 5d 01       	testb  $0x1,0x5d(%r13)
ffffffff804c10b4:        0 	0f 44 44 24 50       	cmove  0x50(%rsp),%eax
ffffffff804c10b9:        0 	89 44 24 50          	mov    %eax,0x50(%rsp)
ffffffff804c10bd:      444 	f6 44 24 50 14       	testb  $0x14,0x50(%rsp)
ffffffff804c10c2:      551 	0f 84 e1 01 00 00    	je     ffffffff804c12a9 <tcp_ack+0x792>
ffffffff804c10c8:        1 	f6 85 9c 04 00 00 01 	testb  $0x1,0x49c(%rbp)
ffffffff804c10cf:        2 	48 8b 9d 60 03 00 00 	mov    0x360(%rbp),%rbx
ffffffff804c10d6:      462 	74 17                	je     ffffffff804c10ef <tcp_ack+0x5d8>
ffffffff804c10d8:        0 	83 bd 98 04 00 00 00 	cmpl   $0x0,0x498(%rbp)
ffffffff804c10df:        0 	74 0e                	je     ffffffff804c10ef <tcp_ack+0x5d8>
ffffffff804c10e1:      451 	8b 74 24 50          	mov    0x50(%rsp),%esi
ffffffff804c10e5:       43 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c10e8:        0 	e8 ea e8 ff ff       	callq  ffffffff804bf9d7 <tcp_ack_saw_tstamp>
ffffffff804c10ed:       66 	eb 47                	jmp    ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10ef:        0 	83 7c 24 4c 00       	cmpl   $0x0,0x4c(%rsp)
ffffffff804c10f4:        0 	78 40                	js     ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10f6:        0 	f6 44 24 50 08       	testb  $0x8,0x50(%rsp)
ffffffff804c10fb:        0 	75 39                	jne    ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10fd:        0 	8b 74 24 4c          	mov    0x4c(%rsp),%esi
ffffffff804c1101:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1104:        0 	e8 b5 e7 ff ff       	callq  ffffffff804bf8be <tcp_rtt_estimator>
ffffffff804c1109:        0 	8b 85 60 04 00 00    	mov    0x460(%rbp),%eax
ffffffff804c110f:        0 	c6 85 7b 03 00 00 00 	movb   $0x0,0x37b(%rbp)
ffffffff804c1116:        0 	c1 e8 03             	shr    $0x3,%eax
ffffffff804c1119:        0 	03 85 6c 04 00 00    	add    0x46c(%rbp),%eax
ffffffff804c111f:        0 	3d 30 75 00 00       	cmp    $0x7530,%eax
ffffffff804c1124:        0 	89 85 58 03 00 00    	mov    %eax,0x358(%rbp)
ffffffff804c112a:        0 	76 0a                	jbe    ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c112c:        0 	c7 85 58 03 00 00 30 	movl   $0x7530,0x358(%rbp)
ffffffff804c1133:        0 	75 00 00 
ffffffff804c1136:      732 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c113d:     1833 	75 0f                	jne    ffffffff804c114e <tcp_ack+0x637>
ffffffff804c113f:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1144:      493 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1147:        0 	e8 07 d7 ff ff       	callq  ffffffff804be853 <inet_csk_clear_xmit_timer>
ffffffff804c114c:        0 	eb 18                	jmp    ffffffff804c1166 <tcp_ack+0x64f>
ffffffff804c114e:        0 	8b 95 58 03 00 00    	mov    0x358(%rbp),%edx
ffffffff804c1154:        0 	b9 30 75 00 00       	mov    $0x7530,%ecx
ffffffff804c1159:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c115e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1161:        0 	e8 7d e4 ff ff       	callq  ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c1166:      881 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c116c:      845 	c0 e8 04             	shr    $0x4,%al
ffffffff804c116f:        1 	75 63                	jne    ffffffff804c11d4 <tcp_ack+0x6bd>
ffffffff804c1171:        0 	83 7c 24 38 00       	cmpl   $0x0,0x38(%rsp)
ffffffff804c1176:        0 	7e 29                	jle    ffffffff804c11a1 <tcp_ack+0x68a>
ffffffff804c1178:        0 	8b 44 24 38          	mov    0x38(%rsp),%eax
ffffffff804c117c:        0 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c1182:        0 	ff c8                	dec    %eax
ffffffff804c1184:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c1186:        0 	72 0c                	jb     ffffffff804c1194 <tcp_ack+0x67d>
ffffffff804c1188:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c118f:        0 	00 00 00 
ffffffff804c1192:        0 	eb 0d                	jmp    ffffffff804c11a1 <tcp_ack+0x68a>
ffffffff804c1194:        0 	8d 42 01             	lea    0x1(%rdx),%eax
ffffffff804c1197:        0 	2b 44 24 38          	sub    0x38(%rsp),%eax
ffffffff804c119b:        0 	89 85 d0 04 00 00    	mov    %eax,0x4d0(%rbp)
ffffffff804c11a1:        0 	8b 74 24 38          	mov    0x38(%rsp),%esi
ffffffff804c11a5:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c11a8:        0 	e8 2d dd ff ff       	callq  ffffffff804beeda <tcp_check_reno_reordering>
ffffffff804c11ad:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c11b3:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c11b9:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c11bf:        0 	76 5e                	jbe    ffffffff804c121f <tcp_ack+0x708>
ffffffff804c11c1:        0 	be b0 06 00 00       	mov    $0x6b0,%esi
ffffffff804c11c6:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c11cd:        0 	e8 e3 4f d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c11d2:        0 	eb 4b                	jmp    ffffffff804c121f <tcp_ack+0x708>
ffffffff804c11d4:      414 	8b 44 24 20          	mov    0x20(%rsp),%eax
ffffffff804c11d8:     1591 	39 44 24 34          	cmp    %eax,0x34(%rsp)
ffffffff804c11dc:        2 	73 14                	jae    ffffffff804c11f2 <tcp_ack+0x6db>
ffffffff804c11de:        0 	8b b5 d4 04 00 00    	mov    0x4d4(%rbp),%esi
ffffffff804c11e4:        0 	2b 74 24 34          	sub    0x34(%rsp),%esi
ffffffff804c11e8:        0 	31 d2                	xor    %edx,%edx
ffffffff804c11ea:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c11ed:        0 	e8 9c db ff ff       	callq  ffffffff804bed8e <tcp_update_reordering>
ffffffff804c11f2:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c11f8:      865 	c0 e8 04             	shr    $0x4,%al
ffffffff804c11fb:        3 	a8 02                	test   $0x2,%al
ffffffff804c11fd:        0 	8b 85 60 05 00 00    	mov    0x560(%rbp),%eax
ffffffff804c1203:      453 	74 06                	je     ffffffff804c120b <tcp_ack+0x6f4>
ffffffff804c1205:        8 	2b 44 24 38          	sub    0x38(%rsp),%eax
ffffffff804c1209:        0 	eb 0e                	jmp    ffffffff804c1219 <tcp_ack+0x702>
ffffffff804c120b:        0 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c1211:        0 	29 54 24 40          	sub    %edx,0x40(%rsp)
ffffffff804c1215:        0 	2b 44 24 40          	sub    0x40(%rsp),%eax
ffffffff804c1219:      423 	89 85 60 05 00 00    	mov    %eax,0x560(%rbp)
ffffffff804c121f:      492 	8b 85 d4 04 00 00    	mov    0x4d4(%rbp),%eax
ffffffff804c1225:      489 	39 44 24 38          	cmp    %eax,0x38(%rsp)
ffffffff804c1229:        0 	8b 54 24 38          	mov    0x38(%rsp),%edx
ffffffff804c122d:        0 	0f 47 d0             	cmova  %eax,%edx
ffffffff804c1230:      438 	29 d0                	sub    %edx,%eax
ffffffff804c1232:        0 	89 85 d4 04 00 00    	mov    %eax,0x4d4(%rbp)
ffffffff804c1238:        1 	48 83 7b 58 00       	cmpq   $0x0,0x58(%rbx)
ffffffff804c123d:      446 	74 6a                	je     ffffffff804c12a9 <tcp_ack+0x792>
ffffffff804c123f:        0 	f6 44 24 50 08       	testb  $0x8,0x50(%rsp)
ffffffff804c1244:        3 	75 54                	jne    ffffffff804c129a <tcp_ack+0x783>
ffffffff804c1246:      441 	f6 43 10 02          	testb  $0x2,0x10(%rbx)
ffffffff804c124a:        8 	74 3f                	je     ffffffff804c128b <tcp_ack+0x774>
ffffffff804c124c:        0 	e8 33 e4 ff ff       	callq  ffffffff804bf684 <net_invalid_timestamp>
ffffffff804c1251:        0 	48 39 44 24 08       	cmp    %rax,0x8(%rsp)
ffffffff804c1256:        0 	74 33                	je     ffffffff804c128b <tcp_ack+0x774>
ffffffff804c1258:        0 	e8 17 8b d8 ff       	callq  ffffffff80249d74 <ktime_get_real>
ffffffff804c125d:        0 	48 89 c7             	mov    %rax,%rdi
ffffffff804c1260:        0 	48 2b 7c 24 08       	sub    0x8(%rsp),%rdi
ffffffff804c1265:        0 	e8 e3 8e d7 ff       	callq  ffffffff8023a14d <ns_to_timeval>
ffffffff804c126a:        0 	48 89 44 24 60       	mov    %rax,0x60(%rsp)
ffffffff804c126f:        0 	48 89 44 24 70       	mov    %rax,0x70(%rsp)
ffffffff804c1274:        0 	48 69 c0 40 42 0f 00 	imul   $0xf4240,%rax,%rax
ffffffff804c127b:        0 	48 89 54 24 78       	mov    %rdx,0x78(%rsp)
ffffffff804c1280:        0 	48 89 54 24 68       	mov    %rdx,0x68(%rsp)
ffffffff804c1285:        0 	03 44 24 78          	add    0x78(%rsp),%eax
ffffffff804c1289:        0 	eb 12                	jmp    ffffffff804c129d <tcp_ack+0x786>
ffffffff804c128b:       89 	45 85 e4             	test   %r12d,%r12d
ffffffff804c128e:      414 	7e 0a                	jle    ffffffff804c129a <tcp_ack+0x783>
ffffffff804c1290:        0 	49 63 fc             	movslq %r12d,%rdi
ffffffff804c1293:       65 	e8 a8 8b d7 ff       	callq  ffffffff80239e40 <jiffies_to_usecs>
ffffffff804c1298:        0 	eb 03                	jmp    ffffffff804c129d <tcp_ack+0x786>
ffffffff804c129a:        0 	83 c8 ff             	or     $0xffffffffffffffff,%eax
ffffffff804c129d:     1136 	89 c2                	mov    %eax,%edx
ffffffff804c129f:        7 	8b 74 24 38          	mov    0x38(%rsp),%esi
ffffffff804c12a3:      444 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c12a6:        1 	ff 53 58             	callq  *0x58(%rbx)
ffffffff804c12a9:      305 	83 bd d0 04 00 00 00 	cmpl   $0x0,0x4d0(%rbp)
ffffffff804c12b0:      518 	79 11                	jns    ffffffff804c12c3 <tcp_ack+0x7ac>
ffffffff804c12b2:        0 	be ac 0b 00 00       	mov    $0xbac,%esi
ffffffff804c12b7:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c12be:        0 	e8 f2 4e d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12c3:      415 	83 bd cc 04 00 00 00 	cmpl   $0x0,0x4cc(%rbp)
ffffffff804c12ca:     2204 	79 11                	jns    ffffffff804c12dd <tcp_ack+0x7c6>
ffffffff804c12cc:        0 	be ad 0b 00 00       	mov    $0xbad,%esi
ffffffff804c12d1:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c12d8:        0 	e8 d8 4e d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12dd:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c12e4:     1747 	79 11                	jns    ffffffff804c12f7 <tcp_ack+0x7e0>
ffffffff804c12e6:        0 	be ae 0b 00 00       	mov    $0xbae,%esi
ffffffff804c12eb:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c12f2:        0 	e8 be 4e d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12f7:        0 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c12fe:      878 	0f 85 86 00 00 00    	jne    ffffffff804c138a <tcp_ack+0x873>
ffffffff804c1304:     4721 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c130a:      968 	c0 e8 04             	shr    $0x4,%al
ffffffff804c130d:        2 	74 7b                	je     ffffffff804c138a <tcp_ack+0x873>
ffffffff804c130f:      171 	8b b5 cc 04 00 00    	mov    0x4cc(%rbp),%esi
ffffffff804c1315:      282 	85 f6                	test   %esi,%esi
ffffffff804c1317:        0 	74 1f                	je     ffffffff804c1338 <tcp_ack+0x821>
ffffffff804c1319:        0 	0f b6 95 78 03 00 00 	movzbl 0x378(%rbp),%edx
ffffffff804c1320:        0 	48 c7 c7 b2 d9 6a 80 	mov    $0xffffffff806ad9b2,%rdi
ffffffff804c1327:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1329:        0 	e8 46 5a d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c132e:        0 	c7 85 cc 04 00 00 00 	movl   $0x0,0x4cc(%rbp)
ffffffff804c1335:        0 	00 00 00 
ffffffff804c1338:      198 	8b b5 d0 04 00 00    	mov    0x4d0(%rbp),%esi
ffffffff804c133e:      257 	85 f6                	test   %esi,%esi
ffffffff804c1340:        0 	74 1f                	je     ffffffff804c1361 <tcp_ack+0x84a>
ffffffff804c1342:        0 	0f b6 95 78 03 00 00 	movzbl 0x378(%rbp),%edx
ffffffff804c1349:        0 	48 c7 c7 c3 d9 6a 80 	mov    $0xffffffff806ad9c3,%rdi
ffffffff804c1350:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1352:        0 	e8 1d 5a d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c1357:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c135e:        0 	00 00 00 
ffffffff804c1361:     2524 	8b b5 78 04 00 00    	mov    0x478(%rbp),%esi
ffffffff804c1367:     1825 	85 f6                	test   %esi,%esi
ffffffff804c1369:        0 	74 1f                	je     ffffffff804c138a <tcp_ack+0x873>
ffffffff804c136b:        0 	0f b6 95 78 03 00 00 	movzbl 0x378(%rbp),%edx
ffffffff804c1372:        0 	48 c7 c7 d4 d9 6a 80 	mov    $0xffffffff806ad9d4,%rdi
ffffffff804c1379:        0 	31 c0                	xor    %eax,%eax
ffffffff804c137b:        0 	e8 f4 59 d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c1380:        0 	c7 85 78 04 00 00 00 	movl   $0x0,0x478(%rbp)
ffffffff804c1387:        0 	00 00 00 
ffffffff804c138a:       46 	44 8b 64 24 50       	mov    0x50(%rsp),%r12d
ffffffff804c138f:     7369 	31 c9                	xor    %ecx,%ecx
ffffffff804c1391:      348 	44 0b 64 24 5c       	or     0x5c(%rsp),%r12d
ffffffff804c1396:        0 	80 bd 5e 04 00 00 00 	cmpb   $0x0,0x45e(%rbp)
ffffffff804c139d:       96 	0f 84 26 02 00 00    	je     ffffffff804c15c9 <tcp_ack+0xab2>
ffffffff804c13a3:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c13a9:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c13af:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c13b5:        0 	76 11                	jbe    ffffffff804c13c8 <tcp_ack+0x8b1>
ffffffff804c13b7:        0 	be 58 0c 00 00       	mov    $0xc58,%esi
ffffffff804c13bc:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c13c3:        0 	e8 ed 4d d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c13c8:        0 	44 89 e3             	mov    %r12d,%ebx
ffffffff804c13cb:        0 	83 e3 04             	and    $0x4,%ebx
ffffffff804c13ce:        0 	74 07                	je     ffffffff804c13d7 <tcp_ack+0x8c0>
ffffffff804c13d0:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c13d7:        0 	41 f7 c4 00 10 00 00 	test   $0x1000,%r12d
ffffffff804c13de:        0 	75 0f                	jne    ffffffff804c13ef <tcp_ack+0x8d8>
ffffffff804c13e0:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c13e7:        0 	76 10                	jbe    ffffffff804c13f9 <tcp_ack+0x8e2>
ffffffff804c13e9:        0 	41 f6 c4 08          	test   $0x8,%r12b
ffffffff804c13ed:        0 	74 0a                	je     ffffffff804c13f9 <tcp_ack+0x8e2>
ffffffff804c13ef:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c13f6:        0 	00 00 00 
ffffffff804c13f9:        0 	8b 85 58 04 00 00    	mov    0x458(%rbp),%eax
ffffffff804c13ff:        0 	39 85 00 04 00 00    	cmp    %eax,0x400(%rbp)
ffffffff804c1405:        0 	78 12                	js     ffffffff804c1419 <tcp_ack+0x902>
ffffffff804c1407:        0 	31 f6                	xor    %esi,%esi
ffffffff804c1409:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c1410:        0 	40 0f 95 c6          	setne  %sil
ffffffff804c1414:        0 	83 c6 02             	add    $0x2,%esi
ffffffff804c1417:        0 	eb 37                	jmp    ffffffff804c1450 <tcp_ack+0x939>
ffffffff804c1419:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c141c:        0 	e8 e0 da ff ff       	callq  ffffffff804bef01 <tcp_is_sackfrto>
ffffffff804c1421:        0 	85 c0                	test   %eax,%eax
ffffffff804c1423:        0 	75 3b                	jne    ffffffff804c1460 <tcp_ack+0x949>
ffffffff804c1425:        0 	41 f7 c4 34 04 00 00 	test   $0x434,%r12d
ffffffff804c142c:        0 	75 0a                	jne    ffffffff804c1438 <tcp_ack+0x921>
ffffffff804c142e:        0 	41 f6 c4 17          	test   $0x17,%r12b
ffffffff804c1432:        0 	0f 85 8c 01 00 00    	jne    ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c1438:        0 	85 db                	test   %ebx,%ebx
ffffffff804c143a:        0 	0f 85 8d 00 00 00    	jne    ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c1440:        0 	31 f6                	xor    %esi,%esi
ffffffff804c1442:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c1449:        0 	40 0f 95 c6          	setne  %sil
ffffffff804c144d:        0 	8d 34 76             	lea    (%rsi,%rsi,2),%esi
ffffffff804c1450:        0 	44 89 e2             	mov    %r12d,%edx
ffffffff804c1453:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1456:        0 	e8 b8 e7 ff ff       	callq  ffffffff804bfc13 <tcp_enter_frto_loss>
ffffffff804c145b:        0 	e9 64 01 00 00       	jmpq   ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c1460:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1462:        0 	75 37                	jne    ffffffff804c149b <tcp_ack+0x984>
ffffffff804c1464:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c146b:        0 	75 2e                	jne    ffffffff804c149b <tcp_ack+0x984>
ffffffff804c146d:        0 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c1473:        0 	03 85 74 04 00 00    	add    0x474(%rbp),%eax
ffffffff804c1479:        0 	2b 85 d0 04 00 00    	sub    0x4d0(%rbp),%eax
ffffffff804c147f:        0 	8b 95 ac 04 00 00    	mov    0x4ac(%rbp),%edx
ffffffff804c1485:        0 	2b 85 cc 04 00 00    	sub    0x4cc(%rbp),%eax
ffffffff804c148b:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c148d:        0 	0f 47 c2             	cmova  %edx,%eax
ffffffff804c1490:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c1496:        0 	e9 29 01 00 00       	jmpq   ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c149b:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c14a2:        0 	76 29                	jbe    ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c14a4:        0 	41 f6 c4 34          	test   $0x34,%r12b
ffffffff804c14a8:        0 	74 0f                	je     ffffffff804c14b9 <tcp_ack+0x9a2>
ffffffff804c14aa:        0 	44 89 e0             	mov    %r12d,%eax
ffffffff804c14ad:        0 	25 20 02 00 00       	and    $0x220,%eax
ffffffff804c14b2:        0 	83 f8 20             	cmp    $0x20,%eax
ffffffff804c14b5:        0 	75 16                	jne    ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c14b7:        0 	eb 0a                	jmp    ffffffff804c14c3 <tcp_ack+0x9ac>
ffffffff804c14b9:        0 	41 f6 c4 17          	test   $0x17,%r12b
ffffffff804c14bd:        0 	0f 85 01 01 00 00    	jne    ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c14c3:        0 	44 89 e2             	mov    %r12d,%edx
ffffffff804c14c6:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c14cb:        0 	eb 86                	jmp    ffffffff804c1453 <tcp_ack+0x93c>
ffffffff804c14cd:        0 	80 bd 5e 04 00 00 01 	cmpb   $0x1,0x45e(%rbp)
ffffffff804c14d4:        0 	75 45                	jne    ffffffff804c151b <tcp_ack+0xa04>
ffffffff804c14d6:        0 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c14dc:        0 	03 85 74 04 00 00    	add    0x474(%rbp),%eax
ffffffff804c14e2:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c14e5:        0 	c6 85 5e 04 00 00 02 	movb   $0x2,0x45e(%rbp)
ffffffff804c14ec:        0 	83 c0 02             	add    $0x2,%eax
ffffffff804c14ef:        0 	2b 85 cc 04 00 00    	sub    0x4cc(%rbp),%eax
ffffffff804c14f5:        0 	2b 85 d0 04 00 00    	sub    0x4d0(%rbp),%eax
ffffffff804c14fb:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c1501:        0 	e8 0a 3e 00 00       	callq  ffffffff804c5310 <tcp_may_send_now>
ffffffff804c1506:        0 	85 c0                	test   %eax,%eax
ffffffff804c1508:        0 	0f 85 b6 00 00 00    	jne    ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c150e:        0 	44 89 e2             	mov    %r12d,%edx
ffffffff804c1511:        0 	be 02 00 00 00       	mov    $0x2,%esi
ffffffff804c1516:        0 	e9 38 ff ff ff       	jmpq   ffffffff804c1453 <tcp_ack+0x93c>
ffffffff804c151b:        0 	8b 05 3f 6f 3f 00    	mov    0x3f6f3f(%rip),%eax        # ffffffff808b8460 <sysctl_tcp_frto_response>
ffffffff804c1521:        0 	83 f8 01             	cmp    $0x1,%eax
ffffffff804c1524:        0 	74 1a                	je     ffffffff804c1540 <tcp_ack+0xa29>
ffffffff804c1526:        0 	83 f8 02             	cmp    $0x2,%eax
ffffffff804c1529:        0 	75 5d                	jne    ffffffff804c1588 <tcp_ack+0xa71>
ffffffff804c152b:        0 	41 f6 c4 40          	test   $0x40,%r12b
ffffffff804c152f:        0 	75 57                	jne    ffffffff804c1588 <tcp_ack+0xa71>
ffffffff804c1531:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1536:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1539:        0 	e8 5a db ff ff       	callq  ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c153e:        0 	eb 50                	jmp    ffffffff804c1590 <tcp_ack+0xa79>
ffffffff804c1540:        0 	8b 85 ac 04 00 00    	mov    0x4ac(%rbp),%eax
ffffffff804c1546:        0 	8b 95 a8 04 00 00    	mov    0x4a8(%rbp),%edx
ffffffff804c154c:        0 	c7 85 b0 04 00 00 00 	movl   $0x0,0x4b0(%rbp)
ffffffff804c1553:        0 	00 00 00 
ffffffff804c1556:        0 	c7 85 dc 04 00 00 00 	movl   $0x0,0x4dc(%rbp)
ffffffff804c155d:        0 	00 00 00 
ffffffff804c1560:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c1562:        0 	0f 46 c2             	cmovbe %edx,%eax
ffffffff804c1565:        0 	89 85 ac 04 00 00    	mov    %eax,0x4ac(%rbp)
ffffffff804c156b:        0 	8a 85 7e 04 00 00    	mov    0x47e(%rbp),%al
ffffffff804c1571:        0 	a8 01                	test   $0x1,%al
ffffffff804c1573:        0 	74 09                	je     ffffffff804c157e <tcp_ack+0xa67>
ffffffff804c1575:        0 	83 c8 02             	or     $0x2,%eax
ffffffff804c1578:        0 	88 85 7e 04 00 00    	mov    %al,0x47e(%rbp)
ffffffff804c157e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1581:        0 	e8 27 da ff ff       	callq  ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1586:        0 	eb 08                	jmp    ffffffff804c1590 <tcp_ack+0xa79>
ffffffff804c1588:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c158b:        0 	e8 78 dd ff ff       	callq  ffffffff804bf308 <tcp_ratehalving_spur_to_response>
ffffffff804c1590:        0 	c6 85 5e 04 00 00 00 	movb   $0x0,0x45e(%rbp)
ffffffff804c1597:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c159e:        0 	00 00 00 
ffffffff804c15a1:        0 	31 c9                	xor    %ecx,%ecx
ffffffff804c15a3:        0 	48 8b 05 0e 01 5f 00 	mov    0x5f010e(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c15aa:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c15b1:        0 	00 
ffffffff804c15b2:        0 	89 d2                	mov    %edx,%edx
ffffffff804c15b4:        0 	48 f7 d0             	not    %rax
ffffffff804c15b7:        0 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c15bb:        0 	48 ff 80 28 02 00 00 	incq   0x228(%rax)
ffffffff804c15c2:        0 	eb 05                	jmp    ffffffff804c15c9 <tcp_ack+0xab2>
ffffffff804c15c4:        0 	b9 01 00 00 00       	mov    $0x1,%ecx
ffffffff804c15c9:      466 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c15cf:     5645 	39 95 58 04 00 00    	cmp    %edx,0x458(%rbp)
ffffffff804c15d5:      176 	79 0a                	jns    ffffffff804c15e1 <tcp_ack+0xaca>
ffffffff804c15d7:       24 	c7 85 58 04 00 00 00 	movl   $0x0,0x458(%rbp)
ffffffff804c15de:        0 	00 00 00 
ffffffff804c15e1:      620 	8b 54 24 2c          	mov    0x2c(%rsp),%edx
ffffffff804c15e5:      639 	03 54 24 30          	add    0x30(%rsp),%edx
ffffffff804c15e9:        2 	44 89 e3             	mov    %r12d,%ebx
ffffffff804c15ec:      283 	2b 54 24 28          	sub    0x28(%rsp),%edx
ffffffff804c15f0:      154 	2b 54 24 24          	sub    0x24(%rsp),%edx
ffffffff804c15f4:        0 	83 e3 17             	and    $0x17,%ebx
ffffffff804c15f7:      266 	89 5c 24 54          	mov    %ebx,0x54(%rsp)
ffffffff804c15fb:      168 	74 13                	je     ffffffff804c1610 <tcp_ack+0xaf9>
ffffffff804c15fd:        0 	41 f6 c4 60          	test   $0x60,%r12b
ffffffff804c1601:     6575 	75 0d                	jne    ffffffff804c1610 <tcp_ack+0xaf9>
ffffffff804c1603:       20 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c160a:     1417 	0f 84 3a 09 00 00    	je     ffffffff804c1f4a <tcp_ack+0x1433>
ffffffff804c1610:        0 	44 89 e0             	mov    %r12d,%eax
ffffffff804c1613:        0 	c1 e8 02             	shr    $0x2,%eax
ffffffff804c1616:        0 	88 c3                	mov    %al,%bl
ffffffff804c1618:        0 	80 e3 01             	and    $0x1,%bl
ffffffff804c161b:        0 	41 88 de             	mov    %bl,%r14b
ffffffff804c161e:        0 	74 36                	je     ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1620:        0 	85 c9                	test   %ecx,%ecx
ffffffff804c1622:        0 	75 32                	jne    ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1624:        0 	41 f6 c4 40          	test   $0x40,%r12b
ffffffff804c1628:        0 	74 0e                	je     ffffffff804c1638 <tcp_ack+0xb21>
ffffffff804c162a:        0 	8b 85 a8 04 00 00    	mov    0x4a8(%rbp),%eax
ffffffff804c1630:        0 	39 85 ac 04 00 00    	cmp    %eax,0x4ac(%rbp)
ffffffff804c1636:        0 	73 1e                	jae    ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1638:        0 	0f b6 8d 78 03 00 00 	movzbl 0x378(%rbp),%ecx
ffffffff804c163f:        0 	b8 0c 00 00 00       	mov    $0xc,%eax
ffffffff804c1644:        0 	d3 f8                	sar    %cl,%eax
ffffffff804c1646:        0 	a8 01                	test   $0x1,%al
ffffffff804c1648:        0 	75 0c                	jne    ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c164a:        0 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c164e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1651:        0 	e8 6b dc ff ff       	callq  ffffffff804bf2c1 <tcp_cong_avoid>
ffffffff804c1656:        0 	31 db                	xor    %ebx,%ebx
ffffffff804c1658:        0 	41 f7 c4 17 04 00 00 	test   $0x417,%r12d
ffffffff804c165f:        0 	44 8b bd 74 04 00 00 	mov    0x474(%rbp),%r15d
ffffffff804c1666:        0 	0f 94 c3             	sete   %bl
ffffffff804c1669:        0 	41 bd 01 00 00 00    	mov    $0x1,%r13d
ffffffff804c166f:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1671:        0 	75 21                	jne    ffffffff804c1694 <tcp_ack+0xb7d>
ffffffff804c1673:        0 	45 30 ed             	xor    %r13b,%r13b
ffffffff804c1676:        0 	41 f6 c4 20          	test   $0x20,%r12b
ffffffff804c167a:        0 	74 18                	je     ffffffff804c1694 <tcp_ack+0xb7d>
ffffffff804c167c:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c167f:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c1682:        0 	e8 cf d8 ff ff       	callq  ffffffff804bef56 <tcp_fackets_out>
ffffffff804c1687:        0 	0f b6 95 7f 04 00 00 	movzbl 0x47f(%rbp),%edx
ffffffff804c168e:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c1690:        0 	41 0f 9f c5          	setg   %r13b
ffffffff804c1694:        0 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c169b:        0 	75 24                	jne    ffffffff804c16c1 <tcp_ack+0xbaa>
ffffffff804c169d:        0 	83 bd d0 04 00 00 00 	cmpl   $0x0,0x4d0(%rbp)
ffffffff804c16a4:        0 	74 1b                	je     ffffffff804c16c1 <tcp_ack+0xbaa>
ffffffff804c16a6:        0 	be 16 0a 00 00       	mov    $0xa16,%esi
ffffffff804c16ab:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c16b2:        0 	e8 fe 4a d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c16b7:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c16be:        0 	00 00 00 
ffffffff804c16c1:        0 	83 bd d0 04 00 00 00 	cmpl   $0x0,0x4d0(%rbp)
ffffffff804c16c8:        0 	75 24                	jne    ffffffff804c16ee <tcp_ack+0xbd7>
ffffffff804c16ca:        0 	83 bd d4 04 00 00 00 	cmpl   $0x0,0x4d4(%rbp)
ffffffff804c16d1:        0 	74 1b                	je     ffffffff804c16ee <tcp_ack+0xbd7>
ffffffff804c16d3:        0 	be 18 0a 00 00       	mov    $0xa18,%esi
ffffffff804c16d8:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c16df:        0 	e8 d1 4a d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c16e4:        0 	c7 85 d4 04 00 00 00 	movl   $0x0,0x4d4(%rbp)
ffffffff804c16eb:        0 	00 00 00 
ffffffff804c16ee:        0 	44 89 e0             	mov    %r12d,%eax
ffffffff804c16f1:        0 	83 e0 40             	and    $0x40,%eax
ffffffff804c16f4:        0 	89 44 24 58          	mov    %eax,0x58(%rsp)
ffffffff804c16f8:        0 	74 0a                	je     ffffffff804c1704 <tcp_ack+0xbed>
ffffffff804c16fa:        0 	c7 85 6c 05 00 00 00 	movl   $0x0,0x56c(%rbp)
ffffffff804c1701:        0 	00 00 00 
ffffffff804c1704:        0 	41 f7 c4 00 20 00 00 	test   $0x2000,%r12d
ffffffff804c170b:        0 	0f 84 50 08 00 00    	je     ffffffff804c1f61 <tcp_ack+0x144a>
ffffffff804c1711:        0 	48 8b 15 a0 ff 5e 00 	mov    0x5effa0(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1718:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c171b:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1720:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c1727:        0 	00 
ffffffff804c1728:        0 	89 c0                	mov    %eax,%eax
ffffffff804c172a:        0 	48 f7 d2             	not    %rdx
ffffffff804c172d:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c1731:        0 	48 ff 80 00 01 00 00 	incq   0x100(%rax)
ffffffff804c1738:        0 	e8 df e2 ff ff       	callq  ffffffff804bfa1c <tcp_enter_loss>
ffffffff804c173d:        0 	48 8b b5 c0 00 00 00 	mov    0xc0(%rbp),%rsi
ffffffff804c1744:        0 	fe 85 79 03 00 00    	incb   0x379(%rbp)
ffffffff804c174a:        0 	48 8d 85 c0 00 00 00 	lea    0xc0(%rbp),%rax
ffffffff804c1751:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1754:        0 	48 39 c6             	cmp    %rax,%rsi
ffffffff804c1757:        0 	b8 00 00 00 00       	mov    $0x0,%eax
ffffffff804c175c:        0 	48 0f 44 f0          	cmove  %rax,%rsi
ffffffff804c1760:        0 	e8 2d 4b 00 00       	callq  ffffffff804c6292 <tcp_retransmit_skb>
ffffffff804c1765:        0 	8b 95 58 03 00 00    	mov    0x358(%rbp),%edx
ffffffff804c176b:        0 	b9 30 75 00 00       	mov    $0x7530,%ecx
ffffffff804c1770:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1775:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1778:        0 	e8 66 de ff ff       	callq  ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c177d:        0 	e9 dd 06 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1782:        0 	45 84 e4             	test   %r12b,%r12b
ffffffff804c1785:        0 	79 51                	jns    ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1787:        0 	8b 95 70 05 00 00    	mov    0x570(%rbp),%edx
ffffffff804c178d:        0 	39 95 00 04 00 00    	cmp    %edx,0x400(%rbp)
ffffffff804c1793:        0 	79 43                	jns    ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1795:        0 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c179c:        0 	74 3a                	je     ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c179e:        0 	0f b6 85 7f 04 00 00 	movzbl 0x47f(%rbp),%eax
ffffffff804c17a5:        0 	8b b5 d4 04 00 00    	mov    0x4d4(%rbp),%esi
ffffffff804c17ab:        0 	39 c6                	cmp    %eax,%esi
ffffffff804c17ad:        0 	76 29                	jbe    ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c17af:        0 	29 c6                	sub    %eax,%esi
ffffffff804c17b1:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c17b4:        0 	e8 58 e6 ff ff       	callq  ffffffff804bfe11 <tcp_mark_head_lost>
ffffffff804c17b9:        0 	48 8b 05 f8 fe 5e 00 	mov    0x5efef8(%rip),%rax        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c17c0:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c17c7:        0 	00 
ffffffff804c17c8:        0 	89 d2                	mov    %edx,%edx
ffffffff804c17ca:        0 	48 f7 d0             	not    %rax
ffffffff804c17cd:        0 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c17d1:        0 	48 ff 80 48 01 00 00 	incq   0x148(%rax)
ffffffff804c17d8:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c17de:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c17e4:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c17ea:        0 	76 11                	jbe    ffffffff804c17fd <tcp_ack+0xce6>
ffffffff804c17ec:        0 	be 2e 0a 00 00       	mov    $0xa2e,%esi
ffffffff804c17f1:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c17f8:        0 	e8 b8 49 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c17fd:        0 	8a 85 78 03 00 00    	mov    0x378(%rbp),%al
ffffffff804c1803:        0 	84 c0                	test   %al,%al
ffffffff804c1805:        0 	75 29                	jne    ffffffff804c1830 <tcp_ack+0xd19>
ffffffff804c1807:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c180e:        0 	74 11                	je     ffffffff804c1821 <tcp_ack+0xd0a>
ffffffff804c1810:        0 	be 33 0a 00 00       	mov    $0xa33,%esi
ffffffff804c1815:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c181c:        0 	e8 94 49 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1821:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c1828:        0 	00 00 00 
ffffffff804c182b:        0 	e9 c4 00 00 00       	jmpq   ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1830:        0 	8b 8d 70 05 00 00    	mov    0x570(%rbp),%ecx
ffffffff804c1836:        0 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c183c:        0 	39 ca                	cmp    %ecx,%edx
ffffffff804c183e:        0 	0f 88 b0 00 00 00    	js     ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1844:        0 	3c 02                	cmp    $0x2,%al
ffffffff804c1846:        0 	74 31                	je     ffffffff804c1879 <tcp_ack+0xd62>
ffffffff804c1848:        0 	77 0a                	ja     ffffffff804c1854 <tcp_ack+0xd3d>
ffffffff804c184a:        0 	fe c8                	dec    %al
ffffffff804c184c:        0 	0f 85 a2 00 00 00    	jne    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1852:        0 	eb 33                	jmp    ffffffff804c1887 <tcp_ack+0xd70>
ffffffff804c1854:        0 	3c 03                	cmp    $0x3,%al
ffffffff804c1856:        0 	74 6f                	je     ffffffff804c18c7 <tcp_ack+0xdb0>
ffffffff804c1858:        0 	3c 04                	cmp    $0x4,%al
ffffffff804c185a:        0 	0f 85 94 00 00 00    	jne    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1860:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c1867:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c186a:        0 	e8 fb d8 ff ff       	callq  ffffffff804bf16a <tcp_try_undo_recovery>
ffffffff804c186f:        0 	85 c0                	test   %eax,%eax
ffffffff804c1871:        0 	0f 85 e8 05 00 00    	jne    ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1877:        0 	eb 7b                	jmp    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1879:        0 	39 ca                	cmp    %ecx,%edx
ffffffff804c187b:        0 	74 77                	je     ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c187d:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1880:        0 	e8 b8 d9 ff ff       	callq  ffffffff804bf23d <tcp_complete_cwr>
ffffffff804c1885:        0 	eb 34                	jmp    ffffffff804c18bb <tcp_ack+0xda4>
ffffffff804c1887:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c188a:        0 	e8 63 d9 ff ff       	callq  ffffffff804bf1f2 <tcp_try_undo_dsack>
ffffffff804c188f:        0 	83 bd 78 05 00 00 00 	cmpl   $0x0,0x578(%rbp)
ffffffff804c1896:        0 	74 19                	je     ffffffff804c18b1 <tcp_ack+0xd9a>
ffffffff804c1898:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c189e:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c18a1:        0 	74 0e                	je     ffffffff804c18b1 <tcp_ack+0xd9a>
ffffffff804c18a3:        0 	8b 85 70 05 00 00    	mov    0x570(%rbp),%eax
ffffffff804c18a9:        0 	39 85 00 04 00 00    	cmp    %eax,0x400(%rbp)
ffffffff804c18af:        0 	74 43                	je     ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c18b1:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c18b8:        0 	00 00 00 
ffffffff804c18bb:        0 	31 f6                	xor    %esi,%esi
ffffffff804c18bd:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c18c0:        0 	e8 b4 cf ff ff       	callq  ffffffff804be879 <tcp_set_ca_state>
ffffffff804c18c5:        0 	eb 2d                	jmp    ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c18c7:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c18cd:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c18d0:        0 	75 0a                	jne    ffffffff804c18dc <tcp_ack+0xdc5>
ffffffff804c18d2:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c18d9:        0 	00 00 00 
ffffffff804c18dc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c18df:        0 	e8 86 d8 ff ff       	callq  ffffffff804bf16a <tcp_try_undo_recovery>
ffffffff804c18e4:        0 	85 c0                	test   %eax,%eax
ffffffff804c18e6:        0 	0f 85 73 05 00 00    	jne    ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c18ec:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c18ef:        0 	e8 49 d9 ff ff       	callq  ffffffff804bf23d <tcp_complete_cwr>
ffffffff804c18f4:        0 	8a 85 78 03 00 00    	mov    0x378(%rbp),%al
ffffffff804c18fa:        0 	3c 03                	cmp    $0x3,%al
ffffffff804c18fc:        0 	74 0d                	je     ffffffff804c190b <tcp_ack+0xdf4>
ffffffff804c18fe:        0 	3c 04                	cmp    $0x4,%al
ffffffff804c1900:        0 	0f 85 b8 01 00 00    	jne    ffffffff804c1abe <tcp_ack+0xfa7>
ffffffff804c1906:        0 	e9 c4 00 00 00       	jmpq   ffffffff804c19cf <tcp_ack+0xeb8>
ffffffff804c190b:        0 	41 f7 c4 00 04 00 00 	test   $0x400,%r12d
ffffffff804c1912:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1918:        0 	75 1e                	jne    ffffffff804c1938 <tcp_ack+0xe21>
ffffffff804c191a:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c191d:        0 	0f 85 fd 03 00 00    	jne    ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c1923:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1925:        0 	0f 84 f5 03 00 00    	je     ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c192b:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c192e:        0 	e8 54 dd ff ff       	callq  ffffffff804bf687 <tcp_add_reno_sack>
ffffffff804c1933:        0 	e9 e8 03 00 00       	jmpq   ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c1938:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c193b:        0 	41 bd 01 00 00 00    	mov    $0x1,%r13d
ffffffff804c1941:        0 	74 18                	je     ffffffff804c195b <tcp_ack+0xe44>
ffffffff804c1943:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1946:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c1949:        0 	e8 08 d6 ff ff       	callq  ffffffff804bef56 <tcp_fackets_out>
ffffffff804c194e:        0 	0f b6 95 7f 04 00 00 	movzbl 0x47f(%rbp),%edx
ffffffff804c1955:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c1957:        0 	41 0f 9f c5          	setg   %r13b
ffffffff804c195b:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c195e:        0 	e8 c9 d7 ff ff       	callq  ffffffff804bf12c <tcp_may_undo>
ffffffff804c1963:        0 	85 c0                	test   %eax,%eax
ffffffff804c1965:        0 	0f 84 b5 03 00 00    	je     ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c196b:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c1972:        0 	75 0a                	jne    ffffffff804c197e <tcp_ack+0xe67>
ffffffff804c1974:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c197b:        0 	00 00 00 
ffffffff804c197e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1981:        0 	45 31 ed             	xor    %r13d,%r13d
ffffffff804c1984:        0 	e8 cd d5 ff ff       	callq  ffffffff804bef56 <tcp_fackets_out>
ffffffff804c1989:        0 	44 29 7c 24 14       	sub    %r15d,0x14(%rsp)
ffffffff804c198e:        0 	ba 01 00 00 00       	mov    $0x1,%edx
ffffffff804c1993:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1996:        0 	8b 74 24 14          	mov    0x14(%rsp),%esi
ffffffff804c199a:        0 	01 c6                	add    %eax,%esi
ffffffff804c199c:        0 	e8 ed d3 ff ff       	callq  ffffffff804bed8e <tcp_update_reordering>
ffffffff804c19a1:        0 	31 f6                	xor    %esi,%esi
ffffffff804c19a3:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c19a6:        0 	e8 ed d6 ff ff       	callq  ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c19ab:        0 	48 8b 15 06 fd 5e 00 	mov    0x5efd06(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c19b2:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c19b9:        0 	00 
ffffffff804c19ba:        0 	89 c0                	mov    %eax,%eax
ffffffff804c19bc:        0 	48 f7 d2             	not    %rdx
ffffffff804c19bf:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c19c3:        0 	48 ff 80 30 01 00 00 	incq   0x130(%rax)
ffffffff804c19ca:        0 	e9 51 03 00 00       	jmpq   ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c19cf:        0 	45 84 f6             	test   %r14b,%r14b
ffffffff804c19d2:        0 	74 07                	je     ffffffff804c19db <tcp_ack+0xec4>
ffffffff804c19d4:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c19db:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c19e1:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c19e4:        0 	75 13                	jne    ffffffff804c19f9 <tcp_ack+0xee2>
ffffffff804c19e6:        0 	41 f7 c4 00 04 00 00 	test   $0x400,%r12d
ffffffff804c19ed:        0 	74 0a                	je     ffffffff804c19f9 <tcp_ack+0xee2>
ffffffff804c19ef:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c19f6:        0 	00 00 00 
ffffffff804c19f9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c19fc:        0 	e8 2b d7 ff ff       	callq  ffffffff804bf12c <tcp_may_undo>
ffffffff804c1a01:        0 	85 c0                	test   %eax,%eax
ffffffff804c1a03:        0 	0f 84 6e 05 00 00    	je     ffffffff804c1f77 <tcp_ack+0x1460>
ffffffff804c1a09:        0 	48 8b 95 c0 00 00 00 	mov    0xc0(%rbp),%rdx
ffffffff804c1a10:        0 	48 8d 8d c0 00 00 00 	lea    0xc0(%rbp),%rcx
ffffffff804c1a17:        0 	eb 10                	jmp    ffffffff804c1a29 <tcp_ack+0xf12>
ffffffff804c1a19:        0 	48 3b 95 d8 01 00 00 	cmp    0x1d8(%rbp),%rdx
ffffffff804c1a20:        0 	74 12                	je     ffffffff804c1a34 <tcp_ack+0xf1d>
ffffffff804c1a22:        0 	80 62 5d fb          	andb   $0xfb,0x5d(%rdx)
ffffffff804c1a26:        0 	48 8b 12             	mov    (%rdx),%rdx
ffffffff804c1a29:        0 	48 8b 02             	mov    (%rdx),%rax
ffffffff804c1a2c:        0 	48 39 ca             	cmp    %rcx,%rdx
ffffffff804c1a2f:        0 	0f 18 08             	prefetcht0 (%rax)
ffffffff804c1a32:        0 	75 e5                	jne    ffffffff804c1a19 <tcp_ack+0xf02>
ffffffff804c1a34:        0 	48 c7 85 e0 04 00 00 	movq   $0x0,0x4e0(%rbp)
ffffffff804c1a3b:        0 	00 00 00 00 
ffffffff804c1a3f:        0 	48 c7 85 e8 04 00 00 	movq   $0x0,0x4e8(%rbp)
ffffffff804c1a46:        0 	00 00 00 00 
ffffffff804c1a4a:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1a4f:        0 	48 c7 85 f0 04 00 00 	movq   $0x0,0x4f0(%rbp)
ffffffff804c1a56:        0 	00 00 00 00 
ffffffff804c1a5a:        0 	c7 85 cc 04 00 00 00 	movl   $0x0,0x4cc(%rbp)
ffffffff804c1a61:        0 	00 00 00 
ffffffff804c1a64:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1a67:        0 	e8 2c d6 ff ff       	callq  ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c1a6c:        0 	48 8b 15 45 fc 5e 00 	mov    0x5efc45(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1a73:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c1a7a:        0 	00 
ffffffff804c1a7b:        0 	89 c0                	mov    %eax,%eax
ffffffff804c1a7d:        0 	48 f7 d2             	not    %rdx
ffffffff804c1a80:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c1a84:        0 	48 ff 80 40 01 00 00 	incq   0x140(%rax)
ffffffff804c1a8b:        0 	c6 85 79 03 00 00 00 	movb   $0x0,0x379(%rbp)
ffffffff804c1a92:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1a98:        0 	c7 85 78 05 00 00 00 	movl   $0x0,0x578(%rbp)
ffffffff804c1a9f:        0 	00 00 00 
ffffffff804c1aa2:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1aa5:        0 	74 0a                	je     ffffffff804c1ab1 <tcp_ack+0xf9a>
ffffffff804c1aa7:        0 	31 f6                	xor    %esi,%esi
ffffffff804c1aa9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1aac:        0 	e8 c8 cd ff ff       	callq  ffffffff804be879 <tcp_set_ca_state>
ffffffff804c1ab1:        0 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c1ab8:        0 	0f 85 a1 03 00 00    	jne    ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1abe:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1ac4:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1ac7:        0 	75 1f                	jne    ffffffff804c1ae8 <tcp_ack+0xfd1>
ffffffff804c1ac9:        0 	41 f7 c4 00 04 00 00 	test   $0x400,%r12d
ffffffff804c1ad0:        0 	74 0a                	je     ffffffff804c1adc <tcp_ack+0xfc5>
ffffffff804c1ad2:        0 	c7 85 d0 04 00 00 00 	movl   $0x0,0x4d0(%rbp)
ffffffff804c1ad9:        0 	00 00 00 
ffffffff804c1adc:        0 	85 db                	test   %ebx,%ebx
ffffffff804c1ade:        0 	74 08                	je     ffffffff804c1ae8 <tcp_ack+0xfd1>
ffffffff804c1ae0:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1ae3:        0 	e8 9f db ff ff       	callq  ffffffff804bf687 <tcp_add_reno_sack>
ffffffff804c1ae8:        0 	80 bd 78 03 00 00 01 	cmpb   $0x1,0x378(%rbp)
ffffffff804c1aef:        0 	75 08                	jne    ffffffff804c1af9 <tcp_ack+0xfe2>
ffffffff804c1af1:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1af4:        0 	e8 f9 d6 ff ff       	callq  ffffffff804bf1f2 <tcp_try_undo_dsack>
ffffffff804c1af9:        0 	80 bd 5e 04 00 00 00 	cmpb   $0x0,0x45e(%rbp)
ffffffff804c1b00:        0 	0f 85 90 00 00 00    	jne    ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b06:        0 	83 bd cc 04 00 00 00 	cmpl   $0x0,0x4cc(%rbp)
ffffffff804c1b0d:        0 	0f 85 79 04 00 00    	jne    ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b13:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1b19:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1b1c:        0 	a8 02                	test   $0x2,%al
ffffffff804c1b1e:        0 	74 08                	je     ffffffff804c1b28 <tcp_ack+0x1011>
ffffffff804c1b20:        0 	8b 95 d4 04 00 00    	mov    0x4d4(%rbp),%edx
ffffffff804c1b26:        0 	eb 08                	jmp    ffffffff804c1b30 <tcp_ack+0x1019>
ffffffff804c1b28:        0 	8b 95 d0 04 00 00    	mov    0x4d0(%rbp),%edx
ffffffff804c1b2e:        0 	ff c2                	inc    %edx
ffffffff804c1b30:        0 	0f b6 85 7f 04 00 00 	movzbl 0x47f(%rbp),%eax
ffffffff804c1b37:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c1b39:        0 	0f 8f 4d 04 00 00    	jg     ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b3f:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1b45:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1b48:        0 	a8 02                	test   $0x2,%al
ffffffff804c1b4a:        0 	74 10                	je     ffffffff804c1b5c <tcp_ack+0x1045>
ffffffff804c1b4c:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1b4f:        0 	e8 1d d4 ff ff       	callq  ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1b54:        0 	85 c0                	test   %eax,%eax
ffffffff804c1b56:        0 	0f 85 30 04 00 00    	jne    ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b5c:        0 	0f b6 85 7f 04 00 00 	movzbl 0x47f(%rbp),%eax
ffffffff804c1b63:        0 	8b 95 74 04 00 00    	mov    0x474(%rbp),%edx
ffffffff804c1b69:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c1b6b:        0 	77 29                	ja     ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b6d:        0 	89 d0                	mov    %edx,%eax
ffffffff804c1b6f:        0 	d1 e8                	shr    %eax
ffffffff804c1b71:        0 	39 05 c1 68 3f 00    	cmp    %eax,0x3f68c1(%rip)        # ffffffff808b8438 <sysctl_tcp_reordering>
ffffffff804c1b77:        0 	0f 43 05 ba 68 3f 00 	cmovae 0x3f68ba(%rip),%eax        # ffffffff808b8438 <sysctl_tcp_reordering>
ffffffff804c1b7e:        0 	39 85 d0 04 00 00    	cmp    %eax,0x4d0(%rbp)
ffffffff804c1b84:        0 	72 10                	jb     ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b86:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1b89:        0 	e8 82 37 00 00       	callq  ffffffff804c5310 <tcp_may_send_now>
ffffffff804c1b8e:        0 	85 c0                	test   %eax,%eax
ffffffff804c1b90:        0 	0f 84 f6 03 00 00    	je     ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b96:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c1b9c:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c1ba2:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c1ba8:        0 	76 11                	jbe    ffffffff804c1bbb <tcp_ack+0x10a4>
ffffffff804c1baa:        0 	be d7 09 00 00       	mov    $0x9d7,%esi
ffffffff804c1baf:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c1bb6:        0 	e8 fa 45 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1bbb:        0 	80 bd 5e 04 00 00 00 	cmpb   $0x0,0x45e(%rbp)
ffffffff804c1bc2:        0 	75 13                	jne    ffffffff804c1bd7 <tcp_ack+0x10c0>
ffffffff804c1bc4:        0 	83 bd 78 04 00 00 00 	cmpl   $0x0,0x478(%rbp)
ffffffff804c1bcb:        0 	75 0a                	jne    ffffffff804c1bd7 <tcp_ack+0x10c0>
ffffffff804c1bcd:        0 	c7 85 74 05 00 00 00 	movl   $0x0,0x574(%rbp)
ffffffff804c1bd4:        0 	00 00 00 
ffffffff804c1bd7:        0 	83 7c 24 58 00       	cmpl   $0x0,0x58(%rsp)
ffffffff804c1bdc:        0 	74 0d                	je     ffffffff804c1beb <tcp_ack+0x10d4>
ffffffff804c1bde:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1be3:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1be6:        0 	e8 cf d0 ff ff       	callq  ffffffff804becba <tcp_enter_cwr>
ffffffff804c1beb:        0 	80 bd 78 03 00 00 02 	cmpb   $0x2,0x378(%rbp)
ffffffff804c1bf2:        0 	74 15                	je     ffffffff804c1c09 <tcp_ack+0x10f2>
ffffffff804c1bf4:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1bf7:        0 	e8 71 d6 ff ff       	callq  ffffffff804bf26d <tcp_try_keep_open>
ffffffff804c1bfc:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1bff:        0 	e8 a9 d3 ff ff       	callq  ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1c04:        0 	e9 56 02 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c09:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff804c1c0c:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1c0f:        0 	e8 d9 d3 ff ff       	callq  ffffffff804befed <tcp_cwnd_down>
ffffffff804c1c14:        0 	e9 46 02 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c19:        0 	8b 95 a4 03 00 00    	mov    0x3a4(%rbp),%edx
ffffffff804c1c1f:        0 	85 d2                	test   %edx,%edx
ffffffff804c1c21:        0 	74 34                	je     ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1c23:        0 	8b 85 b0 05 00 00    	mov    0x5b0(%rbp),%eax
ffffffff804c1c29:        0 	39 85 00 04 00 00    	cmp    %eax,0x400(%rbp)
ffffffff804c1c2f:        0 	75 26                	jne    ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1c31:        0 	ff 85 ac 04 00 00    	incl   0x4ac(%rbp)
ffffffff804c1c37:        0 	8d 42 ff             	lea    -0x1(%rdx),%eax
ffffffff804c1c3a:        0 	c7 85 a4 03 00 00 00 	movl   $0x0,0x3a4(%rbp)
ffffffff804c1c41:        0 	00 00 00 
ffffffff804c1c44:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1c47:        0 	89 85 9c 03 00 00    	mov    %eax,0x39c(%rbp)
ffffffff804c1c4d:        0 	e8 86 54 00 00       	callq  ffffffff804c70d8 <tcp_simple_retransmit>
ffffffff804c1c52:        0 	e9 08 02 00 00       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c57:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1c5d:        0 	48 8b 15 54 fa 5e 00 	mov    0x5efa54(%rip),%rdx        # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1c64:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1c67:        0 	48 f7 d2             	not    %rdx
ffffffff804c1c6a:        0 	3c 01                	cmp    $0x1,%al
ffffffff804c1c6c:        0 	19 c9                	sbb    %ecx,%ecx
ffffffff804c1c6e:        0 	65 8b 04 25 24 00 00 	mov    %gs:0x24,%eax
ffffffff804c1c75:        0 	00 
ffffffff804c1c76:        0 	89 c0                	mov    %eax,%eax
ffffffff804c1c78:        0 	83 c1 1f             	add    $0x1f,%ecx
ffffffff804c1c7b:        0 	48 8b 04 c2          	mov    (%rdx,%rax,8),%rax
ffffffff804c1c7f:        0 	48 63 c9             	movslq %ecx,%rcx
ffffffff804c1c82:        0 	48 ff 04 c8          	incq   (%rax,%rcx,8)
ffffffff804c1c86:        0 	c7 85 6c 05 00 00 00 	movl   $0x0,0x56c(%rbp)
ffffffff804c1c8d:        0 	00 00 00 
ffffffff804c1c90:        0 	8b 85 fc 03 00 00    	mov    0x3fc(%rbp),%eax
ffffffff804c1c96:        0 	80 bd 78 03 00 00 01 	cmpb   $0x1,0x378(%rbp)
ffffffff804c1c9d:        0 	89 85 70 05 00 00    	mov    %eax,0x570(%rbp)
ffffffff804c1ca3:        0 	8b 85 00 04 00 00    	mov    0x400(%rbp),%eax
ffffffff804c1ca9:        0 	89 85 78 05 00 00    	mov    %eax,0x578(%rbp)
ffffffff804c1caf:        0 	8b 85 78 04 00 00    	mov    0x478(%rbp),%eax
ffffffff804c1cb5:        0 	89 85 7c 05 00 00    	mov    %eax,0x57c(%rbp)
ffffffff804c1cbb:        0 	77 3b                	ja     ffffffff804c1cf8 <tcp_ack+0x11e1>
ffffffff804c1cbd:        0 	83 7c 24 58 00       	cmpl   $0x0,0x58(%rsp)
ffffffff804c1cc2:        0 	75 0e                	jne    ffffffff804c1cd2 <tcp_ack+0x11bb>
ffffffff804c1cc4:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1cc7:        0 	e8 f0 cb ff ff       	callq  ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c1ccc:        0 	89 85 6c 05 00 00    	mov    %eax,0x56c(%rbp)
ffffffff804c1cd2:        0 	48 8b 85 60 03 00 00 	mov    0x360(%rbp),%rax
ffffffff804c1cd9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1cdc:        0 	ff 50 28             	callq  *0x28(%rax)
ffffffff804c1cdf:        0 	89 85 a8 04 00 00    	mov    %eax,0x4a8(%rbp)
ffffffff804c1ce5:        0 	8a 85 7e 04 00 00    	mov    0x47e(%rbp),%al
ffffffff804c1ceb:        0 	a8 01                	test   $0x1,%al
ffffffff804c1ced:        0 	74 09                	je     ffffffff804c1cf8 <tcp_ack+0x11e1>
ffffffff804c1cef:        0 	83 c8 02             	or     $0x2,%eax
ffffffff804c1cf2:        0 	88 85 7e 04 00 00    	mov    %al,0x47e(%rbp)
ffffffff804c1cf8:        0 	c7 85 dc 04 00 00 00 	movl   $0x0,0x4dc(%rbp)
ffffffff804c1cff:        0 	00 00 00 
ffffffff804c1d02:        0 	c7 85 b0 04 00 00 00 	movl   $0x0,0x4b0(%rbp)
ffffffff804c1d09:        0 	00 00 00 
ffffffff804c1d0c:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c1d11:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1d14:        0 	bb 01 00 00 00       	mov    $0x1,%ebx
ffffffff804c1d19:        0 	e8 5b cb ff ff       	callq  ffffffff804be879 <tcp_set_ca_state>
ffffffff804c1d1e:        0 	eb 02                	jmp    ffffffff804c1d22 <tcp_ack+0x120b>
ffffffff804c1d20:        0 	31 db                	xor    %ebx,%ebx
ffffffff804c1d22:        0 	45 85 ed             	test   %r13d,%r13d
ffffffff804c1d25:        0 	75 21                	jne    ffffffff804c1d48 <tcp_ack+0x1231>
ffffffff804c1d27:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1d2d:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1d30:        0 	a8 02                	test   $0x2,%al
ffffffff804c1d32:        0 	0f 84 0b 01 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1d38:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1d3b:        0 	e8 31 d2 ff ff       	callq  ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1d40:        0 	85 c0                	test   %eax,%eax
ffffffff804c1d42:        0 	0f 84 fb 00 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1d48:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1d4e:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1d51:        0 	75 07                	jne    ffffffff804c1d5a <tcp_ack+0x1243>
ffffffff804c1d53:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c1d58:        0 	eb 31                	jmp    ffffffff804c1d8b <tcp_ack+0x1274>
ffffffff804c1d5a:        0 	a8 02                	test   $0x2,%al
ffffffff804c1d5c:        0 	8a 85 7f 04 00 00    	mov    0x47f(%rbp),%al
ffffffff804c1d62:        0 	74 17                	je     ffffffff804c1d7b <tcp_ack+0x1264>
ffffffff804c1d64:        0 	8b b5 d4 04 00 00    	mov    0x4d4(%rbp),%esi
ffffffff804c1d6a:        0 	0f b6 c0             	movzbl %al,%eax
ffffffff804c1d6d:        0 	29 c6                	sub    %eax,%esi
ffffffff804c1d6f:        0 	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff804c1d74:        0 	85 f6                	test   %esi,%esi
ffffffff804c1d76:        0 	0f 4e f0             	cmovle %eax,%esi
ffffffff804c1d79:        0 	eb 10                	jmp    ffffffff804c1d8b <tcp_ack+0x1274>
ffffffff804c1d7b:        0 	8b b5 d0 04 00 00    	mov    0x4d0(%rbp),%esi
ffffffff804c1d81:        0 	0f b6 c0             	movzbl %al,%eax
ffffffff804c1d84:        0 	29 c6                	sub    %eax,%esi
ffffffff804c1d86:        0 	39 f3                	cmp    %esi,%ebx
ffffffff804c1d88:        0 	0f 4d f3             	cmovge %ebx,%esi
ffffffff804c1d8b:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1d8e:        0 	e8 7e e0 ff ff       	callq  ffffffff804bfe11 <tcp_mark_head_lost>
ffffffff804c1d93:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1d99:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1d9c:        0 	a8 02                	test   $0x2,%al
ffffffff804c1d9e:        0 	0f 84 9f 00 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1da4:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1da7:        0 	e8 c5 d1 ff ff       	callq  ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1dac:        0 	85 c0                	test   %eax,%eax
ffffffff804c1dae:        0 	0f 84 8f 00 00 00    	je     ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1db4:        0 	48 8b 85 e8 04 00 00 	mov    0x4e8(%rbp),%rax
ffffffff804c1dbb:        0 	48 85 c0             	test   %rax,%rax
ffffffff804c1dbe:        0 	48 89 c3             	mov    %rax,%rbx
ffffffff804c1dc1:        0 	75 42                	jne    ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dc3:        0 	48 8b 9d c0 00 00 00 	mov    0xc0(%rbp),%rbx
ffffffff804c1dca:        0 	48 8d 85 c0 00 00 00 	lea    0xc0(%rbp),%rax
ffffffff804c1dd1:        0 	48 39 c3             	cmp    %rax,%rbx
ffffffff804c1dd4:        0 	75 2f                	jne    ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dd6:        0 	31 db                	xor    %ebx,%ebx
ffffffff804c1dd8:        0 	eb 2b                	jmp    ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dda:        0 	48 3b 9d d8 01 00 00 	cmp    0x1d8(%rbp),%rbx
ffffffff804c1de1:        0 	74 34                	je     ffffffff804c1e17 <tcp_ack+0x1300>
ffffffff804c1de3:        0 	48 8b 05 96 7a 3f 00 	mov    0x3f7a96(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c1dea:        0 	2b 43 58             	sub    0x58(%rbx),%eax
ffffffff804c1ded:        0 	3b 85 58 03 00 00    	cmp    0x358(%rbp),%eax
ffffffff804c1df3:        0 	76 22                	jbe    ffffffff804c1e17 <tcp_ack+0x1300>
ffffffff804c1df5:        0 	48 89 de             	mov    %rbx,%rsi
ffffffff804c1df8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1dfb:        0 	e8 28 d0 ff ff       	callq  ffffffff804bee28 <tcp_skb_mark_lost>
ffffffff804c1e00:        0 	48 8b 1b             	mov    (%rbx),%rbx
ffffffff804c1e03:        0 	eb 07                	jmp    ffffffff804c1e0c <tcp_ack+0x12f5>
ffffffff804c1e05:        0 	4c 8d ad c0 00 00 00 	lea    0xc0(%rbp),%r13
ffffffff804c1e0c:        0 	48 8b 03             	mov    (%rbx),%rax
ffffffff804c1e0f:        0 	4c 39 eb             	cmp    %r13,%rbx
ffffffff804c1e12:        0 	0f 18 08             	prefetcht0 (%rax)
ffffffff804c1e15:        0 	75 c3                	jne    ffffffff804c1dda <tcp_ack+0x12c3>
ffffffff804c1e17:        0 	8b 85 cc 04 00 00    	mov    0x4cc(%rbp),%eax
ffffffff804c1e1d:        0 	03 85 d0 04 00 00    	add    0x4d0(%rbp),%eax
ffffffff804c1e23:        0 	3b 85 74 04 00 00    	cmp    0x474(%rbp),%eax
ffffffff804c1e29:        0 	48 89 9d e8 04 00 00 	mov    %rbx,0x4e8(%rbp)
ffffffff804c1e30:        0 	76 11                	jbe    ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1e32:        0 	be e5 08 00 00       	mov    $0x8e5,%esi
ffffffff804c1e37:        0 	48 c7 c7 9d d9 6a 80 	mov    $0xffffffff806ad99d,%rdi
ffffffff804c1e3e:        0 	e8 72 43 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1e43:        0 	44 89 e6             	mov    %r12d,%esi
ffffffff804c1e46:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1e49:        0 	e8 9f d1 ff ff       	callq  ffffffff804befed <tcp_cwnd_down>
ffffffff804c1e4e:        0 	e9 2c 01 00 00       	jmpq   ffffffff804c1f7f <tcp_ack+0x1468>
ffffffff804c1e53:       47 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c1e57:      513 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1e5a:        0 	e8 62 d4 ff ff       	callq  ffffffff804bf2c1 <tcp_cong_avoid>
ffffffff804c1e5f:      427 	41 80 e4 34          	and    $0x34,%r12b
ffffffff804c1e63:     1234 	75 07                	jne    ffffffff804c1e6c <tcp_ack+0x1355>
ffffffff804c1e65:        0 	83 7c 24 54 00       	cmpl   $0x0,0x54(%rsp)
ffffffff804c1e6a:        0 	75 3c                	jne    ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e6c:        0 	48 8b 7d 78          	mov    0x78(%rbp),%rdi
ffffffff804c1e70:      916 	e8 8d c9 ff ff       	callq  ffffffff804be802 <dst_confirm>
ffffffff804c1e75:        3 	eb 31                	jmp    ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e77:        0 	48 8b 95 d8 01 00 00 	mov    0x1d8(%rbp),%rdx
ffffffff804c1e7e:       99 	48 85 d2             	test   %rdx,%rdx
ffffffff804c1e81:       16 	74 25                	je     ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e83:        0 	8b 85 44 04 00 00    	mov    0x444(%rbp),%eax
ffffffff804c1e89:        0 	03 85 00 04 00 00    	add    0x400(%rbp),%eax
ffffffff804c1e8f:        0 	3b 42 54             	cmp    0x54(%rdx),%eax
ffffffff804c1e92:        0 	78 1e                	js     ffffffff804c1eb2 <tcp_ack+0x139b>
ffffffff804c1e94:        0 	c6 85 7b 03 00 00 00 	movb   $0x0,0x37b(%rbp)
ffffffff804c1e9b:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c1ea0:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1ea3:        0 	e8 ab c9 ff ff       	callq  ffffffff804be853 <inet_csk_clear_xmit_timer>
ffffffff804c1ea8:      520 	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff804c1ead:      994 	e9 ec 00 00 00       	jmpq   ffffffff804c1f9e <tcp_ack+0x1487>
ffffffff804c1eb2:        0 	0f b6 8d 7b 03 00 00 	movzbl 0x37b(%rbp),%ecx
ffffffff804c1eb9:        0 	8b 95 58 03 00 00    	mov    0x358(%rbp),%edx
ffffffff804c1ebf:        0 	b8 30 75 00 00       	mov    $0x7530,%eax
ffffffff804c1ec4:        0 	be 03 00 00 00       	mov    $0x3,%esi
ffffffff804c1ec9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1ecc:        0 	d3 e2                	shl    %cl,%edx
ffffffff804c1ece:        0 	b9 30 75 00 00       	mov    $0x7530,%ecx
ffffffff804c1ed3:        0 	81 fa 30 75 00 00    	cmp    $0x7530,%edx
ffffffff804c1ed9:        0 	0f 47 d0             	cmova  %eax,%edx
ffffffff804c1edc:        0 	89 d2                	mov    %edx,%edx
ffffffff804c1ede:        0 	e8 00 d7 ff ff       	callq  ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c1ee3:        0 	eb c3                	jmp    ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1ee5:        0 	80 78 25 00          	cmpb   $0x0,0x25(%rax)
ffffffff804c1ee9:        0 	74 1a                	je     ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c1eeb:        0 	8b 54 24 18          	mov    0x18(%rsp),%edx
ffffffff804c1eef:        0 	e8 cc e3 ff ff       	callq  ffffffff804c02c0 <tcp_sacktag_write_queue>
ffffffff804c1ef4:        0 	80 bd 78 03 00 00 00 	cmpb   $0x0,0x378(%rbp)
ffffffff804c1efb:        0 	75 08                	jne    ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c1efd:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f00:        0 	e8 68 d3 ff ff       	callq  ffffffff804bf26d <tcp_try_keep_open>
ffffffff804c1f05:        0 	48 85 ed             	test   %rbp,%rbp
ffffffff804c1f08:        0 	74 2f                	je     ffffffff804c1f39 <tcp_ack+0x1422>
ffffffff804c1f0a:        0 	be 0a 00 00 00       	mov    $0xa,%esi
ffffffff804c1f0f:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f12:        0 	e8 f1 d5 ff ff       	callq  ffffffff804bf508 <sock_flag>
ffffffff804c1f17:        0 	85 c0                	test   %eax,%eax
ffffffff804c1f19:        0 	74 1e                	je     ffffffff804c1f39 <tcp_ack+0x1422>
ffffffff804c1f1b:        0 	8b 8d fc 03 00 00    	mov    0x3fc(%rbp),%ecx
ffffffff804c1f21:        0 	8b 95 00 04 00 00    	mov    0x400(%rbp),%edx
ffffffff804c1f27:        0 	48 c7 c7 e5 d9 6a 80 	mov    $0xffffffff806ad9e5,%rdi
ffffffff804c1f2e:        0 	8b 74 24 1c          	mov    0x1c(%rsp),%esi
ffffffff804c1f32:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1f34:        0 	e8 3b 4e d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804c1f39:        0 	31 c0                	xor    %eax,%eax
ffffffff804c1f3b:        0 	eb 61                	jmp    ffffffff804c1f9e <tcp_ack+0x1487>
ffffffff804c1f3d:        0 	c7 44 24 44 00 00 00 	movl   $0x0,0x44(%rsp)
ffffffff804c1f44:        0 	00 
ffffffff804c1f45:        0 	e9 c3 ef ff ff       	jmpq   ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c1f4a:       54 	41 f6 c4 04          	test   $0x4,%r12b
ffffffff804c1f4e:      424 	0f 84 0b ff ff ff    	je     ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f54:      364 	85 c9                	test   %ecx,%ecx
ffffffff804c1f56:        0 	0f 84 f7 fe ff ff    	je     ffffffff804c1e53 <tcp_ack+0x133c>
ffffffff804c1f5c:        0 	e9 fe fe ff ff       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f61:        0 	8a 85 9c 04 00 00    	mov    0x49c(%rbp),%al
ffffffff804c1f67:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c1f6a:        0 	a8 02                	test   $0x2,%al
ffffffff804c1f6c:        0 	0f 85 10 f8 ff ff    	jne    ffffffff804c1782 <tcp_ack+0xc6b>
ffffffff804c1f72:        0 	e9 61 f8 ff ff       	jmpq   ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1f77:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f7a:        0 	e8 2e d0 ff ff       	callq  ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1f7f:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c1f82:        0 	e8 f7 47 00 00       	callq  ffffffff804c677e <tcp_xmit_retransmit_queue>
ffffffff804c1f87:        0 	e9 d3 fe ff ff       	jmpq   ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f8c:        0 	80 bd 78 03 00 00 01 	cmpb   $0x1,0x378(%rbp)
ffffffff804c1f93:        0 	0f 87 be fc ff ff    	ja     ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1f99:        0 	e9 7b fc ff ff       	jmpq   ffffffff804c1c19 <tcp_ack+0x1102>
ffffffff804c1f9e:      493 	48 81 c4 88 00 00 00 	add    $0x88,%rsp
ffffffff804c1fa5:     1288 	5b                   	pop    %rbx
ffffffff804c1fa6:        0 	5d                   	pop    %rbp
ffffffff804c1fa7:      446 	41 5c                	pop    %r12
ffffffff804c1fa9:        0 	41 5d                	pop    %r13
ffffffff804c1fab:        2 	41 5e                	pop    %r14
ffffffff804c1fad:      447 	41 5f                	pop    %r15
ffffffff804c1faf:        0 	c3                   	retq   

No real obvious single-instruction hotspots i can see.

But i can see another problem: the function is too large and its flow 
is not fall-through in any way. As you can see it from the profile 
distribution it is broken into 25-30 separate code sequences.

The function consists of more than 1200 instructions and is 5200 bytes 
large. According to the profile above, only 350 instructions are used 
and about 850 of those instructions are never used by this workload. 
So in theory this function should only take up ~1.5K of the 
instruction cache.

But because execution is spread out into 25+ smaller pieces, it takes 
up ~4K of the instruction cache instead (there's a single ~1.2K hole 
in the middle, i subtracted that) - 2-3 times larger than it should.

So this code could make good use of the (brand-new ;-) branch-tracer 
ftrace plugin and grow a few well-placed likely()/unlikely() places - 
at least for this workload. I think.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* tcp_recvmsg(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:19                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.833688 tcp_recvmsg

                      hits (total: 183368)
                 .........
ffffffff804bd46e:      882 <tcp_recvmsg>:
ffffffff804bd46e:      882 	41 57                	push   %r15
ffffffff804bd470:    15507 	48 89 f7             	mov    %rsi,%rdi
ffffffff804bd473:      179 	41 56                	push   %r14
ffffffff804bd475:        0 	49 89 ce             	mov    %rcx,%r14
ffffffff804bd478:      744 	41 55                	push   %r13
ffffffff804bd47a:      165 	41 54                	push   %r12
ffffffff804bd47c:        0 	45 89 c4             	mov    %r8d,%r12d
ffffffff804bd47f:      692 	55                   	push   %rbp
ffffffff804bd480:      178 	44 89 cd             	mov    %r9d,%ebp
ffffffff804bd483:     3434 	53                   	push   %rbx
ffffffff804bd484:      685 	48 89 f3             	mov    %rsi,%rbx
ffffffff804bd487:       11 	48 83 ec 68          	sub    $0x68,%rsp
ffffffff804bd48b:      949 	48 89 54 24 30       	mov    %rdx,0x30(%rsp)
ffffffff804bd490:        7 	e8 e8 e8 ff ff       	callq  ffffffff804bbd7d <lock_sock>
ffffffff804bd495:     1771 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bd498:     6176 	3c 0a                	cmp    $0xa,%al
ffffffff804bd49a:        0 	0f 84 3a 06 00 00    	je     ffffffff804bdada <tcp_recvmsg+0x66c>
ffffffff804bd4a0:     3121 	31 c0                	xor    %eax,%eax
ffffffff804bd4a2:      195 	45 85 e4             	test   %r12d,%r12d
ffffffff804bd4a5:        0 	75 07                	jne    ffffffff804bd4ae <tcp_recvmsg+0x40>
ffffffff804bd4a7:      926 	48 8b 83 68 01 00 00 	mov    0x168(%rbx),%rax
ffffffff804bd4ae:      189 	40 f6 c5 01          	test   $0x1,%bpl
ffffffff804bd4b2:        0 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
ffffffff804bd4b7:      819 	0f 85 33 06 00 00    	jne    ffffffff804bdaf0 <tcp_recvmsg+0x682>
ffffffff804bd4bd:      216 	89 e8                	mov    %ebp,%eax
ffffffff804bd4bf:        0 	83 e0 02             	and    $0x2,%eax
ffffffff804bd4c2:      638 	89 44 24 3c          	mov    %eax,0x3c(%rsp)
ffffffff804bd4c6:      177 	75 0e                	jne    ffffffff804bd4d6 <tcp_recvmsg+0x68>
ffffffff804bd4c8:        0 	48 8d 93 f4 03 00 00 	lea    0x3f4(%rbx),%rdx
ffffffff804bd4cf:      661 	48 89 54 24 40       	mov    %rdx,0x40(%rsp)
ffffffff804bd4d4:      195 	eb 14                	jmp    ffffffff804bd4ea <tcp_recvmsg+0x7c>
ffffffff804bd4d6:        0 	8b 83 f4 03 00 00    	mov    0x3f4(%rbx),%eax
ffffffff804bd4dc:        0 	48 8d 4c 24 60       	lea    0x60(%rsp),%rcx
ffffffff804bd4e1:        0 	48 89 4c 24 40       	mov    %rcx,0x40(%rsp)
ffffffff804bd4e6:        0 	89 44 24 60          	mov    %eax,0x60(%rsp)
ffffffff804bd4ea:      867 	89 ee                	mov    %ebp,%esi
ffffffff804bd4ec:      210 	44 89 f2             	mov    %r14d,%edx
ffffffff804bd4ef:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd4f2:      894 	81 e6 00 01 00 00    	and    $0x100,%esi
ffffffff804bd4f8:      192 	45 31 ff             	xor    %r15d,%r15d
ffffffff804bd4fb:        0 	e8 fc df ff ff       	callq  ffffffff804bb4fc <sock_rcvlowat>
ffffffff804bd500:      853 	89 44 24 4c          	mov    %eax,0x4c(%rsp)
ffffffff804bd504:     1857 	48 8d 83 a8 00 00 00 	lea    0xa8(%rbx),%rax
ffffffff804bd50b:        0 	89 e9                	mov    %ebp,%ecx
ffffffff804bd50d:      595 	48 8d 93 10 04 00 00 	lea    0x410(%rbx),%rdx
ffffffff804bd514:      263 	83 e1 22             	and    $0x22,%ecx
ffffffff804bd517:        0 	83 e5 20             	and    $0x20,%ebp
ffffffff804bd51a:      601 	48 89 44 24 28       	mov    %rax,0x28(%rsp)
ffffffff804bd51f:      254 	48 8d 83 f8 04 00 00 	lea    0x4f8(%rbx),%rax
ffffffff804bd526:        2 	48 c7 44 24 50 00 00 	movq   $0x0,0x50(%rsp)
ffffffff804bd52d:        0 	00 00 
ffffffff804bd52f:      578 	48 89 54 24 20       	mov    %rdx,0x20(%rsp)
ffffffff804bd534:      290 	89 4c 24 1c          	mov    %ecx,0x1c(%rsp)
ffffffff804bd538:        1 	48 89 44 24 10       	mov    %rax,0x10(%rsp)
ffffffff804bd53d:      593 	89 6c 24 0c          	mov    %ebp,0xc(%rsp)
ffffffff804bd541:      568 	66 83 bb 7c 04 00 00 	cmpw   $0x0,0x47c(%rbx)
ffffffff804bd548:        0 	00 
ffffffff804bd549:     3956 	74 55                	je     ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd54b:        0 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bd550:        0 	8b 83 84 05 00 00    	mov    0x584(%rbx),%eax
ffffffff804bd556:        0 	3b 02                	cmp    (%rdx),%eax
ffffffff804bd558:        0 	75 46                	jne    ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd55a:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bd55d:        0 	0f 85 e6 04 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd563:        0 	65 48 8b 3c 25 00 00 	mov    %gs:0x0,%rdi
ffffffff804bd56a:        0 	00 00 
ffffffff804bd56c:        0 	e8 4c e1 ff ff       	callq  ffffffff804bb6bd <signal_pending>
ffffffff804bd571:        0 	85 c0                	test   %eax,%eax
ffffffff804bd573:        0 	74 2b                	je     ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd575:        0 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804bd57a:        0 	41 bf f5 ff ff ff    	mov    $0xfffffff5,%r15d
ffffffff804bd580:        0 	48 85 d2             	test   %rdx,%rdx
ffffffff804bd583:        0 	0f 84 c0 04 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd589:        0 	48 b8 ff ff ff ff ff 	mov    $0x7fffffffffffffff,%rax
ffffffff804bd590:        0 	ff ff 7f 
ffffffff804bd593:        0 	66 41 bf 00 fe       	mov    $0xfe00,%r15w
ffffffff804bd598:        0 	48 39 c2             	cmp    %rax,%rdx
ffffffff804bd59b:        0 	e9 89 01 00 00       	jmpq   ffffffff804bd729 <tcp_recvmsg+0x2bb>
ffffffff804bd5a0:      597 	48 8b ab a8 00 00 00 	mov    0xa8(%rbx),%rbp
ffffffff804bd5a7:     4601 	48 3b 6c 24 28       	cmp    0x28(%rsp),%rbp
ffffffff804bd5ac:        1 	b8 00 00 00 00       	mov    $0x0,%eax
ffffffff804bd5b1:     1769 	48 0f 44 e8          	cmove  %rax,%rbp
ffffffff804bd5b5:      473 	48 85 ed             	test   %rbp,%rbp
ffffffff804bd5b8:        0 	74 76                	je     ffffffff804bd630 <tcp_recvmsg+0x1c2>
ffffffff804bd5ba:      595 	48 8b 4c 24 40       	mov    0x40(%rsp),%rcx
ffffffff804bd5bf:      897 	8b 55 50             	mov    0x50(%rbp),%edx
ffffffff804bd5c2:       89 	8b 31                	mov    (%rcx),%esi
ffffffff804bd5c4:      581 	41 89 f5             	mov    %esi,%r13d
ffffffff804bd5c7:      301 	41 29 d5             	sub    %edx,%r13d
ffffffff804bd5ca:       33 	79 10                	jns    ffffffff804bd5dc <tcp_recvmsg+0x16e>
ffffffff804bd5cc:        0 	48 c7 c7 48 d9 6a 80 	mov    $0xffffffff806ad948,%rdi
ffffffff804bd5d3:        0 	31 c0                	xor    %eax,%eax
ffffffff804bd5d5:        0 	e8 9a 97 d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804bd5da:        0 	eb 54                	jmp    ffffffff804bd630 <tcp_recvmsg+0x1c2>
ffffffff804bd5dc:      584 	8b 85 b8 00 00 00    	mov    0xb8(%rbp),%eax
ffffffff804bd5e2:     1061 	48 8b 95 d0 00 00 00 	mov    0xd0(%rbp),%rdx
ffffffff804bd5e9:        1 	8a 54 02 0d          	mov    0xd(%rdx,%rax,1),%dl
ffffffff804bd5ed:        0 	88 d0                	mov    %dl,%al
ffffffff804bd5ef:      876 	83 e0 02             	and    $0x2,%eax
ffffffff804bd5f2:        0 	3c 01                	cmp    $0x1,%al
ffffffff804bd5f4:        0 	8b 45 68             	mov    0x68(%rbp),%eax
ffffffff804bd5f7:      909 	41 83 d5 ff          	adc    $0xffffffffffffffff,%r13d
ffffffff804bd5fb:        0 	41 39 c5             	cmp    %eax,%r13d
ffffffff804bd5fe:        0 	0f 82 df 02 00 00    	jb     ffffffff804bd8e3 <tcp_recvmsg+0x475>
ffffffff804bd604:        0 	80 e2 01             	and    $0x1,%dl
ffffffff804bd607:        0 	0f 85 16 04 00 00    	jne    ffffffff804bda23 <tcp_recvmsg+0x5b5>
ffffffff804bd60d:        0 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bd612:        0 	75 11                	jne    ffffffff804bd625 <tcp_recvmsg+0x1b7>
ffffffff804bd614:        0 	be 53 05 00 00       	mov    $0x553,%esi
ffffffff804bd619:        0 	48 c7 c7 13 d9 6a 80 	mov    $0xffffffff806ad913,%rdi
ffffffff804bd620:        0 	e8 90 8b d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804bd625:        0 	48 8b 6d 00          	mov    0x0(%rbp),%rbp
ffffffff804bd629:        0 	48 3b 6c 24 28       	cmp    0x28(%rsp),%rbp
ffffffff804bd62e:        0 	75 85                	jne    ffffffff804bd5b5 <tcp_recvmsg+0x147>
ffffffff804bd630:       80 	44 3b 7c 24 4c       	cmp    0x4c(%rsp),%r15d
ffffffff804bd635:     4164 	7c 0b                	jl     ffffffff804bd642 <tcp_recvmsg+0x1d4>
ffffffff804bd637:        0 	48 83 7b 68 00       	cmpq   $0x0,0x68(%rbx)
ffffffff804bd63c:        0 	0f 84 07 04 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd642:        1 	45 85 ff             	test   %r15d,%r15d
ffffffff804bd645:     3438 	74 49                	je     ffffffff804bd690 <tcp_recvmsg+0x222>
ffffffff804bd647:        0 	83 bb 44 01 00 00 00 	cmpl   $0x0,0x144(%rbx)
ffffffff804bd64e:        0 	0f 85 f5 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd654:        0 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bd657:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bd659:        0 	0f 84 ea 03 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd65f:        0 	f6 43 38 01          	testb  $0x1,0x38(%rbx)
ffffffff804bd663:        0 	0f 85 e0 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd669:        0 	48 83 7c 24 58 00    	cmpq   $0x0,0x58(%rsp)
ffffffff804bd66f:        0 	0f 84 d4 03 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd675:        0 	65 48 8b 3c 25 00 00 	mov    %gs:0x0,%rdi
ffffffff804bd67c:        0 	00 00 
ffffffff804bd67e:        0 	e8 3a e0 ff ff       	callq  ffffffff804bb6bd <signal_pending>
ffffffff804bd683:        0 	85 c0                	test   %eax,%eax
ffffffff804bd685:        0 	0f 84 ac 00 00 00    	je     ffffffff804bd737 <tcp_recvmsg+0x2c9>
ffffffff804bd68b:        0 	e9 b9 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd690:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804bd695:     4166 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd698:        0 	e8 7b de ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bd69d:        0 	85 c0                	test   %eax,%eax
ffffffff804bd69f:      276 	0f 85 a4 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6a5:      126 	83 bb 44 01 00 00 00 	cmpl   $0x0,0x144(%rbx)
ffffffff804bd6ac:        0 	74 10                	je     ffffffff804bd6be <tcp_recvmsg+0x250>
ffffffff804bd6ae:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd6b1:        0 	e8 00 df ff ff       	callq  ffffffff804bb5b6 <sock_error>
ffffffff804bd6b6:        0 	41 89 c7             	mov    %eax,%r15d
ffffffff804bd6b9:        0 	e9 8b 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6be:      112 	f6 43 38 01          	testb  $0x1,0x38(%rbx)
ffffffff804bd6c2:     3451 	0f 85 81 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6c8:      497 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bd6cb:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bd6cd:      113 	75 20                	jne    ffffffff804bd6ef <tcp_recvmsg+0x281>
ffffffff804bd6cf:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804bd6d4:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd6d7:        0 	e8 3c de ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bd6dc:        0 	85 c0                	test   %eax,%eax
ffffffff804bd6de:        0 	0f 85 65 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6e4:        0 	41 bf 95 ff ff ff    	mov    $0xffffff95,%r15d
ffffffff804bd6ea:        0 	e9 5a 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6ef:      118 	48 83 7c 24 58 00    	cmpq   $0x0,0x58(%rsp)
ffffffff804bd6f5:      398 	75 0b                	jne    ffffffff804bd702 <tcp_recvmsg+0x294>
ffffffff804bd6f7:        0 	41 bf f5 ff ff ff    	mov    $0xfffffff5,%r15d
ffffffff804bd6fd:        0 	e9 47 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd702:        0 	65 48 8b 3c 25 00 00 	mov    %gs:0x0,%rdi
ffffffff804bd709:        0 	00 00 
ffffffff804bd70b:     2993 	e8 ad df ff ff       	callq  ffffffff804bb6bd <signal_pending>
ffffffff804bd710:      200 	85 c0                	test   %eax,%eax
ffffffff804bd712:        0 	74 23                	je     ffffffff804bd737 <tcp_recvmsg+0x2c9>
ffffffff804bd714:        0 	48 b8 ff ff ff ff ff 	mov    $0x7fffffffffffffff,%rax
ffffffff804bd71b:        0 	ff ff 7f 
ffffffff804bd71e:        0 	48 39 44 24 58       	cmp    %rax,0x58(%rsp)
ffffffff804bd723:        0 	41 bf 00 fe ff ff    	mov    $0xfffffe00,%r15d
ffffffff804bd729:        0 	b8 fc ff ff ff       	mov    $0xfffffffc,%eax
ffffffff804bd72e:        0 	44 0f 45 f8          	cmovne %eax,%r15d
ffffffff804bd732:        0 	e9 12 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd737:      207 	44 89 fe             	mov    %r15d,%esi
ffffffff804bd73a:      198 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd73d:        0 	e8 cc e9 ff ff       	callq  ffffffff804bc10e <tcp_cleanup_rbuf>
ffffffff804bd742:      227 	83 3d 9b ad 3f 00 00 	cmpl   $0x0,0x3fad9b(%rip)        # ffffffff808b84e4 <sysctl_tcp_low_latency>
ffffffff804bd749:      210 	0f 85 81 00 00 00    	jne    ffffffff804bd7d0 <tcp_recvmsg+0x362>
ffffffff804bd74f:        0 	48 8b ab 28 04 00 00 	mov    0x428(%rbx),%rbp
ffffffff804bd756:        0 	48 3b 6c 24 50       	cmp    0x50(%rsp),%rbp
ffffffff804bd75b:      232 	75 73                	jne    ffffffff804bd7d0 <tcp_recvmsg+0x362>
ffffffff804bd75d:        0 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bd763:        7 	75 27                	jne    ffffffff804bd78c <tcp_recvmsg+0x31e>
ffffffff804bd765:      229 	83 7c 24 1c 00       	cmpl   $0x0,0x1c(%rsp)
ffffffff804bd76a:       30 	75 20                	jne    ffffffff804bd78c <tcp_recvmsg+0x31e>
ffffffff804bd76c:        7 	48 8b 54 24 30       	mov    0x30(%rsp),%rdx
ffffffff804bd771:      191 	65 48 8b 2c 25 00 00 	mov    %gs:0x0,%rbp
ffffffff804bd778:        0 	00 00 
ffffffff804bd77a:       12 	48 89 ab 28 04 00 00 	mov    %rbp,0x428(%rbx)
ffffffff804bd781:     2617 	48 8b 42 10          	mov    0x10(%rdx),%rax
ffffffff804bd785:      670 	48 89 83 30 04 00 00 	mov    %rax,0x430(%rbx)
ffffffff804bd78c:       11 	8b 83 f4 03 00 00    	mov    0x3f4(%rbx),%eax
ffffffff804bd792:      188 	3b 83 f0 03 00 00    	cmp    0x3f0(%rbx),%eax
ffffffff804bd798:      166 	44 89 b3 3c 04 00 00 	mov    %r14d,0x43c(%rbx)
ffffffff804bd79f:        5 	74 18                	je     ffffffff804bd7b9 <tcp_recvmsg+0x34b>
ffffffff804bd7a1:        0 	83 7c 24 1c 00       	cmpl   $0x0,0x1c(%rsp)
ffffffff804bd7a6:        0 	75 11                	jne    ffffffff804bd7b9 <tcp_recvmsg+0x34b>
ffffffff804bd7a8:        0 	be 92 05 00 00       	mov    $0x592,%esi
ffffffff804bd7ad:        0 	48 c7 c7 13 d9 6a 80 	mov    $0xffffffff806ad913,%rdi
ffffffff804bd7b4:        0 	e8 fc 89 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804bd7b9:      336 	48 8b 4c 24 20       	mov    0x20(%rsp),%rcx
ffffffff804bd7be:      302 	48 39 8b 10 04 00 00 	cmp    %rcx,0x410(%rbx)
ffffffff804bd7c5:     1176 	48 89 6c 24 50       	mov    %rbp,0x50(%rsp)
ffffffff804bd7ca:      244 	0f 85 81 00 00 00    	jne    ffffffff804bd851 <tcp_recvmsg+0x3e3>
ffffffff804bd7d0:      135 	44 3b 7c 24 4c       	cmp    0x4c(%rsp),%r15d
ffffffff804bd7d5:      112 	7c 12                	jl     ffffffff804bd7e9 <tcp_recvmsg+0x37b>
ffffffff804bd7d7:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7da:        0 	e8 57 7f fc ff       	callq  ffffffff80485736 <release_sock>
ffffffff804bd7df:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7e2:        0 	e8 96 e5 ff ff       	callq  ffffffff804bbd7d <lock_sock>
ffffffff804bd7e7:        0 	eb 0d                	jmp    ffffffff804bd7f6 <tcp_recvmsg+0x388>
ffffffff804bd7e9:      152 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
ffffffff804bd7ee:      563 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7f1:       59 	e8 83 99 fc ff       	callq  ffffffff80487179 <sk_wait_data>
ffffffff804bd7f6:       86 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bd7fc:     8550 	0f 84 8a 00 00 00    	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd802:     4038 	44 89 f1             	mov    %r14d,%ecx
ffffffff804bd805:      900 	2b 8b 3c 04 00 00    	sub    0x43c(%rbx),%ecx
ffffffff804bd80b:        5 	74 28                	je     ffffffff804bd835 <tcp_recvmsg+0x3c7>
ffffffff804bd80d:        0 	48 8b 05 ac 3e 5f 00 	mov    0x5f3eac(%rip),%rax        # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bd814:        1 	41 01 cf             	add    %ecx,%r15d
ffffffff804bd817:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804bd81e:        0 	00 
ffffffff804bd81f:        0 	89 d2                	mov    %edx,%edx
ffffffff804bd821:        0 	48 f7 d0             	not    %rax
ffffffff804bd824:        0 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804bd828:        0 	48 63 d1             	movslq %ecx,%rdx
ffffffff804bd82b:        0 	49 29 d6             	sub    %rdx,%r14
ffffffff804bd82e:        0 	48 01 90 b8 00 00 00 	add    %rdx,0xb8(%rax)
ffffffff804bd835:        4 	8b 83 f0 03 00 00    	mov    0x3f0(%rbx),%eax
ffffffff804bd83b:      373 	3b 83 f4 03 00 00    	cmp    0x3f4(%rbx),%eax
ffffffff804bd841:     3604 	75 49                	jne    ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd843:        0 	48 8b 44 24 20       	mov    0x20(%rsp),%rax
ffffffff804bd848:      971 	48 39 83 10 04 00 00 	cmp    %rax,0x410(%rbx)
ffffffff804bd84f:       11 	74 3b                	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd851:        6 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd854:      267 	e8 94 e6 ff ff       	callq  ffffffff804bbeed <tcp_prequeue_process>
ffffffff804bd859:        0 	44 89 f1             	mov    %r14d,%ecx
ffffffff804bd85c:      879 	2b 8b 3c 04 00 00    	sub    0x43c(%rbx),%ecx
ffffffff804bd862:      256 	74 28                	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd864:        0 	48 8b 05 55 3e 5f 00 	mov    0x5f3e55(%rip),%rax        # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bd86b:      116 	41 01 cf             	add    %ecx,%r15d
ffffffff804bd86e:       17 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804bd875:        0 	00 
ffffffff804bd876:        0 	89 d2                	mov    %edx,%edx
ffffffff804bd878:        1 	48 f7 d0             	not    %rax
ffffffff804bd87b:        5 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804bd87f:        0 	48 63 d1             	movslq %ecx,%rdx
ffffffff804bd882:        6 	49 29 d6             	sub    %rdx,%r14
ffffffff804bd885:        7 	48 01 90 c0 00 00 00 	add    %rdx,0xc0(%rax)
ffffffff804bd88c:       11 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bd891:      438 	0f 84 a9 01 00 00    	je     ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd897:        0 	8b 44 24 60          	mov    0x60(%rsp),%eax
ffffffff804bd89b:        0 	3b 83 f4 03 00 00    	cmp    0x3f4(%rbx),%eax
ffffffff804bd8a1:        0 	0f 84 99 01 00 00    	je     ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd8a7:        0 	e8 19 ad fd ff       	callq  ffffffff804985c5 <net_ratelimit>
ffffffff804bd8ac:        0 	85 c0                	test   %eax,%eax
ffffffff804bd8ae:        0 	74 24                	je     ffffffff804bd8d4 <tcp_recvmsg+0x466>
ffffffff804bd8b0:        0 	65 48 8b 34 25 00 00 	mov    %gs:0x0,%rsi
ffffffff804bd8b7:        0 	00 00 
ffffffff804bd8b9:        0 	8b 96 70 01 00 00    	mov    0x170(%rsi),%edx
ffffffff804bd8bf:        0 	48 c7 c7 6a d9 6a 80 	mov    $0xffffffff806ad96a,%rdi
ffffffff804bd8c6:        0 	48 81 c6 68 03 00 00 	add    $0x368,%rsi
ffffffff804bd8cd:        0 	31 c0                	xor    %eax,%eax
ffffffff804bd8cf:        0 	e8 a0 94 d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804bd8d4:        0 	8b 83 f4 03 00 00    	mov    0x3f4(%rbx),%eax
ffffffff804bd8da:        0 	89 44 24 60          	mov    %eax,0x60(%rsp)
ffffffff804bd8de:        0 	e9 5d 01 00 00       	jmpq   ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd8e3:     4077 	44 29 e8             	sub    %r13d,%eax
ffffffff804bd8e6:     6031 	4d 89 f4             	mov    %r14,%r12
ffffffff804bd8e9:        0 	4c 39 f0             	cmp    %r14,%rax
ffffffff804bd8ec:        0 	4c 0f 46 e0          	cmovbe %rax,%r12
ffffffff804bd8f0:      934 	66 83 bb 7c 04 00 00 	cmpw   $0x0,0x47c(%rbx)
ffffffff804bd8f7:        0 	00 
ffffffff804bd8f8:        0 	74 38                	je     ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd8fa:        0 	8b 83 84 05 00 00    	mov    0x584(%rbx),%eax
ffffffff804bd900:        0 	29 f0                	sub    %esi,%eax
ffffffff804bd902:        0 	89 c2                	mov    %eax,%edx
ffffffff804bd904:        0 	4c 39 e2             	cmp    %r12,%rdx
ffffffff804bd907:        0 	73 29                	jae    ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd909:        0 	85 c0                	test   %eax,%eax
ffffffff804bd90b:        0 	74 05                	je     ffffffff804bd912 <tcp_recvmsg+0x4a4>
ffffffff804bd90d:        0 	49 89 d4             	mov    %rdx,%r12
ffffffff804bd910:        0 	eb 20                	jmp    ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd912:        0 	be 02 00 00 00       	mov    $0x2,%esi
ffffffff804bd917:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd91a:        0 	e8 f9 db ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bd91f:        0 	85 c0                	test   %eax,%eax
ffffffff804bd921:        0 	75 0f                	jne    ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd923:        0 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bd928:        0 	41 ff c5             	inc    %r13d
ffffffff804bd92b:        0 	ff 02                	incl   (%rdx)
ffffffff804bd92d:        0 	49 ff cc             	dec    %r12
ffffffff804bd930:        0 	74 4c                	je     ffffffff804bd97e <tcp_recvmsg+0x510>
ffffffff804bd932:      906 	83 7c 24 0c 00       	cmpl   $0x0,0xc(%rsp)
ffffffff804bd937:     6039 	75 2f                	jne    ffffffff804bd968 <tcp_recvmsg+0x4fa>
ffffffff804bd939:       48 	48 8b 4c 24 30       	mov    0x30(%rsp),%rcx
ffffffff804bd93e:     1412 	44 89 ee             	mov    %r13d,%esi
ffffffff804bd941:     6648 	48 89 ef             	mov    %rbp,%rdi
ffffffff804bd944:        0 	48 8b 51 10          	mov    0x10(%rcx),%rdx
ffffffff804bd948:     1524 	44 89 e1             	mov    %r12d,%ecx
ffffffff804bd94b:      167 	e8 c5 d3 fc ff       	callq  ffffffff8048ad15 <skb_copy_datagram_iovec>
ffffffff804bd950:        0 	85 c0                	test   %eax,%eax
ffffffff804bd952:     1038 	74 14                	je     ffffffff804bd968 <tcp_recvmsg+0x4fa>
ffffffff804bd954:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bd957:        0 	0f 85 ec 00 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd95d:        0 	41 bf f2 ff ff ff    	mov    $0xfffffff2,%r15d
ffffffff804bd963:        0 	e9 e1 00 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd968:       28 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bd96d:     5713 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd970:      241 	45 01 e7             	add    %r12d,%r15d
ffffffff804bd973:       27 	4d 29 e6             	sub    %r12,%r14
ffffffff804bd976:      626 	44 01 22             	add    %r12d,(%rdx)
ffffffff804bd979:      221 	e8 fe 11 00 00       	callq  ffffffff804beb7c <tcp_rcv_space_adjust>
ffffffff804bd97e:     1425 	66 83 bb 7c 04 00 00 	cmpw   $0x0,0x47c(%rbx)
ffffffff804bd985:        0 	00 
ffffffff804bd986:     3430 	74 63                	je     ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd988:        0 	8b 8b f4 03 00 00    	mov    0x3f4(%rbx),%ecx
ffffffff804bd98e:        0 	39 8b 84 05 00 00    	cmp    %ecx,0x584(%rbx)
ffffffff804bd994:        0 	79 55                	jns    ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd996:        0 	48 8b 44 24 10       	mov    0x10(%rsp),%rax
ffffffff804bd99b:        0 	48 39 83 f8 04 00 00 	cmp    %rax,0x4f8(%rbx)
ffffffff804bd9a2:        0 	66 c7 83 7c 04 00 00 	movw   $0x0,0x47c(%rbx)
ffffffff804bd9a9:        0 	00 00 
ffffffff804bd9ab:        0 	75 3e                	jne    ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9ad:        0 	83 bb c0 04 00 00 00 	cmpl   $0x0,0x4c0(%rbx)
ffffffff804bd9b4:        0 	74 35                	je     ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9b6:        0 	8b 83 94 00 00 00    	mov    0x94(%rbx),%eax
ffffffff804bd9bc:        0 	3b 43 3c             	cmp    0x3c(%rbx),%eax
ffffffff804bd9bf:        0 	7d 2a                	jge    ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9c1:        0 	0f b7 83 e8 03 00 00 	movzwl 0x3e8(%rbx),%eax
ffffffff804bd9c8:        0 	8a 8b 9d 04 00 00    	mov    0x49d(%rbx),%cl
ffffffff804bd9ce:        0 	8b 93 44 04 00 00    	mov    0x444(%rbx),%edx
ffffffff804bd9d4:        0 	83 e1 0f             	and    $0xf,%ecx
ffffffff804bd9d7:        0 	c1 e0 1a             	shl    $0x1a,%eax
ffffffff804bd9da:        0 	d3 ea                	shr    %cl,%edx
ffffffff804bd9dc:        0 	09 d0                	or     %edx,%eax
ffffffff804bd9de:        0 	0d 00 00 10 00       	or     $0x100000,%eax
ffffffff804bd9e3:        0 	0f c8                	bswap  %eax
ffffffff804bd9e5:        0 	89 83 ec 03 00 00    	mov    %eax,0x3ec(%rbx)
ffffffff804bd9eb:        0 	8b 55 68             	mov    0x68(%rbp),%edx
ffffffff804bd9ee:     1655 	44 89 e8             	mov    %r13d,%eax
ffffffff804bd9f1:       32 	4c 01 e0             	add    %r12,%rax
ffffffff804bd9f4:        0 	48 39 d0             	cmp    %rdx,%rax
ffffffff804bd9f7:      847 	72 47                	jb     ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd9f9:        0 	8b 95 b8 00 00 00    	mov    0xb8(%rbp),%edx
ffffffff804bd9ff:       80 	48 8b 85 d0 00 00 00 	mov    0xd0(%rbp),%rax
ffffffff804bda06:      441 	f6 44 02 0d 01       	testb  $0x1,0xd(%rdx,%rax,1)
ffffffff804bda0b:        0 	75 16                	jne    ffffffff804bda23 <tcp_recvmsg+0x5b5>
ffffffff804bda0d:        0 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bda12:      453 	75 2c                	jne    ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bda14:        0 	31 d2                	xor    %edx,%edx
ffffffff804bda16:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff804bda19:      477 	48 89 df             	mov    %rbx,%rdi
ffffffff804bda1c:        0 	e8 0f e4 ff ff       	callq  ffffffff804bbe30 <sk_eat_skb>
ffffffff804bda21:      562 	eb 1d                	jmp    ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bda23:        0 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bda28:        0 	ff 02                	incl   (%rdx)
ffffffff804bda2a:        0 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bda2f:        0 	75 18                	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bda31:        0 	31 d2                	xor    %edx,%edx
ffffffff804bda33:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff804bda36:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bda39:        0 	e8 f2 e3 ff ff       	callq  ffffffff804bbe30 <sk_eat_skb>
ffffffff804bda3e:        0 	eb 09                	jmp    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bda40:      959 	4d 85 f6             	test   %r14,%r14
ffffffff804bda43:     4766 	0f 85 f8 fa ff ff    	jne    ffffffff804bd541 <tcp_recvmsg+0xd3>
ffffffff804bda49:      217 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bda4f:     2084 	74 71                	je     ffffffff804bdac2 <tcp_recvmsg+0x654>
ffffffff804bda51:       40 	48 8d 83 10 04 00 00 	lea    0x410(%rbx),%rax
ffffffff804bda58:      448 	48 39 83 10 04 00 00 	cmp    %rax,0x410(%rbx)
ffffffff804bda5f:        4 	74 4c                	je     ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda61:        0 	31 c0                	xor    %eax,%eax
ffffffff804bda63:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bda66:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bda69:        0 	41 0f 4f c6          	cmovg  %r14d,%eax
ffffffff804bda6d:        0 	89 83 3c 04 00 00    	mov    %eax,0x43c(%rbx)
ffffffff804bda73:        0 	e8 75 e4 ff ff       	callq  ffffffff804bbeed <tcp_prequeue_process>
ffffffff804bda78:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bda7b:        0 	7e 30                	jle    ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda7d:        0 	44 89 f1             	mov    %r14d,%ecx
ffffffff804bda80:        0 	2b 8b 3c 04 00 00    	sub    0x43c(%rbx),%ecx
ffffffff804bda86:        0 	74 25                	je     ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda88:        0 	48 8b 05 31 3c 5f 00 	mov    0x5f3c31(%rip),%rax        # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bda8f:        0 	41 01 cf             	add    %ecx,%r15d
ffffffff804bda92:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804bda99:        0 	00 
ffffffff804bda9a:        0 	89 d2                	mov    %edx,%edx
ffffffff804bda9c:        0 	48 f7 d0             	not    %rax
ffffffff804bda9f:        0 	48 8b 14 d0          	mov    (%rax,%rdx,8),%rdx
ffffffff804bdaa3:        0 	48 63 c1             	movslq %ecx,%rax
ffffffff804bdaa6:        0 	48 01 82 c0 00 00 00 	add    %rax,0xc0(%rdx)
ffffffff804bdaad:      214 	48 c7 83 28 04 00 00 	movq   $0x0,0x428(%rbx)
ffffffff804bdab4:        0 	00 00 00 00 
ffffffff804bdab8:     1530 	c7 83 3c 04 00 00 00 	movl   $0x0,0x43c(%rbx)
ffffffff804bdabf:        0 	00 00 00 
ffffffff804bdac2:     1135 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdac5:     3909 	44 89 fe             	mov    %r15d,%esi
ffffffff804bdac8:        0 	e8 41 e6 ff ff       	callq  ffffffff804bc10e <tcp_cleanup_rbuf>
ffffffff804bdacd:     1724 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdad0:      932 	e8 61 7c fc ff       	callq  ffffffff80485736 <release_sock>
ffffffff804bdad5:     4661 	e9 12 01 00 00       	jmpq   ffffffff804bdbec <tcp_recvmsg+0x77e>
ffffffff804bdada:        0 	41 bc 95 ff ff ff    	mov    $0xffffff95,%r12d
ffffffff804bdae0:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdae3:        0 	45 89 e7             	mov    %r12d,%r15d
ffffffff804bdae6:        0 	e8 4b 7c fc ff       	callq  ffffffff80485736 <release_sock>
ffffffff804bdaeb:        0 	e9 fc 00 00 00       	jmpq   ffffffff804bdbec <tcp_recvmsg+0x77e>
ffffffff804bdaf0:        0 	be 02 00 00 00       	mov    $0x2,%esi
ffffffff804bdaf5:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdaf8:        0 	e8 1b da ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bdafd:        0 	85 c0                	test   %eax,%eax
ffffffff804bdaff:        0 	0f 85 d4 00 00 00    	jne    ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb05:        0 	8b 83 7c 04 00 00    	mov    0x47c(%rbx),%eax
ffffffff804bdb0b:        0 	66 85 c0             	test   %ax,%ax
ffffffff804bdb0e:        0 	0f 84 c5 00 00 00    	je     ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb14:        0 	66 3d 00 04          	cmp    $0x400,%ax
ffffffff804bdb18:        0 	0f 84 bb 00 00 00    	je     ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb1e:        0 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bdb21:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bdb23:        0 	75 17                	jne    ffffffff804bdb3c <tcp_recvmsg+0x6ce>
ffffffff804bdb25:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804bdb2a:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdb2d:        0 	41 bc 95 ff ff ff    	mov    $0xffffff95,%r12d
ffffffff804bdb33:        0 	e8 e0 d9 ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bdb38:        0 	85 c0                	test   %eax,%eax
ffffffff804bdb3a:        0 	74 a4                	je     ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdb3c:        0 	8b 83 7c 04 00 00    	mov    0x47c(%rbx),%eax
ffffffff804bdb42:        0 	f6 c4 01             	test   $0x1,%ah
ffffffff804bdb45:        0 	74 79                	je     ffffffff804bdbc0 <tcp_recvmsg+0x752>
ffffffff804bdb47:        0 	40 f6 c5 02          	test   $0x2,%bpl
ffffffff804bdb4b:        0 	88 44 24 67          	mov    %al,0x67(%rsp)
ffffffff804bdb4f:        0 	75 09                	jne    ffffffff804bdb5a <tcp_recvmsg+0x6ec>
ffffffff804bdb51:        0 	66 c7 83 7c 04 00 00 	movw   $0x400,0x47c(%rbx)
ffffffff804bdb58:        0 	00 04 
ffffffff804bdb5a:        0 	48 8b 4c 24 30       	mov    0x30(%rsp),%rcx
ffffffff804bdb5f:        0 	45 89 f4             	mov    %r14d,%r12d
ffffffff804bdb62:        0 	8b 51 30             	mov    0x30(%rcx),%edx
ffffffff804bdb65:        0 	89 d0                	mov    %edx,%eax
ffffffff804bdb67:        0 	83 c8 01             	or     $0x1,%eax
ffffffff804bdb6a:        0 	45 85 f6             	test   %r14d,%r14d
ffffffff804bdb6d:        0 	89 41 30             	mov    %eax,0x30(%rcx)
ffffffff804bdb70:        0 	7e 33                	jle    ffffffff804bdba5 <tcp_recvmsg+0x737>
ffffffff804bdb72:        0 	40 80 e5 20          	and    $0x20,%bpl
ffffffff804bdb76:        0 	41 bc 01 00 00 00    	mov    $0x1,%r12d
ffffffff804bdb7c:        0 	0f 85 5e ff ff ff    	jne    ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdb82:        0 	48 8b 79 10          	mov    0x10(%rcx),%rdi
ffffffff804bdb86:        0 	48 8d 74 24 67       	lea    0x67(%rsp),%rsi
ffffffff804bdb8b:        0 	ba 01 00 00 00       	mov    $0x1,%edx
ffffffff804bdb90:        0 	41 bc f2 ff ff ff    	mov    $0xfffffff2,%r12d
ffffffff804bdb96:        0 	e8 8a cb fc ff       	callq  ffffffff8048a725 <memcpy_toiovec>
ffffffff804bdb9b:        0 	85 c0                	test   %eax,%eax
ffffffff804bdb9d:        0 	0f 85 3d ff ff ff    	jne    ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdba3:        0 	eb 10                	jmp    ffffffff804bdbb5 <tcp_recvmsg+0x747>
ffffffff804bdba5:        0 	48 8b 44 24 30       	mov    0x30(%rsp),%rax
ffffffff804bdbaa:        0 	83 ca 21             	or     $0x21,%edx
ffffffff804bdbad:        0 	89 50 30             	mov    %edx,0x30(%rax)
ffffffff804bdbb0:        0 	e9 2b ff ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbb5:        0 	41 bc 01 00 00 00    	mov    $0x1,%r12d
ffffffff804bdbbb:        0 	e9 20 ff ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbc0:        0 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bdbc3:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bdbc5:        0 	74 1d                	je     ffffffff804bdbe4 <tcp_recvmsg+0x776>
ffffffff804bdbc7:        0 	f6 43 38 01          	testb  $0x1,0x38(%rbx)
ffffffff804bdbcb:        0 	41 bc f5 ff ff ff    	mov    $0xfffffff5,%r12d
ffffffff804bdbd1:        0 	0f 84 09 ff ff ff    	je     ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbd7:        0 	eb 0b                	jmp    ffffffff804bdbe4 <tcp_recvmsg+0x776>
ffffffff804bdbd9:        0 	41 bc ea ff ff ff    	mov    $0xffffffea,%r12d
ffffffff804bdbdf:        0 	e9 fc fe ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbe4:        0 	45 31 e4             	xor    %r12d,%r12d
ffffffff804bdbe7:        0 	e9 f4 fe ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbec:     1206 	48 83 c4 68          	add    $0x68,%rsp
ffffffff804bdbf0:      498 	44 89 f8             	mov    %r15d,%eax
ffffffff804bdbf3:      387 	5b                   	pop    %rbx
ffffffff804bdbf4:      462 	5d                   	pop    %rbp
ffffffff804bdbf5:        0 	41 5c                	pop    %r12
ffffffff804bdbf7:      485 	41 5d                	pop    %r13
ffffffff804bdbf9:      466 	41 5e                	pop    %r14
ffffffff804bdbfb:        0 	41 5f                	pop    %r15
ffffffff804bdbfd:      796 	c3                   	retq   

no real hotspots either - but a bit too fractured code sequence, so 
this function's icache footprint is too probably double the size of 
what it could be.

a bit of overhead (8%) leaks in from a callsite:

ffffffff804bd46e:      882 	41 57                	push   %r15
ffffffff804bd470:    15507 	48 89 f7             	mov    %rsi,%rdi

(this is used as a dynamic function pointer too so i'm just guessing 
that the common callsite would be sock_common_recvmsg().)

perhaps this sequence, about 7% of the total overhead of this 
function, warrants mention:

ffffffff804bd7e2:        0 	e8 96 e5 ff ff       	callq  ffffffff804bbd7d <lock_sock>
ffffffff804bd7e7:        0 	eb 0d                	jmp    ffffffff804bd7f6 <tcp_recvmsg+0x388>
ffffffff804bd7e9:      152 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
ffffffff804bd7ee:      563 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7f1:       59 	e8 83 99 fc ff       	callq  ffffffff80487179 <sk_wait_data>
ffffffff804bd7f6:       86 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bd7fc:     8550 	0f 84 8a 00 00 00    	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd802:     4038 	44 89 f1             	mov    %r14d,%ecx

that's most likely lock_sock[_nested]()'s overhead leaking over into 
this function:

ffffffff804857cb:     9392 <lock_sock_nested>:
ffffffff804857cb:     9392 	41 55                	push   %r13
ffffffff804857cd:     4112 	41 54                	push   %r12
ffffffff804857cf:        2 	55                   	push   %rbp
ffffffff804857d0:        7 	48 8d 6f 40          	lea    0x40(%rdi),%rbp
ffffffff804857d4:     1515 	53                   	push   %rbx
ffffffff804857d5:        0 	48 89 fb             	mov    %rdi,%rbx
ffffffff804857d8:        4 	48 89 ef             	mov    %rbp,%rdi
ffffffff804857db:     1461 	48 83 ec 38          	sub    $0x38,%rsp
ffffffff804857df:        8 	e8 78 11 09 00       	callq  ffffffff8051695c <_spin_lock_bh>
ffffffff804857e4:     4827 	83 7b 44 00          	cmpl   $0x0,0x44(%rbx)
ffffffff804857e8:     2937 	74 6d                	je     ffffffff80485857 <lock_sock_nested+0x8c>
ffffffff804857ea:        0 	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
ffffffff804857f1:        0 	00 00 
ffffffff804857f3:        0 	fc                   	cld    
ffffffff804857f4:        0 	31 c0                	xor    %eax,%eax
ffffffff804857f6:        0 	48 89 e7             	mov    %rsp,%rdi
ffffffff804857f9:        0 	b9 0a 00 00 00       	mov    $0xa,%ecx
ffffffff804857fe:        0 	f3 ab                	rep stos %eax,%es:(%rdi)
ffffffff80485800:        0 	48 8d 44 24 18       	lea    0x18(%rsp),%rax
ffffffff80485805:        0 	4c 8d 63 48          	lea    0x48(%rbx),%r12
ffffffff80485809:        0 	48 89 54 24 08       	mov    %rdx,0x8(%rsp)
ffffffff8048580e:        0 	48 c7 44 24 10 80 78 	movq   $0xffffffff80247880,0x10(%rsp)
ffffffff80485815:        0 	24 80 
ffffffff80485817:        0 	48 89 44 24 18       	mov    %rax,0x18(%rsp)
ffffffff8048581c:        0 	48 89 44 24 20       	mov    %rax,0x20(%rsp)
ffffffff80485821:        0 	ba 02 00 00 00       	mov    $0x2,%edx
ffffffff80485826:        0 	48 89 e6             	mov    %rsp,%rsi
ffffffff80485829:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff8048582c:        0 	e8 fd 20 dc ff       	callq  ffffffff8024792e <prepare_to_wait_exclusive>
ffffffff80485831:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80485834:        0 	e8 18 11 09 00       	callq  ffffffff80516951 <_spin_unlock_bh>
ffffffff80485839:        0 	e8 52 f9 08 00       	callq  ffffffff80515190 <schedule>
ffffffff8048583e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80485841:        0 	e8 16 11 09 00       	callq  ffffffff8051695c <_spin_lock_bh>
ffffffff80485846:        0 	83 7b 44 00          	cmpl   $0x0,0x44(%rbx)
ffffffff8048584a:        0 	75 d5                	jne    ffffffff80485821 <lock_sock_nested+0x56>
ffffffff8048584c:        0 	48 89 e6             	mov    %rsp,%rsi
ffffffff8048584f:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff80485852:        0 	e8 7a 20 dc ff       	callq  ffffffff802478d1 <finish_wait>
ffffffff80485857:       88 	c7 43 44 01 00 00 00 	movl   $0x1,0x44(%rbx)
ffffffff8048585e:     3431 	fe 43 40             	incb   0x40(%rbx)
ffffffff80485861:     1568 	e8 00 4e db ff       	callq  ffffffff8023a666 <local_bh_enable>
ffffffff80485866:     1548 	48 83 c4 38          	add    $0x38,%rsp
ffffffff8048586a:       61 	5b                   	pop    %rbx
ffffffff8048586b:     1568 	5d                   	pop    %rbp
ffffffff8048586c:       36 	41 5c                	pop    %r12
ffffffff8048586e:        0 	41 5d                	pop    %r13
ffffffff80485870:     2753 	c3                   	retq   

which is:

1748	void lock_sock_nested(struct sock *sk, int subclass)
1749	{
1750		might_sleep();
1751		spin_lock_bh(&sk->sk_lock.slock);
1752		if (sk->sk_lock.owned)
1753			__lock_sock(sk);
1754		sk->sk_lock.owned = 1;
1755		spin_unlock(&sk->sk_lock.slock);

that branch in the middle should perhaps be:

		if (unlikely(sk->sk_lock.owned))

to make this function fall-through.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* tcp_recvmsg(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:19                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   1.833688 tcp_recvmsg

                      hits (total: 183368)
                 .........
ffffffff804bd46e:      882 <tcp_recvmsg>:
ffffffff804bd46e:      882 	41 57                	push   %r15
ffffffff804bd470:    15507 	48 89 f7             	mov    %rsi,%rdi
ffffffff804bd473:      179 	41 56                	push   %r14
ffffffff804bd475:        0 	49 89 ce             	mov    %rcx,%r14
ffffffff804bd478:      744 	41 55                	push   %r13
ffffffff804bd47a:      165 	41 54                	push   %r12
ffffffff804bd47c:        0 	45 89 c4             	mov    %r8d,%r12d
ffffffff804bd47f:      692 	55                   	push   %rbp
ffffffff804bd480:      178 	44 89 cd             	mov    %r9d,%ebp
ffffffff804bd483:     3434 	53                   	push   %rbx
ffffffff804bd484:      685 	48 89 f3             	mov    %rsi,%rbx
ffffffff804bd487:       11 	48 83 ec 68          	sub    $0x68,%rsp
ffffffff804bd48b:      949 	48 89 54 24 30       	mov    %rdx,0x30(%rsp)
ffffffff804bd490:        7 	e8 e8 e8 ff ff       	callq  ffffffff804bbd7d <lock_sock>
ffffffff804bd495:     1771 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bd498:     6176 	3c 0a                	cmp    $0xa,%al
ffffffff804bd49a:        0 	0f 84 3a 06 00 00    	je     ffffffff804bdada <tcp_recvmsg+0x66c>
ffffffff804bd4a0:     3121 	31 c0                	xor    %eax,%eax
ffffffff804bd4a2:      195 	45 85 e4             	test   %r12d,%r12d
ffffffff804bd4a5:        0 	75 07                	jne    ffffffff804bd4ae <tcp_recvmsg+0x40>
ffffffff804bd4a7:      926 	48 8b 83 68 01 00 00 	mov    0x168(%rbx),%rax
ffffffff804bd4ae:      189 	40 f6 c5 01          	test   $0x1,%bpl
ffffffff804bd4b2:        0 	48 89 44 24 58       	mov    %rax,0x58(%rsp)
ffffffff804bd4b7:      819 	0f 85 33 06 00 00    	jne    ffffffff804bdaf0 <tcp_recvmsg+0x682>
ffffffff804bd4bd:      216 	89 e8                	mov    %ebp,%eax
ffffffff804bd4bf:        0 	83 e0 02             	and    $0x2,%eax
ffffffff804bd4c2:      638 	89 44 24 3c          	mov    %eax,0x3c(%rsp)
ffffffff804bd4c6:      177 	75 0e                	jne    ffffffff804bd4d6 <tcp_recvmsg+0x68>
ffffffff804bd4c8:        0 	48 8d 93 f4 03 00 00 	lea    0x3f4(%rbx),%rdx
ffffffff804bd4cf:      661 	48 89 54 24 40       	mov    %rdx,0x40(%rsp)
ffffffff804bd4d4:      195 	eb 14                	jmp    ffffffff804bd4ea <tcp_recvmsg+0x7c>
ffffffff804bd4d6:        0 	8b 83 f4 03 00 00    	mov    0x3f4(%rbx),%eax
ffffffff804bd4dc:        0 	48 8d 4c 24 60       	lea    0x60(%rsp),%rcx
ffffffff804bd4e1:        0 	48 89 4c 24 40       	mov    %rcx,0x40(%rsp)
ffffffff804bd4e6:        0 	89 44 24 60          	mov    %eax,0x60(%rsp)
ffffffff804bd4ea:      867 	89 ee                	mov    %ebp,%esi
ffffffff804bd4ec:      210 	44 89 f2             	mov    %r14d,%edx
ffffffff804bd4ef:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd4f2:      894 	81 e6 00 01 00 00    	and    $0x100,%esi
ffffffff804bd4f8:      192 	45 31 ff             	xor    %r15d,%r15d
ffffffff804bd4fb:        0 	e8 fc df ff ff       	callq  ffffffff804bb4fc <sock_rcvlowat>
ffffffff804bd500:      853 	89 44 24 4c          	mov    %eax,0x4c(%rsp)
ffffffff804bd504:     1857 	48 8d 83 a8 00 00 00 	lea    0xa8(%rbx),%rax
ffffffff804bd50b:        0 	89 e9                	mov    %ebp,%ecx
ffffffff804bd50d:      595 	48 8d 93 10 04 00 00 	lea    0x410(%rbx),%rdx
ffffffff804bd514:      263 	83 e1 22             	and    $0x22,%ecx
ffffffff804bd517:        0 	83 e5 20             	and    $0x20,%ebp
ffffffff804bd51a:      601 	48 89 44 24 28       	mov    %rax,0x28(%rsp)
ffffffff804bd51f:      254 	48 8d 83 f8 04 00 00 	lea    0x4f8(%rbx),%rax
ffffffff804bd526:        2 	48 c7 44 24 50 00 00 	movq   $0x0,0x50(%rsp)
ffffffff804bd52d:        0 	00 00 
ffffffff804bd52f:      578 	48 89 54 24 20       	mov    %rdx,0x20(%rsp)
ffffffff804bd534:      290 	89 4c 24 1c          	mov    %ecx,0x1c(%rsp)
ffffffff804bd538:        1 	48 89 44 24 10       	mov    %rax,0x10(%rsp)
ffffffff804bd53d:      593 	89 6c 24 0c          	mov    %ebp,0xc(%rsp)
ffffffff804bd541:      568 	66 83 bb 7c 04 00 00 	cmpw   $0x0,0x47c(%rbx)
ffffffff804bd548:        0 	00 
ffffffff804bd549:     3956 	74 55                	je     ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd54b:        0 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bd550:        0 	8b 83 84 05 00 00    	mov    0x584(%rbx),%eax
ffffffff804bd556:        0 	3b 02                	cmp    (%rdx),%eax
ffffffff804bd558:        0 	75 46                	jne    ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd55a:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bd55d:        0 	0f 85 e6 04 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd563:        0 	65 48 8b 3c 25 00 00 	mov    %gs:0x0,%rdi
ffffffff804bd56a:        0 	00 00 
ffffffff804bd56c:        0 	e8 4c e1 ff ff       	callq  ffffffff804bb6bd <signal_pending>
ffffffff804bd571:        0 	85 c0                	test   %eax,%eax
ffffffff804bd573:        0 	74 2b                	je     ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd575:        0 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
ffffffff804bd57a:        0 	41 bf f5 ff ff ff    	mov    $0xfffffff5,%r15d
ffffffff804bd580:        0 	48 85 d2             	test   %rdx,%rdx
ffffffff804bd583:        0 	0f 84 c0 04 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd589:        0 	48 b8 ff ff ff ff ff 	mov    $0x7fffffffffffffff,%rax
ffffffff804bd590:        0 	ff ff 7f 
ffffffff804bd593:        0 	66 41 bf 00 fe       	mov    $0xfe00,%r15w
ffffffff804bd598:        0 	48 39 c2             	cmp    %rax,%rdx
ffffffff804bd59b:        0 	e9 89 01 00 00       	jmpq   ffffffff804bd729 <tcp_recvmsg+0x2bb>
ffffffff804bd5a0:      597 	48 8b ab a8 00 00 00 	mov    0xa8(%rbx),%rbp
ffffffff804bd5a7:     4601 	48 3b 6c 24 28       	cmp    0x28(%rsp),%rbp
ffffffff804bd5ac:        1 	b8 00 00 00 00       	mov    $0x0,%eax
ffffffff804bd5b1:     1769 	48 0f 44 e8          	cmove  %rax,%rbp
ffffffff804bd5b5:      473 	48 85 ed             	test   %rbp,%rbp
ffffffff804bd5b8:        0 	74 76                	je     ffffffff804bd630 <tcp_recvmsg+0x1c2>
ffffffff804bd5ba:      595 	48 8b 4c 24 40       	mov    0x40(%rsp),%rcx
ffffffff804bd5bf:      897 	8b 55 50             	mov    0x50(%rbp),%edx
ffffffff804bd5c2:       89 	8b 31                	mov    (%rcx),%esi
ffffffff804bd5c4:      581 	41 89 f5             	mov    %esi,%r13d
ffffffff804bd5c7:      301 	41 29 d5             	sub    %edx,%r13d
ffffffff804bd5ca:       33 	79 10                	jns    ffffffff804bd5dc <tcp_recvmsg+0x16e>
ffffffff804bd5cc:        0 	48 c7 c7 48 d9 6a 80 	mov    $0xffffffff806ad948,%rdi
ffffffff804bd5d3:        0 	31 c0                	xor    %eax,%eax
ffffffff804bd5d5:        0 	e8 9a 97 d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804bd5da:        0 	eb 54                	jmp    ffffffff804bd630 <tcp_recvmsg+0x1c2>
ffffffff804bd5dc:      584 	8b 85 b8 00 00 00    	mov    0xb8(%rbp),%eax
ffffffff804bd5e2:     1061 	48 8b 95 d0 00 00 00 	mov    0xd0(%rbp),%rdx
ffffffff804bd5e9:        1 	8a 54 02 0d          	mov    0xd(%rdx,%rax,1),%dl
ffffffff804bd5ed:        0 	88 d0                	mov    %dl,%al
ffffffff804bd5ef:      876 	83 e0 02             	and    $0x2,%eax
ffffffff804bd5f2:        0 	3c 01                	cmp    $0x1,%al
ffffffff804bd5f4:        0 	8b 45 68             	mov    0x68(%rbp),%eax
ffffffff804bd5f7:      909 	41 83 d5 ff          	adc    $0xffffffffffffffff,%r13d
ffffffff804bd5fb:        0 	41 39 c5             	cmp    %eax,%r13d
ffffffff804bd5fe:        0 	0f 82 df 02 00 00    	jb     ffffffff804bd8e3 <tcp_recvmsg+0x475>
ffffffff804bd604:        0 	80 e2 01             	and    $0x1,%dl
ffffffff804bd607:        0 	0f 85 16 04 00 00    	jne    ffffffff804bda23 <tcp_recvmsg+0x5b5>
ffffffff804bd60d:        0 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bd612:        0 	75 11                	jne    ffffffff804bd625 <tcp_recvmsg+0x1b7>
ffffffff804bd614:        0 	be 53 05 00 00       	mov    $0x553,%esi
ffffffff804bd619:        0 	48 c7 c7 13 d9 6a 80 	mov    $0xffffffff806ad913,%rdi
ffffffff804bd620:        0 	e8 90 8b d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804bd625:        0 	48 8b 6d 00          	mov    0x0(%rbp),%rbp
ffffffff804bd629:        0 	48 3b 6c 24 28       	cmp    0x28(%rsp),%rbp
ffffffff804bd62e:        0 	75 85                	jne    ffffffff804bd5b5 <tcp_recvmsg+0x147>
ffffffff804bd630:       80 	44 3b 7c 24 4c       	cmp    0x4c(%rsp),%r15d
ffffffff804bd635:     4164 	7c 0b                	jl     ffffffff804bd642 <tcp_recvmsg+0x1d4>
ffffffff804bd637:        0 	48 83 7b 68 00       	cmpq   $0x0,0x68(%rbx)
ffffffff804bd63c:        0 	0f 84 07 04 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd642:        1 	45 85 ff             	test   %r15d,%r15d
ffffffff804bd645:     3438 	74 49                	je     ffffffff804bd690 <tcp_recvmsg+0x222>
ffffffff804bd647:        0 	83 bb 44 01 00 00 00 	cmpl   $0x0,0x144(%rbx)
ffffffff804bd64e:        0 	0f 85 f5 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd654:        0 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bd657:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bd659:        0 	0f 84 ea 03 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd65f:        0 	f6 43 38 01          	testb  $0x1,0x38(%rbx)
ffffffff804bd663:        0 	0f 85 e0 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd669:        0 	48 83 7c 24 58 00    	cmpq   $0x0,0x58(%rsp)
ffffffff804bd66f:        0 	0f 84 d4 03 00 00    	je     ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd675:        0 	65 48 8b 3c 25 00 00 	mov    %gs:0x0,%rdi
ffffffff804bd67c:        0 	00 00 
ffffffff804bd67e:        0 	e8 3a e0 ff ff       	callq  ffffffff804bb6bd <signal_pending>
ffffffff804bd683:        0 	85 c0                	test   %eax,%eax
ffffffff804bd685:        0 	0f 84 ac 00 00 00    	je     ffffffff804bd737 <tcp_recvmsg+0x2c9>
ffffffff804bd68b:        0 	e9 b9 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd690:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804bd695:     4166 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd698:        0 	e8 7b de ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bd69d:        0 	85 c0                	test   %eax,%eax
ffffffff804bd69f:      276 	0f 85 a4 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6a5:      126 	83 bb 44 01 00 00 00 	cmpl   $0x0,0x144(%rbx)
ffffffff804bd6ac:        0 	74 10                	je     ffffffff804bd6be <tcp_recvmsg+0x250>
ffffffff804bd6ae:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd6b1:        0 	e8 00 df ff ff       	callq  ffffffff804bb5b6 <sock_error>
ffffffff804bd6b6:        0 	41 89 c7             	mov    %eax,%r15d
ffffffff804bd6b9:        0 	e9 8b 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6be:      112 	f6 43 38 01          	testb  $0x1,0x38(%rbx)
ffffffff804bd6c2:     3451 	0f 85 81 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6c8:      497 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bd6cb:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bd6cd:      113 	75 20                	jne    ffffffff804bd6ef <tcp_recvmsg+0x281>
ffffffff804bd6cf:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804bd6d4:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd6d7:        0 	e8 3c de ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bd6dc:        0 	85 c0                	test   %eax,%eax
ffffffff804bd6de:        0 	0f 85 65 03 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6e4:        0 	41 bf 95 ff ff ff    	mov    $0xffffff95,%r15d
ffffffff804bd6ea:        0 	e9 5a 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6ef:      118 	48 83 7c 24 58 00    	cmpq   $0x0,0x58(%rsp)
ffffffff804bd6f5:      398 	75 0b                	jne    ffffffff804bd702 <tcp_recvmsg+0x294>
ffffffff804bd6f7:        0 	41 bf f5 ff ff ff    	mov    $0xfffffff5,%r15d
ffffffff804bd6fd:        0 	e9 47 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd702:        0 	65 48 8b 3c 25 00 00 	mov    %gs:0x0,%rdi
ffffffff804bd709:        0 	00 00 
ffffffff804bd70b:     2993 	e8 ad df ff ff       	callq  ffffffff804bb6bd <signal_pending>
ffffffff804bd710:      200 	85 c0                	test   %eax,%eax
ffffffff804bd712:        0 	74 23                	je     ffffffff804bd737 <tcp_recvmsg+0x2c9>
ffffffff804bd714:        0 	48 b8 ff ff ff ff ff 	mov    $0x7fffffffffffffff,%rax
ffffffff804bd71b:        0 	ff ff 7f 
ffffffff804bd71e:        0 	48 39 44 24 58       	cmp    %rax,0x58(%rsp)
ffffffff804bd723:        0 	41 bf 00 fe ff ff    	mov    $0xfffffe00,%r15d
ffffffff804bd729:        0 	b8 fc ff ff ff       	mov    $0xfffffffc,%eax
ffffffff804bd72e:        0 	44 0f 45 f8          	cmovne %eax,%r15d
ffffffff804bd732:        0 	e9 12 03 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd737:      207 	44 89 fe             	mov    %r15d,%esi
ffffffff804bd73a:      198 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd73d:        0 	e8 cc e9 ff ff       	callq  ffffffff804bc10e <tcp_cleanup_rbuf>
ffffffff804bd742:      227 	83 3d 9b ad 3f 00 00 	cmpl   $0x0,0x3fad9b(%rip)        # ffffffff808b84e4 <sysctl_tcp_low_latency>
ffffffff804bd749:      210 	0f 85 81 00 00 00    	jne    ffffffff804bd7d0 <tcp_recvmsg+0x362>
ffffffff804bd74f:        0 	48 8b ab 28 04 00 00 	mov    0x428(%rbx),%rbp
ffffffff804bd756:        0 	48 3b 6c 24 50       	cmp    0x50(%rsp),%rbp
ffffffff804bd75b:      232 	75 73                	jne    ffffffff804bd7d0 <tcp_recvmsg+0x362>
ffffffff804bd75d:        0 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bd763:        7 	75 27                	jne    ffffffff804bd78c <tcp_recvmsg+0x31e>
ffffffff804bd765:      229 	83 7c 24 1c 00       	cmpl   $0x0,0x1c(%rsp)
ffffffff804bd76a:       30 	75 20                	jne    ffffffff804bd78c <tcp_recvmsg+0x31e>
ffffffff804bd76c:        7 	48 8b 54 24 30       	mov    0x30(%rsp),%rdx
ffffffff804bd771:      191 	65 48 8b 2c 25 00 00 	mov    %gs:0x0,%rbp
ffffffff804bd778:        0 	00 00 
ffffffff804bd77a:       12 	48 89 ab 28 04 00 00 	mov    %rbp,0x428(%rbx)
ffffffff804bd781:     2617 	48 8b 42 10          	mov    0x10(%rdx),%rax
ffffffff804bd785:      670 	48 89 83 30 04 00 00 	mov    %rax,0x430(%rbx)
ffffffff804bd78c:       11 	8b 83 f4 03 00 00    	mov    0x3f4(%rbx),%eax
ffffffff804bd792:      188 	3b 83 f0 03 00 00    	cmp    0x3f0(%rbx),%eax
ffffffff804bd798:      166 	44 89 b3 3c 04 00 00 	mov    %r14d,0x43c(%rbx)
ffffffff804bd79f:        5 	74 18                	je     ffffffff804bd7b9 <tcp_recvmsg+0x34b>
ffffffff804bd7a1:        0 	83 7c 24 1c 00       	cmpl   $0x0,0x1c(%rsp)
ffffffff804bd7a6:        0 	75 11                	jne    ffffffff804bd7b9 <tcp_recvmsg+0x34b>
ffffffff804bd7a8:        0 	be 92 05 00 00       	mov    $0x592,%esi
ffffffff804bd7ad:        0 	48 c7 c7 13 d9 6a 80 	mov    $0xffffffff806ad913,%rdi
ffffffff804bd7b4:        0 	e8 fc 89 d7 ff       	callq  ffffffff802361b5 <warn_on_slowpath>
ffffffff804bd7b9:      336 	48 8b 4c 24 20       	mov    0x20(%rsp),%rcx
ffffffff804bd7be:      302 	48 39 8b 10 04 00 00 	cmp    %rcx,0x410(%rbx)
ffffffff804bd7c5:     1176 	48 89 6c 24 50       	mov    %rbp,0x50(%rsp)
ffffffff804bd7ca:      244 	0f 85 81 00 00 00    	jne    ffffffff804bd851 <tcp_recvmsg+0x3e3>
ffffffff804bd7d0:      135 	44 3b 7c 24 4c       	cmp    0x4c(%rsp),%r15d
ffffffff804bd7d5:      112 	7c 12                	jl     ffffffff804bd7e9 <tcp_recvmsg+0x37b>
ffffffff804bd7d7:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7da:        0 	e8 57 7f fc ff       	callq  ffffffff80485736 <release_sock>
ffffffff804bd7df:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7e2:        0 	e8 96 e5 ff ff       	callq  ffffffff804bbd7d <lock_sock>
ffffffff804bd7e7:        0 	eb 0d                	jmp    ffffffff804bd7f6 <tcp_recvmsg+0x388>
ffffffff804bd7e9:      152 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
ffffffff804bd7ee:      563 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7f1:       59 	e8 83 99 fc ff       	callq  ffffffff80487179 <sk_wait_data>
ffffffff804bd7f6:       86 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bd7fc:     8550 	0f 84 8a 00 00 00    	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd802:     4038 	44 89 f1             	mov    %r14d,%ecx
ffffffff804bd805:      900 	2b 8b 3c 04 00 00    	sub    0x43c(%rbx),%ecx
ffffffff804bd80b:        5 	74 28                	je     ffffffff804bd835 <tcp_recvmsg+0x3c7>
ffffffff804bd80d:        0 	48 8b 05 ac 3e 5f 00 	mov    0x5f3eac(%rip),%rax        # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bd814:        1 	41 01 cf             	add    %ecx,%r15d
ffffffff804bd817:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804bd81e:        0 	00 
ffffffff804bd81f:        0 	89 d2                	mov    %edx,%edx
ffffffff804bd821:        0 	48 f7 d0             	not    %rax
ffffffff804bd824:        0 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804bd828:        0 	48 63 d1             	movslq %ecx,%rdx
ffffffff804bd82b:        0 	49 29 d6             	sub    %rdx,%r14
ffffffff804bd82e:        0 	48 01 90 b8 00 00 00 	add    %rdx,0xb8(%rax)
ffffffff804bd835:        4 	8b 83 f0 03 00 00    	mov    0x3f0(%rbx),%eax
ffffffff804bd83b:      373 	3b 83 f4 03 00 00    	cmp    0x3f4(%rbx),%eax
ffffffff804bd841:     3604 	75 49                	jne    ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd843:        0 	48 8b 44 24 20       	mov    0x20(%rsp),%rax
ffffffff804bd848:      971 	48 39 83 10 04 00 00 	cmp    %rax,0x410(%rbx)
ffffffff804bd84f:       11 	74 3b                	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd851:        6 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd854:      267 	e8 94 e6 ff ff       	callq  ffffffff804bbeed <tcp_prequeue_process>
ffffffff804bd859:        0 	44 89 f1             	mov    %r14d,%ecx
ffffffff804bd85c:      879 	2b 8b 3c 04 00 00    	sub    0x43c(%rbx),%ecx
ffffffff804bd862:      256 	74 28                	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd864:        0 	48 8b 05 55 3e 5f 00 	mov    0x5f3e55(%rip),%rax        # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bd86b:      116 	41 01 cf             	add    %ecx,%r15d
ffffffff804bd86e:       17 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804bd875:        0 	00 
ffffffff804bd876:        0 	89 d2                	mov    %edx,%edx
ffffffff804bd878:        1 	48 f7 d0             	not    %rax
ffffffff804bd87b:        5 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804bd87f:        0 	48 63 d1             	movslq %ecx,%rdx
ffffffff804bd882:        6 	49 29 d6             	sub    %rdx,%r14
ffffffff804bd885:        7 	48 01 90 c0 00 00 00 	add    %rdx,0xc0(%rax)
ffffffff804bd88c:       11 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bd891:      438 	0f 84 a9 01 00 00    	je     ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd897:        0 	8b 44 24 60          	mov    0x60(%rsp),%eax
ffffffff804bd89b:        0 	3b 83 f4 03 00 00    	cmp    0x3f4(%rbx),%eax
ffffffff804bd8a1:        0 	0f 84 99 01 00 00    	je     ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd8a7:        0 	e8 19 ad fd ff       	callq  ffffffff804985c5 <net_ratelimit>
ffffffff804bd8ac:        0 	85 c0                	test   %eax,%eax
ffffffff804bd8ae:        0 	74 24                	je     ffffffff804bd8d4 <tcp_recvmsg+0x466>
ffffffff804bd8b0:        0 	65 48 8b 34 25 00 00 	mov    %gs:0x0,%rsi
ffffffff804bd8b7:        0 	00 00 
ffffffff804bd8b9:        0 	8b 96 70 01 00 00    	mov    0x170(%rsi),%edx
ffffffff804bd8bf:        0 	48 c7 c7 6a d9 6a 80 	mov    $0xffffffff806ad96a,%rdi
ffffffff804bd8c6:        0 	48 81 c6 68 03 00 00 	add    $0x368,%rsi
ffffffff804bd8cd:        0 	31 c0                	xor    %eax,%eax
ffffffff804bd8cf:        0 	e8 a0 94 d7 ff       	callq  ffffffff80236d74 <printk>
ffffffff804bd8d4:        0 	8b 83 f4 03 00 00    	mov    0x3f4(%rbx),%eax
ffffffff804bd8da:        0 	89 44 24 60          	mov    %eax,0x60(%rsp)
ffffffff804bd8de:        0 	e9 5d 01 00 00       	jmpq   ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd8e3:     4077 	44 29 e8             	sub    %r13d,%eax
ffffffff804bd8e6:     6031 	4d 89 f4             	mov    %r14,%r12
ffffffff804bd8e9:        0 	4c 39 f0             	cmp    %r14,%rax
ffffffff804bd8ec:        0 	4c 0f 46 e0          	cmovbe %rax,%r12
ffffffff804bd8f0:      934 	66 83 bb 7c 04 00 00 	cmpw   $0x0,0x47c(%rbx)
ffffffff804bd8f7:        0 	00 
ffffffff804bd8f8:        0 	74 38                	je     ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd8fa:        0 	8b 83 84 05 00 00    	mov    0x584(%rbx),%eax
ffffffff804bd900:        0 	29 f0                	sub    %esi,%eax
ffffffff804bd902:        0 	89 c2                	mov    %eax,%edx
ffffffff804bd904:        0 	4c 39 e2             	cmp    %r12,%rdx
ffffffff804bd907:        0 	73 29                	jae    ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd909:        0 	85 c0                	test   %eax,%eax
ffffffff804bd90b:        0 	74 05                	je     ffffffff804bd912 <tcp_recvmsg+0x4a4>
ffffffff804bd90d:        0 	49 89 d4             	mov    %rdx,%r12
ffffffff804bd910:        0 	eb 20                	jmp    ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd912:        0 	be 02 00 00 00       	mov    $0x2,%esi
ffffffff804bd917:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd91a:        0 	e8 f9 db ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bd91f:        0 	85 c0                	test   %eax,%eax
ffffffff804bd921:        0 	75 0f                	jne    ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd923:        0 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bd928:        0 	41 ff c5             	inc    %r13d
ffffffff804bd92b:        0 	ff 02                	incl   (%rdx)
ffffffff804bd92d:        0 	49 ff cc             	dec    %r12
ffffffff804bd930:        0 	74 4c                	je     ffffffff804bd97e <tcp_recvmsg+0x510>
ffffffff804bd932:      906 	83 7c 24 0c 00       	cmpl   $0x0,0xc(%rsp)
ffffffff804bd937:     6039 	75 2f                	jne    ffffffff804bd968 <tcp_recvmsg+0x4fa>
ffffffff804bd939:       48 	48 8b 4c 24 30       	mov    0x30(%rsp),%rcx
ffffffff804bd93e:     1412 	44 89 ee             	mov    %r13d,%esi
ffffffff804bd941:     6648 	48 89 ef             	mov    %rbp,%rdi
ffffffff804bd944:        0 	48 8b 51 10          	mov    0x10(%rcx),%rdx
ffffffff804bd948:     1524 	44 89 e1             	mov    %r12d,%ecx
ffffffff804bd94b:      167 	e8 c5 d3 fc ff       	callq  ffffffff8048ad15 <skb_copy_datagram_iovec>
ffffffff804bd950:        0 	85 c0                	test   %eax,%eax
ffffffff804bd952:     1038 	74 14                	je     ffffffff804bd968 <tcp_recvmsg+0x4fa>
ffffffff804bd954:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bd957:        0 	0f 85 ec 00 00 00    	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd95d:        0 	41 bf f2 ff ff ff    	mov    $0xfffffff2,%r15d
ffffffff804bd963:        0 	e9 e1 00 00 00       	jmpq   ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd968:       28 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bd96d:     5713 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd970:      241 	45 01 e7             	add    %r12d,%r15d
ffffffff804bd973:       27 	4d 29 e6             	sub    %r12,%r14
ffffffff804bd976:      626 	44 01 22             	add    %r12d,(%rdx)
ffffffff804bd979:      221 	e8 fe 11 00 00       	callq  ffffffff804beb7c <tcp_rcv_space_adjust>
ffffffff804bd97e:     1425 	66 83 bb 7c 04 00 00 	cmpw   $0x0,0x47c(%rbx)
ffffffff804bd985:        0 	00 
ffffffff804bd986:     3430 	74 63                	je     ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd988:        0 	8b 8b f4 03 00 00    	mov    0x3f4(%rbx),%ecx
ffffffff804bd98e:        0 	39 8b 84 05 00 00    	cmp    %ecx,0x584(%rbx)
ffffffff804bd994:        0 	79 55                	jns    ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd996:        0 	48 8b 44 24 10       	mov    0x10(%rsp),%rax
ffffffff804bd99b:        0 	48 39 83 f8 04 00 00 	cmp    %rax,0x4f8(%rbx)
ffffffff804bd9a2:        0 	66 c7 83 7c 04 00 00 	movw   $0x0,0x47c(%rbx)
ffffffff804bd9a9:        0 	00 00 
ffffffff804bd9ab:        0 	75 3e                	jne    ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9ad:        0 	83 bb c0 04 00 00 00 	cmpl   $0x0,0x4c0(%rbx)
ffffffff804bd9b4:        0 	74 35                	je     ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9b6:        0 	8b 83 94 00 00 00    	mov    0x94(%rbx),%eax
ffffffff804bd9bc:        0 	3b 43 3c             	cmp    0x3c(%rbx),%eax
ffffffff804bd9bf:        0 	7d 2a                	jge    ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9c1:        0 	0f b7 83 e8 03 00 00 	movzwl 0x3e8(%rbx),%eax
ffffffff804bd9c8:        0 	8a 8b 9d 04 00 00    	mov    0x49d(%rbx),%cl
ffffffff804bd9ce:        0 	8b 93 44 04 00 00    	mov    0x444(%rbx),%edx
ffffffff804bd9d4:        0 	83 e1 0f             	and    $0xf,%ecx
ffffffff804bd9d7:        0 	c1 e0 1a             	shl    $0x1a,%eax
ffffffff804bd9da:        0 	d3 ea                	shr    %cl,%edx
ffffffff804bd9dc:        0 	09 d0                	or     %edx,%eax
ffffffff804bd9de:        0 	0d 00 00 10 00       	or     $0x100000,%eax
ffffffff804bd9e3:        0 	0f c8                	bswap  %eax
ffffffff804bd9e5:        0 	89 83 ec 03 00 00    	mov    %eax,0x3ec(%rbx)
ffffffff804bd9eb:        0 	8b 55 68             	mov    0x68(%rbp),%edx
ffffffff804bd9ee:     1655 	44 89 e8             	mov    %r13d,%eax
ffffffff804bd9f1:       32 	4c 01 e0             	add    %r12,%rax
ffffffff804bd9f4:        0 	48 39 d0             	cmp    %rdx,%rax
ffffffff804bd9f7:      847 	72 47                	jb     ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd9f9:        0 	8b 95 b8 00 00 00    	mov    0xb8(%rbp),%edx
ffffffff804bd9ff:       80 	48 8b 85 d0 00 00 00 	mov    0xd0(%rbp),%rax
ffffffff804bda06:      441 	f6 44 02 0d 01       	testb  $0x1,0xd(%rdx,%rax,1)
ffffffff804bda0b:        0 	75 16                	jne    ffffffff804bda23 <tcp_recvmsg+0x5b5>
ffffffff804bda0d:        0 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bda12:      453 	75 2c                	jne    ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bda14:        0 	31 d2                	xor    %edx,%edx
ffffffff804bda16:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff804bda19:      477 	48 89 df             	mov    %rbx,%rdi
ffffffff804bda1c:        0 	e8 0f e4 ff ff       	callq  ffffffff804bbe30 <sk_eat_skb>
ffffffff804bda21:      562 	eb 1d                	jmp    ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bda23:        0 	48 8b 54 24 40       	mov    0x40(%rsp),%rdx
ffffffff804bda28:        0 	ff 02                	incl   (%rdx)
ffffffff804bda2a:        0 	83 7c 24 3c 00       	cmpl   $0x0,0x3c(%rsp)
ffffffff804bda2f:        0 	75 18                	jne    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bda31:        0 	31 d2                	xor    %edx,%edx
ffffffff804bda33:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff804bda36:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bda39:        0 	e8 f2 e3 ff ff       	callq  ffffffff804bbe30 <sk_eat_skb>
ffffffff804bda3e:        0 	eb 09                	jmp    ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bda40:      959 	4d 85 f6             	test   %r14,%r14
ffffffff804bda43:     4766 	0f 85 f8 fa ff ff    	jne    ffffffff804bd541 <tcp_recvmsg+0xd3>
ffffffff804bda49:      217 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bda4f:     2084 	74 71                	je     ffffffff804bdac2 <tcp_recvmsg+0x654>
ffffffff804bda51:       40 	48 8d 83 10 04 00 00 	lea    0x410(%rbx),%rax
ffffffff804bda58:      448 	48 39 83 10 04 00 00 	cmp    %rax,0x410(%rbx)
ffffffff804bda5f:        4 	74 4c                	je     ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda61:        0 	31 c0                	xor    %eax,%eax
ffffffff804bda63:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bda66:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bda69:        0 	41 0f 4f c6          	cmovg  %r14d,%eax
ffffffff804bda6d:        0 	89 83 3c 04 00 00    	mov    %eax,0x43c(%rbx)
ffffffff804bda73:        0 	e8 75 e4 ff ff       	callq  ffffffff804bbeed <tcp_prequeue_process>
ffffffff804bda78:        0 	45 85 ff             	test   %r15d,%r15d
ffffffff804bda7b:        0 	7e 30                	jle    ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda7d:        0 	44 89 f1             	mov    %r14d,%ecx
ffffffff804bda80:        0 	2b 8b 3c 04 00 00    	sub    0x43c(%rbx),%ecx
ffffffff804bda86:        0 	74 25                	je     ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda88:        0 	48 8b 05 31 3c 5f 00 	mov    0x5f3c31(%rip),%rax        # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bda8f:        0 	41 01 cf             	add    %ecx,%r15d
ffffffff804bda92:        0 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804bda99:        0 	00 
ffffffff804bda9a:        0 	89 d2                	mov    %edx,%edx
ffffffff804bda9c:        0 	48 f7 d0             	not    %rax
ffffffff804bda9f:        0 	48 8b 14 d0          	mov    (%rax,%rdx,8),%rdx
ffffffff804bdaa3:        0 	48 63 c1             	movslq %ecx,%rax
ffffffff804bdaa6:        0 	48 01 82 c0 00 00 00 	add    %rax,0xc0(%rdx)
ffffffff804bdaad:      214 	48 c7 83 28 04 00 00 	movq   $0x0,0x428(%rbx)
ffffffff804bdab4:        0 	00 00 00 00 
ffffffff804bdab8:     1530 	c7 83 3c 04 00 00 00 	movl   $0x0,0x43c(%rbx)
ffffffff804bdabf:        0 	00 00 00 
ffffffff804bdac2:     1135 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdac5:     3909 	44 89 fe             	mov    %r15d,%esi
ffffffff804bdac8:        0 	e8 41 e6 ff ff       	callq  ffffffff804bc10e <tcp_cleanup_rbuf>
ffffffff804bdacd:     1724 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdad0:      932 	e8 61 7c fc ff       	callq  ffffffff80485736 <release_sock>
ffffffff804bdad5:     4661 	e9 12 01 00 00       	jmpq   ffffffff804bdbec <tcp_recvmsg+0x77e>
ffffffff804bdada:        0 	41 bc 95 ff ff ff    	mov    $0xffffff95,%r12d
ffffffff804bdae0:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdae3:        0 	45 89 e7             	mov    %r12d,%r15d
ffffffff804bdae6:        0 	e8 4b 7c fc ff       	callq  ffffffff80485736 <release_sock>
ffffffff804bdaeb:        0 	e9 fc 00 00 00       	jmpq   ffffffff804bdbec <tcp_recvmsg+0x77e>
ffffffff804bdaf0:        0 	be 02 00 00 00       	mov    $0x2,%esi
ffffffff804bdaf5:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdaf8:        0 	e8 1b da ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bdafd:        0 	85 c0                	test   %eax,%eax
ffffffff804bdaff:        0 	0f 85 d4 00 00 00    	jne    ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb05:        0 	8b 83 7c 04 00 00    	mov    0x47c(%rbx),%eax
ffffffff804bdb0b:        0 	66 85 c0             	test   %ax,%ax
ffffffff804bdb0e:        0 	0f 84 c5 00 00 00    	je     ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb14:        0 	66 3d 00 04          	cmp    $0x400,%ax
ffffffff804bdb18:        0 	0f 84 bb 00 00 00    	je     ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb1e:        0 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bdb21:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bdb23:        0 	75 17                	jne    ffffffff804bdb3c <tcp_recvmsg+0x6ce>
ffffffff804bdb25:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804bdb2a:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804bdb2d:        0 	41 bc 95 ff ff ff    	mov    $0xffffff95,%r12d
ffffffff804bdb33:        0 	e8 e0 d9 ff ff       	callq  ffffffff804bb518 <sock_flag>
ffffffff804bdb38:        0 	85 c0                	test   %eax,%eax
ffffffff804bdb3a:        0 	74 a4                	je     ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdb3c:        0 	8b 83 7c 04 00 00    	mov    0x47c(%rbx),%eax
ffffffff804bdb42:        0 	f6 c4 01             	test   $0x1,%ah
ffffffff804bdb45:        0 	74 79                	je     ffffffff804bdbc0 <tcp_recvmsg+0x752>
ffffffff804bdb47:        0 	40 f6 c5 02          	test   $0x2,%bpl
ffffffff804bdb4b:        0 	88 44 24 67          	mov    %al,0x67(%rsp)
ffffffff804bdb4f:        0 	75 09                	jne    ffffffff804bdb5a <tcp_recvmsg+0x6ec>
ffffffff804bdb51:        0 	66 c7 83 7c 04 00 00 	movw   $0x400,0x47c(%rbx)
ffffffff804bdb58:        0 	00 04 
ffffffff804bdb5a:        0 	48 8b 4c 24 30       	mov    0x30(%rsp),%rcx
ffffffff804bdb5f:        0 	45 89 f4             	mov    %r14d,%r12d
ffffffff804bdb62:        0 	8b 51 30             	mov    0x30(%rcx),%edx
ffffffff804bdb65:        0 	89 d0                	mov    %edx,%eax
ffffffff804bdb67:        0 	83 c8 01             	or     $0x1,%eax
ffffffff804bdb6a:        0 	45 85 f6             	test   %r14d,%r14d
ffffffff804bdb6d:        0 	89 41 30             	mov    %eax,0x30(%rcx)
ffffffff804bdb70:        0 	7e 33                	jle    ffffffff804bdba5 <tcp_recvmsg+0x737>
ffffffff804bdb72:        0 	40 80 e5 20          	and    $0x20,%bpl
ffffffff804bdb76:        0 	41 bc 01 00 00 00    	mov    $0x1,%r12d
ffffffff804bdb7c:        0 	0f 85 5e ff ff ff    	jne    ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdb82:        0 	48 8b 79 10          	mov    0x10(%rcx),%rdi
ffffffff804bdb86:        0 	48 8d 74 24 67       	lea    0x67(%rsp),%rsi
ffffffff804bdb8b:        0 	ba 01 00 00 00       	mov    $0x1,%edx
ffffffff804bdb90:        0 	41 bc f2 ff ff ff    	mov    $0xfffffff2,%r12d
ffffffff804bdb96:        0 	e8 8a cb fc ff       	callq  ffffffff8048a725 <memcpy_toiovec>
ffffffff804bdb9b:        0 	85 c0                	test   %eax,%eax
ffffffff804bdb9d:        0 	0f 85 3d ff ff ff    	jne    ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdba3:        0 	eb 10                	jmp    ffffffff804bdbb5 <tcp_recvmsg+0x747>
ffffffff804bdba5:        0 	48 8b 44 24 30       	mov    0x30(%rsp),%rax
ffffffff804bdbaa:        0 	83 ca 21             	or     $0x21,%edx
ffffffff804bdbad:        0 	89 50 30             	mov    %edx,0x30(%rax)
ffffffff804bdbb0:        0 	e9 2b ff ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbb5:        0 	41 bc 01 00 00 00    	mov    $0x1,%r12d
ffffffff804bdbbb:        0 	e9 20 ff ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbc0:        0 	8a 43 02             	mov    0x2(%rbx),%al
ffffffff804bdbc3:        0 	3c 07                	cmp    $0x7,%al
ffffffff804bdbc5:        0 	74 1d                	je     ffffffff804bdbe4 <tcp_recvmsg+0x776>
ffffffff804bdbc7:        0 	f6 43 38 01          	testb  $0x1,0x38(%rbx)
ffffffff804bdbcb:        0 	41 bc f5 ff ff ff    	mov    $0xfffffff5,%r12d
ffffffff804bdbd1:        0 	0f 84 09 ff ff ff    	je     ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbd7:        0 	eb 0b                	jmp    ffffffff804bdbe4 <tcp_recvmsg+0x776>
ffffffff804bdbd9:        0 	41 bc ea ff ff ff    	mov    $0xffffffea,%r12d
ffffffff804bdbdf:        0 	e9 fc fe ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbe4:        0 	45 31 e4             	xor    %r12d,%r12d
ffffffff804bdbe7:        0 	e9 f4 fe ff ff       	jmpq   ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbec:     1206 	48 83 c4 68          	add    $0x68,%rsp
ffffffff804bdbf0:      498 	44 89 f8             	mov    %r15d,%eax
ffffffff804bdbf3:      387 	5b                   	pop    %rbx
ffffffff804bdbf4:      462 	5d                   	pop    %rbp
ffffffff804bdbf5:        0 	41 5c                	pop    %r12
ffffffff804bdbf7:      485 	41 5d                	pop    %r13
ffffffff804bdbf9:      466 	41 5e                	pop    %r14
ffffffff804bdbfb:        0 	41 5f                	pop    %r15
ffffffff804bdbfd:      796 	c3                   	retq   

no real hotspots either - but a bit too fractured code sequence, so 
this function's icache footprint is too probably double the size of 
what it could be.

a bit of overhead (8%) leaks in from a callsite:

ffffffff804bd46e:      882 	41 57                	push   %r15
ffffffff804bd470:    15507 	48 89 f7             	mov    %rsi,%rdi

(this is used as a dynamic function pointer too so i'm just guessing 
that the common callsite would be sock_common_recvmsg().)

perhaps this sequence, about 7% of the total overhead of this 
function, warrants mention:

ffffffff804bd7e2:        0 	e8 96 e5 ff ff       	callq  ffffffff804bbd7d <lock_sock>
ffffffff804bd7e7:        0 	eb 0d                	jmp    ffffffff804bd7f6 <tcp_recvmsg+0x388>
ffffffff804bd7e9:      152 	48 8d 74 24 58       	lea    0x58(%rsp),%rsi
ffffffff804bd7ee:      563 	48 89 df             	mov    %rbx,%rdi
ffffffff804bd7f1:       59 	e8 83 99 fc ff       	callq  ffffffff80487179 <sk_wait_data>
ffffffff804bd7f6:       86 	48 83 7c 24 50 00    	cmpq   $0x0,0x50(%rsp)
ffffffff804bd7fc:     8550 	0f 84 8a 00 00 00    	je     ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd802:     4038 	44 89 f1             	mov    %r14d,%ecx

that's most likely lock_sock[_nested]()'s overhead leaking over into 
this function:

ffffffff804857cb:     9392 <lock_sock_nested>:
ffffffff804857cb:     9392 	41 55                	push   %r13
ffffffff804857cd:     4112 	41 54                	push   %r12
ffffffff804857cf:        2 	55                   	push   %rbp
ffffffff804857d0:        7 	48 8d 6f 40          	lea    0x40(%rdi),%rbp
ffffffff804857d4:     1515 	53                   	push   %rbx
ffffffff804857d5:        0 	48 89 fb             	mov    %rdi,%rbx
ffffffff804857d8:        4 	48 89 ef             	mov    %rbp,%rdi
ffffffff804857db:     1461 	48 83 ec 38          	sub    $0x38,%rsp
ffffffff804857df:        8 	e8 78 11 09 00       	callq  ffffffff8051695c <_spin_lock_bh>
ffffffff804857e4:     4827 	83 7b 44 00          	cmpl   $0x0,0x44(%rbx)
ffffffff804857e8:     2937 	74 6d                	je     ffffffff80485857 <lock_sock_nested+0x8c>
ffffffff804857ea:        0 	65 48 8b 14 25 00 00 	mov    %gs:0x0,%rdx
ffffffff804857f1:        0 	00 00 
ffffffff804857f3:        0 	fc                   	cld    
ffffffff804857f4:        0 	31 c0                	xor    %eax,%eax
ffffffff804857f6:        0 	48 89 e7             	mov    %rsp,%rdi
ffffffff804857f9:        0 	b9 0a 00 00 00       	mov    $0xa,%ecx
ffffffff804857fe:        0 	f3 ab                	rep stos %eax,%es:(%rdi)
ffffffff80485800:        0 	48 8d 44 24 18       	lea    0x18(%rsp),%rax
ffffffff80485805:        0 	4c 8d 63 48          	lea    0x48(%rbx),%r12
ffffffff80485809:        0 	48 89 54 24 08       	mov    %rdx,0x8(%rsp)
ffffffff8048580e:        0 	48 c7 44 24 10 80 78 	movq   $0xffffffff80247880,0x10(%rsp)
ffffffff80485815:        0 	24 80 
ffffffff80485817:        0 	48 89 44 24 18       	mov    %rax,0x18(%rsp)
ffffffff8048581c:        0 	48 89 44 24 20       	mov    %rax,0x20(%rsp)
ffffffff80485821:        0 	ba 02 00 00 00       	mov    $0x2,%edx
ffffffff80485826:        0 	48 89 e6             	mov    %rsp,%rsi
ffffffff80485829:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff8048582c:        0 	e8 fd 20 dc ff       	callq  ffffffff8024792e <prepare_to_wait_exclusive>
ffffffff80485831:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80485834:        0 	e8 18 11 09 00       	callq  ffffffff80516951 <_spin_unlock_bh>
ffffffff80485839:        0 	e8 52 f9 08 00       	callq  ffffffff80515190 <schedule>
ffffffff8048583e:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff80485841:        0 	e8 16 11 09 00       	callq  ffffffff8051695c <_spin_lock_bh>
ffffffff80485846:        0 	83 7b 44 00          	cmpl   $0x0,0x44(%rbx)
ffffffff8048584a:        0 	75 d5                	jne    ffffffff80485821 <lock_sock_nested+0x56>
ffffffff8048584c:        0 	48 89 e6             	mov    %rsp,%rsi
ffffffff8048584f:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff80485852:        0 	e8 7a 20 dc ff       	callq  ffffffff802478d1 <finish_wait>
ffffffff80485857:       88 	c7 43 44 01 00 00 00 	movl   $0x1,0x44(%rbx)
ffffffff8048585e:     3431 	fe 43 40             	incb   0x40(%rbx)
ffffffff80485861:     1568 	e8 00 4e db ff       	callq  ffffffff8023a666 <local_bh_enable>
ffffffff80485866:     1548 	48 83 c4 38          	add    $0x38,%rsp
ffffffff8048586a:       61 	5b                   	pop    %rbx
ffffffff8048586b:     1568 	5d                   	pop    %rbp
ffffffff8048586c:       36 	41 5c                	pop    %r12
ffffffff8048586e:        0 	41 5d                	pop    %r13
ffffffff80485870:     2753 	c3                   	retq   

which is:

1748	void lock_sock_nested(struct sock *sk, int subclass)
1749	{
1750		might_sleep();
1751		spin_lock_bh(&sk->sk_lock.slock);
1752		if (sk->sk_lock.owned)
1753			__lock_sock(sk);
1754		sk->sk_lock.owned = 1;
1755		spin_unlock(&sk->sk_lock.slock);

that branch in the middle should perhaps be:

		if (unlikely(sk->sk_lock.owned))

to make this function fall-through.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:26                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.717771 eth_type_trans

                      hits (total: 171777)
                 .........
ffffffff8049e215:      457 <eth_type_trans>:
ffffffff8049e215:      457 	41 54                	push   %r12
ffffffff8049e217:     6514 	55                   	push   %rbp
ffffffff8049e218:        0 	48 89 f5             	mov    %rsi,%rbp
ffffffff8049e21b:        0 	53                   	push   %rbx
ffffffff8049e21c:      441 	48 8b 87 d8 00 00 00 	mov    0xd8(%rdi),%rax
ffffffff8049e223:        5 	48 89 fb             	mov    %rdi,%rbx
ffffffff8049e226:        0 	2b 87 d0 00 00 00    	sub    0xd0(%rdi),%eax
ffffffff8049e22c:      493 	48 89 73 20          	mov    %rsi,0x20(%rbx)
ffffffff8049e230:        2 	be 0e 00 00 00       	mov    $0xe,%esi
ffffffff8049e235:        0 	89 87 c0 00 00 00    	mov    %eax,0xc0(%rdi)
ffffffff8049e23b:      472 	e8 2c 98 fe ff       	callq  ffffffff80487a6c <skb_pull>
ffffffff8049e240:      501 	44 8b a3 c0 00 00 00 	mov    0xc0(%rbx),%r12d
ffffffff8049e247:      763 	4c 03 a3 d0 00 00 00 	add    0xd0(%rbx),%r12
ffffffff8049e24e:        0 	41 f6 04 24 01       	testb  $0x1,(%r12)
ffffffff8049e253:      497 	74 26                	je     ffffffff8049e27b <eth_type_trans+0x66>
ffffffff8049e255:        0 	48 8d b5 38 02 00 00 	lea    0x238(%rbp),%rsi
ffffffff8049e25c:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff8049e25f:        0 	e8 49 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
ffffffff8049e264:        0 	85 c0                	test   %eax,%eax
ffffffff8049e266:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
ffffffff8049e269:        0 	75 08                	jne    ffffffff8049e273 <eth_type_trans+0x5e>
ffffffff8049e26b:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
ffffffff8049e26e:        0 	83 c8 01             	or     $0x1,%eax
ffffffff8049e271:        0 	eb 24                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
ffffffff8049e273:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
ffffffff8049e276:        0 	83 c8 02             	or     $0x2,%eax
ffffffff8049e279:        0 	eb 1c                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
ffffffff8049e27b:       82 	48 8d b5 18 02 00 00 	lea    0x218(%rbp),%rsi
ffffffff8049e282:     8782 	4c 89 e7             	mov    %r12,%rdi
ffffffff8049e285:     1752 	e8 23 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
ffffffff8049e28a:        0 	85 c0                	test   %eax,%eax
ffffffff8049e28c:      757 	74 0c                	je     ffffffff8049e29a <eth_type_trans+0x85>
ffffffff8049e28e:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
ffffffff8049e291:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
ffffffff8049e294:        0 	83 c8 03             	or     $0x3,%eax
ffffffff8049e297:        0 	88 43 7d             	mov    %al,0x7d(%rbx)
ffffffff8049e29a:      107 	66 41 8b 44 24 0c    	mov    0xc(%r12),%ax
ffffffff8049e2a0:     1031 	0f b7 c8             	movzwl %ax,%ecx
ffffffff8049e2a3:      518 	66 c1 e8 08          	shr    $0x8,%ax
ffffffff8049e2a7:        0 	89 ca                	mov    %ecx,%edx
ffffffff8049e2a9:        0 	c1 e2 08             	shl    $0x8,%edx
ffffffff8049e2ac:      484 	09 d0                	or     %edx,%eax
ffffffff8049e2ae:        0 	0f b7 c0             	movzwl %ax,%eax
ffffffff8049e2b1:        0 	3d ff 05 00 00       	cmp    $0x5ff,%eax
ffffffff8049e2b6:      468 	7f 18                	jg     ffffffff8049e2d0 <eth_type_trans+0xbb>
ffffffff8049e2b8:        0 	48 8b 83 d8 00 00 00 	mov    0xd8(%rbx),%rax
ffffffff8049e2bf:        0 	b9 00 01 00 00       	mov    $0x100,%ecx
ffffffff8049e2c4:        0 	66 83 38 ff          	cmpw   $0xffffffffffffffff,(%rax)
ffffffff8049e2c8:        0 	b8 00 04 00 00       	mov    $0x400,%eax
ffffffff8049e2cd:        0 	0f 45 c8             	cmovne %eax,%ecx
ffffffff8049e2d0:        0 	5b                   	pop    %rbx
ffffffff8049e2d1:    85064 	5d                   	pop    %rbp
ffffffff8049e2d2:    63776 	41 5c                	pop    %r12
ffffffff8049e2d4:        1 	89 c8                	mov    %ecx,%eax
ffffffff8049e2d6:      474 	c3                   	retq   

small function, big bang - 1.7% of the total overhead.

90% of this function's cost is in the closing sequence. My guess would 
be that it originates from ffffffff8049e2ae (the branch after that is 
not taken), which corresponds to this source code context:

(gdb) list *0xffffffff8049e2ae
0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199).
194		if (netdev_uses_dsa_tags(dev))
195			return htons(ETH_P_DSA);
196		if (netdev_uses_trailer_tags(dev))
197			return htons(ETH_P_TRAILER);
198	
199		if (ntohs(eth->h_proto) >= 1536)
200			return eth->h_proto;
201	
202		rawp = skb->data;
203	

eth->h_proto access.

Given that this workload does localhost networking, my guess would be 
that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
read-mostly field should be separated from the bouncing bits.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:26                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   1.717771 eth_type_trans

                      hits (total: 171777)
                 .........
ffffffff8049e215:      457 <eth_type_trans>:
ffffffff8049e215:      457 	41 54                	push   %r12
ffffffff8049e217:     6514 	55                   	push   %rbp
ffffffff8049e218:        0 	48 89 f5             	mov    %rsi,%rbp
ffffffff8049e21b:        0 	53                   	push   %rbx
ffffffff8049e21c:      441 	48 8b 87 d8 00 00 00 	mov    0xd8(%rdi),%rax
ffffffff8049e223:        5 	48 89 fb             	mov    %rdi,%rbx
ffffffff8049e226:        0 	2b 87 d0 00 00 00    	sub    0xd0(%rdi),%eax
ffffffff8049e22c:      493 	48 89 73 20          	mov    %rsi,0x20(%rbx)
ffffffff8049e230:        2 	be 0e 00 00 00       	mov    $0xe,%esi
ffffffff8049e235:        0 	89 87 c0 00 00 00    	mov    %eax,0xc0(%rdi)
ffffffff8049e23b:      472 	e8 2c 98 fe ff       	callq  ffffffff80487a6c <skb_pull>
ffffffff8049e240:      501 	44 8b a3 c0 00 00 00 	mov    0xc0(%rbx),%r12d
ffffffff8049e247:      763 	4c 03 a3 d0 00 00 00 	add    0xd0(%rbx),%r12
ffffffff8049e24e:        0 	41 f6 04 24 01       	testb  $0x1,(%r12)
ffffffff8049e253:      497 	74 26                	je     ffffffff8049e27b <eth_type_trans+0x66>
ffffffff8049e255:        0 	48 8d b5 38 02 00 00 	lea    0x238(%rbp),%rsi
ffffffff8049e25c:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff8049e25f:        0 	e8 49 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
ffffffff8049e264:        0 	85 c0                	test   %eax,%eax
ffffffff8049e266:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
ffffffff8049e269:        0 	75 08                	jne    ffffffff8049e273 <eth_type_trans+0x5e>
ffffffff8049e26b:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
ffffffff8049e26e:        0 	83 c8 01             	or     $0x1,%eax
ffffffff8049e271:        0 	eb 24                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
ffffffff8049e273:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
ffffffff8049e276:        0 	83 c8 02             	or     $0x2,%eax
ffffffff8049e279:        0 	eb 1c                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
ffffffff8049e27b:       82 	48 8d b5 18 02 00 00 	lea    0x218(%rbp),%rsi
ffffffff8049e282:     8782 	4c 89 e7             	mov    %r12,%rdi
ffffffff8049e285:     1752 	e8 23 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
ffffffff8049e28a:        0 	85 c0                	test   %eax,%eax
ffffffff8049e28c:      757 	74 0c                	je     ffffffff8049e29a <eth_type_trans+0x85>
ffffffff8049e28e:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
ffffffff8049e291:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
ffffffff8049e294:        0 	83 c8 03             	or     $0x3,%eax
ffffffff8049e297:        0 	88 43 7d             	mov    %al,0x7d(%rbx)
ffffffff8049e29a:      107 	66 41 8b 44 24 0c    	mov    0xc(%r12),%ax
ffffffff8049e2a0:     1031 	0f b7 c8             	movzwl %ax,%ecx
ffffffff8049e2a3:      518 	66 c1 e8 08          	shr    $0x8,%ax
ffffffff8049e2a7:        0 	89 ca                	mov    %ecx,%edx
ffffffff8049e2a9:        0 	c1 e2 08             	shl    $0x8,%edx
ffffffff8049e2ac:      484 	09 d0                	or     %edx,%eax
ffffffff8049e2ae:        0 	0f b7 c0             	movzwl %ax,%eax
ffffffff8049e2b1:        0 	3d ff 05 00 00       	cmp    $0x5ff,%eax
ffffffff8049e2b6:      468 	7f 18                	jg     ffffffff8049e2d0 <eth_type_trans+0xbb>
ffffffff8049e2b8:        0 	48 8b 83 d8 00 00 00 	mov    0xd8(%rbx),%rax
ffffffff8049e2bf:        0 	b9 00 01 00 00       	mov    $0x100,%ecx
ffffffff8049e2c4:        0 	66 83 38 ff          	cmpw   $0xffffffffffffffff,(%rax)
ffffffff8049e2c8:        0 	b8 00 04 00 00       	mov    $0x400,%eax
ffffffff8049e2cd:        0 	0f 45 c8             	cmovne %eax,%ecx
ffffffff8049e2d0:        0 	5b                   	pop    %rbx
ffffffff8049e2d1:    85064 	5d                   	pop    %rbp
ffffffff8049e2d2:    63776 	41 5c                	pop    %r12
ffffffff8049e2d4:        1 	89 c8                	mov    %ecx,%eax
ffffffff8049e2d6:      474 	c3                   	retq   

small function, big bang - 1.7% of the total overhead.

90% of this function's cost is in the closing sequence. My guess would 
be that it originates from ffffffff8049e2ae (the branch after that is 
not taken), which corresponds to this source code context:

(gdb) list *0xffffffff8049e2ae
0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199).
194		if (netdev_uses_dsa_tags(dev))
195			return htons(ETH_P_DSA);
196		if (netdev_uses_trailer_tags(dev))
197			return htons(ETH_P_TRAILER);
198	
199		if (ntohs(eth->h_proto) >= 1536)
200			return eth->h_proto;
201	
202		rawp = skb->data;
203	

eth->h_proto access.

Given that this workload does localhost networking, my guess would be 
that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
read-mostly field should be separated from the bouncing bits.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:34                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> this function _really_ hurts from a 16-bit op:
> 
> ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
> ffffffff80489445:        0 	00 00 
> ffffffff80489447:   174101 	5b                   	pop    %rbx

I don't think that is it, actually. The 16-bit store just before it had a 
zero count, even though anything that executes the second one will always 
execute the first one too.

The fact is, x86 profiles are subtle at an instruction level, and you tend 
to get profile hits _after_ the instruction that caused the cost because 
an interrupt (even an NMI) is always delayed to the next instruction (the 
one that didn't complete). And since the core will execute out-of-order, 
you don't even know what that one is, since there could easily be 
branches, but even in the absense of branches you have many instructions 
executing together.

For example, in many situations the two 16-bit stores will happily execute 
together, and what you see may simply be a cache miss on the line that was 
stored to. The store buffer needs to resolve the read of the "pop" in 
order to complete, so having a big count in between stores and a 
subsequent load is not all that unlikely.

So doing per-instruction profiling is not useful unless you start looking 
at what preceded the instruction, and because of the out-of-order nature, 
you really almost have to look for cache misses or branch mispredicts.

One common reason for such a big count on an instruction that looks 
perfectly simple is often that there is a branch to that instruction that 
was mispredicted. Or that there was an instruction that was costly _long_ 
before, and that other instructions were in the shadow of that one 
completing (ie they had actually completed first, but didn't retire until 
the earlier instruction did).

So you really should never just look at the previous instruction or 
anythign as simplistic as that. The time of in-order execution is long 
past.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:34                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> this function _really_ hurts from a 16-bit op:
> 
> ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
> ffffffff80489445:        0 	00 00 
> ffffffff80489447:   174101 	5b                   	pop    %rbx

I don't think that is it, actually. The 16-bit store just before it had a 
zero count, even though anything that executes the second one will always 
execute the first one too.

The fact is, x86 profiles are subtle at an instruction level, and you tend 
to get profile hits _after_ the instruction that caused the cost because 
an interrupt (even an NMI) is always delayed to the next instruction (the 
one that didn't complete). And since the core will execute out-of-order, 
you don't even know what that one is, since there could easily be 
branches, but even in the absense of branches you have many instructions 
executing together.

For example, in many situations the two 16-bit stores will happily execute 
together, and what you see may simply be a cache miss on the line that was 
stored to. The store buffer needs to resolve the read of the "pop" in 
order to complete, so having a big count in between stores and a 
subsequent load is not all that unlikely.

So doing per-instruction profiling is not useful unless you start looking 
at what preceded the instruction, and because of the out-of-order nature, 
you really almost have to look for cache misses or branch mispredicts.

One common reason for such a big count on an instruction that looks 
perfectly simple is often that there is a branch to that instruction that 
was mispredicted. Or that there was an instruction that was costly _long_ 
before, and that other instructions were in the shadow of that one 
completing (ie they had actually completed first, but didn't retire until 
the earlier instruction did).

So you really should never just look at the previous instruction or 
anythign as simplistic as that. The time of in-order execution is long 
past.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:35                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.673249 __inet_lookup_established

                      hits (total: 167324)
                 .........
ffffffff804b9b12:      446 <__inet_lookup_established>:
ffffffff804b9b12:      446 	41 57                	push   %r15
ffffffff804b9b14:     4810 	89 d0                	mov    %edx,%eax
ffffffff804b9b16:        0 	0f b7 c9             	movzwl %cx,%ecx
ffffffff804b9b19:        0 	41 56                	push   %r14
ffffffff804b9b1b:      456 	41 55                	push   %r13
ffffffff804b9b1d:        0 	41 54                	push   %r12
ffffffff804b9b1f:        0 	55                   	push   %rbp
ffffffff804b9b20:      427 	53                   	push   %rbx
ffffffff804b9b21:        4 	48 89 f3             	mov    %rsi,%rbx
ffffffff804b9b24:        2 	44 89 c6             	mov    %r8d,%esi
ffffffff804b9b27:      504 	41 89 c8             	mov    %ecx,%r8d
ffffffff804b9b2a:        1 	49 89 f7             	mov    %rsi,%r15
ffffffff804b9b2d:        1 	48 83 ec 08          	sub    $0x8,%rsp
ffffffff804b9b31:      462 	49 c1 e7 20          	shl    $0x20,%r15
ffffffff804b9b35:        0 	48 89 3c 24          	mov    %rdi,(%rsp)
ffffffff804b9b39:      507 	89 d7                	mov    %edx,%edi
ffffffff804b9b3b:       38 	41 0f b7 d1          	movzwl %r9w,%edx
ffffffff804b9b3f:        0 	41 89 d6             	mov    %edx,%r14d
ffffffff804b9b42:      863 	49 09 c7             	or     %rax,%r15
ffffffff804b9b45:       24 	41 c1 e6 10          	shl    $0x10,%r14d
ffffffff804b9b49:        0 	41 09 ce             	or     %ecx,%r14d
ffffffff804b9b4c:      479 	89 f9                	mov    %edi,%ecx
ffffffff804b9b4e:        8 	48 8b 3c 24          	mov    (%rsp),%rdi
ffffffff804b9b52:        0 	e8 cc f4 ff ff       	callq  ffffffff804b9023 <inet_ehashfn>
ffffffff804b9b57:      413 	48 89 df             	mov    %rbx,%rdi
ffffffff804b9b5a:      122 	41 89 c5             	mov    %eax,%r13d
ffffffff804b9b5d:        0 	89 c6                	mov    %eax,%esi
ffffffff804b9b5f:      635 	e8 3e f5 ff ff       	callq  ffffffff804b90a2 <inet_ehash_bucket>
ffffffff804b9b64:      511 	48 89 c5             	mov    %rax,%rbp
ffffffff804b9b67:        6 	44 89 e8             	mov    %r13d,%eax
ffffffff804b9b6a:        0 	23 43 14             	and    0x14(%rbx),%eax
ffffffff804b9b6d:      497 	4c 8d 24 85 00 00 00 	lea    0x0(,%rax,4),%r12
ffffffff804b9b74:        0 	00 
ffffffff804b9b75:        1 	4c 03 63 08          	add    0x8(%rbx),%r12
ffffffff804b9b79:        0 	48 8b 45 00          	mov    0x0(%rbp),%rax
ffffffff804b9b7d:      470 	0f 18 08             	prefetcht0 (%rax)
ffffffff804b9b80:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff804b9b83:     1089 	e8 32 cd 05 00       	callq  ffffffff805168ba <_read_lock>
ffffffff804b9b88:     6752 	48 8b 55 00          	mov    0x0(%rbp),%rdx
ffffffff804b9b8c:      598 	eb 2c                	jmp    ffffffff804b9bba <__inet_lookup_established+0xa8>
ffffffff804b9b8e:      447 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
ffffffff804b9b95:        0 	80 
ffffffff804b9b96:     1119 	75 1f                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9b98:       21 	4c 39 b8 30 02 00 00 	cmp    %r15,0x230(%rax)
ffffffff804b9b9f:        0 	75 16                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9ba1:      492 	44 39 b0 38 02 00 00 	cmp    %r14d,0x238(%rax)
ffffffff804b9ba8:        0 	75 0d                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9baa:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
ffffffff804b9bad:      451 	85 d2                	test   %edx,%edx
ffffffff804b9baf:        0 	74 67                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bb1:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
ffffffff804b9bb5:        0 	74 61                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bb7:        0 	48 89 ca             	mov    %rcx,%rdx
ffffffff804b9bba:      402 	48 85 d2             	test   %rdx,%rdx
ffffffff804b9bbd:     1006 	74 12                	je     ffffffff804b9bd1 <__inet_lookup_established+0xbf>
ffffffff804b9bbf:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
ffffffff804b9bc3:      821 	48 8b 0a             	mov    (%rdx),%rcx
ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>
ffffffff804b9bd1:        0 	48 8b 55 08          	mov    0x8(%rbp),%rdx
ffffffff804b9bd5:        0 	eb 26                	jmp    ffffffff804b9bfd <__inet_lookup_established+0xeb>
ffffffff804b9bd7:        0 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
ffffffff804b9bde:        0 	80 
ffffffff804b9bdf:        0 	75 19                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9be1:        0 	4c 39 78 40          	cmp    %r15,0x40(%rax)
ffffffff804b9be5:        0 	75 13                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9be7:        0 	44 39 70 48          	cmp    %r14d,0x48(%rax)
ffffffff804b9beb:        0 	75 0d                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9bed:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
ffffffff804b9bf0:        0 	85 d2                	test   %edx,%edx
ffffffff804b9bf2:        0 	74 24                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bf4:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
ffffffff804b9bf8:        0 	74 1e                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bfa:        0 	48 89 ca             	mov    %rcx,%rdx
ffffffff804b9bfd:        0 	48 85 d2             	test   %rdx,%rdx
ffffffff804b9c00:        0 	74 12                	je     ffffffff804b9c14 <__inet_lookup_established+0x102>
ffffffff804b9c02:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
ffffffff804b9c06:        0 	48 8b 0a             	mov    (%rdx),%rcx
ffffffff804b9c09:        0 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9c0d:        0 	0f 18 09             	prefetcht0 (%rcx)
ffffffff804b9c10:        0 	75 e8                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9c12:        0 	eb c3                	jmp    ffffffff804b9bd7 <__inet_lookup_established+0xc5>
ffffffff804b9c14:        0 	31 c0                	xor    %eax,%eax
ffffffff804b9c16:        0 	eb 04                	jmp    ffffffff804b9c1c <__inet_lookup_established+0x10a>
ffffffff804b9c18:      441 	f0 ff 40 28          	lock incl 0x28(%rax)
ffffffff804b9c1c:     1442 	f0 41 ff 04 24       	lock incl (%r12)
ffffffff804b9c21:      476 	41 5b                	pop    %r11
ffffffff804b9c23:        1 	5b                   	pop    %rbx
ffffffff804b9c24:        0 	5d                   	pop    %rbp
ffffffff804b9c25:      475 	41 5c                	pop    %r12
ffffffff804b9c27:        0 	41 5d                	pop    %r13
ffffffff804b9c29:        1 	41 5e                	pop    %r14
ffffffff804b9c2b:      494 	41 5f                	pop    %r15
ffffffff804b9c2d:        0 	c3                   	retq   
ffffffff804b9c2e:        0 	90                   	nop    
ffffffff804b9c2f:        0 	90                   	nop    

80% of the overhead comes from cachemisses here:

ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>

corresponding to:

(gdb) list *0xffffffff804b9bc6
0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237).
232		rwlock_t *lock = inet_ehash_lockp(hashinfo, hash);
233	
234		prefetch(head->chain.first);
235		read_lock(lock);
236		sk_for_each(sk, node, &head->chain) {
237			if (INET_MATCH(sk, net, hash, acookie,
238						saddr, daddr, ports, dif))
239				goto hit; /* You sunk my battleship! */
240		}
241	

Seeing the first hard cachemiss on hash lookups is a familiar and 
partly expected pattern - it is the first thing that touches 
cache-cold data structures.

Seeing 1.4% of the totaly tbench overhead go into this single 
cachemiss is a bit surprising to me though: tbench works via 
long-lived connections (TCP establish costs and nowhere to be seen in 
the profiles) so the socket hash should be relatively stable and 
read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache 
per socket.

Could we be somehow dirtying these cachelines perhaps, causing 
unnecessary cachemisses in hash lookups? Is the hash linkage portion 
of the socket data structure frequently dirtied? Padding that to 64 
bytes (or next to 64 bytes worth of read-mostly fields) could perhaps 
give us a +1.7% tbench speedup.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:35                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   1.673249 __inet_lookup_established

                      hits (total: 167324)
                 .........
ffffffff804b9b12:      446 <__inet_lookup_established>:
ffffffff804b9b12:      446 	41 57                	push   %r15
ffffffff804b9b14:     4810 	89 d0                	mov    %edx,%eax
ffffffff804b9b16:        0 	0f b7 c9             	movzwl %cx,%ecx
ffffffff804b9b19:        0 	41 56                	push   %r14
ffffffff804b9b1b:      456 	41 55                	push   %r13
ffffffff804b9b1d:        0 	41 54                	push   %r12
ffffffff804b9b1f:        0 	55                   	push   %rbp
ffffffff804b9b20:      427 	53                   	push   %rbx
ffffffff804b9b21:        4 	48 89 f3             	mov    %rsi,%rbx
ffffffff804b9b24:        2 	44 89 c6             	mov    %r8d,%esi
ffffffff804b9b27:      504 	41 89 c8             	mov    %ecx,%r8d
ffffffff804b9b2a:        1 	49 89 f7             	mov    %rsi,%r15
ffffffff804b9b2d:        1 	48 83 ec 08          	sub    $0x8,%rsp
ffffffff804b9b31:      462 	49 c1 e7 20          	shl    $0x20,%r15
ffffffff804b9b35:        0 	48 89 3c 24          	mov    %rdi,(%rsp)
ffffffff804b9b39:      507 	89 d7                	mov    %edx,%edi
ffffffff804b9b3b:       38 	41 0f b7 d1          	movzwl %r9w,%edx
ffffffff804b9b3f:        0 	41 89 d6             	mov    %edx,%r14d
ffffffff804b9b42:      863 	49 09 c7             	or     %rax,%r15
ffffffff804b9b45:       24 	41 c1 e6 10          	shl    $0x10,%r14d
ffffffff804b9b49:        0 	41 09 ce             	or     %ecx,%r14d
ffffffff804b9b4c:      479 	89 f9                	mov    %edi,%ecx
ffffffff804b9b4e:        8 	48 8b 3c 24          	mov    (%rsp),%rdi
ffffffff804b9b52:        0 	e8 cc f4 ff ff       	callq  ffffffff804b9023 <inet_ehashfn>
ffffffff804b9b57:      413 	48 89 df             	mov    %rbx,%rdi
ffffffff804b9b5a:      122 	41 89 c5             	mov    %eax,%r13d
ffffffff804b9b5d:        0 	89 c6                	mov    %eax,%esi
ffffffff804b9b5f:      635 	e8 3e f5 ff ff       	callq  ffffffff804b90a2 <inet_ehash_bucket>
ffffffff804b9b64:      511 	48 89 c5             	mov    %rax,%rbp
ffffffff804b9b67:        6 	44 89 e8             	mov    %r13d,%eax
ffffffff804b9b6a:        0 	23 43 14             	and    0x14(%rbx),%eax
ffffffff804b9b6d:      497 	4c 8d 24 85 00 00 00 	lea    0x0(,%rax,4),%r12
ffffffff804b9b74:        0 	00 
ffffffff804b9b75:        1 	4c 03 63 08          	add    0x8(%rbx),%r12
ffffffff804b9b79:        0 	48 8b 45 00          	mov    0x0(%rbp),%rax
ffffffff804b9b7d:      470 	0f 18 08             	prefetcht0 (%rax)
ffffffff804b9b80:        0 	4c 89 e7             	mov    %r12,%rdi
ffffffff804b9b83:     1089 	e8 32 cd 05 00       	callq  ffffffff805168ba <_read_lock>
ffffffff804b9b88:     6752 	48 8b 55 00          	mov    0x0(%rbp),%rdx
ffffffff804b9b8c:      598 	eb 2c                	jmp    ffffffff804b9bba <__inet_lookup_established+0xa8>
ffffffff804b9b8e:      447 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
ffffffff804b9b95:        0 	80 
ffffffff804b9b96:     1119 	75 1f                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9b98:       21 	4c 39 b8 30 02 00 00 	cmp    %r15,0x230(%rax)
ffffffff804b9b9f:        0 	75 16                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9ba1:      492 	44 39 b0 38 02 00 00 	cmp    %r14d,0x238(%rax)
ffffffff804b9ba8:        0 	75 0d                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9baa:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
ffffffff804b9bad:      451 	85 d2                	test   %edx,%edx
ffffffff804b9baf:        0 	74 67                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bb1:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
ffffffff804b9bb5:        0 	74 61                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bb7:        0 	48 89 ca             	mov    %rcx,%rdx
ffffffff804b9bba:      402 	48 85 d2             	test   %rdx,%rdx
ffffffff804b9bbd:     1006 	74 12                	je     ffffffff804b9bd1 <__inet_lookup_established+0xbf>
ffffffff804b9bbf:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
ffffffff804b9bc3:      821 	48 8b 0a             	mov    (%rdx),%rcx
ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>
ffffffff804b9bd1:        0 	48 8b 55 08          	mov    0x8(%rbp),%rdx
ffffffff804b9bd5:        0 	eb 26                	jmp    ffffffff804b9bfd <__inet_lookup_established+0xeb>
ffffffff804b9bd7:        0 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
ffffffff804b9bde:        0 	80 
ffffffff804b9bdf:        0 	75 19                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9be1:        0 	4c 39 78 40          	cmp    %r15,0x40(%rax)
ffffffff804b9be5:        0 	75 13                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9be7:        0 	44 39 70 48          	cmp    %r14d,0x48(%rax)
ffffffff804b9beb:        0 	75 0d                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9bed:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
ffffffff804b9bf0:        0 	85 d2                	test   %edx,%edx
ffffffff804b9bf2:        0 	74 24                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bf4:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
ffffffff804b9bf8:        0 	74 1e                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bfa:        0 	48 89 ca             	mov    %rcx,%rdx
ffffffff804b9bfd:        0 	48 85 d2             	test   %rdx,%rdx
ffffffff804b9c00:        0 	74 12                	je     ffffffff804b9c14 <__inet_lookup_established+0x102>
ffffffff804b9c02:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
ffffffff804b9c06:        0 	48 8b 0a             	mov    (%rdx),%rcx
ffffffff804b9c09:        0 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9c0d:        0 	0f 18 09             	prefetcht0 (%rcx)
ffffffff804b9c10:        0 	75 e8                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9c12:        0 	eb c3                	jmp    ffffffff804b9bd7 <__inet_lookup_established+0xc5>
ffffffff804b9c14:        0 	31 c0                	xor    %eax,%eax
ffffffff804b9c16:        0 	eb 04                	jmp    ffffffff804b9c1c <__inet_lookup_established+0x10a>
ffffffff804b9c18:      441 	f0 ff 40 28          	lock incl 0x28(%rax)
ffffffff804b9c1c:     1442 	f0 41 ff 04 24       	lock incl (%r12)
ffffffff804b9c21:      476 	41 5b                	pop    %r11
ffffffff804b9c23:        1 	5b                   	pop    %rbx
ffffffff804b9c24:        0 	5d                   	pop    %rbp
ffffffff804b9c25:      475 	41 5c                	pop    %r12
ffffffff804b9c27:        0 	41 5d                	pop    %r13
ffffffff804b9c29:        1 	41 5e                	pop    %r14
ffffffff804b9c2b:      494 	41 5f                	pop    %r15
ffffffff804b9c2d:        0 	c3                   	retq   
ffffffff804b9c2e:        0 	90                   	nop    
ffffffff804b9c2f:        0 	90                   	nop    

80% of the overhead comes from cachemisses here:

ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>

corresponding to:

(gdb) list *0xffffffff804b9bc6
0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237).
232		rwlock_t *lock = inet_ehash_lockp(hashinfo, hash);
233	
234		prefetch(head->chain.first);
235		read_lock(lock);
236		sk_for_each(sk, node, &head->chain) {
237			if (INET_MATCH(sk, net, hash, acookie,
238						saddr, daddr, ports, dif))
239				goto hit; /* You sunk my battleship! */
240		}
241	

Seeing the first hard cachemiss on hash lookups is a familiar and 
partly expected pattern - it is the first thing that touches 
cache-cold data structures.

Seeing 1.4% of the totaly tbench overhead go into this single 
cachemiss is a bit surprising to me though: tbench works via 
long-lived connections (TCP establish costs and nowhere to be seen in 
the profiles) so the socket hash should be relatively stable and 
read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache 
per socket.

Could we be somehow dirtying these cachelines perhaps, causing 
unnecessary cachemisses in hash lookups? Is the hash linkage portion 
of the socket data structure frequently dirtied? Padding that to 64 
bytes (or next to 64 bytes worth of read-mostly fields) could perhaps 
give us a +1.7% tbench speedup.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:38                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 17 Nov 2008, Ingo Molnar wrote:
> > 
> > this function _really_ hurts from a 16-bit op:
> > 
> > ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
> > ffffffff80489445:        0 	00 00 
> > ffffffff80489447:   174101 	5b                   	pop    %rbx
> 
> I don't think that is it, actually. The 16-bit store just before it 
> had a zero count, even though anything that executes the second one 
> will always execute the first one too.

yeah - look at the followup bits that identify the likely real source 
of that overhead:

>> _But_, the real overhead probably comes from:
>> 
>>  ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
>> 
>> which is the next line, the ttl field:
>> 
>>  373             iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
>> 
>> this shows that we are doing a hard cachemiss on the net-localhost 
>> route dst structure cacheline. We do a plain load instruction from 
>> it here and get a hefty cachemiss. (because 16 CPUs are banging on 
>> that single route)
>> 
>> And let make sure we see this in perspective as well: that single 
>> cachemiss is _1.0 percent_ of the total tbench cost. (!) We could 
>> make the scheduler 10% slower straight away and it would have less 
>> of a real-life effect than this single iph->ttl field setting.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:38                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Mon, 17 Nov 2008, Ingo Molnar wrote:
> > 
> > this function _really_ hurts from a 16-bit op:
> > 
> > ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
> > ffffffff80489445:        0 	00 00 
> > ffffffff80489447:   174101 	5b                   	pop    %rbx
> 
> I don't think that is it, actually. The 16-bit store just before it 
> had a zero count, even though anything that executes the second one 
> will always execute the first one too.

yeah - look at the followup bits that identify the likely real source 
of that overhead:

>> _But_, the real overhead probably comes from:
>> 
>>  ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
>> 
>> which is the next line, the ttl field:
>> 
>>  373             iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
>> 
>> this shows that we are doing a hard cachemiss on the net-localhost 
>> route dst structure cacheline. We do a plain load instruction from 
>> it here and get a hefty cachemiss. (because 16 CPUs are banging on 
>> that single route)
>> 
>> And let make sure we see this in perspective as well: that single 
>> cachemiss is _1.0 percent_ of the total tbench cost. (!) We could 
>> make the scheduler 10% slower straight away and it would have less 
>> of a real-life effect than this single iph->ttl field setting.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:40                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 21:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> 100.000000 total
>> ................
>>   1.717771 eth_type_trans
> 
>                       hits (total: 171777)
>                  .........
> ffffffff8049e215:      457 <eth_type_trans>:
> ffffffff8049e215:      457 	41 54                	push   %r12
> ffffffff8049e217:     6514 	55                   	push   %rbp
> ffffffff8049e218:        0 	48 89 f5             	mov    %rsi,%rbp
> ffffffff8049e21b:        0 	53                   	push   %rbx
> ffffffff8049e21c:      441 	48 8b 87 d8 00 00 00 	mov    0xd8(%rdi),%rax
> ffffffff8049e223:        5 	48 89 fb             	mov    %rdi,%rbx
> ffffffff8049e226:        0 	2b 87 d0 00 00 00    	sub    0xd0(%rdi),%eax
> ffffffff8049e22c:      493 	48 89 73 20          	mov    %rsi,0x20(%rbx)
> ffffffff8049e230:        2 	be 0e 00 00 00       	mov    $0xe,%esi
> ffffffff8049e235:        0 	89 87 c0 00 00 00    	mov    %eax,0xc0(%rdi)
> ffffffff8049e23b:      472 	e8 2c 98 fe ff       	callq  ffffffff80487a6c <skb_pull>
> ffffffff8049e240:      501 	44 8b a3 c0 00 00 00 	mov    0xc0(%rbx),%r12d
> ffffffff8049e247:      763 	4c 03 a3 d0 00 00 00 	add    0xd0(%rbx),%r12
> ffffffff8049e24e:        0 	41 f6 04 24 01       	testb  $0x1,(%r12)
> ffffffff8049e253:      497 	74 26                	je     ffffffff8049e27b <eth_type_trans+0x66>
> ffffffff8049e255:        0 	48 8d b5 38 02 00 00 	lea    0x238(%rbp),%rsi
> ffffffff8049e25c:        0 	4c 89 e7             	mov    %r12,%rdi
> ffffffff8049e25f:        0 	e8 49 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
> ffffffff8049e264:        0 	85 c0                	test   %eax,%eax
> ffffffff8049e266:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
> ffffffff8049e269:        0 	75 08                	jne    ffffffff8049e273 <eth_type_trans+0x5e>
> ffffffff8049e26b:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
> ffffffff8049e26e:        0 	83 c8 01             	or     $0x1,%eax
> ffffffff8049e271:        0 	eb 24                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
> ffffffff8049e273:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
> ffffffff8049e276:        0 	83 c8 02             	or     $0x2,%eax
> ffffffff8049e279:        0 	eb 1c                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
> ffffffff8049e27b:       82 	48 8d b5 18 02 00 00 	lea    0x218(%rbp),%rsi
> ffffffff8049e282:     8782 	4c 89 e7             	mov    %r12,%rdi
> ffffffff8049e285:     1752 	e8 23 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
> ffffffff8049e28a:        0 	85 c0                	test   %eax,%eax
> ffffffff8049e28c:      757 	74 0c                	je     ffffffff8049e29a <eth_type_trans+0x85>
> ffffffff8049e28e:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
> ffffffff8049e291:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
> ffffffff8049e294:        0 	83 c8 03             	or     $0x3,%eax
> ffffffff8049e297:        0 	88 43 7d             	mov    %al,0x7d(%rbx)
> ffffffff8049e29a:      107 	66 41 8b 44 24 0c    	mov    0xc(%r12),%ax
> ffffffff8049e2a0:     1031 	0f b7 c8             	movzwl %ax,%ecx
> ffffffff8049e2a3:      518 	66 c1 e8 08          	shr    $0x8,%ax
> ffffffff8049e2a7:        0 	89 ca                	mov    %ecx,%edx
> ffffffff8049e2a9:        0 	c1 e2 08             	shl    $0x8,%edx
> ffffffff8049e2ac:      484 	09 d0                	or     %edx,%eax
> ffffffff8049e2ae:        0 	0f b7 c0             	movzwl %ax,%eax
> ffffffff8049e2b1:        0 	3d ff 05 00 00       	cmp    $0x5ff,%eax
> ffffffff8049e2b6:      468 	7f 18                	jg     ffffffff8049e2d0 <eth_type_trans+0xbb>
> ffffffff8049e2b8:        0 	48 8b 83 d8 00 00 00 	mov    0xd8(%rbx),%rax
> ffffffff8049e2bf:        0 	b9 00 01 00 00       	mov    $0x100,%ecx
> ffffffff8049e2c4:        0 	66 83 38 ff          	cmpw   $0xffffffffffffffff,(%rax)
> ffffffff8049e2c8:        0 	b8 00 04 00 00       	mov    $0x400,%eax
> ffffffff8049e2cd:        0 	0f 45 c8             	cmovne %eax,%ecx
> ffffffff8049e2d0:        0 	5b                   	pop    %rbx
> ffffffff8049e2d1:    85064 	5d                   	pop    %rbp
> ffffffff8049e2d2:    63776 	41 5c                	pop    %r12
> ffffffff8049e2d4:        1 	89 c8                	mov    %ecx,%eax
> ffffffff8049e2d6:      474 	c3                   	retq   
> 
> small function, big bang - 1.7% of the total overhead.
> 
> 90% of this function's cost is in the closing sequence. My guess would 
> be that it originates from ffffffff8049e2ae (the branch after that is 
> not taken), which corresponds to this source code context:
> 
> (gdb) list *0xffffffff8049e2ae
> 0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199).
> 194		if (netdev_uses_dsa_tags(dev))
> 195			return htons(ETH_P_DSA);
> 196		if (netdev_uses_trailer_tags(dev))
> 197			return htons(ETH_P_TRAILER);
> 198	
> 199		if (ntohs(eth->h_proto) >= 1536)
> 200			return eth->h_proto;
> 201	
> 202		rawp = skb->data;
> 203	
> 
> eth->h_proto access.
> 
> Given that this workload does localhost networking, my guess would be 
> that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
> read-mostly field should be separated from the bouncing bits.
> 

"eth" is on the frame itself, so each cpu is handling a skb it owns.
If there is a cache line miss, then scheduler might have done a wrong schedule ?
(tbench server and tbench client on different cpus)

But seeing your disassembly, I can see compare_ether_addr() is not inlined.

This sucks.

/**
 * compare_ether_addr - Compare two Ethernet addresses
 * @addr1: Pointer to a six-byte array containing the Ethernet address
 * @addr2: Pointer other six-byte array containing the Ethernet address
 *
 * Compare two ethernet addresses, returns 0 if equal
 */
static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
{
        const u16 *a = (const u16 *) addr1;
        const u16 *b = (const u16 *) addr2;

        BUILD_BUG_ON(ETH_ALEN != 6);
        return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
}

On my machine/compiler, it is inlined, that makes a big difference.

c0420750 <eth_type_trans>: /* eth_type_trans total:  14417  0.4101 */ 



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:40                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 21:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
>> 100.000000 total
>> ................
>>   1.717771 eth_type_trans
> 
>                       hits (total: 171777)
>                  .........
> ffffffff8049e215:      457 <eth_type_trans>:
> ffffffff8049e215:      457 	41 54                	push   %r12
> ffffffff8049e217:     6514 	55                   	push   %rbp
> ffffffff8049e218:        0 	48 89 f5             	mov    %rsi,%rbp
> ffffffff8049e21b:        0 	53                   	push   %rbx
> ffffffff8049e21c:      441 	48 8b 87 d8 00 00 00 	mov    0xd8(%rdi),%rax
> ffffffff8049e223:        5 	48 89 fb             	mov    %rdi,%rbx
> ffffffff8049e226:        0 	2b 87 d0 00 00 00    	sub    0xd0(%rdi),%eax
> ffffffff8049e22c:      493 	48 89 73 20          	mov    %rsi,0x20(%rbx)
> ffffffff8049e230:        2 	be 0e 00 00 00       	mov    $0xe,%esi
> ffffffff8049e235:        0 	89 87 c0 00 00 00    	mov    %eax,0xc0(%rdi)
> ffffffff8049e23b:      472 	e8 2c 98 fe ff       	callq  ffffffff80487a6c <skb_pull>
> ffffffff8049e240:      501 	44 8b a3 c0 00 00 00 	mov    0xc0(%rbx),%r12d
> ffffffff8049e247:      763 	4c 03 a3 d0 00 00 00 	add    0xd0(%rbx),%r12
> ffffffff8049e24e:        0 	41 f6 04 24 01       	testb  $0x1,(%r12)
> ffffffff8049e253:      497 	74 26                	je     ffffffff8049e27b <eth_type_trans+0x66>
> ffffffff8049e255:        0 	48 8d b5 38 02 00 00 	lea    0x238(%rbp),%rsi
> ffffffff8049e25c:        0 	4c 89 e7             	mov    %r12,%rdi
> ffffffff8049e25f:        0 	e8 49 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
> ffffffff8049e264:        0 	85 c0                	test   %eax,%eax
> ffffffff8049e266:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
> ffffffff8049e269:        0 	75 08                	jne    ffffffff8049e273 <eth_type_trans+0x5e>
> ffffffff8049e26b:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
> ffffffff8049e26e:        0 	83 c8 01             	or     $0x1,%eax
> ffffffff8049e271:        0 	eb 24                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
> ffffffff8049e273:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
> ffffffff8049e276:        0 	83 c8 02             	or     $0x2,%eax
> ffffffff8049e279:        0 	eb 1c                	jmp    ffffffff8049e297 <eth_type_trans+0x82>
> ffffffff8049e27b:       82 	48 8d b5 18 02 00 00 	lea    0x218(%rbp),%rsi
> ffffffff8049e282:     8782 	4c 89 e7             	mov    %r12,%rdi
> ffffffff8049e285:     1752 	e8 23 fc ff ff       	callq  ffffffff8049dead <compare_ether_addr>
> ffffffff8049e28a:        0 	85 c0                	test   %eax,%eax
> ffffffff8049e28c:      757 	74 0c                	je     ffffffff8049e29a <eth_type_trans+0x85>
> ffffffff8049e28e:        0 	8a 43 7d             	mov    0x7d(%rbx),%al
> ffffffff8049e291:        0 	83 e0 f8             	and    $0xfffffffffffffff8,%eax
> ffffffff8049e294:        0 	83 c8 03             	or     $0x3,%eax
> ffffffff8049e297:        0 	88 43 7d             	mov    %al,0x7d(%rbx)
> ffffffff8049e29a:      107 	66 41 8b 44 24 0c    	mov    0xc(%r12),%ax
> ffffffff8049e2a0:     1031 	0f b7 c8             	movzwl %ax,%ecx
> ffffffff8049e2a3:      518 	66 c1 e8 08          	shr    $0x8,%ax
> ffffffff8049e2a7:        0 	89 ca                	mov    %ecx,%edx
> ffffffff8049e2a9:        0 	c1 e2 08             	shl    $0x8,%edx
> ffffffff8049e2ac:      484 	09 d0                	or     %edx,%eax
> ffffffff8049e2ae:        0 	0f b7 c0             	movzwl %ax,%eax
> ffffffff8049e2b1:        0 	3d ff 05 00 00       	cmp    $0x5ff,%eax
> ffffffff8049e2b6:      468 	7f 18                	jg     ffffffff8049e2d0 <eth_type_trans+0xbb>
> ffffffff8049e2b8:        0 	48 8b 83 d8 00 00 00 	mov    0xd8(%rbx),%rax
> ffffffff8049e2bf:        0 	b9 00 01 00 00       	mov    $0x100,%ecx
> ffffffff8049e2c4:        0 	66 83 38 ff          	cmpw   $0xffffffffffffffff,(%rax)
> ffffffff8049e2c8:        0 	b8 00 04 00 00       	mov    $0x400,%eax
> ffffffff8049e2cd:        0 	0f 45 c8             	cmovne %eax,%ecx
> ffffffff8049e2d0:        0 	5b                   	pop    %rbx
> ffffffff8049e2d1:    85064 	5d                   	pop    %rbp
> ffffffff8049e2d2:    63776 	41 5c                	pop    %r12
> ffffffff8049e2d4:        1 	89 c8                	mov    %ecx,%eax
> ffffffff8049e2d6:      474 	c3                   	retq   
> 
> small function, big bang - 1.7% of the total overhead.
> 
> 90% of this function's cost is in the closing sequence. My guess would 
> be that it originates from ffffffff8049e2ae (the branch after that is 
> not taken), which corresponds to this source code context:
> 
> (gdb) list *0xffffffff8049e2ae
> 0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199).
> 194		if (netdev_uses_dsa_tags(dev))
> 195			return htons(ETH_P_DSA);
> 196		if (netdev_uses_trailer_tags(dev))
> 197			return htons(ETH_P_TRAILER);
> 198	
> 199		if (ntohs(eth->h_proto) >= 1536)
> 200			return eth->h_proto;
> 201	
> 202		rawp = skb->data;
> 203	
> 
> eth->h_proto access.
> 
> Given that this workload does localhost networking, my guess would be 
> that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
> read-mostly field should be separated from the bouncing bits.
> 

"eth" is on the frame itself, so each cpu is handling a skb it owns.
If there is a cache line miss, then scheduler might have done a wrong schedule ?
(tbench server and tbench client on different cpus)

But seeing your disassembly, I can see compare_ether_addr() is not inlined.

This sucks.

/**
 * compare_ether_addr - Compare two Ethernet addresses
 * @addr1: Pointer to a six-byte array containing the Ethernet address
 * @addr2: Pointer other six-byte array containing the Ethernet address
 *
 * Compare two ethernet addresses, returns 0 if equal
 */
static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
{
        const u16 *a = (const u16 *) addr1;
        const u16 *b = (const u16 *) addr2;

        BUILD_BUG_ON(ETH_ALEN != 6);
        return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
}

On my machine/compiler, it is inlined, that makes a big difference.

c0420750 <eth_type_trans>: /* eth_type_trans total:  14417  0.4101 */ 


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:52                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> ffffffff8049e2ae:        0 	0f b7 c0             	movzwl %ax,%eax
> ffffffff8049e2b1:        0 	3d ff 05 00 00       	cmp    $0x5ff,%eax
> ffffffff8049e2b6:      468 	7f 18                	jg     ffffffff8049e2d0 <eth_type_trans+0xbb>
> ffffffff8049e2b8:        0 	48 8b 83 d8 00 00 00 	mov    0xd8(%rbx),%rax
> ffffffff8049e2bf:        0 	b9 00 01 00 00       	mov    $0x100,%ecx
> ffffffff8049e2c4:        0 	66 83 38 ff          	cmpw   $0xffffffffffffffff,(%rax)
> ffffffff8049e2c8:        0 	b8 00 04 00 00       	mov    $0x400,%eax
> ffffffff8049e2cd:        0 	0f 45 c8             	cmovne %eax,%ecx
> ffffffff8049e2d0:        0 	5b                   	pop    %rbx
> ffffffff8049e2d1:    85064 	5d                   	pop    %rbp
> ffffffff8049e2d2:    63776 	41 5c                	pop    %r12
> ffffffff8049e2d4:        1 	89 c8                	mov    %ecx,%eax
> ffffffff8049e2d6:      474 	c3                   	retq   
> 
> small function, big bang - 1.7% of the total overhead.
> 
> 90% of this function's cost is in the closing sequence. My guess would 
> be that it originates from ffffffff8049e2ae (the branch after that is 
> not taken), which corresponds to this source code context:

I would actually suspect that branch mispredicts may be an issue.

If that thing falls out of the branch prediction table (which it could 
easily do), then a forward branch will be predicted as "not taken". And if 
it then turns out that the _common_ case is the other way around, the 
incorrectly predicted destination is often the one that shows up in 
profiles.

Giving gcc likely()/unlikely() hints usually doesn't much help, I'm 
afraid. It _can_ make a difference, but often not for -Os in particular.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:52                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> ffffffff8049e2ae:        0 	0f b7 c0             	movzwl %ax,%eax
> ffffffff8049e2b1:        0 	3d ff 05 00 00       	cmp    $0x5ff,%eax
> ffffffff8049e2b6:      468 	7f 18                	jg     ffffffff8049e2d0 <eth_type_trans+0xbb>
> ffffffff8049e2b8:        0 	48 8b 83 d8 00 00 00 	mov    0xd8(%rbx),%rax
> ffffffff8049e2bf:        0 	b9 00 01 00 00       	mov    $0x100,%ecx
> ffffffff8049e2c4:        0 	66 83 38 ff          	cmpw   $0xffffffffffffffff,(%rax)
> ffffffff8049e2c8:        0 	b8 00 04 00 00       	mov    $0x400,%eax
> ffffffff8049e2cd:        0 	0f 45 c8             	cmovne %eax,%ecx
> ffffffff8049e2d0:        0 	5b                   	pop    %rbx
> ffffffff8049e2d1:    85064 	5d                   	pop    %rbp
> ffffffff8049e2d2:    63776 	41 5c                	pop    %r12
> ffffffff8049e2d4:        1 	89 c8                	mov    %ecx,%eax
> ffffffff8049e2d6:      474 	c3                   	retq   
> 
> small function, big bang - 1.7% of the total overhead.
> 
> 90% of this function's cost is in the closing sequence. My guess would 
> be that it originates from ffffffff8049e2ae (the branch after that is 
> not taken), which corresponds to this source code context:

I would actually suspect that branch mispredicts may be an issue.

If that thing falls out of the branch prediction table (which it could 
easily do), then a forward branch will be predicted as "not taken". And if 
it then turns out that the _common_ case is the other way around, the 
incorrectly predicted destination is often the one that shows up in 
profiles.

Giving gcc likely()/unlikely() hints usually doesn't much help, I'm 
afraid. It _can_ make a difference, but often not for -Os in particular.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:59                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger, H. Peter Anvin,
	Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.508888 system_call

that's an easy one:

ffffffff8020be00:    97321 <system_call>:
ffffffff8020be00:    97321 	0f 01 f8             	swapgs 
ffffffff8020be03:    53089 	66 66 66 90          	xchg   %ax,%ax
ffffffff8020be07:     1524 	66 66 90             	xchg   %ax,%ax
ffffffff8020be0a:        0 	66 66 90             	xchg   %ax,%ax
ffffffff8020be0d:        0 	66 66 90             	xchg   %ax,%ax

ffffffff8020be10:     1511 <system_call_after_swapgs>:
ffffffff8020be10:     1511 	65 48 89 24 25 18 00 	mov    %rsp,%gs:0x18
ffffffff8020be17:        0 	00 00 
ffffffff8020be19:        0 	65 48 8b 24 25 10 00 	mov    %gs:0x10,%rsp
ffffffff8020be20:        0 	00 00 
ffffffff8020be22:     1490 	fb                   	sti    

syscall entry instruction costs - unavoidable security checks, etc. - 
hardware costs.

But looking at this profile made me notice this detail:

  ENTRY(system_call_after_swapgs)

Combined with this alignment rule we have in 
arch/x86/include/asm/linkage.h on 64-bit:

  #ifdef CONFIG_X86_64
  #define __ALIGN .p2align 4,,15
  #define __ALIGN_STR ".p2align 4,,15"
  #endif

while it inserts NOP sequences, that is still +13 bytes of excessive, 
stupid, and straight in our syscall entry path alignment padding.

system_call_after_swapgs is an utter slowpath in any case. The interim 
fix is below - although it needs more thinking and probably should be 
done via an ENTRY_UNALIGNED() method as well, for slowpath targets.

With that we get this much nicer entry sequence:

ffffffff8020be00:   544323 <system_call>:
ffffffff8020be00:   544323 	0f 01 f8             	swapgs 

ffffffff8020be03:   197954 <system_call_after_swapgs>:
ffffffff8020be03:   197954 	65 48 89 24 25 18 00 	mov    %rsp,%gs:0x18
ffffffff8020be0a:        0 	00 00 
ffffffff8020be0c:     6578 	65 48 8b 24 25 10 00 	mov    %gs:0x10,%rsp
ffffffff8020be13:        0 	00 00 
ffffffff8020be15:        0 	fb                   	sti    
ffffffff8020be16:        0 	48 83 ec 50          	sub    $0x50,%rsp

And we should probably weaken the generic code alignment rules as well 
on x86. I'll do some measurements of it.

	Ingo

Index: linux/arch/x86/kernel/entry_64.S
===================================================================
--- linux.orig/arch/x86/kernel/entry_64.S
+++ linux/arch/x86/kernel/entry_64.S
@@ -315,7 +315,8 @@ ENTRY(system_call)
 	 * after the swapgs, so that it can do the swapgs
 	 * for the guest and jump here on syscall.
 	 */
-ENTRY(system_call_after_swapgs)
+.globl system_call_after_swapgs
+system_call_after_swapgs:
 
 	movq	%rsp,%gs:pda_oldrsp 
 	movq	%gs:pda_kernelstack,%rsp

^ permalink raw reply	[flat|nested] 349+ messages in thread

* system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 21:59                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger,
	H. Peter Anvin, Thomas Gleixner


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   1.508888 system_call

that's an easy one:

ffffffff8020be00:    97321 <system_call>:
ffffffff8020be00:    97321 	0f 01 f8             	swapgs 
ffffffff8020be03:    53089 	66 66 66 90          	xchg   %ax,%ax
ffffffff8020be07:     1524 	66 66 90             	xchg   %ax,%ax
ffffffff8020be0a:        0 	66 66 90             	xchg   %ax,%ax
ffffffff8020be0d:        0 	66 66 90             	xchg   %ax,%ax

ffffffff8020be10:     1511 <system_call_after_swapgs>:
ffffffff8020be10:     1511 	65 48 89 24 25 18 00 	mov    %rsp,%gs:0x18
ffffffff8020be17:        0 	00 00 
ffffffff8020be19:        0 	65 48 8b 24 25 10 00 	mov    %gs:0x10,%rsp
ffffffff8020be20:        0 	00 00 
ffffffff8020be22:     1490 	fb                   	sti    

syscall entry instruction costs - unavoidable security checks, etc. - 
hardware costs.

But looking at this profile made me notice this detail:

  ENTRY(system_call_after_swapgs)

Combined with this alignment rule we have in 
arch/x86/include/asm/linkage.h on 64-bit:

  #ifdef CONFIG_X86_64
  #define __ALIGN .p2align 4,,15
  #define __ALIGN_STR ".p2align 4,,15"
  #endif

while it inserts NOP sequences, that is still +13 bytes of excessive, 
stupid, and straight in our syscall entry path alignment padding.

system_call_after_swapgs is an utter slowpath in any case. The interim 
fix is below - although it needs more thinking and probably should be 
done via an ENTRY_UNALIGNED() method as well, for slowpath targets.

With that we get this much nicer entry sequence:

ffffffff8020be00:   544323 <system_call>:
ffffffff8020be00:   544323 	0f 01 f8             	swapgs 

ffffffff8020be03:   197954 <system_call_after_swapgs>:
ffffffff8020be03:   197954 	65 48 89 24 25 18 00 	mov    %rsp,%gs:0x18
ffffffff8020be0a:        0 	00 00 
ffffffff8020be0c:     6578 	65 48 8b 24 25 10 00 	mov    %gs:0x10,%rsp
ffffffff8020be13:        0 	00 00 
ffffffff8020be15:        0 	fb                   	sti    
ffffffff8020be16:        0 	48 83 ec 50          	sub    $0x50,%rsp

And we should probably weaken the generic code alignment rules as well 
on x86. I'll do some measurements of it.

	Ingo

Index: linux/arch/x86/kernel/entry_64.S
===================================================================
--- linux.orig/arch/x86/kernel/entry_64.S
+++ linux/arch/x86/kernel/entry_64.S
@@ -315,7 +315,8 @@ ENTRY(system_call)
 	 * after the swapgs, so that it can do the swapgs
 	 * for the guest and jump here on syscall.
 	 */
-ENTRY(system_call_after_swapgs)
+.globl system_call_after_swapgs
+system_call_after_swapgs:
 
 	movq	%rsp,%gs:pda_oldrsp 
 	movq	%gs:pda_kernelstack,%rsp

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
  2008-11-17 18:49                           ` Ingo Molnar
                                             ` (12 preceding siblings ...)
  (?)
@ 2008-11-17 22:08                           ` Ingo Molnar
  2008-11-17 22:15                               ` Eric Dumazet
  -1 siblings, 1 reply; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.469183 tcp_current_mss

                      hits (total: 146918)
                 .........
ffffffff804c5237:      526 <tcp_current_mss>:
ffffffff804c5237:      526 	41 54                	push   %r12
ffffffff804c5239:     5929 	55                   	push   %rbp
ffffffff804c523a:       32 	53                   	push   %rbx
ffffffff804c523b:      294 	48 89 fb             	mov    %rdi,%rbx
ffffffff804c523e:      539 	48 83 ec 30          	sub    $0x30,%rsp
ffffffff804c5242:     2590 	85 f6                	test   %esi,%esi
ffffffff804c5244:      444 	48 8b 4f 78          	mov    0x78(%rdi),%rcx
ffffffff804c5248:      521 	8b af 4c 04 00 00    	mov    0x44c(%rdi),%ebp
ffffffff804c524e:      791 	74 2a                	je     ffffffff804c527a <tcp_current_mss+0x43>
ffffffff804c5250:      433 	8b 87 00 01 00 00    	mov    0x100(%rdi),%eax
ffffffff804c5256:      236 	c1 e0 10             	shl    $0x10,%eax
ffffffff804c5259:      191 	89 c2                	mov    %eax,%edx
ffffffff804c525b:      487 	23 97 fc 00 00 00    	and    0xfc(%rdi),%edx
ffffffff804c5261:      362 	39 c2                	cmp    %eax,%edx
ffffffff804c5263:      342 	75 15                	jne    ffffffff804c527a <tcp_current_mss+0x43>
ffffffff804c5265:      473 	45 31 e4             	xor    %r12d,%r12d
ffffffff804c5268:      221 	8b 87 00 04 00 00    	mov    0x400(%rdi),%eax
ffffffff804c526e:      194 	3b 87 80 04 00 00    	cmp    0x480(%rdi),%eax
ffffffff804c5274:      445 	41 0f 94 c4          	sete   %r12b
ffffffff804c5278:      261 	eb 03                	jmp    ffffffff804c527d <tcp_current_mss+0x46>
ffffffff804c527a:        0 	45 31 e4             	xor    %r12d,%r12d
ffffffff804c527d:      185 	48 85 c9             	test   %rcx,%rcx
ffffffff804c5280:      686 	74 15                	je     ffffffff804c5297 <tcp_current_mss+0x60>
ffffffff804c5282:     1806 	8b 71 7c             	mov    0x7c(%rcx),%esi
ffffffff804c5285:        1 	3b b3 5c 03 00 00    	cmp    0x35c(%rbx),%esi
ffffffff804c528b:       21 	74 0a                	je     ffffffff804c5297 <tcp_current_mss+0x60>
ffffffff804c528d:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804c5290:        0 	e8 8b fb ff ff       	callq  ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c5295:        0 	89 c5                	mov    %eax,%ebp
ffffffff804c5297:      864 	48 8d 4c 24 28       	lea    0x28(%rsp),%rcx
ffffffff804c529c:      634 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
ffffffff804c52a1:      995 	31 f6                	xor    %esi,%esi
ffffffff804c52a3:        0 	48 89 df             	mov    %rbx,%rdi
ffffffff804c52a6:        2 	e8 f2 fe ff ff       	callq  ffffffff804c519d <tcp_established_options>
ffffffff804c52ab:      859 	8b 8b e8 03 00 00    	mov    0x3e8(%rbx),%ecx
ffffffff804c52b1:      936 	83 c0 14             	add    $0x14,%eax
ffffffff804c52b4:        6 	0f b7 d1             	movzwl %cx,%edx
ffffffff804c52b7:        0 	39 d0                	cmp    %edx,%eax
ffffffff804c52b9:      911 	74 04                	je     ffffffff804c52bf <tcp_current_mss+0x88>
ffffffff804c52bb:        0 	29 d0                	sub    %edx,%eax
ffffffff804c52bd:        0 	29 c5                	sub    %eax,%ebp
ffffffff804c52bf:        0 	45 85 e4             	test   %r12d,%r12d
ffffffff804c52c2:     6894 	89 e8                	mov    %ebp,%eax
ffffffff804c52c4:        0 	74 38                	je     ffffffff804c52fe <tcp_current_mss+0xc7>
ffffffff804c52c6:      990 	48 8b 83 68 03 00 00 	mov    0x368(%rbx),%rax
ffffffff804c52cd:      642 	8b b3 04 01 00 00    	mov    0x104(%rbx),%esi
ffffffff804c52d3:        3 	48 89 df             	mov    %rbx,%rdi
ffffffff804c52d6:      240 	66 2b 70 30          	sub    0x30(%rax),%si
ffffffff804c52da:      588 	66 2b b3 7e 03 00 00 	sub    0x37e(%rbx),%si
ffffffff804c52e1:        2 	66 29 ce             	sub    %cx,%si
ffffffff804c52e4:      284 	ff ce                	dec    %esi
ffffffff804c52e6:      664 	0f b7 f6             	movzwl %si,%esi
ffffffff804c52e9:        2 	e8 0a fb ff ff       	callq  ffffffff804c4df8 <tcp_bound_to_half_wnd>
ffffffff804c52ee:       68 	0f b7 d0             	movzwl %ax,%edx
ffffffff804c52f1:     1870 	89 c1                	mov    %eax,%ecx
ffffffff804c52f3:        0 	89 d0                	mov    %edx,%eax
ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
ffffffff804c52fb:     1670 	66 29 d0             	sub    %dx,%ax
ffffffff804c52fe:        0 	66 89 83 ea 03 00 00 	mov    %ax,0x3ea(%rbx)
ffffffff804c5305:        4 	48 83 c4 30          	add    $0x30,%rsp
ffffffff804c5309:      855 	89 e8                	mov    %ebp,%eax
ffffffff804c530b:        0 	5b                   	pop    %rbx
ffffffff804c530c:      797 	5d                   	pop    %rbp
ffffffff804c530d:        0 	41 5c                	pop    %r12
ffffffff804c530f:        0 	c3                   	retq   

apparently this division causes 1.0% of tbench overhead:

ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax

(gdb) list *0xffffffff804c52f7
0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
1073					  inet_csk(sk)->icsk_af_ops->net_header_len -
1074					  inet_csk(sk)->icsk_ext_hdr_len -
1075					  tp->tcp_header_len);
1076	
1077			xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
1078			xmit_size_goal -= (xmit_size_goal % mss_now);
1079		}
1080		tp->xmit_size_goal = xmit_size_goal;
1081	
1082		return mss_now;
(gdb) 

it's this division:

        if (doing_tso) {
        [...]
			xmit_size_goal -= (xmit_size_goal % mss_now);

Has no-one hit this before? Perhaps this is why switching loopback 
networking to TSO had a performance impact for others?

It's still a bit weird ... how can a single division cause this much 
overhead? tcp_bound_to_half_wnd() [which is called straight before 
this sequence] seems low-overhead.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:09                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 22:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger, H. Peter Anvin,
	Thomas Gleixner



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> syscall entry instruction costs - unavoidable security checks, etc. - 
> hardware costs.

Yes. One thing to look out for on x86 is the system call _return_ path. It 
doesn't show up in kernel profiles (it shows up as user costs), and we had 
a bug where auditing essentially always caused us to use 'iret' instead of 
'sysret' because it took us the long way around.

And profiling doesn't show it, but things like lmbench did, iret being 
about five times slower than sysret.

But yes:

> -ENTRY(system_call_after_swapgs)
> +.globl system_call_after_swapgs
> +system_call_after_swapgs:

This definitely makes sense. We definitely do not want to align that 
special case.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:09                               ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-17 22:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger,
	H. Peter Anvin, Thomas Gleixner



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> syscall entry instruction costs - unavoidable security checks, etc. - 
> hardware costs.

Yes. One thing to look out for on x86 is the system call _return_ path. It 
doesn't show up in kernel profiles (it shows up as user costs), and we had 
a bug where auditing essentially always caused us to use 'iret' instead of 
'sysret' because it took us the long way around.

And profiling doesn't show it, but things like lmbench did, iret being 
about five times slower than sysret.

But yes:

> -ENTRY(system_call_after_swapgs)
> +.globl system_call_after_swapgs
> +system_call_after_swapgs:

This definitely makes sense. We definitely do not want to align that 
special case.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:14                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 22:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> 100.000000 total
>> ................
>>   1.673249 __inet_lookup_established
> 
>                       hits (total: 167324)
>                  .........
> ffffffff804b9b12:      446 <__inet_lookup_established>:
> ffffffff804b9b12:      446 	41 57                	push   %r15
> ffffffff804b9b14:     4810 	89 d0                	mov    %edx,%eax
> ffffffff804b9b16:        0 	0f b7 c9             	movzwl %cx,%ecx
> ffffffff804b9b19:        0 	41 56                	push   %r14
> ffffffff804b9b1b:      456 	41 55                	push   %r13
> ffffffff804b9b1d:        0 	41 54                	push   %r12
> ffffffff804b9b1f:        0 	55                   	push   %rbp
> ffffffff804b9b20:      427 	53                   	push   %rbx
> ffffffff804b9b21:        4 	48 89 f3             	mov    %rsi,%rbx
> ffffffff804b9b24:        2 	44 89 c6             	mov    %r8d,%esi
> ffffffff804b9b27:      504 	41 89 c8             	mov    %ecx,%r8d
> ffffffff804b9b2a:        1 	49 89 f7             	mov    %rsi,%r15
> ffffffff804b9b2d:        1 	48 83 ec 08          	sub    $0x8,%rsp
> ffffffff804b9b31:      462 	49 c1 e7 20          	shl    $0x20,%r15
> ffffffff804b9b35:        0 	48 89 3c 24          	mov    %rdi,(%rsp)
> ffffffff804b9b39:      507 	89 d7                	mov    %edx,%edi
> ffffffff804b9b3b:       38 	41 0f b7 d1          	movzwl %r9w,%edx
> ffffffff804b9b3f:        0 	41 89 d6             	mov    %edx,%r14d
> ffffffff804b9b42:      863 	49 09 c7             	or     %rax,%r15
> ffffffff804b9b45:       24 	41 c1 e6 10          	shl    $0x10,%r14d
> ffffffff804b9b49:        0 	41 09 ce             	or     %ecx,%r14d
> ffffffff804b9b4c:      479 	89 f9                	mov    %edi,%ecx
> ffffffff804b9b4e:        8 	48 8b 3c 24          	mov    (%rsp),%rdi
> ffffffff804b9b52:        0 	e8 cc f4 ff ff       	callq  ffffffff804b9023 <inet_ehashfn>
> ffffffff804b9b57:      413 	48 89 df             	mov    %rbx,%rdi
> ffffffff804b9b5a:      122 	41 89 c5             	mov    %eax,%r13d
> ffffffff804b9b5d:        0 	89 c6                	mov    %eax,%esi
> ffffffff804b9b5f:      635 	e8 3e f5 ff ff       	callq  ffffffff804b90a2 <inet_ehash_bucket>
> ffffffff804b9b64:      511 	48 89 c5             	mov    %rax,%rbp
> ffffffff804b9b67:        6 	44 89 e8             	mov    %r13d,%eax
> ffffffff804b9b6a:        0 	23 43 14             	and    0x14(%rbx),%eax
> ffffffff804b9b6d:      497 	4c 8d 24 85 00 00 00 	lea    0x0(,%rax,4),%r12
> ffffffff804b9b74:        0 	00 
> ffffffff804b9b75:        1 	4c 03 63 08          	add    0x8(%rbx),%r12
> ffffffff804b9b79:        0 	48 8b 45 00          	mov    0x0(%rbp),%rax
> ffffffff804b9b7d:      470 	0f 18 08             	prefetcht0 (%rax)
> ffffffff804b9b80:        0 	4c 89 e7             	mov    %r12,%rdi
> ffffffff804b9b83:     1089 	e8 32 cd 05 00       	callq  ffffffff805168ba <_read_lock>
> ffffffff804b9b88:     6752 	48 8b 55 00          	mov    0x0(%rbp),%rdx
> ffffffff804b9b8c:      598 	eb 2c                	jmp    ffffffff804b9bba <__inet_lookup_established+0xa8>
> ffffffff804b9b8e:      447 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
> ffffffff804b9b95:        0 	80 
> ffffffff804b9b96:     1119 	75 1f                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9b98:       21 	4c 39 b8 30 02 00 00 	cmp    %r15,0x230(%rax)
> ffffffff804b9b9f:        0 	75 16                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9ba1:      492 	44 39 b0 38 02 00 00 	cmp    %r14d,0x238(%rax)
> ffffffff804b9ba8:        0 	75 0d                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9baa:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
> ffffffff804b9bad:      451 	85 d2                	test   %edx,%edx
> ffffffff804b9baf:        0 	74 67                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bb1:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
> ffffffff804b9bb5:        0 	74 61                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bb7:        0 	48 89 ca             	mov    %rcx,%rdx
> ffffffff804b9bba:      402 	48 85 d2             	test   %rdx,%rdx
> ffffffff804b9bbd:     1006 	74 12                	je     ffffffff804b9bd1 <__inet_lookup_established+0xbf>
> ffffffff804b9bbf:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
> ffffffff804b9bc3:      821 	48 8b 0a             	mov    (%rdx),%rcx
> ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
> ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
> ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>
> ffffffff804b9bd1:        0 	48 8b 55 08          	mov    0x8(%rbp),%rdx
> ffffffff804b9bd5:        0 	eb 26                	jmp    ffffffff804b9bfd <__inet_lookup_established+0xeb>
> ffffffff804b9bd7:        0 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
> ffffffff804b9bde:        0 	80 
> ffffffff804b9bdf:        0 	75 19                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9be1:        0 	4c 39 78 40          	cmp    %r15,0x40(%rax)
> ffffffff804b9be5:        0 	75 13                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9be7:        0 	44 39 70 48          	cmp    %r14d,0x48(%rax)
> ffffffff804b9beb:        0 	75 0d                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9bed:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
> ffffffff804b9bf0:        0 	85 d2                	test   %edx,%edx
> ffffffff804b9bf2:        0 	74 24                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bf4:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
> ffffffff804b9bf8:        0 	74 1e                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bfa:        0 	48 89 ca             	mov    %rcx,%rdx
> ffffffff804b9bfd:        0 	48 85 d2             	test   %rdx,%rdx
> ffffffff804b9c00:        0 	74 12                	je     ffffffff804b9c14 <__inet_lookup_established+0x102>
> ffffffff804b9c02:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
> ffffffff804b9c06:        0 	48 8b 0a             	mov    (%rdx),%rcx
> ffffffff804b9c09:        0 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
> ffffffff804b9c0d:        0 	0f 18 09             	prefetcht0 (%rcx)
> ffffffff804b9c10:        0 	75 e8                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9c12:        0 	eb c3                	jmp    ffffffff804b9bd7 <__inet_lookup_established+0xc5>
> ffffffff804b9c14:        0 	31 c0                	xor    %eax,%eax
> ffffffff804b9c16:        0 	eb 04                	jmp    ffffffff804b9c1c <__inet_lookup_established+0x10a>
> ffffffff804b9c18:      441 	f0 ff 40 28          	lock incl 0x28(%rax)
> ffffffff804b9c1c:     1442 	f0 41 ff 04 24       	lock incl (%r12)
> ffffffff804b9c21:      476 	41 5b                	pop    %r11
> ffffffff804b9c23:        1 	5b                   	pop    %rbx
> ffffffff804b9c24:        0 	5d                   	pop    %rbp
> ffffffff804b9c25:      475 	41 5c                	pop    %r12
> ffffffff804b9c27:        0 	41 5d                	pop    %r13
> ffffffff804b9c29:        1 	41 5e                	pop    %r14
> ffffffff804b9c2b:      494 	41 5f                	pop    %r15
> ffffffff804b9c2d:        0 	c3                   	retq   
> ffffffff804b9c2e:        0 	90                   	nop    
> ffffffff804b9c2f:        0 	90                   	nop    
> 
> 80% of the overhead comes from cachemisses here:
> 
> ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
> ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
> ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>
> 
> corresponding to:
> 
> (gdb) list *0xffffffff804b9bc6
> 0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237).
> 232		rwlock_t *lock = inet_ehash_lockp(hashinfo, hash);
> 233	
> 234		prefetch(head->chain.first);
> 235		read_lock(lock);
> 236		sk_for_each(sk, node, &head->chain) {
> 237			if (INET_MATCH(sk, net, hash, acookie,
> 238						saddr, daddr, ports, dif))
> 239				goto hit; /* You sunk my battleship! */
> 240		}
> 241	
> 
> Seeing the first hard cachemiss on hash lookups is a familiar and 
> partly expected pattern - it is the first thing that touches 
> cache-cold data structures.
> 
> Seeing 1.4% of the totaly tbench overhead go into this single 
> cachemiss is a bit surprising to me though: tbench works via 
> long-lived connections (TCP establish costs and nowhere to be seen in 
> the profiles) so the socket hash should be relatively stable and 
> read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache 
> per socket.
> 
> Could we be somehow dirtying these cachelines perhaps, causing 
> unnecessary cachemisses in hash lookups? Is the hash linkage portion 
> of the socket data structure frequently dirtied? Padding that to 64 
> bytes (or next to 64 bytes worth of read-mostly fields) could perhaps 
> give us a +1.7% tbench speedup.
> 

I am not seeing this of course on net-next-2.6 thanks to RCU

Could it be that several tbench sockets are hashed on same chain ?

tbench uses dst address and src address 127.0.0.1 for its sockets.
server binds on port 7003


static inline unsigned int inet_ehashfn(struct net *net,
                                        const __be32 laddr, const __u16 lport,
                                        const __be32 faddr, const __be16 fport)
{
        return jhash_3words((__force __u32) laddr,
                            (__force __u32) faddr,
                            ((__u32) lport) << 16 | (__force __u32)fport,
                            inet_ehash_secret + net_hash_mix(net));
}

Hum... should be OK, thanks to jhash.

Maybe same problem than eth_type_trans :

You have a cache line miss because the socket we handle in the chain was previously
handled by another cpu. (sk->refcnt being dirtied by this other cpu)


ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)

ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
< "jne" stalls beccause CPU must bring to its cache 0x2c(%rax) to perform compare >

ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>

Even if you padd/move refcnt somewhere else in sk, you'll need to take a reference on it,
so it wont help very much.



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:14                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 22:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
>> 100.000000 total
>> ................
>>   1.673249 __inet_lookup_established
> 
>                       hits (total: 167324)
>                  .........
> ffffffff804b9b12:      446 <__inet_lookup_established>:
> ffffffff804b9b12:      446 	41 57                	push   %r15
> ffffffff804b9b14:     4810 	89 d0                	mov    %edx,%eax
> ffffffff804b9b16:        0 	0f b7 c9             	movzwl %cx,%ecx
> ffffffff804b9b19:        0 	41 56                	push   %r14
> ffffffff804b9b1b:      456 	41 55                	push   %r13
> ffffffff804b9b1d:        0 	41 54                	push   %r12
> ffffffff804b9b1f:        0 	55                   	push   %rbp
> ffffffff804b9b20:      427 	53                   	push   %rbx
> ffffffff804b9b21:        4 	48 89 f3             	mov    %rsi,%rbx
> ffffffff804b9b24:        2 	44 89 c6             	mov    %r8d,%esi
> ffffffff804b9b27:      504 	41 89 c8             	mov    %ecx,%r8d
> ffffffff804b9b2a:        1 	49 89 f7             	mov    %rsi,%r15
> ffffffff804b9b2d:        1 	48 83 ec 08          	sub    $0x8,%rsp
> ffffffff804b9b31:      462 	49 c1 e7 20          	shl    $0x20,%r15
> ffffffff804b9b35:        0 	48 89 3c 24          	mov    %rdi,(%rsp)
> ffffffff804b9b39:      507 	89 d7                	mov    %edx,%edi
> ffffffff804b9b3b:       38 	41 0f b7 d1          	movzwl %r9w,%edx
> ffffffff804b9b3f:        0 	41 89 d6             	mov    %edx,%r14d
> ffffffff804b9b42:      863 	49 09 c7             	or     %rax,%r15
> ffffffff804b9b45:       24 	41 c1 e6 10          	shl    $0x10,%r14d
> ffffffff804b9b49:        0 	41 09 ce             	or     %ecx,%r14d
> ffffffff804b9b4c:      479 	89 f9                	mov    %edi,%ecx
> ffffffff804b9b4e:        8 	48 8b 3c 24          	mov    (%rsp),%rdi
> ffffffff804b9b52:        0 	e8 cc f4 ff ff       	callq  ffffffff804b9023 <inet_ehashfn>
> ffffffff804b9b57:      413 	48 89 df             	mov    %rbx,%rdi
> ffffffff804b9b5a:      122 	41 89 c5             	mov    %eax,%r13d
> ffffffff804b9b5d:        0 	89 c6                	mov    %eax,%esi
> ffffffff804b9b5f:      635 	e8 3e f5 ff ff       	callq  ffffffff804b90a2 <inet_ehash_bucket>
> ffffffff804b9b64:      511 	48 89 c5             	mov    %rax,%rbp
> ffffffff804b9b67:        6 	44 89 e8             	mov    %r13d,%eax
> ffffffff804b9b6a:        0 	23 43 14             	and    0x14(%rbx),%eax
> ffffffff804b9b6d:      497 	4c 8d 24 85 00 00 00 	lea    0x0(,%rax,4),%r12
> ffffffff804b9b74:        0 	00 
> ffffffff804b9b75:        1 	4c 03 63 08          	add    0x8(%rbx),%r12
> ffffffff804b9b79:        0 	48 8b 45 00          	mov    0x0(%rbp),%rax
> ffffffff804b9b7d:      470 	0f 18 08             	prefetcht0 (%rax)
> ffffffff804b9b80:        0 	4c 89 e7             	mov    %r12,%rdi
> ffffffff804b9b83:     1089 	e8 32 cd 05 00       	callq  ffffffff805168ba <_read_lock>
> ffffffff804b9b88:     6752 	48 8b 55 00          	mov    0x0(%rbp),%rdx
> ffffffff804b9b8c:      598 	eb 2c                	jmp    ffffffff804b9bba <__inet_lookup_established+0xa8>
> ffffffff804b9b8e:      447 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
> ffffffff804b9b95:        0 	80 
> ffffffff804b9b96:     1119 	75 1f                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9b98:       21 	4c 39 b8 30 02 00 00 	cmp    %r15,0x230(%rax)
> ffffffff804b9b9f:        0 	75 16                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9ba1:      492 	44 39 b0 38 02 00 00 	cmp    %r14d,0x238(%rax)
> ffffffff804b9ba8:        0 	75 0d                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9baa:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
> ffffffff804b9bad:      451 	85 d2                	test   %edx,%edx
> ffffffff804b9baf:        0 	74 67                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bb1:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
> ffffffff804b9bb5:        0 	74 61                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bb7:        0 	48 89 ca             	mov    %rcx,%rdx
> ffffffff804b9bba:      402 	48 85 d2             	test   %rdx,%rdx
> ffffffff804b9bbd:     1006 	74 12                	je     ffffffff804b9bd1 <__inet_lookup_established+0xbf>
> ffffffff804b9bbf:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
> ffffffff804b9bc3:      821 	48 8b 0a             	mov    (%rdx),%rcx
> ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
> ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
> ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>
> ffffffff804b9bd1:        0 	48 8b 55 08          	mov    0x8(%rbp),%rdx
> ffffffff804b9bd5:        0 	eb 26                	jmp    ffffffff804b9bfd <__inet_lookup_established+0xeb>
> ffffffff804b9bd7:        0 	48 81 3c 24 d0 15 ab 	cmpq   $0xffffffff80ab15d0,(%rsp)
> ffffffff804b9bde:        0 	80 
> ffffffff804b9bdf:        0 	75 19                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9be1:        0 	4c 39 78 40          	cmp    %r15,0x40(%rax)
> ffffffff804b9be5:        0 	75 13                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9be7:        0 	44 39 70 48          	cmp    %r14d,0x48(%rax)
> ffffffff804b9beb:        0 	75 0d                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9bed:        0 	8b 52 fc             	mov    -0x4(%rdx),%edx
> ffffffff804b9bf0:        0 	85 d2                	test   %edx,%edx
> ffffffff804b9bf2:        0 	74 24                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bf4:        0 	3b 54 24 40          	cmp    0x40(%rsp),%edx
> ffffffff804b9bf8:        0 	74 1e                	je     ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bfa:        0 	48 89 ca             	mov    %rcx,%rdx
> ffffffff804b9bfd:        0 	48 85 d2             	test   %rdx,%rdx
> ffffffff804b9c00:        0 	74 12                	je     ffffffff804b9c14 <__inet_lookup_established+0x102>
> ffffffff804b9c02:        0 	48 8d 42 f8          	lea    -0x8(%rdx),%rax
> ffffffff804b9c06:        0 	48 8b 0a             	mov    (%rdx),%rcx
> ffffffff804b9c09:        0 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
> ffffffff804b9c0d:        0 	0f 18 09             	prefetcht0 (%rcx)
> ffffffff804b9c10:        0 	75 e8                	jne    ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9c12:        0 	eb c3                	jmp    ffffffff804b9bd7 <__inet_lookup_established+0xc5>
> ffffffff804b9c14:        0 	31 c0                	xor    %eax,%eax
> ffffffff804b9c16:        0 	eb 04                	jmp    ffffffff804b9c1c <__inet_lookup_established+0x10a>
> ffffffff804b9c18:      441 	f0 ff 40 28          	lock incl 0x28(%rax)
> ffffffff804b9c1c:     1442 	f0 41 ff 04 24       	lock incl (%r12)
> ffffffff804b9c21:      476 	41 5b                	pop    %r11
> ffffffff804b9c23:        1 	5b                   	pop    %rbx
> ffffffff804b9c24:        0 	5d                   	pop    %rbp
> ffffffff804b9c25:      475 	41 5c                	pop    %r12
> ffffffff804b9c27:        0 	41 5d                	pop    %r13
> ffffffff804b9c29:        1 	41 5e                	pop    %r14
> ffffffff804b9c2b:      494 	41 5f                	pop    %r15
> ffffffff804b9c2d:        0 	c3                   	retq   
> ffffffff804b9c2e:        0 	90                   	nop    
> ffffffff804b9c2f:        0 	90                   	nop    
> 
> 80% of the overhead comes from cachemisses here:
> 
> ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
> ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)
> ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>
> 
> corresponding to:
> 
> (gdb) list *0xffffffff804b9bc6
> 0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237).
> 232		rwlock_t *lock = inet_ehash_lockp(hashinfo, hash);
> 233	
> 234		prefetch(head->chain.first);
> 235		read_lock(lock);
> 236		sk_for_each(sk, node, &head->chain) {
> 237			if (INET_MATCH(sk, net, hash, acookie,
> 238						saddr, daddr, ports, dif))
> 239				goto hit; /* You sunk my battleship! */
> 240		}
> 241	
> 
> Seeing the first hard cachemiss on hash lookups is a familiar and 
> partly expected pattern - it is the first thing that touches 
> cache-cold data structures.
> 
> Seeing 1.4% of the totaly tbench overhead go into this single 
> cachemiss is a bit surprising to me though: tbench works via 
> long-lived connections (TCP establish costs and nowhere to be seen in 
> the profiles) so the socket hash should be relatively stable and 
> read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache 
> per socket.
> 
> Could we be somehow dirtying these cachelines perhaps, causing 
> unnecessary cachemisses in hash lookups? Is the hash linkage portion 
> of the socket data structure frequently dirtied? Padding that to 64 
> bytes (or next to 64 bytes worth of read-mostly fields) could perhaps 
> give us a +1.7% tbench speedup.
> 

I am not seeing this of course on net-next-2.6 thanks to RCU

Could it be that several tbench sockets are hashed on same chain ?

tbench uses dst address and src address 127.0.0.1 for its sockets.
server binds on port 7003


static inline unsigned int inet_ehashfn(struct net *net,
                                        const __be32 laddr, const __u16 lport,
                                        const __be32 faddr, const __be16 fport)
{
        return jhash_3words((__force __u32) laddr,
                            (__force __u32) faddr,
                            ((__u32) lport) << 16 | (__force __u32)fport,
                            inet_ehash_secret + net_hash_mix(net));
}

Hum... should be OK, thanks to jhash.

Maybe same problem than eth_type_trans :

You have a cache line miss because the socket we handle in the chain was previously
handled by another cpu. (sk->refcnt being dirtied by this other cpu)


ffffffff804b9bc6:       78 	44 39 68 2c          	cmp    %r13d,0x2c(%rax)
ffffffff804b9bca:        4 	0f 18 09             	prefetcht0 (%rcx)

ffffffff804b9bcd:      685 	75 e8                	jne    ffffffff804b9bb7 <__inet_lookup_established+0xa5>
< "jne" stalls beccause CPU must bring to its cache 0x2c(%rax) to perform compare >

ffffffff804b9bcf:   139502 	eb bd                	jmp    ffffffff804b9b8e <__inet_lookup_established+0x7c>

Even if you padd/move refcnt somewhere else in sk, you'll need to take a reference on it,
so it wont help very much.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* tcp_transmit_skb() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:14                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.431553 tcp_transmit_skb

                      hits (total: 143155)
                 .........
ffffffff804c550e:      485 <tcp_transmit_skb>:
ffffffff804c550e:      485 	41 57                	push   %r15
ffffffff804c5510:     5692 	41 56                	push   %r14
ffffffff804c5512:      390 	49 89 f6             	mov    %rsi,%r14
ffffffff804c5515:        0 	41 55                	push   %r13
ffffffff804c5517:       69 	41 54                	push   %r12
ffffffff804c5519:      388 	41 89 d4             	mov    %edx,%r12d
ffffffff804c551c:        0 	55                   	push   %rbp
ffffffff804c551d:       66 	48 89 fd             	mov    %rdi,%rbp
ffffffff804c5520:      405 	53                   	push   %rbx
ffffffff804c5521:        0 	89 cb                	mov    %ecx,%ebx
ffffffff804c5523:       75 	48 83 ec 38          	sub    $0x38,%rsp
ffffffff804c5527:      396 	48 85 f6             	test   %rsi,%rsi
ffffffff804c552a:       51 	74 15                	je     ffffffff804c5541 <tcp_transmit_skb+0x33>
ffffffff804c552c:      396 	8b 96 c8 00 00 00    	mov    0xc8(%rsi),%edx
ffffffff804c5532:        1 	48 8b 86 d0 00 00 00 	mov    0xd0(%rsi),%rax
ffffffff804c5539:       63 	66 83 7c 02 08 00    	cmpw   $0x0,0x8(%rdx,%rax,1)
ffffffff804c553f:      417 	75 04                	jne    ffffffff804c5545 <tcp_transmit_skb+0x37>
ffffffff804c5541:        0 	0f 0b                	ud2a   
ffffffff804c5543:        0 	eb fe                	jmp    ffffffff804c5543 <tcp_transmit_skb+0x35>
ffffffff804c5545:     3719 	48 8b 87 60 03 00 00 	mov    0x360(%rdi),%rax
ffffffff804c554c:     2873 	f6 40 10 02          	testb  $0x2,0x10(%rax)
ffffffff804c5550:        1 	74 09                	je     ffffffff804c555b <tcp_transmit_skb+0x4d>
ffffffff804c5552:        0 	e8 1d 48 d8 ff       	callq  ffffffff80249d74 <ktime_get_real>
ffffffff804c5557:        0 	49 89 46 18          	mov    %rax,0x18(%r14)
ffffffff804c555b:      487 	45 85 e4             	test   %r12d,%r12d
ffffffff804c555e:      456 	74 33                	je     ffffffff804c5593 <tcp_transmit_skb+0x85>
ffffffff804c5560:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c5563:      482 	e8 28 f4 ff ff       	callq  ffffffff804c4990 <skb_cloned>
ffffffff804c5568:     1469 	85 c0                	test   %eax,%eax
ffffffff804c556a:     1085 	74 0c                	je     ffffffff804c5578 <tcp_transmit_skb+0x6a>
ffffffff804c556c:        0 	89 de                	mov    %ebx,%esi
ffffffff804c556e:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c5571:        0 	e8 47 41 fc ff       	callq  ffffffff804896bd <pskb_copy>
ffffffff804c5576:        0 	eb 0a                	jmp    ffffffff804c5582 <tcp_transmit_skb+0x74>
ffffffff804c5578:        0 	89 de                	mov    %ebx,%esi
ffffffff804c557a:      906 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c557d:        0 	e8 ab 35 fc ff       	callq  ffffffff80488b2d <skb_clone>
ffffffff804c5582:        0 	48 85 c0             	test   %rax,%rax
ffffffff804c5585:        7 	49 89 c6             	mov    %rax,%r14
ffffffff804c5588:      576 	bb 97 ff ff ff       	mov    $0xffffff97,%ebx
ffffffff804c558d:        0 	0f 84 59 05 00 00    	je     ffffffff804c5aec <tcp_transmit_skb+0x5de>
ffffffff804c5593:        0 	49 8d 46 38          	lea    0x38(%r14),%rax
ffffffff804c5597:      699 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
ffffffff804c559c:        1 	fc                   	cld    
ffffffff804c559d:      452 	48 89 04 24          	mov    %rax,(%rsp)
ffffffff804c55a1:       40 	48 89 d7             	mov    %rdx,%rdi
ffffffff804c55a4:        1 	31 c0                	xor    %eax,%eax
ffffffff804c55a6:      432 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55a7:      956 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55a8:      959 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55a9:      910 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55aa:      943 	48 8b 0c 24          	mov    (%rsp),%rcx
ffffffff804c55ae:      455 	f6 41 24 02          	testb  $0x2,0x24(%rcx)
ffffffff804c55b2:        0 	0f 84 b7 00 00 00    	je     ffffffff804c566f <tcp_transmit_skb+0x161>
ffffffff804c55b8:        0 	48 8b 85 b8 05 00 00 	mov    0x5b8(%rbp),%rax
ffffffff804c55bf:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff804c55c2:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c55c5:        0 	ff 10                	callq  *(%rax)
ffffffff804c55c7:        0 	31 f6                	xor    %esi,%esi
ffffffff804c55c9:        0 	48 85 c0             	test   %rax,%rax
ffffffff804c55cc:        0 	48 89 44 24 28       	mov    %rax,0x28(%rsp)
ffffffff804c55d1:        0 	74 08                	je     ffffffff804c55db <tcp_transmit_skb+0xcd>
ffffffff804c55d3:        0 	80 4c 24 10 04       	orb    $0x4,0x10(%rsp)
ffffffff804c55d8:        0 	40 b6 14             	mov    $0x14,%sil
ffffffff804c55db:        0 	48 8b 55 78          	mov    0x78(%rbp),%rdx
ffffffff804c55df:        0 	0f b7 85 5c 04 00 00 	movzwl 0x45c(%rbp),%eax
ffffffff804c55e6:        0 	48 85 d2             	test   %rdx,%rdx
ffffffff804c55e9:        0 	74 13                	je     ffffffff804c55fe <tcp_transmit_skb+0xf0>
ffffffff804c55eb:        0 	8b 92 94 00 00 00    	mov    0x94(%rdx),%edx
ffffffff804c55f1:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c55f3:        0 	73 09                	jae    ffffffff804c55fe <tcp_transmit_skb+0xf0>
ffffffff804c55f5:        0 	89 d0                	mov    %edx,%eax
ffffffff804c55f7:        0 	66 89 95 5c 04 00 00 	mov    %dx,0x45c(%rbp)
ffffffff804c55fe:        0 	83 3d 23 2e 3f 00 00 	cmpl   $0x0,0x3f2e23(%rip)        # ffffffff808b8428 <sysctl_tcp_timestamps>
ffffffff804c5605:        0 	66 89 44 24 14       	mov    %ax,0x14(%rsp)
ffffffff804c560a:        0 	8d 4e 04             	lea    0x4(%rsi),%ecx
ffffffff804c560d:        0 	74 25                	je     ffffffff804c5634 <tcp_transmit_skb+0x126>
ffffffff804c560f:        0 	48 83 7c 24 28 00    	cmpq   $0x0,0x28(%rsp)
ffffffff804c5615:        0 	75 1d                	jne    ffffffff804c5634 <tcp_transmit_skb+0x126>
ffffffff804c5617:        0 	48 8b 14 24          	mov    (%rsp),%rdx
ffffffff804c561b:        0 	80 4c 24 10 02       	orb    $0x2,0x10(%rsp)
ffffffff804c5620:        0 	8d 4e 10             	lea    0x10(%rsi),%ecx
ffffffff804c5623:        0 	8b 42 20             	mov    0x20(%rdx),%eax
ffffffff804c5626:        0 	89 44 24 18          	mov    %eax,0x18(%rsp)
ffffffff804c562a:        0 	8b 85 90 04 00 00    	mov    0x490(%rbp),%eax
ffffffff804c5630:        0 	89 44 24 1c          	mov    %eax,0x1c(%rsp)
ffffffff804c5634:        0 	83 3d f1 2d 3f 00 00 	cmpl   $0x0,0x3f2df1(%rip)        # ffffffff808b842c <sysctl_tcp_window_scaling>
ffffffff804c563b:        0 	74 15                	je     ffffffff804c5652 <tcp_transmit_skb+0x144>
ffffffff804c563d:        0 	8a 85 9d 04 00 00    	mov    0x49d(%rbp),%al
ffffffff804c5643:        0 	8d 51 04             	lea    0x4(%rcx),%edx
ffffffff804c5646:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c5649:        0 	84 c0                	test   %al,%al
ffffffff804c564b:        0 	88 44 24 11          	mov    %al,0x11(%rsp)
ffffffff804c564f:        0 	0f 45 ca             	cmovne %edx,%ecx
ffffffff804c5652:        0 	83 3d d7 2d 3f 00 00 	cmpl   $0x0,0x3f2dd7(%rip)        # ffffffff808b8430 <sysctl_tcp_sack>
ffffffff804c5659:        0 	74 26                	je     ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c565b:        0 	8a 44 24 10          	mov    0x10(%rsp),%al
ffffffff804c565f:        0 	83 c8 01             	or     $0x1,%eax
ffffffff804c5662:        0 	a8 02                	test   $0x2,%al
ffffffff804c5664:        0 	88 44 24 10          	mov    %al,0x10(%rsp)
ffffffff804c5668:        0 	75 17                	jne    ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c566a:        0 	83 c1 04             	add    $0x4,%ecx
ffffffff804c566d:        0 	eb 12                	jmp    ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c566f:      502 	48 8d 4c 24 28       	lea    0x28(%rsp),%rcx
ffffffff804c5674:      638 	4c 89 f6             	mov    %r14,%rsi
ffffffff804c5677:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c567a:        0 	e8 1e fb ff ff       	callq  ffffffff804c519d <tcp_established_options>
ffffffff804c567f:      468 	89 c1                	mov    %eax,%ecx
ffffffff804c5681:     1605 	8b 85 74 04 00 00    	mov    0x474(%rbp),%eax
ffffffff804c5687:      307 	03 85 78 04 00 00    	add    0x478(%rbp),%eax
ffffffff804c568d:        0 	44 8d 69 14          	lea    0x14(%rcx),%r13d
ffffffff804c5691:      409 	2b 85 d0 04 00 00    	sub    0x4d0(%rbp),%eax
ffffffff804c5697:       89 	3b 85 cc 04 00 00    	cmp    0x4cc(%rbp),%eax
ffffffff804c569d:        0 	75 0a                	jne    ffffffff804c56a9 <tcp_transmit_skb+0x19b>
ffffffff804c569f:      415 	31 f6                	xor    %esi,%esi
ffffffff804c56a1:      210 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c56a4:        0 	e8 b0 f3 ff ff       	callq  ffffffff804c4a59 <tcp_ca_event>
ffffffff804c56a9:     1050 	44 89 ee             	mov    %r13d,%esi
ffffffff804c56ac:     1063 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c56af:        0 	e8 00 34 fc ff       	callq  ffffffff80488ab4 <skb_push>
ffffffff804c56b4:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c56b7:      789 	e8 4f f3 ff ff       	callq  ffffffff804c4a0b <skb_reset_transport_header>
ffffffff804c56bc:      509 	f0 ff 45 28          	lock incl 0x28(%rbp)
ffffffff804c56c0:      494 	49 89 6e 10          	mov    %rbp,0x10(%r14)
ffffffff804c56c4:     3510 	49 c7 86 80 00 00 00 	movq   $0xffffffff80486679,0x80(%r14)
ffffffff804c56cb:        0 	79 66 48 80 
ffffffff804c56cf:      102 	41 8b 86 e0 00 00 00 	mov    0xe0(%r14),%eax
ffffffff804c56d6:      155 	f0 01 85 98 00 00 00 	lock add %eax,0x98(%rbp)
ffffffff804c56dd:      437 	41 8b 9e b8 00 00 00 	mov    0xb8(%r14),%ebx
ffffffff804c56e4:      219 	8b 85 50 02 00 00    	mov    0x250(%rbp),%eax
ffffffff804c56ea:       71 	49 03 9e d0 00 00 00 	add    0xd0(%r14),%rbx
ffffffff804c56f1:      735 	66 89 03             	mov    %ax,(%rbx)
ffffffff804c56f4:        0 	8b 85 38 02 00 00    	mov    0x238(%rbp),%eax
ffffffff804c56fa:       75 	66 89 43 02          	mov    %ax,0x2(%rbx)
ffffffff804c56fe:      720 	48 8b 0c 24          	mov    (%rsp),%rcx
ffffffff804c5702:     5992 	8b 41 18             	mov    0x18(%rcx),%eax
ffffffff804c5705:     1460 	0f c8                	bswap  %eax
ffffffff804c5707:       60 	89 43 04             	mov    %eax,0x4(%rbx)
ffffffff804c570a:       69 	8b 85 f0 03 00 00    	mov    0x3f0(%rbp),%eax
ffffffff804c5710:      374 	0f c8                	bswap  %eax
ffffffff804c5712:       43 	89 43 08             	mov    %eax,0x8(%rbx)
ffffffff804c5715:       76 	0f b6 51 24          	movzbl 0x24(%rcx),%edx
ffffffff804c5719:      337 	44 89 e8             	mov    %r13d,%eax
ffffffff804c571c:       36 	c1 e8 02             	shr    $0x2,%eax
ffffffff804c571f:       76 	c1 e0 0c             	shl    $0xc,%eax
ffffffff804c5722:      476 	09 d0                	or     %edx,%eax
ffffffff804c5724:       48 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c5728:       51 	66 89 43 0c          	mov    %ax,0xc(%rbx)
ffffffff804c572c:      370 	0f b6 41 24          	movzbl 0x24(%rcx),%eax
ffffffff804c5730:      137 	89 c2                	mov    %eax,%edx
ffffffff804c5732:      118 	83 e2 02             	and    $0x2,%edx
ffffffff804c5735:      377 	74 1b                	je     ffffffff804c5752 <tcp_transmit_skb+0x244>
ffffffff804c5737:        0 	81 bd c0 04 00 00 ff 	cmpl   $0xffff,0x4c0(%rbp)
ffffffff804c573e:        0 	ff 00 00 
ffffffff804c5741:        0 	b8 ff ff 00 00       	mov    $0xffff,%eax
ffffffff804c5746:        0 	0f 46 85 c0 04 00 00 	cmovbe 0x4c0(%rbp),%eax
ffffffff804c574d:        0 	e9 a0 00 00 00       	jmpq   ffffffff804c57f2 <tcp_transmit_skb+0x2e4>
ffffffff804c5752:       34 	8b 85 f8 03 00 00    	mov    0x3f8(%rbp),%eax
ffffffff804c5758:     5610 	03 85 c0 04 00 00    	add    0x4c0(%rbp),%eax
ffffffff804c575e:       44 	41 89 d4             	mov    %edx,%r12d
ffffffff804c5761:      539 	2b 85 f0 03 00 00    	sub    0x3f0(%rbp),%eax
ffffffff804c5767:        1 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c576a:       51 	44 0f 49 e0          	cmovns %eax,%r12d
ffffffff804c576e:      495 	e8 7e f8 ff ff       	callq  ffffffff804c4ff1 <__tcp_select_window>
ffffffff804c5773:      484 	44 39 e0             	cmp    %r12d,%eax
ffffffff804c5776:      244 	89 c2                	mov    %eax,%edx
ffffffff804c5778:        0 	73 19                	jae    ffffffff804c5793 <tcp_transmit_skb+0x285>
ffffffff804c577a:        0 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c5780:        0 	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff804c5785:        0 	c0 e9 04             	shr    $0x4,%cl
ffffffff804c5788:        0 	d3 e0                	shl    %cl,%eax
ffffffff804c578a:        0 	42 8d 54 20 ff       	lea    -0x1(%rax,%r12,1),%edx
ffffffff804c578f:        0 	f7 d8                	neg    %eax
ffffffff804c5791:        0 	21 c2                	and    %eax,%edx
ffffffff804c5793:      217 	f6 85 9d 04 00 00 f0 	testb  $0xf0,0x49d(%rbp)
ffffffff804c579a:     2014 	8b 85 f0 03 00 00    	mov    0x3f0(%rbp),%eax
ffffffff804c57a0:        0 	89 95 c0 04 00 00    	mov    %edx,0x4c0(%rbp)
ffffffff804c57a6:      490 	89 85 f8 03 00 00    	mov    %eax,0x3f8(%rbp)
ffffffff804c57ac:        1 	75 16                	jne    ffffffff804c57c4 <tcp_transmit_skb+0x2b6>
ffffffff804c57ae:        0 	83 3d bb 2c 3f 00 00 	cmpl   $0x0,0x3f2cbb(%rip)        # ffffffff808b8470 <sysctl_tcp_workaround_signed_windows>
ffffffff804c57b5:        0 	74 0d                	je     ffffffff804c57c4 <tcp_transmit_skb+0x2b6>
ffffffff804c57b7:        0 	b8 ff 7f 00 00       	mov    $0x7fff,%eax
ffffffff804c57bc:        0 	81 fa ff 7f 00 00    	cmp    $0x7fff,%edx
ffffffff804c57c2:        0 	eb 12                	jmp    ffffffff804c57d6 <tcp_transmit_skb+0x2c8>
ffffffff804c57c4:        0 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c57ca:     7025 	b8 ff ff 00 00       	mov    $0xffff,%eax
ffffffff804c57cf:        0 	c0 e9 04             	shr    $0x4,%cl
ffffffff804c57d2:      418 	d3 e0                	shl    %cl,%eax
ffffffff804c57d4:      102 	39 c2                	cmp    %eax,%edx
ffffffff804c57d6:        0 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c57dc:      424 	0f 46 c2             	cmovbe %edx,%eax
ffffffff804c57df:      105 	c0 e9 04             	shr    $0x4,%cl
ffffffff804c57e2:        9 	d3 e8                	shr    %cl,%eax
ffffffff804c57e4:      389 	85 c0                	test   %eax,%eax
ffffffff804c57e6:       76 	75 0a                	jne    ffffffff804c57f2 <tcp_transmit_skb+0x2e4>
ffffffff804c57e8:        0 	c7 85 ec 03 00 00 00 	movl   $0x0,0x3ec(%rbp)
ffffffff804c57ef:        0 	00 00 00 
ffffffff804c57f2:        2 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c57f6:     1657 	66 c7 43 10 00 00    	movw   $0x0,0x10(%rbx)
ffffffff804c57fc:       35 	66 c7 43 12 00 00    	movw   $0x0,0x12(%rbx)
ffffffff804c5802:     4377 	66 89 43 0e          	mov    %ax,0xe(%rbx)
ffffffff804c5806:      954 	8b 95 80 04 00 00    	mov    0x480(%rbp),%edx
ffffffff804c580c:       31 	39 95 00 04 00 00    	cmp    %edx,0x400(%rbp)
ffffffff804c5812:      186 	74 27                	je     ffffffff804c583b <tcp_transmit_skb+0x32d>
ffffffff804c5814:        0 	48 8b 34 24          	mov    (%rsp),%rsi
ffffffff804c5818:        0 	8b 4e 18             	mov    0x18(%rsi),%ecx
ffffffff804c581b:        0 	89 d6                	mov    %edx,%esi
ffffffff804c581d:        0 	8d 41 01             	lea    0x1(%rcx),%eax
ffffffff804c5820:        0 	29 c6                	sub    %eax,%esi
ffffffff804c5822:        0 	81 fe fe ff 00 00    	cmp    $0xfffe,%esi
ffffffff804c5828:        0 	77 11                	ja     ffffffff804c583b <tcp_transmit_skb+0x32d>
ffffffff804c582a:        0 	89 d0                	mov    %edx,%eax
ffffffff804c582c:        0 	80 4b 0d 20          	orb    $0x20,0xd(%rbx)
ffffffff804c5830:        0 	66 29 c8             	sub    %cx,%ax
ffffffff804c5833:        0 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c5837:        0 	66 89 43 12          	mov    %ax,0x12(%rbx)
ffffffff804c583b:      268 	48 8d 7b 14          	lea    0x14(%rbx),%rdi
ffffffff804c583f:      187 	48 8d 4c 24 20       	lea    0x20(%rsp),%rcx
ffffffff804c5844:     4006 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
ffffffff804c5849:     1117 	48 89 ee             	mov    %rbp,%rsi
ffffffff804c584c:        0 	e8 a9 fb ff ff       	callq  ffffffff804c53fa <tcp_options_write>
ffffffff804c5851:     1285 	48 8b 04 24          	mov    (%rsp),%rax
ffffffff804c5855:      727 	f6 40 24 02          	testb  $0x2,0x24(%rax)
ffffffff804c5859:        0 	0f 85 8f 00 00 00    	jne    ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c585f:        0 	f6 85 7e 04 00 00 01 	testb  $0x1,0x47e(%rbp)
ffffffff804c5866:      456 	0f 84 82 00 00 00    	je     ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c586c:        0 	45 39 6e 68          	cmp    %r13d,0x68(%r14)
ffffffff804c5870:        0 	74 53                	je     ffffffff804c58c5 <tcp_transmit_skb+0x3b7>
ffffffff804c5872:        0 	8b 95 fc 03 00 00    	mov    0x3fc(%rbp),%edx
ffffffff804c5878:        0 	39 50 18             	cmp    %edx,0x18(%rax)
ffffffff804c587b:        0 	78 48                	js     ffffffff804c58c5 <tcp_transmit_skb+0x3b7>
ffffffff804c587d:        0 	8a 85 7e 04 00 00    	mov    0x47e(%rbp),%al
ffffffff804c5883:        0 	80 8d 54 02 00 00 02 	orb    $0x2,0x254(%rbp)
ffffffff804c588a:        0 	a8 02                	test   $0x2,%al
ffffffff804c588c:        0 	74 3e                	je     ffffffff804c58cc <tcp_transmit_skb+0x3be>
ffffffff804c588e:        0 	83 e0 fd             	and    $0xfffffffffffffffd,%eax
ffffffff804c5891:        0 	88 85 7e 04 00 00    	mov    %al,0x47e(%rbp)
ffffffff804c5897:        0 	41 8b 8e b8 00 00 00 	mov    0xb8(%r14),%ecx
ffffffff804c589e:        0 	49 8b 96 d0 00 00 00 	mov    0xd0(%r14),%rdx
ffffffff804c58a5:        0 	8a 44 11 0d          	mov    0xd(%rcx,%rdx,1),%al
ffffffff804c58a9:        0 	83 c8 80             	or     $0xffffffffffffff80,%eax
ffffffff804c58ac:        0 	88 44 0a 0d          	mov    %al,0xd(%rdx,%rcx,1)
ffffffff804c58b0:        0 	41 8b 86 c8 00 00 00 	mov    0xc8(%r14),%eax
ffffffff804c58b7:        0 	49 03 86 d0 00 00 00 	add    0xd0(%r14),%rax
ffffffff804c58be:        0 	66 83 48 0a 08       	orw    $0x8,0xa(%rax)
ffffffff804c58c3:        0 	eb 07                	jmp    ffffffff804c58cc <tcp_transmit_skb+0x3be>
ffffffff804c58c5:        0 	80 a5 54 02 00 00 fc 	andb   $0xfc,0x254(%rbp)
ffffffff804c58cc:        0 	f6 85 7e 04 00 00 04 	testb  $0x4,0x47e(%rbp)
ffffffff804c58d3:        0 	74 19                	je     ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c58d5:        0 	41 8b 8e b8 00 00 00 	mov    0xb8(%r14),%ecx
ffffffff804c58dc:        0 	49 8b 96 d0 00 00 00 	mov    0xd0(%r14),%rdx
ffffffff804c58e3:        0 	8a 44 11 0d          	mov    0xd(%rcx,%rdx,1),%al
ffffffff804c58e7:        0 	83 c8 40             	or     $0x40,%eax
ffffffff804c58ea:        0 	88 44 0a 0d          	mov    %al,0xd(%rdx,%rcx,1)
ffffffff804c58ee:        0 	48 83 7c 24 28 00    	cmpq   $0x0,0x28(%rsp)
ffffffff804c58f4:     9425 	74 26                	je     ffffffff804c591c <tcp_transmit_skb+0x40e>
ffffffff804c58f6:        0 	48 8b 85 b8 05 00 00 	mov    0x5b8(%rbp),%rax
ffffffff804c58fd:        0 	81 a5 fc 00 00 00 ff 	andl   $0xffff,0xfc(%rbp)
ffffffff804c5904:        0 	ff 00 00 
ffffffff804c5907:        0 	4d 89 f0             	mov    %r14,%r8
ffffffff804c590a:        0 	48 8b 74 24 28       	mov    0x28(%rsp),%rsi
ffffffff804c590f:        0 	48 8b 7c 24 20       	mov    0x20(%rsp),%rdi
ffffffff804c5914:        0 	31 c9                	xor    %ecx,%ecx
ffffffff804c5916:        0 	48 89 ea             	mov    %rbp,%rdx
ffffffff804c5919:        0 	ff 50 08             	callq  *0x8(%rax)
ffffffff804c591c:        0 	48 8b 85 68 03 00 00 	mov    0x368(%rbp),%rax
ffffffff804c5923:     2344 	41 8b 76 68          	mov    0x68(%r14),%esi
ffffffff804c5927:        1 	4c 89 f2             	mov    %r14,%rdx
ffffffff804c592a:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c592d:      486 	ff 50 08             	callq  *0x8(%rax)
ffffffff804c5930:       44 	48 8b 0c 24          	mov    (%rsp),%rcx
ffffffff804c5934:      836 	f6 41 24 10          	testb  $0x10,0x24(%rcx)
ffffffff804c5938:        0 	74 4f                	je     ffffffff804c5989 <tcp_transmit_skb+0x47b>
ffffffff804c593a:       75 	41 8b 96 c8 00 00 00 	mov    0xc8(%r14),%edx
ffffffff804c5941:     8600 	49 8b 86 d0 00 00 00 	mov    0xd0(%r14),%rax
ffffffff804c5948:     1667 	8b 44 10 08          	mov    0x8(%rax,%rdx,1),%eax
ffffffff804c594c:       13 	8a 95 81 03 00 00    	mov    0x381(%rbp),%dl
ffffffff804c5952:       24 	84 d2                	test   %dl,%dl
ffffffff804c5954:      429 	74 25                	je     ffffffff804c597b <tcp_transmit_skb+0x46d>
ffffffff804c5956:        0 	0f b7 c8             	movzwl %ax,%ecx
ffffffff804c5959:        3 	0f b6 c2             	movzbl %dl,%eax
ffffffff804c595c:        0 	39 c1                	cmp    %eax,%ecx
ffffffff804c595e:        0 	72 13                	jb     ffffffff804c5973 <tcp_transmit_skb+0x465>
ffffffff804c5960:        0 	c6 85 81 03 00 00 00 	movb   $0x0,0x381(%rbp)
ffffffff804c5967:        1 	c7 85 84 03 00 00 0a 	movl   $0xa,0x384(%rbp)
ffffffff804c596e:        0 	00 00 00 
ffffffff804c5971:        0 	eb 08                	jmp    ffffffff804c597b <tcp_transmit_skb+0x46d>
ffffffff804c5973:        1 	28 ca                	sub    %cl,%dl
ffffffff804c5975:        0 	88 95 81 03 00 00    	mov    %dl,0x381(%rbp)
ffffffff804c597b:       11 	c6 85 80 03 00 00 00 	movb   $0x0,0x380(%rbp)
ffffffff804c5982:     4553 	c6 85 83 03 00 00 00 	movb   $0x0,0x383(%rbp)
ffffffff804c5989:      714 	45 39 6e 68          	cmp    %r13d,0x68(%r14)
ffffffff804c598d:        1 	0f 84 e2 00 00 00    	je     ffffffff804c5a75 <tcp_transmit_skb+0x567>
ffffffff804c5993:      288 	83 3d e6 2a 3f 00 00 	cmpl   $0x0,0x3f2ae6(%rip)        # ffffffff808b8480 <sysctl_tcp_slow_start_after_idle>
ffffffff804c599a:      247 	48 8b 05 df 3e 3f 00 	mov    0x3f3edf(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c59a1:      711 	41 89 c7             	mov    %eax,%r15d
ffffffff804c59a4:        0 	0f 84 ad 00 00 00    	je     ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59aa:      159 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c59b1:      311 	0f 85 a0 00 00 00    	jne    ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59b7:        0 	44 8b ad 0c 04 00 00 	mov    0x40c(%rbp),%r13d
ffffffff804c59be:      183 	44 29 e8             	sub    %r13d,%eax
ffffffff804c59c1:      475 	3b 85 58 03 00 00    	cmp    0x358(%rbp),%eax
ffffffff804c59c7:       54 	0f 86 8a 00 00 00    	jbe    ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59cd:        0 	48 8b 75 78          	mov    0x78(%rbp),%rsi
ffffffff804c59d1:        1 	48 8b 05 a8 3e 3f 00 	mov    0x3f3ea8(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c59d8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c59db:        0 	48 89 44 24 08       	mov    %rax,0x8(%rsp)
ffffffff804c59e0:        0 	e8 9c 92 ff ff       	callq  ffffffff804bec81 <tcp_init_cwnd>
ffffffff804c59e5:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c59ea:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c59ed:        0 	41 89 c4             	mov    %eax,%r12d
ffffffff804c59f0:        0 	8b 9d ac 04 00 00    	mov    0x4ac(%rbp),%ebx
ffffffff804c59f6:        0 	e8 5e f0 ff ff       	callq  ffffffff804c4a59 <tcp_ca_event>
ffffffff804c59fb:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c59fe:        0 	e8 6d f0 ff ff       	callq  ffffffff804c4a70 <tcp_current_ssthresh>
ffffffff804c5a03:        0 	89 85 a8 04 00 00    	mov    %eax,0x4a8(%rbp)
ffffffff804c5a09:        4 	8b 85 58 03 00 00    	mov    0x358(%rbp),%eax
ffffffff804c5a0f:        0 	41 39 dc             	cmp    %ebx,%r12d
ffffffff804c5a12:        0 	8b 54 24 08          	mov    0x8(%rsp),%edx
ffffffff804c5a16:        0 	89 d9                	mov    %ebx,%ecx
ffffffff804c5a18:        0 	41 0f 46 cc          	cmovbe %r12d,%ecx
ffffffff804c5a1c:        0 	89 c6                	mov    %eax,%esi
ffffffff804c5a1e:        0 	44 29 ea             	sub    %r13d,%edx
ffffffff804c5a21:        0 	f7 de                	neg    %esi
ffffffff804c5a23:        0 	29 c2                	sub    %eax,%edx
ffffffff804c5a25:        0 	89 d8                	mov    %ebx,%eax
ffffffff804c5a27:        0 	eb 02                	jmp    ffffffff804c5a2b <tcp_transmit_skb+0x51d>
ffffffff804c5a29:        0 	d1 e8                	shr    %eax
ffffffff804c5a2b:        0 	85 d2                	test   %edx,%edx
ffffffff804c5a2d:        1 	7e 06                	jle    ffffffff804c5a35 <tcp_transmit_skb+0x527>
ffffffff804c5a2f:        0 	01 f2                	add    %esi,%edx
ffffffff804c5a31:        0 	39 c8                	cmp    %ecx,%eax
ffffffff804c5a33:        0 	77 f4                	ja     ffffffff804c5a29 <tcp_transmit_skb+0x51b>
ffffffff804c5a35:        0 	39 c8                	cmp    %ecx,%eax
ffffffff804c5a37:        1 	0f 43 c8             	cmovae %eax,%ecx
ffffffff804c5a3a:        0 	89 8d ac 04 00 00    	mov    %ecx,0x4ac(%rbp)
ffffffff804c5a40:        0 	48 8b 05 39 3e 3f 00 	mov    0x3f3e39(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c5a47:        0 	c7 85 b8 04 00 00 00 	movl   $0x0,0x4b8(%rbp)
ffffffff804c5a4e:        0 	00 00 00 
ffffffff804c5a51:        0 	89 85 bc 04 00 00    	mov    %eax,0x4bc(%rbp)
ffffffff804c5a57:      173 	44 89 bd 0c 04 00 00 	mov    %r15d,0x40c(%rbp)
ffffffff804c5a5e:     5224 	44 2b bd 90 03 00 00 	sub    0x390(%rbp),%r15d
ffffffff804c5a65:      478 	44 3b bd 84 03 00 00 	cmp    0x384(%rbp),%r15d
ffffffff804c5a6c:        0 	73 07                	jae    ffffffff804c5a75 <tcp_transmit_skb+0x567>
ffffffff804c5a6e:       38 	c6 85 82 03 00 00 01 	movb   $0x1,0x382(%rbp)
ffffffff804c5a75:      452 	48 8b 14 24          	mov    (%rsp),%rdx
ffffffff804c5a79:      312 	8b 42 1c             	mov    0x1c(%rdx),%eax
ffffffff804c5a7c:       33 	39 85 fc 03 00 00    	cmp    %eax,0x3fc(%rbp)
ffffffff804c5a82:     4768 	78 05                	js     ffffffff804c5a89 <tcp_transmit_skb+0x57b>
ffffffff804c5a84:        0 	39 42 18             	cmp    %eax,0x18(%rdx)
ffffffff804c5a87:       20 	75 37                	jne    ffffffff804c5ac0 <tcp_transmit_skb+0x5b2>
ffffffff804c5a89:       30 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff804c5a90:        0 	00 00 
ffffffff804c5a92:     1059 	8b 80 48 e0 ff ff    	mov    -0x1fb8(%rax),%eax
ffffffff804c5a98:       21 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c5a9f:        0 	00 
ffffffff804c5aa0:       14 	89 d2                	mov    %edx,%edx
ffffffff804c5aa2:      471 	30 c0                	xor    %al,%al
ffffffff804c5aa4:        3 	66 83 f8 01          	cmp    $0x1,%ax
ffffffff804c5aa8:       21 	48 19 c0             	sbb    %rax,%rax
ffffffff804c5aab:      433 	83 e0 08             	and    $0x8,%eax
ffffffff804c5aae:        2 	48 8b 80 98 16 ab 80 	mov    -0x7f54e968(%rax),%rax
ffffffff804c5ab5:       16 	48 f7 d0             	not    %rax
ffffffff804c5ab8:      457 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c5abc:        3 	48 ff 40 58          	incq   0x58(%rax)
ffffffff804c5ac0:       20 	48 8b 85 68 03 00 00 	mov    0x368(%rbp),%rax
ffffffff804c5ac7:      424 	31 f6                	xor    %esi,%esi
ffffffff804c5ac9:        2 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c5acc:       20 	ff 10                	callq  *(%rax)
ffffffff804c5ace:        0 	85 c0                	test   %eax,%eax
ffffffff804c5ad0:     9596 	89 c3                	mov    %eax,%ebx
ffffffff804c5ad2:        0 	7e 18                	jle    ffffffff804c5aec <tcp_transmit_skb+0x5de>
ffffffff804c5ad4:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c5ad9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c5adc:        0 	e8 d9 91 ff ff       	callq  ffffffff804becba <tcp_enter_cwr>
ffffffff804c5ae1:        0 	83 fb 02             	cmp    $0x2,%ebx
ffffffff804c5ae4:        0 	b8 00 00 00 00       	mov    $0x0,%eax
ffffffff804c5ae9:        0 	0f 44 d8             	cmove  %eax,%ebx
ffffffff804c5aec:      457 	48 83 c4 38          	add    $0x38,%rsp
ffffffff804c5af0:     1473 	89 d8                	mov    %ebx,%eax
ffffffff804c5af2:        0 	5b                   	pop    %rbx
ffffffff804c5af3:      480 	5d                   	pop    %rbp
ffffffff804c5af4:        0 	41 5c                	pop    %r12
ffffffff804c5af6:        0 	41 5d                	pop    %r13
ffffffff804c5af8:      449 	41 5e                	pop    %r14
ffffffff804c5afa:        0 	41 5f                	pop    %r15
ffffffff804c5afc:        0 	c3                   	retq   

looks like spread-out overhead with no particular bad spike. Just 
called a lot.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* tcp_transmit_skb() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:14                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   1.431553 tcp_transmit_skb

                      hits (total: 143155)
                 .........
ffffffff804c550e:      485 <tcp_transmit_skb>:
ffffffff804c550e:      485 	41 57                	push   %r15
ffffffff804c5510:     5692 	41 56                	push   %r14
ffffffff804c5512:      390 	49 89 f6             	mov    %rsi,%r14
ffffffff804c5515:        0 	41 55                	push   %r13
ffffffff804c5517:       69 	41 54                	push   %r12
ffffffff804c5519:      388 	41 89 d4             	mov    %edx,%r12d
ffffffff804c551c:        0 	55                   	push   %rbp
ffffffff804c551d:       66 	48 89 fd             	mov    %rdi,%rbp
ffffffff804c5520:      405 	53                   	push   %rbx
ffffffff804c5521:        0 	89 cb                	mov    %ecx,%ebx
ffffffff804c5523:       75 	48 83 ec 38          	sub    $0x38,%rsp
ffffffff804c5527:      396 	48 85 f6             	test   %rsi,%rsi
ffffffff804c552a:       51 	74 15                	je     ffffffff804c5541 <tcp_transmit_skb+0x33>
ffffffff804c552c:      396 	8b 96 c8 00 00 00    	mov    0xc8(%rsi),%edx
ffffffff804c5532:        1 	48 8b 86 d0 00 00 00 	mov    0xd0(%rsi),%rax
ffffffff804c5539:       63 	66 83 7c 02 08 00    	cmpw   $0x0,0x8(%rdx,%rax,1)
ffffffff804c553f:      417 	75 04                	jne    ffffffff804c5545 <tcp_transmit_skb+0x37>
ffffffff804c5541:        0 	0f 0b                	ud2a   
ffffffff804c5543:        0 	eb fe                	jmp    ffffffff804c5543 <tcp_transmit_skb+0x35>
ffffffff804c5545:     3719 	48 8b 87 60 03 00 00 	mov    0x360(%rdi),%rax
ffffffff804c554c:     2873 	f6 40 10 02          	testb  $0x2,0x10(%rax)
ffffffff804c5550:        1 	74 09                	je     ffffffff804c555b <tcp_transmit_skb+0x4d>
ffffffff804c5552:        0 	e8 1d 48 d8 ff       	callq  ffffffff80249d74 <ktime_get_real>
ffffffff804c5557:        0 	49 89 46 18          	mov    %rax,0x18(%r14)
ffffffff804c555b:      487 	45 85 e4             	test   %r12d,%r12d
ffffffff804c555e:      456 	74 33                	je     ffffffff804c5593 <tcp_transmit_skb+0x85>
ffffffff804c5560:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c5563:      482 	e8 28 f4 ff ff       	callq  ffffffff804c4990 <skb_cloned>
ffffffff804c5568:     1469 	85 c0                	test   %eax,%eax
ffffffff804c556a:     1085 	74 0c                	je     ffffffff804c5578 <tcp_transmit_skb+0x6a>
ffffffff804c556c:        0 	89 de                	mov    %ebx,%esi
ffffffff804c556e:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c5571:        0 	e8 47 41 fc ff       	callq  ffffffff804896bd <pskb_copy>
ffffffff804c5576:        0 	eb 0a                	jmp    ffffffff804c5582 <tcp_transmit_skb+0x74>
ffffffff804c5578:        0 	89 de                	mov    %ebx,%esi
ffffffff804c557a:      906 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c557d:        0 	e8 ab 35 fc ff       	callq  ffffffff80488b2d <skb_clone>
ffffffff804c5582:        0 	48 85 c0             	test   %rax,%rax
ffffffff804c5585:        7 	49 89 c6             	mov    %rax,%r14
ffffffff804c5588:      576 	bb 97 ff ff ff       	mov    $0xffffff97,%ebx
ffffffff804c558d:        0 	0f 84 59 05 00 00    	je     ffffffff804c5aec <tcp_transmit_skb+0x5de>
ffffffff804c5593:        0 	49 8d 46 38          	lea    0x38(%r14),%rax
ffffffff804c5597:      699 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
ffffffff804c559c:        1 	fc                   	cld    
ffffffff804c559d:      452 	48 89 04 24          	mov    %rax,(%rsp)
ffffffff804c55a1:       40 	48 89 d7             	mov    %rdx,%rdi
ffffffff804c55a4:        1 	31 c0                	xor    %eax,%eax
ffffffff804c55a6:      432 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55a7:      956 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55a8:      959 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55a9:      910 	ab                   	stos   %eax,%es:(%rdi)
ffffffff804c55aa:      943 	48 8b 0c 24          	mov    (%rsp),%rcx
ffffffff804c55ae:      455 	f6 41 24 02          	testb  $0x2,0x24(%rcx)
ffffffff804c55b2:        0 	0f 84 b7 00 00 00    	je     ffffffff804c566f <tcp_transmit_skb+0x161>
ffffffff804c55b8:        0 	48 8b 85 b8 05 00 00 	mov    0x5b8(%rbp),%rax
ffffffff804c55bf:        0 	48 89 ee             	mov    %rbp,%rsi
ffffffff804c55c2:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c55c5:        0 	ff 10                	callq  *(%rax)
ffffffff804c55c7:        0 	31 f6                	xor    %esi,%esi
ffffffff804c55c9:        0 	48 85 c0             	test   %rax,%rax
ffffffff804c55cc:        0 	48 89 44 24 28       	mov    %rax,0x28(%rsp)
ffffffff804c55d1:        0 	74 08                	je     ffffffff804c55db <tcp_transmit_skb+0xcd>
ffffffff804c55d3:        0 	80 4c 24 10 04       	orb    $0x4,0x10(%rsp)
ffffffff804c55d8:        0 	40 b6 14             	mov    $0x14,%sil
ffffffff804c55db:        0 	48 8b 55 78          	mov    0x78(%rbp),%rdx
ffffffff804c55df:        0 	0f b7 85 5c 04 00 00 	movzwl 0x45c(%rbp),%eax
ffffffff804c55e6:        0 	48 85 d2             	test   %rdx,%rdx
ffffffff804c55e9:        0 	74 13                	je     ffffffff804c55fe <tcp_transmit_skb+0xf0>
ffffffff804c55eb:        0 	8b 92 94 00 00 00    	mov    0x94(%rdx),%edx
ffffffff804c55f1:        0 	39 c2                	cmp    %eax,%edx
ffffffff804c55f3:        0 	73 09                	jae    ffffffff804c55fe <tcp_transmit_skb+0xf0>
ffffffff804c55f5:        0 	89 d0                	mov    %edx,%eax
ffffffff804c55f7:        0 	66 89 95 5c 04 00 00 	mov    %dx,0x45c(%rbp)
ffffffff804c55fe:        0 	83 3d 23 2e 3f 00 00 	cmpl   $0x0,0x3f2e23(%rip)        # ffffffff808b8428 <sysctl_tcp_timestamps>
ffffffff804c5605:        0 	66 89 44 24 14       	mov    %ax,0x14(%rsp)
ffffffff804c560a:        0 	8d 4e 04             	lea    0x4(%rsi),%ecx
ffffffff804c560d:        0 	74 25                	je     ffffffff804c5634 <tcp_transmit_skb+0x126>
ffffffff804c560f:        0 	48 83 7c 24 28 00    	cmpq   $0x0,0x28(%rsp)
ffffffff804c5615:        0 	75 1d                	jne    ffffffff804c5634 <tcp_transmit_skb+0x126>
ffffffff804c5617:        0 	48 8b 14 24          	mov    (%rsp),%rdx
ffffffff804c561b:        0 	80 4c 24 10 02       	orb    $0x2,0x10(%rsp)
ffffffff804c5620:        0 	8d 4e 10             	lea    0x10(%rsi),%ecx
ffffffff804c5623:        0 	8b 42 20             	mov    0x20(%rdx),%eax
ffffffff804c5626:        0 	89 44 24 18          	mov    %eax,0x18(%rsp)
ffffffff804c562a:        0 	8b 85 90 04 00 00    	mov    0x490(%rbp),%eax
ffffffff804c5630:        0 	89 44 24 1c          	mov    %eax,0x1c(%rsp)
ffffffff804c5634:        0 	83 3d f1 2d 3f 00 00 	cmpl   $0x0,0x3f2df1(%rip)        # ffffffff808b842c <sysctl_tcp_window_scaling>
ffffffff804c563b:        0 	74 15                	je     ffffffff804c5652 <tcp_transmit_skb+0x144>
ffffffff804c563d:        0 	8a 85 9d 04 00 00    	mov    0x49d(%rbp),%al
ffffffff804c5643:        0 	8d 51 04             	lea    0x4(%rcx),%edx
ffffffff804c5646:        0 	c0 e8 04             	shr    $0x4,%al
ffffffff804c5649:        0 	84 c0                	test   %al,%al
ffffffff804c564b:        0 	88 44 24 11          	mov    %al,0x11(%rsp)
ffffffff804c564f:        0 	0f 45 ca             	cmovne %edx,%ecx
ffffffff804c5652:        0 	83 3d d7 2d 3f 00 00 	cmpl   $0x0,0x3f2dd7(%rip)        # ffffffff808b8430 <sysctl_tcp_sack>
ffffffff804c5659:        0 	74 26                	je     ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c565b:        0 	8a 44 24 10          	mov    0x10(%rsp),%al
ffffffff804c565f:        0 	83 c8 01             	or     $0x1,%eax
ffffffff804c5662:        0 	a8 02                	test   $0x2,%al
ffffffff804c5664:        0 	88 44 24 10          	mov    %al,0x10(%rsp)
ffffffff804c5668:        0 	75 17                	jne    ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c566a:        0 	83 c1 04             	add    $0x4,%ecx
ffffffff804c566d:        0 	eb 12                	jmp    ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c566f:      502 	48 8d 4c 24 28       	lea    0x28(%rsp),%rcx
ffffffff804c5674:      638 	4c 89 f6             	mov    %r14,%rsi
ffffffff804c5677:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c567a:        0 	e8 1e fb ff ff       	callq  ffffffff804c519d <tcp_established_options>
ffffffff804c567f:      468 	89 c1                	mov    %eax,%ecx
ffffffff804c5681:     1605 	8b 85 74 04 00 00    	mov    0x474(%rbp),%eax
ffffffff804c5687:      307 	03 85 78 04 00 00    	add    0x478(%rbp),%eax
ffffffff804c568d:        0 	44 8d 69 14          	lea    0x14(%rcx),%r13d
ffffffff804c5691:      409 	2b 85 d0 04 00 00    	sub    0x4d0(%rbp),%eax
ffffffff804c5697:       89 	3b 85 cc 04 00 00    	cmp    0x4cc(%rbp),%eax
ffffffff804c569d:        0 	75 0a                	jne    ffffffff804c56a9 <tcp_transmit_skb+0x19b>
ffffffff804c569f:      415 	31 f6                	xor    %esi,%esi
ffffffff804c56a1:      210 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c56a4:        0 	e8 b0 f3 ff ff       	callq  ffffffff804c4a59 <tcp_ca_event>
ffffffff804c56a9:     1050 	44 89 ee             	mov    %r13d,%esi
ffffffff804c56ac:     1063 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c56af:        0 	e8 00 34 fc ff       	callq  ffffffff80488ab4 <skb_push>
ffffffff804c56b4:        0 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c56b7:      789 	e8 4f f3 ff ff       	callq  ffffffff804c4a0b <skb_reset_transport_header>
ffffffff804c56bc:      509 	f0 ff 45 28          	lock incl 0x28(%rbp)
ffffffff804c56c0:      494 	49 89 6e 10          	mov    %rbp,0x10(%r14)
ffffffff804c56c4:     3510 	49 c7 86 80 00 00 00 	movq   $0xffffffff80486679,0x80(%r14)
ffffffff804c56cb:        0 	79 66 48 80 
ffffffff804c56cf:      102 	41 8b 86 e0 00 00 00 	mov    0xe0(%r14),%eax
ffffffff804c56d6:      155 	f0 01 85 98 00 00 00 	lock add %eax,0x98(%rbp)
ffffffff804c56dd:      437 	41 8b 9e b8 00 00 00 	mov    0xb8(%r14),%ebx
ffffffff804c56e4:      219 	8b 85 50 02 00 00    	mov    0x250(%rbp),%eax
ffffffff804c56ea:       71 	49 03 9e d0 00 00 00 	add    0xd0(%r14),%rbx
ffffffff804c56f1:      735 	66 89 03             	mov    %ax,(%rbx)
ffffffff804c56f4:        0 	8b 85 38 02 00 00    	mov    0x238(%rbp),%eax
ffffffff804c56fa:       75 	66 89 43 02          	mov    %ax,0x2(%rbx)
ffffffff804c56fe:      720 	48 8b 0c 24          	mov    (%rsp),%rcx
ffffffff804c5702:     5992 	8b 41 18             	mov    0x18(%rcx),%eax
ffffffff804c5705:     1460 	0f c8                	bswap  %eax
ffffffff804c5707:       60 	89 43 04             	mov    %eax,0x4(%rbx)
ffffffff804c570a:       69 	8b 85 f0 03 00 00    	mov    0x3f0(%rbp),%eax
ffffffff804c5710:      374 	0f c8                	bswap  %eax
ffffffff804c5712:       43 	89 43 08             	mov    %eax,0x8(%rbx)
ffffffff804c5715:       76 	0f b6 51 24          	movzbl 0x24(%rcx),%edx
ffffffff804c5719:      337 	44 89 e8             	mov    %r13d,%eax
ffffffff804c571c:       36 	c1 e8 02             	shr    $0x2,%eax
ffffffff804c571f:       76 	c1 e0 0c             	shl    $0xc,%eax
ffffffff804c5722:      476 	09 d0                	or     %edx,%eax
ffffffff804c5724:       48 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c5728:       51 	66 89 43 0c          	mov    %ax,0xc(%rbx)
ffffffff804c572c:      370 	0f b6 41 24          	movzbl 0x24(%rcx),%eax
ffffffff804c5730:      137 	89 c2                	mov    %eax,%edx
ffffffff804c5732:      118 	83 e2 02             	and    $0x2,%edx
ffffffff804c5735:      377 	74 1b                	je     ffffffff804c5752 <tcp_transmit_skb+0x244>
ffffffff804c5737:        0 	81 bd c0 04 00 00 ff 	cmpl   $0xffff,0x4c0(%rbp)
ffffffff804c573e:        0 	ff 00 00 
ffffffff804c5741:        0 	b8 ff ff 00 00       	mov    $0xffff,%eax
ffffffff804c5746:        0 	0f 46 85 c0 04 00 00 	cmovbe 0x4c0(%rbp),%eax
ffffffff804c574d:        0 	e9 a0 00 00 00       	jmpq   ffffffff804c57f2 <tcp_transmit_skb+0x2e4>
ffffffff804c5752:       34 	8b 85 f8 03 00 00    	mov    0x3f8(%rbp),%eax
ffffffff804c5758:     5610 	03 85 c0 04 00 00    	add    0x4c0(%rbp),%eax
ffffffff804c575e:       44 	41 89 d4             	mov    %edx,%r12d
ffffffff804c5761:      539 	2b 85 f0 03 00 00    	sub    0x3f0(%rbp),%eax
ffffffff804c5767:        1 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c576a:       51 	44 0f 49 e0          	cmovns %eax,%r12d
ffffffff804c576e:      495 	e8 7e f8 ff ff       	callq  ffffffff804c4ff1 <__tcp_select_window>
ffffffff804c5773:      484 	44 39 e0             	cmp    %r12d,%eax
ffffffff804c5776:      244 	89 c2                	mov    %eax,%edx
ffffffff804c5778:        0 	73 19                	jae    ffffffff804c5793 <tcp_transmit_skb+0x285>
ffffffff804c577a:        0 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c5780:        0 	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff804c5785:        0 	c0 e9 04             	shr    $0x4,%cl
ffffffff804c5788:        0 	d3 e0                	shl    %cl,%eax
ffffffff804c578a:        0 	42 8d 54 20 ff       	lea    -0x1(%rax,%r12,1),%edx
ffffffff804c578f:        0 	f7 d8                	neg    %eax
ffffffff804c5791:        0 	21 c2                	and    %eax,%edx
ffffffff804c5793:      217 	f6 85 9d 04 00 00 f0 	testb  $0xf0,0x49d(%rbp)
ffffffff804c579a:     2014 	8b 85 f0 03 00 00    	mov    0x3f0(%rbp),%eax
ffffffff804c57a0:        0 	89 95 c0 04 00 00    	mov    %edx,0x4c0(%rbp)
ffffffff804c57a6:      490 	89 85 f8 03 00 00    	mov    %eax,0x3f8(%rbp)
ffffffff804c57ac:        1 	75 16                	jne    ffffffff804c57c4 <tcp_transmit_skb+0x2b6>
ffffffff804c57ae:        0 	83 3d bb 2c 3f 00 00 	cmpl   $0x0,0x3f2cbb(%rip)        # ffffffff808b8470 <sysctl_tcp_workaround_signed_windows>
ffffffff804c57b5:        0 	74 0d                	je     ffffffff804c57c4 <tcp_transmit_skb+0x2b6>
ffffffff804c57b7:        0 	b8 ff 7f 00 00       	mov    $0x7fff,%eax
ffffffff804c57bc:        0 	81 fa ff 7f 00 00    	cmp    $0x7fff,%edx
ffffffff804c57c2:        0 	eb 12                	jmp    ffffffff804c57d6 <tcp_transmit_skb+0x2c8>
ffffffff804c57c4:        0 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c57ca:     7025 	b8 ff ff 00 00       	mov    $0xffff,%eax
ffffffff804c57cf:        0 	c0 e9 04             	shr    $0x4,%cl
ffffffff804c57d2:      418 	d3 e0                	shl    %cl,%eax
ffffffff804c57d4:      102 	39 c2                	cmp    %eax,%edx
ffffffff804c57d6:        0 	8a 8d 9d 04 00 00    	mov    0x49d(%rbp),%cl
ffffffff804c57dc:      424 	0f 46 c2             	cmovbe %edx,%eax
ffffffff804c57df:      105 	c0 e9 04             	shr    $0x4,%cl
ffffffff804c57e2:        9 	d3 e8                	shr    %cl,%eax
ffffffff804c57e4:      389 	85 c0                	test   %eax,%eax
ffffffff804c57e6:       76 	75 0a                	jne    ffffffff804c57f2 <tcp_transmit_skb+0x2e4>
ffffffff804c57e8:        0 	c7 85 ec 03 00 00 00 	movl   $0x0,0x3ec(%rbp)
ffffffff804c57ef:        0 	00 00 00 
ffffffff804c57f2:        2 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c57f6:     1657 	66 c7 43 10 00 00    	movw   $0x0,0x10(%rbx)
ffffffff804c57fc:       35 	66 c7 43 12 00 00    	movw   $0x0,0x12(%rbx)
ffffffff804c5802:     4377 	66 89 43 0e          	mov    %ax,0xe(%rbx)
ffffffff804c5806:      954 	8b 95 80 04 00 00    	mov    0x480(%rbp),%edx
ffffffff804c580c:       31 	39 95 00 04 00 00    	cmp    %edx,0x400(%rbp)
ffffffff804c5812:      186 	74 27                	je     ffffffff804c583b <tcp_transmit_skb+0x32d>
ffffffff804c5814:        0 	48 8b 34 24          	mov    (%rsp),%rsi
ffffffff804c5818:        0 	8b 4e 18             	mov    0x18(%rsi),%ecx
ffffffff804c581b:        0 	89 d6                	mov    %edx,%esi
ffffffff804c581d:        0 	8d 41 01             	lea    0x1(%rcx),%eax
ffffffff804c5820:        0 	29 c6                	sub    %eax,%esi
ffffffff804c5822:        0 	81 fe fe ff 00 00    	cmp    $0xfffe,%esi
ffffffff804c5828:        0 	77 11                	ja     ffffffff804c583b <tcp_transmit_skb+0x32d>
ffffffff804c582a:        0 	89 d0                	mov    %edx,%eax
ffffffff804c582c:        0 	80 4b 0d 20          	orb    $0x20,0xd(%rbx)
ffffffff804c5830:        0 	66 29 c8             	sub    %cx,%ax
ffffffff804c5833:        0 	66 c1 c0 08          	rol    $0x8,%ax
ffffffff804c5837:        0 	66 89 43 12          	mov    %ax,0x12(%rbx)
ffffffff804c583b:      268 	48 8d 7b 14          	lea    0x14(%rbx),%rdi
ffffffff804c583f:      187 	48 8d 4c 24 20       	lea    0x20(%rsp),%rcx
ffffffff804c5844:     4006 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
ffffffff804c5849:     1117 	48 89 ee             	mov    %rbp,%rsi
ffffffff804c584c:        0 	e8 a9 fb ff ff       	callq  ffffffff804c53fa <tcp_options_write>
ffffffff804c5851:     1285 	48 8b 04 24          	mov    (%rsp),%rax
ffffffff804c5855:      727 	f6 40 24 02          	testb  $0x2,0x24(%rax)
ffffffff804c5859:        0 	0f 85 8f 00 00 00    	jne    ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c585f:        0 	f6 85 7e 04 00 00 01 	testb  $0x1,0x47e(%rbp)
ffffffff804c5866:      456 	0f 84 82 00 00 00    	je     ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c586c:        0 	45 39 6e 68          	cmp    %r13d,0x68(%r14)
ffffffff804c5870:        0 	74 53                	je     ffffffff804c58c5 <tcp_transmit_skb+0x3b7>
ffffffff804c5872:        0 	8b 95 fc 03 00 00    	mov    0x3fc(%rbp),%edx
ffffffff804c5878:        0 	39 50 18             	cmp    %edx,0x18(%rax)
ffffffff804c587b:        0 	78 48                	js     ffffffff804c58c5 <tcp_transmit_skb+0x3b7>
ffffffff804c587d:        0 	8a 85 7e 04 00 00    	mov    0x47e(%rbp),%al
ffffffff804c5883:        0 	80 8d 54 02 00 00 02 	orb    $0x2,0x254(%rbp)
ffffffff804c588a:        0 	a8 02                	test   $0x2,%al
ffffffff804c588c:        0 	74 3e                	je     ffffffff804c58cc <tcp_transmit_skb+0x3be>
ffffffff804c588e:        0 	83 e0 fd             	and    $0xfffffffffffffffd,%eax
ffffffff804c5891:        0 	88 85 7e 04 00 00    	mov    %al,0x47e(%rbp)
ffffffff804c5897:        0 	41 8b 8e b8 00 00 00 	mov    0xb8(%r14),%ecx
ffffffff804c589e:        0 	49 8b 96 d0 00 00 00 	mov    0xd0(%r14),%rdx
ffffffff804c58a5:        0 	8a 44 11 0d          	mov    0xd(%rcx,%rdx,1),%al
ffffffff804c58a9:        0 	83 c8 80             	or     $0xffffffffffffff80,%eax
ffffffff804c58ac:        0 	88 44 0a 0d          	mov    %al,0xd(%rdx,%rcx,1)
ffffffff804c58b0:        0 	41 8b 86 c8 00 00 00 	mov    0xc8(%r14),%eax
ffffffff804c58b7:        0 	49 03 86 d0 00 00 00 	add    0xd0(%r14),%rax
ffffffff804c58be:        0 	66 83 48 0a 08       	orw    $0x8,0xa(%rax)
ffffffff804c58c3:        0 	eb 07                	jmp    ffffffff804c58cc <tcp_transmit_skb+0x3be>
ffffffff804c58c5:        0 	80 a5 54 02 00 00 fc 	andb   $0xfc,0x254(%rbp)
ffffffff804c58cc:        0 	f6 85 7e 04 00 00 04 	testb  $0x4,0x47e(%rbp)
ffffffff804c58d3:        0 	74 19                	je     ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c58d5:        0 	41 8b 8e b8 00 00 00 	mov    0xb8(%r14),%ecx
ffffffff804c58dc:        0 	49 8b 96 d0 00 00 00 	mov    0xd0(%r14),%rdx
ffffffff804c58e3:        0 	8a 44 11 0d          	mov    0xd(%rcx,%rdx,1),%al
ffffffff804c58e7:        0 	83 c8 40             	or     $0x40,%eax
ffffffff804c58ea:        0 	88 44 0a 0d          	mov    %al,0xd(%rdx,%rcx,1)
ffffffff804c58ee:        0 	48 83 7c 24 28 00    	cmpq   $0x0,0x28(%rsp)
ffffffff804c58f4:     9425 	74 26                	je     ffffffff804c591c <tcp_transmit_skb+0x40e>
ffffffff804c58f6:        0 	48 8b 85 b8 05 00 00 	mov    0x5b8(%rbp),%rax
ffffffff804c58fd:        0 	81 a5 fc 00 00 00 ff 	andl   $0xffff,0xfc(%rbp)
ffffffff804c5904:        0 	ff 00 00 
ffffffff804c5907:        0 	4d 89 f0             	mov    %r14,%r8
ffffffff804c590a:        0 	48 8b 74 24 28       	mov    0x28(%rsp),%rsi
ffffffff804c590f:        0 	48 8b 7c 24 20       	mov    0x20(%rsp),%rdi
ffffffff804c5914:        0 	31 c9                	xor    %ecx,%ecx
ffffffff804c5916:        0 	48 89 ea             	mov    %rbp,%rdx
ffffffff804c5919:        0 	ff 50 08             	callq  *0x8(%rax)
ffffffff804c591c:        0 	48 8b 85 68 03 00 00 	mov    0x368(%rbp),%rax
ffffffff804c5923:     2344 	41 8b 76 68          	mov    0x68(%r14),%esi
ffffffff804c5927:        1 	4c 89 f2             	mov    %r14,%rdx
ffffffff804c592a:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c592d:      486 	ff 50 08             	callq  *0x8(%rax)
ffffffff804c5930:       44 	48 8b 0c 24          	mov    (%rsp),%rcx
ffffffff804c5934:      836 	f6 41 24 10          	testb  $0x10,0x24(%rcx)
ffffffff804c5938:        0 	74 4f                	je     ffffffff804c5989 <tcp_transmit_skb+0x47b>
ffffffff804c593a:       75 	41 8b 96 c8 00 00 00 	mov    0xc8(%r14),%edx
ffffffff804c5941:     8600 	49 8b 86 d0 00 00 00 	mov    0xd0(%r14),%rax
ffffffff804c5948:     1667 	8b 44 10 08          	mov    0x8(%rax,%rdx,1),%eax
ffffffff804c594c:       13 	8a 95 81 03 00 00    	mov    0x381(%rbp),%dl
ffffffff804c5952:       24 	84 d2                	test   %dl,%dl
ffffffff804c5954:      429 	74 25                	je     ffffffff804c597b <tcp_transmit_skb+0x46d>
ffffffff804c5956:        0 	0f b7 c8             	movzwl %ax,%ecx
ffffffff804c5959:        3 	0f b6 c2             	movzbl %dl,%eax
ffffffff804c595c:        0 	39 c1                	cmp    %eax,%ecx
ffffffff804c595e:        0 	72 13                	jb     ffffffff804c5973 <tcp_transmit_skb+0x465>
ffffffff804c5960:        0 	c6 85 81 03 00 00 00 	movb   $0x0,0x381(%rbp)
ffffffff804c5967:        1 	c7 85 84 03 00 00 0a 	movl   $0xa,0x384(%rbp)
ffffffff804c596e:        0 	00 00 00 
ffffffff804c5971:        0 	eb 08                	jmp    ffffffff804c597b <tcp_transmit_skb+0x46d>
ffffffff804c5973:        1 	28 ca                	sub    %cl,%dl
ffffffff804c5975:        0 	88 95 81 03 00 00    	mov    %dl,0x381(%rbp)
ffffffff804c597b:       11 	c6 85 80 03 00 00 00 	movb   $0x0,0x380(%rbp)
ffffffff804c5982:     4553 	c6 85 83 03 00 00 00 	movb   $0x0,0x383(%rbp)
ffffffff804c5989:      714 	45 39 6e 68          	cmp    %r13d,0x68(%r14)
ffffffff804c598d:        1 	0f 84 e2 00 00 00    	je     ffffffff804c5a75 <tcp_transmit_skb+0x567>
ffffffff804c5993:      288 	83 3d e6 2a 3f 00 00 	cmpl   $0x0,0x3f2ae6(%rip)        # ffffffff808b8480 <sysctl_tcp_slow_start_after_idle>
ffffffff804c599a:      247 	48 8b 05 df 3e 3f 00 	mov    0x3f3edf(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c59a1:      711 	41 89 c7             	mov    %eax,%r15d
ffffffff804c59a4:        0 	0f 84 ad 00 00 00    	je     ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59aa:      159 	83 bd 74 04 00 00 00 	cmpl   $0x0,0x474(%rbp)
ffffffff804c59b1:      311 	0f 85 a0 00 00 00    	jne    ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59b7:        0 	44 8b ad 0c 04 00 00 	mov    0x40c(%rbp),%r13d
ffffffff804c59be:      183 	44 29 e8             	sub    %r13d,%eax
ffffffff804c59c1:      475 	3b 85 58 03 00 00    	cmp    0x358(%rbp),%eax
ffffffff804c59c7:       54 	0f 86 8a 00 00 00    	jbe    ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59cd:        0 	48 8b 75 78          	mov    0x78(%rbp),%rsi
ffffffff804c59d1:        1 	48 8b 05 a8 3e 3f 00 	mov    0x3f3ea8(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c59d8:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c59db:        0 	48 89 44 24 08       	mov    %rax,0x8(%rsp)
ffffffff804c59e0:        0 	e8 9c 92 ff ff       	callq  ffffffff804bec81 <tcp_init_cwnd>
ffffffff804c59e5:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c59ea:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c59ed:        0 	41 89 c4             	mov    %eax,%r12d
ffffffff804c59f0:        0 	8b 9d ac 04 00 00    	mov    0x4ac(%rbp),%ebx
ffffffff804c59f6:        0 	e8 5e f0 ff ff       	callq  ffffffff804c4a59 <tcp_ca_event>
ffffffff804c59fb:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c59fe:        0 	e8 6d f0 ff ff       	callq  ffffffff804c4a70 <tcp_current_ssthresh>
ffffffff804c5a03:        0 	89 85 a8 04 00 00    	mov    %eax,0x4a8(%rbp)
ffffffff804c5a09:        4 	8b 85 58 03 00 00    	mov    0x358(%rbp),%eax
ffffffff804c5a0f:        0 	41 39 dc             	cmp    %ebx,%r12d
ffffffff804c5a12:        0 	8b 54 24 08          	mov    0x8(%rsp),%edx
ffffffff804c5a16:        0 	89 d9                	mov    %ebx,%ecx
ffffffff804c5a18:        0 	41 0f 46 cc          	cmovbe %r12d,%ecx
ffffffff804c5a1c:        0 	89 c6                	mov    %eax,%esi
ffffffff804c5a1e:        0 	44 29 ea             	sub    %r13d,%edx
ffffffff804c5a21:        0 	f7 de                	neg    %esi
ffffffff804c5a23:        0 	29 c2                	sub    %eax,%edx
ffffffff804c5a25:        0 	89 d8                	mov    %ebx,%eax
ffffffff804c5a27:        0 	eb 02                	jmp    ffffffff804c5a2b <tcp_transmit_skb+0x51d>
ffffffff804c5a29:        0 	d1 e8                	shr    %eax
ffffffff804c5a2b:        0 	85 d2                	test   %edx,%edx
ffffffff804c5a2d:        1 	7e 06                	jle    ffffffff804c5a35 <tcp_transmit_skb+0x527>
ffffffff804c5a2f:        0 	01 f2                	add    %esi,%edx
ffffffff804c5a31:        0 	39 c8                	cmp    %ecx,%eax
ffffffff804c5a33:        0 	77 f4                	ja     ffffffff804c5a29 <tcp_transmit_skb+0x51b>
ffffffff804c5a35:        0 	39 c8                	cmp    %ecx,%eax
ffffffff804c5a37:        1 	0f 43 c8             	cmovae %eax,%ecx
ffffffff804c5a3a:        0 	89 8d ac 04 00 00    	mov    %ecx,0x4ac(%rbp)
ffffffff804c5a40:        0 	48 8b 05 39 3e 3f 00 	mov    0x3f3e39(%rip),%rax        # ffffffff808b9880 <jiffies>
ffffffff804c5a47:        0 	c7 85 b8 04 00 00 00 	movl   $0x0,0x4b8(%rbp)
ffffffff804c5a4e:        0 	00 00 00 
ffffffff804c5a51:        0 	89 85 bc 04 00 00    	mov    %eax,0x4bc(%rbp)
ffffffff804c5a57:      173 	44 89 bd 0c 04 00 00 	mov    %r15d,0x40c(%rbp)
ffffffff804c5a5e:     5224 	44 2b bd 90 03 00 00 	sub    0x390(%rbp),%r15d
ffffffff804c5a65:      478 	44 3b bd 84 03 00 00 	cmp    0x384(%rbp),%r15d
ffffffff804c5a6c:        0 	73 07                	jae    ffffffff804c5a75 <tcp_transmit_skb+0x567>
ffffffff804c5a6e:       38 	c6 85 82 03 00 00 01 	movb   $0x1,0x382(%rbp)
ffffffff804c5a75:      452 	48 8b 14 24          	mov    (%rsp),%rdx
ffffffff804c5a79:      312 	8b 42 1c             	mov    0x1c(%rdx),%eax
ffffffff804c5a7c:       33 	39 85 fc 03 00 00    	cmp    %eax,0x3fc(%rbp)
ffffffff804c5a82:     4768 	78 05                	js     ffffffff804c5a89 <tcp_transmit_skb+0x57b>
ffffffff804c5a84:        0 	39 42 18             	cmp    %eax,0x18(%rdx)
ffffffff804c5a87:       20 	75 37                	jne    ffffffff804c5ac0 <tcp_transmit_skb+0x5b2>
ffffffff804c5a89:       30 	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff804c5a90:        0 	00 00 
ffffffff804c5a92:     1059 	8b 80 48 e0 ff ff    	mov    -0x1fb8(%rax),%eax
ffffffff804c5a98:       21 	65 8b 14 25 24 00 00 	mov    %gs:0x24,%edx
ffffffff804c5a9f:        0 	00 
ffffffff804c5aa0:       14 	89 d2                	mov    %edx,%edx
ffffffff804c5aa2:      471 	30 c0                	xor    %al,%al
ffffffff804c5aa4:        3 	66 83 f8 01          	cmp    $0x1,%ax
ffffffff804c5aa8:       21 	48 19 c0             	sbb    %rax,%rax
ffffffff804c5aab:      433 	83 e0 08             	and    $0x8,%eax
ffffffff804c5aae:        2 	48 8b 80 98 16 ab 80 	mov    -0x7f54e968(%rax),%rax
ffffffff804c5ab5:       16 	48 f7 d0             	not    %rax
ffffffff804c5ab8:      457 	48 8b 04 d0          	mov    (%rax,%rdx,8),%rax
ffffffff804c5abc:        3 	48 ff 40 58          	incq   0x58(%rax)
ffffffff804c5ac0:       20 	48 8b 85 68 03 00 00 	mov    0x368(%rbp),%rax
ffffffff804c5ac7:      424 	31 f6                	xor    %esi,%esi
ffffffff804c5ac9:        2 	4c 89 f7             	mov    %r14,%rdi
ffffffff804c5acc:       20 	ff 10                	callq  *(%rax)
ffffffff804c5ace:        0 	85 c0                	test   %eax,%eax
ffffffff804c5ad0:     9596 	89 c3                	mov    %eax,%ebx
ffffffff804c5ad2:        0 	7e 18                	jle    ffffffff804c5aec <tcp_transmit_skb+0x5de>
ffffffff804c5ad4:        0 	be 01 00 00 00       	mov    $0x1,%esi
ffffffff804c5ad9:        0 	48 89 ef             	mov    %rbp,%rdi
ffffffff804c5adc:        0 	e8 d9 91 ff ff       	callq  ffffffff804becba <tcp_enter_cwr>
ffffffff804c5ae1:        0 	83 fb 02             	cmp    $0x2,%ebx
ffffffff804c5ae4:        0 	b8 00 00 00 00       	mov    $0x0,%eax
ffffffff804c5ae9:        0 	0f 44 d8             	cmove  %eax,%ebx
ffffffff804c5aec:      457 	48 83 c4 38          	add    $0x38,%rsp
ffffffff804c5af0:     1473 	89 d8                	mov    %ebx,%eax
ffffffff804c5af2:        0 	5b                   	pop    %rbx
ffffffff804c5af3:      480 	5d                   	pop    %rbp
ffffffff804c5af4:        0 	41 5c                	pop    %r12
ffffffff804c5af6:        0 	41 5d                	pop    %r13
ffffffff804c5af8:      449 	41 5e                	pop    %r14
ffffffff804c5afa:        0 	41 5f                	pop    %r15
ffffffff804c5afc:        0 	c3                   	retq   

looks like spread-out overhead with no particular bad spike. Just 
called a lot.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:15                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 22:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> 100.000000 total
>> ................
>>   1.469183 tcp_current_mss
> 
>                       hits (total: 146918)
>                  .........
> ffffffff804c5237:      526 <tcp_current_mss>:
> ffffffff804c5237:      526 	41 54                	push   %r12
> ffffffff804c5239:     5929 	55                   	push   %rbp
> ffffffff804c523a:       32 	53                   	push   %rbx
> ffffffff804c523b:      294 	48 89 fb             	mov    %rdi,%rbx
> ffffffff804c523e:      539 	48 83 ec 30          	sub    $0x30,%rsp
> ffffffff804c5242:     2590 	85 f6                	test   %esi,%esi
> ffffffff804c5244:      444 	48 8b 4f 78          	mov    0x78(%rdi),%rcx
> ffffffff804c5248:      521 	8b af 4c 04 00 00    	mov    0x44c(%rdi),%ebp
> ffffffff804c524e:      791 	74 2a                	je     ffffffff804c527a <tcp_current_mss+0x43>
> ffffffff804c5250:      433 	8b 87 00 01 00 00    	mov    0x100(%rdi),%eax
> ffffffff804c5256:      236 	c1 e0 10             	shl    $0x10,%eax
> ffffffff804c5259:      191 	89 c2                	mov    %eax,%edx
> ffffffff804c525b:      487 	23 97 fc 00 00 00    	and    0xfc(%rdi),%edx
> ffffffff804c5261:      362 	39 c2                	cmp    %eax,%edx
> ffffffff804c5263:      342 	75 15                	jne    ffffffff804c527a <tcp_current_mss+0x43>
> ffffffff804c5265:      473 	45 31 e4             	xor    %r12d,%r12d
> ffffffff804c5268:      221 	8b 87 00 04 00 00    	mov    0x400(%rdi),%eax
> ffffffff804c526e:      194 	3b 87 80 04 00 00    	cmp    0x480(%rdi),%eax
> ffffffff804c5274:      445 	41 0f 94 c4          	sete   %r12b
> ffffffff804c5278:      261 	eb 03                	jmp    ffffffff804c527d <tcp_current_mss+0x46>
> ffffffff804c527a:        0 	45 31 e4             	xor    %r12d,%r12d
> ffffffff804c527d:      185 	48 85 c9             	test   %rcx,%rcx
> ffffffff804c5280:      686 	74 15                	je     ffffffff804c5297 <tcp_current_mss+0x60>
> ffffffff804c5282:     1806 	8b 71 7c             	mov    0x7c(%rcx),%esi
> ffffffff804c5285:        1 	3b b3 5c 03 00 00    	cmp    0x35c(%rbx),%esi
> ffffffff804c528b:       21 	74 0a                	je     ffffffff804c5297 <tcp_current_mss+0x60>
> ffffffff804c528d:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff804c5290:        0 	e8 8b fb ff ff       	callq  ffffffff804c4e20 <tcp_sync_mss>
> ffffffff804c5295:        0 	89 c5                	mov    %eax,%ebp
> ffffffff804c5297:      864 	48 8d 4c 24 28       	lea    0x28(%rsp),%rcx
> ffffffff804c529c:      634 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
> ffffffff804c52a1:      995 	31 f6                	xor    %esi,%esi
> ffffffff804c52a3:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff804c52a6:        2 	e8 f2 fe ff ff       	callq  ffffffff804c519d <tcp_established_options>
> ffffffff804c52ab:      859 	8b 8b e8 03 00 00    	mov    0x3e8(%rbx),%ecx
> ffffffff804c52b1:      936 	83 c0 14             	add    $0x14,%eax
> ffffffff804c52b4:        6 	0f b7 d1             	movzwl %cx,%edx
> ffffffff804c52b7:        0 	39 d0                	cmp    %edx,%eax
> ffffffff804c52b9:      911 	74 04                	je     ffffffff804c52bf <tcp_current_mss+0x88>
> ffffffff804c52bb:        0 	29 d0                	sub    %edx,%eax
> ffffffff804c52bd:        0 	29 c5                	sub    %eax,%ebp
> ffffffff804c52bf:        0 	45 85 e4             	test   %r12d,%r12d
> ffffffff804c52c2:     6894 	89 e8                	mov    %ebp,%eax
> ffffffff804c52c4:        0 	74 38                	je     ffffffff804c52fe <tcp_current_mss+0xc7>
> ffffffff804c52c6:      990 	48 8b 83 68 03 00 00 	mov    0x368(%rbx),%rax
> ffffffff804c52cd:      642 	8b b3 04 01 00 00    	mov    0x104(%rbx),%esi
> ffffffff804c52d3:        3 	48 89 df             	mov    %rbx,%rdi
> ffffffff804c52d6:      240 	66 2b 70 30          	sub    0x30(%rax),%si
> ffffffff804c52da:      588 	66 2b b3 7e 03 00 00 	sub    0x37e(%rbx),%si
> ffffffff804c52e1:        2 	66 29 ce             	sub    %cx,%si
> ffffffff804c52e4:      284 	ff ce                	dec    %esi
> ffffffff804c52e6:      664 	0f b7 f6             	movzwl %si,%esi
> ffffffff804c52e9:        2 	e8 0a fb ff ff       	callq  ffffffff804c4df8 <tcp_bound_to_half_wnd>
> ffffffff804c52ee:       68 	0f b7 d0             	movzwl %ax,%edx
> ffffffff804c52f1:     1870 	89 c1                	mov    %eax,%ecx
> ffffffff804c52f3:        0 	89 d0                	mov    %edx,%eax
> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
> ffffffff804c52fb:     1670 	66 29 d0             	sub    %dx,%ax
> ffffffff804c52fe:        0 	66 89 83 ea 03 00 00 	mov    %ax,0x3ea(%rbx)
> ffffffff804c5305:        4 	48 83 c4 30          	add    $0x30,%rsp
> ffffffff804c5309:      855 	89 e8                	mov    %ebp,%eax
> ffffffff804c530b:        0 	5b                   	pop    %rbx
> ffffffff804c530c:      797 	5d                   	pop    %rbp
> ffffffff804c530d:        0 	41 5c                	pop    %r12
> ffffffff804c530f:        0 	c3                   	retq   
> 
> apparently this division causes 1.0% of tbench overhead:
> 
> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
> 
> (gdb) list *0xffffffff804c52f7
> 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
> 1073					  inet_csk(sk)->icsk_af_ops->net_header_len -
> 1074					  inet_csk(sk)->icsk_ext_hdr_len -
> 1075					  tp->tcp_header_len);
> 1076	
> 1077			xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> 1078			xmit_size_goal -= (xmit_size_goal % mss_now);
> 1079		}
> 1080		tp->xmit_size_goal = xmit_size_goal;
> 1081	
> 1082		return mss_now;
> (gdb) 
> 
> it's this division:
> 
>         if (doing_tso) {
>         [...]
> 			xmit_size_goal -= (xmit_size_goal % mss_now);
> 
> Has no-one hit this before? Perhaps this is why switching loopback 
> networking to TSO had a performance impact for others?

Yes, I mentioned it later. But apparently you dont read my mails, so
I will just stop now.

> 
> It's still a bit weird ... how can a single division cause this much 
> overhead? tcp_bound_to_half_wnd() [which is called straight before 
> this sequence] seems low-overhead.
> 
> 	Ingo
> 
> 



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:15                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 22:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> 
>> 100.000000 total
>> ................
>>   1.469183 tcp_current_mss
> 
>                       hits (total: 146918)
>                  .........
> ffffffff804c5237:      526 <tcp_current_mss>:
> ffffffff804c5237:      526 	41 54                	push   %r12
> ffffffff804c5239:     5929 	55                   	push   %rbp
> ffffffff804c523a:       32 	53                   	push   %rbx
> ffffffff804c523b:      294 	48 89 fb             	mov    %rdi,%rbx
> ffffffff804c523e:      539 	48 83 ec 30          	sub    $0x30,%rsp
> ffffffff804c5242:     2590 	85 f6                	test   %esi,%esi
> ffffffff804c5244:      444 	48 8b 4f 78          	mov    0x78(%rdi),%rcx
> ffffffff804c5248:      521 	8b af 4c 04 00 00    	mov    0x44c(%rdi),%ebp
> ffffffff804c524e:      791 	74 2a                	je     ffffffff804c527a <tcp_current_mss+0x43>
> ffffffff804c5250:      433 	8b 87 00 01 00 00    	mov    0x100(%rdi),%eax
> ffffffff804c5256:      236 	c1 e0 10             	shl    $0x10,%eax
> ffffffff804c5259:      191 	89 c2                	mov    %eax,%edx
> ffffffff804c525b:      487 	23 97 fc 00 00 00    	and    0xfc(%rdi),%edx
> ffffffff804c5261:      362 	39 c2                	cmp    %eax,%edx
> ffffffff804c5263:      342 	75 15                	jne    ffffffff804c527a <tcp_current_mss+0x43>
> ffffffff804c5265:      473 	45 31 e4             	xor    %r12d,%r12d
> ffffffff804c5268:      221 	8b 87 00 04 00 00    	mov    0x400(%rdi),%eax
> ffffffff804c526e:      194 	3b 87 80 04 00 00    	cmp    0x480(%rdi),%eax
> ffffffff804c5274:      445 	41 0f 94 c4          	sete   %r12b
> ffffffff804c5278:      261 	eb 03                	jmp    ffffffff804c527d <tcp_current_mss+0x46>
> ffffffff804c527a:        0 	45 31 e4             	xor    %r12d,%r12d
> ffffffff804c527d:      185 	48 85 c9             	test   %rcx,%rcx
> ffffffff804c5280:      686 	74 15                	je     ffffffff804c5297 <tcp_current_mss+0x60>
> ffffffff804c5282:     1806 	8b 71 7c             	mov    0x7c(%rcx),%esi
> ffffffff804c5285:        1 	3b b3 5c 03 00 00    	cmp    0x35c(%rbx),%esi
> ffffffff804c528b:       21 	74 0a                	je     ffffffff804c5297 <tcp_current_mss+0x60>
> ffffffff804c528d:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff804c5290:        0 	e8 8b fb ff ff       	callq  ffffffff804c4e20 <tcp_sync_mss>
> ffffffff804c5295:        0 	89 c5                	mov    %eax,%ebp
> ffffffff804c5297:      864 	48 8d 4c 24 28       	lea    0x28(%rsp),%rcx
> ffffffff804c529c:      634 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
> ffffffff804c52a1:      995 	31 f6                	xor    %esi,%esi
> ffffffff804c52a3:        0 	48 89 df             	mov    %rbx,%rdi
> ffffffff804c52a6:        2 	e8 f2 fe ff ff       	callq  ffffffff804c519d <tcp_established_options>
> ffffffff804c52ab:      859 	8b 8b e8 03 00 00    	mov    0x3e8(%rbx),%ecx
> ffffffff804c52b1:      936 	83 c0 14             	add    $0x14,%eax
> ffffffff804c52b4:        6 	0f b7 d1             	movzwl %cx,%edx
> ffffffff804c52b7:        0 	39 d0                	cmp    %edx,%eax
> ffffffff804c52b9:      911 	74 04                	je     ffffffff804c52bf <tcp_current_mss+0x88>
> ffffffff804c52bb:        0 	29 d0                	sub    %edx,%eax
> ffffffff804c52bd:        0 	29 c5                	sub    %eax,%ebp
> ffffffff804c52bf:        0 	45 85 e4             	test   %r12d,%r12d
> ffffffff804c52c2:     6894 	89 e8                	mov    %ebp,%eax
> ffffffff804c52c4:        0 	74 38                	je     ffffffff804c52fe <tcp_current_mss+0xc7>
> ffffffff804c52c6:      990 	48 8b 83 68 03 00 00 	mov    0x368(%rbx),%rax
> ffffffff804c52cd:      642 	8b b3 04 01 00 00    	mov    0x104(%rbx),%esi
> ffffffff804c52d3:        3 	48 89 df             	mov    %rbx,%rdi
> ffffffff804c52d6:      240 	66 2b 70 30          	sub    0x30(%rax),%si
> ffffffff804c52da:      588 	66 2b b3 7e 03 00 00 	sub    0x37e(%rbx),%si
> ffffffff804c52e1:        2 	66 29 ce             	sub    %cx,%si
> ffffffff804c52e4:      284 	ff ce                	dec    %esi
> ffffffff804c52e6:      664 	0f b7 f6             	movzwl %si,%esi
> ffffffff804c52e9:        2 	e8 0a fb ff ff       	callq  ffffffff804c4df8 <tcp_bound_to_half_wnd>
> ffffffff804c52ee:       68 	0f b7 d0             	movzwl %ax,%edx
> ffffffff804c52f1:     1870 	89 c1                	mov    %eax,%ecx
> ffffffff804c52f3:        0 	89 d0                	mov    %edx,%eax
> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
> ffffffff804c52fb:     1670 	66 29 d0             	sub    %dx,%ax
> ffffffff804c52fe:        0 	66 89 83 ea 03 00 00 	mov    %ax,0x3ea(%rbx)
> ffffffff804c5305:        4 	48 83 c4 30          	add    $0x30,%rsp
> ffffffff804c5309:      855 	89 e8                	mov    %ebp,%eax
> ffffffff804c530b:        0 	5b                   	pop    %rbx
> ffffffff804c530c:      797 	5d                   	pop    %rbp
> ffffffff804c530d:        0 	41 5c                	pop    %r12
> ffffffff804c530f:        0 	c3                   	retq   
> 
> apparently this division causes 1.0% of tbench overhead:
> 
> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
> 
> (gdb) list *0xffffffff804c52f7
> 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
> 1073					  inet_csk(sk)->icsk_af_ops->net_header_len -
> 1074					  inet_csk(sk)->icsk_ext_hdr_len -
> 1075					  tp->tcp_header_len);
> 1076	
> 1077			xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> 1078			xmit_size_goal -= (xmit_size_goal % mss_now);
> 1079		}
> 1080		tp->xmit_size_goal = xmit_size_goal;
> 1081	
> 1082		return mss_now;
> (gdb) 
> 
> it's this division:
> 
>         if (doing_tso) {
>         [...]
> 			xmit_size_goal -= (xmit_size_goal % mss_now);
> 
> Has no-one hit this before? Perhaps this is why switching loopback 
> networking to TSO had a performance impact for others?

Yes, I mentioned it later. But apparently you dont read my mails, so
I will just stop now.

> 
> It's still a bit weird ... how can a single division cause this much 
> overhead? tcp_bound_to_half_wnd() [which is called straight before 
> this sequence] seems low-overhead.
> 
> 	Ingo
> 
> 


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:19                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Ingo Molnar <mingo@elte.hu> wrote:

> 100.000000 total
> ................
>   1.385125 tcp_sendmsg

this too is spread out, no spikes i noticed.

Seems like the subsequent functions seem to be spread out pretty 
evenly, with no particular spikes visible.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:19                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> 100.000000 total
> ................
>   1.385125 tcp_sendmsg

this too is spread out, no spikes i noticed.

Seems like the subsequent functions seem to be spread out pretty 
evenly, with no particular spikes visible.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:26                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Ingo Molnar a écrit :
>> * Ingo Molnar <mingo@elte.hu> wrote:
>>
>>> 100.000000 total
>>> ................
>>>   1.469183 tcp_current_mss
>>
>>                       hits (total: 146918)
>>                  .........
>> ffffffff804c5237:      526 <tcp_current_mss>:
>> ffffffff804c5237:      526 	41 54                	push   %r12
>> ffffffff804c5239:     5929 	55                   	push   %rbp
>> ffffffff804c523a:       32 	53                   	push   %rbx
>> ffffffff804c523b:      294 	48 89 fb             	mov    %rdi,%rbx
>> ffffffff804c523e:      539 	48 83 ec 30          	sub    $0x30,%rsp
>> ffffffff804c5242:     2590 	85 f6                	test   %esi,%esi
>> ffffffff804c5244:      444 	48 8b 4f 78          	mov    0x78(%rdi),%rcx
>> ffffffff804c5248:      521 	8b af 4c 04 00 00    	mov    0x44c(%rdi),%ebp
>> ffffffff804c524e:      791 	74 2a                	je     ffffffff804c527a <tcp_current_mss+0x43>
>> ffffffff804c5250:      433 	8b 87 00 01 00 00    	mov    0x100(%rdi),%eax
>> ffffffff804c5256:      236 	c1 e0 10             	shl    $0x10,%eax
>> ffffffff804c5259:      191 	89 c2                	mov    %eax,%edx
>> ffffffff804c525b:      487 	23 97 fc 00 00 00    	and    0xfc(%rdi),%edx
>> ffffffff804c5261:      362 	39 c2                	cmp    %eax,%edx
>> ffffffff804c5263:      342 	75 15                	jne    ffffffff804c527a <tcp_current_mss+0x43>
>> ffffffff804c5265:      473 	45 31 e4             	xor    %r12d,%r12d
>> ffffffff804c5268:      221 	8b 87 00 04 00 00    	mov    0x400(%rdi),%eax
>> ffffffff804c526e:      194 	3b 87 80 04 00 00    	cmp    0x480(%rdi),%eax
>> ffffffff804c5274:      445 	41 0f 94 c4          	sete   %r12b
>> ffffffff804c5278:      261 	eb 03                	jmp    ffffffff804c527d <tcp_current_mss+0x46>
>> ffffffff804c527a:        0 	45 31 e4             	xor    %r12d,%r12d
>> ffffffff804c527d:      185 	48 85 c9             	test   %rcx,%rcx
>> ffffffff804c5280:      686 	74 15                	je     ffffffff804c5297 <tcp_current_mss+0x60>
>> ffffffff804c5282:     1806 	8b 71 7c             	mov    0x7c(%rcx),%esi
>> ffffffff804c5285:        1 	3b b3 5c 03 00 00    	cmp    0x35c(%rbx),%esi
>> ffffffff804c528b:       21 	74 0a                	je     ffffffff804c5297 <tcp_current_mss+0x60>
>> ffffffff804c528d:        0 	48 89 df             	mov    %rbx,%rdi
>> ffffffff804c5290:        0 	e8 8b fb ff ff       	callq  ffffffff804c4e20 <tcp_sync_mss>
>> ffffffff804c5295:        0 	89 c5                	mov    %eax,%ebp
>> ffffffff804c5297:      864 	48 8d 4c 24 28       	lea    0x28(%rsp),%rcx
>> ffffffff804c529c:      634 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
>> ffffffff804c52a1:      995 	31 f6                	xor    %esi,%esi
>> ffffffff804c52a3:        0 	48 89 df             	mov    %rbx,%rdi
>> ffffffff804c52a6:        2 	e8 f2 fe ff ff       	callq  ffffffff804c519d <tcp_established_options>
>> ffffffff804c52ab:      859 	8b 8b e8 03 00 00    	mov    0x3e8(%rbx),%ecx
>> ffffffff804c52b1:      936 	83 c0 14             	add    $0x14,%eax
>> ffffffff804c52b4:        6 	0f b7 d1             	movzwl %cx,%edx
>> ffffffff804c52b7:        0 	39 d0                	cmp    %edx,%eax
>> ffffffff804c52b9:      911 	74 04                	je     ffffffff804c52bf <tcp_current_mss+0x88>
>> ffffffff804c52bb:        0 	29 d0                	sub    %edx,%eax
>> ffffffff804c52bd:        0 	29 c5                	sub    %eax,%ebp
>> ffffffff804c52bf:        0 	45 85 e4             	test   %r12d,%r12d
>> ffffffff804c52c2:     6894 	89 e8                	mov    %ebp,%eax
>> ffffffff804c52c4:        0 	74 38                	je     ffffffff804c52fe <tcp_current_mss+0xc7>
>> ffffffff804c52c6:      990 	48 8b 83 68 03 00 00 	mov    0x368(%rbx),%rax
>> ffffffff804c52cd:      642 	8b b3 04 01 00 00    	mov    0x104(%rbx),%esi
>> ffffffff804c52d3:        3 	48 89 df             	mov    %rbx,%rdi
>> ffffffff804c52d6:      240 	66 2b 70 30          	sub    0x30(%rax),%si
>> ffffffff804c52da:      588 	66 2b b3 7e 03 00 00 	sub    0x37e(%rbx),%si
>> ffffffff804c52e1:        2 	66 29 ce             	sub    %cx,%si
>> ffffffff804c52e4:      284 	ff ce                	dec    %esi
>> ffffffff804c52e6:      664 	0f b7 f6             	movzwl %si,%esi
>> ffffffff804c52e9:        2 	e8 0a fb ff ff       	callq  ffffffff804c4df8 <tcp_bound_to_half_wnd>
>> ffffffff804c52ee:       68 	0f b7 d0             	movzwl %ax,%edx
>> ffffffff804c52f1:     1870 	89 c1                	mov    %eax,%ecx
>> ffffffff804c52f3:        0 	89 d0                	mov    %edx,%eax
>> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
>> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
>> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
>> ffffffff804c52fb:     1670 	66 29 d0             	sub    %dx,%ax
>> ffffffff804c52fe:        0 	66 89 83 ea 03 00 00 	mov    %ax,0x3ea(%rbx)
>> ffffffff804c5305:        4 	48 83 c4 30          	add    $0x30,%rsp
>> ffffffff804c5309:      855 	89 e8                	mov    %ebp,%eax
>> ffffffff804c530b:        0 	5b                   	pop    %rbx
>> ffffffff804c530c:      797 	5d                   	pop    %rbp
>> ffffffff804c530d:        0 	41 5c                	pop    %r12
>> ffffffff804c530f:        0 	c3                   	retq   
>>
>> apparently this division causes 1.0% of tbench overhead:
>>
>> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
>> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
>> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
>>
>> (gdb) list *0xffffffff804c52f7
>> 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
>> 1073					  inet_csk(sk)->icsk_af_ops->net_header_len -
>> 1074					  inet_csk(sk)->icsk_ext_hdr_len -
>> 1075					  tp->tcp_header_len);
>> 1076	
>> 1077			xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>> 1078			xmit_size_goal -= (xmit_size_goal % mss_now);
>> 1079		}
>> 1080		tp->xmit_size_goal = xmit_size_goal;
>> 1081	
>> 1082		return mss_now;
>> (gdb) 
>>
>> it's this division:
>>
>>         if (doing_tso) {
>>         [...]
>> 			xmit_size_goal -= (xmit_size_goal % mss_now);
>>
>> Has no-one hit this before? Perhaps this is why switching loopback  
>> networking to TSO had a performance impact for others?
>
> Yes, I mentioned it later. [...]

i see - i just caught up with some of my inbox from today.

> [...] But apparently you dont read my mails, so I will just stop 
> now.

Sorry, i spent my time looking at the profile output.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:26                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger


* Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:

> Ingo Molnar a écrit :
>> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
>>
>>> 100.000000 total
>>> ................
>>>   1.469183 tcp_current_mss
>>
>>                       hits (total: 146918)
>>                  .........
>> ffffffff804c5237:      526 <tcp_current_mss>:
>> ffffffff804c5237:      526 	41 54                	push   %r12
>> ffffffff804c5239:     5929 	55                   	push   %rbp
>> ffffffff804c523a:       32 	53                   	push   %rbx
>> ffffffff804c523b:      294 	48 89 fb             	mov    %rdi,%rbx
>> ffffffff804c523e:      539 	48 83 ec 30          	sub    $0x30,%rsp
>> ffffffff804c5242:     2590 	85 f6                	test   %esi,%esi
>> ffffffff804c5244:      444 	48 8b 4f 78          	mov    0x78(%rdi),%rcx
>> ffffffff804c5248:      521 	8b af 4c 04 00 00    	mov    0x44c(%rdi),%ebp
>> ffffffff804c524e:      791 	74 2a                	je     ffffffff804c527a <tcp_current_mss+0x43>
>> ffffffff804c5250:      433 	8b 87 00 01 00 00    	mov    0x100(%rdi),%eax
>> ffffffff804c5256:      236 	c1 e0 10             	shl    $0x10,%eax
>> ffffffff804c5259:      191 	89 c2                	mov    %eax,%edx
>> ffffffff804c525b:      487 	23 97 fc 00 00 00    	and    0xfc(%rdi),%edx
>> ffffffff804c5261:      362 	39 c2                	cmp    %eax,%edx
>> ffffffff804c5263:      342 	75 15                	jne    ffffffff804c527a <tcp_current_mss+0x43>
>> ffffffff804c5265:      473 	45 31 e4             	xor    %r12d,%r12d
>> ffffffff804c5268:      221 	8b 87 00 04 00 00    	mov    0x400(%rdi),%eax
>> ffffffff804c526e:      194 	3b 87 80 04 00 00    	cmp    0x480(%rdi),%eax
>> ffffffff804c5274:      445 	41 0f 94 c4          	sete   %r12b
>> ffffffff804c5278:      261 	eb 03                	jmp    ffffffff804c527d <tcp_current_mss+0x46>
>> ffffffff804c527a:        0 	45 31 e4             	xor    %r12d,%r12d
>> ffffffff804c527d:      185 	48 85 c9             	test   %rcx,%rcx
>> ffffffff804c5280:      686 	74 15                	je     ffffffff804c5297 <tcp_current_mss+0x60>
>> ffffffff804c5282:     1806 	8b 71 7c             	mov    0x7c(%rcx),%esi
>> ffffffff804c5285:        1 	3b b3 5c 03 00 00    	cmp    0x35c(%rbx),%esi
>> ffffffff804c528b:       21 	74 0a                	je     ffffffff804c5297 <tcp_current_mss+0x60>
>> ffffffff804c528d:        0 	48 89 df             	mov    %rbx,%rdi
>> ffffffff804c5290:        0 	e8 8b fb ff ff       	callq  ffffffff804c4e20 <tcp_sync_mss>
>> ffffffff804c5295:        0 	89 c5                	mov    %eax,%ebp
>> ffffffff804c5297:      864 	48 8d 4c 24 28       	lea    0x28(%rsp),%rcx
>> ffffffff804c529c:      634 	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
>> ffffffff804c52a1:      995 	31 f6                	xor    %esi,%esi
>> ffffffff804c52a3:        0 	48 89 df             	mov    %rbx,%rdi
>> ffffffff804c52a6:        2 	e8 f2 fe ff ff       	callq  ffffffff804c519d <tcp_established_options>
>> ffffffff804c52ab:      859 	8b 8b e8 03 00 00    	mov    0x3e8(%rbx),%ecx
>> ffffffff804c52b1:      936 	83 c0 14             	add    $0x14,%eax
>> ffffffff804c52b4:        6 	0f b7 d1             	movzwl %cx,%edx
>> ffffffff804c52b7:        0 	39 d0                	cmp    %edx,%eax
>> ffffffff804c52b9:      911 	74 04                	je     ffffffff804c52bf <tcp_current_mss+0x88>
>> ffffffff804c52bb:        0 	29 d0                	sub    %edx,%eax
>> ffffffff804c52bd:        0 	29 c5                	sub    %eax,%ebp
>> ffffffff804c52bf:        0 	45 85 e4             	test   %r12d,%r12d
>> ffffffff804c52c2:     6894 	89 e8                	mov    %ebp,%eax
>> ffffffff804c52c4:        0 	74 38                	je     ffffffff804c52fe <tcp_current_mss+0xc7>
>> ffffffff804c52c6:      990 	48 8b 83 68 03 00 00 	mov    0x368(%rbx),%rax
>> ffffffff804c52cd:      642 	8b b3 04 01 00 00    	mov    0x104(%rbx),%esi
>> ffffffff804c52d3:        3 	48 89 df             	mov    %rbx,%rdi
>> ffffffff804c52d6:      240 	66 2b 70 30          	sub    0x30(%rax),%si
>> ffffffff804c52da:      588 	66 2b b3 7e 03 00 00 	sub    0x37e(%rbx),%si
>> ffffffff804c52e1:        2 	66 29 ce             	sub    %cx,%si
>> ffffffff804c52e4:      284 	ff ce                	dec    %esi
>> ffffffff804c52e6:      664 	0f b7 f6             	movzwl %si,%esi
>> ffffffff804c52e9:        2 	e8 0a fb ff ff       	callq  ffffffff804c4df8 <tcp_bound_to_half_wnd>
>> ffffffff804c52ee:       68 	0f b7 d0             	movzwl %ax,%edx
>> ffffffff804c52f1:     1870 	89 c1                	mov    %eax,%ecx
>> ffffffff804c52f3:        0 	89 d0                	mov    %edx,%eax
>> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
>> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
>> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
>> ffffffff804c52fb:     1670 	66 29 d0             	sub    %dx,%ax
>> ffffffff804c52fe:        0 	66 89 83 ea 03 00 00 	mov    %ax,0x3ea(%rbx)
>> ffffffff804c5305:        4 	48 83 c4 30          	add    $0x30,%rsp
>> ffffffff804c5309:      855 	89 e8                	mov    %ebp,%eax
>> ffffffff804c530b:        0 	5b                   	pop    %rbx
>> ffffffff804c530c:      797 	5d                   	pop    %rbp
>> ffffffff804c530d:        0 	41 5c                	pop    %r12
>> ffffffff804c530f:        0 	c3                   	retq   
>>
>> apparently this division causes 1.0% of tbench overhead:
>>
>> ffffffff804c52f5:        0 	31 d2                	xor    %edx,%edx
>> ffffffff804c52f7:     2135 	f7 f5                	div    %ebp
>> ffffffff804c52f9:   107010 	89 c8                	mov    %ecx,%eax
>>
>> (gdb) list *0xffffffff804c52f7
>> 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
>> 1073					  inet_csk(sk)->icsk_af_ops->net_header_len -
>> 1074					  inet_csk(sk)->icsk_ext_hdr_len -
>> 1075					  tp->tcp_header_len);
>> 1076	
>> 1077			xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>> 1078			xmit_size_goal -= (xmit_size_goal % mss_now);
>> 1079		}
>> 1080		tp->xmit_size_goal = xmit_size_goal;
>> 1081	
>> 1082		return mss_now;
>> (gdb) 
>>
>> it's this division:
>>
>>         if (doing_tso) {
>>         [...]
>> 			xmit_size_goal -= (xmit_size_goal % mss_now);
>>
>> Has no-one hit this before? Perhaps this is why switching loopback  
>> networking to TSO had a performance impact for others?
>
> Yes, I mentioned it later. [...]

i see - i just caught up with some of my inbox from today.

> [...] But apparently you dont read my mails, so I will just stop 
> now.

Sorry, i spent my time looking at the profile output.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:39                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 22:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Ingo Molnar a écrit :

>>> it's this division:
>>>
>>>         if (doing_tso) {
>>>         [...]
>>> 			xmit_size_goal -= (xmit_size_goal % mss_now);
>>>
>>> Has no-one hit this before? Perhaps this is why switching loopback  
>>> networking to TSO had a performance impact for others?
>> Yes, I mentioned it later. [...]
> 
> i see - i just caught up with some of my inbox from today.
> 
>> [...] But apparently you dont read my mails, so I will just stop 
>> now.
> 
> Sorry, i spent my time looking at the profile output.
> 

No problem Ingo, I am very glad you take so much time to profil kernel ;)

I had too many problems with profilers on my dev machine lately :(



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:39                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 22:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

Ingo Molnar a écrit :
> * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:
> 
>> Ingo Molnar a écrit :

>>> it's this division:
>>>
>>>         if (doing_tso) {
>>>         [...]
>>> 			xmit_size_goal -= (xmit_size_goal % mss_now);
>>>
>>> Has no-one hit this before? Perhaps this is why switching loopback  
>>> networking to TSO had a performance impact for others?
>> Yes, I mentioned it later. [...]
> 
> i see - i just caught up with some of my inbox from today.
> 
>> [...] But apparently you dont read my mails, so I will just stop 
>> now.
> 
> Sorry, i spent my time looking at the profile output.
> 

No problem Ingo, I am very glad you take so much time to profil kernel ;)

I had too many problems with profilers on my dev machine lately :(


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:47                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:47 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, torvalds


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 17 Nov 2008 17:11:35 +0100
> 
> > Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
> > compared to the things we were after in scheduler land.
> 
> The scheduler has accounted for at least %10 of the tbench 
> regressions at this point, what are you talking about?

yeah, you are probably right when it comes to task migration policy 
impact - that can have effects in that range. (and that, you have to 
accept, is a fundamentally hard and fragile job to get right, as it 
involves observing the past and predicting the future out of it - at 
1.3 million events per second)

So above i was just talking about straight scheduling code overhead. 
(that cannot have been +10% of the total - as the whole scheduler only 
takes 7% total - TLB flush and FPU restore overhead included. Even the 
hrtimer bits were about 1% of the total.)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 22:47                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-17 22:47 UTC (permalink / raw)
  To: David Miller
  Cc: dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


* David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:

> From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
> Date: Mon, 17 Nov 2008 17:11:35 +0100
> 
> > Ouch, +4% from a oneliner networking change? That's a _huge_ speedup 
> > compared to the things we were after in scheduler land.
> 
> The scheduler has accounted for at least %10 of the tbench 
> regressions at this point, what are you talking about?

yeah, you are probably right when it comes to task migration policy 
impact - that can have effects in that range. (and that, you have to 
accept, is a fundamentally hard and fragile job to get right, as it 
involves observing the past and predicting the future out of it - at 
1.3 million events per second)

So above i was just talking about straight scheduling code overhead. 
(that cannot have been +10% of the total - as the whole scheduler only 
takes 7% total - TLB flush and FPU restore overhead included. Even the 
hrtimer bits were about 1% of the total.)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 23:41                                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 23:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, Stephen Hemminger

[-- Attachment #1: Type: text/plain, Size: 1648 bytes --]

Eric Dumazet a écrit :
> 
> But seeing your disassembly, I can see compare_ether_addr() is not inlined.
> 
> This sucks.
> 
> /**
> * compare_ether_addr - Compare two Ethernet addresses
> * @addr1: Pointer to a six-byte array containing the Ethernet address
> * @addr2: Pointer other six-byte array containing the Ethernet address
> *
> * Compare two ethernet addresses, returns 0 if equal
> */
> static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
> {
>        const u16 *a = (const u16 *) addr1;
>        const u16 *b = (const u16 *) addr2;
> 
>        BUILD_BUG_ON(ETH_ALEN != 6);
>        return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
> }
> 
> On my machine/compiler, it is inlined, that makes a big difference.

old gcc compiler... OK understood...

> 
> c0420750 <eth_type_trans>: /* eth_type_trans total:  14417  0.4101 */
> 
> 

Could you try this patch Ingo ?

Thanks

[PATCH] net: eth_type_trans() should be a leaf function

In old days, eth_type_trans() was a leaf function. It is not anymore the case.

eth_type_trans() is a critical network function, called for each incoming packet.

We should make sure it is not calling functions, especially trivial ones.

1) Adds an __always_inline to compare_ether_addr() : This one was created to be faster
   than memcmp(). It really should be faster (and inlined)

2) Hand code skb_put() call in eth_type_trans()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/etherdevice.h |    2 +-
 net/ethernet/eth.c          |    7 ++++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

[-- Attachment #2: eth_type_trans_speedup.patch --]
[-- Type: text/plain, Size: 1053 bytes --]

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 25d62e6..94af6a7 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -128,7 +128,7 @@ static inline void random_ether_addr(u8 *addr)
  *
  * Compare two ethernet addresses, returns 0 if equal
  */
-static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
+static __always_inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
 {
 	const u16 *a = (const u16 *) addr1;
 	const u16 *b = (const u16 *) addr2;
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index b9d85af..30b60b2 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -162,7 +162,12 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 
 	skb->dev = dev;
 	skb_reset_mac_header(skb);
-	skb_pull(skb, ETH_HLEN);
+	/*
+	 * Hand coded skb_pull(skb, ETH_HLEN) to avoid a function call
+	 */
+	if (likely(skb->len >= ETH_HLEN))
+		__skb_pull(skb, ETH_HLEN);
+
 	eth = eth_hdr(skb);
 
 	if (is_multicast_ether_addr(eth->h_dest)) {

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-17 23:41                                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-17 23:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger

[-- Attachment #1: Type: text/plain, Size: 1674 bytes --]

Eric Dumazet a écrit :
> 
> But seeing your disassembly, I can see compare_ether_addr() is not inlined.
> 
> This sucks.
> 
> /**
> * compare_ether_addr - Compare two Ethernet addresses
> * @addr1: Pointer to a six-byte array containing the Ethernet address
> * @addr2: Pointer other six-byte array containing the Ethernet address
> *
> * Compare two ethernet addresses, returns 0 if equal
> */
> static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
> {
>        const u16 *a = (const u16 *) addr1;
>        const u16 *b = (const u16 *) addr2;
> 
>        BUILD_BUG_ON(ETH_ALEN != 6);
>        return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
> }
> 
> On my machine/compiler, it is inlined, that makes a big difference.

old gcc compiler... OK understood...

> 
> c0420750 <eth_type_trans>: /* eth_type_trans total:  14417  0.4101 */
> 
> 

Could you try this patch Ingo ?

Thanks

[PATCH] net: eth_type_trans() should be a leaf function

In old days, eth_type_trans() was a leaf function. It is not anymore the case.

eth_type_trans() is a critical network function, called for each incoming packet.

We should make sure it is not calling functions, especially trivial ones.

1) Adds an __always_inline to compare_ether_addr() : This one was created to be faster
   than memcmp(). It really should be faster (and inlined)

2) Hand code skb_put() call in eth_type_trans()

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 include/linux/etherdevice.h |    2 +-
 net/ethernet/eth.c          |    7 ++++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

[-- Attachment #2: eth_type_trans_speedup.patch --]
[-- Type: text/plain, Size: 1053 bytes --]

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 25d62e6..94af6a7 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -128,7 +128,7 @@ static inline void random_ether_addr(u8 *addr)
  *
  * Compare two ethernet addresses, returns 0 if equal
  */
-static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
+static __always_inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
 {
 	const u16 *a = (const u16 *) addr1;
 	const u16 *b = (const u16 *) addr2;
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index b9d85af..30b60b2 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -162,7 +162,12 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 
 	skb->dev = dev;
 	skb_reset_mac_header(skb);
-	skb_pull(skb, ETH_HLEN);
+	/*
+	 * Hand coded skb_pull(skb, ETH_HLEN) to avoid a function call
+	 */
+	if (likely(skb->len >= ETH_HLEN))
+		__skb_pull(skb, ETH_HLEN);
+
 	eth = eth_hdr(skb);
 
 	if (is_multicast_ether_addr(eth->h_dest)) {

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  0:01                                   ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-18  0:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, Stephen Hemminger



On Tue, 18 Nov 2008, Eric Dumazet wrote:
> > *
> > * Compare two ethernet addresses, returns 0 if equal
> > */
> > static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
> > {
> >        const u16 *a = (const u16 *) addr1;
> >        const u16 *b = (const u16 *) addr2;
> > 
> >        BUILD_BUG_ON(ETH_ALEN != 6);
> >        return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;

Btw, at least on some Intel CPU's, it would be faster to do this as a 
32-bit xor and a 16-bit xor. And if we can know that there is always 2 
bytes at the end (because of how the thing was allocated), it's faster 
still to do it as a 64-bit xor and a mask.

And that's true even if the addresses are only 2-byte aligned.

The code that gcc generates for "memcmp()" for a constant-size small data 
thing is sadly crap. It always generates a "rep cmpsb", even if the size 
is something really trivial like 4 bytes, and even if you compare for 
exact equality rather than a smaller/greater-than. Gaah.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  0:01                                   ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-18  0:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Stephen Hemminger



On Tue, 18 Nov 2008, Eric Dumazet wrote:
> > *
> > * Compare two ethernet addresses, returns 0 if equal
> > */
> > static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
> > {
> >        const u16 *a = (const u16 *) addr1;
> >        const u16 *b = (const u16 *) addr2;
> > 
> >        BUILD_BUG_ON(ETH_ALEN != 6);
> >        return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;

Btw, at least on some Intel CPU's, it would be faster to do this as a 
32-bit xor and a 16-bit xor. And if we can know that there is always 2 
bytes at the end (because of how the thing was allocated), it's faster 
still to do it as a 64-bit xor and a mask.

And that's true even if the addresses are only 2-byte aligned.

The code that gcc generates for "memcmp()" for a constant-size small data 
thing is sadly crap. It always generates a "rep cmpsb", even if the size 
is something really trivial like 4 bytes, and even if you compare for 
exact equality rather than a smaller/greater-than. Gaah.

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  5:16                               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-18  5:16 UTC (permalink / raw)
  To: mingo
  Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 17 Nov 2008 22:26:57 +0100

> eth->h_proto access.

Yes, this is the first time a packet is touched on receive.

> Given that this workload does localhost networking, my guess would be 
> that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
> read-mostly field should be separated from the bouncing bits.

It's the packet contents, there is no way to "seperate it".

And it should be unlikely bouncing on your system under tbench,
the senders and receivers should hang out on the same cpu unless
the something completely stupid is happening.

That's why I like running tbench with a num_threads command
line argument equal to the number of cpus, every cpu gets
the two thread talking to eachother over the TCP socket.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  5:16                               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-18  5:16 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Mon, 17 Nov 2008 22:26:57 +0100

> eth->h_proto access.

Yes, this is the first time a packet is touched on receive.

> Given that this workload does localhost networking, my guess would be 
> that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
> read-mostly field should be separated from the bouncing bits.

It's the packet contents, there is no way to "seperate it".

And it should be unlikely bouncing on your system under tbench,
the senders and receivers should hang out on the same cpu unless
the something completely stupid is happening.

That's why I like running tbench with a num_threads command
line argument equal to the number of cpus, every cpu gets
the two thread talking to eachother over the TCP socket.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  5:23                                 ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-18  5:23 UTC (permalink / raw)
  To: dada1
  Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 17 Nov 2008 23:15:50 +0100

> Yes, I mentioned it later. But apparently you dont read my mails, so
> I will just stop now.

Yeah I was going to mention this too :-/

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  5:23                                 ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-18  5:23 UTC (permalink / raw)
  To: dada1-fPLkHRcR87vqlBn2x/YWAg
  Cc: mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Date: Mon, 17 Nov 2008 23:15:50 +0100

> Yes, I mentioned it later. But apparently you dont read my mails, so
> I will just stop now.

Yeah I was going to mention this too :-/

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
  2008-11-18  5:16                               ` David Miller
  (?)
@ 2008-11-18  5:35                               ` Eric Dumazet
  2008-11-18  7:00                                   ` David Miller
  -1 siblings, 1 reply; 349+ messages in thread
From: Eric Dumazet @ 2008-11-18  5:35 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

David Miller a écrit :
> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 17 Nov 2008 22:26:57 +0100
> 
>> eth->h_proto access.
> 
> Yes, this is the first time a packet is touched on receive.

Well, not exactly, since we do a 

if (is_multicast_ether_addr(eth->h_dest)) {
...}

and one of the
	compare_ether_addr(eth->h_dest, {dev->dev_addr | dev->broadcast})

probably its a profiling effect...




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  7:00                                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-18  7:00 UTC (permalink / raw)
  To: dada1
  Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 18 Nov 2008 06:35:46 +0100

> David Miller a écrit :
> > From: Ingo Molnar <mingo@elte.hu>
> > Date: Mon, 17 Nov 2008 22:26:57 +0100
> > 
> >> eth->h_proto access.
> > Yes, this is the first time a packet is touched on receive.
> 
> Well, not exactly, since we do a 
> 
> if (is_multicast_ether_addr(eth->h_dest)) {
> ...}
> 
> and one of the
> 	compare_ether_addr(eth->h_dest, {dev->dev_addr | dev->broadcast})
> 
> probably its a profiling effect...

True.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  7:00                                   ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-18  7:00 UTC (permalink / raw)
  To: dada1-fPLkHRcR87vqlBn2x/YWAg
  Cc: mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Date: Tue, 18 Nov 2008 06:35:46 +0100

> David Miller a écrit :
> > From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
> > Date: Mon, 17 Nov 2008 22:26:57 +0100
> > 
> >> eth->h_proto access.
> > Yes, this is the first time a packet is touched on receive.
> 
> Well, not exactly, since we do a 
> 
> if (is_multicast_ether_addr(eth->h_dest)) {
> ...}
> 
> and one of the
> 	compare_ether_addr(eth->h_dest, {dev->dev_addr | dev->broadcast})
> 
> probably its a profiling effect...

True.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  8:30                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-18  8:30 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 17 Nov 2008 22:26:57 +0100
> 
> > eth->h_proto access.
> 
> Yes, this is the first time a packet is touched on receive.
> 
> > Given that this workload does localhost networking, my guess would be 
> > that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
> > read-mostly field should be separated from the bouncing bits.
> 
> It's the packet contents, there is no way to "seperate it".
> 
> And it should be unlikely bouncing on your system under tbench, the 
> senders and receivers should hang out on the same cpu unless the 
> something completely stupid is happening.
> 
> That's why I like running tbench with a num_threads command line 
> argument equal to the number of cpus, every cpu gets the two thread 
> talking to eachother over the TCP socket.

yeah - and i posted the numbers for that too - it's the same 
throughput, within ~1% of noise.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  8:30                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-18  8:30 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA


* David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:

> From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
> Date: Mon, 17 Nov 2008 22:26:57 +0100
> 
> > eth->h_proto access.
> 
> Yes, this is the first time a packet is touched on receive.
> 
> > Given that this workload does localhost networking, my guess would be 
> > that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
> > read-mostly field should be separated from the bouncing bits.
> 
> It's the packet contents, there is no way to "seperate it".
> 
> And it should be unlikely bouncing on your system under tbench, the 
> senders and receivers should hang out on the same cpu unless the 
> something completely stupid is happening.
> 
> That's why I like running tbench with a num_threads command line 
> argument equal to the number of cpus, every cpu gets the two thread 
> talking to eachother over the TCP socket.

yeah - and i posted the numbers for that too - it's the same 
throughput, within ~1% of noise.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
  2008-11-18  0:01                                   ` Linus Torvalds
  (?)
@ 2008-11-18  8:35                                   ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-18  8:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, Stephen Hemminger

[-- Attachment #1: Type: text/plain, Size: 1937 bytes --]

Linus Torvalds a écrit :
> 
> On Tue, 18 Nov 2008, Eric Dumazet wrote:
>>> *
>>> * Compare two ethernet addresses, returns 0 if equal
>>> */
>>> static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
>>> {
>>>        const u16 *a = (const u16 *) addr1;
>>>        const u16 *b = (const u16 *) addr2;
>>>
>>>        BUILD_BUG_ON(ETH_ALEN != 6);
>>>        return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
> 
> Btw, at least on some Intel CPU's, it would be faster to do this as a 
> 32-bit xor and a 16-bit xor. And if we can know that there is always 2 
> bytes at the end (because of how the thing was allocated), it's faster 
> still to do it as a 64-bit xor and a mask.
> 
> And that's true even if the addresses are only 2-byte aligned.
> 

Yes, this is allowed, we always have at least 8 bytes for both arrays,
when called from eth_type_trans() at least.

I tried this idea and got nice assembly on 32 bits:

 158:   33 82 38 01 00 00       xor    0x138(%edx),%eax
 15e:   33 8a 34 01 00 00       xor    0x134(%edx),%ecx
 164:   c1 e0 10                shl    $0x10,%eax
 167:   09 c1                   or     %eax,%ecx
 169:   74 0b                   je     176 <eth_type_trans+0x87>

And very nice assembly on 64 bits of course (one xor, one shl)

About alignments, we have aligned addr2, but not addr1

Nice oprofile improvement in eth_type_trans(), 0.17 % instead of 0.41 %

opreport -l vmlinux | grep eth_type_trans
38797     0.1710  eth_type_trans



[PATCH] eth: Declare an optimized compare_ether_addr_64bits() function

Linus mentioned we could try to perform long word operations, even
on potentially unaligned addresses, on x86 at least.

This patch implements a compare_ether_addr_64bits() function,
that handles the case of x86 cpus, but might be used on other arches as well.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---

[-- Attachment #2: compare_ether_addr_64bits.patch --]
[-- Type: text/plain, Size: 2438 bytes --]

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 25d62e6..ee0df09 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -136,6 +136,47 @@ static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
 	BUILD_BUG_ON(ETH_ALEN != 6);
 	return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
 }
+
+static inline unsigned long zap_last_2bytes(unsigned long value)
+{
+#ifdef __BIG_ENDIAN
+	return value >> 16;
+#else
+	return value << 16;
+#endif
+}
+
+/**
+ * compare_ether_addr_64bits - Compare two Ethernet addresses
+ * @addr1: Pointer to an array of 8 bytes
+ * @addr2: Pointer to an other array of 8 bytes
+ *
+ * Compare two ethernet addresses, returns 0 if equal.
+ * Same result than "memcmp(addr1, addr2, ETH_ALEN)" but without conditional
+ * branches, and possibly long word memory accesses on CPU allowing cheap
+ * unaligned memory reads.
+ * arrays = { byte1, byte2, byte3, byte4, byte6, byte7, pad1, pad2}
+ * 
+ * Please note that alignment of addr1 & addr2 is only guaranted to be 16 bits.
+ */
+
+static inline unsigned compare_ether_addr_64bits(const u8 addr1[6+2],
+						 const u8 addr2[6+2])
+{
+#if defined(CONFIG_X86)
+	unsigned long fold = *(const unsigned long *)addr1 ^
+			     *(const unsigned long *)addr2;
+
+	if (sizeof(fold) == 8)
+		return zap_last_2bytes(fold) != 0;
+
+	fold |= zap_last_2bytes(*(const unsigned long *)(addr1 + 4) ^
+				*(const unsigned long *)(addr2 + 4));
+	return fold != 0;
+#else
+	return compare_ether_addr(addr1, addr2);
+#endif
+}
 #endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_ETHERDEVICE_H */
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index b9d85af..dcfeb9b 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -166,7 +166,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 	eth = eth_hdr(skb);
 
 	if (is_multicast_ether_addr(eth->h_dest)) {
-		if (!compare_ether_addr(eth->h_dest, dev->broadcast))
+		if (!compare_ether_addr_64bits(eth->h_dest, dev->broadcast))
 			skb->pkt_type = PACKET_BROADCAST;
 		else
 			skb->pkt_type = PACKET_MULTICAST;
@@ -181,7 +181,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 	 */
 
 	else if (1 /*dev->flags&IFF_PROMISC */ ) {
-		if (unlikely(compare_ether_addr(eth->h_dest, dev->dev_addr)))
+		if (unlikely(compare_ether_addr_64bits(eth->h_dest, dev->dev_addr)))
 			skb->pkt_type = PACKET_OTHERHOST;
 	}
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  8:45                                   ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-18  8:45 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, torvalds, rjw, linux-kernel, kernel-testers, cl, efault,
	a.p.zijlstra, shemminger


* David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Mon, 17 Nov 2008 23:15:50 +0100
> 
> > Yes, I mentioned it later. But apparently you dont read my mails, 
> > so I will just stop now.
> 
> Yeah I was going to mention this too :-/

I spent hours profiling the networking code, and no, i didnt read all 
the incoming emails in parallel - i read them after that.

I have established it beyond reasonable doubt that the scheduler is 
doing the right thing with the config i've posted. Your "wakeup is two 
orders of magnitude more expensive" claim, which got me to measure and 
profile this stuff, is not reproducible here and this regression 
should not be listed as a scheduler regression.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  8:45                                   ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-18  8:45 UTC (permalink / raw)
  To: David Miller
  Cc: dada1-fPLkHRcR87vqlBn2x/YWAg,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA


* David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:

> From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> Date: Mon, 17 Nov 2008 23:15:50 +0100
> 
> > Yes, I mentioned it later. But apparently you dont read my mails, 
> > so I will just stop now.
> 
> Yeah I was going to mention this too :-/

I spent hours profiling the networking code, and no, i didnt read all 
the incoming emails in parallel - i read them after that.

I have established it beyond reasonable doubt that the scheduler is 
doing the right thing with the config i've posted. Your "wakeup is two 
orders of magnitude more expensive" claim, which got me to measure and 
profile this stuff, is not reproducible here and this regression 
should not be listed as a scheduler regression.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  8:49                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-18  8:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, torvalds, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, shemminger

Ingo Molnar a écrit :
> * David Miller <davem@davemloft.net> wrote:
> 
>> From: Ingo Molnar <mingo@elte.hu>
>> Date: Mon, 17 Nov 2008 22:26:57 +0100
>>
>>> eth->h_proto access.
>> Yes, this is the first time a packet is touched on receive.
>>
>>> Given that this workload does localhost networking, my guess would be 
>>> that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
>>> read-mostly field should be separated from the bouncing bits.
>> It's the packet contents, there is no way to "seperate it".
>>
>> And it should be unlikely bouncing on your system under tbench, the 
>> senders and receivers should hang out on the same cpu unless the 
>> something completely stupid is happening.
>>
>> That's why I like running tbench with a num_threads command line 
>> argument equal to the number of cpus, every cpu gets the two thread 
>> talking to eachother over the TCP socket.
> 
> yeah - and i posted the numbers for that too - it's the same 
> throughput, within ~1% of noise.

Thinking once again about loopback driver, I recall a previous attempt
to call netif_receive_skb() instead of netif_rx() and pay the price
of cache line ping-pongs between cpus.

http://kerneltrap.org/mailarchive/linux-netdev/2008/2/21/939644

Maybe we could do that, with a temporary percpu stack, like we do in softirq
when CONFIG_4KSTACKS=y

(arch/x86/kernel/irq_32.c  : call_on_stack(func, stack)

And do this only if the current cpu doesnt already use its softirq_stack
(think about loopback re-entering loopback xmit because of TCP ACK for example)

Oh well... black magic, you are going to kill me :)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  8:49                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-18  8:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

Ingo Molnar a écrit :
> * David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> 
>> From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
>> Date: Mon, 17 Nov 2008 22:26:57 +0100
>>
>>> eth->h_proto access.
>> Yes, this is the first time a packet is touched on receive.
>>
>>> Given that this workload does localhost networking, my guess would be 
>>> that eth->h_proto is bouncing around between 16 CPUs? At minimum this 
>>> read-mostly field should be separated from the bouncing bits.
>> It's the packet contents, there is no way to "seperate it".
>>
>> And it should be unlikely bouncing on your system under tbench, the 
>> senders and receivers should hang out on the same cpu unless the 
>> something completely stupid is happening.
>>
>> That's why I like running tbench with a num_threads command line 
>> argument equal to the number of cpus, every cpu gets the two thread 
>> talking to eachother over the TCP socket.
> 
> yeah - and i posted the numbers for that too - it's the same 
> throughput, within ~1% of noise.

Thinking once again about loopback driver, I recall a previous attempt
to call netif_receive_skb() instead of netif_rx() and pay the price
of cache line ping-pongs between cpus.

http://kerneltrap.org/mailarchive/linux-netdev/2008/2/21/939644

Maybe we could do that, with a temporary percpu stack, like we do in softirq
when CONFIG_4KSTACKS=y

(arch/x86/kernel/irq_32.c  : call_on_stack(func, stack)

And do this only if the current cpu doesnt already use its softirq_stack
(think about loopback re-entering loopback xmit because of TCP ACK for example)

Oh well... black magic, you are going to kill me :)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
  2008-11-17 20:32                             ` Ingo Molnar
  (?)
  (?)
@ 2008-11-18  9:12                             ` Nick Piggin
  -1 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2008-11-18  9:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Eric Dumazet, David Miller, rjw, linux-kernel,
	kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger

On Tuesday 18 November 2008 07:32, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> > 100.000000 total
> > ................
> >   3.356152 ip_queue_xmit

> 30% of the overhead of this function comes from:
>
> ffffffff804b7203:        0 	66 c7 43 06 00 00    	movw   $0x0,0x6(%rbx)
> ffffffff804b7209:      118 	0f bf 85 40 02 00 00 	movswl 0x240(%rbp),%eax
> ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
> ffffffff804b7215:      340 	85 c0                	test   %eax,%eax
> ffffffff804b7217:        0 	79 06                	jns    ffffffff804b721f
> <ip_queue_xmit+0x1da> ffffffff804b7219:   107464 	8b 82 9c 00 00 00    	mov
>    0x9c(%rdx),%eax ffffffff804b721f:     4963 	88 43 08             	mov   
> %al,0x8(%rbx)
>
> the 16-bit movw looks a bit weird. It comes from line 372:
>
>  0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
>  367		iph = ip_hdr(skb);
>  368		*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
>  369		if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
>  370			iph->frag_off = htons(IP_DF);
>  371		else
>  372			iph->frag_off = 0;
>  373		iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
>  374		iph->protocol = sk->sk_protocol;
>  375		iph->saddr    = rt->rt_src;
>  376		iph->daddr    = rt->rt_dst;
>
> the ip-header fragment flag setting to zero.
>
> 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is
> towards eliminating them as much as possible.
>
> _But_, the real overhead probably comes from:
>
>  ffffffff804b7210:    10867 	48 8b 54 24 58       	mov    0x58(%rsp),%rdx
>
> which is the next line, the ttl field:
>
>  373             iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
>
> this shows that we are doing a hard cachemiss on the net-localhost
> route dst structure cacheline. We do a plain load instruction from it
> here and get a hefty cachemiss. (because 16 CPUs are banging on that
> single route)

Why would that show up right there, though? Instruction like this should
be non-blocking. Shouldn't the cost should show up at some point where the
CPU executes an instruction depending on rdx? (and good luck working out
when that happens!)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  9:44                                       ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2008-11-18  9:44 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, mingo, dada1, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, shemminger

On Tuesday 18 November 2008 07:58, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)
>
> > On Mon, 17 Nov 2008, David Miller wrote:
> > > It's on my workstation which is a much simpler 2 processor
> > > UltraSPARC-IIIi (1.5Ghz) system.
> >
> > Ok. It could easily be something like a cache footprint issue. And while
> > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > super- scalar but does no out-of-order and speculation, no?
>
> I does only very simple speculation, but you're description is accurate.

Surely it would do branch prediction, but maybe not indirect branch?
I did wonder why those indirect function calls were added everywhere
in the scheduler...

They didn't show up in the newest generation of x86 CPUs, but simpler
implementations won't handle them as well.

I wouldn't expect that to cause such a big regression on its own, but
it would still be interesting to test changing them to direct calls.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18  9:44                                       ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2008-11-18  9:44 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

On Tuesday 18 November 2008 07:58, David Miller wrote:
> From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)
>
> > On Mon, 17 Nov 2008, David Miller wrote:
> > > It's on my workstation which is a much simpler 2 processor
> > > UltraSPARC-IIIi (1.5Ghz) system.
> >
> > Ok. It could easily be something like a cache footprint issue. And while
> > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > super- scalar but does no out-of-order and speculation, no?
>
> I does only very simple speculation, but you're description is accurate.

Surely it would do branch prediction, but maybe not indirect branch?
I did wonder why those indirect function calls were added everywhere
in the scheduler...

They didn't show up in the newest generation of x86 CPUs, but simpler
implementations won't handle them as well.

I wouldn't expect that to cause such a big regression on its own, but
it would still be interesting to test changing them to direct calls.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18 12:29                               ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2008-11-18 12:29 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, torvalds, dada1, rjw, linux-kernel, kernel-testers, cl,
	a.p.zijlstra, shemminger

On Mon, 2008-11-17 at 11:39 -0800, David Miller wrote:
> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 17 Nov 2008 19:49:51 +0100
> 
> > 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
> > > investigate it now a bit more to place the real overhead point 
> > > properly. (i already mapped the test-bit overhead: that comes from 
> > > napi_disable_pending())
> > 
> > ok, here's a new set of profiles. (again for tbench 64-thread on a 
> > 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
> > posted before.)
> 
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:
> 
> samples  %        app name                 symbol name
> 473       6.3928  vmlinux                  finish_task_switch
> 349       4.7169  vmlinux                  tcp_v4_rcv
> 327       4.4195  vmlinux                  U3copy_from_user
> 322       4.3519  vmlinux                  tl0_linux32
> 178       2.4057  vmlinux                  tcp_ack
> 170       2.2976  vmlinux                  tcp_sendmsg
> 167       2.2571  vmlinux                  U3copy_to_user
> 
> That tcp_v4_rcv() hit is %98 on the wake_up() call it does.

Easy enough, since i don't know how to do spiffy NMI profile.. yet ;-) 

I revived the 2.6.25 kernel where I tested back-ports of recent sched
fixes, and did a non-NMI profile of 2.6.22.19 and the back-port kernel.

The test kernel has all clock fixes 25->.git, min_vruntime accuracy fix
native_read_tsc() fix, and back looking buddy.  No knobs turned, and
only testing one pair per CPU, as to not take unfair advantage of back
looking buddy.  Netperf TCP_RR (hits sched harder) looks about the same.

Tbench 4 throughput was so close you would call these two twins.

2.6.22.19-smp
CPU: Core 2, speed 2400 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
vma      samples  %        symbol name
ffffffff802e6670 575909   13.7425  copy_user_generic_string
ffffffff80422ad8 175649    4.1914  schedule
ffffffff803a522d 133152    3.1773  tcp_sendmsg
ffffffff803a9387 128911    3.0761  tcp_ack
ffffffff803b65f7 116562    2.7814  tcp_v4_rcv
ffffffff803aeac8 116541    2.7809  tcp_transmit_skb
ffffffff8039eb95 112133    2.6757  ip_queue_xmit
ffffffff80209e20 110945    2.6474  system_call
ffffffff8037b720 108277    2.5837  __kfree_skb
ffffffff803a65cd 105493    2.5173  tcp_recvmsg
ffffffff80210f87 97947     2.3372  read_tsc
ffffffff802085b6 95255     2.2730  __switch_to
ffffffff803803f1 82069     1.9584  netif_rx
ffffffff8039f645 80937     1.9313  ip_output
ffffffff8027617d 74585     1.7798  __slab_alloc
ffffffff803824a0 70928     1.6925  process_backlog
ffffffff803ad9a5 69574     1.6602  tcp_rcv_established
ffffffff80399d40 55453     1.3232  ip_rcv
ffffffff803b07d1 53256     1.2708  __tcp_push_pending_frames
ffffffff8037b49c 52565     1.2543  skb_clone
ffffffff80276e97 49690     1.1857  __kmalloc_track_caller
ffffffff80379d05 45450     1.0845  sock_wfree
ffffffff80223d82 44851     1.0702  effective_prio
ffffffff803826b6 42417     1.0122  net_rx_action
ffffffff8027684c 42341     1.0104  kfree

2.6.25.20-test-smp
CPU: Core 2, speed 2400 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
vma      samples  %        symbol name
ffffffff80301450 576125   14.0874  copy_user_generic_string
ffffffff803cf8d9 127997    3.1298  tcp_transmit_skb
ffffffff803c9eac 125402    3.0663  tcp_ack
ffffffff80454da3 122337    2.9914  schedule
ffffffff803c673c 120401    2.9440  tcp_sendmsg
ffffffff8039aa9e 116554    2.8500  skb_release_all
ffffffff803c5abb 104840    2.5635  tcp_recvmsg
ffffffff8020a63d 92180     2.2540  __switch_to
ffffffff8020be20 79703     1.9489  system_call
ffffffff803bf460 79384     1.9411  ip_queue_xmit
ffffffff803a005c 78035     1.9081  netif_rx
ffffffff803ce56b 71223     1.7415  tcp_rcv_established
ffffffff8039ff70 66493     1.6259  process_backlog
ffffffff803d5a2d 61635     1.5071  tcp_v4_rcv
ffffffff803c1dae 60889     1.4889  __inet_lookup_established
ffffffff802126bc 54711     1.3378  native_read_tsc
ffffffff803d23bc 51843     1.2677  __tcp_push_pending_frames
ffffffff803bfb24 51821     1.2671  ip_finish_output
ffffffff8023700c 48248     1.1798  local_bh_enable
ffffffff803979bc 42221     1.0324  sock_wfree
ffffffff8039b12c 41279     1.0094  __alloc_skb



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18 12:29                               ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2008-11-18 12:29 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

On Mon, 2008-11-17 at 11:39 -0800, David Miller wrote:
> From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
> Date: Mon, 17 Nov 2008 19:49:51 +0100
> 
> > 
> > * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> > 
> > 4> The place for the sock_rfree() hit looks a bit weird, and i'll 
> > > investigate it now a bit more to place the real overhead point 
> > > properly. (i already mapped the test-bit overhead: that comes from 
> > > napi_disable_pending())
> > 
> > ok, here's a new set of profiles. (again for tbench 64-thread on a 
> > 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i 
> > posted before.)
> 
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:
> 
> samples  %        app name                 symbol name
> 473       6.3928  vmlinux                  finish_task_switch
> 349       4.7169  vmlinux                  tcp_v4_rcv
> 327       4.4195  vmlinux                  U3copy_from_user
> 322       4.3519  vmlinux                  tl0_linux32
> 178       2.4057  vmlinux                  tcp_ack
> 170       2.2976  vmlinux                  tcp_sendmsg
> 167       2.2571  vmlinux                  U3copy_to_user
> 
> That tcp_v4_rcv() hit is %98 on the wake_up() call it does.

Easy enough, since i don't know how to do spiffy NMI profile.. yet ;-) 

I revived the 2.6.25 kernel where I tested back-ports of recent sched
fixes, and did a non-NMI profile of 2.6.22.19 and the back-port kernel.

The test kernel has all clock fixes 25->.git, min_vruntime accuracy fix
native_read_tsc() fix, and back looking buddy.  No knobs turned, and
only testing one pair per CPU, as to not take unfair advantage of back
looking buddy.  Netperf TCP_RR (hits sched harder) looks about the same.

Tbench 4 throughput was so close you would call these two twins.

2.6.22.19-smp
CPU: Core 2, speed 2400 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
vma      samples  %        symbol name
ffffffff802e6670 575909   13.7425  copy_user_generic_string
ffffffff80422ad8 175649    4.1914  schedule
ffffffff803a522d 133152    3.1773  tcp_sendmsg
ffffffff803a9387 128911    3.0761  tcp_ack
ffffffff803b65f7 116562    2.7814  tcp_v4_rcv
ffffffff803aeac8 116541    2.7809  tcp_transmit_skb
ffffffff8039eb95 112133    2.6757  ip_queue_xmit
ffffffff80209e20 110945    2.6474  system_call
ffffffff8037b720 108277    2.5837  __kfree_skb
ffffffff803a65cd 105493    2.5173  tcp_recvmsg
ffffffff80210f87 97947     2.3372  read_tsc
ffffffff802085b6 95255     2.2730  __switch_to
ffffffff803803f1 82069     1.9584  netif_rx
ffffffff8039f645 80937     1.9313  ip_output
ffffffff8027617d 74585     1.7798  __slab_alloc
ffffffff803824a0 70928     1.6925  process_backlog
ffffffff803ad9a5 69574     1.6602  tcp_rcv_established
ffffffff80399d40 55453     1.3232  ip_rcv
ffffffff803b07d1 53256     1.2708  __tcp_push_pending_frames
ffffffff8037b49c 52565     1.2543  skb_clone
ffffffff80276e97 49690     1.1857  __kmalloc_track_caller
ffffffff80379d05 45450     1.0845  sock_wfree
ffffffff80223d82 44851     1.0702  effective_prio
ffffffff803826b6 42417     1.0122  net_rx_action
ffffffff8027684c 42341     1.0104  kfree

2.6.25.20-test-smp
CPU: Core 2, speed 2400 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
vma      samples  %        symbol name
ffffffff80301450 576125   14.0874  copy_user_generic_string
ffffffff803cf8d9 127997    3.1298  tcp_transmit_skb
ffffffff803c9eac 125402    3.0663  tcp_ack
ffffffff80454da3 122337    2.9914  schedule
ffffffff803c673c 120401    2.9440  tcp_sendmsg
ffffffff8039aa9e 116554    2.8500  skb_release_all
ffffffff803c5abb 104840    2.5635  tcp_recvmsg
ffffffff8020a63d 92180     2.2540  __switch_to
ffffffff8020be20 79703     1.9489  system_call
ffffffff803bf460 79384     1.9411  ip_queue_xmit
ffffffff803a005c 78035     1.9081  netif_rx
ffffffff803ce56b 71223     1.7415  tcp_rcv_established
ffffffff8039ff70 66493     1.6259  process_backlog
ffffffff803d5a2d 61635     1.5071  tcp_v4_rcv
ffffffff803c1dae 60889     1.4889  __inet_lookup_established
ffffffff802126bc 54711     1.3378  native_read_tsc
ffffffff803d23bc 51843     1.2677  __tcp_push_pending_frames
ffffffff803bfb24 51821     1.2671  ip_finish_output
ffffffff8023700c 48248     1.1798  local_bh_enable
ffffffff803979bc 42221     1.0324  sock_wfree
ffffffff8039b12c 41279     1.0094  __alloc_skb


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18 15:58                                         ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-18 15:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Miller, mingo, dada1, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, shemminger



On Tue, 18 Nov 2008, Nick Piggin wrote:

> On Tuesday 18 November 2008 07:58, David Miller wrote:
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> > >
> > > Ok. It could easily be something like a cache footprint issue. And while
> > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > > super- scalar but does no out-of-order and speculation, no?
> >
> > I does only very simple speculation, but you're description is accurate.
> 
> Surely it would do branch prediction, but maybe not indirect branch?

That would be "branch target prediction" (and a BTB - "Branch Target 
Buffer" to hold it), and no, I don't think Sparc does that. You can 
certainly do it for in-order machines too, but I think it's fairly rare.

It's sufficiently different from the regular "pick up the address from the 
static instruction stream, and also yank the kill-chain on mispredicted 
direction" to be real work to do. Unlike a compare or test instruction, 
it's not at all likely that you can resolve the final address in just a 
single pipeline stage, and without that, it's usually too late to yank the 
kill-chain.

(And perhaps equally importantly, indirect branches are relatively rare on 
old-style Unix benchmarks - ie SpecInt/FP - or in databases. So it's not 
something that Sparc would necessarily have spent the effort on.)

There is obviously one very special indirect jump: "ret". That's the one 
that is common, and that tends to have a special branch target buffer that 
is a pure stack. And for that, there is usually a special branch target 
register that needs to be set up 'x' cycles before the ret in order to 
avoid the stall (then the predition is checking that register against the 
branch target stack, which is somewhat akin to a regular conditional 
branch comparison).

So I strongly suspect that an indirect (non-ret) branch flushes the 
pipeline on sparc. It is possible that there is a "prepare to jump" 
instruction that prepares the indirect branch stack (kind of a "push 
prediction information"). I suspect Java sees a lot more indirect 
branches than traditional Unix loads, so maybe Sun did do that.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-18 15:58                                         ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2008-11-18 15:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Miller, mingo-X9Un+BFzKDI, dada1-fPLkHRcR87vqlBn2x/YWAg,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA



On Tue, 18 Nov 2008, Nick Piggin wrote:

> On Tuesday 18 November 2008 07:58, David Miller wrote:
> > From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> > >
> > > Ok. It could easily be something like a cache footprint issue. And while
> > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > > super- scalar but does no out-of-order and speculation, no?
> >
> > I does only very simple speculation, but you're description is accurate.
> 
> Surely it would do branch prediction, but maybe not indirect branch?

That would be "branch target prediction" (and a BTB - "Branch Target 
Buffer" to hold it), and no, I don't think Sparc does that. You can 
certainly do it for in-order machines too, but I think it's fairly rare.

It's sufficiently different from the regular "pick up the address from the 
static instruction stream, and also yank the kill-chain on mispredicted 
direction" to be real work to do. Unlike a compare or test instruction, 
it's not at all likely that you can resolve the final address in just a 
single pipeline stage, and without that, it's usually too late to yank the 
kill-chain.

(And perhaps equally importantly, indirect branches are relatively rare on 
old-style Unix benchmarks - ie SpecInt/FP - or in databases. So it's not 
something that Sparc would necessarily have spent the effort on.)

There is obviously one very special indirect jump: "ret". That's the one 
that is common, and that tends to have a special branch target buffer that 
is a pure stack. And for that, there is usually a special branch target 
register that needs to be set up 'x' cycles before the ret in order to 
avoid the stall (then the predition is checking that register against the 
branch target stack, which is somewhat akin to a regular conditional 
branch comparison).

So I strongly suspect that an indirect (non-ret) branch flushes the 
pipeline on sparc. It is possible that there is a "prepare to jump" 
instruction that prepares the indirect branch stack (kind of a "push 
prediction information"). I suspect Java sees a lot more indirect 
branches than traditional Unix loads, so maybe Sun did do that.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
  2008-11-18 15:58                                         ` Linus Torvalds
  (?)
@ 2008-11-19  4:31                                         ` Nick Piggin
  -1 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2008-11-19  4:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, mingo, dada1, rjw, linux-kernel, kernel-testers,
	cl, efault, a.p.zijlstra, shemminger

On Wednesday 19 November 2008 02:58, Linus Torvalds wrote:
> On Tue, 18 Nov 2008, Nick Piggin wrote:
> > On Tuesday 18 November 2008 07:58, David Miller wrote:
> > > From: Linus Torvalds <torvalds@linux-foundation.org>
> > >
> > > > Ok. It could easily be something like a cache footprint issue. And
> > > > while I don't know my sparc cpu's very well, I think the
> > > > Ultrasparc-IIIi is super- scalar but does no out-of-order and
> > > > speculation, no?
> > >
> > > I does only very simple speculation, but you're description is
> > > accurate.
> >
> > Surely it would do branch prediction, but maybe not indirect branch?
>
> That would be "branch target prediction" (and a BTB - "Branch Target
> Buffer" to hold it), and no, I don't think Sparc does that. You can
> certainly do it for in-order machines too, but I think it's fairly rare.
>
> It's sufficiently different from the regular "pick up the address from the
> static instruction stream, and also yank the kill-chain on mispredicted
> direction" to be real work to do. Unlike a compare or test instruction,
> it's not at all likely that you can resolve the final address in just a
> single pipeline stage, and without that, it's usually too late to yank the
> kill-chain.
>
> (And perhaps equally importantly, indirect branches are relatively rare on
> old-style Unix benchmarks - ie SpecInt/FP - or in databases. So it's not
> something that Sparc would necessarily have spent the effort on.)
>
> There is obviously one very special indirect jump: "ret". That's the one
> that is common, and that tends to have a special branch target buffer that
> is a pure stack. And for that, there is usually a special branch target
> register that needs to be set up 'x' cycles before the ret in order to
> avoid the stall (then the predition is checking that register against the
> branch target stack, which is somewhat akin to a regular conditional
> branch comparison).
>
> So I strongly suspect that an indirect (non-ret) branch flushes the
> pipeline on sparc. It is possible that there is a "prepare to jump"
> instruction that prepares the indirect branch stack (kind of a "push
> prediction information"). I suspect Java sees a lot more indirect
> branches than traditional Unix loads, so maybe Sun did do that.

Probably true. OTOH, I've seen indirect branches get compiled to direct
branches or the common-case special cased into a direct branch

if (object->fn == default_object_fn)
  default_object_fn();

That might be an easy way to test suspicions about CPU scheduler
slowdowns... (adding a likely() there, and using likely profiling would
help ensure you got the defualt case right).

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-19 19:43       ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-19 19:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra

On Mon, 17 Nov 2008, Ingo Molnar wrote:

> Christoph, as per the recent analysis of Mike:
>
>  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
>
> all scheduler components of this regression have been eliminated.
>
> In fact his numbers show that scheduler speedups since 2.6.22 have
> offset and hidden most other sources of tbench regression. (i.e. the
> scheduler portion got 5% faster, hence it was able to offset a
> slowdown of 5% in other areas of the kernel that tbench triggers)

Ok will rerun the tests tomorrow. Just got back from SC08 need some time
to catch up.

Looks like a lot of work was done on this issue. Thanks!


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-19 19:43       ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-19 19:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra

On Mon, 17 Nov 2008, Ingo Molnar wrote:

> Christoph, as per the recent analysis of Mike:
>
>  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
>
> all scheduler components of this regression have been eliminated.
>
> In fact his numbers show that scheduler speedups since 2.6.22 have
> offset and hidden most other sources of tbench regression. (i.e. the
> scheduler portion got 5% faster, hence it was able to offset a
> slowdown of 5% in other areas of the kernel that tbench triggers)

Ok will rerun the tests tomorrow. Just got back from SC08 need some time
to catch up.

Looks like a lot of work was done on this issue. Thanks!

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-19 20:14         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-19 20:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra


* Christoph Lameter <cl@linux-foundation.org> wrote:

> On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> > Christoph, as per the recent analysis of Mike:
> >
> >  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> >
> > all scheduler components of this regression have been eliminated.
> >
> > In fact his numbers show that scheduler speedups since 2.6.22 have
> > offset and hidden most other sources of tbench regression. (i.e. the
> > scheduler portion got 5% faster, hence it was able to offset a
> > slowdown of 5% in other areas of the kernel that tbench triggers)
> 
> Ok will rerun the tests tomorrow. Just got back from SC08 need some 
> time to catch up.
> 
> Looks like a lot of work was done on this issue. Thanks!

You might also want to try net-next:

 [remote "net-next"]
        url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
        fetch = +refs/heads/*:refs/remotes/net-next/*

Some good stuff is in there too, impacting this workload.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-19 20:14         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-19 20:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra


* Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> > Christoph, as per the recent analysis of Mike:
> >
> >  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> >
> > all scheduler components of this regression have been eliminated.
> >
> > In fact his numbers show that scheduler speedups since 2.6.22 have
> > offset and hidden most other sources of tbench regression. (i.e. the
> > scheduler portion got 5% faster, hence it was able to offset a
> > slowdown of 5% in other areas of the kernel that tbench triggers)
> 
> Ok will rerun the tests tomorrow. Just got back from SC08 need some 
> time to catch up.
> 
> Looks like a lot of work was done on this issue. Thanks!

You might also want to try net-next:

 [remote "net-next"]
        url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
        fetch = +refs/heads/*:refs/remotes/net-next/*

Some good stuff is in there too, impacting this workload.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-20  9:06                                         ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-20  9:06 UTC (permalink / raw)
  To: nickpiggin
  Cc: torvalds, mingo, dada1, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, shemminger

From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Tue, 18 Nov 2008 20:44:10 +1100

> On Tuesday 18 November 2008 07:58, David Miller wrote:
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> > Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)
> >
> > > On Mon, 17 Nov 2008, David Miller wrote:
> > > > It's on my workstation which is a much simpler 2 processor
> > > > UltraSPARC-IIIi (1.5Ghz) system.
> > >
> > > Ok. It could easily be something like a cache footprint issue. And while
> > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > > super- scalar but does no out-of-order and speculation, no?
> >
> > I does only very simple speculation, but you're description is accurate.
> 
> Surely it would do branch prediction, but maybe not indirect branch?

Right.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-20  9:06                                         ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-20  9:06 UTC (permalink / raw)
  To: nickpiggin-/E1597aS9LT0CCvOHzKKcA
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Nick Piggin <nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
Date: Tue, 18 Nov 2008 20:44:10 +1100

> On Tuesday 18 November 2008 07:58, David Miller wrote:
> > From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> > Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)
> >
> > > On Mon, 17 Nov 2008, David Miller wrote:
> > > > It's on my workstation which is a much simpler 2 processor
> > > > UltraSPARC-IIIi (1.5Ghz) system.
> > >
> > > Ok. It could easily be something like a cache footprint issue. And while
> > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > > super- scalar but does no out-of-order and speculation, no?
> >
> > I does only very simple speculation, but you're description is accurate.
> 
> Surely it would do branch prediction, but maybe not indirect branch?

Right.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-20  9:14                                           ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-20  9:14 UTC (permalink / raw)
  To: torvalds
  Cc: nickpiggin, mingo, dada1, rjw, linux-kernel, kernel-testers, cl,
	efault, a.p.zijlstra, shemminger

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 18 Nov 2008 07:58:49 -0800 (PST)

> There is obviously one very special indirect jump: "ret". That's the one 
> that is common, and that tends to have a special branch target buffer that 
> is a pure stack. And for that, there is usually a special branch target 
> register that needs to be set up 'x' cycles before the ret in order to 
> avoid the stall (then the predition is checking that register against the 
> branch target stack, which is somewhat akin to a regular conditional 
> branch comparison).

Yes, UltraSPARC has a RAS or Return Address Stack.  I think it has
effectively zero latency (ie. you can call some function, immediately
"ret" and it hits the RAS).  This is probably because, due to delay slots,
there is always going to be one instruction in between anyways. :)

> So I strongly suspect that an indirect (non-ret) branch flushes the 
> pipeline on sparc. It is possible that there is a "prepare to jump" 
> instruction that prepares the indirect branch stack (kind of a "push 
> prediction information").

It doesn't flush the pipeline, it just stalls it waiting for the
address computation.

Branches are predicted and can execute in the same cycle as the
condition-code setting instruction they depend upon.

> I suspect Java sees a lot more indirect branches than traditional
> Unix loads, so maybe Sun did do that.

There really isn't anything special done here for indirect jumps,
other than pushing onto the RAS.  Indirects just suck :)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-20  9:14                                           ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-20  9:14 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: nickpiggin-/E1597aS9LT0CCvOHzKKcA, mingo-X9Un+BFzKDI,
	dada1-fPLkHRcR87vqlBn2x/YWAg, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

From: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Date: Tue, 18 Nov 2008 07:58:49 -0800 (PST)

> There is obviously one very special indirect jump: "ret". That's the one 
> that is common, and that tends to have a special branch target buffer that 
> is a pure stack. And for that, there is usually a special branch target 
> register that needs to be set up 'x' cycles before the ret in order to 
> avoid the stall (then the predition is checking that register against the 
> branch target stack, which is somewhat akin to a regular conditional 
> branch comparison).

Yes, UltraSPARC has a RAS or Return Address Stack.  I think it has
effectively zero latency (ie. you can call some function, immediately
"ret" and it hits the RAS).  This is probably because, due to delay slots,
there is always going to be one instruction in between anyways. :)

> So I strongly suspect that an indirect (non-ret) branch flushes the 
> pipeline on sparc. It is possible that there is a "prepare to jump" 
> instruction that prepares the indirect branch stack (kind of a "push 
> prediction information").

It doesn't flush the pipeline, it just stalls it waiting for the
address computation.

Branches are predicted and can execute in the same cycle as the
condition-code setting instruction they depend upon.

> I suspect Java sees a lot more indirect branches than traditional
> Unix loads, so maybe Sun did do that.

There really isn't anything special done here for indirect jumps,
other than pushing onto the RAS.  Indirects just suck :)

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-20 23:52         ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-20 23:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra

hmmm... Well we are almost there.

2.6.22:

Throughput 2526.15 MB/sec 8 procs

2.6.28-rc5:

Throughput 2486.2 MB/sec 8 procs

8p Dell 1950 and the number of processors specified on the tbench command
line.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-20 23:52         ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-20 23:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra

hmmm... Well we are almost there.

2.6.22:

Throughput 2526.15 MB/sec 8 procs

2.6.28-rc5:

Throughput 2486.2 MB/sec 8 procs

8p Dell 1950 and the number of processors specified on the tbench command
line.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  8:30           ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21  8:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller


* Christoph Lameter <cl@linux-foundation.org> wrote:

> hmmm... Well we are almost there.
> 
> 2.6.22:
> 
> Throughput 2526.15 MB/sec 8 procs
> 
> 2.6.28-rc5:
> 
> Throughput 2486.2 MB/sec 8 procs
> 
> 8p Dell 1950 and the number of processors specified on the tbench 
> command line.

And with net-next we might even be able to get past that magic limit? 
net-next is linus-latest plus the latest and greatest networking bits:

 $ cat .git/config

 [remote "net-next"]
	url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
	fetch = +refs/heads/*:refs/remotes/net-next/*

... so might be worth a test. Just to satisfy our curiosity and to 
possibly close the entry :-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  8:30           ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21  8:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller


* Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> hmmm... Well we are almost there.
> 
> 2.6.22:
> 
> Throughput 2526.15 MB/sec 8 procs
> 
> 2.6.28-rc5:
> 
> Throughput 2486.2 MB/sec 8 procs
> 
> 8p Dell 1950 and the number of processors specified on the tbench 
> command line.

And with net-next we might even be able to get past that magic limit? 
net-next is linus-latest plus the latest and greatest networking bits:

 $ cat .git/config

 [remote "net-next"]
	url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
	fetch = +refs/heads/*:refs/remotes/net-next/*

... so might be worth a test. Just to satisfy our curiosity and to 
possibly close the entry :-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  8:51             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21  8:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

Ingo Molnar a écrit :
> * Christoph Lameter <cl@linux-foundation.org> wrote:
> 
>> hmmm... Well we are almost there.
>>
>> 2.6.22:
>>
>> Throughput 2526.15 MB/sec 8 procs
>>
>> 2.6.28-rc5:
>>
>> Throughput 2486.2 MB/sec 8 procs
>>
>> 8p Dell 1950 and the number of processors specified on the tbench 
>> command line.
> 
> And with net-next we might even be able to get past that magic limit? 
> net-next is linus-latest plus the latest and greatest networking bits:
> 
>  $ cat .git/config
> 
>  [remote "net-next"]
> 	url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> 	fetch = +refs/heads/*:refs/remotes/net-next/*
> 
> ... so might be worth a test. Just to satisfy our curiosity and to 
> possibly close the entry :-)
> 

Well, bits in net-next are new stuff for 2.6.29, not really regression fixes,
but yes, they should give nice tbench speedups.


Now, I wish sockets and pipes not going through dcache, not tbench affair
of course but real workloads...

running 8 processes on a 8 way machine doing a 

for (;;)
	close(socket(AF_INET, SOCK_STREAM, 0));

is slow as hell, we hit so many contended cache lines ...

ticket spin locks are slower in this case (dcache_lock for example
is taken twice when we allocate a socket(), once in d_alloc(), another one
in d_instantiate())



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  8:51             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21  8:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

Ingo Molnar a écrit :
> * Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> 
>> hmmm... Well we are almost there.
>>
>> 2.6.22:
>>
>> Throughput 2526.15 MB/sec 8 procs
>>
>> 2.6.28-rc5:
>>
>> Throughput 2486.2 MB/sec 8 procs
>>
>> 8p Dell 1950 and the number of processors specified on the tbench 
>> command line.
> 
> And with net-next we might even be able to get past that magic limit? 
> net-next is linus-latest plus the latest and greatest networking bits:
> 
>  $ cat .git/config
> 
>  [remote "net-next"]
> 	url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> 	fetch = +refs/heads/*:refs/remotes/net-next/*
> 
> ... so might be worth a test. Just to satisfy our curiosity and to 
> possibly close the entry :-)
> 

Well, bits in net-next are new stuff for 2.6.29, not really regression fixes,
but yes, they should give nice tbench speedups.


Now, I wish sockets and pipes not going through dcache, not tbench affair
of course but real workloads...

running 8 processes on a 8 way machine doing a 

for (;;)
	close(socket(AF_INET, SOCK_STREAM, 0));

is slow as hell, we hit so many contended cache lines ...

ticket spin locks are slower in this case (dcache_lock for example
is taken twice when we allocate a socket(), once in d_alloc(), another one
in d_instantiate())


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  9:03             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-21  9:03 UTC (permalink / raw)
  To: mingo; +Cc: cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 21 Nov 2008 09:30:44 +0100

> 
> * Christoph Lameter <cl@linux-foundation.org> wrote:
> 
> > hmmm... Well we are almost there.
> > 
> > 2.6.22:
> > 
> > Throughput 2526.15 MB/sec 8 procs
> > 
> > 2.6.28-rc5:
> > 
> > Throughput 2486.2 MB/sec 8 procs
> > 
> > 8p Dell 1950 and the number of processors specified on the tbench 
> > command line.
> 
> And with net-next we might even be able to get past that magic limit? 
> net-next is linus-latest plus the latest and greatest networking bits:

In any event I'm happy to toss this from the regression list.

My sparc still shows the issues and I'll profile that independently.
I'm pretty sure it's the indirect calls and the deeper stack frames
(which == 128 bytes of extra stores at each level to save the register
window), but I need to prove that first.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  9:03             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-21  9:03 UTC (permalink / raw)
  To: mingo-X9Un+BFzKDI
  Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw

From: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date: Fri, 21 Nov 2008 09:30:44 +0100

> 
> * Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> 
> > hmmm... Well we are almost there.
> > 
> > 2.6.22:
> > 
> > Throughput 2526.15 MB/sec 8 procs
> > 
> > 2.6.28-rc5:
> > 
> > Throughput 2486.2 MB/sec 8 procs
> > 
> > 8p Dell 1950 and the number of processors specified on the tbench 
> > command line.
> 
> And with net-next we might even be able to get past that magic limit? 
> net-next is linus-latest plus the latest and greatest networking bits:

In any event I'm happy to toss this from the regression list.

My sparc still shows the issues and I'll profile that independently.
I'm pretty sure it's the indirect calls and the deeper stack frames
(which == 128 bytes of extra stores at each level to save the register
window), but I need to prove that first.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  9:05               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-21  9:05 UTC (permalink / raw)
  To: dada1; +Cc: mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 21 Nov 2008 09:51:32 +0100

> Now, I wish sockets and pipes not going through dcache, not tbench affair
> of course but real workloads...
> 
> running 8 processes on a 8 way machine doing a 
> 
> for (;;)
> 	close(socket(AF_INET, SOCK_STREAM, 0));
> 
> is slow as hell, we hit so many contended cache lines ...
> 
> ticket spin locks are slower in this case (dcache_lock for example
> is taken twice when we allocate a socket(), once in d_alloc(), another one
> in d_instantiate())

As you of course know, this used to be a ton worse.  At least now
these things are unhashed. :)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  9:05               ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-21  9:05 UTC (permalink / raw)
  To: dada1-fPLkHRcR87vqlBn2x/YWAg
  Cc: mingo-X9Un+BFzKDI, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw

From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Date: Fri, 21 Nov 2008 09:51:32 +0100

> Now, I wish sockets and pipes not going through dcache, not tbench affair
> of course but real workloads...
> 
> running 8 processes on a 8 way machine doing a 
> 
> for (;;)
> 	close(socket(AF_INET, SOCK_STREAM, 0));
> 
> is slow as hell, we hit so many contended cache lines ...
> 
> ticket spin locks are slower in this case (dcache_lock for example
> is taken twice when we allocate a socket(), once in d_alloc(), another one
> in d_instantiate())

As you of course know, this used to be a ton worse.  At least now
these things are unhashed. :)

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  9:18               ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21  9:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Ingo Molnar a écrit :
>> * Christoph Lameter <cl@linux-foundation.org> wrote:
>>
>>> hmmm... Well we are almost there.
>>>
>>> 2.6.22:
>>>
>>> Throughput 2526.15 MB/sec 8 procs
>>>
>>> 2.6.28-rc5:
>>>
>>> Throughput 2486.2 MB/sec 8 procs
>>>
>>> 8p Dell 1950 and the number of processors specified on the tbench  
>>> command line.
>>
>> And with net-next we might even be able to get past that magic limit?  
>> net-next is linus-latest plus the latest and greatest networking bits:
>>
>>  $ cat .git/config
>>
>>  [remote "net-next"]
>> 	url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>> 	fetch = +refs/heads/*:refs/remotes/net-next/*
>>
>> ... so might be worth a test. Just to satisfy our curiosity and to 
>> possibly close the entry :-)
>>
>
> Well, bits in net-next are new stuff for 2.6.29, not really 
> regression fixes, but yes, they should give nice tbench speedups.

yeah, i know - technically these are lots-of-kernel-releases effects 
so not bona fide latest-cycle regressions anyway. But it doesnt matter 
how we call them, we want improvement in these metrics.

> Now, I wish sockets and pipes not going through dcache, not tbench 
> affair of course but real workloads...
>
> running 8 processes on a 8 way machine doing a 
>
> for (;;)
> 	close(socket(AF_INET, SOCK_STREAM, 0));
>
> is slow as hell, we hit so many contended cache lines ...
>
> ticket spin locks are slower in this case (dcache_lock for example 
> is taken twice when we allocate a socket(), once in d_alloc(), 
> another one in d_instantiate())

hm, weird - since there's no real VFS namespace impact i fail to 
realize the fundamental need that causes us to hit the dcache_lock. 
(perhaps there's none and this is fixable)

The general concept of mapping sockets to fds is a fundamental and 
powerful abstraction. There are APIs that also connect them to the VFS 
namespace (such as unix domain sockets) - but those should be special 
cases, not impacting normal TCP sockets.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21  9:18               ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21  9:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller


* Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:

> Ingo Molnar a écrit :
>> * Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>>
>>> hmmm... Well we are almost there.
>>>
>>> 2.6.22:
>>>
>>> Throughput 2526.15 MB/sec 8 procs
>>>
>>> 2.6.28-rc5:
>>>
>>> Throughput 2486.2 MB/sec 8 procs
>>>
>>> 8p Dell 1950 and the number of processors specified on the tbench  
>>> command line.
>>
>> And with net-next we might even be able to get past that magic limit?  
>> net-next is linus-latest plus the latest and greatest networking bits:
>>
>>  $ cat .git/config
>>
>>  [remote "net-next"]
>> 	url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>> 	fetch = +refs/heads/*:refs/remotes/net-next/*
>>
>> ... so might be worth a test. Just to satisfy our curiosity and to 
>> possibly close the entry :-)
>>
>
> Well, bits in net-next are new stuff for 2.6.29, not really 
> regression fixes, but yes, they should give nice tbench speedups.

yeah, i know - technically these are lots-of-kernel-releases effects 
so not bona fide latest-cycle regressions anyway. But it doesnt matter 
how we call them, we want improvement in these metrics.

> Now, I wish sockets and pipes not going through dcache, not tbench 
> affair of course but real workloads...
>
> running 8 processes on a 8 way machine doing a 
>
> for (;;)
> 	close(socket(AF_INET, SOCK_STREAM, 0));
>
> is slow as hell, we hit so many contended cache lines ...
>
> ticket spin locks are slower in this case (dcache_lock for example 
> is taken twice when we allocate a socket(), once in d_alloc(), 
> another one in d_instantiate())

hm, weird - since there's no real VFS namespace impact i fail to 
realize the fundamental need that causes us to hit the dcache_lock. 
(perhaps there's none and this is fixable)

The general concept of mapping sockets to fds is a fundamental and 
powerful abstraction. There are APIs that also connect them to the VFS 
namespace (such as unix domain sockets) - but those should be special 
cases, not impacting normal TCP sockets.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 12:51                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 12:51 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 21 Nov 2008 09:51:32 +0100
> 
>> Now, I wish sockets and pipes not going through dcache, not tbench affair
>> of course but real workloads...
>>
>> running 8 processes on a 8 way machine doing a 
>>
>> for (;;)
>> 	close(socket(AF_INET, SOCK_STREAM, 0));
>>
>> is slow as hell, we hit so many contended cache lines ...
>>
>> ticket spin locks are slower in this case (dcache_lock for example
>> is taken twice when we allocate a socket(), once in d_alloc(), another one
>> in d_instantiate())
> 
> As you of course know, this used to be a ton worse.  At least now
> these things are unhashed. :)

Well, this is dust compared to what we currently have.

To allocate a socket we :
0) Do the usual file manipulation (pretty scalable these days)
   (but recent drop_file_write_access() and co slow down a bit)
1) allocate an inode with new_inode()
    This function :
     - locks inode_lock,
     - dirties nr_inodes counter
     - dirties inode_in_use list  (for sockets, I doubt it is usefull)
     - dirties superblock s_inodes.
     - dirties last_ino counter
 All these are in different cache lines of course.
2) allocate a dentry
   d_alloc() takes dcache_lock,
   insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
   dirties nr_dentry
3) d_instantiate() dentry  (dcache_lock taken again)
4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to umount this vfs ...)



At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache 
lines that are touched when an element is deleted from a list.

for (i = 0; i < 1000*1000; i++)
	close(socket(socket(AF_INET, SOCK_STREAM, 0));

Cost if run one one cpu :

real    0m1.561s
user    0m0.092s
sys     0m1.469s

If run on 8 CPUS :

real    0m27.496s
user    0m0.657s
sys     3m39.092s


CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
164211   164211        10.9678  10.9678    init_file
155663   319874        10.3969  21.3647    d_alloc
147596   467470         9.8581  31.2228    _atomic_dec_and_lock
92993    560463         6.2111  37.4339    inet_create
73495    633958         4.9088  42.3427    kmem_cache_alloc
46353    680311         3.0960  45.4387    dentry_iput
46042    726353         3.0752  48.5139    tcp_close
42784    769137         2.8576  51.3715    kmem_cache_free
37074    806211         2.4762  53.8477    wake_up_inode
36375    842586         2.4295  56.2772    tcp_v4_init_sock
35212    877798         2.3518  58.6291    inotify_d_instantiate
33199    910997         2.2174  60.8465    sysenter_past_esp
31161    942158         2.0813  62.9277    d_instantiate
31000    973158         2.0705  64.9983    generic_forget_inode
28020    1001178        1.8715  66.8698    vfs_dq_drop
19007    1020185        1.2695  68.1393    __copy_from_user_ll
17513    1037698        1.1697  69.3090    new_inode
16957    1054655        1.1326  70.4415    __init_timer
16897    1071552        1.1286  71.5701    discard_slab
16115    1087667        1.0763  72.6464    d_kill
15542    1103209        1.0381  73.6845    __percpu_counter_add
13562    1116771        0.9058  74.5903    __slab_free
13276    1130047        0.8867  75.4771    __fput
12423    1142470        0.8297  76.3068    new_slab
11976    1154446        0.7999  77.1067    tcp_v4_destroy_sock
10889    1165335        0.7273  77.8340    inet_csk_destroy_sock
10516    1175851        0.7024  78.5364    alloc_inode
9979     1185830        0.6665  79.2029    sock_attach_fd
7980     1193810        0.5330  79.7359    drop_file_write_access
7609     1201419        0.5082  80.2441    alloc_fd
7584     1209003        0.5065  80.7506    sock_init_data
7164     1216167        0.4785  81.2291    add_partial
7107     1223274        0.4747  81.7038    sys_close
6997     1230271        0.4673  82.1711    mwait_idle


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 12:51                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 12:51 UTC (permalink / raw)
  To: David Miller
  Cc: mingo-X9Un+BFzKDI, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw

David Miller a écrit :
> From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> Date: Fri, 21 Nov 2008 09:51:32 +0100
> 
>> Now, I wish sockets and pipes not going through dcache, not tbench affair
>> of course but real workloads...
>>
>> running 8 processes on a 8 way machine doing a 
>>
>> for (;;)
>> 	close(socket(AF_INET, SOCK_STREAM, 0));
>>
>> is slow as hell, we hit so many contended cache lines ...
>>
>> ticket spin locks are slower in this case (dcache_lock for example
>> is taken twice when we allocate a socket(), once in d_alloc(), another one
>> in d_instantiate())
> 
> As you of course know, this used to be a ton worse.  At least now
> these things are unhashed. :)

Well, this is dust compared to what we currently have.

To allocate a socket we :
0) Do the usual file manipulation (pretty scalable these days)
   (but recent drop_file_write_access() and co slow down a bit)
1) allocate an inode with new_inode()
    This function :
     - locks inode_lock,
     - dirties nr_inodes counter
     - dirties inode_in_use list  (for sockets, I doubt it is usefull)
     - dirties superblock s_inodes.
     - dirties last_ino counter
 All these are in different cache lines of course.
2) allocate a dentry
   d_alloc() takes dcache_lock,
   insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
   dirties nr_dentry
3) d_instantiate() dentry  (dcache_lock taken again)
4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to umount this vfs ...)



At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache 
lines that are touched when an element is deleted from a list.

for (i = 0; i < 1000*1000; i++)
	close(socket(socket(AF_INET, SOCK_STREAM, 0));

Cost if run one one cpu :

real    0m1.561s
user    0m0.092s
sys     0m1.469s

If run on 8 CPUS :

real    0m27.496s
user    0m0.657s
sys     3m39.092s


CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
164211   164211        10.9678  10.9678    init_file
155663   319874        10.3969  21.3647    d_alloc
147596   467470         9.8581  31.2228    _atomic_dec_and_lock
92993    560463         6.2111  37.4339    inet_create
73495    633958         4.9088  42.3427    kmem_cache_alloc
46353    680311         3.0960  45.4387    dentry_iput
46042    726353         3.0752  48.5139    tcp_close
42784    769137         2.8576  51.3715    kmem_cache_free
37074    806211         2.4762  53.8477    wake_up_inode
36375    842586         2.4295  56.2772    tcp_v4_init_sock
35212    877798         2.3518  58.6291    inotify_d_instantiate
33199    910997         2.2174  60.8465    sysenter_past_esp
31161    942158         2.0813  62.9277    d_instantiate
31000    973158         2.0705  64.9983    generic_forget_inode
28020    1001178        1.8715  66.8698    vfs_dq_drop
19007    1020185        1.2695  68.1393    __copy_from_user_ll
17513    1037698        1.1697  69.3090    new_inode
16957    1054655        1.1326  70.4415    __init_timer
16897    1071552        1.1286  71.5701    discard_slab
16115    1087667        1.0763  72.6464    d_kill
15542    1103209        1.0381  73.6845    __percpu_counter_add
13562    1116771        0.9058  74.5903    __slab_free
13276    1130047        0.8867  75.4771    __fput
12423    1142470        0.8297  76.3068    new_slab
11976    1154446        0.7999  77.1067    tcp_v4_destroy_sock
10889    1165335        0.7273  77.8340    inet_csk_destroy_sock
10516    1175851        0.7024  78.5364    alloc_inode
9979     1185830        0.6665  79.2029    sock_attach_fd
7980     1193810        0.5330  79.7359    drop_file_write_access
7609     1201419        0.5082  80.2441    alloc_fd
7584     1209003        0.5065  80.7506    sock_init_data
7164     1216167        0.4785  81.2291    add_partial
7107     1223274        0.4747  81.7038    sys_close
6997     1230271        0.4673  82.1711    mwait_idle

^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:13                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 15:13 UTC (permalink / raw)
  To: David Miller, mingo
  Cc: cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra,
	Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 5668 bytes --]

Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Fri, 21 Nov 2008 09:51:32 +0100
>>
>>> Now, I wish sockets and pipes not going through dcache, not tbench 
>>> affair
>>> of course but real workloads...
>>>
>>> running 8 processes on a 8 way machine doing a
>>> for (;;)
>>>     close(socket(AF_INET, SOCK_STREAM, 0));
>>>
>>> is slow as hell, we hit so many contended cache lines ...
>>>
>>> ticket spin locks are slower in this case (dcache_lock for example
>>> is taken twice when we allocate a socket(), once in d_alloc(), 
>>> another one
>>> in d_instantiate())
>>
>> As you of course know, this used to be a ton worse.  At least now
>> these things are unhashed. :)
> 
> Well, this is dust compared to what we currently have.
> 
> To allocate a socket we :
> 0) Do the usual file manipulation (pretty scalable these days)
>   (but recent drop_file_write_access() and co slow down a bit)
> 1) allocate an inode with new_inode()
>    This function :
>     - locks inode_lock,
>     - dirties nr_inodes counter
>     - dirties inode_in_use list  (for sockets, I doubt it is usefull)
>     - dirties superblock s_inodes.
>     - dirties last_ino counter
> All these are in different cache lines of course.
> 2) allocate a dentry
>   d_alloc() takes dcache_lock,
>   insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
>   dirties nr_dentry
> 3) d_instantiate() dentry  (dcache_lock taken again)
> 4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to 
> umount this vfs ...)
> 
> 
> 
> At close() time, we must undo the things. Its even more expensive because
> of the _atomic_dec_and_lock() that stress a lot, and because of two 
> cache lines that are touched when an element is deleted from a list.
> 
> for (i = 0; i < 1000*1000; i++)
>     close(socket(socket(AF_INET, SOCK_STREAM, 0));
> 
> Cost if run one one cpu :
> 
> real    0m1.561s
> user    0m0.092s
> sys     0m1.469s
> 
> If run on 8 CPUS :
> 
> real    0m27.496s
> user    0m0.657s
> sys     3m39.092s
> 
> 

[PATCH] fs: pipe/sockets/anon dentries should not have a parent

Linking pipe/sockets/anon dentries to one root 'parent' has no functional
impact at all, but a scalability one.

We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
We avoid an expensive atomic_dec_and_lock() call on the root dentry.

If we correct dnotify_parent() and inotify_d_instantiate() to take into account
a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.

Before patch, time to run 8 millions of close(socket()) calls on 8 CPUS was :

real    0m27.496s
user    0m0.657s
sys     3m39.092s

After patch :

real    0m23.997s
user    0m0.682s
sys     3m11.193s


Old oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
164257   164257        11.0245  11.0245    init_file
155488   319745        10.4359  21.4604    d_alloc
151887   471632        10.1942  31.6547    _atomic_dec_and_lock
91620    563252         6.1493  37.8039    inet_create
74245    637497         4.9831  42.7871    kmem_cache_alloc
46702    684199         3.1345  45.9216    dentry_iput
46186    730385         3.0999  49.0215    tcp_close
42824    773209         2.8742  51.8957    kmem_cache_free
37275    810484         2.5018  54.3975    wake_up_inode
36553    847037         2.4533  56.8508    tcp_v4_init_sock
35661    882698         2.3935  59.2443    inotify_d_instantiate
32998    915696         2.2147  61.4590    sysenter_past_esp
31442    947138         2.1103  63.5693    d_instantiate
31303    978441         2.1010  65.6703    generic_forget_inode
27533    1005974        1.8479  67.5183    vfs_dq_drop
24237    1030211        1.6267  69.1450    sock_attach_fd
19290    1049501        1.2947  70.4397    __copy_from_user_ll


New oprofile :
CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
147287   147287        10.3984  10.3984    new_inode
144884   292171        10.2287  20.6271    inet_create
93670    385841         6.6131  27.2402    init_file
89852    475693         6.3435  33.5837    wake_up_inode
80910    556603         5.7122  39.2959    kmem_cache_alloc
53588    610191         3.7833  43.0792    _atomic_dec_and_lock
44341    654532         3.1305  46.2096    generic_forget_inode
38710    693242         2.7329  48.9425    kmem_cache_free
37605    730847         2.6549  51.5974    tcp_v4_init_sock
37228    768075         2.6283  54.2257    d_alloc
34085    802160         2.4064  56.6321    tcp_close
32550    834710         2.2980  58.9301    sysenter_past_esp
25931    860641         1.8307  60.7608    vfs_dq_drop
24458    885099         1.7267  62.4875    d_kill
22015    907114         1.5542  64.0418    dentry_iput
18877    925991         1.3327  65.3745    __copy_from_user_ll
17873    943864         1.2618  66.6363    mwait_idle

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c |    2 +-
 fs/dnotify.c     |    2 +-
 fs/inotify.c     |    2 +-
 fs/pipe.c        |    2 +-
 net/socket.c     |    2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

[-- Attachment #2: null_parent.patch --]
[-- Type: text/plain, Size: 2076 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..22cce87 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -92,7 +92,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(NULL, &this);
 	if (!dentry)
 		goto err_put_unused_fd;
 
diff --git a/fs/dnotify.c b/fs/dnotify.c
index 676073b..66066a3 100644
--- a/fs/dnotify.c
+++ b/fs/dnotify.c
@@ -173,7 +173,7 @@ void dnotify_parent(struct dentry *dentry, unsigned long event)
 
 	spin_lock(&dentry->d_lock);
 	parent = dentry->d_parent;
-	if (parent->d_inode->i_dnotify_mask & event) {
+	if (parent && parent->d_inode->i_dnotify_mask & event) {
 		dget(parent);
 		spin_unlock(&dentry->d_lock);
 		__inode_dir_notify(parent->d_inode, event);
diff --git a/fs/inotify.c b/fs/inotify.c
index 7bbed1b..9f051bb 100644
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -270,7 +270,7 @@ void inotify_d_instantiate(struct dentry *entry, struct inode *inode)
 
 	spin_lock(&entry->d_lock);
 	parent = entry->d_parent;
-	if (parent->d_inode && inotify_inode_watched(parent->d_inode))
+	if (parent && parent->d_inode && inotify_inode_watched(parent->d_inode))
 		entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
 	spin_unlock(&entry->d_lock);
 }
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4b961bc 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -926,7 +926,7 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc(NULL, &name);
 	if (!dentry)
 		goto err_inode;
 
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b84de7d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -373,7 +373,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc(NULL, &name);
 	if (unlikely(!dentry))
 		return -ENOMEM;
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:13                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 15:13 UTC (permalink / raw)
  To: David Miller, mingo-X9Un+BFzKDI
  Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 5720 bytes --]

Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
>> Date: Fri, 21 Nov 2008 09:51:32 +0100
>>
>>> Now, I wish sockets and pipes not going through dcache, not tbench 
>>> affair
>>> of course but real workloads...
>>>
>>> running 8 processes on a 8 way machine doing a
>>> for (;;)
>>>     close(socket(AF_INET, SOCK_STREAM, 0));
>>>
>>> is slow as hell, we hit so many contended cache lines ...
>>>
>>> ticket spin locks are slower in this case (dcache_lock for example
>>> is taken twice when we allocate a socket(), once in d_alloc(), 
>>> another one
>>> in d_instantiate())
>>
>> As you of course know, this used to be a ton worse.  At least now
>> these things are unhashed. :)
> 
> Well, this is dust compared to what we currently have.
> 
> To allocate a socket we :
> 0) Do the usual file manipulation (pretty scalable these days)
>   (but recent drop_file_write_access() and co slow down a bit)
> 1) allocate an inode with new_inode()
>    This function :
>     - locks inode_lock,
>     - dirties nr_inodes counter
>     - dirties inode_in_use list  (for sockets, I doubt it is usefull)
>     - dirties superblock s_inodes.
>     - dirties last_ino counter
> All these are in different cache lines of course.
> 2) allocate a dentry
>   d_alloc() takes dcache_lock,
>   insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
>   dirties nr_dentry
> 3) d_instantiate() dentry  (dcache_lock taken again)
> 4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to 
> umount this vfs ...)
> 
> 
> 
> At close() time, we must undo the things. Its even more expensive because
> of the _atomic_dec_and_lock() that stress a lot, and because of two 
> cache lines that are touched when an element is deleted from a list.
> 
> for (i = 0; i < 1000*1000; i++)
>     close(socket(socket(AF_INET, SOCK_STREAM, 0));
> 
> Cost if run one one cpu :
> 
> real    0m1.561s
> user    0m0.092s
> sys     0m1.469s
> 
> If run on 8 CPUS :
> 
> real    0m27.496s
> user    0m0.657s
> sys     3m39.092s
> 
> 

[PATCH] fs: pipe/sockets/anon dentries should not have a parent

Linking pipe/sockets/anon dentries to one root 'parent' has no functional
impact at all, but a scalability one.

We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
We avoid an expensive atomic_dec_and_lock() call on the root dentry.

If we correct dnotify_parent() and inotify_d_instantiate() to take into account
a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.

Before patch, time to run 8 millions of close(socket()) calls on 8 CPUS was :

real    0m27.496s
user    0m0.657s
sys     3m39.092s

After patch :

real    0m23.997s
user    0m0.682s
sys     3m11.193s


Old oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
164257   164257        11.0245  11.0245    init_file
155488   319745        10.4359  21.4604    d_alloc
151887   471632        10.1942  31.6547    _atomic_dec_and_lock
91620    563252         6.1493  37.8039    inet_create
74245    637497         4.9831  42.7871    kmem_cache_alloc
46702    684199         3.1345  45.9216    dentry_iput
46186    730385         3.0999  49.0215    tcp_close
42824    773209         2.8742  51.8957    kmem_cache_free
37275    810484         2.5018  54.3975    wake_up_inode
36553    847037         2.4533  56.8508    tcp_v4_init_sock
35661    882698         2.3935  59.2443    inotify_d_instantiate
32998    915696         2.2147  61.4590    sysenter_past_esp
31442    947138         2.1103  63.5693    d_instantiate
31303    978441         2.1010  65.6703    generic_forget_inode
27533    1005974        1.8479  67.5183    vfs_dq_drop
24237    1030211        1.6267  69.1450    sock_attach_fd
19290    1049501        1.2947  70.4397    __copy_from_user_ll


New oprofile :
CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
147287   147287        10.3984  10.3984    new_inode
144884   292171        10.2287  20.6271    inet_create
93670    385841         6.6131  27.2402    init_file
89852    475693         6.3435  33.5837    wake_up_inode
80910    556603         5.7122  39.2959    kmem_cache_alloc
53588    610191         3.7833  43.0792    _atomic_dec_and_lock
44341    654532         3.1305  46.2096    generic_forget_inode
38710    693242         2.7329  48.9425    kmem_cache_free
37605    730847         2.6549  51.5974    tcp_v4_init_sock
37228    768075         2.6283  54.2257    d_alloc
34085    802160         2.4064  56.6321    tcp_close
32550    834710         2.2980  58.9301    sysenter_past_esp
25931    860641         1.8307  60.7608    vfs_dq_drop
24458    885099         1.7267  62.4875    d_kill
22015    907114         1.5542  64.0418    dentry_iput
18877    925991         1.3327  65.3745    __copy_from_user_ll
17873    943864         1.2618  66.6363    mwait_idle

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/anon_inodes.c |    2 +-
 fs/dnotify.c     |    2 +-
 fs/inotify.c     |    2 +-
 fs/pipe.c        |    2 +-
 net/socket.c     |    2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

[-- Attachment #2: null_parent.patch --]
[-- Type: text/plain, Size: 2076 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..22cce87 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -92,7 +92,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(NULL, &this);
 	if (!dentry)
 		goto err_put_unused_fd;
 
diff --git a/fs/dnotify.c b/fs/dnotify.c
index 676073b..66066a3 100644
--- a/fs/dnotify.c
+++ b/fs/dnotify.c
@@ -173,7 +173,7 @@ void dnotify_parent(struct dentry *dentry, unsigned long event)
 
 	spin_lock(&dentry->d_lock);
 	parent = dentry->d_parent;
-	if (parent->d_inode->i_dnotify_mask & event) {
+	if (parent && parent->d_inode->i_dnotify_mask & event) {
 		dget(parent);
 		spin_unlock(&dentry->d_lock);
 		__inode_dir_notify(parent->d_inode, event);
diff --git a/fs/inotify.c b/fs/inotify.c
index 7bbed1b..9f051bb 100644
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -270,7 +270,7 @@ void inotify_d_instantiate(struct dentry *entry, struct inode *inode)
 
 	spin_lock(&entry->d_lock);
 	parent = entry->d_parent;
-	if (parent->d_inode && inotify_inode_watched(parent->d_inode))
+	if (parent && parent->d_inode && inotify_inode_watched(parent->d_inode))
 		entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
 	spin_unlock(&entry->d_lock);
 }
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4b961bc 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -926,7 +926,7 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc(NULL, &name);
 	if (!dentry)
 		goto err_inode;
 
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b84de7d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -373,7 +373,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc(NULL, &name);
 	if (unlikely(!dentry))
 		return -ENOMEM;
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:21                     ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21 15:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, cl, rjw, linux-kernel, kernel-testers, efault,
	a.p.zijlstra, Linux Netdev List


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Before patch, time to run 8 millions of close(socket()) calls on 8 
> CPUS was :
>
> real    0m27.496s
> user    0m0.657s
> sys     3m39.092s
>
> After patch :
>
> real    0m23.997s
> user    0m0.682s
> sys     3m11.193s

cool :-)

What would it take to get it down to:

>> Cost if run one one cpu :
>>
>> real    0m1.561s
>> user    0m0.092s
>> sys     0m1.469s

i guess asking for a wall-clock cost of 1.561/8 would be too much? :)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:21                     ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21 15:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List


* Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:

> Before patch, time to run 8 millions of close(socket()) calls on 8 
> CPUS was :
>
> real    0m27.496s
> user    0m0.657s
> sys     3m39.092s
>
> After patch :
>
> real    0m23.997s
> user    0m0.682s
> sys     3m11.193s

cool :-)

What would it take to get it down to:

>> Cost if run one one cpu :
>>
>> real    0m1.561s
>> user    0m0.092s
>> sys     0m1.469s

i guess asking for a wall-clock cost of 1.561/8 would be too much? :)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:28                       ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 15:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, cl, rjw, linux-kernel, kernel-testers, efault,
	a.p.zijlstra, Linux Netdev List

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Before patch, time to run 8 millions of close(socket()) calls on 8 
>> CPUS was :
>>
>> real    0m27.496s
>> user    0m0.657s
>> sys     3m39.092s
>>
>> After patch :
>>
>> real    0m23.997s
>> user    0m0.682s
>> sys     3m11.193s
> 
> cool :-)
> 
> What would it take to get it down to:
> 
>>> Cost if run one one cpu :
>>>
>>> real    0m1.561s
>>> user    0m0.092s
>>> sys     0m1.469s
> 
> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
> 

It might be possible, depending on the level of hackery I am allowed to inject
in fs/dcache.c and fs/inode.c :)

wall cost of 1.56 (each cpu runs one loop of one million iterations)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:28                       ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 15:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List

Ingo Molnar a écrit :
> * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:
> 
>> Before patch, time to run 8 millions of close(socket()) calls on 8 
>> CPUS was :
>>
>> real    0m27.496s
>> user    0m0.657s
>> sys     3m39.092s
>>
>> After patch :
>>
>> real    0m23.997s
>> user    0m0.682s
>> sys     3m11.193s
> 
> cool :-)
> 
> What would it take to get it down to:
> 
>>> Cost if run one one cpu :
>>>
>>> real    0m1.561s
>>> user    0m0.092s
>>> sys     0m1.469s
> 
> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
> 

It might be possible, depending on the level of hackery I am allowed to inject
in fs/dcache.c and fs/inode.c :)

wall cost of 1.56 (each cpu runs one loop of one million iterations)

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:34                         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21 15:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, cl, rjw, linux-kernel, kernel-testers, efault,
	a.p.zijlstra, Linux Netdev List


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Ingo Molnar a écrit :
>> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>>
>>> Before patch, time to run 8 millions of close(socket()) calls on 8  
>>> CPUS was :
>>>
>>> real    0m27.496s
>>> user    0m0.657s
>>> sys     3m39.092s
>>>
>>> After patch :
>>>
>>> real    0m23.997s
>>> user    0m0.682s
>>> sys     3m11.193s
>>
>> cool :-)
>>
>> What would it take to get it down to:
>>
>>>> Cost if run one one cpu :
>>>>
>>>> real    0m1.561s
>>>> user    0m0.092s
>>>> sys     0m1.469s
>>
>> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
>>
>
> It might be possible, depending on the level of hackery I am allowed 
> to inject in fs/dcache.c and fs/inode.c :)

I think being able to open+close sockets in a scalable way is an 
undisputed prime-time workload on Linux. The numbers you showed look 
horrible.

Once you can show how much faster it could go via hacks, it should 
only be a matter of time to achieve that safely and cleanly.

> wall cost of 1.56 (each cpu runs one loop of one million iterations)

(indeed.)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
@ 2008-11-21 15:34                         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-21 15:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List


* Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:

> Ingo Molnar a écrit :
>> * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:
>>
>>> Before patch, time to run 8 millions of close(socket()) calls on 8  
>>> CPUS was :
>>>
>>> real    0m27.496s
>>> user    0m0.657s
>>> sys     3m39.092s
>>>
>>> After patch :
>>>
>>> real    0m23.997s
>>> user    0m0.682s
>>> sys     3m11.193s
>>
>> cool :-)
>>
>> What would it take to get it down to:
>>
>>>> Cost if run one one cpu :
>>>>
>>>> real    0m1.561s
>>>> user    0m0.092s
>>>> sys     0m1.469s
>>
>> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
>>
>
> It might be possible, depending on the level of hackery I am allowed 
> to inject in fs/dcache.c and fs/inode.c :)

I think being able to open+close sockets in a scalable way is an 
undisputed prime-time workload on Linux. The numbers you showed look 
horrible.

Once you can show how much faster it could go via hacks, it should 
only be a matter of time to achieve that safely and cleanly.

> wall cost of 1.56 (each cpu runs one loop of one million iterations)

(indeed.)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent
  2008-11-21 15:13                   ` Eric Dumazet
  (?)
  (?)
@ 2008-11-21 15:36                   ` Christoph Hellwig
  2008-11-21 17:58                     ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
  -1 siblings, 1 reply; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-21 15:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers,
	efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel

On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote:
> [PATCH] fs: pipe/sockets/anon dentries should not have a parent
>
> Linking pipe/sockets/anon dentries to one root 'parent' has no functional
> impact at all, but a scalability one.
>
> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
> We avoid an expensive atomic_dec_and_lock() call on the root dentry.
>
> If we correct dnotify_parent() and inotify_d_instantiate() to take into account
> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.

Sorry folks, but a NULL d_parent is a no-go from the VFS perspective,
but you can set d_parent to the dentry itself which is the magic used
for root of tree dentries.  They should also be marked
DCACHE_DISCONNECTED to make sure this is not unexpected.

And this kind of stuff really needs to go through -fsdevel.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 16:11             ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-21 16:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

On Fri, 21 Nov 2008, Ingo Molnar wrote:

> > 2.6.22:
> > Throughput 2526.15 MB/sec 8 procs
> > 2.6.28-rc5:
> > Throughput 2486.2 MB/sec 8 procs
> >
> > 8p Dell 1950 and the number of processors specified on the tbench
> > command line.
>
> ... so might be worth a test. Just to satisfy our curiosity and to
> possibly close the entry :-)

Ahh.. Wow.... net-next gets us:

Throughput 2685.17 MB/sec 8 procs



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 16:11             ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-21 16:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

On Fri, 21 Nov 2008, Ingo Molnar wrote:

> > 2.6.22:
> > Throughput 2526.15 MB/sec 8 procs
> > 2.6.28-rc5:
> > Throughput 2486.2 MB/sec 8 procs
> >
> > 8p Dell 1950 and the number of processors specified on the tbench
> > command line.
>
> ... so might be worth a test. Just to satisfy our curiosity and to
> possibly close the entry :-)

Ahh.. Wow.... net-next gets us:

Throughput 2685.17 MB/sec 8 procs


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent
  2008-11-21 15:36                   ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
@ 2008-11-21 17:58                     ` Eric Dumazet
  2008-11-21 18:43                         ` Matthew Wilcox
  0 siblings, 1 reply; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers,
	efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5101 bytes --]

Christoph Hellwig a écrit :
> On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote:
>> [PATCH] fs: pipe/sockets/anon dentries should not have a parent
>>
>> Linking pipe/sockets/anon dentries to one root 'parent' has no functional
>> impact at all, but a scalability one.
>>
>> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
>> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
>> We avoid an expensive atomic_dec_and_lock() call on the root dentry.
>>
>> If we correct dnotify_parent() and inotify_d_instantiate() to take into account
>> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.
> 
> Sorry folks, but a NULL d_parent is a no-go from the VFS perspective,
> but you can set d_parent to the dentry itself which is the magic used
> for root of tree dentries.  They should also be marked
> DCACHE_DISCONNECTED to make sure this is not unexpected.
> 
> And this kind of stuff really needs to go through -fsdevel.

Thanks Christoph for your review, sorry for fsdevel being forgotten.

d_alloc_root() is not an option here, since we also want such dentries
to be unhashed. So here is a second version, with the introduction
of a new helper, d_alloc_unhashed(), to be used by pipes, sockets and anon

I got even better numbers, probably because dnotify/inotify dont have
the NULL d_parent test anymore.

[PATCH] fs: pipe/sockets/anon dentries should have themselves as parent


Linking pipe/sockets/anon dentries to one root 'parent' has no functional
impact at all, but a scalability one.

We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
We avoid an expensive atomic_dec_and_lock() call on the root dentry.

We add d_alloc_unhashed(const char *name, struct inode *inode) helper
to be used by pipes/socket/anon. This function is about the same as
d_alloc_root() but for unhashed entries.

Before patch, time to run 8 *  1 million of close(socket()) calls on 8 CPUS was :

real    0m27.496s
user    0m0.657s
sys     3m39.092s

After patch :

real    0m23.843s
user    0m0.616s
sys     3m9.732s


Old oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
164257   164257        11.0245  11.0245    init_file
155488   319745        10.4359  21.4604    d_alloc
151887   471632        10.1942  31.6547    _atomic_dec_and_lock
91620    563252         6.1493  37.8039    inet_create
74245    637497         4.9831  42.7871    kmem_cache_alloc
46702    684199         3.1345  45.9216    dentry_iput
46186    730385         3.0999  49.0215    tcp_close
42824    773209         2.8742  51.8957    kmem_cache_free
37275    810484         2.5018  54.3975    wake_up_inode
36553    847037         2.4533  56.8508    tcp_v4_init_sock
35661    882698         2.3935  59.2443    inotify_d_instantiate
32998    915696         2.2147  61.4590    sysenter_past_esp
31442    947138         2.1103  63.5693    d_instantiate
31303    978441         2.1010  65.6703    generic_forget_inode
27533    1005974        1.8479  67.5183    vfs_dq_drop
24237    1030211        1.6267  69.1450    sock_attach_fd
19290    1049501        1.2947  70.4397    __copy_from_user_ll


New oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
148703   148703        10.8581  10.8581    inet_create
116680   265383         8.5198  19.3779    new_inode
108912   374295         7.9526  27.3306    init_file
82911    457206         6.0541  33.3846    kmem_cache_alloc
65690    522896         4.7966  38.1812    wake_up_inode
53286    576182         3.8909  42.0721    _atomic_dec_and_lock
43814    619996         3.1992  45.2713    generic_forget_inode
41993    661989         3.0663  48.3376    d_alloc
41244    703233         3.0116  51.3492    kmem_cache_free
39244    742477         2.8655  54.2148    tcp_v4_init_sock
37402    779879         2.7310  56.9458    tcp_close
33336    813215         2.4342  59.3800    sysenter_past_esp
28596    841811         2.0880  61.4680    inode_has_buffers
25769    867580         1.8816  63.3496    d_kill
22606    890186         1.6507  65.0003    dentry_iput
20224    910410         1.4767  66.4770    vfs_dq_drop
19800    930210         1.4458  67.9228    __copy_from_user_ll

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |    9 +--------
 fs/dcache.c            |   31 +++++++++++++++++++++++++++++++
 fs/pipe.c              |   10 +---------
 include/linux/dcache.h |    1 +
 net/socket.c           |   10 +---------
 5 files changed, 35 insertions(+), 26 deletions(-)

[-- Attachment #2: d_alloc_unhashed.patch --]
[-- Type: text/plain, Size: 4728 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..9fd0515 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = {
 int anon_inode_getfd(const char *name, const struct file_operations *fops,
 		     void *priv, int flags)
 {
-	struct qstr this;
 	struct dentry *dentry;
 	struct file *file;
 	int error, fd;
@@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	 * using the inode sequence number.
 	 */
 	error = -ENOMEM;
-	this.name = name;
-	this.len = strlen(name);
-	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_unhashed(name, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..a5477fd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1111,6 +1111,37 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_unhashed - allocate unhashed dentry
+ * @inode: inode to allocate the dentry for
+ * @name: dentry name
+ *
+ * Allocate an unhashed dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory. Unhashed dentries have themselves as a parent.
+ */
+ 
+struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
+{
+	struct qstr q = { .name = name, .len = strlen(name) };
+	struct dentry *res;
+
+	res = d_alloc(NULL, &q);
+	if (res) {
+		res->d_sb = inode->i_sb;
+		res->d_parent = res;
+		/*
+		 * We dont want to push this dentry into global dentry hash table.
+		 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
+		 * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
+		 */
+		res->d_flags &= ~DCACHE_UNHASHED;
+		res->d_flags |= DCACHE_DISCONNECTED;
+		d_instantiate(res, inode);
+	}
+	return res;
+}
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..29fcac2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..12438d6 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
+extern struct dentry * d_alloc_unhashed(const char *, struct inode *);
 
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b659b5d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 18:06               ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-21 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

AIM9 results:
		TCP		UDP
2.6.22		104868.00	489970.03
2.6.28-rc5	110007.00	518640.00
net-next	108207.00	514790.00

net-next looses here for some reason against 2.6.28-rc5. But the numbers
are better than 2.6.22 in any case.





^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 18:06               ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-21 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

AIM9 results:
		TCP		UDP
2.6.22		104868.00	489970.03
2.6.28-rc5	110007.00	518640.00
net-next	108207.00	514790.00

net-next looses here for some reason against 2.6.28-rc5. But the numbers
are better than 2.6.22 in any case.




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 18:16                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 18:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

Christoph Lameter a écrit :
> AIM9 results:
> 		TCP		UDP
> 2.6.22		104868.00	489970.03
> 2.6.28-rc5	110007.00	518640.00
> net-next	108207.00	514790.00
> 
> net-next looses here for some reason against 2.6.28-rc5. But the numbers
> are better than 2.6.22 in any case.
> 

I found that on current net-next, running oprofile in background can give better bench
results. Thats really curious... no ?


So the single loop on close(socket()), on all my 8 cpus is almost 10% faster if oprofile
is running... (20 secs instead of 23 secs)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 18:16                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 18:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

Christoph Lameter a écrit :
> AIM9 results:
> 		TCP		UDP
> 2.6.22		104868.00	489970.03
> 2.6.28-rc5	110007.00	518640.00
> net-next	108207.00	514790.00
> 
> net-next looses here for some reason against 2.6.28-rc5. But the numbers
> are better than 2.6.22 in any case.
> 

I found that on current net-next, running oprofile in background can give better bench
results. Thats really curious... no ?


So the single loop on close(socket()), on all my 8 cpus is almost 10% faster if oprofile
is running... (20 secs instead of 23 secs)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 18:19                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 18:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

Eric Dumazet a écrit :
> Christoph Lameter a écrit :
>> AIM9 results:
>>         TCP        UDP
>> 2.6.22        104868.00    489970.03
>> 2.6.28-rc5    110007.00    518640.00
>> net-next    108207.00    514790.00
>>
>> net-next looses here for some reason against 2.6.28-rc5. But the numbers
>> are better than 2.6.22 in any case.
>>
> 
> I found that on current net-next, running oprofile in background can 
> give better bench
> results. Thats really curious... no ?
> 
> 
> So the single loop on close(socket()), on all my 8 cpus is almost 10% 
> faster if oprofile
> is running... (20 secs instead of 23 secs)
> 

Oh well, thats normal, since when a cpu is interrupted by a NMI, and
distracted by oprofile code, it doesnt fight with other cpus on dcache_lock
and other contended cache lines...



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
@ 2008-11-21 18:19                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-21 18:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	David S. Miller

Eric Dumazet a écrit :
> Christoph Lameter a écrit :
>> AIM9 results:
>>         TCP        UDP
>> 2.6.22        104868.00    489970.03
>> 2.6.28-rc5    110007.00    518640.00
>> net-next    108207.00    514790.00
>>
>> net-next looses here for some reason against 2.6.28-rc5. But the numbers
>> are better than 2.6.22 in any case.
>>
> 
> I found that on current net-next, running oprofile in background can 
> give better bench
> results. Thats really curious... no ?
> 
> 
> So the single loop on close(socket()), on all my 8 cpus is almost 10% 
> faster if oprofile
> is running... (20 secs instead of 23 secs)
> 

Oh well, thats normal, since when a cpu is interrupted by a NMI, and
distracted by oprofile code, it doesnt fight with other cpus on dcache_lock
and other contended cache lines...


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent
@ 2008-11-21 18:43                         ` Matthew Wilcox
  0 siblings, 0 replies; 349+ messages in thread
From: Matthew Wilcox @ 2008-11-21 18:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, David Miller, mingo, cl, rjw, linux-kernel,
	kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro,
	linux-fsdevel

On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote:
> +/**
> + * d_alloc_unhashed - allocate unhashed dentry
> + * @inode: inode to allocate the dentry for
> + * @name: dentry name

It's normal to list the parameters in the order they're passed to the
function.  Not sure if we have a tool that checks for this or not --
Randy?

> + *
> + * Allocate an unhashed dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory. Unhashed dentries have themselves as a parent.
> + */
> + 
> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
> +{
> +	struct qstr q = { .name = name, .len = strlen(name) };
> +	struct dentry *res;
> +
> +	res = d_alloc(NULL, &q);
> +	if (res) {
> +		res->d_sb = inode->i_sb;
> +		res->d_parent = res;
> +		/*
> +		 * We dont want to push this dentry into global dentry hash table.
> +		 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> +		 * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
> +		 */

Line length ... as checkpatch would have warned you ;-)

And there are several other grammatical nitpicks with this comment.  Try
this:

		/*
		 * We don't want to put this dentry in the global dentry
		 * hash table, so we pretend the dentry is already hashed
		 * by unsetting DCACHE_UNHASHED.  This permits 
		 * /proc/$pid/fd/XXX t work for sockets, pipes and
		 * anonymous files (signalfd, timerfd, etc).
		 */

> +		res->d_flags &= ~DCACHE_UNHASHED;
> +		res->d_flags |= DCACHE_DISCONNECTED;

Is this really better than:

		res->d_flags = res->d_flags & ~DCACHE_UNHASHED |
						DCACHE_DISCONNECTED;

Anyway, nice cleanup.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent
@ 2008-11-21 18:43                         ` Matthew Wilcox
  0 siblings, 0 replies; 349+ messages in thread
From: Matthew Wilcox @ 2008-11-21 18:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, David Miller, mingo-X9Un+BFzKDI,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote:
> +/**
> + * d_alloc_unhashed - allocate unhashed dentry
> + * @inode: inode to allocate the dentry for
> + * @name: dentry name

It's normal to list the parameters in the order they're passed to the
function.  Not sure if we have a tool that checks for this or not --
Randy?

> + *
> + * Allocate an unhashed dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory. Unhashed dentries have themselves as a parent.
> + */
> + 
> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
> +{
> +	struct qstr q = { .name = name, .len = strlen(name) };
> +	struct dentry *res;
> +
> +	res = d_alloc(NULL, &q);
> +	if (res) {
> +		res->d_sb = inode->i_sb;
> +		res->d_parent = res;
> +		/*
> +		 * We dont want to push this dentry into global dentry hash table.
> +		 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> +		 * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
> +		 */

Line length ... as checkpatch would have warned you ;-)

And there are several other grammatical nitpicks with this comment.  Try
this:

		/*
		 * We don't want to put this dentry in the global dentry
		 * hash table, so we pretend the dentry is already hashed
		 * by unsetting DCACHE_UNHASHED.  This permits 
		 * /proc/$pid/fd/XXX t work for sockets, pipes and
		 * anonymous files (signalfd, timerfd, etc).
		 */

> +		res->d_flags &= ~DCACHE_UNHASHED;
> +		res->d_flags |= DCACHE_DISCONNECTED;

Is this really better than:

		res->d_flags = res->d_flags & ~DCACHE_UNHASHED |
						DCACHE_DISCONNECTED;

Anyway, nice cleanup.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent
  2008-11-21 18:43                         ` Matthew Wilcox
  (?)
@ 2008-11-23  3:53                         ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-23  3:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, David Miller, mingo, cl, rjw, linux-kernel,
	kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5733 bytes --]

Matthew Wilcox a écrit :
> On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote:
>> +/**
>> + * d_alloc_unhashed - allocate unhashed dentry
>> + * @inode: inode to allocate the dentry for
>> + * @name: dentry name
> 
> It's normal to list the parameters in the order they're passed to the
> function.  Not sure if we have a tool that checks for this or not --
> Randy?

Yes, no problem, better to have the same order.

> 
>> + *
>> + * Allocate an unhashed dentry for the inode given. The inode is
>> + * instantiated and returned. %NULL is returned if there is insufficient
>> + * memory. Unhashed dentries have themselves as a parent.
>> + */
>> + 
>> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
>> +{
>> +	struct qstr q = { .name = name, .len = strlen(name) };
>> +	struct dentry *res;
>> +
>> +	res = d_alloc(NULL, &q);
>> +	if (res) {
>> +		res->d_sb = inode->i_sb;
>> +		res->d_parent = res;
>> +		/*
>> +		 * We dont want to push this dentry into global dentry hash table.
>> +		 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
>> +		 * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
>> +		 */
> 
> Line length ... as checkpatch would have warned you ;-)
> 
> And there are several other grammatical nitpicks with this comment.  Try
> this:
> 
> 		/*
> 		 * We don't want to put this dentry in the global dentry
> 		 * hash table, so we pretend the dentry is already hashed
> 		 * by unsetting DCACHE_UNHASHED.  This permits 
> 		 * /proc/$pid/fd/XXX t work for sockets, pipes and
> 		 * anonymous files (signalfd, timerfd, etc).
> 		 */

Yes, this is better.

> 
>> +		res->d_flags &= ~DCACHE_UNHASHED;
>> +		res->d_flags |= DCACHE_DISCONNECTED;
> 
> Is this really better than:
> 
> 		res->d_flags = res->d_flags & ~DCACHE_UNHASHED |
> 						DCACHE_DISCONNECTED;

Well, I personally prefer the two lines, intention is more readable :)

> 
> Anyway, nice cleanup.
> 

Thanks Matthew, here is an updated version of the patch.

[PATCH] fs: pipe/sockets/anon dentries should have themselves as parent


Linking pipe/sockets/anon dentries to one root 'parent' has no functional
impact at all, but a scalability one.

We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
We avoid an expensive atomic_dec_and_lock() call on the root dentry.

We add d_alloc_unhashed(const char *name, struct inode *inode) helper
to be used by pipes/socket/anon. This function is about the same as
d_alloc_root() but for unhashed entries.

Before patch, time to run 8 *  1 million of close(socket()) calls on 8 CPUS was :

real    0m27.496s
user    0m0.657s
sys     3m39.092s

After patch :

real    0m23.843s
user    0m0.616s
sys     3m9.732s


Old oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
164257   164257        11.0245  11.0245    init_file
155488   319745        10.4359  21.4604    d_alloc
151887   471632        10.1942  31.6547    _atomic_dec_and_lock
91620    563252         6.1493  37.8039    inet_create
74245    637497         4.9831  42.7871    kmem_cache_alloc
46702    684199         3.1345  45.9216    dentry_iput
46186    730385         3.0999  49.0215    tcp_close
42824    773209         2.8742  51.8957    kmem_cache_free
37275    810484         2.5018  54.3975    wake_up_inode
36553    847037         2.4533  56.8508    tcp_v4_init_sock
35661    882698         2.3935  59.2443    inotify_d_instantiate
32998    915696         2.2147  61.4590    sysenter_past_esp
31442    947138         2.1103  63.5693    d_instantiate
31303    978441         2.1010  65.6703    generic_forget_inode
27533    1005974        1.8479  67.5183    vfs_dq_drop
24237    1030211        1.6267  69.1450    sock_attach_fd
19290    1049501        1.2947  70.4397    __copy_from_user_ll


New oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
148703   148703        10.8581  10.8581    inet_create
116680   265383         8.5198  19.3779    new_inode
108912   374295         7.9526  27.3306    init_file
82911    457206         6.0541  33.3846    kmem_cache_alloc
65690    522896         4.7966  38.1812    wake_up_inode
53286    576182         3.8909  42.0721    _atomic_dec_and_lock
43814    619996         3.1992  45.2713    generic_forget_inode
41993    661989         3.0663  48.3376    d_alloc
41244    703233         3.0116  51.3492    kmem_cache_free
39244    742477         2.8655  54.2148    tcp_v4_init_sock
37402    779879         2.7310  56.9458    tcp_close
33336    813215         2.4342  59.3800    sysenter_past_esp
28596    841811         2.0880  61.4680    inode_has_buffers
25769    867580         1.8816  63.3496    d_kill
22606    890186         1.6507  65.0003    dentry_iput
20224    910410         1.4767  66.4770    vfs_dq_drop
19800    930210         1.4458  67.9228    __copy_from_user_ll

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |    9 +--------
 fs/dcache.c            |   33 +++++++++++++++++++++++++++++++++
 fs/pipe.c              |   10 +---------
 include/linux/dcache.h |    1 +
 net/socket.c           |   10 +---------
 5 files changed, 37 insertions(+), 26 deletions(-)

[-- Attachment #2: d_alloc_unhashed2.patch --]
[-- Type: text/plain, Size: 4788 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..9fd0515 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = {
 int anon_inode_getfd(const char *name, const struct file_operations *fops,
 		     void *priv, int flags)
 {
-	struct qstr this;
 	struct dentry *dentry;
 	struct file *file;
 	int error, fd;
@@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	 * using the inode sequence number.
 	 */
 	error = -ENOMEM;
-	this.name = name;
-	this.len = strlen(name);
-	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_unhashed(name, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..43ef88d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1111,6 +1111,39 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_unhashed - allocate unhashed dentry
+ * @name: dentry name
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an unhashed dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory. Unhashed dentries have themselves as a parent.
+ */
+ 
+struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
+{
+	struct qstr q = { .name = name, .len = strlen(name) };
+	struct dentry *res;
+
+	res = d_alloc(NULL, &q);
+	if (res) {
+		res->d_sb = inode->i_sb;
+		res->d_parent = res;
+		/*
+		 * We dont want to push this dentry into global dentry
+		 * hash table, so we pretend the dentry is already hashed
+		 * by unsetting DCACHE_UNHASHED. This permits
+		 * /proc/$pid/fd/XXX to work for sockets, pipes, and
+		 * anonymous files (signalfd, timerfd, ...)
+		 */
+		res->d_flags &= ~DCACHE_UNHASHED;
+		res->d_flags |= DCACHE_DISCONNECTED;
+		d_instantiate(res, inode);
+	}
+	return res;
+}
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..29fcac2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..12438d6 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
+extern struct dentry * d_alloc_unhashed(const char *, struct inode *);
 
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b659b5d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_unhashed("", SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
  2008-11-21 15:34                         ` Ingo Molnar
  (?)
@ 2008-11-26 23:27                         ` Eric Dumazet
  2008-11-27  1:37                             ` Christoph Lameter
                                             ` (8 more replies)
  -1 siblings, 9 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.6 second)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
 but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
 and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
 This function :
  - locks inode_lock,
  - dirties nr_inodes counter
  - dirties inode_in_use list  (for sockets/pipes, this is useless)
  - dirties superblock s_inodes. 
  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all contended cache lines for sockets, pipes
and anonymous fd  (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
  close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.325s   (instead of 1.561s)
user    0.091s
sys     1.234s


If run on 8 CPUS :

real    2.229s     <<<< instead of 27.496s >>>
user    0.695s
sys     16.903s

Oprofile results (for the 8 process run, 3 times):
CPU: Core 2, speed 2999.74 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
143791   143791        11.7849  11.7849    __slab_free
128404   272195        10.5238  22.3087    add_partial
99150    371345         8.1262  30.4349    kmem_cache_alloc
52031    423376         4.2644  34.6993    sysenter_past_esp
47752    471128         3.9137  38.6130    kmem_cache_free
47429    518557         3.8872  42.5002    tcp_close
34376    552933         2.8174  45.3176    __percpu_counter_add
29046    581979         2.3806  47.6982    copy_from_user
28249    610228         2.3152  50.0134    init_timer
26220    636448         2.1490  52.1624    __slab_alloc
23402    659850         1.9180  54.0803    discard_slab
20560    680410         1.6851  55.7654    __call_rcu
18288    698698         1.4989  57.2643    d_alloc
16425    715123         1.3462  58.6104    get_empty_filp
16237    731360         1.3308  59.9412    __fput
15729    747089         1.2891  61.2303    alloc_fd
15021    762110         1.2311  62.4614    alloc_inode
14690    776800         1.2040  63.6654    sys_close
14666    791466         1.2020  64.8674    inet_create
13638    805104         1.1178  65.9852    dput
12503    817607         1.0247  67.0099    iput_special
12231    829838         1.0024  68.0123    lock_sock_nested
12210    842048         1.0007  69.0130    fd_install
12137    854185         0.9947  70.0078    d_alloc_special
12058    866243         0.9883  70.9960    sock_init_data
11200    877443         0.9179  71.9140    release_sock
11114    888557         0.9109  72.8248    inotify_d_instantiate

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 at boot, or we use Christoph Lameter
patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

New cost if run on one cpu :

real    1.307s
user    0.094s
sys     1.214s

If run on 8 CPUS :

real    1.625s     <<<< instead of 27.496s or 2.229s >>>
user    0.771s
sys     12.061s

Oprofile results (for the 8 process run, 3 times):
CPU: Core 2, speed 3000.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
108005   108005        11.0758  11.0758    kmem_cache_alloc
52023    160028         5.3349  16.4107    sysenter_past_esp
47363    207391         4.8570  21.2678    tcp_close
45430    252821         4.6588  25.9266    kmem_cache_free
36566    289387         3.7498  29.6764    __percpu_counter_add
36085    325472         3.7005  33.3769    __slab_free
29185    354657         2.9929  36.3698    copy_from_user
28210    382867         2.8929  39.2627    init_timer
25663    408530         2.6317  41.8944    d_alloc_special
22360    430890         2.2930  44.1874    cap_file_alloc_security
19237    450127         1.9727  46.1601    __call_rcu
19097    469224         1.9584  48.1185    d_alloc
16962    486186         1.7394  49.8580    alloc_fd
16315    502501         1.6731  51.5311    __fput
16102    518603         1.6512  53.1823    get_empty_filp
14954    533557         1.5335  54.7158    inet_create
14468    548025         1.4837  56.1995    alloc_inode
14198    562223         1.4560  57.6555    sys_close
13905    576128         1.4259  59.0814    dput
12262    588390         1.2575  60.3389    lock_sock_nested
12203    600593         1.2514  61.5903    sock_attach_fd
12147    612740         1.2457  62.8360    iput_special
12049    624789         1.2356  64.0716    fd_install
12033    636822         1.2340  65.3056    sock_init_data
11999    648821         1.2305  66.5361    release_sock
11231    660052         1.1517  67.6878    inotify_d_instantiate
11068    671120         1.1350  68.8228    inet_csk_destroy_sock


This patch serie contains 6 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/6] fs: Introduce a per_cpu nr_dentry

Adding a per_cpu nr_dentry avoids cache line ping pongs between
cpus to maintain this metric.

We centralize decrements of nr_dentry in d_free(),
and increments in d_alloc().

d_alloc() can avoid taking dcache_lock if parent is NULL


[PATCH 2/6] fs: Introduce special dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SPECIAL flag, to mark a dentry as
a special one (for sockets, pipes, anonymous fd), and a new
d_alloc_special(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_special() for
special dentries.

Differences betwen a special dentry and a normal one are :

1) Special dentry has the DCACHE_SPECIAL flag
2) Special dentry's parent are themselves
  This to avoid taking a reference on 'root' dentry, shared
  by too many dentries.
3) They are not hashed into global hash table
4) Their d_alias list is empty

Internally, dput() can avoid an expensive atomic_dec_and_lock()
for special dentries.


(socket8 bench result : from 27.5s to 25.5s) 

[PATCH 3/6] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino.

Note : last_ino_get() method must be called with preemption disabled.

(socket8 bench result : 25.5s to 25s almost no differences, but 
this is because inode_lock cost is too heavy for the moment)

[PATCH 4/6] fs: Introduce a per_cpu nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : 25s to 20.5s)

[PATCH 5/6] fs: Introduce special inodes

 Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
 inodes allocation/freeing.

 In new_inode(), we test if super block has MS_SPECIAL flag set.
 If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
 As inode_lock was taken only to protect these lists, we avoid it as well

 Using iput_special() from dput_special() avoids taking inode_lock
 at freeing time.

 This patch has a very noticeable effect, because we avoid dirtying 
 of three contended cache lines in new_inode(), and five cache lines
 in iput()

Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
really need a different flag.

(socket8 bench result : from 20.5s to 2.94s) 

[PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
refcounting on permanent system vfs.
Use this function for sockets, pipes, anonymous fds.

(socket8 bench result : from 2.94s to 2.23s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
Overall diffstat :

 fs/anon_inodes.c       |   19 +-----
 fs/dcache.c            |  106 ++++++++++++++++++++++++++++++++-------
 fs/fs-writeback.c      |    2
 fs/inode.c             |  101 +++++++++++++++++++++++++++++++------
 fs/pipe.c              |   28 +---------
 fs/super.c             |    9 +++
 include/linux/dcache.h |    2
 include/linux/fs.h     |    8 ++
 include/linux/mount.h  |    5 +
 kernel/sysctl.c        |    6 +-
 mm/page-writeback.c    |    2
 net/socket.c           |   27 +--------
 12 files changed, 212 insertions(+), 103 deletions(-)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH 1/6] fs: Introduce a per_cpu nr_dentry
@ 2008-11-26 23:30                           ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 469 bytes --]

Adding a per_cpu nr_dentry avoids cache line ping pongs between
cpus to maintain this metric.

We centralize decrements of nr_dentry in d_free(),
and increments in d_alloc().

d_alloc() can avoid taking dcache_lock if parent is NULL

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/dcache.c        |   55 ++++++++++++++++++++++++++++---------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 40 insertions(+), 19 deletions(-)

[-- Attachment #2: per_cpu_nr_dentry.patch --]
[-- Type: text/plain, Size: 4782 bytes --]

diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..42ed9fc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,38 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static DEFINE_PER_CPU(int, nr_dentry);
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int cpu;
+	int counter = 0;
+
+	for_each_possible_cpu(cpu)
+	    counter += per_cpu(nr_dentry, cpu);
+	if (counter < 0)
+		counter = 0;
+	dentry_stat.nr_dentry = counter;
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +108,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +119,8 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	get_cpu_var(nr_dentry)--;
+	put_cpu_var(nr_dentry);
 }
 
 /*
@@ -172,7 +199,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +645,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +703,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +720,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +729,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +962,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,15 +976,17 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
 
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
+	get_cpu_var(nr_dentry)++;
+	put_cpu_var(nr_dentry);
 
 	return dentry;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..c5e7aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9d048fa..eebddef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 1/6] fs: Introduce a per_cpu nr_dentry
@ 2008-11-26 23:30                           ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 495 bytes --]

Adding a per_cpu nr_dentry avoids cache line ping pongs between
cpus to maintain this metric.

We centralize decrements of nr_dentry in d_free(),
and increments in d_alloc().

d_alloc() can avoid taking dcache_lock if parent is NULL

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/dcache.c        |   55 ++++++++++++++++++++++++++++---------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 40 insertions(+), 19 deletions(-)

[-- Attachment #2: per_cpu_nr_dentry.patch --]
[-- Type: text/plain, Size: 4782 bytes --]

diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..42ed9fc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,38 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static DEFINE_PER_CPU(int, nr_dentry);
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int cpu;
+	int counter = 0;
+
+	for_each_possible_cpu(cpu)
+	    counter += per_cpu(nr_dentry, cpu);
+	if (counter < 0)
+		counter = 0;
+	dentry_stat.nr_dentry = counter;
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +108,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +119,8 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	get_cpu_var(nr_dentry)--;
+	put_cpu_var(nr_dentry);
 }
 
 /*
@@ -172,7 +199,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +645,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +703,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +720,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +729,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +962,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,15 +976,17 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
 
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
+	get_cpu_var(nr_dentry)++;
+	put_cpu_var(nr_dentry);
 
 	return dentry;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..c5e7aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9d048fa..eebddef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator
  2008-11-21 15:34                         ` Ingo Molnar
                                           ` (2 preceding siblings ...)
  (?)
@ 2008-11-26 23:32                         ` Eric Dumazet
  2008-11-27  9:46                             ` Christoph Hellwig
  -1 siblings, 1 reply; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

new_inode() dirties a contended cache line to get inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino.

Note : last_ino_get() method must be called with preemption
disabled on SMP.


(socket8 bench result : no differences, but this is because inode_lock
cost is too heavy)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/inode.c |   27 +++++++++++++++++++++++++--
 1 files changed, 25 insertions(+), 2 deletions(-)

[-- Attachment #2: last_ino.patch --]
[-- Type: text/plain, Size: 1308 bytes --]

diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..d850050 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,30 @@ repeat:
 	return node ? inode : NULL;
 }
 
+#ifdef CONFIG_SMP
+/*
+ * each cpu owns a block of 1024 numbers.
+ * The global 'last_ino' is dirtied once every 1024 allocations
+ */
+static DEFINE_PER_CPU(int, cpu_ino_alloc) = {0};
+static int last_ino_get(void)
+{
+	static atomic_t last_ino;
+	int *ptr = &__raw_get_cpu_var(cpu_ino_alloc);
+
+	if (unlikely((*ptr & 1023) == 0))
+		*ptr = atomic_add_return(1024, &last_ino);
+	return --(*ptr);
+}
+#else
+static int last_ino_get(void)
+{
+	static int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -553,7 +577,6 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
 	struct inode * inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -564,7 +587,7 @@ struct inode *new_inode(struct super_block *sb)
 		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
-		inode->i_ino = ++last_ino;
+		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
@ 2008-11-26 23:32                           ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 473 bytes --]

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes metric dont need inode_lock anymore.

(socket8 bench result : 25s to 20.5s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/fs-writeback.c   |    2 -
 fs/inode.c          |   51 +++++++++++++++++++++++++++++++++++-------
 include/linux/fs.h  |    3 ++
 kernel/sysctl.c     |    4 +--
 mm/page-writeback.c |    2 -
 5 files changed, 50 insertions(+), 12 deletions(-)


[-- Attachment #2: per_cpu_nr_inodes.patch --]
[-- Type: text/plain, Size: 5705 bytes --]

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index d850050..8d8d40e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static DEFINE_PER_CPU(int, nr_inodes);
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	int cpu;
+	int counter = 0;
+
+	for_each_possible_cpu(cpu)
+	    counter += per_cpu(nr_inodes, cpu);
+	if (counter < 0)
+		counter = 0;
+	return counter;
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +337,8 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	get_cpu_var(nr_inodes) -= nr_disposed;
+	put_cpu_var(nr_inodes);
 }
 
 /*
@@ -584,10 +614,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
+		get_cpu_var(nr_inodes)--;
 		inode->i_ino = last_ino_get();
+		put_cpu_var(nr_inodes);
 		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
@@ -645,7 +676,8 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			get_cpu_var(nr_inodes)++;
+			put_cpu_var(nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -694,7 +726,8 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			get_cpu_var(nr_inodes)++;
+			put_cpu_var(nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1065,8 +1098,9 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	get_cpu_var(nr_inodes)--;
+	put_cpu_var(nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1116,8 +1150,9 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	get_cpu_var(nr_inodes)--;
+	put_cpu_var(nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5e7aa5..2482977 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index eebddef..eebed01 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
@ 2008-11-26 23:32                           ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 499 bytes --]

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes metric dont need inode_lock anymore.

(socket8 bench result : 25s to 20.5s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/fs-writeback.c   |    2 -
 fs/inode.c          |   51 +++++++++++++++++++++++++++++++++++-------
 include/linux/fs.h  |    3 ++
 kernel/sysctl.c     |    4 +--
 mm/page-writeback.c |    2 -
 5 files changed, 50 insertions(+), 12 deletions(-)


[-- Attachment #2: per_cpu_nr_inodes.patch --]
[-- Type: text/plain, Size: 5705 bytes --]

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index d850050..8d8d40e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static DEFINE_PER_CPU(int, nr_inodes);
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	int cpu;
+	int counter = 0;
+
+	for_each_possible_cpu(cpu)
+	    counter += per_cpu(nr_inodes, cpu);
+	if (counter < 0)
+		counter = 0;
+	return counter;
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +337,8 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	get_cpu_var(nr_inodes) -= nr_disposed;
+	put_cpu_var(nr_inodes);
 }
 
 /*
@@ -584,10 +614,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
+		get_cpu_var(nr_inodes)--;
 		inode->i_ino = last_ino_get();
+		put_cpu_var(nr_inodes);
 		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
@@ -645,7 +676,8 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			get_cpu_var(nr_inodes)++;
+			put_cpu_var(nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -694,7 +726,8 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			get_cpu_var(nr_inodes)++;
+			put_cpu_var(nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1065,8 +1098,9 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	get_cpu_var(nr_inodes)--;
+	put_cpu_var(nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1116,8 +1150,9 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	get_cpu_var(nr_inodes)--;
+	put_cpu_var(nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5e7aa5..2482977 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index eebddef..eebed01 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 5/6] fs: Introduce special inodes
@ 2008-11-26 23:32                           ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 995 bytes --]

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
 inodes allocation/freeing.

 In new_inode(), we test if super block has MS_SPECIAL flag set.
 If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
 As inode_lock was taken only to protect these lists, we avoid it as well

 Using iput_special() from dput_special() avoids taking inode_lock
 at freeing time.

 This patch has a very noticeable effect, because we avoid dirtying 
 of three contended cache lines in new_inode(), and five cache lines
 in iput()

Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
really need a different flag.

(socket8 bench result : from 20.5s to 2.94s) 

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---

 fs/anon_inodes.c   |    1 +
 fs/dcache.c        |    2 +-
 fs/inode.c         |   25 ++++++++++++++++++-------
 fs/pipe.c          |    3 ++-
 include/linux/fs.h |    2 ++
 net/socket.c       |    1 +
 6 files changed, 25 insertions(+), 9 deletions(-)

[-- Attachment #2: special_inodes.patch --]
[-- Type: text/plain, Size: 3551 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4f20d48..a0212b3 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -158,6 +158,7 @@ static int __init anon_inode_init(void)
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
 	}
+	anon_inode_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 	anon_inode_inode = anon_inode_mkinode();
 	if (IS_ERR(anon_inode_inode)) {
 		error = PTR_ERR(anon_inode_inode);
diff --git a/fs/dcache.c b/fs/dcache.c
index d73763b..bade7d7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -239,7 +239,7 @@ static void dput_special(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_special(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 8d8d40e..1bb6553 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -228,6 +228,14 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_special(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		get_cpu_var(nr_inodes)--;
+		put_cpu_var(nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -609,18 +617,21 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
+		inode->i_state = 0;
+		if (sb->s_flags & MS_SPECIAL) {
+ 			INIT_LIST_HEAD(&inode->i_list);
+ 			INIT_LIST_HEAD(&inode->i_sb_list);
+		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 		get_cpu_var(nr_inodes)--;
 		inode->i_ino = last_ino_get();
 		put_cpu_var(nr_inodes);
-		inode->i_state = 0;
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
diff --git a/fs/pipe.c b/fs/pipe.c
index 5cc132a..6fca681 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void)
 		if (IS_ERR(pipe_mnt)) {
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
-		}
+		} else
+			pipe_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 	}
 	return err;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2482977..dd0e8a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -136,6 +136,7 @@ extern int dir_notify_enable;
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+#define MS_SPECIAL	(1<<24) /* special fs (inodes not in sb->s_inodes) */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
@@ -1898,6 +1899,7 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
+extern void iput_special(struct inode *inode);
 extern struct inode *new_inode(struct super_block *);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
diff --git a/net/socket.c b/net/socket.c
index f41b6c6..4177456 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2205,6 +2205,7 @@ static int __init sock_init(void)
 	init_inodecache();
 	register_filesystem(&sock_fs_type);
 	sock_mnt = kern_mount(&sock_fs_type);
+	sock_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 
 	/* The real protocol initialization is performed in later initcalls.
 	 */

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 5/6] fs: Introduce special inodes
@ 2008-11-26 23:32                           ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
 inodes allocation/freeing.

 In new_inode(), we test if super block has MS_SPECIAL flag set.
 If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
 As inode_lock was taken only to protect these lists, we avoid it as well

 Using iput_special() from dput_special() avoids taking inode_lock
 at freeing time.

 This patch has a very noticeable effect, because we avoid dirtying 
 of three contended cache lines in new_inode(), and five cache lines
 in iput()

Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
really need a different flag.

(socket8 bench result : from 20.5s to 2.94s) 

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---

 fs/anon_inodes.c   |    1 +
 fs/dcache.c        |    2 +-
 fs/inode.c         |   25 ++++++++++++++++++-------
 fs/pipe.c          |    3 ++-
 include/linux/fs.h |    2 ++
 net/socket.c       |    1 +
 6 files changed, 25 insertions(+), 9 deletions(-)

[-- Attachment #2: special_inodes.patch --]
[-- Type: text/plain, Size: 3551 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4f20d48..a0212b3 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -158,6 +158,7 @@ static int __init anon_inode_init(void)
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
 	}
+	anon_inode_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 	anon_inode_inode = anon_inode_mkinode();
 	if (IS_ERR(anon_inode_inode)) {
 		error = PTR_ERR(anon_inode_inode);
diff --git a/fs/dcache.c b/fs/dcache.c
index d73763b..bade7d7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -239,7 +239,7 @@ static void dput_special(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_special(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 8d8d40e..1bb6553 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -228,6 +228,14 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_special(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		get_cpu_var(nr_inodes)--;
+		put_cpu_var(nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -609,18 +617,21 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
+		inode->i_state = 0;
+		if (sb->s_flags & MS_SPECIAL) {
+ 			INIT_LIST_HEAD(&inode->i_list);
+ 			INIT_LIST_HEAD(&inode->i_sb_list);
+		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 		get_cpu_var(nr_inodes)--;
 		inode->i_ino = last_ino_get();
 		put_cpu_var(nr_inodes);
-		inode->i_state = 0;
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
diff --git a/fs/pipe.c b/fs/pipe.c
index 5cc132a..6fca681 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void)
 		if (IS_ERR(pipe_mnt)) {
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
-		}
+		} else
+			pipe_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 	}
 	return err;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2482977..dd0e8a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -136,6 +136,7 @@ extern int dir_notify_enable;
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+#define MS_SPECIAL	(1<<24) /* special fs (inodes not in sb->s_inodes) */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
@@ -1898,6 +1899,7 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
+extern void iput_special(struct inode *inode);
 extern struct inode *new_inode(struct super_block *);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
diff --git a/net/socket.c b/net/socket.c
index f41b6c6..4177456 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2205,6 +2205,7 @@ static int __init sock_init(void)
 	init_inodecache();
 	register_filesystem(&sock_fs_type);
 	sock_mnt = kern_mount(&sock_fs_type);
+	sock_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 
 	/* The real protocol initialization is performed in later initcalls.
 	 */

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
  2008-11-21 15:34                         ` Ingo Molnar
                                           ` (5 preceding siblings ...)
  (?)
@ 2008-11-26 23:32                         ` Eric Dumazet
  2008-11-27  8:21                             ` David Miller
                                             ` (2 more replies)
  -1 siblings, 3 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 511 bytes --]

This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
refcounting on permanent system vfs.
Use this function for sockets, pipes, anonymous fds.

(socket8 bench result : from 2.94s to 2.23s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c      |    2 +-
 fs/pipe.c             |    2 +-
 fs/super.c            |    9 +++++++++
 include/linux/fs.h    |    1 +
 include/linux/mount.h |    5 +++--
 net/socket.c          |    2 +-
 6 files changed, 16 insertions(+), 5 deletions(-)


[-- Attachment #2: mnt_special.patch --]
[-- Type: text/plain, Size: 3352 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index a0212b3..42dfe28 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -153,7 +153,7 @@ static int __init anon_inode_init(void)
 	error = register_filesystem(&anon_inode_fs_type);
 	if (error)
 		goto err_exit;
-	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
+	anon_inode_mnt = kern_mount_special(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
diff --git a/fs/pipe.c b/fs/pipe.c
index 6fca681..391d4fe 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1074,7 +1074,7 @@ static int __init init_pipe_fs(void)
 	int err = register_filesystem(&pipe_fs_type);
 
 	if (!err) {
-		pipe_mnt = kern_mount(&pipe_fs_type);
+		pipe_mnt = kern_mount_special(&pipe_fs_type);
 		if (IS_ERR(pipe_mnt)) {
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
diff --git a/fs/super.c b/fs/super.c
index 400a760..a8e14f7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -982,3 +982,12 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
 }
 
 EXPORT_SYMBOL_GPL(kern_mount_data);
+
+struct vfsmount *kern_mount_special(struct file_system_type *type)
+{
+	struct vfsmount *res = kern_mount_data(type, NULL);
+
+	if (!IS_ERR(res))
+		res->mnt_flags |= MNT_SPECIAL;
+	return res;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd0e8a5..a92544a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1591,6 +1591,7 @@ extern int register_filesystem(struct file_system_type *);
 extern int unregister_filesystem(struct file_system_type *);
 extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
 #define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount_special(struct file_system_type *);
 extern int may_umount_tree(struct vfsmount *);
 extern int may_umount(struct vfsmount *);
 extern long do_mount(char *, char *, char *, unsigned long, void *);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..cb4fa90 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -30,6 +30,7 @@ struct mnt_namespace;
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_IMBALANCED_WRITE_COUNT	0x200 /* just for debugging */
+#define MNT_SPECIAL	0x400	/* special mount (pipes,sockets,...) */
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
@@ -73,7 +74,7 @@ struct vfsmount {
 
 static inline struct vfsmount *mntget(struct vfsmount *mnt)
 {
-	if (mnt)
+	if (mnt && !(mnt->mnt_flags & MNT_SPECIAL))
 		atomic_inc(&mnt->mnt_count);
 	return mnt;
 }
@@ -87,7 +88,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 static inline void mntput(struct vfsmount *mnt)
 {
-	if (mnt) {
+	if (mnt && !(mnt->mnt_flags & MNT_SPECIAL)) {
 		mnt->mnt_expiry_mark = 0;
 		mntput_no_expire(mnt);
 	}
diff --git a/net/socket.c b/net/socket.c
index 4177456..2857d70 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2204,7 +2204,7 @@ static int __init sock_init(void)
 
 	init_inodecache();
 	register_filesystem(&sock_fs_type);
-	sock_mnt = kern_mount(&sock_fs_type);
+	sock_mnt = kern_mount_special(&sock_fs_type);
 	sock_mnt->mnt_sb->s_flags |= MS_SPECIAL;
 
 	/* The real protocol initialization is performed in later initcalls.

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-27  1:37                             ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-27  1:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Hellwig

On Thu, 27 Nov 2008, Eric Dumazet wrote:

> The last point is about SLUB being hit hard, unless we
> use slub_min_order=3 at boot, or we use Christoph Lameter
> patch (struct file RCU optimizations)
> http://thread.gmane.org/gmane.linux.kernel/418615
>
> If we boot machine with slub_min_order=3, SLUB overhead disappears.


I'd rather not be that drastic. Did you try increasing slub_min_objects
instead? Try 40-100. If we find the right number then we should update
the tuning to make sure that it pickes the right slab page sizes.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-27  1:37                             ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-27  1:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Hellwig

On Thu, 27 Nov 2008, Eric Dumazet wrote:

> The last point is about SLUB being hit hard, unless we
> use slub_min_order=3 at boot, or we use Christoph Lameter
> patch (struct file RCU optimizations)
> http://thread.gmane.org/gmane.linux.kernel/418615
>
> If we boot machine with slub_min_order=3, SLUB overhead disappears.


I'd rather not be that drastic. Did you try increasing slub_min_objects
instead? Try 40-100. If we find the right number then we should update
the tuning to make sure that it pickes the right slab page sizes.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-27  6:27                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-27  6:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Hellwig

Christoph Lameter a écrit :
> On Thu, 27 Nov 2008, Eric Dumazet wrote:
> 
>> The last point is about SLUB being hit hard, unless we
>> use slub_min_order=3 at boot, or we use Christoph Lameter
>> patch (struct file RCU optimizations)
>> http://thread.gmane.org/gmane.linux.kernel/418615
>>
>> If we boot machine with slub_min_order=3, SLUB overhead disappears.
> 
> 
> I'd rather not be that drastic. Did you try increasing slub_min_objects
> instead? Try 40-100. If we find the right number then we should update
> the tuning to make sure that it pickes the right slab page sizes.
> 
> 

4096/192 = 21

with slub_min_objects=22 :

# cat /sys/kernel/slab/filp/order
1
# time ./socket8
real    0m1.725s
user    0m0.685s
sys     0m12.955s

with slub_min_objects=45 :

# cat /sys/kernel/slab/filp/order
2
# time ./socket8
real    0m1.652s
user    0m0.694s
sys     0m12.367s

with slub_min_objects=80 :

# cat /sys/kernel/slab/filp/order
3
# time ./socket8
real    0m1.642s
user    0m0.719s
sys     0m12.315s

I would say slub_min_objects=45 is the optimal value on 32bit arches to
get acceptable performance on this workload (order=2 for filp kmem_cache)

Note : SLAB here is disastrous, but you already knew that :)

real    0m8.128s
user    0m0.748s
sys     1m3.467s


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-27  6:27                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-27  6:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Hellwig

Christoph Lameter a écrit :
> On Thu, 27 Nov 2008, Eric Dumazet wrote:
> 
>> The last point is about SLUB being hit hard, unless we
>> use slub_min_order=3 at boot, or we use Christoph Lameter
>> patch (struct file RCU optimizations)
>> http://thread.gmane.org/gmane.linux.kernel/418615
>>
>> If we boot machine with slub_min_order=3, SLUB overhead disappears.
> 
> 
> I'd rather not be that drastic. Did you try increasing slub_min_objects
> instead? Try 40-100. If we find the right number then we should update
> the tuning to make sure that it pickes the right slab page sizes.
> 
> 

4096/192 = 21

with slub_min_objects=22 :

# cat /sys/kernel/slab/filp/order
1
# time ./socket8
real    0m1.725s
user    0m0.685s
sys     0m12.955s

with slub_min_objects=45 :

# cat /sys/kernel/slab/filp/order
2
# time ./socket8
real    0m1.652s
user    0m0.694s
sys     0m12.367s

with slub_min_objects=80 :

# cat /sys/kernel/slab/filp/order
3
# time ./socket8
real    0m1.642s
user    0m0.719s
sys     0m12.315s

I would say slub_min_objects=45 is the optimal value on 32bit arches to
get acceptable performance on this workload (order=2 for filp kmem_cache)

Note : SLAB here is disastrous, but you already knew that :)

real    0m8.128s
user    0m0.748s
sys     1m3.467s

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 5/6] fs: Introduce special inodes
@ 2008-11-27  8:20                             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-27  8:20 UTC (permalink / raw)
  To: dada1
  Cc: mingo, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra,
	netdev, cl, hch

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 27 Nov 2008 00:32:41 +0100

> Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
>  inodes allocation/freeing.
> 
>  In new_inode(), we test if super block has MS_SPECIAL flag set.
>  If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
>  As inode_lock was taken only to protect these lists, we avoid it as well
> 
>  Using iput_special() from dput_special() avoids taking inode_lock
>  at freeing time.
> 
>  This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
>  in iput()
> 
> Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
> really need a different flag.
> 
> (socket8 bench result : from 20.5s to 2.94s) 
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

No problem with networking part:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 5/6] fs: Introduce special inodes
@ 2008-11-27  8:20                             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-27  8:20 UTC (permalink / raw)
  To: dada1-fPLkHRcR87vqlBn2x/YWAg
  Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, hch-wEGCiKHe2LqWVfeAwA7xHQ

From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Date: Thu, 27 Nov 2008 00:32:41 +0100

> Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
>  inodes allocation/freeing.
> 
>  In new_inode(), we test if super block has MS_SPECIAL flag set.
>  If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
>  As inode_lock was taken only to protect these lists, we avoid it as well
> 
>  Using iput_special() from dput_special() avoids taking inode_lock
>  at freeing time.
> 
>  This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
>  in iput()
> 
> Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
> really need a different flag.
> 
> (socket8 bench result : from 20.5s to 2.94s) 
> 
> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>

No problem with networking part:

Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-27  8:21                             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-27  8:21 UTC (permalink / raw)
  To: dada1
  Cc: mingo, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra,
	netdev, cl, hch

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 27 Nov 2008 00:32:59 +0100

> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.
> 
> (socket8 bench result : from 2.94s to 2.23s)
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

For networking bits:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-27  8:21                             ` David Miller
  0 siblings, 0 replies; 349+ messages in thread
From: David Miller @ 2008-11-27  8:21 UTC (permalink / raw)
  To: dada1-fPLkHRcR87vqlBn2x/YWAg
  Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, hch-wEGCiKHe2LqWVfeAwA7xHQ

From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Date: Thu, 27 Nov 2008 00:32:59 +0100

> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.
> 
> (socket8 bench result : from 2.94s to 2.23s)
> 
> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>

For networking bits:

Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
  2008-11-26 23:32                           ` Eric Dumazet
  (?)
@ 2008-11-27  9:32                           ` Peter Zijlstra
  2008-11-27  9:39                               ` Peter Zijlstra
                                               ` (3 more replies)
  -1 siblings, 4 replies; 349+ messages in thread
From: Peter Zijlstra @ 2008-11-27  9:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig, travis

On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes metric dont need inode_lock anymore.
> 
> (socket8 bench result : 25s to 20.5s)
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---

> @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
>   * Statistics gathering..
>   */
>  struct inodes_stat_t inodes_stat;
> +static DEFINE_PER_CPU(int, nr_inodes);
>  
>  static struct kmem_cache * inode_cachep __read_mostly;
>  
> +int get_nr_inodes(void)
> +{
> +	int cpu;
> +	int counter = 0;
> +
> +	for_each_possible_cpu(cpu)
> +	    counter += per_cpu(nr_inodes, cpu);
> +	if (counter < 0)
> +		counter = 0;
> +	return counter;
> +}

It would be good to get a cpu hotplug handler here and move to
for_each_online_cpu(). People are wanting distro's to be build with
NR_CPUS=4096.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
  2008-11-27  9:32                           ` Peter Zijlstra
@ 2008-11-27  9:39                               ` Peter Zijlstra
  2008-11-27 10:01                               ` Eric Dumazet
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 349+ messages in thread
From: Peter Zijlstra @ 2008-11-27  9:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig, travis

On Thu, 2008-11-27 at 10:33 +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
> > Avoids cache line ping pongs between cpus and prepare next patch,
> > because updates of nr_inodes metric dont need inode_lock anymore.
> > 
> > (socket8 bench result : 25s to 20.5s)
> > 
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > ---
> 
> > @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
> >   * Statistics gathering..
> >   */
> >  struct inodes_stat_t inodes_stat;
> > +static DEFINE_PER_CPU(int, nr_inodes);
> >  
> >  static struct kmem_cache * inode_cachep __read_mostly;
> >  
> > +int get_nr_inodes(void)
> > +{
> > +	int cpu;
> > +	int counter = 0;
> > +
> > +	for_each_possible_cpu(cpu)
> > +	    counter += per_cpu(nr_inodes, cpu);
> > +	if (counter < 0)
> > +		counter = 0;
> > +	return counter;
> > +}
> 
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Also, this trade-off between global vs per_cpu only works if
get_nr_inodes() is called significantly less than nr_inodes is changed.

With it being called from writeback that might not be true for all
workloads. One thing you can do about it is use the regular per-cpu
counter stuff, which allows you to do an approximation of the global
number (it also does all the hotplug stuff for you already).



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
@ 2008-11-27  9:39                               ` Peter Zijlstra
  0 siblings, 0 replies; 349+ messages in thread
From: Peter Zijlstra @ 2008-11-27  9:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis

On Thu, 2008-11-27 at 10:33 +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
> > Avoids cache line ping pongs between cpus and prepare next patch,
> > because updates of nr_inodes metric dont need inode_lock anymore.
> > 
> > (socket8 bench result : 25s to 20.5s)
> > 
> > Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> > ---
> 
> > @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
> >   * Statistics gathering..
> >   */
> >  struct inodes_stat_t inodes_stat;
> > +static DEFINE_PER_CPU(int, nr_inodes);
> >  
> >  static struct kmem_cache * inode_cachep __read_mostly;
> >  
> > +int get_nr_inodes(void)
> > +{
> > +	int cpu;
> > +	int counter = 0;
> > +
> > +	for_each_possible_cpu(cpu)
> > +	    counter += per_cpu(nr_inodes, cpu);
> > +	if (counter < 0)
> > +		counter = 0;
> > +	return counter;
> > +}
> 
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Also, this trade-off between global vs per_cpu only works if
get_nr_inodes() is called significantly less than nr_inodes is changed.

With it being called from writeback that might not be true for all
workloads. One thing you can do about it is use the regular per-cpu
counter stuff, which allows you to do an approximation of the global
number (it also does all the hotplug stuff for you already).

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
  2008-11-26 23:27                         ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
  2008-11-27  1:37                             ` Christoph Lameter
@ 2008-11-27  9:39                           ` Christoph Hellwig
  2008-11-28 18:03                           ` Ingo Molnar
                                             ` (6 subsequent siblings)
  8 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27  9:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig


As I told you before, you absolutely must include the fsdevel list and
the VFS maintainer for a patchset like this.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 1/6] fs: Introduce a per_cpu nr_dentry
@ 2008-11-27  9:41                             ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27  9:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig

Looks good modulo the exact version of the for_each_cpu loops that the
experts in that area can help with.  Same for the per_cpu nr_inodes
patch.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 1/6] fs: Introduce a per_cpu nr_dentry
@ 2008-11-27  9:41                             ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27  9:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig

Looks good modulo the exact version of the for_each_cpu loops that the
experts in that area can help with.  Same for the per_cpu nr_inodes
patch.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator
@ 2008-11-27  9:46                             ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27  9:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig

On Thu, Nov 27, 2008 at 12:32:24AM +0100, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino.
>
> Note : last_ino_get() method must be called with preemption
> disabled on SMP.

Looks a little clumsy.  One idea might be to have a special slab for
synthetic inodes using new_inode and only assign it on the first
allocation and after that re-use it.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator
@ 2008-11-27  9:46                             ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27  9:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig

On Thu, Nov 27, 2008 at 12:32:24AM +0100, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino.
>
> Note : last_ino_get() method must be called with preemption
> disabled on SMP.

Looks a little clumsy.  One idea might be to have a special slab for
synthetic inodes using new_inode and only assign it on the first
allocation and after that re-use it.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
  2008-11-27  9:39                               ` Peter Zijlstra
  (?)
@ 2008-11-27  9:48                               ` Christoph Hellwig
  -1 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig, travis

On Thu, Nov 27, 2008 at 10:39:31AM +0100, Peter Zijlstra wrote:
> With it being called from writeback that might not be true for all
> workloads. One thing you can do about it is use the regular per-cpu
> counter stuff, which allows you to do an approximation of the global
> number (it also does all the hotplug stuff for you already).

The way it's used in writeback is utterly stupid and should be fixed :)

But otherwise agreed.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
  2008-11-26 23:32                         ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet
  2008-11-27  8:21                             ` David Miller
@ 2008-11-27  9:53                           ` Christoph Hellwig
  2008-11-27 10:04                               ` Eric Dumazet
  2008-11-28  9:26                             ` Al Viro
  2 siblings, 1 reply; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27  9:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig

On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.

special is not a useful name for a flag, by definition everything that
needs a flag is special compared to the version that doesn't need a
flag.

The general idea of skippign the writer counts makes sense, but please
give it a descriptive name that explains the not unmountable thing.
And please kill your kern_mount wrapper and just set the flag manually.

Also I think it should be a superblock flag, not a mount flag as you
don't want thse to differ for multiple mounts of the same filesystem.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
  2008-11-27  9:32                           ` Peter Zijlstra
@ 2008-11-27 10:01                               ` Eric Dumazet
  2008-11-27 10:01                               ` Eric Dumazet
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-27 10:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig, travis

Peter Zijlstra a écrit :
> On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
>> Avoids cache line ping pongs between cpus and prepare next patch,
>> because updates of nr_inodes metric dont need inode_lock anymore.
>>
>> (socket8 bench result : 25s to 20.5s)
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> ---
> 
>> @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
>>   * Statistics gathering..
>>   */
>>  struct inodes_stat_t inodes_stat;
>> +static DEFINE_PER_CPU(int, nr_inodes);
>>  
>>  static struct kmem_cache * inode_cachep __read_mostly;
>>  
>> +int get_nr_inodes(void)
>> +{
>> +	int cpu;
>> +	int counter = 0;
>> +
>> +	for_each_possible_cpu(cpu)
>> +	    counter += per_cpu(nr_inodes, cpu);
>> +	if (counter < 0)
>> +		counter = 0;
>> +	return counter;
>> +}
> 
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Hum, I guess we can use regular percpu_counter for this...



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
@ 2008-11-27 10:01                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-27 10:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis

Peter Zijlstra a écrit :
> On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
>> Avoids cache line ping pongs between cpus and prepare next patch,
>> because updates of nr_inodes metric dont need inode_lock anymore.
>>
>> (socket8 bench result : 25s to 20.5s)
>>
>> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
>> ---
> 
>> @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
>>   * Statistics gathering..
>>   */
>>  struct inodes_stat_t inodes_stat;
>> +static DEFINE_PER_CPU(int, nr_inodes);
>>  
>>  static struct kmem_cache * inode_cachep __read_mostly;
>>  
>> +int get_nr_inodes(void)
>> +{
>> +	int cpu;
>> +	int counter = 0;
>> +
>> +	for_each_possible_cpu(cpu)
>> +	    counter += per_cpu(nr_inodes, cpu);
>> +	if (counter < 0)
>> +		counter = 0;
>> +	return counter;
>> +}
> 
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Hum, I guess we can use regular percpu_counter for this...

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-27 10:04                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-27 10:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter

Christoph Hellwig a écrit :
> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>> refcounting on permanent system vfs.
>> Use this function for sockets, pipes, anonymous fds.
> 
> special is not a useful name for a flag, by definition everything that
> needs a flag is special compared to the version that doesn't need a
> flag.
> 
> The general idea of skippign the writer counts makes sense, but please
> give it a descriptive name that explains the not unmountable thing.
> And please kill your kern_mount wrapper and just set the flag manually.
> 
> Also I think it should be a superblock flag, not a mount flag as you
> don't want thse to differ for multiple mounts of the same filesystem.
> 
> 

Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
is going to be a litle bit expensive if we add a derefence ?

if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
   ...
}


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-27 10:04                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-27 10:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter

Christoph Hellwig a écrit :
> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>> refcounting on permanent system vfs.
>> Use this function for sockets, pipes, anonymous fds.
> 
> special is not a useful name for a flag, by definition everything that
> needs a flag is special compared to the version that doesn't need a
> flag.
> 
> The general idea of skippign the writer counts makes sense, but please
> give it a descriptive name that explains the not unmountable thing.
> And please kill your kern_mount wrapper and just set the flag manually.
> 
> Also I think it should be a superblock flag, not a mount flag as you
> don't want thse to differ for multiple mounts of the same filesystem.
> 
> 

Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
is going to be a litle bit expensive if we add a derefence ?

if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
   ...
}

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
  2008-11-27  9:32                           ` Peter Zijlstra
  2008-11-27  9:39                               ` Peter Zijlstra
  2008-11-27 10:01                               ` Eric Dumazet
@ 2008-11-27 10:07                             ` Andi Kleen
  2008-11-27 14:46                             ` Christoph Lameter
  3 siblings, 0 replies; 349+ messages in thread
From: Andi Kleen @ 2008-11-27 10:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig, travis

Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
>>  
>> +int get_nr_inodes(void)
>> +{
>> +	int cpu;
>> +	int counter = 0;
>> +
>> +	for_each_possible_cpu(cpu)
>> +	    counter += per_cpu(nr_inodes, cpu);
>> +	if (counter < 0)
>> +		counter = 0;
>> +	return counter;
>> +}
>
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Doesn't matter, possible cpus is always only set to what the
machine supports.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-27 10:10                                 ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27 10:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter

On Thu, Nov 27, 2008 at 11:04:38AM +0100, Eric Dumazet wrote:
> Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
> is going to be a litle bit expensive if we add a derefence ?
>
> if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
>   ...
> }

Well, run a benchmark to see if it makes any difference.  And when it
does please always set the mount flag from the common mount code when
it's set on the superblock, and document that this is the only valid way
to set it.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-27 10:10                                 ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-27 10:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter

On Thu, Nov 27, 2008 at 11:04:38AM +0100, Eric Dumazet wrote:
> Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
> is going to be a litle bit expensive if we add a derefence ?
>
> if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
>   ...
> }

Well, run a benchmark to see if it makes any difference.  And when it
does please always set the mount flag from the common mount code when
it's set on the superblock, and document that this is the only valid way
to set it.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-27 14:44                                 ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-27 14:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Hellwig, Pekka Enberg

On Thu, 27 Nov 2008, Eric Dumazet wrote:

> with slub_min_objects=45 :
>
> # cat /sys/kernel/slab/filp/order
> 2
> # time ./socket8
> real    0m1.652s
> user    0m0.694s
> sys     0m12.367s

That may be a good value. How many processor do you have? Look at
calculate_order() in mm/slub.c:

   if (!min_objects)
                min_objects = 4 * (fls(nr_cpu_ids) + 1);

We couild increase the scaling factor there or start
with a mininum of 20 objects?


Try

	min_objects = 20 + 4 * (fls(nr_cpu_ids) + 1);

> I would say slub_min_objects=45 is the optimal value on 32bit arches to
> get acceptable performance on this workload (order=2 for filp kmem_cache)
>
> Note : SLAB here is disastrous, but you already knew that :)

Its good though to have examples where the queue management gets in the
way of performance.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-27 14:44                                 ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-27 14:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Hellwig,
	Pekka Enberg

On Thu, 27 Nov 2008, Eric Dumazet wrote:

> with slub_min_objects=45 :
>
> # cat /sys/kernel/slab/filp/order
> 2
> # time ./socket8
> real    0m1.652s
> user    0m0.694s
> sys     0m12.367s

That may be a good value. How many processor do you have? Look at
calculate_order() in mm/slub.c:

   if (!min_objects)
                min_objects = 4 * (fls(nr_cpu_ids) + 1);

We couild increase the scaling factor there or start
with a mininum of 20 objects?


Try

	min_objects = 20 + 4 * (fls(nr_cpu_ids) + 1);

> I would say slub_min_objects=45 is the optimal value on 32bit arches to
> get acceptable performance on this workload (order=2 for filp kmem_cache)
>
> Note : SLAB here is disastrous, but you already knew that :)

Its good though to have examples where the queue management gets in the
way of performance.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes
  2008-11-27  9:32                           ` Peter Zijlstra
                                               ` (2 preceding siblings ...)
  2008-11-27 10:07                             ` Andi Kleen
@ 2008-11-27 14:46                             ` Christoph Lameter
  3 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-11-27 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Hellwig, travis

On Thu, 27 Nov 2008, Peter Zijlstra wrote:

> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

NR_CPUS=4096 does not necessarily increase the number of possible cpus.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28  9:26                             ` Al Viro
  0 siblings, 0 replies; 349+ messages in thread
From: Al Viro @ 2008-11-28  9:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth,
	ink

On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.

IMO that's pushing it past the point of usefulness; unless you can show
that this really gives considerable win on pipes et.al. *AND* that it
doesn't hurt other loads...

dput() part: again, I want to see what happens on other loads; it's probably
fine (and win is certainly more than from mntput() change), but...  The
thing is, atomic_dec_and_lock() in there is often done on dentries with
d_count > 1 and that's fairly cheap (and doesn't involve contention on
dcache_lock on sane targets).

FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock()
in a special way, I'd try to compare with
        if (atomic_add_unless(&dentry->d_count, -1, 1))
                return;
	if (your flag)
		sod off to special
	spin_lock(&dcache_lock);
	if (atomic_dec_and_test(&dentry->d_count)) {
		spin_unlock(&dcache_lock);
		return;
	}
	the rest as usual

	As for the alpha... unless I'm misreading the assembler in
arch/alpha/lib/dec_and_lock.c, it looks like we have essentially an
implementation of atomic_add_unless() in there and one that just
might be better than what we've got in arch/alpha/include/asm/atomic.h.
How about
1:	ldl_l	x, addr
	cmpne	x, u, y	/* y = x != u */
	beq	y, 3f	/* if !y -> bugger off, return 0 */
	addl	x, a, y
	stl_c	y, addr	/* y <- *addr has not changed since ldl_l */
	beq	y, 2f
3:	/* return value is in y */
.subsection 2 /* out of the way */
2:	br	1b
.previous
for atomic_add_unless() guts?  With that we are rid of HAVE_DEC_LOCK and
get a uniform implementation of atomic_dec_and_lock() for all targets...

AFAICS, that would be
static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
{
	unsigned long temp, res;
	__asm__ __volatile__(
	"1:     ldl_l %0,%1\n"
	"       cmpne %0,%4,%2\n"
	"       beq %4,3f\n"
	"       addl %0,%3,%4\n"
	"       stl_c %2,%1\n"
	"       beq %2,2f\n"
	"3:\n"
        ".subsection 2\n"
        "2:     br 1b\n"
        ".previous"
        :"=&r" (temp), "=m" (v->counter), "=&r" (res)
        :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
	smp_mb();
	return res;
}

static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
{
	unsigned long temp, res;
	__asm__ __volatile__(
	"1:     ldq_l %0,%1\n"
	"       cmpne %0,%4,%2\n"
	"       beq %4,3f\n"
	"       addq %0,%3,%4\n"
	"       stq_c %2,%1\n"
	"       beq %2,2f\n"
	"3:\n"
        ".subsection 2\n"
        "2:     br 1b\n"
        ".previous"
        :"=&r" (temp), "=m" (v->counter), "=&r" (res)
        :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
	smp_mb();
	return res;
}

Comments?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28  9:26                             ` Al Viro
  0 siblings, 0 replies; 349+ messages in thread
From: Al Viro @ 2008-11-28  9:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ,
	ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09

On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.

IMO that's pushing it past the point of usefulness; unless you can show
that this really gives considerable win on pipes et.al. *AND* that it
doesn't hurt other loads...

dput() part: again, I want to see what happens on other loads; it's probably
fine (and win is certainly more than from mntput() change), but...  The
thing is, atomic_dec_and_lock() in there is often done on dentries with
d_count > 1 and that's fairly cheap (and doesn't involve contention on
dcache_lock on sane targets).

FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock()
in a special way, I'd try to compare with
        if (atomic_add_unless(&dentry->d_count, -1, 1))
                return;
	if (your flag)
		sod off to special
	spin_lock(&dcache_lock);
	if (atomic_dec_and_test(&dentry->d_count)) {
		spin_unlock(&dcache_lock);
		return;
	}
	the rest as usual

	As for the alpha... unless I'm misreading the assembler in
arch/alpha/lib/dec_and_lock.c, it looks like we have essentially an
implementation of atomic_add_unless() in there and one that just
might be better than what we've got in arch/alpha/include/asm/atomic.h.
How about
1:	ldl_l	x, addr
	cmpne	x, u, y	/* y = x != u */
	beq	y, 3f	/* if !y -> bugger off, return 0 */
	addl	x, a, y
	stl_c	y, addr	/* y <- *addr has not changed since ldl_l */
	beq	y, 2f
3:	/* return value is in y */
.subsection 2 /* out of the way */
2:	br	1b
.previous
for atomic_add_unless() guts?  With that we are rid of HAVE_DEC_LOCK and
get a uniform implementation of atomic_dec_and_lock() for all targets...

AFAICS, that would be
static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
{
	unsigned long temp, res;
	__asm__ __volatile__(
	"1:     ldl_l %0,%1\n"
	"       cmpne %0,%4,%2\n"
	"       beq %4,3f\n"
	"       addl %0,%3,%4\n"
	"       stl_c %2,%1\n"
	"       beq %2,2f\n"
	"3:\n"
        ".subsection 2\n"
        "2:     br 1b\n"
        ".previous"
        :"=&r" (temp), "=m" (v->counter), "=&r" (res)
        :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
	smp_mb();
	return res;
}

static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
{
	unsigned long temp, res;
	__asm__ __volatile__(
	"1:     ldq_l %0,%1\n"
	"       cmpne %0,%4,%2\n"
	"       beq %4,3f\n"
	"       addq %0,%3,%4\n"
	"       stq_c %2,%1\n"
	"       beq %2,2f\n"
	"3:\n"
        ".subsection 2\n"
        "2:     br 1b\n"
        ".previous"
        :"=&r" (temp), "=m" (v->counter), "=&r" (res)
        :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
	smp_mb();
	return res;
}

Comments?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28  9:34                               ` Al Viro
  0 siblings, 0 replies; 349+ messages in thread
From: Al Viro @ 2008-11-28  9:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth,
	ink

On Fri, Nov 28, 2008 at 09:26:04AM +0000, Al Viro wrote:

gyah...  That would be

> static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
> {
> 	unsigned long temp, res;
> 	__asm__ __volatile__(
> 	"1:     ldl_l %0,%1\n"
> 	"       cmpne %0,%4,%2\n"
 	"       beq %2,3f\n"
 	"       addl %0,%3,%2\n"
> 	"       stl_c %2,%1\n"
> 	"       beq %2,2f\n"
> 	"3:\n"
>         ".subsection 2\n"
>         "2:     br 1b\n"
>         ".previous"
>         :"=&r" (temp), "=m" (v->counter), "=&r" (res)
>         :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> 	smp_mb();
> 	return res;
> }
> 
> static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
> {
> 	unsigned long temp, res;
> 	__asm__ __volatile__(
> 	"1:     ldq_l %0,%1\n"
> 	"       cmpne %0,%4,%2\n"
 	"       beq %2,3f\n"
 	"       addq %0,%3,%2\n"
> 	"       stq_c %2,%1\n"
> 	"       beq %2,2f\n"
> 	"3:\n"
>         ".subsection 2\n"
>         "2:     br 1b\n"
>         ".previous"
>         :"=&r" (temp), "=m" (v->counter), "=&r" (res)
>         :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> 	smp_mb();
> 	return res;
> }
> 
> Comments?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28  9:34                               ` Al Viro
  0 siblings, 0 replies; 349+ messages in thread
From: Al Viro @ 2008-11-28  9:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ,
	ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09

On Fri, Nov 28, 2008 at 09:26:04AM +0000, Al Viro wrote:

gyah...  That would be

> static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
> {
> 	unsigned long temp, res;
> 	__asm__ __volatile__(
> 	"1:     ldl_l %0,%1\n"
> 	"       cmpne %0,%4,%2\n"
 	"       beq %2,3f\n"
 	"       addl %0,%3,%2\n"
> 	"       stl_c %2,%1\n"
> 	"       beq %2,2f\n"
> 	"3:\n"
>         ".subsection 2\n"
>         "2:     br 1b\n"
>         ".previous"
>         :"=&r" (temp), "=m" (v->counter), "=&r" (res)
>         :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> 	smp_mb();
> 	return res;
> }
> 
> static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
> {
> 	unsigned long temp, res;
> 	__asm__ __volatile__(
> 	"1:     ldq_l %0,%1\n"
> 	"       cmpne %0,%4,%2\n"
 	"       beq %2,3f\n"
 	"       addq %0,%3,%2\n"
> 	"       stq_c %2,%1\n"
> 	"       beq %2,2f\n"
> 	"3:\n"
>         ".subsection 2\n"
>         "2:     br 1b\n"
>         ".previous"
>         :"=&r" (temp), "=m" (v->counter), "=&r" (res)
>         :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> 	smp_mb();
> 	return res;
> }
> 
> Comments?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28 18:02                               ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-28 18:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth,
	ink


* Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> > refcounting on permanent system vfs.
> > Use this function for sockets, pipes, anonymous fds.
> 
> IMO that's pushing it past the point of usefulness; unless you can show
> that this really gives considerable win on pipes et.al. *AND* that it
> doesn't hurt other loads...

The numbers look pretty convincing:

> >  (socket8 bench result : from 2.94s to 2.23s)

And i wouldnt expect it to hurt real-filesystem workloads.

Here's the contemporary trace of a typical ext3- sys_open():

 0)               |  sys_open() {
 0)               |    do_sys_open() {
 0)               |      getname() {
 0)      0.367 us |        kmem_cache_alloc();
 0)               |        strncpy_from_user(); {
 0)               |          _cond_resched() {
 0)               |            need_resched() {
 0)      0.363 us |              constant_test_bit();
 0)      1. 47 us |            }
 0)      1.815 us |          }
 0)      2.587 us |        }
 0)      4. 22 us |      }
 0)               |      alloc_fd() {
 0)      0.480 us |        _spin_lock();
 0)      0.487 us |        expand_files();
 0)      2.356 us |      }
 0)               |      do_filp_open() {
 0)               |        path_lookup_open() {
 0)               |          get_empty_filp() {
 0)      0.439 us |            kmem_cache_alloc();
 0)               |            security_file_alloc() {
 0)      0.316 us |              cap_file_alloc_security();
 0)      1. 87 us |            }
 0)      3.189 us |          }
 0)               |          do_path_lookup() {
 0)      0.366 us |            _read_lock();
 0)               |            path_walk() {
 0)               |              __link_path_walk() {
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)      0.441 us |                    generic_permission();
 0)      1.247 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.411 us |                    cap_inode_permission();
 0)      1.186 us |                  }
 0)      3.555 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.486 us |                    _spin_lock();
 0)      1.369 us |                  }
 0)      0.442 us |                  __follow_mount();
 0)      3. 14 us |                }
 0)               |                path_to_nameidata() {
 0)      0.476 us |                  dput();
 0)      1.235 us |                }
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)               |                    generic_permission() {
 0)               |                      in_group_p() {
 0)      0.410 us |                        groups_search();
 0)      1.172 us |                      }
 0)      1.994 us |                    }
 0)      2.789 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.454 us |                    cap_inode_permission();
 0)      1.238 us |                  }
 0)      5.262 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.480 us |                    _spin_lock();
 0)      1.621 us |                  }
 0)      0.456 us |                  __follow_mount();
 0)      3.215 us |                }
 0)               |                path_to_nameidata() {
 0)      0.420 us |                  dput();
 0)      1.193 us |                }
 0) +   23.551 us |              }
 0)               |              path_put() {
 0)      0.420 us |                dput();
 0)               |                mntput() {
 0)      0.359 us |                  mntput_no_expire();
 0)      1. 50 us |                }
 0)      2.544 us |              }
 0) +   27.253 us |            }
 0) +   28.850 us |          }
 0) +   33.217 us |        }
 0)               |        may_open() {
 0)               |          inode_permission() {
 0)               |            ext3_permission() {
 0)      0.480 us |              generic_permission();
 0)      1.229 us |            }
 0)               |            security_inode_permission() {
 0)      0.405 us |              cap_inode_permission();
 0)      1.196 us |            }
 0)      3.589 us |          }
 0)      4.600 us |        }
 0)               |        nameidata_to_filp() {
 0)               |          __dentry_open() {
 0)               |            file_move() {
 0)      0.470 us |              _spin_lock();
 0)      1.243 us |            }
 0)               |            security_dentry_open() {
 0)      0.344 us |              cap_dentry_open();
 0)      1.139 us |            }
 0)      0.412 us |            generic_file_open();
 0)      0.561 us |            file_ra_state_init();
 0)      5.714 us |          }
 0)      6.483 us |        }
 0) +   46.494 us |      }
 0)      0.453 us |      inotify_dentry_parent_queue_event();
 0)      0.403 us |      inotify_inode_queue_event();
 0)               |      fd_install() {
 0)      0.440 us |        _spin_lock();
 0)      1.247 us |      }
 0)               |      putname() {
 0)               |        kmem_cache_free() {
 0)               |          virt_to_head_page() {
 0)      0.369 us |            constant_test_bit();
 0)      1. 23 us |          }
 0)      1.738 us |        }
 0)      2.422 us |      }
 0) +   60.560 us |    }
 0) +   61.368 us |  }

and here's a sys_close():

 0)               |  sys_close() {
 0)      0.540 us |    _spin_lock();
 0)               |    filp_close() {
 0)      0.437 us |      dnotify_flush();
 0)      0.401 us |      locks_remove_posix();
 0)      0.349 us |      fput();
 0)      2.679 us |    }
 0)      4.452 us |  }

i'd be surprised to see a flag to show up in that codepath. Eric, does 
your testing confirm that?

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28 18:02                               ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-28 18:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ,
	ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09


* Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:

> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> > refcounting on permanent system vfs.
> > Use this function for sockets, pipes, anonymous fds.
> 
> IMO that's pushing it past the point of usefulness; unless you can show
> that this really gives considerable win on pipes et.al. *AND* that it
> doesn't hurt other loads...

The numbers look pretty convincing:

> >  (socket8 bench result : from 2.94s to 2.23s)

And i wouldnt expect it to hurt real-filesystem workloads.

Here's the contemporary trace of a typical ext3- sys_open():

 0)               |  sys_open() {
 0)               |    do_sys_open() {
 0)               |      getname() {
 0)      0.367 us |        kmem_cache_alloc();
 0)               |        strncpy_from_user(); {
 0)               |          _cond_resched() {
 0)               |            need_resched() {
 0)      0.363 us |              constant_test_bit();
 0)      1. 47 us |            }
 0)      1.815 us |          }
 0)      2.587 us |        }
 0)      4. 22 us |      }
 0)               |      alloc_fd() {
 0)      0.480 us |        _spin_lock();
 0)      0.487 us |        expand_files();
 0)      2.356 us |      }
 0)               |      do_filp_open() {
 0)               |        path_lookup_open() {
 0)               |          get_empty_filp() {
 0)      0.439 us |            kmem_cache_alloc();
 0)               |            security_file_alloc() {
 0)      0.316 us |              cap_file_alloc_security();
 0)      1. 87 us |            }
 0)      3.189 us |          }
 0)               |          do_path_lookup() {
 0)      0.366 us |            _read_lock();
 0)               |            path_walk() {
 0)               |              __link_path_walk() {
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)      0.441 us |                    generic_permission();
 0)      1.247 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.411 us |                    cap_inode_permission();
 0)      1.186 us |                  }
 0)      3.555 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.486 us |                    _spin_lock();
 0)      1.369 us |                  }
 0)      0.442 us |                  __follow_mount();
 0)      3. 14 us |                }
 0)               |                path_to_nameidata() {
 0)      0.476 us |                  dput();
 0)      1.235 us |                }
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)               |                    generic_permission() {
 0)               |                      in_group_p() {
 0)      0.410 us |                        groups_search();
 0)      1.172 us |                      }
 0)      1.994 us |                    }
 0)      2.789 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.454 us |                    cap_inode_permission();
 0)      1.238 us |                  }
 0)      5.262 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.480 us |                    _spin_lock();
 0)      1.621 us |                  }
 0)      0.456 us |                  __follow_mount();
 0)      3.215 us |                }
 0)               |                path_to_nameidata() {
 0)      0.420 us |                  dput();
 0)      1.193 us |                }
 0) +   23.551 us |              }
 0)               |              path_put() {
 0)      0.420 us |                dput();
 0)               |                mntput() {
 0)      0.359 us |                  mntput_no_expire();
 0)      1. 50 us |                }
 0)      2.544 us |              }
 0) +   27.253 us |            }
 0) +   28.850 us |          }
 0) +   33.217 us |        }
 0)               |        may_open() {
 0)               |          inode_permission() {
 0)               |            ext3_permission() {
 0)      0.480 us |              generic_permission();
 0)      1.229 us |            }
 0)               |            security_inode_permission() {
 0)      0.405 us |              cap_inode_permission();
 0)      1.196 us |            }
 0)      3.589 us |          }
 0)      4.600 us |        }
 0)               |        nameidata_to_filp() {
 0)               |          __dentry_open() {
 0)               |            file_move() {
 0)      0.470 us |              _spin_lock();
 0)      1.243 us |            }
 0)               |            security_dentry_open() {
 0)      0.344 us |              cap_dentry_open();
 0)      1.139 us |            }
 0)      0.412 us |            generic_file_open();
 0)      0.561 us |            file_ra_state_init();
 0)      5.714 us |          }
 0)      6.483 us |        }
 0) +   46.494 us |      }
 0)      0.453 us |      inotify_dentry_parent_queue_event();
 0)      0.403 us |      inotify_inode_queue_event();
 0)               |      fd_install() {
 0)      0.440 us |        _spin_lock();
 0)      1.247 us |      }
 0)               |      putname() {
 0)               |        kmem_cache_free() {
 0)               |          virt_to_head_page() {
 0)      0.369 us |            constant_test_bit();
 0)      1. 23 us |          }
 0)      1.738 us |        }
 0)      2.422 us |      }
 0) +   60.560 us |    }
 0) +   61.368 us |  }

and here's a sys_close():

 0)               |  sys_close() {
 0)      0.540 us |    _spin_lock();
 0)               |    filp_close() {
 0)      0.437 us |      dnotify_flush();
 0)      0.401 us |      locks_remove_posix();
 0)      0.349 us |      fput();
 0)      2.679 us |    }
 0)      4.452 us |  }

i'd be surprised to see a flag to show up in that codepath. Eric, does 
your testing confirm that?

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
  2008-11-26 23:27                         ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
  2008-11-27  1:37                             ` Christoph Lameter
  2008-11-27  9:39                           ` Christoph Hellwig
@ 2008-11-28 18:03                           ` Ingo Molnar
  2008-11-28 18:47                               ` Peter Zijlstra
  2008-11-29  8:43                             ` Eric Dumazet
                                             ` (5 subsequent siblings)
  8 siblings, 1 reply; 349+ messages in thread
From: Ingo Molnar @ 2008-11-28 18:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Hi all
>
> Short summary : Nice speedups for allocation/deallocation of sockets/pipes
> (From 27.5 seconds to 1.6 second)

Wow, that's incredibly impressive! :-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-28 18:47                               ` Peter Zijlstra
  0 siblings, 0 replies; 349+ messages in thread
From: Peter Zijlstra @ 2008-11-28 18:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

On Fri, 2008-11-28 at 19:03 +0100, Ingo Molnar wrote:
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> > Hi all
> >
> > Short summary : Nice speedups for allocation/deallocation of sockets/pipes
> > (From 27.5 seconds to 1.6 second)
> 
> Wow, that's incredibly impressive! :-)

Yeah, we got a similar speedup on -rt by pushing those super-block files
list into per-cpu lists and doing crazy locking on them.

Of course avoiding them all together, like done here is a nicer option
but is sadly not a possibility for regular files (until hch gets around
to removing the need for the list).




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-28 18:47                               ` Peter Zijlstra
  0 siblings, 0 replies; 349+ messages in thread
From: Peter Zijlstra @ 2008-11-28 18:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig

On Fri, 2008-11-28 at 19:03 +0100, Ingo Molnar wrote:
> * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote:
> 
> > Hi all
> >
> > Short summary : Nice speedups for allocation/deallocation of sockets/pipes
> > (From 27.5 seconds to 1.6 second)
> 
> Wow, that's incredibly impressive! :-)

Yeah, we got a similar speedup on -rt by pushing those super-block files
list into per-cpu lists and doing crazy locking on them.

Of course avoiding them all together, like done here is a nicer option
but is sadly not a possibility for regular files (until hch gets around
to removing the need for the list).

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
  2008-11-28 18:02                               ` Ingo Molnar
  (?)
@ 2008-11-28 18:58                               ` Ingo Molnar
  -1 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2008-11-28 18:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth,
	ink


* Ingo Molnar <mingo@elte.hu> wrote:

> And i wouldnt expect it to hurt real-filesystem workloads.
> 
> Here's the contemporary trace of a typical ext3- sys_open():

here's a sys_open() that has to touch atime:

 0)               |  sys_open() {
 0)               |    do_sys_open() {
 0)               |      getname() {
 0)      0.377 us |        kmem_cache_alloc();
 0)               |        strncpy_from_user() {
 0)               |          _cond_resched() {
 0)               |            need_resched() {
 0)      0.353 us |              constant_test_bit();
 0)      1. 45 us |            }
 0)      1.739 us |          }
 0)      2.492 us |        }
 0)      3.934 us |      }
 0)               |      alloc_fd() {
 0)      0.374 us |        _spin_lock();
 0)      0.447 us |        expand_files();
 0)      2.124 us |      }
 0)               |      do_filp_open() {
 0)               |        path_lookup_open() {
 0)               |          get_empty_filp() {
 0)      0.689 us |            kmem_cache_alloc();
 0)               |            security_file_alloc() {
 0)      0.327 us |              cap_file_alloc_security();
 0)      1. 71 us |            }
 0)      2.869 us |          }
 0)               |          do_path_lookup() {
 0)      0.460 us |            _read_lock();
 0)               |            path_walk() {
 0)               |              __link_path_walk() {
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)      0.434 us |                    generic_permission();
 0)      1.191 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.400 us |                    cap_inode_permission();
 0)      1.130 us |                  }
 0)      3.453 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.489 us |                    _spin_lock();
 0)      1.525 us |                  }
 0)      0.449 us |                  __follow_mount();
 0)      3.115 us |                }
 0)               |                path_to_nameidata() {
 0)      0.422 us |                  dput();
 0)      1.204 us |                }
 0)               |                inode_permission() {
 0)               |                  ext3_permission() {
 0)      0.391 us |                    generic_permission();
 0)      1.223 us |                  }
 0)               |                  security_inode_permission() {
 0)      0.406 us |                    cap_inode_permission();
 0)      1.189 us |                  }
 0)      3.565 us |                }
 0)               |                do_lookup() {
 0)               |                  __d_lookup() {
 0)      0.527 us |                    _spin_lock();
 0)      1.633 us |                  }
 0)      0.440 us |                  __follow_mount();
 0)      3.223 us |                }
 0)               |                do_follow_link() {
 0)               |                  _cond_resched() {
 0)               |                    need_resched() {
 0)      0.361 us |                      constant_test_bit();
 0)      1. 64 us |                    }
 0)      1.749 us |                  }
 0)               |                  security_inode_follow_link() {
 0)      0.390 us |                    cap_inode_follow_link();
 0)      1.260 us |                  }
 0)               |                  touch_atime() {
 0)               |                    mnt_want_write() {
 0)      0.360 us |                      _spin_lock();
 0)      1.137 us |                    }
 0)               |                    mnt_drop_write() {
 0)      0.348 us |                      _spin_lock();
 0)      1.102 us |                    }
 0)      3.402 us |                  }
 0)      0.446 us |                  ext3_follow_link();
 0)               |                  __link_path_walk() {
 0)               |                    inode_permission() {
 0)               |                      ext3_permission() {
 0)               |                        generic_permission() {
 0)      4.481 us |                      }
 0)               |                      security_inode_permission() {
 0)      0.402 us |                        cap_inode_permission();
 0)      1.127 us |                      }
 0)      6.747 us |                    }
 0)               |                    do_lookup() {
 0)               |                      __d_lookup() {
 0)      0.547 us |                        _spin_lock();
 0)      1.758 us |                      }
 0)      0.465 us |                      __follow_mount();
 0)      3.368 us |                    }
 0)               |                    path_to_nameidata() {
 0)      0.419 us |                      dput();
 0)      1.203 us |                    }
 0) +   13. 40 us |                  }
 0)               |                  path_put() {
 0)      0.429 us |                    dput();
 0)               |                    mntput() {
 0)      0.367 us |                      mntput_no_expire();
 0)      1.130 us |                    }
 0)      2.660 us |                  }
 0)               |                  path_put() {
 0)               |                    dput() {
 0)               |                      _cond_resched() {
 0)               |                        need_resched() {
 0)      0.382 us |                          constant_test_bit();
 0)      1. 67 us |                        }
 0)      1.808 us |                      }
 0)      0.399 us |                      _spin_lock();
 0)      0.452 us |                      _spin_lock();
 0)      4.270 us |                    }
 0)               |                    mntput() {
 0)      0.375 us |                      mntput_no_expire();
 0)      1. 62 us |                    }
 0)      6.547 us |                  }
 0) +   32.702 us |                }
 0) +   50.413 us |              }
 0)               |              path_put() {
 0)      0.421 us |                dput();
 0)               |                mntput() {
 0)      0.364 us |                  mntput_no_expire();
 0)      1. 64 us |                }
 0)      2.545 us |              }
 0) +   54.147 us |            }
 0) +   55.780 us |          }
 0) +   59.714 us |        }
 0)               |        may_open() {
 0)               |          inode_permission() {
 0)               |            ext3_permission() {
 0)      0.406 us |              generic_permission();
 0)      1.189 us |            }
 0)               |            security_inode_permission() {
 0)      0.388 us |              cap_inode_permission();
 0)      1.175 us |            }
 0)      3.498 us |          }
 0)      4.328 us |        }
 0)               |        nameidata_to_filp() {
 0)               |          __dentry_open() {
 0)               |            file_move() {
 0)      0.361 us |              _spin_lock();
 0)      1.102 us |            }
 0)               |            security_dentry_open() {
 0)      0.356 us |              cap_dentry_open();
 0)      1.121 us |            }
 0)      0.400 us |            generic_file_open();
 0)      0.544 us |            file_ra_state_init();
 0)      5. 11 us |          }
 0)      5.709 us |        }
 0) +   71.181 us |      }
 0)      0.453 us |      inotify_dentry_parent_queue_event();
 0)      0.403 us |      inotify_inode_queue_event();
 0)               |      fd_install() {
 0)      0.411 us |        _spin_lock();
 0)      1.217 us |      }
 0)               |      putname() {
 0)               |        kmem_cache_free() {
 0)               |          virt_to_head_page() {
 0)      0.371 us |            constant_test_bit();
 0)      1. 47 us |          }
 0)      1.752 us |        }
 0)      2.446 us |      }
 0) +   84.676 us |    }
 0) +   85.365 us |  }

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28 22:20                                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-28 22:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Al Viro, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth,
	ink

Ingo Molnar a écrit :
> * Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
>> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>>> refcounting on permanent system vfs.
>>> Use this function for sockets, pipes, anonymous fds.
>> IMO that's pushing it past the point of usefulness; unless you can show
>> that this really gives considerable win on pipes et.al. *AND* that it
>> doesn't hurt other loads...
> 
> The numbers look pretty convincing:
> 
>>>  (socket8 bench result : from 2.94s to 2.23s)
> 
> And i wouldnt expect it to hurt real-filesystem workloads.
> 
> Here's the contemporary trace of a typical ext3- sys_open():
> 
>  0)               |  sys_open() {
>  0)               |    do_sys_open() {
>  0)               |      getname() {
>  0)      0.367 us |        kmem_cache_alloc();
>  0)               |        strncpy_from_user(); {
>  0)               |          _cond_resched() {
>  0)               |            need_resched() {
>  0)      0.363 us |              constant_test_bit();
>  0)      1. 47 us |            }
>  0)      1.815 us |          }
>  0)      2.587 us |        }
>  0)      4. 22 us |      }
>  0)               |      alloc_fd() {
>  0)      0.480 us |        _spin_lock();
>  0)      0.487 us |        expand_files();
>  0)      2.356 us |      }
>  0)               |      do_filp_open() {
>  0)               |        path_lookup_open() {
>  0)               |          get_empty_filp() {
>  0)      0.439 us |            kmem_cache_alloc();
>  0)               |            security_file_alloc() {
>  0)      0.316 us |              cap_file_alloc_security();
>  0)      1. 87 us |            }
>  0)      3.189 us |          }
>  0)               |          do_path_lookup() {
>  0)      0.366 us |            _read_lock();
>  0)               |            path_walk() {
>  0)               |              __link_path_walk() {
>  0)               |                inode_permission() {
>  0)               |                  ext3_permission() {
>  0)      0.441 us |                    generic_permission();
>  0)      1.247 us |                  }
>  0)               |                  security_inode_permission() {
>  0)      0.411 us |                    cap_inode_permission();
>  0)      1.186 us |                  }
>  0)      3.555 us |                }
>  0)               |                do_lookup() {
>  0)               |                  __d_lookup() {
>  0)      0.486 us |                    _spin_lock();
>  0)      1.369 us |                  }
>  0)      0.442 us |                  __follow_mount();
>  0)      3. 14 us |                }
>  0)               |                path_to_nameidata() {
>  0)      0.476 us |                  dput();
>  0)      1.235 us |                }
>  0)               |                inode_permission() {
>  0)               |                  ext3_permission() {
>  0)               |                    generic_permission() {
>  0)               |                      in_group_p() {
>  0)      0.410 us |                        groups_search();
>  0)      1.172 us |                      }
>  0)      1.994 us |                    }
>  0)      2.789 us |                  }
>  0)               |                  security_inode_permission() {
>  0)      0.454 us |                    cap_inode_permission();
>  0)      1.238 us |                  }
>  0)      5.262 us |                }
>  0)               |                do_lookup() {
>  0)               |                  __d_lookup() {
>  0)      0.480 us |                    _spin_lock();
>  0)      1.621 us |                  }
>  0)      0.456 us |                  __follow_mount();
>  0)      3.215 us |                }
>  0)               |                path_to_nameidata() {
>  0)      0.420 us |                  dput();
>  0)      1.193 us |                }
>  0) +   23.551 us |              }
>  0)               |              path_put() {
>  0)      0.420 us |                dput();
>  0)               |                mntput() {
>  0)      0.359 us |                  mntput_no_expire();
>  0)      1. 50 us |                }
>  0)      2.544 us |              }
>  0) +   27.253 us |            }
>  0) +   28.850 us |          }
>  0) +   33.217 us |        }
>  0)               |        may_open() {
>  0)               |          inode_permission() {
>  0)               |            ext3_permission() {
>  0)      0.480 us |              generic_permission();
>  0)      1.229 us |            }
>  0)               |            security_inode_permission() {
>  0)      0.405 us |              cap_inode_permission();
>  0)      1.196 us |            }
>  0)      3.589 us |          }
>  0)      4.600 us |        }
>  0)               |        nameidata_to_filp() {
>  0)               |          __dentry_open() {
>  0)               |            file_move() {
>  0)      0.470 us |              _spin_lock();
>  0)      1.243 us |            }
>  0)               |            security_dentry_open() {
>  0)      0.344 us |              cap_dentry_open();
>  0)      1.139 us |            }
>  0)      0.412 us |            generic_file_open();
>  0)      0.561 us |            file_ra_state_init();
>  0)      5.714 us |          }
>  0)      6.483 us |        }
>  0) +   46.494 us |      }
>  0)      0.453 us |      inotify_dentry_parent_queue_event();
>  0)      0.403 us |      inotify_inode_queue_event();
>  0)               |      fd_install() {
>  0)      0.440 us |        _spin_lock();
>  0)      1.247 us |      }
>  0)               |      putname() {
>  0)               |        kmem_cache_free() {
>  0)               |          virt_to_head_page() {
>  0)      0.369 us |            constant_test_bit();
>  0)      1. 23 us |          }
>  0)      1.738 us |        }
>  0)      2.422 us |      }
>  0) +   60.560 us |    }
>  0) +   61.368 us |  }
> 
> and here's a sys_close():
> 
>  0)               |  sys_close() {
>  0)      0.540 us |    _spin_lock();
>  0)               |    filp_close() {
>  0)      0.437 us |      dnotify_flush();
>  0)      0.401 us |      locks_remove_posix();
>  0)      0.349 us |      fput();
>  0)      2.679 us |    }
>  0)      4.452 us |  }
> 
> i'd be surprised to see a flag to show up in that codepath. Eric, does 
> your testing confirm that?

On a socket/pipe, definitly no, because inode->i_sb->s_flags is not contended.

But on a shared inode, it might hurt :

offsetof(struct inode, i_count)=0x24
offsetof(struct inode, i_lock)=0x70
offsetof(struct inode, i_sb)=0x9c
offsetof(struct inode, i_writecount)=0x144

So i_sb sits in a probably contended cache line 

I wonder why i_writecount sits so far from i_count, that doesnt make sense.



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
@ 2008-11-28 22:20                                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-28 22:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Al Viro, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Peter Zijlstra, Linux Netdev List, Christoph Lameter,
	Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ,
	ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09

Ingo Molnar a écrit :
> * Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> 
>> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>>> refcounting on permanent system vfs.
>>> Use this function for sockets, pipes, anonymous fds.
>> IMO that's pushing it past the point of usefulness; unless you can show
>> that this really gives considerable win on pipes et.al. *AND* that it
>> doesn't hurt other loads...
> 
> The numbers look pretty convincing:
> 
>>>  (socket8 bench result : from 2.94s to 2.23s)
> 
> And i wouldnt expect it to hurt real-filesystem workloads.
> 
> Here's the contemporary trace of a typical ext3- sys_open():
> 
>  0)               |  sys_open() {
>  0)               |    do_sys_open() {
>  0)               |      getname() {
>  0)      0.367 us |        kmem_cache_alloc();
>  0)               |        strncpy_from_user(); {
>  0)               |          _cond_resched() {
>  0)               |            need_resched() {
>  0)      0.363 us |              constant_test_bit();
>  0)      1. 47 us |            }
>  0)      1.815 us |          }
>  0)      2.587 us |        }
>  0)      4. 22 us |      }
>  0)               |      alloc_fd() {
>  0)      0.480 us |        _spin_lock();
>  0)      0.487 us |        expand_files();
>  0)      2.356 us |      }
>  0)               |      do_filp_open() {
>  0)               |        path_lookup_open() {
>  0)               |          get_empty_filp() {
>  0)      0.439 us |            kmem_cache_alloc();
>  0)               |            security_file_alloc() {
>  0)      0.316 us |              cap_file_alloc_security();
>  0)      1. 87 us |            }
>  0)      3.189 us |          }
>  0)               |          do_path_lookup() {
>  0)      0.366 us |            _read_lock();
>  0)               |            path_walk() {
>  0)               |              __link_path_walk() {
>  0)               |                inode_permission() {
>  0)               |                  ext3_permission() {
>  0)      0.441 us |                    generic_permission();
>  0)      1.247 us |                  }
>  0)               |                  security_inode_permission() {
>  0)      0.411 us |                    cap_inode_permission();
>  0)      1.186 us |                  }
>  0)      3.555 us |                }
>  0)               |                do_lookup() {
>  0)               |                  __d_lookup() {
>  0)      0.486 us |                    _spin_lock();
>  0)      1.369 us |                  }
>  0)      0.442 us |                  __follow_mount();
>  0)      3. 14 us |                }
>  0)               |                path_to_nameidata() {
>  0)      0.476 us |                  dput();
>  0)      1.235 us |                }
>  0)               |                inode_permission() {
>  0)               |                  ext3_permission() {
>  0)               |                    generic_permission() {
>  0)               |                      in_group_p() {
>  0)      0.410 us |                        groups_search();
>  0)      1.172 us |                      }
>  0)      1.994 us |                    }
>  0)      2.789 us |                  }
>  0)               |                  security_inode_permission() {
>  0)      0.454 us |                    cap_inode_permission();
>  0)      1.238 us |                  }
>  0)      5.262 us |                }
>  0)               |                do_lookup() {
>  0)               |                  __d_lookup() {
>  0)      0.480 us |                    _spin_lock();
>  0)      1.621 us |                  }
>  0)      0.456 us |                  __follow_mount();
>  0)      3.215 us |                }
>  0)               |                path_to_nameidata() {
>  0)      0.420 us |                  dput();
>  0)      1.193 us |                }
>  0) +   23.551 us |              }
>  0)               |              path_put() {
>  0)      0.420 us |                dput();
>  0)               |                mntput() {
>  0)      0.359 us |                  mntput_no_expire();
>  0)      1. 50 us |                }
>  0)      2.544 us |              }
>  0) +   27.253 us |            }
>  0) +   28.850 us |          }
>  0) +   33.217 us |        }
>  0)               |        may_open() {
>  0)               |          inode_permission() {
>  0)               |            ext3_permission() {
>  0)      0.480 us |              generic_permission();
>  0)      1.229 us |            }
>  0)               |            security_inode_permission() {
>  0)      0.405 us |              cap_inode_permission();
>  0)      1.196 us |            }
>  0)      3.589 us |          }
>  0)      4.600 us |        }
>  0)               |        nameidata_to_filp() {
>  0)               |          __dentry_open() {
>  0)               |            file_move() {
>  0)      0.470 us |              _spin_lock();
>  0)      1.243 us |            }
>  0)               |            security_dentry_open() {
>  0)      0.344 us |              cap_dentry_open();
>  0)      1.139 us |            }
>  0)      0.412 us |            generic_file_open();
>  0)      0.561 us |            file_ra_state_init();
>  0)      5.714 us |          }
>  0)      6.483 us |        }
>  0) +   46.494 us |      }
>  0)      0.453 us |      inotify_dentry_parent_queue_event();
>  0)      0.403 us |      inotify_inode_queue_event();
>  0)               |      fd_install() {
>  0)      0.440 us |        _spin_lock();
>  0)      1.247 us |      }
>  0)               |      putname() {
>  0)               |        kmem_cache_free() {
>  0)               |          virt_to_head_page() {
>  0)      0.369 us |            constant_test_bit();
>  0)      1. 23 us |          }
>  0)      1.738 us |        }
>  0)      2.422 us |      }
>  0) +   60.560 us |    }
>  0) +   61.368 us |  }
> 
> and here's a sys_close():
> 
>  0)               |  sys_close() {
>  0)      0.540 us |    _spin_lock();
>  0)               |    filp_close() {
>  0)      0.437 us |      dnotify_flush();
>  0)      0.401 us |      locks_remove_posix();
>  0)      0.349 us |      fput();
>  0)      2.679 us |    }
>  0)      4.452 us |  }
> 
> i'd be surprised to see a flag to show up in that codepath. Eric, does 
> your testing confirm that?

On a socket/pipe, definitly no, because inode->i_sb->s_flags is not contended.

But on a shared inode, it might hurt :

offsetof(struct inode, i_count)=0x24
offsetof(struct inode, i_lock)=0x70
offsetof(struct inode, i_sb)=0x9c
offsetof(struct inode, i_writecount)=0x144

So i_sb sits in a probably contended cache line 

I wonder why i_writecount sits so far from i_count, that doesnt make sense.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
  2008-11-28  9:26                             ` Al Viro
                                               ` (2 preceding siblings ...)
  (?)
@ 2008-11-28 22:37                             ` Eric Dumazet
  2008-11-28 22:43                               ` Eric Dumazet
  -1 siblings, 1 reply; 349+ messages in thread
From: Eric Dumazet @ 2008-11-28 22:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth,
	ink

Al Viro a écrit :
> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>> refcounting on permanent system vfs.
>> Use this function for sockets, pipes, anonymous fds.
> 
> IMO that's pushing it past the point of usefulness; unless you can show
> that this really gives considerable win on pipes et.al. *AND* that it
> doesn't hurt other loads...

Well, if this is the last cache line that might be shared, then yes, numbers can talk.
But coming from 10 to 1 instead of 0 is OK I guess

> 
> dput() part: again, I want to see what happens on other loads; it's probably
> fine (and win is certainly more than from mntput() change), but...  The
> thing is, atomic_dec_and_lock() in there is often done on dentries with
> d_count > 1 and that's fairly cheap (and doesn't involve contention on
> dcache_lock on sane targets).
> 
> FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock()
> in a special way, I'd try to compare with

>         if (atomic_add_unless(&dentry->d_count, -1, 1))
>                 return;

I dont know, but *reading* d_count before trying to write it is expensive
on modern cpus. Oprofile clearly show that on Intel Core2.

Then, *testing* the flag before doing the atomic_something() has the same
problem. Or we should put flag in a different cache line.

I am lazy (time for a sleep here), maybe we are smart here and use a trick like that already ?

atomic_t atomic_read_with_write_intent(atomic_t *v)
{
        int val = 0;
	/*
	 * No LOCK prefix here, we only give a write intent hint to cpu
	 */
        asm volatile("xaddl %0, %1"
                     : "+r" (val), "+m" (v->counter)
                     : : "memory");
        return val;
}



> 	if (your flag)
> 		sod off to special
> 	spin_lock(&dcache_lock);
> 	if (atomic_dec_and_test(&dentry->d_count)) {
> 		spin_unlock(&dcache_lock);
> 		return;
> 	}
> 	the rest as usual
> 


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs
  2008-11-28 22:37                             ` Eric Dumazet
@ 2008-11-28 22:43                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-28 22:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth,
	ink

Eric Dumazet a écrit :
> Al Viro a écrit :
>> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>>> refcounting on permanent system vfs.
>>> Use this function for sockets, pipes, anonymous fds.
>>
>> IMO that's pushing it past the point of usefulness; unless you can show
>> that this really gives considerable win on pipes et.al. *AND* that it
>> doesn't hurt other loads...
> 
> Well, if this is the last cache line that might be shared, then yes, 
> numbers can talk.
> But coming from 10 to 1 instead of 0 is OK I guess
> 
>>
>> dput() part: again, I want to see what happens on other loads; it's 
>> probably
>> fine (and win is certainly more than from mntput() change), but...  The
>> thing is, atomic_dec_and_lock() in there is often done on dentries with
>> d_count > 1 and that's fairly cheap (and doesn't involve contention on
>> dcache_lock on sane targets).
>>
>> FWIW, unless there's a really good reason to do alpha 
>> atomic_dec_and_lock()
>> in a special way, I'd try to compare with
> 
>>         if (atomic_add_unless(&dentry->d_count, -1, 1))
>>                 return;
> 
> I dont know, but *reading* d_count before trying to write it is expensive
> on modern cpus. Oprofile clearly show that on Intel Core2.
> 
> Then, *testing* the flag before doing the atomic_something() has the same
> problem. Or we should put flag in a different cache line.
> 
> I am lazy (time for a sleep here), maybe we are smart here and use a 
> trick like that already ?
> 
> atomic_t atomic_read_with_write_intent(atomic_t *v)
> {
>        int val = 0;
>     /*
>      * No LOCK prefix here, we only give a write intent hint to cpu
>      */
>        asm volatile("xaddl %0, %1"
>                     : "+r" (val), "+m" (v->counter)
>                     : : "memory");
>        return val;
> }

Forget it, its wrong... I really need to sleep :)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
  2008-11-28 18:47                               ` Peter Zijlstra
@ 2008-11-29  6:38                                 ` Christoph Hellwig
  -1 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-29  6:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Eric Dumazet, David Miller, Rafael J. Wysocki,
	linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter, Christoph Hellwig

On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
> > Wow, that's incredibly impressive! :-)
> 
> Yeah, we got a similar speedup on -rt by pushing those super-block files
> list into per-cpu lists and doing crazy locking on them.
> 
> Of course avoiding them all together, like done here is a nicer option
> but is sadly not a possibility for regular files (until hch gets around
> to removing the need for the list).

We should have finished this long ago, thanks for the reminder.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-29  6:38                                 ` Christoph Hellwig
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Hellwig @ 2008-11-29  6:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Eric Dumazet, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Linux Netdev List, Christoph Lameter, Christoph Hellwig

On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
> > Wow, that's incredibly impressive! :-)
> 
> Yeah, we got a similar speedup on -rt by pushing those super-block files
> list into per-cpu lists and doing crazy locking on them.
> 
> Of course avoiding them all together, like done here is a nicer option
> but is sadly not a possibility for regular files (until hch gets around
> to removing the need for the list).

We should have finished this long ago, thanks for the reminder.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-29  8:07                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List,
	Christoph Lameter

Christoph Hellwig a écrit :
> On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
>>> Wow, that's incredibly impressive! :-)
>> Yeah, we got a similar speedup on -rt by pushing those super-block files
>> list into per-cpu lists and doing crazy locking on them.
>>
>> Of course avoiding them all together, like done here is a nicer option
>> but is sadly not a possibility for regular files (until hch gets around
>> to removing the need for the list).
> 
> We should have finished this long ago, thanks for the reminder.
> 
> 

inode_in_use could be percpu, at least.

Or just zap it, since we never have to scan it.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-29  8:07                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Ingo Molnar, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith,
	Linux Netdev List, Christoph Lameter

Christoph Hellwig a écrit :
> On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
>>> Wow, that's incredibly impressive! :-)
>> Yeah, we got a similar speedup on -rt by pushing those super-block files
>> list into per-cpu lists and doing crazy locking on them.
>>
>> Of course avoiding them all together, like done here is a nicer option
>> but is sadly not a possibility for regular files (until hch gets around
>> to removing the need for the list).
> 
> We should have finished this long ago, thanks for the reminder.
> 
> 

inode_in_use could be percpu, at least.

Or just zap it, since we never have to scan it.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP
  2008-11-26 23:27                         ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
@ 2008-11-29  8:43                             ` Eric Dumazet
  2008-11-27  9:39                           ` Christoph Hellwig
                                               ` (7 subsequent siblings)
  8 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks))

Long version :

For this second version, I removed the mntput()/mntget() optimization
since most reviewers are not convinced it is usefull.
This is a four lines patch that can be reconsidered later.

I chose the name SINGLE instead of SPECIAL to name
isolated dentries (for sockets, pipes, anonymous fd) that
have no parent and no relationship in the vfs.

Thanks all

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
	close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.325s   (instead of 1.561s)
user    0.091s
sys     1.234s


If run on 8 CPUS :

real    0m2.971s
user    0m0.726s
sys     0m21.310s

CPU: Core 2, speed 3000.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
189772   189772        12.7205  12.7205    _atomic_dec_and_lock
140467   330239         9.4155  22.1360    __slab_free
128210   458449         8.5940  30.7300    add_partial
121578   580027         8.1494  38.8794    kmem_cache_alloc
72626    652653         4.8681  43.7475    init_file
62720    715373         4.2041  47.9517    __percpu_counter_add
51632    767005         3.4609  51.4126    sysenter_past_esp
49196    816201         3.2976  54.7102    tcp_close
47933    864134         3.2130  57.9231    kmem_cache_free
29628    893762         1.9860  59.9091    copy_from_user
28443    922205         1.9065  61.8157    init_timer
25602    947807         1.7161  63.5318    __slab_alloc
22139    969946         1.4840  65.0158    discard_slab
20428    990374         1.3693  66.3851    __call_rcu
18174    1008548        1.2182  67.6033    alloc_fd
17643    1026191        1.1826  68.7859    __fput
17374    1043565        1.1646  69.9505    d_alloc
17196    1060761        1.1527  71.1031    sys_close
17024    1077785        1.1411  72.2442    inet_create
15208    1092993        1.0194  73.2636    alloc_inode
12201    1105194        0.8178  74.0815    fd_install
12167    1117361        0.8156  74.8970    lock_sock_nested
12123    1129484        0.8126  75.7096    get_empty_filp
11648    1141132        0.7808  76.4904    release_sock
11509    1152641        0.7715  77.2619    dput
11335    1163976        0.7598  78.0216    sock_init_data
11038    1175014        0.7399  78.7615    inet_csk_destroy_sock
10880    1185894        0.7293  79.4908    drop_file_write_access
10083    1195977        0.6759  80.1667    inotify_d_instantiate
9216     1205193        0.6178  80.7844    local_bh_enable_ip
8881     1214074        0.5953  81.3797    sysenter_do_call
8759     1222833        0.5871  81.9668    setup_object
8489     1231322        0.5690  82.5359    iput_single

So we now hit mntput()/mntget() and SLUB.

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 (or slub_min_objects=45) at boot,
or we use Christoph Lameter patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

If run on 8 CPUS :

real    0m2.315s
user    0m0.752s
sys     0m17.324s

CPU: Core 2, speed 3000.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
199409   199409        15.6440  15.6440    _atomic_dec_and_lock    (mntput())
141606   341015        11.1092  26.7532    kmem_cache_alloc
76071    417086         5.9679  32.7211    init_file
70595    487681         5.5383  38.2595    __percpu_counter_add
51595    539276         4.0477  42.3072    sysenter_past_esp
49313    588589         3.8687  46.1759    tcp_close
45503    634092         3.5698  49.7457    kmem_cache_free
41413    675505         3.2489  52.9946    __slab_free
29911    705416         2.3466  55.3412    copy_from_user
28979    734395         2.2735  57.6146    init_timer
22251    756646         1.7456  59.3602    get_empty_filp
19942    776588         1.5645  60.9247    __call_rcu
18348    794936         1.4394  62.3642    __fput
18328    813264         1.4379  63.8020    alloc_fd
17395    830659         1.3647  65.1667    sys_close
17301    847960         1.3573  66.5240    d_alloc
16570    864530         1.2999  67.8239    inet_create
15522    880052         1.2177  69.0417    alloc_inode
13185    893237         1.0344  70.0761    setup_object
12359    905596         0.9696  71.0456    fd_install
12275    917871         0.9630  72.0086    lock_sock_nested
11924    929795         0.9355  72.9441    release_sock
11790    941585         0.9249  73.8690    sock_init_data
11310    952895         0.8873  74.7563    dput
10924    963819         0.8570  75.6133    drop_file_write_access
10903    974722         0.8554  76.4687    inet_csk_destroy_sock
10184    984906         0.7990  77.2676    inotify_d_instantiate
9372     994278         0.7353  78.0029    local_bh_enable_ip
8901     1003179        0.6983  78.7012    sysenter_do_call
8569     1011748        0.6723  79.3735    iput_single
8194     1019942        0.6428  80.0163    inet_release


This patch serie contains 5 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/5] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

[PATCH 2/5] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

[PATCH 3/5] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

(socket8 bench result : no difference)


[PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/5] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid 
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of 
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
Overall diffstat :

 fs/anon_inodes.c       |   18 ------
 fs/dcache.c            |  100 ++++++++++++++++++++++++++++++--------
 fs/fs-writeback.c      |    2
 fs/inode.c             |  101 +++++++++++++++++++++++++++++++--------
 fs/pipe.c              |   25 +--------
 include/linux/dcache.h |    9 +++
 include/linux/fs.h     |   17 ++++++
 kernel/sysctl.c        |    6 +-
 mm/page-writeback.c    |    2
 net/socket.c           |   26 +---------
 10 files changed, 200 insertions(+), 106 deletions(-)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-11-29  8:43                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks))

Long version :

For this second version, I removed the mntput()/mntget() optimization
since most reviewers are not convinced it is usefull.
This is a four lines patch that can be reconsidered later.

I chose the name SINGLE instead of SPECIAL to name
isolated dentries (for sockets, pipes, anonymous fd) that
have no parent and no relationship in the vfs.

Thanks all

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
	close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.325s   (instead of 1.561s)
user    0.091s
sys     1.234s


If run on 8 CPUS :

real    0m2.971s
user    0m0.726s
sys     0m21.310s

CPU: Core 2, speed 3000.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
189772   189772        12.7205  12.7205    _atomic_dec_and_lock
140467   330239         9.4155  22.1360    __slab_free
128210   458449         8.5940  30.7300    add_partial
121578   580027         8.1494  38.8794    kmem_cache_alloc
72626    652653         4.8681  43.7475    init_file
62720    715373         4.2041  47.9517    __percpu_counter_add
51632    767005         3.4609  51.4126    sysenter_past_esp
49196    816201         3.2976  54.7102    tcp_close
47933    864134         3.2130  57.9231    kmem_cache_free
29628    893762         1.9860  59.9091    copy_from_user
28443    922205         1.9065  61.8157    init_timer
25602    947807         1.7161  63.5318    __slab_alloc
22139    969946         1.4840  65.0158    discard_slab
20428    990374         1.3693  66.3851    __call_rcu
18174    1008548        1.2182  67.6033    alloc_fd
17643    1026191        1.1826  68.7859    __fput
17374    1043565        1.1646  69.9505    d_alloc
17196    1060761        1.1527  71.1031    sys_close
17024    1077785        1.1411  72.2442    inet_create
15208    1092993        1.0194  73.2636    alloc_inode
12201    1105194        0.8178  74.0815    fd_install
12167    1117361        0.8156  74.8970    lock_sock_nested
12123    1129484        0.8126  75.7096    get_empty_filp
11648    1141132        0.7808  76.4904    release_sock
11509    1152641        0.7715  77.2619    dput
11335    1163976        0.7598  78.0216    sock_init_data
11038    1175014        0.7399  78.7615    inet_csk_destroy_sock
10880    1185894        0.7293  79.4908    drop_file_write_access
10083    1195977        0.6759  80.1667    inotify_d_instantiate
9216     1205193        0.6178  80.7844    local_bh_enable_ip
8881     1214074        0.5953  81.3797    sysenter_do_call
8759     1222833        0.5871  81.9668    setup_object
8489     1231322        0.5690  82.5359    iput_single

So we now hit mntput()/mntget() and SLUB.

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 (or slub_min_objects=45) at boot,
or we use Christoph Lameter patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

If run on 8 CPUS :

real    0m2.315s
user    0m0.752s
sys     0m17.324s

CPU: Core 2, speed 3000.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
199409   199409        15.6440  15.6440    _atomic_dec_and_lock    (mntput())
141606   341015        11.1092  26.7532    kmem_cache_alloc
76071    417086         5.9679  32.7211    init_file
70595    487681         5.5383  38.2595    __percpu_counter_add
51595    539276         4.0477  42.3072    sysenter_past_esp
49313    588589         3.8687  46.1759    tcp_close
45503    634092         3.5698  49.7457    kmem_cache_free
41413    675505         3.2489  52.9946    __slab_free
29911    705416         2.3466  55.3412    copy_from_user
28979    734395         2.2735  57.6146    init_timer
22251    756646         1.7456  59.3602    get_empty_filp
19942    776588         1.5645  60.9247    __call_rcu
18348    794936         1.4394  62.3642    __fput
18328    813264         1.4379  63.8020    alloc_fd
17395    830659         1.3647  65.1667    sys_close
17301    847960         1.3573  66.5240    d_alloc
16570    864530         1.2999  67.8239    inet_create
15522    880052         1.2177  69.0417    alloc_inode
13185    893237         1.0344  70.0761    setup_object
12359    905596         0.9696  71.0456    fd_install
12275    917871         0.9630  72.0086    lock_sock_nested
11924    929795         0.9355  72.9441    release_sock
11790    941585         0.9249  73.8690    sock_init_data
11310    952895         0.8873  74.7563    dput
10924    963819         0.8570  75.6133    drop_file_write_access
10903    974722         0.8554  76.4687    inet_csk_destroy_sock
10184    984906         0.7990  77.2676    inotify_d_instantiate
9372     994278         0.7353  78.0029    local_bh_enable_ip
8901     1003179        0.6983  78.7012    sysenter_do_call
8569     1011748        0.6723  79.3735    iput_single
8194     1019942        0.6428  80.0163    inet_release


This patch serie contains 5 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/5] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

[PATCH 2/5] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

[PATCH 3/5] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

(socket8 bench result : no difference)


[PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/5] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid 
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of 
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
Overall diffstat :

 fs/anon_inodes.c       |   18 ------
 fs/dcache.c            |  100 ++++++++++++++++++++++++++++++--------
 fs/fs-writeback.c      |    2
 fs/inode.c             |  101 +++++++++++++++++++++++++++++++--------
 fs/pipe.c              |   25 +--------
 include/linux/dcache.h |    9 +++
 include/linux/fs.h     |   17 ++++++
 kernel/sysctl.c        |    6 +-
 mm/page-writeback.c    |    2
 net/socket.c           |   26 +---------
 10 files changed, 200 insertions(+), 106 deletions(-)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry
@ 2008-11-29  8:43                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 606 bytes --]

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 32 insertions(+), 21 deletions(-)

[-- Attachment #2: nr_dentry.patch --]
[-- Type: text/plain, Size: 4891 bytes --]

diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..46d5d1e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	percpu_counter_dec(&nr_dentry);
 }
 
 /*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
-
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
-
+	percpu_counter_inc(&nr_dentry);
 	return dentry;
 }
 
@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_dentry, 0);
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..c5e7aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9d048fa..eebddef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry
@ 2008-11-29  8:43                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 32 insertions(+), 21 deletions(-)

[-- Attachment #2: nr_dentry.patch --]
[-- Type: text/plain, Size: 4891 bytes --]

diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..46d5d1e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	percpu_counter_dec(&nr_dentry);
 }
 
 /*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
-
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
-
+	percpu_counter_inc(&nr_dentry);
 	return dentry;
 }
 
@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_dentry, 0);
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..c5e7aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9d048fa..eebddef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes
@ 2008-11-29  8:43                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 481 bytes --]

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/fs-writeback.c   |    2 +-
 fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
 include/linux/fs.h  |    3 +++
 kernel/sysctl.c     |    4 ++--
 mm/page-writeback.c |    2 +-
 5 files changed, 38 insertions(+), 12 deletions(-)

[-- Attachment #2: nr_inodes.patch --]
[-- Type: text/plain, Size: 5626 bytes --]

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	percpu_counter_sub(&nr_inodes, nr_disposed);
 }
 
 /*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
 	
 	inode = alloc_inode(sb);
 	if (inode) {
+		percpu_counter_inc(&nr_inodes);
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
 		inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_inodes, 0);
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5e7aa5..2482977 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index eebddef..eebed01 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes
@ 2008-11-29  8:43                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 507 bytes --]

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/fs-writeback.c   |    2 +-
 fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
 include/linux/fs.h  |    3 +++
 kernel/sysctl.c     |    4 ++--
 mm/page-writeback.c |    2 +-
 5 files changed, 38 insertions(+), 12 deletions(-)

[-- Attachment #2: nr_inodes.patch --]
[-- Type: text/plain, Size: 5626 bytes --]

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	percpu_counter_sub(&nr_inodes, nr_disposed);
 }
 
 /*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
 	
 	inode = alloc_inode(sb);
 	if (inode) {
+		percpu_counter_inc(&nr_inodes);
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
 		inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_inodes, 0);
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5e7aa5..2482977 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index eebddef..eebed01 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator
  2008-11-26 23:27                         ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
@ 2008-11-29  8:44                             ` Eric Dumazet
  2008-11-27  9:39                           ` Christoph Hellwig
                                               ` (7 subsequent siblings)
  8 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:44 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 505 bytes --]

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)


[-- Attachment #2: last_ino.patch --]
[-- Type: text/plain, Size: 1511 bytes --]

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
 	return node ? inode : NULL;
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+	static atomic_t shared_last_ino;
+	int *p = &get_cpu_var(last_ino);
+	int res = *p;
+
+	if (unlikely((res & 1023) == 0))
+		res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+#else
+static int last_ino_get(void)
+{
+	static int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
 	struct inode * inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
+		inode->i_state = 0;
+		inode->i_ino = last_ino_get();
 		spin_lock(&inode_lock);
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
-		inode->i_ino = ++last_ino;
-		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
 	return inode;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator
@ 2008-11-29  8:44                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:44 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 505 bytes --]

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)


[-- Attachment #2: last_ino.patch --]
[-- Type: text/plain, Size: 1511 bytes --]

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
 	return node ? inode : NULL;
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+	static atomic_t shared_last_ino;
+	int *p = &get_cpu_var(last_ino);
+	int res = *p;
+
+	if (unlikely((res & 1023) == 0))
+		res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+#else
+static int last_ino_get(void)
+{
+	static int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
 	struct inode * inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
+		inode->i_state = 0;
+		inode->i_ino = last_ino_get();
 		spin_lock(&inode_lock);
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
-		inode->i_ino = ++last_ino;
-		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
 	return inode;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-11-29  8:44                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:44 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1602 bytes --]


Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |   16 ------------
 fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
 fs/pipe.c              |   23 +----------------
 include/linux/dcache.h |    9 ++++++
 net/socket.c           |   24 +-----------------
 5 files changed, 65 insertions(+), 58 deletions(-)

[-- Attachment #2: dcache_single.patch --]
[-- Type: text/plain, Size: 7886 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
 			     mnt);
 }
 
-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * We faked vfs to believe the dentry was hashed when we created it.
-	 * Now we restore the flag so that dput() will work correctly.
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 1;
-}
-
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
 	.kill_sb	= kill_anon_super,
 };
 static struct dentry_operations anon_inodefs_dentry_operations = {
-	.d_delete	= anon_inodefs_delete_dentry,
 };
 
 /**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_single(&this, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index 46d5d1e..35d4a25 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
  */
 
 /*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	if (!atomic_dec_and_test(&dentry->d_count))
+		return;
+	inode = dentry->d_inode;
+	if (inode)
+		iput(inode);
+	d_free(dentry);
+}
+
+/*
  * dput - release a dentry
  * @dentry: dentry to release 
  *
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
 {
 	if (!dentry)
 		return;
+	/*
+	 * single dentries (sockets/pipes/anon) fast path
+	 */
+	if (dentry->d_flags & DCACHE_SINGLE)
+		return dput_single(dentry);
 
 repeat:
 	if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+	struct dentry *entry;
+
+	entry = d_alloc(NULL, name);
+	if (entry) {
+		entry->d_sb = inode->i_sb;
+		entry->d_parent = entry;
+		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+		entry->d_inode = inode;
+		fsnotify_d_instantiate(entry, inode);
+		security_d_instantiate(entry, inode);
+	}
+	return entry;
+}
+
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
 
 /*
  * pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations pipefs_dentry_operations = {
-	.d_delete	= pipefs_delete_dentry,
 	.d_dname	= pipefs_dname,
 };
 
@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
 #define DCACHE_UNHASHED		0x0010	
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE		0x0040
+	/*
+	 * socket, pipe or anonymous fd dentry
+	 * - SINGLE dentries have themselves as a parent.
+	 * - SINGLE dentries are not hashed into global hash table
+	 * - Their d_alias list is empty
+	 * - They dont need dcache_lock synchronization
+	 */
 
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
 extern void shrink_dcache_for_umount(struct super_block *);
 extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..231cd66 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -307,18 +307,6 @@ static struct file_system_type sock_fs_type = {
 	.kill_sb =	kill_anon_super,
 };
 
-static int sockfs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
-
 /*
  * sockfs_dname() is called from d_path().
  */
@@ -329,7 +317,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations sockfs_dentry_operations = {
-	.d_delete = sockfs_delete_dentry,
 	.d_dname  = sockfs_dname,
 };
 
@@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-11-29  8:44                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:44 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1628 bytes --]


Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/anon_inodes.c       |   16 ------------
 fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
 fs/pipe.c              |   23 +----------------
 include/linux/dcache.h |    9 ++++++
 net/socket.c           |   24 +-----------------
 5 files changed, 65 insertions(+), 58 deletions(-)

[-- Attachment #2: dcache_single.patch --]
[-- Type: text/plain, Size: 7886 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
 			     mnt);
 }
 
-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * We faked vfs to believe the dentry was hashed when we created it.
-	 * Now we restore the flag so that dput() will work correctly.
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 1;
-}
-
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
 	.kill_sb	= kill_anon_super,
 };
 static struct dentry_operations anon_inodefs_dentry_operations = {
-	.d_delete	= anon_inodefs_delete_dentry,
 };
 
 /**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_single(&this, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index 46d5d1e..35d4a25 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
  */
 
 /*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	if (!atomic_dec_and_test(&dentry->d_count))
+		return;
+	inode = dentry->d_inode;
+	if (inode)
+		iput(inode);
+	d_free(dentry);
+}
+
+/*
  * dput - release a dentry
  * @dentry: dentry to release 
  *
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
 {
 	if (!dentry)
 		return;
+	/*
+	 * single dentries (sockets/pipes/anon) fast path
+	 */
+	if (dentry->d_flags & DCACHE_SINGLE)
+		return dput_single(dentry);
 
 repeat:
 	if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+	struct dentry *entry;
+
+	entry = d_alloc(NULL, name);
+	if (entry) {
+		entry->d_sb = inode->i_sb;
+		entry->d_parent = entry;
+		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+		entry->d_inode = inode;
+		fsnotify_d_instantiate(entry, inode);
+		security_d_instantiate(entry, inode);
+	}
+	return entry;
+}
+
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
 
 /*
  * pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations pipefs_dentry_operations = {
-	.d_delete	= pipefs_delete_dentry,
 	.d_dname	= pipefs_dname,
 };
 
@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
 #define DCACHE_UNHASHED		0x0010	
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE		0x0040
+	/*
+	 * socket, pipe or anonymous fd dentry
+	 * - SINGLE dentries have themselves as a parent.
+	 * - SINGLE dentries are not hashed into global hash table
+	 * - Their d_alias list is empty
+	 * - They dont need dcache_lock synchronization
+	 */
 
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
 extern void shrink_dcache_for_umount(struct super_block *);
 extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..231cd66 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -307,18 +307,6 @@ static struct file_system_type sock_fs_type = {
 	.kill_sb =	kill_anon_super,
 };
 
-static int sockfs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
-
 /*
  * sockfs_dname() is called from d_path().
  */
@@ -329,7 +317,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations sockfs_dentry_operations = {
-	.d_delete = sockfs_delete_dentry,
 	.d_dname  = sockfs_dname,
 };
 
@@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 5/5] fs: new_inode_single() and iput_single()
@ 2008-11-29  8:45                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:45 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 905 bytes --]

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid 
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of 
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c   |    2 +-
 fs/dcache.c        |    2 +-
 fs/inode.c         |   29 ++++++++++++++++++++---------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |   12 +++++++++++-
 net/socket.c       |    2 +-
 6 files changed, 35 insertions(+), 14 deletions(-)

[-- Attachment #2: new_inode_single.patch --]
[-- Type: text/plain, Size: 4080 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index 35d4a25..3aa9ed5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_single(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_single(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		percpu_counter_dec(&nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
 #endif
 
 /**
- *	new_inode 	- obtain an inode
+ *	__new_inode 	- obtain an inode
  *	@sb: superblock
+ *  @single: if true, dont link new inode in a list
  *
  *	Allocates a new inode for given superblock. The default gfp_mask
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
  *	newly created inode's mapping
  *
  */
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
 		inode->i_state = 0;
 		inode->i_ino = last_ino_get();
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
-		spin_unlock(&inode_lock);
+ 		if (single) {
+  			INIT_LIST_HEAD(&inode->i_list);
+  			INIT_LIST_HEAD(&inode->i_sb_list);
+ 		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 	}
 	return inode;
 }
 
-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);
 
 void unlock_new_inode(struct inode *inode)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2482977..b3daffc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1898,7 +1898,17 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+	return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+	return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
diff --git a/net/socket.c b/net/socket.c
index 231cd66..f1e656c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -463,7 +463,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_inode_single(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v2 5/5] fs: new_inode_single() and iput_single()
@ 2008-11-29  8:45                             ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29  8:45 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

[-- Attachment #1: Type: text/plain, Size: 931 bytes --]

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid 
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of 
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)

Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
---
 fs/anon_inodes.c   |    2 +-
 fs/dcache.c        |    2 +-
 fs/inode.c         |   29 ++++++++++++++++++++---------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |   12 +++++++++++-
 net/socket.c       |    2 +-
 6 files changed, 35 insertions(+), 14 deletions(-)

[-- Attachment #2: new_inode_single.patch --]
[-- Type: text/plain, Size: 4080 bytes --]

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index 35d4a25..3aa9ed5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_single(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_single(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		percpu_counter_dec(&nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
 #endif
 
 /**
- *	new_inode 	- obtain an inode
+ *	__new_inode 	- obtain an inode
  *	@sb: superblock
+ *  @single: if true, dont link new inode in a list
  *
  *	Allocates a new inode for given superblock. The default gfp_mask
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
  *	newly created inode's mapping
  *
  */
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
 		inode->i_state = 0;
 		inode->i_ino = last_ino_get();
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
-		spin_unlock(&inode_lock);
+ 		if (single) {
+  			INIT_LIST_HEAD(&inode->i_list);
+  			INIT_LIST_HEAD(&inode->i_sb_list);
+ 		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 	}
 	return inode;
 }
 
-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);
 
 void unlock_new_inode(struct inode *inode)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2482977..b3daffc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1898,7 +1898,17 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+	return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+	return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
diff --git a/net/socket.c b/net/socket.c
index 231cd66..f1e656c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -463,7 +463,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_inode_single(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-11-29 10:38                               ` Jörn Engel
  0 siblings, 0 replies; 349+ messages in thread
From: Jörn Engel @ 2008-11-29 10:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> +	struct dentry *entry;
> +
> +	entry = d_alloc(NULL, name);
> +	if (entry) {
> +		entry->d_sb = inode->i_sb;
> +		entry->d_parent = entry;
> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> +		entry->d_inode = inode;
> +		fsnotify_d_instantiate(entry, inode);
> +		security_d_instantiate(entry, inode);
> +	}
> +	return entry;

Calling the struct dentry entry had me onfused a bit.  I believe
everyone else (including the code you removed) uses dentry.

> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>  	struct inode *inode;
>  	struct file *f;
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
>  
>  	err = -ENFILE;
>  	inode = get_pipe_inode();
...
> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>  {
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };

These two could even be combined.

And of course I realize that I comment on absolute trivialities.  On the
whole, I couldn't spot a real problem in your patches.

Jörn

-- 
Public Domain  - Free as in Beer
General Public - Free as in Speech
BSD License    - Free as in Enterprise
Shared Source  - Free as in "Work will make you..."

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-11-29 10:38                               ` Jörn Engel
  0 siblings, 0 replies; 349+ messages in thread
From: Jörn Engel @ 2008-11-29 10:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> +	struct dentry *entry;
> +
> +	entry = d_alloc(NULL, name);
> +	if (entry) {
> +		entry->d_sb = inode->i_sb;
> +		entry->d_parent = entry;
> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> +		entry->d_inode = inode;
> +		fsnotify_d_instantiate(entry, inode);
> +		security_d_instantiate(entry, inode);
> +	}
> +	return entry;

Calling the struct dentry entry had me onfused a bit.  I believe
everyone else (including the code you removed) uses dentry.

> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>  	struct inode *inode;
>  	struct file *f;
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
>  
>  	err = -ENFILE;
>  	inode = get_pipe_inode();
...
> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>  {
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };

These two could even be combined.

And of course I realize that I comment on absolute trivialities.  On the
whole, I couldn't spot a real problem in your patches.

Jörn

-- 
Public Domain  - Free as in Beer
General Public - Free as in Speech
BSD License    - Free as in Enterprise
Shared Source  - Free as in "Work will make you..."

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-11-29 10:38                               ` Jörn Engel
  0 siblings, 0 replies; 349+ messages in thread
From: Jörn Engel @ 2008-11-29 10:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> +	struct dentry *entry;
> +
> +	entry = d_alloc(NULL, name);
> +	if (entry) {
> +		entry->d_sb = inode->i_sb;
> +		entry->d_parent = entry;
> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> +		entry->d_inode = inode;
> +		fsnotify_d_instantiate(entry, inode);
> +		security_d_instantiate(entry, inode);
> +	}
> +	return entry;

Calling the struct dentry entry had me onfused a bit.  I believe
everyone else (including the code you removed) uses dentry.

> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>  	struct inode *inode;
>  	struct file *f;
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
>  
>  	err = -ENFILE;
>  	inode = get_pipe_inode();
...
> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>  {
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };

These two could even be combined.

And of course I realize that I comment on absolute trivialities.  On the
whole, I couldn't spot a real problem in your patches.

Jörn

-- 
Public Domain  - Free as in Beer
General Public - Free as in Speech
BSD License    - Free as in Enterprise
Shared Source  - Free as in "Work will make you..."

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-11-29 11:14                                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29 11:14 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

Jörn Engel a écrit :
> On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
>> +{
>> +	struct dentry *entry;
>> +
>> +	entry = d_alloc(NULL, name);
>> +	if (entry) {
>> +		entry->d_sb = inode->i_sb;
>> +		entry->d_parent = entry;
>> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
>> +		entry->d_inode = inode;
>> +		fsnotify_d_instantiate(entry, inode);
>> +		security_d_instantiate(entry, inode);
>> +	}
>> +	return entry;
> 
> Calling the struct dentry entry had me onfused a bit.  I believe
> everyone else (including the code you removed) uses dentry.

Ah yes, it seems I took it from d_instantiate(), I guess a cleanup
patch would be nice.

> 
>> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>>  	struct inode *inode;
>>  	struct file *f;
>>  	struct dentry *dentry;
>> -	struct qstr name = { .name = "" };
>> +	static const struct qstr name = { .name = "" };
>>  
>>  	err = -ENFILE;
>>  	inode = get_pipe_inode();
> ...
>> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>>  {
>>  	struct dentry *dentry;
>> -	struct qstr name = { .name = "" };
>> +	static const struct qstr name = { .name = "" };
> 
> These two could even be combined.
> 
> And of course I realize that I comment on absolute trivialities.  On the
> whole, I couldn't spot a real problem in your patches.

Well, at least you reviewed it, it's the important point !

Thanks Jörn


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-11-29 11:14                                 ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-11-29 11:14 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

Jörn Engel a écrit :
> On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
>> +{
>> +	struct dentry *entry;
>> +
>> +	entry = d_alloc(NULL, name);
>> +	if (entry) {
>> +		entry->d_sb = inode->i_sb;
>> +		entry->d_parent = entry;
>> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
>> +		entry->d_inode = inode;
>> +		fsnotify_d_instantiate(entry, inode);
>> +		security_d_instantiate(entry, inode);
>> +	}
>> +	return entry;
> 
> Calling the struct dentry entry had me onfused a bit.  I believe
> everyone else (including the code you removed) uses dentry.

Ah yes, it seems I took it from d_instantiate(), I guess a cleanup
patch would be nice.

> 
>> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>>  	struct inode *inode;
>>  	struct file *f;
>>  	struct dentry *dentry;
>> -	struct qstr name = { .name = "" };
>> +	static const struct qstr name = { .name = "" };
>>  
>>  	err = -ENFILE;
>>  	inode = get_pipe_inode();
> ...
>> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>>  {
>>  	struct dentry *dentry;
>> -	struct qstr name = { .name = "" };
>> +	static const struct qstr name = { .name = "" };
> 
> These two could even be combined.
> 
> And of course I realize that I comment on absolute trivialities.  On the
> whole, I couldn't spot a real problem in your patches.

Well, at least you reviewed it, it's the important point !

Thanks Jörn

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 5/5] fs: new_inode_single() and iput_single()
  2008-11-29  8:45                             ` Eric Dumazet
  (?)
@ 2008-11-29 11:14                               ` Jörn Engel
  -1 siblings, 0 replies; 349+ messages in thread
From: Jörn Engel @ 2008-11-29 11:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Sat, 29 November 2008 09:45:09 +0100, Eric Dumazet wrote:
>  
> +void iput_single(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_count)) {
> +		destroy_inode(inode);
> +		percpu_counter_dec(&nr_inodes);
> +	}
> +}

I wonder if it is possible to avoid the atomic_dec_and_test() here, at
least in the common case, and combine it with the atomic_dec_and_test()
of the dentry.  A quick look at fs/inode.c indicates that inode->i_count
may never get changed for a SINGLE inode, except during creation or
deletion.

It might be worth to
- remove the conditional from iput_single() and measure that it makes a
  difference,
- poison SINGLE inodes with some value and
- put a BUG_ON() in __iget() that checks for the poison value.

I _think_ the BUG_ON() is unnecessary, but at least my brain is not
sufficient to convince me.  Can inotify somehow get a hold of a socket?
Or dquot (how insane would that be?)

Jörn

-- 
Mac is for working,
Linux is for Networking,
Windows is for Solitaire!
-- stolen from dc

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 5/5] fs: new_inode_single() and iput_single()
@ 2008-11-29 11:14                               ` Jörn Engel
  0 siblings, 0 replies; 349+ messages in thread
From: Jörn Engel @ 2008-11-29 11:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Sat, 29 November 2008 09:45:09 +0100, Eric Dumazet wrote:
>  
> +void iput_single(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_count)) {
> +		destroy_inode(inode);
> +		percpu_counter_dec(&nr_inodes);
> +	}
> +}

I wonder if it is possible to avoid the atomic_dec_and_test() here, at
least in the common case, and combine it with the atomic_dec_and_test()
of the dentry.  A quick look at fs/inode.c indicates that inode->i_count
may never get changed for a SINGLE inode, except during creation or
deletion.

It might be worth to
- remove the conditional from iput_single() and measure that it makes a
  difference,
- poison SINGLE inodes with some value and
- put a BUG_ON() in __iget() that checks for the poison value.

I _think_ the BUG_ON() is unnecessary, but at least my brain is not
sufficient to convince me.  Can inotify somehow get a hold of a socket?
Or dquot (how insane would that be?)

Jörn

-- 
Mac is for working,
Linux is for Networking,
Windows is for Solitaire!
-- stolen from dc

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v2 5/5] fs: new_inode_single() and iput_single()
@ 2008-11-29 11:14                               ` Jörn Engel
  0 siblings, 0 replies; 349+ messages in thread
From: Jörn Engel @ 2008-11-29 11:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Sat, 29 November 2008 09:45:09 +0100, Eric Dumazet wrote:
>  
> +void iput_single(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_count)) {
> +		destroy_inode(inode);
> +		percpu_counter_dec(&nr_inodes);
> +	}
> +}

I wonder if it is possible to avoid the atomic_dec_and_test() here, at
least in the common case, and combine it with the atomic_dec_and_test()
of the dentry.  A quick look at fs/inode.c indicates that inode->i_count
may never get changed for a SINGLE inode, except during creation or
deletion.

It might be worth to
- remove the conditional from iput_single() and measure that it makes a
  difference,
- poison SINGLE inodes with some value and
- put a BUG_ON() in __iget() that checks for the poison value.

I _think_ the BUG_ON() is unnecessary, but at least my brain is not
sufficient to convince me.  Can inotify somehow get a hold of a socket?
Or dquot (how insane would that be?)

Jörn

-- 
Mac is for working,
Linux is for Networking,
Windows is for Solitaire!
-- stolen from dc

^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH v3 0/7] fs: Scalability of sockets/pipes allocation/deallocation on SMP
  2008-11-29  8:43                             ` Eric Dumazet
@ 2008-12-11 22:38                               ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Hi Andrew

Take v2 of this patch serie got no new feedback, maybe its time for mm
inclusion for a while ?

In this third version I added last two patches, one intialy from Christoph
Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs.

Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work done
on "struct file".

Thank you

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.62 s, on a 8 cpus machine)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer). This point is addressed by 6th
patch.

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

socketallocbench is a very simple program (attached to this mail) that makes
a loop :

for (i = 0; i < 1000000; i++)
    close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(socketallocbench -n 8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.245s (instead of 1.561s)
user    0.074s
sys     1.161s


If run on 8 CPUS :

real    1.624s
user    0.580s
sys     12.296s


On oprofile, we finally can see network stuff coming at the front of
expensive stuff. (with the exception of kmem_cache_[z]alloc(), because
it has to clear 192 bytes of file structures, this takes half of the time)

CPU: Core 2, speed 3000.09 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
176586   176586        10.9376  10.9376    kmem_cache_alloc
169838   346424        10.5196  21.4572    tcp_close
105331   451755         6.5241  27.9813    tcp_v4_init_sock
105146   556901         6.5126  34.4939    tcp_v4_destroy_sock
83307    640208         5.1600  39.6539    sysenter_past_esp
80241    720449         4.9701  44.6239    inet_csk_destroy_sock
74263    794712         4.5998  49.2237    kmem_cache_free
56806    851518         3.5185  52.7422    __percpu_counter_add
48619    900137         3.0114  55.7536    copy_from_user
44803    944940         2.7751  58.5287    init_timer
28539    973479         1.7677  60.2964    d_alloc
27795    1001274        1.7216  62.0180    alloc_fd
26747    1028021        1.6567  63.6747    __fput
24312    1052333        1.5059  65.1805    sys_close
24205    1076538        1.4992  66.6798    inet_create
22409    1098947        1.3880  68.0677    alloc_inode
21359    1120306        1.3230  69.3907    release_sock
19865    1140171        1.2304  70.6211    fd_install
19472    1159643        1.2061  71.8272    lock_sock_nested
18956    1178599        1.1741  73.0013    sock_init_data
17301    1195900        1.0716  74.0729    drop_file_write_access
17113    1213013        1.0600  75.1329    inotify_d_instantiate
16384    1229397        1.0148  76.1477    dput
15173    1244570        0.9398  77.0875    local_bh_enable_ip
15017    1259587        0.9301  78.0176    local_bh_enable
13354    1272941        0.8271  78.8448    __sock_create
13139    1286080        0.8138  79.6586    inet_release
13062    1299142        0.8090  80.4676    sysenter_do_call
11935    1311077        0.7392  81.2069    iput_single


This patch serie contains 7 patches, against linux-2.6 tree,
plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c)

[PATCH 1/7] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n 8" bench result : 27.5s to 25s)

[PATCH 2/7] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

("socketallocbench -n 8" bench result : no difference at this point)

[PATCH 3/7] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 232 allocations)

("socketallocbench -n 8" result : no difference)


[PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/7] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)


[PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

From: Christoph Lameter <cl@linux-foundation.org>

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

("socketallocbench -n 8" result : from 3.01s to 2.20s)

[PATCH 7/7] fs: MS_NOREFCOUNT

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

cat socketallocbench.c
/*
 * socketallocbench benchmark
 *
 * Usage : socket [-n procs]  [-l loops]
 */
#include <sys/socket.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/wait.h>

void dowork(int loops)
{
        int i;

        for (i = 0; i < loops; i++)
                close(socket(AF_INET, SOCK_STREAM, 0));
}

int main(int argc, char *argv[])
{
        int i;
        int n = 1;
        int loops = 1000000;
        pid_t *pidtable;

        while ((i = getopt(argc, argv, "n:l:")) != EOF) {
                if (i == 'n')
                        n = atoi(optarg);
                if (i == 'l')
                        loops = atoi(optarg);
        }
        pidtable = malloc(n * sizeof(pid_t));
        for (i = 1; i < n; i++) {
                pidtable[i] = fork();
                if (pidtable[i] == 0) {
                        dowork(loops);
                        _exit(0);
                }
                if (pidtable[i] == -1) {
                        perror("fork");
                        n = i;
                        break;
                }
        }
        dowork(loops);
        for (i = 1; i < n; i++) {
                int status;

                wait(&status);
                }
        return 0;
}

^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH v3 0/7] fs: Scalability of sockets/pipes allocation/deallocation on SMP
@ 2008-12-11 22:38                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Hi Andrew

Take v2 of this patch serie got no new feedback, maybe its time for mm
inclusion for a while ?

In this third version I added last two patches, one intialy from Christoph
Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs.

Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work done
on "struct file".

Thank you

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.62 s, on a 8 cpus machine)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer). This point is addressed by 6th
patch.

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

socketallocbench is a very simple program (attached to this mail) that makes
a loop :

for (i = 0; i < 1000000; i++)
    close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(socketallocbench -n 8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.245s (instead of 1.561s)
user    0.074s
sys     1.161s


If run on 8 CPUS :

real    1.624s
user    0.580s
sys     12.296s


On oprofile, we finally can see network stuff coming at the front of
expensive stuff. (with the exception of kmem_cache_[z]alloc(), because
it has to clear 192 bytes of file structures, this takes half of the time)

CPU: Core 2, speed 3000.09 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
176586   176586        10.9376  10.9376    kmem_cache_alloc
169838   346424        10.5196  21.4572    tcp_close
105331   451755         6.5241  27.9813    tcp_v4_init_sock
105146   556901         6.5126  34.4939    tcp_v4_destroy_sock
83307    640208         5.1600  39.6539    sysenter_past_esp
80241    720449         4.9701  44.6239    inet_csk_destroy_sock
74263    794712         4.5998  49.2237    kmem_cache_free
56806    851518         3.5185  52.7422    __percpu_counter_add
48619    900137         3.0114  55.7536    copy_from_user
44803    944940         2.7751  58.5287    init_timer
28539    973479         1.7677  60.2964    d_alloc
27795    1001274        1.7216  62.0180    alloc_fd
26747    1028021        1.6567  63.6747    __fput
24312    1052333        1.5059  65.1805    sys_close
24205    1076538        1.4992  66.6798    inet_create
22409    1098947        1.3880  68.0677    alloc_inode
21359    1120306        1.3230  69.3907    release_sock
19865    1140171        1.2304  70.6211    fd_install
19472    1159643        1.2061  71.8272    lock_sock_nested
18956    1178599        1.1741  73.0013    sock_init_data
17301    1195900        1.0716  74.0729    drop_file_write_access
17113    1213013        1.0600  75.1329    inotify_d_instantiate
16384    1229397        1.0148  76.1477    dput
15173    1244570        0.9398  77.0875    local_bh_enable_ip
15017    1259587        0.9301  78.0176    local_bh_enable
13354    1272941        0.8271  78.8448    __sock_create
13139    1286080        0.8138  79.6586    inet_release
13062    1299142        0.8090  80.4676    sysenter_do_call
11935    1311077        0.7392  81.2069    iput_single


This patch serie contains 7 patches, against linux-2.6 tree,
plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c)

[PATCH 1/7] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n 8" bench result : 27.5s to 25s)

[PATCH 2/7] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

("socketallocbench -n 8" bench result : no difference at this point)

[PATCH 3/7] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 232 allocations)

("socketallocbench -n 8" result : no difference)


[PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/7] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)


[PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

From: Christoph Lameter <cl@linux-foundation.org>

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

("socketallocbench -n 8" result : from 3.01s to 2.20s)

[PATCH 7/7] fs: MS_NOREFCOUNT

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

cat socketallocbench.c
/*
 * socketallocbench benchmark
 *
 * Usage : socket [-n procs]  [-l loops]
 */
#include <sys/socket.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/wait.h>

void dowork(int loops)
{
        int i;

        for (i = 0; i < loops; i++)
                close(socket(AF_INET, SOCK_STREAM, 0));
}

int main(int argc, char *argv[])
{
        int i;
        int n = 1;
        int loops = 1000000;
        pid_t *pidtable;

        while ((i = getopt(argc, argv, "n:l:")) != EOF) {
                if (i == 'n')
                        n = atoi(optarg);
                if (i == 'l')
                        loops = atoi(optarg);
        }
        pidtable = malloc(n * sizeof(pid_t));
        for (i = 1; i < n; i++) {
                pidtable[i] = fork();
                if (pidtable[i] == 0) {
                        dowork(loops);
                        _exit(0);
                }
                if (pidtable[i] == -1) {
                        perror("fork");
                        n = i;
                        break;
                }
        }
        dowork(loops);
        for (i = 1; i < n; i++) {
                int status;

                wait(&status);
                }
        return 0;
}

^ permalink raw reply	[flat|nested] 349+ messages in thread

* [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
  2008-11-29  8:43                             ` Eric Dumazet
@ 2008-12-11 22:38                               ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n8" result : 27.5s to 25s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fa1ba03..f463a81 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	percpu_counter_dec(&nr_dentry);
 }
 
 /*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
-
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
-
+	percpu_counter_inc(&nr_dentry);
 	return dentry;
 }
 
@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_dentry, 0);
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..114cb65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d56fe7..777bee7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
@ 2008-12-11 22:38                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n8" result : 27.5s to 25s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
 include/linux/fs.h |    2 +
 kernel/sysctl.c    |    2 -
 3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fa1ba03..f463a81 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
 static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
 		__d_free(dentry);
 	else
 		call_rcu(&dentry->d_u.d_rcu, d_callback);
+	percpu_counter_dec(&nr_dentry);
 }
 
 /*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
 static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 {
 	struct dentry *parent;
-	unsigned detached = 0;
 
 	BUG_ON(!IS_ROOT(dentry));
 
@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			}
 
 			list_del(&dentry->d_u.d_child);
-			detached++;
 
 			inode = dentry->d_inode;
 			if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			 * otherwise we ascend to the parent and move to the
 			 * next sibling if there is one */
 			if (!parent)
-				goto out;
+				return;
 
 			dentry = parent;
 
@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		dentry = list_entry(dentry->d_subdirs.next,
 				    struct dentry, d_u.d_child);
 	}
-out:
-	/* several dentries were freed, need to correct nr_dentry */
-	spin_lock(&dcache_lock);
-	dentry_stat.nr_dentry -= detached;
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
-	dentry->d_parent = NULL;
-	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		spin_lock(&dcache_lock);
+		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+		spin_unlock(&dcache_lock);
 	} else {
+		dentry->d_parent = NULL;
+		dentry->d_sb = NULL;
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
-
-	spin_lock(&dcache_lock);
-	if (parent)
-		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	dentry_stat.nr_dentry++;
-	spin_unlock(&dcache_lock);
-
+	percpu_counter_inc(&nr_dentry);
 	return dentry;
 }
 
@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_dentry, 0);
 	/* 
 	 * A constructor could be added for stable state like the lists,
 	 * but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..114cb65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d56fe7..777bee7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &dentry_stat,
 		.maxlen		= 6*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_dentry,
 	},
 	{
 		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
  2008-11-29  8:43                             ` Eric Dumazet
@ 2008-12-11 22:39                               ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/fs-writeback.c   |    2 +-
 fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
 include/linux/fs.h  |    3 +++
 kernel/sysctl.c     |    4 ++--
 mm/page-writeback.c |    2 +-
 5 files changed, 38 insertions(+), 12 deletions(-)


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	percpu_counter_sub(&nr_inodes, nr_disposed);
 }
 
 /*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
 	
 	inode = alloc_inode(sb);
 	if (inode) {
+		percpu_counter_inc(&nr_inodes);
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
 		inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_inodes, 0);
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 114cb65..a789346 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 777bee7..b705f3a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
@ 2008-12-11 22:39                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/fs-writeback.c   |    2 +-
 fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
 include/linux/fs.h  |    3 +++
 kernel/sysctl.c     |    4 ++--
 mm/page-writeback.c |    2 +-
 5 files changed, 38 insertions(+), 12 deletions(-)


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
 	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+			(get_nr_inodes() - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
 	sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;
 
 static struct kmem_cache * inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	percpu_counter_sub(&nr_inodes, nr_disposed);
 }
 
 /*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
 	
 	inode = alloc_inode(sb);
 	if (inode) {
+		percpu_counter_inc(&nr_inodes);
 		spin_lock(&inode_lock);
-		inodes_stat.nr_inodes++;
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
 		inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
 			if (set(inode, data))
 				goto set_failed;
 
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			inodes_stat.nr_inodes++;
+			percpu_counter_inc(&nr_inodes);
 			list_add(&inode->i_list, &inode_in_use);
 			list_add(&inode->i_sb_list, &sb->s_inodes);
 			hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 
 	security_inode_delete(inode);
 
@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	percpu_counter_dec(&nr_inodes);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
 {
 	int loop;
 
+	percpu_counter_init(&nr_inodes, 0);
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 114cb65..a789346 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
 
 extern int leases_enable, lease_break_time;
 
@@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 
 int get_filesystem_list(char * buf);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 777bee7..b705f3a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.ctl_name	= FS_STATINODE,
@@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
 	next_jif = start_jif + dirty_writeback_interval;
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(get_nr_inodes() - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
  2008-11-29  8:43                             ` Eric Dumazet
@ 2008-12-11 22:39                               ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
 	return node ? inode : NULL;
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+	static atomic_t shared_last_ino;
+	int *p = &get_cpu_var(last_ino);
+	int res = *p;
+
+	if (unlikely((res & 1023) == 0))
+		res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+#else
+static int last_ino_get(void)
+{
+	static int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
 	struct inode * inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
+		inode->i_state = 0;
+		inode->i_ino = last_ino_get();
 		spin_lock(&inode_lock);
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
-		inode->i_ino = ++last_ino;
-		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
 	return inode;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
@ 2008-12-11 22:39                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
 	return node ? inode : NULL;
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+	static atomic_t shared_last_ino;
+	int *p = &get_cpu_var(last_ino);
+	int res = *p;
+
+	if (unlikely((res & 1023) == 0))
+		res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+#else
+static int last_ino_get(void)
+{
+	static int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
 	struct inode * inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
+		inode->i_state = 0;
+		inode->i_ino = last_ino_get();
 		spin_lock(&inode_lock);
 		list_add(&inode->i_list, &inode_in_use);
 		list_add(&inode->i_sb_list, &sb->s_inodes);
-		inode->i_ino = ++last_ino;
-		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}
 	return inode;

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd
  2008-11-29  8:43                             ` Eric Dumazet
@ 2008-12-11 22:39                               ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

("socketallocbench -n 8" bench result : from 25s to 19.9s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |   16 ------------
 fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
 fs/pipe.c              |   23 +----------------
 include/linux/dcache.h |    9 ++++++
 net/socket.c           |   24 +-----------------
 5 files changed, 65 insertions(+), 58 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
 			     mnt);
 }
 
-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * We faked vfs to believe the dentry was hashed when we created it.
-	 * Now we restore the flag so that dput() will work correctly.
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 1;
-}
-
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
 	.kill_sb	= kill_anon_super,
 };
 static struct dentry_operations anon_inodefs_dentry_operations = {
-	.d_delete	= anon_inodefs_delete_dentry,
 };
 
 /**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_single(&this, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index f463a81..af3bfb3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
  */
 
 /*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	if (!atomic_dec_and_test(&dentry->d_count))
+		return;
+	inode = dentry->d_inode;
+	if (inode)
+		iput(inode);
+	d_free(dentry);
+}
+
+/*
  * dput - release a dentry
  * @dentry: dentry to release 
  *
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
 {
 	if (!dentry)
 		return;
+	/*
+	 * single dentries (sockets/pipes/anon) fast path
+	 */
+	if (dentry->d_flags & DCACHE_SINGLE)
+		return dput_single(dentry);
 
 repeat:
 	if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+	struct dentry *entry;
+
+	entry = d_alloc(NULL, name);
+	if (entry) {
+		entry->d_sb = inode->i_sb;
+		entry->d_parent = entry;
+		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+		entry->d_inode = inode;
+		fsnotify_d_instantiate(entry, inode);
+		security_d_instantiate(entry, inode);
+	}
+	return entry;
+}
+
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
 
 /*
  * pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations pipefs_dentry_operations = {
-	.d_delete	= pipefs_delete_dentry,
 	.d_dname	= pipefs_dname,
 };
 
@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
 #define DCACHE_UNHASHED		0x0010	
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE		0x0040
+	/*
+	 * socket, pipe or anonymous fd dentry
+	 * - SINGLE dentries have themselves as a parent.
+	 * - SINGLE dentries are not hashed into global hash table
+	 * - Their d_alias list is empty
+	 * - They dont need dcache_lock synchronization
+	 */
 
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
 extern void shrink_dcache_for_umount(struct super_block *);
 extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index 92764d8..353c928 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
 	.kill_sb =	kill_anon_super,
 };
 
-static int sockfs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
-
 /*
  * sockfs_dname() is called from d_path().
  */
@@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations sockfs_dentry_operations = {
-	.d_delete = sockfs_delete_dentry,
 	.d_dname  = sockfs_dname,
 };
 
@@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-12-11 22:39                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

("socketallocbench -n 8" bench result : from 25s to 19.9s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c       |   16 ------------
 fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
 fs/pipe.c              |   23 +----------------
 include/linux/dcache.h |    9 ++++++
 net/socket.c           |   24 +-----------------
 5 files changed, 65 insertions(+), 58 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
 			     mnt);
 }
 
-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * We faked vfs to believe the dentry was hashed when we created it.
-	 * Now we restore the flag so that dput() will work correctly.
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 1;
-}
-
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.get_sb		= anon_inodefs_get_sb,
 	.kill_sb	= kill_anon_super,
 };
 static struct dentry_operations anon_inodefs_dentry_operations = {
-	.d_delete	= anon_inodefs_delete_dentry,
 };
 
 /**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc_single(&this, anon_inode_inode);
 	if (!dentry)
 		goto err_put_unused_fd;
 
@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	atomic_inc(&anon_inode_inode->i_count);
 
 	dentry->d_op = &anon_inodefs_dentry_operations;
-	/* Do not publish this dentry inside the global dentry hash table */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, anon_inode_inode);
 
 	error = -ENFILE;
 	file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index f463a81..af3bfb3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
  */
 
 /*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	if (!atomic_dec_and_test(&dentry->d_count))
+		return;
+	inode = dentry->d_inode;
+	if (inode)
+		iput(inode);
+	d_free(dentry);
+}
+
+/*
  * dput - release a dentry
  * @dentry: dentry to release 
  *
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
 {
 	if (!dentry)
 		return;
+	/*
+	 * single dentries (sockets/pipes/anon) fast path
+	 */
+	if (dentry->d_flags & DCACHE_SINGLE)
+		return dput_single(dentry);
 
 repeat:
 	if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 	return res;
 }
 
+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+	struct dentry *entry;
+
+	entry = d_alloc(NULL, name);
+	if (entry) {
+		entry->d_sb = inode->i_sb;
+		entry->d_parent = entry;
+		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+		entry->d_inode = inode;
+		fsnotify_d_instantiate(entry, inode);
+		security_d_instantiate(entry, inode);
+	}
+	return entry;
+}
+
+
 static inline struct hlist_head *d_hash(struct dentry *parent,
 					unsigned long hash)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
 
 /*
  * pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations pipefs_dentry_operations = {
-	.d_delete	= pipefs_delete_dentry,
 	.d_dname	= pipefs_dname,
 };
 
@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
 	err = -ENFILE;
 	inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, inode);
 	if (!dentry)
 		goto err_inode;
 
 	dentry->d_op = &pipefs_dentry_operations;
-	/*
-	 * We dont want to publish this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on pipes
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, inode);
 
 	err = -ENFILE;
 	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
 #define DCACHE_UNHASHED		0x0010	
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE		0x0040
+	/*
+	 * socket, pipe or anonymous fd dentry
+	 * - SINGLE dentries have themselves as a parent.
+	 * - SINGLE dentries are not hashed into global hash table
+	 * - Their d_alias list is empty
+	 * - They dont need dcache_lock synchronization
+	 */
 
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
 extern void shrink_dcache_for_umount(struct super_block *);
 extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
 
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index 92764d8..353c928 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
 	.kill_sb =	kill_anon_super,
 };
 
-static int sockfs_delete_dentry(struct dentry *dentry)
-{
-	/*
-	 * At creation time, we pretended this dentry was hashed
-	 * (by clearing DCACHE_UNHASHED bit in d_flags)
-	 * At delete time, we restore the truth : not hashed.
-	 * (so that dput() can proceed correctly)
-	 */
-	dentry->d_flags |= DCACHE_UNHASHED;
-	return 0;
-}
-
 /*
  * sockfs_dname() is called from d_path().
  */
@@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
 }
 
 static struct dentry_operations sockfs_dentry_operations = {
-	.d_delete = sockfs_delete_dentry,
 	.d_dname  = sockfs_dname,
 };
 
@@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
 static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
-	struct qstr name = { .name = "" };
+	static const struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc_single(&name, SOCK_INODE(sock));
 	if (unlikely(!dentry))
 		return -ENOMEM;
 
 	dentry->d_op = &sockfs_dentry_operations;
-	/*
-	 * We dont want to push this dentry into global dentry hash table.
-	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
-	 * This permits a working /proc/$pid/fd/XXX on sockets
-	 */
-	dentry->d_flags &= ~DCACHE_UNHASHED;
-	d_instantiate(dentry, SOCK_INODE(sock));
 
 	sock->file = file;
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 5/7] fs: new_inode_single() and iput_single()
  2008-11-29  8:43                             ` Eric Dumazet
@ 2008-12-11 22:40                               ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it
as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of
three contended cache lines in new_inode(), and five cache lines in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c   |    2 +-
 fs/dcache.c        |    2 +-
 fs/inode.c         |   29 ++++++++++++++++++++---------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |   12 +++++++++++-
 net/socket.c       |    2 +-
 6 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index af3bfb3..3363853 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_single(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_single(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		percpu_counter_dec(&nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
 #endif
 
 /**
- *	new_inode 	- obtain an inode
+ *	__new_inode 	- obtain an inode
  *	@sb: superblock
+ *  @single: if true, dont link new inode in a list
  *
  *	Allocates a new inode for given superblock. The default gfp_mask
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
  *	newly created inode's mapping
  *
  */
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
 		inode->i_state = 0;
 		inode->i_ino = last_ino_get();
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
-		spin_unlock(&inode_lock);
+ 		if (single) {
+  			INIT_LIST_HEAD(&inode->i_list);
+  			INIT_LIST_HEAD(&inode->i_sb_list);
+ 		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 	}
 	return inode;
 }
 
-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);
 
 void unlock_new_inode(struct inode *inode)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a789346..a702d81 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+	return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+	return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
diff --git a/net/socket.c b/net/socket.c
index 353c928..4017409 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_inode_single(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 5/7] fs: new_inode_single() and iput_single()
@ 2008-12-11 22:40                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it
as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of
three contended cache lines in new_inode(), and five cache lines in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c   |    2 +-
 fs/dcache.c        |    2 +-
 fs/inode.c         |   29 ++++++++++++++++++++---------
 fs/pipe.c          |    2 +-
 include/linux/fs.h |   12 +++++++++++-
 net/socket.c       |    2 +-
 6 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index af3bfb3..3363853 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
 		return;
 	inode = dentry->d_inode;
 	if (inode)
-		iput(inode);
+		iput_single(inode);
 	d_free(dentry);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
 		kmem_cache_free(inode_cachep, (inode));
 }
 
+void iput_single(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_count)) {
+		destroy_inode(inode);
+		percpu_counter_dec(&nr_inodes);
+	}
+}
 
 /*
  * These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
 #endif
 
 /**
- *	new_inode 	- obtain an inode
+ *	__new_inode 	- obtain an inode
  *	@sb: superblock
+ *  @single: if true, dont link new inode in a list
  *
  *	Allocates a new inode for given superblock. The default gfp_mask
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
  *	newly created inode's mapping
  *
  */
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
 {
 	/*
 	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
 	 */
 	struct inode * inode;
 
-	spin_lock_prefetch(&inode_lock);
-	
 	inode = alloc_inode(sb);
 	if (inode) {
 		percpu_counter_inc(&nr_inodes);
 		inode->i_state = 0;
 		inode->i_ino = last_ino_get();
-		spin_lock(&inode_lock);
-		list_add(&inode->i_list, &inode_in_use);
-		list_add(&inode->i_sb_list, &sb->s_inodes);
-		spin_unlock(&inode_lock);
+ 		if (single) {
+  			INIT_LIST_HEAD(&inode->i_list);
+  			INIT_LIST_HEAD(&inode->i_sb_list);
+ 		} else {
+			spin_lock(&inode_lock);
+			list_add(&inode->i_list, &inode_in_use);
+			list_add(&inode->i_sb_list, &sb->s_inodes);
+			spin_unlock(&inode_lock);
+		}
 	}
 	return inode;
 }
 
-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);
 
 void unlock_new_inode(struct inode *inode)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a789346..a702d81 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+	return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+	return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
diff --git a/net/socket.c b/net/socket.c
index 353c928..4017409 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_inode_single(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-11 22:40                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

From: Christoph Lameter <cl@linux-foundation.org>

[PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 Documentation/filesystems/files.txt |   21 ++++++++++++++--
 fs/file_table.c                     |   33 ++++++++++++++++++--------
 include/linux/fs.h                  |    5 ---
 3 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
index ac2facc..6916baa 100644
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.txt
@@ -78,13 +78,28 @@ the fdtable structure -
    that look-up may race with the last put() operation on the
    file structure. This is avoided using atomic_long_inc_not_zero()
    on ->f_count :
+   As file structures are allocated with SLAB_DESTROY_BY_RCU,
+   they can also be freed before a RCU grace period, and reused,
+   but still as a struct file.
+   It is necessary to check again after getting
+   a stable reference (ie after atomic_long_inc_not_zero()),
+   that fcheck_files(files, fd) points to the same file.
 
 	rcu_read_lock();
 	file = fcheck_files(files, fd);
 	if (file) {
-		if (atomic_long_inc_not_zero(&file->f_count))
+		if (atomic_long_inc_not_zero(&file->f_count)) {
 			*fput_needed = 1;
-		else
+			/*
+			 * Now we have a stable reference to an object.
+			 * Check if other threads freed file and reallocated it.
+			 */
+			if (file != fcheck_files(files, fd)) {
+				*fput_needed = 0;
+				put_filp(file);
+				file = NULL;
+			}
+		} else
 		/* Didn't get the reference, someone's freed */
 			file = NULL;
 	}
@@ -95,6 +110,8 @@ the fdtable structure -
    atomic_long_inc_not_zero() detects if refcounts is already zero or
    goes to zero during increment. If it does, we fail
    fget()/fget_light().
+   The second call to fcheck_files(files, fd) checks that this filp
+   was not freed, then reused by an other thread.
 
 6. Since both fdtable and file structures can be looked up
    lock-free, they must be installed using rcu_assign_pointer()
diff --git a/fs/file_table.c b/fs/file_table.c
index a46e880..3e9259d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
-static inline void file_free_rcu(struct rcu_head *head)
-{
-	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
-	kmem_cache_free(filp_cachep, f);
-}
-
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
 	file_check_state(f);
-	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+	kmem_cache_free(filp_cachep, f);
 }
 
 /*
@@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
 			rcu_read_unlock();
 			return NULL;
 		}
+		/*
+		 * Now we have a stable reference to an object.
+		 * Check if other threads freed file and re-allocated it.
+		 */
+		if (unlikely(file != fcheck_files(files, fd))) {
+			put_filp(file);
+			file = NULL;
+		}
 	}
 	rcu_read_unlock();
 
@@ -333,9 +335,19 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 		rcu_read_lock();
 		file = fcheck_files(files, fd);
 		if (file) {
-			if (atomic_long_inc_not_zero(&file->f_count))
+			if (atomic_long_inc_not_zero(&file->f_count)) {
 				*fput_needed = 1;
-			else
+				/*
+				 * Now we have a stable reference to an object.
+				 * Check if other threads freed this file and
+				 * re-allocated it.
+				 */
+				if (unlikely(file != fcheck_files(files, fd))) {
+					*fput_needed = 0;
+					put_filp(file);
+					file = NULL;
+				}
+			} else
 				/* Didn't get the reference, someone's freed */
 				file = NULL;
 		}
@@ -402,7 +414,8 @@ void __init files_init(unsigned long mempages)
 	int n; 
 
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN | SLAB_DESTROY_BY_RCU | SLAB_PANIC,
+			NULL);
 
 	/*
 	 * One file with associated inode and dcache is very roughly 1K. 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a702d81..a1f56d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -811,13 +811,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 #define FILE_MNT_WRITE_RELEASED	2
 
 struct file {
-	/*
-	 * fu_list becomes invalid after file_free is called and queued via
-	 * fu_rcuhead for RCU freeing
-	 */
 	union {
 		struct list_head	fu_list;
-		struct rcu_head 	fu_rcuhead;
 	} f_u;
 	struct path		f_path;
 #define f_dentry	f_path.dentry

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-11 22:40                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney

From: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

[PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
Signed-off-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Documentation/filesystems/files.txt |   21 ++++++++++++++--
 fs/file_table.c                     |   33 ++++++++++++++++++--------
 include/linux/fs.h                  |    5 ---
 3 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
index ac2facc..6916baa 100644
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.txt
@@ -78,13 +78,28 @@ the fdtable structure -
    that look-up may race with the last put() operation on the
    file structure. This is avoided using atomic_long_inc_not_zero()
    on ->f_count :
+   As file structures are allocated with SLAB_DESTROY_BY_RCU,
+   they can also be freed before a RCU grace period, and reused,
+   but still as a struct file.
+   It is necessary to check again after getting
+   a stable reference (ie after atomic_long_inc_not_zero()),
+   that fcheck_files(files, fd) points to the same file.
 
 	rcu_read_lock();
 	file = fcheck_files(files, fd);
 	if (file) {
-		if (atomic_long_inc_not_zero(&file->f_count))
+		if (atomic_long_inc_not_zero(&file->f_count)) {
 			*fput_needed = 1;
-		else
+			/*
+			 * Now we have a stable reference to an object.
+			 * Check if other threads freed file and reallocated it.
+			 */
+			if (file != fcheck_files(files, fd)) {
+				*fput_needed = 0;
+				put_filp(file);
+				file = NULL;
+			}
+		} else
 		/* Didn't get the reference, someone's freed */
 			file = NULL;
 	}
@@ -95,6 +110,8 @@ the fdtable structure -
    atomic_long_inc_not_zero() detects if refcounts is already zero or
    goes to zero during increment. If it does, we fail
    fget()/fget_light().
+   The second call to fcheck_files(files, fd) checks that this filp
+   was not freed, then reused by an other thread.
 
 6. Since both fdtable and file structures can be looked up
    lock-free, they must be installed using rcu_assign_pointer()
diff --git a/fs/file_table.c b/fs/file_table.c
index a46e880..3e9259d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
-static inline void file_free_rcu(struct rcu_head *head)
-{
-	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
-	kmem_cache_free(filp_cachep, f);
-}
-
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
 	file_check_state(f);
-	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+	kmem_cache_free(filp_cachep, f);
 }
 
 /*
@@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
 			rcu_read_unlock();
 			return NULL;
 		}
+		/*
+		 * Now we have a stable reference to an object.
+		 * Check if other threads freed file and re-allocated it.
+		 */
+		if (unlikely(file != fcheck_files(files, fd))) {
+			put_filp(file);
+			file = NULL;
+		}
 	}
 	rcu_read_unlock();
 
@@ -333,9 +335,19 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 		rcu_read_lock();
 		file = fcheck_files(files, fd);
 		if (file) {
-			if (atomic_long_inc_not_zero(&file->f_count))
+			if (atomic_long_inc_not_zero(&file->f_count)) {
 				*fput_needed = 1;
-			else
+				/*
+				 * Now we have a stable reference to an object.
+				 * Check if other threads freed this file and
+				 * re-allocated it.
+				 */
+				if (unlikely(file != fcheck_files(files, fd))) {
+					*fput_needed = 0;
+					put_filp(file);
+					file = NULL;
+				}
+			} else
 				/* Didn't get the reference, someone's freed */
 				file = NULL;
 		}
@@ -402,7 +414,8 @@ void __init files_init(unsigned long mempages)
 	int n; 
 
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN | SLAB_DESTROY_BY_RCU | SLAB_PANIC,
+			NULL);
 
 	/*
 	 * One file with associated inode and dcache is very roughly 1K. 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a702d81..a1f56d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -811,13 +811,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 #define FILE_MNT_WRITE_RELEASED	2
 
 struct file {
-	/*
-	 * fu_list becomes invalid after file_free is called and queued via
-	 * fu_rcuhead for RCU freeing
-	 */
 	union {
 		struct list_head	fu_list;
-		struct rcu_head 	fu_rcuhead;
 	} f_u;
 	struct path		f_path;
 #define f_dentry	f_path.dentry

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 7/7] fs: MS_NOREFCOUNT
  2008-11-29  8:43                             ` Eric Dumazet
@ 2008-12-11 22:41                               ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c      |    1 +
 fs/pipe.c             |    3 ++-
 include/linux/fs.h    |    2 ++
 include/linux/mount.h |    8 +++-----
 net/socket.c          |    1 +
 5 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 89fd36d..de0ec3b 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -158,6 +158,7 @@ static int __init anon_inode_init(void)
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
 	}
+	anon_inode_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 	anon_inode_inode = anon_inode_mkinode();
 	if (IS_ERR(anon_inode_inode)) {
 		error = PTR_ERR(anon_inode_inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index 8c51a0d..f547432 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void)
 		if (IS_ERR(pipe_mnt)) {
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
-		}
+		} else
+			pipe_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 	}
 	return err;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a1f56d4..11b0452 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -137,6 +137,8 @@ extern int dir_notify_enable;
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+
+#define MS_NOREFCOUNT	(1<<29) /* kernel static mnt : no refcounting needed */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..51418b5 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -14,10 +14,8 @@
 #include <linux/nodemask.h>
 #include <linux/spinlock.h>
 #include <asm/atomic.h>
+#include <linux/fs.h>
 
-struct super_block;
-struct vfsmount;
-struct dentry;
 struct mnt_namespace;
 
 #define MNT_NOSUID	0x01
@@ -73,7 +71,7 @@ struct vfsmount {
 
 static inline struct vfsmount *mntget(struct vfsmount *mnt)
 {
-	if (mnt)
+	if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT))
 		atomic_inc(&mnt->mnt_count);
 	return mnt;
 }
@@ -87,7 +85,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 static inline void mntput(struct vfsmount *mnt)
 {
-	if (mnt) {
+	if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) {
 		mnt->mnt_expiry_mark = 0;
 		mntput_no_expire(mnt);
 	}
diff --git a/net/socket.c b/net/socket.c
index 4017409..2534dbc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2206,6 +2206,7 @@ static int __init sock_init(void)
 	init_inodecache();
 	register_filesystem(&sock_fs_type);
 	sock_mnt = kern_mount(&sock_fs_type);
+	sock_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 
 	/* The real protocol initialization is performed in later initcalls.
 	 */

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* [PATCH v3 7/7] fs: MS_NOREFCOUNT
@ 2008-12-11 22:41                               ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-11 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c      |    1 +
 fs/pipe.c             |    3 ++-
 include/linux/fs.h    |    2 ++
 include/linux/mount.h |    8 +++-----
 net/socket.c          |    1 +
 5 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 89fd36d..de0ec3b 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -158,6 +158,7 @@ static int __init anon_inode_init(void)
 		error = PTR_ERR(anon_inode_mnt);
 		goto err_unregister_filesystem;
 	}
+	anon_inode_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 	anon_inode_inode = anon_inode_mkinode();
 	if (IS_ERR(anon_inode_inode)) {
 		error = PTR_ERR(anon_inode_inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index 8c51a0d..f547432 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void)
 		if (IS_ERR(pipe_mnt)) {
 			err = PTR_ERR(pipe_mnt);
 			unregister_filesystem(&pipe_fs_type);
-		}
+		} else
+			pipe_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 	}
 	return err;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a1f56d4..11b0452 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -137,6 +137,8 @@ extern int dir_notify_enable;
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+
+#define MS_NOREFCOUNT	(1<<29) /* kernel static mnt : no refcounting needed */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..51418b5 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -14,10 +14,8 @@
 #include <linux/nodemask.h>
 #include <linux/spinlock.h>
 #include <asm/atomic.h>
+#include <linux/fs.h>
 
-struct super_block;
-struct vfsmount;
-struct dentry;
 struct mnt_namespace;
 
 #define MNT_NOSUID	0x01
@@ -73,7 +71,7 @@ struct vfsmount {
 
 static inline struct vfsmount *mntget(struct vfsmount *mnt)
 {
-	if (mnt)
+	if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT))
 		atomic_inc(&mnt->mnt_count);
 	return mnt;
 }
@@ -87,7 +85,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 static inline void mntput(struct vfsmount *mnt)
 {
-	if (mnt) {
+	if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) {
 		mnt->mnt_expiry_mark = 0;
 		mntput_no_expire(mnt);
 	}
diff --git a/net/socket.c b/net/socket.c
index 4017409..2534dbc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2206,6 +2206,7 @@ static int __init sock_init(void)
 	init_inodecache();
 	register_filesystem(&sock_fs_type);
 	sock_mnt = kern_mount(&sock_fs_type);
+	sock_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
 
 	/* The real protocol initialization is performed in later initcalls.
 	 */

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
  2007-07-24  1:13                                 ` Nick Piggin
@ 2008-12-12  2:50                                   ` Nick Piggin
  -1 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2008-12-12  2:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Tuesday 24 July 2007 11:13, Nick Piggin wrote:
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> > From: Christoph Lameter <cl@linux-foundation.org>
> >
> > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
> >
> > Currently we schedule RCU frees for each file we free separately. That
> > has several drawbacks against the earlier file handling (in 2.6.5 f.e.),
> > which did not require RCU callbacks:
> >
> > 1. Excessive number of RCU callbacks can be generated causing long RCU
> >   queues that in turn cause long latencies. We hit SLUB page allocation
> >   more often than necessary.
> >
> > 2. The cache hot object is not preserved between free and realloc. A
> > close followed by another open is very fast with the RCUless approach
> > because the last freed object is returned by the slab allocator that is
> > still cache hot. RCU free means that the object is not immediately
> > available again. The new object is cache cold and therefore open/close
> > performance tests show a significant degradation with the RCU
> >   implementation.
> >
> > One solution to this problem is to move the RCU freeing into the Slab
> > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> > time. The slab allocator will do RCU frees only when it is necessary
> > to dispose of slabs of objects (rare). So with that approach we can cut
> > out the RCU overhead significantly.
> >
> > However, the slab allocator may return the object for another use even
> > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> > there is the (unlikely) possibility that the object is going to be
> > switched under us in sections protected by rcu_read_lock() and
> > rcu_read_unlock(). So we need to verify that we have acquired the correct
> > object after establishing a stable object reference (incrementing the
> > refcounter does that).
> >
> >
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  Documentation/filesystems/files.txt |   21 ++++++++++++++--
> >  fs/file_table.c                     |   33 ++++++++++++++++++--------
> >  include/linux/fs.h                  |    5 ---
> >  3 files changed, 42 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/filesystems/files.txt
> > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> > --- a/Documentation/filesystems/files.txt
> > +++ b/Documentation/filesystems/files.txt
> > @@ -78,13 +78,28 @@ the fdtable structure -
> >     that look-up may race with the last put() operation on the
> >     file structure. This is avoided using atomic_long_inc_not_zero()
> >     on ->f_count :
> > +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> > +   they can also be freed before a RCU grace period, and reused,
> > +   but still as a struct file.
> > +   It is necessary to check again after getting
> > +   a stable reference (ie after atomic_long_inc_not_zero()),
> > +   that fcheck_files(files, fd) points to the same file.
> >
> >  	rcu_read_lock();
> >  	file = fcheck_files(files, fd);
> >  	if (file) {
> > -		if (atomic_long_inc_not_zero(&file->f_count))
> > +		if (atomic_long_inc_not_zero(&file->f_count)) {
> >  			*fput_needed = 1;
> > -		else
> > +			/*
> > +			 * Now we have a stable reference to an object.
> > +			 * Check if other threads freed file and reallocated it.
> > +			 */
> > +			if (file != fcheck_files(files, fd)) {
> > +				*fput_needed = 0;
> > +				put_filp(file);
> > +				file = NULL;
> > +			}
> > +		} else
> >  		/* Didn't get the reference, someone's freed */
> >  			file = NULL;
> >  	}
> > @@ -95,6 +110,8 @@ the fdtable structure -
> >     atomic_long_inc_not_zero() detects if refcounts is already zero or
> >     goes to zero during increment. If it does, we fail
> >     fget()/fget_light().
> > +   The second call to fcheck_files(files, fd) checks that this filp
> > +   was not freed, then reused by an other thread.
> >
> >  6. Since both fdtable and file structures can be looked up
> >     lock-free, they must be installed using rcu_assign_pointer()
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index a46e880..3e9259d 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
> >
> >  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> >
> > -static inline void file_free_rcu(struct rcu_head *head)
> > -{
> > -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> > -	kmem_cache_free(filp_cachep, f);
> > -}
> > -
> >  static inline void file_free(struct file *f)
> >  {
> >  	percpu_counter_dec(&nr_files);
> >  	file_check_state(f);
> > -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> > +	kmem_cache_free(filp_cachep, f);
> >  }
> >
> >  /*
> > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
> >  			rcu_read_unlock();
> >  			return NULL;
> >  		}
> > +		/*
> > +		 * Now we have a stable reference to an object.
> > +		 * Check if other threads freed file and re-allocated it.
> > +		 */
> > +		if (unlikely(file != fcheck_files(files, fd))) {
> > +			put_filp(file);
> > +			file = NULL;
> > +		}
>
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.
>
> From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.
>
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Sorry, my clock screwed up and I didn't notice :(

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-12  2:50                                   ` Nick Piggin
  0 siblings, 0 replies; 349+ messages in thread
From: Nick Piggin @ 2008-12-12  2:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

On Tuesday 24 July 2007 11:13, Nick Piggin wrote:
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> > From: Christoph Lameter <cl@linux-foundation.org>
> >
> > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
> >
> > Currently we schedule RCU frees for each file we free separately. That
> > has several drawbacks against the earlier file handling (in 2.6.5 f.e.),
> > which did not require RCU callbacks:
> >
> > 1. Excessive number of RCU callbacks can be generated causing long RCU
> >   queues that in turn cause long latencies. We hit SLUB page allocation
> >   more often than necessary.
> >
> > 2. The cache hot object is not preserved between free and realloc. A
> > close followed by another open is very fast with the RCUless approach
> > because the last freed object is returned by the slab allocator that is
> > still cache hot. RCU free means that the object is not immediately
> > available again. The new object is cache cold and therefore open/close
> > performance tests show a significant degradation with the RCU
> >   implementation.
> >
> > One solution to this problem is to move the RCU freeing into the Slab
> > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> > time. The slab allocator will do RCU frees only when it is necessary
> > to dispose of slabs of objects (rare). So with that approach we can cut
> > out the RCU overhead significantly.
> >
> > However, the slab allocator may return the object for another use even
> > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> > there is the (unlikely) possibility that the object is going to be
> > switched under us in sections protected by rcu_read_lock() and
> > rcu_read_unlock(). So we need to verify that we have acquired the correct
> > object after establishing a stable object reference (incrementing the
> > refcounter does that).
> >
> >
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  Documentation/filesystems/files.txt |   21 ++++++++++++++--
> >  fs/file_table.c                     |   33 ++++++++++++++++++--------
> >  include/linux/fs.h                  |    5 ---
> >  3 files changed, 42 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/filesystems/files.txt
> > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> > --- a/Documentation/filesystems/files.txt
> > +++ b/Documentation/filesystems/files.txt
> > @@ -78,13 +78,28 @@ the fdtable structure -
> >     that look-up may race with the last put() operation on the
> >     file structure. This is avoided using atomic_long_inc_not_zero()
> >     on ->f_count :
> > +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
> > +   they can also be freed before a RCU grace period, and reused,
> > +   but still as a struct file.
> > +   It is necessary to check again after getting
> > +   a stable reference (ie after atomic_long_inc_not_zero()),
> > +   that fcheck_files(files, fd) points to the same file.
> >
> >  	rcu_read_lock();
> >  	file = fcheck_files(files, fd);
> >  	if (file) {
> > -		if (atomic_long_inc_not_zero(&file->f_count))
> > +		if (atomic_long_inc_not_zero(&file->f_count)) {
> >  			*fput_needed = 1;
> > -		else
> > +			/*
> > +			 * Now we have a stable reference to an object.
> > +			 * Check if other threads freed file and reallocated it.
> > +			 */
> > +			if (file != fcheck_files(files, fd)) {
> > +				*fput_needed = 0;
> > +				put_filp(file);
> > +				file = NULL;
> > +			}
> > +		} else
> >  		/* Didn't get the reference, someone's freed */
> >  			file = NULL;
> >  	}
> > @@ -95,6 +110,8 @@ the fdtable structure -
> >     atomic_long_inc_not_zero() detects if refcounts is already zero or
> >     goes to zero during increment. If it does, we fail
> >     fget()/fget_light().
> > +   The second call to fcheck_files(files, fd) checks that this filp
> > +   was not freed, then reused by an other thread.
> >
> >  6. Since both fdtable and file structures can be looked up
> >     lock-free, they must be installed using rcu_assign_pointer()
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index a46e880..3e9259d 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
> >
> >  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> >
> > -static inline void file_free_rcu(struct rcu_head *head)
> > -{
> > -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
> > -	kmem_cache_free(filp_cachep, f);
> > -}
> > -
> >  static inline void file_free(struct file *f)
> >  {
> >  	percpu_counter_dec(&nr_files);
> >  	file_check_state(f);
> > -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> > +	kmem_cache_free(filp_cachep, f);
> >  }
> >
> >  /*
> > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
> >  			rcu_read_unlock();
> >  			return NULL;
> >  		}
> > +		/*
> > +		 * Now we have a stable reference to an object.
> > +		 * Check if other threads freed file and re-allocated it.
> > +		 */
> > +		if (unlikely(file != fcheck_files(files, fd))) {
> > +			put_filp(file);
> > +			file = NULL;
> > +		}
>
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.
>
> From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.
>
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Sorry, my clock screwed up and I didn't notice :(

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
  2007-07-24  1:13                                 ` Nick Piggin
@ 2008-12-12  4:45                                   ` Eric Dumazet
  -1 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-12  4:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Nick Piggin a écrit :
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>> From: Christoph Lameter <cl@linux-foundation.org>
>>
>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>
>> Currently we schedule RCU frees for each file we free separately. That has
>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>> did not require RCU callbacks:
>>
>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>>   queues that in turn cause long latencies. We hit SLUB page allocation
>>   more often than necessary.
>>
>> 2. The cache hot object is not preserved between free and realloc. A close
>>   followed by another open is very fast with the RCUless approach because
>>   the last freed object is returned by the slab allocator that is
>>   still cache hot. RCU free means that the object is not immediately
>>   available again. The new object is cache cold and therefore open/close
>>   performance tests show a significant degradation with the RCU
>>   implementation.
>>
>> One solution to this problem is to move the RCU freeing into the Slab
>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>> time. The slab allocator will do RCU frees only when it is necessary
>> to dispose of slabs of objects (rare). So with that approach we can cut
>> out the RCU overhead significantly.
>>
>> However, the slab allocator may return the object for another use even
>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>> there is the (unlikely) possibility that the object is going to be
>> switched under us in sections protected by rcu_read_lock() and
>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>> object after establishing a stable object reference (incrementing the
>> refcounter does that).
>>
>>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> ---
>>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>>  include/linux/fs.h                  |    5 ---
>>  3 files changed, 42 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/filesystems/files.txt
>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>> --- a/Documentation/filesystems/files.txt
>> +++ b/Documentation/filesystems/files.txt
>> @@ -78,13 +78,28 @@ the fdtable structure -
>>     that look-up may race with the last put() operation on the
>>     file structure. This is avoided using atomic_long_inc_not_zero()
>>     on ->f_count :
>> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
>> +   they can also be freed before a RCU grace period, and reused,
>> +   but still as a struct file.
>> +   It is necessary to check again after getting
>> +   a stable reference (ie after atomic_long_inc_not_zero()),
>> +   that fcheck_files(files, fd) points to the same file.
>>
>>  	rcu_read_lock();
>>  	file = fcheck_files(files, fd);
>>  	if (file) {
>> -		if (atomic_long_inc_not_zero(&file->f_count))
>> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>>  			*fput_needed = 1;
>> -		else
>> +			/*
>> +			 * Now we have a stable reference to an object.
>> +			 * Check if other threads freed file and reallocated it.
>> +			 */
>> +			if (file != fcheck_files(files, fd)) {
>> +				*fput_needed = 0;
>> +				put_filp(file);
>> +				file = NULL;
>> +			}
>> +		} else
>>  		/* Didn't get the reference, someone's freed */
>>  			file = NULL;
>>  	}
>> @@ -95,6 +110,8 @@ the fdtable structure -
>>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>>     goes to zero during increment. If it does, we fail
>>     fget()/fget_light().
>> +   The second call to fcheck_files(files, fd) checks that this filp
>> +   was not freed, then reused by an other thread.
>>
>>  6. Since both fdtable and file structures can be looked up
>>     lock-free, they must be installed using rcu_assign_pointer()
>> diff --git a/fs/file_table.c b/fs/file_table.c
>> index a46e880..3e9259d 100644
>> --- a/fs/file_table.c
>> +++ b/fs/file_table.c
>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>
>>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>
>> -static inline void file_free_rcu(struct rcu_head *head)
>> -{
>> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
>> -	kmem_cache_free(filp_cachep, f);
>> -}
>> -
>>  static inline void file_free(struct file *f)
>>  {
>>  	percpu_counter_dec(&nr_files);
>>  	file_check_state(f);
>> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>> +	kmem_cache_free(filp_cachep, f);
>>  }
>>
>>  /*
>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>>  			rcu_read_unlock();
>>  			return NULL;
>>  		}
>> +		/*
>> +		 * Now we have a stable reference to an object.
>> +		 * Check if other threads freed file and re-allocated it.
>> +		 */
>> +		if (unlikely(file != fcheck_files(files, fd))) {
>> +			put_filp(file);
>> +			file = NULL;
>> +		}
> 
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.

If we got at this point, we :

Found a non NULL pointer in our fd table.
Then, another thread came, closed the file while we not yet added our reference.
This file was freed (kmem_cache_free(filp_cachep, file))
This file was reused and inserted on another thread fd table.
We added our reference on refcount.
We checked if this file is still ours (in our fd tab).
We found this file is not anymore the file we wanted.
Calling put_filp() here is our only choice to safely remove the reference on
a truly allocated file. At this point the file is
a truly allocated file but not anymore ours.
Unfortunatly we added a reference on it : we must release it.
If the other thread already called put_filp() because it wanted to close its new file,
we must see f_refcnt going to zero, and we must call __fput(), to perform
all the relevant file cleanup ourself.


> 
>>From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.

I see your point. But currently, any thread can be "releasing the last
reference on a file". That is not always the thread that called close(fd)
We extend this to "any thread of any process", so it might have
a security effect you are absolutely right.

> 
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Any real world program that open and close files, or said better,
that close and open files :)

sizeof(struct file) is 192 bytes. Thats three cache lines.
Being able to reuse a hot "struct file" avoids three cache line misses.

Thats about 120 ns.

Then, using call_rcu() is also a latency killer, since we explicitly say :
I dont want to free this file right now, I delegate this job to another layer
in two or three milli second (or more)

A final point is that SLUB doesnt need to allocate or free a slab in many cases.
(This is probably why Christoph needed this patch in 2006 :) )
In my case, I need all these patches to speedup http servers.
They obviously open and close many files per second.

The added code has a cost of less than 3 ns, but I suspect we can cut it to less than 1ns
We prefered with Christoph and Paul to keep patch as short as possible to focus
on essential points.

               :c0287656:       mov    -0x14(%ebp),%esi
               :c0287659:       mov    -0x24(%ebp),%edi
               :c028765c:       mov    0x4(%esi),%eax
               :c028765f:       cmp    (%eax),%edi
               :c0287661:       jb     c0287678 <fget+0xc8>
               :c0287663:       mov    %ebx,%eax
               :c0287665:       xor    %ebx,%ebx
               :c0287667:       call   c0287450 <put_filp>
               :c028766c:       jmp    c02875ec <fget+0x3c>
               :c0287671:       lea    0x0(%esi,%eiz,1),%esi
               :c0287678:       mov    0x4(%eax),%edi
               :c028767b:       add    %edi,-0x10(%ebp)
               :c028767e:       mov    -0x10(%ebp),%edx
     1 8.8e-05 :c0287681:       mov    (%edx),%eax
               :c0287683:       cmp    %eax,%ebx
               :c0287685:       je     c02875ec <fget+0x3c>
               :c028768b:       jmp    c0287663 <fget+0xb3>

We could avoid doing the full test, because there is no way the files->max_fds could
become lower under us, or even fdt itself, and fdt->fd

So instead of using twice this function :

static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
{
        struct file * file = NULL;
        struct fdtable *fdt = files_fdtable(files);

        if (fd < fdt->max_fds)
                file = rcu_dereference(fdt->fd[fd]);
        return file;
}

We could use the attached patch


This becomes a matter of three instructions, including a 99.99% predicted branch :

c0287646:       8b 03                   mov    (%ebx),%eax
c0287648:       39 45 e4                cmp    %eax,-0x1c(%ebp)
c028764b:       74 a1                   je     c02875ee <fget+0x3e>

c028764d:       8b 45 e4                mov    -0x1c(%ebp),%eax
c0287650:       e8 fb fd ff ff          call   c0287450 <put_filp>
c0287655:       31 c0                   xor    %eax,%eax
c0287657:       eb 98                   jmp    c02875f1 <fget+0x41>
	

At the time Christoph sent its patch (in 2006), nobody cared, because
we had no benchmark or real world workload that demonstrated the gain 
of his patch, only intuitions.
We had too many contended cache lines that slow down the whole process.

SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
misses costs become really problematic. This patch series clearly demonstrate
it.

Thanks Nick for your feedback and comments.

Eric

[PATCH] fs: optimize fget() & fget_light()

Instead of calling fcheck_files() a second time, we can take into account we
already did part of the job, in a rcu read locked section. We need a
struct file **filp pointer so that we only dereference it a second time.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/file_table.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 3e9259d..4bc019f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -289,11 +289,16 @@ void __fput(struct file *file)
 
 struct file *fget(unsigned int fd)
 {
-	struct file *file;
+	struct file *file = NULL, **filp;
 	struct files_struct *files = current->files;
+	struct fdtable *fdt;
 
 	rcu_read_lock();
-	file = fcheck_files(files, fd);
+	fdt = files_fdtable(files);
+	if (likely(fd < fdt->max_fds)) {
+		filp = &fdt->fd[fd];
+		file = rcu_dereference(*filp);
+	}
 	if (file) {
 		if (!atomic_long_inc_not_zero(&file->f_count)) {
 			/* File object ref couldn't be taken */
@@ -304,7 +309,7 @@ struct file *fget(unsigned int fd)
 		 * Now we have a stable reference to an object.
 		 * Check if other threads freed file and re-allocated it.
 		 */
-		if (unlikely(file != fcheck_files(files, fd))) {
+		if (unlikely(file != rcu_dereference(*filp))) {
 			put_filp(file);
 			file = NULL;
 		}
@@ -325,15 +330,21 @@ EXPORT_SYMBOL(fget);
  */
 struct file *fget_light(unsigned int fd, int *fput_needed)
 {
-	struct file *file;
+	struct file *file, **filp;
 	struct files_struct *files = current->files;
+	struct fdtable *fdt;
 
 	*fput_needed = 0;
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
 		rcu_read_lock();
-		file = fcheck_files(files, fd);
+		fdt = files_fdtable(files);
+		file = NULL;
+		if (likely(fd < fdt->max_fds)) {
+			filp = &fdt->fd[fd];
+			file = rcu_dereference(*filp);
+		}
 		if (file) {
 			if (atomic_long_inc_not_zero(&file->f_count)) {
 				*fput_needed = 1;
@@ -342,7 +353,7 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 				 * Check if other threads freed this file and
 				 * re-allocated it.
 				 */
-				if (unlikely(file != fcheck_files(files, fd))) {
+				if (unlikely(file != rcu_dereference(*filp))) {
 					*fput_needed = 0;
 					put_filp(file);
 					file = NULL;


^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-12  4:45                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-12  4:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Nick Piggin a écrit :
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>> From: Christoph Lameter <cl@linux-foundation.org>
>>
>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>
>> Currently we schedule RCU frees for each file we free separately. That has
>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>> did not require RCU callbacks:
>>
>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>>   queues that in turn cause long latencies. We hit SLUB page allocation
>>   more often than necessary.
>>
>> 2. The cache hot object is not preserved between free and realloc. A close
>>   followed by another open is very fast with the RCUless approach because
>>   the last freed object is returned by the slab allocator that is
>>   still cache hot. RCU free means that the object is not immediately
>>   available again. The new object is cache cold and therefore open/close
>>   performance tests show a significant degradation with the RCU
>>   implementation.
>>
>> One solution to this problem is to move the RCU freeing into the Slab
>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>> time. The slab allocator will do RCU frees only when it is necessary
>> to dispose of slabs of objects (rare). So with that approach we can cut
>> out the RCU overhead significantly.
>>
>> However, the slab allocator may return the object for another use even
>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>> there is the (unlikely) possibility that the object is going to be
>> switched under us in sections protected by rcu_read_lock() and
>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>> object after establishing a stable object reference (incrementing the
>> refcounter does that).
>>
>>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> ---
>>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>>  include/linux/fs.h                  |    5 ---
>>  3 files changed, 42 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/filesystems/files.txt
>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>> --- a/Documentation/filesystems/files.txt
>> +++ b/Documentation/filesystems/files.txt
>> @@ -78,13 +78,28 @@ the fdtable structure -
>>     that look-up may race with the last put() operation on the
>>     file structure. This is avoided using atomic_long_inc_not_zero()
>>     on ->f_count :
>> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
>> +   they can also be freed before a RCU grace period, and reused,
>> +   but still as a struct file.
>> +   It is necessary to check again after getting
>> +   a stable reference (ie after atomic_long_inc_not_zero()),
>> +   that fcheck_files(files, fd) points to the same file.
>>
>>  	rcu_read_lock();
>>  	file = fcheck_files(files, fd);
>>  	if (file) {
>> -		if (atomic_long_inc_not_zero(&file->f_count))
>> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>>  			*fput_needed = 1;
>> -		else
>> +			/*
>> +			 * Now we have a stable reference to an object.
>> +			 * Check if other threads freed file and reallocated it.
>> +			 */
>> +			if (file != fcheck_files(files, fd)) {
>> +				*fput_needed = 0;
>> +				put_filp(file);
>> +				file = NULL;
>> +			}
>> +		} else
>>  		/* Didn't get the reference, someone's freed */
>>  			file = NULL;
>>  	}
>> @@ -95,6 +110,8 @@ the fdtable structure -
>>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>>     goes to zero during increment. If it does, we fail
>>     fget()/fget_light().
>> +   The second call to fcheck_files(files, fd) checks that this filp
>> +   was not freed, then reused by an other thread.
>>
>>  6. Since both fdtable and file structures can be looked up
>>     lock-free, they must be installed using rcu_assign_pointer()
>> diff --git a/fs/file_table.c b/fs/file_table.c
>> index a46e880..3e9259d 100644
>> --- a/fs/file_table.c
>> +++ b/fs/file_table.c
>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>
>>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>
>> -static inline void file_free_rcu(struct rcu_head *head)
>> -{
>> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
>> -	kmem_cache_free(filp_cachep, f);
>> -}
>> -
>>  static inline void file_free(struct file *f)
>>  {
>>  	percpu_counter_dec(&nr_files);
>>  	file_check_state(f);
>> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>> +	kmem_cache_free(filp_cachep, f);
>>  }
>>
>>  /*
>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>>  			rcu_read_unlock();
>>  			return NULL;
>>  		}
>> +		/*
>> +		 * Now we have a stable reference to an object.
>> +		 * Check if other threads freed file and re-allocated it.
>> +		 */
>> +		if (unlikely(file != fcheck_files(files, fd))) {
>> +			put_filp(file);
>> +			file = NULL;
>> +		}
> 
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.

If we got at this point, we :

Found a non NULL pointer in our fd table.
Then, another thread came, closed the file while we not yet added our reference.
This file was freed (kmem_cache_free(filp_cachep, file))
This file was reused and inserted on another thread fd table.
We added our reference on refcount.
We checked if this file is still ours (in our fd tab).
We found this file is not anymore the file we wanted.
Calling put_filp() here is our only choice to safely remove the reference on
a truly allocated file. At this point the file is
a truly allocated file but not anymore ours.
Unfortunatly we added a reference on it : we must release it.
If the other thread already called put_filp() because it wanted to close its new file,
we must see f_refcnt going to zero, and we must call __fput(), to perform
all the relevant file cleanup ourself.


> 
>>From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.

I see your point. But currently, any thread can be "releasing the last
reference on a file". That is not always the thread that called close(fd)
We extend this to "any thread of any process", so it might have
a security effect you are absolutely right.

> 
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Any real world program that open and close files, or said better,
that close and open files :)

sizeof(struct file) is 192 bytes. Thats three cache lines.
Being able to reuse a hot "struct file" avoids three cache line misses.

Thats about 120 ns.

Then, using call_rcu() is also a latency killer, since we explicitly say :
I dont want to free this file right now, I delegate this job to another layer
in two or three milli second (or more)

A final point is that SLUB doesnt need to allocate or free a slab in many cases.
(This is probably why Christoph needed this patch in 2006 :) )
In my case, I need all these patches to speedup http servers.
They obviously open and close many files per second.

The added code has a cost of less than 3 ns, but I suspect we can cut it to less than 1ns
We prefered with Christoph and Paul to keep patch as short as possible to focus
on essential points.

               :c0287656:       mov    -0x14(%ebp),%esi
               :c0287659:       mov    -0x24(%ebp),%edi
               :c028765c:       mov    0x4(%esi),%eax
               :c028765f:       cmp    (%eax),%edi
               :c0287661:       jb     c0287678 <fget+0xc8>
               :c0287663:       mov    %ebx,%eax
               :c0287665:       xor    %ebx,%ebx
               :c0287667:       call   c0287450 <put_filp>
               :c028766c:       jmp    c02875ec <fget+0x3c>
               :c0287671:       lea    0x0(%esi,%eiz,1),%esi
               :c0287678:       mov    0x4(%eax),%edi
               :c028767b:       add    %edi,-0x10(%ebp)
               :c028767e:       mov    -0x10(%ebp),%edx
     1 8.8e-05 :c0287681:       mov    (%edx),%eax
               :c0287683:       cmp    %eax,%ebx
               :c0287685:       je     c02875ec <fget+0x3c>
               :c028768b:       jmp    c0287663 <fget+0xb3>

We could avoid doing the full test, because there is no way the files->max_fds could
become lower under us, or even fdt itself, and fdt->fd

So instead of using twice this function :

static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
{
        struct file * file = NULL;
        struct fdtable *fdt = files_fdtable(files);

        if (fd < fdt->max_fds)
                file = rcu_dereference(fdt->fd[fd]);
        return file;
}

We could use the attached patch


This becomes a matter of three instructions, including a 99.99% predicted branch :

c0287646:       8b 03                   mov    (%ebx),%eax
c0287648:       39 45 e4                cmp    %eax,-0x1c(%ebp)
c028764b:       74 a1                   je     c02875ee <fget+0x3e>

c028764d:       8b 45 e4                mov    -0x1c(%ebp),%eax
c0287650:       e8 fb fd ff ff          call   c0287450 <put_filp>
c0287655:       31 c0                   xor    %eax,%eax
c0287657:       eb 98                   jmp    c02875f1 <fget+0x41>
	

At the time Christoph sent its patch (in 2006), nobody cared, because
we had no benchmark or real world workload that demonstrated the gain 
of his patch, only intuitions.
We had too many contended cache lines that slow down the whole process.

SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
misses costs become really problematic. This patch series clearly demonstrate
it.

Thanks Nick for your feedback and comments.

Eric

[PATCH] fs: optimize fget() & fget_light()

Instead of calling fcheck_files() a second time, we can take into account we
already did part of the job, in a rcu read locked section. We need a
struct file **filp pointer so that we only dereference it a second time.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/file_table.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 3e9259d..4bc019f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -289,11 +289,16 @@ void __fput(struct file *file)
 
 struct file *fget(unsigned int fd)
 {
-	struct file *file;
+	struct file *file = NULL, **filp;
 	struct files_struct *files = current->files;
+	struct fdtable *fdt;
 
 	rcu_read_lock();
-	file = fcheck_files(files, fd);
+	fdt = files_fdtable(files);
+	if (likely(fd < fdt->max_fds)) {
+		filp = &fdt->fd[fd];
+		file = rcu_dereference(*filp);
+	}
 	if (file) {
 		if (!atomic_long_inc_not_zero(&file->f_count)) {
 			/* File object ref couldn't be taken */
@@ -304,7 +309,7 @@ struct file *fget(unsigned int fd)
 		 * Now we have a stable reference to an object.
 		 * Check if other threads freed file and re-allocated it.
 		 */
-		if (unlikely(file != fcheck_files(files, fd))) {
+		if (unlikely(file != rcu_dereference(*filp))) {
 			put_filp(file);
 			file = NULL;
 		}
@@ -325,15 +330,21 @@ EXPORT_SYMBOL(fget);
  */
 struct file *fget_light(unsigned int fd, int *fput_needed)
 {
-	struct file *file;
+	struct file *file, **filp;
 	struct files_struct *files = current->files;
+	struct fdtable *fdt;
 
 	*fput_needed = 0;
 	if (likely((atomic_read(&files->count) == 1))) {
 		file = fcheck_files(files, fd);
 	} else {
 		rcu_read_lock();
-		file = fcheck_files(files, fd);
+		fdt = files_fdtable(files);
+		file = NULL;
+		if (likely(fd < fdt->max_fds)) {
+			filp = &fdt->fd[fd];
+			file = rcu_dereference(*filp);
+		}
 		if (file) {
 			if (atomic_long_inc_not_zero(&file->f_count)) {
 				*fput_needed = 1;
@@ -342,7 +353,7 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 				 * Check if other threads freed this file and
 				 * re-allocated it.
 				 */
-				if (unlikely(file != fcheck_files(files, fd))) {
+				if (unlikely(file != rcu_dereference(*filp))) {
 					*fput_needed = 0;
 					put_filp(file);
 					file = NULL;


^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
@ 2008-12-12  5:11                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-12  5:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney

Nick Piggin a écrit :
> On Friday 12 December 2008 09:39, Eric Dumazet wrote:
>> Avoids cache line ping pongs between cpus and prepare next patch,
>> because updates of nr_inodes dont need inode_lock anymore.
>>
>> (socket8 bench result : no difference at this point)
> 
> Looks good.
> 
> But.... If we never actually need fast access to the approximate
> total, (which seems to apply to this and the previous patch) we
> could use something much simpler which does not have the spinlock
> or all this batching stuff that percpu counters have. I'd prefer
> that because it will be faster in a straight line...

Well, using a non batching mode could be real easy, just
call __percpu_counter_add(&counter, inc, 1<<30);

Or define a new percpu_counter_fastadd(&counter, inc);

percpu_counter are nice because handle the CPU hotplug problem,
if we want to use for_each_online_cpu() instead of
for_each_possible_cpu().

> 
> (BTW. percpu counters can't be used in interrupt context? That's
> nice.)
> 
> 

Not sure why you said this.

I would like to have a irqsafe percpu_counter, I was preparing such a
patch because we need it for net-next




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
@ 2008-12-12  5:11                                   ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-12  5:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney

Nick Piggin a écrit :
> On Friday 12 December 2008 09:39, Eric Dumazet wrote:
>> Avoids cache line ping pongs between cpus and prepare next patch,
>> because updates of nr_inodes dont need inode_lock anymore.
>>
>> (socket8 bench result : no difference at this point)
> 
> Looks good.
> 
> But.... If we never actually need fast access to the approximate
> total, (which seems to apply to this and the previous patch) we
> could use something much simpler which does not have the spinlock
> or all this batching stuff that percpu counters have. I'd prefer
> that because it will be faster in a straight line...

Well, using a non batching mode could be real easy, just
call __percpu_counter_add(&counter, inc, 1<<30);

Or define a new percpu_counter_fastadd(&counter, inc);

percpu_counter are nice because handle the CPU hotplug problem,
if we want to use for_each_online_cpu() instead of
for_each_possible_cpu().

> 
> (BTW. percpu counters can't be used in interrupt context? That's
> nice.)
> 
> 

Not sure why you said this.

I would like to have a irqsafe percpu_counter, I was preparing such a
patch because we need it for net-next

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-12 16:48                                     ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-12 16:48 UTC (permalink / raw)
  To: Christoph Lameter, Paul E. McKenney
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig,
	David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel,
	Al Viro

Eric Dumazet a écrit :
> Nick Piggin a écrit :
>> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>>> From: Christoph Lameter <cl@linux-foundation.org>
>>>
>>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>>
>>> Currently we schedule RCU frees for each file we free separately. That has
>>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>>> did not require RCU callbacks:
>>>
>>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>>>   queues that in turn cause long latencies. We hit SLUB page allocation
>>>   more often than necessary.
>>>
>>> 2. The cache hot object is not preserved between free and realloc. A close
>>>   followed by another open is very fast with the RCUless approach because
>>>   the last freed object is returned by the slab allocator that is
>>>   still cache hot. RCU free means that the object is not immediately
>>>   available again. The new object is cache cold and therefore open/close
>>>   performance tests show a significant degradation with the RCU
>>>   implementation.
>>>
>>> One solution to this problem is to move the RCU freeing into the Slab
>>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>>> time. The slab allocator will do RCU frees only when it is necessary
>>> to dispose of slabs of objects (rare). So with that approach we can cut
>>> out the RCU overhead significantly.
>>>
>>> However, the slab allocator may return the object for another use even
>>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>>> there is the (unlikely) possibility that the object is going to be
>>> switched under us in sections protected by rcu_read_lock() and
>>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>>> object after establishing a stable object reference (incrementing the
>>> refcounter does that).
>>>
>>>
>>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>> ---
>>>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>>>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>>>  include/linux/fs.h                  |    5 ---
>>>  3 files changed, 42 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/Documentation/filesystems/files.txt
>>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>>> --- a/Documentation/filesystems/files.txt
>>> +++ b/Documentation/filesystems/files.txt
>>> @@ -78,13 +78,28 @@ the fdtable structure -
>>>     that look-up may race with the last put() operation on the
>>>     file structure. This is avoided using atomic_long_inc_not_zero()
>>>     on ->f_count :
>>> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
>>> +   they can also be freed before a RCU grace period, and reused,
>>> +   but still as a struct file.
>>> +   It is necessary to check again after getting
>>> +   a stable reference (ie after atomic_long_inc_not_zero()),
>>> +   that fcheck_files(files, fd) points to the same file.
>>>
>>>  	rcu_read_lock();
>>>  	file = fcheck_files(files, fd);
>>>  	if (file) {
>>> -		if (atomic_long_inc_not_zero(&file->f_count))
>>> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>>>  			*fput_needed = 1;
>>> -		else
>>> +			/*
>>> +			 * Now we have a stable reference to an object.
>>> +			 * Check if other threads freed file and reallocated it.
>>> +			 */
>>> +			if (file != fcheck_files(files, fd)) {
>>> +				*fput_needed = 0;
>>> +				put_filp(file);
>>> +				file = NULL;
>>> +			}
>>> +		} else
>>>  		/* Didn't get the reference, someone's freed */
>>>  			file = NULL;
>>>  	}
>>> @@ -95,6 +110,8 @@ the fdtable structure -
>>>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>>>     goes to zero during increment. If it does, we fail
>>>     fget()/fget_light().
>>> +   The second call to fcheck_files(files, fd) checks that this filp
>>> +   was not freed, then reused by an other thread.
>>>
>>>  6. Since both fdtable and file structures can be looked up
>>>     lock-free, they must be installed using rcu_assign_pointer()
>>> diff --git a/fs/file_table.c b/fs/file_table.c
>>> index a46e880..3e9259d 100644
>>> --- a/fs/file_table.c
>>> +++ b/fs/file_table.c
>>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>>
>>>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>>
>>> -static inline void file_free_rcu(struct rcu_head *head)
>>> -{
>>> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
>>> -	kmem_cache_free(filp_cachep, f);
>>> -}
>>> -
>>>  static inline void file_free(struct file *f)
>>>  {
>>>  	percpu_counter_dec(&nr_files);
>>>  	file_check_state(f);
>>> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>>> +	kmem_cache_free(filp_cachep, f);
>>>  }
>>>
>>>  /*
>>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>>>  			rcu_read_unlock();
>>>  			return NULL;
>>>  		}
>>> +		/*
>>> +		 * Now we have a stable reference to an object.
>>> +		 * Check if other threads freed file and re-allocated it.
>>> +		 */
>>> +		if (unlikely(file != fcheck_files(files, fd))) {
>>> +			put_filp(file);
>>> +			file = NULL;
>>> +		}
>> This is a non-trivial change, because that put_filp may drop the last
>> reference to the file. So now we have the case where we free the file
>> from a context in which it had never been allocated.
> 
> If we got at this point, we :
> 
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Reading again this mail I realise we call put_filp(file), while this should
be fput(file) or put_filp(file), we dont know.

Damned, this patch is wrong as is.

Christoph, Paul, do you see the problem ?

In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
and tried to close it while we got a reference on file) had to call put_filp() or fput()
to release its own reference. So we call atomic_long_dec_and_test() and cannot
take the appropriate action (calling the full __fput() version or the small one,
that some systems use to 'close' an not really opened file.

void put_filp(struct file *file)
{
        if (atomic_long_dec_and_test(&file->f_count)) {
                security_file_free(file);
                file_kill(file);
                file_free(file);
        }
}

void fput(struct file *file)
{
        if (atomic_long_dec_and_test(&file->f_count))
                __fput(file);
}

I believe put_filp() is only called on slowpath (error cases).

Should we just zap it and always call fput() ?




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-12 16:48                                     ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-12 16:48 UTC (permalink / raw)
  To: Christoph Lameter, Paul E. McKenney
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig,
	David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

Eric Dumazet a écrit :
> Nick Piggin a écrit :
>> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>>> From: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>>
>>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>>
>>> Currently we schedule RCU frees for each file we free separately. That has
>>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>>> did not require RCU callbacks:
>>>
>>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>>>   queues that in turn cause long latencies. We hit SLUB page allocation
>>>   more often than necessary.
>>>
>>> 2. The cache hot object is not preserved between free and realloc. A close
>>>   followed by another open is very fast with the RCUless approach because
>>>   the last freed object is returned by the slab allocator that is
>>>   still cache hot. RCU free means that the object is not immediately
>>>   available again. The new object is cache cold and therefore open/close
>>>   performance tests show a significant degradation with the RCU
>>>   implementation.
>>>
>>> One solution to this problem is to move the RCU freeing into the Slab
>>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>>> time. The slab allocator will do RCU frees only when it is necessary
>>> to dispose of slabs of objects (rare). So with that approach we can cut
>>> out the RCU overhead significantly.
>>>
>>> However, the slab allocator may return the object for another use even
>>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>>> there is the (unlikely) possibility that the object is going to be
>>> switched under us in sections protected by rcu_read_lock() and
>>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>>> object after establishing a stable object reference (incrementing the
>>> refcounter does that).
>>>
>>>
>>> Signed-off-by: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
>>> Signed-off-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>>> ---
>>>  Documentation/filesystems/files.txt |   21 ++++++++++++++--
>>>  fs/file_table.c                     |   33 ++++++++++++++++++--------
>>>  include/linux/fs.h                  |    5 ---
>>>  3 files changed, 42 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/Documentation/filesystems/files.txt
>>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>>> --- a/Documentation/filesystems/files.txt
>>> +++ b/Documentation/filesystems/files.txt
>>> @@ -78,13 +78,28 @@ the fdtable structure -
>>>     that look-up may race with the last put() operation on the
>>>     file structure. This is avoided using atomic_long_inc_not_zero()
>>>     on ->f_count :
>>> +   As file structures are allocated with SLAB_DESTROY_BY_RCU,
>>> +   they can also be freed before a RCU grace period, and reused,
>>> +   but still as a struct file.
>>> +   It is necessary to check again after getting
>>> +   a stable reference (ie after atomic_long_inc_not_zero()),
>>> +   that fcheck_files(files, fd) points to the same file.
>>>
>>>  	rcu_read_lock();
>>>  	file = fcheck_files(files, fd);
>>>  	if (file) {
>>> -		if (atomic_long_inc_not_zero(&file->f_count))
>>> +		if (atomic_long_inc_not_zero(&file->f_count)) {
>>>  			*fput_needed = 1;
>>> -		else
>>> +			/*
>>> +			 * Now we have a stable reference to an object.
>>> +			 * Check if other threads freed file and reallocated it.
>>> +			 */
>>> +			if (file != fcheck_files(files, fd)) {
>>> +				*fput_needed = 0;
>>> +				put_filp(file);
>>> +				file = NULL;
>>> +			}
>>> +		} else
>>>  		/* Didn't get the reference, someone's freed */
>>>  			file = NULL;
>>>  	}
>>> @@ -95,6 +110,8 @@ the fdtable structure -
>>>     atomic_long_inc_not_zero() detects if refcounts is already zero or
>>>     goes to zero during increment. If it does, we fail
>>>     fget()/fget_light().
>>> +   The second call to fcheck_files(files, fd) checks that this filp
>>> +   was not freed, then reused by an other thread.
>>>
>>>  6. Since both fdtable and file structures can be looked up
>>>     lock-free, they must be installed using rcu_assign_pointer()
>>> diff --git a/fs/file_table.c b/fs/file_table.c
>>> index a46e880..3e9259d 100644
>>> --- a/fs/file_table.c
>>> +++ b/fs/file_table.c
>>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>>
>>>  static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>>
>>> -static inline void file_free_rcu(struct rcu_head *head)
>>> -{
>>> -	struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
>>> -	kmem_cache_free(filp_cachep, f);
>>> -}
>>> -
>>>  static inline void file_free(struct file *f)
>>>  {
>>>  	percpu_counter_dec(&nr_files);
>>>  	file_check_state(f);
>>> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>>> +	kmem_cache_free(filp_cachep, f);
>>>  }
>>>
>>>  /*
>>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>>>  			rcu_read_unlock();
>>>  			return NULL;
>>>  		}
>>> +		/*
>>> +		 * Now we have a stable reference to an object.
>>> +		 * Check if other threads freed file and re-allocated it.
>>> +		 */
>>> +		if (unlikely(file != fcheck_files(files, fd))) {
>>> +			put_filp(file);
>>> +			file = NULL;
>>> +		}
>> This is a non-trivial change, because that put_filp may drop the last
>> reference to the file. So now we have the case where we free the file
>> from a context in which it had never been allocated.
> 
> If we got at this point, we :
> 
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Reading again this mail I realise we call put_filp(file), while this should
be fput(file) or put_filp(file), we dont know.

Damned, this patch is wrong as is.

Christoph, Paul, do you see the problem ?

In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
and tried to close it while we got a reference on file) had to call put_filp() or fput()
to release its own reference. So we call atomic_long_dec_and_test() and cannot
take the appropriate action (calling the full __fput() version or the small one,
that some systems use to 'close' an not really opened file.

void put_filp(struct file *file)
{
        if (atomic_long_dec_and_test(&file->f_count)) {
                security_file_free(file);
                file_kill(file);
                file_free(file);
        }
}

void fput(struct file *file)
{
        if (atomic_long_dec_and_test(&file->f_count))
                __fput(file);
}

I believe put_filp() is only called on slowpath (error cases).

Should we just zap it and always call fput() ?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-13  1:41                                     ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-12-13  1:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig,
	David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel,
	Al Viro, Paul E. McKenney

On Fri, 12 Dec 2008, Eric Dumazet wrote:


> > This is a non-trivial change, because that put_filp may drop the last
> > reference to the file. So now we have the case where we free the file
> > from a context in which it had never been allocated.
>
> If we got at this point, we :
>
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Correct. That was the idea.

> A final point is that SLUB doesnt need to allocate or free a slab in many cases.
> (This is probably why Christoph needed this patch in 2006 :) )

We needed this patch in 2006 because the AIM9 creat-clo test showed
regressions after the rcu free was put in (discovered during SLES11
verification cycle). All slab allocators do at least defer frees until all
objects in the page are freed if not longer.

> In my case, I need all these patches to speedup http servers.
> They obviously open and close many files per second.

Run AIM9 creat-close tests....

> SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
> misses costs become really problematic. This patch series clearly demonstrate
> it.

Well the issue becomes more severe as accesses to cold memory become more
extensive. Thanks for your work on this.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-13  1:41                                     ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-12-13  1:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig,
	David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro,
	Paul E. McKenney

On Fri, 12 Dec 2008, Eric Dumazet wrote:


> > This is a non-trivial change, because that put_filp may drop the last
> > reference to the file. So now we have the case where we free the file
> > from a context in which it had never been allocated.
>
> If we got at this point, we :
>
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Correct. That was the idea.

> A final point is that SLUB doesnt need to allocate or free a slab in many cases.
> (This is probably why Christoph needed this patch in 2006 :) )

We needed this patch in 2006 because the AIM9 creat-clo test showed
regressions after the rcu free was put in (discovered during SLES11
verification cycle). All slab allocators do at least defer frees until all
objects in the page are freed if not longer.

> In my case, I need all these patches to speedup http servers.
> They obviously open and close many files per second.

Run AIM9 creat-close tests....

> SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
> misses costs become really problematic. This patch series clearly demonstrate
> it.

Well the issue becomes more severe as accesses to cold memory become more
extensive. Thanks for your work on this.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-13  2:07                                       ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-12-13  2:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar,
	Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel,
	Al Viro

On Fri, 12 Dec 2008, Eric Dumazet wrote:

> > a truly allocated file. At this point the file is
> > a truly allocated file but not anymore ours.

Its a valid file. Does ownership matter here?

> Reading again this mail I realise we call put_filp(file), while this should
> be fput(file) or put_filp(file), we dont know.
>
> Damned, this patch is wrong as is.
>
> Christoph, Paul, do you see the problem ?

Yes.

> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
> and tried to close it while we got a reference on file) had to call put_filp() or fput()
> to release its own reference. So we call atomic_long_dec_and_test() and cannot
> take the appropriate action (calling the full __fput() version or the small one,
> that some systems use to 'close' an not really opened file.

The difference is mainly that fput() does full processing whereas
put_filp() is used when we know that the file was not fully operational.
If the checks in __fput are able to handle the put_filp() situation by not
releasing resources that were not allocated then we should be fine.

> I believe put_filp() is only called on slowpath (error cases).

Looks like it. It seems to assume that no dentry is associated.

> Should we just zap it and always call fput() ?

Only if fput() can handle partially setup files.


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-13  2:07                                       ` Christoph Lameter
  0 siblings, 0 replies; 349+ messages in thread
From: Christoph Lameter @ 2008-12-13  2:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar,
	Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Fri, 12 Dec 2008, Eric Dumazet wrote:

> > a truly allocated file. At this point the file is
> > a truly allocated file but not anymore ours.

Its a valid file. Does ownership matter here?

> Reading again this mail I realise we call put_filp(file), while this should
> be fput(file) or put_filp(file), we dont know.
>
> Damned, this patch is wrong as is.
>
> Christoph, Paul, do you see the problem ?

Yes.

> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
> and tried to close it while we got a reference on file) had to call put_filp() or fput()
> to release its own reference. So we call atomic_long_dec_and_test() and cannot
> take the appropriate action (calling the full __fput() version or the small one,
> that some systems use to 'close' an not really opened file.

The difference is mainly that fput() does full processing whereas
put_filp() is used when we know that the file was not fully operational.
If the checks in __fput are able to handle the put_filp() situation by not
releasing resources that were not allocated then we should be fine.

> I believe put_filp() is only called on slowpath (error cases).

Looks like it. It seems to assume that no dentry is associated.

> Should we just zap it and always call fput() ?

Only if fput() can handle partially setup files.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
@ 2008-12-16 21:04                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:38:56PM +0100, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
> 
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
> 
> d_alloc() can avoid taking dcache_lock if parent is NULL
> 
> ("socketallocbench -n8" result : 27.5s to 25s)

Looks good!  (At least once I realised that nr_dentry was global rather
than per-dentry!!!)

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
>  include/linux/fs.h |    2 +
>  kernel/sysctl.c    |    2 -
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index fa1ba03..f463a81 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
>  static unsigned int d_hash_mask __read_mostly;
>  static unsigned int d_hash_shift __read_mostly;
>  static struct hlist_head *dentry_hashtable __read_mostly;
> +static struct percpu_counter nr_dentry;
> 
>  /* Statistics gathering. */
>  struct dentry_stat_t dentry_stat = {
>  	.age_limit = 45,
>  };
> 
> +/*
> + * Handle nr_dentry sysctl
> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
> +	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
>  static void __d_free(struct dentry *dentry)
>  {
>  	WARN_ON(!list_empty(&dentry->d_alias));
> @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
>  }
> 
>  /*
> - * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
> - * inside dcache_lock.
> + * no dcache_lock, please.
>   */
>  static void d_free(struct dentry *dentry)
>  {
> @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
>  		__d_free(dentry);
>  	else
>  		call_rcu(&dentry->d_u.d_rcu, d_callback);
> +	percpu_counter_dec(&nr_dentry);
>  }
> 
>  /*
> @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
>  	struct dentry *parent;
> 
>  	list_del(&dentry->d_u.d_child);
> -	dentry_stat.nr_dentry--;	/* For d_free, below */
>  	/*drops the locks, at that point nobody can reach this dentry */
>  	dentry_iput(dentry);
>  	if (IS_ROOT(dentry))
> @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
>  static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  {
>  	struct dentry *parent;
> -	unsigned detached = 0;
> 
>  	BUG_ON(!IS_ROOT(dentry));
> 
> @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  			}
> 
>  			list_del(&dentry->d_u.d_child);
> -			detached++;
> 
>  			inode = dentry->d_inode;
>  			if (inode) {
> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  			 * otherwise we ascend to the parent and move to the
>  			 * next sibling if there is one */
>  			if (!parent)
> -				goto out;
> +				return;
> 
>  			dentry = parent;
> 
> @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  		dentry = list_entry(dentry->d_subdirs.next,
>  				    struct dentry, d_u.d_child);
>  	}
> -out:
> -	/* several dentries were freed, need to correct nr_dentry */
> -	spin_lock(&dcache_lock);
> -	dentry_stat.nr_dentry -= detached;
> -	spin_unlock(&dcache_lock);
>  }
> 
>  /*
> @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
>  	dentry->d_flags = DCACHE_UNHASHED;
>  	spin_lock_init(&dentry->d_lock);
>  	dentry->d_inode = NULL;
> -	dentry->d_parent = NULL;
> -	dentry->d_sb = NULL;
>  	dentry->d_op = NULL;
>  	dentry->d_fsdata = NULL;
>  	dentry->d_mounted = 0;
> @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
>  	if (parent) {
>  		dentry->d_parent = dget(parent);
>  		dentry->d_sb = parent->d_sb;
> +		spin_lock(&dcache_lock);
> +		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> +		spin_unlock(&dcache_lock);
>  	} else {
> +		dentry->d_parent = NULL;
> +		dentry->d_sb = NULL;
>  		INIT_LIST_HEAD(&dentry->d_u.d_child);
>  	}
> -
> -	spin_lock(&dcache_lock);
> -	if (parent)
> -		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> -	dentry_stat.nr_dentry++;
> -	spin_unlock(&dcache_lock);
> -
> +	percpu_counter_inc(&nr_dentry);
>  	return dentry;
>  }
> 
> @@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
>  {
>  	int loop;
> 
> +	percpu_counter_init(&nr_dentry, 0);
>  	/* 
>  	 * A constructor could be added for stable state like the lists,
>  	 * but it is probably not worth it because of the cache nature
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4a853ef..114cb65 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
>  struct ctl_table;
>  int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
>  		  void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos);
> 
>  int get_filesystem_list(char * buf);
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 3d56fe7..777bee7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &dentry_stat,
>  		.maxlen		= 6*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_dentry,
>  	},
>  	{
>  		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry
@ 2008-12-16 21:04                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Thu, Dec 11, 2008 at 11:38:56PM +0100, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
> 
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
> 
> d_alloc() can avoid taking dcache_lock if parent is NULL
> 
> ("socketallocbench -n8" result : 27.5s to 25s)

Looks good!  (At least once I realised that nr_dentry was global rather
than per-dentry!!!)

Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> ---
>  fs/dcache.c        |   49 +++++++++++++++++++++++++------------------
>  include/linux/fs.h |    2 +
>  kernel/sysctl.c    |    2 -
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index fa1ba03..f463a81 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
>  static unsigned int d_hash_mask __read_mostly;
>  static unsigned int d_hash_shift __read_mostly;
>  static struct hlist_head *dentry_hashtable __read_mostly;
> +static struct percpu_counter nr_dentry;
> 
>  /* Statistics gathering. */
>  struct dentry_stat_t dentry_stat = {
>  	.age_limit = 45,
>  };
> 
> +/*
> + * Handle nr_dentry sysctl
> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
> +	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
>  static void __d_free(struct dentry *dentry)
>  {
>  	WARN_ON(!list_empty(&dentry->d_alias));
> @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
>  }
> 
>  /*
> - * no dcache_lock, please.  The caller must decrement dentry_stat.nr_dentry
> - * inside dcache_lock.
> + * no dcache_lock, please.
>   */
>  static void d_free(struct dentry *dentry)
>  {
> @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
>  		__d_free(dentry);
>  	else
>  		call_rcu(&dentry->d_u.d_rcu, d_callback);
> +	percpu_counter_dec(&nr_dentry);
>  }
> 
>  /*
> @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
>  	struct dentry *parent;
> 
>  	list_del(&dentry->d_u.d_child);
> -	dentry_stat.nr_dentry--;	/* For d_free, below */
>  	/*drops the locks, at that point nobody can reach this dentry */
>  	dentry_iput(dentry);
>  	if (IS_ROOT(dentry))
> @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
>  static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  {
>  	struct dentry *parent;
> -	unsigned detached = 0;
> 
>  	BUG_ON(!IS_ROOT(dentry));
> 
> @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  			}
> 
>  			list_del(&dentry->d_u.d_child);
> -			detached++;
> 
>  			inode = dentry->d_inode;
>  			if (inode) {
> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  			 * otherwise we ascend to the parent and move to the
>  			 * next sibling if there is one */
>  			if (!parent)
> -				goto out;
> +				return;
> 
>  			dentry = parent;
> 
> @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  		dentry = list_entry(dentry->d_subdirs.next,
>  				    struct dentry, d_u.d_child);
>  	}
> -out:
> -	/* several dentries were freed, need to correct nr_dentry */
> -	spin_lock(&dcache_lock);
> -	dentry_stat.nr_dentry -= detached;
> -	spin_unlock(&dcache_lock);
>  }
> 
>  /*
> @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
>  	dentry->d_flags = DCACHE_UNHASHED;
>  	spin_lock_init(&dentry->d_lock);
>  	dentry->d_inode = NULL;
> -	dentry->d_parent = NULL;
> -	dentry->d_sb = NULL;
>  	dentry->d_op = NULL;
>  	dentry->d_fsdata = NULL;
>  	dentry->d_mounted = 0;
> @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
>  	if (parent) {
>  		dentry->d_parent = dget(parent);
>  		dentry->d_sb = parent->d_sb;
> +		spin_lock(&dcache_lock);
> +		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> +		spin_unlock(&dcache_lock);
>  	} else {
> +		dentry->d_parent = NULL;
> +		dentry->d_sb = NULL;
>  		INIT_LIST_HEAD(&dentry->d_u.d_child);
>  	}
> -
> -	spin_lock(&dcache_lock);
> -	if (parent)
> -		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> -	dentry_stat.nr_dentry++;
> -	spin_unlock(&dcache_lock);
> -
> +	percpu_counter_inc(&nr_dentry);
>  	return dentry;
>  }
> 
> @@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
>  {
>  	int loop;
> 
> +	percpu_counter_init(&nr_dentry, 0);
>  	/* 
>  	 * A constructor could be added for stable state like the lists,
>  	 * but it is probably not worth it because of the cache nature
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4a853ef..114cb65 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
>  struct ctl_table;
>  int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
>  		  void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos);
> 
>  int get_filesystem_list(char * buf);
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 3d56fe7..777bee7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &dentry_stat,
>  		.maxlen		= 6*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_dentry,
>  	},
>  	{
>  		.ctl_name	= FS_OVERFLOWUID,

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
@ 2008-12-16 21:10                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:39:10PM +0100, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
> 
> (socket8 bench result : no difference at this point)

I do like this per-CPU counter infrastructure!

One small comment change noted below.  Other than that:

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/fs-writeback.c   |    2 +-
>  fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
>  include/linux/fs.h  |    3 +++
>  kernel/sysctl.c     |    4 ++--
>  mm/page-writeback.c |    2 +-
>  5 files changed, 38 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d0ff0b8..b591cdd 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
>  	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> 
>  	wbc.nr_to_write = nr_dirty + nr_unstable +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
> +			(get_nr_inodes() - inodes_stat.nr_unused) +
>  			nr_dirty + nr_unstable;
>  	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
>  	sync_sb_inodes(sb, &wbc);
> diff --git a/fs/inode.c b/fs/inode.c
> index 0487ddb..f94f889 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
>   * Statistics gathering..
>   */
>  struct inodes_stat_t inodes_stat;
> +static struct percpu_counter nr_inodes;
> 
>  static struct kmem_cache * inode_cachep __read_mostly;
> 
> +int get_nr_inodes(void)
> +{
> +	return percpu_counter_sum_positive(&nr_inodes);
> +}
> +
> +/*
> + * Handle nr_dentry sysctl

That would be "nr_inode", right?

> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	inodes_stat.nr_inodes = get_nr_inodes();
> +	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
>  static void wake_up_inode(struct inode *inode)
>  {
>  	/*
> @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
>  		destroy_inode(inode);
>  		nr_disposed++;
>  	}
> -	spin_lock(&inode_lock);
> -	inodes_stat.nr_inodes -= nr_disposed;
> -	spin_unlock(&inode_lock);
> +	percpu_counter_sub(&nr_inodes, nr_disposed);
>  }
> 
>  /*
> @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
>  	
>  	inode = alloc_inode(sb);
>  	if (inode) {
> +		percpu_counter_inc(&nr_inodes);
>  		spin_lock(&inode_lock);
> -		inodes_stat.nr_inodes++;
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
>  		inode->i_ino = ++last_ino;
> @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
>  			if (set(inode, data))
>  				goto set_failed;
> 
> -			inodes_stat.nr_inodes++;
> +			percpu_counter_inc(&nr_inodes);
>  			list_add(&inode->i_list, &inode_in_use);
>  			list_add(&inode->i_sb_list, &sb->s_inodes);
>  			hlist_add_head(&inode->i_hash, head);
> @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
>  		old = find_inode_fast(sb, head, ino);
>  		if (!old) {
>  			inode->i_ino = ino;
> -			inodes_stat.nr_inodes++;
> +			percpu_counter_inc(&nr_inodes);
>  			list_add(&inode->i_list, &inode_in_use);
>  			list_add(&inode->i_sb_list, &sb->s_inodes);
>  			hlist_add_head(&inode->i_hash, head);
> @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
>  	list_del_init(&inode->i_list);
>  	list_del_init(&inode->i_sb_list);
>  	inode->i_state |= I_FREEING;
> -	inodes_stat.nr_inodes--;
>  	spin_unlock(&inode_lock);
> +	percpu_counter_dec(&nr_inodes);
> 
>  	security_inode_delete(inode);
> 
> @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
>  	list_del_init(&inode->i_list);
>  	list_del_init(&inode->i_sb_list);
>  	inode->i_state |= I_FREEING;
> -	inodes_stat.nr_inodes--;
>  	spin_unlock(&inode_lock);
> +	percpu_counter_dec(&nr_inodes);
>  	if (inode->i_data.nrpages)
>  		truncate_inode_pages(&inode->i_data, 0);
>  	clear_inode(inode);
> @@ -1394,6 +1416,7 @@ void __init inode_init(void)
>  {
>  	int loop;
> 
> +	percpu_counter_init(&nr_inodes, 0);
>  	/* inode slab cache */
>  	inode_cachep = kmem_cache_create("inode_cache",
>  					 sizeof(struct inode),
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 114cb65..a789346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -47,6 +47,7 @@ struct inodes_stat_t {
>  	int dummy[5];		/* padding for sysctl ABI compatibility */
>  };
>  extern struct inodes_stat_t inodes_stat;
> +extern int get_nr_inodes(void);
> 
>  extern int leases_enable, lease_break_time;
> 
> @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
>  		  void __user *buffer, size_t *lenp, loff_t *ppos);
>  int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
>  		   void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos);
> 
>  int get_filesystem_list(char * buf);
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 777bee7..b705f3a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &inodes_stat,
>  		.maxlen		= 2*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_inodes,
>  	},
>  	{
>  		.ctl_name	= FS_STATINODE,
> @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &inodes_stat,
>  		.maxlen		= 7*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_inodes,
>  	},
>  	{
>  		.procname	= "file-nr",
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2970e35..a71a922 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
>  	next_jif = start_jif + dirty_writeback_interval;
>  	nr_to_write = global_page_state(NR_FILE_DIRTY) +
>  			global_page_state(NR_UNSTABLE_NFS) +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> +			(get_nr_inodes() - inodes_stat.nr_unused);
>  	while (nr_to_write > 0) {
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes
@ 2008-12-16 21:10                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Thu, Dec 11, 2008 at 11:39:10PM +0100, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
> 
> (socket8 bench result : no difference at this point)

I do like this per-CPU counter infrastructure!

One small comment change noted below.  Other than that:

Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> ---
>  fs/fs-writeback.c   |    2 +-
>  fs/inode.c          |   39 +++++++++++++++++++++++++++++++--------
>  include/linux/fs.h  |    3 +++
>  kernel/sysctl.c     |    4 ++--
>  mm/page-writeback.c |    2 +-
>  5 files changed, 38 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d0ff0b8..b591cdd 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
>  	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> 
>  	wbc.nr_to_write = nr_dirty + nr_unstable +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
> +			(get_nr_inodes() - inodes_stat.nr_unused) +
>  			nr_dirty + nr_unstable;
>  	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
>  	sync_sb_inodes(sb, &wbc);
> diff --git a/fs/inode.c b/fs/inode.c
> index 0487ddb..f94f889 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
>   * Statistics gathering..
>   */
>  struct inodes_stat_t inodes_stat;
> +static struct percpu_counter nr_inodes;
> 
>  static struct kmem_cache * inode_cachep __read_mostly;
> 
> +int get_nr_inodes(void)
> +{
> +	return percpu_counter_sum_positive(&nr_inodes);
> +}
> +
> +/*
> + * Handle nr_dentry sysctl

That would be "nr_inode", right?

> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	inodes_stat.nr_inodes = get_nr_inodes();
> +	return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
>  static void wake_up_inode(struct inode *inode)
>  {
>  	/*
> @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
>  		destroy_inode(inode);
>  		nr_disposed++;
>  	}
> -	spin_lock(&inode_lock);
> -	inodes_stat.nr_inodes -= nr_disposed;
> -	spin_unlock(&inode_lock);
> +	percpu_counter_sub(&nr_inodes, nr_disposed);
>  }
> 
>  /*
> @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
>  	
>  	inode = alloc_inode(sb);
>  	if (inode) {
> +		percpu_counter_inc(&nr_inodes);
>  		spin_lock(&inode_lock);
> -		inodes_stat.nr_inodes++;
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
>  		inode->i_ino = ++last_ino;
> @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
>  			if (set(inode, data))
>  				goto set_failed;
> 
> -			inodes_stat.nr_inodes++;
> +			percpu_counter_inc(&nr_inodes);
>  			list_add(&inode->i_list, &inode_in_use);
>  			list_add(&inode->i_sb_list, &sb->s_inodes);
>  			hlist_add_head(&inode->i_hash, head);
> @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
>  		old = find_inode_fast(sb, head, ino);
>  		if (!old) {
>  			inode->i_ino = ino;
> -			inodes_stat.nr_inodes++;
> +			percpu_counter_inc(&nr_inodes);
>  			list_add(&inode->i_list, &inode_in_use);
>  			list_add(&inode->i_sb_list, &sb->s_inodes);
>  			hlist_add_head(&inode->i_hash, head);
> @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
>  	list_del_init(&inode->i_list);
>  	list_del_init(&inode->i_sb_list);
>  	inode->i_state |= I_FREEING;
> -	inodes_stat.nr_inodes--;
>  	spin_unlock(&inode_lock);
> +	percpu_counter_dec(&nr_inodes);
> 
>  	security_inode_delete(inode);
> 
> @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
>  	list_del_init(&inode->i_list);
>  	list_del_init(&inode->i_sb_list);
>  	inode->i_state |= I_FREEING;
> -	inodes_stat.nr_inodes--;
>  	spin_unlock(&inode_lock);
> +	percpu_counter_dec(&nr_inodes);
>  	if (inode->i_data.nrpages)
>  		truncate_inode_pages(&inode->i_data, 0);
>  	clear_inode(inode);
> @@ -1394,6 +1416,7 @@ void __init inode_init(void)
>  {
>  	int loop;
> 
> +	percpu_counter_init(&nr_inodes, 0);
>  	/* inode slab cache */
>  	inode_cachep = kmem_cache_create("inode_cache",
>  					 sizeof(struct inode),
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 114cb65..a789346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -47,6 +47,7 @@ struct inodes_stat_t {
>  	int dummy[5];		/* padding for sysctl ABI compatibility */
>  };
>  extern struct inodes_stat_t inodes_stat;
> +extern int get_nr_inodes(void);
> 
>  extern int leases_enable, lease_break_time;
> 
> @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
>  		  void __user *buffer, size_t *lenp, loff_t *ppos);
>  int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
>  		   void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos);
> 
>  int get_filesystem_list(char * buf);
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 777bee7..b705f3a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &inodes_stat,
>  		.maxlen		= 2*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_inodes,
>  	},
>  	{
>  		.ctl_name	= FS_STATINODE,
> @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
>  		.data		= &inodes_stat,
>  		.maxlen		= 7*sizeof(int),
>  		.mode		= 0444,
> -		.proc_handler	= &proc_dointvec,
> +		.proc_handler	= &proc_nr_inodes,
>  	},
>  	{
>  		.procname	= "file-nr",
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2970e35..a71a922 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
>  	next_jif = start_jif + dirty_writeback_interval;
>  	nr_to_write = global_page_state(NR_FILE_DIRTY) +
>  			global_page_state(NR_UNSTABLE_NFS) +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> +			(get_nr_inodes() - inodes_stat.nr_unused);
>  	while (nr_to_write > 0) {
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
  2008-12-11 22:39                               ` Eric Dumazet
@ 2008-12-16 21:26                                 ` Paul E. McKenney
  -1 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:39:18PM +0100, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
> 
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
> 
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

One question below, but just a clarification.  Works correctly as is,
though a bit strangely.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
>  1 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
>  	return node ? inode : NULL;
>  }
> 
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> +	static atomic_t shared_last_ino;
> +	int *p = &get_cpu_var(last_ino);
> +	int res = *p;
> +
> +	if (unlikely((res & 1023) == 0))
> +		res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> +	*p = ++res;

So the first CPU gets the range [1:1024], the second [1025:2048], and
so on, eventually wrapping to [4294966273:0].  Is that the intent?

(I don't see a problem with this, just seems a bit strange.)

> +	put_cpu_var(last_ino);
> +	return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> +	static int last_ino;
> +
> +	return ++last_ino;
> +}
> +#endif
> +
>  /**
>   *	new_inode 	- obtain an inode
>   *	@sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
>  	 * error if st_ino won't fit in target struct field. Use 32bit counter
>  	 * here to attempt to avoid that.
>  	 */
> -	static unsigned int last_ino;
>  	struct inode * inode;
> 
>  	spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
> +		inode->i_state = 0;
> +		inode->i_ino = last_ino_get();
>  		spin_lock(&inode_lock);
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		inode->i_ino = ++last_ino;
> -		inode->i_state = 0;
>  		spin_unlock(&inode_lock);
>  	}
>  	return inode;

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator
@ 2008-12-16 21:26                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:39:18PM +0100, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
> 
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
> 
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

One question below, but just a clarification.  Works correctly as is,
though a bit strangely.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/inode.c |   35 ++++++++++++++++++++++++++++++++---
>  1 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
>  	return node ? inode : NULL;
>  }
> 
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> +	static atomic_t shared_last_ino;
> +	int *p = &get_cpu_var(last_ino);
> +	int res = *p;
> +
> +	if (unlikely((res & 1023) == 0))
> +		res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> +	*p = ++res;

So the first CPU gets the range [1:1024], the second [1025:2048], and
so on, eventually wrapping to [4294966273:0].  Is that the intent?

(I don't see a problem with this, just seems a bit strange.)

> +	put_cpu_var(last_ino);
> +	return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> +	static int last_ino;
> +
> +	return ++last_ino;
> +}
> +#endif
> +
>  /**
>   *	new_inode 	- obtain an inode
>   *	@sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
>  	 * error if st_ino won't fit in target struct field. Use 32bit counter
>  	 * here to attempt to avoid that.
>  	 */
> -	static unsigned int last_ino;
>  	struct inode * inode;
> 
>  	spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
> +		inode->i_state = 0;
> +		inode->i_ino = last_ino_get();
>  		spin_lock(&inode_lock);
>  		list_add(&inode->i_list, &inode_in_use);
>  		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		inode->i_ino = ++last_ino;
> -		inode->i_state = 0;
>  		spin_unlock(&inode_lock);
>  	}
>  	return inode;

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-12-16 21:40                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:39:38PM +0100, Eric Dumazet wrote:
> Sockets, pipes and anonymous fds have interesting properties.
> 
> Like other files, they use a dentry and an inode.
> 
> But dentries for these kind of files are not hashed into dcache,
> since there is no way someone can lookup such a file in the vfs tree.
> (/proc/{pid}/fd/{number} uses a different mechanism)
> 
> Still, allocating and freeing such dentries are expensive processes,
> because we currently take dcache_lock inside d_alloc(), d_instantiate(),
> and dput(). This lock is very contended on SMP machines.
> 
> This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
> a single one (for sockets, pipes, anonymous fd), and a new
> d_alloc_single(const struct qstr *name, struct inode *inode)
> method, called by the three subsystems.
> 
> Internally, dput() can take a fast path to dput_single() for
> SINGLE dentries. No more atomic_dec_and_lock()
> for such dentries.
> 
> 
> Differences betwen an SINGLE dentry and a normal one are :
> 
> 1) SINGLE dentry has the DCACHE_SINGLE flag
> 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
> This to avoid taking a reference on sb 'root' dentry, shared
> by too many dentries.
> 3) They are not hashed into global hash table (DCACHE_UNHASHED)
> 4) Their d_alias list is empty
> 
> ("socketallocbench -n 8" bench result : from 25s to 19.9s)

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/anon_inodes.c       |   16 ------------
>  fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
>  fs/pipe.c              |   23 +----------------
>  include/linux/dcache.h |    9 ++++++
>  net/socket.c           |   24 +-----------------
>  5 files changed, 65 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 3662dd4..8bf83cb 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
>  			     mnt);
>  }
> 
> -static int anon_inodefs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * We faked vfs to believe the dentry was hashed when we created it.
> -	 * Now we restore the flag so that dput() will work correctly.
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 1;
> -}
> -
>  static struct file_system_type anon_inode_fs_type = {
>  	.name		= "anon_inodefs",
>  	.get_sb		= anon_inodefs_get_sb,
>  	.kill_sb	= kill_anon_super,
>  };
>  static struct dentry_operations anon_inodefs_dentry_operations = {
> -	.d_delete	= anon_inodefs_delete_dentry,
>  };
> 
>  /**
> @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
>  	this.name = name;
>  	this.len = strlen(name);
>  	this.hash = 0;
> -	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
> +	dentry = d_alloc_single(&this, anon_inode_inode);
>  	if (!dentry)
>  		goto err_put_unused_fd;
> 
> @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
>  	atomic_inc(&anon_inode_inode->i_count);
> 
>  	dentry->d_op = &anon_inodefs_dentry_operations;
> -	/* Do not publish this dentry inside the global dentry hash table */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, anon_inode_inode);
> 
>  	error = -ENFILE;
>  	file = alloc_file(anon_inode_mnt, dentry,
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f463a81..af3bfb3 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
>   */
> 
>  /*
> + * special version of dput() for pipes/sockets/anon.
> + * These dentries are not present in hash table, we can avoid
> + * taking/dirtying dcache_lock
> + */
> +static void dput_single(struct dentry *dentry)
> +{
> +	struct inode *inode;
> +
> +	if (!atomic_dec_and_test(&dentry->d_count))
> +		return;
> +	inode = dentry->d_inode;
> +	if (inode)
> +		iput(inode);
> +	d_free(dentry);
> +}
> +
> +/*
>   * dput - release a dentry
>   * @dentry: dentry to release 
>   *
> @@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
>  {
>  	if (!dentry)
>  		return;
> +	/*
> +	 * single dentries (sockets/pipes/anon) fast path
> +	 */
> +	if (dentry->d_flags & DCACHE_SINGLE)
> +		return dput_single(dentry);
> 
>  repeat:
>  	if (atomic_read(&dentry->d_count) == 1)
> @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
>  	return res;
>  }
> 
> +/**
> + * d_alloc_single - allocate SINGLE dentry
> + * @name: dentry name, given in a qstr structure
> + * @inode: inode to allocate the dentry for
> + *
> + * Allocate an SINGLE dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory.
> + * - SINGLE dentries have themselves as a parent.
> + * - SINGLE dentries are not hashed into global hash table
> + * - their d_alias list is empty
> + */
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> +	struct dentry *entry;
> +
> +	entry = d_alloc(NULL, name);
> +	if (entry) {
> +		entry->d_sb = inode->i_sb;
> +		entry->d_parent = entry;
> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> +		entry->d_inode = inode;
> +		fsnotify_d_instantiate(entry, inode);
> +		security_d_instantiate(entry, inode);
> +	}
> +	return entry;
> +}
> +
> +
>  static inline struct hlist_head *d_hash(struct dentry *parent,
>  					unsigned long hash)
>  {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 7aea8b8..4de6dd5 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
>  }
> 
>  static struct vfsmount *pipe_mnt __read_mostly;
> -static int pipefs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * At creation time, we pretended this dentry was hashed
> -	 * (by clearing DCACHE_UNHASHED bit in d_flags)
> -	 * At delete time, we restore the truth : not hashed.
> -	 * (so that dput() can proceed correctly)
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 0;
> -}
> 
>  /*
>   * pipefs_dname() is called from d_path().
> @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
>  }
> 
>  static struct dentry_operations pipefs_dentry_operations = {
> -	.d_delete	= pipefs_delete_dentry,
>  	.d_dname	= pipefs_dname,
>  };
> 
> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>  	struct inode *inode;
>  	struct file *f;
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
> 
>  	err = -ENFILE;
>  	inode = get_pipe_inode();
> @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
>  		goto err;
> 
>  	err = -ENOMEM;
> -	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
> +	dentry = d_alloc_single(&name, inode);
>  	if (!dentry)
>  		goto err_inode;
> 
>  	dentry->d_op = &pipefs_dentry_operations;
> -	/*
> -	 * We dont want to publish this dentry into global dentry hash table.
> -	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> -	 * This permits a working /proc/$pid/fd/XXX on pipes
> -	 */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, inode);
> 
>  	err = -ENFILE;
>  	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index a37359d..ca8d269 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
>  #define DCACHE_UNHASHED		0x0010	
> 
>  #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
> +#define DCACHE_SINGLE		0x0040
> +	/*
> +	 * socket, pipe or anonymous fd dentry
> +	 * - SINGLE dentries have themselves as a parent.
> +	 * - SINGLE dentries are not hashed into global hash table
> +	 * - Their d_alias list is empty
> +	 * - They dont need dcache_lock synchronization
> +	 */
> 
>  extern spinlock_t dcache_lock;
>  extern seqlock_t rename_lock;
> @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
>  extern void shrink_dcache_parent(struct dentry *);
>  extern void shrink_dcache_for_umount(struct super_block *);
>  extern int d_invalidate(struct dentry *);
> +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
> 
>  /* only used at mount-time */
>  extern struct dentry * d_alloc_root(struct inode *);
> diff --git a/net/socket.c b/net/socket.c
> index 92764d8..353c928 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
>  	.kill_sb =	kill_anon_super,
>  };
> 
> -static int sockfs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * At creation time, we pretended this dentry was hashed
> -	 * (by clearing DCACHE_UNHASHED bit in d_flags)
> -	 * At delete time, we restore the truth : not hashed.
> -	 * (so that dput() can proceed correctly)
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 0;
> -}
> -
>  /*
>   * sockfs_dname() is called from d_path().
>   */
> @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
>  }
> 
>  static struct dentry_operations sockfs_dentry_operations = {
> -	.d_delete = sockfs_delete_dentry,
>  	.d_dname  = sockfs_dname,
>  };
> 
> @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>  {
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
> 
> -	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
> +	dentry = d_alloc_single(&name, SOCK_INODE(sock));
>  	if (unlikely(!dentry))
>  		return -ENOMEM;
> 
>  	dentry->d_op = &sockfs_dentry_operations;
> -	/*
> -	 * We dont want to push this dentry into global dentry hash table.
> -	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> -	 * This permits a working /proc/$pid/fd/XXX on sockets
> -	 */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, SOCK_INODE(sock));
> 
>  	sock->file = file;
>  	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd
@ 2008-12-16 21:40                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, Christoph Lameter,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

On Thu, Dec 11, 2008 at 11:39:38PM +0100, Eric Dumazet wrote:
> Sockets, pipes and anonymous fds have interesting properties.
> 
> Like other files, they use a dentry and an inode.
> 
> But dentries for these kind of files are not hashed into dcache,
> since there is no way someone can lookup such a file in the vfs tree.
> (/proc/{pid}/fd/{number} uses a different mechanism)
> 
> Still, allocating and freeing such dentries are expensive processes,
> because we currently take dcache_lock inside d_alloc(), d_instantiate(),
> and dput(). This lock is very contended on SMP machines.
> 
> This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
> a single one (for sockets, pipes, anonymous fd), and a new
> d_alloc_single(const struct qstr *name, struct inode *inode)
> method, called by the three subsystems.
> 
> Internally, dput() can take a fast path to dput_single() for
> SINGLE dentries. No more atomic_dec_and_lock()
> for such dentries.
> 
> 
> Differences betwen an SINGLE dentry and a normal one are :
> 
> 1) SINGLE dentry has the DCACHE_SINGLE flag
> 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
> This to avoid taking a reference on sb 'root' dentry, shared
> by too many dentries.
> 3) They are not hashed into global hash table (DCACHE_UNHASHED)
> 4) Their d_alias list is empty
> 
> ("socketallocbench -n 8" bench result : from 25s to 19.9s)

Acked-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
> ---
>  fs/anon_inodes.c       |   16 ------------
>  fs/dcache.c            |   51 +++++++++++++++++++++++++++++++++++++++
>  fs/pipe.c              |   23 +----------------
>  include/linux/dcache.h |    9 ++++++
>  net/socket.c           |   24 +-----------------
>  5 files changed, 65 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 3662dd4..8bf83cb 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
>  			     mnt);
>  }
> 
> -static int anon_inodefs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * We faked vfs to believe the dentry was hashed when we created it.
> -	 * Now we restore the flag so that dput() will work correctly.
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 1;
> -}
> -
>  static struct file_system_type anon_inode_fs_type = {
>  	.name		= "anon_inodefs",
>  	.get_sb		= anon_inodefs_get_sb,
>  	.kill_sb	= kill_anon_super,
>  };
>  static struct dentry_operations anon_inodefs_dentry_operations = {
> -	.d_delete	= anon_inodefs_delete_dentry,
>  };
> 
>  /**
> @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
>  	this.name = name;
>  	this.len = strlen(name);
>  	this.hash = 0;
> -	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
> +	dentry = d_alloc_single(&this, anon_inode_inode);
>  	if (!dentry)
>  		goto err_put_unused_fd;
> 
> @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
>  	atomic_inc(&anon_inode_inode->i_count);
> 
>  	dentry->d_op = &anon_inodefs_dentry_operations;
> -	/* Do not publish this dentry inside the global dentry hash table */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, anon_inode_inode);
> 
>  	error = -ENFILE;
>  	file = alloc_file(anon_inode_mnt, dentry,
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f463a81..af3bfb3 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
>   */
> 
>  /*
> + * special version of dput() for pipes/sockets/anon.
> + * These dentries are not present in hash table, we can avoid
> + * taking/dirtying dcache_lock
> + */
> +static void dput_single(struct dentry *dentry)
> +{
> +	struct inode *inode;
> +
> +	if (!atomic_dec_and_test(&dentry->d_count))
> +		return;
> +	inode = dentry->d_inode;
> +	if (inode)
> +		iput(inode);
> +	d_free(dentry);
> +}
> +
> +/*
>   * dput - release a dentry
>   * @dentry: dentry to release 
>   *
> @@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
>  {
>  	if (!dentry)
>  		return;
> +	/*
> +	 * single dentries (sockets/pipes/anon) fast path
> +	 */
> +	if (dentry->d_flags & DCACHE_SINGLE)
> +		return dput_single(dentry);
> 
>  repeat:
>  	if (atomic_read(&dentry->d_count) == 1)
> @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
>  	return res;
>  }
> 
> +/**
> + * d_alloc_single - allocate SINGLE dentry
> + * @name: dentry name, given in a qstr structure
> + * @inode: inode to allocate the dentry for
> + *
> + * Allocate an SINGLE dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory.
> + * - SINGLE dentries have themselves as a parent.
> + * - SINGLE dentries are not hashed into global hash table
> + * - their d_alias list is empty
> + */
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> +	struct dentry *entry;
> +
> +	entry = d_alloc(NULL, name);
> +	if (entry) {
> +		entry->d_sb = inode->i_sb;
> +		entry->d_parent = entry;
> +		entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> +		entry->d_inode = inode;
> +		fsnotify_d_instantiate(entry, inode);
> +		security_d_instantiate(entry, inode);
> +	}
> +	return entry;
> +}
> +
> +
>  static inline struct hlist_head *d_hash(struct dentry *parent,
>  					unsigned long hash)
>  {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 7aea8b8..4de6dd5 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
>  }
> 
>  static struct vfsmount *pipe_mnt __read_mostly;
> -static int pipefs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * At creation time, we pretended this dentry was hashed
> -	 * (by clearing DCACHE_UNHASHED bit in d_flags)
> -	 * At delete time, we restore the truth : not hashed.
> -	 * (so that dput() can proceed correctly)
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 0;
> -}
> 
>  /*
>   * pipefs_dname() is called from d_path().
> @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
>  }
> 
>  static struct dentry_operations pipefs_dentry_operations = {
> -	.d_delete	= pipefs_delete_dentry,
>  	.d_dname	= pipefs_dname,
>  };
> 
> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>  	struct inode *inode;
>  	struct file *f;
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
> 
>  	err = -ENFILE;
>  	inode = get_pipe_inode();
> @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
>  		goto err;
> 
>  	err = -ENOMEM;
> -	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
> +	dentry = d_alloc_single(&name, inode);
>  	if (!dentry)
>  		goto err_inode;
> 
>  	dentry->d_op = &pipefs_dentry_operations;
> -	/*
> -	 * We dont want to publish this dentry into global dentry hash table.
> -	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> -	 * This permits a working /proc/$pid/fd/XXX on pipes
> -	 */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, inode);
> 
>  	err = -ENFILE;
>  	f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index a37359d..ca8d269 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -176,6 +176,14 @@ d_iput:		no		no		no       yes
>  #define DCACHE_UNHASHED		0x0010	
> 
>  #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
> +#define DCACHE_SINGLE		0x0040
> +	/*
> +	 * socket, pipe or anonymous fd dentry
> +	 * - SINGLE dentries have themselves as a parent.
> +	 * - SINGLE dentries are not hashed into global hash table
> +	 * - Their d_alias list is empty
> +	 * - They dont need dcache_lock synchronization
> +	 */
> 
>  extern spinlock_t dcache_lock;
>  extern seqlock_t rename_lock;
> @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
>  extern void shrink_dcache_parent(struct dentry *);
>  extern void shrink_dcache_for_umount(struct super_block *);
>  extern int d_invalidate(struct dentry *);
> +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
> 
>  /* only used at mount-time */
>  extern struct dentry * d_alloc_root(struct inode *);
> diff --git a/net/socket.c b/net/socket.c
> index 92764d8..353c928 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
>  	.kill_sb =	kill_anon_super,
>  };
> 
> -static int sockfs_delete_dentry(struct dentry *dentry)
> -{
> -	/*
> -	 * At creation time, we pretended this dentry was hashed
> -	 * (by clearing DCACHE_UNHASHED bit in d_flags)
> -	 * At delete time, we restore the truth : not hashed.
> -	 * (so that dput() can proceed correctly)
> -	 */
> -	dentry->d_flags |= DCACHE_UNHASHED;
> -	return 0;
> -}
> -
>  /*
>   * sockfs_dname() is called from d_path().
>   */
> @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
>  }
> 
>  static struct dentry_operations sockfs_dentry_operations = {
> -	.d_delete = sockfs_delete_dentry,
>  	.d_dname  = sockfs_dname,
>  };
> 
> @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>  {
>  	struct dentry *dentry;
> -	struct qstr name = { .name = "" };
> +	static const struct qstr name = { .name = "" };
> 
> -	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
> +	dentry = d_alloc_single(&name, SOCK_INODE(sock));
>  	if (unlikely(!dentry))
>  		return -ENOMEM;
> 
>  	dentry->d_op = &sockfs_dentry_operations;
> -	/*
> -	 * We dont want to push this dentry into global dentry hash table.
> -	 * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> -	 * This permits a working /proc/$pid/fd/XXX on sockets
> -	 */
> -	dentry->d_flags &= ~DCACHE_UNHASHED;
> -	d_instantiate(dentry, SOCK_INODE(sock));
> 
>  	sock->file = file;
>  	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 5/7] fs: new_inode_single() and iput_single()
  2008-12-11 22:40                               ` Eric Dumazet
@ 2008-12-16 21:41                                 ` Paul E. McKenney
  -1 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:40:07PM +0100, Eric Dumazet wrote:
> Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
> inodes allocation/freeing.
> 
> SINGLE dentries are attached to inodes that dont need to be linked
> in a list of inodes, being "inode_in_use" or "sb->s_inodes"
> As inode_lock was taken only to protect these lists, we avoid taking it
> as well.
> 
> Using iput_single() from dput_single() avoids taking inode_lock
> at freeing time.
> 
> This patch has a very noticeable effect, because we avoid dirtying of
> three contended cache lines in new_inode(), and five cache lines in iput()
> 
> ("socketallocbench -n 8" result : from 19.9s to 3.01s)

Nice!

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/anon_inodes.c   |    2 +-
>  fs/dcache.c        |    2 +-
>  fs/inode.c         |   29 ++++++++++++++++++++---------
>  fs/pipe.c          |    2 +-
>  include/linux/fs.h |   12 +++++++++++-
>  net/socket.c       |    2 +-
>  6 files changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 8bf83cb..89fd36d 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
>   */
>  static struct inode *anon_inode_mkinode(void)
>  {
> -	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
> +	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
> 
>  	if (!inode)
>  		return ERR_PTR(-ENOMEM);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index af3bfb3..3363853 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
>  		return;
>  	inode = dentry->d_inode;
>  	if (inode)
> -		iput(inode);
> +		iput_single(inode);
>  	d_free(dentry);
>  }
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index dc8e72a..0fdfe1b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
>  		kmem_cache_free(inode_cachep, (inode));
>  }
> 
> +void iput_single(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_count)) {
> +		destroy_inode(inode);
> +		percpu_counter_dec(&nr_inodes);
> +	}
> +}
> 
>  /*
>   * These are initializations that only need to be done
> @@ -587,8 +594,9 @@ static int last_ino_get(void)
>  #endif
> 
>  /**
> - *	new_inode 	- obtain an inode
> + *	__new_inode 	- obtain an inode
>   *	@sb: superblock
> + *  @single: if true, dont link new inode in a list
>   *
>   *	Allocates a new inode for given superblock. The default gfp_mask
>   *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
> @@ -598,7 +606,7 @@ static int last_ino_get(void)
>   *	newly created inode's mapping
>   *
>   */
> -struct inode *new_inode(struct super_block *sb)
> +struct inode *__new_inode(struct super_block *sb, int single)
>  {
>  	/*
>  	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
>  	 */
>  	struct inode * inode;
> 
> -	spin_lock_prefetch(&inode_lock);
> -	
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
>  		inode->i_state = 0;
>  		inode->i_ino = last_ino_get();
> -		spin_lock(&inode_lock);
> -		list_add(&inode->i_list, &inode_in_use);
> -		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		spin_unlock(&inode_lock);
> + 		if (single) {
> +  			INIT_LIST_HEAD(&inode->i_list);
> +  			INIT_LIST_HEAD(&inode->i_sb_list);
> + 		} else {
> +			spin_lock(&inode_lock);
> +			list_add(&inode->i_list, &inode_in_use);
> +			list_add(&inode->i_sb_list, &sb->s_inodes);
> +			spin_unlock(&inode_lock);
> +		}
>  	}
>  	return inode;
>  }
> 
> -EXPORT_SYMBOL(new_inode);
> +EXPORT_SYMBOL(__new_inode);
> 
>  void unlock_new_inode(struct inode *inode)
>  {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 4de6dd5..8c51a0d 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
> 
>  static struct inode * get_pipe_inode(void)
>  {
> -	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
> +	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
>  	struct pipe_inode_info *pipe;
> 
>  	if (!inode)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a789346..a702d81 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
>  extern void iget_failed(struct inode *);
>  extern void clear_inode(struct inode *);
>  extern void destroy_inode(struct inode *);
> -extern struct inode *new_inode(struct super_block *);
> +extern struct inode *__new_inode(struct super_block *, int);
> +static inline struct inode *new_inode(struct super_block *sb)
> +{
> +	return __new_inode(sb, 0);
> +}
> +static inline struct inode *new_inode_single(struct super_block *sb)
> +{
> +	return __new_inode(sb, 1);
> +}
> +extern void iput_single(struct inode *);
> +
>  extern int should_remove_suid(struct dentry *);
>  extern int file_remove_suid(struct file *);
> 
> diff --git a/net/socket.c b/net/socket.c
> index 353c928..4017409 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
>  	struct inode *inode;
>  	struct socket *sock;
> 
> -	inode = new_inode(sock_mnt->mnt_sb);
> +	inode = new_inode_single(sock_mnt->mnt_sb);
>  	if (!inode)
>  		return NULL;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 5/7] fs: new_inode_single() and iput_single()
@ 2008-12-16 21:41                                 ` Paul E. McKenney
  0 siblings, 0 replies; 349+ messages in thread
From: Paul E. McKenney @ 2008-12-16 21:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller,
	Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List,
	Christoph Lameter, linux-fsdevel, Al Viro

On Thu, Dec 11, 2008 at 11:40:07PM +0100, Eric Dumazet wrote:
> Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
> inodes allocation/freeing.
> 
> SINGLE dentries are attached to inodes that dont need to be linked
> in a list of inodes, being "inode_in_use" or "sb->s_inodes"
> As inode_lock was taken only to protect these lists, we avoid taking it
> as well.
> 
> Using iput_single() from dput_single() avoids taking inode_lock
> at freeing time.
> 
> This patch has a very noticeable effect, because we avoid dirtying of
> three contended cache lines in new_inode(), and five cache lines in iput()
> 
> ("socketallocbench -n 8" result : from 19.9s to 3.01s)

Nice!

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/anon_inodes.c   |    2 +-
>  fs/dcache.c        |    2 +-
>  fs/inode.c         |   29 ++++++++++++++++++++---------
>  fs/pipe.c          |    2 +-
>  include/linux/fs.h |   12 +++++++++++-
>  net/socket.c       |    2 +-
>  6 files changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 8bf83cb..89fd36d 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
>   */
>  static struct inode *anon_inode_mkinode(void)
>  {
> -	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
> +	struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
> 
>  	if (!inode)
>  		return ERR_PTR(-ENOMEM);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index af3bfb3..3363853 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
>  		return;
>  	inode = dentry->d_inode;
>  	if (inode)
> -		iput(inode);
> +		iput_single(inode);
>  	d_free(dentry);
>  }
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index dc8e72a..0fdfe1b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
>  		kmem_cache_free(inode_cachep, (inode));
>  }
> 
> +void iput_single(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_count)) {
> +		destroy_inode(inode);
> +		percpu_counter_dec(&nr_inodes);
> +	}
> +}
> 
>  /*
>   * These are initializations that only need to be done
> @@ -587,8 +594,9 @@ static int last_ino_get(void)
>  #endif
> 
>  /**
> - *	new_inode 	- obtain an inode
> + *	__new_inode 	- obtain an inode
>   *	@sb: superblock
> + *  @single: if true, dont link new inode in a list
>   *
>   *	Allocates a new inode for given superblock. The default gfp_mask
>   *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
> @@ -598,7 +606,7 @@ static int last_ino_get(void)
>   *	newly created inode's mapping
>   *
>   */
> -struct inode *new_inode(struct super_block *sb)
> +struct inode *__new_inode(struct super_block *sb, int single)
>  {
>  	/*
>  	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
>  	 */
>  	struct inode * inode;
> 
> -	spin_lock_prefetch(&inode_lock);
> -	
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		percpu_counter_inc(&nr_inodes);
>  		inode->i_state = 0;
>  		inode->i_ino = last_ino_get();
> -		spin_lock(&inode_lock);
> -		list_add(&inode->i_list, &inode_in_use);
> -		list_add(&inode->i_sb_list, &sb->s_inodes);
> -		spin_unlock(&inode_lock);
> + 		if (single) {
> +  			INIT_LIST_HEAD(&inode->i_list);
> +  			INIT_LIST_HEAD(&inode->i_sb_list);
> + 		} else {
> +			spin_lock(&inode_lock);
> +			list_add(&inode->i_list, &inode_in_use);
> +			list_add(&inode->i_sb_list, &sb->s_inodes);
> +			spin_unlock(&inode_lock);
> +		}
>  	}
>  	return inode;
>  }
> 
> -EXPORT_SYMBOL(new_inode);
> +EXPORT_SYMBOL(__new_inode);
> 
>  void unlock_new_inode(struct inode *inode)
>  {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 4de6dd5..8c51a0d 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
> 
>  static struct inode * get_pipe_inode(void)
>  {
> -	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
> +	struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
>  	struct pipe_inode_info *pipe;
> 
>  	if (!inode)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a789346..a702d81 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
>  extern void iget_failed(struct inode *);
>  extern void clear_inode(struct inode *);
>  extern void destroy_inode(struct inode *);
> -extern struct inode *new_inode(struct super_block *);
> +extern struct inode *__new_inode(struct super_block *, int);
> +static inline struct inode *new_inode(struct super_block *sb)
> +{
> +	return __new_inode(sb, 0);
> +}
> +static inline struct inode *new_inode_single(struct super_block *sb)
> +{
> +	return __new_inode(sb, 1);
> +}
> +extern void iput_single(struct inode *);
> +
>  extern int should_remove_suid(struct dentry *);
>  extern int file_remove_suid(struct file *);
> 
> diff --git a/net/socket.c b/net/socket.c
> index 353c928..4017409 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
>  	struct inode *inode;
>  	struct socket *sock;
> 
> -	inode = new_inode(sock_mnt->mnt_sb);
> +	inode = new_inode_single(sock_mnt->mnt_sb);
>  	if (!inode)
>  		return NULL;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-17 20:25                                         ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-17 20:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar,
	Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel,
	kernel-testers@vger.kernel.org >> Kernel Testers List,
	Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel,
	Al Viro

Christoph Lameter a écrit :
> On Fri, 12 Dec 2008, Eric Dumazet wrote:
> 
>>> a truly allocated file. At this point the file is
>>> a truly allocated file but not anymore ours.
> 
> Its a valid file. Does ownership matter here?
> 
>> Reading again this mail I realise we call put_filp(file), while this should
>> be fput(file) or put_filp(file), we dont know.
>>
>> Damned, this patch is wrong as is.
>>
>> Christoph, Paul, do you see the problem ?
> 
> Yes.
> 
>> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
>> and tried to close it while we got a reference on file) had to call put_filp() or fput()
>> to release its own reference. So we call atomic_long_dec_and_test() and cannot
>> take the appropriate action (calling the full __fput() version or the small one,
>> that some systems use to 'close' an not really opened file.
> 
> The difference is mainly that fput() does full processing whereas
> put_filp() is used when we know that the file was not fully operational.
> If the checks in __fput are able to handle the put_filp() situation by not
> releasing resources that were not allocated then we should be fine.
> 
>> I believe put_filp() is only called on slowpath (error cases).
> 
> Looks like it. It seems to assume that no dentry is associated.
> 
>> Should we just zap it and always call fput() ?
> 
> Only if fput() can handle partially setup files.

It can do that if we add a check for NULL dentry in __fput(), so put_filp() can disappear.

But there is a remaining point where we do an atomic_long_dec_and_test(&...->f_count),
in fs/aio.c, function __aio_put_req(). This one is tricky :(



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
@ 2008-12-17 20:25                                         ` Eric Dumazet
  0 siblings, 0 replies; 349+ messages in thread
From: Eric Dumazet @ 2008-12-17 20:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar,
	Christoph Hellwig, David Miller, Rafael J. Wysocki,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>
	Kernel Testers List, Mike Galbraith, Peter Zijlstra,
	Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro

Christoph Lameter a écrit :
> On Fri, 12 Dec 2008, Eric Dumazet wrote:
> 
>>> a truly allocated file. At this point the file is
>>> a truly allocated file but not anymore ours.
> 
> Its a valid file. Does ownership matter here?
> 
>> Reading again this mail I realise we call put_filp(file), while this should
>> be fput(file) or put_filp(file), we dont know.
>>
>> Damned, this patch is wrong as is.
>>
>> Christoph, Paul, do you see the problem ?
> 
> Yes.
> 
>> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
>> and tried to close it while we got a reference on file) had to call put_filp() or fput()
>> to release its own reference. So we call atomic_long_dec_and_test() and cannot
>> take the appropriate action (calling the full __fput() version or the small one,
>> that some systems use to 'close' an not really opened file.
> 
> The difference is mainly that fput() does full processing whereas
> put_filp() is used when we know that the file was not fully operational.
> If the checks in __fput are able to handle the put_filp() situation by not
> releasing resources that were not allocated then we should be fine.
> 
>> I believe put_filp() is only called on slowpath (error cases).
> 
> Looks like it. It seems to assume that no dentry is associated.
> 
>> Should we just zap it and always call fput() ?
> 
> Only if fput() can handle partially setup files.

It can do that if we add a check for NULL dentry in __fput(), so put_filp() can disappear.

But there is a remaining point where we do an atomic_long_dec_and_test(&...->f_count),
in fs/aio.c, function __aio_put_req(). This one is tricky :(

^ permalink raw reply	[flat|nested] 349+ messages in thread

end of thread, other threads:[~2008-12-17 20:27 UTC | newest]

Thread overview: 349+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki
2008-11-16 17:38 ` Rafael J. Wysocki
2008-11-16 17:38 ` [Bug #11207] VolanoMark regression with 2.6.27-rc1 Rafael J. Wysocki
2008-11-16 17:38   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11215] INFO: possible recursive locking detected ps2_command Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28 Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-17  9:06   ` Ingo Molnar
2008-11-17  9:06     ` Ingo Molnar
2008-11-17  9:14     ` David Miller
2008-11-17  9:14       ` David Miller
2008-11-17 11:01       ` Ingo Molnar
2008-11-17 11:01         ` Ingo Molnar
2008-11-17 11:20         ` Eric Dumazet
2008-11-17 16:11           ` Ingo Molnar
2008-11-17 16:11             ` Ingo Molnar
2008-11-17 16:35             ` Eric Dumazet
2008-11-17 16:35               ` Eric Dumazet
2008-11-17 17:08               ` Ingo Molnar
2008-11-17 17:08                 ` Ingo Molnar
2008-11-17 17:25                 ` Ingo Molnar
2008-11-17 17:25                   ` Ingo Molnar
2008-11-17 17:33                   ` Eric Dumazet
2008-11-17 17:33                     ` Eric Dumazet
2008-11-17 17:38                     ` Linus Torvalds
2008-11-17 17:38                       ` Linus Torvalds
2008-11-17 17:42                       ` Eric Dumazet
2008-11-17 17:42                         ` Eric Dumazet
2008-11-17 18:23                       ` Ingo Molnar
2008-11-17 18:23                         ` Ingo Molnar
2008-11-17 18:33                         ` Linus Torvalds
2008-11-17 18:33                           ` Linus Torvalds
2008-11-17 18:49                         ` Ingo Molnar
2008-11-17 18:49                           ` Ingo Molnar
2008-11-17 19:30                           ` Eric Dumazet
2008-11-17 19:30                             ` Eric Dumazet
2008-11-17 19:39                           ` David Miller
2008-11-17 19:39                             ` David Miller
2008-11-17 19:43                             ` Eric Dumazet
2008-11-17 19:43                               ` Eric Dumazet
2008-11-17 19:55                             ` Linus Torvalds
2008-11-17 19:55                               ` Linus Torvalds
2008-11-17 20:16                               ` David Miller
2008-11-17 20:16                                 ` David Miller
2008-11-17 20:30                                 ` Linus Torvalds
2008-11-17 20:30                                   ` Linus Torvalds
2008-11-17 20:58                                   ` David Miller
2008-11-17 20:58                                     ` David Miller
2008-11-18  9:44                                     ` Nick Piggin
2008-11-18  9:44                                       ` Nick Piggin
2008-11-18 15:58                                       ` Linus Torvalds
2008-11-18 15:58                                         ` Linus Torvalds
2008-11-19  4:31                                         ` Nick Piggin
2008-11-20  9:14                                         ` David Miller
2008-11-20  9:14                                           ` David Miller
2008-11-20  9:06                                       ` David Miller
2008-11-20  9:06                                         ` David Miller
2008-11-18 12:29                             ` Mike Galbraith
2008-11-18 12:29                               ` Mike Galbraith
2008-11-17 19:57                           ` Ingo Molnar
2008-11-17 19:57                             ` Ingo Molnar
2008-11-17 20:20                           ` (avc_has_perm_noaudit()) " Ingo Molnar
2008-11-17 20:20                             ` Ingo Molnar
2008-11-17 20:32                           ` ip_queue_xmit(): " Ingo Molnar
2008-11-17 20:32                             ` Ingo Molnar
2008-11-17 20:57                             ` Eric Dumazet
2008-11-17 20:57                               ` Eric Dumazet
2008-11-18  9:12                             ` Nick Piggin
2008-11-17 20:47                           ` Ingo Molnar
2008-11-17 20:47                             ` Ingo Molnar
2008-11-17 20:56                             ` Eric Dumazet
2008-11-17 20:56                               ` Eric Dumazet
2008-11-17 20:55                           ` skb_release_head_state(): " Ingo Molnar
2008-11-17 20:55                             ` Ingo Molnar
2008-11-17 21:01                             ` David Miller
2008-11-17 21:01                               ` David Miller
2008-11-17 21:04                             ` Eric Dumazet
2008-11-17 21:04                               ` Eric Dumazet
2008-11-17 21:34                             ` Linus Torvalds
2008-11-17 21:34                               ` Linus Torvalds
2008-11-17 21:38                               ` Ingo Molnar
2008-11-17 21:38                                 ` Ingo Molnar
2008-11-17 21:09                           ` tcp_ack(): " Ingo Molnar
2008-11-17 21:09                             ` Ingo Molnar
2008-11-17 21:19                           ` tcp_recvmsg(): " Ingo Molnar
2008-11-17 21:19                             ` Ingo Molnar
2008-11-17 21:26                           ` eth_type_trans(): " Ingo Molnar
2008-11-17 21:26                             ` Ingo Molnar
2008-11-17 21:40                             ` Eric Dumazet
2008-11-17 21:40                               ` Eric Dumazet
2008-11-17 23:41                               ` Eric Dumazet
2008-11-17 23:41                                 ` Eric Dumazet
2008-11-18  0:01                                 ` Linus Torvalds
2008-11-18  0:01                                   ` Linus Torvalds
2008-11-18  8:35                                   ` Eric Dumazet
2008-11-17 21:52                             ` Linus Torvalds
2008-11-17 21:52                               ` Linus Torvalds
2008-11-18  5:16                             ` David Miller
2008-11-18  5:16                               ` David Miller
2008-11-18  5:35                               ` Eric Dumazet
2008-11-18  7:00                                 ` David Miller
2008-11-18  7:00                                   ` David Miller
2008-11-18  8:30                               ` Ingo Molnar
2008-11-18  8:30                                 ` Ingo Molnar
2008-11-18  8:49                                 ` Eric Dumazet
2008-11-18  8:49                                   ` Eric Dumazet
2008-11-17 21:35                           ` __inet_lookup_established(): " Ingo Molnar
2008-11-17 21:35                             ` Ingo Molnar
2008-11-17 22:14                             ` Eric Dumazet
2008-11-17 22:14                               ` Eric Dumazet
2008-11-17 21:59                           ` system_call() - " Ingo Molnar
2008-11-17 21:59                             ` Ingo Molnar
2008-11-17 22:09                             ` Linus Torvalds
2008-11-17 22:09                               ` Linus Torvalds
2008-11-17 22:08                           ` Ingo Molnar
2008-11-17 22:15                             ` Eric Dumazet
2008-11-17 22:15                               ` Eric Dumazet
2008-11-17 22:26                               ` Ingo Molnar
2008-11-17 22:26                                 ` Ingo Molnar
2008-11-17 22:39                                 ` Eric Dumazet
2008-11-17 22:39                                   ` Eric Dumazet
2008-11-18  5:23                               ` David Miller
2008-11-18  5:23                                 ` David Miller
2008-11-18  8:45                                 ` Ingo Molnar
2008-11-18  8:45                                   ` Ingo Molnar
2008-11-17 22:14                           ` tcp_transmit_skb() - " Ingo Molnar
2008-11-17 22:14                             ` Ingo Molnar
2008-11-17 22:19                           ` Ingo Molnar
2008-11-17 22:19                             ` Ingo Molnar
2008-11-17 19:36                 ` David Miller
2008-11-17 19:36                   ` David Miller
2008-11-17 19:31             ` David Miller
2008-11-17 19:31               ` David Miller
2008-11-17 19:47               ` Linus Torvalds
2008-11-17 19:47                 ` Linus Torvalds
2008-11-17 19:51                 ` David Miller
2008-11-17 19:51                   ` David Miller
2008-11-17 19:53                 ` Ingo Molnar
2008-11-17 19:53                   ` Ingo Molnar
2008-11-17 22:47               ` Ingo Molnar
2008-11-17 22:47                 ` Ingo Molnar
2008-11-17 19:21         ` David Miller
2008-11-17 19:21           ` David Miller
2008-11-17 19:48           ` Linus Torvalds
2008-11-17 19:48             ` Linus Torvalds
2008-11-17 19:52             ` David Miller
2008-11-17 19:52               ` David Miller
2008-11-17 19:57               ` Linus Torvalds
2008-11-17 19:57                 ` Linus Torvalds
2008-11-17 20:18                 ` David Miller
2008-11-17 20:18                   ` David Miller
2008-11-19 19:43     ` Christoph Lameter
2008-11-19 19:43       ` Christoph Lameter
2008-11-19 20:14       ` Ingo Molnar
2008-11-19 20:14         ` Ingo Molnar
2008-11-20 23:52       ` Christoph Lameter
2008-11-20 23:52         ` Christoph Lameter
2008-11-21  8:30         ` Ingo Molnar
2008-11-21  8:30           ` Ingo Molnar
2008-11-21  8:51           ` Eric Dumazet
2008-11-21  8:51             ` Eric Dumazet
2008-11-21  9:05             ` David Miller
2008-11-21  9:05               ` David Miller
2008-11-21 12:51               ` Eric Dumazet
2008-11-21 12:51                 ` Eric Dumazet
2008-11-21 15:13                 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet
2008-11-21 15:13                   ` Eric Dumazet
2008-11-21 15:21                   ` Ingo Molnar
2008-11-21 15:21                     ` Ingo Molnar
2008-11-21 15:28                     ` Eric Dumazet
2008-11-21 15:28                       ` Eric Dumazet
2008-11-21 15:34                       ` Ingo Molnar
2008-11-21 15:34                         ` Ingo Molnar
2008-11-26 23:27                         ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
2008-11-27  1:37                           ` Christoph Lameter
2008-11-27  1:37                             ` Christoph Lameter
2008-11-27  6:27                             ` Eric Dumazet
2008-11-27  6:27                               ` Eric Dumazet
2008-11-27 14:44                               ` Christoph Lameter
2008-11-27 14:44                                 ` Christoph Lameter
2008-11-27  9:39                           ` Christoph Hellwig
2008-11-28 18:03                           ` Ingo Molnar
2008-11-28 18:47                             ` Peter Zijlstra
2008-11-28 18:47                               ` Peter Zijlstra
2008-11-29  6:38                               ` Christoph Hellwig
2008-11-29  6:38                                 ` Christoph Hellwig
2008-11-29  8:07                                 ` Eric Dumazet
2008-11-29  8:07                                   ` Eric Dumazet
2008-11-29  8:43                           ` [PATCH v2 0/5] " Eric Dumazet
2008-11-29  8:43                             ` Eric Dumazet
2008-12-11 22:38                             ` [PATCH v3 0/7] " Eric Dumazet
2008-12-11 22:38                               ` Eric Dumazet
2008-12-11 22:38                             ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2008-12-11 22:38                               ` Eric Dumazet
2007-07-24  1:24                               ` Nick Piggin
2007-07-24  1:24                                 ` Nick Piggin
2008-12-16 21:04                               ` Paul E. McKenney
2008-12-16 21:04                                 ` Paul E. McKenney
2008-12-11 22:39                             ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2008-12-11 22:39                               ` Eric Dumazet
2007-07-24  1:30                               ` Nick Piggin
2007-07-24  1:30                                 ` Nick Piggin
2008-12-12  5:11                                 ` Eric Dumazet
2008-12-12  5:11                                   ` Eric Dumazet
2008-12-16 21:10                               ` Paul E. McKenney
2008-12-16 21:10                                 ` Paul E. McKenney
2008-12-11 22:39                             ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2008-12-11 22:39                               ` Eric Dumazet
2007-07-24  1:34                               ` Nick Piggin
2007-07-24  1:34                                 ` Nick Piggin
2008-12-16 21:26                               ` Paul E. McKenney
2008-12-16 21:26                                 ` Paul E. McKenney
2008-12-11 22:39                             ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
2008-12-11 22:39                               ` Eric Dumazet
2008-12-16 21:40                               ` Paul E. McKenney
2008-12-16 21:40                                 ` Paul E. McKenney
2008-12-11 22:40                             ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
2008-12-11 22:40                               ` Eric Dumazet
2008-12-16 21:41                               ` Paul E. McKenney
2008-12-16 21:41                                 ` Paul E. McKenney
2008-12-11 22:40                             ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet
2008-12-11 22:40                               ` Eric Dumazet
2007-07-24  1:13                               ` Nick Piggin
2007-07-24  1:13                                 ` Nick Piggin
2007-07-24  1:13                                 ` Nick Piggin
2008-12-12  2:50                                 ` Nick Piggin
2008-12-12  2:50                                   ` Nick Piggin
2008-12-12  4:45                                 ` Eric Dumazet
2008-12-12  4:45                                   ` Eric Dumazet
2008-12-12 16:48                                   ` Eric Dumazet
2008-12-12 16:48                                     ` Eric Dumazet
2008-12-13  2:07                                     ` Christoph Lameter
2008-12-13  2:07                                       ` Christoph Lameter
2008-12-17 20:25                                       ` Eric Dumazet
2008-12-17 20:25                                         ` Eric Dumazet
2008-12-13  1:41                                   ` Christoph Lameter
2008-12-13  1:41                                     ` Christoph Lameter
2008-12-11 22:41                             ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet
2008-12-11 22:41                               ` Eric Dumazet
2008-11-29  8:43                           ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2008-11-29  8:43                             ` Eric Dumazet
2008-11-29  8:43                           ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2008-11-29  8:43                             ` Eric Dumazet
2008-11-29  8:44                           ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2008-11-29  8:44                             ` Eric Dumazet
2008-11-29  8:44                           ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
2008-11-29  8:44                             ` Eric Dumazet
2008-11-29 10:38                             ` Jörn Engel
2008-11-29 10:38                               ` Jörn Engel
2008-11-29 10:38                               ` Jörn Engel
2008-11-29 11:14                               ` Eric Dumazet
2008-11-29 11:14                                 ` Eric Dumazet
2008-11-29  8:45                           ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
2008-11-29  8:45                             ` Eric Dumazet
2008-11-29 11:14                             ` Jörn Engel
2008-11-29 11:14                               ` Jörn Engel
2008-11-29 11:14                               ` Jörn Engel
2008-11-26 23:30                         ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet
2008-11-26 23:30                           ` Eric Dumazet
2008-11-27  9:41                           ` Christoph Hellwig
2008-11-27  9:41                             ` Christoph Hellwig
2008-11-26 23:32                         ` [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2008-11-27  9:46                           ` Christoph Hellwig
2008-11-27  9:46                             ` Christoph Hellwig
2008-11-26 23:32                         ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet
2008-11-26 23:32                           ` Eric Dumazet
2008-11-27  9:32                           ` Peter Zijlstra
2008-11-27  9:39                             ` Peter Zijlstra
2008-11-27  9:39                               ` Peter Zijlstra
2008-11-27  9:48                               ` Christoph Hellwig
2008-11-27 10:01                             ` Eric Dumazet
2008-11-27 10:01                               ` Eric Dumazet
2008-11-27 10:07                             ` Andi Kleen
2008-11-27 14:46                             ` Christoph Lameter
2008-11-26 23:32                         ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet
2008-11-26 23:32                           ` Eric Dumazet
2008-11-27  8:20                           ` David Miller
2008-11-27  8:20                             ` David Miller
2008-11-26 23:32                         ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet
2008-11-27  8:21                           ` David Miller
2008-11-27  8:21                             ` David Miller
2008-11-27  9:53                           ` Christoph Hellwig
2008-11-27 10:04                             ` Eric Dumazet
2008-11-27 10:04                               ` Eric Dumazet
2008-11-27 10:10                               ` Christoph Hellwig
2008-11-27 10:10                                 ` Christoph Hellwig
2008-11-28  9:26                           ` Al Viro
2008-11-28  9:26                             ` Al Viro
2008-11-28  9:34                             ` Al Viro
2008-11-28  9:34                               ` Al Viro
2008-11-28 18:02                             ` Ingo Molnar
2008-11-28 18:02                               ` Ingo Molnar
2008-11-28 18:58                               ` Ingo Molnar
2008-11-28 22:20                               ` Eric Dumazet
2008-11-28 22:20                                 ` Eric Dumazet
2008-11-28 22:37                             ` Eric Dumazet
2008-11-28 22:43                               ` Eric Dumazet
2008-11-21 15:36                   ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
2008-11-21 17:58                     ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
2008-11-21 18:43                       ` Matthew Wilcox
2008-11-21 18:43                         ` Matthew Wilcox
2008-11-23  3:53                         ` Eric Dumazet
2008-11-21  9:18             ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28 Ingo Molnar
2008-11-21  9:18               ` Ingo Molnar
2008-11-21  9:03           ` David Miller
2008-11-21  9:03             ` David Miller
2008-11-21 16:11           ` Christoph Lameter
2008-11-21 16:11             ` Christoph Lameter
2008-11-21 18:06             ` Christoph Lameter
2008-11-21 18:06               ` Christoph Lameter
2008-11-21 18:16               ` Eric Dumazet
2008-11-21 18:16                 ` Eric Dumazet
2008-11-21 18:19                 ` Eric Dumazet
2008-11-21 18:19                   ` Eric Dumazet
2008-11-16 17:40 ` [Bug #11664] acpi errors and random freeze on sony vaio sr Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11698] 2.6.27-rc7, freezes with &gt; 1 s2ram cycle Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-17 16:19   ` Randy Dunlap
2008-11-16 17:40 ` [Bug #11569] Panic stop CPUs regression Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11543] kernel panic: softlockup in tick_periodic() ??? Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-17 14:44   ` Christoph Hellwig
2008-11-17 14:44     ` Christoph Hellwig
2008-11-16 17:40 ` [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11865] WOL for E100 Doesn't Work Anymore Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11843] usb hdd problems with 2.6.27.2 Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 21:37   ` Luciano Rocha
2008-11-16 17:40 ` [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8 Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:40 ` [Bug #11886] without serial console system doesn't poweroff Rafael J. Wysocki
2008-11-16 17:40   ` Rafael J. Wysocki
2008-11-16 17:41 ` [Bug #12039] Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6 Rafael J. Wysocki
2008-11-16 17:41   ` Rafael J. Wysocki
2008-11-16 17:41 ` [Bug #11983] iwlagn: wrong command queue 31, command id 0x0 Rafael J. Wysocki
2008-11-16 17:41   ` Rafael J. Wysocki
2008-11-16 17:41 ` [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6 Rafael J. Wysocki
2008-11-16 17:41   ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.