All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] test_bit before clear files_struct bits
@ 2015-02-10  7:06 Wang, Yalin
  2015-02-10  7:11 ` [RFC V2] " Wang, Yalin
  0 siblings, 1 reply; 8+ messages in thread
From: Wang, Yalin @ 2015-02-10  7:06 UTC (permalink / raw)
  To: 'viro@zeniv.linux.org.uk',
	'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	Gao, Neil

add test_bit() before clear close_on_exec and open_fds,
by trace __clear_bit(), these 2 place are false in most times,
we test it so that we don't need dirty cacheline, and we can win
in most time.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 fs/file.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..2fd296b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt)
 
 static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->close_on_exec);
+	if (test_bit(fd, fdt->close_on_exec))
+		__clear_bit(fd, fdt->close_on_exec);
 }
 
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
@@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		struct file *f = *old_fds++;
 		if (f) {
 			get_file(f);
-		} else {
+		} else if (test_bit(open_files - i, new_fdt)) {
 			/*
 			 * The fd may be claimed in the fd bitmap but not yet
 			 * instantiated in the files array if a sibling thread
-- 
2.2.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC V2] test_bit before clear files_struct bits
  2015-02-10  7:06 [RFC] test_bit before clear files_struct bits Wang, Yalin
@ 2015-02-10  7:11 ` Wang, Yalin
  2015-02-10 20:22   ` Andrew Morton
  2015-02-15  8:27   ` [RFC V3] test bit " Wang, Yalin
  0 siblings, 2 replies; 8+ messages in thread
From: Wang, Yalin @ 2015-02-10  7:11 UTC (permalink / raw)
  To: 'viro@zeniv.linux.org.uk',
	'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	Gao, Neil

add test_bit() before clear close_on_exec and open_fds,
by trace __clear_bit(), these 2 place are false in most times,
we test it so that we don't need clear_bit, and we can win
in most time.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 fs/file.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..b0e059c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt)
 
 static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->close_on_exec);
+	if (test_bit(fd, fdt->close_on_exec))
+		__clear_bit(fd, fdt->close_on_exec);
 }
 
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
@@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		struct file *f = *old_fds++;
 		if (f) {
 			get_file(f);
-		} else {
+		} else if (test_bit(open_files - i, new_fdt->open_fds)) {
 			/*
 			 * The fd may be claimed in the fd bitmap but not yet
 			 * instantiated in the files array if a sibling thread
-- 
2.2.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC V2] test_bit before clear files_struct bits
  2015-02-10  7:11 ` [RFC V2] " Wang, Yalin
@ 2015-02-10 20:22   ` Andrew Morton
  2015-02-10 20:49     ` Linus Torvalds
  2015-02-15  8:27   ` [RFC V3] test bit " Wang, Yalin
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2015-02-10 20:22 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'viro@zeniv.linux.org.uk',
	'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	Gao, Neil, Linus Torvalds


(cc Linus for CPU-fu)

On Tue, 10 Feb 2015 15:11:37 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> add test_bit() before clear close_on_exec and open_fds,
> by trace __clear_bit(), these 2 place are false in most times,
> we test it so that we don't need clear_bit, and we can win
> in most time.
> 
> ...
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt)
>  
>  static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
>  {
> -	__clear_bit(fd, fdt->close_on_exec);
> +	if (test_bit(fd, fdt->close_on_exec))
> +		__clear_bit(fd, fdt->close_on_exec);
>  }
>  
>  static inline void __set_open_fd(int fd, struct fdtable *fdt)
> @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
>  		struct file *f = *old_fds++;
>  		if (f) {
>  			get_file(f);
> -		} else {
> +		} else if (test_bit(open_files - i, new_fdt->open_fds)) {
>  			/*
>  			 * The fd may be claimed in the fd bitmap but not yet
>  			 * instantiated in the files array if a sibling thread

The patch is good but I'm still wondering if any CPUs can do this
speedup for us.  The CPU has to pull in the target word to modify the
bit and what it *could* do is to avoid dirtying the cacheline if it
sees that the bit is already in the desired state.

However somef elapsed-time testing I did on a couple of Intel
machines indicates that these CPUs don't perform that optimisation. 
Perhaps there's some reason why they don't, dunno.


Still, I think we should encapsulate the above (common) pattern into
helper functions in include/linux/bitops.h because

- it's cleaner

- it's self-documenting

- it permits us to eliminate the if(test_bit) on any CPU which does
  perform the optimisation internally, if such exists.


You actually have measurement results for these (and other)
set-bit-on-already-set-bit call sites.  Please include all of that info
in the changelog.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC V2] test_bit before clear files_struct bits
  2015-02-10 20:22   ` Andrew Morton
@ 2015-02-10 20:49     ` Linus Torvalds
  2015-02-10 22:46       ` Kirill A. Shutemov
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2015-02-10 20:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wang, Yalin, viro, linux-fsdevel, linux-kernel, Gao, Neil

On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> The patch is good but I'm still wondering if any CPUs can do this
> speedup for us.  The CPU has to pull in the target word to modify the
> bit and what it *could* do is to avoid dirtying the cacheline if it
> sees that the bit is already in the desired state.

Sadly, no CPU I know of actually does this.  Probably because it would
take a bit more core resources, and conditional writes to memory are
not normally part of an x86 core (it might be more natural for
something like old-style ARM that has conditional writes).

Also, even if the store were to be conditional, the cacheline would
have been acquired in exclusive state, and in many cache protocols the
state machine is from exclusive to dirty (since normally the only
reason to get a cacheline for exclusive use is in order to write to
it). So a "read, test, conditional write" ends up actually being more
complicated than you'd think - because you *want* that
exclusive->dirty state for the case where you really are going to
change the bit, and to avoid extra cache protocol stages you don't
generally want to read the cacheline into a shared read mode first
(only to then have to turn it into exclusive/dirty as a second state)

So at least on current x86 (and for reasons above, likely in the
future, including other architectures with read-modify-write memory
access models), the default assumption is that the bit operations will
actually change the bit, and unlikely bit setting/clearing for
cachelines that are very likely to otherwise stay clean should
probably be conditionalized in software. Like in this patch.

                         Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC V2] test_bit before clear files_struct bits
  2015-02-10 20:49     ` Linus Torvalds
@ 2015-02-10 22:46       ` Kirill A. Shutemov
  2015-02-10 23:29         ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Kirill A. Shutemov @ 2015-02-10 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Wang, Yalin, viro, linux-fsdevel, linux-kernel, Gao, Neil

On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote:
> On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > The patch is good but I'm still wondering if any CPUs can do this
> > speedup for us.  The CPU has to pull in the target word to modify the
> > bit and what it *could* do is to avoid dirtying the cacheline if it
> > sees that the bit is already in the desired state.
> 
> Sadly, no CPU I know of actually does this.  Probably because it would
> take a bit more core resources, and conditional writes to memory are
> not normally part of an x86 core (it might be more natural for
> something like old-style ARM that has conditional writes).
> 
> Also, even if the store were to be conditional, the cacheline would
> have been acquired in exclusive state, and in many cache protocols the
> state machine is from exclusive to dirty (since normally the only
> reason to get a cacheline for exclusive use is in order to write to
> it). So a "read, test, conditional write" ends up actually being more
> complicated than you'd think - because you *want* that
> exclusive->dirty state for the case where you really are going to
> change the bit, and to avoid extra cache protocol stages you don't
> generally want to read the cacheline into a shared read mode first
> (only to then have to turn it into exclusive/dirty as a second state)

That all sounds resonable.

But I still fail to understand why my micro-benchmark is faster with
branch before store comparing to plain store.

http://article.gmane.org/gmane.linux.kernel.cross-arch/26254

In this case we would not have intermidiate shared state, because we don't
have anybody else to share cache line with. So with branch we would have
the smae E->M and write-back to memory as without branch. But it doesn't
explain why branch makes code faster.

Any ideas?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC V2] test_bit before clear files_struct bits
  2015-02-10 22:46       ` Kirill A. Shutemov
@ 2015-02-10 23:29         ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2015-02-10 23:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Wang, Yalin, viro, linux-fsdevel, linux-kernel, Gao, Neil

On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> But I still fail to understand why my micro-benchmark is faster with
> branch before store comparing to plain store.

Very tight artificial loops like that tend to be horrible for
performance analysis on modern cores, because you end up seeing mostly
random microarchitectural details rather than any real performance.

At a guess, since you write just one word per cacheline, what happens
is that the store buffer continually fills up faster than the stores
get drained to cache. So then the stores start staling. The extra load
- that you expect to slow things down - likely ends up efectively just
prefetching the hot L2 cacheline into L1 so that the store buffer then
drains more cleanly.

And both the load and the branch are effectively free, because the
branch predicts perfectly, and the load just prefetches a cacheline
that will have to be fetched for the subsequent store buffer drain
anyway.

And as you say, there is no cacheline bouncing issues, and the working
set presumably fits in the caches - even if it doesn't fit in the L1.

But that's all just a wild guess.

It could equally easily be some very specific microarchitectural store
buffer stall due to the simpler loop hitting just the right cycle
count between stores.  There are all kinds of random odd small
corner-cases that are generally very very rare and hidden in the
noise, but then a loop with just the right strides can happen to just
hit them. It used to be *trivial* to hit things like address
generation stalls, and even though modern Intel CPU's tend to be quite
robust performance-wise, it's not true that they always handle any
code sequence "perfectly".

                            Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC V3] test bit before clear files_struct bits
  2015-02-10  7:11 ` [RFC V2] " Wang, Yalin
  2015-02-10 20:22   ` Andrew Morton
@ 2015-02-15  8:27   ` Wang, Yalin
  2015-02-18 21:27     ` Andrew Morton
  1 sibling, 1 reply; 8+ messages in thread
From: Wang, Yalin @ 2015-02-15  8:27 UTC (permalink / raw)
  To: 'viro@zeniv.linux.org.uk',
	'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	Gao, Neil, 'Andrew Morton'

Test bit before clear close_on_exec and open_fds,
by trace __clear_bit(), these 2 place are false in most times,
we test it so that we don't need clear_bit, and we can win
in most time.
Add *_if_need bitop non-atomic version.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 fs/file.c                               |  9 +++++++--
 include/asm-generic/bitops/non-atomic.h | 35 +++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..2e08e6f 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -209,7 +209,7 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt)
 
 static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->close_on_exec);
+	__clear_bit_if_need(fd, fdt->close_on_exec);
 }
 
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
@@ -222,6 +222,11 @@ static inline void __clear_open_fd(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->open_fds);
 }
 
+static inline void __clear_open_fd_if_need(int fd, struct fdtable *fdt)
+{
+	__clear_bit_if_need(fd, fdt->open_fds);
+}
+
 static int count_open_files(struct fdtable *fdt)
 {
 	int size = fdt->max_fds;
@@ -316,7 +321,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 			 * is partway through open().  So make sure that this
 			 * fd is available to the new process.
 			 */
-			__clear_open_fd(open_files - i, new_fdt);
+			__clear_open_fd_if_need(open_files - i, new_fdt);
 		}
 		rcu_assign_pointer(*new_fds++, f);
 	}
diff --git a/include/asm-generic/bitops/non-atomic.h b/include/asm-generic/bitops/non-atomic.h
index 697cc2b..0713e3b 100644
--- a/include/asm-generic/bitops/non-atomic.h
+++ b/include/asm-generic/bitops/non-atomic.h
@@ -105,4 +105,39 @@ static inline int test_bit(int nr, const volatile unsigned long *addr)
 	return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
 }
 
+/*
+ * __*_if_need version is used in cases that you don't want write a bit which
+ * have been set / clear, to avoid dirty cacheline.
+ */
+static inline void __set_bit_if_need(int nr, volatile unsigned long *addr)
+{
+	if (!test_bit(nr, addr))
+		__set_bit(nr, addr);
+}
+
+static inline void __clear_bit_if_need(int nr, volatile unsigned long *addr)
+{
+	if (test_bit(nr, addr))
+		__clear_bit(nr, addr);
+}
+
+static inline int __test_and_set_bit_if_need(int nr, volatile unsigned long *addr)
+{
+	if (!test_bit(nr, addr)) {
+		__set_bit(nr, addr);
+		return false;
+	} else {
+		return true;
+	}
+}
+
+static inline int __test_and_clear_bit_if_need(int nr, volatile unsigned long *addr)
+{
+	if (test_bit(nr, addr)) {
+		__clear_bit(nr, addr);
+		return true;
+	} else {
+		return false;
+	}
+}
 #endif /* _ASM_GENERIC_BITOPS_NON_ATOMIC_H_ */
-- 
2.2.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC V3] test bit before clear files_struct bits
  2015-02-15  8:27   ` [RFC V3] test bit " Wang, Yalin
@ 2015-02-18 21:27     ` Andrew Morton
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2015-02-18 21:27 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'viro@zeniv.linux.org.uk',
	'linux-fsdevel@vger.kernel.org',
	'linux-kernel@vger.kernel.org',
	Gao, Neil

On Sun, 15 Feb 2015 16:27:27 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> +/*
> + * __*_if_need version is used in cases that you don't want write a bit which
> + * have been set / clear, to avoid dirty cacheline.
> + */
> +static inline void __set_bit_if_need(int nr, volatile unsigned long *addr)

"if_need" is gramatically incorrect - it should be "if_needed".

And __set_bit_if_needed is too long.  But I can't think of anything
shorter, unless we do something lame like __set_bit2(). 
__set_bit_if_clear() isn't too bad?

Also, your changelog still doesn't include the results of the
quantitative testing which you performed to demonstrate that these code
sites will benefit from this conversion.  Please copy all that info
into the changelog!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-02-18 21:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-10  7:06 [RFC] test_bit before clear files_struct bits Wang, Yalin
2015-02-10  7:11 ` [RFC V2] " Wang, Yalin
2015-02-10 20:22   ` Andrew Morton
2015-02-10 20:49     ` Linus Torvalds
2015-02-10 22:46       ` Kirill A. Shutemov
2015-02-10 23:29         ` Linus Torvalds
2015-02-15  8:27   ` [RFC V3] test bit " Wang, Yalin
2015-02-18 21:27     ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.