All of lore.kernel.org
 help / color / mirror / Atom feed
* Name hashing function causing a perf regression
@ 2014-09-09 19:30 Josef Bacik
  2014-09-12 19:11 ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2014-09-09 19:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, viro, hch, clm

Hello,

I'm debugging a performance regression between 3.2 and 3.10 and I 
narrowed it down to this patch

commit bfcfaa77bdf0f775263e906015982a608df01c76
vfs: use 'unsigned long' accesses for dcache name comparison and hashing

The test case is essentially

for (i = 0; i < 1000000; i++)
	mkdir("a$i");

On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k 
dir/sec with 3.10.  This is because we spend waaaaay more time in 
__d_lookup on 3.10 than in 3.2.  The new hashing function for strings is 
suboptimal for < sizeof(unsigned long) string names (and hell even > 
sizeof(unsigned long) string names that I've tested).  I broke out the 
old hashing function and the new one into a userspace helper to get real 
numbers and this is what I'm getting

Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 12628 entries, 987372 dupes, 900 max dupes
We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 
567 dupes for the new hash

My test does the hash, and then does the d_hash into a integer pointer 
array the same size as the dentry hash table on my system, and then just 
increments the value at the address we got to see how many entries we 
overlap with.  As you can see the old hash function ended up with all 1 
million entries in their own bucket, whereas the new one they are only 
distributed among ~12.5k buckets, which is why we're using so much more 
CPU in __d_lookup.

The new hash function basically just uses the unsigned long value of the 
string if it is less than unsigned long as the hash.  If the string is 
longer than unsigned long we still only hash the parts that fill a full 
word, and then simply add the remaining bit to the hash.  This seems to 
make strings that are close together create hashes that are close 
together as well, which when coupled with d_hash end up getting put into 
existing buckets.

So the question is what do we do here?  I tested other random strings 
and every one of them ended up worse as far as collisions go with the 
new function vs the old one.  I assume we want to keep the word at a 
time functionality, so should we switch to a different hashing scheme, 
like murmur3/fnv/xxhash/crc32c/whatever?  Or should we just go back to 
what we had and maybe just use the word at a time stuff for comparisons? 
  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-09 19:30 Name hashing function causing a perf regression Josef Bacik
@ 2014-09-12 19:11 ` Andi Kleen
  2014-09-12 19:21   ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2014-09-12 19:11 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Linus Torvalds, linux-fsdevel, viro, hch, clm

Josef Bacik <jbacik@fb.com> writes:
>
> So the question is what do we do here?  I tested other random strings
> and every one of them ended up worse as far as collisions go with the
> new function vs the old one.  I assume we want to keep the word at a
> time functionality, so should we switch to a different hashing scheme,
> like murmur3/fnv/xxhash/crc32c/whatever?  Or should we just go back to

Would be interesting to try murmur3.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 19:11 ` Andi Kleen
@ 2014-09-12 19:21   ` Linus Torvalds
  2014-09-12 19:52     ` Josef Bacik
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-09-12 19:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Josef Bacik, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On Fri, Sep 12, 2014 at 12:11 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Josef Bacik <jbacik@fb.com> writes:
>>
>> So the question is what do we do here?  I tested other random strings
>> and every one of them ended up worse as far as collisions go with the
>> new function vs the old one.  I assume we want to keep the word at a
>> time functionality, so should we switch to a different hashing scheme,
>> like murmur3/fnv/xxhash/crc32c/whatever?  Or should we just go back to
>
> Would be interesting to try murmur3.

I seriously doubt it's the word-at-a-time part, since Josef reports
that it's "suboptimal for < sizeof(unsigned long) string names", and
for those, there is no data loss at all.

The main difference is that the new hash doesn't try to finish the
hash particularly well. Nobody complained up until now.

The old hash kept mixing up the bits for each byte it encounters,
while the new hash really only does that mixing at the end. And its
mixing is particularly stupid and weak: see fold_hash() (and then
d_hash() does something very similar).

So the _first_ thing to test would be to try making "fold_hash()"
smarter. Perhaps using "hash_long(hash, 32)" instead?

                Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 19:21   ` Linus Torvalds
@ 2014-09-12 19:52     ` Josef Bacik
  2014-09-12 20:39       ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2014-09-12 19:52 UTC (permalink / raw)
  To: Linus Torvalds, Andi Kleen
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On 09/12/2014 03:21 PM, Linus Torvalds wrote:
> On Fri, Sep 12, 2014 at 12:11 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Josef Bacik <jbacik@fb.com> writes:
>>>
>>> So the question is what do we do here?  I tested other random strings
>>> and every one of them ended up worse as far as collisions go with the
>>> new function vs the old one.  I assume we want to keep the word at a
>>> time functionality, so should we switch to a different hashing scheme,
>>> like murmur3/fnv/xxhash/crc32c/whatever?  Or should we just go back to
>>
>> Would be interesting to try murmur3.
>
> I seriously doubt it's the word-at-a-time part, since Josef reports
> that it's "suboptimal for < sizeof(unsigned long) string names", and
> for those, there is no data loss at all.
>
> The main difference is that the new hash doesn't try to finish the
> hash particularly well. Nobody complained up until now.
>
> The old hash kept mixing up the bits for each byte it encounters,
> while the new hash really only does that mixing at the end. And its
> mixing is particularly stupid and weak: see fold_hash() (and then
> d_hash() does something very similar).
>
> So the _first_ thing to test would be to try making "fold_hash()"
> smarter. Perhaps using "hash_long(hash, 32)" instead?
>
>                  Linus
>

Ok with the hash_long(hash, 32) change I get this

[jbacik@devbig005 ~/local] ./hash
Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 331504 entries, 668496 dupes, 5 max dupes
We had 292735 buckets with a p50 of 3 dupes, p90 of 4 dupes, p99 of 5 
dupes for the new hash

So that looks much better, not perfect but hlist_for_each through 5 
entries isn't going to kill us, I'll build a kernel with this and get 
back shortly with real numbers.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 19:52     ` Josef Bacik
@ 2014-09-12 20:39       ` Linus Torvalds
  2014-09-12 21:25         ` Josef Bacik
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-09-12 20:39 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On Fri, Sep 12, 2014 at 12:52 PM, Josef Bacik <jbacik@fb.com> wrote:
>
> So that looks much better, not perfect but hlist_for_each through 5 entries
> isn't going to kill us, I'll build a kernel with this and get back shortly
> with real numbers.

So one thing to look out for is that the fold_hash() is only effective
for 64-bit architectures, and I'm pretty sure that we end up having a
very similar issue on 32-bit architectures that have

 - two 32-bit words with the fairly simplisting mixing (basically
"first word * 9 + second word")

   This is probably ok'ish, in that you end up still having a 32-bit
number with *most* of the bits being relevant, but we should think
about this too.

 - the real problem is likely the "d_hash()" function that narrows the
32-bit hash down to d_hash_mask bits, and drops a lot of bits as it
does so (ie d_hash_mask is usually on the order of 12 bits, so it
really only uses 24 bits of the 32-bit hash value, and 12 bits is
particularly bad anyway, since it's a multiple of three which is what
the '9' in the first simplistic mixing used too.

So it *might* be better to fix this all at d_hash() instead, and make
that use "hash_u32(hash, d_hash_bits)" instead. And maybe keep the
fold_hash() stage fairly stupid.

Anyway, what I'm trying to say is that your numbers are certainly
*much* better than the hash bucket list used to be, but we should
probably tweak it a bit more. I'm not sure you really need to go to a
real hash function like murmur, but it would also be a good idea to at
least look at making some initial value be randomized so that people
also can't quite as easily perhaps try to hit worst-case situations
where they control the hash and can make really long buckets (the fact
that we mix in the parent pointer and many filesystems have some
directory size limit might make that harder anyway, but..)

           Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 20:39       ` Linus Torvalds
@ 2014-09-12 21:25         ` Josef Bacik
  2014-09-12 22:01           ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2014-09-12 21:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On 09/12/2014 04:39 PM, Linus Torvalds wrote:
> On Fri, Sep 12, 2014 at 12:52 PM, Josef Bacik <jbacik@fb.com> wrote:
>>
>> So that looks much better, not perfect but hlist_for_each through 5 entries
>> isn't going to kill us, I'll build a kernel with this and get back shortly
>> with real numbers.
>
> So one thing to look out for is that the fold_hash() is only effective
> for 64-bit architectures, and I'm pretty sure that we end up having a
> very similar issue on 32-bit architectures that have
>
>   - two 32-bit words with the fairly simplisting mixing (basically
> "first word * 9 + second word")
>
>     This is probably ok'ish, in that you end up still having a 32-bit
> number with *most* of the bits being relevant, but we should think
> about this too.
>
>   - the real problem is likely the "d_hash()" function that narrows the
> 32-bit hash down to d_hash_mask bits, and drops a lot of bits as it
> does so (ie d_hash_mask is usually on the order of 12 bits, so it
> really only uses 24 bits of the 32-bit hash value, and 12 bits is
> particularly bad anyway, since it's a multiple of three which is what
> the '9' in the first simplistic mixing used too.
>
> So it *might* be better to fix this all at d_hash() instead, and make
> that use "hash_u32(hash, d_hash_bits)" instead. And maybe keep the
> fold_hash() stage fairly stupid.
>
> Anyway, what I'm trying to say is that your numbers are certainly
> *much* better than the hash bucket list used to be, but we should
> probably tweak it a bit more. I'm not sure you really need to go to a
> real hash function like murmur, but it would also be a good idea to at
> least look at making some initial value be randomized so that people
> also can't quite as easily perhaps try to hit worst-case situations
> where they control the hash and can make really long buckets (the fact
> that we mix in the parent pointer and many filesystems have some
> directory size limit might make that harder anyway, but..)
>

Ok I have a good direction to take this in, so far the best results have 
been to change fold_hash() to hash_64(hash, 32) and change d_hash to do this

static int *d_hash(int *table, unsigned long parent, unsigned int hash)
{
         hash += (unsigned long) parent / L1_CACHE_BYTES;
         hash += hash_32(hash, d_hash_shift);
         return table + (hash & d_hash_mask);
}

which has gotten us 890k buckets.  I'll futz with it some more on Monday 
and send out a patch with whatever gives me the best numbers.  Thanks 
for the help, hashing is not my forte,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 21:25         ` Josef Bacik
@ 2014-09-12 22:01           ` Linus Torvalds
  2014-09-12 22:08             ` Josef Bacik
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-09-12 22:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On Fri, Sep 12, 2014 at 2:25 PM, Josef Bacik <jbacik@fb.com> wrote:
>
> Ok I have a good direction to take this in, so far the best results have
> been to change fold_hash() to hash_64(hash, 32) and change d_hash to do this
>
> static int *d_hash(int *table, unsigned long parent, unsigned int hash)
> {
>         hash += (unsigned long) parent / L1_CACHE_BYTES;
>         hash += hash_32(hash, d_hash_shift);
>         return table + (hash & d_hash_mask);
> }
>

It really should be sufficient to just do

  static int *d_hash(int *table, unsigned long parent, unsigned int hash)
  {
        hash += (unsigned long) parent / L1_CACHE_BYTES;
        return table + hash_32(hash, d_hash_shift);
  }

because the hash_32() function should already reduce the hash to its
second argument (d_hash_shift) and mix around the bits sufficiently.

I'm a *bit* nervous about the cost of this all, especially on CPU's
where integer multiplies are expensive, but obviously we need to
improve on the final hashing.

Just out of interest, how good/bad does the hash look if the *only*
change you do is the above d_hash() thing (ie just leave the
fold_hash() thing alone?)

            Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 22:01           ` Linus Torvalds
@ 2014-09-12 22:08             ` Josef Bacik
  2014-09-12 22:25               ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2014-09-12 22:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On 09/12/2014 06:01 PM, Linus Torvalds wrote:
> On Fri, Sep 12, 2014 at 2:25 PM, Josef Bacik <jbacik@fb.com> wrote:
>>
>> Ok I have a good direction to take this in, so far the best results have
>> been to change fold_hash() to hash_64(hash, 32) and change d_hash to do this
>>
>> static int *d_hash(int *table, unsigned long parent, unsigned int hash)
>> {
>>          hash += (unsigned long) parent / L1_CACHE_BYTES;
>>          hash += hash_32(hash, d_hash_shift);
>>          return table + (hash & d_hash_mask);
>> }
>>
>
> It really should be sufficient to just do
>
>    static int *d_hash(int *table, unsigned long parent, unsigned int hash)
>    {
>          hash += (unsigned long) parent / L1_CACHE_BYTES;
>          return table + hash_32(hash, d_hash_shift);
>    }
>
> because the hash_32() function should already reduce the hash to its
> second argument (d_hash_shift) and mix around the bits sufficiently.
>
> I'm a *bit* nervous about the cost of this all, especially on CPU's
> where integer multiplies are expensive, but obviously we need to
> improve on the final hashing.
>
> Just out of interest, how good/bad does the hash look if the *only*
> change you do is the above d_hash() thing (ie just leave the
> fold_hash() thing alone?)
>
>


[jbacik@devbig005 ~/local] ./hash
Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 62200 entries, 937800 dupes, 90 max dupes
Fnv hash table had 988406 entries, 11594 dupes, 3 max dupes
We had 49800 buckets with a p50 of 8 dupes, p90 of 45 dupes, p99 of 80 
dupes for the new hash

"New hash" is the number you want.  Not as bad as originally, not as 
good as replacing fold_hash with hash_64 and not changing d_hash.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 22:08             ` Josef Bacik
@ 2014-09-12 22:25               ` Linus Torvalds
  2014-09-13 18:58                 ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-09-12 22:25 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

[-- Attachment #1: Type: text/plain, Size: 979 bytes --]

On Fri, Sep 12, 2014 at 3:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>
> "New hash" is the number you want.  Not as bad as originally, not as good as
> replacing fold_hash with hash_64 and not changing d_hash.  Thanks,

Ok, so that hash folding really does hurt. And now that I look at it,
I understand quite well. It's extra horrible, because it does the
addition of bits on a byte boundary. So it very easily causes bits to
basically cancel out when the first four bytes have a high commonality
with the next four bytes (like your numeric names do - lots of ascii
decimal numbers). So yeah, we need to fix fold_hash(), no question
about it.

I get the feeling that "hash_long()" is a bit excessive and we end up
mixing bits twice, but I guess it's the simple approach.

So with that, the kernel patch would look something like the attached.
Totally untested. Does it fix the regression you saw in addition to
looking more reasonable in your test-program?

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1127 bytes --]

 fs/dcache.c | 3 +--
 fs/namei.c  | 4 ++--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d30ce699ae4b..4023e77b800e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -106,8 +106,7 @@ static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
 					unsigned int hash)
 {
 	hash += (unsigned long) parent / L1_CACHE_BYTES;
-	hash = hash + (hash >> d_hash_shift);
-	return dentry_hashtable + (hash & d_hash_mask);
+	return dentry_hashtable + hash_32(hash, d_hash_shift);
 }
 
 /* Statistics gathering. */
diff --git a/fs/namei.c b/fs/namei.c
index a996bb48dfab..229235862e50 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -34,6 +34,7 @@
 #include <linux/device_cgroup.h>
 #include <linux/fs_struct.h>
 #include <linux/posix_acl.h>
+#include <linux/hash.h>
 #include <asm/uaccess.h>
 
 #include "internal.h"
@@ -1634,8 +1635,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
 
 static inline unsigned int fold_hash(unsigned long hash)
 {
-	hash += hash >> (8*sizeof(int));
-	return hash;
+	return hash_64(hash, 32);
 }
 
 #else	/* 32-bit case */

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-12 22:25               ` Linus Torvalds
@ 2014-09-13 18:58                 ` Linus Torvalds
  2014-09-15  1:32                   ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-09-13 18:58 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On Fri, Sep 12, 2014 at 3:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So with that, the kernel patch would look something like the attached.
> Totally untested.

Tested here, and it looks ok. The hash_64() implementation meant that
even on x86-64, where multiplies are generally fast, we'd do manual
shift-and-adds. Which annoyed me, although it didn't show up as a huge
problem in profiles. So I cleaned some of that up too.

I'm not going to wait for the next merge window, since we still have a
couple of weeks to go for 3.17. But I also didn't necessarily mark it
for stable, since I want to hear about how this affects performance at
fb first.

So if this all helps you guys, and after sufficient testing, just the
current top commit should be self-sufficient (with two other commits
before *if* that multiply-vs-shift makes a difference)

    99d263d4c5b2 (HEAD, master) vfs: fix bad hashing of dentries

and if it *doesn't* help, and you still see problems, please holler.

            Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-13 18:58                 ` Linus Torvalds
@ 2014-09-15  1:32                   ` Linus Torvalds
  2014-09-15  2:49                     ` Tetsuo Handa
  2014-09-15 15:55                     ` Josef Bacik
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2014-09-15  1:32 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On Sat, Sep 13, 2014 at 11:58 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So if this all helps you guys, and after sufficient testing, just the
> current top commit should be self-sufficient (with two other commits
> before *if* that multiply-vs-shift makes a difference)
>
>     99d263d4c5b2 (HEAD, master) vfs: fix bad hashing of dentries
>
> and if it *doesn't* help, and you still see problems, please holler.

Btw, please if at all possible, give the 3.17-rc5 release a test on
the load that you saw problems with.

Because of the whole dentry hashing issue, I took another look at name
lookup performance, and found a few more problems in this area.

The biggest problem only affected the fairly unusual case of a
directory that was mounted differently in different namespaces, and it
may well be that you don't actually have that situation at all. I
found it almost by mistake when checking performance consistency and
noticing that my "/tmp" directory lookups were much slower than
everything else. The pathname lookup incorrectly dropped out of RCU
mode for that case due to two independent bugs (one hit normal lookups
of such directories, the other hit just the ".." case).

I also hit a small CPU pipeline hickup in link_path_walk() that is
probably specific to just the store buffer forwarding of x86-64, but
could possibly hit other 64-bit cases too. I doubt it's noticeable for
your case, but it showed up pretty clearly in the profiles when I was
checking that everything looked ok.

All of them should be fixed in the -rc5 I just pushed out. At least I

I do have another case I'm not entirely happy about - our negative
lookups (ie looking up a pathname that doesn't exist) hit in the
dcache for real filesystems and perform really well, but they suck for
tmpfs. Al, we turn off negative dentry caches for tmpfs because
simple_dentry_operations uses

        .d_delete = always_delete_dentry,

Do we care? It's noticeable in benchmarks: it's almost an order of
magnitude difference when looking up non-existent files. I can look up
a non-existent file 23M times per second on ext4, but only 3.3M on
/tmp.

Anyway, I'm not sure FB does a lot of lookups of nonexistent files,
but there are some loads that really do that a lot.  And we're in the
odd situation that it's actually *much* faster on a real filesystem
than it is on a RAM filesystem like /tmp.

                   Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15  1:32                   ` Linus Torvalds
@ 2014-09-15  2:49                     ` Tetsuo Handa
  2014-09-15  3:37                       ` Linus Torvalds
  2014-09-15 15:55                     ` Josef Bacik
  1 sibling, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2014-09-15  2:49 UTC (permalink / raw)
  To: torvalds; +Cc: jbacik, andi, linux-fsdevel, viro, hch, clm

Linus Torvalds wrote:
> I do have another case I'm not entirely happy about - our negative
> lookups (ie looking up a pathname that doesn't exist) hit in the
> dcache for real filesystems and perform really well, but they suck for
> tmpfs. Al, we turn off negative dentry caches for tmpfs because
> simple_dentry_operations uses
> 
>         .d_delete = always_delete_dentry,
> 
> Do we care? It's noticeable in benchmarks: it's almost an order of
> magnitude difference when looking up non-existent files. I can look up
> a non-existent file 23M times per second on ext4, but only 3.3M on
> /tmp.

I care, at least as of RHEL6 (2.6.32) kernels. Some users are using tmpfs
in order to avoid dentry cache bouncing bomb (a series of simple stat() calls
take about 10 seconds when there are 300,000,000+ unused dentries, P31 of
http://I-love.SAKURA.ne.jp/tomoyo/LCJ2014-en.pdf ). I don't know performance
on recent kernels.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15  2:49                     ` Tetsuo Handa
@ 2014-09-15  3:37                       ` Linus Torvalds
  2014-09-15  4:58                         ` Tetsuo Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-09-15  3:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Josef Bacik, Andi Kleen, linux-fsdevel, Al Viro,
	Christoph Hellwig, Chris Mason

On Sun, Sep 14, 2014 at 7:49 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> I care, at least as of RHEL6 (2.6.32) kernels. Some users are using tmpfs
> in order to avoid dentry cache bouncing bomb (a series of simple stat() calls
> take about 10 seconds when there are 300,000,000+ unused dentries, P31 of
> http://I-love.SAKURA.ne.jp/tomoyo/LCJ2014-en.pdf ). I don't know performance
> on recent kernels.

Hmm. Probably not hugely different. It's still fairly easy to generate
a lot of negative dentries. Do you have an actual test program that
does this and is annoying? Do people actually do that?

It shouldn't be impossible to keep negative dentries in check without
getting rid of them *entirely*. They do speed up lots of common
operations (things like $PATH lookups etc, negative caches really are
useful).

            Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15  3:37                       ` Linus Torvalds
@ 2014-09-15  4:58                         ` Tetsuo Handa
  2014-09-15 14:17                           ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2014-09-15  4:58 UTC (permalink / raw)
  To: torvalds; +Cc: jbacik, andi, linux-fsdevel, viro, hch, clm

Linus Torvalds wrote:
> On Sun, Sep 14, 2014 at 7:49 PM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > I care, at least as of RHEL6 (2.6.32) kernels. Some users are using tmpfs
> > in order to avoid dentry cache bouncing bomb (a series of simple stat() calls
> > take about 10 seconds when there are 300,000,000+ unused dentries, P31 of
> > http://I-love.SAKURA.ne.jp/tomoyo/LCJ2014-en.pdf ). I don't know performance
> > on recent kernels.
> 
> Hmm. Probably not hugely different. It's still fairly easy to generate
> a lot of negative dentries. Do you have an actual test program that
> does this and is annoying? Do people actually do that?

OK. I quote comment #5 and #9 of RHBZ 1061562 below.
The test program I used (source code of a.out) is in comment #5. The customer
in comment #9 is now using tmpfs and so far I have heard no problems.

----- comment #5 start -----

At least two enterprise systems which create a lot of dentries are suffering
unexpected system reboots/hangups. One of the systems has about 100GB of RAM
and the sar report showed that the unused dentries was over 300,000,000 just
before an unexpected reboot happened. Also, by using the dentry producer
program described in comment #0, the system successfully triggered another
unexpected reboot.

Therefore, I am examining the cause of unexpected reboots. Since I had a
chance to access a machine with 64GB of RAM this week, I ran a test using
SystemTap and an updated version of dentry producer program described below.

---------- a SystemTap command line start ----------
stap -e '
global stat_start;
global shrink_start;
global counter;
probe kernel.function("vfs_stat") {
  counter++;
  stat_start[tid()] = gettimeofday_ns();
}
probe kernel.function("vfs_stat").return {
  t = gettimeofday_ns() - stat_start[tid()];
  if (t >= 1000000)
    printf("%s: vfs_stat(%u) took %u ms\n", execname(), counter, t / 1000000);
}
probe kernel.function("shrink_slab") {
  printf("%s: shrink_slab() started\n", execname());
  shrink_start[tid()] = gettimeofday_ns();
}
probe kernel.function("shrink_slab").return {
  t = gettimeofday_ns() - shrink_start[tid()];
  printf("%s: shrink_slab() took %u ms\n", execname(), t / 1000000);
}
'
---------- a SystemTap command line end ----------

---------- an example of very-fast dentry producer start ----------
#include <stdio.h>
#include <sys/stat.h>

int main(int argc, char *argv[])
{
        unsigned int i = 0;
        struct stat buf;
        char buffer[128] = { };
        while (1) {
                snprintf(buffer, sizeof(buffer) - 1, "%u", i++);
                stat(buffer, &buf);
        }
        return 0;
}
---------- an example of very-fast dentry producer end ----------

The test result showed that a single stat() call takes a few seconds if kswapd
is reclaiming memory at shrink_slab(), and a series of a few stat() calls takes
more than 10 seconds.

---------- SystemTap output start ----------
bash: shrink_slab() started
bash: shrink_slab() took 18 ms
bash: shrink_slab() started
bash: shrink_slab() took 0 ms
crond: vfs_stat(9964303) took 10 ms
crond: vfs_stat(79137289) took 6 ms
sa1: vfs_stat(79200631) took 9 ms
bash: vfs_stat(98784249) took 10 ms
run-parts: vfs_stat(98787397) took 5 ms
run-parts: vfs_stat(98791464) took 3 ms
0anacron: vfs_stat(98801077) took 15 ms
0anacron: vfs_stat(98804416) took 10 ms
kswapd0: shrink_slab() started
kswapd1: shrink_slab() started
a.out: vfs_stat(305840783) took 282 ms
a.out: vfs_stat(305840784) took 999 ms
a.out: vfs_stat(305840785) took 335 ms
a.out: vfs_stat(305840786) took 3654 ms
a.out: vfs_stat(305840787) took 3009 ms
a.out: vfs_stat(305840788) took 999 ms
a.out: vfs_stat(305840790) took 1228 ms
a.out: vfs_stat(305847338) took 2 ms
a.out: vfs_stat(305849179) took 3 ms
a.out: vfs_stat(305849870) took 2 ms
a.out: vfs_stat(305850478) took 1 ms
a.out: vfs_stat(305854046) took 1 ms
a.out: vfs_stat(305856251) took 1 ms
a.out: vfs_stat(305864002) took 5 ms
a.out: vfs_stat(305876155) took 6 ms
a.out: vfs_stat(305885622) took 12 ms
a.out: vfs_stat(305886187) took 9 ms
a.out: vfs_stat(305888365) took 2 ms
a.out: vfs_stat(305892807) took 2 ms
a.out: vfs_stat(305896127) took 2 ms
a.out: vfs_stat(305897223) took 2 ms
a.out: vfs_stat(305910449) took 1 ms
a.out: vfs_stat(305911688) took 8 ms
a.out: vfs_stat(305914482) took 12 ms
kswapd1: shrink_slab() took 11810 ms
kswapd1: shrink_slab() started
a.out: vfs_stat(305939079) took 7 ms
a.out: vfs_stat(305943797) took 21 ms
a.out: vfs_stat(305944628) took 12 ms
a.out: vfs_stat(305964155) took 3 ms
a.out: vfs_stat(305965994) took 15 ms
a.out: vfs_stat(305966685) took 15 ms
a.out: vfs_stat(305967124) took 1 ms
a.out: vfs_stat(305973015) took 2 ms
a.out: vfs_stat(305975511) took 4 ms
a.out: vfs_stat(305978080) took 3 ms
a.out: vfs_stat(305982128) took 3 ms
a.out: vfs_stat(305988765) took 4 ms
a.out: vfs_stat(305991667) took 3 ms
a.out: vfs_stat(305995990) took 3 ms
a.out: vfs_stat(306000642) took 1 ms
a.out: vfs_stat(306010369) took 3 ms
a.out: vfs_stat(306023386) took 10 ms
a.out: vfs_stat(306023881) took 9 ms
a.out: vfs_stat(306024374) took 3 ms
a.out: vfs_stat(306033201) took 5 ms
kswapd1: shrink_slab() took 1373 ms
a.out: vfs_stat(306038979) took 4 ms
a.out: vfs_stat(306045321) took 1 ms
a.out: vfs_stat(306048306) took 1 ms
a.out: vfs_stat(306050274) took 5 ms
a.out: vfs_stat(306053855) took 4 ms
a.out: vfs_stat(306054220) took 1 ms
a.out: vfs_stat(306075005) took 8 ms
a.out: vfs_stat(306076840) took 17 ms
a.out: vfs_stat(306085567) took 3 ms
a.out: vfs_stat(306101199) took 1 ms
a.out: vfs_stat(306102299) took 20 ms
a.out: vfs_stat(306109915) took 2 ms
a.out: vfs_stat(306111825) took 14 ms
a.out: vfs_stat(306114245) took 3 ms
a.out: vfs_stat(306114878) took 4 ms
kswapd0: shrink_slab() took 14025 ms
kswapd0: shrink_slab() started
a.out: vfs_stat(306133947) took 29 ms
a.out: vfs_stat(306138301) took 5 ms
a.out: vfs_stat(306144529) took 5 ms
a.out: vfs_stat(306149390) took 5 ms
a.out: vfs_stat(306164696) took 8 ms
a.out: vfs_stat(306165424) took 5 ms
a.out: vfs_stat(306176530) took 1 ms
a.out: vfs_stat(306180342) took 20 ms
a.out: vfs_stat(306187037) took 7 ms
a.out: vfs_stat(306192183) took 12 ms
a.out: vfs_stat(306201618) took 9 ms
a.out: vfs_stat(306203887) took 2 ms
a.out: vfs_stat(306204688) took 4 ms
a.out: vfs_stat(306206134) took 1 ms
a.out: vfs_stat(306207974) took 2 ms
a.out: vfs_stat(306220292) took 12 ms
a.out: vfs_stat(306224972) took 6 ms
a.out: vfs_stat(306231423) took 6 ms
a.out: vfs_stat(306232489) took 1 ms
a.out: vfs_stat(306236307) took 4 ms
a.out: vfs_stat(306238302) took 1 ms
a.out: vfs_stat(306245840) took 1 ms
a.out: vfs_stat(306249541) took 14 ms
a.out: vfs_stat(306251562) took 6 ms
a.out: vfs_stat(306260444) took 2 ms
a.out: vfs_stat(306275088) took 4 ms
a.out: vfs_stat(306278510) took 2 ms
a.out: vfs_stat(306278795) took 3 ms
kswapd0: shrink_slab() took 1693 ms
kswapd0: shrink_slab() started
a.out: vfs_stat(306295322) took 22 ms
a.out: vfs_stat(306307052) took 17 ms
a.out: vfs_stat(306309420) took 1 ms
a.out: vfs_stat(306315293) took 4 ms
a.out: vfs_stat(306325204) took 10 ms
a.out: vfs_stat(306325362) took 17 ms
a.out: vfs_stat(306329728) took 2 ms
a.out: vfs_stat(306334637) took 5 ms
a.out: vfs_stat(306350149) took 7 ms
a.out: vfs_stat(306352355) took 1 ms
a.out: vfs_stat(306354047) took 1 ms
a.out: vfs_stat(306356301) took 1 ms
a.out: vfs_stat(306384456) took 1 ms
a.out: vfs_stat(306388446) took 10 ms
a.out: vfs_stat(306391249) took 2 ms
a.out: vfs_stat(306393632) took 1 ms
a.out: vfs_stat(306407549) took 6 ms
a.out: vfs_stat(306414708) took 2 ms
a.out: vfs_stat(306438538) took 1 ms
a.out: vfs_stat(306447413) took 9 ms
a.out: vfs_stat(306453184) took 2 ms
a.out: vfs_stat(306459036) took 1 ms
kswapd0: shrink_slab() took 1691 ms
---------- SystemTap output end ----------

If a watchdog program does something like

  while (1) {
    read from /proc/$pid/status for process check
    sleep for 1 second for CPU saving
    write to /dev/watchdog for resetting timer
  }

the watchdog program could be blocked for duration that is sufficient to fail
to write to /dev/watchdog within the timeout second.

So far, I'm suspecting that the cause of unexpected reboots is the watchdog
program which is unexpectedly blocked by dcache_lock contention due to kswapd
calling cond_resched_lock(&dcache_lock) from __shrink_dcache_sb() from
prune_dcache() from shrink_dcache_memory() from shrink_slab().

If my guess is correct, this is a time bomb for systems with a huge amount of
RAM and a huge amount of dentries.

----- comment #5 end -----

----- comment #9 start -----

If I remember correctly, the system is an active/standby cluster, and this
problem occurs on the standby machine which is almost idle.

On the standby machine, "service drbd status" is issued by heartbeat software
at the speed of twice per a second. The /etc/init.d/drbd uses /bin/sort , and
/bin/sort creates temporary files. Since the standby machine is nearly idle,
there is nothing but dentry which consumes the RAM. Therefore, by the time
uptime becomes 120 days, the sum of unused dentries (created by /bin/sort as
temporary files) exceeded 300,000,000.

Since the cause of these unused dentries turned out to be /etc/init.d/drbd ,
I suggested the customer to create temporary files on tmpfs (rather than ext4)
by doing

  # echo "export TMPDIR=/dev/shm" >> /etc/default/drbd

as a workaround.

Also, I suggested the customer to use serial console or netconsole to catch
kernel messages (if any) in case of unexpected reboots/hangups, for I'm
currently suspecting watchdog-like timeout when kswapd starts memory
reclaiming.

----- comment #9 end -----

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15  4:58                         ` Tetsuo Handa
@ 2014-09-15 14:17                           ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2014-09-15 14:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Josef Bacik, Andi Kleen, linux-fsdevel, Al Viro,
	Christoph Hellwig, Chris Mason

[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]

On Sun, Sep 14, 2014 at 9:58 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> OK. I quote comment #5 and #9 of RHBZ 1061562 below.

Looks like that BZ entry is somehow restricted, so I can't see the
details, but it turns out that I think we've fixed this issue a long
time ago for other reasons.

We definitely can use a lot of memory on negative dentries (see my own
test program attached) and that will inevitably lead to *some* issues
(because we do get memory pressure and then have to get to
shrink_slabs etc to get rid of them).

But the negative dentries should be fairly easy to get rid of, and the
real problem in that bugzilla seems to be the dcache_lock. And that
dcache_lock doesn't exist at all any more, and dentries should scale
almost infinitely.

So I'd like to have some way to limit excessive negative dentries
anyway, because they obviously do fill up the hash lists and use up
memory, so they can certainly be problematic, but I don't think they
are necessarily any worse than streaming a large file and filling up
memory that way. Except, likely, that the "streaming a large file" is
so much more common that we may well have more problems with negative
dentries just because nobody actually does this in practice. I didn't
see anything obvious from my test-program, but I didn't run any real
latency test or anything, just a "can't tell a difference in normal
use" aside from the normal "ok, no free memory".

                 Linus

[-- Attachment #2: negative.c --]
[-- Type: text/x-csrc, Size: 215 bytes --]

int main(int argc, char *argv)
{
	int i;
	char name[] = "a0000000000";

	for (;;) {
		char *p = name + 10;
		while (++*p > '9') {
			if (*p == 'b')
				return;
			*p = '0';
			--p;
		}
		stat(name);
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15  1:32                   ` Linus Torvalds
  2014-09-15  2:49                     ` Tetsuo Handa
@ 2014-09-15 15:55                     ` Josef Bacik
  2014-09-15 16:22                       ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2014-09-15 15:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On 09/14/2014 09:32 PM, Linus Torvalds wrote:
> On Sat, Sep 13, 2014 at 11:58 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So if this all helps you guys, and after sufficient testing, just the
>> current top commit should be self-sufficient (with two other commits
>> before *if* that multiply-vs-shift makes a difference)
>>
>>      99d263d4c5b2 (HEAD, master) vfs: fix bad hashing of dentries
>>
>> and if it *doesn't* help, and you still see problems, please holler.
>
> Btw, please if at all possible, give the 3.17-rc5 release a test on
> the load that you saw problems with.
>
> Because of the whole dentry hashing issue, I took another look at name
> lookup performance, and found a few more problems in this area.
>
> The biggest problem only affected the fairly unusual case of a
> directory that was mounted differently in different namespaces, and it
> may well be that you don't actually have that situation at all. I
> found it almost by mistake when checking performance consistency and
> noticing that my "/tmp" directory lookups were much slower than
> everything else. The pathname lookup incorrectly dropped out of RCU
> mode for that case due to two independent bugs (one hit normal lookups
> of such directories, the other hit just the ".." case).
>
> I also hit a small CPU pipeline hickup in link_path_walk() that is
> probably specific to just the store buffer forwarding of x86-64, but
> could possibly hit other 64-bit cases too. I doubt it's noticeable for
> your case, but it showed up pretty clearly in the profiles when I was
> checking that everything looked ok.
>
> All of them should be fixed in the -rc5 I just pushed out. At least I
>
> I do have another case I'm not entirely happy about - our negative
> lookups (ie looking up a pathname that doesn't exist) hit in the
> dcache for real filesystems and perform really well, but they suck for
> tmpfs. Al, we turn off negative dentry caches for tmpfs because
> simple_dentry_operations uses
>
>          .d_delete = always_delete_dentry,
>
> Do we care? It's noticeable in benchmarks: it's almost an order of
> magnitude difference when looking up non-existent files. I can look up
> a non-existent file 23M times per second on ext4, but only 3.3M on
> /tmp.
>
> Anyway, I'm not sure FB does a lot of lookups of nonexistent files,
> but there are some loads that really do that a lot.  And we're in the
> odd situation that it's actually *much* faster on a real filesystem
> than it is on a RAM filesystem like /tmp.
>

Sorry, after my last reply I promptly went out of town for the weekend. 
  Ran everything again, here are the numbers (avg ops/sec)

		3.2 hashing	3.10 hashing	3.10+Your fix
XFS-mkdir	18425.083984	13783.275391	18638.523438
Ext4-mkdir	16834.097656	13801.648438	16805.552734
Btrfs-mkdir	48617.929688	25772.544922	47906.093750

XFS-rename	17321.025391	10300.987305	16647.900391
Ext4-rename	17619.650391	11027.894531	17456.312500
Btrfs-rename	26019.062500	9964.170898	25311.501953

I can't test on 3.17 proper since the Fusion IO driver doesn't build 
properly there and I'm not being paid to work on it anymore so I'm not 
fixing it ;).  Thanks for fixing this, I've pulled back 99d263d4c5b2 
which will do us just fine.

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15 15:55                     ` Josef Bacik
@ 2014-09-15 16:22                       ` Linus Torvalds
  2014-09-15 16:25                         ` Al Viro
  2014-09-15 16:35                         ` Greg KH
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2014-09-15 16:22 UTC (permalink / raw)
  To: Josef Bacik, stable
  Cc: Andi Kleen, linux-fsdevel, Al Viro, Christoph Hellwig, Chris Mason

On Mon, Sep 15, 2014 at 8:55 AM, Josef Bacik <jbacik@fb.com> wrote:
>
> I can't test on 3.17 proper since the Fusion IO driver doesn't build
> properly there and I'm not being paid to work on it anymore so I'm not
> fixing it ;).  Thanks for fixing this, I've pulled back 99d263d4c5b2 which
> will do us just fine.

Ok. I'm cc'ing stable to know that

    99d263d4c5b2 ("vfs: fix bad hashing of dentries")

should be added to the queues for 3.10+.

Greg&co - it's a simple fix for a performance regression. Not
end-of-the-world, but if it ends up being in the FB kernel trees,
might as well get it back-ported to the other stable trees too.

The other performance issues I found are actually potentially worse,
but they aren't regressions and they hit only when using namespaces.
Which mostly nobody does on old kernels anyway. More of a "systemd
uses namespaces for /tmp, and then it's quite noticeable as slowing
down pathname lookups there if you benchmark it".

So the single commit Josef mentions is likely sufficient for stable.

               Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15 16:22                       ` Linus Torvalds
@ 2014-09-15 16:25                         ` Al Viro
  2014-09-15 16:33                           ` Linus Torvalds
  2014-09-15 16:35                         ` Greg KH
  1 sibling, 1 reply; 23+ messages in thread
From: Al Viro @ 2014-09-15 16:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, stable, Andi Kleen, linux-fsdevel,
	Christoph Hellwig, Chris Mason

On Mon, Sep 15, 2014 at 09:22:57AM -0700, Linus Torvalds wrote:

> The other performance issues I found are actually potentially worse,
> but they aren't regressions and they hit only when using namespaces.
> Which mostly nobody does on old kernels anyway. More of a "systemd
> uses namespaces for /tmp, and then it's quite noticeable as slowing
> down pathname lookups there if you benchmark it".

Umm...  They are easy to trigger with just mount --bind; no namespaces
needed, actually.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15 16:25                         ` Al Viro
@ 2014-09-15 16:33                           ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2014-09-15 16:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Josef Bacik, stable, Andi Kleen, linux-fsdevel,
	Christoph Hellwig, Chris Mason

On Mon, Sep 15, 2014 at 9:25 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  They are easy to trigger with just mount --bind; no namespaces
> needed, actually.

Fair enough, but it's still a "non-traditional" situation, ie I would
expect that most older installs don't necessarily do it.

That said, I certainly do not have anything against fixing the other
performance problems too, and I think you did mark them all for
stable. So they'll be picked up automatically as long as they still
apply (and I suspect they all do - this is not a hugh-churn area, and
they are all old problems).

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15 16:22                       ` Linus Torvalds
  2014-09-15 16:25                         ` Al Viro
@ 2014-09-15 16:35                         ` Greg KH
  2014-09-15 16:45                           ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Greg KH @ 2014-09-15 16:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, stable, Andi Kleen, linux-fsdevel, Al Viro,
	Christoph Hellwig, Chris Mason

On Mon, Sep 15, 2014 at 09:22:57AM -0700, Linus Torvalds wrote:
> On Mon, Sep 15, 2014 at 8:55 AM, Josef Bacik <jbacik@fb.com> wrote:
> >
> > I can't test on 3.17 proper since the Fusion IO driver doesn't build
> > properly there and I'm not being paid to work on it anymore so I'm not
> > fixing it ;).  Thanks for fixing this, I've pulled back 99d263d4c5b2 which
> > will do us just fine.
> 
> Ok. I'm cc'ing stable to know that
> 
>     99d263d4c5b2 ("vfs: fix bad hashing of dentries")
> 
> should be added to the queues for 3.10+.
> 
> Greg&co - it's a simple fix for a performance regression. Not
> end-of-the-world, but if it ends up being in the FB kernel trees,
> might as well get it back-ported to the other stable trees too.
> 
> The other performance issues I found are actually potentially worse,
> but they aren't regressions and they hit only when using namespaces.
> Which mostly nobody does on old kernels anyway. More of a "systemd
> uses namespaces for /tmp, and then it's quite noticeable as slowing
> down pathname lookups there if you benchmark it".
> 
> So the single commit Josef mentions is likely sufficient for stable.

Thanks, now queued up.  Note, I had to change the 3.10-stable patch a
bit due to some rejects.

Here's a diff of the diffs to show the difference if anyone wants to
check it.

thanks,

greg k-h


--- queue-3.16/vfs-fix-bad-hashing-of-dentries.patch	2014-09-15 09:27:57.072950053 -0700
+++ queue-3.10/vfs-fix-bad-hashing-of-dentries.patch	2014-09-15 09:29:06.851839582 -0700
@@ -76,13 +76,13 @@
 
 --- a/fs/dcache.c
 +++ b/fs/dcache.c
-@@ -106,8 +106,7 @@ static inline struct hlist_bl_head *d_ha
+@@ -108,8 +108,7 @@ static inline struct hlist_bl_head *d_ha
  					unsigned int hash)
  {
  	hash += (unsigned long) parent / L1_CACHE_BYTES;
--	hash = hash + (hash >> d_hash_shift);
--	return dentry_hashtable + (hash & d_hash_mask);
-+	return dentry_hashtable + hash_32(hash, d_hash_shift);
+-	hash = hash + (hash >> D_HASHBITS);
+-	return dentry_hashtable + (hash & D_HASHMASK);
++	return dentry_hashtable + hash_32(hash, D_HASHBITS);
  }
  
  /* Statistics gathering. */
@@ -96,7 +96,7 @@
  #include <asm/uaccess.h>
  
  #include "internal.h"
-@@ -1629,8 +1630,7 @@ static inline int nested_symlink(struct
+@@ -1647,8 +1648,7 @@ static inline int can_lookup(struct inod
  
  static inline unsigned int fold_hash(unsigned long hash)
  {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15 16:35                         ` Greg KH
@ 2014-09-15 16:45                           ` Linus Torvalds
  2014-09-15 16:53                             ` Jiri Slaby
  2014-09-15 17:31                             ` Greg KH
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2014-09-15 16:45 UTC (permalink / raw)
  To: Greg KH
  Cc: Josef Bacik, stable, Andi Kleen, linux-fsdevel, Al Viro,
	Christoph Hellwig, Chris Mason

On Mon, Sep 15, 2014 at 9:35 AM, Greg KH <greg@kroah.com> wrote:
>
> Thanks, now queued up.  Note, I had to change the 3.10-stable patch a
> bit due to some rejects.

Ahh. Yes. it still uses the old D_HASHBITS/HASHMASK macros in up until
and including 3.12.

You could have just backported 482db9066199 ("dcache.c: get rid of
pointless macros") too, but your resolution looks perfectly fine too.

                Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15 16:45                           ` Linus Torvalds
@ 2014-09-15 16:53                             ` Jiri Slaby
  2014-09-15 17:31                             ` Greg KH
  1 sibling, 0 replies; 23+ messages in thread
From: Jiri Slaby @ 2014-09-15 16:53 UTC (permalink / raw)
  To: Linus Torvalds, Greg KH
  Cc: Josef Bacik, stable, Andi Kleen, linux-fsdevel, Al Viro,
	Christoph Hellwig, Chris Mason

On 09/15/2014, 06:45 PM, Linus Torvalds wrote:
> On Mon, Sep 15, 2014 at 9:35 AM, Greg KH <greg@kroah.com> wrote:
>>
>> Thanks, now queued up.  Note, I had to change the 3.10-stable patch a
>> bit due to some rejects.
> 
> Ahh. Yes. it still uses the old D_HASHBITS/HASHMASK macros in up until
> and including 3.12.
> 
> You could have just backported 482db9066199 ("dcache.c: get rid of
> pointless macros") too

Ok, I went this path for 3.12 as 482db9066199 is simple enough. Thanks!

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Name hashing function causing a perf regression
  2014-09-15 16:45                           ` Linus Torvalds
  2014-09-15 16:53                             ` Jiri Slaby
@ 2014-09-15 17:31                             ` Greg KH
  1 sibling, 0 replies; 23+ messages in thread
From: Greg KH @ 2014-09-15 17:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, stable, Andi Kleen, linux-fsdevel, Al Viro,
	Christoph Hellwig, Chris Mason

On Mon, Sep 15, 2014 at 09:45:40AM -0700, Linus Torvalds wrote:
> On Mon, Sep 15, 2014 at 9:35 AM, Greg KH <greg@kroah.com> wrote:
> >
> > Thanks, now queued up.  Note, I had to change the 3.10-stable patch a
> > bit due to some rejects.
> 
> Ahh. Yes. it still uses the old D_HASHBITS/HASHMASK macros in up until
> and including 3.12.
> 
> You could have just backported 482db9066199 ("dcache.c: get rid of
> pointless macros") too, but your resolution looks perfectly fine too.

Ah, much easier, I've gone and done that now, thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-09-15 17:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-09 19:30 Name hashing function causing a perf regression Josef Bacik
2014-09-12 19:11 ` Andi Kleen
2014-09-12 19:21   ` Linus Torvalds
2014-09-12 19:52     ` Josef Bacik
2014-09-12 20:39       ` Linus Torvalds
2014-09-12 21:25         ` Josef Bacik
2014-09-12 22:01           ` Linus Torvalds
2014-09-12 22:08             ` Josef Bacik
2014-09-12 22:25               ` Linus Torvalds
2014-09-13 18:58                 ` Linus Torvalds
2014-09-15  1:32                   ` Linus Torvalds
2014-09-15  2:49                     ` Tetsuo Handa
2014-09-15  3:37                       ` Linus Torvalds
2014-09-15  4:58                         ` Tetsuo Handa
2014-09-15 14:17                           ` Linus Torvalds
2014-09-15 15:55                     ` Josef Bacik
2014-09-15 16:22                       ` Linus Torvalds
2014-09-15 16:25                         ` Al Viro
2014-09-15 16:33                           ` Linus Torvalds
2014-09-15 16:35                         ` Greg KH
2014-09-15 16:45                           ` Linus Torvalds
2014-09-15 16:53                             ` Jiri Slaby
2014-09-15 17:31                             ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.