* Horrible L2 cache effects from kernel compile @ 2003-02-25 0:41 Dave Hansen 2003-02-25 0:59 ` William Lee Irwin III 2003-02-25 2:31 ` Linus Torvalds 0 siblings, 2 replies; 27+ messages in thread From: Dave Hansen @ 2003-02-25 0:41 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Martin J. Bligh, Gerrit Huizenga I was testing Martin Bligh's kernbench (a kernel compile with -j(2*NR_CPUS)) and using DCU_MISS_OUTSTANDING as the counter event. The surprising thing? d_lookup() accounts for 8% of the time spent waiting for an L2 miss. __copy_to_user_ll should be trashing a lot of cachelines, but d_lookup() is strange. Counter 0 counted DCU_MISS_OUTSTANDING events (number of cycles while DCU miss outstanding) with a unit mask of 0x00 (Not set) count 10000 c017d78c 72929 1.92175 start_this_handle c0139b60 75317 1.98468 vm_enough_memory c0138cd4 75367 1.986 do_no_page c01ccae0 79475 2.09425 atomic_dec_and_lock c0117320 80918 2.13227 scheduler_tick c0176338 90851 2.39402 ext3_dirty_inode c013cc38 132228 3.48434 page_remove_rmap c0176557 148116 3.90301 .text.lock.inode c013cae0 156345 4.11985 page_add_rmap c012c964 157716 4.15598 find_get_page c015797c 314165 8.27857 d_lookup c01cc948 334779 8.82177 __copy_to_user_ll a snippet from d_lookup(), annotated. I've seen oprofile be off by a line here, but we can be pretty sure it is in this area. ... smp_read_barrier_depends(); /* 106 0.002793% */ dentry = list_entry(tmp, struct dentry, d_hash); /* if lookup ends up in a different bucket * due to concurrent rename, fail it */ /* 154991 4.084% */ if (unlikely(dentry->d_bucket != head)) break; /* to avoid race if dentry keep coming back to original * bucket due to double moves */ /* 634 0.01671% */ if (unlikely(++lookup_count > max_dentries)) break; ... -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 0:41 Horrible L2 cache effects from kernel compile Dave Hansen @ 2003-02-25 0:59 ` William Lee Irwin III 2003-02-25 1:01 ` Dave Hansen 2003-02-25 3:15 ` John Levon 2003-02-25 2:31 ` Linus Torvalds 1 sibling, 2 replies; 27+ messages in thread From: William Lee Irwin III @ 2003-02-25 0:59 UTC (permalink / raw) To: Dave Hansen; +Cc: Linux Kernel Mailing List, Martin J. Bligh, Gerrit Huizenga On Mon, Feb 24, 2003 at 04:41:37PM -0800, Dave Hansen wrote: > c01cc948 334779 8.82177 __copy_to_user_ll > c015797c 314165 8.27857 d_lookup > c012c964 157716 4.15598 find_get_page > c013cae0 156345 4.11985 page_add_rmap > c0176557 148116 3.90301 .text.lock.inode > c013cc38 132228 3.48434 page_remove_rmap > c0176338 90851 2.39402 ext3_dirty_inode > c0117320 80918 2.13227 scheduler_tick > c01ccae0 79475 2.09425 atomic_dec_and_lock > c0138cd4 75367 1.986 do_no_page > c0139b60 75317 1.98468 vm_enough_memory > c017d78c 72929 1.92175 start_this_handle Your profile was upside down. I've re-sorted it. It probably confused people who were wondering why the numbers at the top of the profile were lower than the ones below them. On Mon, Feb 24, 2003 at 04:41:37PM -0800, Dave Hansen wrote: > a snippet from d_lookup(), annotated. I've seen oprofile be off by a > line here, but we can be pretty sure it is in this area. > smp_read_barrier_depends(); > /* 106 0.002793% */ > dentry = list_entry(tmp, struct dentry, d_hash); > /* if lookup ends up in a different bucket > * due to concurrent rename, fail it > */ > /* 154991 4.084% */ > if (unlikely(dentry->d_bucket != head)) > break; > /* to avoid race if dentry keep coming back to original > * bucket due to double moves > */ > /* 634 0.01671% */ > if (unlikely(++lookup_count > max_dentries)) > break; > ... Well, you're walking hash chains, you're going to get a lot of cache misses, about the whole length of the chain's worth. You could try changing the dcache to use something like a B+ tree or some such nonsense to reduce its cache footprint. I'd collect statistics to see if the dcache hash function is badly distributing things. That's relatively easily fixable if so. -- wli ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 0:59 ` William Lee Irwin III @ 2003-02-25 1:01 ` Dave Hansen 2003-02-25 3:15 ` John Levon 1 sibling, 0 replies; 27+ messages in thread From: Dave Hansen @ 2003-02-25 1:01 UTC (permalink / raw) To: William Lee Irwin III Cc: Linux Kernel Mailing List, Martin J. Bligh, Gerrit Huizenga William Lee Irwin III wrote: > Your profile was upside down. I've re-sorted it. > It probably confused people who were wondering why the numbers > at the top of the profile were lower than the ones below them. Thanks, Bill. oprofile's default output is useful when you're viewing it from the cmdline, but it looks bad in email. -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 0:59 ` William Lee Irwin III 2003-02-25 1:01 ` Dave Hansen @ 2003-02-25 3:15 ` John Levon 2003-02-25 3:15 ` William Lee Irwin III 2003-02-25 3:35 ` Andrew Morton 1 sibling, 2 replies; 27+ messages in thread From: John Levon @ 2003-02-25 3:15 UTC (permalink / raw) To: William Lee Irwin III, Dave Hansen, Linux Kernel Mailing List, Martin J. Bligh, Gerrit Huizenga On Mon, Feb 24, 2003 at 04:59:22PM -0800, William Lee Irwin III wrote: > Your profile was upside down. I've re-sorted it. > It probably confused people who were wondering why the numbers > at the top of the profile were lower than the ones below them. Would people generally prefer things to be sorted so the most important stuff was at the top ? We're considering such a change ... regards john ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 3:15 ` John Levon @ 2003-02-25 3:15 ` William Lee Irwin III 2003-02-25 3:35 ` Andrew Morton 1 sibling, 0 replies; 27+ messages in thread From: William Lee Irwin III @ 2003-02-25 3:15 UTC (permalink / raw) To: John Levon Cc: Dave Hansen, Linux Kernel Mailing List, Martin J. Bligh, Gerrit Huizenga On Mon, Feb 24, 2003 at 04:59:22PM -0800, William Lee Irwin III wrote: >> Your profile was upside down. I've re-sorted it. >> It probably confused people who were wondering why the numbers >> at the top of the profile were lower than the ones below them. On Tue, Feb 25, 2003 at 03:15:19AM +0000, John Levon wrote: > Would people generally prefer things to be sorted so the most important > stuff was at the top ? We're considering such a change ... I would, I have to do a double take to catch it. -- wli ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 3:15 ` John Levon 2003-02-25 3:15 ` William Lee Irwin III @ 2003-02-25 3:35 ` Andrew Morton 2003-02-25 4:13 ` Martin J. Bligh 1 sibling, 1 reply; 27+ messages in thread From: Andrew Morton @ 2003-02-25 3:35 UTC (permalink / raw) To: John Levon; +Cc: wli, haveblue, linux-kernel, mbligh, gh John Levon <levon@movementarian.org> wrote: > > On Mon, Feb 24, 2003 at 04:59:22PM -0800, William Lee Irwin III wrote: > > > Your profile was upside down. I've re-sorted it. > > It probably confused people who were wondering why the numbers > > at the top of the profile were lower than the ones below them. > > Would people generally prefer things to be sorted so the most important > stuff was at the top ? We're considering such a change ... > I prefer it the way it is, so unimportant stuff can scroll away. Bill can just turn his monitor upside down. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 3:35 ` Andrew Morton @ 2003-02-25 4:13 ` Martin J. Bligh 2003-02-25 11:57 ` John Levon 0 siblings, 1 reply; 27+ messages in thread From: Martin J. Bligh @ 2003-02-25 4:13 UTC (permalink / raw) To: Andrew Morton, John Levon; +Cc: wli, haveblue, linux-kernel, gh >> > Your profile was upside down. I've re-sorted it. >> > It probably confused people who were wondering why the numbers >> > at the top of the profile were lower than the ones below them. >> >> Would people generally prefer things to be sorted so the most important >> stuff was at the top ? We're considering such a change ... >> > > I prefer it the way it is, so unimportant stuff can scroll away. > > Bill can just turn his monitor upside down. You're Australian though, so you get that by default ;-) John, how about a "-r" flag or something to sort the output biggest first? Might keep the dissenting masses happy ;-) M. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 4:13 ` Martin J. Bligh @ 2003-02-25 11:57 ` John Levon 0 siblings, 0 replies; 27+ messages in thread From: John Levon @ 2003-02-25 11:57 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, wli, haveblue, linux-kernel, gh On Mon, Feb 24, 2003 at 08:13:25PM -0800, Martin J. Bligh wrote: > John, how about a "-r" flag or something to sort the output biggest first? [moz@lambent src]$ oprofpp --help | grep -- reverse -r, --reverse reverse sort order [moz@lambent src]$ op_time --help | grep -- reverse -r, --reverse reverse sort order Requests like this are my favourite ;) john ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 0:41 Horrible L2 cache effects from kernel compile Dave Hansen 2003-02-25 0:59 ` William Lee Irwin III @ 2003-02-25 2:31 ` Linus Torvalds 2003-02-25 17:05 ` John W. M. Stevens 1 sibling, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2003-02-25 2:31 UTC (permalink / raw) To: linux-kernel In article <3E5ABBC1.8050203@us.ibm.com>, Dave Hansen <haveblue@us.ibm.com> wrote: >I was testing Martin Bligh's kernbench (a kernel compile with >-j(2*NR_CPUS)) and using DCU_MISS_OUTSTANDING as the counter event. > >The surprising thing? d_lookup() accounts for 8% of the time spent >waiting for an L2 miss. > >__copy_to_user_ll should be trashing a lot of cachelines, but d_lookup() >is strange. I wouldn't call it that strange. It _is_ one of the most critical areas of the FS code, and hashes (which it uses) are inherently bad for caches. The instruction you point to as being the most likely suspect: if (unlikely(dentry->d_bucket != head)) is the first instruction that actually looks at the dentry chain information, so sure as hell, that's the one you'd expect to show up as the cache miss. There's no question that the dcache is very effective at caching, but I also think it's pretty clear that especially since we allow it to grow pretty much as big as we have memory, it _is_ going to cause cache misses. I don't know what to even suggest doing about it - it may be one of those things where we just have to live with it. I don't see any alternate good data structures that would be more cache-friendly. Unlike some of our other data structures (the page cache, for example) which have been converted to more cache-friendly RB-trees, the name lookup is fundamentally not very well-behaved. (Ie with the page cache there is a lot of locality within one file, while with name lookup the indexing is a string, which pretty much implies that we have to hash it and thus lose all locality) Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 2:31 ` Linus Torvalds @ 2003-02-25 17:05 ` John W. M. Stevens 0 siblings, 0 replies; 27+ messages in thread From: John W. M. Stevens @ 2003-02-25 17:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel http://www.sourcejudy.com/downloads/10minutes.htm ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <3E5ABBC1.8050203@us.ibm.com.suse.lists.linux.kernel>]
* Re: Horrible L2 cache effects from kernel compile [not found] <3E5ABBC1.8050203@us.ibm.com.suse.lists.linux.kernel> @ 2003-02-25 16:16 ` Andi Kleen 2003-02-27 3:24 ` Dave Hansen 2003-02-27 16:36 ` Jan Harkes [not found] ` <b3ekil$1cp$1@penguin.transmeta.com.suse.lists.linux.kernel> 1 sibling, 2 replies; 27+ messages in thread From: Andi Kleen @ 2003-02-25 16:16 UTC (permalink / raw) To: Dave Hansen; +Cc: linux-kernel, lse-tech, akpm Dave Hansen <haveblue@us.ibm.com> writes: > The surprising thing? d_lookup() accounts for 8% of the time spent > waiting for an L2 miss. The reason: Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes) Inode cache hash table entries: 65536 (order: 7, 524288 bytes) (1GB) I bet on your big memory box it is even worse. No cache in the world can cache that. If the hash function works well it'll guarantee few if any cache locality. Try the appended experimental patch. It replaces the hash table madness with relatively small fixed tables. The actual size is probably left for experimentation. I choose 64K for inode and 128K for dcache for now. That's more on the small side, but should work on all linux boxes except very small embedded systems. Acutally I suspect the inode could be made even smaller (iget should be rather infrequent even for NFS server loads because nfsd has an own cache) and everybody else should mostly only use the dcache. Other numbers may work better for your workload. Please test. It also replaces the cache wasting double pointer list_head hash buckets with single pointer hash buckets. This cuts the hash table overhead in half Also adds some prefetches which may help a bit for walking the lists. Work left to do: try to find a better hash function that isn't afraid of prime numbers. Patch is still experimental, especially I had to change a few details from dcache-rcu (removed the isbucket check which apparently wasn't needed anymore). I only tested on UP/x86-64 so far, it is possible that I added some new RCU races. Review needed. Also I replaced the max_dentries rcu race break hack with a fixed number now. Unfortunately putting any number there is wrong: if I chose a big number (like I do currently) it can loop for a long time, if I chose a small number there is an artitifical limit on hash table bucket lengths that may be hit in some extreme workloads. -And --- linux-2.5.63/include/linux/list.h-HASH 2003-02-17 23:56:16.000000000 +0100 +++ linux-2.5.63/include/linux/list.h 2003-02-25 16:03:09.000000000 +0100 @@ -319,6 +319,95 @@ for (pos = (head)->next, n = pos->next; pos != (head); \ pos = n, ({ read_barrier_depends(); 0;}), n = pos->next) +/* + * Double linked lists with a single pointer list head. + * Mostly useful for hash tables where the two pointer list head is + * too wasteful. + * You lose the ability to access the tail in O(1). + */ + +struct hlist_head { + struct hlist_node *first; +}; + +struct hlist_node { + struct hlist_node *next, **pprev; +}; + +#define HLIST_HEAD_INIT { first: NULL } +#define HLIST_HEAD(name) struct hlist_head name = { first: NULL } +#define INIT_HLIST_HEAD(ptr) ((ptr)->first = NULL) +#define INIT_HLIST_NODE(ptr) ((ptr)->next = NULL, (ptr)->pprev = NULL) + +/* This is really misnamed */ +static __inline__ int hnode_empty(struct hlist_node *h) +{ + return h->pprev==0; +} + +static __inline__ int hlist_empty(struct hlist_head *h) +{ + return !h->first; +} + +static __inline__ void hlist_del(struct hlist_node *n) +{ + /* The if could be avoided by a common dummy pprev target. */ + if (!n->pprev) + return; + *(n->pprev) = n->next; + if (n->next) + n->next->pprev = n->pprev; +} + +#define hlist_del_rcu hlist_del /* list_del_rcu is identical too? */ + +static __inline__ void hlist_del_init(struct hlist_node *n) +{ + /* The if could be avoided by a common dummy pprev target. */ + if (!n->pprev) + return; + *(n->pprev) = n->next; + if (n->next) + n->next->pprev = n->pprev; + INIT_HLIST_NODE(n); +} + +static __inline__ void hlist_add_head(struct hlist_node *n, struct hlist_head *h) +{ + n->next = h->first; + if (h->first) + h->first->pprev = &n->next; + h->first = n; + n->pprev = &h->first; +} + +static __inline__ void hlist_add_head_rcu(struct hlist_node *n, struct hlist_head *h) +{ + n->next = h->first; + wmb(); + if (h->first) + h->first->pprev = &n->next; + h->first = n; + n->pprev = &h->first; +} + +/* next must be != NULL */ +static __inline__ void hlist_add_before(struct hlist_node *n, struct hlist_node *next) +{ + n->pprev = next->pprev; + n->next = next; + next->pprev = &n->next; + *(n->pprev) = n; +} + +#define hlist_entry(ptr, type, member) container_of(ptr,type,member) + +/* Cannot easily do prefetch unfortunately */ +#define hlist_for_each(pos, head) \ + for (pos = (head)->first; pos; \ + pos = pos->next) + #else #warning "don't include kernel headers in userspace" #endif /* __KERNEL__ */ --- linux-2.5.63/include/linux/dcache.h-HASH 2003-02-17 23:57:19.000000000 +0100 +++ linux-2.5.63/include/linux/dcache.h 2003-02-25 14:00:51.000000000 +0100 @@ -80,8 +80,8 @@ unsigned long d_move_count; /* to indicated moved dentry while lockless lookup */ struct inode * d_inode; /* Where the name belongs to - NULL is negative */ struct dentry * d_parent; /* parent directory */ - struct list_head * d_bucket; /* lookup hash bucket */ - struct list_head d_hash; /* lookup hash list */ + struct hlist_head * d_bucket; /* lookup hash bucket */ + struct hlist_node d_hash; /* lookup hash list */ struct list_head d_lru; /* LRU list */ struct list_head d_child; /* child of parent list */ struct list_head d_subdirs; /* our children */ @@ -171,7 +171,7 @@ static __inline__ void __d_drop(struct dentry * dentry) { dentry->d_vfs_flags |= DCACHE_UNHASHED; - list_del_rcu(&dentry->d_hash); + hlist_del_rcu(&dentry->d_hash); } static __inline__ void d_drop(struct dentry * dentry) @@ -198,7 +198,7 @@ extern struct dentry * d_splice_alias(struct inode *, struct dentry *); extern void shrink_dcache_sb(struct super_block *); extern void shrink_dcache_parent(struct dentry *); -extern void shrink_dcache_anon(struct list_head *); +extern void shrink_dcache_anon(struct hlist_head *); extern int d_invalidate(struct dentry *); /* only used at mount-time */ --- linux-2.5.63/include/linux/fs.h-HASH 2003-02-24 21:23:11.000000000 +0100 +++ linux-2.5.63/include/linux/fs.h 2003-02-25 13:52:49.000000000 +0100 @@ -353,7 +353,7 @@ }; struct inode { - struct list_head i_hash; + struct hlist_node i_hash; struct list_head i_list; struct list_head i_dentry; unsigned long i_ino; @@ -601,7 +601,7 @@ struct list_head s_dirty; /* dirty inodes */ struct list_head s_io; /* parked for writeback */ - struct list_head s_anon; /* anonymous dentries for (nfs) exporting */ + struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ struct list_head s_files; struct block_device *s_bdev; --- linux-2.5.63/fs/hugetlbfs/inode.c-HASH 2003-02-17 23:56:55.000000000 +0100 +++ linux-2.5.63/fs/hugetlbfs/inode.c 2003-02-25 16:49:05.000000000 +0100 @@ -189,7 +189,7 @@ static void hugetlbfs_delete_inode(struct inode *inode) { - list_del_init(&inode->i_hash); + hlist_del_init(&inode->i_hash); list_del_init(&inode->i_list); inode->i_state |= I_FREEING; inodes_stat.nr_inodes--; @@ -208,7 +208,7 @@ { struct super_block *super_block = inode->i_sb; - if (list_empty(&inode->i_hash)) + if (hlist_empty(&inode->i_hash)) goto out_truncate; if (!(inode->i_state & (I_DIRTY|I_LOCK))) { @@ -223,7 +223,7 @@ /* write_inode_now() ? */ inodes_stat.nr_unused--; - list_del_init(&inode->i_hash); + hlist_del_init(&inode->i_hash); out_truncate: list_del_init(&inode->i_list); inode->i_state |= I_FREEING; --- linux-2.5.63/fs/fs-writeback.c-HASH 2003-02-17 23:57:22.000000000 +0100 +++ linux-2.5.63/fs/fs-writeback.c 2003-02-25 16:45:19.000000000 +0100 @@ -90,7 +90,7 @@ * Only add valid (hashed) inodes to the superblock's * dirty list. Add blockdev inodes as well. */ - if (list_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode)) + if (hlist_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode)) goto out; /* --- linux-2.5.63/fs/inode.c-HASH 2003-02-17 23:57:19.000000000 +0100 +++ linux-2.5.63/fs/inode.c 2003-02-25 17:02:30.581636072 +0100 @@ -17,7 +17,7 @@ #include <linux/wait.h> #include <linux/hash.h> #include <linux/swap.h> -#include <linux/security.h> +l#include <linux/security.h> /* * This is needed for the following functions: @@ -49,11 +49,9 @@ * Inode lookup is no longer as critical as it used to be: * most of the lookups are going to be through the dcache. */ -#define I_HASHBITS i_hash_shift -#define I_HASHMASK i_hash_mask - -static unsigned int i_hash_mask; -static unsigned int i_hash_shift; +#define I_NUMBUCKETS (8*1024) +#define I_HASHBITS 12 /* = log2(I_NUMBUCKETS) */ +#define I_HASHMASK (I_NUMBUCKETS-1) /* * Each inode can be on two separate lists. One is @@ -69,8 +67,8 @@ LIST_HEAD(inode_in_use); LIST_HEAD(inode_unused); -static struct list_head *inode_hashtable; -static LIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */ +static struct hlist_head *inode_hashtable; +static HLIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */ /* * A simple spinlock to protect the list manipulations. @@ -162,7 +160,7 @@ void inode_init_once(struct inode *inode) { memset(inode, 0, sizeof(*inode)); - INIT_LIST_HEAD(&inode->i_hash); + INIT_HLIST_NODE(&inode->i_hash); INIT_LIST_HEAD(&inode->i_data.clean_pages); INIT_LIST_HEAD(&inode->i_data.dirty_pages); INIT_LIST_HEAD(&inode->i_data.locked_pages); @@ -284,7 +282,7 @@ continue; invalidate_inode_buffers(inode); if (!atomic_read(&inode->i_count)) { - list_del_init(&inode->i_hash); + hlist_del_init(&inode->i_hash); list_del(&inode->i_list); list_add(&inode->i_list, dispose); inode->i_state |= I_FREEING; @@ -422,7 +420,7 @@ if (!can_unuse(inode)) continue; } - list_del_init(&inode->i_hash); + hlist_del_init(&inode->i_hash); list_move(&inode->i_list, &freeable); inode->i_state |= I_FREEING; nr_pruned++; @@ -460,50 +458,42 @@ * by hand after calling find_inode now! This simplifies iunique and won't * add any additional branch in the common code. */ -static struct inode * find_inode(struct super_block * sb, struct list_head *head, int (*test)(struct inode *, void *), void *data) +static struct inode * find_inode(struct super_block * sb, struct hlist_head *head, int (*test)(struct inode *, void *), void *data) { - struct list_head *tmp; + struct hlist_node *node; struct inode * inode; - tmp = head; - for (;;) { - tmp = tmp->next; - inode = NULL; - if (tmp == head) - break; - inode = list_entry(tmp, struct inode, i_hash); + hlist_for_each (node, head) { + prefetch(node->next); + inode = hlist_entry(node, struct inode, i_hash); if (inode->i_sb != sb) continue; if (!test(inode, data)) continue; break; - } - return inode; + } + return node ? NULL : inode; } /* * find_inode_fast is the fast path version of find_inode, see the comment at * iget_locked for details. */ -static struct inode * find_inode_fast(struct super_block * sb, struct list_head *head, unsigned long ino) +static struct inode * find_inode_fast(struct super_block * sb, struct hlist_head *head, unsigned long ino) { - struct list_head *tmp; + struct hlist_node *node; struct inode * inode; - tmp = head; - for (;;) { - tmp = tmp->next; - inode = NULL; - if (tmp == head) - break; - inode = list_entry(tmp, struct inode, i_hash); + hlist_for_each (node, head) { + prefetch(node->next); + inode = list_entry(node, struct inode, i_hash); if (inode->i_ino != ino) continue; if (inode->i_sb != sb) continue; break; } - return inode; + return node ? inode : NULL; } /** @@ -553,7 +543,7 @@ * We no longer cache the sb_flags in i_flags - see fs.h * -- rmk@arm.uk.linux.org */ -static struct inode * get_new_inode(struct super_block *sb, struct list_head *head, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data) +static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data) { struct inode * inode; @@ -570,7 +560,7 @@ inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_hash, head); + hlist_add_head(&inode->i_hash, head); inode->i_state = I_LOCK|I_NEW; spin_unlock(&inode_lock); @@ -603,7 +593,7 @@ * get_new_inode_fast is the fast path version of get_new_inode, see the * comment at iget_locked for details. */ -static struct inode * get_new_inode_fast(struct super_block *sb, struct list_head *head, unsigned long ino) +static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_head *head, unsigned long ino) { struct inode * inode; @@ -618,7 +608,7 @@ inode->i_ino = ino; inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_hash, head); + hlist_add_head(&inode->i_hash, head); inode->i_state = I_LOCK|I_NEW; spin_unlock(&inode_lock); @@ -670,7 +660,7 @@ { static ino_t counter = 0; struct inode *inode; - struct list_head * head; + struct hlist_head * head; ino_t res; spin_lock(&inode_lock); retry: @@ -724,7 +714,7 @@ * Note, @test is called with the inode_lock held, so can't sleep. */ static inline struct inode *ifind(struct super_block *sb, - struct list_head *head, int (*test)(struct inode *, void *), + struct hlist_head *head, int (*test)(struct inode *, void *), void *data) { struct inode *inode; @@ -756,7 +746,7 @@ * Otherwise NULL is returned. */ static inline struct inode *ifind_fast(struct super_block *sb, - struct list_head *head, unsigned long ino) + struct hlist_head *head, unsigned long ino) { struct inode *inode; @@ -794,7 +784,7 @@ struct inode *ilookup5(struct super_block *sb, unsigned long hashval, int (*test)(struct inode *, void *), void *data) { - struct list_head *head = inode_hashtable + hash(sb, hashval); + struct hlist_head *head = inode_hashtable + hash(sb, hashval); return ifind(sb, head, test, data); } @@ -816,7 +806,7 @@ */ struct inode *ilookup(struct super_block *sb, unsigned long ino) { - struct list_head *head = inode_hashtable + hash(sb, ino); + struct hlist_head *head = inode_hashtable + hash(sb, ino); return ifind_fast(sb, head, ino); } @@ -848,7 +838,7 @@ int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data) { - struct list_head *head = inode_hashtable + hash(sb, hashval); + struct hlist_head *head = inode_hashtable + hash(sb, hashval); struct inode *inode; inode = ifind(sb, head, test, data); @@ -881,7 +871,7 @@ */ struct inode *iget_locked(struct super_block *sb, unsigned long ino) { - struct list_head *head = inode_hashtable + hash(sb, ino); + struct hlist_head *head = inode_hashtable + hash(sb, ino); struct inode *inode; inode = ifind_fast(sb, head, ino); @@ -907,11 +897,11 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval) { - struct list_head *head = &anon_hash_chain; + struct hlist_head *head = &anon_hash_chain; if (inode->i_sb) head = inode_hashtable + hash(inode->i_sb, hashval); spin_lock(&inode_lock); - list_add(&inode->i_hash, head); + hlist_add_head(&inode->i_hash, head); spin_unlock(&inode_lock); } @@ -925,7 +915,7 @@ void remove_inode_hash(struct inode *inode) { spin_lock(&inode_lock); - list_del_init(&inode->i_hash); + hlist_del_init(&inode->i_hash); spin_unlock(&inode_lock); } @@ -933,7 +923,7 @@ { struct super_operations *op = inode->i_sb->s_op; - list_del_init(&inode->i_hash); + hlist_del_init(&inode->i_hash); list_del_init(&inode->i_list); inode->i_state|=I_FREEING; inodes_stat.nr_inodes--; @@ -962,7 +952,7 @@ { struct super_block *sb = inode->i_sb; - if (!list_empty(&inode->i_hash)) { + if (!hnode_empty(&inode->i_hash)) { if (!(inode->i_state & (I_DIRTY|I_LOCK))) { list_del(&inode->i_list); list_add(&inode->i_list, &inode_unused); @@ -974,7 +964,7 @@ write_inode_now(inode, 1); spin_lock(&inode_lock); inodes_stat.nr_unused--; - list_del_init(&inode->i_hash); + hlist_del_init(&inode->i_hash); } list_del_init(&inode->i_list); inode->i_state|=I_FREEING; @@ -1220,48 +1210,18 @@ */ void __init inode_init(unsigned long mempages) { - struct list_head *head; - unsigned long order; - unsigned int nr_hash; int i; - for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++) init_waitqueue_head(&i_wait_queue_heads[i].wqh); - mempages >>= (14 - PAGE_SHIFT); - mempages *= sizeof(struct list_head); - for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++) - ; - - do { - unsigned long tmp; - - nr_hash = (1UL << order) * PAGE_SIZE / - sizeof(struct list_head); - i_hash_mask = (nr_hash - 1); - - tmp = nr_hash; - i_hash_shift = 0; - while ((tmp >>= 1UL) != 0UL) - i_hash_shift++; - - inode_hashtable = (struct list_head *) - __get_free_pages(GFP_ATOMIC, order); - } while (inode_hashtable == NULL && --order >= 0); - - printk("Inode-cache hash table entries: %d (order: %ld, %ld bytes)\n", - nr_hash, order, (PAGE_SIZE << order)); - + inode_hashtable = (struct hlist_head *) + __get_free_pages(GFP_ATOMIC, + get_order(I_NUMBUCKETS*sizeof(struct hlist_head))); if (!inode_hashtable) panic("Failed to allocate inode hash table\n"); - head = inode_hashtable; - i = nr_hash; - do { - INIT_LIST_HEAD(head); - head++; - i--; - } while (i); + for (i = 0; i < I_NUMBUCKETS; i++) + INIT_HLIST_HEAD(&inode_hashtable[i]); /* inode slab cache */ inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode), --- linux-2.5.63/fs/dcache.c-HASH 2003-02-17 23:55:55.000000000 +0100 +++ linux-2.5.63/fs/dcache.c 2003-02-25 17:02:12.772343496 +0100 @@ -41,22 +41,32 @@ * * This hash-function tries to avoid losing too many bits of hash * information, yet avoid using a prime hash-size or similar. - */ -#define D_HASHBITS d_hash_shift -#define D_HASHMASK d_hash_mask + * + * AK: using a prime hash with a prime near some power-of-two would be + * likely better. Any hash guru interested? Same for the inode hash. + * + * We probably should have CONFIG_SMALL and CONFIG_LARGE for this. + * Don't scale it by memory size, otherwise big systems are eaten + * by the cache misses. + * + * Sizes need to be power-of-two for now. + */ +#define D_NUMBUCKETS (16*1024) /* 64K RAM on a 32bit machine */ +#define D_HASHBITS 13 /* = log2(D_NUMBUCKETS) */ +#define D_HASHMASK (D_NUMBUCKETS-1) -static unsigned int d_hash_mask; -static unsigned int d_hash_shift; -static struct list_head *dentry_hashtable; +static struct hlist_head *dentry_hashtable; static LIST_HEAD(dentry_unused); static int max_dentries; static void * hashtable_end; +#if 0 static inline int is_bucket(void * addr) { return ((addr < (void *)dentry_hashtable) || (addr > hashtable_end) ? 0 : 1); } +#endif /* Statistics gathering. */ struct dentry_stat_t dentry_stat = { @@ -292,6 +302,7 @@ while (next != head) { tmp = next; next = tmp->next; + prefetch(next); alias = list_entry(tmp, struct dentry, d_alias); if (!d_unhashed(alias)) { if (alias->d_flags & DCACHE_DISCONNECTED) @@ -378,6 +389,7 @@ if (tmp == &dentry_unused) break; list_del_init(tmp); + prefetch(dentry_unused.prev); dentry_stat.nr_unused--; dentry = list_entry(tmp, struct dentry, d_lru); @@ -603,15 +615,15 @@ * done under dcache_lock. * */ -void shrink_dcache_anon(struct list_head *head) +void shrink_dcache_anon(struct hlist_head *head) { - struct list_head *lp; + struct hlist_node *lp; int found; do { found = 0; spin_lock(&dcache_lock); - list_for_each(lp, head) { - struct dentry *this = list_entry(lp, struct dentry, d_hash); + hlist_for_each(lp, head) { + struct dentry *this = hlist_entry(lp, struct dentry, d_hash); list_del(&this->d_lru); /* don't add non zero d_count dentries @@ -727,7 +739,7 @@ dentry->d_mounted = 0; dentry->d_cookie = NULL; dentry->d_bucket = NULL; - INIT_LIST_HEAD(&dentry->d_hash); + INIT_HLIST_NODE(&dentry->d_hash); INIT_LIST_HEAD(&dentry->d_lru); INIT_LIST_HEAD(&dentry->d_subdirs); INIT_LIST_HEAD(&dentry->d_alias); @@ -797,7 +809,7 @@ return res; } -static inline struct list_head * d_hash(struct dentry * parent, unsigned long hash) +static inline struct hlist_head * d_hash(struct dentry * parent, unsigned long hash) { hash += (unsigned long) parent / L1_CACHE_BYTES; hash = hash ^ (hash >> D_HASHBITS); @@ -860,7 +872,7 @@ res->d_flags |= DCACHE_DISCONNECTED; res->d_vfs_flags &= ~DCACHE_UNHASHED; list_add(&res->d_alias, &inode->i_dentry); - list_add(&res->d_hash, &inode->i_sb->s_anon); + hlist_add_head(&res->d_hash, &inode->i_sb->s_anon); spin_unlock(&res->d_lock); } inode = NULL; /* don't drop reference */ @@ -947,21 +959,24 @@ unsigned int len = name->len; unsigned int hash = name->hash; const unsigned char *str = name->name; - struct list_head *head = d_hash(parent,hash); + struct hlist_head *head = d_hash(parent,hash); struct dentry *found = NULL; - struct list_head *tmp; + struct hlist_node *node; int lookup_count = 0; rcu_read_lock(); /* lookup is terminated when flow reaches any bucket head */ - for(tmp = head->next; !is_bucket(tmp); tmp = tmp->next) { + /* RED-PEN: is_bucket removed for now. hopefully not needed anymore. */ + hlist_for_each (node, head) { struct dentry *dentry; unsigned long move_count; struct qstr * qstr; + prefetch(node->next); + smp_read_barrier_depends(); - dentry = list_entry(tmp, struct dentry, d_hash); + dentry = hlist_entry(node, struct dentry, d_hash); /* if lookup ends up in a different bucket * due to concurrent rename, fail it @@ -1031,7 +1046,8 @@ unsigned long dent_addr = (unsigned long) dentry; unsigned long min_addr = PAGE_OFFSET; unsigned long align_mask = 0x0F; - struct list_head *base, *lhp; + struct hlist_head *base; + struct hlist_node *lhp; if (dent_addr < min_addr) goto out; @@ -1047,12 +1063,13 @@ goto out; spin_lock(&dcache_lock); - lhp = base = d_hash(dparent, dentry->d_name.hash); - while ((lhp = lhp->next) != base) { + base = d_hash(dparent, dentry->d_name.hash); + hlist_for_each(lhp,base) { + prefetch(lhp->next); /* read_barrier_depends() not required for d_hash list * as it is parsed under dcache_lock */ - if (dentry == list_entry(lhp, struct dentry, d_hash)) { + if (dentry == hlist_entry(lhp, struct dentry, d_hash)) { dget(dentry); spin_unlock(&dcache_lock); return 1; @@ -1113,12 +1130,11 @@ void d_rehash(struct dentry * entry) { - struct list_head *list = d_hash(entry->d_parent, entry->d_name.hash); + struct hlist_head *list = d_hash(entry->d_parent, entry->d_name.hash); spin_lock(&dcache_lock); - if (!list_empty(&entry->d_hash) && !d_unhashed(entry)) BUG(); entry->d_vfs_flags &= ~DCACHE_UNHASHED; entry->d_bucket = list; - list_add_rcu(&entry->d_hash, list); + hlist_add_head_rcu(&entry->d_hash, list); spin_unlock(&dcache_lock); } @@ -1171,10 +1187,6 @@ * We could be nicer about the deleted file, and let it show * up under the name it got deleted rather than the name that * deleted it. - * - * Careful with the hash switch. The hash switch depends on - * the fact that any list-entry can be a head of the list. - * Think about it. */ /** @@ -1197,8 +1209,8 @@ /* Move the dentry to the target hash queue, if on different bucket */ if (dentry->d_bucket != target->d_bucket) { dentry->d_bucket = target->d_bucket; - list_del_rcu(&dentry->d_hash); - list_add_rcu(&dentry->d_hash, &target->d_hash); + hlist_del(&dentry->d_hash); + hlist_add_head_rcu(&dentry->d_hash, target->d_bucket); } /* Unhash the target: dput() will then get rid of it */ @@ -1281,6 +1293,7 @@ continue; } parent = dentry->d_parent; + prefetch(parent); namelen = dentry->d_name.len; buflen -= namelen + 1; if (buflen < 0) @@ -1500,9 +1513,6 @@ static void __init dcache_init(unsigned long mempages) { - struct list_head *d; - unsigned long order; - unsigned int nr_hash; int i; /* @@ -1522,48 +1532,20 @@ panic("Cannot create dentry cache"); /* approximate maximum number of dentries in one hash bucket */ - max_dentries = (mempages * (PAGE_SIZE / sizeof(struct dentry))); + /* RED-PEN rather dubious. looping that long won't be good. */ + max_dentries = 0xfffff; /* XXX */ set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory); -#if PAGE_SHIFT < 13 - mempages >>= (13 - PAGE_SHIFT); -#endif - mempages *= sizeof(struct list_head); - for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++) - ; - - do { - unsigned long tmp; - - nr_hash = (1UL << order) * PAGE_SIZE / - sizeof(struct list_head); - d_hash_mask = (nr_hash - 1); - - tmp = nr_hash; - d_hash_shift = 0; - while ((tmp >>= 1UL) != 0UL) - d_hash_shift++; - - dentry_hashtable = (struct list_head *) - __get_free_pages(GFP_ATOMIC, order); - } while (dentry_hashtable == NULL && --order >= 0); - - printk(KERN_INFO "Dentry cache hash table entries: %d (order: %ld, %ld bytes)\n", - nr_hash, order, (PAGE_SIZE << order)); - + dentry_hashtable = (struct hlist_head *) + __get_free_pages(GFP_ATOMIC, + get_order(D_NUMBUCKETS * sizeof(struct hlist_head))); if (!dentry_hashtable) panic("Failed to allocate dcache hash table\n"); - hashtable_end = dentry_hashtable + nr_hash; - - d = dentry_hashtable; - i = nr_hash; - do { - INIT_LIST_HEAD(d); - d++; - i--; - } while (i); + for (i = 0; i < D_NUMBUCKETS; i++) + INIT_HLIST_HEAD(&dentry_hashtable[i]); + hashtable_end = dentry_hashtable + i; } /* SLAB cache for __getname() consumers */ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 16:16 ` Andi Kleen @ 2003-02-27 3:24 ` Dave Hansen 2003-02-27 16:36 ` Jan Harkes 1 sibling, 0 replies; 27+ messages in thread From: Dave Hansen @ 2003-02-27 3:24 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, lse-tech, akpm Andi Kleen wrote: > Other numbers may work better for your workload. Please test. The whole post: http://marc.theaimsgroup.com/?l=linux-kernel&m=104619044800608&w=2 The number of cycles in d_lookup increased by 10% with your patch. But, it looks like lots of stuff was moving around, so I wouldn't take too much stock in it. % change 2.5.62 2.5.62-ak-dc in cycles cycles(%oftot) cycles(%oftot) +20.685% 6251(0.397%) 9834(0.603%) ext3_get_inode_loc +13.296% 16183(1.027%) 21863(1.341%) try_to_wake_up +10.560% 130566(8.283%) 166867(10.239%) d_lookup +10.202% 12525(0.795%) 15892(0.975%) filemap_nopage +5.991% 22127(1.404%) 25793(1.583%) schedule +5.581% 8705(0.552%) 10064(0.618%) run_timer_softirq +4.246% 7922(0.503%) 8917(0.547%) current_kernel_time +4.116% 9196(0.583%) 10324(0.633%) __wake_up +3.739% 21650(1.373%) 24123(1.480%) __find_get_block +3.631% 7226(0.458%) 8034(0.493%) do_page_cache_readahead +3.287% 72668(4.610%) 80239(4.923%) find_get_page +2.857% 11435(0.725%) 12518(0.768%) strnlen_user -2.184% 35107(2.227%) 34745(2.132%) .text.lock.inode -2.200% 17129(1.087%) 16947(1.040%) __fput -2.706% 28213(1.790%) 27632(1.695%) start_this_handle -2.823% 36539(2.318%) 35703(2.191%) smp_apic_timer_interrupt -3.668% 26074(1.654%) 25050(1.537%) load_balance -3.844% 8367(0.531%) 8010(0.491%) find_vma -4.340% 46642(2.959%) 44211(2.713%) scheduler_tick -4.650% 9965(0.632%) 9387(0.576%) default_idle -4.667% 14213(0.902%) 13384(0.821%) do_wp_page -4.728% 13603(0.863%) 12794(0.785%) mark_offset_tsc -4.815% 39748(2.522%) 37319(2.290%) __blk_queue_bounce -6.545% 44437(2.819%) 40298(2.473%) x86_profile_hook -6.991% 36113(2.291%) 32457(1.991%) __copy_from_user_ll -7.660% 16275(1.032%) 14432(0.886%) may_open -13.539% 28492(1.807%) 22432(1.376%) get_empty_filp -16.257% 10334(0.656%) 7696(0.472%) file_move -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 16:16 ` Andi Kleen 2003-02-27 3:24 ` Dave Hansen @ 2003-02-27 16:36 ` Jan Harkes 1 sibling, 0 replies; 27+ messages in thread From: Jan Harkes @ 2003-02-27 16:36 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Tue, Feb 25, 2003 at 05:16:31PM +0100, Andi Kleen wrote: > Dave Hansen <haveblue@us.ibm.com> writes: > > > The surprising thing? d_lookup() accounts for 8% of the time spent > > waiting for an L2 miss. > > The reason: > > Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes) > Inode cache hash table entries: 65536 (order: 7, 524288 bytes) > > (1GB) I bet on your big memory box it is even worse. No cache > in the world can cache that. If the hash function works well it'll > guarantee few if any cache locality. Ehh, I read that as 1MB for the dentry cache and .5MB for the inode cache, Which is several orders less than 1GB. Ofcourse these are only the pointers to the chains of dentries and inodes which take up far more than just 8 bytes per entry. And once you have to start traversing those hashchains you're toast. > Try the appended experimental patch. It replaces the hash table madness > with relatively small fixed tables. The actual size is probably left > for experimentation. I choose 64K for inode and 128K for dcache for now. And as a result you are probably just making the length of any given hashchain longer, and walking the chain is painful as the referenced objects are actually scattered throughout memory. Another optimization is to leave the tables big (scaled up with memory size), but try to keep the chains as short as possible. f.i. when adding a new entry to a non-empty chain, drop the old entry if it isn't used. That would give a lot more control over the actual size of the dentry and inode caches, so that updatedb runs won't push these caches to completely take over the VM. Jan ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <b3ekil$1cp$1@penguin.transmeta.com.suse.lists.linux.kernel>]
[parent not found: <20030225170546.GA23772@morningstar.nowhere.lie.suse.lists.linux.kernel>]
* Re: Horrible L2 cache effects from kernel compile [not found] ` <20030225170546.GA23772@morningstar.nowhere.lie.suse.lists.linux.kernel> @ 2003-02-25 17:20 ` Andi Kleen 2003-02-26 18:22 ` Jamie Lokier 0 siblings, 1 reply; 27+ messages in thread From: Andi Kleen @ 2003-02-25 17:20 UTC (permalink / raw) To: John W. M. Stevens; +Cc: linux-kernel, lse-tech "John W. M. Stevens" <john@betelgeuse.us> writes: > http://www.sourcejudy.com/downloads/10minutes.htm Feel free to code it up. If you did I'm sure someone would be willing to test it on large boxes too. However with RCU in the equation looking may get very interesting... Hash tables have the advantage that they're simply enough for lockless tricks; balanced trees are likely not so lucky. -Andi (who took a look at judy some time ago but it looked horribly complicated, even worse so than skiplists) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 17:20 ` Andi Kleen @ 2003-02-26 18:22 ` Jamie Lokier 0 siblings, 0 replies; 27+ messages in thread From: Jamie Lokier @ 2003-02-26 18:22 UTC (permalink / raw) To: Andi Kleen; +Cc: John W. M. Stevens, linux-kernel, lse-tech Andi Kleen wrote: > "John W. M. Stevens" <john@betelgeuse.us> writes: > > > http://www.sourcejudy.com/downloads/10minutes.htm > > Feel free to code it up. If you did I'm sure someone would be willing > to test it on large boxes too. > > However with RCU in the equation looking may get very interesting... > Hash tables have the advantage that they're simply enough for lockless > tricks; balanced trees are likely not so lucky. > > -Andi (who took a look at judy some time ago but it looked horribly > complicated, even worse so than skiplists) Note that the Judy trees paper has some text deleted which begins "{some Judy patent applications}... {Remainder deleted, sorry, you don't need to read this to understand the released software.}". Maybe some parts of the algorithm are patent-pending in the US? -- Jamie ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile @ 2003-02-25 18:37 Manfred Spraul 2003-02-25 18:41 ` Dave Hansen ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Manfred Spraul @ 2003-02-25 18:37 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, Dave Hansen Andi wrote: >The reason: > >Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes) >Inode cache hash table entries: 65536 (order: 7, 524288 bytes) > >(1GB) I bet on your big memory box it is even worse. No cache >in the world can cache that. > [snip] >Try the appended experimental patch. It replaces the hash table madness >with relatively small fixed tables. > > Are you sure that this will help? With a smaller table, you might cause fewer cache misses for the table lookup. Instead you get longer hash chains. Walking linked lists probably causes more cache line misses than the single array lookup. Dave, how many entries are in the dcache? Btw, has anyone tried to replaced the global dcache with something local, perhaps a tree instead of d_child, and then lookup in d_child_tree? -- Manfred ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 18:37 Manfred Spraul @ 2003-02-25 18:41 ` Dave Hansen 2003-02-25 18:42 ` Andi Kleen ` (2 subsequent siblings) 3 siblings, 0 replies; 27+ messages in thread From: Dave Hansen @ 2003-02-25 18:41 UTC (permalink / raw) To: Manfred Spraul; +Cc: Andi Kleen, linux-kernel Manfred Spraul wrote: > Dave, how many entries are in the dcache? from slabinfo: dentry_cache 35157 35352 160 1473 1473 1 : 248 124 The full result set: http://www.sr71.net/prof/kernbench/run-kernbench-2.5.62.13/ -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 18:37 Manfred Spraul 2003-02-25 18:41 ` Dave Hansen @ 2003-02-25 18:42 ` Andi Kleen 2003-02-25 19:29 ` Martin J. Bligh 2003-02-25 22:18 ` Linus Torvalds 3 siblings, 0 replies; 27+ messages in thread From: Andi Kleen @ 2003-02-25 18:42 UTC (permalink / raw) To: Manfred Spraul; +Cc: Andi Kleen, linux-kernel, Dave Hansen > Are you sure that this will help? > With a smaller table, you might cause fewer cache misses for the table > lookup. Instead you get longer hash chains. Walking linked lists > probably causes more cache line misses than the single array lookup. That's true when the hash table has a reasonable size. But with 1MB and bigger hash tables you are guaranteed to get a cache miss for most head bucket access, no matter how many dentries you have. The hash function is actively working against your cache here. The dentries are actually more likely to fit into dcache because they don't get artificially spread out over the cache space. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 18:37 Manfred Spraul 2003-02-25 18:41 ` Dave Hansen 2003-02-25 18:42 ` Andi Kleen @ 2003-02-25 19:29 ` Martin J. Bligh 2003-02-25 22:18 ` Linus Torvalds 3 siblings, 0 replies; 27+ messages in thread From: Martin J. Bligh @ 2003-02-25 19:29 UTC (permalink / raw) To: Manfred Spraul, Andi Kleen; +Cc: linux-kernel, Dave Hansen > Btw, has anyone tried to replaced the global dcache with something local, > perhaps a tree instead of d_child, and then lookup in d_child_tree? I don't think anyone does this, but I've been considering doing this for over a year now ... I just feel there's bound to be some inherent problem, or someone would have done it already. But there's only one real way to find that out ;-) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 18:37 Manfred Spraul ` (2 preceding siblings ...) 2003-02-25 19:29 ` Martin J. Bligh @ 2003-02-25 22:18 ` Linus Torvalds 2003-03-03 19:03 ` Benjamin LaHaise 3 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2003-02-25 22:18 UTC (permalink / raw) To: linux-kernel In article <3E5BB7EE.5090301@colorfullife.com>, Manfred Spraul <manfred@colorfullife.com> wrote: > >Are you sure that this will help? It might, under some loads. However, I don't think it's a long-term solution, since the hashing will mean that for any reasonably spread out load you _will_ always walk all dentries. So the long-term solution is to either use a local lookup (which we ended up doing for the page cache) _or_ to limit the number of dentries themselves some way. The latter sounds like a bad idea. >Btw, has anyone tried to replaced the global dcache with something >local, perhaps a tree instead of d_child, and then lookup in d_child_tree? I'd love to see somebody try. The main worry is the overhead required per directory dentry and keeping it scalable. The dentry tree _will_ be quickly populated and one common case is a few huge directories, yet at the same time for most dentries there won't be any children at all or very few of them). Right now the "child" list is just a simple linked list, and changing that to something more complex might make it possible to get rid of the hash entirely. But increasing the size of individual dentries is a bad idea, so it would have to be something fairly smart. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-02-25 22:18 ` Linus Torvalds @ 2003-03-03 19:03 ` Benjamin LaHaise 2003-03-03 19:13 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Benjamin LaHaise @ 2003-03-03 19:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Tue, Feb 25, 2003 at 10:18:09PM +0000, Linus Torvalds wrote: > Right now the "child" list is just a simple linked list, and changing > that to something more complex might make it possible to get rid of the > hash entirely. But increasing the size of individual dentries is a bad > idea, so it would have to be something fairly smart. Part of it is that some of the dentry is simply just too bloated. At 160 bytes, there must be something we can prune: qstr.len - if anyone cares about 4GB long dentries, they probably have other problems. could be a short d_lock - 1 bit out of 4 bytes d_vfs_flags - 2 bits out of 4 bytes d_flags - 3 bits out of 4 bytes d_move_count - rcu -- is it ever used on UP? d_time - only used by network filesystems *d_sb - rarely used, accessible via inode *d_fsdata - mostly used by exotic/network filesystems *d_cookie - only used when o_profile is active In short, almost a third can be configured out of existence for some setups, and others are candidates for being moved into filesystem specific data. -ben -- Junk email? <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-03-03 19:03 ` Benjamin LaHaise @ 2003-03-03 19:13 ` Linus Torvalds 2003-03-03 23:58 ` Alan Cox 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2003-03-03 19:13 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: linux-kernel On Mon, 3 Mar 2003, Benjamin LaHaise wrote: > > Part of it is that some of the dentry is simply just too bloated. At > 160 bytes, there must be something we can prune: The thing is, the size of it doesn't really matter. The bad effects come not from the size, but from the bad behaviour of the lookup algorithm, which would be exactly the same even if the dentry was _half_ the size it is now. In other words, the size of the dentry only matters from a memory usage standpoint, and I don't think we have any cause to believe that dentries really hurt our global memory usage (we've _often_ had the bug that we don't shrink the dentry cache quickly enough, which is a different problem, though - keeping too many of them around. That should be largely fixed in current kernels). So I don't think there is any real reason to worry about the size of the dentry itself. Yes, you could make it smaller (you could remove the inline string from it, for example, and you could avoid allocating it at cacheline boundaries - both of which makes it a _lot_ smaller than just trying to save few bits), but both of those bigger decisions look like they are worthwhile tradeoffs. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-03-03 19:13 ` Linus Torvalds @ 2003-03-03 23:58 ` Alan Cox 2003-03-03 22:57 ` Andrew Morton ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Alan Cox @ 2003-03-03 23:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: Benjamin LaHaise, Linux Kernel Mailing List On Mon, 2003-03-03 at 19:13, Linus Torvalds wrote: > dentry itself. Yes, you could make it smaller (you could remove the inline > string from it, for example, and you could avoid allocating it at How about at least making the inline string align to the slab alignment so we dont waste space ? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-03-03 23:58 ` Alan Cox @ 2003-03-03 22:57 ` Andrew Morton 2003-03-03 23:01 ` Benjamin LaHaise ` (2 subsequent siblings) 3 siblings, 0 replies; 27+ messages in thread From: Andrew Morton @ 2003-03-03 22:57 UTC (permalink / raw) To: Alan Cox; +Cc: torvalds, bcrl, linux-kernel Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > On Mon, 2003-03-03 at 19:13, Linus Torvalds wrote: > > dentry itself. Yes, you could make it smaller (you could remove the inline > > string from it, for example, and you could avoid allocating it at > > How about at least making the inline string align to the slab alignment so we > dont waste space ? That change was made a couple of months back. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-03-03 23:58 ` Alan Cox 2003-03-03 22:57 ` Andrew Morton @ 2003-03-03 23:01 ` Benjamin LaHaise 2003-03-03 23:09 ` Linus Torvalds 2003-03-05 20:19 ` dean gaudet 3 siblings, 0 replies; 27+ messages in thread From: Benjamin LaHaise @ 2003-03-03 23:01 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, Linux Kernel Mailing List On Mon, Mar 03, 2003 at 11:58:27PM +0000, Alan Cox wrote: > On Mon, 2003-03-03 at 19:13, Linus Torvalds wrote: > > dentry itself. Yes, you could make it smaller (you could remove the inline > > string from it, for example, and you could avoid allocating it at > > How about at least making the inline string align to the slab alignment so we > dont waste space ? Also, making the qstr length an unsigned short or u8 would increase the length of the string by 2 or 3 bytes quite for free. -ben -- Junk email? <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-03-03 23:58 ` Alan Cox 2003-03-03 22:57 ` Andrew Morton 2003-03-03 23:01 ` Benjamin LaHaise @ 2003-03-03 23:09 ` Linus Torvalds 2003-03-05 20:19 ` dean gaudet 3 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2003-03-03 23:09 UTC (permalink / raw) To: Alan Cox; +Cc: Benjamin LaHaise, Linux Kernel Mailing List On 3 Mar 2003, Alan Cox wrote: > On Mon, 2003-03-03 at 19:13, Linus Torvalds wrote: > > dentry itself. Yes, you could make it smaller (you could remove the inline > > string from it, for example, and you could avoid allocating it at > > How about at least making the inline string align to the slab alignment so we > dont waste space ? 2.5.x does that already: #define DNAME_INLINE_LEN \ (sizeof(struct dentry)-offsetof(struct dentry,d_iname)) with the DNAME_INLINE_LEN_MIN just being exactly what the name says: the minimum size. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Horrible L2 cache effects from kernel compile 2003-03-03 23:58 ` Alan Cox ` (2 preceding siblings ...) 2003-03-03 23:09 ` Linus Torvalds @ 2003-03-05 20:19 ` dean gaudet 3 siblings, 0 replies; 27+ messages in thread From: dean gaudet @ 2003-03-05 20:19 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, Benjamin LaHaise, Linux Kernel Mailing List On Mon, 3 Mar 2003, Alan Cox wrote: > On Mon, 2003-03-03 at 19:13, Linus Torvalds wrote: > > dentry itself. Yes, you could make it smaller (you could remove the inline > > string from it, for example, and you could avoid allocating it at > > How about at least making the inline string align to the slab alignment so we > dont waste space ? what what be best is the hash chaining pointers and the first N bytes of the inline name in the same cacheline. for 32B cache lines it looks like it's at least two misses per chain now. (er, well i'm looking at 2.4 source.) -dean ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2003-03-05 20:09 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-02-25 0:41 Horrible L2 cache effects from kernel compile Dave Hansen 2003-02-25 0:59 ` William Lee Irwin III 2003-02-25 1:01 ` Dave Hansen 2003-02-25 3:15 ` John Levon 2003-02-25 3:15 ` William Lee Irwin III 2003-02-25 3:35 ` Andrew Morton 2003-02-25 4:13 ` Martin J. Bligh 2003-02-25 11:57 ` John Levon 2003-02-25 2:31 ` Linus Torvalds 2003-02-25 17:05 ` John W. M. Stevens [not found] <3E5ABBC1.8050203@us.ibm.com.suse.lists.linux.kernel> 2003-02-25 16:16 ` Andi Kleen 2003-02-27 3:24 ` Dave Hansen 2003-02-27 16:36 ` Jan Harkes [not found] ` <b3ekil$1cp$1@penguin.transmeta.com.suse.lists.linux.kernel> [not found] ` <20030225170546.GA23772@morningstar.nowhere.lie.suse.lists.linux.kernel> 2003-02-25 17:20 ` Andi Kleen 2003-02-26 18:22 ` Jamie Lokier 2003-02-25 18:37 Manfred Spraul 2003-02-25 18:41 ` Dave Hansen 2003-02-25 18:42 ` Andi Kleen 2003-02-25 19:29 ` Martin J. Bligh 2003-02-25 22:18 ` Linus Torvalds 2003-03-03 19:03 ` Benjamin LaHaise 2003-03-03 19:13 ` Linus Torvalds 2003-03-03 23:58 ` Alan Cox 2003-03-03 22:57 ` Andrew Morton 2003-03-03 23:01 ` Benjamin LaHaise 2003-03-03 23:09 ` Linus Torvalds 2003-03-05 20:19 ` dean gaudet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).