linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 10.31 second kernel compile
@ 2002-03-13  8:52 Anton Blanchard
  2002-03-13 14:44 ` Martin J. Bligh
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
  0 siblings, 2 replies; 137+ messages in thread
From: Anton Blanchard @ 2002-03-13  8:52 UTC (permalink / raw)
  To: lse-tech; +Cc: linux-kernel


Let the kernel compile benchmarks continue!

hardware: 24 way logical partition, 1.1GHz POWER4, 60G RAM

kernel: 2.5.6 + ppc64 pagetable rework

kernel compiled: 2.4.18 x86 with Martin's config

compiler: gcc 2.95.3 x86 cross compiler


# MAKE="make -j14" /usr/bin/time make -j14 bzImage
...
make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
130.63user 71.31system 0:10.31elapsed 1957%CPU (0avgtext+0avgdata 0maxresident)k


Due to the final link and compress stage, there is a fair amount of idle
time at the end of the run. Its going to be hard to push that number
lower by adding cpus.

The profile results below show that kernel time is dominated by the low
level ppc64 pagetable management. We are working to correct this, a lot
of the overhead in __hash_page should be gone soon. The rest of the
profile looks pretty good, do_anonymous_page and lru_cache_add show
up high as they did in Martin's results.

Thanks to Milton Miller who helped with the benchmarking, and the ppc64
team!

Anton
--
anton@samba.org
anton@au.ibm.com

201150 total                                      0.0668
129051 .idled                                  

 43586 .__hash_page                            ppc64 specific
  6714 .local_flush_tlb_range                  ppc64 specific
  2773 .local_flush_tlb_page                   ppc64 specific

  2203 .do_anonymous_page                      
  2059 .lru_cache_add                          
  1379 .__copy_tofrom_user                     

  1220 .hpte_create_valid_pSeriesLP            ppc64 LPAR specific

  1039 .save_remaining_regs                    
   871 .do_page_fault                          

   575 .plpar_hcall                            ppc64 LPAR specific

   554 .d_lookup                               
   545 .rmqueue                                
   482 .copy_page                              
   475 .__strnlen_user                         
   391 .__free_pages_ok                        
   389 .zap_page_range                         
   366 .atomic_dec_and_lock                    
   296 .__find_get_page                        
   287 .set_page_dirty                         
   278 .page_cache_release                     
   218 .handle_mm_fault                        
   199 .__flush_dcache_icache                  
   175 .schedule                               
   173 .sys_brk                                
   163 .exit_notify                            
   156 .do_no_page                             
   152 .lru_cache_del                          
   147 .__wake_up                              
   146 .copy_page_range                        
   139 .ppc_irq_dispatch_handler               
   135 .find_vma                               
   131 .__lru_cache_del                        
   128 .fget                                   
   126 .link_path_walk                         
   115 .do_generic_file_read                   
   113 .pte_alloc_map                          
   113 .filemap_nopage                         
   109 .clear_user_page                        
   106 .fput                                   
   104 .__alloc_pages                          
   101 .nr_free_pages                          


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 10.31 second kernel compile
  2002-03-13  8:52 10.31 second kernel compile Anton Blanchard
@ 2002-03-13 14:44 ` Martin J. Bligh
  2002-03-13 21:44   ` [Lse-tech] " Dave Hansen
                     ` (3 more replies)
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
  1 sibling, 4 replies; 137+ messages in thread
From: Martin J. Bligh @ 2002-03-13 14:44 UTC (permalink / raw)
  To: Anton Blanchard, lse-tech; +Cc: linux-kernel

> make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
> 130.63user 71.31system 0:10.31elapsed 1957%CPU (0avgtext+0avgdata 0maxresident)k

Wow! Is this box NUMA (what latency ratio, mem access speeds, etc?), or can you really build a straight SMP that big?

OK, now I'm going to have to build a bigger system ;-)

> Due to the final link and compress stage, there is a fair amount of idle
> time at the end of the run. Its going to be hard to push that number
> lower by adding cpus.

I think we need to fix the final phase .... anyone got any ideas
on parallelizing that?
 
> The profile results below show that kernel time is dominated by the low
> level ppc64 pagetable management. We are working to correct this, a lot
> of the overhead in __hash_page should be gone soon. The rest of the
> profile looks pretty good, do_anonymous_page and lru_cache_add show
> up high as they did in Martin's results.

I have some strange plans for the lru stuff, but it'll take me a while.
I'm curious as to why lru_cache_del is so much lower in your list than
add, whereas the ratio for me is about:

   719 lru_cache_add                              7.8152
   477 lru_cache_del                             21.6818

>   6714 .local_flush_tlb_range                  ppc64 specific
>   2773 .local_flush_tlb_page                   ppc64 specific

Do you know what's causing the tlb flushes? Just context switches?
 
>    554 .d_lookup                               

Did you try the dcache patches?

Can you publish lockmeter stats?

Thanks,

M.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-13 14:44 ` Martin J. Bligh
@ 2002-03-13 21:44   ` Dave Hansen
  2002-03-14  1:07     ` Keith Owens
  2002-03-14 11:27   ` Anton Blanchard
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 137+ messages in thread
From: Dave Hansen @ 2002-03-13 21:44 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Anton Blanchard, lse-tech, linux-kernel

Martin J. Bligh wrote:
>>Due to the final link and compress stage, there is a fair amount of idle
>>time at the end of the run. Its going to be hard to push that number
>>lower by adding cpus.
> 
> I think we need to fix the final phase .... anyone got any ideas
> on parallelizing that?
The final linking stage in the makefile looks like this:

vmlinux: piggy.o $(OBJECTS)
	$(LD) $(ZLINKFLAGS) -o vmlinux $(OBJECTS) piggy.o

ld has a "-r" option
        `--relocateable'
            Generate relocatable output---i.e., generate an output
            file that can in turn serve as input to `ld'.  This is
            often  called  partial  linking.  As a side effect, in
            environments that support standard Unix magic numbers,
            this  option  also sets the output file's magic number
            to `OMAGIC'.  If this  option  is  not  specified,  an
            absolute file is produced.  When linking C++ programs,
            this option will not resolve references  to  construc­
            tors; to do that, use -Ur.

If we link in chunks, we can parallelize this.
Image 26 object files: [a-z].o

ld -r -o abcd.o [abcd].o
ld -r -o efgh.o [efgh].o
...
ld -r -o abcdefgh.o {abcd,efgh,...}.o

then, instead of the old final link stage:
$(LD) $(ZLINKFLAGS) -o vmlinux {abcdefgh,...}.o piggy.o

The final link will still take a while, but we will have at least broken 
up SOME of the work.  I'm going to see if this will actually work now. 
Any comments?

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-13 21:44   ` [Lse-tech] " Dave Hansen
@ 2002-03-14  1:07     ` Keith Owens
  0 siblings, 0 replies; 137+ messages in thread
From: Keith Owens @ 2002-03-14  1:07 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Martin J. Bligh, Anton Blanchard, lse-tech, linux-kernel

On Wed, 13 Mar 2002 13:44:43 -0800, 
Dave Hansen <haveblue@us.ibm.com> wrote:
>The final linking stage in the makefile looks like this:
>
>vmlinux: piggy.o $(OBJECTS)
>	$(LD) $(ZLINKFLAGS) -o vmlinux $(OBJECTS) piggy.o
>
>If we link in chunks, we can parallelize this.
>Image 26 object files: [a-z].o
>
>ld -r -o abcd.o [abcd].o
>ld -r -o efgh.o [efgh].o
>...
>ld -r -o abcdefgh.o {abcd,efgh,...}.o
>
>then, instead of the old final link stage:
>$(LD) $(ZLINKFLAGS) -o vmlinux {abcdefgh,...}.o piggy.o
>
>The final link will still take a while, but we will have at least broken 
>up SOME of the work.  I'm going to see if this will actually work now. 
>Any comments?

I'm sorry Dave, you can't do that ;) The init_call order is controlled
by link order, change the link order and you corrupt the kernel
initialization order, double plus ungood.  The link of vmlinux requires
that $(OBJECTS) be exactly as coded.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 10.31 second kernel compile
  2002-03-13 14:44 ` Martin J. Bligh
  2002-03-13 21:44   ` [Lse-tech] " Dave Hansen
@ 2002-03-14 11:27   ` Anton Blanchard
  2002-03-14 13:16     ` [Lse-tech] " Dipankar Sarma
  2002-03-14 13:21     ` [Lse-tech] Re: 10.31 second kernel compile Momchil Velikov
  2002-03-14 18:21   ` Hanna Linder
  2002-03-15  7:12   ` Chris Wedgwood
  3 siblings, 2 replies; 137+ messages in thread
From: Anton Blanchard @ 2002-03-14 11:27 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: lse-tech, linux-kernel

 
> Wow! Is this box NUMA (what latency ratio, mem access speeds, etc?), or
> can you really build a straight SMP that big?

The system has two cpus on a chip sharing a l2 cache, 4 of these chips
built into a multichip module and up to 4 of these modules connected
together. The l3 and memory is distributed amongst the modules.

As you can imagine there is a hierarchy of latencies but the ratio is
quite low.

> OK, now I'm going to have to build a bigger system ;-)

I've got 8 more cpus in store too :)

> I have some strange plans for the lru stuff, but it'll take me a while.
> I'm curious as to why lru_cache_del is so much lower in your list than
> add, whereas the ratio for me is about:
> 
>    719 lru_cache_add                              7.8152
>    477 lru_cache_del                             21.6818

Not sure about that.

> >   6714 .local_flush_tlb_range                  ppc64 specific
> >   2773 .local_flush_tlb_page                   ppc64 specific
> 
> Do you know what's causing the tlb flushes? Just context switches?

Thats due to the way we manipulate the ppc hashed page table. Every
time we update the linux page tables we have to update the hashed
page table. There are some obvious optimisations we need to make,
hopefully then this will go away. The tlb flushes here are probably
process exits and things like COW faults.

> >    554 .d_lookup                               
> 
> Did you try the dcache patches?

Not for this, I did do some benchmarking of the RCU dcache patches a
while ago which I should post.

> Can you publish lockmeter stats?

I didnt get a chance to run lockmeter, I tend to use the kernel profiler
and use a hacked readprofile (originally from tridge) that displays
profile hits vs assembly instruction. Thats usually good enough to work
out which spinlocks are a problem.

Anton

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 11:27   ` Anton Blanchard
@ 2002-03-14 13:16     ` Dipankar Sarma
  2002-03-17 13:12       ` some RCU dcache and ratcache results Anton Blanchard
  2002-03-14 13:21     ` [Lse-tech] Re: 10.31 second kernel compile Momchil Velikov
  1 sibling, 1 reply; 137+ messages in thread
From: Dipankar Sarma @ 2002-03-14 13:16 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Martin J. Bligh, lse-tech, linux-kernel

On Thu, Mar 14, 2002 at 10:27:26PM +1100, Anton Blanchard wrote:
>  
> > >    554 .d_lookup                               
> > 
> > Did you try the dcache patches?
> 
> Not for this, I did do some benchmarking of the RCU dcache patches a
> while ago which I should post.

Please do ;-) This shows why we need to ease the pressure on dcache_lock.

> 
> > Can you publish lockmeter stats?
> 
> I didnt get a chance to run lockmeter, I tend to use the kernel profiler
> and use a hacked readprofile (originally from tridge) that displays
> profile hits vs assembly instruction. Thats usually good enough to work
> out which spinlocks are a problem.

Is this a PPC only hack ? Also, where can I get it ?

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 11:27   ` Anton Blanchard
  2002-03-14 13:16     ` [Lse-tech] " Dipankar Sarma
@ 2002-03-14 13:21     ` Momchil Velikov
  2002-03-14 18:33       ` Daniel Phillips
  2002-03-14 19:05       ` Linus Torvalds
  1 sibling, 2 replies; 137+ messages in thread
From: Momchil Velikov @ 2002-03-14 13:21 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Martin J. Bligh, lse-tech, linux-kernel

>>>>> "Anton" == Anton Blanchard <anton@samba.org> writes:
Anton> Thats due to the way we manipulate the ppc hashed page table. Every
Anton> time we update the linux page tables we have to update the hashed
Anton> page table. There are some obvious optimisations we need to make,

Out of curiousity, why there's a need to update the linux page tables ?
Doesn't pte/pmd/pgd family functions provide enough abstraction in
order to maintain _only_ the hashed page table ?

Regards,
-velco

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-13 14:44 ` Martin J. Bligh
  2002-03-13 21:44   ` [Lse-tech] " Dave Hansen
  2002-03-14 11:27   ` Anton Blanchard
@ 2002-03-14 18:21   ` Hanna Linder
  2002-03-16  5:27     ` Anton Blanchard
  2002-03-15  7:12   ` Chris Wedgwood
  3 siblings, 1 reply; 137+ messages in thread
From: Hanna Linder @ 2002-03-14 18:21 UTC (permalink / raw)
  To: Anton Blanchard, Martin J. Bligh; +Cc: lse-tech, linux-kernel, hannal


--On Thursday, March 14, 2002 22:27:26 +1100 Anton Blanchard <anton@samba.org> wrote:

>> >    554 .d_lookup                               
>> 
>> Did you try the dcache patches?
> 
> Not for this, I did do some benchmarking of the RCU dcache patches a
> while ago which I should post.
> 

	There are two dcache patches. The one I wrote based on Al Viro's
	suggestion for fast path walking is especially good for NUMA
	systems hit by cache bouncing (d_lookup is the main culprit in 
	the dcache). Martin had some initial results that looked very
	good. 

	Following is the clean 2.5.6 version which is also available at:
	http://sf.net/projects/lse Under the Read Copy Update Section 
	
	Hanna Linder (hannal@us.ibm.com)
	IBM Linux Technology Center

-----------
diff -Nru -X dontdiff linux-2.5.6/fs/dcache.c linux-2.5.6-fw/fs/dcache.c
--- linux-2.5.6/fs/dcache.c	Thu Mar  7 18:18:13 2002
+++ linux-2.5.6-fw/fs/dcache.c	Fri Mar  8 13:50:43 2002
@@ -704,13 +704,22 @@
  
 struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
 {
+	struct dentry * dentry;
+	spin_lock(&dcache_lock);
+	dentry = __d_lookup(parent,name);
+	spin_unlock(&dcache_lock);
+	return dentry;
+}
+
+struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)  
+{
+
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
 	const unsigned char *str = name->name;
 	struct list_head *head = d_hash(parent,hash);
 	struct list_head *tmp;
 
-	spin_lock(&dcache_lock);
 	tmp = head->next;
 	for (;;) {
 		struct dentry * dentry = list_entry(tmp, struct dentry, d_hash);
@@ -732,10 +741,8 @@
 		}
 		__dget_locked(dentry);
 		dentry->d_vfs_flags |= DCACHE_REFERENCED;
-		spin_unlock(&dcache_lock);
 		return dentry;
 	}
-	spin_unlock(&dcache_lock);
 	return NULL;
 }
 
diff -Nru -X dontdiff linux-2.5.6/fs/namei.c linux-2.5.6-fw/fs/namei.c
--- linux-2.5.6/fs/namei.c	Thu Mar  7 18:18:24 2002
+++ linux-2.5.6-fw/fs/namei.c	Fri Mar  8 13:56:25 2002
@@ -268,8 +268,41 @@
 static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, int flags)
 {
 	struct dentry * dentry = d_lookup(parent, name);
+	
+	if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
+		if (!dentry->d_op->d_revalidate(dentry, flags) && !d_invalidate(dentry)) {
+			dput(dentry);
+			dentry = NULL;
+		}
+	}
+	return dentry;
+}
 
+/*for fastwalking*/
+static inline void undo_locked(struct nameidata *nd)
+{
+	if(nd->flags & LOOKUP_LOCKED){
+		dget(nd->dentry);
+		mntget(nd->mnt);
+		spin_unlock(&dcache_lock);
+		nd->flags &= ~LOOKUP_LOCKED;
+	}
+}
+
+/*
+ * For fast path lookup while holding the dcache_lock. 
+ * SMP-safe
+ */
+static struct dentry * cached_lookup_nd(struct nameidata * nd, struct qstr * name, int flags)
+{
+	struct dentry * dentry = NULL;
+	if(!(nd->flags & LOOKUP_LOCKED))
+		return cached_lookup(nd->dentry, name, flags);
+	
+	dentry = __d_lookup(nd->dentry, name);
+	
 	if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
+		undo_locked(nd);
 		if (!dentry->d_op->d_revalidate(dentry, flags) && !d_invalidate(dentry)) {
 			dput(dentry);
 			dentry = NULL;
@@ -279,6 +312,34 @@
 }
 
 /*
+ * Short-cut version of permission(), for calling by
+ * path_walk(), when dcache lock is held.  Combines parts
+ * of permission() and vfs_permission(), and tests ONLY for
+ * MAY_EXEC permission.
+ *
+ * If appropriate, check DAC only.  If not appropriate, or
+ * short-cut DAC fails, then call permission() to do more
+ * complete permission check.
+ */
+static inline int exec_permission_lite(struct inode *inode)
+{
+	umode_t	mode = inode->i_mode;
+
+	if ((inode->i_op && inode->i_op->permission))
+		return -EACCES;
+
+	if (current->fsuid == inode->i_uid)
+		mode >>= 6;
+	else if (in_group_p(inode->i_gid))
+		mode >>= 3;
+
+	if (mode & MAY_EXEC)
+		return 0;
+
+	return -EACCES;
+}
+
+/*
  * This is called when everything else fails, and we actually have
  * to go to the low-level filesystem to find out what we should do..
  *
@@ -472,7 +533,9 @@
 		struct qstr this;
 		unsigned int c;
 
-		err = permission(inode, MAY_EXEC);
+		err = exec_permission_lite(inode);
+		if(err)
+			err = permission(inode, MAY_EXEC);
 		dentry = ERR_PTR(err);
  		if (err)
 			break;
@@ -507,6 +570,7 @@
 			case 2:	
 				if (this.name[1] != '.')
 					break;
+				undo_locked(nd);
 				follow_dotdot(nd);
 				inode = nd->dentry->d_inode;
 				/* fallthrough */
@@ -523,16 +587,20 @@
 				break;
 		}
 		/* This does the actual lookups.. */
-		dentry = cached_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
+		dentry = cached_lookup_nd(nd, &this, LOOKUP_CONTINUE);
 		if (!dentry) {
+			undo_locked(nd);
 			dentry = real_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
 			err = PTR_ERR(dentry);
 			if (IS_ERR(dentry))
 				break;
 		}
 		/* Check mountpoints.. */
-		while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
-			;
+		if(d_mountpoint(dentry)){
+			undo_locked(nd);
+			while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
+				;
+		}
 
 		err = -ENOENT;
 		inode = dentry->d_inode;
@@ -543,6 +611,7 @@
 			goto out_dput;
 
 		if (inode->i_op->follow_link) {
+			undo_locked(nd);
 			err = do_follow_link(dentry, nd);
 			dput(dentry);
 			if (err)
@@ -555,7 +624,8 @@
 			if (!inode->i_op)
 				break;
 		} else {
-			dput(nd->dentry);
+			if (!(nd->flags & LOOKUP_LOCKED))
+				dput(nd->dentry);
 			nd->dentry = dentry;
 		}
 		err = -ENOTDIR; 
@@ -575,6 +645,7 @@
 			case 2:	
 				if (this.name[1] != '.')
 					break;
+				undo_locked(nd);
 				follow_dotdot(nd);
 				inode = nd->dentry->d_inode;
 				/* fallthrough */
@@ -586,7 +657,8 @@
 			if (err < 0)
 				break;
 		}
-		dentry = cached_lookup(nd->dentry, &this, 0);
+		dentry = cached_lookup_nd(nd, &this, 0);
+		undo_locked(nd); 
 		if (!dentry) {
 			dentry = real_lookup(nd->dentry, &this, 0);
 			err = PTR_ERR(dentry);
@@ -626,11 +698,14 @@
 		else if (this.len == 2 && this.name[1] == '.')
 			nd->last_type = LAST_DOTDOT;
 return_base:
+		undo_locked(nd);
 		return 0;
 out_dput:
+		undo_locked(nd);
 		dput(dentry);
 		break;
 	}
+	undo_locked(nd);
 	path_release(nd);
 return_err:
 	return err;
@@ -734,6 +809,36 @@
 	nd->dentry = dget(current->fs->pwd);
 	read_unlock(&current->fs->lock);
 	return 1;
+}
+
+int path_lookup(const char *name, unsigned int flags, struct nameidata *nd)
+{
+	nd->last_type = LAST_ROOT; /* if there are only slashes... */
+	nd->flags = flags;
+	if (*name=='/'){
+		read_lock(&current->fs->lock);
+		if (current->fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
+			nd->mnt = mntget(current->fs->altrootmnt);
+			nd->dentry = dget(current->fs->altroot);
+			read_unlock(&current->fs->lock);
+			if (__emul_lookup_dentry(name,nd))
+				return 0;
+			read_lock(&current->fs->lock);
+		}
+		spin_lock(&dcache_lock); /*to avoid cacheline bouncing with d_count*/
+		nd->mnt = current->fs->rootmnt;
+		nd->dentry = current->fs->root;
+		read_unlock(&current->fs->lock);
+	}
+	else{
+		read_lock(&current->fs->lock);
+		spin_lock(&dcache_lock);
+		nd->mnt = current->fs->pwdmnt;
+		nd->dentry = current->fs->pwd;
+		read_unlock(&current->fs->lock);
+	}
+	nd->flags |= LOOKUP_LOCKED;
+	return (path_walk(name, nd));
 }
 
 /*
diff -Nru -X dontdiff linux-2.5.6/include/linux/dcache.h linux-2.5.6-fw/include/linux/dcache.h
--- linux-2.5.6/include/linux/dcache.h	Thu Mar  7 18:18:30 2002
+++ linux-2.5.6-fw/include/linux/dcache.h	Fri Mar  8 13:50:43 2002
@@ -220,6 +220,7 @@
 
 /* appendix may either be NULL or be used for transname suffixes */
 extern struct dentry * d_lookup(struct dentry *, struct qstr *);
+extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
 
 /* validate "insecure" dentry pointer */
 extern int d_validate(struct dentry *, struct dentry *);
diff -Nru -X dontdiff linux-2.5.6/include/linux/fs.h linux-2.5.6-fw/include/linux/fs.h
--- linux-2.5.6/include/linux/fs.h	Thu Mar  7 18:18:19 2002
+++ linux-2.5.6-fw/include/linux/fs.h	Fri Mar  8 13:50:43 2002
@@ -1273,12 +1273,15 @@
  *  - require a directory
  *  - ending slashes ok even for nonexistent files
  *  - internal "there are more path compnents" flag
+ *  - locked when lookup done with dcache_lock held
  */
 #define LOOKUP_FOLLOW		(1)
 #define LOOKUP_DIRECTORY	(2)
 #define LOOKUP_CONTINUE		(4)
 #define LOOKUP_PARENT		(16)
 #define LOOKUP_NOALT		(32)
+#define LOOKUP_LOCKED		(64)
+
 /*
  * Type of the last component on LOOKUP_PARENT
  */
@@ -1309,13 +1312,7 @@
 extern int FASTCALL(path_init(const char *, unsigned, struct nameidata *));
 extern int FASTCALL(path_walk(const char *, struct nameidata *));
 extern int FASTCALL(link_path_walk(const char *, struct nameidata *));
-static inline int path_lookup(const char *path, unsigned flags, struct nameidata *nd)
-{
-	int error = 0;
-	if (path_init(path, flags, nd))
-		error = path_walk(path, nd);
-	return error;
-}
+extern int FASTCALL(path_lookup(const char *, unsigned, struct nameidata *));
 extern void path_release(struct nameidata *);
 extern int follow_down(struct vfsmount **, struct dentry **);
 extern int follow_up(struct vfsmount **, struct dentry **);


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 13:21     ` [Lse-tech] Re: 10.31 second kernel compile Momchil Velikov
@ 2002-03-14 18:33       ` Daniel Phillips
  2002-03-15 12:16         ` Chris Wedgwood
                           ` (2 more replies)
  2002-03-14 19:05       ` Linus Torvalds
  1 sibling, 3 replies; 137+ messages in thread
From: Daniel Phillips @ 2002-03-14 18:33 UTC (permalink / raw)
  To: Momchil Velikov, Anton Blanchard; +Cc: Martin J. Bligh, lse-tech, linux-kernel

On March 14, 2002 02:21 pm, Momchil Velikov wrote:
> >>>>> "Anton" == Anton Blanchard <anton@samba.org> writes:
> Anton> Thats due to the way we manipulate the ppc hashed page table. Every
> Anton> time we update the linux page tables we have to update the hashed
> Anton> page table. There are some obvious optimisations we need to make,
> 
> Out of curiousity, why there's a need to update the linux page tables ?
> Doesn't pte/pmd/pgd family functions provide enough abstraction in
> order to maintain _only_ the hashed page table ?

No, it's hardwired to the x86 tree view of page translation.

-- 
Daniel

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 13:21     ` [Lse-tech] Re: 10.31 second kernel compile Momchil Velikov
  2002-03-14 18:33       ` Daniel Phillips
@ 2002-03-14 19:05       ` Linus Torvalds
  2002-03-19 16:40         ` Bill Davidsen
  1 sibling, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-14 19:05 UTC (permalink / raw)
  To: linux-kernel

In article <87wuwfxp25.fsf@fadata.bg>,
Momchil Velikov  <velco@fadata.bg> wrote:
>
>Out of curiousity, why there's a need to update the linux page tables ?
>Doesn't pte/pmd/pgd family functions provide enough abstraction in
>order to maintain _only_ the hashed page table ?

No.  The IBM hashed page tables are not page tables at all, they are
really just a bigger 16-way set-associative in-memory TLB. 

You can't actually sanely keep track of VM layout in them.

Those POWER4 machines are wonderful things, but they have a few quirks:

 - it's so expensive that anybody who is slightly price-conscious gets a
   farm of PC's instead. Oh, well.

 - the CPU module alone is something like .5 kilowatts (translation:
   don't expect it in a nice desktop factor, even if you could afford
   it). 

 - IBM nomenclature really is broken. They call disks DASD devices, and
   they call their hash table a page table, and they just confuse
   themselves and everybody else for no good reason.  They number bits
   the wrong way around, for example (and big-endian bitordering really
   _is_ clearly inferior to little-endian, unlike byte-ordering.  Watch
   the _same_ bits in the _same_ register change name in the 32 vs
   64-bit architecture manuals, and puke)

But with all their faults, they do have this really studly setup with 8
big, fast CPU's on a single module. A few of those modules and you get
some ass-kick performance numbers. As you can see.

		Linus

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 10.31 second kernel compile
  2002-03-13 14:44 ` Martin J. Bligh
                     ` (2 preceding siblings ...)
  2002-03-14 18:21   ` Hanna Linder
@ 2002-03-15  7:12   ` Chris Wedgwood
  3 siblings, 0 replies; 137+ messages in thread
From: Chris Wedgwood @ 2002-03-15  7:12 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Anton Blanchard, lse-tech, linux-kernel

On Wed, Mar 13, 2002 at 06:44:56AM -0800, Martin J. Bligh wrote:

    I think we need to fix the final phase .... anyone got any ideas
    on parallelizing that?

Redefine the benchmark not to include the final link :)



  --cw

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 18:33       ` Daniel Phillips
@ 2002-03-15 12:16         ` Chris Wedgwood
  2002-03-16  5:12           ` Anton Blanchard
  2002-03-15 18:20         ` Linus Torvalds
  2002-03-16 11:55         ` Paul Mackerras
  2 siblings, 1 reply; 137+ messages in thread
From: Chris Wedgwood @ 2002-03-15 12:16 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Momchil Velikov, Anton Blanchard, Martin J. Bligh, lse-tech,
	linux-kernel

On Thu, Mar 14, 2002 at 07:33:40PM +0100, Daniel Phillips wrote:

    On March 14, 2002 02:21 pm, Momchil Velikov wrote:

    > Out of curiousity, why there's a need to update the linux page
    > tables ?  Doesn't pte/pmd/pgd family functions provide enough
    > abstraction in order to maintain _only_ the hashed page table ?

    No, it's hardwired to the x86 tree view of page translation.

What about doing soft TLB reloads then?


  --cw

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 18:33       ` Daniel Phillips
  2002-03-15 12:16         ` Chris Wedgwood
@ 2002-03-15 18:20         ` Linus Torvalds
  2002-03-16 15:24           ` Daniel Phillips
  2002-03-18  3:07           ` David S. Miller
  2002-03-16 11:55         ` Paul Mackerras
  2 siblings, 2 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-15 18:20 UTC (permalink / raw)
  To: linux-kernel

In article <E16la2m-0000SX-00@starship>,
Daniel Phillips  <phillips@bonn-fries.net> wrote:
>On March 14, 2002 02:21 pm, Momchil Velikov wrote:
>> 
>> Out of curiousity, why there's a need to update the linux page tables ?
>> Doesn't pte/pmd/pgd family functions provide enough abstraction in
>> order to maintain _only_ the hashed page table ?
>
>No, it's hardwired to the x86 tree view of page translation.

No no no.

If you think that, then you don't see the big picture.

In fact, when I did the 3-level page tables for Linux, no x86 chips that
could _use_ three levels actually existed.

The Linux MM was actually _designed_ for portability when I did the port
to alpha (oh, that's a long time ago). I even wrote my masters thesis on
why it was done the way it was done (the only actual academic use I ever
got out of the whole Linux exercise ;)

Yes a tree-based page table matches a lot of hardware architectures very
well.  And it's _not_ just x86: it also matches soft-fill TLB's better
than alternatives (newer sparcs and MIPS), and matches a number of other
architecture specifications (eg alpha, m68k). 

So on about 50% of architectures (and 99.9% of machines), the Linux MM
data structures can be made to map 1:1 to the hardware constructs, so
that you avoid duplicate information. 

But more importantly than that, the whole point really is that the page
table tree as far as Linux is concerned is nothing but an _abstraction_
of the VM mapping hardware. It so happens that a tree format is the only
sane format to keep full VM information that works well with real loads.

Whatever the hardware actually does, Linux considers that to be noting
but an extended TLB.  When you can make the MM software tree map 1:1
with the extended TLB (as on x86), you win in memory usage and in
cheaper TLB invalidates, but you _could_ (if you wanted to) just keep
two separate trees.  In fact, with the rmap patches, that's exactly what
you see: the software tree is _not_ 1:1 with the hardare tree any more
(but it _is_ a proper superset, so that you can still get partial
sharing and still get the cheaper TLB updates). 

Are there machines where the sharing between the software abstraction
and the hardware isn't as total? Sure. But if you actually know how
hashed page tables work on ppc, you'd be aware of the fact that they
aren't actualy able to do a full VM mapping - when a hash chain gets too
long, the hardware is no longer able to look it up ("too long" being 16
entries on a PPC, for example).

And that's a common situation with non-tree VM representations - they
aren't actually VM representations, they are just caches of what the
_real_ representation is.  And what do we call such caches? Right: they
are nothing but a TLB. 

So the fact is, the Linux tree-based VM has _nothing_ to do with x86
tree-basedness, and everything to do with the fact that it's the only
sane way to keep VM information. 

The fact that it maps 1:1 to the x86 trees with the "folding" of the mid
layer was a design consideration, for sure.  Being efficient and clever
is always good.  But the basic reason for tree-ness lies elsewhere. 
(The basic reasons for tree-ness is why so many architectures _do_ use a
tree-based page table - you should think of PPC and ia64 as the sick
puppies who didn't understand.  Read the PPC documentation on virtual
memory, and you'll see just _how_ sick they are). 

			Linus

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-15 12:16         ` Chris Wedgwood
@ 2002-03-16  5:12           ` Anton Blanchard
  0 siblings, 0 replies; 137+ messages in thread
From: Anton Blanchard @ 2002-03-16  5:12 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Daniel Phillips, Momchil Velikov, Martin J. Bligh, lse-tech,
	linux-kernel

 
> What about doing soft TLB reloads then?

ppc32 linux preloads entries into the hashed pagetable in
update_mmu_cache. Im about to commit a patch to do the same thing
in ppc64, at the moment we take two exceptions per pagefault which
is pretty ugly.

Some ppc32 hardware does allow you to take an exception for a TLB miss
(ie bypass the hashed pagetable completely).

Anton

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 18:21   ` Hanna Linder
@ 2002-03-16  5:27     ` Anton Blanchard
  0 siblings, 0 replies; 137+ messages in thread
From: Anton Blanchard @ 2002-03-16  5:27 UTC (permalink / raw)
  To: Hanna Linder; +Cc: Martin J. Bligh, lse-tech, linux-kernel

 
Hi,

> 	There are two dcache patches. The one I wrote based on Al Viro's
> 	suggestion for fast path walking is especially good for NUMA
> 	systems hit by cache bouncing (d_lookup is the main culprit in 
> 	the dcache). Martin had some initial results that looked very
> 	good. 

I gave the patch a go, here is the before and after for the kernel
compile benchmark. As you can see d_lookup and atomic_dec_and_lock
have both dropped. Since the main bottleneck for us is still the
ppc64 mm code, we didnt see a noticable drop in wall clock time.

It would be interesting to try the patch on a large specweb run,
Ive seen the dcache lock become a problem when running 8 way specweb.

Anton

before:
155912 total                                      0.0550
114562 .cpu_idle                               
 12615 .local_flush_tlb_range                  
  8476 .local_flush_tlb_page                   
  2576 .insert_hpte_into_group                 
  1980 .do_anonymous_page                      
  1813 .lru_cache_add                          
  1390 .d_lookup                               
  1320 .__copy_tofrom_user                     
  1140 .save_remaining_regs                    
   612 .rmqueue                                
   517 .atomic_dec_and_lock                    
   492 .do_page_fault                          
   444 .copy_page                              
   438 .__free_pages_ok                        
   375 .set_page_dirty                         
   350 .zap_page_range                         
   314 .schedule                               
   270 .__find_get_page                        
   245 .page_cache_release                     
   233 .lru_cache_del                          
   231 .hvc_poll                               
   215 .sys_brk                                

after:
152844 total                                      0.0539
113527 .cpu_idle                               
 12740 .local_flush_tlb_range                  
  7701 .local_flush_tlb_page                   
  2564 .insert_hpte_into_group                 
  2099 .do_anonymous_page                      
  1780 .lru_cache_add                          
  1230 .__copy_tofrom_user                     
  1082 .save_remaining_regs                    
   581 .rmqueue                                
   486 .__free_pages_ok                        
   479 .do_page_fault                          
   465 .copy_page                              
   371 .zap_page_range                         
   333 .atomic_dec_and_lock                    
   332 .set_page_dirty                         
   286 .__find_get_page                        
   275 .__d_lookup                             
   263 .path_lookup                            
   250 .page_cache_release                     
   221 .lru_cache_del                          
   218 .sys_brk                                
   215 .__flush_dcache_icache                  

^ permalink raw reply	[flat|nested] 137+ messages in thread

* 7.52 second kernel compile
  2002-03-13  8:52 10.31 second kernel compile Anton Blanchard
  2002-03-13 14:44 ` Martin J. Bligh
@ 2002-03-16  6:15 ` Anton Blanchard
  2002-03-16  6:42   ` [Lse-tech] " Gerrit Huizenga
                     ` (4 more replies)
  1 sibling, 5 replies; 137+ messages in thread
From: Anton Blanchard @ 2002-03-16  6:15 UTC (permalink / raw)
  To: lse-tech; +Cc: linux-kernel


> Let the kernel compile benchmarks continue!

I think Im addicted. I need help!

In this update we added 8 cpus and rewrote the ppc64 pagetable management
code to do lockless inserts and removals (there is still locking at
the pte level to avoid races).

hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM

kernel: 2.5.7-pre1 + ppc64 pagetable rework

kernel compiled: 2.4.18 x86 with Martin's config

compiler: gcc 2.95.3 x86 cross compiler

make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
128.89user 40.23system 0:07.52elapsed 2246%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (437084major+572835minor)pagefaults 0swaps

7.52 seconds is not a bad result for something running under a hypervisor.
The profile looks much better now. We still spend a lot of time flushing tlb
entries but we can look into batching them.

Anton
--
anton@samba.org
anton@au.ibm.com

155912 total                                      0.0550
114562 .cpu_idle                               

 12615 .local_flush_tlb_range                  
  8476 .local_flush_tlb_page                   
  2576 .insert_hpte_into_group                 

  1980 .do_anonymous_page                      
  1813 .lru_cache_add                          
  1390 .d_lookup                               
  1320 .__copy_tofrom_user                     
  1140 .save_remaining_regs                    
   612 .rmqueue                                
   517 .atomic_dec_and_lock                    
   492 .do_page_fault                          
   444 .copy_page                              
   438 .__free_pages_ok                        
   375 .set_page_dirty                         
   350 .zap_page_range                         
   314 .schedule                               
   270 .__find_get_page                        
   245 .page_cache_release                     
   233 .lru_cache_del                          
   231 .hvc_poll                               
   215 .sys_brk                                

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
@ 2002-03-16  6:42   ` Gerrit Huizenga
  2002-03-17 12:34     ` Anton Blanchard
  2002-03-16  8:05   ` Linus Torvalds
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 137+ messages in thread
From: Gerrit Huizenga @ 2002-03-16  6:42 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: lse-tech, linux-kernel


And this *without* the dcache_lock?  Hmm.  So you are saying there
may still be room for improvement?

BTW, are you doing this all out of cache/memory or do you have a
disk/controller quick enough to do the initial little file reads that
fast?

gerrit

In message <20020316061535.GA16653@krispykreme>, > : Anton Blanchard writes:
> 
> > Let the kernel compile benchmarks continue!
> 
> I think Im addicted. I need help!
> 
> In this update we added 8 cpus and rewrote the ppc64 pagetable management
> code to do lockless inserts and removals (there is still locking at
> the pte level to avoid races).
> 
> hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM
> 
> kernel: 2.5.7-pre1 + ppc64 pagetable rework
> 
> kernel compiled: 2.4.18 x86 with Martin's config
> 
> compiler: gcc 2.95.3 x86 cross compiler
> 
> make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
> 128.89user 40.23system 0:07.52elapsed 2246%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (437084major+572835minor)pagefaults 0swaps
> 
> 7.52 seconds is not a bad result for something running under a hypervisor.
> The profile looks much better now. We still spend a lot of time flushing tlb
> entries but we can look into batching them.
> 
> Anton
> --
> anton@samba.org
> anton@au.ibm.com
> 
> 155912 total                                      0.0550
> 114562 .cpu_idle                               
> 
>  12615 .local_flush_tlb_range                  
>   8476 .local_flush_tlb_page                   
>   2576 .insert_hpte_into_group                 
> 
>   1980 .do_anonymous_page                      
>   1813 .lru_cache_add                          
>   1390 .d_lookup                               
>   1320 .__copy_tofrom_user                     
>   1140 .save_remaining_regs                    
>    612 .rmqueue                                
>    517 .atomic_dec_and_lock                    
>    492 .do_page_fault                          
>    444 .copy_page                              
>    438 .__free_pages_ok                        
>    375 .set_page_dirty                         
>    350 .zap_page_range                         
>    314 .schedule                               
>    270 .__find_get_page                        
>    245 .page_cache_release                     
>    233 .lru_cache_del                          
>    231 .hvc_poll                               
>    215 .sys_brk                                
> 
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
  2002-03-16  6:42   ` [Lse-tech] " Gerrit Huizenga
@ 2002-03-16  8:05   ` Linus Torvalds
  2002-03-16 11:54     ` yodaiken
  2002-03-16 11:04   ` Paul Mackerras
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16  8:05 UTC (permalink / raw)
  To: linux-kernel

In article <20020316061535.GA16653@krispykreme>,
Anton Blanchard  <anton@samba.org> wrote:
>
>hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM

It's interesting to see that scalability doesn't seem to be the #1
problem by a long shot. 

>7.52 seconds is not a bad result for something running under a hypervisor.
>The profile looks much better now. We still spend a lot of time flushing tlb
>entries but we can look into batching them.

I wonder if you wouldn't be better off just getting rid of the TLB range
flush altogether, and instead making it select a new VSID in the segment
register, and just forgetting about the old TLB contents entirely.

Then, when you do a TLB miss, you just re-use any hash table entries
that have a stale VSID.

It seems that you spend _way_ too much time actually trying to
physically invalidate the hashtables, which sounds like a total waste to
me. Especially as going through them to see whether they need to be
invalidated has to be a horribe thing for the dcache.

It would also be interesting to hear if you can just make the hash table
smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or
just bypass it altogether (at least the 604e used to be able to just
disable the stupid hashing altogether and make the whole thing much
saner). 

Note that the official IBM "minimum recommended page table sizes" stuff
looks like total and utter crap.  Those tables have nothing to do with
sanity, and everything to do with a crappy OS called AIX that takes
forever to fill the hashes.  You should probably make them the minimum
size (which, if I remember correctly, is still quite a large amount of
memory thrown away on a TLB) if you can't just disable them altogether. 

			Linus

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
  2002-03-16  6:42   ` [Lse-tech] " Gerrit Huizenga
  2002-03-16  8:05   ` Linus Torvalds
@ 2002-03-16 11:04   ` Paul Mackerras
  2002-03-16 18:32     ` Linus Torvalds
                       ` (2 more replies)
  2002-03-16 17:37   ` [Lse-tech] " Martin J. Bligh
  2002-03-16 18:57   ` Daniel Egger
  4 siblings, 3 replies; 137+ messages in thread
From: Paul Mackerras @ 2002-03-16 11:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> I wonder if you wouldn't be better off just getting rid of the TLB range
> flush altogether, and instead making it select a new VSID in the segment
> register, and just forgetting about the old TLB contents entirely.
> 
> Then, when you do a TLB miss, you just re-use any hash table entries
> that have a stale VSID.

We used to do something a bit like that on ppc32 - flush_tlb_mm would
just assign a new mmu context number to the task, which translates
into a new set of VSIDs.  We didn't do the second part, reusing hash
table entries with stale VSIDs, because we couldn't see a good fast
way to tell whether a given VSID was stale.  Instead, when the hash
bucket filled up, we just picked an entry to overwrite semi-randomly.

It turned out that the stale VSIDs were causing us various problems,
particularly on SMP, so I tried a solution that always cleared all the
hash table entries, using a bit in the linux pte to say whether there
was (or had ever been) a hash table entry corresponding to that pte as
an optimization to avoid doing unnecessary hash lookups.  To my
surprise, that turned out to be faster, so that's what we do now.

Your suggestion has the problem that when you get to needing to reuse
one of the VSIDs that you have thrown away, it becomes very difficult
and expensive to ensure that there aren't any stale hash table entries
left around for that VSID - particularly on a system with logical
partitioning where we don't control the size of the hash table.  And
there is a finite number of VSIDs so you have to reuse them sooner or
later.

[For those not familiar with the PPC MMU, think of the VSID as an MMU
context number, but separately settable for each 256MB of the virtual
address space.]

> It would also be interesting to hear if you can just make the hash table
> smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or

On ppc32 we use a hash table 1/4 of the recommended size and it works
fine.

> just bypass it altogether (at least the 604e used to be able to just
> disable the stupid hashing altogether and make the whole thing much
> saner). 

That was the 603, actually.  In fact the newest G4 processors also let
you do this.  When I get hold of a machine with one of these new G4
chips I'm going to try it again and see how much faster it goes
without the hash table.

One other thing - I would *love* it if we could get rid of
flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
_all_ of the flush_tlb_* functions tell us what address(es) we need to
invalidate, and let the architecture code decide whether a complete
TLB flush is justified.

Paul.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16  8:05   ` Linus Torvalds
@ 2002-03-16 11:54     ` yodaiken
  0 siblings, 0 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 11:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sat, Mar 16, 2002 at 08:05:14AM +0000, Linus Torvalds wrote:
> It would also be interesting to hear if you can just make the hash table
> smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or
> just bypass it altogether (at least the 604e used to be able to just
> disable the stupid hashing altogether and make the whole thing much
> saner). 

Reference:
URL: http://www.usenix.org/ Optimizing the Idle Task and Other MMU Tricks
Cort Dougan, Paul Mackerras, Victor Yodaiken
www.usenix.org/publications/library/proceedings/osdi99/full_papers/dougan/dougan.pdf


Cort's MS thesis was on this topic. IBM seems reluctant to give up on 
hardware page tables though.



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 18:33       ` Daniel Phillips
  2002-03-15 12:16         ` Chris Wedgwood
  2002-03-15 18:20         ` Linus Torvalds
@ 2002-03-16 11:55         ` Paul Mackerras
  2002-03-16 17:25           ` Rik van Riel
                             ` (2 more replies)
  2 siblings, 3 replies; 137+ messages in thread
From: Paul Mackerras @ 2002-03-16 11:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> But more importantly than that, the whole point really is that the page
> table tree as far as Linux is concerned is nothing but an _abstraction_
> of the VM mapping hardware. It so happens that a tree format is the only
> sane format to keep full VM information that works well with real loads.

Is that still true when we get to wanting to support a full 64-bit
address space?  Given that we can already tolerate losing PTEs for
resident pages from the page tables quite happily (since they can be
reconstructed from the information in the vm_area_structs and the page
cache), I don't see that the fact that a hash table will sometimes
lose PTEs because of a hash bucket filling up is all that much of a
problem.  (We would need to find some other way of dealing with swap
entries of course.)

IMHO it would be interesting to compare the size and complexity of
using a hash table for the page tables with a 5-level tree.  For a
32-bit address space I think the tree wins hands down but for a full
64-bit address space I am not convinced either way at present.

Paul.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-15 18:20         ` Linus Torvalds
@ 2002-03-16 15:24           ` Daniel Phillips
  2002-03-16 19:01             ` Linus Torvalds
  2002-03-18  3:07           ` David S. Miller
  1 sibling, 1 reply; 137+ messages in thread
From: Daniel Phillips @ 2002-03-16 15:24 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

On March 15, 2002 07:20 pm, Linus Torvalds wrote:
> In article <E16la2m-0000SX-00@starship>,
> Daniel Phillips  <phillips@bonn-fries.net> wrote:
> >On March 14, 2002 02:21 pm, Momchil Velikov wrote:
> >> 
> >> Out of curiousity, why there's a need to update the linux page tables ?
> >> Doesn't pte/pmd/pgd family functions provide enough abstraction in
> >> order to maintain _only_ the hashed page table ?
> >
> >No, it's hardwired to the x86 tree view of page translation.
> 
> No no no.
> 
> If you think that, then you don't see the big picture.

The statement itself is correct, however as for negative connotations, there 
aren't any intended.

I meant that the functions are hardwired to the tree structure, which they 
certainly are - each flavor of traversal is written out in full as a series 
of nested loops across the various tree levels.  Taking inventory of the 
existing abstractions:

  - Low level page table entry operations are per-architecture

  - Sizes of tables and entries are per-architecture

  - Two-level tables are cleverly made to appear as three-level tables

  - Hooks are sprinkled through the code as necessary to accommodate
    special per-architecture requirements, including possibly mapping 
    operations on the generic (x86-style) page table onto the arch's
    hardware page tables if necessary.

It could be a lot more abstract than that.  Chuck Cranor's UVM (which seems 
to bear some sort of filial relationship to the FreeBSD VM) buries all 
accesses to the page table behind a 'pmap' API, and implements the standard 
Unix VM semantics at the 'memory object' level.  In UVM the page table is 
just a cache of state encoded in memory objects and a means of communicating 
with the hardware.

In contrast, your design dives exuberantly straight into the page table tree 
to manipulate it directly, and relies on it as the primary repository of VM 
state information.  Unix VM semantics are implemented by a seemingly-naive 
combination of pte-copying and reference counting.  This approach is simple 
and robust, and in some cases outperforms UVM's structured high level 
approach, e.g,, since there is less structural manipulation to do at fork 
time, Linux forking is faster, at least in the case that there are not too 
many mapped pages tables in the parent.

When we get into forks from large memory processes the UVM approach beats us, 
as page table copying costs start to predominate.  At this point your 
approach of letting VM state live in the page table comes to the rescue, as I 
was able to extend it into a new way of implementing unix VM semantics 
efficiently, by sharing page tables instead of relying on memory objects.  
This is far simpler than the (to my mind, terrifying) memory object approach. 
I would probably not have had this insight if it were not for the way you 
designed the page table abstraction.  Or should I say, nonabstraction, 
because it's precisely the simplicity that allowed it to be extended in an 
interesting way.

So yes, I appreciate the elegance of the existing design.

> In fact, when I did the 3-level page tables for Linux, no x86 chips that
> could _use_ three levels actually existed.
> 
> The Linux MM was actually _designed_ for portability when I did the port
> to alpha (oh, that's a long time ago). I even wrote my masters thesis on
> why it was done the way it was done (the only actual academic use I ever
> got out of the whole Linux exercise ;)

Honorary doctorates don't count I suppose ;)

> Yes a tree-based page table matches a lot of hardware architectures very
> well.  And it's _not_ just x86: it also matches soft-fill TLB's better
> than alternatives (newer sparcs and MIPS), and matches a number of other
> architecture specifications (eg alpha, m68k). 
> 
> So on about 50% of architectures (and 99.9% of machines), the Linux MM
> data structures can be made to map 1:1 to the hardware constructs, so
> that you avoid duplicate information. 
> 
> But more importantly than that, the whole point really is that the page
> table tree as far as Linux is concerned is nothing but an _abstraction_
> of the VM mapping hardware. It so happens that a tree format is the only
> sane format to keep full VM information that works well with real loads.
> 
> Whatever the hardware actually does, Linux considers that to be noting
> but an extended TLB.  When you can make the MM software tree map 1:1
> with the extended TLB (as on x86), you win in memory usage and in
> cheaper TLB invalidates, but you _could_ (if you wanted to) just keep
> two separate trees.  In fact, with the rmap patches, that's exactly what
> you see: the software tree is _not_ 1:1 with the hardare tree any more
> (but it _is_ a proper superset, so that you can still get partial
> sharing and still get the cheaper TLB updates). 

I don't quite get your point.  Rmap is just an inverted index on the page 
table tree, not a separate tree.

> Are there machines where the sharing between the software abstraction
> and the hardware isn't as total? Sure. But if you actually know how
> hashed page tables work on ppc, you'd be aware of the fact that they
> aren't actually able to do a full VM mapping - when a hash chain gets too
> long, the hardware is no longer able to look it up ("too long" being 16
> entries on a PPC, for example).

And I suppose you have to take extra faults then to sort this out, and evict 
something to make room in the address space.  But if more than seven 
colliding entries are in the working set, ick.

I hadn't looked at PPC VM architecture before, and now I've taken a cursory 
look.  I understand the motivation for hashed page tables, that is, to 
restrict the mapping overhead to what is actually mapped, flattening out the 
structure and speeding up tlb reloads.

In ten years when terabyte memories are common at the high end (fifteen years 
for Joe Average's PC) we really will have to worry about such things, not so 
much because it won't fit into the existing model - it will - but because the 
existing model is not necessarily optimal.  Somehow I don't think hashing is 
either, personally I hold out more hope for an extent-based approach as the 
ultimate winner.

The problem being solved in any case is: how to massage the page table tree 
without hitting too many cache lines.  I guess we've got a few years to think 
about that one, and for the time being, the current approach is perfectly 
serviceable.

> And that's a common situation with non-tree VM representations - they
> aren't actually VM representations, they are just caches of what the
> _real_ representation is.  And what do we call such caches? Right: they
> are nothing but a TLB. 
> 
> So the fact is, the Linux tree-based VM has _nothing_ to do with x86
> tree-basedness, and everything to do with the fact that it's the only
> sane way to keep VM information. 
> 
> The fact that it maps 1:1 to the x86 trees with the "folding" of the mid
> layer was a design consideration, for sure.  Being efficient and clever
> is always good.  But the basic reason for tree-ness lies elsewhere. 
> (The basic reasons for tree-ness is why so many architectures _do_ use a
> tree-based page table - you should think of PPC and ia64 as the sick
> puppies who didn't understand.  Read the PPC documentation on virtual
> memory, and you'll see just _how_ sick they are). 
>
> 			Linus

/me adds 'read the PPC VM specs' to his too-long list of things to do

-- 
Daniel

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 11:55         ` Paul Mackerras
@ 2002-03-16 17:25           ` Rik van Riel
  2002-03-16 17:57           ` yodaiken
  2002-03-16 18:06           ` Linus Torvalds
  2 siblings, 0 replies; 137+ messages in thread
From: Rik van Riel @ 2002-03-16 17:25 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux-kernel

On Sat, 16 Mar 2002, Paul Mackerras wrote:
> Linus Torvalds writes:
>
> > But more importantly than that, the whole point really is that the page
> > table tree as far as Linux is concerned is nothing but an _abstraction_
> > of the VM mapping hardware. It so happens that a tree format is the only
> > sane format to keep full VM information that works well with real loads.
>
> Is that still true when we get to wanting to support a full 64-bit
> address space?  Given that we can already tolerate losing PTEs for
> resident pages from the page tables quite happily (since they can be
> reconstructed from the information in the vm_area_structs and the page
> cache), I don't see that the fact that a hash table will sometimes
> lose PTEs because of a hash bucket filling up is all that much of a
> problem.

Indeed, the VM basically has 2 components in this area:

1) the TLB information and possibly an extended TLB in RAM

2) the information needed to construct (1), which could be
   either page tables or VMAs and page cache metadata

On most architectures there is some overlap between (1) and (2),
but on eg. mmap() we never store the complete info on the file
in the page tables but build that _on the fly_ as we page fault
along.

Having said that, obviously we do need a way to store the info
needed to construct (1) somewhere and anonymous pages don't fit
into the pagecache cleanly because of COW and MAP_PRIVATE semantics.

> IMHO it would be interesting to compare the size and complexity of
> using a hash table for the page tables with a 5-level tree.  For a
> 32-bit address space I think the tree wins hands down but for a full
> 64-bit address space I am not convinced either way at present.

This is a good question, especially considering the fact that for
databases page table overhead is already bogging us down on 32-bit
systems.

Reconstructing hash table entries directly from VMA + page cache
might just be more efficient for PPC in this scenario, what would
be best for other architectures I really don't know.

regards,

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
                     ` (2 preceding siblings ...)
  2002-03-16 11:04   ` Paul Mackerras
@ 2002-03-16 17:37   ` Martin J. Bligh
  2002-03-17  1:45     ` Keith Owens
                       ` (2 more replies)
  2002-03-16 18:57   ` Daniel Egger
  4 siblings, 3 replies; 137+ messages in thread
From: Martin J. Bligh @ 2002-03-16 17:37 UTC (permalink / raw)
  To: Anton Blanchard, lse-tech; +Cc: linux-kernel

> I think Im addicted. I need help!

Well, you're not going to get much competition, so maybe help
would be more in order ;-) ;-)

Are you still doing something like this?
# MAKE="make -j14" /usr/bin/time make -j14 bzImage

I tried setting the MAKE variable as well as doing the -j,
but it actually made kernel compile time slower - what difference
does it make on your machine? Can somebody clarify what this
actually does, as opposed to the -j on the command line?

BTW - the other tip that was in the big book of whizzy kernel
compiles was to set gcc to use -pipe ... you might want to try
that.

How much of that 7.52 seconds are you spending in the final
single-threaded link & compress phase?

Thanks,

M.



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 11:55         ` Paul Mackerras
  2002-03-16 17:25           ` Rik van Riel
@ 2002-03-16 17:57           ` yodaiken
  2002-03-16 18:06           ` Linus Torvalds
  2 siblings, 0 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 17:57 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux-kernel

On Sat, Mar 16, 2002 at 10:55:40PM +1100, Paul Mackerras wrote:
> Is that still true when we get to wanting to support a full 64-bit
> address space?  Given that we can already tolerate losing PTEs for
> resident pages from the page tables quite happily (since they can be
> reconstructed from the information in the vm_area_structs and the page
> cache), I don't see that the fact that a hash table will sometimes
> lose PTEs because of a hash bucket filling up is all that much of a
> problem.  (We would need to find some other way of dealing with swap
> entries of course.)

The basic problem with hash tables is not that they loose data, even though
that is disgusting, the problem is that they delocalize mm data

	reference page n
	reference page n+1
	...
	reference page n+k

With a page table design, referencing page n's PTE on a miss almost
certainly brings PTE n+1 into the cache so that the next TLB miss does
not cause a page miss. the PTE list is a nice chunk of related information
But the hash table design means that consecutive TLB misses scatter over
a hash  table -and the cache is filled with page entries that are not
useful. Even uglier
	TLB miss
	hash table walk where every reference is a cache miss
	and we fill the cache up with crap.
	the pte is not in the hash table, now "reconstruct"

I've yet to see any plausible argument that going to 64 bit can do anything
but make this a whole lot worse. Maybe you've seen one?

In fact, I'm waiting for some hardware engineer to finally realize that
with extent file systems/disks and huge memories, the PDP11 base/limit
architecture is going to have a good chance of outperforming pages.
Pages and hash tables are a solution to the problem of memory fragmentation.
When you have a 4G memory, do you really care so much?


> IMHO it would be interesting to compare the size and complexity of
> using a hash table for the page tables with a 5-level tree.  For a

Why would you need a 5-level tree? Even three levels seems overdoing it
to me. Make the directory pages and target pages bigger. With 4M pages,
each process may waste 12M (if it's using 1byte each on the last page
of stack, code, and data). If that's a problem, you don't need a 64bit
memory space.


> 32-bit address space I think the tree wins hands down but for a full
> 64-bit address space I am not convinced either way at present.


> 
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 11:55         ` Paul Mackerras
  2002-03-16 17:25           ` Rik van Riel
  2002-03-16 17:57           ` yodaiken
@ 2002-03-16 18:06           ` Linus Torvalds
  2002-03-16 18:35             ` yodaiken
  2002-03-16 20:53             ` Alan Cox
  2 siblings, 2 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 18:06 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel


On Sat, 16 Mar 2002, Paul Mackerras wrote:
> 
> > But more importantly than that, the whole point really is that the page
> > table tree as far as Linux is concerned is nothing but an _abstraction_
> > of the VM mapping hardware. It so happens that a tree format is the only
> > sane format to keep full VM information that works well with real loads.
> 
> Is that still true when we get to wanting to support a full 64-bit
> address space?

Yup.

We'll end up (probably five years from now) re-doing the thing to allow 
four levels (so a tired old x86 would fold _two_ levels instead of just 
one, but I bet they'll still be the majority), simply because with three 
levels you reasonably reach only about 41 bits of VM space.

With four levels you get 50+ bits of VM space, and since the kernel etc 
wants a few bits, we're just in the right range.

>		  Given that we can already tolerate losing PTEs for
> resident pages from the page tables quite happily (since they can be
> reconstructed from the information in the vm_area_structs and the page
> cache)

Wrong. Look again. The most common case of all (anonymous pages) can NOT 
be reconstructed.

You're making the same mistake IBM did originally.

If it needs reconstructing, it's a TLB. 

And if it is a TLB, then it shouldn't be so damn big in the first place, 
because then you get horrible overhead for flushing.

A in-memory TLB is fine, but it should be understood that that is _all_
that it is. You can make the in-memory TLB be a tree if you want to, but
if it depends on reconstructing then the tree is pointless - you might as
well use something that isn't able to hold the whole address space in the 
first place.

And a big TLB (whether tree-based or hased or whatever) is bad if it is so 
big that building it up and tearing it down takes a noticeable amount of 
time. Which it obviously does on PPC64 - numbers talk.

What IBM should do is 

 - face up to their hashes being so big that building them up is a real 
   performance problem. It was ok for long-running fortran and database 
   programs, but it _sucks_ for any other load.

 - make a nice big on-chip L2 TLB to make their legacy stuff happy (the
   same legacy stuff that is so slow at filling the TLB in software that
   they needed the humungous hashtables in the first place).

Repeat after me: there are TLB's (reconstructive caches) and there are 
page tables (real VM information). Get them straight.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16 11:04   ` Paul Mackerras
@ 2002-03-16 18:32     ` Linus Torvalds
  2002-03-17  2:00     ` Paul Mackerras
  2002-03-18 19:37     ` Cort Dougan
  2 siblings, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 18:32 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel


On Sat, 16 Mar 2002, Paul Mackerras wrote:
> 
> Your suggestion has the problem that when you get to needing to reuse
> one of the VSIDs that you have thrown away, it becomes very difficult
> and expensive to ensure that there aren't any stale hash table entries
> left around for that VSID - particularly on a system with logical
> partitioning where we don't control the size of the hash table.

But the VSID is something like 20 bits, no? So the re-use is a fairly 
uncommon thing, in the end.

Remember: think about the hashes as just TLB's, and the VSID's are just 
address space identifiers (yeah, yeah, you can have several VSID's per 
process at least in 32-bit mode, I don't remember the 64-bit thing). So 
what you do is the same thing alpha does with it's 6-bit ASN thing: when 
you wrap around, you blast the whole TLB to kingdom come.

The alpha wraps around a lot more often with just 6 bits, but on the other 
hand it's a lot cheaper to get rid of the TLB too, so it evens out.

Yeah, there are latency issues, but that can be handled by just switching
the hash table base: you have two hash tables, and whenever you increment
the VSID you clear a small part of the other table, designed so that when
the VSID wraps around the other table is 100% clear, and you just switch
the two.

You _can_ switch the hash table base around on ppc64, can't you?

So now the VM invalidate becomes

	++vsid;
	partial_clear_secondary_hash();
	if (++vsid > MAXVSID)
		vsid = 0;
		switch_hashes();
	}

> > just bypass it altogether (at least the 604e used to be able to just
> > disable the stupid hashing altogether and make the whole thing much
> > saner). 
> 
> That was the 603, actually.

Ahh, my mind is going.

>			  In fact the newest G4 processors also let
> you do this.  When I get hold of a machine with one of these new G4
> chips I'm going to try it again and see how much faster it goes
> without the hash table.

Maybe somebody is seeing the light.

> One other thing - I would *love* it if we could get rid of
> flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
> _all_ of the flush_tlb_* functions tell us what address(es) we need to
> invalidate, and let the architecture code decide whether a complete
> TLB flush is justified.

Sure, sounds reasonable.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 18:06           ` Linus Torvalds
@ 2002-03-16 18:35             ` yodaiken
  2002-03-16 18:45               ` Linus Torvalds
  2002-03-16 20:53             ` Alan Cox
  1 sibling, 1 reply; 137+ messages in thread
From: yodaiken @ 2002-03-16 18:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 10:06:06AM -0800, Linus Torvalds wrote:
> We'll end up (probably five years from now) re-doing the thing to allow 
> four levels (so a tired old x86 would fold _two_ levels instead of just 
> one, but I bet they'll still be the majority), simply because with three 
> levels you reasonably reach only about 41 bits of VM space.

Why so few bits per level? Don't you want bigger pages or page clusters?


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 18:35             ` yodaiken
@ 2002-03-16 18:45               ` Linus Torvalds
  2002-03-16 18:57                 ` yodaiken
  0 siblings, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 18:45 UTC (permalink / raw)
  To: yodaiken; +Cc: Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:

> On Sat, Mar 16, 2002 at 10:06:06AM -0800, Linus Torvalds wrote:
> > We'll end up (probably five years from now) re-doing the thing to allow 
> > four levels (so a tired old x86 would fold _two_ levels instead of just 
> > one, but I bet they'll still be the majority), simply because with three 
> > levels you reasonably reach only about 41 bits of VM space.
> 
> Why so few bits per level? Don't you want bigger pages or page clusters?

Simply because I want to be able to share the software page tables with 
the hardware page tables.

Not sharing the page tables implies that you have to have two copies and 
keep them in sync, which is almost certainly going to make fork/exec just 
suck badly.

Now I agree with you that in the long range all the hardware will just use
a 64k pagesize or we'll have extents or whatever.  I'm not saying that
trees are the only good way to populate a VM, it's just the currently
dominant way, and as long as it's the dominant way I want to have the 
common mapping be the 1:1 mapping.

			Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 18:45               ` Linus Torvalds
@ 2002-03-16 18:57                 ` yodaiken
  2002-03-16 19:16                   ` Linus Torvalds
  2002-03-16 19:43                   ` David Mosberger
  0 siblings, 2 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 18:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 10:45:46AM -0800, Linus Torvalds wrote:
> 
> On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> 
> > On Sat, Mar 16, 2002 at 10:06:06AM -0800, Linus Torvalds wrote:
> > > We'll end up (probably five years from now) re-doing the thing to allow 
> > > four levels (so a tired old x86 would fold _two_ levels instead of just 
> > > one, but I bet they'll still be the majority), simply because with three 
> > > levels you reasonably reach only about 41 bits of VM space.
> > 
> > Why so few bits per level? Don't you want bigger pages or page clusters?
> 
> Simply because I want to be able to share the software page tables with 
> the hardware page tables.

Isn't this only an issue when the hardware wants to search the tables?
So for a semi-sane architecture, the hardware idea of pte is only important
in the tlb.

is there a 64 bit machine with hardware search of pagetables? Even ibm
only has a hardware search of hash tables - which we agree are simply
a means of making your hardware TLB larger and slower.





-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16  6:15 ` 7.52 " Anton Blanchard
                     ` (3 preceding siblings ...)
  2002-03-16 17:37   ` [Lse-tech] " Martin J. Bligh
@ 2002-03-16 18:57   ` Daniel Egger
  2002-03-17  8:18     ` Mike Galbraith
  4 siblings, 1 reply; 137+ messages in thread
From: Daniel Egger @ 2002-03-16 18:57 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

Am Sam, 2002-03-16 um 18.37 schrieb Martin J. Bligh:

> BTW - the other tip that was in the big book of whizzy kernel
> compiles was to set gcc to use -pipe ... you might want to try
> that.

Interestingly -pipe doesn't give any measurable performance increases or
even leads to a minor decrease in compile speed in my latest tests on
bigger projects like the linux kernel or GIMP. I suspect that's because
of the caching nature of nowadays systems: the temporary products are
cached in memory and likely not to never end on a drive because they're
read and removed before the point the filesystem decides to physically
write the data.

I also benchmarked tmpfs mounts and it demonstrated - to my surprise -
small advantages slightly above the noise range; I suspect this is due
to the way it handles files in memory.
 
-- 
Servus,
       Daniel


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 15:24           ` Daniel Phillips
@ 2002-03-16 19:01             ` Linus Torvalds
  2002-03-16 22:25               ` Daniel Phillips
  0 siblings, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 19:01 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel


On Sat, 16 Mar 2002, Daniel Phillips wrote:
> 
> I meant that the functions are hardwired to the tree structure, which they 
> certainly are

Oh yes.

Sure, you can abstract the VM stuff much more - and many people do, to the 
point of actually having a per-architecture VM with very little shared 
information.

The thing I like about the explicit tree is that while it _is_ an abstract 
data structure, it's also a data structure that people are very aware of 
how it maps to the actual hardware, which means that the abstraction 
doesn't come with a performance penalty. 

(There are two kinds of performance penalties in abstractions: (a) just 
the translation overhead for compilers etc, and (b) the _mental_ overhead 
of doing the wrong thing because you don't think of what it actually means 
in terms of hardware).

Now, the linux tree abstraction is obviously _so_ close to a common set of
hardware that many people don't realize at all that it's really meant to
be an abstraction (albeit one with a good mapping to reality).

> It could be a lot more abstract than that.  Chuck Cranor's UVM (which seems 
> to bear some sort of filial relationship to the FreeBSD VM) buries all 
> accesses to the page table behind a 'pmap' API, and implements the standard 
> Unix VM semantics at the 'memory object' level.

Who knows, maybe we'll change the abstraction in Linux some day too.. 
However, I personally tend to prefer "thin" abstractions that don't hide 
details.

The problem with the thick abstractions ("high level") is that they often
lead you down the wrong path. You start thinking that it's really cheap to
share partial address spaces etc ("hey, I just map this 'memory object'
into another process, and it's just a matter of one linked list operation
and incrementing a reference ount").

Until you realize that the actual sharing still implies a TLB switch 
between the two "threads", and that you need to instantiate the TLB in 
both processes etc. And suddenly that instantiation is actually the _real_ 
cost - and your clever highlevel abstraction was actually a lot more 
expensive than you realized.

[ Side note: I'm very biased by reality. In theory, a non-page-table based 
  approach which used only a front-side TLB and a fast lookup into higher- 
  level abstractions might be a really nice setup. However, in practice, 
  the world is 99%+ based on hardware that natively looks up the TLB in a 
  tree, and is really good at it too.  So I'm biased. I'd rather do good 
  on the 99% than care about some theoretical 1% ]

			Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 18:57                 ` yodaiken
@ 2002-03-16 19:16                   ` Linus Torvalds
  2002-03-16 19:53                     ` yodaiken
  2002-03-27  1:07                     ` Richard Henderson
  2002-03-16 19:43                   ` David Mosberger
  1 sibling, 2 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 19:16 UTC (permalink / raw)
  To: yodaiken; +Cc: Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> > 
> > Simply because I want to be able to share the software page tables with 
> > the hardware page tables.
> 
> Isn't this only an issue when the hardware wants to search the tables?
> So for a semi-sane architecture, the hardware idea of pte is only important
> in the tlb.

Show me a semi-sane architecture that _matters_ from a commercial angle.

The x86 is actually fairly good: a sane data structure that allows it to
fill multiple pages in one go (the page size may be just 4kB, but the x86
TLB fill rate is pretty impressive - I _think_ Intel actually fills a
whole cacheline worth of tlb entries - 8 pages - per miss).

But the x86 page table structure is fairly rigid, and is in practice 
limited to 4kB entries for normal user pages, and 4kB page table entries.

> is there a 64 bit machine with hardware search of pagetables? Even ibm
> only has a hardware search of hash tables - which we agree are simply
> a means of making your hardware TLB larger and slower.

ia64 does the same mistake, I think. 

But alpha does a "pseudo-hardware" fill of page tables, ie as far as the
OS is concerned you might as well consider it hardware. And that is
actually limited to 8kB pages (with a "fast fill" feature in the form of
page size hints - a cheaper version of what Intel seems to do).

The upcoming hammer stuff from AMD is also 64-bit, and apparently a
four-level page table, each with 512 entries and 4kB pages. So there you
get 9+9+9+9+12=48 bits of VM space, which is plenty. Linux won't be able
to take advantage of more than 39 bits of it until we switch to four
levels, of course (39 bits is plenty good enough too, for the next few
years, and we'll have no pain in expanding to 48 when that day comes).

So yes, there are 64-bit chips with hardware (or architecture-specified) 
page tables. And I personally like how Hammer looks more than the ia64 VM 
horror.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 18:57                 ` yodaiken
  2002-03-16 19:16                   ` Linus Torvalds
@ 2002-03-16 19:43                   ` David Mosberger
  2002-03-16 19:58                     ` Linus Torvalds
  2002-03-16 20:36                     ` David Mosberger
  1 sibling, 2 replies; 137+ messages in thread
From: David Mosberger @ 2002-03-16 19:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Paul Mackerras, linux-kernel

>>>>> On Sat, 16 Mar 2002 11:16:16 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said:

  >> is there a 64 bit machine with hardware search of pagetables?
  >> Even ibm only has a hardware search of hash tables - which we
  >> agree are simply a means of making your hardware TLB larger and
  >> slower.

  Linus> ia64 does the same mistake, I think.

ia64 has an optional hardware walker which can operate in "hashed"
mode or in "virtually mapped linear page table mode".  If you think
you can do a TLB lookup faster in software, you can turn the walker
off.  Our experience so far is that the hw walker does help
performance significantly.  This is partly because it allows CPU
designers to play some nice tricks, which you can't do once the miss
is exposed to software.  Also, since it's defined as an optional
feature, the hardware doesn't have to deal with the difficult corner
cases.  If it gets "overwhelmed" for one reason or another, it can
simply throw up it's hands and raise a TLB miss fault.

Anyhow, at the moment ia64 linux operates the hardware walker in the
virtually mapped linear page table mode, which allows us to use the
normal Linux page tables for the hardware walker.  However, I think
it's quite possible (perhaps even quite likely) that at some time
during the 2.5 cycle we'll switch the hardware walker into hashed
mode.  At that point, the hardware walker would simply operate as
large in-core TLB.  If Linux had a more flexible page table
abstraction, we could treat the in-core TLB as the primary page table,
but quite frankly, it's not clear at all to me whether and how much of
a win this would be.

	--david
--
Interested in learning more about IA-64 Linux?  Try http://www.lia64.org/book/

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:16                   ` Linus Torvalds
@ 2002-03-16 19:53                     ` yodaiken
  2002-03-16 20:02                       ` Linus Torvalds
  2002-03-27  1:07                     ` Richard Henderson
  1 sibling, 1 reply; 137+ messages in thread
From: yodaiken @ 2002-03-16 19:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 11:16:16AM -0800, Linus Torvalds wrote:
> Show me a semi-sane architecture that _matters_ from a commercial angle.

I thought we were into this for the pure technical thrill-)

> > is there a 64 bit machine with hardware search of pagetables? Even ibm
> > only has a hardware search of hash tables - which we agree are simply
> > a means of making your hardware TLB larger and slower.
> 
> ia64 does the same mistake, I think. 

I finally let myself read part of the hammer spec - and it's got that 4 level -
except for2MB pages where it is 3 level.  

			
 



> page tables. And I personally like how Hammer looks more than the ia64 VM 
> horror.

No kidding. But  I want TLB load instructions. 


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:43                   ` David Mosberger
@ 2002-03-16 19:58                     ` Linus Torvalds
  2002-03-16 20:08                       ` yodaiken
  2002-03-16 20:36                     ` David Mosberger
  1 sibling, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 19:58 UTC (permalink / raw)
  To: davidm; +Cc: yodaiken, Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002, David Mosberger wrote:
> 
> ia64 has an optional hardware walker which can operate in "hashed"
> mode or in "virtually mapped linear page table mode".  If you think
> you can do a TLB lookup faster in software, you can turn the walker
> off.

I used to be a sw fill proponent, but I've grown personally convinced that 
while sw fill is good, it needs a few things:

 - large on-chip TLB to avoid excessive trashing (ie preferably thousands
   of entries)

   This implies that the TLB should be split into a L1 and a L2, for all 
   the same reasons you split other caches that way (and with the L1
   probably being duplicated among all memory units)

 - ability to fill multiple entries in one go to offset the cost of taking 
   the trap.

Without that kind of support, the flexibility advantages of a sw fill just 
isn't enough to offset the advantage you can get from doing it in 
hardware (mainly the ability to not have to break your pipeline).

An in-memory hash table can of course be that L2, but I have this strong
suspicion that a forward-looking chip engineer would just have put the L2
on the die and made it architecturally invisible (so that moore's law can
trivially make it bigger in years to come).

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:53                     ` yodaiken
@ 2002-03-16 20:02                       ` Linus Torvalds
  2002-03-16 20:25                         ` yodaiken
  0 siblings, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 20:02 UTC (permalink / raw)
  To: yodaiken; +Cc: Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> On Sat, Mar 16, 2002 at 11:16:16AM -0800, Linus Torvalds wrote:
> > Show me a semi-sane architecture that _matters_ from a commercial angle.
> 
> I thought we were into this for the pure technical thrill-)

I don't know about you, but to me the difference between technological
thrill and masturbation is that real technology actually matters to real
people.

I'm not in it for some theoretical good. I want my code to make _sense_.

> > page tables. And I personally like how Hammer looks more than the ia64 VM 
> > horror.
> 
> No kidding. But  I want TLB load instructions. 

TLB load instructions + hardware walking just do not make much sense if 
you allow the loaded entries to be victimized.

Of course, you can have a separate "lock this TLB entry that I give you" 
thing, which can be useful for real-time, and can also be useful for 
having per-CPU data areas. 

But then you might as well consider that a BAT register ("block address 
translation", ppc has those too), and separate from the TLB.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:58                     ` Linus Torvalds
@ 2002-03-16 20:08                       ` yodaiken
  2002-03-16 20:23                         ` Linus Torvalds
  0 siblings, 1 reply; 137+ messages in thread
From: yodaiken @ 2002-03-16 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: davidm, yodaiken, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 11:58:22AM -0800, Linus Torvalds wrote:
>    This implies that the TLB should be split into a L1 and a L2, for all 
>    the same reasons you split other caches that way (and with the L1
>    probably being duplicated among all memory units)

AMD claims  L1, L2 and with hammer an 
I/D split as well. But no TLB load instruction as
far as I can tell


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:08                       ` yodaiken
@ 2002-03-16 20:23                         ` Linus Torvalds
  0 siblings, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 20:23 UTC (permalink / raw)
  To: yodaiken; +Cc: davidm, Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> 
> AMD claims L1, L2 and with hammer an I/D split as well.

Oh, people have done L1/L2 TLB splits for a long time. The two-level TLB
exists in Athlon (and I think nexgen did it in the x86 space almost 10
years ago, and that's probably what got AMD into that game). Others have 
done it too.

And people have done split TLB's (I/D split is quite common, duplicated by
memory unit is getting so).

But multiple entries loaded at a time?

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:02                       ` Linus Torvalds
@ 2002-03-16 20:25                         ` yodaiken
  0 siblings, 0 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 20:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 12:02:59PM -0800, Linus Torvalds wrote:
> 
> On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> > On Sat, Mar 16, 2002 at 11:16:16AM -0800, Linus Torvalds wrote:
> > > Show me a semi-sane architecture that _matters_ from a commercial angle.
> > 
> > I thought we were into this for the pure technical thrill-)
> 
> I don't know about you, but to me the difference between technological
> thrill and masturbation is that real technology actually matters to real
> people.

Beyond me. Some kind of sophisticated California thing that us
poor folks in New Mexico can hardly imagine, I suppose. 

> 
> I'm not in it for some theoretical good. I want my code to make _sense_.
> 
> > > page tables. And I personally like how Hammer looks more than the ia64 VM 
> > > horror.
> > 
> > No kidding. But  I want TLB load instructions. 
> 
> TLB load instructions + hardware walking just do not make much sense if 
> you allow the loaded entries to be victimized.

If you have TLB load, you can sabotage hw walking and at least see whether
you can beat it. I think it could be done, because the OS could adapt to
the characteristics of the process - using perhaps on mm layout for 
kde applets and a different one for oracle ...


> Of course, you can have a separate "lock this TLB entry that I give you" 
> thing, which can be useful for real-time, and can also be useful for 
> having per-CPU data areas. 
> 
> But then you might as well consider that a BAT register ("block address 
> translation", ppc has those too), and separate from the TLB.

Bats are a good start. What I'd like is also a "small unpaged
process base/limit" set of registers or two.


---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:43                   ` David Mosberger
  2002-03-16 19:58                     ` Linus Torvalds
@ 2002-03-16 20:36                     ` David Mosberger
  2002-03-16 20:46                       ` Linus Torvalds
  2002-03-17  1:09                       ` Paul Mackerras
  1 sibling, 2 replies; 137+ messages in thread
From: David Mosberger @ 2002-03-16 20:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: davidm, yodaiken, Paul Mackerras, linux-kernel

>>>>> On Sat, 16 Mar 2002 11:58:22 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said:

  Linus> I used to be a sw fill proponent, but I've grown personally
  Linus> convinced that while sw fill is good, it needs a few things:

Glad to see you're coming around! ;-)

  Linus>  - large on-chip TLB to avoid excessive trashing (ie
  Linus> preferably thousands of entries)

  Linus>    This implies that the TLB should be split into a L1 and a
  Linus> L2, for all the same reasons you split other caches that way
  Linus> (and with the L1 probably being duplicated among all memory
  Linus> units)

Yes, Itanium has a two-level DTLB, McKinley has both ITLB and DTLB
split into two levels.  Not quite as big though: "only" on the order
of hundreds of entries (partially offset by larger page sizes).  Of
course, operating the hardware walker in hashed mode can give you an
L3 TLB as large as you want it to be.

  Linus>  - ability to fill multiple entries in one go to offset the
  Linus> cost of taking the trap.

The software fill can definitely do that.  I think it's one area where
some interesting experimentation could happen.

	--david

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:36                     ` David Mosberger
@ 2002-03-16 20:46                       ` Linus Torvalds
  2002-03-17  1:09                       ` Paul Mackerras
  1 sibling, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 20:46 UTC (permalink / raw)
  To: davidm; +Cc: yodaiken, Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002, David Mosberger wrote:
> 
> Yes, Itanium has a two-level DTLB, McKinley has both ITLB and DTLB
> split into two levels.  Not quite as big though: "only" on the order
> of hundreds of entries (partially offset by larger page sizes).  Of
> course, operating the hardware walker in hashed mode can give you an
> L3 TLB as large as you want it to be.

The problem with caches is that if they are not coherent (and TLB's
generally aren't) you need to invalidate them by hand. And if they are 
in main memory, that invalidation can be expensive.

Which brings us back to the whole reason for the discussion: this is not a 
theoretical argument. Look at the POWER4 numbers, and _shudder_ at the 
expense of cache invalidation.

NOTE! The goodness of a cache is not in its size, but how quickly you can 
fill it, and what the hitrate is. I'd be very surprised if you get 
noticeably higher hitrates from "as large as you want it to be" than from 
"a few thousand entries that trivially fit on the die".

And I will guarantee that the on-die ones are faster to fill, and much
faster to invalidate (on-die it is fairly easy to do content-
addressability if you limit the addressing to just a few ways - off-chip 
memory is not).

>   Linus>  - ability to fill multiple entries in one go to offset the
>   Linus> cost of taking the trap.
> 
> The software fill can definitely do that.  I think it's one area where
> some interesting experimentation could happen.

If you can do it, and you don't do it already, you're just throwing away
cycles. If that was your comparison with the "superior hardware fill", it
really wasn't very fair.

Note that by "multiple entry support" I don't mean just a loop that adds 
noticeable overhead for each entry - I mean something which can fairly 
efficiently load contiguous entries pretty much in "one go". A TLB fill 
routine can't afford to spend time setting up tag registers etc.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 18:06           ` Linus Torvalds
  2002-03-16 18:35             ` yodaiken
@ 2002-03-16 20:53             ` Alan Cox
  1 sibling, 0 replies; 137+ messages in thread
From: Alan Cox @ 2002-03-16 20:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul Mackerras, linux-kernel

> We'll end up (probably five years from now) re-doing the thing to allow 
> four levels (so a tired old x86 would fold _two_ levels instead of just 
> one, but I bet they'll still be the majority), simply because with three 
> levels you reasonably reach only about 41 bits of VM space.

If you use ridiculously small page sizes. If your page size is 64K, which
is an awful lot saner for a big machine, then three levels is just fine.

Alan

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:01             ` Linus Torvalds
@ 2002-03-16 22:25               ` Daniel Phillips
  2002-03-19 16:35                 ` Bill Davidsen
  0 siblings, 1 reply; 137+ messages in thread
From: Daniel Phillips @ 2002-03-16 22:25 UTC (permalink / raw)
  To: Linus Torvalds, Daniel Phillips; +Cc: linux-kernel

On March 16, 2002 08:01 pm, Linus Torvalds wrote:
> On Sat, 16 Mar 2002, Daniel Phillips wrote:
> > It could be a lot more abstract than that.  Chuck Cranor's UVM (which 
> > seems to bear some sort of filial relationship to the FreeBSD VM) buries 
> > all  accesses to the page table behind a 'pmap' API, and implements the 
> > standard Unix VM semantics at the 'memory object' level.
> 
> Who knows, maybe we'll change the abstraction in Linux some day too.. 
> However, I personally tend to prefer "thin" abstractions that don't hide 
> details.
> 
> The problem with the thick abstractions ("high level") is that they often
> lead you down the wrong path. You start thinking that it's really cheap to
> share partial address spaces etc ("hey, I just map this 'memory object'
> into another process, and it's just a matter of one linked list operation
> and incrementing a reference ount").

My opinion, which I implied in the previous post but didn't state in so many 
words, is that the whole Real Unix crowd - Chuck Cranor and Matt Dillon, Sun, 
SGI and IBM etc - got off on the wrong track with respect to implementing 
Unix VM semantics, and that we will achieve all the design goals they set for 
themselves in a simpler, more efficient way.  (That is, assuming I ever 
finish debugging the page table sharing[1] and extend it to shared mmaps.)  I 
attribute that whole wrong turn to a too-heavy abstraction of the page table, 
distracting the eye from the observation that the page table itself provides 
sufficient state to do the same job as memory objects.  I'm curious to hear 
Matt's opinion on that by the way, I have to go bother him about this.

> Until you realize that the actual sharing still implies a TLB switch 
> between the two "threads", and that you need to instantiate the TLB in 
> both processes etc. And suddenly that instantiation is actually the _real_ 
> cost - and your clever highlevel abstraction was actually a lot more 
> expensive than you realized.

Well I don't have any problem with the TLB cost being hidden, what bothers me 
is the complexity of the mechanism required to make the abstraction work.  
Sort-of work I mean, just google 'all-shadowed case' to see one nasty 
difficulty.

> [ Side note: I'm very biased by reality. In theory, a non-page-table based 
>   approach which used only a front-side TLB and a fast lookup into higher- 
>   level abstractions might be a really nice setup. However, in practice, 
>   the world is 99%+ based on hardware that natively looks up the TLB in a 
>   tree, and is really good at it too.  So I'm biased. I'd rather do good 
>   on the 99% than care about some theoretical 1% ]

It breaks down somewhat as virtual memory range goes way beyond 4GB.  
There's the relatively minor issue of extra levels of tree traversal, 
currently limited to 4 by AMD's architecture but not so limited on other 
architectures.  A bigger problem is what to do about internal fragmentation 
in the page table tree, say if somebody mmaps a 2 TB sparse file, then writes 
one byte every 2 meg.  Bang, 4 gig worth of page tables, this is probably not 
what we want.  IMHO, 'don't do that then' isn't a reasonable response.

What we might want to do there is evict some page tables when they start 
proliferating too much, and that's when we find out we have no good model for 
doing that.  I think this needs to be looked at.

[1] I finally got a little more work done on it today

-- 
Daniel

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:36                     ` David Mosberger
  2002-03-16 20:46                       ` Linus Torvalds
@ 2002-03-17  1:09                       ` Paul Mackerras
  2002-03-17  2:08                         ` Linus Torvalds
  1 sibling, 1 reply; 137+ messages in thread
From: Paul Mackerras @ 2002-03-17  1:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> Which brings us back to the whole reason for the discussion: this is not a 
> theoretical argument. Look at the POWER4 numbers, and _shudder_ at the 
> expense of cache invalidation.

Go a little easy, the ppc64 port is still young and there are still
lots of places where it can use some serious optimization.  This is
one of them.

In principle the expense of invalidating the hash-table entries should
be able to be reduced to at most one store for every time we write to
a PTE in the linux page tables.  We currently don't have quite enough
information made available to the architecture code to achieve that.
In particular I think it would help if set_pte could be given the
mm_struct and the virtual address, then set_pte could fairly easily
invalidate the hash-table entry (if any) corresponding to the PTE
being changed.  Would you consider a patch along these lines?

Another alternative would be to make flush_tlb_mm doing the
change-the-VSIDs trick and then get the idle task to flush the stale
hash table entries.  We would need something like a bitmap showing
which PTEs had corresponding hash-table entries so that we didn't
waste time searching for hash-table entries that weren't there.

Paul.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16 17:37   ` [Lse-tech] " Martin J. Bligh
@ 2002-03-17  1:45     ` Keith Owens
  2002-03-17 13:54     ` David Woodhouse
  2002-03-19 16:49     ` Bill Davidsen
  2 siblings, 0 replies; 137+ messages in thread
From: Keith Owens @ 2002-03-17  1:45 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Anton Blanchard, lse-tech, linux-kernel

On Sat, 16 Mar 2002 09:37:00 -0800, 
"Martin J. Bligh" <Martin.Bligh@us.ibm.com> wrote:
>Are you still doing something like this?
># MAKE="make -j14" /usr/bin/time make -j14 bzImage
>
>I tried setting the MAKE variable as well as doing the -j,
>but it actually made kernel compile time slower - what difference
>does it make on your machine? Can somebody clarify what this
>actually does, as opposed to the -j on the command line?

It depends on which version of make you are using.  make 3.78 onwards
has a built in job scheduler which shares the value of -j across its
children, yea unto the nth generation.  Earlier versions of make did a
simplistic 'run -j copies of make at the top level' and did not
propagate -j to the lower levels.

With the recursive makefiles and make < 3.78 you need MAKE="make -j" to
get a decent speedup because of the lack of choices at the top level
Makefile.  With make >= 3.79 you do not need MAKE="make -j14", it can
interfere with make's own scheduler.  See also make -l LOAD.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16 11:04   ` Paul Mackerras
  2002-03-16 18:32     ` Linus Torvalds
@ 2002-03-17  2:00     ` Paul Mackerras
  2002-03-17  2:40       ` Linus Torvalds
  2002-03-18 19:42       ` 7.52 " Cort Dougan
  2002-03-18 19:37     ` Cort Dougan
  2 siblings, 2 replies; 137+ messages in thread
From: Paul Mackerras @ 2002-03-17  2:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> Remember: think about the hashes as just TLB's, and the VSID's are just 
> address space identifiers (yeah, yeah, you can have several VSID's per 
> process at least in 32-bit mode, I don't remember the 64-bit thing). So 
> what you do is the same thing alpha does with it's 6-bit ASN thing: when 
> you wrap around, you blast the whole TLB to kingdom come.

I have performance measurements that show that having stale hash-table
entries cluttering up the hash table hurts performance more than
taking the time to get rid of them does.  This is on ppc32 using
kernel compiles and lmbench as the performance measures.

> You _can_ switch the hash table base around on ppc64, can't you?

Not when running under a hypervisor (i.e. on a logically-partitioned
system), unfortunately.  It _may_ be possible to choose the VSIDs so
that we only use half (or less) of the hash table at any time.

> Maybe somebody is seeing the light.

Maybe.  Whenever I have been asked what hardware features should be
added to PPC chips to make Linux run better, I usually put having an
option for software loading of the TLB pretty high on the list.

However, one good argument against software TLB loading that I have
heard (and which you mentioned in another message) is that loading a
TLB entry in software requires taking an exception, which requires
synchronizing the pipeline, which is expensive.  With hardware TLB
reload you can just freeze the pipeline while the hardware does a
couple of fetches from memory.  And PPC64 remains the only
architecture I know of that supports a full 64-bit virtual address
space _and_ can do hardware TLB reload.

I would be interested to see measurements of how many cache misses on
average each hardware TLB reload takes; for a hash table I expect it
would be about 1, for a 3-level tree I expect it would be very
dependent on access pattern but I wouldn't be surprised if it averaged
about 1 also on real workloads.

But this is all a bit academic, the real question is how do we deal
most efficiently with the real hardware that is out there.  And if you
want a 7.5 second kernel compile the only option currently available
is a machine whose MMU uses a hash table. :)

Paul.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17  1:09                       ` Paul Mackerras
@ 2002-03-17  2:08                         ` Linus Torvalds
  0 siblings, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-17  2:08 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel


On Sun, 17 Mar 2002, Paul Mackerras wrote:
> 
> In particular I think it would help if set_pte could be given the
> mm_struct and the virtual address, then set_pte could fairly easily
> invalidate the hash-table entry (if any) corresponding to the PTE
> being changed.  Would you consider a patch along these lines?

How about just doing a few more "update_mmu_cache()" type of things?

This is actually why update_mmu_cache() exists in the first place - to be 
able to proactively fill in shadow information like hashed page tables.

Adding a "clear_mmu_cache()"?

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-17  2:00     ` Paul Mackerras
@ 2002-03-17  2:40       ` Linus Torvalds
  2002-03-17  2:50         ` M. Edward Borasky
  2002-03-18 19:42       ` 7.52 " Cort Dougan
  1 sibling, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-17  2:40 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel


On Sun, 17 Mar 2002, Paul Mackerras wrote:
> 
> But this is all a bit academic, the real question is how do we deal
> most efficiently with the real hardware that is out there.  And if you
> want a 7.5 second kernel compile the only option currently available
> is a machine whose MMU uses a hash table. :)

Yeah, at a cost of $2M+, if I'm not mistaken. I think I'll settle for my 2
minute time that is actually available to mere mortals at a small fraction
of one percent of that ;)

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* RE: 7.52 second kernel compile
  2002-03-17  2:40       ` Linus Torvalds
@ 2002-03-17  2:50         ` M. Edward Borasky
  2002-03-18 15:08           ` 0.73 " snpe
  0 siblings, 1 reply; 137+ messages in thread
From: M. Edward Borasky @ 2002-03-17  2:50 UTC (permalink / raw)
  To: linux-kernel

Well ... along those lines ... I'll settle for my $1500US 5 GFLOP Athlon for
sound processing instead of the 12 MFLOP FPS AP120B I always dreamed of
owning :). We've sure come a long way in 20 years, eh?

M. Edward Borasky
The COUGAR Project

znmeb@borasky-research.net
http://www.borasky-research.com/Cougar.htm

> -----Original Message-----
> Yeah, at a cost of $2M+, if I'm not mistaken. I think I'll settle for my 2
> minute time that is actually available to mere mortals at a small fraction
> of one percent of that ;)
>
> 		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16 18:57   ` Daniel Egger
@ 2002-03-17  8:18     ` Mike Galbraith
  2002-03-17 15:29       ` Martin J. Bligh
  0 siblings, 1 reply; 137+ messages in thread
From: Mike Galbraith @ 2002-03-17  8:18 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Martin J. Bligh, linux-kernel

On 16 Mar 2002, Daniel Egger wrote:

> Am Sam, 2002-03-16 um 18.37 schrieb Martin J. Bligh:
> 
> > BTW - the other tip that was in the big book of whizzy kernel
> > compiles was to set gcc to use -pipe ... you might want to try
> > that.
> 
> Interestingly -pipe doesn't give any measurable performance increases or
> even leads to a minor decrease in compile speed in my latest tests on
> bigger projects like the linux kernel or GIMP. I suspect that's because
> of the caching nature of nowadays systems: the temporary products are
> cached in memory and likely not to never end on a drive because they're
> read and removed before the point the filesystem decides to physically
> write the data.
> 
> I also benchmarked tmpfs mounts and it demonstrated - to my surprise -
> small advantages slightly above the noise range; I suspect this is due
> to the way it handles files in memory.

Yes.  Last time I tested, -pipe was _always_ a loser, and writing to
swap was measurably faster than writing to fs.

	-Mike


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16  6:42   ` [Lse-tech] " Gerrit Huizenga
@ 2002-03-17 12:34     ` Anton Blanchard
  2002-03-17 22:09       ` Theodore Tso
  0 siblings, 1 reply; 137+ messages in thread
From: Anton Blanchard @ 2002-03-17 12:34 UTC (permalink / raw)
  To: Gerrit Huizenga; +Cc: lse-tech, linux-kernel


> And this *without* the dcache_lock?  Hmm.  So you are saying there
> may still be room for improvement?

I tried the dcache lock patches but found it hard to see a difference,
for us the mm stuff still seems to be the bottleneck.

> BTW, are you doing this all out of cache/memory or do you have a
> disk/controller quick enough to do the initial little file reads that
> fast?

Yep its all out of cache, its slower on the first run.

Anton

^ permalink raw reply	[flat|nested] 137+ messages in thread

* some RCU dcache and ratcache results
  2002-03-14 13:16     ` [Lse-tech] " Dipankar Sarma
@ 2002-03-17 13:12       ` Anton Blanchard
  0 siblings, 0 replies; 137+ messages in thread
From: Anton Blanchard @ 2002-03-17 13:12 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: lse-tech, linux-kernel


> > Not for this, I did do some benchmarking of the RCU dcache patches a
> > while ago which I should post.
> 
> Please do ;-) This shows why we need to ease the pressure on dcache_lock.

OK :) Here is a graph I made a while ago. It is on a 32 way ppc64 box
running dbench.

http://samba.org/~anton/linux/dcache/summary.png

rat - radix-tree pagecache patch
dcache - RCU dcache patch
ext2 - rusty's BKL removal from ext2 patch

Not surprisingly the RCU dcache patch gave a large improvement in
dbench. While dbench may not be the greatest of benchmarks I am also
seeing a lot of dcache_lock contention on large zero copy workloads (eg 8
way specweb).

> > I didnt get a chance to run lockmeter, I tend to use the kernel profiler
> > and use a hacked readprofile (originally from tridge) that displays
> > profile hits vs assembly instruction. Thats usually good enough to work
> > out which spinlocks are a problem.
> 
> Is this a PPC only hack ? Also, where can I get it ?

I thought tridge put it into cvs somewhere, I'll find out the details
from him.

Anton

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16 17:37   ` [Lse-tech] " Martin J. Bligh
  2002-03-17  1:45     ` Keith Owens
@ 2002-03-17 13:54     ` David Woodhouse
  2002-03-19 16:49     ` Bill Davidsen
  2 siblings, 0 replies; 137+ messages in thread
From: David Woodhouse @ 2002-03-17 13:54 UTC (permalink / raw)
  To: Daniel Egger; +Cc: Martin J. Bligh, linux-kernel


degger@fhm.edu said:
>  Interestingly -pipe doesn't give any measurable performance increases
> or even leads to a minor decrease in compile speed in my latest tests
> on bigger projects like the linux kernel or GIMP.

I believe that newer versions of GCC have a builtin preprocessor, and -pipe 
forces them to use the old, slower, external cpp. 

--
dwmw2



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-17  8:18     ` Mike Galbraith
@ 2002-03-17 15:29       ` Martin J. Bligh
  0 siblings, 0 replies; 137+ messages in thread
From: Martin J. Bligh @ 2002-03-17 15:29 UTC (permalink / raw)
  To: Mike Galbraith, Daniel Egger; +Cc: linux-kernel

> Yes.  Last time I tested, -pipe was _always_ a loser, and writing to
> swap was measurably faster than writing to fs.

Yup, was a looser for me too. I have a vague random theory that
things get blocked on pipe writes, thus causing more context switches.
I have plans at some point to try the improved pipe stuff that 
Hubertus and others were working on, and see if that helps.

M.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-17 12:34     ` Anton Blanchard
@ 2002-03-17 22:09       ` Theodore Tso
  2002-03-18  7:04         ` Jeff Garzik
  0 siblings, 1 reply; 137+ messages in thread
From: Theodore Tso @ 2002-03-17 22:09 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Gerrit Huizenga, lse-tech, linux-kernel

On Sun, Mar 17, 2002 at 11:34:34PM +1100, Anton Blanchard wrote:
> 
> > And this *without* the dcache_lock?  Hmm.  So you are saying there
> > may still be room for improvement?
> 
> I tried the dcache lock patches but found it hard to see a difference,
> for us the mm stuff still seems to be the bottleneck.

Try the patch which gets rid of the BKL in ext2_get_block() --- if you
don't have that, let me know, I've got one kicking around that mostly
works except I haven't validated that it does the right thing if
quotas are enabled.  If you're running with a cold page cache, I
suspect that will help out much more.  If the numbers are assuming a
page-cache already preloaded with then getting rid of the BKL in
ext2_get_block() will help somewhat, but maybe not enough to be
significant.

						- Ted

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-15 18:20         ` Linus Torvalds
  2002-03-16 15:24           ` Daniel Phillips
@ 2002-03-18  3:07           ` David S. Miller
  1 sibling, 0 replies; 137+ messages in thread
From: David S. Miller @ 2002-03-18  3:07 UTC (permalink / raw)
  To: paulus; +Cc: torvalds, linux-kernel

   From: Paul Mackerras <paulus@samba.org>
   Date: Sat, 16 Mar 2002 22:55:40 +1100 (EST)
   
   IMHO it would be interesting to compare the size and complexity of
   using a hash table for the page tables with a 5-level tree.  For a
   32-bit address space I think the tree wins hands down but for a full
   64-bit address space I am not convinced either way at present.

You only need a 4-level tree for a full 64-bit address space as long
as you can guarentee less than (32 + PAGE_SHIFT) bits of physical
addressing (ie. you can use 32-bit pmd_t's and pgd_t's in that case).

At least this is how I remember the numbers working out.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-17 22:09       ` Theodore Tso
@ 2002-03-18  7:04         ` Jeff Garzik
  2002-03-19 18:28           ` Theodore Tso
  0 siblings, 1 reply; 137+ messages in thread
From: Jeff Garzik @ 2002-03-18  7:04 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Anton Blanchard, Gerrit Huizenga, lse-tech, linux-kernel

Theodore Tso wrote:

>On Sun, Mar 17, 2002 at 11:34:34PM +1100, Anton Blanchard wrote:
>
>>>And this *without* the dcache_lock?  Hmm.  So you are saying there
>>>may still be room for improvement?
>>>
>>I tried the dcache lock patches but found it hard to see a difference,
>>for us the mm stuff still seems to be the bottleneck.
>>
>
>Try the patch which gets rid of the BKL in ext2_get_block() --- if you
>don't have that, let me know, I've got one kicking around that mostly
>works except I haven't validated that it does the right thing if
>quotas are enabled.
>

Is yours different from what's in 2.5.x?

    Jeff







^ permalink raw reply	[flat|nested] 137+ messages in thread

* 0.73 second kernel compile
  2002-03-17  2:50         ` M. Edward Borasky
@ 2002-03-18 15:08           ` snpe
  0 siblings, 0 replies; 137+ messages in thread
From: snpe @ 2002-03-18 15:08 UTC (permalink / raw)
  To: linux-kernel

Just kidding !

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-16 11:04   ` Paul Mackerras
  2002-03-16 18:32     ` Linus Torvalds
  2002-03-17  2:00     ` Paul Mackerras
@ 2002-03-18 19:37     ` Cort Dougan
  2 siblings, 0 replies; 137+ messages in thread
From: Cort Dougan @ 2002-03-18 19:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux-kernel

In fact we _did_ do the second part.  Rather, I did anyway.  The zombie
reclaim code (used to live in idle.c before it was removed) would run much
like the zero-paged code I put in there.  It ran with the cache off to
avoid blowing the entire contents of the L1/L2 in the idle task.  It would
just invalidate (genuinely clearing the valid bit) for any hash table entry
that was stale (zombie was the term I used).

That method was a definite win in UP but didn't help terribly well for SMP
since once a processor bogs down it no longer gets the advantage of the
easy to find empty slot in the hash replacement code.

At this point, I think it would be worth throwing out the tlb invalidate
optimization (by changing VSID's) and benchmarking that against the code
with the optimization.  A test a year ago that I did showed that they were
pretty much even.  I'm betting the latest changes have made that
optimization an actual loss now.

Linus, shrinking that hash table was a very very bad thing.  Early on we
used a very small hash table and it really put too much pressure on the
entries and we were throwing them out nearly constantly.  Adding some code
to scatter the entries and use the table more efficient helped but a large
has table is a need, unfortunately.

The ultimate solution was actually not using the hash table on the 603's
that I added a few years ago.  I documented how doing this actually
improved performance in a OSDI paper from '99 that I have on my web page.
Linux, It's worth a look - it actually supports most of your opinions of
the PPC MMU.
 
} > I wonder if you wouldn't be better off just getting rid of the TLB range
} > flush altogether, and instead making it select a new VSID in the segment
} > register, and just forgetting about the old TLB contents entirely.
} > 
} > Then, when you do a TLB miss, you just re-use any hash table entries
} > that have a stale VSID.
} 
} We used to do something a bit like that on ppc32 - flush_tlb_mm would
} just assign a new mmu context number to the task, which translates
} into a new set of VSIDs.  We didn't do the second part, reusing hash
} table entries with stale VSIDs, because we couldn't see a good fast
} way to tell whether a given VSID was stale.  Instead, when the hash
} bucket filled up, we just picked an entry to overwrite semi-randomly.
} 
} It turned out that the stale VSIDs were causing us various problems,
} particularly on SMP, so I tried a solution that always cleared all the
} hash table entries, using a bit in the linux pte to say whether there
} was (or had ever been) a hash table entry corresponding to that pte as
} an optimization to avoid doing unnecessary hash lookups.  To my
} surprise, that turned out to be faster, so that's what we do now.
} 
} Your suggestion has the problem that when you get to needing to reuse
} one of the VSIDs that you have thrown away, it becomes very difficult
} and expensive to ensure that there aren't any stale hash table entries
} left around for that VSID - particularly on a system with logical
} partitioning where we don't control the size of the hash table.  And
} there is a finite number of VSIDs so you have to reuse them sooner or
} later.
} 
} [For those not familiar with the PPC MMU, think of the VSID as an MMU
} context number, but separately settable for each 256MB of the virtual
} address space.]
} 
} > It would also be interesting to hear if you can just make the hash table
} > smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or
} 
} On ppc32 we use a hash table 1/4 of the recommended size and it works
} fine.
} 
} > just bypass it altogether (at least the 604e used to be able to just
} > disable the stupid hashing altogether and make the whole thing much
} > saner). 
} 
} That was the 603, actually.  In fact the newest G4 processors also let
} you do this.  When I get hold of a machine with one of these new G4
} chips I'm going to try it again and see how much faster it goes
} without the hash table.
} 
} One other thing - I would *love* it if we could get rid of
} flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
} _all_ of the flush_tlb_* functions tell us what address(es) we need to
} invalidate, and let the architecture code decide whether a complete
} TLB flush is justified.
} 
} Paul.
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-17  2:00     ` Paul Mackerras
  2002-03-17  2:40       ` Linus Torvalds
@ 2002-03-18 19:42       ` Cort Dougan
  2002-03-18 20:04         ` Linus Torvalds
  1 sibling, 1 reply; 137+ messages in thread
From: Cort Dougan @ 2002-03-18 19:42 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linus Torvalds, linux-kernel

I have a counter-proposal.  How about a hardware tlb load (if we must have
one) that does the right thing?  I don't think the PPC is a good example of
a hardware well-managed TLB load process.  Software loads show up so well
on the PPC because it does some very very foolish things I suspect.  I've
had some conversations with Moto engineers who have suggested that my
suspicion that the TLB loads are actually cached when the hardware does
them so that we waste cache space with an line that we better darn well not
be loading again (otherwise we've thrown out our tlb way too early).

I still think there are some clever tricks one could do with the VSID's to
get a much saner system that the current hash table.  It would take some
serious work I think but the results could be worthwhile.  By carefully
choosing the VSID scatter algorithm and the size of the hash table I think
one could get a much better access method.

} However, one good argument against software TLB loading that I have
} heard (and which you mentioned in another message) is that loading a
} TLB entry in software requires taking an exception, which requires
} synchronizing the pipeline, which is expensive.  With hardware TLB
} reload you can just freeze the pipeline while the hardware does a
} couple of fetches from memory.  And PPC64 remains the only
} architecture I know of that supports a full 64-bit virtual address
} space _and_ can do hardware TLB reload.
} 
} I would be interested to see measurements of how many cache misses on
} average each hardware TLB reload takes; for a hash table I expect it
} would be about 1, for a 3-level tree I expect it would be very
} dependent on access pattern but I wouldn't be surprised if it averaged
} about 1 also on real workloads.
} 
} But this is all a bit academic, the real question is how do we deal
} most efficiently with the real hardware that is out there.  And if you
} want a 7.5 second kernel compile the only option currently available
} is a machine whose MMU uses a hash table. :)
} 
} Paul.
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 19:42       ` 7.52 " Cort Dougan
@ 2002-03-18 20:04         ` Linus Torvalds
  2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:34           ` Cort Dougan
  0 siblings, 2 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-18 20:04 UTC (permalink / raw)
  To: Cort Dougan; +Cc: Paul Mackerras, linux-kernel



On Mon, 18 Mar 2002, Cort Dougan wrote:
>
> I have a counter-proposal.  How about a hardware tlb load (if we must have
> one) that does the right thing?

Well, I actually hink that an x86 comes fairly close.

Hashes simply do not do the right thing. You cannot do a speculative load
from a hash, and the hash overhead gets _bigger_ for TLB loads that miss
(ie they optimize for the hit case, which is the wrong optimization if the
on-chip TLB is big enough - and since Moore's law says that the on-chip
TLB _will_ be big enough, that's just stupid).

Basic premise in caching: hardware gets better, and misses go down.

Which implies that misses due to cache contention are misses that go away
over time, while forced misses (due to program startup etc) matter more
and more over time.

Ergo, you want to make build-up fast, because that's where you can't avoid
the work by trivially just making your caches bigger. So you want to have
architecture support for aggressive TLB pre-loading.

> I still think there are some clever tricks one could do with the VSID's to
> get a much saner system that the current hash table.  It would take some
> serious work I think but the results could be worthwhile.  By carefully
> choosing the VSID scatter algorithm and the size of the hash table I think
> one could get a much better access method.

But the whole point of _scattering_ is so incredibly broken in itself!
Don't do it.

You can load many TLB entries in one go, if you just keep them close-by to
each other. Load them into a prefetch-buffer (so that you don't dirty your
real TLB with speculative TLB loads), and since there tends to be locality
to TLB's, you've just automatically speeded up your hardware walker.

In contrast, a hash algorithm automatically means that you cannot sanely
do this _trivial_ optimization.

Face it, hashes are BAD for things like this.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:04         ` Linus Torvalds
@ 2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:50             ` Rene Herman
                               ` (4 more replies)
  2002-03-18 21:34           ` Cort Dougan
  1 sibling, 5 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-18 20:23 UTC (permalink / raw)
  To: Cort Dougan; +Cc: Paul Mackerras, linux-kernel



On Mon, 18 Mar 2002, Linus Torvalds wrote:
>
> Well, I actually hink that an x86 comes fairly close.

Btw, here's a program that does a simple histogram of TLB miss cost, and
shows the interesting pattern on intel I was talking about: every 8th miss
is most costly, apparently because Intel pre-fetches 8 TLB entries at a
time.

So on a PII core, you'll see something like

	  87.50: 36
	  12.39: 40

ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4% (ie
1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due to
actually loading the thing from the data cache).

Yeah, my program might be buggy, so take the numbers with a pinch of salt.
But it's interesting to see how on an athlon the numbers are

	   3.17: 59
	  34.94: 62
	   4.71: 85
	  54.83: 88

ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
where that pattern would come from..

What are the ppc numbers like (after modifying the rdtsc implementation,
of course)? I suspect you'll get a less clear distribution depending on
whether the hash lookup ends up hitting in the primary or secondary hash,
and where in the list it hits, but..

			Linus

-----
#include <stdlib.h>

#define rdtsc(low) \
   __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")

#define MAXTIMES 1000
#define BUFSIZE (128*1024*1024)
#define access(x) (*(volatile unsigned int *)&(x))

int main()
{
	unsigned int i, j;
	static int times[MAXTIMES];
	char *buffer = malloc(BUFSIZE);

	for (i = 0; i < BUFSIZE; i += 4096)
		access(buffer[i]);
	for (i = 0; i < MAXTIMES; i++)
		times[i] = 0;
	for (j = 0; j < 100; j++) {
		for (i = 0; i < BUFSIZE ; i+= 4096) {
			unsigned long start, end;

			rdtsc(start);
			access(buffer[i]);
			rdtsc(end);
			end -= start;
			if (end >= MAXTIMES)
				end = MAXTIMES-1;
			times[end]++;
		}
	}
	for (i = 0; i < MAXTIMES; i++) {
		int count = times[i];
		double percent = (double)count / (BUFSIZE/4096);
		if (percent < 1)
			continue;
		printf("%7.2f: %d\n", percent, i);
	}
	return 0;
}


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:04         ` Linus Torvalds
  2002-03-18 20:23           ` Linus Torvalds
@ 2002-03-18 21:34           ` Cort Dougan
  2002-03-18 22:00             ` Linus Torvalds
  1 sibling, 1 reply; 137+ messages in thread
From: Cort Dougan @ 2002-03-18 21:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul Mackerras, linux-kernel

I agree with you there.  On many PowerPC's, we're screwed.  The best thing
I can think of is the clever VSID allocation and trying to make a sane data
structure out of the hash table but that would involve a _lot_ of work with
very likely no reward.

} Hashes simply do not do the right thing. You cannot do a speculative load
} from a hash, and the hash overhead gets _bigger_ for TLB loads that miss
} (ie they optimize for the hit case, which is the wrong optimization if the
} on-chip TLB is big enough - and since Moore's law says that the on-chip
} TLB _will_ be big enough, that's just stupid).

What's the alternative for some PowerPC's?  Every shared library program
likes to use the exact same addresses which load (and thus create htab
entries) at exactly the same location.  A machine running 100+ processes is
not going to be usable because the every process is sharing the same 8 PTE
slots.

} But the whole point of _scattering_ is so incredibly broken in itself!
} Don't do it.

Yes, that is indeed correct theoretically.  The problem is that we actually
measured it and there was very little locality.  When I added some
multiple-tlb loads it actually decreased wall-clock performance for nearly
every user load I put on the machine.  The common apps now-a-days are using
10's of shared libs so that would make it even worse.

} You can load many TLB entries in one go, if you just keep them close-by to
} each other. Load them into a prefetch-buffer (so that you don't dirty your
} real TLB with speculative TLB loads), and since there tends to be locality
} to TLB's, you've just automatically speeded up your hardware walker.
} 
} In contrast, a hash algorithm automatically means that you cannot sanely
} do this _trivial_ optimization.

Linus, I knew that deep in my heart 8 years ago when I started in on all
this.  I'm with you but I'm not good enough with a soldering iron to fix
every powerpc out there that forces that crappy IBM spawned madness upon
us.

I even wrote a paper about how bad a design is and how the designers should
be whipped for their foolish choices on the PPC.  I'll hold the torch if
you knock on the castle door...

} Face it, hashes are BAD for things like this.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
@ 2002-03-18 21:50             ` Rene Herman
  2002-03-18 22:36             ` Cort Dougan
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 137+ messages in thread
From: Rene Herman @ 2002-03-18 21:50 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:

> So on a PII core, you'll see something like
> 
> 87.50: 36
> 12.39: 40
> 
> ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4%
> (ie 1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due
> to actually loading the thing from the data cache).
> 
> Yeah, my program might be buggy, so take the numbers with a pinch of
> salt. But it's interesting to see how on an athlon the numbers are
> 
>  3.17: 59
> 34.94: 62
>  4.71: 85
> 54.83: 88
> 
> ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't
> know where that pattern would come from..

You scared me, so I ran the program on my AMD duron. Result are 
completely repeatable (4 runs):

   4.17: 20
  92.89: 21
   1.17: 26

   4.17: 20
  93.00: 21
   1.18: 26

   4.17: 20
  92.86: 21
   1.18: 26

   4.16: 20
  92.78: 21
   1.16: 26

Ie, rather violently different from the numbers you quoted for the 
Athlon...

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 3
model name      : AMD Duron(tm) Processor 
stepping        : 1
cpu MHz         : 757.472
cache size      : 64 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat 
pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 1510.60

Rene.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 21:34           ` Cort Dougan
@ 2002-03-18 22:00             ` Linus Torvalds
  0 siblings, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-18 22:00 UTC (permalink / raw)
  To: Cort Dougan; +Cc: Paul Mackerras, linux-kernel



On Mon, 18 Mar 2002, Cort Dougan wrote:
>
> } But the whole point of _scattering_ is so incredibly broken in itself!
> } Don't do it.
>
> Yes, that is indeed correct theoretically.  The problem is that we actually
> measured it and there was very little locality.  When I added some
> multiple-tlb loads it actually decreased wall-clock performance for nearly
> every user load I put on the machine.

This is what I meant by hardware support for multiple loads - you mustn't
let speculative TLB loads displace real TLB entries, for example.

> Linus, I knew that deep in my heart 8 years ago when I started in on all
> this.  I'm with you but I'm not good enough with a soldering iron to fix
> every powerpc out there that forces that crappy IBM spawned madness upon
> us.

Oh, I agree, we can't fix existing broken hardware, we'll ave to just live
with it.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:50             ` Rene Herman
@ 2002-03-18 22:36             ` Cort Dougan
  2002-03-18 22:47               ` Linus Torvalds
  2002-03-19  2:42             ` Paul Mackerras
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 137+ messages in thread
From: Cort Dougan @ 2002-03-18 22:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Here's the modified for PPC version and the results.

The cycle timer in this case is about 16.6MHz.

# ./foo
  92.01: 1
   7.98: 2
# ./foo
   3.71: 0
  92.30: 1
   3.99: 2
# ./foo
  92.01: 1
   7.97: 2
# ./foo
  92.01: 1
   7.97: 2
# ./foo
   3.71: 0
  92.30: 1
   3.99: 2
# ./foo
   3.71: 0
  92.30: 1
   3.99: 2

#include <stdlib.h>

#if defined(__powerpc__)
#define rdtsc(low) \
   __asm__ __volatile__ ("mftb %0": "=r" (low))
#else
#define rdtsc(low) \
  __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")
#endif

#define MAXTIMES 1000
#define BUFSIZE (128*1024*1024)
#define access(x) (*(volatile unsigned int *)&(x))

int main()
{
	unsigned int i, j;
	static int times[MAXTIMES];
	char *buffer = malloc(BUFSIZE);

	for (i = 0; i < BUFSIZE; i += 4096)
		access(buffer[i]);
	for (i = 0; i < MAXTIMES; i++)
		times[i] = 0;
	for (j = 0; j < 100; j++) {
		for (i = 0; i < BUFSIZE ; i+= 4096) {
			unsigned long start, end;

			rdtsc(start);
			access(buffer[i]);
			rdtsc(end);
			end -= start;
			if (end >= MAXTIMES)
				end = MAXTIMES-1;
			times[end]++;
		}
	}
	for (i = 0; i < MAXTIMES; i++) {
		int count = times[i];
		double percent = (double)count / (BUFSIZE/4096);
		if (percent < 1)
			continue;
		printf("%7.2f: %d\n", percent, i);
	}
	return v0;
}


} Btw, here's a program that does a simple histogram of TLB miss cost, and
} shows the interesting pattern on intel I was talking about: every 8th miss
} is most costly, apparently because Intel pre-fetches 8 TLB entries at a
} time.
} 
} So on a PII core, you'll see something like
} 
} 	  87.50: 36
} 	  12.39: 40
} 
} ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4% (ie
} 1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due to
} actually loading the thing from the data cache).
} 
} Yeah, my program might be buggy, so take the numbers with a pinch of salt.
} But it's interesting to see how on an athlon the numbers are
} 
} 	   3.17: 59
} 	  34.94: 62
} 	   4.71: 85
} 	  54.83: 88
} 
} ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
} where that pattern would come from..
} 
} What are the ppc numbers like (after modifying the rdtsc implementation,
} of course)? I suspect you'll get a less clear distribution depending on
} whether the hash lookup ends up hitting in the primary or secondary hash,
} and where in the list it hits, but..
} 
} 			Linus

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:36             ` Cort Dougan
@ 2002-03-18 22:47               ` Linus Torvalds
  2002-03-18 22:56                 ` Cort Dougan
                                   ` (2 more replies)
  0 siblings, 3 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-18 22:47 UTC (permalink / raw)
  To: Cort Dougan; +Cc: linux-kernel


On Mon, 18 Mar 2002, Cort Dougan wrote:
> 
> The cycle timer in this case is about 16.6MHz.

Oh, you're cycle timer is too slow to be interesting, apparently ;(

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:47               ` Linus Torvalds
@ 2002-03-18 22:56                 ` Cort Dougan
  2002-03-18 23:52                 ` Paul Mackerras
  2002-03-19  0:22                 ` David S. Miller
  2 siblings, 0 replies; 137+ messages in thread
From: Cort Dougan @ 2002-03-18 22:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Unfortunately so.  I have some boards here that have higher precision
timers but nothing approaching the clock rate of the chip.  I don't think
there are any PPC boards with timers at that rate.

Some of the 6xx or 74xx model debug registers may have something useful
here, though.

} Oh, you're cycle timer is too slow to be interesting, apparently ;(

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:47               ` Linus Torvalds
  2002-03-18 22:56                 ` Cort Dougan
@ 2002-03-18 23:52                 ` Paul Mackerras
  2002-03-19  0:57                   ` Dave Jones
  2002-03-19  0:22                 ` David S. Miller
  2 siblings, 1 reply; 137+ messages in thread
From: Paul Mackerras @ 2002-03-18 23:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds writes:

> Oh, you're cycle timer is too slow to be interesting, apparently ;(

The G4 has 4 performance monitor counters that you can set up to
measure things like ITLB misses, DTLB misses, cycles spent doing
tablewalks for ITLB misses and DTLB misses, etc.  I hacked up a
measurement of the misses and total cycles doing tablewalks during a
kernel compile and got an average of 36 cycles per DTLB miss and 40
cycles per ITLB miss on a 500MHz G4 machine.  What I need to do now is
to put some better infrastructure for using those counters in place
and try your program using those counters instead of the timebase.

Paul.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 22:47               ` Linus Torvalds
  2002-03-18 22:56                 ` Cort Dougan
  2002-03-18 23:52                 ` Paul Mackerras
@ 2002-03-19  0:22                 ` David S. Miller
  2002-03-19  0:27                   ` Cort Dougan
  2 siblings, 1 reply; 137+ messages in thread
From: David S. Miller @ 2002-03-19  0:22 UTC (permalink / raw)
  To: torvalds; +Cc: cort, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Mon, 18 Mar 2002 14:47:19 -0800 (PST)

   On Mon, 18 Mar 2002, Cort Dougan wrote:
   > The cycle timer in this case is about 16.6MHz.
   
   Oh, you're cycle timer is too slow to be interesting, apparently ;(

We could modify the test program to use more portably timing functions
and doing the TLB accesses several times over.  While this would get
us something more reasonable on PPC, and be more portable, the results
would be a bit less accurate because we'd be dealing effectively with
averages instead of real cycle count samples.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:22                 ` David S. Miller
@ 2002-03-19  0:27                   ` Cort Dougan
  2002-03-19  0:27                     ` David S. Miller
  0 siblings, 1 reply; 137+ messages in thread
From: Cort Dougan @ 2002-03-19  0:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-kernel

It would be easy to do with the debug registers on PPC but they're
supervisor level only.  Users have no need to profile their code, after
all.

A logic analyzer would be really handy here.  Dave, think you can swing
one? :)

I ended up using averages for my tests with the PPC when doing the MM
optimizations.  Wall-clock time tells you if you did a good thing or not,
but not what it was that you actually did :)

Any suggestions for a structure, Dave?

}    On Mon, 18 Mar 2002, Cort Dougan wrote:
}    > The cycle timer in this case is about 16.6MHz.
}    
}    Oh, you're cycle timer is too slow to be interesting, apparently ;(
} 
} We could modify the test program to use more portably timing functions
} and doing the TLB accesses several times over.  While this would get
} us something more reasonable on PPC, and be more portable, the results
} would be a bit less accurate because we'd be dealing effectively with
} averages instead of real cycle count samples.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:27                   ` Cort Dougan
@ 2002-03-19  0:27                     ` David S. Miller
  2002-03-19  0:36                       ` Cort Dougan
  0 siblings, 1 reply; 137+ messages in thread
From: David S. Miller @ 2002-03-19  0:27 UTC (permalink / raw)
  To: cort; +Cc: torvalds, linux-kernel

   From: Cort Dougan <cort@fsmlabs.com>
   Date: Mon, 18 Mar 2002 17:27:05 -0700
   
   Any suggestions for a structure, Dave?

Structure?  Of what?

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:27                     ` David S. Miller
@ 2002-03-19  0:36                       ` Cort Dougan
  2002-03-19  0:38                         ` David S. Miller
  0 siblings, 1 reply; 137+ messages in thread
From: Cort Dougan @ 2002-03-19  0:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-kernel

The structure of the program you suggested with more portable timing.

}    Any suggestions for a structure, Dave?
} 
} Structure?  Of what?

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:36                       ` Cort Dougan
@ 2002-03-19  0:38                         ` David S. Miller
  2002-03-19  1:28                           ` Davide Libenzi
  0 siblings, 1 reply; 137+ messages in thread
From: David S. Miller @ 2002-03-19  0:38 UTC (permalink / raw)
  To: cort; +Cc: torvalds, linux-kernel

   From: Cort Dougan <cort@fsmlabs.com>
   Date: Mon, 18 Mar 2002 17:36:35 -0700

   The structure of the program you suggested with more portable timing.
   
Oh, just something like:


	gettimeofday(&stamp1);
	for (A MILLION TIMES) {
		TLB miss;
	}
	gettimeofday(&stamp2);

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 23:52                 ` Paul Mackerras
@ 2002-03-19  0:57                   ` Dave Jones
  2002-03-19  3:35                     ` Jeff Garzik
  0 siblings, 1 reply; 137+ messages in thread
From: Dave Jones @ 2002-03-19  0:57 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel

On Tue, Mar 19, 2002 at 10:52:40AM +1100, Paul Mackerras wrote:
 > The G4 has 4 performance monitor counters that you can set up to
 > measure things like ITLB misses, DTLB misses, cycles spent doing
 > tablewalks for ITLB misses and DTLB misses, etc.
 > What I need to do now is
 > to put some better infrastructure for using those counters in place
 > and try your program using those counters instead of the timebase.

 Sounds like a good candidate for the first non-x86 port of oprofile[1].
 Write the kernel part, and all the nice userspace tools come for free.
 There are also a few other perfctr abstraction projects, which are
 linked off the oprofile pages somewhere iirc.

[1] http://oprofile.sf.net

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:38                         ` David S. Miller
@ 2002-03-19  1:28                           ` Davide Libenzi
  0 siblings, 0 replies; 137+ messages in thread
From: Davide Libenzi @ 2002-03-19  1:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: cort, torvalds, linux-kernel

On Mon, 18 Mar 2002, David S. Miller wrote:

>    From: Cort Dougan <cort@fsmlabs.com>
>    Date: Mon, 18 Mar 2002 17:36:35 -0700
>
>    The structure of the program you suggested with more portable timing.
>
> Oh, just something like:
>
>
> 	gettimeofday(&stamp1);
> 	for (A MILLION TIMES) {
> 		TLB miss;
> 	}
> 	gettimeofday(&stamp2);

This make the measure stable on my machine :

#define rdtsc(low) \
   __asm__ __volatile__("rdtsc" : "=A" (low) : )


            unsigned long long start, end;

            rdtsc(start);
            access(buffer[i]);
            rdtsc(end);



processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 999.561
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
		pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 1992.29



$ gcc -o tlb_test tlb_test.c

#APP
    rdtsc
#NO_APP
    movl    %eax, -24(%ebp)
    movl    %edx, -20(%ebp)
    movl    -4(%ebp), %eax
    addl    -12(%ebp), %eax
    movl    (%eax), %eax
#APP
    rdtsc


  11.89: 18
   4.70: 20
  81.90: 23



$ gcc -O2 -o tlb_test tlb_test.c

#APP
    rdtsc
#NO_APP
    movl    %edx, -28(%ebp)
    movl    -24(%ebp), %edx
    movl    %eax, -32(%ebp)
    movl    (%esi,%edx), %ecx
#APP
    rdtsc
#NO_APP


  87.70: 20
  11.24: 25




- Davide



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
  2002-03-18 21:50             ` Rene Herman
  2002-03-18 22:36             ` Cort Dougan
@ 2002-03-19  2:42             ` Paul Mackerras
  2002-03-27  2:53             ` Richard Henderson
  2002-04-02 10:50             ` Pablo Alcaraz
  4 siblings, 0 replies; 137+ messages in thread
From: Paul Mackerras @ 2002-03-19  2:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Cort Dougan, linux-kernel

Linus Torvalds writes:

> Btw, here's a program that does a simple histogram of TLB miss cost, and
> shows the interesting pattern on intel I was talking about: every 8th miss
> is most costly, apparently because Intel pre-fetches 8 TLB entries at a
> time.

Here are the results on my 500Mhz G4 laptop:

   1.85: 22
  17.86: 26
  14.41: 28
  16.88: 42
  34.03: 46
   9.61: 48
   2.07: 88
   1.04: 90

The numbers are fairly repeatable except that the last two tend to
wobble around a little.  These are numbers of cycles obtained using
one of the performance monitor counters set to count every cycle.
The average is 40.6 cycles.

This was with a 512kB MMU hash table, which translates to 8192 hash
buckets each holding 8 ptes.  The machine has 1MB of L2 cache.

Paul.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-19  0:57                   ` Dave Jones
@ 2002-03-19  3:35                     ` Jeff Garzik
  0 siblings, 0 replies; 137+ messages in thread
From: Jeff Garzik @ 2002-03-19  3:35 UTC (permalink / raw)
  To: Dave Jones; +Cc: Paul Mackerras, linux-kernel

Dave Jones wrote:

>On Tue, Mar 19, 2002 at 10:52:40AM +1100, Paul Mackerras wrote:
> > The G4 has 4 performance monitor counters that you can set up to
> > measure things like ITLB misses, DTLB misses, cycles spent doing
> > tablewalks for ITLB misses and DTLB misses, etc.
> > What I need to do now is
> > to put some better infrastructure for using those counters in place
> > and try your program using those counters instead of the timebase.
>
> Sounds like a good candidate for the first non-x86 port of oprofile[1].
> Write the kernel part, and all the nice userspace tools come for free.
> There are also a few other perfctr abstraction projects, which are
> linked off the oprofile pages somewhere iirc.
>

Maybe this is why drepper doesn't like threaded profiling... he wants us 
all to use oprofile.

/me ducks and runs....





^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 22:25               ` Daniel Phillips
@ 2002-03-19 16:35                 ` Bill Davidsen
  0 siblings, 0 replies; 137+ messages in thread
From: Bill Davidsen @ 2002-03-19 16:35 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

On Sat, 16 Mar 2002, Daniel Phillips wrote:

> It breaks down somewhat as virtual memory range goes way beyond 4GB.  
> There's the relatively minor issue of extra levels of tree traversal, 
> currently limited to 4 by AMD's architecture but not so limited on other 
> architectures.  A bigger problem is what to do about internal fragmentation 
> in the page table tree, say if somebody mmaps a 2 TB sparse file, then writes 
> one byte every 2 meg.  Bang, 4 gig worth of page tables, this is probably not 
> what we want.  IMHO, 'don't do that then' isn't a reasonable response.

  Perhaps not, but "if you do that it will be slow" is a reasonable
response when any operation requires an unusual resource to complete. The
best solution is to reduce the resources needed by being clever, but the
next best is to prevent one process from beating the machine to death for
all others (if any).

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-14 19:05       ` Linus Torvalds
@ 2002-03-19 16:40         ` Bill Davidsen
  0 siblings, 0 replies; 137+ messages in thread
From: Bill Davidsen @ 2002-03-19 16:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, 14 Mar 2002, Linus Torvalds wrote:

>  - IBM nomenclature really is broken. They call disks DASD devices, and
>    they call their hash table a page table, and they just confuse
>    themselves and everybody else for no good reason.

Actually, no on DASD. DASD = "Direct Access Storage Device" and while disk
is the most common implementation of that, it is not the only. Like
Windows is the most common implementation of "operating system," but not
the only one, thankfully.

Think drum, solid state storage, optical, etc... all DASD.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-16 17:37   ` [Lse-tech] " Martin J. Bligh
  2002-03-17  1:45     ` Keith Owens
  2002-03-17 13:54     ` David Woodhouse
@ 2002-03-19 16:49     ` Bill Davidsen
  2 siblings, 0 replies; 137+ messages in thread
From: Bill Davidsen @ 2002-03-19 16:49 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: lse-tech, Linux Kernel Mailing List

On Sat, 16 Mar 2002, Martin J. Bligh wrote:

> > I think Im addicted. I need help!
> 
> Well, you're not going to get much competition, so maybe help
> would be more in order ;-) ;-)
> 
> Are you still doing something like this?
> # MAKE="make -j14" /usr/bin/time make -j14 bzImage
> 
> I tried setting the MAKE variable as well as doing the -j,
> but it actually made kernel compile time slower - what difference
> does it make on your machine? Can somebody clarify what this
> actually does, as opposed to the -j on the command line?

  Passing the -j option to make either (a) starts N processes at the
initial level and implies -j1 for submakes, (b) starts N processes at base
level each of which get the -jN and use it, or (c) -jN means run a total
of N processes shared between everything running. The [abc] depends on the
make you run, BSD, xmake, old GNU, new GNU, etc.

  No that doesn't clarify things, the correct answer is "it depends." I
have always used the environment variable with older GNU make, havent
rethought it on very recent systems. I suggest that N be Nproc+1 for best
results, but I've never had more than four CPUs with a build large enough
to measure.
 
> BTW - the other tip that was in the big book of whizzy kernel
> compiles was to set gcc to use -pipe ... you might want to try
> that.

  I general -pipe is a bad thing for uni, non-win for SMP (for any -j)
although I have often thought that making the pipe buffer larger might
change that.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] 7.52 second kernel compile
  2002-03-18  7:04         ` Jeff Garzik
@ 2002-03-19 18:28           ` Theodore Tso
  0 siblings, 0 replies; 137+ messages in thread
From: Theodore Tso @ 2002-03-19 18:28 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Anton Blanchard, Gerrit Huizenga, lse-tech, linux-kernel

On Mon, Mar 18, 2002 at 02:04:18AM -0500, Jeff Garzik wrote:
> Theodore Tso wrote:
> 
> >On Sun, Mar 17, 2002 at 11:34:34PM +1100, Anton Blanchard wrote:
> >
> >>>And this *without* the dcache_lock?  Hmm.  So you are saying there
> >>>may still be room for improvement?
> >>>
> >>I tried the dcache lock patches but found it hard to see a difference,
> >>for us the mm stuff still seems to be the bottleneck.
> >>
> >
> >Try the patch which gets rid of the BKL in ext2_get_block() --- if you
> >don't have that, let me know, I've got one kicking around that mostly
> >works except I haven't validated that it does the right thing if
> >quotas are enabled.
> >
> 
> Is yours different from what's in 2.5.x?

Yes it is, but it looks like Al's is better in any case.  (I hadn't
realized that Al's changes had gone into 2.5.recent; I've been
distracted recently with a few other things.)
	
						- Ted



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:16                   ` Linus Torvalds
  2002-03-16 19:53                     ` yodaiken
@ 2002-03-27  1:07                     ` Richard Henderson
  1 sibling, 0 replies; 137+ messages in thread
From: Richard Henderson @ 2002-03-27  1:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 11:16:16AM -0800, Linus Torvalds wrote:
> But alpha does a "pseudo-hardware" fill of page tables, ie as far as the
> OS is concerned you might as well consider it hardware. And that is
> actually limited to 8kB pages

Actually, it can be frobbed up to 64k with a pal call.  Not that
we've ever arranged for the alpha backend to allow for a page size
not equal to 8k...

> The upcoming hammer stuff from AMD is also 64-bit, and apparently a
> four-level page table, each with 512 entries and 4kB pages.

And FWIW, ev6 also has an option to do 4 level page tables.


r~

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2002-03-19  2:42             ` Paul Mackerras
@ 2002-03-27  2:53             ` Richard Henderson
  2002-04-02  4:32               ` Linus Torvalds
  2002-04-02 10:50             ` Pablo Alcaraz
  4 siblings, 1 reply; 137+ messages in thread
From: Richard Henderson @ 2002-03-27  2:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

For the record, Alpha timings:

pca164 @ 533MHz:
  72.79: 19
   1.50: 20
  21.30: 35
   1.50: 36
   1.30: 105

ev6 @ 500MHz:
   2.43: 78
  72.13: 84
   2.55: 89
   5.87: 90
   1.38: 105
   5.94: 108
   1.36: 112

I wonder how much of that ev6 slowdown is due to an SRM that's
has to handle both 3 and 4 level page tables, and how much is
due to the more expensive syncing of the OOO pipeline...


r~

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-27  2:53             ` Richard Henderson
@ 2002-04-02  4:32               ` Linus Torvalds
  0 siblings, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-04-02  4:32 UTC (permalink / raw)
  To: Richard Henderson; +Cc: linux-kernel


On Tue, 26 Mar 2002, Richard Henderson wrote:
>
> For the record, Alpha timings:
> 
> pca164 @ 533MHz:
>   72.79: 19
>    1.50: 20
>   21.30: 35
>    1.50: 36
>    1.30: 105

Interesting. There seems to be three peaks: a big 4/1 split at 19-20 vs
35-36 cycles, which is probably just the L1 cache (8 bytes per entry,
32-byte cachelines on the EV5 gives 4 entries per cache load), while the
much smaller peak at 105 cycles might possibly be due to the virtual
lookup miss, causing a double TLB miss and a real walk every 8kB entries
(actually, much more often than that, since there's TLB pressure and the
virtual PTE mappings get thrown out faster than the theoretical numbers
would indicate)

It also shows how pretty studly it is to take a sw TLB miss quite that
quickly. Getting in and out of PAL-mode that fast is rather fast.

> ev6 @ 500MHz:
>    2.43: 78
>   72.13: 84
>    2.55: 89
>    5.87: 90
>    1.38: 105
>    5.94: 108
>    1.36: 112
> 
> I wonder how much of that ev6 slowdown is due to an SRM that's
> has to handle both 3 and 4 level page tables, and how much is
> due to the more expensive syncing of the OOO pipeline...

The multi-level page table shouldn't hurt at all for the common case (ie
the virtual PTE lookup success), so my money would be on the pipeline
flush.

The other profile difference seems to be due to the 64-byte cacheline (ie
a cacheline now holds 8 entries, so 7/8th can be filled that way).

However, I doubt whether that third peak could be a double PTE fault, it
seems too big and too close in cycles to the others. So maybe the third
peak at 108 cycles is something else... As it seems to balance out very
nicely with the second peak, I wonder if there might not be something
making every other cache fill faster - like a 128-byte prefetch or an
external 128-byte line on the L2/L3? (Ie the third peak would be really
just the "other half" of the second peak).

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: 7.52 second kernel compile
  2002-03-18 20:23           ` Linus Torvalds
                               ` (3 preceding siblings ...)
  2002-03-27  2:53             ` Richard Henderson
@ 2002-04-02 10:50             ` Pablo Alcaraz
  4 siblings, 0 replies; 137+ messages in thread
From: Pablo Alcaraz @ 2002-04-02 10:50 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:

>
>But it's interesting to see how on an athlon the numbers are
>
>	   3.17: 59
>	  34.94: 62
>	   4.71: 85
>	  54.83: 88
>
>ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
>where that pattern would come from..
>
In an athlon 1Ghz the numbers are:

94.49: 20
 2.51: 21

I don't why the numbers are so different.

Pablo


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-24 21:35             ` Andrew Morton
  2002-03-24 22:54               ` Nick Craig-Wood
@ 2002-03-25  6:40               ` Martin J. Bligh
  1 sibling, 0 replies; 137+ messages in thread
From: Martin J. Bligh @ 2002-03-25  6:40 UTC (permalink / raw)
  To: Andrew Morton, Rogier Wolff
  Cc: Linus Torvalds, yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

> Frankly, all the discussion I've seen about altering page sizes
> threatens to add considerable complexity for very dubious gains.

If we don't mix page sizes, but just increase the default from
4k, does this still add a lot of complexity in your eyes? I can't
see why it would ... ?

> If someone can point at a real-world workload and say "we suck",
> and we can't fix that suckage without altering the page size then
> would that person please come forth.

I believe one of the traditional problems stated for this case is
the amount of virtual address space taken up by all the struct pages
for a machine with large amounts of memory (32-64Gb). At the moment, 
the obvious choice of architecture is still 32 bit, but maybe AMD 
Hammer will fix this ... Unless someone has a plan to move all those
up into highmem as well ....

M.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-24 22:54               ` Nick Craig-Wood
@ 2002-03-24 23:41                 ` Andi Kleen
  0 siblings, 0 replies; 137+ messages in thread
From: Andi Kleen @ 2002-03-24 23:41 UTC (permalink / raw)
  To: Nick Craig-Wood
  Cc: Andrew Morton, Rogier Wolff, Linus Torvalds, yodaiken,
	Andi Kleen, Paul Mackerras, linux-kernel

> If there was some hack where 4MB pages could be allocated for
> applications like this then I'd be very happy!

You could always run it as a kernel module.
Just need to add schedule points or use a preemptive kernel. 
When you allocate data using get_free_pages() it'll return a pointer
in the 2 or 4MB mapped direct mapping of the kernel. It'll only work
when your memory is not fragmented. 

-Andi
> 

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-24 21:35             ` Andrew Morton
@ 2002-03-24 22:54               ` Nick Craig-Wood
  2002-03-24 23:41                 ` Andi Kleen
  2002-03-25  6:40               ` Martin J. Bligh
  1 sibling, 1 reply; 137+ messages in thread
From: Nick Craig-Wood @ 2002-03-24 22:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rogier Wolff, Linus Torvalds, yodaiken, Andi Kleen,
	Paul Mackerras, linux-kernel

On Sun, Mar 24, 2002 at 01:35:57PM -0800, Andrew Morton wrote:
> Frankly, all the discussion I've seen about altering page sizes
> threatens to add considerable complexity for very dubious gains.
> The only place where I've seen a solid justification is for
> scientific applications which have a huge working set, and need
> large pages to save on TLB thrashing.

A widely used example is mprime - the mersenne prime finding program (
http://www.mersenne.org/ ).  This typically uses 8 or more MBytes of
RAM which it completely thrashes.

The program is written in very efficient assembler code and has been
designed not to thrash the TLB as much as possible, but with a working
set of > 8 MBs (which is iterated through many times a second at
maximum memory bandwith) large pages would make a real improvement to
it.  Since each run takes weeks any improvement would be eagerly
snatched at by the 1000s of people running this program ;-)

If there was some hack where 4MB pages could be allocated for
applications like this then I'd be very happy!

-- 
Nick Craig-Wood
ncw@axis.demon.co.uk

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-24 21:12           ` Rogier Wolff
@ 2002-03-24 21:35             ` Andrew Morton
  2002-03-24 22:54               ` Nick Craig-Wood
  2002-03-25  6:40               ` Martin J. Bligh
  0 siblings, 2 replies; 137+ messages in thread
From: Andrew Morton @ 2002-03-24 21:35 UTC (permalink / raw)
  To: Rogier Wolff
  Cc: Linus Torvalds, yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

Rogier Wolff wrote:
> 
> ...
> So we have a "PAGE_SIZE" define all around the kernel. Keep that the
> same (for compatibility), but make a "REAL_PAGE_SIZE" that governs the
> loop that actually sets the page table (or tlb) entries.... Note that
> a first implementation may actually effectivly reduce the size of the
> TLB on machines with a software loaded TLB....
> 
> Why would I want this? Well, suppose I have a machine that unavoidably
> has to swap on some of its workload. In practise you will almost
> double the disk troughput by increasing the page size by a factor of
> two.

swapin and swapout already perform multipage clustering - you'd get
the same benefits from increasing SWAP_CLUSTER_MAX and page_cluster.

Which is a three-line patch.

Frankly, all the discussion I've seen about altering page sizes
threatens to add considerable complexity for very dubious gains.
The only place where I've seen a solid justification is for
scientific applications which have a huge working set, and need
large pages to save on TLB thrashing.

For everything else, I believe we can get the efficiencies
which we need by writing efficient code; no need to go playing
with page sizes.

If someone can point at a real-world workload and say "we suck",
and we can't fix that suckage without altering the page size then
would that person please come forth.
 
-

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:14         ` Linus Torvalds
  2002-03-16 20:22           ` Andi Kleen
  2002-03-17 13:23           ` Rik van Riel
@ 2002-03-24 21:12           ` Rogier Wolff
  2002-03-24 21:35             ` Andrew Morton
  2 siblings, 1 reply; 137+ messages in thread
From: Rogier Wolff @ 2002-03-24 21:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

Linus Torvalds wrote:
> We may get there some day, but right now 2M pages are not usable for use 
> access.
> 
> 64kB would be fine, though.

[...]

> Give up on large pages - it's just not happening. Even when a 64kB page 
> would make sense from a technology standpoint these days, backwards 
> compatibility makes people stay at 4kB.

I would think that "large page support" that the processors give you
is indeed unusable, but what do you think about "software large(r)
pages"?

What I mean is that instead of doing the 4k that the ia32 hardware
gives us, we pretend that pages are (e.g.) 8k. Thus we always load a
pair of page table entries. memmap ends up having 8k granularity, IO
is done on page-sized (i.e. 8k in this case) chunks etc etc. (%)

So we have a "PAGE_SIZE" define all around the kernel. Keep that the
same (for compatibility), but make a "REAL_PAGE_SIZE" that governs the
loop that actually sets the page table (or tlb) entries.... Note that
a first implementation may actually effectivly reduce the size of the
TLB on machines with a software loaded TLB....

Why would I want this? Well, suppose I have a machine that unavoidably
has to swap on some of its workload. In practise you will almost
double the disk troughput by increasing the page size by a factor of
two.  If the hit rate on the "extra page" that you swap in by
pretending pages are 8k and not 4k is over a couple of percents (*),
then this is advantageous: A seek plus transfer of 4k costs say 10ms +
0.16us while a seek plus transfer of 8k costs 10ms + .33 us (#). Thus
the "penalty" of the extra 4k transfer is very, very small.

Now, for all the reasons you mention, keeping 4k as the "default" on
Intel is good. But a config option:

                on      on
	       intel   alpha
page size: 	4k
		8k	8k
		16k 	16k 
		32k	32k
		64k	64k
		128k	128k

would also be good for some people. 

				Roger. 

(*) Actually it has to be a "couple of percents better than the 4k
page that we had to evict to be able to accomodate the current extra
4k...

(#) A 10k RPM disk rotates in 6ms, so average rotational latency is
about 3ms, and with an average seek time of 7ms, that comes to 10ms.

(%) And we get ext2 8k block size support on ia32!

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-19 21:12                   ` yodaiken
  2002-03-19 22:09                     ` Chris Friesen
@ 2002-03-20  4:25                     ` Bill Davidsen
  1 sibling, 0 replies; 137+ messages in thread
From: Bill Davidsen @ 2002-03-20  4:25 UTC (permalink / raw)
  To: yodaiken; +Cc: Pavel Machek, Linux Kernel Mailing List

On Tue, 19 Mar 2002 yodaiken@fsmlabs.com wrote:

> On Tue, Mar 19, 2002 at 01:06:19PM +0100, Pavel Machek wrote:
> > Hammer is designed for desktop, AFAICT. [Its slightly modified athlon,
> > you see?]
> 
> Thanks for the insight. Only by reading LKM could
> I have found out that AMD doesn't care about server space.

You can get BS from Intel lovers in the Intel group as well. If you
believe that 8-way SMP doesn't indicate an interest in server space you
see a larger market for games and DSI than I do.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-19 22:09                     ` Chris Friesen
@ 2002-03-19 22:15                       ` yodaiken
  0 siblings, 0 replies; 137+ messages in thread
From: yodaiken @ 2002-03-19 22:15 UTC (permalink / raw)
  To: Chris Friesen; +Cc: yodaiken, Pavel Machek, linux-kernel

On Tue, Mar 19, 2002 at 05:09:37PM -0500, Chris Friesen wrote:
> yodaiken@fsmlabs.com wrote:
> > 
> > On Tue, Mar 19, 2002 at 01:06:19PM +0100, Pavel Machek wrote:
> > > Hammer is designed for desktop, AFAICT. [Its slightly modified athlon,
> > > you see?]
> > 
> > Thanks for the insight. Only by reading LKM could
> > I have found out that AMD doesn't care about server space.
> 
> The sledgehammer is a bit more than a slightly modified athlon...
> 
> up to 8 way multiprocessing
> 64-bit addressing
> hypertransport
> integrated DDR memory controller

Look at www.amd.com front page  for some subtle hints on where
the margin is.


> 
> 
> -- 
> Chris Friesen                    | MailStop: 043/33/F10  
> Nortel Networks                  | work: (613) 765-0557
> 3500 Carling Avenue              | fax:  (613) 765-2986
> Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-19 21:12                   ` yodaiken
@ 2002-03-19 22:09                     ` Chris Friesen
  2002-03-19 22:15                       ` yodaiken
  2002-03-20  4:25                     ` Bill Davidsen
  1 sibling, 1 reply; 137+ messages in thread
From: Chris Friesen @ 2002-03-19 22:09 UTC (permalink / raw)
  To: yodaiken; +Cc: Pavel Machek, linux-kernel

yodaiken@fsmlabs.com wrote:
> 
> On Tue, Mar 19, 2002 at 01:06:19PM +0100, Pavel Machek wrote:
> > Hammer is designed for desktop, AFAICT. [Its slightly modified athlon,
> > you see?]
> 
> Thanks for the insight. Only by reading LKM could
> I have found out that AMD doesn't care about server space.

The sledgehammer is a bit more than a slightly modified athlon...

up to 8 way multiprocessing
64-bit addressing
hypertransport
integrated DDR memory controller


-- 
Chris Friesen                    | MailStop: 043/33/F10  
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-19 12:06                 ` Pavel Machek
@ 2002-03-19 21:12                   ` yodaiken
  2002-03-19 22:09                     ` Chris Friesen
  2002-03-20  4:25                     ` Bill Davidsen
  0 siblings, 2 replies; 137+ messages in thread
From: yodaiken @ 2002-03-19 21:12 UTC (permalink / raw)
  To: Pavel Machek
  Cc: yodaiken, Linus Torvalds, Andi Kleen, Paul Mackerras, linux-kernel

On Tue, Mar 19, 2002 at 01:06:19PM +0100, Pavel Machek wrote:
> Hammer is designed for desktop, AFAICT. [Its slightly modified athlon,
> you see?]

Thanks for the insight. Only by reading LKM could
I have found out that AMD doesn't care about server space.


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 21:39               ` yodaiken
                                   ` (2 preceding siblings ...)
  2002-03-17 14:52                 ` Kai Henningsen
@ 2002-03-19 12:06                 ` Pavel Machek
  2002-03-19 21:12                   ` yodaiken
  3 siblings, 1 reply; 137+ messages in thread
From: Pavel Machek @ 2002-03-19 12:06 UTC (permalink / raw)
  To: yodaiken; +Cc: Linus Torvalds, Andi Kleen, Paul Mackerras, linux-kernel

Hi!

> > > To me, once you have a G of memory, wasting a few meg on unused process 
> > > memory seems no big deal.
> > 
> > It's not the process memory, and it is a whole lot than a "few meg" if 
> > your page size is 2M.
> 
> I forget what an extremist you are. My claim is that
> some processes benefit from big pages, some do not.
> A 16G process needs 2^25 bytes of PTE at 4kbytes/page if I
> did the numbers right. Just populating 4 million odd  page tables is a 
> pain. I might be wrong about it, but I wonder if just scaling
> up from a working 32 bit strategy gets you anywhere.
> If you want to optimize for gnome, you get a very different
> layout. But Hammer and ia64 are supposedly designed for huge
> databases, routing tables, and images. Our good friends at Intel

Hammer is designed for desktop, AFAICT. [Its slightly modified athlon,
you see?]
									Pavel
-- 
(about SSSCA) "I don't say this lightly.  However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:22           ` Andi Kleen
@ 2002-03-19  4:34             ` Rusty Russell
  0 siblings, 0 replies; 137+ messages in thread
From: Rusty Russell @ 2002-03-19  4:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, yodaiken, ak, paulus, linux-kernel

On Sat, 16 Mar 2002 21:22:29 +0100
Andi Kleen <ak@suse.de> wrote:

> On Sat, Mar 16, 2002 at 12:14:06PM -0800, Linus Torvalds wrote:
> > Give up on large pages - it's just not happening. Even when a 64kB page 
> > would make sense from a technology standpoint these days, backwards 
> > compatibility makes people stay at 4kB.
> 
> Yes the 4KB page has to be kept at least for now. 

We have sysconf(_SC_PAGESIZE).  I say, introduce an experimental CONFIG for
64k pagesize in 2.5, so we can start to weed out the problem apps NOW.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-18  1:31                     ` Linus Torvalds
@ 2002-03-18  1:56                       ` Davide Libenzi
  0 siblings, 0 replies; 137+ messages in thread
From: Davide Libenzi @ 2002-03-18  1:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Linux Kernel Mailing List

On Sun, 17 Mar 2002, Linus Torvalds wrote:

> On Sun, 17 Mar 2002, Davide Libenzi wrote:
> >
> > What's the reason that would make more convenient for us, upon receiving a
> > request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?
>
> Ehh.. Let me count the ways:
>  - reliably allocation of 4MB of contiguous data
>  - graceful fallback when you need to start paging
>  - sane coherency with somebody who mapped the same file/segment in a much
>    smaller chunk
>
> Guyes, 4MB pages are always going to be a special case. There's no sane
> way to make them automatic, for the simple reason that they are USELESS
> for "normal" work, and they have tons of problems that are quite
> fundamental and just aren't going away and cannot be worked around.
>
> The only sane way to use 4MB segments is:
>
>  - the application does a special system call (or special flag to mmap)
>    saying that it wants a big page and doesn't care about coherency with
>    anybody else that didn't set the flag (and realize that that probably
>    includes things like read/write)
>
>  - the machine has sufficiently enough memory that the user can be allowed
>    to _lock_ the area down, so that you don't have to worry about
>    swapping out that thing in 4M pieces. (This of course implies that
>    per-user memory counters have to work too, or we have to limit it by
>    default with a rlimit or something to zero).
>
> In short, very much a special case.
>
> (There are two reasons you don't want to handle paging on 4M chunks: (a)
> they may be wonderful for IO throughput, but they are horrible for latency
> for other people and (b) you now have basically just a few bits of usage
> information for 4M worth of memory, as opposed to a finer granularity view
> of which parts are actually _used_).
>
> Once you can count on having memory sizes in the hundreds of Gigs, and
> disk throughput speeds in the hundreds of megs a second, and ther are
> enough of these machines to _matter_ (and reliably 64-bit address spaces
> so that virtual fragmentation doesn't matter), we might make 4MB the
> regular mapping entity.
>
> That's probably at least a decade away.

Plenty of reason thanks Linus :) ... even if workstations with Gigs of RAM
and no swap are not so uncommon nowadays. Anyway i agree about the flag
driven activation ...




- Davide



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-18  1:40                     ` Mike Fedyk
@ 2002-03-18  1:48                       ` Davide Libenzi
  0 siblings, 0 replies; 137+ messages in thread
From: Davide Libenzi @ 2002-03-18  1:48 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Rik van Riel, Linus Torvalds, Linux Kernel Mailing List

On Sun, 17 Mar 2002, Mike Fedyk wrote:

> On Sun, Mar 17, 2002 at 05:13:16PM -0800, Davide Libenzi wrote:

> > What's the reason that would make more convenient for us, upon receiving a
> > request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?
>
> ... the VM chooses to unmap a mmaped page, it chooses a 4mb page, later it
> needs just a few bytes from that unmaped page and free 4mb instead of 4kb (worst
> case) to map that page again...

The big-page property should be vma related and should be obviously
handled correctly ...



- Davide



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-18  1:13                   ` Davide Libenzi
  2002-03-18  1:31                     ` Linus Torvalds
@ 2002-03-18  1:40                     ` Mike Fedyk
  2002-03-18  1:48                       ` Davide Libenzi
  1 sibling, 1 reply; 137+ messages in thread
From: Mike Fedyk @ 2002-03-18  1:40 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Rik van Riel, Linus Torvalds, Linux Kernel Mailing List

On Sun, Mar 17, 2002 at 05:13:16PM -0800, Davide Libenzi wrote:
> On Sun, 17 Mar 2002, Rik van Riel wrote:
> 
> > On Sun, 17 Mar 2002, Davide Libenzi wrote:
> > > On Sun, 17 Mar 2002, Linus Torvalds wrote:
> > >
> > > > In article <Pine.LNX.4.44L.0203171021090.2181-100000@imladris.surriel.com>,
> > > > Rik van Riel  <riel@conectiva.com.br> wrote:
> > > > >
> > > > >In other words, large pages should be a "special hack" for
> > > > >special applications, like Oracle and maybe some scientific
> > > > >calculations ?
> > > >
> > > > Yes, I think so.
> >
> > > Couldn't we choose the page size depending on the map size ?
> >
> > For on-disk files I guess this is better an mmap flag,
> > but for shared memory segments we could try to do this
> > automagically.
> 
> What's the reason that would make more convenient for us, upon receiving a
> request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?

... the VM chooses to unmap a mmaped page, it chooses a 4mb page, later it
needs just a few bytes from that unmaped page and free 4mb instead of 4kb (worst
case) to map that page again...

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-18  1:13                   ` Davide Libenzi
@ 2002-03-18  1:31                     ` Linus Torvalds
  2002-03-18  1:56                       ` Davide Libenzi
  2002-03-18  1:40                     ` Mike Fedyk
  1 sibling, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-18  1:31 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Rik van Riel, Linux Kernel Mailing List



On Sun, 17 Mar 2002, Davide Libenzi wrote:
>
> What's the reason that would make more convenient for us, upon receiving a
> request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?

Ehh.. Let me count the ways:
 - reliably allocation of 4MB of contiguous data
 - graceful fallback when you need to start paging
 - sane coherency with somebody who mapped the same file/segment in a much
   smaller chunk

Guyes, 4MB pages are always going to be a special case. There's no sane
way to make them automatic, for the simple reason that they are USELESS
for "normal" work, and they have tons of problems that are quite
fundamental and just aren't going away and cannot be worked around.

The only sane way to use 4MB segments is:

 - the application does a special system call (or special flag to mmap)
   saying that it wants a big page and doesn't care about coherency with
   anybody else that didn't set the flag (and realize that that probably
   includes things like read/write)

 - the machine has sufficiently enough memory that the user can be allowed
   to _lock_ the area down, so that you don't have to worry about
   swapping out that thing in 4M pieces. (This of course implies that
   per-user memory counters have to work too, or we have to limit it by
   default with a rlimit or something to zero).

In short, very much a special case.

(There are two reasons you don't want to handle paging on 4M chunks: (a)
they may be wonderful for IO throughput, but they are horrible for latency
for other people and (b) you now have basically just a few bits of usage
information for 4M worth of memory, as opposed to a finer granularity view
of which parts are actually _used_).

Once you can count on having memory sizes in the hundreds of Gigs, and
disk throughput speeds in the hundreds of megs a second, and ther are
enough of these machines to _matter_ (and reliably 64-bit address spaces
so that virtual fragmentation doesn't matter), we might make 4MB the
regular mapping entity.

That's probably at least a decade away.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-18  0:53                 ` Rik van Riel
@ 2002-03-18  1:13                   ` Davide Libenzi
  2002-03-18  1:31                     ` Linus Torvalds
  2002-03-18  1:40                     ` Mike Fedyk
  0 siblings, 2 replies; 137+ messages in thread
From: Davide Libenzi @ 2002-03-18  1:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Sun, 17 Mar 2002, Rik van Riel wrote:

> On Sun, 17 Mar 2002, Davide Libenzi wrote:
> > On Sun, 17 Mar 2002, Linus Torvalds wrote:
> >
> > > In article <Pine.LNX.4.44L.0203171021090.2181-100000@imladris.surriel.com>,
> > > Rik van Riel  <riel@conectiva.com.br> wrote:
> > > >
> > > >In other words, large pages should be a "special hack" for
> > > >special applications, like Oracle and maybe some scientific
> > > >calculations ?
> > >
> > > Yes, I think so.
>
> > Couldn't we choose the page size depending on the map size ?
>
> For on-disk files I guess this is better an mmap flag,
> but for shared memory segments we could try to do this
> automagically.

What's the reason that would make more convenient for us, upon receiving a
request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?




- Davide



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17 23:01               ` Davide Libenzi
@ 2002-03-18  0:53                 ` Rik van Riel
  2002-03-18  1:13                   ` Davide Libenzi
  0 siblings, 1 reply; 137+ messages in thread
From: Rik van Riel @ 2002-03-18  0:53 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Sun, 17 Mar 2002, Davide Libenzi wrote:
> On Sun, 17 Mar 2002, Linus Torvalds wrote:
>
> > In article <Pine.LNX.4.44L.0203171021090.2181-100000@imladris.surriel.com>,
> > Rik van Riel  <riel@conectiva.com.br> wrote:
> > >
> > >In other words, large pages should be a "special hack" for
> > >special applications, like Oracle and maybe some scientific
> > >calculations ?
> >
> > Yes, I think so.

> Couldn't we choose the page size depending on the map size ?

For on-disk files I guess this is better an mmap flag,
but for shared memory segments we could try to do this
automagically.

> If we start mixing page sizes, what about kernel code that assumes
> PAGE_SIZE ?

We fix it.

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17 18:16             ` Linus Torvalds
@ 2002-03-17 23:01               ` Davide Libenzi
  2002-03-18  0:53                 ` Rik van Riel
  0 siblings, 1 reply; 137+ messages in thread
From: Davide Libenzi @ 2002-03-17 23:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

On Sun, 17 Mar 2002, Linus Torvalds wrote:

> In article <Pine.LNX.4.44L.0203171021090.2181-100000@imladris.surriel.com>,
> Rik van Riel  <riel@conectiva.com.br> wrote:
> >
> >In other words, large pages should be a "special hack" for
> >special applications, like Oracle and maybe some scientific
> >calculations ?
>
> Yes, I think so.
>
> That said, a 64kB page would be useful for generic use.
>
> >Grabbing some bitflags in generic datastructures shouldn't
> >be an issue since free bits are available.
>
> I had large-page-support working in the VM a long time ago, back when I
> did the original VM portability rewrite.  I actually exposed the kernel
> large pages to the VM, and it worked fine - I didn't even need a new
> bit, since the code just used the "large page" bit in the page table
> directly.
>
> But it wasn't ever exposed to user space, and in the end I just made the
> kenel mapping just not visible to the VM and simplified the x86
> pmd_xxx() macros.  The approach definitely worked, though.

Couldn't we choose the page size depending on the map size ?
If we start mixing page sizes, what about kernel code that assumes PAGE_SIZE ?



- Davide



^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17 14:52                 ` Kai Henningsen
@ 2002-03-17 21:00                   ` yodaiken
  0 siblings, 0 replies; 137+ messages in thread
From: yodaiken @ 2002-03-17 21:00 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: linux-kernel

On Sun, Mar 17, 2002 at 04:52:00PM +0200, Kai Henningsen wrote:
> yodaiken@fsmlabs.com  wrote on 16.03.02 in <20020316161057.A23495@hq.fsmlabs.com>:
> 
> > On Sat, Mar 16, 2002 at 10:00:07PM +0000, Alan Cox wrote:
> > > > databases, routing tables, and images. Our good friends at Intel
> > > > claim "carrier grade" Linux  needs to run threaded apps
> > > > with 10,000 threads to depose Solaris in telecom - all sharing the
> > > > same monster address space.=20
> > >
> > > Thats intel though. The same people who seem to think that hyperthreading
> > > in the CPU is required for carrier grade work 8)
> >
> > I love the whole sound of "carrier grade" though: Do you use "carrier grade"
> > Linux or just the "recreational boating" version?
> 
> Wrong carrier, though. It's not US Navy carriers (those people use NT,  
> after all, and this was "depose Solaris"), it's carriers like AT&T - phone  

Really? So why are they always talking about that ship SS7 and the 
sonar network - SONET ? I think Alan may have something there about the
carrier pigeon angle, though. Needs more study.

Actually, it makes me think of "as big, manoeuverable, and 
low cost  as an aircraft carrier" although that is certainly unfair.

> companies. And I suspect many of those 10,000 threads are handling one  
> phone conversation each. Or maybe one half of one.
> 
> In fact, that's a problem space I find much more interesting than the  
> military. *These* people need to be robust in peacetime. They can't afford  
> a big showy piece of hardware that breaks down when it's finally needed,  
> because "finally" is a very short-term goal.

But in my, as always ever so humble, opinion, 10,000 threads is a programming
error based on the incorrect Solaris theory that threads were a good
substitute for thinking about scheduling operations. So making Linux
handle 10K threads is not necessarily an appealing idea unless you can
think of some very clever way to do it.
On the other hand, if I just wanted to sell chips, I might think 
differently.



> 
> MfG Kai
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17 14:38                   ` Kai Henningsen
@ 2002-03-17 18:20                     ` Alan Cox
  0 siblings, 0 replies; 137+ messages in thread
From: Alan Cox @ 2002-03-17 18:20 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: torvalds, linux-kernel

> > That's not an umlaut, that's an "ae", which is a real letter in Finnish and
> > Swedish (it just _looks_ like an a with an umlaut to you uncultured
> > people), and it happens to be a letter that is just left of the ' mark on
> > a Finnish keyboard.
> 
> Hey, careful there! Those English speakers stole that name from German,  
> and in German those umlauts are real letters, too. Incidentally, my ae is  
> next to the '# key ...

There are still a couple of places you can legitimaely use an ae symbol in
English. It's not quite dead yet 8)

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17 13:23           ` Rik van Riel
@ 2002-03-17 18:16             ` Linus Torvalds
  2002-03-17 23:01               ` Davide Libenzi
  0 siblings, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-17 18:16 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.44L.0203171021090.2181-100000@imladris.surriel.com>,
Rik van Riel  <riel@conectiva.com.br> wrote:
>
>In other words, large pages should be a "special hack" for
>special applications, like Oracle and maybe some scientific
>calculations ?

Yes, I think so.

That said, a 64kB page would be useful for generic use. 

>Grabbing some bitflags in generic datastructures shouldn't
>be an issue since free bits are available.

I had large-page-support working in the VM a long time ago, back when I
did the original VM portability rewrite.  I actually exposed the kernel
large pages to the VM, and it worked fine - I didn't even need a new
bit, since the code just used the "large page" bit in the page table
directly. 

But it wasn't ever exposed to user space, and in the end I just made the
kenel mapping just not visible to the VM and simplified the x86
pmd_xxx() macros.  The approach definitely worked, though. 

			Linus

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 21:39               ` yodaiken
  2002-03-16 21:49                 ` Linus Torvalds
  2002-03-16 22:00                 ` Alan Cox
@ 2002-03-17 14:52                 ` Kai Henningsen
  2002-03-17 21:00                   ` yodaiken
  2002-03-19 12:06                 ` Pavel Machek
  3 siblings, 1 reply; 137+ messages in thread
From: Kai Henningsen @ 2002-03-17 14:52 UTC (permalink / raw)
  To: linux-kernel

yodaiken@fsmlabs.com  wrote on 16.03.02 in <20020316161057.A23495@hq.fsmlabs.com>:

> On Sat, Mar 16, 2002 at 10:00:07PM +0000, Alan Cox wrote:
> > > databases, routing tables, and images. Our good friends at Intel
> > > claim "carrier grade" Linux  needs to run threaded apps
> > > with 10,000 threads to depose Solaris in telecom - all sharing the
> > > same monster address space.=20
> >
> > Thats intel though. The same people who seem to think that hyperthreading
> > in the CPU is required for carrier grade work 8)
>
> I love the whole sound of "carrier grade" though: Do you use "carrier grade"
> Linux or just the "recreational boating" version?

Wrong carrier, though. It's not US Navy carriers (those people use NT,  
after all, and this was "depose Solaris"), it's carriers like AT&T - phone  
companies. And I suspect many of those 10,000 threads are handling one  
phone conversation each. Or maybe one half of one.

In fact, that's a problem space I find much more interesting than the  
military. *These* people need to be robust in peacetime. They can't afford  
a big showy piece of hardware that breaks down when it's finally needed,  
because "finally" is a very short-term goal.

MfG Kai

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 21:49                 ` Linus Torvalds
@ 2002-03-17 14:38                   ` Kai Henningsen
  2002-03-17 18:20                     ` Alan Cox
  0 siblings, 1 reply; 137+ messages in thread
From: Kai Henningsen @ 2002-03-17 14:38 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 16.03.02 in <Pine.LNX.4.33.0203161342190.24457-100000@home.transmeta.com>:

> On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:

> > > In short, youaere
> >
> > Don't use umlauts unless you are ready to back it up.
>
> That's not an umlaut, that's an "ae", which is a real letter in Finnish and
> Swedish (it just _looks_ like an a with an umlaut to you uncultured
> people), and it happens to be a letter that is just left of the ' mark on
> a Finnish keyboard.

Hey, careful there! Those English speakers stole that name from German,  
and in German those umlauts are real letters, too. Incidentally, my ae is  
next to the '# key ...

MfG Kai

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 21:05             ` Richard Gooch
  2002-03-16 23:34               ` yodaiken
@ 2002-03-17 13:48               ` Rik van Riel
  1 sibling, 0 replies; 137+ messages in thread
From: Rik van Riel @ 2002-03-17 13:48 UTC (permalink / raw)
  To: Richard Gooch
  Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel, torvalds

On Sat, 16 Mar 2002, Richard Gooch wrote:

> And I can afford a few MiB of RAM for PTE's and such for *the one
> process which is mapping my huge data files*!

Once you have this, you might as well make that granularity
per VMA.

This gives you the advantage of being able to share the
mapping for libc.so ;)

>From what I can see, you'll basically want large pages
for:

1) Oracle and maybe other large shared memory situations
   where the page table overhead would otherwise be
   prohibitively high

2) scientific calculations and other programs with a huge
   dataset where TLB misses would be prohibitively slow

regards,

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:14         ` Linus Torvalds
  2002-03-16 20:22           ` Andi Kleen
@ 2002-03-17 13:23           ` Rik van Riel
  2002-03-17 18:16             ` Linus Torvalds
  2002-03-24 21:12           ` Rogier Wolff
  2 siblings, 1 reply; 137+ messages in thread
From: Rik van Riel @ 2002-03-17 13:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

On Sat, 16 Mar 2002, Linus Torvalds wrote:
> On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> >
> > What about 2M pages?
>
> Not useful for generic loads right now, and the latencies for clearing or
> copying them them etc (ie single page faults - nopage or COW) are still
> big enough that it would likely be a performance problem at that level.
> And while doing IO in 2MB chunks sounds like fun, since most files are
> still just a few kB,

In other words, large pages should be a "special hack" for
special applications, like Oracle and maybe some scientific
calculations ?

Grabbing some bitflags in generic datastructures shouldn't
be an issue since free bits are available.

regards,

Rik
-- 
<insert bitkeeper endorsement here>

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17  4:12               ` Chris Wedgwood
@ 2002-03-17  4:31                 ` Alan Cox
  0 siblings, 0 replies; 137+ messages in thread
From: Alan Cox @ 2002-03-17  4:31 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Alan Cox, Andi Kleen, yodaiken, Paul Mackerras, linux-kernel, torvalds

> Either way, we have tens of MB of ram where we either put textures,
> options or whatever --- the CPU has to meddle with it one way or
> another.

The disk DMA's it to RAM, the graphics card fetches it via the GART
mappings. We shouldn't be touching a lot of it. 


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17  3:43             ` Alan Cox
@ 2002-03-17  4:12               ` Chris Wedgwood
  2002-03-17  4:31                 ` Alan Cox
  0 siblings, 1 reply; 137+ messages in thread
From: Chris Wedgwood @ 2002-03-17  4:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, yodaiken, Paul Mackerras, linux-kernel, torvalds

On Sun, Mar 17, 2002 at 03:43:29AM +0000, Alan Cox wrote:

    You are labouring under the belief that processors touch the frame
    buffer nowdays. For a current accelerated frame buffer that isnt
    very true.

/s/frame-buffer/hunk-of-memory/

Either way, we have tens of MB of ram where we either put textures,
options or whatever --- the CPU has to meddle with it one way or
another.



  --cw

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-17  2:50           ` Chris Wedgwood
@ 2002-03-17  3:43             ` Alan Cox
  2002-03-17  4:12               ` Chris Wedgwood
  0 siblings, 1 reply; 137+ messages in thread
From: Alan Cox @ 2002-03-17  3:43 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Andi Kleen, yodaiken, Paul Mackerras, linux-kernel, torvalds

>     They are not supported for user space, but used in private
>     mappings for kernel text and direct memory mappings. Generic code
>     never sees them.
> 
> Is there any reason we couldn't use them for mapping large
> frame-buffers and similar?

You are labouring under the belief that processors touch the frame buffer
nowdays. For a current accelerated frame buffer that isnt very true.

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 23:10                   ` yodaiken
  2002-03-17  1:17                     ` rddunlap
@ 2002-03-17  3:34                     ` Alan Cox
  1 sibling, 0 replies; 137+ messages in thread
From: Alan Cox @ 2002-03-17  3:34 UTC (permalink / raw)
  To: yodaiken
  Cc: Alan Cox, Linus Torvalds, Andi Kleen, Paul Mackerras, linux-kernel

> I love the whole sound of "carrier grade" though: Do you use "carrier grade"
> Linux or just the "recreational boating" version?

Linux is alread carrier (pigeon) grade
		-> www.blug.linux.no/rfc1149/


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:05         ` Andi Kleen
  2002-03-16 20:12           ` yodaiken
  2002-03-16 20:27           ` Richard Gooch
@ 2002-03-17  2:50           ` Chris Wedgwood
  2002-03-17  3:43             ` Alan Cox
  2 siblings, 1 reply; 137+ messages in thread
From: Chris Wedgwood @ 2002-03-17  2:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: yodaiken, Paul Mackerras, linux-kernel, torvalds

On Sat, Mar 16, 2002 at 09:05:04PM +0100, Andi Kleen wrote:

    On Sat, Mar 16, 2002 at 12:57:11PM -0700, yodaiken@fsmlabs.com
    wrote:

    >
    > What about 2M pages?

    They are not supported for user space, but used in private
    mappings for kernel text and direct memory mappings. Generic code
    never sees them.

Is there any reason we couldn't use them for mapping large
frame-buffers and similar?



  --cw

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 23:10                   ` yodaiken
@ 2002-03-17  1:17                     ` rddunlap
  2002-03-17  3:34                     ` Alan Cox
  1 sibling, 0 replies; 137+ messages in thread
From: rddunlap @ 2002-03-17  1:17 UTC (permalink / raw)
  To: yodaiken
  Cc: Alan Cox, Linus Torvalds, Andi Kleen, Paul Mackerras, linux-kernel

On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:

| On Sat, Mar 16, 2002 at 10:00:07PM +0000, Alan Cox wrote:
| > > databases, routing tables, and images. Our good friends at Intel
| > > claim "carrier grade" Linux  needs to run threaded apps
| > > with 10,000 threads to depose Solaris in telecom - all sharing the
| > > same monster address space.=20
| >
| > Thats intel though. The same people who seem to think that hyperthreading
| > in the CPU is required for carrier grade work 8)
|
| I love the whole sound of "carrier grade" though: Do you use "carrier grade"
| Linux or just the "recreational boating" version?

Don't know what Intel is claiming, but OSDL is
sponsoring a "Carrier Grade Linux Working Group".
See info at www.osdl.org and/or at cglinux.sf.net .

-- 
~Randy


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 21:05             ` Richard Gooch
@ 2002-03-16 23:34               ` yodaiken
  2002-03-17 13:48               ` Rik van Riel
  1 sibling, 0 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 23:34 UTC (permalink / raw)
  To: Richard Gooch
  Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel, torvalds

On Sat, Mar 16, 2002 at 02:05:17PM -0700, Richard Gooch wrote:
> > Why not?  If you just ran vim on console you'd be more productive and
> > not need all those worthless processes. 
> 
> Yeah, right.

I was just trying to be nice.

> > At 4KB/page and 8bytes/pte a
> > 1G process will need at least 2MB of pte alone ! Add in the 4 layers,
> > the software VM struct, ...
> 
> This isn't a dedicated bigass-image display box. It's a workstation.
> It's where I read email, hack kernels, write visualisation tools and
> stuff like that.
> 
> And I can afford a few MiB of RAM for PTE's and such for *the one
> process which is mapping my huge data files*! That's effectively a
> small, one-time cost. Every other process doesn't have a significant
> PTE cost.

Well, it's a matter of target. I'm thinking about our customers who
do high grade image processing on a stream of gig+ bitmaps. They need
64 (some are already painfully stranded on Alphas) and they don't use these
boxes for email. 

> > But sure, big pages are not always good.
> 
> Hm. With wide TLB's, what are the benefits to big pages? One

tlb miss rates, mm structure overhead and setup/teardown, swap speed,

---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 22:00                 ` Alan Cox
  2002-03-16 21:49                   ` Linus Torvalds
@ 2002-03-16 23:10                   ` yodaiken
  2002-03-17  1:17                     ` rddunlap
  2002-03-17  3:34                     ` Alan Cox
  1 sibling, 2 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 23:10 UTC (permalink / raw)
  To: Alan Cox
  Cc: yodaiken, Linus Torvalds, Andi Kleen, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 10:00:07PM +0000, Alan Cox wrote:
> > databases, routing tables, and images. Our good friends at Intel
> > claim "carrier grade" Linux  needs to run threaded apps
> > with 10,000 threads to depose Solaris in telecom - all sharing the
> > same monster address space.=20
> 
> Thats intel though. The same people who seem to think that hyperthreading
> in the CPU is required for carrier grade work 8)

I love the whole sound of "carrier grade" though: Do you use "carrier grade"
Linux or just the "recreational boating" version?


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 21:39               ` yodaiken
  2002-03-16 21:49                 ` Linus Torvalds
@ 2002-03-16 22:00                 ` Alan Cox
  2002-03-16 21:49                   ` Linus Torvalds
  2002-03-16 23:10                   ` yodaiken
  2002-03-17 14:52                 ` Kai Henningsen
  2002-03-19 12:06                 ` Pavel Machek
  3 siblings, 2 replies; 137+ messages in thread
From: Alan Cox @ 2002-03-16 22:00 UTC (permalink / raw)
  To: yodaiken; +Cc: Linus Torvalds, Andi Kleen, Paul Mackerras, linux-kernel

> databases, routing tables, and images. Our good friends at Intel
> claim "carrier grade" Linux  needs to run threaded apps
> with 10,000 threads to depose Solaris in telecom - all sharing the
> same monster address space.=20

Thats intel though. The same people who seem to think that hyperthreading
in the CPU is required for carrier grade work 8)

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 22:00                 ` Alan Cox
@ 2002-03-16 21:49                   ` Linus Torvalds
  2002-03-16 23:10                   ` yodaiken
  1 sibling, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 21:49 UTC (permalink / raw)
  To: Alan Cox; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel



On Sat, 16 Mar 2002, Alan Cox wrote:
>
> Thats intel though. The same people who seem to think that hyperthreading
> in the CPU is required for carrier grade work 8)

Well, to be fair, they do seem to be making inroads against Sun.

Maybe that factor-of-five performance edge has something to do with that
part, though ;)

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 21:39               ` yodaiken
@ 2002-03-16 21:49                 ` Linus Torvalds
  2002-03-17 14:38                   ` Kai Henningsen
  2002-03-16 22:00                 ` Alan Cox
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 21:49 UTC (permalink / raw)
  To: yodaiken; +Cc: Andi Kleen, Paul Mackerras, linux-kernel



On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
>
> If you want to optimize for gnome, you get a very different
> layout. But Hammer and ia64 are supposedly designed for huge
> databases, routing tables, and images.

Yeah, and I'm claiming that databases etc count for a whole lot less than
most other apps.

> What's the "common case" for 64 bit ? Do you really think it will
> be on desktop soon?

Oh, not Itanium for sure. But if AMD succeeds even reasonably with Hammer,
Intel will certainly see the error of its ways (except they won't admit
it) and make their version of a 64-bit P4 core available. They're
reportedly working on it already, just in case the Itanic sinks (and
judging by current market behaviour it certainly isn't flying).

> > In short, youäre
>
> Don't use umlauts unless you are ready to back it up.

That's not an umlaut, that's an "ä", which is a real letter in Finnish and
Swedish (it just _looks_ like an a with an umlaut to you uncultured
people), and it happens to be a letter that is just left of the ' mark on
a Finnish keyboard.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:34             ` Linus Torvalds
@ 2002-03-16 21:39               ` yodaiken
  2002-03-16 21:49                 ` Linus Torvalds
                                   ` (3 more replies)
  0 siblings, 4 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 21:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 12:34:29PM -0800, Linus Torvalds wrote:
> 
> On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> > 
> > To me, once you have a G of memory, wasting a few meg on unused process 
> > memory seems no big deal.
> 
> It's not the process memory, and it is a whole lot than a "few meg" if 
> your page size is 2M.

I forget what an extremist you are. My claim is that
some processes benefit from big pages, some do not.
A 16G process needs 2^25 bytes of PTE at 4kbytes/page if I
did the numbers right. Just populating 4 million odd  page tables is a 
pain. I might be wrong about it, but I wonder if just scaling
up from a working 32 bit strategy gets you anywhere.
If you want to optimize for gnome, you get a very different
layout. But Hammer and ia64 are supposedly designed for huge
databases, routing tables, and images. Our good friends at Intel
claim "carrier grade" Linux  needs to run threaded apps
with 10,000 threads to depose Solaris in telecom - all sharing the
same monster address space. 

> Admit it, you're just wrong. 2M page sizes are _not_ useful for the common
> case, and won't be for years to come.

What's the "common case" for 64 bit ? Do you really think it will
be on desktop soon?

> 
> In short, youäre 

Don't use umlauts unless you are ready to back it up.


> 
> > They say:
> > 	Hammer microarchitecture features a flush filter allowing multiple
> > 	processes to share TLB without SW intervention.
> > 
> > Not a lot of technical detail in that.
> 
> I suspect it's some special case for windows with a special MSR that 
> enables something illegal that just works well for whatever patterns 
> windows does.

sounds like it from what Andi wrote. disappointing.

> 
> 		Linus

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:27           ` Richard Gooch
  2002-03-16 20:47             ` yodaiken
@ 2002-03-16 21:05             ` Richard Gooch
  2002-03-16 23:34               ` yodaiken
  2002-03-17 13:48               ` Rik van Riel
  1 sibling, 2 replies; 137+ messages in thread
From: Richard Gooch @ 2002-03-16 21:05 UTC (permalink / raw)
  To: yodaiken; +Cc: Andi Kleen, Paul Mackerras, linux-kernel, torvalds

yodaiken@fsmlabs.com writes:
> On Sat, Mar 16, 2002 at 01:27:52PM -0700, Richard Gooch wrote:
> > yodaiken@fsmlabs.com writes:
> > > On Sat, Mar 16, 2002 at 09:05:04PM +0100, Andi Kleen wrote:
> > > > That will hopefully change eventually because 2M pages are a bit help for
> > > > a lot of applications that are limited by TLB thrashing, but needs some 
> > > > thinking on how to avoid the fragmentation trap (e.g. I'm considering
> > > > to add a highmem zone again just for that and use rmap with targetted
> > > > physical freeing to allocating them) 
> > > 
> > > To me, once you have a G of memory, wasting a few meg on unused
> > > process memory seems no big deal.
> > 
> > I'm not happy to throw away 2 MiB per process. My workstation has 1
> > GiB of RAM, and 65 processes (and that's fairly low compared to your
> > average desktop these days, because I just use olwm and don't have a
> > fancy desktop or lots of windows). You want me to throw over 1/8th of
> > my RAM away?!?
> 
> Why not?  If you just ran vim on console you'd be more productive and
> not need all those worthless processes. 

Yeah, right.

> At 4KB/page and 8bytes/pte a
> 1G process will need at least 2MB of pte alone ! Add in the 4 layers,
> the software VM struct, ...

This isn't a dedicated bigass-image display box. It's a workstation.
It's where I read email, hack kernels, write visualisation tools and
stuff like that.

And I can afford a few MiB of RAM for PTE's and such for *the one
process which is mapping my huge data files*! That's effectively a
small, one-time cost. Every other process doesn't have a significant
PTE cost.

I'm not using my kernel as a device driver for an image display
programme. I'm using it run a box that's generally useful to me.

> > And in fact, isn't it going to be more than 2 MiB wasted per process?
> > For each shared object loaded, only partial pages are going to be
> > used. *My* libc is less than 700 KiB, so I'd be wasting most of a page
> > to map it in.
> 
> You're using a politically incorrect libc.

Yeah :-) Man it feels good.

> But sure, big pages are not always good.

Hm. With wide TLB's, what are the benefits to big pages? One
pathological case that hit me a few years back was a workload which
bounced around in VM in a pattern that really thrash the cache due to
aliasing. It wouldn't have been a problem if we had truly fully
set-associative caches, rather than N-way (where N is 2, 4 or 8
usually). But big pages won't help that much here (it's just a way of
reducing TLB thrash, but doesn't help with cache thrashing).

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:36         ` Richard Gooch
  2002-03-16 20:38           ` Linus Torvalds
@ 2002-03-16 20:51           ` Richard Gooch
  1 sibling, 0 replies; 137+ messages in thread
From: Richard Gooch @ 2002-03-16 20:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

Linus Torvalds writes:
> 
> On Sat, 16 Mar 2002, Richard Gooch wrote:
> > 
> > These are contiguous physical pages, or just logical (virtual) pages?
> 
> Contiguous virtual pages, but discontiguous physical pages.

That's what I was hoping. Having both contiguous would be of some
benefit, but of course at the cost of having to unfragment physical
pages. Even if Andi can cleanly do that with rmap, it's still going to
cost (page copies, Dcache footprint, locking and more). I like the
"wide" TLB approach much more.

> The advantage being that you only need one set of virtual tags per
> "wide" entry, and you just fill the whole wide entry directly from
> the cacheline (ie the TLB entry is not really 32 bits any more, it's
> a full cacheline).
> 
> The _real_ advantage being that it should be totally invisible to
> software. I think Intel does something like this, but the point is,
> I don't even have to know, and it still works.

Completely behind the kernel's, back? Even so, is there some hint we
can give to the CPU to help? Or perhaps a hint an application can give
to the kernel to specify better alignment of mappings? The latter
would require a way for the kernel to find out the preferred alignment
from the CPU. Is this information available?

Anyone know if AMD does this as well?

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:27           ` Richard Gooch
@ 2002-03-16 20:47             ` yodaiken
  2002-03-16 21:05             ` Richard Gooch
  1 sibling, 0 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 20:47 UTC (permalink / raw)
  To: Richard Gooch
  Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel, torvalds

On Sat, Mar 16, 2002 at 01:27:52PM -0700, Richard Gooch wrote:
> yodaiken@fsmlabs.com writes:
> > On Sat, Mar 16, 2002 at 09:05:04PM +0100, Andi Kleen wrote:
> > > That will hopefully change eventually because 2M pages are a bit help for
> > > a lot of applications that are limited by TLB thrashing, but needs some 
> > > thinking on how to avoid the fragmentation trap (e.g. I'm considering
> > > to add a highmem zone again just for that and use rmap with targetted
> > > physical freeing to allocating them) 
> > 
> > To me, once you have a G of memory, wasting a few meg on unused
> > process memory seems no big deal.
> 
> I'm not happy to throw away 2 MiB per process. My workstation has 1
> GiB of RAM, and 65 processes (and that's fairly low compared to your
> average desktop these days, because I just use olwm and don't have a
> fancy desktop or lots of windows). You want me to throw over 1/8th of
> my RAM away?!?

Why not?  If you just ran vim on console you'd be more productive and
not need all those worthless processes. 

At 4KB/page and 8bytes/pte a
1G process will need at least 2MB of pte alone ! Add in the 4 layers,
the software VM struct, ...


> 
> And in fact, isn't it going to be more than 2 MiB wasted per process?
> For each shared object loaded, only partial pages are going to be
> used. *My* libc is less than 700 KiB, so I'd be wasting most of a page
> to map it in.

You're using a politically incorrect libc. 
But sure, big pages are not always good.


> I want that 1 GiB of RAM to be used to cache most of my data. Those
> NASA 1km/pixel satellite mosaics of the world are pretty big, you know
> (21600x21600x3 per hemisphere:-).


> 
> 				Regards,
> 
> 					Richard....
> Permanent: rgooch@atnf.csiro.au
> Current:   rgooch@ras.ucalgary.ca

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:36         ` Richard Gooch
@ 2002-03-16 20:38           ` Linus Torvalds
  2002-03-16 20:51           ` Richard Gooch
  1 sibling, 0 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 20:38 UTC (permalink / raw)
  To: Richard Gooch; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002, Richard Gooch wrote:
> 
> These are contiguous physical pages, or just logical (virtual) pages?

Contiguous virtual pages, but discontiguous physical pages.

The advantage being that you only need one set of virtual tags per "wide" 
entry, and you just fill the whole wide entry directly from the cacheline 
(ie the TLB entry is not really 32 bits any more, it's a full cacheline).

The _real_ advantage being that it should be totally invisible to 
software. I think Intel does something like this, but the point is, I 
don't even have to know, and it still works.

			Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:57       ` yodaiken
  2002-03-16 20:05         ` Andi Kleen
  2002-03-16 20:14         ` Linus Torvalds
@ 2002-03-16 20:36         ` Richard Gooch
  2002-03-16 20:38           ` Linus Torvalds
  2002-03-16 20:51           ` Richard Gooch
  2 siblings, 2 replies; 137+ messages in thread
From: Richard Gooch @ 2002-03-16 20:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

Linus Torvalds writes:
> Instead of large pages, you should be asking for larger and wider TLB's
> (for example, nothign says that a TLB entry has to be a single page:
> people already do the kind of "super-entries", where one TLB entry
> actually contains data for 4 or 8 aligned pages, so you get the _effect_
> of a 32kB page that really is 8 consecutive 4kB pages).
> 
> Such a "wide" TLB entry has all the advantages of small pages (no
> memory fragmentation, backwards compatibility etc), while still
> being able to load 64kB worth of translations in one go.

These are contiguous physical pages, or just logical (virtual) pages?

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:12           ` yodaiken
@ 2002-03-16 20:34             ` Linus Torvalds
  2002-03-16 21:39               ` yodaiken
  0 siblings, 1 reply; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 20:34 UTC (permalink / raw)
  To: yodaiken; +Cc: Andi Kleen, Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> 
> To me, once you have a G of memory, wasting a few meg on unused process 
> memory seems no big deal.

It's not the process memory, and it is a whole lot than a "few meg" if 
your page size is 2M.

Look at "free" output one day, and notice that "cached" line? On my 2G 
machine, I usually have about a gig cached or so. Guess what the most 
common thing in that case is? Yeah, the kernel. 

And my kernel tree (with bk overhead etc) is right not about 25.000 files. 
That's without object files etc. At 2M a pop in the page cache, that's a 
whole lot more memory for caching than I have in my machine.

Ok, so assume you compress that, and you only actually use the full 2M 
when mapping into user space, you now added a lot of complexity, but at 
least you make the ridiculous memory use go down. 

But even in the process space, I've got about 150 processes quite 
normally, and while most of them are idle, if we had 2M pages most of them 
would waste at least 2M of memory (probably more - the stack doesn't even 
need half a page, and the data section would probably waste half a page on 
average).

That's 300M just wasted.

Tell me that's peanuts even if you've got a few gigs of ram on your 
machine. 

Admit it, you're just wrong. 2M page sizes are _not_ useful for the common
case, and won't be for years to come.

In short, youäre 

> They say:
> 	Hammer microarchitecture features a flush filter allowing multiple
> 	processes to share TLB without SW intervention.
> 
> Not a lot of technical detail in that.

I suspect it's some special case for windows with a special MSR that 
enables something illegal that just works well for whatever patterns 
windows does.

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:05         ` Andi Kleen
  2002-03-16 20:12           ` yodaiken
@ 2002-03-16 20:27           ` Richard Gooch
  2002-03-16 20:47             ` yodaiken
  2002-03-16 21:05             ` Richard Gooch
  2002-03-17  2:50           ` Chris Wedgwood
  2 siblings, 2 replies; 137+ messages in thread
From: Richard Gooch @ 2002-03-16 20:27 UTC (permalink / raw)
  To: yodaiken; +Cc: Andi Kleen, Paul Mackerras, linux-kernel, torvalds

yodaiken@fsmlabs.com writes:
> On Sat, Mar 16, 2002 at 09:05:04PM +0100, Andi Kleen wrote:
> > That will hopefully change eventually because 2M pages are a bit help for
> > a lot of applications that are limited by TLB thrashing, but needs some 
> > thinking on how to avoid the fragmentation trap (e.g. I'm considering
> > to add a highmem zone again just for that and use rmap with targetted
> > physical freeing to allocating them) 
> 
> To me, once you have a G of memory, wasting a few meg on unused
> process memory seems no big deal.

I'm not happy to throw away 2 MiB per process. My workstation has 1
GiB of RAM, and 65 processes (and that's fairly low compared to your
average desktop these days, because I just use olwm and don't have a
fancy desktop or lots of windows). You want me to throw over 1/8th of
my RAM away?!?

And in fact, isn't it going to be more than 2 MiB wasted per process?
For each shared object loaded, only partial pages are going to be
used. *My* libc is less than 700 KiB, so I'd be wasting most of a page
to map it in.

I want that 1 GiB of RAM to be used to cache most of my data. Those
NASA 1km/pixel satellite mosaics of the world are pretty big, you know
(21600x21600x3 per hemisphere:-).

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:14         ` Linus Torvalds
@ 2002-03-16 20:22           ` Andi Kleen
  2002-03-19  4:34             ` Rusty Russell
  2002-03-17 13:23           ` Rik van Riel
  2002-03-24 21:12           ` Rogier Wolff
  2 siblings, 1 reply; 137+ messages in thread
From: Andi Kleen @ 2002-03-16 20:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yodaiken, Andi Kleen, Paul Mackerras, linux-kernel

On Sat, Mar 16, 2002 at 12:14:06PM -0800, Linus Torvalds wrote:
> Oh, and in the specific case of hammer, one of the main advantages of the 
> thing is of course running old binaries unchanged. And old binaries 
> certainly do mmap's at smaller granularity than 2M (and have to, because a 
> 3G user address space won't fit all that many 2M chunks).

The idea was to only map selected mappings using large pages, e.g. shared
memory mappings to help all the databases or use a special mmap flag 
for the Beowulf people. 

> Give up on large pages - it's just not happening. Even when a 64kB page 
> would make sense from a technology standpoint these days, backwards 
> compatibility makes people stay at 4kB.

Yes the 4KB page has to be kept at least for now. 

-ANdi

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:57       ` yodaiken
  2002-03-16 20:05         ` Andi Kleen
@ 2002-03-16 20:14         ` Linus Torvalds
  2002-03-16 20:22           ` Andi Kleen
                             ` (2 more replies)
  2002-03-16 20:36         ` Richard Gooch
  2 siblings, 3 replies; 137+ messages in thread
From: Linus Torvalds @ 2002-03-16 20:14 UTC (permalink / raw)
  To: yodaiken; +Cc: Andi Kleen, Paul Mackerras, linux-kernel


On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> 
> What about 2M pages?

Not useful for generic loads right now, and the latencies for clearing or
copying them them etc (ie single page faults - nopage or COW) are still
big enough that it would likely be a performance problem at that level.  
And while doing IO in 2MB chunks sounds like fun, since most files are
still just a few kB, your page cache memory overhead would be prohibitive
(ie even if you had 64GB of memory, you might want to cache more than a
few thousand files at the same time).

So then you'd need to do page caching at a finer granularity than you do
mmap, which imples absolutely horrible things from a coherency standpoint
(mmap/read/write are supposed to be coherent in a nice UNIX - even if
there are some of non-nice unixes still around).

We may get there some day, but right now 2M pages are not usable for use 
access.

64kB would be fine, though.

Oh, and in the specific case of hammer, one of the main advantages of the 
thing is of course running old binaries unchanged. And old binaries 
certainly do mmap's at smaller granularity than 2M (and have to, because a 
3G user address space won't fit all that many 2M chunks).

Give up on large pages - it's just not happening. Even when a 64kB page 
would make sense from a technology standpoint these days, backwards 
compatibility makes people stay at 4kB.

Instead of large pages, you should be asking for larger and wider TLB's
(for example, nothign says that a TLB entry has to be a single page:
people already do the kind of "super-entries", where one TLB entry
actually contains data for 4 or 8 aligned pages, so you get the _effect_
of a 32kB page that really is 8 consecutive 4kB pages).

Such a "wide" TLB entry has all the advantages of small pages (no 
memory fragmentation, backwards compatibility etc), while still being able 
to load 64kB worth of translations in one go.

(One of the advantages of a page table tree over a hashed setup becomes
clear in this kind of situation: you cannot usefully load multiple entries
from the same cacheline into one TLB entry in a hashed table, while in a
tree it's truly trivial)

		Linus


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 20:05         ` Andi Kleen
@ 2002-03-16 20:12           ` yodaiken
  2002-03-16 20:34             ` Linus Torvalds
  2002-03-16 20:27           ` Richard Gooch
  2002-03-17  2:50           ` Chris Wedgwood
  2 siblings, 1 reply; 137+ messages in thread
From: yodaiken @ 2002-03-16 20:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: yodaiken, Paul Mackerras, linux-kernel, torvalds

On Sat, Mar 16, 2002 at 09:05:04PM +0100, Andi Kleen wrote:
> On Sat, Mar 16, 2002 at 12:57:11PM -0700, yodaiken@fsmlabs.com wrote:
> > On Sat, Mar 16, 2002 at 08:32:26PM +0100, Andi Kleen wrote:
> > > x86-64 aka AMD Hammer does hardware (or more likely microcode) search of 
> > > page tables.
> > > It has a 4 level page table with 4K pages. Generic Linux MM code only sees
> > > the first slot in 4th level page limit user space to 512GB with 3 levels. 
> > 
> > What about 2M pages?
> 
> They are not supported for user space, but used in private mappings
> for kernel text and direct memory mappings. Generic code never sees them.
> 
> That will hopefully change eventually because 2M pages are a bit help for
> a lot of applications that are limited by TLB thrashing, but needs some 
> thinking on how to avoid the fragmentation trap (e.g. I'm considering
> to add a highmem zone again just for that and use rmap with targetted
> physical freeing to allocating them) 

To me, once you have a G of memory, wasting a few meg on unused process 
memory seems no big deal.

> > There was something in some AMD doc about preventing tlbflush on process
> > switch - through a context like thing perhaps? Any idea?
> 
> There are global pages which are normally not flushed over context switch.
> That is used for all kernel mappings. 
> 
> There is also some optimization in the CPU that tries to do a selective
> flush only when you reload CR3, but as far as I can see doesn't help
> for the Linux context switch. It only works around broken TLB flushing
> algorithms in some Windows version.

They say:
	Hammer microarchitecture features a flush filter allowing multiple
	processes to share TLB without SW intervention.

Not a lot of technical detail in that.

> 
> -Andi

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:57       ` yodaiken
@ 2002-03-16 20:05         ` Andi Kleen
  2002-03-16 20:12           ` yodaiken
                             ` (2 more replies)
  2002-03-16 20:14         ` Linus Torvalds
  2002-03-16 20:36         ` Richard Gooch
  2 siblings, 3 replies; 137+ messages in thread
From: Andi Kleen @ 2002-03-16 20:05 UTC (permalink / raw)
  To: yodaiken; +Cc: Andi Kleen, Paul Mackerras, linux-kernel, torvalds

On Sat, Mar 16, 2002 at 12:57:11PM -0700, yodaiken@fsmlabs.com wrote:
> On Sat, Mar 16, 2002 at 08:32:26PM +0100, Andi Kleen wrote:
> > x86-64 aka AMD Hammer does hardware (or more likely microcode) search of 
> > page tables.
> > It has a 4 level page table with 4K pages. Generic Linux MM code only sees
> > the first slot in 4th level page limit user space to 512GB with 3 levels. 
> 
> What about 2M pages?

They are not supported for user space, but used in private mappings
for kernel text and direct memory mappings. Generic code never sees them.

That will hopefully change eventually because 2M pages are a bit help for
a lot of applications that are limited by TLB thrashing, but needs some 
thinking on how to avoid the fragmentation trap (e.g. I'm considering
to add a highmem zone again just for that and use rmap with targetted
physical freeing to allocating them) 

> 
> > Direct mappings and kernel mappings are handled specially by architecture 
> > specific code outside that first slot.
> > 
> > The CPU itself has I/D TLBs split into L1 and L2.
> 
> There was something in some AMD doc about preventing tlbflush on process
> switch - through a context like thing perhaps? Any idea?

There are global pages which are normally not flushed over context switch.
That is used for all kernel mappings. 

There is also some optimization in the CPU that tries to do a selective
flush only when you reload CR3, but as far as I can see doesn't help
for the Linux context switch. It only works around broken TLB flushing
algorithms in some Windows version.

-Andi

^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
  2002-03-16 19:32     ` [Lse-tech] Re: 10.31 " Andi Kleen
@ 2002-03-16 19:57       ` yodaiken
  2002-03-16 20:05         ` Andi Kleen
                           ` (2 more replies)
  0 siblings, 3 replies; 137+ messages in thread
From: yodaiken @ 2002-03-16 19:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: yodaiken, Paul Mackerras, linux-kernel, torvalds

On Sat, Mar 16, 2002 at 08:32:26PM +0100, Andi Kleen wrote:
> x86-64 aka AMD Hammer does hardware (or more likely microcode) search of 
> page tables.
> It has a 4 level page table with 4K pages. Generic Linux MM code only sees
> the first slot in 4th level page limit user space to 512GB with 3 levels. 

What about 2M pages?

> Direct mappings and kernel mappings are handled specially by architecture 
> specific code outside that first slot.
> 
> The CPU itself has I/D TLBs split into L1 and L2.

There was something in some AMD doc about preventing tlbflush on process
switch - through a context like thing perhaps? Any idea?


> 
> -Andi

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com


^ permalink raw reply	[flat|nested] 137+ messages in thread

* Re: [Lse-tech] Re: 10.31 second kernel compile
       [not found]   ` <20020316115726.B19495@hq.fsmlabs.com.suse.lists.linux.kernel>
@ 2002-03-16 19:32     ` Andi Kleen
  2002-03-16 19:57       ` yodaiken
  0 siblings, 1 reply; 137+ messages in thread
From: Andi Kleen @ 2002-03-16 19:32 UTC (permalink / raw)
  To: yodaiken; +Cc: Paul Mackerras, linux-kernel, torvalds

yodaiken@fsmlabs.com writes:

> is there a 64 bit machine with hardware search of pagetables? Even ibm
> only has a hardware search of hash tables - which we agree are simply
> a means of making your hardware TLB larger and slower.

x86-64 aka AMD Hammer does hardware (or more likely microcode) search of 
page tables.
It has a 4 level page table with 4K pages. Generic Linux MM code only sees
the first slot in 4th level page limit user space to 512GB with 3 levels. 
Direct mappings and kernel mappings are handled specially by architecture 
specific code outside that first slot.

The CPU itself has I/D TLBs split into L1 and L2.

-Andi

^ permalink raw reply	[flat|nested] 137+ messages in thread

end of thread, other threads:[~2002-04-02 12:39 UTC | newest]

Thread overview: 137+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-13  8:52 10.31 second kernel compile Anton Blanchard
2002-03-13 14:44 ` Martin J. Bligh
2002-03-13 21:44   ` [Lse-tech] " Dave Hansen
2002-03-14  1:07     ` Keith Owens
2002-03-14 11:27   ` Anton Blanchard
2002-03-14 13:16     ` [Lse-tech] " Dipankar Sarma
2002-03-17 13:12       ` some RCU dcache and ratcache results Anton Blanchard
2002-03-14 13:21     ` [Lse-tech] Re: 10.31 second kernel compile Momchil Velikov
2002-03-14 18:33       ` Daniel Phillips
2002-03-15 12:16         ` Chris Wedgwood
2002-03-16  5:12           ` Anton Blanchard
2002-03-15 18:20         ` Linus Torvalds
2002-03-16 15:24           ` Daniel Phillips
2002-03-16 19:01             ` Linus Torvalds
2002-03-16 22:25               ` Daniel Phillips
2002-03-19 16:35                 ` Bill Davidsen
2002-03-18  3:07           ` David S. Miller
2002-03-16 11:55         ` Paul Mackerras
2002-03-16 17:25           ` Rik van Riel
2002-03-16 17:57           ` yodaiken
2002-03-16 18:06           ` Linus Torvalds
2002-03-16 18:35             ` yodaiken
2002-03-16 18:45               ` Linus Torvalds
2002-03-16 18:57                 ` yodaiken
2002-03-16 19:16                   ` Linus Torvalds
2002-03-16 19:53                     ` yodaiken
2002-03-16 20:02                       ` Linus Torvalds
2002-03-16 20:25                         ` yodaiken
2002-03-27  1:07                     ` Richard Henderson
2002-03-16 19:43                   ` David Mosberger
2002-03-16 19:58                     ` Linus Torvalds
2002-03-16 20:08                       ` yodaiken
2002-03-16 20:23                         ` Linus Torvalds
2002-03-16 20:36                     ` David Mosberger
2002-03-16 20:46                       ` Linus Torvalds
2002-03-17  1:09                       ` Paul Mackerras
2002-03-17  2:08                         ` Linus Torvalds
2002-03-16 20:53             ` Alan Cox
2002-03-14 19:05       ` Linus Torvalds
2002-03-19 16:40         ` Bill Davidsen
2002-03-14 18:21   ` Hanna Linder
2002-03-16  5:27     ` Anton Blanchard
2002-03-15  7:12   ` Chris Wedgwood
2002-03-16  6:15 ` 7.52 " Anton Blanchard
2002-03-16  6:42   ` [Lse-tech] " Gerrit Huizenga
2002-03-17 12:34     ` Anton Blanchard
2002-03-17 22:09       ` Theodore Tso
2002-03-18  7:04         ` Jeff Garzik
2002-03-19 18:28           ` Theodore Tso
2002-03-16  8:05   ` Linus Torvalds
2002-03-16 11:54     ` yodaiken
2002-03-16 11:04   ` Paul Mackerras
2002-03-16 18:32     ` Linus Torvalds
2002-03-17  2:00     ` Paul Mackerras
2002-03-17  2:40       ` Linus Torvalds
2002-03-17  2:50         ` M. Edward Borasky
2002-03-18 15:08           ` 0.73 " snpe
2002-03-18 19:42       ` 7.52 " Cort Dougan
2002-03-18 20:04         ` Linus Torvalds
2002-03-18 20:23           ` Linus Torvalds
2002-03-18 21:50             ` Rene Herman
2002-03-18 22:36             ` Cort Dougan
2002-03-18 22:47               ` Linus Torvalds
2002-03-18 22:56                 ` Cort Dougan
2002-03-18 23:52                 ` Paul Mackerras
2002-03-19  0:57                   ` Dave Jones
2002-03-19  3:35                     ` Jeff Garzik
2002-03-19  0:22                 ` David S. Miller
2002-03-19  0:27                   ` Cort Dougan
2002-03-19  0:27                     ` David S. Miller
2002-03-19  0:36                       ` Cort Dougan
2002-03-19  0:38                         ` David S. Miller
2002-03-19  1:28                           ` Davide Libenzi
2002-03-19  2:42             ` Paul Mackerras
2002-03-27  2:53             ` Richard Henderson
2002-04-02  4:32               ` Linus Torvalds
2002-04-02 10:50             ` Pablo Alcaraz
2002-03-18 21:34           ` Cort Dougan
2002-03-18 22:00             ` Linus Torvalds
2002-03-18 19:37     ` Cort Dougan
2002-03-16 17:37   ` [Lse-tech] " Martin J. Bligh
2002-03-17  1:45     ` Keith Owens
2002-03-17 13:54     ` David Woodhouse
2002-03-19 16:49     ` Bill Davidsen
2002-03-16 18:57   ` Daniel Egger
2002-03-17  8:18     ` Mike Galbraith
2002-03-17 15:29       ` Martin J. Bligh
     [not found] <20020316113536.A19495@hq.fsmlabs.com.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.33.0203161037160.31913-100000@penguin.transmeta.com.suse.lists.linux.kernel>
     [not found]   ` <20020316115726.B19495@hq.fsmlabs.com.suse.lists.linux.kernel>
2002-03-16 19:32     ` [Lse-tech] Re: 10.31 " Andi Kleen
2002-03-16 19:57       ` yodaiken
2002-03-16 20:05         ` Andi Kleen
2002-03-16 20:12           ` yodaiken
2002-03-16 20:34             ` Linus Torvalds
2002-03-16 21:39               ` yodaiken
2002-03-16 21:49                 ` Linus Torvalds
2002-03-17 14:38                   ` Kai Henningsen
2002-03-17 18:20                     ` Alan Cox
2002-03-16 22:00                 ` Alan Cox
2002-03-16 21:49                   ` Linus Torvalds
2002-03-16 23:10                   ` yodaiken
2002-03-17  1:17                     ` rddunlap
2002-03-17  3:34                     ` Alan Cox
2002-03-17 14:52                 ` Kai Henningsen
2002-03-17 21:00                   ` yodaiken
2002-03-19 12:06                 ` Pavel Machek
2002-03-19 21:12                   ` yodaiken
2002-03-19 22:09                     ` Chris Friesen
2002-03-19 22:15                       ` yodaiken
2002-03-20  4:25                     ` Bill Davidsen
2002-03-16 20:27           ` Richard Gooch
2002-03-16 20:47             ` yodaiken
2002-03-16 21:05             ` Richard Gooch
2002-03-16 23:34               ` yodaiken
2002-03-17 13:48               ` Rik van Riel
2002-03-17  2:50           ` Chris Wedgwood
2002-03-17  3:43             ` Alan Cox
2002-03-17  4:12               ` Chris Wedgwood
2002-03-17  4:31                 ` Alan Cox
2002-03-16 20:14         ` Linus Torvalds
2002-03-16 20:22           ` Andi Kleen
2002-03-19  4:34             ` Rusty Russell
2002-03-17 13:23           ` Rik van Riel
2002-03-17 18:16             ` Linus Torvalds
2002-03-17 23:01               ` Davide Libenzi
2002-03-18  0:53                 ` Rik van Riel
2002-03-18  1:13                   ` Davide Libenzi
2002-03-18  1:31                     ` Linus Torvalds
2002-03-18  1:56                       ` Davide Libenzi
2002-03-18  1:40                     ` Mike Fedyk
2002-03-18  1:48                       ` Davide Libenzi
2002-03-24 21:12           ` Rogier Wolff
2002-03-24 21:35             ` Andrew Morton
2002-03-24 22:54               ` Nick Craig-Wood
2002-03-24 23:41                 ` Andi Kleen
2002-03-25  6:40               ` Martin J. Bligh
2002-03-16 20:36         ` Richard Gooch
2002-03-16 20:38           ` Linus Torvalds
2002-03-16 20:51           ` Richard Gooch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).