Re: page fault scalability patch V12 [0/7]: Overview and performance tests

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
@ 2004-12-03 14:49 Sebastien Decugis
  0 siblings, 0 replies; 55+ messages in thread
From: Sebastien Decugis @ 2004-12-03 14:49 UTC (permalink / raw)
  To: linux-ia64, linux-mm, linux-kernel

[Gerrit Huizenga, 2004-12-02 16:24:04]
> Towards that end, there
> was a recent effort at Bull on the NPTL work which serves as a very
> interesting model:

> http://nptl.bullopensource.org/Tests/results/run-browse.php

> Basically, you can compare results from any test run with any other
> and get a summary of differences.  That helps give a quick status
> check and helps you focus on the correct issues when tracking down
> defects.

Thanks Gerrit for mentioning this :)

Just an additional information -- the tool used to get this reporting
system is OSS and can be found here:
http://tslogparser.sourceforge.net

This tool is not mature yet, but it gives an overview of how useful a
test suite can be, when the results are easy to analyse...

It currently supports only the Open POSIX Test Suite, but I'd be happy
to work to enlarge the scope of this tool.

Regards, 
Seb.

PS: please include me in reply as I'm not subscribed to the list...
-------------------------------
Sebastien DECUGIS
NPTL Test & Trace Project
http://nptl.bullopensource.org/

"You may fail if you try.
You -will- fail if you don't."


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 14:30                   ` Andy Warner
@ 2005-01-06 23:40                     ` Jeff Garzik
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff Garzik @ 2005-01-06 23:40 UTC (permalink / raw)
  To: Andy Warner; +Cc: Andrew Morton, torvalds, benh, linux-kernel, linux-ide

Andy Warner wrote:
> Jeff Garzik wrote:
> 
>>[...]
>>I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes 
>>ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, 
>>but works on UP.  Potentially related to >=4GB of RAM.
>>
>>
>>
>>Details, in case anyone is interested:
>>Unless my code is screwed up (certainly possible), PIO data-in [using 
>>the insw() call] seems to return all zeroes on a true-blue SMP machine, 
>>for the identify-device command.  When this happens, libata (correctly) 
>>detects a bad id page and bails.  (problem doesn't show up on single CPU 
>>w/ HT)
> 
> 
> Ah, I might have been here recently, with the pass-thru stuff.
> 
> What I saw was that in an SMP machine:
> 
> 1. queue_work() can result in the work running (on another
>    CPU) instantly.
> 
> 2. Having one CPU beat on PIO registers reading data from one port
>    would significantly alter the timing of the CMD->BSY->DRQ sequence
>    used in PIO. This behaviour was far worse for competing ports
>    within one chip, which I put down to arbitration problems.
> 
> 3. CPU utilisation would go through the roof. Effectively the
>    entire pio_task state machine reduced to a busy spin loop.
> 
> 4. The state machine needed some tweaks, especially in error
>    handling cases.
> 
> I made some changes, which effectively solved the problem for promise
> TX4-150 cards, and was going to test the results on other chipsets
> next week before speaking up. Specifically, I have seen some
> issues with SiI 3114 cards.
> 
> I was trying to explore using interrupts instead of polling state
> but for some reason, I was not getting them for PIO data operations,
> or I misunderstand the spec, after removing ata_qc_set_polling() - again
> I saw a difference in behaviour between the Promise & SiI cards
> here.
> 
> I'm about to go offline for 3 days, and hadn't prepared for this
> yet. The best I can do is provide a patch (attached) that applies
> against 2.6.9. It also seems to apply against libata-2.6, but
> barfs a bit against libata-dev-2.6.
> 
> The changes boil down to these:
> 
> 1. Minor changes in how status/error regs are read.
>    Including attempts to use altstatus, while I was
>    exploring interrupts.
> 
> 2. State machine logic changes.
> 
> 3. Replace calls to queue_work() with queue_delayed_work()
>    to stop SMP machines going crazy.
> 
> With these changes, on a platform consisting of 2.6.9 and
> Promise TX4-150 cards, I can move terabytes of parallel
> PIO data, without error.
> 
> My gut says that the PIO mechanism should be overhauled, I
> composed a "how much should we pay for this muffler" email
> to linux-ide at least twice while working on this, but never
> sent it - wanting to send a solution in rather than just
> making more comments from the peanut gallery.
> 
> I'll pick up the thread on this next week, when I'm back online.
> I hope this helps.

Please let me know if you still have problems?

The PIO SMP problems seem to be fixed here.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-12 21:24                   ` William Lee Irwin III
@ 2004-12-17  3:31                     ` Christoph Lameter
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Lameter @ 2004-12-17  3:31 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Hugh Dickins, Nick Piggin, Linus Torvalds, Andrew Morton,
	Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel

On Sun, 12 Dec 2004, William Lee Irwin III wrote:

> On Sun, Dec 12, 2004 at 09:33:11AM +0000, Hugh Dickins wrote:
> > Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock
> > similarly racy, in both the 64-on-32 cases, and on architectures
> > which have a more complex pmd_t (sparc, m68k, h8300)?  Sigh.
>
> yes.

Those may fall back to use the page_table_lock for individual operations
that cannot be realized in an atomic way.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-12  9:33                 ` Hugh Dickins
  2004-12-12  9:48                   ` Nick Piggin
@ 2004-12-12 21:24                   ` William Lee Irwin III
  2004-12-17  3:31                     ` Christoph Lameter
  1 sibling, 1 reply; 55+ messages in thread
From: William Lee Irwin III @ 2004-12-12 21:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Christoph Lameter, Linus Torvalds, Andrew Morton,
	Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel

On Sun, Dec 12, 2004 at 09:33:11AM +0000, Hugh Dickins wrote:
> Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock
> similarly racy, in both the 64-on-32 cases, and on architectures
> which have a more complex pmd_t (sparc, m68k, h8300)?  Sigh.

yes.


-- wli

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-12  9:33                 ` Hugh Dickins
@ 2004-12-12  9:48                   ` Nick Piggin
  2004-12-12 21:24                   ` William Lee Irwin III
  1 sibling, 0 replies; 55+ messages in thread
From: Nick Piggin @ 2004-12-12  9:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton,
	Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel

Hugh Dickins wrote:
> On Sun, 12 Dec 2004, Nick Piggin wrote:
> 
>>Christoph Lameter wrote:
>>
>>>On Thu, 9 Dec 2004, Hugh Dickins wrote:
>>
>>>>probably others (harder to think through).  Your 4/7 patch for i386 has
>>>>an unused atomic get_64bit function from Nick, I think you'll have to
>>>>define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
>>>
>>>That would be a performance issue.
>>
>>Problems were pretty trivial to reproduce here with non atomic 64-bit
>>loads being cut in half by atomic 64 bit stores. I don't see a way
>>around them, unfortunately.
> 
> 
> Of course, it'll only be a performance issue in the 64-on-32 cases:
> the 64-on-64 and 32-on-32 macro should reduce to exactly the present
> "entry = *pte".
> 

That's right, yep. There is no ordering requirement, only that
the actual store and load be atomic.

> I've had the impression that Christoph and SGI have to care a great
> deal more about ia64 than the others; and as x86_64 advances, so
> i386 PAE grows less important.  Just so long as a get_64bit there
> isn't a serious degradation from present behaviour, it's okay.
> 

I don't think it was particularly serious for PAE. Probably not
worth holding off until 2.7. We'll see.

> Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock
> similarly racy, in both the 64-on-32 cases, and on architectures
> which have a more complex pmd_t (sparc, m68k, h8300)?  Sigh.
> 

Can't comment on a specific architecture... some may have problems.
I think i386 prepopulates pmds, so it is no problem; but generally:

I think you can get away with it if you write the "unimportant"
word(s) first, do a wmb(), then write the word containing the
present bit. I guess this has to be done this way otherwise the
hardware walker will blow up...

Of course, the hardware walker would be doing either atomic or
correctly ordered reads, while a plain dereference doesn't
guarantee anything.

I'm not sure of the history behind the code, but I would be in
favour of making _all_ pagetable access go through accessor
functions, even if nobody quite needs them yet.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-12  7:54               ` Nick Piggin
@ 2004-12-12  9:33                 ` Hugh Dickins
  2004-12-12  9:48                   ` Nick Piggin
  2004-12-12 21:24                   ` William Lee Irwin III
  0 siblings, 2 replies; 55+ messages in thread
From: Hugh Dickins @ 2004-12-12  9:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton,
	Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel

On Sun, 12 Dec 2004, Nick Piggin wrote:
> Christoph Lameter wrote:
> > On Thu, 9 Dec 2004, Hugh Dickins wrote:
> 
> >>probably others (harder to think through).  Your 4/7 patch for i386 has
> >>an unused atomic get_64bit function from Nick, I think you'll have to
> >>define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
> > 
> > That would be a performance issue.
> 
> Problems were pretty trivial to reproduce here with non atomic 64-bit
> loads being cut in half by atomic 64 bit stores. I don't see a way
> around them, unfortunately.

Of course, it'll only be a performance issue in the 64-on-32 cases:
the 64-on-64 and 32-on-32 macro should reduce to exactly the present
"entry = *pte".

I've had the impression that Christoph and SGI have to care a great
deal more about ia64 than the others; and as x86_64 advances, so
i386 PAE grows less important.  Just so long as a get_64bit there
isn't a serious degradation from present behaviour, it's okay.

Oh, hold on, isn't handle_mm_fault's pmd without page_table_lock
similarly racy, in both the 64-on-32 cases, and on architectures
which have a more complex pmd_t (sparc, m68k, h8300)?  Sigh.

Hugh

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10 18:43             ` Christoph Lameter
  2004-12-10 21:43               ` Hugh Dickins
@ 2004-12-12  7:54               ` Nick Piggin
  2004-12-12  9:33                 ` Hugh Dickins
  1 sibling, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-12-12  7:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton,
	Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel

Christoph Lameter wrote:
> Thank you for the thorough review of my patches. Comments below
> 
> On Thu, 9 Dec 2004, Hugh Dickins wrote:
> 
> 
>>Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
>>before) your mailer or whatever is eating trailing whitespace: trivial
>>patch attached to apply before yours, removing that whitespace so yours
>>apply.  But what your patches need to apply to would be 2.6.10-mm.
> 
> 
> I am still mystified as to why this is an issue at all. The patches apply
> just fine to the kernel sources as is. I have patched kernels numerous
> times with this patchset and never ran into any issue. quilt removes trailing
> whitespace from patches when they are generated as far as I can tell.
> 
> Patches will be made against mm after Nick's modifications to the 4 level
> patches are in.
> 

I've been a bit slow with them, sorry.... but there hasn't been a hard
decision to go one way or the other with the 4level patches yet.
Fortunately, it looks like 2.6.10 is having a longish drying out period,
so I should have something before it is released.

I would just sit on them for a while, and submit them to -mm when the
4level patches get merged / ready to merge into 2.6. It shouldn't slow
down the progress of your patch too much - they'll may have to wait until
after 2.6.11 anyway I'd say (probably depends on the progress of other
changes going in).


>>probably others (harder to think through).  Your 4/7 patch for i386 has
>>an unused atomic get_64bit function from Nick, I think you'll have to
>>define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
> 
> 
> That would be a performance issue.
> 
> 

Problems were pretty trivial to reproduce here with non atomic 64-bit
loads being cut in half by atomic 64 bit stores. I don't see a way
around them, unfortunately.

Test case is to run with CONFIG_HIGHMEM (you needn't have > 4 GB of
memory in the system, of course), and run 2-4 threads on a dual CPU
system, doing parallel faulting of the *same* anonymous pages.

What happens is that the load (`entry = *pte`) in handle_pte_fault
gets cut in half, and handle_pte_fault drops down to do_swap_page,
and you get an infinite loop trying to read in a non existant swap
entry IIRC.

>>Hmm, that will only work if you're using atomic set_64bit rather than
>>relying on page_table_lock in the complementary places which matter.
>>Which I believe you are indeed doing in your 3level set_pte.  Shouldn't
>>__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?
> 
> 
>>But by making every set_pte use set_64bit, you are significantly slowing
>>down many operations which do not need that atomicity.  This is quite
>>visible in the fork/exec/shell results from lmbench on i386 PAE (and is
>>the only interesting difference, for good or bad, that I noticed with
>>your patches in lmbench on 2*HT*P4), which run 5-20% slower.  There are
>>no faults on dst mm (nor on src mm) while copy_page_range is copying,
>>so its set_ptes don't need to be atomic; likewise during zap_pte_range
>>(either mmap_sem is held exclusively, or it's in the final exit_mmap).
>>Probably revert set_pte and set_pte_atomic to what they were, and use
>>set_pte_atomic where it's needed.
> 
> 
> Good suggestions. Will see what I can do but I will need some assistence
> my main platform is ia64 and the hardware and opportunities for testing on
> i386 are limited.
> 

I think I (and/or others) should be able to help with i386 if you are
having trouble :)

Nick

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-11  0:57                         ` Andrew Morton
@ 2004-12-11  9:23                           ` Hugh Dickins
  0 siblings, 0 replies; 55+ messages in thread
From: Hugh Dickins @ 2004-12-11  9:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel

On Fri, 10 Dec 2004, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > My inclination would be simply to remove the mark_page_accessed
> > from do_anonymous_page; but I have no numbers to back that hunch.
> 
> With the current implementation of page_referenced() the
> software-referenced bit doesn't matter anyway, as long as the pte's
> referenced bit got set.  So as long as the thing is on the active list, we
> can simply remove the mark_page_accessed() call.

Yes, you're right.  So we don't need numbers, can just delete that line.

> Except one day the VM might get smarter about pages which are both
> software-referenced and pte-referenced.

And on that day, we'd be making other changes, which might well
involve restoring the mark_page_accessed to do_anonymous_page
and adding it in the similar places which currently lack it.

But for now...

--- 2.6.10-rc3/mm/memory.c	2004-12-05 12:56:12.000000000 +0000
+++ linux/mm/memory.c	2004-12-11 09:18:39.000000000 +0000
@@ -1464,7 +1464,6 @@ do_anonymous_page(struct mm_struct *mm, 
 							 vma->vm_page_prot)),
 				      vma);
 		lru_cache_add_active(page);
-		mark_page_accessed(page);
 		page_add_anon_rmap(page, vma, addr);
 	}
 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-11  0:44                       ` Hugh Dickins
@ 2004-12-11  0:57                         ` Andrew Morton
  2004-12-11  9:23                           ` Hugh Dickins
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2004-12-11  0:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Fri, 10 Dec 2004, Andrew Morton wrote:
> > Hugh Dickins <hugh@veritas.com> wrote:
> > > But why is do_anonymous_page adding anything to lru_cache_add_active,
> > > when its other callers leave it at that?  What's special about the
> > > do_anonymous_page case?
> > 
> > do_swap_page() is effectively doing the same as do_anonymous_page(). 
> > do_wp_page() and do_no_page() appear to be errant.
> 
> Demur.  do_swap_page has to mark_page_accessed because the page from
> the swap cache is already on the LRU, and for who knows how long.

Well.  Some of the time.  If the page was just read from swap, it's known
to be on the active list.

> The others (and count in fs/exec.c's install_arg_page) are dealing
> with a freshly allocated page they are putting onto the active LRU.
> 
> My inclination would be simply to remove the mark_page_accessed
> from do_anonymous_page; but I have no numbers to back that hunch.
> 

With the current implementation of page_referenced() the
software-referenced bit doesn't matter anyway, as long as the pte's
referenced bit got set.  So as long as the thing is on the active list, we
can simply remove the mark_page_accessed() call.

Except one day the VM might get smarter about pages which are both
software-referenced and pte-referenced.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-11  0:18                     ` Andrew Morton
@ 2004-12-11  0:44                       ` Hugh Dickins
  2004-12-11  0:57                         ` Andrew Morton
  0 siblings, 1 reply; 55+ messages in thread
From: Hugh Dickins @ 2004-12-11  0:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel

On Fri, 10 Dec 2004, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> > But why is do_anonymous_page adding anything to lru_cache_add_active,
> > when its other callers leave it at that?  What's special about the
> > do_anonymous_page case?
> 
> do_swap_page() is effectively doing the same as do_anonymous_page(). 
> do_wp_page() and do_no_page() appear to be errant.

Demur.  do_swap_page has to mark_page_accessed because the page from
the swap cache is already on the LRU, and for who knows how long.
The others (and count in fs/exec.c's install_arg_page) are dealing
with a freshly allocated page they are putting onto the active LRU.

My inclination would be simply to remove the mark_page_accessed
from do_anonymous_page; but I have no numbers to back that hunch.

Hugh

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10 23:52                   ` Hugh Dickins
@ 2004-12-11  0:18                     ` Andrew Morton
  2004-12-11  0:44                       ` Hugh Dickins
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2004-12-11  0:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Fri, 10 Dec 2004, Andrew Morton wrote:
> > Hugh Dickins <hugh@veritas.com> wrote:
> > >
> > > > > (I do wonder why do_anonymous_page calls mark_page_accessed as well as
> > > > > lru_cache_add_active.  The other instances of lru_cache_add_active for
> > > > > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
> > > > > why here?  But that's nothing new with your patch, and although you've
> > > > > reordered the calls, the final page state is the same as before.)
> > 
> > The point is a good one - I guess that code is a holdover from earlier
> > implementations.
> > 
> > This is equivalent, no?
> 
> Yes, it is equivalent to use SetPageReferenced(page) there instead.
> But why is do_anonymous_page adding anything to lru_cache_add_active,
> when its other callers leave it at that?  What's special about the
> do_anonymous_page case?

do_swap_page() is effectively doing the same as do_anonymous_page(). 
do_wp_page() and do_no_page() appear to be errant.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10 22:12                 ` Andrew Morton
@ 2004-12-10 23:52                   ` Hugh Dickins
  2004-12-11  0:18                     ` Andrew Morton
  0 siblings, 1 reply; 55+ messages in thread
From: Hugh Dickins @ 2004-12-10 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel

On Fri, 10 Dec 2004, Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> >
> > > > (I do wonder why do_anonymous_page calls mark_page_accessed as well as
> > > > lru_cache_add_active.  The other instances of lru_cache_add_active for
> > > > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
> > > > why here?  But that's nothing new with your patch, and although you've
> > > > reordered the calls, the final page state is the same as before.)
> 
> The point is a good one - I guess that code is a holdover from earlier
> implementations.
> 
> This is equivalent, no?

Yes, it is equivalent to use SetPageReferenced(page) there instead.
But why is do_anonymous_page adding anything to lru_cache_add_active,
when its other callers leave it at that?  What's special about the
do_anonymous_page case?

Hugh


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10 21:43               ` Hugh Dickins
@ 2004-12-10 22:12                 ` Andrew Morton
  2004-12-10 23:52                   ` Hugh Dickins
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2004-12-10 22:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: clameter, torvalds, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> > > (I do wonder why do_anonymous_page calls mark_page_accessed as well as
> > > lru_cache_add_active.  The other instances of lru_cache_add_active for
> > > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
> > > why here?  But that's nothing new with your patch, and although you've
> > > reordered the calls, the final page state is the same as before.)
> > 
> > The mark_page_accessed is likely there avoid a future fault just to set
> > the accessed bit.
> 
> No, mark_page_accessed is an operation on the struct page
> (and the accessed bit of the pte is preset too anyway).

The point is a good one - I guess that code is a holdover from earlier
implementations.

This is equivalent, no?

--- 25/mm/memory.c~do_anonymous_page-use-setpagereferenced	Fri Dec 10 14:11:32 2004
+++ 25-akpm/mm/memory.c	Fri Dec 10 14:11:42 2004
@@ -1464,7 +1464,7 @@ do_anonymous_page(struct mm_struct *mm, 
 							 vma->vm_page_prot)),
 				      vma);
 		lru_cache_add_active(page);
-		mark_page_accessed(page);
+		SetPageReferenced(page);
 		page_add_anon_rmap(page, vma, addr);
 	}
 
_


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10 18:43             ` Christoph Lameter
@ 2004-12-10 21:43               ` Hugh Dickins
  2004-12-10 22:12                 ` Andrew Morton
  2004-12-12  7:54               ` Nick Piggin
  1 sibling, 1 reply; 55+ messages in thread
From: Hugh Dickins @ 2004-12-10 21:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt,
	Nick Piggin, linux-mm, linux-ia64, linux-kernel

On Fri, 10 Dec 2004, Christoph Lameter wrote:
> On Thu, 9 Dec 2004, Hugh Dickins wrote:
> 
> > Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
> > before) your mailer or whatever is eating trailing whitespace: trivial
> > patch attached to apply before yours, removing that whitespace so yours
> > apply.  But what your patches need to apply to would be 2.6.10-mm.
> 
> I am still mystified as to why this is an issue at all. The patches apply
> just fine to the kernel sources as is. I have patched kernels numerous
> times with this patchset and never ran into any issue. quilt removes trailing
> whitespace from patches when they are generated as far as I can tell.

Perhaps you've only tried applying your original patches, not the ones
as received through the mail.  It discourages people from trying them
when "patch -p1" fails with rejects, however trivial.  Or am I alone
in seeing this?  never had such a problem with other patches before.

> > Your scalability figures show a superb improvement.  But they are (I
> > presume) for the best case: intense initial faulting of distinct areas
> > of anonymous memory by parallel cpus running a multithreaded process.
> > This is not a common case: how much do what real-world apps benefit?
> 
> This is common during the startup of distributed applications on our large
> machines. They seem to freeze for minutes on bootup. I am not sure how
> much real-world apps benefit. The numbers show that the benefit would
> mostly be for SMP applications. UP has only very minor improvements.

How much do your patches speed the startup of these applications?
Can you name them?

> I have worked with a couple of arches and received feedback that was
> integrated. I certainly welcome more feedback. A vague idea if there is
> more trouble on that front: One could take the ptl in the cmpxchg
> emulation and then unlock on update_mmu cache.

Or move the update_mmu_cache into the ptep_cmpxchg emulation perhaps.

> > (I do wonder why do_anonymous_page calls mark_page_accessed as well as
> > lru_cache_add_active.  The other instances of lru_cache_add_active for
> > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
> > why here?  But that's nothing new with your patch, and although you've
> > reordered the calls, the final page state is the same as before.)
> 
> The mark_page_accessed is likely there avoid a future fault just to set
> the accessed bit.

No, mark_page_accessed is an operation on the struct page
(and the accessed bit of the pte is preset too anyway).

> > Where handle_pte_fault does "entry = *pte" without page_table_lock:
> > you're quite right to passing down precisely that entry to the fault
> > handlers below, but there's still a problem on the 32bit architectures
> > supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
> > of entry may be out of synch.  Not a problem for do_anonymous_page, or
> > anything else relying on ptep_cmpxchg to check; but a problem for
> > do_wp_page (which could find !pfn_valid and kill the process) and
> > probably others (harder to think through).  Your 4/7 patch for i386 has
> > an unused atomic get_64bit function from Nick, I think you'll have to
> > define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
> 
> That would be a performance issue.

Sadly, yes, but correctness must take precedence over performance.
It may be possible to avoid it in most cases, doing the atomic
later when in doubt: but would need careful thought.

> Good suggestions. Will see what I can do but I will need some assistence
> my main platform is ia64 and the hardware and opportunities for testing on
> i386 are limited.

There's plenty of us can be trying i386.  It's other arches worrying me.

Hugh


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-09 18:37           ` Hugh Dickins
  2004-12-10  4:26             ` Nick Piggin
@ 2004-12-10 18:43             ` Christoph Lameter
  2004-12-10 21:43               ` Hugh Dickins
  2004-12-12  7:54               ` Nick Piggin
  1 sibling, 2 replies; 55+ messages in thread
From: Christoph Lameter @ 2004-12-10 18:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt,
	Nick Piggin, linux-mm, linux-ia64, linux-kernel

Thank you for the thorough review of my patches. Comments below

On Thu, 9 Dec 2004, Hugh Dickins wrote:

> Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
> before) your mailer or whatever is eating trailing whitespace: trivial
> patch attached to apply before yours, removing that whitespace so yours
> apply.  But what your patches need to apply to would be 2.6.10-mm.

I am still mystified as to why this is an issue at all. The patches apply
just fine to the kernel sources as is. I have patched kernels numerous
times with this patchset and never ran into any issue. quilt removes trailing
whitespace from patches when they are generated as far as I can tell.

Patches will be made against mm after Nick's modifications to the 4 level
patches are in.

> Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
> have tested out okay up to 4GB but not above: trivial patch attached.

Thanks for the patch.

> Your scalability figures show a superb improvement.  But they are (I
> presume) for the best case: intense initial faulting of distinct areas
> of anonymous memory by parallel cpus running a multithreaded process.
> This is not a common case: how much do what real-world apps benefit?

This is common during the startup of distributed applications on our large
machines. They seem to freeze for minutes on bootup. I am not sure how
much real-world apps benefit. The numbers show that the benefit would
mostly be for SMP applications. UP has only very minor improvements.

> Since you also avoid taking the page_table_lock in handle_pte_fault,
> there should be some scalability benefit to all kinds of page fault:
> do you have any results to show how much (perhaps hard to quantify,
> since even tmpfs file faults introduce other scalability issues)?

I have not done such tests (yet).

> The split rss patch, if it stays, needs some work.  For example,
> task_statm uses "get_shared" to total up rss-anon_rss from the tasks,
> but assumes mm->rss is already accurate.  Scrap the separate get_rss,
> get_anon_rss, get_shared functions: just one get_rss to make a single
> pass through the tasks adding up both rss and anon_rss at the same time.

Next rev will have that.

> Updating current->rss in do_anonymous_page, current->anon_rss in
> page_add_anon_rmap, is not always correct: ptrace's access_process_vm
> uses get_user_pages on another task.  You need check that current->mm ==
> mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss,
> fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock
> for that).  You'll also need to check !(current->flags & PF_BORROWED_MM),
> to guard against use_mm.  Or... just go back to sloppy rss.

I will look into this issue.

> Moving to the main patch, 1/7, the major issue I see there is the way
> do_anonymous_page does update_mmu_cache after setting the pte, without
> any page_table_lock to bracket them together.  Obviously no problem on
> architectures where update_mmu_cache is a no-op!  But although there's
> been plenty of discussion, particularly with Ben and Nick, I've not
> noticed anything to guarantee that as safe on all architectures.  I do
> think it's fine for you to post your patches before completing hooks in
> all the arches, but isn't this a significant issue which needs to be
> sorted before your patches go into -mm?  You hazily refer to such issues
> in 0/7, but now you need to work with arch maintainers to settle them
> and show the patches.

I have worked with a couple of arches and received feedback that was
integrated. I certainly welcome more feedback. A vague idea if there is
more trouble on that front: One could take the ptl in the cmpxchg
emulation and then unlock on update_mmu cache.

> A lesser issue with the reordering in do_anonymous_page: don't you need
> to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
> the very slight chance that vmscan will pick the page off the LRU and
> unmap it before you've counted it in, hitting page_remove_rmap's
> BUG_ON(page_mapcount(page) < 0)?

Changed.

> (I do wonder why do_anonymous_page calls mark_page_accessed as well as
> lru_cache_add_active.  The other instances of lru_cache_add_active for
> an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
> why here?  But that's nothing new with your patch, and although you've
> reordered the calls, the final page state is the same as before.)

The mark_page_accessed is likely there avoid a future fault just to set
the accessed bit.

> Where handle_pte_fault does "entry = *pte" without page_table_lock:
> you're quite right to passing down precisely that entry to the fault
> handlers below, but there's still a problem on the 32bit architectures
> supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
> of entry may be out of synch.  Not a problem for do_anonymous_page, or
> anything else relying on ptep_cmpxchg to check; but a problem for
> do_wp_page (which could find !pfn_valid and kill the process) and
> probably others (harder to think through).  Your 4/7 patch for i386 has
> an unused atomic get_64bit function from Nick, I think you'll have to
> define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.

That would be a performance issue.

> Hmm, that will only work if you're using atomic set_64bit rather than
> relying on page_table_lock in the complementary places which matter.
> Which I believe you are indeed doing in your 3level set_pte.  Shouldn't
> __set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?

> But by making every set_pte use set_64bit, you are significantly slowing
> down many operations which do not need that atomicity.  This is quite
> visible in the fork/exec/shell results from lmbench on i386 PAE (and is
> the only interesting difference, for good or bad, that I noticed with
> your patches in lmbench on 2*HT*P4), which run 5-20% slower.  There are
> no faults on dst mm (nor on src mm) while copy_page_range is copying,
> so its set_ptes don't need to be atomic; likewise during zap_pte_range
> (either mmap_sem is held exclusively, or it's in the final exit_mmap).
> Probably revert set_pte and set_pte_atomic to what they were, and use
> set_pte_atomic where it's needed.

Good suggestions. Will see what I can do but I will need some assistence
my main platform is ia64 and the hardware and opportunities for testing on
i386 are limited.

Again thanks for the detailed review.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10  5:19                   ` Nick Piggin
@ 2004-12-10 12:30                     ` Hugh Dickins
  0 siblings, 0 replies; 55+ messages in thread
From: Hugh Dickins @ 2004-12-10 12:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Benjamin Herrenschmidt, Christoph Lameter, Linus Torvalds,
	Andrew Morton, linux-mm, linux-ia64, Linux Kernel list

On Fri, 10 Dec 2004, Nick Piggin wrote:
> Benjamin Herrenschmidt wrote:
> > On Fri, 2004-12-10 at 15:54 +1100, Nick Piggin wrote:
> >>
> >>The page-freed-before-update_mmu_cache issue can be solved in that way,
> >>not the set_pte and update_mmu_cache not performed under the same ptl
> >>section issue that you raised.
> > 
> > What is the problem with update_mmu_cache ? It doesn't need to be done
> > in the same lock section since it's approx. equivalent to a HW fault,
> > which doesn't take the ptl...
> 
> I don't think a problem has been observed, I think Hugh was just raising
> it as a general issue.

That's right, I know little of the arches on which update_mmu_cache does
something, so cannot say that separation is a problem.  And I did see mail
from Ben a month ago in which he arrived at the conclusion that it's not a
problem - but assumed he was speaking for ppc and ppc64.  (He was also
writing in the context of your patches rather than Christoph's.)

Perhaps Ben has in mind a logical argument that if update_mmu_cache does
just what its name implies, then doing it under a separate acquisition
of page_table_lock cannot introduce incorrectness on any architecture.
Maybe, but I'd still rather we heard that from an expert in each of the
affected architectures.

As it stands in Christoph's patches, update_mmu_cache is sometimes
called inside page_table_lock and sometimes outside: I'd be surprised
if that doesn't require adjustment for some architecture.

Your idea to raise do_anonymous_page's update_mmu_cache before the
lru_cache_add_active sounds just right; perhaps it should then even be
subsumed into the architectural ptep_cmpxchg.  But once we get this far,
I do wonder again whether it's right to be changing the rules in
do_anonymous_page alone (Christoph's patches) rather than all the
other faults together (your patches).

But there's no doubt that the do_anonymous_page case is easier,
or more obviously easy, to deal with - it helps a lot to know
that the page cannot yet be exposed to vmscan.c and rmap.c.

Hugh

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10  5:06                 ` Benjamin Herrenschmidt
@ 2004-12-10  5:19                   ` Nick Piggin
  2004-12-10 12:30                     ` Hugh Dickins
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-12-10  5:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Hugh Dickins, Christoph Lameter, Linus Torvalds, Andrew Morton,
	linux-mm, linux-ia64, Linux Kernel list

Benjamin Herrenschmidt wrote:
> On Fri, 2004-12-10 at 15:54 +1100, Nick Piggin wrote:
> 
>>Nick Piggin wrote:
>>
>>The page-freed-before-update_mmu_cache issue can be solved in that way,
>>not the set_pte and update_mmu_cache not performed under the same ptl
>>section issue that you raised.
> 
> 
> What is the problem with update_mmu_cache ? It doesn't need to be done
> in the same lock section since it's approx. equivalent to a HW fault,
> which doesn't take the ptl...
> 

I don't think a problem has been observed, I think Hugh was just raising
it as a general issue.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10  4:54               ` Nick Piggin
@ 2004-12-10  5:06                 ` Benjamin Herrenschmidt
  2004-12-10  5:19                   ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Benjamin Herrenschmidt @ 2004-12-10  5:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Christoph Lameter, Linus Torvalds, Andrew Morton,
	linux-mm, linux-ia64, Linux Kernel list

On Fri, 2004-12-10 at 15:54 +1100, Nick Piggin wrote:
> Nick Piggin wrote:
> 
> > Yep, the update_mmu_cache issue is real. There is a parallel problem
> > that is update_mmu_cache can be called on a pte who's page has since
> > been evicted and reused. Again, that looks safe on IA64, but maybe
> > not on other architectures.
> > 
> > It can be solved by moving lru_cache_add to after update_mmu_cache in
> > all cases but the "update accessed bit" type fault. I solved that by
> > simply defining that out for architectures that don't need it - a raced
> > fault will simply get repeated if need be.
> > 
> 
> The page-freed-before-update_mmu_cache issue can be solved in that way,
> not the set_pte and update_mmu_cache not performed under the same ptl
> section issue that you raised.

What is the problem with update_mmu_cache ? It doesn't need to be done
in the same lock section since it's approx. equivalent to a HW fault,
which doesn't take the ptl...

Ben.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-10  4:26             ` Nick Piggin
@ 2004-12-10  4:54               ` Nick Piggin
  2004-12-10  5:06                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-12-10  4:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Christoph Lameter, Linus Torvalds, Andrew Morton,
	Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel

Nick Piggin wrote:

> Yep, the update_mmu_cache issue is real. There is a parallel problem
> that is update_mmu_cache can be called on a pte who's page has since
> been evicted and reused. Again, that looks safe on IA64, but maybe
> not on other architectures.
> 
> It can be solved by moving lru_cache_add to after update_mmu_cache in
> all cases but the "update accessed bit" type fault. I solved that by
> simply defining that out for architectures that don't need it - a raced
> fault will simply get repeated if need be.
> 

The page-freed-before-update_mmu_cache issue can be solved in that way,
not the set_pte and update_mmu_cache not performed under the same ptl
section issue that you raised.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-09 17:03             ` Christoph Lameter
@ 2004-12-10  4:30               ` Nick Piggin
  0 siblings, 0 replies; 55+ messages in thread
From: Nick Piggin @ 2004-12-10  4:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt,
	linux-mm, linux-ia64, linux-kernel

Christoph Lameter wrote:
> On Thu, 9 Dec 2004, Nick Piggin wrote:
> 
> 
>>>For more than 8 cpus the page fault rate increases by orders
>>>of magnitude. For more than 64 cpus the improvement in performace
>>>is 10 times better.
>>
>>Those numbers are pretty impressive. I thought you'd said with earlier
>>patches that performance was about doubled from 8 to 512 CPUS. Did I
>>remember correctly? If so, where is the improvement coming from? The
>>per-thread RSS I guess?
> 
> 
> Right. The per-thread RSS seems to have made a big difference for high CPU
> counts. Also I was conservative in the estimates in earlier post since I
> did not have the numbers for the very high cpu counts.
> 

Ah OK.

> 
>>On another note, these patches are basically only helpful to new
>>anonymous page faults. I guess this is the main thing you are concerned
>>about at the moment, but I wonder if you would see improvements with
>>my patch to remove the ptl from the other types of faults as well?
> 
> 
> I can try that but I am frankly a bit sceptical since the ptl protects
> many other variables. It may be more efficient to have the ptl in these
> cases than doing the atomic ops all over the place. Do you have any number
> you could post? I believe I send you a copy of the code that I use for
> performance tests last week or so,
> 

Yep I have your test program. No real numbers because the biggest thing
I have to test on is a 4-way - there is improvement, but it is not so
impressive as your 512 way tests! :)

> 
>>The downside of my patch - well the main downsides - compared to yours
>>are its intrusiveness, and the extra cost involved in copy_page_range
>>which yours appears not to require.
> 
> 
> Is the patch known to be okay for ia64? I can try to see how it
> does.
> 

I think it just needs one small fix to the swapping code, and it should
be pretty stable. So in fact it would probably work for you as is (if you
don't swap), but I'd rather have something more stable before I ask you
to test. I'll try to find time to do that in the next few days.

> 
>>As I've said earlier though, I wouldn't mind your patches going in. At
>>least they should probably get into -mm soon, when Andrew has time (and
>>after the 4level patches are sorted out). That wouldn't stop my patch
>>(possibly) being merged some time after that if and when it was found
>>worthy...
> 
> 
> I'd certainly be willing to poke around and see how beneficial this is. If
> it turns out to accellerate other functionality of the vm then you
> have my full support.
> 

Great, thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-09 18:37           ` Hugh Dickins
@ 2004-12-10  4:26             ` Nick Piggin
  2004-12-10  4:54               ` Nick Piggin
  2004-12-10 18:43             ` Christoph Lameter
  1 sibling, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-12-10  4:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton,
	Benjamin Herrenschmidt, linux-mm, linux-ia64, linux-kernel

Hugh Dickins wrote:
> On Wed, 1 Dec 2004, Christoph Lameter wrote:
> 
>>Changes from V11->V12 of this patch:
>>- dump sloppy_rss in favor of list_rss (Linus' proposal)
>>- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
>>
>>This is a series of patches that increases the scalability of
>>the page fault handler for SMP. Here are some performance results
>>on a machine with 512 processors allocating 32 GB with an increasing
>>number of threads (that are assigned a processor each).
> 
> 
> Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
> before) your mailer or whatever is eating trailing whitespace: trivial
> patch attached to apply before yours, removing that whitespace so yours
> apply.  But what your patches need to apply to would be 2.6.10-mm.
> 
> Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
> have tested out okay up to 4GB but not above: trivial patch attached.
> 

That looks obviously correct. Probably the reason why Martin was
getting crashes.

[snip]

> Moving to the main patch, 1/7, the major issue I see there is the way
> do_anonymous_page does update_mmu_cache after setting the pte, without
> any page_table_lock to bracket them together.  Obviously no problem on
> architectures where update_mmu_cache is a no-op!  But although there's
> been plenty of discussion, particularly with Ben and Nick, I've not
> noticed anything to guarantee that as safe on all architectures.  I do
> think it's fine for you to post your patches before completing hooks in
> all the arches, but isn't this a significant issue which needs to be
> sorted before your patches go into -mm?  You hazily refer to such issues
> in 0/7, but now you need to work with arch maintainers to settle them
> and show the patches.
> 

Yep, the update_mmu_cache issue is real. There is a parallel problem
that is update_mmu_cache can be called on a pte who's page has since
been evicted and reused. Again, that looks safe on IA64, but maybe
not on other architectures.

It can be solved by moving lru_cache_add to after update_mmu_cache in
all cases but the "update accessed bit" type fault. I solved that by
simply defining that out for architectures that don't need it - a raced
fault will simply get repeated if need be.

> A lesser issue with the reordering in do_anonymous_page: don't you need
> to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
> the very slight chance that vmscan will pick the page off the LRU and
> unmap it before you've counted it in, hitting page_remove_rmap's
> BUG_ON(page_mapcount(page) < 0)?
> 

That's what I had been doing too. Seems to be the right way to go.

> (I do wonder why do_anonymous_page calls mark_page_accessed as well as
> lru_cache_add_active.  The other instances of lru_cache_add_active for
> an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
> why here?  But that's nothing new with your patch, and although you've
> reordered the calls, the final page state is the same as before.)
> 
> Where handle_pte_fault does "entry = *pte" without page_table_lock:
> you're quite right to passing down precisely that entry to the fault
> handlers below, but there's still a problem on the 32bit architectures
> supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
> of entry may be out of synch.  Not a problem for do_anonymous_page, or
> anything else relying on ptep_cmpxchg to check; but a problem for
> do_wp_page (which could find !pfn_valid and kill the process) and
> probably others (harder to think through).  Your 4/7 patch for i386 has
> an unused atomic get_64bit function from Nick, I think you'll have to
> define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
> 

Indeed. This was a real problem for my patch, definitely.

> Hmm, that will only work if you're using atomic set_64bit rather than
> relying on page_table_lock in the complementary places which matter.
> Which I believe you are indeed doing in your 3level set_pte.  Shouldn't
> __set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?
> 

That's what I was wondering. It could be that the actual 64-bit store is
still atomic without the lock prefix (just not the entire rmw), which I
think would be sufficient.

In that case, get_64bit may be able to drop the lock prefix as well.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-01 23:41         ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
  2004-12-02  0:10           ` Linus Torvalds
  2004-12-09  8:00           ` Nick Piggin
@ 2004-12-09 18:37           ` Hugh Dickins
  2004-12-10  4:26             ` Nick Piggin
  2004-12-10 18:43             ` Christoph Lameter
  2 siblings, 2 replies; 55+ messages in thread
From: Hugh Dickins @ 2004-12-09 18:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Andrew Morton, Benjamin Herrenschmidt,
	Nick Piggin, linux-mm, linux-ia64, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6655 bytes --]

On Wed, 1 Dec 2004, Christoph Lameter wrote:
> 
> Changes from V11->V12 of this patch:
> - dump sloppy_rss in favor of list_rss (Linus' proposal)
> - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
> 
> This is a series of patches that increases the scalability of
> the page fault handler for SMP. Here are some performance results
> on a machine with 512 processors allocating 32 GB with an increasing
> number of threads (that are assigned a processor each).

Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
before) your mailer or whatever is eating trailing whitespace: trivial
patch attached to apply before yours, removing that whitespace so yours
apply.  But what your patches need to apply to would be 2.6.10-mm.

Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
have tested out okay up to 4GB but not above: trivial patch attached.

Your scalability figures show a superb improvement.  But they are (I
presume) for the best case: intense initial faulting of distinct areas
of anonymous memory by parallel cpus running a multithreaded process.
This is not a common case: how much do what real-world apps benefit?

Since you also avoid taking the page_table_lock in handle_pte_fault,
there should be some scalability benefit to all kinds of page fault:
do you have any results to show how much (perhaps hard to quantify,
since even tmpfs file faults introduce other scalability issues)?

How do the scalability figures compare if you omit patch 7/7 i.e. revert
the per-task rss complications you added in for Linus?  I remain a fan
of sloppy rss, which you earlier showed to be accurate enough (I'd say),
though I guess should be checked on other architectures than your ia64.
I can't see the point of all that added ugliness for numbers which don't
need to be precise - but perhaps there's no way of rearranging fields,
and the point at which mm->(anon_)rss is updated (near up of mmap_sem?),
to avoid destructive cacheline bounce.  What I'm asking is, do you have
numbers to support 7/7?  Perhaps it's the fact you showed up to 512 cpus
this time, but only up to 32 with sloppy rss?  The ratios do look better
with the latest, but the numbers are altogether lower so we don't know.

The split rss patch, if it stays, needs some work.  For example,
task_statm uses "get_shared" to total up rss-anon_rss from the tasks,
but assumes mm->rss is already accurate.  Scrap the separate get_rss,
get_anon_rss, get_shared functions: just one get_rss to make a single
pass through the tasks adding up both rss and anon_rss at the same time.

I am bothered that every read of /proc/<pid>/status or /proc/<pid>/statm
is going to reread through all of that task_list each time; yet in that
massively parallel case that concerns you, there should be little change
to rss after startup.  Perhaps a later optimization would be to avoid
task_list completely for singly threaded processes.  I'd like get_rss to
update mm->rss and mm->anon_rss and flag it uptodate to avoid subsequent
task_list iterations, but the locking might defeat your whole purpose.

Updating current->rss in do_anonymous_page, current->anon_rss in
page_add_anon_rmap, is not always correct: ptrace's access_process_vm
uses get_user_pages on another task.  You need check that current->mm ==
mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss,
fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock
for that).  You'll also need to check !(current->flags & PF_BORROWED_MM),
to guard against use_mm.  Or... just go back to sloppy rss.

Moving to the main patch, 1/7, the major issue I see there is the way
do_anonymous_page does update_mmu_cache after setting the pte, without
any page_table_lock to bracket them together.  Obviously no problem on
architectures where update_mmu_cache is a no-op!  But although there's
been plenty of discussion, particularly with Ben and Nick, I've not
noticed anything to guarantee that as safe on all architectures.  I do
think it's fine for you to post your patches before completing hooks in
all the arches, but isn't this a significant issue which needs to be
sorted before your patches go into -mm?  You hazily refer to such issues
in 0/7, but now you need to work with arch maintainers to settle them
and show the patches.

A lesser issue with the reordering in do_anonymous_page: don't you need
to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
the very slight chance that vmscan will pick the page off the LRU and
unmap it before you've counted it in, hitting page_remove_rmap's
BUG_ON(page_mapcount(page) < 0)?

(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active.  The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here?  But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)

Where handle_pte_fault does "entry = *pte" without page_table_lock:
you're quite right to passing down precisely that entry to the fault
handlers below, but there's still a problem on the 32bit architectures
supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
of entry may be out of synch.  Not a problem for do_anonymous_page, or
anything else relying on ptep_cmpxchg to check; but a problem for
do_wp_page (which could find !pfn_valid and kill the process) and
probably others (harder to think through).  Your 4/7 patch for i386 has
an unused atomic get_64bit function from Nick, I think you'll have to
define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.

Hmm, that will only work if you're using atomic set_64bit rather than
relying on page_table_lock in the complementary places which matter.
Which I believe you are indeed doing in your 3level set_pte.  Shouldn't
__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?

But by making every set_pte use set_64bit, you are significantly slowing
down many operations which do not need that atomicity.  This is quite
visible in the fork/exec/shell results from lmbench on i386 PAE (and is
the only interesting difference, for good or bad, that I noticed with
your patches in lmbench on 2*HT*P4), which run 5-20% slower.  There are
no faults on dst mm (nor on src mm) while copy_page_range is copying,
so its set_ptes don't need to be atomic; likewise during zap_pte_range
(either mmap_sem is held exclusively, or it's in the final exit_mmap).
Probably revert set_pte and set_pte_atomic to what they were, and use
set_pte_atomic where it's needed.

Hugh

[-- Attachment #2: Remove trailing whitespace before C.L. patches --]
[-- Type: TEXT/PLAIN, Size: 1736 bytes --]

--- 2.6.10-rc3/include/asm-i386/system.h	2004-11-15 16:21:12.000000000 +0000
+++ linux/include/asm-i386/system.h	2004-11-22 14:44:30.761904592 +0000
@@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo
 #define cmpxchg(ptr,o,n)\
 	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
 					(unsigned long)(n),sizeof(*(ptr))))
-    
+
 #ifdef __KERNEL__
-struct alt_instr { 
+struct alt_instr {
 	__u8 *instr; 		/* original instruction */
 	__u8 *replacement;
 	__u8  cpuid;		/* cpuid bit set for replacement */
--- 2.6.10-rc3/include/asm-s390/pgalloc.h	2004-05-10 03:33:39.000000000 +0100
+++ linux/include/asm-s390/pgalloc.h	2004-11-22 14:54:43.704723120 +0000
@@ -99,7 +99,7 @@ static inline void pgd_populate(struct m

 #endif /* __s390x__ */

-static inline void 
+static inline void
 pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
 {
 #ifndef __s390x__
--- 2.6.10-rc3/mm/memory.c	2004-11-18 17:56:11.000000000 +0000
+++ linux/mm/memory.c	2004-11-22 14:39:33.924030808 +0000
@@ -1424,7 +1424,7 @@ out:
 /*
  * We are called with the MM semaphore and page_table_lock
  * spinlock held to protect against concurrent faults in
- * multithreaded programs. 
+ * multithreaded programs.
  */
 static int
 do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct
 	 * Fall back to the linear mapping if the fs does not support
 	 * ->populate:
 	 */
-	if (!vma->vm_ops || !vma->vm_ops->populate || 
+	if (!vma->vm_ops || !vma->vm_ops->populate ||
 			(write_access && !(vma->vm_flags & VM_SHARED))) {
 		pte_clear(pte);
 		return do_no_page(mm, vma, address, write_access, pte, pmd);

[-- Attachment #3: 3level ptep_cmpxchg use cmpxchg8b --]
[-- Type: TEXT/PLAIN, Size: 570 bytes --]

--- 2.6.10-rc3-cl/include/asm-i386/pgtable-3level.h	2004-12-05 14:01:11.000000000 +0000
+++ linux/include/asm-i386/pgtable-3level.h	2004-12-09 13:17:44.000000000 +0000
@@ -147,7 +147,7 @@ static inline pmd_t pfn_pmd(unsigned lon

 static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
 {
-	return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+	return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
 }

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-09  8:00           ` Nick Piggin
@ 2004-12-09 17:03             ` Christoph Lameter
  2004-12-10  4:30               ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Lameter @ 2004-12-09 17:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt,
	linux-mm, linux-ia64, linux-kernel

On Thu, 9 Dec 2004, Nick Piggin wrote:

> > For more than 8 cpus the page fault rate increases by orders
> > of magnitude. For more than 64 cpus the improvement in performace
> > is 10 times better.
>
> Those numbers are pretty impressive. I thought you'd said with earlier
> patches that performance was about doubled from 8 to 512 CPUS. Did I
> remember correctly? If so, where is the improvement coming from? The
> per-thread RSS I guess?

Right. The per-thread RSS seems to have made a big difference for high CPU
counts. Also I was conservative in the estimates in earlier post since I
did not have the numbers for the very high cpu counts.

> On another note, these patches are basically only helpful to new
> anonymous page faults. I guess this is the main thing you are concerned
> about at the moment, but I wonder if you would see improvements with
> my patch to remove the ptl from the other types of faults as well?

I can try that but I am frankly a bit sceptical since the ptl protects
many other variables. It may be more efficient to have the ptl in these
cases than doing the atomic ops all over the place. Do you have any number
you could post? I believe I send you a copy of the code that I use for
performance tests last week or so,

> The downside of my patch - well the main downsides - compared to yours
> are its intrusiveness, and the extra cost involved in copy_page_range
> which yours appears not to require.

Is the patch known to be okay for ia64? I can try to see how it
does.

> As I've said earlier though, I wouldn't mind your patches going in. At
> least they should probably get into -mm soon, when Andrew has time (and
> after the 4level patches are sorted out). That wouldn't stop my patch
> (possibly) being merged some time after that if and when it was found
> worthy...

I'd certainly be willing to poke around and see how beneficial this is. If
it turns out to accellerate other functionality of the vm then you
have my full support.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-01 23:41         ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
  2004-12-02  0:10           ` Linus Torvalds
@ 2004-12-09  8:00           ` Nick Piggin
  2004-12-09 17:03             ` Christoph Lameter
  2004-12-09 18:37           ` Hugh Dickins
  2 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2004-12-09  8:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Hugh Dickins, akpm, Benjamin Herrenschmidt,
	linux-mm, linux-ia64, linux-kernel

Christoph Lameter wrote:
> Changes from V11->V12 of this patch:
> - dump sloppy_rss in favor of list_rss (Linus' proposal)
> - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
> 

[snip]

> For more than 8 cpus the page fault rate increases by orders
> of magnitude. For more than 64 cpus the improvement in performace
> is 10 times better.

Those numbers are pretty impressive. I thought you'd said with earlier
patches that performance was about doubled from 8 to 512 CPUS. Did I
remember correctly? If so, where is the improvement coming from? The
per-thread RSS I guess?

On another note, these patches are basically only helpful to new
anonymous page faults. I guess this is the main thing you are concerned
about at the moment, but I wonder if you would see improvements with
my patch to remove the ptl from the other types of faults as well?

The downside of my patch - well the main downsides - compared to yours
are its intrusiveness, and the extra cost involved in copy_page_range
which yours appears not to require.

As I've said earlier though, I wouldn't mind your patches going in. At
least they should probably get into -mm soon, when Andrew has time (and
after the 4level patches are sorted out). That wouldn't stop my patch
(possibly) being merged some time after that if and when it was found
worthy...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:34               ` Andrew Morton
                                   ` (2 preceding siblings ...)
  2004-12-02 18:27                 ` Grant Grundler
@ 2004-12-07 10:51                 ` Pavel Machek
  3 siblings, 0 replies; 55+ messages in thread
From: Pavel Machek @ 2004-12-07 10:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

Hi!

> Or start alternating between stable and flakey releases, so 2.6.11 will be
> a feature release with a 2-month development period and 2.6.12 will be a
> bugfix-only release, with perhaps a 2-week development period, so people
> know that the even-numbered releases are better stabilised.

If you expect "feature 2.6.11", you might as well call it 2.7.0, 
followed by 2.8.0.

								Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 18:43                       ` cliff white
@ 2004-12-06 19:33                         ` Marcelo Tosatti
  0 siblings, 0 replies; 55+ messages in thread
From: Marcelo Tosatti @ 2004-12-06 19:33 UTC (permalink / raw)
  To: cliff white
  Cc: Martin J. Bligh, akpm, jgarzik, torvalds, clameter, hugh, benh,
	nickpiggin, linux-mm, linux-ia64, linux-kernel

On Thu, Dec 02, 2004 at 10:43:30AM -0800, cliff white wrote:
> On Wed, 01 Dec 2004 23:26:59 -0800
> "Martin J. Bligh" <mbligh@aracnet.com> wrote:
> 
> > --Andrew Morton <akpm@osdl.org> wrote (on Wednesday, December 01, 2004 23:02:17 -0800):
> > 
> > > Jeff Garzik <jgarzik@pobox.com> wrote:
> > >> 
> > >> Andrew Morton wrote:
> > >> > We need to be be achieving higher-quality major releases than we did in
> > >> > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> > >> > stabilisation periods.
> > >> 
> > >> 
> > >> I'm still hoping that distros (like my employer) and orgs like OSDL will 
> > >> step up, and hook 2.6.x BK snapshots into daily test harnesses.
> > > 
> > > I believe that both IBM and OSDL are doing this, or are getting geared up
> > > to do this.  With both Linus bk and -mm.
> > 
> > I already run a bunch of tests on a variety of machines for every new 
> > kernel ... but don't have an automated way to compare the results as yet, 
> > so don't actually look at them much ;-(. Sometime soon (quite possibly over 
> > Christmas) things will calm down enough I'll get a couple of days to write 
> > the appropriate perl script, and start publishing stuff.
> 
> We've had the most success when one person has an itch to scratch, and works
> with us to scratch it. We (OSDL) worked with Sebastien at Bull, and we're very 
> glad he had the time to do such excellent work. We worked with Con Kolivas, likewise.
> 
> We've done tools to automate LTP comparisons ( bryce@osdl.org  has posted results )
> and reaim, we've been able to post some regression to lkml, and tied in with developers
> to get bugs fixed. But OSDL has been limited by manpower.
>  
> One of the issues with the performance tests is the amount of data produced - 
>  for example, the deep IO tests produce ton's o'  numbers, but the developer community wants
> a single "+/- 5%" type response-  we need some opinions and help on how to do the data reduction 
> necessary. 

Yep, reaim produces a single "global throughput" result in MB/s, which is wonderful 
for readability.

Now iozone on the other extreme produces output for each kind of operation
(read, write, rw, sync version of those) for each client IIRC. tiobench also
has detailed output for each operation.

We ought to reduce all benchmark results to "read", "write" and "global" (read+write/2) 
numbers. 

I'm willing to work on the data reduction and graphic generation scripts
for STP results. I think I can do that.

> 
> What would be really kewl is some test/analysis code that could be re-used, so the Martin's of the future
> have a good starting place. 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 20:12                     ` Jeff Garzik
  2004-12-02 20:30                       ` Diego Calleja
  2004-12-02 21:08                       ` Wichert Akkerman
@ 2004-12-03  0:07                       ` Francois Romieu
  2 siblings, 0 replies; 55+ messages in thread
From: Francois Romieu @ 2004-12-03  0:07 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Diego Calleja, linux-kernel

Jeff Garzik <jgarzik@pobox.com> :
[...]
> Should be simple for rpm at least, given the "make rpm" target.  I 
> wonder if we have, or could add, a 'make deb' target.

http://www.wiggy.net/files/kerneldeb-1.2.ptc ?

--
Ueimor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 20:12                     ` Jeff Garzik
  2004-12-02 20:30                       ` Diego Calleja
@ 2004-12-02 21:08                       ` Wichert Akkerman
  2004-12-03  0:07                       ` Francois Romieu
  2 siblings, 0 replies; 55+ messages in thread
From: Wichert Akkerman @ 2004-12-02 21:08 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Diego Calleja, linux-kernel

Previously Jeff Garzik wrote:
> Should be simple for rpm at least, given the "make rpm" target.  I 
> wonder if we have, or could add, a 'make deb' target.

make deb-pkg has been there for a while.

Wichert.

-- 
Wichert Akkerman <wichert@wiggy.net>    It is simple to make things.
http://www.wiggy.net/                   It is hard to make things simple.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 20:12                     ` Jeff Garzik
@ 2004-12-02 20:30                       ` Diego Calleja
  2004-12-02 21:08                       ` Wichert Akkerman
  2004-12-03  0:07                       ` Francois Romieu
  2 siblings, 0 replies; 55+ messages in thread
From: Diego Calleja @ 2004-12-02 20:30 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, sam

El Thu, 02 Dec 2004 15:12:22 -0500 Jeff Garzik <jgarzik@pobox.com>
escribió:

> > Automated .deb's and .rpm's for the -bk snapshots (and yum/apt
> > repositories) would be nice for all those people who run unsupported
> > distros.
> 
> Now, that's a darned good idea...
> 
> Should be simple for rpm at least, given the "make rpm" target.  I 
> wonder if we have, or could add, a 'make deb' target.


There was a patch for that long time ago before 2.6 was out IIRC? I don't
know where it went (CC'ing Sam who should know ;)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 18:10                         ` cliff white
  2004-12-02 18:17                           ` Gerrit Huizenga
@ 2004-12-02 20:25                           ` linux-os
  1 sibling, 0 replies; 55+ messages in thread
From: linux-os @ 2004-12-02 20:25 UTC (permalink / raw)
  To: cliff white
  Cc: Jeff Garzik, mbligh, akpm, torvalds, clameter, hugh, benh,
	nickpiggin, linux-mm, linux-ia64, linux-kernel

On Thu, 2 Dec 2004, cliff white wrote:

> On Thu, 02 Dec 2004 02:31:35 -0500
> Jeff Garzik <jgarzik@pobox.com> wrote:
>
>> Martin J. Bligh wrote:
>>> Yeah, probably. Though the stress tests catch a lot more than the
>>> functionality ones. The big pain in the ass is drivers, because I don't
>>> have a hope in hell of testing more than 1% of them.
>>
>> My dream is that hardware vendors rotate their current machines through
>> a test shop :)  It would be nice to make sure that the popular drivers
>> get daily test coverage.
>>
>> 	Jeff, dreaming on
>

It isn't going to happen until the time when the vendors
call somebody a liar, try to get them fired, and then
that somebody takes them to court and they lose 100
million dollars or so.

Until that happens, vendors will continue to make junk
and they will continue to lie about the performance of
that junk. It doesn't help that Software Engineering has
become a "hardware junk fixing" job.

Basically many vendors in the PC and PC peripheral
business are, for lack of a better word, liars who
are in the business of perpetrating fraud upon the
unsuspecting PC user.

We have vendors who convincingly change mega-bits
to mega-bytes, improving performance 8-fold without
any expense at all. We have vendors reducing the
size of a kilobyte and a megabyte, then getting
the new lies entered into dictionaries, etc. The
scheme goes on.

In the meantime, if you try to perform DMA
across a PCI/Bus at or near the specified rates,
you will learn that the specifications are
for "this chip" or "that chip", and have nothing
to do with the performance when these chips
get connected together. You will find that real
performance is about 20 percent of the specification.

Occasionally you find a vendor that doesn't lie and
the same chip-set magically performs close to
the published specifications. This is becoming
rare because it costs money to build motherboards
that work. This might require two or more
prototypes to get the timing just right so the
artificial delays and re-clocking, used to make
junk work, isn't required.

Once the PC (and not just the desk-top PC) became
a commodity, everything points to the bottom-line.
You get into the business by making something that
looks and smells new. Then you sell it by writing
specifications that are better than the most
expensive on the market. Your sales-price is
set below average market so you can unload this
junk as rapidly as possible.

Then, you do this over again, claiming that your
equipment is "state-of-the-art"! And if anybody
ever tests the junk and claims that it doesn't
work as specified, you contact the president of
his company and try to kill the messenger.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 19:48                   ` Diego Calleja
@ 2004-12-02 20:12                     ` Jeff Garzik
  2004-12-02 20:30                       ` Diego Calleja
                                         ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Jeff Garzik @ 2004-12-02 20:12 UTC (permalink / raw)
  To: Diego Calleja; +Cc: linux-kernel

Diego Calleja wrote:
> El Thu, 02 Dec 2004 01:48:25 -0500 Jeff Garzik <jgarzik@pobox.com>
> escribió:
> 
> 
> 
>>I'm still hoping that distros (like my employer) and orgs like OSDL will 
>>step up, and hook 2.6.x BK snapshots into daily test harnesses.
> 
> 
> Automated .deb's and .rpm's for the -bk snapshots (and yum/apt repositories)
> would be nice for all those people who run unsupported distros.

Now, that's a darned good idea...

Should be simple for rpm at least, given the "make rpm" target.  I 
wonder if we have, or could add, a 'make deb' target.

	Jeff




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:48                 ` Jeff Garzik
  2004-12-02  7:02                   ` Andrew Morton
@ 2004-12-02 19:48                   ` Diego Calleja
  2004-12-02 20:12                     ` Jeff Garzik
  1 sibling, 1 reply; 55+ messages in thread
From: Diego Calleja @ 2004-12-02 19:48 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

El Thu, 02 Dec 2004 01:48:25 -0500 Jeff Garzik <jgarzik@pobox.com>
escribió:


> I'm still hoping that distros (like my employer) and orgs like OSDL will 
> step up, and hook 2.6.x BK snapshots into daily test harnesses.

Automated .deb's and .rpm's for the -bk snapshots (and yum/apt repositories)
would be nice for all those people who run unsupported distros.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:26                     ` Martin J. Bligh
  2004-12-02  7:31                       ` Jeff Garzik
@ 2004-12-02 18:43                       ` cliff white
  2004-12-06 19:33                         ` Marcelo Tosatti
  1 sibling, 1 reply; 55+ messages in thread
From: cliff white @ 2004-12-02 18:43 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: akpm, jgarzik, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

On Wed, 01 Dec 2004 23:26:59 -0800
"Martin J. Bligh" <mbligh@aracnet.com> wrote:

> --Andrew Morton <akpm@osdl.org> wrote (on Wednesday, December 01, 2004 23:02:17 -0800):
> 
> > Jeff Garzik <jgarzik@pobox.com> wrote:
> >> 
> >> Andrew Morton wrote:
> >> > We need to be be achieving higher-quality major releases than we did in
> >> > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> >> > stabilisation periods.
> >> 
> >> 
> >> I'm still hoping that distros (like my employer) and orgs like OSDL will 
> >> step up, and hook 2.6.x BK snapshots into daily test harnesses.
> > 
> > I believe that both IBM and OSDL are doing this, or are getting geared up
> > to do this.  With both Linus bk and -mm.
> 
> I already run a bunch of tests on a variety of machines for every new 
> kernel ... but don't have an automated way to compare the results as yet, 
> so don't actually look at them much ;-(. Sometime soon (quite possibly over 
> Christmas) things will calm down enough I'll get a couple of days to write 
> the appropriate perl script, and start publishing stuff.

We've had the most success when one person has an itch to scratch, and works
with us to scratch it. We (OSDL) worked with Sebastien at Bull, and we're very 
glad he had the time to do such excellent work. We worked with Con Kolivas, likewise.

We've done tools to automate LTP comparisons ( bryce@osdl.org  has posted results )
and reaim, we've been able to post some regression to lkml, and tied in with developers
to get bugs fixed. But OSDL has been limited by manpower.
 
One of the issues with the performance tests is the amount of data produced - 
 for example, the deep IO tests produce ton's o'  numbers, but the developer community wants
a single "+/- 5%" type response-  we need some opinions and help on how to do the data reduction 
necessary. 

What would be really kewl is some test/analysis code that could be re-used, so the Martin's of the future
have a good starting place. 
cliffw
OSDL




> 
> > However I have my doubts about how useful it will end up being.  These test
> > suites don't seem to pick up many regressions.  I've challenged Gerrit to
> > go back through a release cycle's bugfixes and work out how many of those
> > bugs would have been detected by the test suite.
> > 
> > My suspicion is that the answer will be "a very small proportion", and that
> > really is the bottom line.
> 
> Yeah, probably. Though the stress tests catch a lot more than the 
> functionality ones. The big pain in the ass is drivers, because I don't
> have a hope in hell of testing more than 1% of them.
> 
> M.
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 


-- 
The church is near, but the road is icy.
The bar is far, but i will walk carefully. - Russian proverb

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 18:27                 ` Grant Grundler
  2004-12-02 18:33                   ` Andrew Morton
@ 2004-12-02 18:36                   ` Christoph Hellwig
  1 sibling, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2004-12-02 18:36 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Andrew Morton, Jeff Garzik, torvalds, clameter, hugh, benh,
	nickpiggin, linux-mm, linux-ia64, linux-kernel

On Thu, Dec 02, 2004 at 10:27:16AM -0800, Grant Grundler wrote:
> Also need to think about how well any scheme align's with what distro's
> need to support releases. Like the "Adopt-a-Highway" program in
> California to pickup trash along highways, I'm wondering if distros
> would be willing/interested in adopting a particular release
> and maintain it in bk.  e.g. SuSE clearly has interest in some sort
> of 2.6.5.n series for SLES9. ditto for RHEL4 (but for 2.6.9.n).

Unfortunately the SLES9 kernels don't really look anything like 2.6.5
except from the version number.  There's far too much trash from Business
Partners in there.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 18:27                 ` Grant Grundler
@ 2004-12-02 18:33                   ` Andrew Morton
  2004-12-02 18:36                   ` Christoph Hellwig
  1 sibling, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2004-12-02 18:33 UTC (permalink / raw)
  To: Grant Grundler
  Cc: jgarzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm,
	linux-ia64, linux-kernel

Grant Grundler <iod00d@hp.com> wrote:
>
> 2.6.odd/.even release described above is a variant of 2.6.10.n releases
>  where n = {0, 1}. The question is how many parallel releases do people
>  (you and linus) want us keep "alive" at the same time?

2.6.odd/.even is actually a significantly different process.  a) because
there's only one tree, linearly growing.  That's considerably simpler than
maintaining a branch.  And b) because everyone knows that there won't be a
new development tree opened until we've all knuckled down and fixed the
bugs which we put into the previous one, dammit.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:34               ` Andrew Morton
  2004-12-02  6:48                 ` Jeff Garzik
  2004-12-02  7:00                 ` Jeff Garzik
@ 2004-12-02 18:27                 ` Grant Grundler
  2004-12-02 18:33                   ` Andrew Morton
  2004-12-02 18:36                   ` Christoph Hellwig
  2004-12-07 10:51                 ` Pavel Machek
  3 siblings, 2 replies; 55+ messages in thread
From: Grant Grundler @ 2004-12-02 18:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

On Wed, Dec 01, 2004 at 10:34:41PM -0800, Andrew Morton wrote:
> Of course, nobody will test -rc3 and a zillion people will test final
> 2.6.10, which is when we get lots of useful bug reports.  If this keeps on
> happening then we'll need to get more serious about the 2.6.10.n process.
> 
> Or start alternating between stable and flakey releases, so 2.6.11 will be
> a feature release with a 2-month development period and 2.6.12 will be a
> bugfix-only release, with perhaps a 2-week development period, so people
> know that the even-numbered releases are better stabilised.

No matter what scheme you adopt, I (and others) will adapt as well.
When working on a new feature or bug fix, I don't chase -bk releases
since I don't want to find new, unrelated issues that interfere with
the issue I was originally chasing. I roll to a new release when
the issue I care about is "cooked". Anything that takes longer than
a month or so is just hopeless since I fall too far behind.

(e.g. IRQ handling in parisc-linux needs to be completely rewritten
to pickup irq_affinity support - I just don't have enough time to get
it done in < 2 monthes. We started on this last year and gave up.)

I see "2.6.10.n process" as the right way to handle bug fix only releases.
I'm happy to work on 2.6.10.0 and understand the initial release was a
"best effort".

2.6.odd/.even release described above is a variant of 2.6.10.n releases
where n = {0, 1}. The question is how many parallel releases do people
(you and linus) want us keep "alive" at the same time?
odd/even implies only one vs several if 2.6.X.n scheme is continued
beyond 2.6.8.1.

Also need to think about how well any scheme align's with what distro's
need to support releases. Like the "Adopt-a-Highway" program in
California to pickup trash along highways, I'm wondering if distros
would be willing/interested in adopting a particular release
and maintain it in bk.  e.g. SuSE clearly has interest in some sort
of 2.6.5.n series for SLES9. ditto for RHEL4 (but for 2.6.9.n).
The question of *who* (at respective distro) would be the release
maintainer is a titanic sized rathole. But there is a release manager
today at each distro and perhaps it's easier if s/he remains invisible
to us.

hth,
grant

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 18:10                         ` cliff white
@ 2004-12-02 18:17                           ` Gerrit Huizenga
  2004-12-02 20:25                           ` linux-os
  1 sibling, 0 replies; 55+ messages in thread
From: Gerrit Huizenga @ 2004-12-02 18:17 UTC (permalink / raw)
  To: cliff white
  Cc: Jeff Garzik, mbligh, akpm, torvalds, clameter, hugh, benh,
	nickpiggin, linux-mm, linux-ia64, linux-kernel


On Thu, 02 Dec 2004 10:10:29 PST, cliff white wrote:
> On Thu, 02 Dec 2004 02:31:35 -0500
> Jeff Garzik <jgarzik@pobox.com> wrote:
> 
> > Martin J. Bligh wrote:
> > > Yeah, probably. Though the stress tests catch a lot more than the 
> > > functionality ones. The big pain in the ass is drivers, because I don't
> > > have a hope in hell of testing more than 1% of them.
> > 
> > My dream is that hardware vendors rotate their current machines through 
> > a test shop :)  It would be nice to make sure that the popular drivers 
> > get daily test coverage.
> > 
> > 	Jeff, dreaming on
> 
> OSDL has recently re-done the donation policy, and we're much better positioned
> to support that sort of thing now - Contact Tom Hanrahan at OSDL if you 
> are a vendor, or know a vendor. ( Or you can become a vendor ) 

Specifically Tom Hanrahan == hanrahat@osdl.org

gerrit

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:31                       ` Jeff Garzik
@ 2004-12-02 18:10                         ` cliff white
  2004-12-02 18:17                           ` Gerrit Huizenga
  2004-12-02 20:25                           ` linux-os
  0 siblings, 2 replies; 55+ messages in thread
From: cliff white @ 2004-12-02 18:10 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: mbligh, akpm, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

On Thu, 02 Dec 2004 02:31:35 -0500
Jeff Garzik <jgarzik@pobox.com> wrote:

> Martin J. Bligh wrote:
> > Yeah, probably. Though the stress tests catch a lot more than the 
> > functionality ones. The big pain in the ass is drivers, because I don't
> > have a hope in hell of testing more than 1% of them.
> 
> My dream is that hardware vendors rotate their current machines through 
> a test shop :)  It would be nice to make sure that the popular drivers 
> get daily test coverage.
> 
> 	Jeff, dreaming on

OSDL has recently re-done the donation policy, and we're much better positioned
to support that sort of thing now - Contact Tom Hanrahan at OSDL if you 
are a vendor, or know a vendor. ( Or you can become a vendor ) 

cliffw

> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 


-- 
The church is near, but the road is icy.
The bar is far, but i will walk carefully. - Russian proverb

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:02                   ` Andrew Morton
  2004-12-02  7:26                     ` Martin J. Bligh
  2004-12-02 16:24                     ` Gerrit Huizenga
@ 2004-12-02 17:34                     ` cliff white
  2 siblings, 0 replies; 55+ messages in thread
From: cliff white @ 2004-12-02 17:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jgarzik, torvalds, clameter, hugh, benh, nickpiggin, linux-mm,
	linux-ia64, linux-kernel

On Wed, 1 Dec 2004 23:02:17 -0800
Andrew Morton <akpm@osdl.org> wrote:

> Jeff Garzik <jgarzik@pobox.com> wrote:
> >
> > Andrew Morton wrote:
> > > We need to be be achieving higher-quality major releases than we did in
> > > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> > > stabilisation periods.
> > 
> > 
> > I'm still hoping that distros (like my employer) and orgs like OSDL will 
> > step up, and hook 2.6.x BK snapshots into daily test harnesses.
> 
> I believe that both IBM and OSDL are doing this, or are getting geared up
> to do this.  With both Linus bk and -mm.

Gee, OSDL has been doing this sort of testing for > 1 years now. Getting
bandwidth to look at the results has been a problem. We need more eyeballs
and community support badly, i'm very glad Marcelo has shown recent interest. 
> 
> However I have my doubts about how useful it will end up being.  These test
> suites don't seem to pick up many regressions.  I've challenged Gerrit to
> go back through a release cycle's bugfixes and work out how many of those
> bugs would have been detected by the test suite.

> 
> My suspicion is that the answer will be "a very small proportion", and that
> really is the bottom line.
> 
> We simply get far better coverage testing by releasing code, because of all
> the wild, whacky and weird things which people do with their computers. 
> Bless them.
> 
> > Something like John Cherry's reports to lkml on warnings and errors 
> > would be darned useful.  His reports are IMO an ideal model:  show 
> > day-to-day _changes_ in test results.  Don't just dump a huge list of 
> > testsuite results, results which are often clogged with expected 
> > failures and testsuite bug noise.
> > 
> 
> Yes, we need humans between the tests and the developers.  Someone who has
> good experience with the tests and who can say "hey, something changed
> when I do X".  If nothing changed, we don't hear anything.

I would agree, and would do almost anything to help/assist/enable any humans 
interested. We need some expertise on when to run certain tests, to avoid
data overload. 
I've noticed that when developer's submit test results with a patch, it sometimes
helps in the decision on patch acceptance. Is there a way to promote this sort of
behaviour?
cliffw
OSDL
> 
> It's a developer role, not a testing role.   All testing is, really.
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
> 


-- 
The church is near, but the road is icy.
The bar is far, but i will walk carefully. - Russian proverb

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:02                   ` Andrew Morton
  2004-12-02  7:26                     ` Martin J. Bligh
@ 2004-12-02 16:24                     ` Gerrit Huizenga
  2004-12-02 17:34                     ` cliff white
  2 siblings, 0 replies; 55+ messages in thread
From: Gerrit Huizenga @ 2004-12-02 16:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

On Wed, 01 Dec 2004 23:02:17 PST, Andrew Morton wrote:
> Jeff Garzik <jgarzik@pobox.com> wrote:
> >
> > Andrew Morton wrote:
> > > We need to be be achieving higher-quality major releases than we did in
> > > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> > > stabilisation periods.
> > 
> > 
> > I'm still hoping that distros (like my employer) and orgs like OSDL will 
> > step up, and hook 2.6.x BK snapshots into daily test harnesses.
> 
> I believe that both IBM and OSDL are doing this, or are getting geared up
> to do this.  With both Linus bk and -mm.
> 
> However I have my doubts about how useful it will end up being.  These test
> suites don't seem to pick up many regressions.  I've challenged Gerrit to
> go back through a release cycle's bugfixes and work out how many of those
> bugs would have been detected by the test suite.
> 
> My suspicion is that the answer will be "a very small proportion", and that
> really is the bottom line.

Yeah, sort of what Martin said.  LTP, for instance, doesn't find a lot
of what is in our internal bugzilla or the bugme database.  Automated
testing tends not to cover all the range of desktop peripherals and
drivers that make up a large quantity of the code but gets very little
coverage.  Our stress testing is extensive and was finding 3 year old
problems when we first ran it but it is pretty expensive to run those
types of tests (machines, people, data analysis) so we typically run
those tests on distros rather than mainline to help validate distro
quality.

However, that said, the LTP stuff is still *necessary* - it would
catch quite a number of regressions if we were to regress.  The good
thing is that most changes today haven't been leading to regressions.
That could change at any time, and one of the keys is to make sure that
when we do find regressions we get a test into LTP to make sure that
that particular regression never happens again.

I haven't looked at the code coverage for LTP in a while but it is
actually a high line count coverage test for core kernel.  I don't remember
if it was over 80% or not, but usually 85-88% is the point of diminishing
returns for a regression suite.  I think a more important proactive
step here is to understand what regressions we *do* have an whether
or not we can construct a test that in the future will catch that
regression (or better, a class of regressions).

And, maybe we need some kind of filter person or group for lkml that
can see what the key regressions are (e.g. akpm, if you know of a set
of regressions that you are working, maybe periodically sending those
to the ltp mailing list) we could focus on creating tests for those
regressions.

We are also working to set up large ISV applications in a couple of
spots - both inside IBM and there is a similar effort underway at OSDL.
Those ISV applications will catch a class of real world usage models
and also check for regressions.  I don't know if it is possible to set
up a better testing environment for the wild, whacky and weird things
that people do but, yes, Bless them.  ;-)

> We simply get far better coverage testing by releasing code, because of all
> the wild, whacky and weird things which people do with their computers. 
> Bless them.
> 
> > Something like John Cherry's reports to lkml on warnings and errors 
> > would be darned useful.  His reports are IMO an ideal model:  show 
> > day-to-day _changes_ in test results.  Don't just dump a huge list of 
> > testsuite results, results which are often clogged with expected 
> > failures and testsuite bug noise.
> 
> Yes, we need humans between the tests and the developers.  Someone who has
> good experience with the tests and who can say "hey, something changed
> when I do X".  If nothing changed, we don't hear anything.
> 
> It's a developer role, not a testing role.   All testing is, really.

Yep.  However, smart developers continue to write scripts to automate
the rote and mundane tasks that they hate doing.  Towards that end, there
was a recent effort at Bull on the NPTL work which serves as a very
interesting model:

http://nptl.bullopensource.org/Tests/results/run-browse.php

Basically, you can compare results from any test run with any other
and get a summary of differences.  That helps give a quick status
check and helps you focus on the correct issues when tracking down
defects.

gerrit

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:00                 ` Jeff Garzik
  2004-12-02  7:05                   ` Benjamin Herrenschmidt
@ 2004-12-02 14:30                   ` Andy Warner
  2005-01-06 23:40                     ` Jeff Garzik
  1 sibling, 1 reply; 55+ messages in thread
From: Andy Warner @ 2004-12-02 14:30 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, torvalds, benh, linux-kernel, linux-ide

[-- Attachment #1: Type: text/plain, Size: 2854 bytes --]

Jeff Garzik wrote:
> [...]
> I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes 
> ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, 
> but works on UP.  Potentially related to >=4GB of RAM.
> 
> 
> 
> Details, in case anyone is interested:
> Unless my code is screwed up (certainly possible), PIO data-in [using 
> the insw() call] seems to return all zeroes on a true-blue SMP machine, 
> for the identify-device command.  When this happens, libata (correctly) 
> detects a bad id page and bails.  (problem doesn't show up on single CPU 
> w/ HT)

Ah, I might have been here recently, with the pass-thru stuff.

What I saw was that in an SMP machine:

1. queue_work() can result in the work running (on another
   CPU) instantly.

2. Having one CPU beat on PIO registers reading data from one port
   would significantly alter the timing of the CMD->BSY->DRQ sequence
   used in PIO. This behaviour was far worse for competing ports
   within one chip, which I put down to arbitration problems.

3. CPU utilisation would go through the roof. Effectively the
   entire pio_task state machine reduced to a busy spin loop.

4. The state machine needed some tweaks, especially in error
   handling cases.

I made some changes, which effectively solved the problem for promise
TX4-150 cards, and was going to test the results on other chipsets
next week before speaking up. Specifically, I have seen some
issues with SiI 3114 cards.

I was trying to explore using interrupts instead of polling state
but for some reason, I was not getting them for PIO data operations,
or I misunderstand the spec, after removing ata_qc_set_polling() - again
I saw a difference in behaviour between the Promise & SiI cards
here.

I'm about to go offline for 3 days, and hadn't prepared for this
yet. The best I can do is provide a patch (attached) that applies
against 2.6.9. It also seems to apply against libata-2.6, but
barfs a bit against libata-dev-2.6.

The changes boil down to these:

1. Minor changes in how status/error regs are read.
   Including attempts to use altstatus, while I was
   exploring interrupts.

2. State machine logic changes.

3. Replace calls to queue_work() with queue_delayed_work()
   to stop SMP machines going crazy.

With these changes, on a platform consisting of 2.6.9 and
Promise TX4-150 cards, I can move terabytes of parallel
PIO data, without error.

My gut says that the PIO mechanism should be overhauled, I
composed a "how much should we pay for this muffler" email
to linux-ide at least twice while working on this, but never
sent it - wanting to send a solution in rather than just
making more comments from the peanut gallery.

I'll pick up the thread on this next week, when I'm back online.
I hope this helps.
-- 
andyw@pobox.com

Andy Warner		Voice: (612) 801-8549	Fax: (208) 575-5634

[-- Attachment #2: 2.6.9-pio-smp.patch --]
[-- Type: text/plain, Size: 1862 bytes --]

diff -r -u -X dontdiff linux-2.6.9-vanilla/drivers/scsi/libata-core.c linux-2.6.9/drivers/scsi/libata-core.c
--- linux-2.6.9-vanilla/drivers/scsi/libata-core.c	2004-10-18 16:53:06.000000000 -0500
+++ linux-2.6.9/drivers/scsi/libata-core.c	2004-11-24 11:01:40.000000000 -0600
@@ -2099,7 +2099,7 @@
 	}

 	drv_stat = ata_wait_idle(ap);
-	if (!ata_ok(drv_stat)) {
+	if (drv_stat & (ATA_ERR | ATA_DF)) {
 		ap->pio_task_state = PIO_ST_ERR;
 		return;
 	}
@@ -2254,23 +2254,17 @@
 	 * chk-status again.  If still busy, fall back to
 	 * PIO_ST_POLL state.
 	 */
-	status = ata_busy_wait(ap, ATA_BUSY, 5);
-	if (status & ATA_BUSY) {
+	status = ata_altstatus(ap) ;
+	if (!(status & ATA_DRQ)) {
 		msleep(2);
-		status = ata_busy_wait(ap, ATA_BUSY, 10);
-		if (status & ATA_BUSY) {
+		status = ata_altstatus(ap) ;
+		if (!(status & ATA_DRQ)) {
 			ap->pio_task_state = PIO_ST_POLL;
 			ap->pio_task_timeout = jiffies + ATA_TMOUT_PIO;
 			return;
 		}
 	}

-	/* handle BSY=0, DRQ=0 as error */
-	if ((status & ATA_DRQ) == 0) {
-		ap->pio_task_state = PIO_ST_ERR;
-		return;
-	}
-
 	qc = ata_qc_from_tag(ap, ap->active_tag);
 	assert(qc != NULL);

@@ -2321,17 +2315,15 @@
 	case PIO_ST_TMOUT:
 	case PIO_ST_ERR:
 		ata_pio_error(ap);
-		break;
+		return ;
 	}

-	if ((ap->pio_task_state != PIO_ST_IDLE) &&
-	    (ap->pio_task_state != PIO_ST_TMOUT) &&
-	    (ap->pio_task_state != PIO_ST_ERR)) {
+	if (ap->pio_task_state != PIO_ST_IDLE) {
 		if (timeout)
 			queue_delayed_work(ata_wq, &ap->pio_task,
 					   timeout);
 		else
-			queue_work(ata_wq, &ap->pio_task);
+			queue_delayed_work(ata_wq, &ap->pio_task, 2);
 	}
 }

@@ -2624,7 +2616,7 @@
 		ata_qc_set_polling(qc);
 		ata_tf_to_host_nolock(ap, &qc->tf);
 		ap->pio_task_state = PIO_ST;
-		queue_work(ata_wq, &ap->pio_task);
+		queue_delayed_work(ata_wq, &ap->pio_task, 2);
 		break;

 	case ATA_PROT_ATAPI:

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:11                     ` Jeff Garzik
@ 2004-12-02 11:16                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 55+ messages in thread
From: Benjamin Herrenschmidt @ 2004-12-02 11:16 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide

On Thu, 2004-12-02 at 02:11 -0500, Jeff Garzik wrote:
> Benjamin Herrenschmidt wrote:
> > They may not end up in order if they are stores (the stores to the
> > taskfile may be out of order vs; the loads/stores to/from the data
> > register) unless you have a spinlock protecting both or a full sync (on
> > ppc), but then, I don't know the ordering things on x86_64. This could
> > certainly be a problem on ppc & ppc64 too.
> 
> 
> Is synchronization beyond in[bwl] needed, do you think?

Yes, when potentially hop'ing between CPUs, definitely.

> This specific problem is only on Intel ICHx AFAICS, which is PIO not 
> MMIO and x86-only.  I presumed insw() by its very nature already has 
> synchronization, but perhaps not...

Hrm... on "pure" x86, I would expect so at the HW level, not sure about
x86_64... but there would be definitely an issue on ppc with your
scheme. You need at least a full barrier before you trigger the
workqueue. That may not be the problem you are facing now, but it would
become one.

Ben.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:26                     ` Martin J. Bligh
@ 2004-12-02  7:31                       ` Jeff Garzik
  2004-12-02 18:10                         ` cliff white
  2004-12-02 18:43                       ` cliff white
  1 sibling, 1 reply; 55+ messages in thread
From: Jeff Garzik @ 2004-12-02  7:31 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, torvalds, clameter, hugh, benh, nickpiggin,
	linux-mm, linux-ia64, linux-kernel

Martin J. Bligh wrote:
> Yeah, probably. Though the stress tests catch a lot more than the 
> functionality ones. The big pain in the ass is drivers, because I don't
> have a hope in hell of testing more than 1% of them.

My dream is that hardware vendors rotate their current machines through 
a test shop :)  It would be nice to make sure that the popular drivers 
get daily test coverage.

	Jeff, dreaming on



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:02                   ` Andrew Morton
@ 2004-12-02  7:26                     ` Martin J. Bligh
  2004-12-02  7:31                       ` Jeff Garzik
  2004-12-02 18:43                       ` cliff white
  2004-12-02 16:24                     ` Gerrit Huizenga
  2004-12-02 17:34                     ` cliff white
  2 siblings, 2 replies; 55+ messages in thread
From: Martin J. Bligh @ 2004-12-02  7:26 UTC (permalink / raw)
  To: Andrew Morton, Jeff Garzik
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

--Andrew Morton <akpm@osdl.org> wrote (on Wednesday, December 01, 2004 23:02:17 -0800):

> Jeff Garzik <jgarzik@pobox.com> wrote:
>> 
>> Andrew Morton wrote:
>> > We need to be be achieving higher-quality major releases than we did in
>> > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
>> > stabilisation periods.
>> 
>> 
>> I'm still hoping that distros (like my employer) and orgs like OSDL will 
>> step up, and hook 2.6.x BK snapshots into daily test harnesses.
> 
> I believe that both IBM and OSDL are doing this, or are getting geared up
> to do this.  With both Linus bk and -mm.

I already run a bunch of tests on a variety of machines for every new 
kernel ... but don't have an automated way to compare the results as yet, 
so don't actually look at them much ;-(. Sometime soon (quite possibly over 
Christmas) things will calm down enough I'll get a couple of days to write 
the appropriate perl script, and start publishing stuff.

> However I have my doubts about how useful it will end up being.  These test
> suites don't seem to pick up many regressions.  I've challenged Gerrit to
> go back through a release cycle's bugfixes and work out how many of those
> bugs would have been detected by the test suite.
> 
> My suspicion is that the answer will be "a very small proportion", and that
> really is the bottom line.

Yeah, probably. Though the stress tests catch a lot more than the 
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.

M.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:05                   ` Benjamin Herrenschmidt
@ 2004-12-02  7:11                     ` Jeff Garzik
  2004-12-02 11:16                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 55+ messages in thread
From: Jeff Garzik @ 2004-12-02  7:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide

Benjamin Herrenschmidt wrote:
> They may not end up in order if they are stores (the stores to the
> taskfile may be out of order vs; the loads/stores to/from the data
> register) unless you have a spinlock protecting both or a full sync (on
> ppc), but then, I don't know the ordering things on x86_64. This could
> certainly be a problem on ppc & ppc64 too.


Is synchronization beyond in[bwl] needed, do you think?

This specific problem is only on Intel ICHx AFAICS, which is PIO not 
MMIO and x86-only.  I presumed insw() by its very nature already has 
synchronization, but perhaps not...

	Jeff



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:00                 ` Jeff Garzik
@ 2004-12-02  7:05                   ` Benjamin Herrenschmidt
  2004-12-02  7:11                     ` Jeff Garzik
  2004-12-02 14:30                   ` Andy Warner
  1 sibling, 1 reply; 55+ messages in thread
From: Benjamin Herrenschmidt @ 2004-12-02  7:05 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide

On Thu, 2004-12-02 at 02:00 -0500, Jeff Garzik wrote:

> 
> 2.6.9:
> 	bitbang ATA taskfile registers
> 	queue_work()
> 	workqueue thread bitbangs ATA data register (read id page)
> 
> So I wonder if <something> doesn't like CPU 0 sending I/O traffic to the 
> on-board SATA PCI device, then immediately after that, CPU 1 sending I/O 
> traffic.
> 
> Anyway, back to debugging...  :)

They may not end up in order if they are stores (the stores to the
taskfile may be out of order vs; the loads/stores to/from the data
register) unless you have a spinlock protecting both or a full sync (on
ppc), but then, I don't know the ordering things on x86_64. This could
certainly be a problem on ppc & ppc64 too.

Ben.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:48                 ` Jeff Garzik
@ 2004-12-02  7:02                   ` Andrew Morton
  2004-12-02  7:26                     ` Martin J. Bligh
                                       ` (2 more replies)
  2004-12-02 19:48                   ` Diego Calleja
  1 sibling, 3 replies; 55+ messages in thread
From: Andrew Morton @ 2004-12-02  7:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Andrew Morton wrote:
> > We need to be be achieving higher-quality major releases than we did in
> > 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> > stabilisation periods.
> 
> 
> I'm still hoping that distros (like my employer) and orgs like OSDL will 
> step up, and hook 2.6.x BK snapshots into daily test harnesses.

I believe that both IBM and OSDL are doing this, or are getting geared up
to do this.  With both Linus bk and -mm.

However I have my doubts about how useful it will end up being.  These test
suites don't seem to pick up many regressions.  I've challenged Gerrit to
go back through a release cycle's bugfixes and work out how many of those
bugs would have been detected by the test suite.

My suspicion is that the answer will be "a very small proportion", and that
really is the bottom line.

We simply get far better coverage testing by releasing code, because of all
the wild, whacky and weird things which people do with their computers. 
Bless them.

> Something like John Cherry's reports to lkml on warnings and errors 
> would be darned useful.  His reports are IMO an ideal model:  show 
> day-to-day _changes_ in test results.  Don't just dump a huge list of 
> testsuite results, results which are often clogged with expected 
> failures and testsuite bug noise.
> 

Yes, we need humans between the tests and the developers.  Someone who has
good experience with the tests and who can say "hey, something changed
when I do X".  If nothing changed, we don't hear anything.

It's a developer role, not a testing role.   All testing is, really.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:34               ` Andrew Morton
  2004-12-02  6:48                 ` Jeff Garzik
@ 2004-12-02  7:00                 ` Jeff Garzik
  2004-12-02  7:05                   ` Benjamin Herrenschmidt
  2004-12-02 14:30                   ` Andy Warner
  2004-12-02 18:27                 ` Grant Grundler
  2004-12-07 10:51                 ` Pavel Machek
  3 siblings, 2 replies; 55+ messages in thread
From: Jeff Garzik @ 2004-12-02  7:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, benh, linux-kernel, linux-ide

Andrew Morton wrote:
> We need an -rc3 yet.  And I need to do another pass through the
> regressions-since-2.6.9 list.  We've made pretty good progress there
> recently.  Mid to late December is looking like the 2.6.10 date.

another for that list, BTW:

I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes 
ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, 
but works on UP.  Potentially related to >=4GB of RAM.

Details, in case anyone is interested:
Unless my code is screwed up (certainly possible), PIO data-in [using 
the insw() call] seems to return all zeroes on a true-blue SMP machine, 
for the identify-device command.  When this happens, libata (correctly) 
detects a bad id page and bails.  (problem doesn't show up on single CPU 
w/ HT)

What changed from 2.6.8 to 2.6.9 is

2.6.8:
	bitbang ATA taskfile registers (loads command)
	bitbang ATA data register (read id page)

2.6.9:
	bitbang ATA taskfile registers
	queue_work()
	workqueue thread bitbangs ATA data register (read id page)

So I wonder if <something> doesn't like CPU 0 sending I/O traffic to the 
on-board SATA PCI device, then immediately after that, CPU 1 sending I/O 
traffic.

Anyway, back to debugging...  :)

	Jeff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:34               ` Andrew Morton
@ 2004-12-02  6:48                 ` Jeff Garzik
  2004-12-02  7:02                   ` Andrew Morton
  2004-12-02 19:48                   ` Diego Calleja
  2004-12-02  7:00                 ` Jeff Garzik
                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 55+ messages in thread
From: Jeff Garzik @ 2004-12-02  6:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

Andrew Morton wrote:
> We need to be be achieving higher-quality major releases than we did in
> 2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
> stabilisation periods.

I'm still hoping that distros (like my employer) and orgs like OSDL will 
step up, and hook 2.6.x BK snapshots into daily test harnesses.

Something like John Cherry's reports to lkml on warnings and errors 
would be darned useful.  His reports are IMO an ideal model:  show 
day-to-day _changes_ in test results.  Don't just dump a huge list of 
testsuite results, results which are often clogged with expected 
failures and testsuite bug noise.

	Jeff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  6:21             ` Jeff Garzik
@ 2004-12-02  6:34               ` Andrew Morton
  2004-12-02  6:48                 ` Jeff Garzik
                                   ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Andrew Morton @ 2004-12-02  6:34 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: torvalds, clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Linus Torvalds wrote:
> > Ok, consider me convinced. I don't want to apply this before I get 2.6.10 
> > out the door, but I'm happy with it. I assume Andrew has already picked up 
> > the previous version.
> 
> 
> Does that mean that 2.6.10 is actually close to the door?
> 

We need an -rc3 yet.  And I need to do another pass through the
regressions-since-2.6.9 list.  We've made pretty good progress there
recently.  Mid to late December is looking like the 2.6.10 date.

We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9.  Really the only tool we have to ensure this is longer
stabilisation periods.

Of course, nobody will test -rc3 and a zillion people will test final
2.6.10, which is when we get lots of useful bug reports.  If this keeps on
happening then we'll need to get more serious about the 2.6.10.n process.

Or start alternating between stable and flakey releases, so 2.6.11 will be
a feature release with a 2-month development period and 2.6.12 will be a
bugfix-only release, with perhaps a 2-week development period, so people
know that the even-numbered releases are better stabilised.

We'll see.  It all depends on how many bugs you can fix in the next two
weeks ;)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  0:10           ` Linus Torvalds
  2004-12-02  0:55             ` Andrew Morton
@ 2004-12-02  6:21             ` Jeff Garzik
  2004-12-02  6:34               ` Andrew Morton
  1 sibling, 1 reply; 55+ messages in thread
From: Jeff Garzik @ 2004-12-02  6:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Hugh Dickins, akpm, Benjamin Herrenschmidt,
	Nick Piggin, linux-mm, linux-ia64, linux-kernel

Linus Torvalds wrote:
> Ok, consider me convinced. I don't want to apply this before I get 2.6.10 
> out the door, but I'm happy with it. I assume Andrew has already picked up 
> the previous version.


Does that mean that 2.6.10 is actually close to the door?

/me runs...


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  0:55             ` Andrew Morton
@ 2004-12-02  1:46               ` Christoph Lameter
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Lameter @ 2004-12-02  1:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, hugh, benh, nickpiggin, linux-mm, linux-ia64,
	linux-kernel

On Wed, 1 Dec 2004, Andrew Morton wrote:

> > Ok, consider me convinced. I don't want to apply this before I get 2.6.10
> > out the door, but I'm happy with it.
>
> There were concerns about some architectures relying upon page_table_lock
> for exclusivity within their own pte handling functions.  Have they all
> been resolved?

The patch will fall back on the page_table_lock if an architecture cannot
provide atomic pte operations.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  0:10           ` Linus Torvalds
@ 2004-12-02  0:55             ` Andrew Morton
  2004-12-02  1:46               ` Christoph Lameter
  2004-12-02  6:21             ` Jeff Garzik
  1 sibling, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2004-12-02  0:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: clameter, hugh, benh, nickpiggin, linux-mm, linux-ia64, linux-kernel

Linus Torvalds <torvalds@osdl.org> wrote:
>
> 
> 
> On Wed, 1 Dec 2004, Christoph Lameter wrote:
> >
> > Changes from V11->V12 of this patch:
> > - dump sloppy_rss in favor of list_rss (Linus' proposal)
> > - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
> > 
> > This is a series of patches that increases the scalability of
> > the page fault handler for SMP. Here are some performance results
> > on a machine with 512 processors allocating 32 GB with an increasing
> > number of threads (that are assigned a processor each).
> 
> Ok, consider me convinced. I don't want to apply this before I get 2.6.10 
> out the door, but I'm happy with it.

There were concerns about some architectures relying upon page_table_lock
for exclusivity within their own pte handling functions.  Have they all
been resolved?

> I assume Andrew has already picked up the previous version.

Nope.  It has major clashes with the 4-level-pagetable work.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-01 23:41         ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
@ 2004-12-02  0:10           ` Linus Torvalds
  2004-12-02  0:55             ` Andrew Morton
  2004-12-02  6:21             ` Jeff Garzik
  2004-12-09  8:00           ` Nick Piggin
  2004-12-09 18:37           ` Hugh Dickins
  2 siblings, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2004-12-02  0:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin,
	linux-mm, linux-ia64, linux-kernel



On Wed, 1 Dec 2004, Christoph Lameter wrote:
>
> Changes from V11->V12 of this patch:
> - dump sloppy_rss in favor of list_rss (Linus' proposal)
> - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
> 
> This is a series of patches that increases the scalability of
> the page fault handler for SMP. Here are some performance results
> on a machine with 512 processors allocating 32 GB with an increasing
> number of threads (that are assigned a processor each).

Ok, consider me convinced. I don't want to apply this before I get 2.6.10 
out the door, but I'm happy with it. I assume Andrew has already picked up 
the previous version.

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-11-22 22:40       ` Linus Torvalds
@ 2004-12-01 23:41         ` Christoph Lameter
  2004-12-02  0:10           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Christoph Lameter @ 2004-12-01 23:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, akpm, Benjamin Herrenschmidt, Nick Piggin,
	linux-mm, linux-ia64, linux-kernel

Changes from V11->V12 of this patch:
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)

This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).

Without the patches:
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
 32   3    2    1.397s    148.523s  78.044s 41965.149  80201.646
 32   3    4    1.390s    152.618s  44.044s 40851.258 141545.239
 32   3    8    1.500s    374.008s  53.001s 16754.519 118671.950
 32   3   16    1.415s   1051.759s  73.094s  5973.803  85087.358
 32   3   32    1.867s   3400.417s 117.003s  1849.186  53754.928
 32   3   64    5.361s  11633.040s 197.034s   540.577  31881.112
 32   3  128   23.387s  39386.390s 332.055s   159.642  18918.599
 32   3  256   15.409s  20031.450s 168.095s   313.837  37237.918
 32   3  512   18.720s  10338.511s  86.047s   607.446  72752.686

With the patches:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.451s    140.151s 141.060s 44430.367  44428.115
 32   3    2    1.399s    136.349s  73.041s 45673.303  85699.793
 32   3    4    1.321s    129.760s  39.027s 47996.303 160197.217
 32   3    8    1.279s    100.648s  20.039s 61724.641 308454.557
 32   3   16    1.414s    153.975s  15.090s 40488.236 395681.716
 32   3   32    2.534s    337.021s  17.016s 18528.487 366445.400
 32   3   64    4.271s    709.872s  18.057s  8809.787 338656.440
 32   3  128   18.734s   1805.094s  21.084s  3449.586 288005.644
 32   3  256   14.698s    963.787s  11.078s  6429.787 534077.540
 32   3  512   15.299s    453.990s   5.098s 13406.321 1050416.414

For more than 8 cpus the page fault rate increases by orders
of magnitude. For more than 64 cpus the improvement in performace
is 10 times better.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its
generic emulation) on page table entries before doing an
update_mmu_change without holding the page table lock. However, we do
similar things now with other atomic pte operations such as
ptep_get_and_clear and ptep_test_and_clear_dirty. These operations
clear a pte *after* doing an operation on it. The ptep_cmpxchg as used
in this patch operates on an *cleared* pte and replaces it with a pte
pointing to valid memory. The effect of this change on various
architectures has to be thought through. Local definitions of
ptep_cmpxchg and ptep_xchg may be necessary.

For IA64 an icache coherency issue may arise that potentially requires
the flushing of the icache (as done via update_mmu_cache on IA64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch introduces a split counter for rss handling to avoid atomic
operations and locks currently necessary for rss modifications. In
addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be
in the same cache line as tsk->mm (which is already used by the fault
handler) and thus tsk->rss can be incremented without locks
in a fast way. The cache line does not need to be shared between
processors in the page table handler.

A tasklist is generated for each mm (rcu based). Values in that list
are added up to calculate rss or anon_rss values.

The patchset is composed of 7 patches:

1/7: Avoid page_table_lock in handle_mm_fault

   This patch defers the acquisition of the page_table_lock as much as
   possible and uses atomic operations for allocating anonymous memory.
   These atomic operations are simulated by acquiring the page_table_lock
   for very small time frames if an architecture does not define
   __HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
   pte will not be set to empty if a page is in transition to swap.

   If only the first two patches are applied then the time that the
   page_table_lock is held is simply reduced. The lock may then be
   acquired multiple times during a page fault.

2/7: Atomic pte operations for ia64

3/7: Make cmpxchg generally available on i386

   The atomic operations on the page table rely heavily on cmpxchg
   instructions. This patch adds emulations for cmpxchg and cmpxchg8b
   for old 80386 and 80486 cpus. The emulations are only included if a
   kernel is build for these old cpus and are skipped for the real
   cmpxchg instructions if the kernel that is build for 386 or 486 is
   then run on a more recent cpu.

   This patch may be used independently of the other patches.

4/7: Atomic pte operations for i386

   A generally available cmpxchg (last patch) must be available for
   this patch to preserve the ability to build kernels for 386 and 486.

5/7: Atomic pte operation for x86_64

6/7: Atomic pte operations for s390

7/7: Split counter implementation for rss
  Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic
  to calculate rss from tasklist.

There are some additional outstanding performance enhancements that are
not available yet but which require this patch. Those modifications
push the maximum page fault rate from ~ 1 mio faults per second as
shown above to above 3 mio faults per second.

The last editions of the sloppy rss, atomic rss and deferred rss patch
will be posted to linux-ia64 for archival purpose.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2005-01-06 23:50 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-03 14:49 page fault scalability patch V12 [0/7]: Overview and performance tests Sebastien Decugis
  -- strict thread matches above, loose matches on Subject: below --
2004-11-22 15:00 page fault scalability patch V11 [1/7]: sloppy rss Hugh Dickins
2004-11-22 21:50 ` deferred rss update instead of " Christoph Lameter
2004-11-22 22:22   ` Linus Torvalds
2004-11-22 22:27     ` Christoph Lameter
2004-11-22 22:40       ` Linus Torvalds
2004-12-01 23:41         ` page fault scalability patch V12 [0/7]: Overview and performance tests Christoph Lameter
2004-12-02  0:10           ` Linus Torvalds
2004-12-02  0:55             ` Andrew Morton
2004-12-02  1:46               ` Christoph Lameter
2004-12-02  6:21             ` Jeff Garzik
2004-12-02  6:34               ` Andrew Morton
2004-12-02  6:48                 ` Jeff Garzik
2004-12-02  7:02                   ` Andrew Morton
2004-12-02  7:26                     ` Martin J. Bligh
2004-12-02  7:31                       ` Jeff Garzik
2004-12-02 18:10                         ` cliff white
2004-12-02 18:17                           ` Gerrit Huizenga
2004-12-02 20:25                           ` linux-os
2004-12-02 18:43                       ` cliff white
2004-12-06 19:33                         ` Marcelo Tosatti
2004-12-02 16:24                     ` Gerrit Huizenga
2004-12-02 17:34                     ` cliff white
2004-12-02 19:48                   ` Diego Calleja
2004-12-02 20:12                     ` Jeff Garzik
2004-12-02 20:30                       ` Diego Calleja
2004-12-02 21:08                       ` Wichert Akkerman
2004-12-03  0:07                       ` Francois Romieu
2004-12-02  7:00                 ` Jeff Garzik
2004-12-02  7:05                   ` Benjamin Herrenschmidt
2004-12-02  7:11                     ` Jeff Garzik
2004-12-02 11:16                       ` Benjamin Herrenschmidt
2004-12-02 14:30                   ` Andy Warner
2005-01-06 23:40                     ` Jeff Garzik
2004-12-02 18:27                 ` Grant Grundler
2004-12-02 18:33                   ` Andrew Morton
2004-12-02 18:36                   ` Christoph Hellwig
2004-12-07 10:51                 ` Pavel Machek
2004-12-09  8:00           ` Nick Piggin
2004-12-09 17:03             ` Christoph Lameter
2004-12-10  4:30               ` Nick Piggin
2004-12-09 18:37           ` Hugh Dickins
2004-12-10  4:26             ` Nick Piggin
2004-12-10  4:54               ` Nick Piggin
2004-12-10  5:06                 ` Benjamin Herrenschmidt
2004-12-10  5:19                   ` Nick Piggin
2004-12-10 12:30                     ` Hugh Dickins
2004-12-10 18:43             ` Christoph Lameter
2004-12-10 21:43               ` Hugh Dickins
2004-12-10 22:12                 ` Andrew Morton
2004-12-10 23:52                   ` Hugh Dickins
2004-12-11  0:18                     ` Andrew Morton
2004-12-11  0:44                       ` Hugh Dickins
2004-12-11  0:57                         ` Andrew Morton
2004-12-11  9:23                           ` Hugh Dickins
2004-12-12  7:54               ` Nick Piggin
2004-12-12  9:33                 ` Hugh Dickins
2004-12-12  9:48                   ` Nick Piggin
2004-12-12 21:24                   ` William Lee Irwin III
2004-12-17  3:31                     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).