linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-07 10:51 Joerg Platte
  2008-01-07 11:19 ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Joerg Platte @ 2008-01-07 10:51 UTC (permalink / raw)
  To: linux-kernel

Hi,

when booting kernel 2.6.24-rc{4,5,6,7} top reports up to 100% iowait, even if 
no program accesses the disc on my Thinkpad T40p. Kernel 2.6.23.12 does not 
suffer from this. Is there anything I can do to find out which process or 
which part of the kernel is responsible for this? I can try to bisect it, but 
maybe there are other possibilities to debug this, since I cannot boot this 
computer frequently. I discovered, that there is no iowait within the first 
few seconds after waking up from suspend to ram...

regards,
Jörg


^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-16  9:26 Martin Knoblauch
       [not found] ` <E1JF6w8-0000vs-HM@localhost.localdomain>
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-16  9:26 UTC (permalink / raw)
  To: Mike Snitzer, Fengguang Wu
  Cc: Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel, linux-ext4,
	Linus Torvalds

----- Original Message ----
> From: Mike Snitzer <snitzer@gmail.com>
> To: Fengguang Wu <wfg@mail.ustc.edu.cn>
> Cc: Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>; Andrew Morton <akpm@li>
> Sent: Tuesday, January 15, 2008 10:13:22 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Jan 14, 2008 7:50 AM, Fengguang Wu  wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch
> fixed
> 
 the bug! With
> > > > previous kernels the bug showed up after each reboot. Now,
> when
> 
 booting the
> > > > patched kernel everything is fine and there is no longer
> any
> 
 suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24?
> Did
> 
 somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I
> guess
> 
 the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and
> solving
> 
 them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> >         "writeback: introduce writeback_control.more_io to
> indicate
> 
 more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b).
> 
 Let's
> > push it back to -mm tree for more testings?
> 
> Fengguang,
> 
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm.  To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...).  Very impressive!
> 
> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
> 
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
> 
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have?  Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?
> 
Hi Fengguang, Mike,

 I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted  "...2250b". I will definitely repeat my tests with -rc8. and report.

 Cheers
Martin





^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-16 14:15 Martin Knoblauch
  2008-01-16 16:27 ` Mike Snitzer
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-16 14:15 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4, Linus Torvalds

----- Original Message ----
> From: Fengguang Wu <wfg@mail.ustc.edu.cn>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 

 Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it.

Cheers
Martin

> Fengguang
> ---
>  fs/fs-writeback.c         |   17 +++++++++++++++--
>  include/linux/writeback.h |    1 +
>  mm/page-writeback.c       |    9 ++++++---
>  3 files changed, 22 insertions(+), 5 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
>                   * soon as the queue becomes uncongested.
>                   */
>                  inode->i_state |= I_DIRTY_PAGES;
> -                requeue_io(inode);
> +                if (wbc->nr_to_write <= 0)
> +                    /*
> +                     * slice used up: queue for next turn
> +                     */
> +                    requeue_io(inode);
> +                else
> +                    /*
> +                     * somehow blocked: retry later
> +                     */
> +                    redirty_tail(inode);
>              } else {
>                  /*
>                   * Otherwise fully redirty the inode so that
> @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
>          iput(inode);
>          cond_resched();
>          spin_lock(&inode_lock);
> -        if (wbc->nr_to_write <= 0)
> +        if (wbc->nr_to_write <= 0) {
> +            wbc->more_io = 1;
>              break;
> +        }
> +        if (!list_empty(&sb->s_more_io))
> +            wbc->more_io = 1;
>      }
>      return;        /* Leave any unwritten inodes on s_io */
>  }
> --- linux.orig/include/linux/writeback.h
> +++ linux/include/linux/writeback.h
> @@ -62,6 +62,7 @@ struct writeback_control {
>      unsigned for_reclaim:1;        /* Invoked from the page
> allocator
> 
 */
>      unsigned for_writepages:1;    /* This is a writepages() call */
>      unsigned range_cyclic:1;    /* range_start is cyclic */
> +    unsigned more_io:1;        /* more io to be dispatched */
>  };
>  
>  /*
> --- linux.orig/mm/page-writeback.c
> +++ linux/mm/page-writeback.c
> @@ -558,6 +558,7 @@ static void background_writeout(unsigned
>              global_page_state(NR_UNSTABLE_NFS) < background_thresh
>                  && min_pages <= 0)
>              break;
> +        wbc.more_io = 0;
>          wbc.encountered_congestion = 0;
>          wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>          wbc.pages_skipped = 0;
> @@ -565,8 +566,9 @@ static void background_writeout(unsigned
>          min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>          if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>              /* Wrote less than expected */
> -            congestion_wait(WRITE, HZ/10);
> -            if (!wbc.encountered_congestion)
> +            if (wbc.encountered_congestion || wbc.more_io)
> +                congestion_wait(WRITE, HZ/10);
> +            else
>                  break;
>          }
>      }
> @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
>              global_page_state(NR_UNSTABLE_NFS) +
>              (inodes_stat.nr_inodes - inodes_stat.nr_unused);
>      while (nr_to_write > 0) {
> +        wbc.more_io = 0;
>          wbc.encountered_congestion = 0;
>          wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>          writeback_inodes(&wbc);
>          if (wbc.nr_to_write > 0) {
> -            if (wbc.encountered_congestion)
> +            if (wbc.encountered_congestion || wbc.more_io)
>                  congestion_wait(WRITE, HZ/10);
>              else
>                  break;    /* All the old data is written */
> 
> 
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 13:52 Martin Knoblauch
  2008-01-17 16:11 ` Mike Snitzer
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-17 13:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4, Linus Torvalds

----- Original Message ----
> From: Fengguang Wu <wfg@mail.ustc.edu.cn>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 
Hi Fengguang,

 something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-(

 Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff.

 At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch.

Depressed
Martin


^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 17:44 Martin Knoblauch
  2008-01-17 20:23 ` Mel Gorman
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-17 17:44 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4, Linus Torvalds

----- Original Message ----
> From: Martin Knoblauch <spamtrap@knobisoft.de>
> To: Fengguang Wu <wfg@mail.ustc.edu.cn>
> Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Thursday, January 17, 2008 2:52:58 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> ----- Original Message ----
> > From: Fengguang Wu 
> > To: Martin Knoblauch 
> > Cc: Mike Snitzer ; Peter
> Zijlstra
> 
 ; jplatte@naasa.net; Ingo Molnar
> ;
> 
 linux-kernel@vger.kernel.org;
> "linux-ext4@vger.kernel.org"
> 
 ; Linus
> Torvalds
> 
 
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> > 
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> > 
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > 
> > > Hi Fengguang, Mike,
> > > 
> > >  I can add myself to Mikes question. It would be good to know
> > a
> > 
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> > 
>  showing quite nice improvement of the overall writeback situation and
> > it
> > 
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> > 
>  apparently already has reverted  "...2250b". I will definitely
> repeat
> 
 my
> > tests
> > 
>  with -rc8. and report.
> > 
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> > 
> Hi Fengguang,
> 
>  something really bad has happened between -rc3 and
> -rc6.
> 
 Embarrassingly I did not catch that earlier :-(
> 
>  Compared to the numbers I posted
> in
> 
 http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> (slight
> 
 plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only
> test
> 
 that is still good is mix3, which I attribute to the per-BDI stuff.
> 
>  At the moment I am frantically trying to find when things went down.
> I
> 
 did run -rc8 and rc8+yourpatch. No difference to what I see with
> -rc6.
> 
 Sorry that I cannot provide any input to your patch.
> 

 OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted

#commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
#Author: Mel Gorman <mel@csn.ul.ie>
#Date:   Mon Dec 17 16:20:05 2007 -0800
#
#    mm: fix page allocation for larger I/O segments
#    
#    In some cases the IO subsystem is able to merge requests if the pages are
#    adjacent in physical memory.  This was achieved in the allocator by having
#    expand() return pages in physically contiguous order in situations were a
#    large buddy was split.  However, list-based anti-fragmentation changed the
#    order pages were returned in to avoid searching in buffered_rmqueue() for a
#    page of the appropriate migrate type.
#    
#    This patch restores behaviour of rmqueue_bulk() preserving the physical
#    order of pages returned by the allocator without incurring increased search
#    costs for anti-fragmentation.
#    
#    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
#    Cc: James Bottomley <James.Bottomley@steeleye.com>
#    Cc: Jens Axboe <jens.axboe@oracle.com>
#    Cc: Mark Lord <mlord@pobox.com
#    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
#    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
--- linux-2.6.24-rc5/mm/page_alloc.c    2007-12-21 04:14:11.305633890 +0000
+++ linux-2.6.24-rc6/mm/page_alloc.c    2007-12-21 04:14:17.746985697 +0000
@@ -847,8 +847,19 @@
                struct page *page = __rmqueue(zone, order, migratetype);
                if (unlikely(page == NULL))
                        break;
+
+               /*
+                * Split buddy pages returned by expand() are received here
+                * in physical page order. The page is added to the callers and
+                * list and the list head then moves forward. From the callers
+                * perspective, the linked list is ordered by page number in
+                * some conditions. This is useful for IO devices that can
+                * merge IO requests if the physical pages are ordered
+                * properly.
+                */
                list_add(&page->lru, list);
                set_page_private(page, migratetype);
+               list = &page->lru;
        }
        spin_unlock(&zone->lock);
        return i;



This has brought back the good results I observed and reported.
I do not know what to make out of this. At least on the systems I care
about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
controller with 4x72GB SCSI disks as RAID5 (battery protected writeback
cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster.

 On the other hand, it is not a regression against 2.6.22/23. Those had
bad IO scaling to. It would just be a shame to loose an apparently great
performance win.
is

Cheers
Martin



^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 17:51 Martin Knoblauch
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-17 17:51 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Fengguang Wu, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4, Linus Torvalds

----- Original Message ----
> From: Mike Snitzer <snitzer@gmail.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Thursday, January 17, 2008 5:11:50 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
> with anyone who might be interested.
> 
> As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
> result matrix you previously posted, 2.6.22.x+perbdi might give you
> what you're looking for (sans improved writeback that 2.6.24 was
> thought to be providing).  That is, much improved scaling with better
> O_DIRECT and network throughput.  Just a thought...
> 
> Unfortunately, my priorities (and computing resources) have shifted
> and I won't be able to thoroughly test Fengguang's new writeback patch
> on 2.6.24-rc8... whereby missing out on providing
> justification/testing to others on _some_ improved writeback being
> included in 2.6.24 final.
> 
> Not to mention the window for writeback improvement is all but closed
> considering the 2.6.24-rc8 announcement's 2.6.24 final release
> timetable.
> 
Mike,

 thanks for the offer, but the improved throughput is my #1 priority nowadays.
And while the better scaling for different targets is nothing to frown upon, the
much better scaling when writing to the same target would have been the big
winner for me.

 Anyway, I located the "offending" commit. Lets see what the experts say.


Cheers
Martin




^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 21:50 Martin Knoblauch
  2008-01-17 22:12 ` Mel Gorman
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-17 21:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4, Linus Torvalds, James.Bottomley

----- Original Message ----
> From: Mel Gorman <mel@csn.ul.ie>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>; James.Bottomley@steeleye.com
> Sent: Thursday, January 17, 2008 9:23:57 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800,
> Martin
> 
 Knoblauch wrote:
> > > > > > > For those interested in using your writeback
> improvements
> 
 in
> > > > > > > production sooner rather than later (primarily with
> ext3);
> 
 what
> > > > > > > recommendations do you have?  Just heavily test our
> own
> 
 2.6.24
> > > > > > > evolving "close, but not ready for merge" -mm
> writeback
> 
 patchset?
> > > > > > > 
> > > > > > 
> > > > > >  I can add myself to Mikes question. It would be good to
> know
> 
 a
> > > > > 
> > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so
> far
> 
 has
> > > > > been showing quite nice improvement of the overall
> writeback
> 
 situation and
> > > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > > Linus apparently already has reverted  "...2250b". I
> will
> 
 definitely
> > > > > repeat my tests  with -rc8. and report.
> > > > > 
> > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > > Maybe we can push it to 2.6.24 after your testing.
> > > > 
> > > Hi Fengguang,
> > > 
> > > something really bad has happened between -rc3 and -rc6.
> > > Embarrassingly I did not catch that earlier :-(
> > > Compared to the numbers I posted in
> > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > > The only test that is still good is mix3, which I attribute to
> > > the per-BDI stuff.
> 
> I suspect that the IO hardware you have is very sensitive to the
> color of the physical page. I wonder, do you boot the system cleanly
> and then run these tests? If so, it would be interesting to know what
> happens if you stress the system first (many kernel compiles for example,
> basically anything that would use a lot of memory in different ways for some
> time) to randomise the free lists a bit and then run your test. You'd need to run
> the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
> identified reverted.
>

 The effect  is  defintely  depending on  the  IO  hardware. I performed the same tests
on a different box with an AACRAID controller and there things look different. Basically
the "offending" commit helps seingle stream performance on that box, while dual/triple
stream are not affected. So I suspect that the CCISS is just not behaving well.

 And yes, the tests are usually done on a freshly booted box. Of course, I repeat them
a few times. On the CCISS box the numbers are very constant. On the AACRAID box
they vary quite a bit.

 I can certainly stress the box before doing the tests. Please define "many" for the kernel
compiles :-)

> > 
> >  OK, the change happened between rc5 and rc6. Just following a
> > gut feeling, I reverted
> > 
> > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > #Author: Mel Gorman 
> > #Date:   Mon Dec 17 16:20:05 2007 -0800
> > #

> > 
> > This has brought back the good results I observed and reported.
> > I do not know what to make out of this. At least on the systems
> > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > protected writeback cache enabled) and gigabit networking (tg3)) this
> > optimisation is a dissaster.
> > 
> 
> That patch was not an optimisation, it was a regression fix
> against 2.6.23 and I don't believe reverting it is an option. Other IO
> hardware benefits from having the allocator supply pages in PFN order.

 I think this late in the 2.6.24 game we just should leave things as they are. But
we should try to find a way to make CCISS faster, as it apparently can be faster.

> Your controller would seem to suffer when presented with the same situation
> but I don't know why that is. I've added James to the cc in case he has seen this
> sort of situation before.
> 
> > On the other hand, it is not a regression against 2.6.22/23. Those
> 
 > had bad IO scaling to. It would just be a shame to loose an apparently
> 
 > great performance win.
> 
> Could you try running your tests again when the system has been
> stressed with some other workload first?
> 

 Will do.

Cheers
Martin




^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-18  8:19 Martin Knoblauch
  2008-01-18 16:01 ` Mel Gorman
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-18  8:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4, Linus Torvalds, James.Bottomley

----- Original Message ----
> From: Mel Gorman <mel@csn.ul.ie>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>; James.Bottomley@steeleye.com
> Sent: Thursday, January 17, 2008 11:12:21 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > > 
> > 
> > The effect  is  defintely  depending on  the  IO  hardware.
> > 
 performed the same tests
> > on a different box with an AACRAID controller and there things
> > look different.
> 
> I take it different also means it does not show this odd performance
> behaviour and is similar whether the patch is applied or not?
>

Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:

Test           2.6.19.2   2.6.24-rc6  2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
dd1             325       350             290
dd1-dir       180       160             160
dd2             2x90     2x113         2x110
dd2-dir       2x120   2x92            2x93
dd3            3x54      3x70           3x70
dd3-dir      3x83      3x64           3x64
mix3          55,2x30  400,2x25   310,2x25

 What we are seing here is that:

a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box
b) Reverting your patch hurts single stream
c) dual/triple stream are not affected by your patch and are improved over 2.6.19
d) the mix3 performance is improved compared to 2.6.19.
d1) reverting your patch hurts the local-disk part of mix3
e) the AACRAID setup is definitely faster than the CCISS.

 So, on this box your patch is definitely needed to get the pre-2.6.24 performance
when writing a single big file.

 Actually things on the CCISS box might be even more complicated. I forgot the fact
that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring?

 Anyway: does your patch only address this performance issue, or are there also
data integrity concerns without it? I may consider reverting the patch for my
production environment. It really helps two thirds of my boxes big time, while it does
not hurt the other third that much :-)

> > 
> >  I can certainly stress the box before doing the tests. Please
> > define "many" for the kernel compiles :-)
> > 
> 
> With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> all simultaneously. Running that for for 20-30 minutes should be enough
> 
 to randomise the freelists affecting what color of page is used for the
> dd  test.
> 

 ouch :-) OK, I will try that.

Martin




^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-19 10:24 Martin Knoblauch
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-19 10:24 UTC (permalink / raw)
  To: Mike Snitzer, Linus Torvalds
  Cc: Mel Gorman, Fengguang Wu, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4, James.Bottomley

----- Original Message ----
> From: Mike Snitzer <snitzer@gmail.com>
> To: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Mel Gorman <mel@csn.ul.ie>; Martin Knoblauch <spamtrap@knobisoft.de>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com
> Sent: Friday, January 18, 2008 11:47:02 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> > I can fire up 2.6.24-rc8 in short order to see if things are vastly
> > improved (as Martin seems to indicate that he is happy with
> > AACRAID on 2.6.24-rc8).  Although even Martin's AACRAID
> > numbers from 2.6.19.2
> 
 > are still quite good (relative to mine).  Martin can you share any tuning
> > you may have done to get AACRAID to where it is for you right now?
Mike,

 I have always been happy with the AACRAID box compared to the CCISS system. Even with the "regression" in 2.6.24-rc1..rc5 it was more than acceptable to me. For me the differences between 2.6.19  and 2.6.24-rc8 on the AACRAID setup are:

- 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something I do not care much about. I just measure it for reference.
+ the very nice behaviour when writing to different targets (mix3), which I attribute to Peter's per-dbi stuff.

 And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS boxes. This would have been the next "production" kernel for me. But lets discuss this under a seperate topic. It has nothing to do with the original wait-io issue.

 Oh, before I forget. There has been no tuning for the AACRAID. The system is an IBM x3650 with built in AACRAID and battery backed write cache. The disks are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an my tests. I do 1MB writes to simulate the behaviour of the real applications, while yours seem to be much smaller.
 
Cheers
Martin




^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-22 15:25 Martin Knoblauch
  2008-01-22 23:40 ` Alasdair G Kergon
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-22 15:25 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel, linux-ext4,
	James.Bottomley, Jens Axboe, Milan Broz, Neil Brown


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

----- Original Message ----
> From: Alasdair G Kergon <agk@redhat.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Mel Gorman <mel@csn.ul.ie>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com; Jens Axboe <jens.axboe@oracle.com>; Milan Broz <mbroz@redhat.com>; Neil Brown <neilb@suse.de>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
> >  At least, rc1-rc5 have shown that the CCISS system can do well. Now
> > the question is which part of the system does not cope well with the
> > larger IO sizes? Is it the CCISS controller, LVM or both. I am
> open
> 
 to
> > suggestions on how to debug that. 
> 
> What is your LVM device configuration?
>   E.g. 'dmsetup table' and 'dmsetup info -c' output.
> Some configurations lead to large IOs getting split up on the
> way
> 
 through
> device-mapper.
>
Hi Alastair,

 here is the output, the filesystem in question is on LogVol02:

  [root@lpsdm52 ~]# dmsetup table
VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
VolGroup00-LogVol00: 0 67108864 linear 104:2 384
[root@lpsdm52 ~]# dmsetup info -c
Name             Maj Min Stat Open Targ Event  UUID
VolGroup00-LogVol02 253   1 L--w    1    1      0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmOZ4OzOgGQIdF3qDx6fJmlZukXXLIy39R
VolGroup00-LogVol01 253   2 L--w    1    1      0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4Ogmfn2CcAd2Fh7i48twe8PZc2XK5bSOe1Fq
VolGroup00-LogVol00 253   0 L--w    1    1      0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmfYjxQKFP3zw2fGsezJN7ypSrfmP7oSvE

> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
>     dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
>     dm-introduce-merge_bvec_fn.patch
>     dm-linear-add-merge.patch
>     dm-table-remove-merge_bvec-sector-restriction.patch
>  

 thanks for the suggestion. Are they supposed to apply to mainline?

Cheers
Martin




^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-22 18:51 Martin Knoblauch
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-22 18:51 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel, linux-ext4,
	James.Bottomley, Jens Axboe, Milan Broz, Neil Brown

----- Original Message ----
> From: Alasdair G Kergon <agk@redhat.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Mel Gorman <mel@csn.ul.ie>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com; Jens Axboe <jens.axboe@oracle.com>; Milan Broz <mbroz@redhat.com>; Neil Brown <neilb@suse.de>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
>     dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
>     dm-introduce-merge_bvec_fn.patch
>     dm-linear-add-merge.patch
>     dm-table-remove-merge_bvec-sector-restriction.patch
>  


 nope. Exactely the same poor results. To rule out LVM/DM I really have to see what happens if I setup a system with filesystems directly on partitions. Might take some time though.

Cheers
Martin




^ permalink raw reply	[flat|nested] 59+ messages in thread
* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-23 11:12 Martin Knoblauch
  0 siblings, 0 replies; 59+ messages in thread
From: Martin Knoblauch @ 2008-01-23 11:12 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel, linux-ext4,
	James.Bottomley, Jens Axboe, Milan Broz, Neil Brown

----- Original Message ----
> From: Alasdair G Kergon <agk@redhat.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Mel Gorman <mel@csn.ul.ie>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com; Jens Axboe <jens.axboe@oracle.com>; Milan Broz <mbroz@redhat.com>; Neil Brown <neilb@suse.de>
> Sent: Wednesday, January 23, 2008 12:40:52 AM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
> >   [root@lpsdm52 ~]# dmsetup table
> > VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
> > VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
> > VolGroup00-LogVol00: 0 67108864 linear 104:2 384
> 
> The IO should pass straight through simple linear targets like
> that without needing to get broken up, so I wouldn't expect those patches to
> make any difference in this particular case.
> 

Alasdair,

 LVM/DM are off the hook :-) I converted one box to direct using partitions and the performance is the same disappointment as with LVM/DM. Thanks anyway for looking at my problem.

 I will move the discussion now to a new thread, targetting CCISS directly. 

Cheers
Martin



^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2008-01-23 11:12 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-07 10:51 regression: 100% io-wait with 2.6.24-rcX Joerg Platte
2008-01-07 11:19 ` Ingo Molnar
2008-01-07 13:24   ` Joerg Platte
2008-01-07 13:32     ` Peter Zijlstra
2008-01-07 13:40       ` Joerg Platte
     [not found]         ` <E1JCRbA-0002bh-3c@localhost.localdomain>
2008-01-09  3:27           ` Fengguang Wu
2008-01-09  6:13             ` Joerg Platte
     [not found]               ` <E1JCZg2-0001DE-RP@localhost.localdomain>
2008-01-09 12:04                 ` Fengguang Wu
2008-01-09 12:22                   ` Joerg Platte
     [not found]                     ` <E1JCaUd-0001Ko-Tt@localhost.localdomain>
2008-01-09 12:57                       ` Fengguang Wu
2008-01-09 13:04                         ` Joerg Platte
     [not found]                           ` <E1JCrMj-0001HR-SZ@localhost.localdomain>
2008-01-10  6:58                             ` Fengguang Wu
     [not found]                           ` <E1JCrsE-0000v4-Dz@localhost.localdomain>
2008-01-10  7:30                             ` Fengguang Wu
     [not found]                           ` <20080110073046.GA3432@mail.ustc.edu.cn>
     [not found]                             ` <E1JCsDr-0002cl-0e@localhost.localdomain>
2008-01-10  7:53                               ` Fengguang Wu
2008-01-10  8:37                                 ` Joerg Platte
     [not found]                                   ` <E1JCt0n-00048n-AD@localhost.localdomain>
2008-01-10  8:43                                     ` Fengguang Wu
2008-01-10 10:03                                       ` Joerg Platte
     [not found]                                         ` <E1JDBk4-0000UF-03@localhost.localdomain>
2008-01-11  4:43                                           ` Fengguang Wu
2008-01-11  5:29                                             ` Joerg Platte
2008-01-11  6:41                                               ` Joerg Platte
2008-01-12 23:32                                             ` Joerg Platte
     [not found]                                               ` <E1JDwaA-00017Q-W6@localhost.localdomain>
2008-01-13  6:44                                                 ` Fengguang Wu
2008-01-13  8:05                                                   ` Joerg Platte
     [not found]                                                     ` <E1JDy5a-0001al-Tk@localhost.localdomain>
2008-01-13  8:21                                                       ` Fengguang Wu
2008-01-13  9:49                                                         ` Joerg Platte
     [not found]                                                           ` <E1JE1Uz-0002w5-6z@localhost.localdomain>
2008-01-13 11:59                                                             ` Fengguang Wu
     [not found]                                                           ` <20080113115933.GA11045@mail.ustc.edu.cn>
     [not found]                                                             ` <E1JEGPH-0001uw-Df@localhost.localdomain>
2008-01-14  3:54                                                               ` Fengguang Wu
     [not found]                                                             ` <20080114035439.GA7330@mail.ustc.edu.cn>
     [not found]                                                               ` <E1JEM2I-00010S-5U@localhost.localdomain>
2008-01-14  9:55                                                                 ` Fengguang Wu
2008-01-14 11:30                                                                   ` Joerg Platte
2008-01-14 11:41                                                                     ` Peter Zijlstra
     [not found]                                                                       ` <E1JEOmD-0001Ap-U7@localhost.localdomain>
2008-01-14 12:50                                                                         ` Fengguang Wu
2008-01-15 21:13                                                                           ` Mike Snitzer
     [not found]                                                                             ` <E1JF0m1-000101-OK@localhost.localdomain>
2008-01-16  5:25                                                                               ` Fengguang Wu
2008-01-15 21:42                                                                         ` Ingo Molnar
     [not found]                                                                           ` <E1JF0bJ-0000zU-FG@localhost.localdomain>
2008-01-16  5:14                                                                             ` Fengguang Wu
2008-01-16  9:26 Martin Knoblauch
     [not found] ` <E1JF6w8-0000vs-HM@localhost.localdomain>
2008-01-16 12:00   ` Fengguang Wu
2008-01-16 14:15 Martin Knoblauch
2008-01-16 16:27 ` Mike Snitzer
2008-01-17 13:52 Martin Knoblauch
2008-01-17 16:11 ` Mike Snitzer
2008-01-17 17:44 Martin Knoblauch
2008-01-17 20:23 ` Mel Gorman
2008-01-17 17:51 Martin Knoblauch
2008-01-17 21:50 Martin Knoblauch
2008-01-17 22:12 ` Mel Gorman
2008-01-18  8:19 Martin Knoblauch
2008-01-18 16:01 ` Mel Gorman
2008-01-18 17:46   ` Linus Torvalds
2008-01-18 19:01     ` Martin Knoblauch
2008-01-18 19:23       ` Linus Torvalds
2008-01-22 14:39       ` Alasdair G Kergon
2008-01-18 20:00     ` Mike Snitzer
2008-01-18 22:47       ` Mike Snitzer
2008-01-19 10:24 Martin Knoblauch
2008-01-22 15:25 Martin Knoblauch
2008-01-22 23:40 ` Alasdair G Kergon
2008-01-22 18:51 Martin Knoblauch
2008-01-23 11:12 Martin Knoblauch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).