Re: MD write performance issue

* Re: MD write performance issue - found Catalyst patches
@ 2009-10-18 10:00 mark delfman
  2009-10-18 22:39 ` NeilBrown
  2009-10-29  6:41 ` Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: mark delfman @ 2009-10-18 10:00 UTC (permalink / raw)
  To: Mattias Hellström, Linux RAID Mailing List, NeilBrown

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

We have tracked the performance drop to the attached two commits in
2.6.28.6.    The performance never fully recovers in later kernels so
I presuming that the change in the write cache is still affecting MD
today.

The problem for us is that although we have slowly tracked it down, we
have no understanding of linux at this level and simply wouldn’t know
where go from this point.

Considering this seems to only effect MD and not hardware based RAID
(in our tests) I thought that this would be an appropriate place to
post these patches and findings.

There are 2 patches which impact MD performance via a filesystem:

a) commit 66c85494570396661479ba51e17964b2c82b6f39 - write-back: fix
nr_to_write counter
b) commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 - Fix page
writeback thinko, causing Berkeley DB slowdown


1) no patches applied into 2.6.28.5 kernel: write speed is 1.1 GB/s via xfs
2) both patches are applied into 2.6.28.5 kernel: xfs drops to circa:
680 MB/s (like in kernel 2.6.28.6 and later)
3) put only one patch: 66c85494570396661479ba51e17964b2c82b6f39
(write-back: fix nr_to_write counter) - performance goes down to circa
780 MB/s
4) put only one patch: fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 (Fix
page writeback thinko) - the performance is good: 1.1 GB/s (on XFS)

change log for 28.6
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.28.6


Hopefully this helps to resolve this....


Mark

2009/10/17 Mattias Hellström <hellstrom.mattias@gmail.com>:
> (If I were you) I would further test the revisions between the
> following and then look at the changelog for the culprit. Looks like
> versions after this are just trying to regain the missing speed.
>
> Linux linux-tlfp 2.6.27.14-vanilla #1 SMP Fri Oct 16 00:56:25 BST 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> RAW: 1.1
> XFS 1.1
>
> Linux linux-tlfp 2.6.27.20-vanilla #1 SMP Thu Oct 15 23:59:32 BST 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> RAw 1.1 GB/s
> XFS: 487 MB/s
>

[-- Attachment #2: fa76ac6cbeb58256cf7de97a75d5d7f838a80b32.patch --]
[-- Type: application/octet-stream, Size: 2047 bytes --]

commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32
Author: Nick Piggin <npiggin@suse.de>
Date:   Thu Feb 12 04:34:23 2009 +0100

    Fix page writeback thinko, causing Berkeley DB slowdown
    
    commit 3a4c6800f31ea8395628af5e7e490270ee5d0585 upstream.
    
    A bug was introduced into write_cache_pages cyclic writeout by commit
    31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
    fix").  The intention (and comments) is that we should cycle back and
    look for more dirty pages at the beginning of the file if there is no
    more work to be done.
    
    But the !done condition was dropped from the test.  This means that any
    time the page writeout loop breaks (eg.  due to nr_to_write == 0), we
    will set index to 0, then goto again.  This will set done_index to
    index, then find done is set, so will proceed to the end of the
    function.  When updating mapping->writeback_index for cyclic writeout,
    we now use done_index == 0, so we're always cycling back to 0.
    
    This seemed to be causing random mmap writes (slapadd and iozone) to
    start writing more pages from the LRU and writeout would slowdown, and
    caused bugzilla entry
    
    	http://bugzilla.kernel.org/show_bug.cgi?id=12604
    
    about Berkeley DB slowing down dramatically.
    
    With this patch, iozone random write performance is increased nearly
    5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
    
    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reported-and-tested-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 08d2b96..11400ed 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -997,7 +997,7 @@ continue_unlock:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!cycled) {
+	if (!cycled && !done) {
 		/*
 		 * range_cyclic:
 		 * We hit the last page and there is more work to be done: wrap

[-- Attachment #3: 66c85494570396661479ba51e17964b2c82b6f39.patch --]
[-- Type: application/octet-stream, Size: 2347 bytes --]

commit 66c85494570396661479ba51e17964b2c82b6f39
Author: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Date:   Mon Feb 2 18:33:49 2009 +0200

    write-back: fix nr_to_write counter
    
    commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa upstream.
    
    Commit 05fe478dd04e02fa230c305ab9b5616669821dd3 introduced some
    @wbc->nr_to_write breakage.
    
    It made the following changes:
     1. Decrement wbc->nr_to_write instead of nr_to_write
     2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
     3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
        WB_SYNC_NONE, otherwise keep going.
    
    However, according to the commit message, the intention was to only make
    change 3.  Change 1 is a bug.  Change 2 does not seem to be necessary,
    and it breaks UBIFS expectations, so if needed, it should be done
    separately later.  And change 2 does not seem to be documented in the
    commit message.
    
    This patch does the following:
     1. Undo changes 1 and 2
     2. Add a comment explaining change 3 (it very useful to have comments
        in _code_, not only in the commit).
    
    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
    Acked-by: Nick Piggin <npiggin@suse.de>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11400ed..0c4100e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -981,13 +981,22 @@ continue_unlock:
 				}
  			}
 
-			if (wbc->sync_mode == WB_SYNC_NONE) {
-				wbc->nr_to_write--;
-				if (wbc->nr_to_write <= 0) {
-					done = 1;
-					break;
-				}
+			if (nr_to_write > 0)
+				nr_to_write--;
+			else if (wbc->sync_mode == WB_SYNC_NONE) {
+				/*
+				 * We stop writing back only if we are not
+				 * doing integrity sync. In case of integrity
+				 * sync we have to keep going because someone
+				 * may be concurrently dirtying pages, and we
+				 * might have synced a lot of newly appeared
+				 * dirty pages, but have not synced all of the
+				 * old dirty pages.
+				 */
+				done = 1;
+				break;
 			}
+
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
 				done = 1;

^ permalink raw reply related	[flat|nested] 18+ messages in thread