linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Major mke2fs slowdown (reproducable, bisected)
@ 2007-11-12 18:25 Alexey Dobriyan
  2007-11-12 18:39 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Alexey Dobriyan @ 2007-11-12 18:25 UTC (permalink / raw)
  To: mel, akpm; +Cc: torvalds, linux-kernel, linux-mm

Cross-compile farm here migrated to .ccache and build dir on separate
disks and now I have a way to blow up .ccache without waiting half an
hour for rm(1) to finish. It's called mke2fs(8).

However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
several fat cross-compile builds. Normally it takes ~11 seconds to
finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
"Bias the placement of kernel pages at lower PFNs" it takes several
minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
I'm pretty sure bisection wasn't screwed up.


Details:

	Core 2 Duo on x86_64, no debugging
	4G RAM
	30G ext2 .ccache partition with noatime
	100G ext2 partition with source tree and build dirs with noatime

I build alpha-allnoconfig, alpha-defconfig and 4 allmodconfigs
(SMP=y/n x DEBUG_KERNEL=y/n).

Right after compilation finishes, free(1) reports more or less the same
picture (VM hackers, please, tell me which info you need):

             total       used       free     shared    buffers     cached
Mem:       4032320    2802604    1229716          0      97160    2424816
-/+ buffers/cache:     280628    3751692
Swap:      7823644          0    7823644

Last steps of build script are:

	umount /home/ad/.ccache
	sudo mkfs.ext2 -m 0 	<=== this is slow

I can prepare standalone script with more affordable x86_64 configs if
needed.


commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be
Author: Mel Gorman <mel@csn.ul.ie>
Date:   Tue Oct 16 01:25:54 2007 -0700

    Bias the placement of kernel pages at lower PFNs
    
    This patch chooses blocks with lower PFNs when placing kernel allocations.
    This is particularly important during fallback in low memory situations to
    stop unmovable pages being placed throughout the entire address space.
    
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 676aec9..e1d87ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -765,6 +765,23 @@ int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
 	return move_freepages(zone, start_page, end_page, migratetype);
 }
 
+/* Return the page with the lowest PFN in the list */
+static struct page *min_page(struct list_head *list)
+{
+	unsigned long min_pfn = -1UL;
+	struct page *min_page = NULL, *page;;
+
+	list_for_each_entry(page, list, lru) {
+		unsigned long pfn = page_to_pfn(page);
+		if (pfn < min_pfn) {
+			min_pfn = pfn;
+			min_page = page;
+		}
+	}
+
+	return min_page;
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 						int start_migratetype)
@@ -795,8 +812,11 @@ retry:
 			if (list_empty(&area->free_list[migratetype]))
 				continue;
 
+			/* Bias kernel allocations towards low pfns */
 			page = list_entry(area->free_list[migratetype].next,
 					struct page, lru);
+			if (unlikely(start_migratetype != MIGRATE_MOVABLE))
+				page = min_page(&area->free_list[migratetype]);
 			area->nr_free--;
 
 			/*

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Major mke2fs slowdown (reproducable, bisected)
  2007-11-12 18:25 Major mke2fs slowdown (reproducable, bisected) Alexey Dobriyan
@ 2007-11-12 18:39 ` Linus Torvalds
  2007-11-12 21:34   ` Alexey Dobriyan
  2007-11-13 16:25 ` Andi Kleen
  2007-11-13 16:54 ` Mel Gorman
  2 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2007-11-12 18:39 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: mel, akpm, linux-kernel, linux-mm



On Mon, 12 Nov 2007, Alexey Dobriyan wrote:
>
> Cross-compile farm here migrated to .ccache and build dir on separate
> disks and now I have a way to blow up .ccache without waiting half an
> hour for rm(1) to finish. It's called mke2fs(8).
> 
> However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
> several fat cross-compile builds. Normally it takes ~11 seconds to
> finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
> "Bias the placement of kernel pages at lower PFNs" it takes several
> minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
> I'm pretty sure bisection wasn't screwed up.

Can you (just to make sure) do a "git revert" of this commit on top of the 
current tree, and verify that that makes it all work fine again too? If 
so, let's just revert it.

I just want to make sure that there isn't some subtle interaction with 
anything else in there.

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Major mke2fs slowdown (reproducable, bisected)
  2007-11-12 18:39 ` Linus Torvalds
@ 2007-11-12 21:34   ` Alexey Dobriyan
  2007-11-12 22:15     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Alexey Dobriyan @ 2007-11-12 21:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mel, akpm, linux-kernel, linux-mm

On Mon, Nov 12, 2007 at 10:39:15AM -0800, Linus Torvalds wrote:
> On Mon, 12 Nov 2007, Alexey Dobriyan wrote:
> >
> > Cross-compile farm here migrated to .ccache and build dir on separate
> > disks and now I have a way to blow up .ccache without waiting half an
> > hour for rm(1) to finish. It's called mke2fs(8).
> > 
> > However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
> > several fat cross-compile builds. Normally it takes ~11 seconds to
> > finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
> > "Bias the placement of kernel pages at lower PFNs" it takes several
> > minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
> > I'm pretty sure bisection wasn't screwed up.
> 
> Can you (just to make sure) do a "git revert" of this commit on top of the 
> current tree, and verify that that makes it all work fine again too? If 
> so, let's just revert it.
> 
> I just want to make sure that there isn't some subtle interaction with 
> anything else in there.

OK, with 2.6.24-rc2-6e800af233e0bdf108efb7bd23c11ea6fa34cdeb
mkfs took 4m6.915s seconds. With just "lower PFNs" patch reverted it's
back to 10 seconds.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Major mke2fs slowdown (reproducable, bisected)
  2007-11-12 21:34   ` Alexey Dobriyan
@ 2007-11-12 22:15     ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2007-11-12 22:15 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: mel, akpm, linux-kernel, linux-mm



On Tue, 13 Nov 2007, Alexey Dobriyan wrote:
> 
> OK, with 2.6.24-rc2-6e800af233e0bdf108efb7bd23c11ea6fa34cdeb
> mkfs took 4m6.915s seconds. With just "lower PFNs" patch reverted it's
> back to 10 seconds.

Ok, I reverted it. Thanks for double-checking.

Mel, I guess it's in your court.

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Major mke2fs slowdown (reproducable, bisected)
  2007-11-12 18:25 Major mke2fs slowdown (reproducable, bisected) Alexey Dobriyan
  2007-11-12 18:39 ` Linus Torvalds
@ 2007-11-13 16:25 ` Andi Kleen
  2007-11-14 11:25   ` Mel Gorman
  2007-11-13 16:54 ` Mel Gorman
  2 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2007-11-13 16:25 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: mel, akpm, torvalds, linux-kernel, linux-mm

Alexey Dobriyan <adobriyan@gmail.com> writes:
>  
> +/* Return the page with the lowest PFN in the list */
> +static struct page *min_page(struct list_head *list)
> +{
> +	unsigned long min_pfn = -1UL;
> +	struct page *min_page = NULL, *page;;
> +
> +	list_for_each_entry(page, list, lru) {
> +		unsigned long pfn = page_to_pfn(page);
> +		if (pfn < min_pfn) {
> +			min_pfn = pfn;
> +			min_page = page;
> +		}
> +	}
> +
> +	return min_page;
> +}
> +
>  /* Remove an element from the buddy allocator from the fallback list */
>  static struct page *__rmqueue_fallback(struct zone *zone, int order,
>  						int start_migratetype)
> @@ -795,8 +812,11 @@ retry:
>  			if (list_empty(&area->free_list[migratetype]))
>  				continue;
>  
> +			/* Bias kernel allocations towards low pfns */
>  			page = list_entry(area->free_list[migratetype].next,
>  					struct page, lru);
> +			if (unlikely(start_migratetype != MIGRATE_MOVABLE))
> +				page = min_page(&area->free_list[migratetype]);

Do I misread this, or does it really turn the O(1) buddy allocation into
a "search whole free list" algorithm?  Even as fallback that looks like
a quite extreme thing to do.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Major mke2fs slowdown (reproducable, bisected)
  2007-11-12 18:25 Major mke2fs slowdown (reproducable, bisected) Alexey Dobriyan
  2007-11-12 18:39 ` Linus Torvalds
  2007-11-13 16:25 ` Andi Kleen
@ 2007-11-13 16:54 ` Mel Gorman
  2007-11-14 11:17   ` Mel Gorman
  2 siblings, 1 reply; 8+ messages in thread
From: Mel Gorman @ 2007-11-13 16:54 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, torvalds, linux-kernel, linux-mm, apw

Hi Alexey,

On (12/11/07 21:25), Alexey Dobriyan didst pronounce:
> Cross-compile farm here migrated to .ccache and build dir on separate
> disks and now I have a way to blow up .ccache without waiting half an
> hour for rm(1) to finish. It's called mke2fs(8).
> 
> However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
> several fat cross-compile builds. Normally it takes ~11 seconds to
> finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
> "Bias the placement of kernel pages at lower PFNs" it takes several
> minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
> I'm pretty sure bisection wasn't screwed up.
> 
> 
> Details:
> 
> 	Core 2 Duo on x86_64, no debugging
> 	4G RAM
> 	30G ext2 .ccache partition with noatime
> 	100G ext2 partition with source tree and build dirs with noatime
> 
> I build alpha-allnoconfig, alpha-defconfig and 4 allmodconfigs
> (SMP=y/n x DEBUG_KERNEL=y/n).
> 
> Right after compilation finishes, free(1) reports more or less the same
> picture (VM hackers, please, tell me which info you need):
> 
>              total       used       free     shared    buffers     cached
> Mem:       4032320    2802604    1229716          0      97160    2424816
> -/+ buffers/cache:     280628    3751692
> Swap:      7823644          0    7823644
> 
> Last steps of build script are:
> 
> 	umount /home/ad/.ccache
> 	sudo mkfs.ext2 -m 0 	<=== this is slow
> 

Thanks very much for the report and the bisect. I spent the day trying
to reproduce it but I'm having trouble seeing the same problem using just
mke2fs. I've tried

Pentium III x86 machine with 1GB of RAM, 9GB partition
4-way Opteron with 8GB RAM, 10GB partition
2-way Opteron with 2GB RAM, 10GB partition
Pentium D (duel core) with 2GB RAM, 128GB partition

In all cases, the comparison between 2.6.23, latest git and latest git
with patch reverted were the same. For example, on the Pentium D, I got

2.6.23:			95.672 real, 0.068 user, 10.334 sys
2.6.24-rc2-git:		96.112 real, 0.08 user, 10.664 sys
2.6.24-rc2-revert:	96.182 real, 0.072 user, 10.602 sys

This is an average of 5 runs on a 128GB partition. Somewhat unexpectedly,
the revert was fractionally slower. The deviation between runs was around
the 0.4 second mark so the differences appear to be in the noise.

On the other machines, the reverted version was slightly faster but I was
seeing about 0.5% of overall running time, not the massive differences you
were seeing.  Clearly there is still a problem because reverting the patch
fixes your problem.... As I write this, it occurs to me that it might be
because your compile-job has created very long free-lists and searching them
is causing problems.

Can you post the contents of /proc/buddyinfo before and after you run
mke2fs? It will give an indication of how long the linked lists are being
searched. After I push send here, I'll be trying the tests after running
compile-tests similar to yours to see if that reproduces the problem.

Here are some other questions I hope you can answer just to eliminate them
as possibilities. Can you tell me what sort of disk driver you are using
(results here are for sata_nv)? Are you using RAID or MD? Is anything running
in the background while mke2fs is running? What is the output of mke2fs -V,
gcc --version and ld -v? Finally, can you mail me your .config and I'll try
it on my machine here.

In the meantime, it is safe to revert this patch. Andy Whitcroft tested
the behaviour of anti-fragmentation on a number of machines and while the
results are adversely affected in terms of hugepage allocation success rates,
they are still pretty decent. We will investigate a less expensive way of
achieving the same effect of the patch without the potentially long searches.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Major mke2fs slowdown (reproducable, bisected)
  2007-11-13 16:54 ` Mel Gorman
@ 2007-11-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2007-11-14 11:17 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, torvalds, linux-kernel, linux-mm, apw

On (13/11/07 16:54), Mel Gorman didst pronounce:

> fixes your problem.... As I write this, it occurs to me that it might be
> because your compile-job has created very long free-lists and searching them
> is causing problems.

This indeed did appear to be the problem. When a basic compile-job was
run first, the mke2fs times on a vanilla kernel were

119.96 real, 0.08 user, 19.85 sys
194.91 real, 0.11 user, 24.71 sys
102.24 real, 0.08 user, 10.47 sys
104.66 real, 0.08 user, 10.45 sys
100.77 real, 0.13 user, 10.54 sys

and with the patch reverted was

121.83 real, 0.15 user, 11.16 sys
126.68 real, 0.10 user, 10.78 sys
104.47 real, 0.09 user, 10.48 sys
104.75 real, 0.10 user, 10.66 sys
106.06 real, 0.09 user, 10.55 sys

The high sys times initially are due to the number of fallbacks that
take place as creating a filesystem creates an unusually large number of
pinned pages for a short-period of time. The search times usually
unmeasurably then show up in the profiles.

Reverting the patch is still the right solution. We will find an alternative
way of biasing the placement of pages without the expensive search.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Major mke2fs slowdown (reproducable, bisected)
  2007-11-13 16:25 ` Andi Kleen
@ 2007-11-14 11:25   ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2007-11-14 11:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alexey Dobriyan, akpm, torvalds, linux-kernel, linux-mm

On (13/11/07 17:25), Andi Kleen didst pronounce:
> Alexey Dobriyan <adobriyan@gmail.com> writes:
> >  
> > +/* Return the page with the lowest PFN in the list */
> > +static struct page *min_page(struct list_head *list)
> > +{
> > +	unsigned long min_pfn = -1UL;
> > +	struct page *min_page = NULL, *page;;
> > +
> > +	list_for_each_entry(page, list, lru) {
> > +		unsigned long pfn = page_to_pfn(page);
> > +		if (pfn < min_pfn) {
> > +			min_pfn = pfn;
> > +			min_page = page;
> > +		}
> > +	}
> > +
> > +	return min_page;
> > +}
> > +
> >  /* Remove an element from the buddy allocator from the fallback list */
> >  static struct page *__rmqueue_fallback(struct zone *zone, int order,
> >  						int start_migratetype)
> > @@ -795,8 +812,11 @@ retry:
> >  			if (list_empty(&area->free_list[migratetype]))
> >  				continue;
> >  
> > +			/* Bias kernel allocations towards low pfns */
> >  			page = list_entry(area->free_list[migratetype].next,
> >  					struct page, lru);
> > +			if (unlikely(start_migratetype != MIGRATE_MOVABLE))
> > +				page = min_page(&area->free_list[migratetype]);
> 
> Do I misread this, or does it really turn the O(1) buddy allocation into
> a "search whole free list" algorithm?  Even as fallback that looks like
> a quite extreme thing to do.
> 

It's extreme but not *quite* as extreme as you imply. The whole free-lists are
not searched, just one set at a specific order so it's "search a portion of
the free-lists". It happens for non-movable allocations (usually the minority)
and only then in fallback (in itself quite rare in almost all cases I've seen).

The problem was not detected before by me because it wasn't just a case of
creating a large number of pinned allocations but also depended on the type
of workload preceding it. If mke2fs was long-lived, it might not even have
been noticed. When run more than once, the fallbacks have all been dealt
with and it goes back to normal times.

The patch is now reverted and I don't expect to try bringing it back.
There are ways to bias the placement the pages as the patch intended without
doing an expensive search.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-11-14 11:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-12 18:25 Major mke2fs slowdown (reproducable, bisected) Alexey Dobriyan
2007-11-12 18:39 ` Linus Torvalds
2007-11-12 21:34   ` Alexey Dobriyan
2007-11-12 22:15     ` Linus Torvalds
2007-11-13 16:25 ` Andi Kleen
2007-11-14 11:25   ` Mel Gorman
2007-11-13 16:54 ` Mel Gorman
2007-11-14 11:17   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).