All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
@ 2009-12-07 14:39 Christian Ehrhardt
  2009-12-07 15:09 ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2009-12-07 14:39 UTC (permalink / raw)
  To: Mel Gorman, arayananu Gopalakrishnan, KAMEZAWA Hiroyuki,
	Andrew Morton, linux-kernel
  Cc: epasch, SCHILLIG, Martin Schwidefsky, Heiko Carstens, christof.schmitt

Hi,
I tracked a huge performance regression for a while and got it bisected 
down to commit "e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4 - 
page-allocator: preserve PFN ordering when __GFP_COLD is set".

The scenario I'm running is a low memory system (256M total), that does 
sequential I/O with parallel iozone processes.
One process per disk, each process reading a 2Gb file. The disks I use 
are fcp scsi disks attached to a s390 host. File system is ext2.

The regression appears as up to 76% loss in throughput at 16 processes 
(processes are scaled from 1 to 64, performance is bad everywhere - 16 
is just the peak - avg loss is about 40% throughput).
I already know that giving the system just a bit (~64m+) more memory 
solves the issue almost completely, probably because there is almost no 
"memory pressure" left in that cases.
I also know that using direct-I/O instead of I/O through page cache 
doesn't have the problem at all.
Comparing sysstat data taken while running with the kernels just with & 
without the bisected patch shows nothing obvious except that I/O seems 
to take much longer (lower interrupt ratio etc).

The patch alone looks very reasonable, so I'd prefer understanding and 
fixing the real issue instead of getting it eventually reverted due to 
this regression being larger than the one it was intended to fix.
In the patch it is clear that hot pages (cold==0) freed in rmqueue_bulk 
should behave like before. So maybe the question is "are our pages cold 
while they shouldn't be"?
Well I don't really have a clue yet to explain how patch e084b exactly 
causes that big regression, ideas welcome :-)

Kind regards,
Christian


P.S: The original patch is concise enough to attach it here to act as 
reference without bloating the mail too much:
commit e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4
Author: Mel Gorman <mel@csn.ul.ie>
Date:   Wed Jul 29 15:02:04 2009 -0700

    page-allocator: preserve PFN ordering when __GFP_COLD is set
   
    Fix a post-2.6.24 performace regression caused by
    3dfa5721f12c3d5a441448086bee156887daa961 ("page-allocator: preserve PFN
    ordering when __GFP_COLD is set").
   
    Narayanan reports "The regression is around 15%.  There is no disk 
controller
    as our setup is based on Samsung OneNAND used as a memory mapped 
device on a
    OMAP2430 based board."
   
    The page allocator tries to preserve contiguous PFN ordering when 
returning
    pages such that repeated callers to the allocator have a strong 
chance of
    getting physically contiguous pages, particularly when external 
fragmentation
    is low.  However, of the bulk of the allocations have __GFP_COLD set 
as they
    are due to aio_read() for example, then the PFNs are in reverse PFN 
order.
    This can cause performance degration when used with IO controllers 
that could
    have merged the requests.
   
    This patch attempts to preserve the contiguous ordering of PFNs for 
users of
    __GFP_COLD.
   
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Reported-by: Narayananu Gopalakrishnan <narayanan.g@samsung.com>
    Tested-by: Narayanan Gopalakrishnan <narayanan.g@samsung.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: <stable@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caa9268..ae28c22 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -882,7 +882,7 @@ retry_reserve:
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
             unsigned long count, struct list_head *list,
-            int migratetype)
+            int migratetype, int cold)
 {
     int i;
    
@@ -901,7 +901,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned 
int order,
          * merge IO requests if the physical pages are ordered
          * properly.
          */
-        list_add(&page->lru, list);
+        if (likely(cold == 0))
+            list_add(&page->lru, list);
+        else
+            list_add_tail(&page->lru, list);
         set_page_private(page, migratetype);
         list = &page->lru;
     }
@@ -1119,7 +1122,8 @@ again:
         local_irq_save(flags);
         if (!pcp->count) {
             pcp->count = rmqueue_bulk(zone, 0,
-                    pcp->batch, &pcp->list, migratetype);
+                    pcp->batch, &pcp->list,
+                    migratetype, cold);
             if (unlikely(!pcp->count))
                 goto failed;
         }
@@ -1138,7 +1142,8 @@ again:
         /* Allocate more to the pcp list if necessary */
         if (unlikely(&page->lru == &pcp->list)) {
             pcp->count += rmqueue_bulk(zone, 0,
-                    pcp->batch, &pcp->list, migratetype);
+                    pcp->batch, &pcp->list,
+                    migratetype, cold);
             page = list_entry(pcp->list.next, struct page, lru);
         }
 


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-07 14:39 Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set" Christian Ehrhardt
@ 2009-12-07 15:09 ` Mel Gorman
  2009-12-08 17:59   ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2009-12-07 15:09 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: arayananu Gopalakrishnan, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel, epasch, SCHILLIG, Martin Schwidefsky,
	Heiko Carstens, christof.schmitt

On Mon, Dec 07, 2009 at 03:39:49PM +0100, Christian Ehrhardt wrote:
> Hi,
> I tracked a huge performance regression for a while and got it bisected  
> down to commit "e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4 -  
> page-allocator: preserve PFN ordering when __GFP_COLD is set".
>

Darn. That is related to IO controllers being able to automatically merge
requests. The problem it was fixing was that pages were arriving in reverse
PFN order, the controller was unable to merge and performance was impaired. Any
controller that can merge should be faster as a result of the patch.

> The scenario I'm running is a low memory system (256M total), that does  
> sequential I/O with parallel iozone processes.
> One process per disk, each process reading a 2Gb file. The disks I use  
> are fcp scsi disks attached to a s390 host. File system is ext2.
>

I don't know what controller is in use there but does it
opportunistically merge requests if they are on physically contiguous
pages? If so, can it be disabled?

> The regression appears as up to 76% loss in throughput at 16 processes  
> (processes are scaled from 1 to 64, performance is bad everywhere - 16  
> is just the peak - avg loss is about 40% throughput).
> I already know that giving the system just a bit (~64m+) more memory  
> solves the issue almost completely, probably because there is almost no  
> "memory pressure" left in that cases.
> I also know that using direct-I/O instead of I/O through page cache  
> doesn't have the problem at all.

This makes sense because it's a sequentual read load, so readahead is a
factor and that is why __GFP_COLD is used - the data is not for
immediate use so doesn't need to be cache hot.

> Comparing sysstat data taken while running with the kernels just with &  
> without the bisected patch shows nothing obvious except that I/O seems  
> to take much longer (lower interrupt ratio etc).
>

Maybe the controller is spending an age trying to merge requests because
it should be able to but takes a long time figuring it out?

> The patch alone looks very reasonable, so I'd prefer understanding and  
> fixing the real issue instead of getting it eventually reverted due to  
> this regression being larger than the one it was intended to fix.
> In the patch it is clear that hot pages (cold==0) freed in rmqueue_bulk  
> should behave like before. So maybe the question is "are our pages cold  
> while they shouldn't be"?
> Well I don't really have a clue yet to explain how patch e084b exactly  
> causes that big regression, ideas welcome :-)
>

Only theory I have at the moment is that the controller notices it can
merge requests and either spends a long time figuring out how to do the
merging or performs worse with merged IO requests.

If the problem is in the driver, oprofile might show where the problem lies.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-07 15:09 ` Mel Gorman
@ 2009-12-08 17:59   ` Christian Ehrhardt
  2009-12-10 14:36     ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2009-12-08 17:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: arayananu Gopalakrishnan, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel, epasch, SCHILLIG, Martin Schwidefsky,
	Heiko Carstens, christof.schmitt

Mel Gorman wrote:
> On Mon, Dec 07, 2009 at 03:39:49PM +0100, Christian Ehrhardt wrote:
>   
> [...]
>
> I don't know what controller is in use there but does it
> opportunistically merge requests if they are on physically contiguous
> pages? If so, can it be disabled?
>   
As far as i could clarify it our controllers don't support such a 
opportunistic merging.
>> The regression appears as up to 76% loss in throughput at 16 processes  
>> (processes are scaled from 1 to 64, performance is bad everywhere - 16  
>> is just the peak - avg loss is about 40% throughput).
>> I already know that giving the system just a bit (~64m+) more memory  
>> solves the issue almost completely, probably because there is almost no  
>> "memory pressure" left in that cases.
>> I also know that using direct-I/O instead of I/O through page cache  
>> doesn't have the problem at all.
>>     
>
> This makes sense because it's a sequentual read load, so readahead is a
> factor and that is why __GFP_COLD is used - the data is not for
> immediate use so doesn't need to be cache hot.
>   
In the meanwhile I was able to verify that this also applies to random 
reads which are still reads but have zero read ahead requests.
I attached more regression data in the post scriptum at the end of the mail.
>   
>> Comparing sysstat data taken while running with the kernels just with &  
>> without the bisected patch shows nothing obvious except that I/O seems  
>> to take much longer (lower interrupt ratio etc).
>>
>>     
>
> Maybe the controller is spending an age trying to merge requests because
> it should be able to but takes a long time figuring it out?
>   
I thought that too, but now comes the funny part.
I gathered HW statistics from our I/O controllers and latency statistics 
clearly state that your patch is working as intended - the latency from 
entering the controller until the interrupt to linux device driver is 
~30% lower!.
Remember as I stated above that they don't support that opportunistic 
merging so I will have some fun finding out why it is faster in HW now :-)

>> The patch alone looks very reasonable, so I'd prefer understanding and  
>> fixing the real issue instead of getting it eventually reverted due to  
>> this regression being larger than the one it was intended to fix.
>> In the patch it is clear that hot pages (cold==0) freed in rmqueue_bulk  
>> should behave like before. So maybe the question is "are our pages cold  
>> while they shouldn't be"?
>> Well I don't really have a clue yet to explain how patch e084b exactly  
>> causes that big regression, ideas welcome :-)
>>
>>     
>
> Only theory I have at the moment is that the controller notices it can
> merge requests and either spends a long time figuring out how to do the
> merging or performs worse with merged IO requests.
>
> If the problem is in the driver, oprofile might show where the problem lies
With the effective throughput dropping by such a large amount while 
hardware latency improves by 30% I agree and suggest the issue is in the 
driver.
I'll do some research in breaking down where in our drivers time is lost 
and reply here for advises and comments in regard to what general memory 
management could/should/might do.

Kind regards,
Christian

p.s.
FYI a bit more regression data, now I had the patch identified I made a 
full regression test scaling from 1 to 64 processes.
Comparing just without / with the commit e084b
I wondered, but obviously random read is also suffering from that patch.

Sequential Read
Procs Deviation in %
1              -4.9
2               5.2
4             -82.6
8             -65.6
16            -44.2
32            -30.0
64            -37.7

Random Read
Procs Deviation in %
1             -48.3
2             -45.7
4             -50.5
8             -59.0
16            -61.8
32            -48.3
64            -21.0

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-08 17:59   ` Christian Ehrhardt
@ 2009-12-10 14:36     ` Christian Ehrhardt
  2009-12-11 11:20       ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2009-12-10 14:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: arayananu Gopalakrishnan, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel, epasch, SCHILLIG, Martin Schwidefsky,
	Heiko Carstens, christof.schmitt, thoss

Keeping the old discussion in the mail tail, adding the new information 
up here were everyone finds them :-)

Things I was able to confirm so far summarized:
- The controller doesn't care about pfn ordering in any way (proved by 
HW statistics)
- regression appears in sequential AND random workloads -> also without 
readahead
- oprofile & co are no option atm.
  The effective consumed cpu cycles per transferred kb are almost the 
same so I would not expect sampling would give us huge insights.
  Therefore I expect that it is more a matter of lost time (latency) 
than more expensive tasks (cpu consumption)
  I don't want to preclude it completely, but sampling has to wait as 
long as we have better tracks to follow

So the question is where time is lost in Linux. I used blktrace to 
create latency summaries.
I only list the random case for discussion as the effects are more clear 
int hat data.
Abbreviations are (like the blkparse man page explains) - sorted in 
order it would appear per request:
       A -- remap For stacked devices, incoming i/o is remapped to 
device below it in the i/o stack. The remap action details what exactly 
is being remapped to what.
       G -- get request To send any type of request to a block device, a 
struct request container must be allocated first.
       I -- inserted A request is being sent to the i/o scheduler for 
addition to the internal queue and later service by the driver. The 
request is fully formed at this time.
       D -- issued A request that previously resided on the block layer 
queue or in the i/o scheduler has been sent to the driver.
       C -- complete A previously issued request has been completed.  
The output will detail the sector and size of that request, as well as 
the success or failure of it.

The following table shows the average latencies from A to G, G to I and 
so on.
C2A is special and tries to summarize how long it takes after completing 
an I/O until the next one arrives in the block device layer.

                     avg-A2G    avg-G2I    avg-I2D   avg-D2C   
avg-C2A-in-avg+-stddev    %C2A-in-avg+-stddev
deviation good->bad    -3.48%    -0.56%    -1.57%    
-1.31%                   128.69%                 97.26%

It clearly shows that all latencies once block device layer and device 
driver are involved are almost equal. Remember that the throughput of 
good vs. bad case is more than x3.
But we can also see that the value of C2A increases by a huge amount. 
That huge C2A increase let me assume that the time is actually lost 
"above" the block device layer.

I don't expect the execution speed of iozone as user process itself is 
affected by commit e084b, so the question is where the time is lost 
between the "read" issued by iozone and entering the block device layer.
Actually I expect it somewhere in the area of getting a page cache page 
for the I/O. On one hand page handling is what commit e084b changes and 
on the other hand pages are under pressure (systat vm effectiveness 
~100%, >40% scanned directly in both cases).

I'll continue hunting down the lost time - maybe with ftrace if it is 
not concealing the effect by its invasiveness -, any further 
ideas/comments welcome.

kind regards,
Christian

Christian Ehrhardt wrote:
> Mel Gorman wrote:
>> On Mon, Dec 07, 2009 at 03:39:49PM +0100, Christian Ehrhardt wrote:
>>   [...]
>>
>> I don't know what controller is in use there but does it
>> opportunistically merge requests if they are on physically contiguous
>> pages? If so, can it be disabled?
>>   
> As far as i could clarify it our controllers don't support such a 
> opportunistic merging.
>>> The regression appears as up to 76% loss in throughput at 16 
>>> processes  (processes are scaled from 1 to 64, performance is bad 
>>> everywhere - 16  is just the peak - avg loss is about 40% throughput).
>>> I already know that giving the system just a bit (~64m+) more 
>>> memory  solves the issue almost completely, probably because there 
>>> is almost no  "memory pressure" left in that cases.
>>> I also know that using direct-I/O instead of I/O through page cache  
>>> doesn't have the problem at all.
>>>     
>>
>> This makes sense because it's a sequentual read load, so readahead is a
>> factor and that is why __GFP_COLD is used - the data is not for
>> immediate use so doesn't need to be cache hot.
>>   
> In the meanwhile I was able to verify that this also applies to random 
> reads which are still reads but have zero read ahead requests.
> I attached more regression data in the post scriptum at the end of the 
> mail.
>>  
>>> Comparing sysstat data taken while running with the kernels just 
>>> with &  without the bisected patch shows nothing obvious except that 
>>> I/O seems  to take much longer (lower interrupt ratio etc).
>>>
>>>     
>>
>> Maybe the controller is spending an age trying to merge requests because
>> it should be able to but takes a long time figuring it out?
>>   
> I thought that too, but now comes the funny part.
> I gathered HW statistics from our I/O controllers and latency 
> statistics clearly state that your patch is working as intended - the 
> latency from entering the controller until the interrupt to linux 
> device driver is ~30% lower!.
> Remember as I stated above that they don't support that opportunistic 
> merging so I will have some fun finding out why it is faster in HW now 
> :-)
>
>>> The patch alone looks very reasonable, so I'd prefer understanding 
>>> and  fixing the real issue instead of getting it eventually reverted 
>>> due to  this regression being larger than the one it was intended to 
>>> fix.
>>> In the patch it is clear that hot pages (cold==0) freed in 
>>> rmqueue_bulk  should behave like before. So maybe the question is 
>>> "are our pages cold  while they shouldn't be"?
>>> Well I don't really have a clue yet to explain how patch e084b 
>>> exactly  causes that big regression, ideas welcome :-)
>>>
>>>     
>>
>> Only theory I have at the moment is that the controller notices it can
>> merge requests and either spends a long time figuring out how to do the
>> merging or performs worse with merged IO requests.
>>
>> If the problem is in the driver, oprofile might show where the 
>> problem lies
> With the effective throughput dropping by such a large amount while 
> hardware latency improves by 30% I agree and suggest the issue is in 
> the driver.
> I'll do some research in breaking down where in our drivers time is 
> lost and reply here for advises and comments in regard to what general 
> memory management could/should/might do.
>
> Kind regards,
> Christian
>
> p.s.
> FYI a bit more regression data, now I had the patch identified I made 
> a full regression test scaling from 1 to 64 processes.
> Comparing just without / with the commit e084b
> I wondered, but obviously random read is also suffering from that patch.
>
> Sequential Read
> Procs Deviation in %
> 1              -4.9
> 2               5.2
> 4             -82.6
> 8             -65.6
> 16            -44.2
> 32            -30.0
> 64            -37.7
>
> Random Read
> Procs Deviation in %
> 1             -48.3
> 2             -45.7
> 4             -50.5
> 8             -59.0
> 16            -61.8
> 32            -48.3
> 64            -21.0
>


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-10 14:36     ` Christian Ehrhardt
@ 2009-12-11 11:20       ` Mel Gorman
  2009-12-11 14:47         ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2009-12-11 11:20 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: arayananu Gopalakrishnan, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel, epasch, SCHILLIG, Martin Schwidefsky,
	Heiko Carstens, christof.schmitt, thoss

On Thu, Dec 10, 2009 at 03:36:04PM +0100, Christian Ehrhardt wrote:
> Keeping the old discussion in the mail tail, adding the new information  
> up here were everyone finds them :-)
>
> Things I was able to confirm so far summarized:
> - The controller doesn't care about pfn ordering in any way (proved by  
> HW statistics)
> - regression appears in sequential AND random workloads -> also without  
> readahead
> - oprofile & co are no option atm.
>  The effective consumed cpu cycles per transferred kb are almost the  
> same so I would not expect sampling would give us huge insights.
>  Therefore I expect that it is more a matter of lost time (latency) than 
> more expensive tasks (cpu consumption)

But earlier, you said that latency was lower - "latency statistics clearly
state that your patch is working as intended - the latency from entering
the controller until the interrupt to linux device driver is ~30% lower!."

Also, if the controller is doing no merging of IO requests, why is the
interrupt rate lower?

>  I don't want to preclude it completely, but sampling has to wait as  
> long as we have better tracks to follow
>
> So the question is where time is lost in Linux. I used blktrace to  
> create latency summaries.
> I only list the random case for discussion as the effects are more clear  
> int hat data.
> Abbreviations are (like the blkparse man page explains) - sorted in  
> order it would appear per request:
>       A -- remap For stacked devices, incoming i/o is remapped to device 
> below it in the i/o stack. The remap action details what exactly is being 
> remapped to what.
>       G -- get request To send any type of request to a block device, a  
> struct request container must be allocated first.
>       I -- inserted A request is being sent to the i/o scheduler for  
> addition to the internal queue and later service by the driver. The  
> request is fully formed at this time.
>       D -- issued A request that previously resided on the block layer  
> queue or in the i/o scheduler has been sent to the driver.
>       C -- complete A previously issued request has been completed.  The 
> output will detail the sector and size of that request, as well as the 
> success or failure of it.
>
> The following table shows the average latencies from A to G, G to I and  
> so on.
> C2A is special and tries to summarize how long it takes after completing  
> an I/O until the next one arrives in the block device layer.
>
>                     avg-A2G    avg-G2I    avg-I2D   avg-D2C    avg-C2A-in-avg+-stddev    %C2A-in-avg+-stddev
> deviation good->bad    -3.48%    -0.56%    -1.57%    -1.31%         128.69%                 97.26%
>
> It clearly shows that all latencies once block device layer and device  
> driver are involved are almost equal. Remember that the throughput of  
> good vs. bad case is more than x3.
> But we can also see that the value of C2A increases by a huge amount.  
> That huge C2A increase let me assume that the time is actually lost  
> "above" the block device layer.
>

To be clear. As C is "completion" and "A" is remapping new IO, it
implies that time is being lost between when one IO completes and
another starts, right?

> I don't expect the execution speed of iozone as user process itself is  
> affected by commit e084b,

Not by this much anyway. Lets say cache hotness is a problem, I would
expect some performance loss but not this much.
> so the question is where the time is lost  
> between the "read" issued by iozone and entering the block device layer.
> Actually I expect it somewhere in the area of getting a page cache page  
> for the I/O. On one hand page handling is what commit e084b changes and  
> on the other hand pages are under pressure (systat vm effectiveness  
> ~100%, >40% scanned directly in both cases).
>
> I'll continue hunting down the lost time - maybe with ftrace if it is  
> not concealing the effect by its invasiveness -, any further  
> ideas/comments welcome.
>

One way of finding out if cache hottness was the problem would be to profile
for cache misses and see if there are massive differences with and without
the patch. Is that an option?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-11 11:20       ` Mel Gorman
@ 2009-12-11 14:47         ` Christian Ehrhardt
  2009-12-18 13:38           ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2009-12-11 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: arayananu Gopalakrishnan, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel, epasch, SCHILLIG, Martin Schwidefsky,
	Heiko Carstens, christof.schmitt, thoss

Mel Gorman wrote:
> On Thu, Dec 10, 2009 at 03:36:04PM +0100, Christian Ehrhardt wrote:
>   
>> Keeping the old discussion in the mail tail, adding the new information  
>> up here were everyone finds them :-)
>>
>> Things I was able to confirm so far summarized:
>> - The controller doesn't care about pfn ordering in any way (proved by  
>> HW statistics)
>> - regression appears in sequential AND random workloads -> also without  
>> readahead
>> - oprofile & co are no option atm.
>>  The effective consumed cpu cycles per transferred kb are almost the  
>> same so I would not expect sampling would give us huge insights.
>>  Therefore I expect that it is more a matter of lost time (latency) than 
>> more expensive tasks (cpu consumption)
>>     
>
> But earlier, you said that latency was lower - "latency statistics clearly
> state that your patch is working as intended - the latency from entering
> the controller until the interrupt to linux device driver is ~30% lower!."
>   
Thats right, but the pure Hardware time is only lower due to the case of 
less I/O in flight.
As it has less concurrency and contention that way a single I/O in HW is 
faster.
But in HW it's so fast (both cases) that as verified in the linux layers 
it doesn't matter.
Both cases take more or less the same time from an I/O entering the 
block device layer until completion.
> Also, if the controller is doing no merging of IO requests, why is the
> interrupt rate lower?
>   
I was wondering about that when I started to work on this too, but the 
answer is simply that there are less requests coming in per second - and 
that implies a lower interrupt rate too.
>>  I don't want to preclude it completely, but sampling has to wait as  
>> long as we have better tracks to follow
>>
>> So the question is where time is lost in Linux. I used blktrace to  
>> create latency summaries.
>> I only list the random case for discussion as the effects are more clear  
>> int hat data.
>> Abbreviations are (like the blkparse man page explains) - sorted in  
>> order it would appear per request:
>>       A -- remap For stacked devices, incoming i/o is remapped to device 
>> below it in the i/o stack. The remap action details what exactly is being 
>> remapped to what.
>>       G -- get request To send any type of request to a block device, a  
>> struct request container must be allocated first.
>>       I -- inserted A request is being sent to the i/o scheduler for  
>> addition to the internal queue and later service by the driver. The  
>> request is fully formed at this time.
>>       D -- issued A request that previously resided on the block layer  
>> queue or in the i/o scheduler has been sent to the driver.
>>       C -- complete A previously issued request has been completed.  The 
>> output will detail the sector and size of that request, as well as the 
>> success or failure of it.
>>
>> The following table shows the average latencies from A to G, G to I and  
>> so on.
>> C2A is special and tries to summarize how long it takes after completing  
>> an I/O until the next one arrives in the block device layer.
>>
>>                     avg-A2G    avg-G2I    avg-I2D   avg-D2C    avg-C2A-in-avg+-stddev    %C2A-in-avg+-stddev
>> deviation good->bad    -3.48%    -0.56%    -1.57%    -1.31%         128.69%                 97.26%
>>
>> It clearly shows that all latencies once block device layer and device  
>> driver are involved are almost equal. Remember that the throughput of  
>> good vs. bad case is more than x3.
>> But we can also see that the value of C2A increases by a huge amount.  
>> That huge C2A increase let me assume that the time is actually lost  
>> "above" the block device layer.
>>
>>     
>
> To be clear. As C is "completion" and "A" is remapping new IO, it
> implies that time is being lost between when one IO completes and
> another starts, right?
>
>   
Absolutely correct
>> I don't expect the execution speed of iozone as user process itself is  
>> affected by commit e084b,
>>     
> Not by this much anyway. Lets say cache hotness is a problem, I would
> expect some performance loss but not this much.
>   
I agree, even if it is cache hotness it wouldn't be that much.
And cold caches would appear as "longer" instructions because they would 
need more cycles due to e.g. dcache misses.
But as mentioned before the cycles per transferred amount of data are 
the same so I don't expect it is due to cache hot/cold.

>> so the question is where the time is lost  
>> between the "read" issued by iozone and entering the block device layer.
>> Actually I expect it somewhere in the area of getting a page cache page  
>> for the I/O. On one hand page handling is what commit e084b changes and  
>> on the other hand pages are under pressure (systat vm effectiveness  
>> ~100%, >40% scanned directly in both cases).
>>
>> I'll continue hunting down the lost time - maybe with ftrace if it is  
>> not concealing the effect by its invasiveness -, any further  
>> ideas/comments welcome.
>>
>>     
>
> One way of finding out if cache hottness was the problem would be to profile
> for cache misses and see if there are massive differences with and without
> the patch. Is that an option?
>   
It is an option to verify things, but as mentioned above I would expect 
an increase amounts of consumed cycles per kb which I don't see.
I'll track caches anyway to be sure.

My personal current assumption is that either there is some time lost 
from the read syscall until the A event blocktrace tracks or I'm wrong 
with my assumption about user processes and iozone runs "slower".

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-11 14:47         ` Christian Ehrhardt
@ 2009-12-18 13:38           ` Christian Ehrhardt
  2009-12-18 17:42             ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2009-12-18 13:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss

[-- Attachment #1: Type: text/plain, Size: 3437 bytes --]

Christian Ehrhardt wrote:
> Mel Gorman wrote:
> [...]
>>>  
>>
>> One way of finding out if cache hottness was the problem would be to 
>> profile
>> for cache misses and see if there are massive differences with and 
>> without
>> the patch. Is that an option?
>>   
> It is an option to verify things, but as mentioned above I would 
> expect an increase amounts of consumed cycles per kb which I don't see.
> I'll track caches anyway to be sure.
>
> My personal current assumption is that either there is some time lost 
> from the read syscall until the A event blocktrace tracks or I'm wrong 
> with my assumption about user processes and iozone runs "slower".
>
After I was able to verify that the time is lost above the block device 
layer I ran further tests to check out where the additional latency due 
to commit e084b occurs.
With "strace -frT -e trace=open,close,read,write" I could isolate the 
continuous stream of read calls done by the iozone child process.
In those data two things were revealed:
a) The time is lost between read and Block device enter (not e.g. due to 
slow userspace execution or scheduling effects)
     Time spent in userspace until the next read call is almost the 
same, but the average time spent "inside" the read call increases  from 
788µs to 3278µs
b) The time is lost in a few big peaks
    The majority of read calls are ~500-750µs in both cases, but there 
is a slight shift of approximately 8% that occur now as durations 
between 15000 and 60000µs

As mentioned before there is no significant increase of spent cpu cycles 
- therefore I watched out how the read call could end up in sleeps 
(latency but not used cycles).
I found that there is a chance to run into "congestion_wait" in case 
__alloc_pages_direct_reclaim() was able to free some pages with 
try_to_free_pages() but doesn't get a page in the subsequent 
get_page_from_freelist().
To get details about that I patched in some simple counters and 
ktime_get based timings. The results are impressive and point exactly 
towards the congestion_wait theory

The following data is again taken with iozone 1 thread random read in a 
very low memory environment, the counters were zeroed before starting iozone
Bad = the kernel just with commit e084b, good = kernel one commit before 
e084b. The throughput in good case while taking this data was 
approximately ~168% of that of the bad case.

                                         GOOD     BAD
perf_count_congestion_wait                245    1388
perf_count_pages_direct_reclaim         10993    5365
perf_count_failed_pages_direct_reclaim      0    1090
Note - the data also shows that the average time spent in the 
f_op->aio_read call issued by do_sync_read increases by 68.5% which is 
directly inversely proportional to the average loss in throughput.

This lets me conclude that patch e084b causes the system to run into 
congestion_wait after the semi-failed call to __alloc_pages_direct_reclaim.
Unfortunately I don't understand memory management enough to find the 
relation between e084b actually changing the order and position of freed 
pages in the LRU list to __alloc_pages_direct_reclaim not getting pages 
as effectively as before.
Ideas and comments welcome!

Kind regards,
Christian

P.S: attached are the three patches I used to get those new numbers


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Linux Performance on System z 


[-- Attachment #2: perf-count-and-time-do-sync-read.diff --]
[-- Type: text/x-patch, Size: 3683 bytes --]

Index: linux-2.6-git-2.6.31-bisect2/fs/read_write.c
===================================================================
--- linux-2.6-git-2.6.31-bisect2.orig/fs/read_write.c
+++ linux-2.6-git-2.6.31-bisect2/fs/read_write.c
@@ -249,25 +249,37 @@ static void wait_on_retry_sync_kiocb(str
 	__set_current_state(TASK_RUNNING);
 }
 
+unsigned long perf_count_do_sync_read_calls_aio_read = 0;
+unsigned long perf_tsum_do_sync_read_calls_aio_read = 0;
+unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb = 0;
+unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb = 0;
 ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
 {
 	struct iovec iov = { .iov_base = buf, .iov_len = len };
 	struct kiocb kiocb;
 	ssize_t ret;
+	ktime_t tv64_pre;
 
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_left = len;
 
 	for (;;) {
+		perf_count_do_sync_read_calls_aio_read++;
+		tv64_pre = ktime_get();
 		ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
+		perf_tsum_do_sync_read_calls_aio_read += ktime_sub(ktime_get(),tv64_pre).tv64;
 		if (ret != -EIOCBRETRY)
 			break;
 		wait_on_retry_sync_kiocb(&kiocb);
 	}
 
-	if (-EIOCBQUEUED == ret)
+	if (-EIOCBQUEUED == ret) {
+		perf_count_do_sync_read_calls_wait_on_sync_kiocb++;
+		tv64_pre = ktime_get();
 		ret = wait_on_sync_kiocb(&kiocb);
+		perf_tsum_do_sync_read_calls_wait_on_sync_kiocb += ktime_sub(ktime_get(),tv64_pre).tv64;
+	}
 	*ppos = kiocb.ki_pos;
 	return ret;
 }
Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
===================================================================
--- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
+++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
@@ -257,6 +257,10 @@ extern unsigned long perf_count_pages_di
 extern unsigned long perf_count_failed_pages_direct_reclaim;
 extern unsigned long perf_count_do_generic_file_read_calls_page_cache_alloc_cold;
 extern unsigned long perf_tsum_do_generic_file_read_calls_page_cache_alloc_cold;
+extern unsigned long perf_count_do_sync_read_calls_aio_read;
+extern unsigned long perf_tsum_do_sync_read_calls_aio_read;
+extern unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb;
+extern unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
@@ -297,6 +301,38 @@ static struct ctl_table perf_table[] = {
 		.maxlen		= sizeof(unsigned long),
 		.mode           = 0666,
 		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_do_sync_read_calls_aio_read",
+		.data           = &perf_count_do_sync_read_calls_aio_read,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_tsum_do_sync_read_calls_aio_read",
+		.data           = &perf_tsum_do_sync_read_calls_aio_read,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_do_sync_read_calls_wait_on_sync_kiocb",
+		.data           = &perf_count_do_sync_read_calls_wait_on_sync_kiocb,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_tsum_do_sync_read_calls_wait_on_sync_kiocb",
+		.data           = &perf_tsum_do_sync_read_calls_wait_on_sync_kiocb,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
 	},
 	{ .ctl_name = 0 }
 };

[-- Attachment #3: perf-count-congestion-wait.diff --]
[-- Type: text/x-patch, Size: 2750 bytes --]

Index: linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
===================================================================
--- linux-2.6-git-2.6.31-bisect2.orig/include/linux/sysctl.h
+++ linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
@@ -69,6 +69,7 @@ enum
 	CTL_BUS=8,		/* Busses */
 	CTL_ABI=9,		/* Binary emulation */
 	CTL_CPU=10,		/* CPU stuff (speed scaling, etc) */
+	CTL_PERF=11,		/* Performance counters and timer sums for debugging */
 	CTL_ARLAN=254,		/* arlan wireless driver */
 	CTL_S390DBF=5677,	/* s390 debug */
 	CTL_SUNRPC=7249,	/* sunrpc debug */
Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
===================================================================
--- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
+++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
@@ -177,6 +177,7 @@ static struct ctl_table_root sysctl_tabl
 	.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
 };
 
+static struct ctl_table perf_table[];
 static struct ctl_table kern_table[];
 static struct ctl_table vm_table[];
 static struct ctl_table fs_table[];
@@ -230,6 +231,13 @@ static struct ctl_table root_table[] = {
 		.mode		= 0555,
 		.child		= dev_table,
 	},
+	{
+		.ctl_name	= CTL_PERF,
+		.procname	= "perf",
+		.mode		= 0555,
+		.child		= perf_table,
+	},
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
@@ -244,6 +252,19 @@ static int min_wakeup_granularity_ns;
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 #endif
 
+extern unsigned long perf_count_congestion_wait;
+static struct ctl_table perf_table[] = {
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_congestion_wait",
+		.data           = &perf_count_congestion_wait,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{ .ctl_name = 0 }
+};
+
 static struct ctl_table kern_table[] = {
 #ifdef CONFIG_SCHED_DEBUG
 	{
Index: linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
===================================================================
--- linux-2.6-git-2.6.31-bisect2.orig/mm/backing-dev.c
+++ linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
@@ -305,6 +305,7 @@ void set_bdi_congested(struct backing_de
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+unsigned long perf_count_congestion_wait = 0;
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -320,6 +321,7 @@ long congestion_wait(int sync, long time
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
+	perf_count_congestion_wait++;
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);

[-- Attachment #4: perf-count-failed-direct-recalims.diff --]
[-- Type: text/x-patch, Size: 2064 bytes --]

Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
===================================================================
--- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
+++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
@@ -253,6 +253,8 @@ static int max_wakeup_granularity_ns = N
 #endif
 
 extern unsigned long perf_count_congestion_wait;
+extern unsigned long perf_count_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
@@ -262,6 +264,22 @@ static struct ctl_table perf_table[] = {
 		.maxlen		= sizeof(unsigned long),
 		.proc_handler   = &proc_doulongvec_minmax,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_pages_direct_reclaim",
+		.data           = &perf_count_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_failed_pages_direct_reclaim",
+		.data           = &perf_count_failed_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
===================================================================
--- linux-2.6-git-2.6.31-bisect2.orig/mm/page_alloc.c
+++ linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
@@ -1605,6 +1605,9 @@ out:
 	return page;
 }
 
+unsigned long perf_count_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim = 0;
+
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1645,6 +1648,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	perf_count_pages_direct_reclaim++;
+	if (!page)
+		perf_count_failed_pages_direct_reclaim++;
+
 	return page;
 }
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-18 13:38           ` Christian Ehrhardt
@ 2009-12-18 17:42             ` Mel Gorman
  2010-01-14 12:30               ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2009-12-18 17:42 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss

On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
>> Mel Gorman wrote:
>> [...]
>>>>  
>>>
>>> One way of finding out if cache hottness was the problem would be to  
>>> profile
>>> for cache misses and see if there are massive differences with and  
>>> without
>>> the patch. Is that an option?
>>>   
>> It is an option to verify things, but as mentioned above I would  
>> expect an increase amounts of consumed cycles per kb which I don't see.
>> I'll track caches anyway to be sure.
>>
>> My personal current assumption is that either there is some time lost  
>> from the read syscall until the A event blocktrace tracks or I'm wrong  
>> with my assumption about user processes and iozone runs "slower".
>>
>
> After I was able to verify that the time is lost above the block device  
> layer I ran further tests to check out where the additional latency due  
> to commit e084b occurs.
> With "strace -frT -e trace=open,close,read,write" I could isolate the  
> continuous stream of read calls done by the iozone child process.
> In those data two things were revealed:
> a) The time is lost between read and Block device enter (not e.g. due to  
> slow userspace execution or scheduling effects)
>     Time spent in userspace until the next read call is almost the same, 
> but the average time spent "inside" the read call increases  from 788µs 
> to 3278µs
> b) The time is lost in a few big peaks
>    The majority of read calls are ~500-750µs in both cases, but there is 
> a slight shift of approximately 8% that occur now as durations between 
> 15000 and 60000µs
>
> As mentioned before there is no significant increase of spent cpu cycles  
> - therefore I watched out how the read call could end up in sleeps  
> (latency but not used cycles).
> I found that there is a chance to run into "congestion_wait" in case  
> __alloc_pages_direct_reclaim() was able to free some pages with  
> try_to_free_pages() but doesn't get a page in the subsequent  
> get_page_from_freelist().

This is true although it was also the case before the patch.

> To get details about that I patched in some simple counters and  
> ktime_get based timings. The results are impressive and point exactly  
> towards the congestion_wait theory
>
> The following data is again taken with iozone 1 thread random read in a  
> very low memory environment, the counters were zeroed before starting 
> iozone
> Bad = the kernel just with commit e084b, good = kernel one commit before  
> e084b. The throughput in good case while taking this data was  
> approximately ~168% of that of the bad case.
>
>                                         GOOD     BAD
> perf_count_congestion_wait                245    1388
> perf_count_pages_direct_reclaim         10993    5365
> perf_count_failed_pages_direct_reclaim      0    1090
> Note - the data also shows that the average time spent in the  
> f_op->aio_read call issued by do_sync_read increases by 68.5% which is  
> directly inversely proportional to the average loss in throughput.
>
> This lets me conclude that patch e084b causes the system to run into  
> congestion_wait after the semi-failed call to 
> __alloc_pages_direct_reclaim.

You mention that the "GOOD" kernel is one commit before e084b but can
you test with just that patch reverted please? As the patch is only
about list manipulation, I'm failing to see how it can affect
congestion_wait().

> Unfortunately I don't understand memory management enough to find the  
> relation between e084b actually changing the order and position of freed  
> pages in the LRU list to __alloc_pages_direct_reclaim not getting pages  
> as effectively as before.

Off-hand, neither can I. I regret that I'm going off-line for the
Christmas very soon and I'll be in the middle of nowhere with no
internet connection or test machines in the interim. It's going to be
the new year before I get to look at this properly.

Even if this patch is not the real problem, your instrumentation patches
will help pin down the real problem. Thanks and sorry I'll be delayed in
getting back to you properly.

> Ideas and comments welcome!
>
> Kind regards,
> Christian
>
> P.S: attached are the three patches I used to get those new numbers
>
>
> -- 
>
> Grüsse / regards, Christian Ehrhardt
> IBM Linux Technology Center, Linux Performance on System z 
>

> Index: linux-2.6-git-2.6.31-bisect2/fs/read_write.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/fs/read_write.c
> +++ linux-2.6-git-2.6.31-bisect2/fs/read_write.c
> @@ -249,25 +249,37 @@ static void wait_on_retry_sync_kiocb(str
>  	__set_current_state(TASK_RUNNING);
>  }
>  
> +unsigned long perf_count_do_sync_read_calls_aio_read = 0;
> +unsigned long perf_tsum_do_sync_read_calls_aio_read = 0;
> +unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb = 0;
> +unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb = 0;
>  ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
>  {
>  	struct iovec iov = { .iov_base = buf, .iov_len = len };
>  	struct kiocb kiocb;
>  	ssize_t ret;
> +	ktime_t tv64_pre;
>  
>  	init_sync_kiocb(&kiocb, filp);
>  	kiocb.ki_pos = *ppos;
>  	kiocb.ki_left = len;
>  
>  	for (;;) {
> +		perf_count_do_sync_read_calls_aio_read++;
> +		tv64_pre = ktime_get();
>  		ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
> +		perf_tsum_do_sync_read_calls_aio_read += ktime_sub(ktime_get(),tv64_pre).tv64;
>  		if (ret != -EIOCBRETRY)
>  			break;
>  		wait_on_retry_sync_kiocb(&kiocb);
>  	}
>  
> -	if (-EIOCBQUEUED == ret)
> +	if (-EIOCBQUEUED == ret) {
> +		perf_count_do_sync_read_calls_wait_on_sync_kiocb++;
> +		tv64_pre = ktime_get();
>  		ret = wait_on_sync_kiocb(&kiocb);
> +		perf_tsum_do_sync_read_calls_wait_on_sync_kiocb += ktime_sub(ktime_get(),tv64_pre).tv64;
> +	}
>  	*ppos = kiocb.ki_pos;
>  	return ret;
>  }
> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -257,6 +257,10 @@ extern unsigned long perf_count_pages_di
>  extern unsigned long perf_count_failed_pages_direct_reclaim;
>  extern unsigned long perf_count_do_generic_file_read_calls_page_cache_alloc_cold;
>  extern unsigned long perf_tsum_do_generic_file_read_calls_page_cache_alloc_cold;
> +extern unsigned long perf_count_do_sync_read_calls_aio_read;
> +extern unsigned long perf_tsum_do_sync_read_calls_aio_read;
> +extern unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb;
> +extern unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb;
>  static struct ctl_table perf_table[] = {
>  	{
>  		.ctl_name	= CTL_UNNUMBERED,
> @@ -297,6 +301,38 @@ static struct ctl_table perf_table[] = {
>  		.maxlen		= sizeof(unsigned long),
>  		.mode           = 0666,
>  		.proc_handler   = &proc_doulongvec_minmax,
> +	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname       = "perf_count_do_sync_read_calls_aio_read",
> +		.data           = &perf_count_do_sync_read_calls_aio_read,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode           = 0666,
> +		.proc_handler   = &proc_doulongvec_minmax,
> +	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname       = "perf_tsum_do_sync_read_calls_aio_read",
> +		.data           = &perf_tsum_do_sync_read_calls_aio_read,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode           = 0666,
> +		.proc_handler   = &proc_doulongvec_minmax,
> +	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname       = "perf_count_do_sync_read_calls_wait_on_sync_kiocb",
> +		.data           = &perf_count_do_sync_read_calls_wait_on_sync_kiocb,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode           = 0666,
> +		.proc_handler   = &proc_doulongvec_minmax,
> +	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname       = "perf_tsum_do_sync_read_calls_wait_on_sync_kiocb",
> +		.data           = &perf_tsum_do_sync_read_calls_wait_on_sync_kiocb,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode           = 0666,
> +		.proc_handler   = &proc_doulongvec_minmax,
>  	},
>  	{ .ctl_name = 0 }
>  };

> Index: linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/include/linux/sysctl.h
> +++ linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
> @@ -69,6 +69,7 @@ enum
>  	CTL_BUS=8,		/* Busses */
>  	CTL_ABI=9,		/* Binary emulation */
>  	CTL_CPU=10,		/* CPU stuff (speed scaling, etc) */
> +	CTL_PERF=11,		/* Performance counters and timer sums for debugging */
>  	CTL_ARLAN=254,		/* arlan wireless driver */
>  	CTL_S390DBF=5677,	/* s390 debug */
>  	CTL_SUNRPC=7249,	/* sunrpc debug */
> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -177,6 +177,7 @@ static struct ctl_table_root sysctl_tabl
>  	.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
>  };
>  
> +static struct ctl_table perf_table[];
>  static struct ctl_table kern_table[];
>  static struct ctl_table vm_table[];
>  static struct ctl_table fs_table[];
> @@ -230,6 +231,13 @@ static struct ctl_table root_table[] = {
>  		.mode		= 0555,
>  		.child		= dev_table,
>  	},
> +	{
> +		.ctl_name	= CTL_PERF,
> +		.procname	= "perf",
> +		.mode		= 0555,
> +		.child		= perf_table,
> +	},
> +
>  /*
>   * NOTE: do not add new entries to this table unless you have read
>   * Documentation/sysctl/ctl_unnumbered.txt
> @@ -244,6 +252,19 @@ static int min_wakeup_granularity_ns;
>  static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
>  #endif
>  
> +extern unsigned long perf_count_congestion_wait;
> +static struct ctl_table perf_table[] = {
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname       = "perf_count_congestion_wait",
> +		.data           = &perf_count_congestion_wait,
> +		.mode           = 0666,
> +		.maxlen		= sizeof(unsigned long),
> +		.proc_handler   = &proc_doulongvec_minmax,
> +	},
> +	{ .ctl_name = 0 }
> +};
> +
>  static struct ctl_table kern_table[] = {
>  #ifdef CONFIG_SCHED_DEBUG
>  	{
> Index: linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/mm/backing-dev.c
> +++ linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
> @@ -305,6 +305,7 @@ void set_bdi_congested(struct backing_de
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> +unsigned long perf_count_congestion_wait = 0;
>  /**
>   * congestion_wait - wait for a backing_dev to become uncongested
>   * @sync: SYNC or ASYNC IO
> @@ -320,6 +321,7 @@ long congestion_wait(int sync, long time
>  	DEFINE_WAIT(wait);
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
> +	perf_count_congestion_wait++;
>  	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
>  	ret = io_schedule_timeout(timeout);
>  	finish_wait(wqh, &wait);

> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -253,6 +253,8 @@ static int max_wakeup_granularity_ns = N
>  #endif
>  
>  extern unsigned long perf_count_congestion_wait;
> +extern unsigned long perf_count_pages_direct_reclaim;
> +extern unsigned long perf_count_failed_pages_direct_reclaim;
>  static struct ctl_table perf_table[] = {
>  	{
>  		.ctl_name	= CTL_UNNUMBERED,
> @@ -262,6 +264,22 @@ static struct ctl_table perf_table[] = {
>  		.maxlen		= sizeof(unsigned long),
>  		.proc_handler   = &proc_doulongvec_minmax,
>  	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname       = "perf_count_pages_direct_reclaim",
> +		.data           = &perf_count_pages_direct_reclaim,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode           = 0666,
> +		.proc_handler   = &proc_doulongvec_minmax,
> +	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname       = "perf_count_failed_pages_direct_reclaim",
> +		.data           = &perf_count_failed_pages_direct_reclaim,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode           = 0666,
> +		.proc_handler   = &proc_doulongvec_minmax,
> +	},
>  	{ .ctl_name = 0 }
>  };
>  
> Index: linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/mm/page_alloc.c
> +++ linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
> @@ -1605,6 +1605,9 @@ out:
>  	return page;
>  }
>  
> +unsigned long perf_count_pages_direct_reclaim = 0;
> +unsigned long perf_count_failed_pages_direct_reclaim = 0;
> +
>  /* The really slow allocator path where we enter direct reclaim */
>  static inline struct page *
>  __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -1645,6 +1648,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	perf_count_pages_direct_reclaim++;
> +	if (!page)
> +		perf_count_failed_pages_direct_reclaim++;
> +
>  	return page;
>  }
>  


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2009-12-18 17:42             ` Mel Gorman
@ 2010-01-14 12:30               ` Christian Ehrhardt
  2010-01-19 11:33                 ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-01-14 12:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss

Mel Gorman wrote:
> On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
>   
>> Christian Ehrhardt wrote:
>>     
>> [...]
> You mention that the "GOOD" kernel is one commit before e084b but can
> you test with just that patch reverted please? As the patch is only
> about list manipulation, I'm failing to see how it can affect
> congestion_wait().
>
>   
>> Unfortunately I don't understand memory management enough to find the  
>> relation between e084b actually changing the order and position of freed  
>> pages in the LRU list to __alloc_pages_direct_reclaim not getting pages  
>> as effectively as before.
>>     
>
> Off-hand, neither can I. I regret that I'm going off-line for the
> Christmas very soon and I'll be in the middle of nowhere with no
> internet connection or test machines in the interim. It's going to be
> the new year before I get to look at this properly.
>
> Even if this patch is not the real problem, your instrumentation patches
> will help pin down the real problem. Thanks and sorry I'll be delayed in
> getting back to you properly.
>   

Back to business :-)
First - yes it is reproducible in 2.6.32 final and fixable by unapplying 
e084b.
But I don't think un-applying it is really a fix as no one really 
understands how e084b cause this regression (it might just work around 
whatever goes on in the background and someone still hits the hidden issue).

Lets better look what we know summarized:
- the patch e084a causes this regression
- it causes it only in very rare if not theoretically high memory 
constraints
- time is lost between read system call and the block device layer enter
- via debugging and discussion we found that time lost is spent in 
congestion_wait

I did not yet check if it is possible, but we might have a spot tho 
might fix the issue.
Congestion_wait states it would "Waits for up to @timeout jiffies for a 
backing_dev (any backing_dev) to exit write congestion.  If no 
backing_devs are congested then just wait for the next write to be 
completed.".
Now in some case (and also my test) there is absolutely no write to do 
or in flight. So maybe it would be possible to either detect this before 
calling congestion_wait from page_alloc.c or let congestion_wait return 
an error code or something similar.

I mean as far as I can see the kernel currently does a loop like that 
(pseudo code and simplified):
1. get a page <- fail
2. try_to_free page <- returns >=1 pages freed
3. get_page_from_freelist <- fails
4. should_alloc_retry <- true
5. congestion_wait <- FUTILE - just loosing time, nothing will be freed 
by writes
6. goto 1
-> So maybe we should detect the futility of that congestion_wait call 
and not do it at all repeating instantly, go to oom, .., whatever (thats 
out for discussion).

If we can't fix it that way I would propose we keep the code as is, 
hoping that it is a theoretical case that never hits a customer system. 
But in that case we should definitely add a userspace readable counter 
to congestion_wait similar to my debug patches. This would allow 
everyone that ever assumes an issue might be related to this to verify 
it by checking the congestion_wait counter.
This should not be too hard in a technical way, but also not performance 
critical as congestion_wait is going to wait anyway. On the other hand 
everyone hates introducing new kernel interfaces that need to be kept 
compatible until nirvana - especially for bug tracking its not the best 
idea :-) So it should be our very last resort.

Comments/Opinions to those ideas ?

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-01-14 12:30               ` Christian Ehrhardt
@ 2010-01-19 11:33                 ` Mel Gorman
  2010-02-05 15:51                   ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2010-01-19 11:33 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss

On Thu, Jan 14, 2010 at 01:30:24PM +0100, Christian Ehrhardt wrote:
> Mel Gorman wrote:
>> On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
>>   
>>> Christian Ehrhardt wrote:
>>>     [...]
>> You mention that the "GOOD" kernel is one commit before e084b but can
>> you test with just that patch reverted please? As the patch is only
>> about list manipulation, I'm failing to see how it can affect
>> congestion_wait().
>>
>>   
>>> Unfortunately I don't understand memory management enough to find the 
>>>  relation between e084b actually changing the order and position of 
>>> freed  pages in the LRU list to __alloc_pages_direct_reclaim not 
>>> getting pages  as effectively as before.
>>>     
>>
>> Off-hand, neither can I. I regret that I'm going off-line for the
>> Christmas very soon and I'll be in the middle of nowhere with no
>> internet connection or test machines in the interim. It's going to be
>> the new year before I get to look at this properly.
>>
>> Even if this patch is not the real problem, your instrumentation patches
>> will help pin down the real problem. Thanks and sorry I'll be delayed in
>> getting back to you properly.
>>   
>
> Back to business :-)
> First - yes it is reproducible in 2.6.32 final and fixable by unapplying  
> e084b.
> But I don't think un-applying it is really a fix as no one really  
> understands how e084b cause this regression (it might just work around  
> whatever goes on in the background and someone still hits the hidden 
> issue).
>

Agreed on reverting is not an option. While it is not understood why it
causes your regression, it's known to fix a regression for hardware that
does do merging.

> Lets better look what we know summarized:
> - the patch e084a causes this regression
> - it causes it only in very rare if not theoretically high memory  
> constraints
> - time is lost between read system call and the block device layer enter
> - via debugging and discussion we found that time lost is spent in  
> congestion_wait
>

Thanks for the summary.

There is an outside chance it's due to a difference in free page availablity
and the associated counters. Particularly if some subsystem is optimistally
trying high-order allocations depending on availability. If there are many
allocation failures, the counters gets skewed due to a bug, watermarks are
not met and direct reclaim is used more. A patch is on its way upstream to
fix this is http://www.spinics.net/lists/mm-commits/msg75671.html . Can you
try it please?

There is a report on swap-based workloads being impacted on 2.6.32 and a fix
exists but it wouldn't have affected 2.6.31. If you are testing on 2.6.32
though, this patch should be applied http://lkml.org/lkml/2009/12/21/245

> I did not yet check if it is possible, but we might have a spot tho  
> might fix the issue.
> Congestion_wait states it would "Waits for up to @timeout jiffies for a  
> backing_dev (any backing_dev) to exit write congestion.  If no  
> backing_devs are congested then just wait for the next write to be  
> completed.".
> Now in some case (and also my test) there is absolutely no write to do  
> or in flight. So maybe it would be possible to either detect this before  
> calling congestion_wait from page_alloc.c or let congestion_wait return  
> an error code or something similar.
>

While I think that idea has merit, it's fixing the symptons and not the
underlying problem. There is a growing opinion that congestion_wait() being
used in the VM at all is a mistake. If the VM really needs to wait on pages
being written out, it should have been on a waitqueue for a number of pages
rather than a time-based method.

In other words, I'm more interesting in figuring out *why* we enter
congestion_wait() at all with the patch and it's avoided without rather than
what congestion_wait() does when it gets called.

Also, are you sure about thw "waiting for the next write to complete"? In
2.6.31 at least, it's waiting on the async write queue. If it's a case that
it waits the full timeout if there is no async traffic then that's obviously
bad as well but still doesn't explain why the problem patch makes a difference.

> I mean as far as I can see the kernel currently does a loop like that  
> (pseudo code and simplified):
> 1. get a page <- fail
> 2. try_to_free page <- returns >=1 pages freed
> 3. get_page_from_freelist <- fails
> 4. should_alloc_retry <- true
> 5. congestion_wait <- FUTILE - just loosing time, nothing will be freed  
> by writes
> 6. goto 1
> -> So maybe we should detect the futility of that congestion_wait call  
> and not do it at all repeating instantly, go to oom, .., whatever (thats  
> out for discussion).
>
> If we can't fix it that way I would propose we keep the code as is,  

It's not my preferred way to fix this although I'm willing to explore it.
I'd much prefer an explanation as to why your testcase is sensitive to the
PFN ordering of the pages it gets back.

One possibility is that your testcase requires a number of
high-order allocations and the patch is preventing them being
merged. If that was the case, the following patch would also help
http://www.spinics.net/lists/mm-commits/msg75672.html but it's less than ideal.

> hoping that it is a theoretical case that never hits a customer system.  
> But in that case we should definitely add a userspace readable counter  
> to congestion_wait similar to my debug patches. This would allow  
> everyone that ever assumes an issue might be related to this to verify  
> it by checking the congestion_wait counter.

Would the accounting features available with
CONFIG_TASK_DELAY_ACCT be sufficient?  There is a description in
Documentation/accounting/delay-accounting.txt on how to use it. An abnormal
amount of time spent on IO might explain the problem.

> This should not be too hard in a technical way, but also not performance  
> critical as congestion_wait is going to wait anyway. On the other hand  
> everyone hates introducing new kernel interfaces that need to be kept  
> compatible until nirvana - especially for bug tracking its not the best  
> idea :-) So it should be our very last resort.
>

Agreed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-01-19 11:33                 ` Mel Gorman
@ 2010-02-05 15:51                   ` Christian Ehrhardt
  2010-02-05 17:49                     ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-05 15:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare

[-- Attachment #1: Type: text/plain, Size: 10339 bytes --]

I'll keep the old thread below as reference.

After taking a round of ensuring reproducibility and a pile of new 
measurements I now can come back with several new insights.

FYI - I'm now running iozone triplets (4, then 8, then 16 parallel 
threads) with sequential read load and all that 4 times to find 
potential noise. But since I changed to that load instead of random read 
wit hone thread and ensuring the most memory is cleared (sync + echo 3 > 
/proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%. 
For detailed questions about the setup feel free to ask me directly as I 
won't flood this thread too much with such details.

So in the past I identified git id e084b for bringing a huge 
degradation, that is still true, but I need to revert my old statement 
that unapplying e084b on v2.6.32 helps - it simply doesn't.

Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which 
came in later. Both patches unapplied individually don't improve 
anything. But both patches reverted at the same time on git v2.6.32 
bring us back to our old values (remember that is more than 2Gb/s 
improvement in throughput)!

Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this 
can cause so much trouble.

In the past I identified congestion_wait as the point where the "time is 
lost" when comparing good and bad case. Fortunately at least this is 
still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
So I upgraded my old counter patches a bit and can now report the following:

BAD CASE
                                                                4 THREAD 
READ               8 THREAD READ               16 THREAD READ       
16THR % portions
perf_count_congestion_wait                                        
305                        1970                        8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority      
0                           0                           0
perf_count_call_congestion_wait_from_alloc_pages_slowpath         
305                        1970                        
8979                100.00%
perf_count_pages_direct_reclaim                                  
1153                        6818                        3221
perf_count_failed_pages_direct_reclaim                            
305                        1556                        8979
perf_count_failed_pages_direct_reclaim_but_progress               
305                        1478                        
8979                27.87%

GOOD CASE WITH REVERTS                                          4 THREAD 
READ               8 THREAD READ               16 THREAD READ       
16THR % portions
perf_count_congestion_wait                                         
25                          76                         1114
perf_count_call_congestion_wait_from_alloc_pages_high_priority      
0                           0                            0
perf_count_call_congestion_wait_from_alloc_pages_slowpath          
25                          76                         
1114               99.98%
perf_count_pages_direct_reclaim                                  
1054                        9150                        62297
perf_count_failed_pages_direct_reclaim                             
25                          64                         1114
perf_count_failed_pages_direct_reclaim_but_progress                
25                          57                         
1114                1.79%


I hope the format is kept, it should be good with every monospace viewer.
You can all clearly see that the patches somehow affect the ratio at 
which __alloc_pages_direct_reclaim runs into a race between try_to_free 
pages that could actually free something, but a few lines below can't 
get a page from the free list.
Outside in the function alloc_pages_slowpath this leads to a call to 
congestion_wait which is absolutely futile as there are absolutely no 
writes in flight to wait for.

Now this kills effectively up to 80% of our throughput - Any idea of 
better understanding the link between the patches and the effect is 
welcome and might lead to a solution.

FYI - I tried all patches you suggested - none of them affects this.

Updated perf counter patches are attached.
As for some questions in the old thread:

>> Back to business :-)
>> First - yes it is reproducible in 2.6.32 final and fixable by unapplying  
>> e084b.
>> But I don't think un-applying it is really a fix as no one really  
>> understands how e084b cause this regression (it might just work around  
>> whatever goes on in the background and someone still hits the hidden 
>> issue).
>>
>>     
>
> Agreed on reverting is not an option. While it is not understood why it
> causes your regression, it's known to fix a regression for hardware that
> does do merging.
>   

I'm not taking this point but fairly compared it is +15% vs -80% - so it 
is not "only" good :-)

>> Lets better look what we know summarized:
>> - the patch e084a causes this regression
>> - it causes it only in very rare if not theoretically high memory  
>> constraints
>> - time is lost between read system call and the block device layer enter
>> - via debugging and discussion we found that time lost is spent in  
>> congestion_wait
>>
>>     
>
> Thanks for the summary.
>
> There is an outside chance it's due to a difference in free page availablity
> and the associated counters. Particularly if some subsystem is optimistally
> trying high-order allocations depending on availability. If there are many
> allocation failures, the counters gets skewed due to a bug, watermarks are
> not met and direct reclaim is used more. A patch is on its way upstream to
> fix this is http://www.spinics.net/lists/mm-commits/msg75671.html . Can you
> try it please?
>
> There is a report on swap-based workloads being impacted on 2.6.32 and a fix
> exists but it wouldn't have affected 2.6.31. If you are testing on 2.6.32
> though, this patch should be applied http://lkml.org/lkml/2009/12/21/245
>   
both are not affecting my scenario at all
>   
>> I did not yet check if it is possible, but we might have a spot tho  
>> might fix the issue.
>> Congestion_wait states it would "Waits for up to @timeout jiffies for a  
>> backing_dev (any backing_dev) to exit write congestion.  If no  
>> backing_devs are congested then just wait for the next write to be  
>> completed.".
>> Now in some case (and also my test) there is absolutely no write to do  
>> or in flight. So maybe it would be possible to either detect this before  
>> calling congestion_wait from page_alloc.c or let congestion_wait return  
>> an error code or something similar.
>>
>>     
>
> While I think that idea has merit, it's fixing the symptons and not the
> underlying problem. There is a growing opinion that congestion_wait() being
> used in the VM at all is a mistake. If the VM really needs to wait on pages
> being written out, it should have been on a waitqueue for a number of pages
> rather than a time-based method.
>
> In other words, I'm more interesting in figuring out *why* we enter
> congestion_wait() at all with the patch and it's avoided without rather than
> what congestion_wait() does when it gets called.
>
> Also, are you sure about thw "waiting for the next write to complete"? In
> 2.6.31 at least, it's waiting on the async write queue. If it's a case that
> it waits the full timeout if there is no async traffic then that's obviously
> bad as well but still doesn't explain why the problem patch makes a difference.
>   
I agree that this part of the discussion is about symptoms and for now 
we should try to focus on the cause that avtually brings up the race in 
__alloc_pages_direct_reclaim.
>> I mean as far as I can see the kernel currently does a loop like that  
>> (pseudo code and simplified):
>> 1. get a page <- fail
>> 2. try_to_free page <- returns >=1 pages freed
>> 3. get_page_from_freelist <- fails
>> 4. should_alloc_retry <- true
>> 5. congestion_wait <- FUTILE - just loosing time, nothing will be freed  
>> by writes
>> 6. goto 1
>> -> So maybe we should detect the futility of that congestion_wait call  
>> and not do it at all repeating instantly, go to oom, .., whatever (thats  
>> out for discussion).
>>
>> If we can't fix it that way I would propose we keep the code as is,  
>>     
>
> It's not my preferred way to fix this although I'm willing to explore it.
> I'd much prefer an explanation as to why your testcase is sensitive to the
> PFN ordering of the pages it gets back.
>
> One possibility is that your testcase requires a number of
> high-order allocations and the patch is preventing them being
> merged. If that was the case, the following patch would also help
> http://www.spinics.net/lists/mm-commits/msg75672.html but it's less than ideal.
>
>   
again not affecting my workload at all

>> hoping that it is a theoretical case that never hits a customer system.  
>> But in that case we should definitely add a userspace readable counter  
>> to congestion_wait similar to my debug patches. This would allow  
>> everyone that ever assumes an issue might be related to this to verify  
>> it by checking the congestion_wait counter.
>>     
>
> Would the accounting features available with
> CONFIG_TASK_DELAY_ACCT be sufficient?  There is a description in
> Documentation/accounting/delay-accounting.txt on how to use it. An abnormal
> amount of time spent on IO might explain the problem.
>
>   
Eventually it will appear as a huge increase in I/O-wait for no obvious 
reason - DELAY_ACCT is fine, actually even plain iostat is be enough.
The congestion_wait counter would just help to differ this issue, from 
others also increasing I/O-wait - but that is not too important.
I think we can all live without a counter and better fix the cause :-)
>> This should not be too hard in a technical way, but also not performance  
>> critical as congestion_wait is going to wait anyway. On the other hand  
>> everyone hates introducing new kernel interfaces that need to be kept  
>> compatible until nirvana - especially for bug tracking its not the best  
>> idea :-) So it should be our very last resort.
>>
>>     
>
> Agreed.
>
>   


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 


[-- Attachment #2: perf-count-congestion-wait.diff --]
[-- Type: text/x-patch, Size: 2630 bytes --]

Index: linux-2.6-git/include/linux/sysctl.h
===================================================================
--- linux-2.6-git.orig/include/linux/sysctl.h	2010-02-05 16:48:08.000000000 +0100
+++ linux-2.6-git/include/linux/sysctl.h	2010-02-05 16:48:50.000000000 +0100
@@ -68,6 +68,7 @@
 	CTL_BUS=8,		/* Busses */
 	CTL_ABI=9,		/* Binary emulation */
 	CTL_CPU=10,		/* CPU stuff (speed scaling, etc) */
+  	CTL_PERF=11,		/* Performance counters and timer sums for debugging */
 	CTL_ARLAN=254,		/* arlan wireless driver */
 	CTL_S390DBF=5677,	/* s390 debug */
 	CTL_SUNRPC=7249,	/* sunrpc debug */
Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c	2010-02-05 16:48:08.000000000 +0100
+++ linux-2.6-git/kernel/sysctl.c	2010-02-05 16:48:27.000000000 +0100
@@ -183,6 +183,7 @@
 	.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
 };
 
+static struct ctl_table perf_table[];
 static struct ctl_table kern_table[];
 static struct ctl_table vm_table[];
 static struct ctl_table fs_table[];
@@ -236,6 +237,13 @@
 		.mode		= 0555,
 		.child		= dev_table,
 	},
+	{
+		.ctl_name	= CTL_PERF,
+		.procname	= "perf",
+		.mode		= 0555,
+		.child		= perf_table,
+	},
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
@@ -250,6 +258,19 @@
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 #endif
 
+extern unsigned long perf_count_congestion_wait;
+static struct ctl_table perf_table[] = {
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_congestion_wait",
+		.data           = &perf_count_congestion_wait,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{ .ctl_name = 0 }
+};
+
 static struct ctl_table kern_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
Index: linux-2.6-git/mm/backing-dev.c
===================================================================
--- linux-2.6-git.orig/mm/backing-dev.c	2010-02-05 16:48:08.000000000 +0100
+++ linux-2.6-git/mm/backing-dev.c	2010-02-05 16:48:27.000000000 +0100
@@ -739,6 +739,7 @@
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+unsigned long perf_count_congestion_wait = 0;
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -754,6 +755,7 @@
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
+	perf_count_congestion_wait++;
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);

[-- Attachment #3: perf-count-congestion-wait-calls.diff --]
[-- Type: text/x-patch, Size: 3604 bytes --]

Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c	2010-02-05 16:48:58.000000000 +0100
+++ linux-2.6-git/kernel/sysctl.c	2010-02-05 16:49:03.000000000 +0100
@@ -262,6 +262,8 @@
 extern unsigned long perf_count_pages_direct_reclaim;
 extern unsigned long perf_count_failed_pages_direct_reclaim;
 extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
+extern unsigned long perf_count_call_congestion_wait_from_alloc_pages_high_priority;
+extern unsigned long perf_count_call_congestion_wait_from_alloc_pages_slowpath;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
@@ -270,6 +272,22 @@
 		.mode           = 0666,
 		.maxlen		= sizeof(unsigned long),
 		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_call_congestion_wait_from_alloc_pages_high_priority",
+		.data           = &perf_count_call_congestion_wait_from_alloc_pages_high_priority,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_call_congestion_wait_from_alloc_pages_slowpath",
+		.data           = &perf_count_call_congestion_wait_from_alloc_pages_slowpath,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2010-02-05 16:48:58.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2010-02-05 16:49:03.000000000 +0100
@@ -1663,6 +1663,7 @@
 
 unsigned long perf_count_pages_direct_reclaim = 0;
 unsigned long perf_count_failed_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim_but_progress = 0;
 
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
@@ -1704,12 +1705,13 @@
 	perf_count_pages_direct_reclaim++;
 	if (!page)
 		perf_count_failed_pages_direct_reclaim++;
-	if (!page && !(*did_some_progress))
-		perf_count_failed_pages_direct_reclaim_but_progress++
+	if (!page && (*did_some_progress))
+		perf_count_failed_pages_direct_reclaim_but_progress++;
 
 	return page;
 }
 
+unsigned long perf_count_call_congestion_wait_from_alloc_pages_high_priority = 0;
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1727,8 +1729,10 @@
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
 			preferred_zone, migratetype);
 
-		if (!page && gfp_mask & __GFP_NOFAIL)
+		if (!page && gfp_mask & __GFP_NOFAIL) {
+			perf_count_call_congestion_wait_from_alloc_pages_high_priority++;
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
+		}
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -1783,6 +1787,7 @@
 	return alloc_flags;
 }
 
+unsigned long perf_count_call_congestion_wait_from_alloc_pages_slowpath = 0;
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1899,6 +1904,7 @@
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
+		perf_count_call_congestion_wait_from_alloc_pages_slowpath++;
 		congestion_wait(BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	}

[-- Attachment #4: perf-count-failed-direct-recalims.diff --]
[-- Type: text/x-patch, Size: 2461 bytes --]

Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c	2010-02-05 16:48:27.000000000 +0100
+++ linux-2.6-git/kernel/sysctl.c	2010-02-05 16:48:58.000000000 +0100
@@ -259,6 +259,9 @@
 #endif
 
 extern unsigned long perf_count_congestion_wait;
+extern unsigned long perf_count_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
@@ -268,6 +271,30 @@
 		.maxlen		= sizeof(unsigned long),
 		.proc_handler   = &proc_doulongvec_minmax,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_pages_direct_reclaim",
+		.data           = &perf_count_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_failed_pages_direct_reclaim",
+		.data           = &perf_count_failed_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_failed_pages_direct_reclaim_but_progress",
+		.data           = &perf_count_failed_pages_direct_reclaim_but_progress,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2010-02-05 16:48:07.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2010-02-05 16:48:58.000000000 +0100
@@ -1661,6 +1661,9 @@
 	return page;
 }
 
+unsigned long perf_count_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim = 0;
+
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1697,6 +1700,13 @@
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	perf_count_pages_direct_reclaim++;
+	if (!page)
+		perf_count_failed_pages_direct_reclaim++;
+	if (!page && !(*did_some_progress))
+		perf_count_failed_pages_direct_reclaim_but_progress++
+
 	return page;
 }
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-05 15:51                   ` Christian Ehrhardt
@ 2010-02-05 17:49                     ` Mel Gorman
  2010-02-08 14:01                       ` Christian Ehrhardt
  2010-02-08 15:02                       ` Christian Ehrhardt
  0 siblings, 2 replies; 41+ messages in thread
From: Mel Gorman @ 2010-02-05 17:49 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare

On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
> I'll keep the old thread below as reference.
>
> After taking a round of ensuring reproducibility and a pile of new  
> measurements I now can come back with several new insights.
>
> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel  
> threads) with sequential read load and all that 4 times to find  
> potential noise. But since I changed to that load instead of random read  
> wit hone thread and ensuring the most memory is cleared (sync + echo 3 >  
> /proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.  
> For detailed questions about the setup feel free to ask me directly as I  
> won't flood this thread too much with such details.
> 

Is there any chance you have a driver script for the test that you could send
me? I'll then try reproducing based on that script and see what happens. I'm
not optimistic I'll be able to reproduce the problem because I think
it's specific to your setup but you never know.

> So in the past I identified git id e084b for bringing a huge  
> degradation, that is still true, but I need to revert my old statement  
> that unapplying e084b on v2.6.32 helps - it simply doesn't.
>

Hmm, ok. Odd that it used to work but now doesn't.

How reproducible are these results with patch e084b reverted? i.e. I know
it does not work now, but did reverting on the old kernel always fix it
or were there occasions where the figures would be still bad?

> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which  
> came in later. Both patches unapplied individually don't improve  
> anything. But both patches reverted at the same time on git v2.6.32  
> bring us back to our old values (remember that is more than 2Gb/s  
> improvement in throughput)!
>
> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this  
> can cause so much trouble.
>

There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
workloads can suffer. A second is that users of high-order allocations can
enter direct reclaim a lot more than previously. This was fixed in commit
a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
your case. 

The only other interesting thing in commit 5f8dcc21 is that it increases
the memory overhead of a per-cpu structure. If your memory usage is really
borderline, it might have been just enough to push it over the edge.

> In the past I identified congestion_wait as the point where the "time is  
> lost" when comparing good and bad case. Fortunately at least this is  
> still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
> So I upgraded my old counter patches a bit and can now report the following:
>
> BAD CASE
>                                                                4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions
> perf_count_congestion_wait                                        305                        1970                        8980
> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                           0
> perf_count_call_congestion_wait_from_alloc_pages_slowpath         305                        1970                        8979                   100.00%
> perf_count_pages_direct_reclaim                                  1153                        6818                        3221
> perf_count_failed_pages_direct_reclaim                            305                        1556                        8979

Something is wrong with the counters there. For 16 threads, it says direct
reclaim was entered 3221 but it failed 8979 times. How does that work? The
patch you supplied below looks like a patch on top of another debugging patch.
Can you send the full debugging patch please?

> perf_count_failed_pages_direct_reclaim_but_progress               305                        1478                        8979                   27.87%
>
> GOOD CASE WITH REVERTS                                          4 THREAD  READ               8 THREAD READ               16 THREAD READ        16THR % portions
> perf_count_congestion_wait                                         25                          76                         1114
> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                            0
> perf_count_call_congestion_wait_from_alloc_pages_slowpath          25                          76                         1114               99.98%
> perf_count_pages_direct_reclaim                                  1054                        9150                        62297
> perf_count_failed_pages_direct_reclaim                             25                          64                         1114
> perf_count_failed_pages_direct_reclaim_but_progress                25                          57                         1114                1.79%
>
>
> I hope the format is kept, it should be good with every monospace viewer.

It got manged but I think I've it fixed above. The biggest thing I can see
is that direct reclaim is a lot more successful with the patches reverted but
that in itself doesn't make sense. Neither patch affects how many pages should
be free or reclaimable - just what order they are allocated and freed in.

With botgh patches reverted, is the performance 100% reliable or does it
sometimes vary?

If reverting 5f8dcc21 is required, I'm leaning towards believing that the
additional memory overhead with that patch is enough to push this workload
over the edge where entering direct reclaim is necessary a lot more to keep
the zones over the min watermark. You mentioned early on that adding 64MB
to the machine makes the problem go away. Do you know what the cut-off point
is? For example, is adding 4MB enough?

If a small amount of memory does help, I'd also test with min_free_kbytes
at a lower value? If reducing it restores the performance, it again implies
that memory usage is the real problem.

Thing is, even if adding small amounts of memory or reducing
min_free_kbytes helps, it does not explain why e084b ever made a
difference because all that patch does is alter the ordering of pages on
the free list. That might have some cache consequences but it would not
impact direct reclaim like this.

> You can all clearly see that the patches somehow affect the ratio at  
> which __alloc_pages_direct_reclaim runs into a race between try_to_free  
> pages that could actually free something, but a few lines below can't  
> get a page from the free list.

There actually is potentially a significant delay from when direct
reclaim returns and a new allocation attempt happens. However, it's been
like this for a long time so I don't think it's the issue.

I have a prototype patch that removes usage of congestion_wait() altogether
and puts processes to sleep on a waitqueue waiting for watermarks to restore
or a timeout. However, I think it would be just treating the symptoms here
rather than the real issue.

> Outside in the function alloc_pages_slowpath this leads to a call to  
> congestion_wait which is absolutely futile as there are absolutely no  
> writes in flight to wait for.
>
> Now this kills effectively up to 80% of our throughput - Any idea of  
> better understanding the link between the patches and the effect is  
> welcome and might lead to a solution.
>
> FYI - I tried all patches you suggested - none of them affects this.
>

I'm still at a total loss to explain this. Memory overhead of the second
patch is a vague possibility and worth checking out by slightly increasing
available memory on the partition or reducing min_free_kbytes. It does not
explain why the first patch makes a difference.

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-05 17:49                     ` Mel Gorman
@ 2010-02-08 14:01                       ` Christian Ehrhardt
  2010-02-08 15:21                         ` Mel Gorman
  2010-02-08 15:02                       ` Christian Ehrhardt
  1 sibling, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-08 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin



Mel Gorman wrote:
> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>   
>> I'll keep the old thread below as reference.
>>
>> After taking a round of ensuring reproducibility and a pile of new  
>> measurements I now can come back with several new insights.
>>
>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel  
>> threads) with sequential read load and all that 4 times to find  
>> potential noise. But since I changed to that load instead of random read  
>> wit hone thread and ensuring the most memory is cleared (sync + echo 3 >  
>> /proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.  
>> For detailed questions about the setup feel free to ask me directly as I  
>> won't flood this thread too much with such details.
>>
>>     
>
> Is there any chance you have a driver script for the test that you could send
> me? I'll then try reproducing based on that script and see what happens. I'm
> not optimistic I'll be able to reproduce the problem because I think
> it's specific to your setup but you never know.
>   
I don't have one as it runs in a bigger automated test environment, but 
it is easy enough to write down something comparable.
>> So in the past I identified git id e084b for bringing a huge  
>> degradation, that is still true, but I need to revert my old statement  
>> that unapplying e084b on v2.6.32 helps - it simply doesn't.
>>
>>     
>
> Hmm, ok. Odd that it used to work but now doesn't.
>
> How reproducible are these results with patch e084b reverted? i.e. I know
> it does not work now, but did reverting on the old kernel always fix it
> or were there occasions where the figures would be still bad?
>   
Reverting e084b in the past showed something like +40%, so I thought it 
helps.
Afterwards I found out it wasn't a good testcase for reproducibility and 
when looking at a series it had +5%,-7%,+20%,...
So by far too noisy to be useful for bisect, discussion or fix testing.
Thats why I made my big round reinventing the testcases in a more 
reproducible way like described above.
In those the original issue is still very visible - which means 4 runs 
comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>   
>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which  
>> came in later. Both patches unapplied individually don't improve  
>> anything. But both patches reverted at the same time on git v2.6.32  
>> bring us back to our old values (remember that is more than 2Gb/s  
>> improvement in throughput)!
>>
>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this  
>> can cause so much trouble.
>>     
>
> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
> workloads can suffer. A second is that users of high-order allocations can
> enter direct reclaim a lot more than previously. This was fixed in commit
> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
> your case. 
>
> The only other interesting thing in commit 5f8dcc21 is that it increases
> the memory overhead of a per-cpu structure. If your memory usage is really
> borderline, it might have been just enough to push it over the edge.
>   
I had this thoughts too, but as it only shows an effect with 5f8dcc21 
and e084b reverted at the same time I'm wondering if it can be that.
But I agree that this should be considered too until a verification can 
be done.
>   
>> In the past I identified congestion_wait as the point where the "time is  
>> lost" when comparing good and bad case. Fortunately at least this is  
>> still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
>> So I upgraded my old counter patches a bit and can now report the following:
>>
>> BAD CASE
>>                                                                4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions
>> perf_count_congestion_wait                                        305                        1970                        8980
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                           0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath         305                        1970                        8979                   100.00%
>> perf_count_pages_direct_reclaim                                  1153                        6818                        3221
>> perf_count_failed_pages_direct_reclaim                            305                        1556                        8979
>>     
>
> Something is wrong with the counters there. For 16 threads, it says direct
> reclaim was entered 3221 but it failed 8979 times. How does that work? The
> patch you supplied below looks like a patch on top of another debugging patch.
> Can you send the full debugging patch please?
>   
This is just a single 7 deleted accidentally when bringing it to a plain 
text format for this mail. The percentage is fine and properly it would 
look like:

                                                               4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions

perf_count_congestion_wait                                        
305                        1970                         8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority      
0                           0                            0
perf_count_call_congestion_wait_from_alloc_pages_slowpath         
305                        1970                         
8979               100.00%
perf_count_pages_direct_reclaim                                  
1153                        6818                        32217
perf_count_failed_pages_direct_reclaim                            
305                        1556                         8979
perf_count_failed_pages_direct_reclaim_but_progress               
305                        1478                         
8979               27.87%


There were three patches attached in the last mail which should work in 
this order:

~/kernels/linux-2.6$ git checkout -f v2.6.32
~/kernels/linux-2.6$ patch -p1 < 
~/Desktop/patches-git/perf-count-congestion-wait.diff
patching file include/linux/sysctl.h
patching file kernel/sysctl.c
patching file mm/backing-dev.c
~/kernels/linux-2.6$ patch -p1 < 
~/Desktop/patches-git/perf-count-failed-direct-recalims.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1670 (offset 9 lines).
Hunk #2 succeeded at 1709 (offset 9 lines).
~/kernels/linux-2.6$ patch -p1 < 
~/Desktop/patches-git/perf-count-congestion-wait-calls.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1672 (offset 9 lines).
Hunk #2 succeeded at 1714 (offset 9 lines).
Hunk #3 succeeded at 1738 (offset 9 lines).
Hunk #4 succeeded at 1796 (offset 9 lines).
Hunk #5 succeeded at 1913 (offset 9 lines).

Offsets are due to e084b/5f8dcc21 being reverted or not, it works in 
both cases.
Let me know if they work out for you that way.
>   
>> perf_count_failed_pages_direct_reclaim_but_progress               305                        1478                        8979                   27.87%
>>
>> GOOD CASE WITH REVERTS                                          4 THREAD  READ               8 THREAD READ               16 THREAD READ        16THR % portions
>> perf_count_congestion_wait                                         25                          76                         1114
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                            0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath          25                          76                         1114               99.98%
>> perf_count_pages_direct_reclaim                                  1054                        9150                        62297
>> perf_count_failed_pages_direct_reclaim                             25                          64                         1114
>> perf_count_failed_pages_direct_reclaim_but_progress                25                          57                         1114                1.79%
>>
>>
>> I hope the format is kept, it should be good with every monospace viewer.
>>     
>
> It got manged but I think I've it fixed above. The biggest thing I can see
> is that direct reclaim is a lot more successful with the patches reverted but
> that in itself doesn't make sense. Neither patch affects how many pages should
> be free or reclaimable - just what order they are allocated and freed in.
>
> With botgh patches reverted, is the performance 100% reliable or does it
> sometimes vary?
>   
It is 100% percent reliable now - reliably bad with plain 2.6.32 as well 
as reliably much better (e.g. +40% @ 16threads) with both reverted.

> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
> additional memory overhead with that patch is enough to push this workload
> over the edge where entering direct reclaim is necessary a lot more to keep
> the zones over the min watermark. You mentioned early on that adding 64MB
> to the machine makes the problem go away. Do you know what the cut-off point
> is? For example, is adding 4MB enough?
>   
That was another one probably only seen due to the lack of good 
reproducibility in my old setup.
I made bigger memory scaling tests with the new setup. Therefore I ran 
the workload with 160 to 356 megabytes in 32mb steps (256 is the default 
in all other runs).
The result is that more than 256m memory only brings a slight benefit 
(which is ok as the load doesn't reread the data read into page cache 
and it just doesn't need much more).

Here some data about scaling memory normalized to the 256m memory values:
- deviation to 256m case -                  160m    192m    224m    
256m    288m    320m    356m
plain 2.6.32                              +9.12% -55.11% -17.15%   
=100%  +1.46%  +1.46%  +1.46%
2.6.32 - 5f8dcc21 and e084b reverted      +6.95%  -6.95%  -0.80%   
=100%  +2.14%  +2.67%  +2.67%
------------------------------------------------------------------------------------------------
deviation between each other (all +)      60.64% 182.93%  63.44%  
36.50%  37.41%  38.13%  38.13%
What can be seen:
a) more memory brings up to +2.67%/+1.46%, but not more when further 
increasing memory (reasonable as we don't reread cached files)
b) decreasing memory drops performance by up to -6.95% @ -64mb with both 
reverted, but down to -55% @ -64mb (compared to the already much lower 
256m throughput)
   -> plain 2.6.32 is much more sensible to lower available memory AND 
always a level below
c) there is a weird but very reproducible improvement with even lower 
memory - very probably another issue or better another solution and not 
related here - but who knows :-)
   -> still both reverted is always much better than plain 2.6.32

> If a small amount of memory does help, I'd also test with min_free_kbytes
> at a lower value? If reducing it restores the performance, it again implies
> that memory usage is the real problem.
>   

I'll run a few tests with bigger/smaller numbers of min_free_kbytes just 
to be sure.
> Thing is, even if adding small amounts of memory or reducing
> min_free_kbytes helps, it does not explain why e084b ever made a
> difference because all that patch does is alter the ordering of pages on
> the free list. That might have some cache consequences but it would not
> impact direct reclaim like this.
>
>   
>> You can all clearly see that the patches somehow affect the ratio at  
>> which __alloc_pages_direct_reclaim runs into a race between try_to_free  
>> pages that could actually free something, but a few lines below can't  
>> get a page from the free list.
>>     
>
> There actually is potentially a significant delay from when direct
> reclaim returns and a new allocation attempt happens. However, it's been
> like this for a long time so I don't think it's the issue.
>   
Due to the increased percentage of progress, but !page in direct_reclaim 
I still think it is related.
I agree that "having" that potential race between try_to_free and 
get_page is not the issue, but I assume it is very likely that the 
patches change the timing there so that it happens much more often.
> I have a prototype patch that removes usage of congestion_wait() altogether
> and puts processes to sleep on a waitqueue waiting for watermarks to restore
> or a timeout. However, I think it would be just treating the symptoms here
> rather than the real issue.
>   
Anyway I'd be happy to test this patch if you could sent it to me, 
because as long as we don't have a "real" solution I like such 
symptom-fixes as a start :-)
At least it would fix going into congestion_wait even without dirty 
pages or I/O's in flight - waiting for watermarks sounds much better to 
me independent to the current issue.
>> Outside in the function alloc_pages_slowpath this leads to a call to  
>> congestion_wait which is absolutely futile as there are absolutely no  
>> writes in flight to wait for.
>>
>> Now this kills effectively up to 80% of our throughput - Any idea of  
>> better understanding the link between the patches and the effect is  
>> welcome and might lead to a solution.
>>
>> FYI - I tried all patches you suggested - none of them affects this.
>>
>>     
>
> I'm still at a total loss to explain this. Memory overhead of the second
> patch is a vague possibility and worth checking out by slightly increasing
> available memory on the partition or reducing min_free_kbytes. It does not
> explain why the first patch makes a difference.
>   
In a discussion with Hannes Reinecke (hare@suse.de) he brought up that 
in a low memory scenario the ordering of the pages might twist.
Today we have the single linst of pages - add hot ones to the head and 
cold to the tail. Both patches affect that list management - e084b 
changes the order of some, and 5f8dcc is splitting it per migrate type.
What now could happen is that you free pages and get a list like:
H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
In low mem scenarios it could now happen that a allocation for hot pages 
falls back to cold pages (as usual fallback), but couldnt it be that it 
then also gets the cold pages in reverse order again ?
Without e084b it would be like:
H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with 
the issue that cold allocations from the right are reverse -> e084b)
5f8dcc is now lowering the size of those lists by splitting it into 
different migration types, so maybe both together are increasing the 
chance to get such an issue.

I hope I explained that idea right, Hannes please feel free to correct 
and extend if needed.
I also added Nick on his request.
>> <SNIP>
>>     
>
>   

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-05 17:49                     ` Mel Gorman
  2010-02-08 14:01                       ` Christian Ehrhardt
@ 2010-02-08 15:02                       ` Christian Ehrhardt
  1 sibling, 0 replies; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-08 15:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin

*resend with fixed line wrap in the tables*

Mel Gorman wrote:
> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>   
>> I'll keep the old thread below as reference.
>>
>> After taking a round of ensuring reproducibility and a pile of new  
>> measurements I now can come back with several new insights.
>>
>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel  
>> threads) with sequential read load and all that 4 times to find  
>> potential noise. But since I changed to that load instead of random read  
>> wit hone thread and ensuring the most memory is cleared (sync + echo 3 >  
>> /proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.  
>> For detailed questions about the setup feel free to ask me directly as I  
>> won't flood this thread too much with such details.
>>
>>     
>
> Is there any chance you have a driver script for the test that you could send
> me? I'll then try reproducing based on that script and see what happens. I'm
> not optimistic I'll be able to reproduce the problem because I think
> it's specific to your setup but you never know.
>   
I don't have one as it runs in a bigger automated test environment, but 
it is easy enough to write down something comparable.
>> So in the past I identified git id e084b for bringing a huge  
>> degradation, that is still true, but I need to revert my old statement  
>> that unapplying e084b on v2.6.32 helps - it simply doesn't.
>>
>>     
>
> Hmm, ok. Odd that it used to work but now doesn't.
>
> How reproducible are these results with patch e084b reverted? i.e. I know
> it does not work now, but did reverting on the old kernel always fix it
> or were there occasions where the figures would be still bad?
>   
Reverting e084b in the past showed something like +40%, so I thought it 
helps.
Afterwards I found out it wasn't a good testcase for reproducibility and 
when looking at a series it had +5%,-7%,+20%,...
So by far too noisy to be useful for bisect, discussion or fix testing.
Thats why I made my big round reinventing the testcases in a more 
reproducible way like described above.
In those the original issue is still very visible - which means 4 runs 
comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>   
>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which  
>> came in later. Both patches unapplied individually don't improve  
>> anything. But both patches reverted at the same time on git v2.6.32  
>> bring us back to our old values (remember that is more than 2Gb/s  
>> improvement in throughput)!
>>
>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this  
>> can cause so much trouble.
>>     
>
> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
> workloads can suffer. A second is that users of high-order allocations can
> enter direct reclaim a lot more than previously. This was fixed in commit
> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
> your case. 
>
> The only other interesting thing in commit 5f8dcc21 is that it increases
> the memory overhead of a per-cpu structure. If your memory usage is really
> borderline, it might have been just enough to push it over the edge.
>   
I had this thoughts too, but as it only shows an effect with 5f8dcc21 
and e084b reverted at the same time I'm wondering if it can be that.
But I agree that this should be considered too until a verification can 
be done.
>   
>> In the past I identified congestion_wait as the point where the "time is  
>> lost" when comparing good and bad case. Fortunately at least this is  
>> still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
>> So I upgraded my old counter patches a bit and can now report the following:
>>
>> BAD CASE
>>                                                                4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions
>> perf_count_congestion_wait                                        305                        1970                        8980
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                           0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath         305                        1970                        8979                   100.00%
>> perf_count_pages_direct_reclaim                                  1153                        6818                        3221
>> perf_count_failed_pages_direct_reclaim                            305                        1556                        8979
>>     
>
> Something is wrong with the counters there. For 16 threads, it says direct
> reclaim was entered 3221 but it failed 8979 times. How does that work? The
> patch you supplied below looks like a patch on top of another debugging patch.
> Can you send the full debugging patch please?
>   
This is just a single 7 deleted accidentally when bringing it to a plain 
text format for this mail. The percentage is fine and properly it would 
look like:

                                                              4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions

perf_count_congestion_wait                                        305                        1970                         8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                            0
perf_count_call_congestion_wait_from_alloc_pages_slowpath         305                        1970                         8979               100.00%
perf_count_pages_direct_reclaim                                  1153                        6818                        32217
perf_count_failed_pages_direct_reclaim                            305                        1556                         8979
perf_count_failed_pages_direct_reclaim_but_progress               305                        1478                         8979               27.87%


There were three patches attached in the last mail which should work in 
this order:

~/kernels/linux-2.6$ git checkout -f v2.6.32
~/kernels/linux-2.6$ patch -p1 < 
~/Desktop/patches-git/perf-count-congestion-wait.diff
patching file include/linux/sysctl.h
patching file kernel/sysctl.c
patching file mm/backing-dev.c
~/kernels/linux-2.6$ patch -p1 < 
~/Desktop/patches-git/perf-count-failed-direct-recalims.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1670 (offset 9 lines).
Hunk #2 succeeded at 1709 (offset 9 lines).
~/kernels/linux-2.6$ patch -p1 < 
~/Desktop/patches-git/perf-count-congestion-wait-calls.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1672 (offset 9 lines).
Hunk #2 succeeded at 1714 (offset 9 lines).
Hunk #3 succeeded at 1738 (offset 9 lines).
Hunk #4 succeeded at 1796 (offset 9 lines).
Hunk #5 succeeded at 1913 (offset 9 lines).

Offsets are due to e084b/5f8dcc21 being reverted or not, it works in 
both cases.
Let me know if they work out for you that way.
>   
>> perf_count_failed_pages_direct_reclaim_but_progress               305                        1478                        8979                   27.87%
>>
>> GOOD CASE WITH REVERTS                                          4 THREAD  READ               8 THREAD READ               16 THREAD READ        16THR % portions
>> perf_count_congestion_wait                                         25                          76                         1114
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                            0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath          25                          76                         1114               99.98%
>> perf_count_pages_direct_reclaim                                  1054                        9150                        62297
>> perf_count_failed_pages_direct_reclaim                             25                          64                         1114
>> perf_count_failed_pages_direct_reclaim_but_progress                25                          57                         1114                1.79%
>>
>>
>> I hope the format is kept, it should be good with every monospace viewer.
>>     
>
> It got manged but I think I've it fixed above. The biggest thing I can see
> is that direct reclaim is a lot more successful with the patches reverted but
> that in itself doesn't make sense. Neither patch affects how many pages should
> be free or reclaimable - just what order they are allocated and freed in.
>
> With botgh patches reverted, is the performance 100% reliable or does it
> sometimes vary?
>   
It is 100% percent reliable now - reliably bad with plain 2.6.32 as well 
as reliably much better (e.g. +40% @ 16threads) with both reverted.

> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
> additional memory overhead with that patch is enough to push this workload
> over the edge where entering direct reclaim is necessary a lot more to keep
> the zones over the min watermark. You mentioned early on that adding 64MB
> to the machine makes the problem go away. Do you know what the cut-off point
> is? For example, is adding 4MB enough?
>   
That was another one probably only seen due to the lack of good 
reproducibility in my old setup.
I made bigger memory scaling tests with the new setup. Therefore I ran 
the workload with 160 to 356 megabytes in 32mb steps (256 is the default 
in all other runs).
The result is that more than 256m memory only brings a slight benefit 
(which is ok as the load doesn't reread the data read into page cache 
and it just doesn't need much more).

Here some data about scaling memory normalized to the 256m memory values:
- deviation to 256m case -                  160m    192m    224m    256m    288m    320m    356m
plain 2.6.32                              +9.12% -55.11% -17.15%   =100%  +1.46%  +1.46%  +1.46%
2.6.32 - 5f8dcc21 and e084b reverted      +6.95%  -6.95%  -0.80%   =100%  +2.14%  +2.67%  +2.67%
------------------------------------------------------------------------------------------------
deviation between each other (all +)      60.64% 182.93%  63.44%  36.50%  37.41%  38.13%  38.13%
What can be seen:
a) more memory brings up to +2.67%/+1.46%, but not more when further 
increasing memory (reasonable as we don't reread cached files)
b) decreasing memory drops performance by up to -6.95% @ -64mb with both 
reverted, but down to -55% @ -64mb (compared to the already much lower 
256m throughput)
  -> plain 2.6.32 is much more sensible to lower available memory AND 
always a level below
c) there is a weird but very reproducible improvement with even lower 
memory - very probably another issue or better another solution and not 
related here - but who knows :-)
  -> still both reverted is always much better than plain 2.6.32

> If a small amount of memory does help, I'd also test with min_free_kbytes
> at a lower value? If reducing it restores the performance, it again implies
> that memory usage is the real problem.
>   

I'll run a few tests with bigger/smaller numbers of min_free_kbytes just 
to be sure.
> Thing is, even if adding small amounts of memory or reducing
> min_free_kbytes helps, it does not explain why e084b ever made a
> difference because all that patch does is alter the ordering of pages on
> the free list. That might have some cache consequences but it would not
> impact direct reclaim like this.
>
>   
>> You can all clearly see that the patches somehow affect the ratio at  
>> which __alloc_pages_direct_reclaim runs into a race between try_to_free  
>> pages that could actually free something, but a few lines below can't  
>> get a page from the free list.
>>     
>
> There actually is potentially a significant delay from when direct
> reclaim returns and a new allocation attempt happens. However, it's been
> like this for a long time so I don't think it's the issue.
>   
Due to the increased percentage of progress, but !page in direct_reclaim 
I still think it is related.
I agree that "having" that potential race between try_to_free and 
get_page is not the issue, but I assume it is very likely that the 
patches change the timing there so that it happens much more often.
> I have a prototype patch that removes usage of congestion_wait() altogether
> and puts processes to sleep on a waitqueue waiting for watermarks to restore
> or a timeout. However, I think it would be just treating the symptoms here
> rather than the real issue.
>   
Anyway I'd be happy to test this patch if you could sent it to me, 
because as long as we don't have a "real" solution I like such 
symptom-fixes as a start :-)
At least it would fix going into congestion_wait even without dirty 
pages or I/O's in flight - waiting for watermarks sounds much better to 
me independent to the current issue.
>> Outside in the function alloc_pages_slowpath this leads to a call to  
>> congestion_wait which is absolutely futile as there are absolutely no  
>> writes in flight to wait for.
>>
>> Now this kills effectively up to 80% of our throughput - Any idea of  
>> better understanding the link between the patches and the effect is  
>> welcome and might lead to a solution.
>>
>> FYI - I tried all patches you suggested - none of them affects this.
>>
>>     
>
> I'm still at a total loss to explain this. Memory overhead of the second
> patch is a vague possibility and worth checking out by slightly increasing
> available memory on the partition or reducing min_free_kbytes. It does not
> explain why the first patch makes a difference.
>   
In a discussion with Hannes Reinecke (hare@suse.de) he brought up that 
in a low memory scenario the ordering of the pages might twist.
Today we have the single linst of pages - add hot ones to the head and 
cold to the tail. Both patches affect that list management - e084b 
changes the order of some, and 5f8dcc is splitting it per migrate type.
What now could happen is that you free pages and get a list like:
H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
In low mem scenarios it could now happen that a allocation for hot pages 
falls back to cold pages (as usual fallback), but couldnt it be that it 
then also gets the cold pages in reverse order again ?
Without e084b it would be like:
H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with 
the issue that cold allocations from the right are reverse -> e084b)
5f8dcc is now lowering the size of those lists by splitting it into 
different migration types, so maybe both together are increasing the 
chance to get such an issue.

I hope I explained that idea right, Hannes please feel free to correct 
and extend if needed.
I also added Nick on his request.
>> <SNIP>
>>     
>
>   

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-08 14:01                       ` Christian Ehrhardt
@ 2010-02-08 15:21                         ` Mel Gorman
  2010-02-08 16:55                           ` Mel Gorman
                                             ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Mel Gorman @ 2010-02-08 15:21 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin

On Mon, Feb 08, 2010 at 03:01:16PM +0100, Christian Ehrhardt wrote:
>
>
> Mel Gorman wrote:
>> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>>   
>>> I'll keep the old thread below as reference.
>>>
>>> After taking a round of ensuring reproducibility and a pile of new   
>>> measurements I now can come back with several new insights.
>>>
>>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel   
>>> threads) with sequential read load and all that 4 times to find   
>>> potential noise. But since I changed to that load instead of random 
>>> read  wit hone thread and ensuring the most memory is cleared (sync + 
>>> echo 3 >  /proc/sys/vm/drop_caches + a few sleeps) . The noise is now 
>>> down <2%.  For detailed questions about the setup feel free to ask me 
>>> directly as I  won't flood this thread too much with such details.
>>>
>>>     
>>
>> Is there any chance you have a driver script for the test that you could send
>> me? I'll then try reproducing based on that script and see what happens. I'm
>> not optimistic I'll be able to reproduce the problem because I think
>> it's specific to your setup but you never know.
>>   
>
> I don't have one as it runs in a bigger automated test environment, but  
> it is easy enough to write down something comparable.

I'd appreciate it, thanks.

>>> So in the past I identified git id e084b for bringing a huge   
>>> degradation, that is still true, but I need to revert my old 
>>> statement  that unapplying e084b on v2.6.32 helps - it simply 
>>> doesn't.
>>>
>>>     
>>
>> Hmm, ok. Odd that it used to work but now doesn't.
>>
>> How reproducible are these results with patch e084b reverted? i.e. I know
>> it does not work now, but did reverting on the old kernel always fix it
>> or were there occasions where the figures would be still bad?
>>   
>
> Reverting e084b in the past showed something like +40%, so I thought it  
> helps.
> Afterwards I found out it wasn't a good testcase for reproducibility and  
> when looking at a series it had +5%,-7%,+20%,...
> So by far too noisy to be useful for bisect, discussion or fix testing.
> Thats why I made my big round reinventing the testcases in a more  
> reproducible way like described above.
> In those the original issue is still very visible - which means 4 runs  
> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.

About all e084b can be affecting is cache hotness when fallback is occuring
but 60% regression still seems way too high for just that.

>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 
>>> which  came in later. Both patches unapplied individually don't 
>>> improve  anything. But both patches reverted at the same time on git 
>>> v2.6.32  bring us back to our old values (remember that is more than 
>>> 2Gb/s  improvement in throughput)!
>>>
>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this 
>>>  can cause so much trouble.
>>>     
>>
>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
>> workloads can suffer. A second is that users of high-order allocations can
>> enter direct reclaim a lot more than previously. This was fixed in commit
>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
>> your case. 
>>
>> The only other interesting thing in commit 5f8dcc21 is that it increases
>> the memory overhead of a per-cpu structure. If your memory usage is really
>> borderline, it might have been just enough to push it over the edge.
>>   
>
> I had this thoughts too, but as it only shows an effect with 5f8dcc21  
> and e084b reverted at the same time I'm wondering if it can be that.
> But I agree that this should be considered too until a verification can  
> be done.
>>   

Another consequence of 5f8dcc is that pages are on the per-cpu lists that
it is possible for pages to be on the per-cpu lists when the page allocator
is called. There is potential that page reclaim is being entered as a
result even though pages were free on the per-cpu lists.

It would not explain why reverting e084b makes such a difference but I
can prototype a patch that falls back to other per-cpu lists before
calling the page allocator as it used to do but without linear searching
the per-cpu lists.

>>> In the past I identified congestion_wait as the point where the "time 
>>> is  lost" when comparing good and bad case. Fortunately at least this 
>>> is  still true when comparing 2.6.32 vs 2.6.32 with both patches 
>>> reverted.
>>> So I upgraded my old counter patches a bit and can now report the following:
>>>
>>> BAD CASE
>>>                                                                4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions
>>> perf_count_congestion_wait                                        305                        1970                        8980
>>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                           0
>>> perf_count_call_congestion_wait_from_alloc_pages_slowpath         305                        1970                        8979                   100.00%
>>> perf_count_pages_direct_reclaim                                  1153                        6818                        3221
>>> perf_count_failed_pages_direct_reclaim                            305                        1556                        8979
>>>     
>>
>> Something is wrong with the counters there. For 16 threads, it says direct
>> reclaim was entered 3221 but it failed 8979 times. How does that work? The
>> patch you supplied below looks like a patch on top of another debugging patch.
>> Can you send the full debugging patch please?
>>   
>
> This is just a single 7 deleted accidentally when bringing it to a plain  
> text format for this mail. The percentage is fine and properly it would  
> look like:
>
>                                                               4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions
>
> perf_count_congestion_wait                                        305                               1970                         8980
> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                                  0                            0
> perf_count_call_congestion_wait_from_alloc_pages_slowpath         305                               1970                         8979               100.00%
> perf_count_pages_direct_reclaim                                  1153                               6818                        32217
> perf_count_failed_pages_direct_reclaim                            305                               1556                         8979
> perf_count_failed_pages_direct_reclaim_but_progress               305                               1478                         8979               27.87%
>
>
> There were three patches attached in the last mail which should work in  
> this order:
>

Bah. I was reading just an automatically-inlined version and didn't notice
the other two attachments. Sorry.

> ~/kernels/linux-2.6$ git checkout -f v2.6.32
> ~/kernels/linux-2.6$ patch -p1 <  
> ~/Desktop/patches-git/perf-count-congestion-wait.diff
> patching file include/linux/sysctl.h
> patching file kernel/sysctl.c
> patching file mm/backing-dev.c
> ~/kernels/linux-2.6$ patch -p1 <  
> ~/Desktop/patches-git/perf-count-failed-direct-recalims.diff
> patching file kernel/sysctl.c
> patching file mm/page_alloc.c
> Hunk #1 succeeded at 1670 (offset 9 lines).
> Hunk #2 succeeded at 1709 (offset 9 lines).
> ~/kernels/linux-2.6$ patch -p1 <  
> ~/Desktop/patches-git/perf-count-congestion-wait-calls.diff
> patching file kernel/sysctl.c
> patching file mm/page_alloc.c
> Hunk #1 succeeded at 1672 (offset 9 lines).
> Hunk #2 succeeded at 1714 (offset 9 lines).
> Hunk #3 succeeded at 1738 (offset 9 lines).
> Hunk #4 succeeded at 1796 (offset 9 lines).
> Hunk #5 succeeded at 1913 (offset 9 lines).
>
> Offsets are due to e084b/5f8dcc21 being reverted or not, it works in  
> both cases.
> Let me know if they work out for you that way.
>>   
>>> perf_count_failed_pages_direct_reclaim_but_progress               305                        1478                        8979                   27.87%
>>>
>>> GOOD CASE WITH REVERTS                                          4 THREAD  READ               8 THREAD READ               16 THREAD READ        16THR % portions
>>> perf_count_congestion_wait                                         25                          76                         1114
>>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                            0
>>> perf_count_call_congestion_wait_from_alloc_pages_slowpath          25                          76                         1114               99.98%
>>> perf_count_pages_direct_reclaim                                  1054                        9150                        62297
>>> perf_count_failed_pages_direct_reclaim                             25                          64                         1114
>>> perf_count_failed_pages_direct_reclaim_but_progress                25                          57                         1114                1.79%
>>>
>>>
>>> I hope the format is kept, it should be good with every monospace viewer.
>>>     
>>
>> It got manged but I think I've it fixed above. The biggest thing I can see
>> is that direct reclaim is a lot more successful with the patches reverted but
>> that in itself doesn't make sense. Neither patch affects how many pages should
>> be free or reclaimable - just what order they are allocated and freed in.
>>
>> With botgh patches reverted, is the performance 100% reliable or does it
>> sometimes vary?
>>   
>
> It is 100% percent reliable now - reliably bad with plain 2.6.32 as well  
> as reliably much better (e.g. +40% @ 16threads) with both reverted.
>

Ok.

>> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
>> additional memory overhead with that patch is enough to push this workload
>> over the edge where entering direct reclaim is necessary a lot more to keep
>> the zones over the min watermark. You mentioned early on that adding 64MB
>> to the machine makes the problem go away. Do you know what the cut-off point
>> is? For example, is adding 4MB enough?
>>   
> That was another one probably only seen due to the lack of good  
> reproducibility in my old setup.

Ok.

> I made bigger memory scaling tests with the new setup. Therefore I ran  
> the workload with 160 to 356 megabytes in 32mb steps (256 is the default  
> in all other runs).
> The result is that more than 256m memory only brings a slight benefit  
> (which is ok as the load doesn't reread the data read into page cache  
> and it just doesn't need much more).
>
> Here some data about scaling memory normalized to the 256m memory values:
> - deviation to 256m case -                  160m    192m    224m    256m     288m    320m    356m
> plain 2.6.32                              +9.12% -55.11% -17.15%   =100%   +1.46%  +1.46%  +1.46%
> 2.6.32 - 5f8dcc21 and e084b reverted      +6.95%  -6.95%  -0.80%   =100%   +2.14%  +2.67%  +2.67%
> ------------------------------------------------------------------------------------------------
> deviation between each other (all +)      60.64% 182.93%  63.44%  36.50%   37.41%  38.13%  38.13%
> What can be seen:
> a) more memory brings up to +2.67%/+1.46%, but not more when further  
> increasing memory (reasonable as we don't reread cached files)
> b) decreasing memory drops performance by up to -6.95% @ -64mb with both  
> reverted, but down to -55% @ -64mb (compared to the already much lower  
> 256m throughput)
>   -> plain 2.6.32 is much more sensible to lower available memory AND  
> always a level below
> c) there is a weird but very reproducible improvement with even lower  
> memory - very probably another issue or better another solution and not  
> related here - but who knows :-)
>   -> still both reverted is always much better than plain 2.6.32
>

Ok. Two avenues of attack then although both are focused on 5f8dcc21.
The first is the prototype below that should avoid congestion_wait. The
second is to reintroduce a fallback to other per-cpu lists and see if
that was a big factor.

>> If a small amount of memory does help, I'd also test with min_free_kbytes
>> at a lower value? If reducing it restores the performance, it again implies
>> that memory usage is the real problem.
>>   
>
> I'll run a few tests with bigger/smaller numbers of min_free_kbytes just  
> to be sure.
>> Thing is, even if adding small amounts of memory or reducing
>> min_free_kbytes helps, it does not explain why e084b ever made a
>> difference because all that patch does is alter the ordering of pages on
>> the free list. That might have some cache consequences but it would not
>> impact direct reclaim like this.
>>
>>   
>>> You can all clearly see that the patches somehow affect the ratio at  
>>> which __alloc_pages_direct_reclaim runs into a race between 
>>> try_to_free  pages that could actually free something, but a few 
>>> lines below can't  get a page from the free list.
>>>     
>>
>> There actually is potentially a significant delay from when direct
>> reclaim returns and a new allocation attempt happens. However, it's been
>> like this for a long time so I don't think it's the issue.
>>   
<
> Due to the increased percentage of progress, but !page in direct_reclaim  
> I still think it is related.

It probably yes, but unfortunately the e084b has very little to do with
that logic although 5f8dcc21 might indirectly be affecting it.

> I agree that "having" that potential race between try_to_free and  
> get_page is not the issue, but I assume it is very likely that the  
> patches change the timing there so that it happens much more often.
>

Patch e084b would make very little difference to the timing of events
though except from a cache perspective but the slowdown is too severe
for that.

>> I have a prototype patch that removes usage of congestion_wait() altogether
>> and puts processes to sleep on a waitqueue waiting for watermarks to restore
>> or a timeout. However, I think it would be just treating the symptoms here
>> rather than the real issue.
>>   
>
> Anyway I'd be happy to test this patch if you could sent it to me,  
> because as long as we don't have a "real" solution I like such  
> symptom-fixes as a start :-)
> At least it would fix going into congestion_wait even without dirty  
> pages or I/O's in flight - waiting for watermarks sounds much better to  
> me independent to the current issue.

Hmm, the patch as it currently stands is below. However, I warn you that
it has only been boot-tested on qemu with no proper testing doing, either
functional or performance testing. It may just go mad, drink your beer and
chew on your dog.

>>> Outside in the function alloc_pages_slowpath this leads to a call to  
>>> congestion_wait which is absolutely futile as there are absolutely no 
>>>  writes in flight to wait for.
>>>
>>> Now this kills effectively up to 80% of our throughput - Any idea of  
>>> better understanding the link between the patches and the effect is   
>>> welcome and might lead to a solution.
>>>
>>> FYI - I tried all patches you suggested - none of them affects this.
>>>
>>>     
>>
>> I'm still at a total loss to explain this. Memory overhead of the second
>> patch is a vague possibility and worth checking out by slightly increasing
>> available memory on the partition or reducing min_free_kbytes. It does not
>> explain why the first patch makes a difference.
>>   
> In a discussion with Hannes Reinecke (hare@suse.de) he brought up that  
> in a low memory scenario the ordering of the pages might twist.
> Today we have the single linst of pages - add hot ones to the head and  
> cold to the tail. Both patches affect that list management - e084b  
> changes the order of some, and 5f8dcc is splitting it per migrate type.

Correct on both counts.

> What now could happen is that you free pages and get a list like:
> H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
> In low mem scenarios it could now happen that a allocation for hot pages  
> falls back to cold pages (as usual fallback), but couldnt it be that it  
> then also gets the cold pages in reverse order again ?

Yes, this is true but that would only matter from a cache perspective
as you say your hardware and drivers are doing no automatic merging of
requests.

> Without e084b it would be like:
> H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with  
> the issue that cold allocations from the right are reverse -> e084b)
> 5f8dcc is now lowering the size of those lists by splitting it into  
> different migration types, so maybe both together are increasing the  
> chance to get such an issue.
>

There is a good chance that 5f8dcc is causing direct reclaim to be entered
when it could have been avoided as pages were on the per-cpu lists but
I don't think the difference in cache hotness is enough to explain a 60%
loss on your machine.

> I hope I explained that idea right, Hannes please feel free to correct  
> and extend if needed.
> I also added Nick on his request.

The prototype patch for avoiding congestion_wait is below. I'll start
work on a fallback-to-other-percpu-lists patch.


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..72465c1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -398,6 +398,9 @@ struct zone {
 	unsigned long		wait_table_hash_nr_entries;
 	unsigned long		wait_table_bits;
 
+	/* queue for processes waiting for pressure to relieve */
+	wait_queue_head_t	*pressure_wq;
+
 	/*
 	 * Discontig memory support fields.
 	 */
diff --git a/mm/internal.h b/mm/internal.h
index 6a697bb..d447d57 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -251,6 +251,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 #define ZONE_RECLAIM_SUCCESS	1
 #endif
 
+extern void check_zone_pressure(struct zone *zone);
+extern long zonepressure_wait(struct zone *zone, long timeout);
+
 extern int hwpoison_filter(struct page *p);
 
 extern u32 hwpoison_filter_dev_major;
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..c43d068 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
+#include <linux/sched.h>
 
 struct pglist_data *first_online_pgdat(void)
 {
@@ -87,3 +88,42 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+void check_zone_pressure(struct zone *zone)
+{
+	/* If no process is waiting, nothing to do */
+	if (!waitqueue_active(zone->pressure_wq))
+		return;
+
+	/* Check if the high watermark is ok for order 0 */
+	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
+		wake_up_interruptible(zone->pressure_wq);
+}
+
+/**
+ * zonepressure_wait - Wait for pressure on a zone to ease off
+ * @zone: The zone that is expected to be under pressure
+ * @timeout: Wait until pressure is relieved or this timeout is reached
+ *
+ * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
+ * It's considered to be relieved if any direct reclaimer or kswapd brings
+ * the zone above the high watermark
+ */
+long zonepressure_wait(struct zone *zone, long timeout)
+{
+	long ret;
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
+
+	/*
+	 * XXX: Is io_schedule_timeout() really appropriate here? Maybe not
+	 * because it'll get accounted for as IO waiting where that is not
+	 * really happening any more. Revisit what sort of wait this should
+	 * be
+	 */
+	ret = io_schedule_timeout(timeout);
+	finish_wait(zone->pressure_wq, &wait);
+
+	return ret;
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..e80de7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1734,8 +1734,10 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
 			preferred_zone, migratetype);
 
-		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+		if (!page && gfp_mask & __GFP_NOFAIL) {
+			/* If still failing, wait for pressure on zone to relieve */
+			zonepressure_wait(preferred_zone, HZ/50);
+		}
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -1905,8 +1907,8 @@ rebalance:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
-		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		/* Too much pressure, back off a bit at let reclaimers do work */
+		zonepressure_wait(preferred_zone, HZ/50);
 		goto rebalance;
 	}
 
@@ -3254,6 +3256,38 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 	return 0;
 }
 
+static noinline __init_refok
+void zone_pressure_wq_cleanup(struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	size_t free_size = sizeof(wait_queue_head_t);
+
+	if (!slab_is_available())
+		free_bootmem_node(pgdat, __pa(zone->pressure_wq), free_size);
+	else
+		kfree(zone->pressure_wq);
+}
+
+static noinline __init_refok
+int zone_pressure_wq_init(struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	size_t alloc_size = sizeof(wait_queue_head_t);
+
+	if (!slab_is_available())
+		zone->pressure_wq = (wait_queue_head_t *)
+			alloc_bootmem_node(pgdat, alloc_size);
+	else
+		zone->pressure_wq = kmalloc(alloc_size, GFP_KERNEL);
+
+	if (!zone->pressure_wq)
+		return -ENOMEM;
+
+	init_waitqueue_head(zone->pressure_wq);
+
+	return 0;
+}
+
 static int __zone_pcp_update(void *data)
 {
 	struct zone *zone = data;
@@ -3306,9 +3340,15 @@ __meminit int init_currently_empty_zone(struct zone *zone,
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret;
-	ret = zone_wait_table_init(zone, size);
+
+	ret = zone_pressure_wq_init(zone);
 	if (ret)
 		return ret;
+	ret = zone_wait_table_init(zone, size);
+	if (ret) {
+		zone_pressure_wq_cleanup(zone);
+		return ret;
+	}
 	pgdat->nr_zones = zone_idx(zone) + 1;
 
 	zone->zone_start_pfn = zone_start_pfn;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..7a9aaae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1709,6 +1709,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		}
 
 		shrink_zone(priority, zone, sc);
+		check_zone_pressure(zone);
 	}
 }
 
@@ -1954,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
  * interoperates with the page allocator fallback scheme to ensure that aging
  * of pages is balanced across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
 {
 	int all_zones_ok;
 	int priority;
@@ -2080,8 +2081,10 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					8*high_wmark_pages(zone), end_zone, 0)) {
 				shrink_zone(priority, zone, &sc);
+			}
+			check_zone_pressure(zone);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
@@ -2120,8 +2123,11 @@ loop_again:
 		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-			else
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else {
+				prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
+				schedule_timeout(HZ/10);
+				finish_wait(&pgdat->kswapd_wait, wait);
+			}
 		}
 
 		/*
@@ -2270,7 +2276,7 @@ static int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret)
-			balance_pgdat(pgdat, order);
+			balance_pgdat(pgdat, &wait, order);
 	}
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-08 15:21                         ` Mel Gorman
@ 2010-02-08 16:55                           ` Mel Gorman
  2010-02-09  6:23                           ` Christian Ehrhardt
  2010-02-09 15:52                           ` Christian Ehrhardt
  2 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2010-02-08 16:55 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin

> <SNIP>
> The prototype patch for avoiding congestion_wait is below. I'll start
> work on a fallback-to-other-percpu-lists patch.
> 

And here is the prototype of the fallback-to-other-percpu-lists patch.
I'm afraid I've only managed to test it on qemu. My three test machines are
still occupied :(

==== CUT HERE ====
page allocator: Fallback to other per-cpu lists when the target list is empty and memory is low

When a per-cpu list of pages for a given migratetype is empty, the page
allocator is called to refill the PCP list. It's possible when memory
is low that this results in the process entering direct reclaim even
if it wasn't strictly necessary because there were pages free for other
migratetypes. Unconditionally falling back to other PCP lists hurts the
fragmentation-avoidance strategy which is also undesirable.

When the desired PCP list is empty, this patch checks how many free pages
there are on the PCP lists and if refilling the list could result in direct
reclaim. If direct reclaim is unlikely, the PCP list is refilled to maintain
fragmentation-avoidance. Otherwise, a page from an alternative PCP list is
chosen to maintain performance and avoid direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   37 ++++++++++++++++++++++++++++++++++---
 1 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..009d683 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,6 +1168,39 @@ void split_page(struct page *page, unsigned int order)
 		set_page_refcounted(page + i);
 }
 
+/* Decide whether to find an alternative PCP list or refill */
+static struct list_head *pcp_fallback(struct zone *zone,
+			struct per_cpu_pages *pcp, 
+			int start_migratetype, int cold)
+{
+	int i;
+	int migratetype;
+	struct list_head *list;
+	long free_pages = zone_page_state(zone, NR_FREE_PAGES) - pcp->batch;
+
+	/* 
+	 * Find a PCPU list with free pages in the same order as
+	 * fragmentation-avoidance fallback in the event that refilling
+	 * the PCP list may result in direct reclaim
+	 */
+	if (pcp->count && free_pages <= low_wmark_pages(zone)) {
+		for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
+			migratetype = fallbacks[start_migratetype][i];
+			list = &pcp->lists[migratetype];
+
+			if (!list_empty(list))
+				return list;
+		}
+	}
+
+	/* Alternatively, we need to allocate more memory to the PCP lists */
+	list = &pcp->lists[start_migratetype];
+	pcp->count += rmqueue_bulk(zone, 0, pcp->batch, list,
+					migratetype, cold);
+
+	return list;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -1193,9 +1226,7 @@ again:
 		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
 		if (list_empty(list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, list,
-					migratetype, cold);
+			list = pcp_fallback(zone, pcp, migratetype, cold);
 			if (unlikely(list_empty(list)))
 				goto failed;
 		}

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-08 15:21                         ` Mel Gorman
  2010-02-08 16:55                           ` Mel Gorman
@ 2010-02-09  6:23                           ` Christian Ehrhardt
  2010-02-09 15:52                           ` Christian Ehrhardt
  2 siblings, 0 replies; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-09  6:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin



Mel Gorman wrote:
> On Mon, Feb 08, 2010 at 03:01:16PM +0100, Christian Ehrhardt wrote:
>>
>> Mel Gorman wrote:
>>> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>>>   
>>>> I'll keep the old thread below as reference.
>>>>
>>>> After taking a round of ensuring reproducibility and a pile of new   
>>>> measurements I now can come back with several new insights.
>>>>
>>>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel   
>>>> threads) with sequential read load and all that 4 times to find   
>>>> potential noise. But since I changed to that load instead of random 
>>>> read  wit hone thread and ensuring the most memory is cleared (sync + 
>>>> echo 3 >  /proc/sys/vm/drop_caches + a few sleeps) . The noise is now 
>>>> down <2%.  For detailed questions about the setup feel free to ask me 
>>>> directly as I  won't flood this thread too much with such details.
>>>>
>>>>     
>>> Is there any chance you have a driver script for the test that you could send
>>> me? I'll then try reproducing based on that script and see what happens. I'm
>>> not optimistic I'll be able to reproduce the problem because I think
>>> it's specific to your setup but you never know.
>>>   
>> I don't have one as it runs in a bigger automated test environment, but  
>> it is easy enough to write down something comparable.
> 
> I'd appreciate it, thanks.
> 

Testing of your two patches starts in a few minutes, thanks in advance.

Here the info how to execute the core of the test - I cross fingers that anyone else can reproduce it that way :-)

I use it in a huge automation framework which takes care of setting up the system, disks, gathering statistics and so on, but it essentially comes down to something simple like that:

#!/bin/bash
# reboot your system with 256m
# attach 16 disks (usually up to 64, but 16 should be enough to show the issue)
# mount your disks at /mnt/subw0, /mnt/subw1, ...
for i in 4 8 16 4 8 16 4 8 16 4 8 16
do
        sync; sleep 10s; echo 3 > /proc/sys/vm/drop_caches; sleep 2s;
        iozone -s 2000m -r 64k -t $i  -e -w -R -C  -i 0 -F /mnt/subw0 /mnt/subw1 /mnt/subw2 /mnt/subw3 /mnt/subw4 /mnt/subw5 /mnt/subw6 /mnt/subw7 /mnt/subw8 /mnt/subw9 /mnt/subw10 /mnt/subw11 /mnt/subw12 /mnt/subw13 /mnt/subw14 /mnt/subw15
        sync; sleep 10s; echo 3 > /proc/sys/vm/drop_caches; sleep 2s;
        iozone -s 2000m -r 64k -t $i  -e -w -R -C  -i 1 -F /mnt/subw0 /mnt/subw1 /mnt/subw2 /mnt/subw3 /mnt/subw4 /mnt/subw5 /mnt/subw6 /mnt/subw7 /mnt/subw8 /mnt/subw9 /mnt/subw10 /mnt/subw11 /mnt/subw12 /mnt/subw13 /mnt/subw14 /mnt/subw15
done
# while we could reduce the number of writes to one 16 thread write I use it that way as it is more similar to our original load (makes no difference anyway)

[...]
-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-08 15:21                         ` Mel Gorman
  2010-02-08 16:55                           ` Mel Gorman
  2010-02-09  6:23                           ` Christian Ehrhardt
@ 2010-02-09 15:52                           ` Christian Ehrhardt
  2010-02-09 17:57                             ` Mel Gorman
  2 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-09 15:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin



Mel Gorman wrote:
> On Mon, Feb 08, 2010 at 03:01:16PM +0100, Christian Ehrhardt wrote:
>>
>> Mel Gorman wrote:
>>> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>>>   

[...]

>>> How reproducible are these results with patch e084b reverted? i.e. I know
>>> it does not work now, but did reverting on the old kernel always fix it
>>> or were there occasions where the figures would be still bad?
>>>   
>> Reverting e084b in the past showed something like +40%, so I thought it  
>> helps.
>> Afterwards I found out it wasn't a good testcase for reproducibility and  
>> when looking at a series it had +5%,-7%,+20%,...
>> So by far too noisy to be useful for bisect, discussion or fix testing.
>> Thats why I made my big round reinventing the testcases in a more  
>> reproducible way like described above.
>> In those the original issue is still very visible - which means 4 runs  
>> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
>> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
> 
> About all e084b can be affecting is cache hotness when fallback is occuring
> but 60% regression still seems way too high for just that.

Yes, most the ideas you, I an a few others so far had went for cache hotness - but always the summary was "couldn't be that much".

On the other hand we have to consider that it could be a small timing due to cache or whatever else which leads into congestion_wait and then it waits HZ/50. That way a 1µs cause can lead to 20001µs lost which then cause 60%.
Another important thing is that "cold page" afaik means the page content, and not the management struct etc. As the time lost is in management only (=> before accessing the actual 4k page content) it should not be important if a page is cold/hot for the mm code itself.
That much for direct_reclaim that either runs into progress, but !page (BAD case) or not (GOOD case) - there only cache effects of struct page, pcp lists etc should have an effect.

>>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 
>>>> which  came in later. Both patches unapplied individually don't 
>>>> improve  anything. But both patches reverted at the same time on git 
>>>> v2.6.32  bring us back to our old values (remember that is more than 
>>>> 2Gb/s  improvement in throughput)!
>>>>
>>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this 
>>>>  can cause so much trouble.
>>>>     
>>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
>>> workloads can suffer. A second is that users of high-order allocations can
>>> enter direct reclaim a lot more than previously. This was fixed in commit
>>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
>>> your case. 
>>>
>>> The only other interesting thing in commit 5f8dcc21 is that it increases
>>> the memory overhead of a per-cpu structure. If your memory usage is really
>>> borderline, it might have been just enough to push it over the edge.
>>>   
>> I had this thoughts too, but as it only shows an effect with 5f8dcc21  
>> and e084b reverted at the same time I'm wondering if it can be that.
>> But I agree that this should be considered too until a verification can  
>> be done.
>>>   
> 
> Another consequence of 5f8dcc is that pages are on the per-cpu lists that
> it is possible for pages to be on the per-cpu lists when the page allocator
> is called. There is potential that page reclaim is being entered as a
> result even though pages were free on the per-cpu lists.
> 
> It would not explain why reverting e084b makes such a difference but I
> can prototype a patch that falls back to other per-cpu lists before
> calling the page allocator as it used to do but without linear searching
> the per-cpu lists.

I tested these patch you submitted in another reply.
Short - it only gives it ~+4% by probably speeding up things a bit but not touching the issue in any way.
Some more detail below at the other test result.

[...]

>>
>>                                                               4 THREAD  READ               8 THREAD READ               16 THREAD READ         16THR % portions
>>
>> perf_count_congestion_wait                                        305                               1970                         8980
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                                  0                            0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath         305                               1970                         8979               100.00%
>> perf_count_pages_direct_reclaim                                  1153                               6818                        32217
>> perf_count_failed_pages_direct_reclaim                            305                               1556                         8979
>> perf_count_failed_pages_direct_reclaim_but_progress               305                               1478                         8979               27.87%
>>
[...]
>>>   
>>>> perf_count_failed_pages_direct_reclaim_but_progress               305                        1478                        8979                   27.87%
>>>>
>>>> GOOD CASE WITH REVERTS                                          4 THREAD  READ               8 THREAD READ               16 THREAD READ        16THR % portions
>>>> perf_count_congestion_wait                                         25                          76                         1114
>>>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0                           0                            0
>>>> perf_count_call_congestion_wait_from_alloc_pages_slowpath          25                          76                         1114               99.98%
>>>> perf_count_pages_direct_reclaim                                  1054                        9150                        62297
>>>> perf_count_failed_pages_direct_reclaim                             25                          64                         1114
>>>> perf_count_failed_pages_direct_reclaim_but_progress                25                          57                         1114                1.79%
>>>>
>>>>
>>>> I hope the format is kept, it should be good with every monospace viewer.
>>>>     
>>> It got manged but I think I've it fixed above. The biggest thing I can see
>>> is that direct reclaim is a lot more successful with the patches reverted but
>>> that in itself doesn't make sense. Neither patch affects how many pages should
>>> be free or reclaimable - just what order they are allocated and freed in.
>>>
>>> With botgh patches reverted, is the performance 100% reliable or does it
>>> sometimes vary?
>>>   
>> It is 100% percent reliable now - reliably bad with plain 2.6.32 as well  
>> as reliably much better (e.g. +40% @ 16threads) with both reverted.
>>
> 
> Ok.
> 
>>> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
>>> additional memory overhead with that patch is enough to push this workload
>>> over the edge where entering direct reclaim is necessary a lot more to keep
>>> the zones over the min watermark. You mentioned early on that adding 64MB
>>> to the machine makes the problem go away. Do you know what the cut-off point
>>> is? For example, is adding 4MB enough?
>>>   
>> That was another one probably only seen due to the lack of good  
>> reproducibility in my old setup.
> 
> Ok.
> 
>> I made bigger memory scaling tests with the new setup. Therefore I ran  
>> the workload with 160 to 356 megabytes in 32mb steps (256 is the default  
>> in all other runs).
>> The result is that more than 256m memory only brings a slight benefit  
>> (which is ok as the load doesn't reread the data read into page cache  
>> and it just doesn't need much more).
>>
>> Here some data about scaling memory normalized to the 256m memory values:
>> - deviation to 256m case -                  160m    192m    224m    256m     288m    320m    356m
>> plain 2.6.32                              +9.12% -55.11% -17.15%   =100%   +1.46%  +1.46%  +1.46%
>> 2.6.32 - 5f8dcc21 and e084b reverted      +6.95%  -6.95%  -0.80%   =100%   +2.14%  +2.67%  +2.67%
>> ------------------------------------------------------------------------------------------------
>> deviation between each other (all +)      60.64% 182.93%  63.44%  36.50%   37.41%  38.13%  38.13%
>> What can be seen:
>> a) more memory brings up to +2.67%/+1.46%, but not more when further  
>> increasing memory (reasonable as we don't reread cached files)
>> b) decreasing memory drops performance by up to -6.95% @ -64mb with both  
>> reverted, but down to -55% @ -64mb (compared to the already much lower  
>> 256m throughput)
>>   -> plain 2.6.32 is much more sensible to lower available memory AND  
>> always a level below
>> c) there is a weird but very reproducible improvement with even lower  
>> memory - very probably another issue or better another solution and not  
>> related here - but who knows :-)
>>   -> still both reverted is always much better than plain 2.6.32
>>
> 
> Ok. Two avenues of attack then although both are focused on 5f8dcc21.
> The first is the prototype below that should avoid congestion_wait. The
> second is to reintroduce a fallback to other per-cpu lists and see if
> that was a big factor.

I tested both of your patches, btw thanks for the quick work!


[...]

> 
> Hmm, the patch as it currently stands is below. However, I warn you that
> it has only been boot-tested on qemu with no proper testing doing, either
> functional or performance testing. It may just go mad, drink your beer and
> chew on your dog.

In opposition to all your warnings it worked pretty well.
The patch replacing congestion_wait calls - which we know re futile in my scenario
anyway - by zone_wait calls helps a lot. But so far as you suggested only by alleviating the symptoms.

>From a throughput perspective alone it looks almost fixed:
                                                    4 thread      8 thread    16 thread
plain 2.6.32                                        100%          100%        100%
2.6.32 avoid direct reclaim                         103.85%       103.60%     104.04%
2.6.32 zone wait                                    125.46%       140.87%     131.89%
2.6.32 zone wait + 5f8dcc21 and e084b reverted      124.46%       143.35%     144.21%

Looking at the counters this theory (only fixing symptoms) seems to be confirmed. Before the ratio of perf_count_pages_direct_reclaim calls that ended in perf_count_failed_pages_direct_reclaim_but_progress was ~2% in the good cases and ~30% in bad cases. As congestion_wait in my case almost always waits the full HZ/50 this is a lot.
With the zone wait patch I see a huge throughput win, most probably by all the (still ~30%!) waits that don't have to wait for the full timeout.
But still reverting 5f8dcc21 and e084b fix the issue of so much direct reclaims running in into that condition and that way giving throughput another last push.

                                                      % ran into try_to_free did progress but get_page got !page
plain 2.6.32                                          30.16%
2.6.32 zone wait                                      29.05%
2.6.32 zone wait + 5f8dcc21 and e084b reverted        2.93%

Note - with the zone wait patch there are also almost twice as much direct_reclaim calls. Probably by several waiters coming of the list when the watermark is restored and allocating too much.

Another good side note - 5f8dcc21 and e084b reverted plus your zone wait patch is better than anything I have seen so far :-)
Maybe its worth to measure  both of your patches plus 5f8dcc21 and e084b reverted with 64 threads (the effect grows with # threads) just to rejoice in huge throughput values for a short time.

>>>> Outside in the function alloc_pages_slowpath this leads to a call to  
>>>> congestion_wait which is absolutely futile as there are absolutely no 
>>>>  writes in flight to wait for.
>>>>
>>>> Now this kills effectively up to 80% of our throughput - Any idea of  
>>>> better understanding the link between the patches and the effect is   
>>>> welcome and might lead to a solution.
>>>>
>>>> FYI - I tried all patches you suggested - none of them affects this.
>>>>
>>>>     
>>> I'm still at a total loss to explain this. Memory overhead of the second
>>> patch is a vague possibility and worth checking out by slightly increasing
>>> available memory on the partition or reducing min_free_kbytes. It does not
>>> explain why the first patch makes a difference.
>>>   
>> In a discussion with Hannes Reinecke (hare@suse.de) he brought up that  
>> in a low memory scenario the ordering of the pages might twist.
>> Today we have the single linst of pages - add hot ones to the head and  
>> cold to the tail. Both patches affect that list management - e084b  
>> changes the order of some, and 5f8dcc is splitting it per migrate type.
> 
> Correct on both counts.
> 
>> What now could happen is that you free pages and get a list like:
>> H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
>> In low mem scenarios it could now happen that a allocation for hot pages  
>> falls back to cold pages (as usual fallback), but couldnt it be that it  
>> then also gets the cold pages in reverse order again ?
> 
> Yes, this is true but that would only matter from a cache perspective
> as you say your hardware and drivers are doing no automatic merging of
> requests.
> 
>> Without e084b it would be like:
>> H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with  
>> the issue that cold allocations from the right are reverse -> e084b)
>> 5f8dcc is now lowering the size of those lists by splitting it into  
>> different migration types, so maybe both together are increasing the  
>> chance to get such an issue.
>>
> 
> There is a good chance that 5f8dcc is causing direct reclaim to be entered
> when it could have been avoided as pages were on the per-cpu lists but
> I don't think the difference in cache hotness is enough to explain a 60%
> loss on your machine.
> 

As far as I get it thats what your patch #2 fixed (page allocator: Fallback to other per-cpu lists when the target list is empty and memory is low), but as mentioned below that gave me +4%.
Which is actually a fine win, but not for the issue currently in discussion.
As bonus I can give you now a little tested-by for both patches :-)

[...]


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-09 15:52                           ` Christian Ehrhardt
@ 2010-02-09 17:57                             ` Mel Gorman
  2010-02-11 16:11                               ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2010-02-09 17:57 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin

> >>> How reproducible are these results with patch e084b reverted? i.e. I know
> >>> it does not work now, but did reverting on the old kernel always fix it
> >>> or were there occasions where the figures would be still bad?
> >>>   
> >> Reverting e084b in the past showed something like +40%, so I thought it  
> >> helps.
> >> Afterwards I found out it wasn't a good testcase for reproducibility and  
> >> when looking at a series it had +5%,-7%,+20%,...
> >> So by far too noisy to be useful for bisect, discussion or fix testing.
> >> Thats why I made my big round reinventing the testcases in a more  
> >> reproducible way like described above.
> >> In those the original issue is still very visible - which means 4 runs  
> >> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
> >> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
> > 
> > About all e084b can be affecting is cache hotness when fallback is occuring
> > but 60% regression still seems way too high for just that.
> 
> Yes, most the ideas you, I an a few others so far had went for cache
> hotness - but always the summary was "couldn't be that much".

Well, it's cache hotness or the driver or hardware doing automatic merging
of requests (which you say is not happening). There are two other
possibilties but their effect should be very low as well.

> On the other hand we have to consider that it could be a small timing due
> to cache or whatever else which leads into congestion_wait and then it waits
> HZ/50. That way a 1µs cause can lead to 20001µs lost which then cause 60%.

Maybe although it would imply the window where it can go wrong is
very small. It seems unlikely that it would be hit every time.

> Another important thing is that "cold page" afaik means the page content,
> and not the management struct etc. As the time lost is in management only
> (=> before accessing the actual 4k page content) it should not be important
> if a page is cold/hot for the mm code itself.

True. Even the zeroing should be a non-temporal store so the timing
should not be affected.

> That much for direct_reclaim that either runs into progress, but !page
> (BAD case) or not (GOOD case) - there only cache effects of struct page,
> pcp lists etc should have an effect.

There are only two other possibilities I can think of.

1. The extra parameter to rmqueue_bulk() means that your platform
   generates particularly bad code to pass a 6th parameter.
2. The branch inside the for loop is causing serious problems somehow

Causing a 60% slowdown from either of these seems unlikely but there isn't
much choice than to eliminate them as possibilities. The attached patch is
for "option 1" above. I'll think more on how to best handle "option 2"

> >>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 
> >>>> which  came in later. Both patches unapplied individually don't 
> >>>> improve  anything. But both patches reverted at the same time on git 
> >>>> v2.6.32  bring us back to our old values (remember that is more than 
> >>>> 2Gb/s  improvement in throughput)!
> >>>>
> >>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this 
> >>>>  can cause so much trouble.
> >>>>     
> >>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
> >>> workloads can suffer. A second is that users of high-order allocations can
> >>> enter direct reclaim a lot more than previously. This was fixed in commit
> >>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
> >>> your case. 
> >>>
> >>> The only other interesting thing in commit 5f8dcc21 is that it increases
> >>> the memory overhead of a per-cpu structure. If your memory usage is really
> >>> borderline, it might have been just enough to push it over the edge.
> >>>   
> >> I had this thoughts too, but as it only shows an effect with 5f8dcc21  
> >> and e084b reverted at the same time I'm wondering if it can be that.
> >> But I agree that this should be considered too until a verification can  
> >> be done.
> > 
> > Another consequence of 5f8dcc is that pages are on the per-cpu lists that
> > it is possible for pages to be on the per-cpu lists when the page allocator
> > is called. There is potential that page reclaim is being entered as a
> > result even though pages were free on the per-cpu lists.
> > 
> > It would not explain why reverting e084b makes such a difference but I
> > can prototype a patch that falls back to other per-cpu lists before
> > calling the page allocator as it used to do but without linear searching
> > the per-cpu lists.
> 
> I tested these patch you submitted in another reply.
> Short - it only gives it ~+4% by probably speeding up things a bit but
> not touching the issue in any way.
> Some more detail below at the other test result.
> 

You mention it gains 4% which is not great, but what impact, if any, had
it on the number of calls to direct reclaim?

> <SNIP>
> > Ok. Two avenues of attack then although both are focused on 5f8dcc21.
> > The first is the prototype below that should avoid congestion_wait. The
> > second is to reintroduce a fallback to other per-cpu lists and see if
> > that was a big factor.
> 
> I tested both of your patches, btw thanks for the quick work!
> 

Thanks for persisting with this! I know it's dragging on a lot longer
than anyone would like.

> > Hmm, the patch as it currently stands is below. However, I warn you that
> > it has only been boot-tested on qemu with no proper testing doing, either
> > functional or performance testing. It may just go mad, drink your beer and
> > chew on your dog.
> 
> In opposition to all your warnings it worked pretty well.

Nice one.

> The patch replacing congestion_wait calls - which we know re futile in my scenario
> anyway - by zone_wait calls helps a lot. But so far as you suggested only by alleviating the symptoms.
> 
> From a throughput perspective alone it looks almost fixed:
>                                                     4 thread      8 thread    16 thread
> plain 2.6.32                                        100%          100%        100%
> 2.6.32 avoid direct reclaim                         103.85%       103.60%     104.04%
> 2.6.32 zone wait                                    125.46%       140.87%     131.89%
> 2.6.32 zone wait + 5f8dcc21 and e084b reverted      124.46%       143.35%     144.21%
> 

Ok, the fact that the congestion_wait-avoidance patch makes such a large
difference is very interesting although if the direct reclaim calls were
lower, it would be less pronounced. I will be having those patches on the
back-burner for another while unless we cannot resolve this problem a better
way but it's nice to know they may have some potential.

> Looking at the counters this theory (only fixing symptoms) seems to be
> confirmed. Before the ratio of perf_count_pages_direct_reclaim calls that
> ended in perf_count_failed_pages_direct_reclaim_but_progress was ~2% in the
> good cases and ~30% in bad cases. As congestion_wait in my case almost always
> waits the full HZ/50 this is a lot.

> With the zone wait patch I see a huge throughput win, most probably by
> all the (still ~30%!) waits that don't have to wait for the full timeout.

That would be my expectation. The underlying problem of
too-many-direct-reclaims is not solved.

> But still reverting 5f8dcc21 and e084b fix the issue of so much direct
> reclaims running in into that condition and that way giving throughput
> another last push.
>                                                       % ran into try_to_free did progress but get_page got !page
> plain 2.6.32                                          30.16%
> 2.6.32 zone wait                                      29.05%
> 2.6.32 zone wait + 5f8dcc21 and e084b reverted        2.93%
> 

And what impact did "Fallback to other per-cpu lists when the target list
is empty and memory is low" have on the number of calls to direct reclaim?

> Note - with the zone wait patch there are also almost twice as much
> direct_reclaim calls. Probably by several waiters coming of the list when
> the watermark is restored and allocating too much.

It could be either a thundering herd problem or just the fact that the
processes are not asleep.

> Another good side note - 5f8dcc21 and e084b reverted plus your zone wait
> patch is better than anything I have seen so far :-)

Well, that's something positive out of all this. The congestion_wait() patch
was developed in response to this thread. Even though I knew it was treating
symptons and not problems, the congestion_wait() was identified as a bad
idea some time ago because the lack of pages may have nothing to do with IO.

> Maybe its worth to measure  both of your patches plus 5f8dcc21 and e084b
> reverted with 64 threads (the effect grows with # threads) just to rejoice
> in huge throughput values for a short time.

:)

> > <SNIP>
> > 
> > There is a good chance that 5f8dcc is causing direct reclaim to be entered
> > when it could have been avoided as pages were on the per-cpu lists but
> > I don't think the difference in cache hotness is enough to explain a 60%
> > loss on your machine.
> > 
> 
> As far as I get it thats what your patch #2 fixed (page allocator: Fallback
> to other per-cpu lists when the target list is empty and memory is low),
> but as mentioned below that gave me +4%.

I need to know what effect, if any, it had on the direct reclaim
figures. If it does drop them, but not by enough, then fallback can be
allowed to occur in more cirsumstances - for example, by changing

+	if (pcp->count && free_pages <= low_wmark_pages(zone)) {

with

+	if (pcp->count && free_pages <= high_wmark_pages(zone)) {

> Which is actually a fine win, but not for the issue currently in discussion.
> As bonus I can give you now a little tested-by for both patches :-)
> 

Thanks :)

Still, the grind continues to try resolve this :(

1. If the "Fallback to other per-cpu lists" does reduce the number of
	calls to direct reclaim, but not by enough, please replace the
	comparison of low_wmark_pages with high_wmark_pages and let me know a)
	how much it affects throughput and b) how much it affects the number
	of calls to direct reclaim

	This is addressing potential problems in 5f8dcc

2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
	number of parameters being passed is making some unexpected large
	difference

	This is trying to figure out more what the problem with e084b is

==== CUT HERE ====
rmqueue_bulk-fewer-parameters

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7021c68..180bab2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -936,7 +936,7 @@ retry_reserve:
  * a single hold of the lock, for efficiency.  Add them to the supplied list.
  * Returns the number of new pages which were placed at *list.
  */
-static int rmqueue_bulk(struct zone *zone, unsigned int order, 
+static int rmqueue_bulk(struct zone *zone,
 			unsigned long count, struct list_head *list,
 			int migratetype, int cold)
 {
@@ -944,7 +944,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
+		struct page *page = __rmqueue(zone, 0, migratetype);
 		if (unlikely(page == NULL))
 			break;
 
@@ -964,7 +964,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		set_page_private(page, migratetype);
 		list = &page->lru;
 	}
-	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -1);
 	spin_unlock(&zone->lock);
 	return i;
 }
@@ -1231,7 +1231,7 @@ again:
 		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
 		if (list_empty(list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
+			pcp->count += rmqueue_bulk(zone,
 					pcp->batch, list,
 					migratetype, cold);
 			if (unlikely(list_empty(list)))

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-09 17:57                             ` Mel Gorman
@ 2010-02-11 16:11                               ` Christian Ehrhardt
  2010-02-12 10:05                                 ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-11 16:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, npiggin, gregkh

Mel Gorman wrote:
>>>>> How reproducible are these results with patch e084b reverted? i.e. I know
>>>>> it does not work now, but did reverting on the old kernel always fix it
>>>>> or were there occasions where the figures would be still bad?
>>>>>   
>>>> Reverting e084b in the past showed something like +40%, so I thought it  
>>>> helps.
>>>> Afterwards I found out it wasn't a good testcase for reproducibility and  
>>>> when looking at a series it had +5%,-7%,+20%,...
>>>> So by far too noisy to be useful for bisect, discussion or fix testing.
>>>> Thats why I made my big round reinventing the testcases in a more  
>>>> reproducible way like described above.
>>>> In those the original issue is still very visible - which means 4 runs  
>>>> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
>>>> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>>> About all e084b can be affecting is cache hotness when fallback is occuring
>>> but 60% regression still seems way too high for just that.
>> Yes, most the ideas you, I an a few others so far had went for cache
>> hotness - but always the summary was "couldn't be that much".
> 
> Well, it's cache hotness or the driver or hardware doing automatic merging
> of requests (which you say is not happening). There are two other
> possibilties but their effect should be very low as well.
> 
>> On the other hand we have to consider that it could be a small timing due
>> to cache or whatever else which leads into congestion_wait and then it waits
>> HZ/50. That way a 1µs cause can lead to 20001µs lost which then cause 60%.
> 
> Maybe although it would imply the window where it can go wrong is
> very small. It seems unlikely that it would be hit every time.
> 
>> Another important thing is that "cold page" afaik means the page content,
>> and not the management struct etc. As the time lost is in management only
>> (=> before accessing the actual 4k page content) it should not be important
>> if a page is cold/hot for the mm code itself.
> 
> True. Even the zeroing should be a non-temporal store so the timing
> should not be affected.
> 
>> That much for direct_reclaim that either runs into progress, but !page
>> (BAD case) or not (GOOD case) - there only cache effects of struct page,
>> pcp lists etc should have an effect.
> 
> There are only two other possibilities I can think of.
> 
> 1. The extra parameter to rmqueue_bulk() means that your platform
>    generates particularly bad code to pass a 6th parameter.

I never checked that number, but out of the box I can tell you that 5 parameters can be passed per register and starting with the 6th they need to go via stack.
So much about hope, some numbers below to keep results together.

> 2. The branch inside the for loop is causing serious problems somehow

I also some branches can be tricky on s390, so I look forward to see your idea how we could verify that by not changing too much behavior.

> 
> Causing a 60% slowdown from either of these seems unlikely but there isn't
> much choice than to eliminate them as possibilities. The attached patch is
> for "option 1" above. I'll think more on how to best handle "option 2"

>>>>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 
>>>>>> which  came in later. Both patches unapplied individually don't 
>>>>>> improve  anything. But both patches reverted at the same time on git 
>>>>>> v2.6.32  bring us back to our old values (remember that is more than 
>>>>>> 2Gb/s  improvement in throughput)!
>>>>>>
>>>>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this 
>>>>>>  can cause so much trouble.
>>>>>>     
>>>>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
>>>>> workloads can suffer. A second is that users of high-order allocations can
>>>>> enter direct reclaim a lot more than previously. This was fixed in commit
>>>>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
>>>>> your case. 
>>>>>
>>>>> The only other interesting thing in commit 5f8dcc21 is that it increases
>>>>> the memory overhead of a per-cpu structure. If your memory usage is really
>>>>> borderline, it might have been just enough to push it over the edge.
>>>>>   
>>>> I had this thoughts too, but as it only shows an effect with 5f8dcc21  
>>>> and e084b reverted at the same time I'm wondering if it can be that.
>>>> But I agree that this should be considered too until a verification can  
>>>> be done.
>>> Another consequence of 5f8dcc is that pages are on the per-cpu lists that
>>> it is possible for pages to be on the per-cpu lists when the page allocator
>>> is called. There is potential that page reclaim is being entered as a
>>> result even though pages were free on the per-cpu lists.
>>>
>>> It would not explain why reverting e084b makes such a difference but I
>>> can prototype a patch that falls back to other per-cpu lists before
>>> calling the page allocator as it used to do but without linear searching
>>> the per-cpu lists.
>> I tested these patch you submitted in another reply.
>> Short - it only gives it ~+4% by probably speeding up things a bit but
>> not touching the issue in any way.
>> Some more detail below at the other test result.
>>
> 
> You mention it gains 4% which is not great, but what impact, if any, had
> it on the number of calls to direct reclaim?

sorry I missed to add that numbers yesterday.
Unfortunately it had no effect. vomparing 2.6.32 to 2.6.32 with avoid direct reclaim patch it has these for e.g. 16 threads:


with avoid direct reclaim (pcp list fallback)
perf_count_pages_direct_reclaim         32909
perf_count_failed_pages_direct_reclaim_but_progress      9802   =29.79%

remmember original 2.6.32 had:
perf_count_pages_direct_reclaim         33662
perf_count_failed_pages_direct_reclaim_but_progress     10152   =30.16%

So no big progress due to that (~4% TP as mentioned before).

 
>> <SNIP>
>>> Ok. Two avenues of attack then although both are focused on 5f8dcc21.
>>> The first is the prototype below that should avoid congestion_wait. The
>>> second is to reintroduce a fallback to other per-cpu lists and see if
>>> that was a big factor.
>> I tested both of your patches, btw thanks for the quick work!
>>
> 
> Thanks for persisting with this! I know it's dragging on a lot longer
> than anyone would like.
>
>>> Hmm, the patch as it currently stands is below. However, I warn you that
>>> it has only been boot-tested on qemu with no proper testing doing, either
>>> functional or performance testing. It may just go mad, drink your beer and
>>> chew on your dog.
>> In opposition to all your warnings it worked pretty well.
> 
> Nice one.
> 
>> The patch replacing congestion_wait calls - which we know re futile in my scenario
>> anyway - by zone_wait calls helps a lot. But so far as you suggested only by alleviating the symptoms.
>>
>> From a throughput perspective alone it looks almost fixed:
>>                                                     4 thread      8 thread    16 thread
>> plain 2.6.32                                        100%          100%        100%
>> 2.6.32 avoid direct reclaim                         103.85%       103.60%     104.04%
>> 2.6.32 zone wait                                    125.46%       140.87%     131.89%
>> 2.6.32 zone wait + 5f8dcc21 and e084b reverted      124.46%       143.35%     144.21%
>>
> 
> Ok, the fact that the congestion_wait-avoidance patch makes such a large
> difference is very interesting although if the direct reclaim calls were
> lower, it would be less pronounced. I will be having those patches on the
> back-burner for another while unless we cannot resolve this problem a better
> way but it's nice to know they may have some potential.
> 
>> Looking at the counters this theory (only fixing symptoms) seems to be
>> confirmed. Before the ratio of perf_count_pages_direct_reclaim calls that
>> ended in perf_count_failed_pages_direct_reclaim_but_progress was ~2% in the
>> good cases and ~30% in bad cases. As congestion_wait in my case almost always
>> waits the full HZ/50 this is a lot.
> 
>> With the zone wait patch I see a huge throughput win, most probably by
>> all the (still ~30%!) waits that don't have to wait for the full timeout.
> 
> That would be my expectation. The underlying problem of
> too-many-direct-reclaims is not solved.
> 
>> But still reverting 5f8dcc21 and e084b fix the issue of so much direct
>> reclaims running in into that condition and that way giving throughput
>> another last push.
>>                                                       % ran into try_to_free did progress but get_page got !page
>> plain 2.6.32                                          30.16%
>> 2.6.32 zone wait                                      29.05%
>> 2.6.32 zone wait + 5f8dcc21 and e084b reverted        2.93%
>>
> 
> And what impact did "Fallback to other per-cpu lists when the target list
> is empty and memory is low" have on the number of calls to direct reclaim?

see above

> 
>> Note - with the zone wait patch there are also almost twice as much
>> direct_reclaim calls. Probably by several waiters coming of the list when
>> the watermark is restored and allocating too much.
> 
> It could be either a thundering herd problem or just the fact that the
> processes are not asleep.

Thats sounds right, but as other things we had that is just related to the symptoms so I guess we can neglect it for now.

>> Another good side note - 5f8dcc21 and e084b reverted plus your zone wait
>> patch is better than anything I have seen so far :-)
> 
> Well, that's something positive out of all this. The congestion_wait() patch
> was developed in response to this thread. Even though I knew it was treating
> symptons and not problems, the congestion_wait() was identified as a bad
> idea some time ago because the lack of pages may have nothing to do with IO.
> 
>> Maybe its worth to measure  both of your patches plus 5f8dcc21 and e084b
>> reverted with 64 threads (the effect grows with # threads) just to rejoice
>> in huge throughput values for a short time.
> 
> :)
> 
>>> <SNIP>
>>>
>>> There is a good chance that 5f8dcc is causing direct reclaim to be entered
>>> when it could have been avoided as pages were on the per-cpu lists but
>>> I don't think the difference in cache hotness is enough to explain a 60%
>>> loss on your machine.
>>>
>> As far as I get it thats what your patch #2 fixed (page allocator: Fallback
>> to other per-cpu lists when the target list is empty and memory is low),
>> but as mentioned below that gave me +4%.
> 
> I need to know what effect, if any, it had on the direct reclaim
> figures. If it does drop them, but not by enough, then fallback can be
> allowed to occur in more cirsumstances - for example, by changing
> 
> +	if (pcp->count && free_pages <= low_wmark_pages(zone)) {
> 
> with
> 
> +	if (pcp->count && free_pages <= high_wmark_pages(zone)) {
> 
>> Which is actually a fine win, but not for the issue currently in discussion.
>> As bonus I can give you now a little tested-by for both patches :-)
>>
> 
> Thanks :)
> 
> Still, the grind continues to try resolve this :(
> 
> 1. If the "Fallback to other per-cpu lists" does reduce the number of
> 	calls to direct reclaim, but not by enough, please replace the
> 	comparison of low_wmark_pages with high_wmark_pages and let me know a)
> 	how much it affects throughput and b) how much it affects the number
> 	of calls to direct reclaim

The difference between low or high watermark is minimal.

here with avoid direct reclaims with high watermark
perf_count_pages_direct_reclaim         32839
perf_count_failed_pages_direct_reclaim_but_progress	  488   =30.16%

as reference from above the one with low watermark
perf_count_pages_direct_reclaim         32909
perf_count_failed_pages_direct_reclaim_but_progress      9802   =29.79%

=> this is inside the small amount of noise that is left.

> 
> 	This is addressing potential problems in 5f8dcc
> 
> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
> 	number of parameters being passed is making some unexpected large
> 	difference

BINGO - this definitely hit something.
This experimental patch does two things - on one hand it closes the race we had:

                                                                4 THREAD READ   8 THREAD READ    16 THREAD READ	    %ofcalls
perf_count_congestion_wait                                         13              27                52
perf_count_call_congestion_wait_from_alloc_pages_high_priority      0               0                 0	
perf_count_call_congestion_wait_from_alloc_pages_slowpath          13              27                52              99.52%
perf_count_pages_direct_reclaim                                 30867           56811            131552	
perf_count_failed_pages_direct_reclaim                             14              24                52	
perf_count_failed_pages_direct_reclaim_but_progress                14              21                52              0.04% !!!

On the other hand we see that the number of direct_reclaim calls increased by ~x4.

I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.

This leaves me with two ideas what the real issue could be:
1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
2. maybe just the fact that we now only do "order 0" operations changes the behaviour

I think I try to cover that tomorrow by keeping the old call with 6 parms but rewriting it internally to 0 again (and hope I don't need to much tricks to stop gcc from optimizing it out).
Maybe I could also extend statistics to track order in the failing cases vs. all cases or something like this.

I don't have full access to the machines tomorrow, so it might get monday for a proper response.
Thanks a lot Mel, this brought us much closer to a final fix.


> 	This is trying to figure out more what the problem with e084b is
> 
> ==== CUT HERE ====
> rmqueue_bulk-fewer-parameters
[...]
-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-11 16:11                               ` Christian Ehrhardt
@ 2010-02-12 10:05                                 ` Nick Piggin
  2010-02-15  6:59                                   ` Nick Piggin
  2010-02-15 15:46                                   ` Christian Ehrhardt
  0 siblings, 2 replies; 41+ messages in thread
From: Nick Piggin @ 2010-02-12 10:05 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
> > 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
> > 	number of parameters being passed is making some unexpected large
> > 	difference
> 
> BINGO - this definitely hit something.
> This experimental patch does two things - on one hand it closes the race we had:
> 
>                                                                 4 THREAD READ   8 THREAD READ    16 THREAD READ	    %ofcalls
> perf_count_congestion_wait                                         13              27                52
> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0               0                 0	
> perf_count_call_congestion_wait_from_alloc_pages_slowpath          13              27                52              99.52%
> perf_count_pages_direct_reclaim                                 30867           56811            131552	
> perf_count_failed_pages_direct_reclaim                             14              24                52	
> perf_count_failed_pages_direct_reclaim_but_progress                14              21                52              0.04% !!!
> 
> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
> 
> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.

But the order parameter was always passed as constant 0 by the caller?

 
> This leaves me with two ideas what the real issue could be:
> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances

It must be something to do wih code generation AFAIKS. I'm surprised
the function isn't inlined, giving exactly the same result regardless
of the patch.

Unlikely to be a correctness issue with code generation, but I'm
really surprised that a small difference in performance could have
such a big (and apparently repeatable) effect on behaviour like this.

What's the assembly look like?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-12 10:05                                 ` Nick Piggin
@ 2010-02-15  6:59                                   ` Nick Piggin
  2010-02-15 15:46                                   ` Christian Ehrhardt
  1 sibling, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2010-02-15  6:59 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

On Fri, Feb 12, 2010 at 09:05:19PM +1100, Nick Piggin wrote:
> > This leaves me with two ideas what the real issue could be:
> > 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
> 
> It must be something to do wih code generation AFAIKS. I'm surprised
> the function isn't inlined, giving exactly the same result regardless
> of the patch.
> 
> Unlikely to be a correctness issue with code generation, but I'm
> really surprised that a small difference in performance could have
> such a big (and apparently repeatable) effect on behaviour like this.
> 
> What's the assembly look like?

Oh, and what happens if you always_inline it? I would have thought
gcc should be able to generate the same code in both cases -- unless
I'm really missing something and there is a functional change in
there?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-12 10:05                                 ` Nick Piggin
  2010-02-15  6:59                                   ` Nick Piggin
@ 2010-02-15 15:46                                   ` Christian Ehrhardt
  2010-02-16 11:25                                     ` Mel Gorman
  1 sibling, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-15 15:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mel Gorman, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh



Nick Piggin wrote:
> On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
>>> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
>>> 	number of parameters being passed is making some unexpected large
>>> 	difference
>> BINGO - this definitely hit something.
>> This experimental patch does two things - on one hand it closes the race we had:
>>
>>                                                                 4 THREAD READ   8 THREAD READ    16 THREAD READ	    %ofcalls
>> perf_count_congestion_wait                                         13              27                52
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0               0                 0	
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath          13              27                52              99.52%
>> perf_count_pages_direct_reclaim                                 30867           56811            131552	
>> perf_count_failed_pages_direct_reclaim                             14              24                52	
>> perf_count_failed_pages_direct_reclaim_but_progress                14              21                52              0.04% !!!
>>
>> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
>>
>> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
> 
> But the order parameter was always passed as constant 0 by the caller?

Uh - I didn't see that in the first look - you're right.
So the x4 in direct_reclaim calls are not caused by e.g. missing order information. Thanks for spotting this.

> 
>> This leaves me with two ideas what the real issue could be:
>> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
> 
> It must be something to do with code generation AFAIKS. I'm surprised
> the function isn't inlined, giving exactly the same result regardless
> of the patch.

after checking asm I can tell that rmqueue_bulk is inlined.
Therefore the test to inline it explicitly as requested in another mail can be considered negligible.
It actually boils down to something different in Mels patch - see below.
 
> Unlikely to be a correctness issue with code generation, but I'm
> really surprised that a small difference in performance could have
> such a big (and apparently repeatable) effect on behaviour like this.
> 
> What's the assembly look like?
> 

Line 214763
This is the preparation of the __mod_zone_page_state call from rmqueue_bulk which  is inlined into
get_page_from_freelist.

!ORDER CHANGED FOR READABILITY!
FAST                               SLOW
lgr  %r2,%r11  #parm 1 from r11    lgr  %r2, %r11  #parm 1 from r11 - same
lghi %r3,0     #parm 2 = 0 const   lghi %r3,0      #parm 2 = 0 - same
lghi %r4,-1    #parm 3 = -1 const  lcr  %r4,%r12   #parm 3 as complement of r12?
                                   lgfr %r4,%r4    #parm 3 - clear upper 32 bit?
brasl          #call               brasl           #call

=> This is most probably this part of Mel's patch:
 23 @@ -965,7 +965,7 @@
 24                 set_page_private(page, migratetype);
 25                 list = &page->lru;
 26         }
 27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
 29         spin_unlock(&zone->lock);
 30         return i;
 31  }

-> doesn't look too important, but to be sure we can build an executable with
just this change, but still passing order to rmqueue_bulk - see below.


Line 214800
Differend end of the function. In Slow path there is a check to %r4 and
dependent two different jump targets inside get_page_from_freelist, while in
the fast version there is only an unconditional jump (both after a
_raw_spin_lock_wait).

FAST                                  SLOW
brasl   # call _raw_spin_lock_wait    brasl              # _raw_spin_lock_wait
j       1e6294 get_page_from_freelist lg   %r4,288(%r15) # from stack
                                      ltgr %r4,%r4       # compare
                                      jne  1e62a2 get_page_from_freelist
                                      lhi  %r12,0 # new call gets r12 @ 0 (GOT?)
                                      j    1e6340 get_page_from_freelist
The rest is the same. So what is left is that the slow variant has a new branch
back in to get_page_from_freelist with r12 set to zero. This jump wents directly
after the also new lcr call seen in the first difference and might therfore be
related to that change.

Now I first thought it might not be rmqueue_bulk that misses optimization, but
__mod_zone_page_state - but that one looks the same in both cases.

Therefore I took a shot for 2.6.32 with just that snippet above applied (__mod_zone_page_state call without order which should be fine as we know it is 0 in all cases).
And it is interesting to see that this snippet alone is enough to fix throughput and the direct_reclaim -> progress&!page ratio.

The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:

 23 @@ -965,7 +965,7 @@
 24                 set_page_private(page, migratetype);
 25                 list = &page->lru;
 26         }
 27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
 29         spin_unlock(&zone->lock);
 30         return i;
 31  }

Now that one has still the issue of the very huge throughput loss and the bad ratio of driect_reclaims leading into congestion_wait.
Now as far as I can see that - oh so important - "-i" or "-1" ends up in zone_page_state_add as variable x:
static inline void zone_page_state_add(long x, struct zone *zone,
                                 enum zone_stat_item item)
{
        atomic_long_add(x, &zone->vm_stat[item]);
        atomic_long_add(x, &vm_stat[item]);
}

I guess the intention there is to update the zone with the number of pages freed - and I also guess that -1 as constant might be wrong there.
That would also explain some weird output I saw like this after boot:
h37lp01:~ # cat /proc/meminfo
MemTotal:         251216 kB
MemFree:          483460 kB bigger than total

BUT, and that is now again open for discussion - "__mod_zone_page_state(zone, NR_FREE_PAGES, -1)" even being wrong - fixed the issue in terms of throughput and the congestion_waits as good as reverting e084b+5f8dcc21!


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-15 15:46                                   ` Christian Ehrhardt
@ 2010-02-16 11:25                                     ` Mel Gorman
  2010-02-16 16:47                                       ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2010-02-16 11:25 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
> 
> 
> Nick Piggin wrote:
> > On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
> >>> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
> >>> 	number of parameters being passed is making some unexpected large
> >>> 	difference
> >> BINGO - this definitely hit something.
> >> This experimental patch does two things - on one hand it closes the race we had:
> >>
> >>                                                                 4 THREAD READ   8 THREAD READ    16 THREAD READ	    %ofcalls
> >> perf_count_congestion_wait                                         13              27                52
> >> perf_count_call_congestion_wait_from_alloc_pages_high_priority      0               0                 0	
> >> perf_count_call_congestion_wait_from_alloc_pages_slowpath          13              27                52              99.52%
> >> perf_count_pages_direct_reclaim                                 30867           56811            131552	
> >> perf_count_failed_pages_direct_reclaim                             14              24                52	
> >> perf_count_failed_pages_direct_reclaim_but_progress                14              21                52              0.04% !!!
> >>
> >> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
> >>
> >> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
> > 
> > But the order parameter was always passed as constant 0 by the caller?
> 
> Uh - I didn't see that in the first look - you're right.
> So the x4 in direct_reclaim calls are not caused by e.g. missing order information. Thanks for spotting this.
> 
> > 
> >> This leaves me with two ideas what the real issue could be:
> >> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
> > 
> > It must be something to do with code generation AFAIKS. I'm surprised
> > the function isn't inlined, giving exactly the same result regardless
> > of the patch.
> 
> after checking asm I can tell that rmqueue_bulk is inlined.
> Therefore the test to inline it explicitly as requested in another mail can be considered negligible.

Well, this is good at least. The expectation was that it was getting
automatically inlined but I'd stopped expecting anything as this problem
is weird.

> It actually boils down to something different in Mels patch - see below.
>  
> > Unlikely to be a correctness issue with code generation, but I'm
> > really surprised that a small difference in performance could have
> > such a big (and apparently repeatable) effect on behaviour like this.
> > 
> > What's the assembly look like?
> > 
> 
> Line 214763
> This is the preparation of the __mod_zone_page_state call from rmqueue_bulk which  is inlined into
> get_page_from_freelist.
> 
> !ORDER CHANGED FOR READABILITY!
> FAST                               SLOW
> lgr  %r2,%r11  #parm 1 from r11    lgr  %r2, %r11  #parm 1 from r11 - same
> lghi %r3,0     #parm 2 = 0 const   lghi %r3,0      #parm 2 = 0 - same
> lghi %r4,-1    #parm 3 = -1 const  lcr  %r4,%r12   #parm 3 as complement of r12?
>                                    lgfr %r4,%r4    #parm 3 - clear upper 32 bit?

I'm not an s390 assembler expert but from the (only) reference I found.

LGHI - loads a half word (16 bits) to a 64 bit
		to -1 into register 4

LCR  - load the comlpement of count (i.e. -count)
LGFR - load a 64 register with a 32 value, no idea if this clears the
	upper bits or not.

So, they are doing very different things here.... err, crap, because I
broke it.  The patch to eliminate the parameter is wrong

-       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
+       __mod_zone_page_state(zone, NR_FREE_PAGES, -1);


should have been

-       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
+       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);

> brasl          #call               brasl           #call
> 
> => This is most probably this part of Mel's patch:
>  23 @@ -965,7 +965,7 @@
>  24                 set_page_private(page, migratetype);
>  25                 list = &page->lru;
>  26         }
>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
>  29         spin_unlock(&zone->lock);
>  30         return i;
>  31  }
> 
> -> doesn't look too important, but to be sure we can build an executable with
> just this change, but still passing order to rmqueue_bulk - see below.
> 

It's important, it's broken. That -1 should be -i. Can you test with
that fixed? I'm very sorry.

> Line 214800
> Differend end of the function. In Slow path there is a check to %r4 and
> dependent two different jump targets inside get_page_from_freelist, while in
> the fast version there is only an unconditional jump (both after a
> _raw_spin_lock_wait).
> 
> FAST                                  SLOW
> brasl   # call _raw_spin_lock_wait    brasl              # _raw_spin_lock_wait
> j       1e6294 get_page_from_freelist lg   %r4,288(%r15) # from stack
>                                       ltgr %r4,%r4       # compare
>                                       jne  1e62a2 get_page_from_freelist
>                                       lhi  %r12,0 # new call gets r12 @ 0 (GOT?)
>                                       j    1e6340 get_page_from_freelist
>
> The rest is the same. So what is left is that the slow variant has a new branch
> back in to get_page_from_freelist with r12 set to zero. This jump wents directly
> after the also new lcr call seen in the first difference and might therfore be
> related to that change.
> 
> Now I first thought it might not be rmqueue_bulk that misses optimization, but
> __mod_zone_page_state - but that one looks the same in both cases.
> 

> Therefore I took a shot for 2.6.32 with just that snippet above applied
> (__mod_zone_page_state call without order which should be fine as we know it
> is 0 in all cases).  > And it is interesting to see that this snippet alone
> is enough to fix throughput and the direct_reclaim -> progress&!page ratio.

Unfortunately, it's because I broke the free counters in that patch so
we are probably breaking watermarks. I imagine the counters in
/proc/vmstat are very funny looking.

> 
> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
> 
>  23 @@ -965,7 +965,7 @@
>  24                 set_page_private(page, migratetype);
>  25                 list = &page->lru;
>  26         }
>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>  29         spin_unlock(&zone->lock);
>  30         return i;
>  31  }
> 
> Now that one has still the issue of the very huge throughput loss and the bad ratio of driect_reclaims leading into congestion_wait.
> Now as far as I can see that - oh so important - "-i" or "-1" ends up in zone_page_state_add as variable x:
> static inline void zone_page_state_add(long x, struct zone *zone,
>                                  enum zone_stat_item item)
> {
>         atomic_long_add(x, &zone->vm_stat[item]);
>         atomic_long_add(x, &vm_stat[item]);
> }
> 
> I guess the intention there is to update the zone with the number of pages freed - and I also guess that -1 as constant might be wrong there.

It is.

> That would also explain some weird output I saw like this after boot:
> h37lp01:~ # cat /proc/meminfo
> MemTotal:         251216 kB
> MemFree:          483460 kB bigger than total
> 
> BUT, and that is now again open for discussion - "__mod_zone_page_state(zone, NR_FREE_PAGES, -1)" even being wrong - fixed the issue in terms of throughput and the congestion_waits as good as reverting e084b+5f8dcc21!

It "fixes" it only by not calling direct reclaim when it should :(

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-16 11:25                                     ` Mel Gorman
@ 2010-02-16 16:47                                       ` Christian Ehrhardt
  2010-02-17  9:55                                         ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-16 16:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

[-- Attachment #1: Type: text/plain, Size: 5686 bytes --]

Mel Gorman wrote:
> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
[...]
>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>
>>  23 @@ -965,7 +965,7 @@
>>  24                 set_page_private(page, migratetype);
>>  25                 list = &page->lru;
>>  26         }
>>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>  29         spin_unlock(&zone->lock);
>>  30         return i;
>>  31  }
>>
[...]
> It "fixes" it only by not calling direct reclaim when it should :(

yeah as we both realized -1 was not right so it was more a crazy workaround :-)

Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.

So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
Two patches are attached and apply after the other three already available in that thread.
The intention is
a) to track the time
 a1) spent in try_to_free_pages
 a2) consumed after try_to_free_pages until get_page_from_freelist
 a3) spent in get_pages_from_freelist
b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)

Naming convention (otherwise it might get confusing here)
Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
Bad case - the scenario e.g. on a clean 2.6.32
Fast path - direct reclaim calls that did not run into progress&!page
Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path

Mini summary of what we had before in huge tables:
            fast path   slow path
GOOD CASE      ~98%       ~1-3%
BAD CASE       ~70%        ~30%
-> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)

Out of the numbers I got the following things might help us to create a new approach to a solution.
The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.

GOOD CASE                                        duration
a1 Fast-avg-duration_pre_ttf_2_post_ttf            164099
a2 Fast-avg-duration_post_ttf_2_pre_get_page          459
a3 Fast-avg-duration_pre_get_page_2_post_get_page     346
a1 Slow-avg-duration_pre_ttf_2_post_ttf            127621
a2 Slow-avg-duration_post_ttf_2_pre_get_page         1957
a3 Slow-avg-duration_pre_get_page_2_post_get_page     256
BAD CASE                                         duration   deviation to good case in %
a1 Fast-avg-duration_pre_ttf_2_post_ttf            122921   -25.09%
a2 Fast-avg-duration_post_ttf_2_pre_get_page          521   13.53%
a3 Fast-avg-duration_pre_get_page_2_post_get_page     244   -29.55%
a1 Slow-avg-duration_pre_ttf_2_post_ttf            109740   -14.01%
a2 Slow-avg-duration_post_ttf_2_pre_get_page          250   -87.18%
a3 Slow-avg-duration_pre_get_page_2_post_get_page     117   -54.16%

That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.

As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
Everything else are a few assignments so it can't be much else.
But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.

I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
>From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.

PAGES-FREED  fast path   slow path
GOOD CASE      ~62       ~1.46
BAD CASE       ~62       ~37

Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.

As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.

Comments and ideas welcome!

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

[-- Attachment #2: perf-count-direct-reclaim-durations.diff --]
[-- Type: text/x-patch, Size: 6199 bytes --]

Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c	2010-02-16 11:06:29.000000000 +0100
+++ linux-2.6-git/kernel/sysctl.c	2010-02-16 13:38:48.000000000 +0100
@@ -264,9 +264,64 @@
 extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
 extern unsigned long perf_count_call_congestion_wait_from_alloc_pages_high_priority;
 extern unsigned long perf_count_call_congestion_wait_from_alloc_pages_slowpath;
+
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page;
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_ttf;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf",
+		.data           = &perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page",
+		.data           = &perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page",
+		.data           = &perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf",
+		.data           = &perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page",
+		.data           = &perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page",
+		.data           = &perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname       = "perf_count_congestion_wait",
 		.data           = &perf_count_congestion_wait,
 		.mode           = 0666,
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2010-02-16 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2010-02-16 13:38:24.000000000 +0100
@@ -51,6 +51,7 @@
 #include <trace/events/kmem.h>
 
 #include <asm/tlbflush.h>
+#include <asm/timex.h>
 #include <asm/div64.h>
 #include "internal.h"
 
@@ -1655,6 +1656,17 @@
 unsigned long perf_count_failed_pages_direct_reclaim = 0;
 unsigned long perf_count_failed_pages_direct_reclaim_but_progress = 0;
 
+unsigned long long perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf = 0;
+unsigned long long perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page = 0;
+unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page = 0;
+unsigned long long perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf = 0;
+unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page = 0;
+unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page = 0;
+
+static inline unsigned long long tod_duration_ns(unsigned long long from, unsigned long long to) {
+	return ((to*125)>>9)-((from*125)>>9);
+}
+
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1665,6 +1677,7 @@
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	unsigned long long t1,t2,t3,t4;
 
 	cond_resched();
 
@@ -1675,8 +1688,12 @@
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
+	t1 = get_clock();
+
 	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
 
+	t2 = get_clock();
+
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
 	p->flags &= ~PF_MEMALLOC;
@@ -1686,17 +1703,28 @@
 	if (order != 0)
 		drain_all_pages();
 
+	t3 = get_clock();
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+	t4 = get_clock();
 
 	perf_count_pages_direct_reclaim++;
 	if (!page)
 		perf_count_failed_pages_direct_reclaim++;
-	if (!page && (*did_some_progress))
+	if (!page && (*did_some_progress)) {
 		perf_count_failed_pages_direct_reclaim_but_progress++;
+		perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
+		perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
+		perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+	}
+	else {
+		perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
+		perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
+		perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+	}
 
 	return page;
 }

[-- Attachment #3: perf-count-direct-reclaim-orderandprogress.diff --]
[-- Type: text/x-patch, Size: 4380 bytes --]

Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c	2010-02-16 13:38:48.000000000 +0100
+++ linux-2.6-git/kernel/sysctl.c	2010-02-16 15:07:21.000000000 +0100
@@ -267,13 +267,49 @@
 
 extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf;
 extern unsigned long long perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page;
-extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page;
 extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf;
 extern unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page;
-extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_ttf;
+extern unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page;
+extern unsigned long perf_count_direct_reclaim_fast_sum_progress;
+extern unsigned long perf_count_direct_reclaim_slow_sum_progress;
+extern unsigned long perf_count_direct_reclaim_fast_sum_order;
+extern unsigned long perf_count_direct_reclaim_slow_sum_order;
 static struct ctl_table perf_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_sum_progress",
+		.data           = &perf_count_direct_reclaim_slow_sum_progress,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_slow_sum_order",
+		.data           = &perf_count_direct_reclaim_slow_sum_order,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_sum_progress",
+		.data           = &perf_count_direct_reclaim_fast_sum_progress,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_direct_reclaim_fast_sum_order",
+		.data           = &perf_count_direct_reclaim_fast_sum_order,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname       = "perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf",
 		.data           = &perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf,
 		.mode           = 0666,
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2010-02-16 14:58:28.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2010-02-16 15:06:15.000000000 +0100
@@ -1663,6 +1663,11 @@
 unsigned long long perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page = 0;
 unsigned long long perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page = 0;
 
+unsigned long perf_count_direct_reclaim_fast_sum_progress = 0;
+unsigned long perf_count_direct_reclaim_slow_sum_progress = 0;
+unsigned long perf_count_direct_reclaim_fast_sum_order = 0;
+unsigned long perf_count_direct_reclaim_slow_sum_order = 0;
+
 static inline unsigned long long tod_duration_ns(unsigned long long from, unsigned long long to) {
 	return ((to*125)>>9)-((from*125)>>9);
 }
@@ -1719,11 +1724,17 @@
 		perf_count_direct_reclaim_slow_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
 		perf_count_direct_reclaim_slow_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
 		perf_count_direct_reclaim_slow_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+
+		perf_count_direct_reclaim_slow_sum_progress += *did_some_progress;
+		perf_count_direct_reclaim_slow_sum_order += order;
 	}
 	else {
 		perf_count_direct_reclaim_fast_dursum_pre_ttf_2_post_ttf += tod_duration_ns(t1,t2);
 		perf_count_direct_reclaim_fast_dursum_post_ttf_2_pre_get_page += tod_duration_ns(t2,t3);
 		perf_count_direct_reclaim_fast_dursum_pre_get_page_2_post_get_page += tod_duration_ns(t3,t4);
+
+		perf_count_direct_reclaim_fast_sum_progress += *did_some_progress;
+		perf_count_direct_reclaim_fast_sum_order += order;
 	}
 
 	return page;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-16 16:47                                       ` Christian Ehrhardt
@ 2010-02-17  9:55                                         ` Christian Ehrhardt
  2010-02-17 10:03                                           ` Christian Ehrhardt
  2010-02-18 11:43                                           ` Mel Gorman
  0 siblings, 2 replies; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-17  9:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

Christian Ehrhardt wrote:
> Mel Gorman wrote:
>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
> [...]
>>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>>
>>>  23 @@ -965,7 +965,7 @@
>>>  24                 set_page_private(page, migratetype);
>>>  25                 list = &page->lru;
>>>  26         }
>>>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>>  29         spin_unlock(&zone->lock);
>>>  30         return i;
>>>  31  }
>>>
> [...]
>> It "fixes" it only by not calling direct reclaim when it should :(
> 
> yeah as we both realized -1 was not right so it was more a crazy workaround :-)
> 
> Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
> 
> So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
> Two patches are attached and apply after the other three already available in that thread.
> The intention is
> a) to track the time
>  a1) spent in try_to_free_pages
>  a2) consumed after try_to_free_pages until get_page_from_freelist
>  a3) spent in get_pages_from_freelist
> b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
> b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
> 
> Naming convention (otherwise it might get confusing here)
> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
> Bad case - the scenario e.g. on a clean 2.6.32
> Fast path - direct reclaim calls that did not run into progress&!page
> Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
> 
> Mini summary of what we had before in huge tables:
>             fast path   slow path
> GOOD CASE      ~98%       ~1-3%
> BAD CASE       ~70%        ~30%
> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
> 
> Out of the numbers I got the following things might help us to create a new approach to a solution.
> The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
> 
> GOOD CASE                                        duration
> a1 Fast-avg-duration_pre_ttf_2_post_ttf            164099
> a2 Fast-avg-duration_post_ttf_2_pre_get_page          459
> a3 Fast-avg-duration_pre_get_page_2_post_get_page     346
> a1 Slow-avg-duration_pre_ttf_2_post_ttf            127621
> a2 Slow-avg-duration_post_ttf_2_pre_get_page         1957
> a3 Slow-avg-duration_pre_get_page_2_post_get_page     256
> BAD CASE                                         duration   deviation to good case in %
> a1 Fast-avg-duration_pre_ttf_2_post_ttf            122921   -25.09%
> a2 Fast-avg-duration_post_ttf_2_pre_get_page          521   13.53%
> a3 Fast-avg-duration_pre_get_page_2_post_get_page     244   -29.55%
> a1 Slow-avg-duration_pre_ttf_2_post_ttf            109740   -14.01%
> a2 Slow-avg-duration_post_ttf_2_pre_get_page          250   -87.18%
> a3 Slow-avg-duration_pre_get_page_2_post_get_page     117   -54.16%
> 
> That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
> 
> As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
> Everything else are a few assignments so it can't be much else.
> But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.

THe reason here is probably the the fact that in the bad case a lot of 
processes are waiting on congestion_wait and are therefore not runnnable 
and that way not scheduled via cond_resched.

I'll test this theory today or tomorrow with cond_resched in 
direct_reclaim commented out and expect almost no difference.

> I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
> From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
> But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
> 
> PAGES-FREED  fast path   slow path
> GOOD CASE      ~62       ~1.46
> BAD CASE       ~62       ~37

5f8dcc21 introduced per migrate type pcp lists, is it possible that we 
run in a scenario where try_to_free frees a lot of pages via, but of the 
wrong migrate type?
And afterwards get_page
At least try_to_free_page and it's called shrink_ functions is not 
migrate type aware while get_page and its subsequent buffered_rmqueue 
and rmqueue_bulk are - btw here comes patch e084b.

I only see buffered_rmqueue chose a specific pcp list based on migrate 
type, and a fallback to migrate_reserve - is that enough fallback, what 
if the reserve is empty too but a few other types would not and those 
other types are the ones filled by try_to_free?

I'll try to get a per migrate type #pages statistic after direct_reclaim 
reaches !page - maybe that can confirm some parts of my theory.

> Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
> The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.
> 
> As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.
> 
> Comments and ideas welcome!
> 
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-17  9:55                                         ` Christian Ehrhardt
@ 2010-02-17 10:03                                           ` Christian Ehrhardt
  2010-02-18 11:43                                           ` Mel Gorman
  1 sibling, 0 replies; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-17 10:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh



Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
>> Mel Gorman wrote:
>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
>> [...]
>>>> The differences in asm are pretty much the same, as before 
>>>> rmqueue_bulk was already inlined the actually intended change to its 
>>>> parameters was negligible.
>>>> I wondered if it would be important if that is a constant value (-1) 
>>>> or if the reason was caused by that shift. So I tried:
>>>>
>>>>  23 @@ -965,7 +965,7 @@
>>>>  24                 set_page_private(page, migratetype);
>>>>  25                 list = &page->lru;
>>>>  26         }
>>>>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>>>  29         spin_unlock(&zone->lock);
>>>>  30         return i;
>>>>  31  }
>>>>
>> [...]
>>> It "fixes" it only by not calling direct reclaim when it should :(
>>
>> yeah as we both realized -1 was not right so it was more a crazy 
>> workaround :-)
>>
>> Anyway after that being a dead end again I dug even deeper into the 
>> details of direct_reclaim - I think we can agree that out of the 
>> counters we already know the race between try_to_free making progress 
>> and get_page not getting a page causing the congestion_wait is source 
>> of the issue.
>>
>> So what I tried to bring some more light into all that was extending 
>> my perf counters to track a few more details in direct_reclaim.
>> Two patches are attached and apply after the other three already 
>> available in that thread.
>> The intention is
>> a) to track the time
>>  a1) spent in try_to_free_pages
>>  a2) consumed after try_to_free_pages until get_page_from_freelist
>>  a3) spent in get_pages_from_freelist
>> b1) after seeing that order!=0 -> drain_all_pages I wondered if that 
>> might differ even all calls look like they have zero
>> b2) tracking the average amount of pages freed by try_to_free_pages 
>> for fast path and slow path (progres&!page)
>>
>> Naming convention (otherwise it might get confusing here)
>> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted 
>> resulting in high throughput and a low ratio of direct_reclaim running 
>> into progress&!page
>> Bad case - the scenario e.g. on a clean 2.6.32
>> Fast path - direct reclaim calls that did not run into progress&!page
>> Slow path - direct reclaim calls that ran into progress&!page ending 
>> up in a long congestion_wait and therefore called "slow" path
>>
>> Mini summary of what we had before in huge tables:
>>             fast path   slow path
>> GOOD CASE      ~98%       ~1-3%
>> BAD CASE       ~70%        ~30%
>> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone 
>> threads (worse with even more threads)
>>
>> Out of the numbers I got the following things might help us to create 
>> a new approach to a solution.
>> The timings showed that that the so called slow case is actually much 
>> faster passing though direct_reclaim in bad case.
>>
>> GOOD CASE                                        duration
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf            164099
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page          459
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page     346
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf            127621
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page         1957
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page     256
>> BAD CASE                                         duration   deviation 
>> to good case in %
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf            122921   -25.09%
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page          521   13.53%
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page     244   -29.55%
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf            109740   -14.01%
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page          250   -87.18%
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page     117   -54.16%
>>
>> That means that in the bad case the execution is much faster. 
>> Especially in the case that eventually runs into the slow path 
>> try_to_free is 14% faster, more important the time between try_to_free 
>> and get_pages is 87%! faster => less than a fifth and finally get_page 
>> is 54% faster, but that is probably just failing in an almost empty 
>> list which is fast^^.
>>
>> As I checked order which always was zero the time is not spent in 
>> drain_all_pages and the only other thing left might be cond_resched ?!
>> Everything else are a few assignments so it can't be much else.
>> But why would e.g. not running into schedule in cond_resched cause 
>> get_pages to not find anything - I don't know and I would expect it 
>> should be the other way around - the faster you get from free to get 
>> the more pages should be left.
> 
> THe reason here is probably the the fact that in the bad case a lot of 
> processes are waiting on congestion_wait and are therefore not runnnable 
> and that way not scheduled via cond_resched.
> 
> I'll test this theory today or tomorrow with cond_resched in 
> direct_reclaim commented out and expect almost no difference.
> 
>> I thought the progress try_to_free_pages is doing might be important 
>> as well so I took numbers for that too.
>> From those I see that the good case as well as the bad case has an 
>> average of 62 pages freed in fast path.
>> But in slow path they start to differ - while the good case that is 
>> running only seldom in that path anyway frees an average of 1.46 pages 
>> (that might be the thing causing it not getting a page eventually) in 
>> the bad case it makes a progress of avg 37 pages even in slow path.
>>
>> PAGES-FREED  fast path   slow path
>> GOOD CASE      ~62       ~1.46
>> BAD CASE       ~62       ~37
> 
> 5f8dcc21 introduced per migrate type pcp lists, is it possible that we 
> run in a scenario where try_to_free frees a lot of pages via, but of the 
> wrong migrate type?
> And afterwards get_page
> At least try_to_free_page and it's called shrink_ functions is not 
> migrate type aware while get_page and its subsequent buffered_rmqueue 
> and rmqueue_bulk are - btw here comes patch e084b.
> 
> I only see buffered_rmqueue chose a specific pcp list based on migrate 
> type, and a fallback to migrate_reserve - is that enough fallback, what 
> if the reserve is empty too but a few other types would not and those 
> other types are the ones filled by try_to_free?

I just saw the full type iteration in __rmqueue_fallback, but still 
somewhere the average of 37 freed pages need to go so that get_page 
doesn't get one.

> I'll try to get a per migrate type #pages statistic after direct_reclaim 
> reaches !page - maybe that can confirm some parts of my theory.

This might be still worth a try.

>> Thinking of it as asking "how few pages do I have to free until I fall 
>> from fast to slow path" the kernels behave different it looks wrong 
>> but interesting.
>> The good case only drops to slow path (!page) in case of ~1.46 pages 
>> freed while the bad case seems to enter that much earlier with even 37 
>> pages freed.
>>
>> As order is always 0 and get_page afaik about getting just "one" page 
>> I wonder where these 37 pages disappeared especially as in bad case it 
>> is much faster getting to get_pages after freeing those ~37 pages.
>>
>> Comments and ideas welcome!
>>
>>
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-17  9:55                                         ` Christian Ehrhardt
  2010-02-17 10:03                                           ` Christian Ehrhardt
@ 2010-02-18 11:43                                           ` Mel Gorman
  2010-02-18 16:09                                             ` Christian Ehrhardt
  1 sibling, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2010-02-18 11:43 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

On Wed, Feb 17, 2010 at 10:55:08AM +0100, Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
>> Mel Gorman wrote:
>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
>> [...]
>>>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>>>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>>>
>>>>  23 @@ -965,7 +965,7 @@
>>>>  24                 set_page_private(page, migratetype);
>>>>  25                 list = &page->lru;
>>>>  26         }
>>>>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>>>  29         spin_unlock(&zone->lock);
>>>>  30         return i;
>>>>  31  }
>>>>
>> [...]
>>> It "fixes" it only by not calling direct reclaim when it should :(
>>
>> yeah as we both realized -1 was not right so it was more a crazy workaround :-)
>>
>> Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
>>
>> So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
>> Two patches are attached and apply after the other three already available in that thread.
>> The intention is
>> a) to track the time
>>  a1) spent in try_to_free_pages
>>  a2) consumed after try_to_free_pages until get_page_from_freelist
>>  a3) spent in get_pages_from_freelist
>> b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
>> b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
>>
>> Naming convention (otherwise it might get confusing here)
>> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
>> Bad case - the scenario e.g. on a clean 2.6.32
>> Fast path - direct reclaim calls that did not run into progress&!page
>> Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
>>
>> Mini summary of what we had before in huge tables:
>>             fast path   slow path
>> GOOD CASE      ~98%       ~1-3%
>> BAD CASE       ~70%        ~30%
>> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
>>
>> Out of the numbers I got the following things might help us to create a new approach to a solution.
>> The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
>>
>> GOOD CASE                                        duration
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf            164099
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page          459
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page     346
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf            127621
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page         1957
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page     256
>> BAD CASE                                         duration   deviation to good case in %
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf            122921   -25.09%
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page          521   13.53%
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page     244   -29.55%
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf            109740   -14.01%
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page          250   -87.18%
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page     117   -54.16%
>>
>> That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
>>
>> As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
>> Everything else are a few assignments so it can't be much else.
>> But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
>
> THe reason here is probably the the fact that in the bad case a lot of  
> processes are waiting on congestion_wait and are therefore not runnnable  
> and that way not scheduled via cond_resched.
>
> I'll test this theory today or tomorrow with cond_resched in  
> direct_reclaim commented out and expect almost no difference.
>
>> I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
>> From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
>> But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
>>
>> PAGES-FREED  fast path   slow path
>> GOOD CASE      ~62       ~1.46
>> BAD CASE       ~62       ~37
>
> 5f8dcc21 introduced per migrate type pcp lists, is it possible that we  
> run in a scenario where try_to_free frees a lot of pages via, but of the  
> wrong migrate type?

It's possible but the window is small. When a threshold is reached on the
PCP lists, they get drained to the buddy lists and later picked up again
by __rmqueue_fallback(). I had considered the possibility of pages of the
wrong type being on the PCP lists which led to the patch "page-allocator:
Fallback to other per-cpu lists when the target list is empty and memory is
low" but you reported it made no difference even when fallback was allowed
with high watermarks.

> And afterwards get_page
> At least try_to_free_page and it's called shrink_ functions is not  
> migrate type aware while get_page and its subsequent buffered_rmqueue  
> and rmqueue_bulk are -

Unless a large number of pages are pinned on those PCP lists (which
doesn't appear to be the case or the fallback patch would have helped),
the pages of other migratetypes are found by __rmqueue_fallback().

> btw here comes patch e084b.
>

Which still has zero explanation.

> I only see buffered_rmqueue chose a specific pcp list based on migrate  
> type, and a fallback to migrate_reserve - is that enough fallback, what  
> if the reserve is empty too but a few other types would not and those  
> other types are the ones filled by try_to_free?
>

They get found by __rmqueue_fallback().

> I'll try to get a per migrate type #pages statistic after direct_reclaim  
> reaches !page - maybe that can confirm some parts of my theory.
>
>> Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
>> The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.
>>
>> As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.
>>
>> Comments and ideas welcome!

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-18 11:43                                           ` Mel Gorman
@ 2010-02-18 16:09                                             ` Christian Ehrhardt
  2010-02-19 11:19                                               ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-18 16:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh



Mel Gorman wrote:
> On Wed, Feb 17, 2010 at 10:55:08AM +0100, Christian Ehrhardt wrote:
>> Christian Ehrhardt wrote:
>>> Mel Gorman wrote:
>>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
>>> [...]
>>>>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>>>>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>>>>
>>>>>  23 @@ -965,7 +965,7 @@
>>>>>  24                 set_page_private(page, migratetype);
>>>>>  25                 list = &page->lru;
>>>>>  26         }
>>>>>  27 -       __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>>>  28 +       __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>>>>  29         spin_unlock(&zone->lock);
>>>>>  30         return i;
>>>>>  31  }
>>>>>
>>> [...]
>>>> It "fixes" it only by not calling direct reclaim when it should :(
>>> yeah as we both realized -1 was not right so it was more a crazy workaround :-)
>>>
>>> Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
>>>
>>> So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
>>> Two patches are attached and apply after the other three already available in that thread.
>>> The intention is
>>> a) to track the time
>>>  a1) spent in try_to_free_pages
>>>  a2) consumed after try_to_free_pages until get_page_from_freelist
>>>  a3) spent in get_pages_from_freelist
>>> b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
>>> b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
>>>
>>> Naming convention (otherwise it might get confusing here)
>>> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
>>> Bad case - the scenario e.g. on a clean 2.6.32
>>> Fast path - direct reclaim calls that did not run into progress&!page
>>> Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
>>>
>>> Mini summary of what we had before in huge tables:
>>>             fast path   slow path
>>> GOOD CASE      ~98%       ~1-3%
>>> BAD CASE       ~70%        ~30%
>>> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
>>>
>>> Out of the numbers I got the following things might help us to create a new approach to a solution.
>>> The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
>>>
>>> GOOD CASE                                        duration
>>> a1 Fast-avg-duration_pre_ttf_2_post_ttf            164099
>>> a2 Fast-avg-duration_post_ttf_2_pre_get_page          459
>>> a3 Fast-avg-duration_pre_get_page_2_post_get_page     346
>>> a1 Slow-avg-duration_pre_ttf_2_post_ttf            127621
>>> a2 Slow-avg-duration_post_ttf_2_pre_get_page         1957
>>> a3 Slow-avg-duration_pre_get_page_2_post_get_page     256
>>> BAD CASE                                         duration   deviation to good case in %
>>> a1 Fast-avg-duration_pre_ttf_2_post_ttf            122921   -25.09%
>>> a2 Fast-avg-duration_post_ttf_2_pre_get_page          521   13.53%
>>> a3 Fast-avg-duration_pre_get_page_2_post_get_page     244   -29.55%
>>> a1 Slow-avg-duration_pre_ttf_2_post_ttf            109740   -14.01%
>>> a2 Slow-avg-duration_post_ttf_2_pre_get_page          250   -87.18%
>>> a3 Slow-avg-duration_pre_get_page_2_post_get_page     117   -54.16%
>>>
>>> That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
>>>
>>> As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
>>> Everything else are a few assignments so it can't be much else.
>>> But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
>> THe reason here is probably the the fact that in the bad case a lot of  
>> processes are waiting on congestion_wait and are therefore not runnnable  
>> and that way not scheduled via cond_resched.
>>
>> I'll test this theory today or tomorrow with cond_resched in  
>> direct_reclaim commented out and expect almost no difference.
>>
>>> I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
>>> From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
>>> But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
>>>
>>> PAGES-FREED  fast path   slow path
>>> GOOD CASE      ~62       ~1.46
>>> BAD CASE       ~62       ~37
>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that we  
>> run in a scenario where try_to_free frees a lot of pages via, but of the  
>> wrong migrate type?
> 
> It's possible but the window is small. When a threshold is reached on the
> PCP lists, they get drained to the buddy lists and later picked up again
> by __rmqueue_fallback(). I had considered the possibility of pages of the
> wrong type being on the PCP lists which led to the patch "page-allocator:
> Fallback to other per-cpu lists when the target list is empty and memory is
> low" but you reported it made no difference even when fallback was allowed
> with high watermarks.
> [...]

Today I created rather huge debug logs - I'll spare everyone with too 
much detail.
Eventually it comes down to some kind of cat /proc/zoneinfo like output 
extended to list all things per migrate type too.

 From that I still think there should be plenty of pages to get the 
allocation through, as it was suggested by the high amount of pages 
freed by try_to_free.

Due to that I tried calling a second get_page_from_freelist after the 
first failing one and after the statistic output which showed me there 
wer enough pages available.
Even that second one failed - but this is good, that allows me to run 
that second call with a kind of "debugthis" parameter which I'll run 
tomorrow.

If anyone wants the full 6.2M of that data let me know, but I think the 
step by step debug of the second try will give us much smaller and 
better insights to discuss about.

More about that tomorrow,
Christian

P.S. Just one output of it here from a 16 thread read case:
### extended zoneinfo after failed direct reclaim ###
Progress just made in try_to_free         32
Current allocation order   0
Current allocation gfp_flags      201da GFP_COLD '1'
Current allocation migrate type '2'
Current cpu id          0
zone      DMA
zni:   pages_free                      503
         min                       508
         low                       635
         high                      762
         scanned                    96
         spanned                 65536
         present                 64640
     nr_free_pages                 503
     nr_inactive_anon             1536
     nr_active_anon               1480
     nr_inactive_file            33197
     nr_active_file               9492
     nr_unevictable               3016
     nr_mlock                     3016
     nr_anon_pages                5188
     nr_mapped                    2295
     nr_file_pages               43714
     nr_dirty                      206
     nr_writeback                    0
     nr_slab_reclaimable          2762
     nr_slab_unreclaimable        5368
     nr_page_table_pages           151
     nr_kernel_stack               215
     nr_unstable                     0
     nr_bounce                       0
     nr_vmscan_write               476
     nr_writeback_temp               0
     nr_isolated_anon                0
     nr_isolated_file               32
     nr_shmem                       62
         protection: (0,0,0)
Free areas:
Order  0 - nr_free        198 (matching        198 order 0 pages)
    MIGRATE_UNMOVABLE          198
    MIGRATE_RECLAIMABLE          0
d get_page_from_freelist fail
    MIGRATE_MOVABLE              0*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  1 - nr_free          2 (matching          4 order 0 pages)
    MIGRATE_UNMOVABLE            1
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              1*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  2 - nr_free          0 (matching          0 order 0 pages)
    MIGRATE_UNMOVABLE            0
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              0*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  3 - nr_free          0 (matching          0 order 0 pages)
    MIGRATE_UNMOVABLE            0
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              0*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  4 - nr_free          0 (matching          0 order 0 pages)
    MIGRATE_UNMOVABLE            0
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              0*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  5 - nr_free          0 (matching          0 order 0 pages)
    MIGRATE_UNMOVABLE            0
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              0*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  6 - nr_free          0 (matching          0 order 0 pages)
    MIGRATE_UNMOVABLE            0
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              0*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  7 - nr_free          0 (matching          0 order 0 pages)
    MIGRATE_UNMOVABLE            0
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              0*
    MIGRATE_RESERVE              0
    MIGRATE_ISOLATE              0
Order  8 - nr_free          1 (matching        256 order 0 pages)
    MIGRATE_UNMOVABLE            0
    MIGRATE_RECLAIMABLE          0
    MIGRATE_MOVABLE              0*
d get_page_from_freelist fail
    MIGRATE_RESERVE              1
    MIGRATE_ISOLATE              0
pagesets
     cpu:  0 (current cpu)
           count:         46
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE             46*
           vm_stats_threshold:         16
     cpu:  1
           count:         77
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE             77*
           vm_stats_threshold:         16
     cpu:  2
           count:         10
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE             10*
           vm_stats_threshold:         16
     cpu:  3
           count:          8
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE              8*
           vm_stats_threshold:         16
     cpu:  4
           count:         11
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE             11*
           vm_stats_threshold:         16
     cpu:  5
           count:         74
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE             74*
           vm_stats_threshold:         16
     cpu:  6
           count:         65
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE             65*
           vm_stats_threshold:         16
     cpu:  7
           count:          0
           high:          90
           batch:         15
           MIGRATE_UNMOVABLE            0
           MIGRATE_RECLAIMABLE          0
           MIGRATE_MOVABLE              0*
           vm_stats_threshold:         16
   all_unreclaimable:          0
   prev_priority:             10
   start_pfn:                  0
   inactive_ratio:             1


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-18 16:09                                             ` Christian Ehrhardt
@ 2010-02-19 11:19                                               ` Christian Ehrhardt
  2010-02-19 15:19                                                 ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-19 11:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

[-- Attachment #1: Type: text/plain, Size: 5852 bytes --]

Christian Ehrhardt wrote:
> 
> 
> Mel Gorman wrote:
>> On Wed, Feb 17, 2010 at 10:55:08AM +0100, Christian Ehrhardt wrote:
>>> Christian Ehrhardt wrote:
>>>> Mel Gorman wrote:
>>>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:

[...]

>>>>
>>>> PAGES-FREED  fast path   slow path
>>>> GOOD CASE      ~62       ~1.46
>>>> BAD CASE       ~62       ~37
>>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that 
>>> we  run in a scenario where try_to_free frees a lot of pages via, but 
>>> of the  wrong migrate type?
>>
>> It's possible but the window is small. When a threshold is reached on the
>> PCP lists, they get drained to the buddy lists and later picked up again
>> by __rmqueue_fallback(). I had considered the possibility of pages of the
>> wrong type being on the PCP lists which led to the patch "page-allocator:
>> Fallback to other per-cpu lists when the target list is empty and 
>> memory is
>> low" but you reported it made no difference even when fallback was 
>> allowed
>> with high watermarks.
>> [...]
> 
> Today I created rather huge debug logs - I'll spare everyone with too 
> much detail.
> Eventually it comes down to some kind of cat /proc/zoneinfo like output 
> extended to list all things per migrate type too.
> 
>  From that I still think there should be plenty of pages to get the 
> allocation through, as it was suggested by the high amount of pages 
> freed by try_to_free.
> > 
[...]

> More about that tomorrow,

Well tomorrow is now, and I think I got some important new insights.

As mentioned before I realized that a second call still fails most of the time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist and buffered_rmqueue to see where the allocations exactly fails (patch attached).

Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
   get_page_from_freelist - zone loop - current zone 'DMA'
   get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
   get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
   get_page_from_freelist - return page '(null)'

Ok - now we at least exactly know why it gets no page.
Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
I didn't expect that, but actually watermarks are what stops the allocations here.
The zone_watermark_ok check reports that there are not enough pages for the current watermark and
finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.

Ok remember my scenario - several parallel iozone sequential read processes - theres not much allocation going on except for the page cache read ahead related to that read workload.
The allocations for page cache seem to have no special watermarks selected via their GFP flags and therefore get stalled by congestion_wait - which in consequence of no available writes in flight consumes its full timeout.

As I see significant impacts to the iozone throughput and not only e.g. bad counters in direct_reclaim the congestion_wait stalling seems to be so often/long to stall the application I/O itself.
That means from the time VFS starting a read ahead it seems to stall that page allocation long enough that the data is not ready when the application tries to read it, while it would be if the allocation would be fast enough.

A simple test for this theory was to allow those failing allocations a second chance without watermark restrictions before putting them to sleep for such a long time.

Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c  2010-02-19 09:53:14.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c       2010-02-19 09:56:26.000000000 +0100
@@ -1954,7 +1954,22 @@

        if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
                /* Wait for some write requests to complete then retry */
-               congestion_wait(BLK_RW_ASYNC, HZ/50);
+               /*
+                * If it gets here try it without watermarks before going
+                * to sleep.
+                *
+                * This will end up in alloc_high_priority and if that fails
+                * once more direct_reclaim but this time without watermark
+                * checks.
+                *
+                * FIXME: that is just for verification - a real fix needs to
+                * ensure e.g. page cache allocations don't drain all pages
+                * under watermark
+                */
+               if (!(alloc_flags & ALLOC_NO_WATERMARKS))
+                       alloc_flags &= ALLOC_NO_WATERMARKS;
+               else
+                       congestion_wait(BLK_RW_ASYNC, HZ/50);
                goto rebalance;
        }


This fixes all issues I have, but as stated in the FIXME it is unfortunately no fix for the real world.

Unfortunately even now knowing the place of the issue so well I don't see the connection to the commits e084b+5f8dcc21 - I couldn't find something but did they change the accounting somewhere or e.g. changed the timing/order of watermark updates and allocations?

Eventually it might come down to a discussion of allocation priorities and we might even keep them as is and accept this issue - I still would prefer a good second chance implementation, other page cache allocation flags or something else that explicitly solves this issue.
Mel's patch that replaces congestion_wait with a wait for the zone watermarks becoming available again is definitely a step in the right direction and should go into upstream and the long term support branches.



-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

[-- Attachment #2: perf-count-page-alloc-fail-page-debugchain.diff --]
[-- Type: text/x-patch, Size: 6947 bytes --]

Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2010-02-18 15:58:20.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2010-02-18 16:53:19.000000000 +0100
@@ -1179,7 +1179,7 @@
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
-			int migratetype)
+			int migratetype, int debugme)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1195,14 +1195,19 @@
 		pcp = &zone_pcp(zone, cpu)->pcp;
 		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
+		if (debugme) printk(KERN_INFO"%s - got list %p for migratetype %d\n", __func__, list, migratetype);
 		if (list_empty(list)) {
+			if (debugme) printk(KERN_INFO"%s - list was empty - go for rmqueue_bulk\n", __func__);
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
 					migratetype, cold);
-			if (unlikely(list_empty(list)))
+			if (unlikely(list_empty(list))) {
+				if (debugme) printk(KERN_INFO"%s - list still empty after rmqueue_bulk\n", __func__);
 				goto failed;
+			}
 		}
 
+		if (debugme) printk(KERN_INFO"%s - list has pages - I'm irq save on one cpu and take one\n", __func__);
 		if (cold)
 			page = list_entry(list->prev, struct page, lru);
 		else
@@ -1238,8 +1243,10 @@
 	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
-	if (prep_new_page(page, order, gfp_flags))
+	if (prep_new_page(page, order, gfp_flags)) {
+		if (debugme) printk(KERN_INFO"%s - prep_new_page triggered - goto again\n", __func__);
 		goto again;
+	}
 	return page;
 
 failed:
@@ -1515,7 +1522,7 @@
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int migratetype, int debugme)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1533,45 +1540,61 @@
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						high_zoneidx, nodemask) {
+		if (debugme) printk(KERN_INFO"%s - zone loop - current zone '%s'\n", __func__, zone->name);
 		if (NUMA_BUILD && zlc_active &&
-			!zlc_zone_worth_trying(zonelist, z, allowednodes))
+			!zlc_zone_worth_trying(zonelist, z, allowednodes)) {
+				if (debugme) printk(KERN_INFO"%s - skip due to NUMA_BUILD && zlc_active\n", __func__);
 				continue;
+		}
 		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			!cpuset_zone_allowed_softwall(zone, gfp_mask)) {
+				if (debugme) printk(KERN_INFO"%s - skip due to ALLOC_CPUSET && !allowed_softwall\n", __func__);
 				goto try_next_zone;
+		}
 
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
 			int ret;
 
+			if (debugme) printk(KERN_INFO"%s - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)\n", __func__);
+
 			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (zone_watermark_ok(zone, order, mark,
-				    classzone_idx, alloc_flags))
+				    classzone_idx, alloc_flags)) {
+				if (debugme) printk(KERN_INFO"%s - goto this_zone due to zone_watermark_ok\n", __func__);
 				goto try_this_zone;
+			}
 
-			if (zone_reclaim_mode == 0)
+			if (zone_reclaim_mode == 0) {
+				if (debugme) printk(KERN_INFO"%s - goto zone_full due to zone_reclaim_mode==0\n", __func__);
 				goto this_zone_full;
+			}
 
 			ret = zone_reclaim(zone, gfp_mask, order);
 			switch (ret) {
 			case ZONE_RECLAIM_NOSCAN:
 				/* did not scan */
+				if (debugme) printk(KERN_INFO"%s - goto next_zone due to ZONE_RECLAIM_NOSCAN\n", __func__);
 				goto try_next_zone;
 			case ZONE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
+				if (debugme) printk(KERN_INFO"%s - goto zone_full due to ZONE_RECLAIM_FULL\n", __func__);
 				goto this_zone_full;
 			default:
 				/* did we reclaim enough */
 				if (!zone_watermark_ok(zone, order, mark,
-						classzone_idx, alloc_flags))
+						classzone_idx, alloc_flags)) {
+					if (debugme) printk(KERN_INFO"%s - goto zone_full due to !zone_watermark_ok\n", __func__);
 					goto this_zone_full;
+				}
 			}
 		}
 
 try_this_zone:
+		if (debugme) printk(KERN_INFO"%s - reach buffered_rmqueue for zone '%s'\n", __func__, zone->name);
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype, debugme);
 		if (page)
 			break;
 this_zone_full:
@@ -1594,6 +1617,8 @@
 		zlc_active = 0;
 		goto zonelist_scan;
 	}
+
+	if (debugme) printk(KERN_INFO"%s - return page '%p'\n", __func__, page);
 	return page;
 }
 
@@ -1655,7 +1680,7 @@
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, migratetype,0);
 	if (page)
 		goto out;
 
@@ -1979,7 +2004,7 @@
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
-					migratetype);
+					migratetype, 0);
 	t4 = get_clock();
 
 	perf_count_pages_direct_reclaim++;
@@ -1994,11 +2019,12 @@
 		perf_count_direct_reclaim_slow_sum_progress += *did_some_progress;
 		perf_count_direct_reclaim_slow_sum_order += order;
 
-		perf_count_page_stats(nodemask, zonelist, high_zoneidx, preferred_zone, migratetype, *did_some_progress, gfp_mask, order);
+		// perf_count_page_stats(nodemask, zonelist, high_zoneidx, preferred_zone, migratetype, *did_some_progress, gfp_mask, order);
+		// We know stats look good but even second alloc fail (might be racy) - call get_page_from_freelist with debug instead
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
-					migratetype);
+					migratetype, 1);
 		if (!page)
 			printk(KERN_DEBUG"second get_page_from_freelist fail\n");
 		else
@@ -2032,7 +2058,7 @@
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype,0);
 
 		if (!page && gfp_mask & __GFP_NOFAIL) {
 			perf_count_call_congestion_wait_from_alloc_pages_high_priority++;
@@ -2142,7 +2168,7 @@
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype,0);
 	if (page)
 		goto got_pg;
 
@@ -2268,7 +2294,7 @@
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype,0);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-19 11:19                                               ` Christian Ehrhardt
@ 2010-02-19 15:19                                                 ` Mel Gorman
  2010-02-22 15:42                                                   ` Christian Ehrhardt
  2010-03-02  6:52                                                   ` Nick Piggin
  0 siblings, 2 replies; 41+ messages in thread
From: Mel Gorman @ 2010-02-19 15:19 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> >>>>
> >>>> PAGES-FREED  fast path   slow path
> >>>> GOOD CASE      ~62       ~1.46
> >>>> BAD CASE       ~62       ~37
> >>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that 
> >>> we  run in a scenario where try_to_free frees a lot of pages via, but 
> >>> of the  wrong migrate type?
> >>
> >> It's possible but the window is small. When a threshold is reached on the
> >> PCP lists, they get drained to the buddy lists and later picked up again
> >> by __rmqueue_fallback(). I had considered the possibility of pages of the
> >> wrong type being on the PCP lists which led to the patch "page-allocator:
> >> Fallback to other per-cpu lists when the target list is empty and 
> >> memory is
> >> low" but you reported it made no difference even when fallback was 
> >> allowed
> >> with high watermarks.
> >> [...]
> > 
> > Today I created rather huge debug logs - I'll spare everyone with too 
> > much detail.
> > Eventually it comes down to some kind of cat /proc/zoneinfo like output 
> > extended to list all things per migrate type too.
> > 
> >  From that I still think there should be plenty of pages to get the 
> > allocation through, as it was suggested by the high amount of pages 
> > freed by try_to_free.
> > > 
> [...]
> 
> > More about that tomorrow,
> 
> Well tomorrow is now, and I think I got some important new insights.
> 
> As mentioned before I realized that a second call still fails most of the
> time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist
> and buffered_rmqueue to see where the allocations exactly fails (patch
> attached).
> 
> Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
>    get_page_from_freelist - zone loop - current zone 'DMA'
>    get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
>    get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
>    get_page_from_freelist - return page '(null)'
> 
> Ok - now we at least exactly know why it gets no page.
> Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
> I didn't expect that, but actually watermarks are what stops the allocations here.

That is somewhat expected. We also don't want to go underneath them
beause that can lead to system deadlock.

> The zone_watermark_ok check reports that there are not enough pages for the current watermark and
> finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
> 
> Ok remember my scenario - several parallel iozone sequential read processes
> - theres not much allocation going on except for the page cache read ahead
> related to that read workload.
> The allocations for page cache seem to have no special watermarks selected
> via their GFP flags and therefore get stalled by congestion_wait - which in
> consequence of no available writes in flight consumes its full timeout.
>

Which I'd expect to some extent, but it still stuns me that e084b makes
a difference to any of this. The one-list-per-migratetype patch would
make some sense except restoring something similar to the old behaviour
didn't help either.

> As I see significant impacts to the iozone throughput and not only
> e.g. bad counters in direct_reclaim the congestion_wait stalling seems to
> be so often/long to stall the application I/O itself.
>
> That means from the time VFS starting a read ahead it seems to stall that
> page allocation long enough that the data is not ready when the application
> tries to read it, while it would be if the allocation would be fast enough.
>
> A simple test for this theory was to allow those failing allocations a
> second chance without watermark restrictions before putting them to sleep
> for such a long time.

> 
> Index: linux-2.6-git/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page_alloc.c  2010-02-19 09:53:14.000000000 +0100
> +++ linux-2.6-git/mm/page_alloc.c       2010-02-19 09:56:26.000000000 +0100
> @@ -1954,7 +1954,22 @@
> 
>         if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
>                 /* Wait for some write requests to complete then retry */
> -               congestion_wait(BLK_RW_ASYNC, HZ/50);
> +               /*
> +                * If it gets here try it without watermarks before going
> +                * to sleep.
> +                *
> +                * This will end up in alloc_high_priority and if that fails
> +                * once more direct_reclaim but this time without watermark
> +                * checks.
> +                *
> +                * FIXME: that is just for verification - a real fix needs to
> +                * ensure e.g. page cache allocations don't drain all pages
> +                * under watermark
> +                */
> +               if (!(alloc_flags & ALLOC_NO_WATERMARKS))
> +                       alloc_flags &= ALLOC_NO_WATERMARKS;
> +               else
> +                       congestion_wait(BLK_RW_ASYNC, HZ/50);
>                 goto rebalance;
>         }
> 
> This fixes all issues I have, but as stated in the FIXME it is unfortunately
> no fix for the real world.

It's possible to deadlock a system with this patch. It's also not acting
as you intended. I think you meant either |= or = there but anyway.

> Unfortunately even now knowing the place of the issue so well I don't see
> the connection to the commits e084b+5f8dcc21

Still a mystery.

>  - I couldn't find something but
> did they change the accounting somewhere or e.g. changed the timing/order
> of watermark updates and allocations?
>

Not that I can think of.

> Eventually it might come down to a discussion of allocation priorities and
> we might even keep them as is and accept this issue - I still would prefer
> a good second chance implementation, other page cache allocation flags or
> something else that explicitly solves this issue.
>

In that line, the patch that replaced congestion_wait() with a waitqueue
makes some sense.

> Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> becoming available again is definitely a step in the right direction and
> should go into upstream and the long term support branches.

I'll need to do a number of tests before I can move that upstream but I
don't think it's a merge candidate. Unfortunately, I'll be offline for a
week starting tomorrow so I won't be able to do the testing.

When I get back, I'll revisit those patches with the view to pushing
them upstream. I hate to treat symptoms here without knowing the
underlying problem but this has been spinning in circles for ages with
little forward progress :(

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-19 15:19                                                 ` Mel Gorman
@ 2010-02-22 15:42                                                   ` Christian Ehrhardt
  2010-02-25 15:13                                                     ` Christian Ehrhardt
  2010-03-02  6:52                                                   ` Nick Piggin
  1 sibling, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-22 15:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

Mel Gorman wrote:
[...]
> 
>> Unfortunately even now knowing the place of the issue so well I don't see
>> the connection to the commits e084b+5f8dcc21
> 
> Still a mystery.
> 
>>  - I couldn't find something but
>> did they change the accounting somewhere or e.g. changed the timing/order
>> of watermark updates and allocations?
>>
> 
> Not that I can think of.
> 
>> Eventually it might come down to a discussion of allocation priorities and
>> we might even keep them as is and accept this issue - I still would prefer
>> a good second chance implementation, other page cache allocation flags or
>> something else that explicitly solves this issue.
>>
> 
> In that line, the patch that replaced congestion_wait() with a waitqueue
> makes some sense.

[...]

> I'll need to do a number of tests before I can move that upstream but I
> don't think it's a merge candidate. Unfortunately, I'll be offline for a
> week starting tomorrow so I won't be able to do the testing.
> 
> When I get back, I'll revisit those patches with the view to pushing
> them upstream. I hate to treat symptoms here without knowing the
> underlying problem but this has been spinning in circles for ages with
> little forward progress :(

I'll continue with some debugging in search for the real reasons, but if 
I can't find a new way to look at it I think we have to drop it for now.

While I hate fixing symptoms too, I completely agree that  it is time to 
bring this fix upstream and in fact into the stable kernel too, to have 
at least a ~98% workaround out there.

I'm looking forward for your revised patch after you are back and I'm 
eager to test this one again.


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-22 15:42                                                   ` Christian Ehrhardt
@ 2010-02-25 15:13                                                     ` Christian Ehrhardt
  2010-02-26 11:18                                                       ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Ehrhardt @ 2010-02-25 15:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

Christian Ehrhardt wrote:
> Mel Gorman wrote:
> [...]
> 
>> I'll need to do a number of tests before I can move that upstream but I
>> don't think it's a merge candidate. Unfortunately, I'll be offline for a
>> week starting tomorrow so I won't be able to do the testing.
>>
>> When I get back, I'll revisit those patches with the view to pushing
>> them upstream. I hate to treat symptoms here without knowing the
>> underlying problem but this has been spinning in circles for ages with
>> little forward progress :(
> 
> I'll continue with some debugging in search for the real reasons, but if 
> I can't find a new way to look at it I think we have to drop it for now.
> 
[...]
> 

As a last try I partially rewrote my debug patches which now report what
I call "extended zone info" (like proc/zoneinfo plus free area and per
migrate type counters) once every second at a random direct_reclaim call.

Naming: As before I call the plain 2.6.32 orig or "bad case" and 2.6.32
with e084b and 5f8dcc21 reverted Rev or "good case".
Depending on the fact if a allocation failed or not before the
statistics were reported I call them "failed" or "worked"
Therefore I splitted the resulting data into the four cases orig-failed,
orig-worked, Rev-failed, Rev-worked.

This could again end up in a report that most people expected (like the
stopped by watermark last time), but still I think it is worth to report
and have everyone take a look at it.

PRE)
First and probably most important to keep it in mind later on - the good
case seems to have more pages free, living usually above the watermark
and is therefore not running into that failed direct_reclaim allocations.
The question for all facts I describe below is "can this affect the
number of free pages either directly or be indirectly a pointer what
else is going on affecting the # free pages"

As notice for all data below the page cache allocations that occur when
running the read workload are GFCP_COLD=1 and preferred migration type
MOVABLE.

1) Free page distribution per order in free areas lists
These numbers cover migrate type distribution across free areas per order,
similar to what /proc/pagetypeinfo reports.

There is a major difference between the plain 2.6.32 and the one with 
e084b and 5f8dcc21 reverted. While the good case shows at least some 
distribution having a few elements in order 2-7 the bad case looks quite different.
Bad case has a huge peak in order 0, is about even on order 1 and then much fewer in order 2-7.
Eventually both cases have one order 8 page as reserve at all times.

Pages per Order      0      1      2      3      4     5      6       7    8
Bad Case        272.85  22.10   2.43   0.51   0.14  0.01   0.01    0.06    1
Good Case        97.55  15.29   3.08   1.39   0.77  0.11   0.29    0.90    1

This might not look much but factorized down to 4k order 0 pages this numbers would look like:

4kPages per Order    0      1      2      3      4     5      6       7    8
Bad Case        272.85  44.21   9.73   4.08   2.23  0.23   0.45    7.62  256
Good Case        97.55  30.58  12.30  11.12  12.24  3.64  18.67  114.91  256

So something seems to allow grouping into higher orders much better in the good case.

I wonder if there might be code doing things like this somewhere:
if (couldIcollapsethisintohigherorder)
    free
else
    donothing
-> leaving less free and less higher order pages in the bad case.

Remember my introduction - what we eventually search is why the bad case
has fewer free pages.

3) Migrate types on free areas
Looking at the numbers above more in detail, meaning "splitted into the
different migrate types" shows another difference.
The bad case has most of the pages as unmovable, interestingly almost exactly that
amount of pages that are shifted from higher orders to order 0 when comparing
good/bad case.
So this might be related to that different order distribution we see above.
(BTW - on s390 all memory 0-2Gb is one zone, as we have 256m in these tests
all is in one zone)

BAD CASE										
Free pgs per migrate type @ order     0     1     2     3     4     5     6     7     8
MIGRATE_UNMOVABLE                178.17  0.38  0.00  0.00  0.00  0.00  0.00  0.00     0
MIGRATE_RECLAIMABLE               12.95  0.58  0.01  0.00  0.00  0.00  0.00  0.00     0
MIGRATE_MOVABLE                   81.74 21.14  2.29  0.50  0.13  0.00  0.00  0.00     0
MIGRATE_RESERVE                    0.00  0.00  0.13  0.01  0.01  0.01  0.01  0.06     1
										
GOOD CASE										
Free pgs per migrate type @ order     0     1     2     3     4     5     6     7     8
Normal	MIGRATE_UNMOVABLE         21.70  0.14  0.00  0.00  0.00  0.00  0.00  0.00  0.00
Normal	MIGRATE_RECLAIMABLE        4.15  0.22  0.00  0.00  0.00  0.00  0.00  0.00  0.00
Normal	MIGRATE_MOVABLE           68.71 12.38  0.88  0.63  0.06  0.00  0.00  0.00  0.00
Normal	MIGRATE_RESERVE            2.99  2.56  2.19  0.77  0.71  0.11  0.29  0.90  1.00
Normal	MIGRATE_ISOLATE            0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00

Maybe this gives someone a good hint why we see that different grouping
or even why we have less free pages in bad case.

4) PCP list fill ratio
Finally the last major difference I see is the fill ratio of the pcp lists.
The good case has an average of ~TODO pages on the pcp lists while the bad
case has only ~TODO pages.

AVG count on pcp lists	
bad case	35.33
good case	62.46


When looking at the migrate types at the pcp lists (which is only possible
without 5f8dcc21 reverted) it looks like "there the movable ones have gone",
which they might have not before migrate differentiation type on pcp list
support.

AVG count per migrate type in bad case
MIGRATE_UNMOVABLE      12.57
MIGRATE_RECLAIMABLE     2.03
MIGRATE_MOVABLE        31.89

Is it possible that with 5f8dcc21 the MIGRATE_MOVABLE pages are drained
from free areas to the pcp list more agressive and leave MIGRATE_UNMOVABLE
which then might e.g. not groupable or something like that - and eventually
that way somehow leave fewer free pages left ?


FIN)
So, thats it from my side.
I look forward to the finalized congestion_wait->zone wait patch however
this turns out (zone wait is resonable if fixing this symptom or not in
my opinion).
But still I have a small amount of hope left that all the data I found here
might give someone the kick to see whats going on in mm's backstage due to
that patches.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-25 15:13                                                     ` Christian Ehrhardt
@ 2010-02-26 11:18                                                       ` Nick Piggin
  0 siblings, 0 replies; 41+ messages in thread
From: Nick Piggin @ 2010-02-26 11:18 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh

On Thu, Feb 25, 2010 at 04:13:52PM +0100, Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
> FIN)
> So, thats it from my side.
> I look forward to the finalized congestion_wait->zone wait patch however
> this turns out (zone wait is resonable if fixing this symptom or not in
> my opinion).
> But still I have a small amount of hope left that all the data I found here
> might give someone the kick to see whats going on in mm's backstage due to
> that patches.

Hi Christian,

Thanks for doing all this work. I've been looking at a couple of other
regressions while it seemed like you guys were making progress. I can't
seem to reproduce it on my x86-64 system (was anybody able to reproduce
it on x86?)

Anyway I have asked our mainframe guy whether we can set up an
environment.

I agree it is a real problem that really needs to be properly explained.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-02-19 15:19                                                 ` Mel Gorman
  2010-02-22 15:42                                                   ` Christian Ehrhardt
@ 2010-03-02  6:52                                                   ` Nick Piggin
  2010-03-02 10:04                                                     ` Mel Gorman
  1 sibling, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-03-02  6:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christian Ehrhardt, Andrew Morton, linux-kernel, epasch,
	SCHILLIG, Martin Schwidefsky, Heiko Carstens, christof.schmitt,
	thoss, hare, gregkh

On Fri, Feb 19, 2010 at 03:19:34PM +0000, Mel Gorman wrote:
> On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> > Eventually it might come down to a discussion of allocation priorities and
> > we might even keep them as is and accept this issue - I still would prefer
> > a good second chance implementation, other page cache allocation flags or
> > something else that explicitly solves this issue.
> >
> 
> In that line, the patch that replaced congestion_wait() with a waitqueue
> makes some sense.
> 
> > Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> > becoming available again is definitely a step in the right direction and
> > should go into upstream and the long term support branches.
> 
> I'll need to do a number of tests before I can move that upstream but I
> don't think it's a merge candidate. Unfortunately, I'll be offline for a
> week starting tomorrow so I won't be able to do the testing.
> 
> When I get back, I'll revisit those patches with the view to pushing
> them upstream. I hate to treat symptoms here without knowing the
> underlying problem but this has been spinning in circles for ages with
> little forward progress :(

The zone pressure waitqueue patch makes sense. We may even want to make
it more strictly FIFO (eg. check upfront if there are waiters on the
queue before allocating a page, and if yes then add ourself to the back
of the waitqueue). And also possibly even look at doing the wakeups in
the page-freeing path. Although that might start adding too much
overhead, so it's quite possible your sloppy-but-lighter timeout
approach is preferable.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-03-02  6:52                                                   ` Nick Piggin
@ 2010-03-02 10:04                                                     ` Mel Gorman
  2010-03-02 10:36                                                       ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2010-03-02 10:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christian Ehrhardt, Andrew Morton, linux-kernel, epasch,
	SCHILLIG, Martin Schwidefsky, Heiko Carstens, christof.schmitt,
	thoss, hare, gregkh

On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
> On Fri, Feb 19, 2010 at 03:19:34PM +0000, Mel Gorman wrote:
> > On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> > > Eventually it might come down to a discussion of allocation priorities and
> > > we might even keep them as is and accept this issue - I still would prefer
> > > a good second chance implementation, other page cache allocation flags or
> > > something else that explicitly solves this issue.
> > >
> > 
> > In that line, the patch that replaced congestion_wait() with a waitqueue
> > makes some sense.
> > 
> > > Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> > > becoming available again is definitely a step in the right direction and
> > > should go into upstream and the long term support branches.
> > 
> > I'll need to do a number of tests before I can move that upstream but I
> > don't think it's a merge candidate. Unfortunately, I'll be offline for a
> > week starting tomorrow so I won't be able to do the testing.
> > 
> > When I get back, I'll revisit those patches with the view to pushing
> > them upstream. I hate to treat symptoms here without knowing the
> > underlying problem but this has been spinning in circles for ages with
> > little forward progress :(
> 
> The zone pressure waitqueue patch makes sense.

I've just started the rebase and considering what sort of test is best
for it.

> We may even want to make
> it more strictly FIFO (eg. check upfront if there are waiters on the
> queue before allocating a page, and if yes then add ourself to the back
> of the waitqueue).

To be really strict about this, we'd have to check in the hot-path of the
per-cpu allocator which would be undesirable. We could check further in the
slow-path but I bet it'd be very rare that the logic would be triggered. For
a process to enter the FIFO due to waiters that were not yet woken up, the
system would have to be a) under heavy memory pressure b) reclaim taking such
a long time that check_zone_pressure() is not being called in time and c)
a process exiting or otherwise freeing memory such that the watermarks are
cleared without reclaim being involved.

This seems overkill but maybe you have a simplier case in mind?

> And also possibly even look at doing the wakeups in
> the page-freeing path. Although that might start adding too much
> overhead, so it's quite possible your sloppy-but-lighter timeout
> approach is preferable.
> 

That's how I felt about it. I was going to put another check_zone_pressure()
check after a pcp drain but thought it was too expensive.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-03-02 10:04                                                     ` Mel Gorman
@ 2010-03-02 10:36                                                       ` Nick Piggin
  2010-03-02 11:01                                                         ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-03-02 10:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christian Ehrhardt, Andrew Morton, linux-kernel, epasch,
	SCHILLIG, Martin Schwidefsky, Heiko Carstens, christof.schmitt,
	thoss, hare, gregkh

On Tue, Mar 02, 2010 at 10:04:02AM +0000, Mel Gorman wrote:
> On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
> > The zone pressure waitqueue patch makes sense.
> 
> I've just started the rebase and considering what sort of test is best
> for it.
> 
> > We may even want to make
> > it more strictly FIFO (eg. check upfront if there are waiters on the
> > queue before allocating a page, and if yes then add ourself to the back
> > of the waitqueue).
> 
> To be really strict about this, we'd have to check in the hot-path of the
> per-cpu allocator which would be undesirable.

Yes, but it would also be desirable for other reasons (eliminating
all unnecessary latency here). Which is why I say it is a tradeoff
but it might be worthwhile.

We obviously wouldn't load the waitqueue itself in the fastpath, but
probably some field in a cacheline we already touch.


> We could check further in the
> slow-path but I bet it'd be very rare that the logic would be triggered. For
> a process to enter the FIFO due to waiters that were not yet woken up, the
> system would have to be a) under heavy memory pressure b) reclaim taking such
> a long time that check_zone_pressure() is not being called in time and c)
> a process exiting or otherwise freeing memory such that the watermarks are
> cleared without reclaim being involved.

I don't think it would be too rare. Things can get freed up and
other allocations come in while reclaim is happening. But anyway
the nasty thing about the "rare" events is that they do add a
rare source of unexpected latency or starvation.

I'm not saying necessarily that it will make a noticable improvement
to throughput for this test case, but that now that we have the
waitqueue there, we should think about exactly how far we want to
go with it.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-03-02 10:36                                                       ` Nick Piggin
@ 2010-03-02 11:01                                                         ` Mel Gorman
  2010-03-02 11:18                                                           ` Nick Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2010-03-02 11:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christian Ehrhardt, Andrew Morton, linux-kernel, epasch,
	SCHILLIG, Martin Schwidefsky, Heiko Carstens, christof.schmitt,
	thoss, hare, gregkh

On Tue, Mar 02, 2010 at 09:36:46PM +1100, Nick Piggin wrote:
> On Tue, Mar 02, 2010 at 10:04:02AM +0000, Mel Gorman wrote:
> > On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
> > > The zone pressure waitqueue patch makes sense.
> > 
> > I've just started the rebase and considering what sort of test is best
> > for it.
> > 
> > > We may even want to make
> > > it more strictly FIFO (eg. check upfront if there are waiters on the
> > > queue before allocating a page, and if yes then add ourself to the back
> > > of the waitqueue).
> > 
> > To be really strict about this, we'd have to check in the hot-path of the
> > per-cpu allocator which would be undesirable.
> 
> Yes, but it would also be desirable for other reasons (eliminating
> all unnecessary latency here). Which is why I say it is a tradeoff
> but it might be worthwhile.
> 
> We obviously wouldn't load the waitqueue itself in the fastpath, but
> probably some field in a cacheline we already touch.
> 

I'm struggling to justify it in my head. Granted, right now the pressure_wq is
located beside the wait_table as a "rarely-used" field. It could be located
before the free_area[] so it would be pulled in which would have some impact
for the high-order lists but probably not too bad. This would make
accessing pressure_wq a little cheaper.

The impact of *not* checking is unfairness. A late process makes forward
progress while others sleep on a queue waiting for a reclaim pass to finish
and the queue to be woken up. At worst, processes continue sleeping for the
entire timeout because late processes kept the zone below the watermark.

Is a little unfairness in a path concerned with heavy memory pressure
enough to justify checking on every page allocation?

> > We could check further in the
> > slow-path but I bet it'd be very rare that the logic would be triggered. For
> > a process to enter the FIFO due to waiters that were not yet woken up, the
> > system would have to be a) under heavy memory pressure b) reclaim taking such
> > a long time that check_zone_pressure() is not being called in time and c)
> > a process exiting or otherwise freeing memory such that the watermarks are
> > cleared without reclaim being involved.
> 
> I don't think it would be too rare. Things can get freed up and
> other allocations come in while reclaim is happening. But anyway
> the nasty thing about the "rare" events is that they do add a
> rare source of unexpected latency or starvation.
> 

If processes are asleep on the waitqueue, reclaim must be active (by kswapd
if nothing else). If pages are getting freed above the necessary watermark,
then the processes will be woken up when the current shrink_zone() finished
unless unfair processes are keeping the zone below watermarks.  But unless
reclaim is taking an extraordinary long length of time, there would be little
difference between waking the queue in the free path and waking it in the
reclaim path.

Again, is a little unfairness under heavy memory pressure enough to
alter the main paths?

> I'm not saying necessarily that it will make a noticable improvement
> to throughput for this test case, but that now that we have the
> waitqueue there, we should think about exactly how far we want to
> go with it.

Fair point but I'd sooner look at the other places the VM waits on timeouts
and see can they be converted to waiting on events. Just in case, after these
tests complete (and assuming they generate usable figures), I'll prototype
some of the suggestions and see what the impact is.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-03-02 11:01                                                         ` Mel Gorman
@ 2010-03-02 11:18                                                           ` Nick Piggin
  2010-03-02 11:24                                                             ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Nick Piggin @ 2010-03-02 11:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christian Ehrhardt, Andrew Morton, linux-kernel, epasch,
	SCHILLIG, Martin Schwidefsky, Heiko Carstens, christof.schmitt,
	thoss, hare, gregkh

On Tue, Mar 02, 2010 at 11:01:50AM +0000, Mel Gorman wrote:
> On Tue, Mar 02, 2010 at 09:36:46PM +1100, Nick Piggin wrote:
> > On Tue, Mar 02, 2010 at 10:04:02AM +0000, Mel Gorman wrote:
> > > On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
> > > > The zone pressure waitqueue patch makes sense.
> > > 
> > > I've just started the rebase and considering what sort of test is best
> > > for it.
> > > 
> > > > We may even want to make
> > > > it more strictly FIFO (eg. check upfront if there are waiters on the
> > > > queue before allocating a page, and if yes then add ourself to the back
> > > > of the waitqueue).
> > > 
> > > To be really strict about this, we'd have to check in the hot-path of the
> > > per-cpu allocator which would be undesirable.
> > 
> > Yes, but it would also be desirable for other reasons (eliminating
> > all unnecessary latency here). Which is why I say it is a tradeoff
> > but it might be worthwhile.
> > 
> > We obviously wouldn't load the waitqueue itself in the fastpath, but
> > probably some field in a cacheline we already touch.
> > 
> 
> I'm struggling to justify it in my head. Granted, right now the pressure_wq is
> located beside the wait_table as a "rarely-used" field. It could be located
> before the free_area[] so it would be pulled in which would have some impact
> for the high-order lists but probably not too bad. This would make
> accessing pressure_wq a little cheaper.

No you would mark it in another field in the slowpath.

It would be justified if the aggregate cost of the work is outweighed
by the reduction in unnecessary wait time in the congestion code, plus
some unquantifiable goodness from the extra fairness.

Obviously that depends on the workload, but so does every change we
do like this.
 

> The impact of *not* checking is unfairness. A late process makes forward
> progress while others sleep on a queue waiting for a reclaim pass to finish
> and the queue to be woken up. At worst, processes continue sleeping for the
> entire timeout because late processes kept the zone below the watermark.
> 
> Is a little unfairness in a path concerned with heavy memory pressure
> enough to justify checking on every page allocation?

Well, we've had patches that do a lot more than that in the allocator
that most people don't really use :) (and they've had relatively small
impact).

But no, it adds at least another branch, icache cost, and at least one
more memory bit in the fastpath. Obviously that's a drawback.


> > > We could check further in the
> > > slow-path but I bet it'd be very rare that the logic would be triggered. For
> > > a process to enter the FIFO due to waiters that were not yet woken up, the
> > > system would have to be a) under heavy memory pressure b) reclaim taking such
> > > a long time that check_zone_pressure() is not being called in time and c)
> > > a process exiting or otherwise freeing memory such that the watermarks are
> > > cleared without reclaim being involved.
> > 
> > I don't think it would be too rare. Things can get freed up and
> > other allocations come in while reclaim is happening. But anyway
> > the nasty thing about the "rare" events is that they do add a
> > rare source of unexpected latency or starvation.
> > 
> 
> If processes are asleep on the waitqueue, reclaim must be active (by kswapd
> if nothing else). If pages are getting freed above the necessary watermark,
> then the processes will be woken up when the current shrink_zone() finished
> unless unfair processes are keeping the zone below watermarks.  But unless
> reclaim is taking an extraordinary long length of time, there would be little
> difference between waking the queue in the free path and waking it in the
> reclaim path.

Reclaim can take quite a while, yes.

 
> Again, is a little unfairness under heavy memory pressure enough to
> alter the main paths?

Well you didn't let me answer yet :) See above.

 
> > I'm not saying necessarily that it will make a noticable improvement
> > to throughput for this test case, but that now that we have the
> > waitqueue there, we should think about exactly how far we want to
> > go with it.
> 
> Fair point but I'd sooner look at the other places the VM waits on timeouts
> and see can they be converted to waiting on events. Just in case, after these
> tests complete (and assuming they generate usable figures), I'll prototype
> some of the suggestions and see what the impact is.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-03-02 11:18                                                           ` Nick Piggin
@ 2010-03-02 11:24                                                             ` Mel Gorman
  2010-03-03  6:51                                                               ` Christian Ehrhardt
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2010-03-02 11:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christian Ehrhardt, Andrew Morton, linux-kernel, epasch,
	SCHILLIG, Martin Schwidefsky, Heiko Carstens, christof.schmitt,
	thoss, hare, gregkh

On Tue, Mar 02, 2010 at 10:18:27PM +1100, Nick Piggin wrote:
> On Tue, Mar 02, 2010 at 11:01:50AM +0000, Mel Gorman wrote:
> > On Tue, Mar 02, 2010 at 09:36:46PM +1100, Nick Piggin wrote:
> > > On Tue, Mar 02, 2010 at 10:04:02AM +0000, Mel Gorman wrote:
> > > > On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
> > > > > The zone pressure waitqueue patch makes sense.
> > > > 
> > > > I've just started the rebase and considering what sort of test is best
> > > > for it.
> > > > 
> > > > > We may even want to make
> > > > > it more strictly FIFO (eg. check upfront if there are waiters on the
> > > > > queue before allocating a page, and if yes then add ourself to the back
> > > > > of the waitqueue).
> > > > 
> > > > To be really strict about this, we'd have to check in the hot-path of the
> > > > per-cpu allocator which would be undesirable.
> > > 
> > > Yes, but it would also be desirable for other reasons (eliminating
> > > all unnecessary latency here). Which is why I say it is a tradeoff
> > > but it might be worthwhile.
> > > 
> > > We obviously wouldn't load the waitqueue itself in the fastpath, but
> > > probably some field in a cacheline we already touch.
> > > 
> > 
> > I'm struggling to justify it in my head. Granted, right now the pressure_wq is
> > located beside the wait_table as a "rarely-used" field. It could be located
> > before the free_area[] so it would be pulled in which would have some impact
> > for the high-order lists but probably not too bad. This would make
> > accessing pressure_wq a little cheaper.
> 
> No you would mark it in another field in the slowpath.
> 

i.e. a flag in zone->flags that is set if something is waiting on the queue?

> It would be justified if the aggregate cost of the work is outweighed
> by the reduction in unnecessary wait time in the congestion code, plus
> some unquantifiable goodness from the extra fairness.
> 
> Obviously that depends on the workload, but so does every change we
> do like this.
>  

True. I'll sketch out what needs to happen along these lines and see
what it looks like.

> > The impact of *not* checking is unfairness. A late process makes forward
> > progress while others sleep on a queue waiting for a reclaim pass to finish
> > and the queue to be woken up. At worst, processes continue sleeping for the
> > entire timeout because late processes kept the zone below the watermark.
> > 
> > Is a little unfairness in a path concerned with heavy memory pressure
> > enough to justify checking on every page allocation?
> 
> Well, we've had patches that do a lot more than that in the allocator
> that most people don't really use :) (and they've had relatively small
> impact).
> 
> But no, it adds at least another branch, icache cost, and at least one
> more memory bit in the fastpath. Obviously that's a drawback.
> 

Of which the branch would be the greatest concern but it's also a very
unlikely branch.

> > > > We could check further in the
> > > > slow-path but I bet it'd be very rare that the logic would be triggered. For
> > > > a process to enter the FIFO due to waiters that were not yet woken up, the
> > > > system would have to be a) under heavy memory pressure b) reclaim taking such
> > > > a long time that check_zone_pressure() is not being called in time and c)
> > > > a process exiting or otherwise freeing memory such that the watermarks are
> > > > cleared without reclaim being involved.
> > > 
> > > I don't think it would be too rare. Things can get freed up and
> > > other allocations come in while reclaim is happening. But anyway
> > > the nasty thing about the "rare" events is that they do add a
> > > rare source of unexpected latency or starvation.
> > > 
> > 
> > If processes are asleep on the waitqueue, reclaim must be active (by kswapd
> > if nothing else). If pages are getting freed above the necessary watermark,
> > then the processes will be woken up when the current shrink_zone() finished
> > unless unfair processes are keeping the zone below watermarks.  But unless
> > reclaim is taking an extraordinary long length of time, there would be little
> > difference between waking the queue in the free path and waking it in the
> > reclaim path.
> 
> Reclaim can take quite a while, yes.
> 
>  
> > Again, is a little unfairness under heavy memory pressure enough to
> > alter the main paths?
> 
> Well you didn't let me answer yet :) See above.
> 
>  
> > > I'm not saying necessarily that it will make a noticable improvement
> > > to throughput for this test case, but that now that we have the
> > > waitqueue there, we should think about exactly how far we want to
> > > go with it.
> > 
> > Fair point but I'd sooner look at the other places the VM waits on timeouts
> > and see can they be converted to waiting on events. Just in case, after these
> > tests complete (and assuming they generate usable figures), I'll prototype
> > some of the suggestions and see what the impact is.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
  2010-03-02 11:24                                                             ` Mel Gorman
@ 2010-03-03  6:51                                                               ` Christian Ehrhardt
  0 siblings, 0 replies; 41+ messages in thread
From: Christian Ehrhardt @ 2010-03-03  6:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Andrew Morton, linux-kernel, epasch, SCHILLIG,
	Martin Schwidefsky, Heiko Carstens, christof.schmitt, thoss,
	hare, gregkh



Mel Gorman wrote:
> On Tue, Mar 02, 2010 at 10:18:27PM +1100, Nick Piggin wrote:
>> On Tue, Mar 02, 2010 at 11:01:50AM +0000, Mel Gorman wrote:
>>> On Tue, Mar 02, 2010 at 09:36:46PM +1100, Nick Piggin wrote:
>>>> On Tue, Mar 02, 2010 at 10:04:02AM +0000, Mel Gorman wrote:
>>>>> On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
[...]
>>>>> We could check further in the
>>>>> slow-path but I bet it'd be very rare that the logic would be triggered. For
>>>>> a process to enter the FIFO due to waiters that were not yet woken up, the
>>>>> system would have to be a) under heavy memory pressure b) reclaim taking such
>>>>> a long time that check_zone_pressure() is not being called in time and c)
>>>>> a process exiting or otherwise freeing memory such that the watermarks are
>>>>> cleared without reclaim being involved.
>>>> I don't think it would be too rare. Things can get freed up and
>>>> other allocations come in while reclaim is happening. But anyway
>>>> the nasty thing about the "rare" events is that they do add a
>>>> rare source of unexpected latency or starvation.
>>>>
>>> If processes are asleep on the waitqueue, reclaim must be active (by kswapd
>>> if nothing else). If pages are getting freed above the necessary watermark,
>>> then the processes will be woken up when the current shrink_zone() finished
>>> unless unfair processes are keeping the zone below watermarks.  But unless
>>> reclaim is taking an extraordinary long length of time, there would be little
>>> difference between waking the queue in the free path and waking it in the
>>> reclaim path.
 >>
>> Reclaim can take quite a while, yes.
>>

On one Hand the question if "waiter A is not yet awoken after 
shrink_zone(), but greedy B just drained pages under the water mark 
again" is good to make these new waitqueue approach as good as it can be.
On the other Hand you can see it this way - it is now at least waiting 
for the right thing "the related watermark being restored", which will 
in any way be better than waiting for writes who might or might not free 
enough pages or as in my case might not even be there :-)
And additionally its timing even if it could be a bit racy as you 
described will be much better than it is at the moment.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2010-03-03  6:51 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-07 14:39 Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set" Christian Ehrhardt
2009-12-07 15:09 ` Mel Gorman
2009-12-08 17:59   ` Christian Ehrhardt
2009-12-10 14:36     ` Christian Ehrhardt
2009-12-11 11:20       ` Mel Gorman
2009-12-11 14:47         ` Christian Ehrhardt
2009-12-18 13:38           ` Christian Ehrhardt
2009-12-18 17:42             ` Mel Gorman
2010-01-14 12:30               ` Christian Ehrhardt
2010-01-19 11:33                 ` Mel Gorman
2010-02-05 15:51                   ` Christian Ehrhardt
2010-02-05 17:49                     ` Mel Gorman
2010-02-08 14:01                       ` Christian Ehrhardt
2010-02-08 15:21                         ` Mel Gorman
2010-02-08 16:55                           ` Mel Gorman
2010-02-09  6:23                           ` Christian Ehrhardt
2010-02-09 15:52                           ` Christian Ehrhardt
2010-02-09 17:57                             ` Mel Gorman
2010-02-11 16:11                               ` Christian Ehrhardt
2010-02-12 10:05                                 ` Nick Piggin
2010-02-15  6:59                                   ` Nick Piggin
2010-02-15 15:46                                   ` Christian Ehrhardt
2010-02-16 11:25                                     ` Mel Gorman
2010-02-16 16:47                                       ` Christian Ehrhardt
2010-02-17  9:55                                         ` Christian Ehrhardt
2010-02-17 10:03                                           ` Christian Ehrhardt
2010-02-18 11:43                                           ` Mel Gorman
2010-02-18 16:09                                             ` Christian Ehrhardt
2010-02-19 11:19                                               ` Christian Ehrhardt
2010-02-19 15:19                                                 ` Mel Gorman
2010-02-22 15:42                                                   ` Christian Ehrhardt
2010-02-25 15:13                                                     ` Christian Ehrhardt
2010-02-26 11:18                                                       ` Nick Piggin
2010-03-02  6:52                                                   ` Nick Piggin
2010-03-02 10:04                                                     ` Mel Gorman
2010-03-02 10:36                                                       ` Nick Piggin
2010-03-02 11:01                                                         ` Mel Gorman
2010-03-02 11:18                                                           ` Nick Piggin
2010-03-02 11:24                                                             ` Mel Gorman
2010-03-03  6:51                                                               ` Christian Ehrhardt
2010-02-08 15:02                       ` Christian Ehrhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.