linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kswapd0: wxcessive CPU usage
@ 2012-10-11  8:52 Jiri Slaby
  2012-10-11 13:44 ` Valdis.Kletnieks
  2012-10-11 22:14 ` kswapd0: wxcessive " Andrew Morton
  0 siblings, 2 replies; 52+ messages in thread
From: Jiri Slaby @ 2012-10-11  8:52 UTC (permalink / raw)
  To: linux-mm, LKML, Jiri Slaby

Hi,

with 3.6.0-next-20121008, kswapd0 is spinning my CPU at 100% for 1
minute or so. If I try to suspend to RAM, this trace appears:
kswapd0         R  running task        0   577      2 0x00000000
 0000000000000000 00000000000000c0 cccccccccccccccd ffff8801c4146800
 ffff8801c4b15c88 ffffffff8116ee05 0000000000003e32 ffff8801c3a79000
 ffff8801c4b15ca8 ffffffff8116fdf8 ffff8801c480f398 ffff8801c3a79000
Call Trace:
 [<ffffffff8116ee05>] ? put_super+0x25/0x40
 [<ffffffff8116fdd4>] ? grab_super_passive+0x24/0xa0
 [<ffffffff8116ff99>] ? prune_super+0x149/0x1b0
 [<ffffffff81131531>] ? shrink_slab+0xa1/0x2d0
 [<ffffffff8113452d>] ? kswapd+0x66d/0xb60
 [<ffffffff81133ec0>] ? try_to_free_pages+0x180/0x180
 [<ffffffff810a2770>] ? kthread+0xc0/0xd0
 [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130
 [<ffffffff816a6c9c>] ? ret_from_fork+0x7c/0x90
 [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130

# cat /proc/vmstat
nr_free_pages 239962
nr_inactive_anon 89825
nr_active_anon 711136
nr_inactive_file 60386
nr_active_file 46668
nr_unevictable 0
nr_mlock 0
nr_anon_pages 500678
nr_mapped 41319
nr_file_pages 319317
nr_dirty 45
nr_writeback 0
nr_slab_reclaimable 21909
nr_slab_unreclaimable 21598
nr_page_table_pages 12131
nr_kernel_stack 491
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1674280
nr_vmscan_immediate_reclaim 301662
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 212263
nr_dirtied 10620227
nr_written 9260939
nr_anon_transparent_hugepages 172
nr_free_cma 0
nr_dirty_threshold 31459
nr_dirty_background_threshold 15729
pgpgin 31311778
pgpgout 38987552
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_dma32 245169455
pgalloc_normal 279685864
pgalloc_movable 0
pgfree 537318727
pgactivate 13126755
pgdeactivate 2482953
pgfault 645947575
pgmajfault 193427
pgrefill_dma 0
pgrefill_dma32 1124272
pgrefill_normal 1998033
pgrefill_movable 0
pgsteal_kswapd_dma 0
pgsteal_kswapd_dma32 2531015
pgsteal_kswapd_normal 3403006
pgsteal_kswapd_movable 0
pgsteal_direct_dma 0
pgsteal_direct_dma32 362488
pgsteal_direct_normal 1134511
pgsteal_direct_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 2693620
pgscan_kswapd_normal 5836491
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 368374
pgscan_direct_normal 1658486
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 258410
slabs_scanned 86459392
kswapd_inodesteal 3907549
kswapd_low_wmark_hit_quickly 15408
kswapd_high_wmark_hit_quickly 23113
kswapd_skip_congestion_wait 10
pageoutrun 2165627235
allocstall 11256
pgrotated 219624
compact_blocks_moved 4862077
compact_pages_moved 1970005
compact_pagemigrate_failed 1726156
compact_stall 21275
compact_fail 6589
compact_success 14686
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2799
unevictable_pgs_scanned 0
unevictable_pgs_rescued 22563
unevictable_pgs_mlocked 22563
unevictable_pgs_munlocked 22563
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 18725
thp_fault_fallback 64868
thp_collapse_alloc 9216
thp_collapse_alloc_failed 2031
thp_split 2146

Any ideas what it could be?

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: wxcessive CPU usage
  2012-10-11  8:52 kswapd0: wxcessive CPU usage Jiri Slaby
@ 2012-10-11 13:44 ` Valdis.Kletnieks
  2012-10-11 15:34   ` Jiri Slaby
  2012-10-11 22:14 ` kswapd0: wxcessive " Andrew Morton
  1 sibling, 1 reply; 52+ messages in thread
From: Valdis.Kletnieks @ 2012-10-11 13:44 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: linux-mm, LKML, Jiri Slaby

[-- Attachment #1: Type: text/plain, Size: 1675 bytes --]

On Thu, 11 Oct 2012 10:52:28 +0200, Jiri Slaby said:
> Hi,
>
> with 3.6.0-next-20121008, kswapd0 is spinning my CPU at 100% for 1
> minute or so.


>  [<ffffffff8116ee05>] ? put_super+0x25/0x40
>  [<ffffffff8116fdd4>] ? grab_super_passive+0x24/0xa0
>  [<ffffffff8116ff99>] ? prune_super+0x149/0x1b0
>  [<ffffffff81131531>] ? shrink_slab+0xa1/0x2d0
>  [<ffffffff8113452d>] ? kswapd+0x66d/0xb60
>  [<ffffffff81133ec0>] ? try_to_free_pages+0x180/0x180
>  [<ffffffff810a2770>] ? kthread+0xc0/0xd0
>  [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130
>  [<ffffffff816a6c9c>] ? ret_from_fork+0x7c/0x90
>  [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130

I don't know what it is, I haven't finished bisecting it - but I can confirm that
I started seeing the same problem 2 or 3 weeks ago.  Note that said call
trace does *NOT* require a suspend - I don't do suspend on my laptop and
I'm seeing kswapd burn CPU with similar traces.

# cat /proc/31/stack
[<ffffffff81110306>] grab_super_passive+0x44/0x76
[<ffffffff81110372>] prune_super+0x3a/0x13c
[<ffffffff810dc52a>] shrink_slab+0x95/0x301
[<ffffffff810defb7>] kswapd+0x5c8/0x902
[<ffffffff8104eea4>] kthread+0x9d/0xa5
[<ffffffff815ccfac>] ret_from_fork+0x7c/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/31/stack
[<ffffffff8110f5af>] put_super+0x29/0x2d
[<ffffffff8110f637>] drop_super+0x1b/0x20
[<ffffffff81110462>] prune_super+0x12a/0x13c
[<ffffffff810dc52a>] shrink_slab+0x95/0x301
[<ffffffff810defb7>] kswapd+0x5c8/0x902
[<ffffffff8104eea4>] kthread+0x9d/0xa5
[<ffffffff815ccfac>] ret_from_fork+0x7c/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

So at least we know we're not hallucinating. :)




[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: wxcessive CPU usage
  2012-10-11 13:44 ` Valdis.Kletnieks
@ 2012-10-11 15:34   ` Jiri Slaby
  2012-10-11 17:56     ` Valdis.Kletnieks
  0 siblings, 1 reply; 52+ messages in thread
From: Jiri Slaby @ 2012-10-11 15:34 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-mm, LKML, Jiri Slaby

On 10/11/2012 03:44 PM, Valdis.Kletnieks@vt.edu wrote:
> So at least we know we're not hallucinating. :)

Just a thought? Do you have raid?

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: wxcessive CPU usage
  2012-10-11 15:34   ` Jiri Slaby
@ 2012-10-11 17:56     ` Valdis.Kletnieks
  2012-10-11 17:59       ` Jiri Slaby
  0 siblings, 1 reply; 52+ messages in thread
From: Valdis.Kletnieks @ 2012-10-11 17:56 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: linux-mm, LKML, Jiri Slaby

[-- Attachment #1: Type: text/plain, Size: 315 bytes --]

On Thu, 11 Oct 2012 17:34:24 +0200, Jiri Slaby said:
> On 10/11/2012 03:44 PM, Valdis.Kletnieks@vt.edu wrote:
> > So at least we know we're not hallucinating. :)
>
> Just a thought? Do you have raid?

Nope, just a 160G laptop spinning hard drive. Filesystems are
ext4 on LVM on a cryptoLUKS partition on /dev/sda2.

[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: wxcessive CPU usage
  2012-10-11 17:56     ` Valdis.Kletnieks
@ 2012-10-11 17:59       ` Jiri Slaby
  2012-10-11 18:19         ` Valdis.Kletnieks
  0 siblings, 1 reply; 52+ messages in thread
From: Jiri Slaby @ 2012-10-11 17:59 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Jiri Slaby, linux-mm, LKML

On 10/11/2012 07:56 PM, Valdis.Kletnieks@vt.edu wrote:
> On Thu, 11 Oct 2012 17:34:24 +0200, Jiri Slaby said:
>> On 10/11/2012 03:44 PM, Valdis.Kletnieks@vt.edu wrote:
>>> So at least we know we're not hallucinating. :)
>> 
>> Just a thought? Do you have raid?
> 
> Nope, just a 160G laptop spinning hard drive. Filesystems are ext4
> on LVM on a cryptoLUKS partition on /dev/sda2.

Ok, it's maybe compaction. Do you have CONFIG_COMPACTION=y?


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: wxcessive CPU usage
  2012-10-11 17:59       ` Jiri Slaby
@ 2012-10-11 18:19         ` Valdis.Kletnieks
  2012-10-11 22:08           ` kswapd0: excessive " Jiri Slaby
  0 siblings, 1 reply; 52+ messages in thread
From: Valdis.Kletnieks @ 2012-10-11 18:19 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Jiri Slaby, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 608 bytes --]

On Thu, 11 Oct 2012 19:59:33 +0200, Jiri Slaby said:
> On 10/11/2012 07:56 PM, Valdis.Kletnieks@vt.edu wrote:
> > On Thu, 11 Oct 2012 17:34:24 +0200, Jiri Slaby said:
> >> On 10/11/2012 03:44 PM, Valdis.Kletnieks@vt.edu wrote:
> >>> So at least we know we're not hallucinating. :)
> >>
> >> Just a thought? Do you have raid?
> >
> > Nope, just a 160G laptop spinning hard drive. Filesystems are ext4
> > on LVM on a cryptoLUKS partition on /dev/sda2.
>
> Ok, it's maybe compaction. Do you have CONFIG_COMPACTION=y?

# zgrep COMPAC /proc/config.gz
CONFIG_COMPACTION=y

Hope that tells you something useful.



[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-11 18:19         ` Valdis.Kletnieks
@ 2012-10-11 22:08           ` Jiri Slaby
  2012-10-12 12:37             ` Jiri Slaby
  0 siblings, 1 reply; 52+ messages in thread
From: Jiri Slaby @ 2012-10-11 22:08 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Jiri Slaby, linux-mm, LKML

On 10/11/2012 08:19 PM, Valdis.Kletnieks@vt.edu wrote:
> # zgrep COMPAC /proc/config.gz
> CONFIG_COMPACTION=y
> 
> Hope that tells you something useful.

It just supports my another theory. This seems to fix it for me:
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1830,8 +1830,8 @@ static inline bool should_continue_reclaim(struct
lruvec *lruvec,
         */
        pages_for_compaction = (2UL << sc->order);

-       pages_for_compaction = scale_for_compaction(pages_for_compaction,
-                                                   lruvec, sc);
+/*     pages_for_compaction = scale_for_compaction(pages_for_compaction,
+                                                   lruvec, sc);*/
        inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
        if (nr_swap_pages > 0)
                inactive_lru_pages += get_lru_size(lruvec,
LRU_INACTIVE_ANON);

And for you?

(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".)

regards,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: wxcessive CPU usage
  2012-10-11  8:52 kswapd0: wxcessive CPU usage Jiri Slaby
  2012-10-11 13:44 ` Valdis.Kletnieks
@ 2012-10-11 22:14 ` Andrew Morton
  2012-10-11 22:26   ` Jiri Slaby
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2012-10-11 22:14 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: linux-mm, LKML, Jiri Slaby

On Thu, 11 Oct 2012 10:52:28 +0200
Jiri Slaby <jslaby@suse.cz> wrote:

> with 3.6.0-next-20121008, kswapd0 is spinning my CPU at 100% for 1
> minute or so. If I try to suspend to RAM, this trace appears:
> kswapd0         R  running task        0   577      2 0x00000000
>  0000000000000000 00000000000000c0 cccccccccccccccd ffff8801c4146800
>  ffff8801c4b15c88 ffffffff8116ee05 0000000000003e32 ffff8801c3a79000
>  ffff8801c4b15ca8 ffffffff8116fdf8 ffff8801c480f398 ffff8801c3a79000
> Call Trace:
>  [<ffffffff8116ee05>] ? put_super+0x25/0x40
>  [<ffffffff8116fdd4>] ? grab_super_passive+0x24/0xa0
>  [<ffffffff8116ff99>] ? prune_super+0x149/0x1b0
>  [<ffffffff81131531>] ? shrink_slab+0xa1/0x2d0
>  [<ffffffff8113452d>] ? kswapd+0x66d/0xb60
>  [<ffffffff81133ec0>] ? try_to_free_pages+0x180/0x180
>  [<ffffffff810a2770>] ? kthread+0xc0/0xd0
>  [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130
>  [<ffffffff816a6c9c>] ? ret_from_fork+0x7c/0x90
>  [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130

Could you please do a sysrq-T a few times while it's spinning, to
confirm that this trace is consistently the culprit?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: wxcessive CPU usage
  2012-10-11 22:14 ` kswapd0: wxcessive " Andrew Morton
@ 2012-10-11 22:26   ` Jiri Slaby
  0 siblings, 0 replies; 52+ messages in thread
From: Jiri Slaby @ 2012-10-11 22:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jiri Slaby, linux-mm, LKML

On 10/12/2012 12:14 AM, Andrew Morton wrote:
> Could you please do a sysrq-T a few times while it's spinning, to
> confirm that this trace is consistently the culprit?

For me yes, shrink_slab is in the most of the traces.

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-11 22:08           ` kswapd0: excessive " Jiri Slaby
@ 2012-10-12 12:37             ` Jiri Slaby
  2012-10-12 13:57               ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Jiri Slaby @ 2012-10-12 12:37 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Jiri Slaby, linux-mm, LKML, Mel Gorman, Andrew Morton

On 10/12/2012 12:08 AM, Jiri Slaby wrote:
> (It's an effective revert of "mm: vmscan: scale number of pages
> reclaimed by reclaim/compaction based on failures".)

Given kswapd had hours of runtime in ps/top output yesterday in the
morning and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.

Mel, you wrote me it's unlikely the patch, but not impossible in the
end. Can you take a look, please? If you need some trace-cmd output or
anything, just let us know.

This is x86_64, 6G of RAM, no swap. FWIW EXT4, SLUB, COMPACTION all
enabled/used.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-12 12:37             ` Jiri Slaby
@ 2012-10-12 13:57               ` Mel Gorman
  2012-10-15  9:54                 ` Jiri Slaby
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2012-10-12 13:57 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

On Fri, Oct 12, 2012 at 02:37:58PM +0200, Jiri Slaby wrote:
> On 10/12/2012 12:08 AM, Jiri Slaby wrote:
> > (It's an effective revert of "mm: vmscan: scale number of pages
> > reclaimed by reclaim/compaction based on failures".)
> 
> Given kswapd had hours of runtime in ps/top output yesterday in the
> morning and after the revert it's now 2 minutes in sum for the last 24h,
> I would say, it's gone.
> 
> Mel, you wrote me it's unlikely the patch, but not impossible in the
> end. Can you take a look, please? If you need some trace-cmd output or
> anything, just let us know.
> 
> This is x86_64, 6G of RAM, no swap. FWIW EXT4, SLUB, COMPACTION all
> enabled/used.
> 

Can you monitor the behaviour of this patch please? Please keep a particular
eye on kswapd activity and the amount of free memory. If free memory is
spiking it might indicate that kswapd is still too aggressive with the loss
of the __GFP_NO_KSWAPD flag. One way to tell is to record /proc/vmstat over
time and see what the pgsteal_* figures look like. If they are climbing
aggressively during what should be normal usage then it might show that
kswapd is still too aggressive when asked to reclaim for THP.

Thanks very much.

---8<---
mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim

Jiri Slaby reported the following:

	(It's an effective revert of "mm: vmscan: scale number of pages
	reclaimed by reclaim/compaction based on failures".)
	Given kswapd had hours of runtime in ps/top output yesterday in the
	morning and after the revert it's now 2 minutes in sum for the last 24h,
	I would say, it's gone.

The intention of the patch in question was to compensate for the loss of
lumpy reclaim. Part of the reason lumpy reclaim worked is because it
aggressively reclaimed pages and this patch was meant to be a
sane compromise.

When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
and continues reclaiming which was not taken into account when the patch
was developed.

As it is not taking deferred compaction into account in this path it scans
aggressively before falling out and making the compaction_deferred check in
compaction_ready. This patch avoids kswapd scaling pages for reclaim and
leaves the aggressive reclaim to the process attempting the THP
allocation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2624edc..2b7edfa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
 #ifdef CONFIG_COMPACTION
 /*
  * If compaction is deferred for sc->order then scale the number of pages
- * reclaimed based on the number of consecutive allocation failures
+ * reclaimed based on the number of consecutive allocation failures. This
+ * scaling only happens for direct reclaim as it is about to attempt
+ * compaction. If compaction fails, future allocations will be deferred
+ * and reclaim avoided. On the other hand, kswapd does not take compaction
+ * deferral into account so if it scaled, it could scan excessively even
+ * though allocations are temporarily not being attempted.
  */
 static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
 			struct lruvec *lruvec, struct scan_control *sc)
 {
 	struct zone *zone = lruvec_zone(lruvec);
 
-	if (zone->compact_order_failed <= sc->order)
+	if (zone->compact_order_failed <= sc->order &&
+	    !current_is_kswapd())
 		pages_for_compaction <<= zone->compact_defer_shift;
 	return pages_for_compaction;
 }

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-12 13:57               ` Mel Gorman
@ 2012-10-15  9:54                 ` Jiri Slaby
  2012-10-15 11:09                   ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Jiri Slaby @ 2012-10-15  9:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

On 10/12/2012 03:57 PM, Mel Gorman wrote:
> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
> 
> Jiri Slaby reported the following:
> 
> 	(It's an effective revert of "mm: vmscan: scale number of pages
> 	reclaimed by reclaim/compaction based on failures".)
> 	Given kswapd had hours of runtime in ps/top output yesterday in the
> 	morning and after the revert it's now 2 minutes in sum for the last 24h,
> 	I would say, it's gone.
> 
> The intention of the patch in question was to compensate for the loss of
> lumpy reclaim. Part of the reason lumpy reclaim worked is because it
> aggressively reclaimed pages and this patch was meant to be a
> sane compromise.
> 
> When compaction fails, it gets deferred and both compaction and
> reclaim/compaction is deferred avoid excessive reclaim. However, since
> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
> and continues reclaiming which was not taken into account when the patch
> was developed.
> 
> As it is not taking deferred compaction into account in this path it scans
> aggressively before falling out and making the compaction_deferred check in
> compaction_ready. This patch avoids kswapd scaling pages for reclaim and
> leaves the aggressive reclaim to the process attempting the THP
> allocation.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2624edc..2b7edfa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
>  #ifdef CONFIG_COMPACTION
>  /*
>   * If compaction is deferred for sc->order then scale the number of pages
> - * reclaimed based on the number of consecutive allocation failures
> + * reclaimed based on the number of consecutive allocation failures. This
> + * scaling only happens for direct reclaim as it is about to attempt
> + * compaction. If compaction fails, future allocations will be deferred
> + * and reclaim avoided. On the other hand, kswapd does not take compaction
> + * deferral into account so if it scaled, it could scan excessively even
> + * though allocations are temporarily not being attempted.
>   */
>  static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
>  			struct lruvec *lruvec, struct scan_control *sc)
>  {
>  	struct zone *zone = lruvec_zone(lruvec);
>  
> -	if (zone->compact_order_failed <= sc->order)
> +	if (zone->compact_order_failed <= sc->order &&
> +	    !current_is_kswapd())
>  		pages_for_compaction <<= zone->compact_defer_shift;
>  	return pages_for_compaction;
>  }

Yes, applying this instead of the revert fixes the issue as well.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-15  9:54                 ` Jiri Slaby
@ 2012-10-15 11:09                   ` Mel Gorman
  2012-10-29 10:52                     ` Thorsten Leemhuis
  2012-11-02 10:44                     ` Zdenek Kabelac
  0 siblings, 2 replies; 52+ messages in thread
From: Mel Gorman @ 2012-10-15 11:09 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
> On 10/12/2012 03:57 PM, Mel Gorman wrote:
> > mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
> > 
> > Jiri Slaby reported the following:
> > 
> > 	(It's an effective revert of "mm: vmscan: scale number of pages
> > 	reclaimed by reclaim/compaction based on failures".)
> > 	Given kswapd had hours of runtime in ps/top output yesterday in the
> > 	morning and after the revert it's now 2 minutes in sum for the last 24h,
> > 	I would say, it's gone.
> > 
> > The intention of the patch in question was to compensate for the loss of
> > lumpy reclaim. Part of the reason lumpy reclaim worked is because it
> > aggressively reclaimed pages and this patch was meant to be a
> > sane compromise.
> > 
> > When compaction fails, it gets deferred and both compaction and
> > reclaim/compaction is deferred avoid excessive reclaim. However, since
> > commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
> > and continues reclaiming which was not taken into account when the patch
> > was developed.
> > 
> > As it is not taking deferred compaction into account in this path it scans
> > aggressively before falling out and making the compaction_deferred check in
> > compaction_ready. This patch avoids kswapd scaling pages for reclaim and
> > leaves the aggressive reclaim to the process attempting the THP
> > allocation.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |   10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2624edc..2b7edfa 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> >  #ifdef CONFIG_COMPACTION
> >  /*
> >   * If compaction is deferred for sc->order then scale the number of pages
> > - * reclaimed based on the number of consecutive allocation failures
> > + * reclaimed based on the number of consecutive allocation failures. This
> > + * scaling only happens for direct reclaim as it is about to attempt
> > + * compaction. If compaction fails, future allocations will be deferred
> > + * and reclaim avoided. On the other hand, kswapd does not take compaction
> > + * deferral into account so if it scaled, it could scan excessively even
> > + * though allocations are temporarily not being attempted.
> >   */
> >  static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
> >  			struct lruvec *lruvec, struct scan_control *sc)
> >  {
> >  	struct zone *zone = lruvec_zone(lruvec);
> >  
> > -	if (zone->compact_order_failed <= sc->order)
> > +	if (zone->compact_order_failed <= sc->order &&
> > +	    !current_is_kswapd())
> >  		pages_for_compaction <<= zone->compact_defer_shift;
> >  	return pages_for_compaction;
> >  }
> 
> Yes, applying this instead of the revert fixes the issue as well.
> 

Thanks Jiri.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-15 11:09                   ` Mel Gorman
@ 2012-10-29 10:52                     ` Thorsten Leemhuis
  2012-10-30 19:18                       ` Mel Gorman
  2012-11-02 10:44                     ` Zdenek Kabelac
  1 sibling, 1 reply; 52+ messages in thread
From: Thorsten Leemhuis @ 2012-10-29 10:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jiri Slaby, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

Hi!

On 15.10.2012 13:09, Mel Gorman wrote:
> On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
>> On 10/12/2012 03:57 PM, Mel Gorman wrote:
>>> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
>>> Jiri Slaby reported the following:
 > [...]
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 2624edc..2b7edfa 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
>>>   #ifdef CONFIG_COMPACTION
>>>   /*
>>>    * If compaction is deferred for sc->order then scale the number of pages
>>> - * reclaimed based on the number of consecutive allocation failures
>>> + * reclaimed based on the number of consecutive allocation failures. This
>>> + * scaling only happens for direct reclaim as it is about to attempt
>>> + * compaction. If compaction fails, future allocations will be deferred
>>> + * and reclaim avoided. On the other hand, kswapd does not take compaction
>>> + * deferral into account so if it scaled, it could scan excessively even
>>> + * though allocations are temporarily not being attempted.
>>>    */
>>>   static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
>>>   			struct lruvec *lruvec, struct scan_control *sc)
>>>   {
>>>   	struct zone *zone = lruvec_zone(lruvec);
>>>
>>> -	if (zone->compact_order_failed <= sc->order)
>>> +	if (zone->compact_order_failed <= sc->order &&
>>> +	    !current_is_kswapd())
>>>   		pages_for_compaction <<= zone->compact_defer_shift;
>>>   	return pages_for_compaction;
>>>   }
>> Yes, applying this instead of the revert fixes the issue as well.

Just wondering, is there a reason why this patch wasn't applied to 
mainline? Did it simply fall through the cracks? Or am I missing something?

I'm asking because I think I stil see the issue on 
3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are hitting 
it, too:
https://bugzilla.redhat.com/show_bug.cgi?id=866988

Or are we seeing something different which just looks similar? I can 
test the patch if it needs further testing, but from the discussion I 
got the impression that everything is clear and the patch ready for merging.

CU
  knurd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-29 10:52                     ` Thorsten Leemhuis
@ 2012-10-30 19:18                       ` Mel Gorman
  2012-10-31 11:25                         ` Thorsten Leemhuis
  2012-11-04 16:36                         ` Rik van Riel
  0 siblings, 2 replies; 52+ messages in thread
From: Mel Gorman @ 2012-10-30 19:18 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Jiri Slaby, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

On Mon, Oct 29, 2012 at 11:52:03AM +0100, Thorsten Leemhuis wrote:
> Hi!
> 
> On 15.10.2012 13:09, Mel Gorman wrote:
> >On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
> >>On 10/12/2012 03:57 PM, Mel Gorman wrote:
> >>>mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
> >>>Jiri Slaby reported the following:
> > [...]
> >>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>index 2624edc..2b7edfa 100644
> >>>--- a/mm/vmscan.c
> >>>+++ b/mm/vmscan.c
> >>>@@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> >>>  #ifdef CONFIG_COMPACTION
> >>>  /*
> >>>   * If compaction is deferred for sc->order then scale the number of pages
> >>>- * reclaimed based on the number of consecutive allocation failures
> >>>+ * reclaimed based on the number of consecutive allocation failures. This
> >>>+ * scaling only happens for direct reclaim as it is about to attempt
> >>>+ * compaction. If compaction fails, future allocations will be deferred
> >>>+ * and reclaim avoided. On the other hand, kswapd does not take compaction
> >>>+ * deferral into account so if it scaled, it could scan excessively even
> >>>+ * though allocations are temporarily not being attempted.
> >>>   */
> >>>  static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
> >>>  			struct lruvec *lruvec, struct scan_control *sc)
> >>>  {
> >>>  	struct zone *zone = lruvec_zone(lruvec);
> >>>
> >>>-	if (zone->compact_order_failed <= sc->order)
> >>>+	if (zone->compact_order_failed <= sc->order &&
> >>>+	    !current_is_kswapd())
> >>>  		pages_for_compaction <<= zone->compact_defer_shift;
> >>>  	return pages_for_compaction;
> >>>  }
> >>Yes, applying this instead of the revert fixes the issue as well.
> 
> Just wondering, is there a reason why this patch wasn't applied to
> mainline? Did it simply fall through the cracks? Or am I missing
> something?
> 

It's because a problem was reported related to the patch (off-list,
whoops). I'm waiting to hear if a second patch fixes the problem or not.

> I'm asking because I think I stil see the issue on
> 3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are
> hitting it, too:
> https://bugzilla.redhat.com/show_bug.cgi?id=866988
> 

I like the steps to reproduce. Is step 3 profit?

> Or are we seeing something different which just looks similar?  I can
> test the patch if it needs further testing, but from the discussion
> I got the impression that everything is clear and the patch ready
> for merging.

It could be the same issue. Can you test with the "mm: vmscan: scale
number of pages reclaimed by reclaim/compaction only in direct reclaim"
patch and the following on top please?

Thanks.

---8<---
mm: page_alloc: Do not wake kswapd if the request is for THP but deferred

Since commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd gets woken
for every THP request in the slow path. If compaction has been deferred
the waker will not compact or enter direct reclaim on its own behalf
but kswapd is still woken to reclaim free pages that no one may consume.
If compaction was deferred because pages and slab was not reclaimable
then kswapd is just consuming cycles for no gain.

This patch avoids waking kswapd if the compaction has been deferred.
It'll still wake when compaction is running to reduce the latency of
THP allocations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..e72674c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,6 +2378,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+/* Returns true if the allocation is likely for THP */
+static bool is_thp_alloc(gfp_t gfp_mask, unsigned int order)
+{
+	if (order == pageblock_order &&
+	    (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+		return true;
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2416,7 +2425,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx,
+	/*
+	 * kswapd is woken except when this is a THP request and compaction
+	 * is deferred. If we are backing off reclaim/compaction then kswapd
+	 * should not be awake aggressively reclaiming with no consumers of
+	 * the freed pages
+	 */
+	if (!(is_thp_alloc(gfp_mask, order) &&
+	      compaction_deferred(preferred_zone, order)))
+		wake_all_kswapd(order, zonelist, high_zoneidx,
 					zone_idx(preferred_zone));
 
 	/*
@@ -2494,7 +2511,7 @@ rebalance:
 	 * system then fail the allocation instead of entering direct reclaim.
 	 */
 	if ((deferred_compaction || contended_compaction) &&
-	    (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+	    is_thp_alloc(gfp_mask, order))
 		goto nopage;
 
 	/* Try direct reclaim and then allocating */

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-30 19:18                       ` Mel Gorman
@ 2012-10-31 11:25                         ` Thorsten Leemhuis
  2012-10-31 15:04                           ` Mel Gorman
  2012-11-04 16:36                         ` Rik van Riel
  1 sibling, 1 reply; 52+ messages in thread
From: Thorsten Leemhuis @ 2012-10-31 11:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jiri Slaby, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

On 30.10.2012 20:18, Mel Gorman wrote:
> On Mon, Oct 29, 2012 at 11:52:03AM +0100, Thorsten Leemhuis wrote:
>> On 15.10.2012 13:09, Mel Gorman wrote:
>>> On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
>>>> On 10/12/2012 03:57 PM, Mel Gorman wrote:
>>>>> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
>>>>> Jiri Slaby reported the following:
> [...]
>>>> Yes, applying this instead of the revert fixes the issue as well.
>> Just wondering, is there a reason why this patch wasn't applied to
>> mainline? Did it simply fall through the cracks? Or am I missing
>> something?
> It's because a problem was reported related to the patch (off-list,
> whoops). I'm waiting to hear if a second patch fixes the problem or not.

Anything in particular I should look out for while testing?

>> I'm asking because I think I stil see the issue on
>> 3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are
>> hitting it, too:
>> https://bugzilla.redhat.com/show_bug.cgi?id=866988
> I like the steps to reproduce.

One of those cases where the bugzilla bug template was not very helpful 
or where it was not used as intended (you decide) :-)

> Is step 3 profit?

Yes, but psst, don't tell anyone; step 4 (world domination! for real!) 
is also hidden to keep that part of the big plan a secret for now ;-)

>> Or are we seeing something different which just looks similar?  I can
>> test the patch if it needs further testing, but from the discussion
>> I got the impression that everything is clear and the patch ready
>> for merging.
> It could be the same issue. Can you test with the "mm: vmscan: scale
> number of pages reclaimed by reclaim/compaction only in direct reclaim"
> patch and the following on top please?

Built a vanilla mainline kernel with those two patches and installed it 
on the machine where I was seeing problems high kswapd0 load on 3.7-rc3. 
Ran it an hour yesterday and a few hours today; seems the patches fix 
the issue for me as kswapd behaves:

$ LC_ALL=C ps -aux | grep 'kswapd'
root       62  0.0  0.0      0     0 ?      S    Oct30   0:05 [kswapd0]

So everything is looking fine again so far thx to the two patches  -- 
hopefully it stays that way even after hitting "send" in my mailer in a 
few seconds.

CU
knurd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-31 11:25                         ` Thorsten Leemhuis
@ 2012-10-31 15:04                           ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2012-10-31 15:04 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Jiri Slaby, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

On Wed, Oct 31, 2012 at 12:25:13PM +0100, Thorsten Leemhuis wrote:
> On 30.10.2012 20:18, Mel Gorman wrote:
> >On Mon, Oct 29, 2012 at 11:52:03AM +0100, Thorsten Leemhuis wrote:
> >>On 15.10.2012 13:09, Mel Gorman wrote:
> >>>On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
> >>>>On 10/12/2012 03:57 PM, Mel Gorman wrote:
> >>>>>mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
> >>>>>Jiri Slaby reported the following:
> >[...]
> >>>>Yes, applying this instead of the revert fixes the issue as well.
> >>Just wondering, is there a reason why this patch wasn't applied to
> >>mainline? Did it simply fall through the cracks? Or am I missing
> >>something?
> >It's because a problem was reported related to the patch (off-list,
> >whoops). I'm waiting to hear if a second patch fixes the problem or not.
> 
> Anything in particular I should look out for while testing?
> 

Excessive reclaim, high CPU usage by kswapd, processes getting stick in
isolate_migratepages or isolate_freepages.

> >>I'm asking because I think I stil see the issue on
> >>3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are
> >>hitting it, too:
> >>https://bugzilla.redhat.com/show_bug.cgi?id=866988
> >I like the steps to reproduce.
> 
> One of those cases where the bugzilla bug template was not very
> helpful or where it was not used as intended (you decide) :-)
> 

It wins at entertainment value if nothing else :)

> >Is step 3 profit?
> 
> Yes, but psst, don't tell anyone; step 4 (world domination! for
> real!) is also hidden to keep that part of the big plan a secret for
> now ;-)
> 

No doubt it's the default private comment #1 !

> >>Or are we seeing something different which just looks similar?  I can
> >>test the patch if it needs further testing, but from the discussion
> >>I got the impression that everything is clear and the patch ready
> >>for merging.
> >It could be the same issue. Can you test with the "mm: vmscan: scale
> >number of pages reclaimed by reclaim/compaction only in direct reclaim"
> >patch and the following on top please?
> 
> Built a vanilla mainline kernel with those two patches and installed
> it on the machine where I was seeing problems high kswapd0 load on
> 3.7-rc3. Ran it an hour yesterday and a few hours today; seems the
> patches fix the issue for me as kswapd behaves:
> 
> $ LC_ALL=C ps -aux | grep 'kswapd'
> root       62  0.0  0.0      0     0 ?      S    Oct30   0:05 [kswapd0]
> 
> So everything is looking fine again so far thx to the two patches
> -- hopefully it stays that way even after hitting "send" in my
> mailer in a few seconds.
> 

Ok, great. Keep an eye on it please. If Jiri Slaby reports similar
success then I'll collapse the two patches together and resend to
Andrew.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-15 11:09                   ` Mel Gorman
  2012-10-29 10:52                     ` Thorsten Leemhuis
@ 2012-11-02 10:44                     ` Zdenek Kabelac
  2012-11-02 10:53                       ` Jiri Slaby
  1 sibling, 1 reply; 52+ messages in thread
From: Zdenek Kabelac @ 2012-11-02 10:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jiri Slaby, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

Dne 15.10.2012 13:09, Mel Gorman napsal(a):
> On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
>> On 10/12/2012 03:57 PM, Mel Gorman wrote:
>>> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
>>>
>>> Jiri Slaby reported the following:
>>>
>>> 	(It's an effective revert of "mm: vmscan: scale number of pages
>>> 	reclaimed by reclaim/compaction based on failures".)
>>> 	Given kswapd had hours of runtime in ps/top output yesterday in the
>>> 	morning and after the revert it's now 2 minutes in sum for the last 24h,
>>> 	I would say, it's gone.
>>>
>>> The intention of the patch in question was to compensate for the loss of
>>> lumpy reclaim. Part of the reason lumpy reclaim worked is because it
>>> aggressively reclaimed pages and this patch was meant to be a
>>> sane compromise.
>>>
>>> When compaction fails, it gets deferred and both compaction and
>>> reclaim/compaction is deferred avoid excessive reclaim. However, since
>>> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
>>> and continues reclaiming which was not taken into account when the patch
>>> was developed.
>>>
>>> As it is not taking deferred compaction into account in this path it scans
>>> aggressively before falling out and making the compaction_deferred check in
>>> compaction_ready. This patch avoids kswapd scaling pages for reclaim and
>>> leaves the aggressive reclaim to the process attempting the THP
>>> allocation.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>> ---
>>>   mm/vmscan.c |   10 ++++++++--
>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 2624edc..2b7edfa 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
>>>   #ifdef CONFIG_COMPACTION
>>>   /*
>>>    * If compaction is deferred for sc->order then scale the number of pages
>>> - * reclaimed based on the number of consecutive allocation failures
>>> + * reclaimed based on the number of consecutive allocation failures. This
>>> + * scaling only happens for direct reclaim as it is about to attempt
>>> + * compaction. If compaction fails, future allocations will be deferred
>>> + * and reclaim avoided. On the other hand, kswapd does not take compaction
>>> + * deferral into account so if it scaled, it could scan excessively even
>>> + * though allocations are temporarily not being attempted.
>>>    */
>>>   static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
>>>   			struct lruvec *lruvec, struct scan_control *sc)
>>>   {
>>>   	struct zone *zone = lruvec_zone(lruvec);
>>>
>>> -	if (zone->compact_order_failed <= sc->order)
>>> +	if (zone->compact_order_failed <= sc->order &&
>>> +	    !current_is_kswapd())
>>>   		pages_for_compaction <<= zone->compact_defer_shift;
>>>   	return pages_for_compaction;
>>>   }
>>
>> Yes, applying this instead of the revert fixes the issue as well.
>>
>


I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive CPU 
usage - mainly  after  suspend/resume

Here is just simple  kswapd backtrace from running kernel:

kswapd0         R  running task        0    30      2 0x00000000
  ffff8801331ddae8 0000000000000082 ffff880135b8a340 0000000000000008
  ffff880135b8a340 ffff8801331ddfd8 ffff8801331ddfd8 ffff8801331ddfd8
  ffff880071db8000 ffff880135b8a340 0000000000000286 ffff8801331dc000
Call Trace:
  [<ffffffff81555cd2>] preempt_schedule+0x42/0x60
  [<ffffffff81557b75>] _raw_spin_unlock+0x55/0x60
  [<ffffffff811929d1>] put_super+0x31/0x40
  [<ffffffff81192aa2>] drop_super+0x22/0x30
  [<ffffffff81193be9>] prune_super+0x149/0x1b0
  [<ffffffff81141e2a>] shrink_slab+0xba/0x510
  [<ffffffff81185baa>] ? mem_cgroup_iter+0x17a/0x2e0
  [<ffffffff81185afa>] ? mem_cgroup_iter+0xca/0x2e0
  [<ffffffff811450f9>] balance_pgdat+0x629/0x7f0
  [<ffffffff81145434>] kswapd+0x174/0x620
  [<ffffffff8106fd20>] ? __init_waitqueue_head+0x60/0x60
  [<ffffffff811452c0>] ? balance_pgdat+0x7f0/0x7f0
  [<ffffffff8106f50b>] kthread+0xdb/0xe0
  [<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140
  [<ffffffff8155fb1c>] ret_from_fork+0x7c/0xb0
  [<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140


Zdenek



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-02 10:44                     ` Zdenek Kabelac
@ 2012-11-02 10:53                       ` Jiri Slaby
  2012-11-02 19:45                         ` Jiri Slaby
  0 siblings, 1 reply; 52+ messages in thread
From: Jiri Slaby @ 2012-11-02 10:53 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Mel Gorman, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>> Yes, applying this instead of the revert fixes the issue as well.
> 
> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
> CPU usage - mainly  after  suspend/resume
> 
> Here is just simple  kswapd backtrace from running kernel:

Yup, this is what we were seeing with the former patch only too. Try to
apply the other one too:
https://patchwork.kernel.org/patch/1673231/

For me I would say, it is fixed by the two patches now. I won't be able
to report later, since I'm leaving to a conference tomorrow.

> kswapd0         R  running task        0    30      2 0x00000000
...
>  [<ffffffff81141e2a>] shrink_slab+0xba/0x510

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-02 10:53                       ` Jiri Slaby
@ 2012-11-02 19:45                         ` Jiri Slaby
  2012-11-04 11:26                           ` Zdenek Kabelac
                                             ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Jiri Slaby @ 2012-11-02 19:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML,
	Andrew Morton

On 11/02/2012 11:53 AM, Jiri Slaby wrote:
> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>> Yes, applying this instead of the revert fixes the issue as well.
>>
>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>> CPU usage - mainly  after  suspend/resume
>>
>> Here is just simple  kswapd backtrace from running kernel:
> 
> Yup, this is what we were seeing with the former patch only too. Try to
> apply the other one too:
> https://patchwork.kernel.org/patch/1673231/
> 
> For me I would say, it is fixed by the two patches now. I won't be able
> to report later, since I'm leaving to a conference tomorrow.

Damn it. It recurred right now, with both patches applied. After I
started a java program which consumed some more memory. Though there are
still 2 gigs free, kswap is spinning:
[<ffffffff810b00da>] __cond_resched+0x2a/0x40
[<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
[<ffffffff8113478d>] kswapd+0x66d/0xb60
[<ffffffff810a25d0>] kthread+0xc0/0xd0
[<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-02 19:45                         ` Jiri Slaby
@ 2012-11-04 11:26                           ` Zdenek Kabelac
  2012-11-05 14:24                           ` [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" Mel Gorman
  2012-11-09  4:22                           ` kswapd0: excessive CPU usage Seth Jennings
  2 siblings, 0 replies; 52+ messages in thread
From: Zdenek Kabelac @ 2012-11-04 11:26 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Mel Gorman, Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton

Dne 2.11.2012 20:45, Jiri Slaby napsal(a):
> On 11/02/2012 11:53 AM, Jiri Slaby wrote:
>> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>>> Yes, applying this instead of the revert fixes the issue as well.
>>>
>>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>>> CPU usage - mainly  after  suspend/resume
>>>
>>> Here is just simple  kswapd backtrace from running kernel:
>>
>> Yup, this is what we were seeing with the former patch only too. Try to
>> apply the other one too:
>> https://patchwork.kernel.org/patch/1673231/
>>
>> For me I would say, it is fixed by the two patches now. I won't be able
>> to report later, since I'm leaving to a conference tomorrow.
>
> Damn it. It recurred right now, with both patches applied. After I
> started a java program which consumed some more memory. Though there are
> still 2 gigs free, kswap is spinning:
> [<ffffffff810b00da>] __cond_resched+0x2a/0x40
> [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
> [<ffffffff8113478d>] kswapd+0x66d/0xb60
> [<ffffffff810a25d0>] kthread+0xc0/0xd0
> [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
>

Yep - wanted to report myself again and noticed your replay.

Yes - I've now also both patches installed - and I still observe kswapd eating 
my CPU.  It seems (at least for me) that  prior suspend and resume is way to 
trigger it more frequently.

However there is a change in behaviour - while before kswapd was running 
almost indefinitely now the> CPU spikes are in the range of minutes.
(i.e. uptime  ~2days -   kswapd has over 32minutes CPU time)
My machine has 4GB, and no swap (disabled)

firefox (22mins), thunderbird(3mins) and pidgin(0.5min) are the 3 most memory 
and CPU hungry apps for this moment.

Zdenek



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-10-30 19:18                       ` Mel Gorman
  2012-10-31 11:25                         ` Thorsten Leemhuis
@ 2012-11-04 16:36                         ` Rik van Riel
  1 sibling, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2012-11-04 16:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Thorsten Leemhuis, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton

On 10/30/2012 03:18 PM, Mel Gorman wrote:

>   restart:
> -	wake_all_kswapd(order, zonelist, high_zoneidx,
> +	/*
> +	 * kswapd is woken except when this is a THP request and compaction
> +	 * is deferred. If we are backing off reclaim/compaction then kswapd
> +	 * should not be awake aggressively reclaiming with no consumers of
> +	 * the freed pages
> +	 */
> +	if (!(is_thp_alloc(gfp_mask, order) &&
> +	      compaction_deferred(preferred_zone, order)))
> +		wake_all_kswapd(order, zonelist, high_zoneidx,
>   					zone_idx(preferred_zone));

What is special about thp allocations here?

Surely other large allocations that keep failing
should get the same treatment, of not waking up
kswapd if compaction is deferred?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
  2012-11-02 19:45                         ` Jiri Slaby
  2012-11-04 11:26                           ` Zdenek Kabelac
@ 2012-11-05 14:24                           ` Mel Gorman
  2012-11-06 10:15                             ` Johannes Hirte
  2012-11-09  9:12                             ` Mel Gorman
  2012-11-09  4:22                           ` kswapd0: excessive CPU usage Seth Jennings
  2 siblings, 2 replies; 52+ messages in thread
From: Mel Gorman @ 2012-11-05 14:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby, linux-mm,
	Rik van Riel, Jiri Slaby, LKML

Jiri Slaby reported the following:

	(It's an effective revert of "mm: vmscan: scale number of pages
	reclaimed by reclaim/compaction based on failures".) Given kswapd
	had hours of runtime in ps/top output yesterday in the morning
	and after the revert it's now 2 minutes in sum for the last 24h,
	I would say, it's gone.

The intention of the patch in question was to compensate for the loss
of lumpy reclaim. Part of the reason lumpy reclaim worked is because
it aggressively reclaimed pages and this patch was meant to be a sane
compromise.

When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
and continues reclaiming which was not taken into account when the patch
was developed.

Attempts to address the problem ended up just changing the shape of the
problem instead of fixing it. The release window gets closer and while a
THP allocation failing is not a major problem, kswapd chewing up a lot of
CPU is. This patch reverts "mm: vmscan: scale number of pages reclaimed
by reclaim/compaction based on failures" and will be revisited in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   25 -------------------------
 1 file changed, 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2624edc..e081ee8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct scan_control *sc)
 	return false;
 }
 
-#ifdef CONFIG_COMPACTION
-/*
- * If compaction is deferred for sc->order then scale the number of pages
- * reclaimed based on the number of consecutive allocation failures
- */
-static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
-			struct lruvec *lruvec, struct scan_control *sc)
-{
-	struct zone *zone = lruvec_zone(lruvec);
-
-	if (zone->compact_order_failed <= sc->order)
-		pages_for_compaction <<= zone->compact_defer_shift;
-	return pages_for_compaction;
-}
-#else
-static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
-			struct lruvec *lruvec, struct scan_control *sc)
-{
-	return pages_for_compaction;
-}
-#endif
-
 /*
  * Reclaim/compaction is used for high-order allocation requests. It reclaims
  * order-0 pages before compacting the zone. should_continue_reclaim() returns
@@ -1829,9 +1807,6 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-
-	pages_for_compaction = scale_for_compaction(pages_for_compaction,
-						    lruvec, sc);
 	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
 		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
  2012-11-05 14:24                           ` [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" Mel Gorman
@ 2012-11-06 10:15                             ` Johannes Hirte
  2012-11-09  8:36                               ` Mel Gorman
  2012-11-09  9:12                             ` Mel Gorman
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Hirte @ 2012-11-06 10:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, Rik van Riel, Jiri Slaby, LKML

Am Mon, 5 Nov 2012 14:24:49 +0000
schrieb Mel Gorman <mgorman@suse.de>:

> Jiri Slaby reported the following:
> 
> 	(It's an effective revert of "mm: vmscan: scale number of
> pages reclaimed by reclaim/compaction based on failures".) Given
> kswapd had hours of runtime in ps/top output yesterday in the morning
> 	and after the revert it's now 2 minutes in sum for the last
> 24h, I would say, it's gone.
> 
> The intention of the patch in question was to compensate for the loss
> of lumpy reclaim. Part of the reason lumpy reclaim worked is because
> it aggressively reclaimed pages and this patch was meant to be a sane
> compromise.
> 
> When compaction fails, it gets deferred and both compaction and
> reclaim/compaction is deferred avoid excessive reclaim. However, since
> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each
> time and continues reclaiming which was not taken into account when
> the patch was developed.
> 
> Attempts to address the problem ended up just changing the shape of
> the problem instead of fixing it. The release window gets closer and
> while a THP allocation failing is not a major problem, kswapd chewing
> up a lot of CPU is. This patch reverts "mm: vmscan: scale number of
> pages reclaimed by reclaim/compaction based on failures" and will be
> revisited in the future.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   25 -------------------------
>  1 file changed, 25 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2624edc..e081ee8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct
> scan_control *sc) return false;
>  }
>  
> -#ifdef CONFIG_COMPACTION
> -/*
> - * If compaction is deferred for sc->order then scale the number of
> pages
> - * reclaimed based on the number of consecutive allocation failures
> - */
> -static unsigned long scale_for_compaction(unsigned long
> pages_for_compaction,
> -			struct lruvec *lruvec, struct scan_control
> *sc) -{
> -	struct zone *zone = lruvec_zone(lruvec);
> -
> -	if (zone->compact_order_failed <= sc->order)
> -		pages_for_compaction <<= zone->compact_defer_shift;
> -	return pages_for_compaction;
> -}
> -#else
> -static unsigned long scale_for_compaction(unsigned long
> pages_for_compaction,
> -			struct lruvec *lruvec, struct scan_control
> *sc) -{
> -	return pages_for_compaction;
> -}
> -#endif
> -
>  /*
>   * Reclaim/compaction is used for high-order allocation requests. It
> reclaims
>   * order-0 pages before compacting the zone.
> should_continue_reclaim() returns @@ -1829,9 +1807,6 @@ static inline
> bool should_continue_reclaim(struct lruvec *lruvec,
>  	 * inactive lists are large enough, continue reclaiming
>  	 */
>  	pages_for_compaction = (2UL << sc->order);
> -
> -	pages_for_compaction =
> scale_for_compaction(pages_for_compaction,
> -						    lruvec, sc);
>  	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
>  	if (nr_swap_pages > 0)
>  		inactive_lru_pages += get_lru_size(lruvec,
> LRU_INACTIVE_ANON); --

Even with this patch I see kswapd0 very often on top. Much more than
with kernel 3.6.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-02 19:45                         ` Jiri Slaby
  2012-11-04 11:26                           ` Zdenek Kabelac
  2012-11-05 14:24                           ` [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" Mel Gorman
@ 2012-11-09  4:22                           ` Seth Jennings
  2012-11-09  8:07                             ` Zdenek Kabelac
  2012-11-09  8:40                             ` Mel Gorman
  2 siblings, 2 replies; 52+ messages in thread
From: Seth Jennings @ 2012-11-09  4:22 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Mel Gorman, Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

On 11/02/2012 02:45 PM, Jiri Slaby wrote:
> On 11/02/2012 11:53 AM, Jiri Slaby wrote:
>> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>>> Yes, applying this instead of the revert fixes the issue as well.
>>>
>>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>>> CPU usage - mainly  after  suspend/resume
>>>
>>> Here is just simple  kswapd backtrace from running kernel:
>>
>> Yup, this is what we were seeing with the former patch only too. Try to
>> apply the other one too:
>> https://patchwork.kernel.org/patch/1673231/
>>
>> For me I would say, it is fixed by the two patches now. I won't be able
>> to report later, since I'm leaving to a conference tomorrow.
> 
> Damn it. It recurred right now, with both patches applied. After I
> started a java program which consumed some more memory. Though there are
> still 2 gigs free, kswap is spinning:
> [<ffffffff810b00da>] __cond_resched+0x2a/0x40
> [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
> [<ffffffff8113478d>] kswapd+0x66d/0xb60
> [<ffffffff810a25d0>] kthread+0xc0/0xd0
> [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff

I'm also hitting this issue in v3.7-rc4.  It appears that the last
release not effected by this issue was v3.3.  Bisecting the changes
included for v3.4-rc1 showed that this commit introduced the issue:

fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
Author: Rik van Riel <riel@redhat.com>
Date:   Wed Mar 21 16:33:51 2012 -0700

    vmscan: reclaim at order 0 when compaction is enabled
...

This is plausible since the issue seems to be in the kswapd + compaction
realm.  I've yet to figure out exactly what about this commit results in
kswapd spinning.

I would be interested if someone can confirm this finding.

--
Seth


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-09  4:22                           ` kswapd0: excessive CPU usage Seth Jennings
@ 2012-11-09  8:07                             ` Zdenek Kabelac
  2012-11-09  9:06                               ` Mel Gorman
  2012-11-09  8:40                             ` Mel Gorman
  1 sibling, 1 reply; 52+ messages in thread
From: Zdenek Kabelac @ 2012-11-09  8:07 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Jiri Slaby, Mel Gorman, Valdis.Kletnieks, Jiri Slaby, linux-mm,
	LKML, Andrew Morton, Rik van Riel, Robert Jennings

Dne 9.11.2012 05:22, Seth Jennings napsal(a):
> On 11/02/2012 02:45 PM, Jiri Slaby wrote:
>> On 11/02/2012 11:53 AM, Jiri Slaby wrote:
>>> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>>>> Yes, applying this instead of the revert fixes the issue as well.
>>>>
>>>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>>>> CPU usage - mainly  after  suspend/resume
>>>>
>>>> Here is just simple  kswapd backtrace from running kernel:
>>>
>>> Yup, this is what we were seeing with the former patch only too. Try to
>>> apply the other one too:
>>> https://patchwork.kernel.org/patch/1673231/
>>>
>>> For me I would say, it is fixed by the two patches now. I won't be able
>>> to report later, since I'm leaving to a conference tomorrow.
>>
>> Damn it. It recurred right now, with both patches applied. After I
>> started a java program which consumed some more memory. Though there are
>> still 2 gigs free, kswap is spinning:
>> [<ffffffff810b00da>] __cond_resched+0x2a/0x40
>> [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
>> [<ffffffff8113478d>] kswapd+0x66d/0xb60
>> [<ffffffff810a25d0>] kthread+0xc0/0xd0
>> [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I'm also hitting this issue in v3.7-rc4.  It appears that the last
> release not effected by this issue was v3.3.  Bisecting the changes
> included for v3.4-rc1 showed that this commit introduced the issue:
>
> fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
> commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
> Author: Rik van Riel <riel@redhat.com>
> Date:   Wed Mar 21 16:33:51 2012 -0700
>
>      vmscan: reclaim at order 0 when compaction is enabled
> ...
>
> This is plausible since the issue seems to be in the kswapd + compaction
> realm.  I've yet to figure out exactly what about this commit results in
> kswapd spinning.
>
> I would be interested if someone can confirm this finding.
>
> --
> Seth
>


On my system 3.7-rc4 the problem seems to be effectively solved by revert 
patch: https://lkml.org/lkml/2012/11/5/308

i.e. in 2 days uptime kswapd0 eats 6 seconds which is IMHO ok - I'm not 
observing any busy loops on CPU with kswapd0.


Zdenek


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
  2012-11-06 10:15                             ` Johannes Hirte
@ 2012-11-09  8:36                               ` Mel Gorman
  2012-11-14 21:43                                 ` Johannes Hirte
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2012-11-09  8:36 UTC (permalink / raw)
  To: Johannes Hirte
  Cc: Andrew Morton, Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, Rik van Riel, Jiri Slaby, LKML

On Tue, Nov 06, 2012 at 11:15:54AM +0100, Johannes Hirte wrote:
> Am Mon, 5 Nov 2012 14:24:49 +0000
> schrieb Mel Gorman <mgorman@suse.de>:
> 
> > Jiri Slaby reported the following:
> > 
> > 	(It's an effective revert of "mm: vmscan: scale number of
> > pages reclaimed by reclaim/compaction based on failures".) Given
> > kswapd had hours of runtime in ps/top output yesterday in the morning
> > 	and after the revert it's now 2 minutes in sum for the last
> > 24h, I would say, it's gone.
> > 
> > The intention of the patch in question was to compensate for the loss
> > of lumpy reclaim. Part of the reason lumpy reclaim worked is because
> > it aggressively reclaimed pages and this patch was meant to be a sane
> > compromise.
> > 
> > When compaction fails, it gets deferred and both compaction and
> > reclaim/compaction is deferred avoid excessive reclaim. However, since
> > commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each
> > time and continues reclaiming which was not taken into account when
> > the patch was developed.
> > 
> > Attempts to address the problem ended up just changing the shape of
> > the problem instead of fixing it. The release window gets closer and
> > while a THP allocation failing is not a major problem, kswapd chewing
> > up a lot of CPU is. This patch reverts "mm: vmscan: scale number of
> > pages reclaimed by reclaim/compaction based on failures" and will be
> > revisited in the future.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |   25 -------------------------
> >  1 file changed, 25 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2624edc..e081ee8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct
> > scan_control *sc) return false;
> >  }
> >  
> > -#ifdef CONFIG_COMPACTION
> > -/*
> > - * If compaction is deferred for sc->order then scale the number of
> > pages
> > - * reclaimed based on the number of consecutive allocation failures
> > - */
> > -static unsigned long scale_for_compaction(unsigned long
> > pages_for_compaction,
> > -			struct lruvec *lruvec, struct scan_control
> > *sc) -{
> > -	struct zone *zone = lruvec_zone(lruvec);
> > -
> > -	if (zone->compact_order_failed <= sc->order)
> > -		pages_for_compaction <<= zone->compact_defer_shift;
> > -	return pages_for_compaction;
> > -}
> > -#else
> > -static unsigned long scale_for_compaction(unsigned long
> > pages_for_compaction,
> > -			struct lruvec *lruvec, struct scan_control
> > *sc) -{
> > -	return pages_for_compaction;
> > -}
> > -#endif
> > -
> >  /*
> >   * Reclaim/compaction is used for high-order allocation requests. It
> > reclaims
> >   * order-0 pages before compacting the zone.
> > should_continue_reclaim() returns @@ -1829,9 +1807,6 @@ static inline
> > bool should_continue_reclaim(struct lruvec *lruvec,
> >  	 * inactive lists are large enough, continue reclaiming
> >  	 */
> >  	pages_for_compaction = (2UL << sc->order);
> > -
> > -	pages_for_compaction =
> > scale_for_compaction(pages_for_compaction,
> > -						    lruvec, sc);
> >  	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> >  	if (nr_swap_pages > 0)
> >  		inactive_lru_pages += get_lru_size(lruvec,
> > LRU_INACTIVE_ANON); --
> 
> Even with this patch I see kswapd0 very often on top. Much more than
> with kernel 3.6.

How severe is the CPU usage? The higher usage can be explained by "mm:
remove __GFP_NO_KSWAPD" which allows kswapd to compact memory to reduce
the amount of time processes spend in compaction but will result in the
CPU cost being incurred by kswapd.

Is it really high like the bug was reporting with high usage over long
periods of time or do you just see it using 2-6% of CPU for short
periods?

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-09  4:22                           ` kswapd0: excessive CPU usage Seth Jennings
  2012-11-09  8:07                             ` Zdenek Kabelac
@ 2012-11-09  8:40                             ` Mel Gorman
  1 sibling, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2012-11-09  8:40 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Jiri Slaby, Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

On Thu, Nov 08, 2012 at 10:22:05PM -0600, Seth Jennings wrote:
> On 11/02/2012 02:45 PM, Jiri Slaby wrote:
> > On 11/02/2012 11:53 AM, Jiri Slaby wrote:
> >> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
> >>>>> Yes, applying this instead of the revert fixes the issue as well.
> >>>
> >>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
> >>> CPU usage - mainly  after  suspend/resume
> >>>
> >>> Here is just simple  kswapd backtrace from running kernel:
> >>
> >> Yup, this is what we were seeing with the former patch only too. Try to
> >> apply the other one too:
> >> https://patchwork.kernel.org/patch/1673231/
> >>
> >> For me I would say, it is fixed by the two patches now. I won't be able
> >> to report later, since I'm leaving to a conference tomorrow.
> > 
> > Damn it. It recurred right now, with both patches applied. After I
> > started a java program which consumed some more memory. Though there are
> > still 2 gigs free, kswap is spinning:
> > [<ffffffff810b00da>] __cond_resched+0x2a/0x40
> > [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
> > [<ffffffff8113478d>] kswapd+0x66d/0xb60
> > [<ffffffff810a25d0>] kthread+0xc0/0xd0
> > [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
> > [<ffffffffffffffff>] 0xffffffffffffffff
> 
> I'm also hitting this issue in v3.7-rc4.  It appears that the last
> release not effected by this issue was v3.3.  Bisecting the changes
> included for v3.4-rc1 showed that this commit introduced the issue:
> 
> fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
> commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
> Author: Rik van Riel <riel@redhat.com>
> Date:   Wed Mar 21 16:33:51 2012 -0700
> 
>     vmscan: reclaim at order 0 when compaction is enabled
> ...
> 
> This is plausible since the issue seems to be in the kswapd + compaction
> realm.  I've yet to figure out exactly what about this commit results in
> kswapd spinning.
> 
> I would be interested if someone can confirm this finding.
> 

I cannot confirm the actual finding as I don't see the same sort of
problems. However, this does make sense and was more or less expected.
Reclaiming at order-0 would have forced compaction to be used more instead
of lumpy reclaim (less CPU usage but greater system distruption that is
harder to measure). Shortly after, lumpy reclaim was removed entirely so
now larger amounts of CPU time is spent compacting memory that previously
would have been reclaimed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-09  8:07                             ` Zdenek Kabelac
@ 2012-11-09  9:06                               ` Mel Gorman
  2012-11-11  9:13                                 ` Zdenek Kabelac
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2012-11-09  9:06 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

On Fri, Nov 09, 2012 at 09:07:45AM +0100, Zdenek Kabelac wrote:
> >fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
> >commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
> >Author: Rik van Riel <riel@redhat.com>
> >Date:   Wed Mar 21 16:33:51 2012 -0700
> >
> >     vmscan: reclaim at order 0 when compaction is enabled
> >...
> >
> >This is plausible since the issue seems to be in the kswapd + compaction
> >realm.  I've yet to figure out exactly what about this commit results in
> >kswapd spinning.
> >
> >I would be interested if someone can confirm this finding.
> >
> >--
> >Seth
> >
> 
> 
> On my system 3.7-rc4 the problem seems to be effectively solved by
> revert patch: https://lkml.org/lkml/2012/11/5/308
> 

Ok, while there is still a question on whether it's enough I think it's
sensible to at least start with the obvious one.

Thanks very much.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
  2012-11-05 14:24                           ` [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" Mel Gorman
  2012-11-06 10:15                             ` Johannes Hirte
@ 2012-11-09  9:12                             ` Mel Gorman
  1 sibling, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2012-11-09  9:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby, linux-mm,
	Rik van Riel, Jiri Slaby, LKML

On Mon, Nov 05, 2012 at 02:24:49PM +0000, Mel Gorman wrote:
> Jiri Slaby reported the following:
> 
> 	(It's an effective revert of "mm: vmscan: scale number of pages
> 	reclaimed by reclaim/compaction based on failures".) Given kswapd
> 	had hours of runtime in ps/top output yesterday in the morning
> 	and after the revert it's now 2 minutes in sum for the last 24h,
> 	I would say, it's gone.
> 
> The intention of the patch in question was to compensate for the loss
> of lumpy reclaim. Part of the reason lumpy reclaim worked is because
> it aggressively reclaimed pages and this patch was meant to be a sane
> compromise.
> 
> When compaction fails, it gets deferred and both compaction and
> reclaim/compaction is deferred avoid excessive reclaim. However, since
> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
> and continues reclaiming which was not taken into account when the patch
> was developed.
> 
> Attempts to address the problem ended up just changing the shape of the
> problem instead of fixing it. The release window gets closer and while a
> THP allocation failing is not a major problem, kswapd chewing up a lot of
> CPU is. This patch reverts "mm: vmscan: scale number of pages reclaimed
> by reclaim/compaction based on failures" and will be revisited in the future.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Andrew, can you pick up this patch please and drop
mm-vmscan-scale-number-of-pages-reclaimed-by-reclaim-compaction-only-in-direct-reclaim.patch
?

There are mixed reports on how much it helps but it comes down to "this
fixes a problem" versus "kswapd is still showing higher usage". I think
the higher kswapd usage is explained by the removal of __GFP_NO_KSWAPD
and so while higher usage is bad, it is not necessarily unjustified.
Ideally it would have been proven that having kswapd doing the work
reduced application stalls in direct reclaim but unfortunately I do not
have concrete evidence of that at this time.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-09  9:06                               ` Mel Gorman
@ 2012-11-11  9:13                                 ` Zdenek Kabelac
  2012-11-12 11:37                                   ` [PATCH] Revert "mm: remove __GFP_NO_KSWAPD" Mel Gorman
  2012-11-12 12:19                                   ` kswapd0: excessive CPU usage Mel Gorman
  0 siblings, 2 replies; 52+ messages in thread
From: Zdenek Kabelac @ 2012-11-11  9:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

Dne 9.11.2012 10:06, Mel Gorman napsal(a):
> On Fri, Nov 09, 2012 at 09:07:45AM +0100, Zdenek Kabelac wrote:
>>> fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
>>> commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
>>> Author: Rik van Riel <riel@redhat.com>
>>> Date:   Wed Mar 21 16:33:51 2012 -0700
>>>
>>>      vmscan: reclaim at order 0 when compaction is enabled
>>> ...
>>>
>>> This is plausible since the issue seems to be in the kswapd + compaction
>>> realm.  I've yet to figure out exactly what about this commit results in
>>> kswapd spinning.
>>>
>>> I would be interested if someone can confirm this finding.
>>>
>>> --
>>> Seth
>>>
>>
>>
>> On my system 3.7-rc4 the problem seems to be effectively solved by
>> revert patch: https://lkml.org/lkml/2012/11/5/308
>>
>
> Ok, while there is still a question on whether it's enough I think it's
> sensible to at least start with the obvious one.
>


Hmm,  so it's just took longer to hit the problem and observe kswapd0
spinning on my CPU again - it's not as endless like before - but still it 
easily eats minutes - it helps to  turn off  Firefox or TB  (memory hungry 
apps) so kswapd0 stops soon - and restart those apps again.
(And I still have like >1GB of cached memory)

kswapd0         R  running task        0    30      2 0x00000000
  ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
  ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
  ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
Call Trace:
  [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
  [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
  [<ffffffff81192971>] put_super+0x31/0x40
  [<ffffffff81192a42>] drop_super+0x22/0x30
  [<ffffffff81193b89>] prune_super+0x149/0x1b0
  [<ffffffff81141e2a>] shrink_slab+0xba/0x510
  [<ffffffff81185b4a>] ? mem_cgroup_iter+0x17a/0x2e0
  [<ffffffff81185a9a>] ? mem_cgroup_iter+0xca/0x2e0
  [<ffffffff81145099>] balance_pgdat+0x629/0x7f0
  [<ffffffff811453d4>] kswapd+0x174/0x620
  [<ffffffff8106fd20>] ? __init_waitqueue_head+0x60/0x60
  [<ffffffff81145260>] ? balance_pgdat+0x7f0/0x7f0
  [<ffffffff8106f50b>] kthread+0xdb/0xe0
  [<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140
  [<ffffffff8155fa1c>] ret_from_fork+0x7c/0xb0
  [<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140


runnable tasks:
             task   PID         tree-key  switches  prio     exec-runtime 
     sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
          kswapd0    30   8689943.729790     36266   120   8689943.729790 
201495.640629  56609485.489414 /
      kworker/0:1 14790   8689937.729790     16969   120   8689937.729790 
   374.385996    150405.181652 /
R           bash 14855       821.749268        50   120       821.749268 
   24.027535      5252.291128 /autogroup-304




Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 146
CPU    1: hi:  186, btch:  31 usd: 135
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd: 131
CPU    1: hi:  186, btch:  31 usd: 132
active_anon:726521 inactive_anon:26442 isolated_anon:0
  active_file:77765 inactive_file:76890 isolated_file:0
  unevictable:12 dirty:4 writeback:0 unstable:0
  free:40261 slab_reclaimable:12414 slab_unreclaimable:9694
  mapped:26382 shmem:162712 pagetables:6618 bounce:0
  free_cma:0
DMA free:15676kB min:272kB low:340kB high:408kB active_anon:208kB 
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:208kB slab_reclaimable:8kB 
slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2951 3836 3836
DMA32 free:126072kB min:51776kB low:64720kB high:77664kB active_anon:2175104kB 
inactive_anon:98976kB active_file:296252kB inactive_file:297648kB 
unevictable:48kB isolated(anon):0kB isolated(file):0kB present:3021960kB 
mlocked:48kB dirty:12kB writeback:0kB mapped:77664kB shmem:620388kB 
slab_reclaimable:19128kB slab_unreclaimable:6292kB kernel_stack:624kB 
pagetables:8900kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 885 885
Normal free:19296kB min:15532kB low:19412kB high:23296kB active_anon:730772kB 
inactive_anon:6792kB active_file:14808kB inactive_file:9912kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:906664kB mlocked:0kB dirty:4kB 
writeback:0kB mapped:27864kB shmem:30252kB slab_reclaimable:30520kB 
slab_unreclaimable:32476kB kernel_stack:2496kB pagetables:17572kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 3*256kB 2*512kB 3*1024kB 
3*2048kB 1*4096kB = 15676kB
DMA32: 730*4kB 328*8kB 223*16kB 123*32kB 182*64kB 96*128kB 172*256kB 56*512kB 
12*1024kB 1*2048kB 1*4096kB = 128120kB
Normal: 600*4kB 384*8kB 164*16kB 122*32kB 40*64kB 7*128kB 1*256kB 1*512kB 
1*1024kB 1*2048kB 0*4096kB = 19296kB
317367 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
1032176 pages RAM
42789 pages reserved
642501 pages shared
869271 pages non-shared


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-11  9:13                                 ` Zdenek Kabelac
@ 2012-11-12 11:37                                   ` Mel Gorman
  2012-11-16 19:14                                     ` Josh Boyer
  2012-11-20  9:18                                     ` Glauber Costa
  2012-11-12 12:19                                   ` kswapd0: excessive CPU usage Mel Gorman
  1 sibling, 2 replies; 52+ messages in thread
From: Mel Gorman @ 2012-11-12 11:37 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following

	Hmm,  so it's just took longer to hit the problem and observe
	kswapd0 spinning on my CPU again - it's not as endless like before -
	but still it easily eats minutes - it helps to	turn off  Firefox
	or TB  (memory hungry apps) so kswapd0 stops soon - and restart
	those apps again.  (And I still have like >1GB of cached memory)

	kswapd0         R  running task        0    30      2 0x00000000
	 ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
	 ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
	 ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
	Call Trace:
	 [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
	 [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
	 [<ffffffff81192971>] put_super+0x31/0x40
	 [<ffffffff81192a42>] drop_super+0x22/0x30
	 [<ffffffff81193b89>] prune_super+0x149/0x1b0
	 [<ffffffff81141e2a>] shrink_slab+0xba/0x510

The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be reclaimed.

The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.

If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still
be the the case that kswapd is awake for a prolonged period of time
as pgdat->kswapd_max_order is updated each time. This is noticed by
the main kswapd() loop and it will not call kswapd_try_to_sleep().
Instead it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.

The temptation is to supply a patch that checks if kswapd was woken for
THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
backed up by proper testing. As 3.7 is very close to release and this is
not a bug we should release with, a safer path is to revert "mm: remove
__GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
balance_pgdat() logic in general.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 drivers/mtd/mtdcore.c           |    6 ++++--
 include/linux/gfp.h             |    5 ++++-
 include/trace/events/gfpflags.h |    1 +
 mm/page_alloc.c                 |    7 ++++---
 4 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 374c46d..ec794a7 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1077,7 +1077,8 @@ EXPORT_SYMBOL_GPL(mtd_writev);
  * until the request succeeds or until the allocation size falls below
  * the system page size. This attempts to make sure it does not adversely
  * impact system performance, so when allocating more than one page, we
- * ask the memory allocator to avoid re-trying.
+ * ask the memory allocator to avoid re-trying, swapping, writing back
+ * or performing I/O.
  *
  * Note, this function also makes sure that the allocated buffer is aligned to
  * the MTD device's min. I/O unit, i.e. the "mtd->writesize" value.
@@ -1091,7 +1092,8 @@ EXPORT_SYMBOL_GPL(mtd_writev);
  */
 void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
 {
-	gfp_t flags = __GFP_NOWARN | __GFP_WAIT | __GFP_NORETRY;
+	gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
+		       __GFP_NORETRY | __GFP_NO_KSWAPD;
 	size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
 	void *kbuf;
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 02c1c971..d0a7967 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,6 +31,7 @@ struct vm_area_struct;
 #define ___GFP_THISNODE		0x40000u
 #define ___GFP_RECLAIMABLE	0x80000u
 #define ___GFP_NOTRACK		0x200000u
+#define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 
@@ -85,6 +86,7 @@ struct vm_area_struct;
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
+#define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
@@ -114,7 +116,8 @@ struct vm_area_struct;
 				 __GFP_MOVABLE)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
 #define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN)
+			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
+			 __GFP_NO_KSWAPD)
 
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9391706..d6fd8e5 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -36,6 +36,7 @@
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
 	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
 	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
+	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
 	{(unsigned long)__GFP_OTHER_NODE,	"GFP_OTHER_NODE"}	\
 	) : "GFP_NOWAIT"
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..7228260 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2416,8 +2416,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx,
-					zone_idx(preferred_zone));
+	if (!(gfp_mask & __GFP_NO_KSWAPD))
+		wake_all_kswapd(order, zonelist, high_zoneidx,
+						zone_idx(preferred_zone));
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -2494,7 +2495,7 @@ rebalance:
 	 * system then fail the allocation instead of entering direct reclaim.
 	 */
 	if ((deferred_compaction || contended_compaction) &&
-	    (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+						(gfp_mask & __GFP_NO_KSWAPD))
 		goto nopage;
 
 	/* Try direct reclaim and then allocating */

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-11  9:13                                 ` Zdenek Kabelac
  2012-11-12 11:37                                   ` [PATCH] Revert "mm: remove __GFP_NO_KSWAPD" Mel Gorman
@ 2012-11-12 12:19                                   ` Mel Gorman
  2012-11-12 13:13                                     ` Zdenek Kabelac
  1 sibling, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2012-11-12 12:19 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
> Hmm,  so it's just took longer to hit the problem and observe kswapd0
> spinning on my CPU again - it's not as endless like before - but
> still it easily eats minutes - it helps to  turn off  Firefox or TB
> (memory hungry apps) so kswapd0 stops soon - and restart those apps
> again.
> (And I still have like >1GB of cached memory)
> 

I posted a "safe" patch that I believe explains why you are seeing what
you are seeing. It does mean that there will still be some stalls due to
THP because kswapd is not helping and it's avoiding the problem rather
than trying to deal with it.

Hence, I'm also going to post this patch even though I have not tested
it myself. If you find it fixes the problem then it would be a
preferable patch to the revert. It still is the case that the
balance_pgdat() logic is in sort need of a rethink as it's pretty
twisted right now.

Thanks

---8<---
mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following

	Hmm,  so it's just took longer to hit the problem and observe
	kswapd0 spinning on my CPU again - it's not as endless like before -
	but still it easily eats minutes - it helps to	turn off  Firefox
	or TB  (memory hungry apps) so kswapd0 stops soon - and restart
	those apps again.  (And I still have like >1GB of cached memory)

	kswapd0         R  running task        0    30      2 0x00000000
	 ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
	 ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
	 ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
	Call Trace:
	 [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
	 [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
	 [<ffffffff81192971>] put_super+0x31/0x40
	 [<ffffffff81192a42>] drop_super+0x22/0x30
	 [<ffffffff81193b89>] prune_super+0x149/0x1b0
	 [<ffffffff81141e2a>] shrink_slab+0xba/0x510

The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be reclaimed.

The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.

If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still
be the the case that kswapd is awake for a prolonged period of time
as pgdat->kswapd_max_order is updated each time. This is noticed by
the main kswapd() loop and it will not call kswapd_try_to_sleep().
Instead it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.

This patch defers when kswapd gets woken up for THP allocations. For !THP
allocations, kswapd is always woken up. For THP allocations, kswapd is
woken up iff the process is willing to enter into direct
reclaim/compaction.

Signed-off-by: Mel Gorman <mgorman@suse.de>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..0b469b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,6 +2378,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+/* Returns true if the allocation is likely for THP */
+static bool is_thp_alloc(gfp_t gfp_mask, unsigned int order)
+{
+	if (order == pageblock_order &&
+	    (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+		return true;
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2416,7 +2425,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx,
+	/* The decision whether to wake kswapd for THP is made later */
+	if (!is_thp_alloc(gfp_mask, order))
+		wake_all_kswapd(order, zonelist, high_zoneidx,
 					zone_idx(preferred_zone));
 
 	/*
@@ -2487,15 +2498,21 @@ rebalance:
 		goto got_pg;
 	sync_migration = true;
 
-	/*
-	 * If compaction is deferred for high-order allocations, it is because
-	 * sync compaction recently failed. In this is the case and the caller
-	 * requested a movable allocation that does not heavily disrupt the
-	 * system then fail the allocation instead of entering direct reclaim.
-	 */
-	if ((deferred_compaction || contended_compaction) &&
-	    (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
-		goto nopage;
+	if (is_thp_alloc(gfp_mask, order)) {
+		/*
+		 * If compaction is deferred for high-order allocations, it is
+		 * because sync compaction recently failed. In this is the case
+		 * and the caller requested a movable allocation that does not
+		 * heavily disrupt the system then fail the allocation instead
+		 * of entering direct reclaim.
+		 */
+		if (deferred_compaction || contended_compaction)
+			goto nopage;
+
+		/* If process is willing to reclaim/compact then wake kswapd */
+		wake_all_kswapd(order, zonelist, high_zoneidx,
+					zone_idx(preferred_zone));
+	}
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-12 12:19                                   ` kswapd0: excessive CPU usage Mel Gorman
@ 2012-11-12 13:13                                     ` Zdenek Kabelac
  2012-11-12 13:31                                       ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Zdenek Kabelac @ 2012-11-12 13:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

Dne 12.11.2012 13:19, Mel Gorman napsal(a):
> On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
>> Hmm,  so it's just took longer to hit the problem and observe kswapd0
>> spinning on my CPU again - it's not as endless like before - but
>> still it easily eats minutes - it helps to  turn off  Firefox or TB
>> (memory hungry apps) so kswapd0 stops soon - and restart those apps
>> again.
>> (And I still have like >1GB of cached memory)
>>
>
> I posted a "safe" patch that I believe explains why you are seeing what
> you are seeing. It does mean that there will still be some stalls due to
> THP because kswapd is not helping and it's avoiding the problem rather
> than trying to deal with it.
>
> Hence, I'm also going to post this patch even though I have not tested
> it myself. If you find it fixes the problem then it would be a
> preferable patch to the revert. It still is the case that the
> balance_pgdat() logic is in sort need of a rethink as it's pretty
> twisted right now.
>


Should I apply them all together for 3.7-rc5 ?

1) https://lkml.org/lkml/2012/11/5/308
2) https://lkml.org/lkml/2012/11/12/113
3) https://lkml.org/lkml/2012/11/12/151

Zdenek


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-12 13:13                                     ` Zdenek Kabelac
@ 2012-11-12 13:31                                       ` Mel Gorman
  2012-11-12 14:50                                         ` Zdenek Kabelac
  2012-11-18 19:00                                         ` Zdenek Kabelac
  0 siblings, 2 replies; 52+ messages in thread
From: Mel Gorman @ 2012-11-12 13:31 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

On Mon, Nov 12, 2012 at 02:13:20PM +0100, Zdenek Kabelac wrote:
> Dne 12.11.2012 13:19, Mel Gorman napsal(a):
> >On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
> >>Hmm,  so it's just took longer to hit the problem and observe kswapd0
> >>spinning on my CPU again - it's not as endless like before - but
> >>still it easily eats minutes - it helps to  turn off  Firefox or TB
> >>(memory hungry apps) so kswapd0 stops soon - and restart those apps
> >>again.
> >>(And I still have like >1GB of cached memory)
> >>
> >
> >I posted a "safe" patch that I believe explains why you are seeing what
> >you are seeing. It does mean that there will still be some stalls due to
> >THP because kswapd is not helping and it's avoiding the problem rather
> >than trying to deal with it.
> >
> >Hence, I'm also going to post this patch even though I have not tested
> >it myself. If you find it fixes the problem then it would be a
> >preferable patch to the revert. It still is the case that the
> >balance_pgdat() logic is in sort need of a rethink as it's pretty
> >twisted right now.
> >
> 
> 
> Should I apply them all together for 3.7-rc5 ?
> 
> 1) https://lkml.org/lkml/2012/11/5/308
> 2) https://lkml.org/lkml/2012/11/12/113
> 3) https://lkml.org/lkml/2012/11/12/151
> 

Not all together. Test either 1+2 or 1+3. 1+2 is the safer choice but
does nothing about THP stalls. 1+3 is a riskier version but depends on
me being correct on what the root cause of the problem you see it.

If both 1+2 and 1+3 work for you, I'd choose 1+3 for merging. If you only
have the time to test one combination then it would be preferred that you
test the safe option of 1+2.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-12 13:31                                       ` Mel Gorman
@ 2012-11-12 14:50                                         ` Zdenek Kabelac
  2012-11-18 19:00                                         ` Zdenek Kabelac
  1 sibling, 0 replies; 52+ messages in thread
From: Zdenek Kabelac @ 2012-11-12 14:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

Dne 12.11.2012 14:31, Mel Gorman napsal(a):
> On Mon, Nov 12, 2012 at 02:13:20PM +0100, Zdenek Kabelac wrote:
>> Dne 12.11.2012 13:19, Mel Gorman napsal(a):
>>> On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
>>>> Hmm,  so it's just took longer to hit the problem and observe kswapd0
>>>> spinning on my CPU again - it's not as endless like before - but
>>>> still it easily eats minutes - it helps to  turn off  Firefox or TB
>>>> (memory hungry apps) so kswapd0 stops soon - and restart those apps
>>>> again.
>>>> (And I still have like >1GB of cached memory)
>>>>
>>>
>>> I posted a "safe" patch that I believe explains why you are seeing what
>>> you are seeing. It does mean that there will still be some stalls due to
>>> THP because kswapd is not helping and it's avoiding the problem rather
>>> than trying to deal with it.
>>>
>>> Hence, I'm also going to post this patch even though I have not tested
>>> it myself. If you find it fixes the problem then it would be a
>>> preferable patch to the revert. It still is the case that the
>>> balance_pgdat() logic is in sort need of a rethink as it's pretty
>>> twisted right now.
>>>
>>
>>
>> Should I apply them all together for 3.7-rc5 ?
>>
>> 1) https://lkml.org/lkml/2012/11/5/308
>> 2) https://lkml.org/lkml/2012/11/12/113
>> 3) https://lkml.org/lkml/2012/11/12/151
>>
>
> Not all together. Test either 1+2 or 1+3. 1+2 is the safer choice but
> does nothing about THP stalls. 1+3 is a riskier version but depends on
> me being correct on what the root cause of the problem you see it.
>
> If both 1+2 and 1+3 work for you, I'd choose 1+3 for merging. If you only
> have the time to test one combination then it would be preferred that you
> test the safe option of 1+2.
>
>

I'll go with 1+2 for couple days - the issue is  - I've no idea how it gets
suddenly triggered - it seemed to be running fine for 2-3 days even with
just 1) - but then kswapd0 started to occupy CPU for minutes.
Looks like some intensive workload on firefox (flash) may lead to that.

Anyway it's hard to tell quickly if it helped.

Zdenek


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
  2012-11-09  8:36                               ` Mel Gorman
@ 2012-11-14 21:43                                 ` Johannes Hirte
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Hirte @ 2012-11-14 21:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Zdenek Kabelac, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, Rik van Riel, Jiri Slaby, LKML

Am Fri, 9 Nov 2012 08:36:37 +0000
schrieb Mel Gorman <mgorman@suse.de>:

> On Tue, Nov 06, 2012 at 11:15:54AM +0100, Johannes Hirte wrote:
> > Am Mon, 5 Nov 2012 14:24:49 +0000
> > schrieb Mel Gorman <mgorman@suse.de>:
> > 
> > > Jiri Slaby reported the following:
> > > 
> > > 	(It's an effective revert of "mm: vmscan: scale number of
> > > pages reclaimed by reclaim/compaction based on failures".) Given
> > > kswapd had hours of runtime in ps/top output yesterday in the
> > > morning and after the revert it's now 2 minutes in sum for the
> > > last 24h, I would say, it's gone.
> > > 
> > > The intention of the patch in question was to compensate for the
> > > loss of lumpy reclaim. Part of the reason lumpy reclaim worked is
> > > because it aggressively reclaimed pages and this patch was meant
> > > to be a sane compromise.
> > > 
> > > When compaction fails, it gets deferred and both compaction and
> > > reclaim/compaction is deferred avoid excessive reclaim. However,
> > > since commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is
> > > woken up each time and continues reclaiming which was not taken
> > > into account when the patch was developed.
> > > 
> > > Attempts to address the problem ended up just changing the shape
> > > of the problem instead of fixing it. The release window gets
> > > closer and while a THP allocation failing is not a major problem,
> > > kswapd chewing up a lot of CPU is. This patch reverts "mm:
> > > vmscan: scale number of pages reclaimed by reclaim/compaction
> > > based on failures" and will be revisited in the future.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/vmscan.c |   25 -------------------------
> > >  1 file changed, 25 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 2624edc..e081ee8 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct
> > > scan_control *sc) return false;
> > >  }
> > >  
> > > -#ifdef CONFIG_COMPACTION
> > > -/*
> > > - * If compaction is deferred for sc->order then scale the number
> > > of pages
> > > - * reclaimed based on the number of consecutive allocation
> > > failures
> > > - */
> > > -static unsigned long scale_for_compaction(unsigned long
> > > pages_for_compaction,
> > > -			struct lruvec *lruvec, struct
> > > scan_control *sc) -{
> > > -	struct zone *zone = lruvec_zone(lruvec);
> > > -
> > > -	if (zone->compact_order_failed <= sc->order)
> > > -		pages_for_compaction <<=
> > > zone->compact_defer_shift;
> > > -	return pages_for_compaction;
> > > -}
> > > -#else
> > > -static unsigned long scale_for_compaction(unsigned long
> > > pages_for_compaction,
> > > -			struct lruvec *lruvec, struct
> > > scan_control *sc) -{
> > > -	return pages_for_compaction;
> > > -}
> > > -#endif
> > > -
> > >  /*
> > >   * Reclaim/compaction is used for high-order allocation
> > > requests. It reclaims
> > >   * order-0 pages before compacting the zone.
> > > should_continue_reclaim() returns @@ -1829,9 +1807,6 @@ static
> > > inline bool should_continue_reclaim(struct lruvec *lruvec,
> > >  	 * inactive lists are large enough, continue reclaiming
> > >  	 */
> > >  	pages_for_compaction = (2UL << sc->order);
> > > -
> > > -	pages_for_compaction =
> > > scale_for_compaction(pages_for_compaction,
> > > -						    lruvec, sc);
> > >  	inactive_lru_pages = get_lru_size(lruvec,
> > > LRU_INACTIVE_FILE); if (nr_swap_pages > 0)
> > >  		inactive_lru_pages += get_lru_size(lruvec,
> > > LRU_INACTIVE_ANON); --
> > 
> > Even with this patch I see kswapd0 very often on top. Much more than
> > with kernel 3.6.
> 
> How severe is the CPU usage? The higher usage can be explained by "mm:
> remove __GFP_NO_KSWAPD" which allows kswapd to compact memory to
> reduce the amount of time processes spend in compaction but will
> result in the CPU cost being incurred by kswapd.
> 
> Is it really high like the bug was reporting with high usage over long
> periods of time or do you just see it using 2-6% of CPU for short
> periods?

It is really high. I've seen with compile-jobs (make -j4 on dual
core) kswapd0 consuming at least 50% CPU most time.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-12 11:37                                   ` [PATCH] Revert "mm: remove __GFP_NO_KSWAPD" Mel Gorman
@ 2012-11-16 19:14                                     ` Josh Boyer
  2012-11-16 19:51                                       ` Andrew Morton
  2012-11-16 20:06                                       ` Mel Gorman
  2012-11-20  9:18                                     ` Glauber Costa
  1 sibling, 2 replies; 52+ messages in thread
From: Josh Boyer @ 2012-11-16 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zdenek Kabelac, Seth Jennings, Jiri Slaby, Valdis.Kletnieks,
	Jiri Slaby, linux-mm, LKML, Andrew Morton, Rik van Riel,
	Robert Jennings

On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <mgorman@suse.de> wrote:
> With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
> based on failures" reverted, Zdenek Kabelac reported the following
>
>         Hmm,  so it's just took longer to hit the problem and observe
>         kswapd0 spinning on my CPU again - it's not as endless like before -
>         but still it easily eats minutes - it helps to  turn off  Firefox
>         or TB  (memory hungry apps) so kswapd0 stops soon - and restart
>         those apps again.  (And I still have like >1GB of cached memory)
>
>         kswapd0         R  running task        0    30      2 0x00000000
>          ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
>          ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
>          ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
>         Call Trace:
>          [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
>          [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
>          [<ffffffff81192971>] put_super+0x31/0x40
>          [<ffffffff81192a42>] drop_super+0x22/0x30
>          [<ffffffff81193b89>] prune_super+0x149/0x1b0
>          [<ffffffff81141e2a>] shrink_slab+0xba/0x510
>
> The sysrq+m indicates the system has no swap so it'll never reclaim
> anonymous pages as part of reclaim/compaction. That is one part of the
> problem but not the root cause as file-backed pages could also be reclaimed.
>
> The likely underlying problem is that kswapd is woken up or kept awake
> for each THP allocation request in the page allocator slow path.
>
> If compaction fails for the requesting process then compaction will be
> deferred for a time and direct reclaim is avoided. However, if there
> are a storm of THP requests that are simply rejected, it will still
> be the the case that kswapd is awake for a prolonged period of time
> as pgdat->kswapd_max_order is updated each time. This is noticed by
> the main kswapd() loop and it will not call kswapd_try_to_sleep().
> Instead it will loopp, shrinking a small number of pages and calling
> shrink_slab() on each iteration.
>
> The temptation is to supply a patch that checks if kswapd was woken for
> THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> backed up by proper testing. As 3.7 is very close to release and this is
> not a bug we should release with, a safer path is to revert "mm: remove
> __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> balance_pgdat() logic in general.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Does anyone know if this is queued to go into 3.7 somewhere?  I looked
a bit and can't find it in a tree.  We have a few reports of Fedora
rawhide users hitting this.

josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-16 19:14                                     ` Josh Boyer
@ 2012-11-16 19:51                                       ` Andrew Morton
  2012-11-20  1:43                                         ` Valdis.Kletnieks
  2012-11-16 20:06                                       ` Mel Gorman
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2012-11-16 19:51 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Mel Gorman, Zdenek Kabelac, Seth Jennings, Jiri Slaby,
	Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Rik van Riel,
	Robert Jennings

On Fri, 16 Nov 2012 14:14:47 -0500
Josh Boyer <jwboyer@gmail.com> wrote:

> > The temptation is to supply a patch that checks if kswapd was woken for
> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> > backed up by proper testing. As 3.7 is very close to release and this is
> > not a bug we should release with, a safer path is to revert "mm: remove
> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> > balance_pgdat() logic in general.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Does anyone know if this is queued to go into 3.7 somewhere?  I looked
> a bit and can't find it in a tree.  We have a few reports of Fedora
> rawhide users hitting this.

Still thinking about it.  We're reverting quite a lot of material
lately. 
mm-revert-mm-vmscan-scale-number-of-pages-reclaimed-by-reclaim-compaction-based-on-failures.patch
and revert-mm-fix-up-zone-present-pages.patch are queued for 3.7.

I'll toss this one in there as well, but I can't say I'm feeling
terribly confident.  How is Valdis's machine nowadays?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-16 19:14                                     ` Josh Boyer
  2012-11-16 19:51                                       ` Andrew Morton
@ 2012-11-16 20:06                                       ` Mel Gorman
  2012-11-20 15:38                                         ` Josh Boyer
  1 sibling, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2012-11-16 20:06 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Zdenek Kabelac, Seth Jennings, Jiri Slaby, Valdis.Kletnieks,
	Jiri Slaby, linux-mm, LKML, Andrew Morton, Rik van Riel,
	Robert Jennings

On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <mgorman@suse.de> wrote:
> > With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
> > based on failures" reverted, Zdenek Kabelac reported the following
> >
> >         Hmm,  so it's just took longer to hit the problem and observe
> >         kswapd0 spinning on my CPU again - it's not as endless like before -
> >         but still it easily eats minutes - it helps to  turn off  Firefox
> >         or TB  (memory hungry apps) so kswapd0 stops soon - and restart
> >         those apps again.  (And I still have like >1GB of cached memory)
> >
> >         kswapd0         R  running task        0    30      2 0x00000000
> >          ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
> >          ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
> >          ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
> >         Call Trace:
> >          [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
> >          [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
> >          [<ffffffff81192971>] put_super+0x31/0x40
> >          [<ffffffff81192a42>] drop_super+0x22/0x30
> >          [<ffffffff81193b89>] prune_super+0x149/0x1b0
> >          [<ffffffff81141e2a>] shrink_slab+0xba/0x510
> >
> > The sysrq+m indicates the system has no swap so it'll never reclaim
> > anonymous pages as part of reclaim/compaction. That is one part of the
> > problem but not the root cause as file-backed pages could also be reclaimed.
> >
> > The likely underlying problem is that kswapd is woken up or kept awake
> > for each THP allocation request in the page allocator slow path.
> >
> > If compaction fails for the requesting process then compaction will be
> > deferred for a time and direct reclaim is avoided. However, if there
> > are a storm of THP requests that are simply rejected, it will still
> > be the the case that kswapd is awake for a prolonged period of time
> > as pgdat->kswapd_max_order is updated each time. This is noticed by
> > the main kswapd() loop and it will not call kswapd_try_to_sleep().
> > Instead it will loopp, shrinking a small number of pages and calling
> > shrink_slab() on each iteration.
> >
> > The temptation is to supply a patch that checks if kswapd was woken for
> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> > backed up by proper testing. As 3.7 is very close to release and this is
> > not a bug we should release with, a safer path is to revert "mm: remove
> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> > balance_pgdat() logic in general.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Does anyone know if this is queued to go into 3.7 somewhere?  I looked
> a bit and can't find it in a tree.  We have a few reports of Fedora
> rawhide users hitting this.
> 

No, because I was waiting to hear if a) it worked and preferably if the
alternative "less safe" option worked. This close to release it might be
better to just go with the safe option.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-12 13:31                                       ` Mel Gorman
  2012-11-12 14:50                                         ` Zdenek Kabelac
@ 2012-11-18 19:00                                         ` Zdenek Kabelac
  2012-11-18 19:07                                           ` Jiri Slaby
  1 sibling, 1 reply; 52+ messages in thread
From: Zdenek Kabelac @ 2012-11-18 19:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Seth Jennings, Jiri Slaby, Valdis.Kletnieks, Jiri Slaby,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

Dne 12.11.2012 14:31, Mel Gorman napsal(a):
> On Mon, Nov 12, 2012 at 02:13:20PM +0100, Zdenek Kabelac wrote:
>> Dne 12.11.2012 13:19, Mel Gorman napsal(a):
>>> On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
>>>> Hmm,  so it's just took longer to hit the problem and observe kswapd0
>>>> spinning on my CPU again - it's not as endless like before - but
>>>> still it easily eats minutes - it helps to  turn off  Firefox or TB
>>>> (memory hungry apps) so kswapd0 stops soon - and restart those apps
>>>> again.
>>>> (And I still have like >1GB of cached memory)
>>>>
>>>
>>> I posted a "safe" patch that I believe explains why you are seeing what
>>> you are seeing. It does mean that there will still be some stalls due to
>>> THP because kswapd is not helping and it's avoiding the problem rather
>>> than trying to deal with it.
>>>
>>> Hence, I'm also going to post this patch even though I have not tested
>>> it myself. If you find it fixes the problem then it would be a
>>> preferable patch to the revert. It still is the case that the
>>> balance_pgdat() logic is in sort need of a rethink as it's pretty
>>> twisted right now.
>>>
>>
>>
>> Should I apply them all together for 3.7-rc5 ?
>>
>> 1) https://lkml.org/lkml/2012/11/5/308
>> 2) https://lkml.org/lkml/2012/11/12/113
>> 3) https://lkml.org/lkml/2012/11/12/151
>>
>
> Not all together. Test either 1+2 or 1+3. 1+2 is the safer choice but
> does nothing about THP stalls. 1+3 is a riskier version but depends on
> me being correct on what the root cause of the problem you see it.
>
> If both 1+2 and 1+3 work for you, I'd choose 1+3 for merging. If you only
> have the time to test one combination then it would be preferred that you
> test the safe option of 1+2.

So I've tested  1+2 for a few days - once I've rebooted for another reason,
but today happened this to me (with ~2day uptime)

For some reason my machine went ouf of memory and OOM killed
firefox and then even whole Xsession.

Unsure whether it's related to those 2 patches - but I've never had
such OOM failure before.

Should I experiment now with 1+3 - or is there newer thing to test ?

Zdenek

  X: page allocation failure: order:0, mode:0x200da
  Pid: 1126, comm: X Not tainted 3.7.0-rc5-00007-g95e21c5 #100
  Call Trace:
   [<ffffffff811354e9>] warn_alloc_failed+0xe9/0x140
   [<ffffffff81138eda>] __alloc_pages_nodemask+0x7fa/0xa40
   [<ffffffff81148fc3>] shmem_getpage_gfp+0x603/0x9d0
   [<ffffffff8100a166>] ? native_sched_clock+0x26/0x90
   [<ffffffff81149d6f>] shmem_fault+0x4f/0xa0
   [<ffffffff812ad69e>] shm_fault+0x1e/0x20
   [<ffffffff811571d3>] __do_fault+0x73/0x4d0
   [<ffffffff81131640>] ? generic_file_aio_write+0xb0/0x100
   [<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
   [<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
   [<ffffffff81081711>] ? get_parent_ip+0x11/0x50
  rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
  rsyslogd cpuset=/ mems_allowed=0
  Pid: 571, comm: rsyslogd Not tainted 3.7.0-rc5-00007-g95e21c5 #100
  Call Trace:
   [<ffffffff8154dfcb>] dump_header.isra.12+0x78/0x224
   [<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
   [<ffffffff81557842>] ? _raw_spin_unlock_irqrestore+0x42/0x80
   [<ffffffff81317c0e>] ? ___ratelimit+0x9e/0x130
   [<ffffffff81133ac3>] oom_kill_process+0x1d3/0x330
   [<ffffffff81134219>] out_of_memory+0x439/0x4a0
   [<ffffffff81139056>] __alloc_pages_nodemask+0x976/0xa40
   [<ffffffff811304b5>] ? find_get_page+0x5/0x230
   [<ffffffff811322a0>] filemap_fault+0x2d0/0x480
   [<ffffffff811571d3>] __do_fault+0x73/0x4d0
   [<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
   [<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
   [<ffffffff81081711>] ? get_parent_ip+0x11/0x50
   [<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
   [<ffffffff8155ae7d>] __do_page_fault+0x15d/0x4e0
   [<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
   [<ffffffff815578b5>] ? _raw_spin_unlock+0x35/0x60
   [<ffffffff811f8d9c>] ? proc_reg_read+0x8c/0xc0
   [<ffffffff815580a3>] ? error_sti+0x5/0x6
   [<ffffffff8131f55d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
   [<ffffffff8155b20e>] do_page_fault+0xe/0x10
   [<ffffffff81557ea2>] page_fault+0x22/0x30
  Mem-Info:
  DMA per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  CPU    1: hi:    0, btch:   1 usd:   0
  DMA32 per-cpu:
  CPU    0: hi:  186, btch:  31 usd:  30
  CPU    1: hi:  186, btch:  31 usd:   6
  Normal per-cpu:
  CPU    0: hi:  186, btch:  31 usd:  30
  CPU    1: hi:  186, btch:  31 usd:   0
  active_anon:900420 inactive_anon:28835 isolated_anon:0
   active_file:43 inactive_file:21 isolated_file:0
   unevictable:4 dirty:34 writeback:2 unstable:0
   free:20731 slab_reclaimable:8641 slab_unreclaimable:10446
   mapped:18325 shmem:243662 pagetables:7705 bounce:0
   free_cma:0
  DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB 
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB 
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 2951 3836 3836
  DMA32 free:55296kB min:51776kB low:64720kB high:77664kB 
active_anon:2834992kB inactive_anon:107924kB active_file:92kB 
inactive_file:52kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:3021968kB mlocked:0kB dirty:88kB writeback:0kB mapped:65460kB 
shmem:943100kB slab_reclaimable:11700kB slab_unreclaimable:8968kB 
kernel_stack:592kB pagetables:11852kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:180 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 885 885
  Normal free:15508kB min:15532kB low:19412kB high:23296kB 
active_anon:763796kB inactive_anon:6544kB active_file:80kB inactive_file:32kB 
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB 
mlocked:16kB dirty:48kB writeback:52kB mapped:6168kB shmem:27952kB 
slab_reclaimable:22864kB slab_unreclaimable:32800kB kernel_stack:2568kB 
pagetables:18968kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:234 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB 
2*2048kB 1*4096kB = 12120kB
  DMA32: 900*4kB 1512*8kB 513*16kB 635*32kB 109*64kB 8*128kB 0*256kB 0*512kB 
1*1024kB 1*2048kB 0*4096kB = 55296kB
  Normal: 452*4kB 363*8kB 225*16kB 139*32kB 30*64kB 4*128kB 1*256kB 0*512kB 
0*1024kB 1*2048kB 0*4096kB = 17496kB
  243783 total pagecache pages
  0 pages in swap cache
  Swap cache stats: add 0, delete 0, find 0/0
  Free swap  = 0kB
  Total swap = 0kB
  1032176 pages RAM
  42789 pages reserved
  553592 pages shared
  943414 pages non-shared
  [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
  [  351]     0   351    74685     1679     154        0             0 
systemd-journal
  [  544]     0   544     5863      107      16        0             0 bluetoothd
  [  545]     0   545    88977      725      56        0             0 
NetworkManager
  [  546]     0   546    30170      158      15        0             0 crond
  [  552]     0   552     1879       28       8        0             0 gpm
  [  557]     0   557     1092       37       8        0             0 acpid
  [  564]    81   564     6361      373      16        0          -900 dbus-daemon
  [  566]     0   566    61331      155      22        0             0 rsyslogd
  [  567]   498   567     7026      104      19        0             0 
avahi-daemon
  [  568]   498   568     6994       59      17        0             0 
avahi-daemon
  [  573]     0   573     1758       33       9        0             0 mcelog
  [  578]     0   578     5925       51      16        0             0 atd
  [  586]   105   586   121536     4270      56        0             0 polkitd
  [  593]     0   593    21967      205      48        0          -900 
modem-manager
  [  601]     0   601     1087       26       8        0             0 thinkfan
  [  619]     0   619   122722     1085     129        0             0 libvirtd
  [  630]    32   630     4812       68      13        0             0 rpcbind
  [  633]     0   633    20080      199      43        0         -1000 sshd
  [  653]    29   653     5905      116      16        0             0 rpc.statd
  [  700]     0   700    13173      190      28        0             0 
wpa_supplicant
  [  719]     0   719     4810       50      14        0             0 rpc.idmapd
  [  730]     0   730    28268       36      10        0             0 rpc.rquotad
  [  766]     0   766     6030      153      15        0             0 rpc.mountd
  [  806]    99   806     3306       45      11        0             0 dnsmasq
  [  985]     0   985    21219      150      46        0             0 login
  [  988]     0   988   260408      355      48        0             0 
console-kit-dae
  [ 1053] 11641  1053    28706      241      14        0             0 bash
  [ 1097] 11641  1097    27972       58      10        0             0 startx
  [ 1125] 11641  1125     3487       48      13        0             0 xinit
  [ 1126] 11641  1126    80028    35289     154        0             0 X
  [ 1138] 11641  1138   142989      930     122        0             0 
gnome-session
  [ 1151] 11641  1151     4013       64      12        0             0 dbus-launch
  [ 1152] 11641  1152     6069       82      17        0             0 dbus-daemon
  [ 1154] 11641  1154    85449      162      36        0             0 
at-spi-bus-laun
  [ 1158] 11641  1158     6103      116      17        0             0 dbus-daemon
  [ 1161] 11641  1161    32328      174      33        0             0 
at-spi2-registr
  [ 1172] 11641  1172     4013       65      13        0             0 dbus-launch
  [ 1173] 11641  1173     6350      265      18        0             0 dbus-daemon
  [ 1177] 11641  1177    37416      416      29        0             0 gconfd-2
  [ 1184] 11641  1184   117556     1203      44        0             0 
gnome-keyring-d
  [ 1185] 11641  1185   224829     2236     177        0             0 
gnome-settings-
  [ 1194]     0  1194    57227      786      46        0             0 upowerd
  [ 1226] 11641  1226    77392      190      36        0             0 gvfsd
  [ 1246] 11641  1246   118201      772      90        0             0 pulseaudio
  [ 1247]   496  1247    41161       59      17        0             0 
rtkit-daemon
  [ 1252] 11641  1252    29494      205      58        0             0 
gconf-helper
  [ 1253]   106  1253    81296      355      46        0             0 colord
  [ 1257] 11641  1257    59080     1574      60        0             0 openbox
  [ 1258] 11641  1258   185569     3216     146        0             0 gnome-panel
  [ 1264] 11641  1264    64102      229      27        0             0 
dconf-service
  [ 1268] 11641  1268   139203      858     116        0             0 
gnome-user-shar
  [ 1269] 11641  1269   268645    27442     334        0             0 pidgin
  [ 1270] 11641  1270   142642     1064     117        0             0 
bluetooth-apple
  [ 1271] 11641  1271   193218     1775     175        0             0 nm-applet
  [ 1272] 11641  1272   220194     1810     138        0             0 
gnome-sound-app
  [ 1285] 11641  1285    80914      632      45        0             0 
gvfs-udisks2-vo
  [ 1287]     0  1287    88101      599      41        0             0 udisksd
  [ 1295] 11641  1295   177162    14140     150        0             0 wnck-applet
  [ 1297] 11641  1297   281043     3161     199        0             0 
clock-applet
  [ 1299] 11641  1299   142537     1053     120        0             0 
cpufreq-applet
  [ 1302] 11641  1302   141960      986     113        0             0 
notification-ar
  [ 1340] 11641  1340   190026     6265     144        0             0 
gnome-terminal
  [ 1346] 11641  1346     2123       35      10        0             0 
gnome-pty-helpe
  [ 1347] 11641  1347    28719      253      11        0             0 bash
  [ 1858] 11641  1858    10895      101      27        0             0 xfconfd
  [ 2052] 11641  2052    28720      255      11        0             0 bash
  [ 6239] 11641  6239    73437      711      88        0             0 kdeinit4
  [ 6240] 11641  6240    83952      717     101        0             0 klauncher
  [ 6242] 11641  6242   126497     1479     172        0             0 kded4
  [ 6244] 11641  6244     2977       48      11        0             0 gam_server
  [10804] 11641 10804   101320      307      47        0             0 gvfsd-http
  [12175]     0 12175    27197       32      10        0             0 agetty
  [12249] 11641 12249    28719      252      14        0             0 bash
  [14862]     0 14862    51773      344      55        0             0 cupsd
  [14868]     4 14868    18105      158      39        0             0 cups-polld
  [16728] 11641 16728    28691      244      12        0             0 bash
  [16975]     0 16975     9109      253      23        0         -1000 
systemd-udevd
  [17618]     0 17618     8245       87      22        0             0 
systemd-logind
  [ 3133] 11641  3133    43721      132      40        0             0 su
  [ 3136]     0  3136    28564      139      12        0             0 bash
  [ 3983] 11641  3983    43722      134      41        0             0 su
  [ 3986]     0  3986    28564      144      13        0             0 bash
  [16350] 11641 16350    28691      245      14        0             0 bash
  [31228] 11641 31228    28691      245      11        0             0 bash
  [31922] 11641 31922    28719      250      13        0             0 bash
  [ 2340] 11641  2340    28691      245      15        0             0 bash
  [12586]    38 12586     7851      150      19        0             0 ntpd
  [32658] 11641 32658    41192      424      35        0             0 mc
  [32660] 11641 32660    28692      245      13        0             0 bash
  [29193] 11641 29193   713846   414344    1614        0             0 firefox
  [10971] 11641 10971    43722      133      43        0             0 su
  [10974]     0 10974    28564      132      12        0             0 bash
  [11343]     0 11343    28497       66      11        0             0 ksmtuned
  [11387] 11641 11387    28719      254      11        0             0 bash
  [11450] 11641 11450    28691      246      13        0             0 bash
  [11576] 11641 11576    43722      133      40        0             0 su
  [11579]     0 11579    28564      141      13        0             0 bash
  [12106] 11641 12106    28691      244      12        0             0 bash
  [12141] 11641 12141    43722      132      44        0             0 su
  [12144]     0 12144    28564      140      11        0             0 bash
  [12264] 11641 12264    28691      245      11        0             0 bash
  [12299] 11641 12299    43721      133      40        0             0 su
  [12302]     0 12302    28564      137      12        0             0 bash
  [26024] 11641 26024    28691      245      13        0             0 bash
  [26083] 11641 26083    28691      245      13        0             0 bash
  [28235] 11641 28235    43721      132      42        0             0 su
  [28238]     0 28238    28564      143      13        0             0 bash
  [29460] 11641 29460    43721      132      42        0             0 su
  [29463]     0 29463    28564      137      12        0             0 bash
  [29758] 11641 29758    28720      256      12        0             0 bash
  [29864] 11641 29864    41916     1153      36        0             0 mc
  [29866] 11641 29866    28728      257      11        0             0 bash
  [32750]     0 32750    23164     2994      47        0             0 dhclient
  [  323]     0   323    24081      471      48        0             0 sendmail
  [  347]    51   347    20347      367      38        0             0 sendmail
  [  907] 11641   907   379562   159766     707        0             0 thunderbird
  [ 6340] 11641  6340    28719      251      12        0             0 bash
  [ 6790] 11641  6790    80307      620     101        0             0 
xfce4-notifyd
  [ 6844]     0  6844    26669       23       9        0             0 sleep
  Out of memory: Kill process 29193 (firefox) score 420 or sacrifice child
  Killed process 29193 (firefox) total-vm:2855384kB, anon-rss:1653868kB, 
file-rss:3508kB
   [<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
   [<ffffffff8115b12a>] __get_user_pages+0x12a/0x530
   [<ffffffff8115b575>] get_dump_page+0x45/0x60
   [<ffffffff811eec6d>] elf_core_dump+0x16bd/0x1960
   [<ffffffff811edf86>] ? elf_core_dump+0x9d6/0x1960
   [<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
   [<ffffffff815546ae>] ? mutex_unlock+0xe/0x10
   [<ffffffff8118ed63>] ? do_truncate+0x73/0xa0
   [<ffffffff811f55a1>] do_coredump+0xa21/0xeb0
   [<ffffffff810b22a0>] ? debug_check_no_locks_freed+0xe0/0x170
   [<ffffffff810abe8d>] ? trace_hardirqs_off+0xd/0x10
   [<ffffffff8105a961>] get_signal_to_deliver+0x2e1/0x960
   [<ffffffff8100236f>] do_signal+0x3f/0x9a0
   [<ffffffff81540000>] ? pci_fixup_msi_k8t_onboard_sound+0x7d/0x97
   [<ffffffff8154b565>] ? is_prefetch.isra.15+0x1a6/0x1fd
   [<ffffffff815580a3>] ? error_sti+0x5/0x6
   [<ffffffff81557cd1>] ? retint_signal+0x11/0x90
   [<ffffffff81002d70>] do_notify_resume+0x80/0xb0
   [<ffffffff81557d06>] retint_signal+0x46/0x90
  Mem-Info:
  DMA per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  CPU    1: hi:    0, btch:   1 usd:   0
  DMA32 per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   0
  CPU    1: hi:  186, btch:  31 usd:  30
  Normal per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   0
  CPU    1: hi:  186, btch:  31 usd:  30
  active_anon:900420 inactive_anon:28835 isolated_anon:0
   active_file:8 inactive_file:0 isolated_file:0
   unevictable:4 dirty:34 writeback:2 unstable:0
   free:20724 slab_reclaimable:8641 slab_unreclaimable:10446
   mapped:18325 shmem:243662 pagetables:7705 bounce:0
   free_cma:0
  DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB 
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB 
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 2951 3836 3836
  DMA32 free:55404kB min:51776kB low:64720kB high:77664kB 
active_anon:2834992kB inactive_anon:107924kB active_file:0kB 
inactive_file:28kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:3021968kB mlocked:0kB dirty:0kB writeback:0kB mapped:65460kB 
shmem:943100kB slab_reclaimable:11700kB slab_unreclaimable:8968kB 
kernel_stack:592kB pagetables:11852kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:129 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 885 885
  Normal free:15364kB min:15532kB low:19412kB high:23296kB 
active_anon:763796kB inactive_anon:6544kB active_file:0kB inactive_file:24kB 
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB 
mlocked:16kB dirty:48kB writeback:52kB mapped:6168kB shmem:27952kB 
slab_reclaimable:22864kB slab_unreclaimable:32800kB kernel_stack:2568kB 
pagetables:18968kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:379 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB 
2*2048kB 1*4096kB = 12120kB
  DMA32: 896*4kB 1512*8kB 513*16kB 635*32kB 109*64kB 8*128kB 0*256kB 0*512kB 
1*1024kB 1*2048kB 0*4096kB = 55280kB
  Normal: 403*4kB 377*8kB 225*16kB 139*32kB 30*64kB 4*128kB 1*256kB 0*512kB 
0*1024kB 1*2048kB 0*4096kB = 17412kB
  243733 total pagecache pages
  rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
  rsyslogd cpuset=/ mems_allowed=0
  Pid: 571, comm: rsyslogd Not tainted 3.7.0-rc5-00007-g95e21c5 #100
  Call Trace:
   [<ffffffff8154dfcb>] dump_header.isra.12+0x78/0x224
   [<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
   [<ffffffff81557842>] ? _raw_spin_unlock_irqrestore+0x42/0x80
   [<ffffffff81317c0e>] ? ___ratelimit+0x9e/0x130
   [<ffffffff81133ac3>] oom_kill_process+0x1d3/0x330
   [<ffffffff81134219>] out_of_memory+0x439/0x4a0
   [<ffffffff81139056>] __alloc_pages_nodemask+0x976/0xa40
   [<ffffffff811304b5>] ? find_get_page+0x5/0x230
   [<ffffffff811322a0>] filemap_fault+0x2d0/0x480
   [<ffffffff811571d3>] __do_fault+0x73/0x4d0
   [<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
   [<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
   [<ffffffff81081711>] ? get_parent_ip+0x11/0x50
   [<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
   [<ffffffff8155ae7d>] __do_page_fault+0x15d/0x4e0
   [<ffffffff815578b5>] ? _raw_spin_unlock+0x35/0x60
   [<ffffffff811f8d9c>] ? proc_reg_read+0x8c/0xc0
   [<ffffffff815580a3>] ? error_sti+0x5/0x6
   [<ffffffff8131f55d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
   [<ffffffff8155b20e>] do_page_fault+0xe/0x10
   [<ffffffff81557ea2>] page_fault+0x22/0x30
  Mem-Info:
  DMA per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  CPU    1: hi:    0, btch:   1 usd:   0
  DMA32 per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   0
  CPU    1: hi:  186, btch:  31 usd:  30
  Normal per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   1
  CPU    1: hi:  186, btch:  31 usd:  46
  active_anon:900420 inactive_anon:28835 isolated_anon:0
   active_file:0 inactive_file:7 isolated_file:0
   unevictable:4 dirty:0 writeback:2 unstable:0
   free:20691 slab_reclaimable:8641 slab_unreclaimable:10446
   mapped:18325 shmem:243662 pagetables:7705 bounce:0
   free_cma:0
  DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB 
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB 
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 2951 3836 3836
  DMA32 free:55280kB min:51776kB low:64720kB high:77664kB 
active_anon:2834992kB inactive_anon:107924kB active_file:0kB 
inactive_file:12kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:3021968kB mlocked:0kB dirty:0kB writeback:0kB mapped:65460kB 
shmem:943100kB slab_reclaimable:11700kB slab_unreclaimable:8968kB 
kernel_stack:592kB pagetables:11852kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:520 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 885 885
  Normal free:15364kB min:15532kB low:19412kB high:23296kB 
active_anon:763796kB inactive_anon:6544kB active_file:0kB inactive_file:16kB 
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB 
mlocked:16kB dirty:48kB writeback:52kB mapped:6168kB shmem:27952kB 
slab_reclaimable:22864kB slab_unreclaimable:32800kB kernel_stack:2568kB 
pagetables:18968kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:571 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB 
2*2048kB 1*4096kB = 12120kB
  DMA32: 896*4kB 1512*8kB 513*16kB 635*32kB 109*64kB 8*128kB 0*256kB 0*512kB 
1*1024kB 1*2048kB 0*4096kB = 55280kB
  Normal: 403*4kB 377*8kB 225*16kB 139*32kB 30*64kB 4*128kB 1*256kB 0*512kB 
0*1024kB 1*2048kB 0*4096kB = 17412kB
  243733 total pagecache pages
  0 pages in swap cache
  Swap cache stats: add 0, delete 0, find 0/0
  Free swap  = 0kB
  Total swap = 0kB
  0 pages in swap cache
  Swap cache stats: add 0, delete 0, find 0/0
  Free swap  = 0kB
  Total swap = 0kB
  1032176 pages RAM
  42789 pages reserved
  553579 pages shared
  943538 pages non-shared
  1032176 pages RAM
  42789 pages reserved
  553576 pages shared
  943549 pages non-shared
  [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
  [  351]     0   351    74685     1682     154        0             0 
systemd-journal
  [  544]     0   544     5863      107      16        0             0 bluetoothd
  [  545]     0   545    88977      725      56        0             0 
NetworkManager
  [  546]     0   546    30170      158      15        0             0 crond
  [  552]     0   552     1879       28       8        0             0 gpm
  [  557]     0   557     1092       37       8        0             0 acpid
  [  564]    81   564     6361      373      16        0          -900 dbus-daemon
  [  566]     0   566    61331      155      22        0             0 rsyslogd
  [  567]   498   567     7026      104      19        0             0 
avahi-daemon
  [  568]   498   568     6994       59      17        0             0 
avahi-daemon
  [  573]     0   573     1758       33       9        0             0 mcelog
  [  578]     0   578     5925       51      16        0             0 atd
  [  586]   105   586   121536     4270      56        0             0 polkitd
  [  593]     0   593    21967      205      48        0          -900 
modem-manager
  [  601]     0   601     1087       26       8        0             0 thinkfan
  [  619]     0   619   122722     1085     129        0             0 libvirtd
  [  630]    32   630     4812       68      13        0             0 rpcbind
  [  633]     0   633    20080      199      43        0         -1000 sshd
  [  653]    29   653     5905      116      16        0             0 rpc.statd
  [  700]     0   700    13173      190      28        0             0 
wpa_supplicant
  [  719]     0   719     4810       50      14        0             0 rpc.idmapd
  [  730]     0   730    28268       36      10        0             0 rpc.rquotad
  [  766]     0   766     6030      153      15        0             0 rpc.mountd
  [  806]    99   806     3306       45      11        0             0 dnsmasq
  [  985]     0   985    21219      150      46        0             0 login
  [  988]     0   988   260408      355      48        0             0 
console-kit-dae
  [ 1053] 11641  1053    28706      241      14        0             0 bash
  [ 1097] 11641  1097    27972       58      10        0             0 startx
  [ 1125] 11641  1125     3487       48      13        0             0 xinit
  [ 1126] 11641  1126    80028    35379     154        0             0 X
  [ 1138] 11641  1138   142989      930     122        0             0 
gnome-session
  [ 1151] 11641  1151     4013       64      12        0             0 dbus-launch
  [ 1152] 11641  1152     6069       82      17        0             0 dbus-daemon
  [ 1154] 11641  1154    85449      162      36        0             0 
at-spi-bus-laun
  [ 1158] 11641  1158     6103      116      17        0             0 dbus-daemon
  [ 1161] 11641  1161    32328      174      33        0             0 
at-spi2-registr
  [ 1172] 11641  1172     4013       65      13        0             0 dbus-launch
  [ 1173] 11641  1173     6350      265      18        0             0 dbus-daemon
  [ 1177] 11641  1177    37416      416      29        0             0 gconfd-2
  [ 1184] 11641  1184   117556     1203      44        0             0 
gnome-keyring-d
  [ 1185] 11641  1185   224829     2236     177        0             0 
gnome-settings-
  [ 1194]     0  1194    57227      786      46        0             0 upowerd
  [ 1226] 11641  1226    77392      190      36        0             0 gvfsd
  [ 1246] 11641  1246   118201      772      90        0             0 pulseaudio
  [ 1247]   496  1247    41161       59      17        0             0 
rtkit-daemon
  [ 1252] 11641  1252    29494      205      58        0             0 
gconf-helper
  [ 1253]   106  1253    81296      355      46        0             0 colord
  [ 1257] 11641  1257    59080     1574      60        0             0 openbox
  [ 1258] 11641  1258   185569     3216     146        0             0 gnome-panel
  [ 1264] 11641  1264    64102      229      27        0             0 
dconf-service
  [ 1268] 11641  1268   139203      858     116        0             0 
gnome-user-shar
  [ 1269] 11641  1269   268645    27442     334        0             0 pidgin
  [ 1270] 11641  1270   142642     1064     117        0             0 
bluetooth-apple
  [ 1271] 11641  1271   193218     1775     175        0             0 nm-applet
  [ 1272] 11641  1272   220194     1810     138        0             0 
gnome-sound-app
  [ 1285] 11641  1285    80914      632      45        0             0 
gvfs-udisks2-vo
  [ 1287]     0  1287    88101      599      41        0             0 udisksd
  [ 1295] 11641  1295   177162    14140     150        0             0 wnck-applet
  [ 1297] 11641  1297   281043     3161     199        0             0 
clock-applet
  [ 1299] 11641  1299   142537     1051     120        0             0 
cpufreq-applet
  [ 1302] 11641  1302   141960      986     113        0             0 
notification-ar
  [ 1340] 11641  1340   190026     6265     144        0             0 
gnome-terminal
  [ 1346] 11641  1346     2123       35      10        0             0 
gnome-pty-helpe
  [ 1347] 11641  1347    28719      253      11        0             0 bash
  [ 1858] 11641  1858    10895      101      27        0             0 xfconfd
  X: page allocation failure: order:0, mode:0x200da
  Pid: 1126, comm: X Not tainted 3.7.0-rc5-00007-g95e21c5 #100
  [ 2052] 11641  2052    28720      255      11        0             0 bash
  [ 6239] 11641  6239    73437      711      88        0             0 kdeinit4
  [ 6240] 11641  6240    83952      717     101        0             0 klauncher
  Call Trace:
   [<ffffffff811354e9>] warn_alloc_failed+0xe9/0x140
   [<ffffffff81138eda>] __alloc_pages_nodemask+0x7fa/0xa40
   [<ffffffff81148fc3>] shmem_getpage_gfp+0x603/0x9d0
   [<ffffffff8100a166>] ? native_sched_clock+0x26/0x90
   [<ffffffff81149d6f>] shmem_fault+0x4f/0xa0
   [<ffffffff812ad69e>] shm_fault+0x1e/0x20
   [<ffffffff811571d3>] __do_fault+0x73/0x4d0
   [<ffffffff81131640>] ? generic_file_aio_write+0xb0/0x100
   [<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
   [<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
   [<ffffffff81081711>] ? get_parent_ip+0x11/0x50
   [<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
   [<ffffffff8115b12a>] __get_user_pages+0x12a/0x530
   [<ffffffff8115b575>] get_dump_page+0x45/0x60
   [<ffffffff811eec6d>] elf_core_dump+0x16bd/0x1960
   [<ffffffff811edf86>] ? elf_core_dump+0x9d6/0x1960
   [<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
   [<ffffffff815546ae>] ? mutex_unlock+0xe/0x10
   [<ffffffff8118ed63>] ? do_truncate+0x73/0xa0
   [<ffffffff811f55a1>] do_coredump+0xa21/0xeb0
   [<ffffffff810b22a0>] ? debug_check_no_locks_freed+0xe0/0x170
   [<ffffffff810abe8d>] ? trace_hardirqs_off+0xd/0x10
   [<ffffffff8105a961>] get_signal_to_deliver+0x2e1/0x960
   [<ffffffff8100236f>] do_signal+0x3f/0x9a0
   [<ffffffff81540000>] ? pci_fixup_msi_k8t_onboard_sound+0x7d/0x97
   [<ffffffff8154b565>] ? is_prefetch.isra.15+0x1a6/0x1fd
   [<ffffffff815580a3>] ? error_sti+0x5/0x6
   [<ffffffff81557cd1>] ? retint_signal+0x11/0x90
   [<ffffffff81002d70>] do_notify_resume+0x80/0xb0
   [<ffffffff81557d06>] retint_signal+0x46/0x90
  Mem-Info:
  DMA per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  CPU    1: hi:    0, btch:   1 usd:   0
  DMA32 per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   0
  CPU    1: hi:  186, btch:  31 usd:   0
  Normal per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   1
  CPU    1: hi:  186, btch:  31 usd:  14
  active_anon:900420 inactive_anon:28978 isolated_anon:0
   active_file:22 inactive_file:24 isolated_file:0
   unevictable:4 dirty:5 writeback:0 unstable:0
   free:20346 slab_reclaimable:8656 slab_unreclaimable:10414
   mapped:18437 shmem:243751 pagetables:7717 bounce:0
   free_cma:0
  DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB 
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB 
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 2951 3836 3836
  DMA32 free:55316kB min:51776kB low:64720kB high:77664kB 
active_anon:2834992kB inactive_anon:108408kB active_file:52kB 
inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:3021968kB mlocked:0kB dirty:20kB writeback:0kB mapped:65916kB 
shmem:943452kB slab_reclaimable:11716kB slab_unreclaimable:8904kB 
kernel_stack:488kB pagetables:11880kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:3103 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 885 885
  Normal free:13948kB min:15532kB low:19412kB high:23296kB 
active_anon:763796kB inactive_anon:6632kB active_file:36kB inactive_file:40kB 
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB 
mlocked:16kB dirty:0kB writeback:0kB mapped:6160kB shmem:27956kB 
slab_reclaimable:22908kB slab_unreclaimable:32736kB kernel_stack:2352kB 
pagetables:18988kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:602 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB 
2*2048kB 1*4096kB = 12120kB
  DMA32: 883*4kB 1525*8kB 513*16kB 637*32kB 109*64kB 8*128kB 0*256kB 0*512kB 
1*1024kB 1*2048kB 0*4096kB = 55396kB
  Normal: 269*4kB 255*8kB 227*16kB 141*32kB 30*64kB 4*128kB 1*256kB 0*512kB 
0*1024kB 1*2048kB 0*4096kB = 15996kB
  243797 total pagecache pages
  0 pages in swap cache
  Swap cache stats: add 0, delete 0, find 0/0
  Free swap  = 0kB
  Total swap = 0kB
  1032176 pages RAM
  42789 pages reserved
  553637 pages shared
  943817 pages non-shared
  X: page allocation failure: order:0, mode:0x200da
  Pid: 1126, comm: X Not tainted 3.7.0-rc5-00007-g95e21c5 #100
  Call Trace:
   [<ffffffff811354e9>] warn_alloc_failed+0xe9/0x140
   [<ffffffff81138eda>] __alloc_pages_nodemask+0x7fa/0xa40
   [<ffffffff81148fc3>] shmem_getpage_gfp+0x603/0x9d0
   [<ffffffff8100a166>] ? native_sched_clock+0x26/0x90
   [<ffffffff81149d6f>] shmem_fault+0x4f/0xa0
   [<ffffffff812ad69e>] shm_fault+0x1e/0x20
   [<ffffffff811571d3>] __do_fault+0x73/0x4d0
   [<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
   [<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
   [<ffffffff81081711>] ? get_parent_ip+0x11/0x50
   [<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
   [<ffffffff8115b12a>] __get_user_pages+0x12a/0x530
   [<ffffffff815578b5>] ? _raw_spin_unlock+0x35/0x60
   [<ffffffff8115b575>] get_dump_page+0x45/0x60
   [<ffffffff811eec6d>] elf_core_dump+0x16bd/0x1960
   [<ffffffff811edf86>] ? elf_core_dump+0x9d6/0x1960
   [<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
   [<ffffffff815546ae>] ? mutex_unlock+0xe/0x10
   [<ffffffff8118ed63>] ? do_truncate+0x73/0xa0
   [<ffffffff811f55a1>] do_coredump+0xa21/0xeb0
   [<ffffffff810b22a0>] ? debug_check_no_locks_freed+0xe0/0x170
   [<ffffffff810abe8d>] ? trace_hardirqs_off+0xd/0x10
   [<ffffffff8105a961>] get_signal_to_deliver+0x2e1/0x960
   [<ffffffff8100236f>] do_signal+0x3f/0x9a0
   [<ffffffff81540000>] ? pci_fixup_msi_k8t_onboard_sound+0x7d/0x97
   [<ffffffff8154b565>] ? is_prefetch.isra.15+0x1a6/0x1fd
   [<ffffffff815580a3>] ? error_sti+0x5/0x6
   [<ffffffff81557cd1>] ? retint_signal+0x11/0x90
   [<ffffffff81002d70>] do_notify_resume+0x80/0xb0
   [<ffffffff81557d06>] retint_signal+0x46/0x90
  Mem-Info:
  DMA per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  CPU    1: hi:    0, btch:   1 usd:   0
  DMA32 per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   0
  CPU    1: hi:  186, btch:  31 usd:   0
  Normal per-cpu:
  CPU    0: hi:  186, btch:  31 usd:   1
  CPU    1: hi:  186, btch:  31 usd:  24
  active_anon:900420 inactive_anon:28978 isolated_anon:0
   active_file:22 inactive_file:24 isolated_file:19
   unevictable:4 dirty:5 writeback:0 unstable:0
   free:20222 slab_reclaimable:8656 slab_unreclaimable:10414
   mapped:18437 shmem:243751 pagetables:7717 bounce:0
   free_cma:0
  DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB 
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB 
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 2951 3836 3836
  DMA32 free:55316kB min:51776kB low:64720kB high:77664kB 
active_anon:2834992kB inactive_anon:108408kB active_file:52kB 
inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:3021968kB mlocked:0kB dirty:20kB writeback:0kB mapped:65916kB 
shmem:943452kB slab_reclaimable:11716kB slab_unreclaimable:8904kB 
kernel_stack:488kB pagetables:11880kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:3940 all_unreclaimable? yes
  [ 6242] 11641  6242   126497     1479     172        0             0 kded4
  [ 6244] 11641  6244     2977       48      11        0             0 gam_server
  [10804] 11641 10804   101320      307      47        0             0 gvfsd-http
  [12175]     0 12175    27197       32      10        0             0 agetty
  [12249] 11641 12249    28719      252      14        0             0 bash
  [14862]     0 14862    51773      344      55        0             0 cupsd
  [14868]     4 14868    18105      158      39        0             0 cups-polld
  [16728] 11641 16728    28691      244      12        0             0 bash
  [16975]     0 16975     9109      253      23        0         -1000 
systemd-udevd
  [17618]     0 17618     8245       87      22        0             0 
systemd-logind
  [ 3133] 11641  3133    43721      132      40        0             0 su
  [ 3136]     0  3136    28564      139      12        0             0 bash
  [ 3983] 11641  3983    43722      134      41        0             0 su
  [ 3986]     0  3986    28564      144      13        0             0 bash
  [16350] 11641 16350    28691      245      14        0             0 bash
  [31228] 11641 31228    28691      245      11        0             0 bash
  [31922] 11641 31922    28719      250      13        0             0 bash
  [ 2340] 11641  2340    28691      245      15        0             0 bash
  [12586]    38 12586     7851      150      19        0             0 ntpd
  [32658] 11641 32658    41192      424      35        0             0 mc
  [32660] 11641 32660    28692      245      13        0             0 bash
  [10971] 11641 10971    43722      133      43        0             0 su
  [10974]     0 10974    28564      132      12        0             0 bash
  [11343]     0 11343    28497       66      11        0             0 ksmtuned
  [11387] 11641 11387    28719      254      11        0             0 bash
  [11450] 11641 11450    28691      246      13        0             0 bash
  [11576] 11641 11576    43722      133      40        0             0 su
  [11579]     0 11579    28564      141      13        0             0 bash
  [12106] 11641 12106    28691      244      12        0             0 bash
  [12141] 11641 12141    43722      132      44        0             0 su
  [12144]     0 12144    28564      140      11        0             0 bash
  [12264] 11641 12264    28691      245      11        0             0 bash
  [12299] 11641 12299    43721      133      40        0             0 su
  [12302]     0 12302    28564      137      12        0             0 bash
  [26024] 11641 26024    28691      245      13        0             0 bash
  [26083] 11641 26083    28691      245      13        0             0 bash
  [28235] 11641 28235    43721      132      42        0             0 su
  [28238]     0 28238    28564      143      13        0             0 bash
  [29460] 11641 29460    43721      132      42        0             0 su
  [29463]     0 29463    28564      137      12        0             0 bash
  [29758] 11641 29758    28720      256      12        0             0 bash
  [29864] 11641 29864    41916     1153      36        0             0 mc
  [29866] 11641 29866    28728      257      11        0             0 bash
  [32750]     0 32750    23164     2994      47        0             0 dhclient
  [  323]     0   323    24081      471      48        0             0 sendmail
  [  347]    51   347    20347      367      38        0             0 sendmail
  [  907] 11641   907   379562   159766     707        0             0 thunderbird
  [ 6340] 11641  6340    28719      251      12        0             0 bash
  [ 6790] 11641  6790    80307      620     101        0             0 
xfce4-notifyd
  [ 6844]     0  6844    26669       23       9        0             0 sleep
  Out of memory: Kill process 907 (thunderbird) score 162 or sacrifice child
  Killed process 907 (thunderbird) total-vm:1518248kB, anon-rss:638476kB, 
file-rss:588kB
  lowmem_reserve[]: 0 0 885 885
  Normal free:12832kB min:15532kB low:19412kB high:23296kB 
active_anon:763796kB inactive_anon:6632kB active_file:36kB inactive_file:40kB 
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB 
mlocked:16kB dirty:0kB writeback:0kB mapped:6160kB shmem:27956kB 
slab_reclaimable:22908kB slab_unreclaimable:32736kB kernel_stack:2352kB 
pagetables:18988kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:1742 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB 
2*2048kB 1*4096kB = 12120kB
  DMA32: 883*4kB 1525*8kB 513*16kB 637*32kB 109*64kB 8*128kB 0*256kB 0*512kB 
1*1024kB 1*2048kB 0*4096kB = 55396kB
  Normal: 270*4kB 173*8kB 198*16kB 141*32kB 30*64kB 4*128kB 1*256kB 0*512kB 
0*1024kB 1*2048kB 0*4096kB = 14880kB
  243797 total pagecache pages
  0 pages in swap cache
  Swap cache stats: add 0, delete 0, find 0/0
  Free swap  = 0kB
  Total swap = 0kB
  1032176 pages RAM
  42789 pages reserved
  553659 pages shared
  937056 pages non-shared

  SysRq : Emergency Sync
  Emergency Sync complete
  SysRq : Emergency Remount R/O




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: kswapd0: excessive CPU usage
  2012-11-18 19:00                                         ` Zdenek Kabelac
@ 2012-11-18 19:07                                           ` Jiri Slaby
  0 siblings, 0 replies; 52+ messages in thread
From: Jiri Slaby @ 2012-11-18 19:07 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Mel Gorman, Seth Jennings, Jiri Slaby, Valdis.Kletnieks,
	linux-mm, LKML, Andrew Morton, Rik van Riel, Robert Jennings

On 11/18/2012 08:00 PM, Zdenek Kabelac wrote:
> For some reason my machine went ouf of memory and OOM killed
> firefox and then even whole Xsession.
> 
> Unsure whether it's related to those 2 patches - but I've never had
> such OOM failure before.

As I wrote, this would be me:
https://lkml.org/lkml/2012/11/15/150

There is no -next tree for Friday which would contain the set already.
So for now, it should be enough for you to apply:
https://lkml.org/lkml/2012/11/15/95

Or, alternatively, if you use a brand new systemd, it likes to fork bomb
using udev.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-16 19:51                                       ` Andrew Morton
@ 2012-11-20  1:43                                         ` Valdis.Kletnieks
  0 siblings, 0 replies; 52+ messages in thread
From: Valdis.Kletnieks @ 2012-11-20  1:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Josh Boyer, Mel Gorman, Zdenek Kabelac, Seth Jennings,
	Jiri Slaby, Jiri Slaby, linux-mm, LKML, Rik van Riel,
	Robert Jennings

[-- Attachment #1: Type: text/plain, Size: 1743 bytes --]

On Fri, 16 Nov 2012 11:51:24 -0800, Andrew Morton said:
> On Fri, 16 Nov 2012 14:14:47 -0500
> Josh Boyer <jwboyer@gmail.com> wrote:
>
> > > The temptation is to supply a patch that checks if kswapd was woken for
> > > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> > > backed up by proper testing. As 3.7 is very close to release and this is
> > > not a bug we should release with, a safer path is to revert "mm: remove
> > > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> > > balance_pgdat() logic in general.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >
> > Does anyone know if this is queued to go into 3.7 somewhere?  I looked
> > a bit and can't find it in a tree.  We have a few reports of Fedora
> > rawhide users hitting this.
>
> Still thinking about it.  We're reverting quite a lot of material
> lately.
> mm-revert-mm-vmscan-scale-number-of-pages-reclaimed-by-reclaim-compaction-based-on-failures.patch
> and revert-mm-fix-up-zone-present-pages.patch are queued for 3.7.
>
> I'll toss this one in there as well, but I can't say I'm feeling
> terribly confident.  How is Valdis's machine nowadays?

I admit possibly having lost the plot.  With the two patches you mention stuck
on top of next-20121114, I'm seeing less kswapd issues but am still tripping
over them on occasion.  It seems to be related to uptime - I don't see any for
a few hours, but they become more frequent.  I was seeing quite a few of them
yesterday after I had a 30-hour uptime.

I'll stick Mel's "mm: remove __GFP_NO_KSWAPD" patch on this evening and let you
know what happens (might be a day or two before I have definitive results, as
usualally my laptop gets rebooted twice a day).


[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-12 11:37                                   ` [PATCH] Revert "mm: remove __GFP_NO_KSWAPD" Mel Gorman
  2012-11-16 19:14                                     ` Josh Boyer
@ 2012-11-20  9:18                                     ` Glauber Costa
  2012-11-20 20:18                                       ` Andrew Morton
  1 sibling, 1 reply; 52+ messages in thread
From: Glauber Costa @ 2012-11-20  9:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zdenek Kabelac, Seth Jennings, Jiri Slaby, Valdis.Kletnieks,
	Jiri Slaby, linux-mm, LKML, Andrew Morton, Rik van Riel,
	Robert Jennings

On 11/12/2012 03:37 PM, Mel Gorman wrote:
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 02c1c971..d0a7967 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -31,6 +31,7 @@ struct vm_area_struct;
>  #define ___GFP_THISNODE		0x40000u
>  #define ___GFP_RECLAIMABLE	0x80000u
>  #define ___GFP_NOTRACK		0x200000u
> +#define ___GFP_NO_KSWAPD	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
>  #define ___GFP_WRITE		0x1000000u

Keep in mind that this bit has been reused in -mm.
If this patch needs to be reverted, we'll need to first change
the definition of __GFP_KMEMCG (and __GFP_BITS_SHIFT as a result), or it
would break things.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-16 20:06                                       ` Mel Gorman
@ 2012-11-20 15:38                                         ` Josh Boyer
  2012-11-20 16:13                                           ` Bruno Wolff III
                                                             ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Josh Boyer @ 2012-11-20 15:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zdenek Kabelac, Seth Jennings, Jiri Slaby, Valdis.Kletnieks,
	Jiri Slaby, linux-mm, LKML, Andrew Morton, Rik van Riel,
	Robert Jennings, Thorsten Leemhuis, bruno

On Fri, Nov 16, 2012 at 3:06 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
>> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
>> > based on failures" reverted, Zdenek Kabelac reported the following
>> >
>> >         Hmm,  so it's just took longer to hit the problem and observe
>> >         kswapd0 spinning on my CPU again - it's not as endless like before -
>> >         but still it easily eats minutes - it helps to  turn off  Firefox
>> >         or TB  (memory hungry apps) so kswapd0 stops soon - and restart
>> >         those apps again.  (And I still have like >1GB of cached memory)
>> >
>> >         kswapd0         R  running task        0    30      2 0x00000000
>> >          ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
>> >          ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
>> >          ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
>> >         Call Trace:
>> >          [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
>> >          [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
>> >          [<ffffffff81192971>] put_super+0x31/0x40
>> >          [<ffffffff81192a42>] drop_super+0x22/0x30
>> >          [<ffffffff81193b89>] prune_super+0x149/0x1b0
>> >          [<ffffffff81141e2a>] shrink_slab+0xba/0x510
>> >
>> > The sysrq+m indicates the system has no swap so it'll never reclaim
>> > anonymous pages as part of reclaim/compaction. That is one part of the
>> > problem but not the root cause as file-backed pages could also be reclaimed.
>> >
>> > The likely underlying problem is that kswapd is woken up or kept awake
>> > for each THP allocation request in the page allocator slow path.
>> >
>> > If compaction fails for the requesting process then compaction will be
>> > deferred for a time and direct reclaim is avoided. However, if there
>> > are a storm of THP requests that are simply rejected, it will still
>> > be the the case that kswapd is awake for a prolonged period of time
>> > as pgdat->kswapd_max_order is updated each time. This is noticed by
>> > the main kswapd() loop and it will not call kswapd_try_to_sleep().
>> > Instead it will loopp, shrinking a small number of pages and calling
>> > shrink_slab() on each iteration.
>> >
>> > The temptation is to supply a patch that checks if kswapd was woken for
>> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
>> > backed up by proper testing. As 3.7 is very close to release and this is
>> > not a bug we should release with, a safer path is to revert "mm: remove
>> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
>> > balance_pgdat() logic in general.
>> >
>> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>>
>> Does anyone know if this is queued to go into 3.7 somewhere?  I looked
>> a bit and can't find it in a tree.  We have a few reports of Fedora
>> rawhide users hitting this.
>>
>
> No, because I was waiting to hear if a) it worked and preferably if the
> alternative "less safe" option worked. This close to release it might be
> better to just go with the safe option.

We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
and people say this revert patch doesn't seem to make the issue go away
fully.  Thorsten has created another kernel with the other patch applied
for testing.

At least I think that is the latest status from the bug.  Hopefully the
commenters will chime in.

josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-20 15:38                                         ` Josh Boyer
@ 2012-11-20 16:13                                           ` Bruno Wolff III
  2012-11-20 17:43                                           ` Thorsten Leemhuis
  2012-11-21 15:08                                           ` Mel Gorman
  2 siblings, 0 replies; 52+ messages in thread
From: Bruno Wolff III @ 2012-11-20 16:13 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Mel Gorman, Zdenek Kabelac, Seth Jennings, Jiri Slaby,
	Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton,
	Rik van Riel, Robert Jennings, Thorsten Leemhuis

On Tue, Nov 20, 2012 at 10:38:45 -0500,
   Josh Boyer <jwboyer@gmail.com> wrote:
>
>We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
>and people say this revert patch doesn't seem to make the issue go away
>fully.  Thorsten has created another kernel with the other patch applied
>for testing.
>
>At least I think that is the latest status from the bug.  Hopefully the
>commenters will chime in.

I am seeing kswapd0 hogging a cpu right now. I have two rsyncs and an md sync 
running and a couple of large memory processes (java and firefox) idle.

I haven't been seeing this happen as often as previously. Before doing a 
yum update with an rsync was pretty good at triggering the problem. Now, 
not so much.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-20 15:38                                         ` Josh Boyer
  2012-11-20 16:13                                           ` Bruno Wolff III
@ 2012-11-20 17:43                                           ` Thorsten Leemhuis
  2012-11-23 15:20                                             ` Thorsten Leemhuis
  2012-11-21 15:08                                           ` Mel Gorman
  2 siblings, 1 reply; 52+ messages in thread
From: Thorsten Leemhuis @ 2012-11-20 17:43 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Mel Gorman, Zdenek Kabelac, Seth Jennings, Jiri Slaby,
	Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton,
	Rik van Riel, Robert Jennings, bruno

On 20.11.2012 16:38, Josh Boyer wrote:
> On Fri, Nov 16, 2012 at 3:06 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
>>> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <mgorman@suse.de> wrote:
>>>> With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
>>>> based on failures" reverted, Zdenek Kabelac reported the following
>>>>
>>>>          Hmm,  so it's just took longer to hit the problem and observe
>>>>          kswapd0 spinning on my CPU again - it's not as endless like before -
>>>>          but still it easily eats minutes - it helps to  turn off  Firefox
>>>>          or TB  (memory hungry apps) so kswapd0 stops soon - and restart
>>>>          those apps again.  (And I still have like >1GB of cached memory)
>>>>
>>>>          kswapd0         R  running task        0    30      2 0x00000000
>>>>           ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
>>>>           ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
>>>>           ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
>>>>          Call Trace:
>>>>           [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
>>>>           [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
>>>>           [<ffffffff81192971>] put_super+0x31/0x40
>>>>           [<ffffffff81192a42>] drop_super+0x22/0x30
>>>>           [<ffffffff81193b89>] prune_super+0x149/0x1b0
>>>>           [<ffffffff81141e2a>] shrink_slab+0xba/0x510
>>>>
>>>> The sysrq+m indicates the system has no swap so it'll never reclaim
>>>> anonymous pages as part of reclaim/compaction. That is one part of the
>>>> problem but not the root cause as file-backed pages could also be reclaimed.
>>>>
>>>> The likely underlying problem is that kswapd is woken up or kept awake
>>>> for each THP allocation request in the page allocator slow path.
>>>>
>>>> If compaction fails for the requesting process then compaction will be
>>>> deferred for a time and direct reclaim is avoided. However, if there
>>>> are a storm of THP requests that are simply rejected, it will still
>>>> be the the case that kswapd is awake for a prolonged period of time
>>>> as pgdat->kswapd_max_order is updated each time. This is noticed by
>>>> the main kswapd() loop and it will not call kswapd_try_to_sleep().
>>>> Instead it will loopp, shrinking a small number of pages and calling
>>>> shrink_slab() on each iteration.
>>>>
>>>> The temptation is to supply a patch that checks if kswapd was woken for
>>>> THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
>>>> backed up by proper testing. As 3.7 is very close to release and this is
>>>> not a bug we should release with, a safer path is to revert "mm: remove
>>>> __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
>>>> balance_pgdat() logic in general.
>>>>
>>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>>
>>> Does anyone know if this is queued to go into 3.7 somewhere?  I looked
>>> a bit and can't find it in a tree.  We have a few reports of Fedora
>>> rawhide users hitting this.
>>
>> No, because I was waiting to hear if a) it worked and preferably if the
>> alternative "less safe" option worked. This close to release it might be
>> better to just go with the safe option.
>
> We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
> and people say this revert patch doesn't seem to make the issue go away
> fully.  Thorsten has created another kernel with the other patch applied
> for testing.
>
> At least I think that is the latest status from the bug.  Hopefully the
> commenters will chime in.

The short story from my current point of view is:

  * my main machine at home where I initially saw the issue that started 
this thread seems to be running fine with rc6 and the "safe" patch Mel 
posted in https://lkml.org/lkml/2012/11/12/113 Before that I ran a rc5 
kernel with the revert that went into rc6 and the "safe" patch -- that 
worked fine for a few days, too.

  * I have a second machine where I started to use 3.7-rc kernels only 
yesterday (the machine triggered a bug in the radeon driver that seems 
to be fixed in rc6) which showed symptoms like the ones Zdenek Kabelac 
mentions in this thread. I wasn't able to look closer at it, but simply 
tried rc6 with the safe patch, which didn't help. I'm now running rc6 
with the "riskier" patch from https://lkml.org/lkml/2012/11/12/151
I can't yet tell if it helps. If the problems shows up again I'll try to 
capture more debugging data via sysrq -- there wasn't any time for that 
when I was running rc6 with the safe patch, sorry.

Thorsten

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-20  9:18                                     ` Glauber Costa
@ 2012-11-20 20:18                                       ` Andrew Morton
  2012-11-21  8:30                                         ` Glauber Costa
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2012-11-20 20:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Mel Gorman, Zdenek Kabelac, Seth Jennings, Jiri Slaby,
	Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Rik van Riel,
	Robert Jennings

On Tue, 20 Nov 2012 13:18:19 +0400
Glauber Costa <glommer@parallels.com> wrote:

> On 11/12/2012 03:37 PM, Mel Gorman wrote:
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 02c1c971..d0a7967 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -31,6 +31,7 @@ struct vm_area_struct;
> >  #define ___GFP_THISNODE		0x40000u
> >  #define ___GFP_RECLAIMABLE	0x80000u
> >  #define ___GFP_NOTRACK		0x200000u
> > +#define ___GFP_NO_KSWAPD	0x400000u
> >  #define ___GFP_OTHER_NODE	0x800000u
> >  #define ___GFP_WRITE		0x1000000u
> 
> Keep in mind that this bit has been reused in -mm.
> If this patch needs to be reverted, we'll need to first change
> the definition of __GFP_KMEMCG (and __GFP_BITS_SHIFT as a result), or it
> would break things.

I presently have

/* Plain integer GFP bitmasks. Do not use this directly. */
#define ___GFP_DMA		0x01u
#define ___GFP_HIGHMEM		0x02u
#define ___GFP_DMA32		0x04u
#define ___GFP_MOVABLE		0x08u
#define ___GFP_WAIT		0x10u
#define ___GFP_HIGH		0x20u
#define ___GFP_IO		0x40u
#define ___GFP_FS		0x80u
#define ___GFP_COLD		0x100u
#define ___GFP_NOWARN		0x200u
#define ___GFP_REPEAT		0x400u
#define ___GFP_NOFAIL		0x800u
#define ___GFP_NORETRY		0x1000u
#define ___GFP_MEMALLOC		0x2000u
#define ___GFP_COMP		0x4000u
#define ___GFP_ZERO		0x8000u
#define ___GFP_NOMEMALLOC	0x10000u
#define ___GFP_HARDWALL		0x20000u
#define ___GFP_THISNODE		0x40000u
#define ___GFP_RECLAIMABLE	0x80000u
#define ___GFP_KMEMCG		0x100000u
#define ___GFP_NOTRACK		0x200000u
#define ___GFP_NO_KSWAPD	0x400000u
#define ___GFP_OTHER_NODE	0x800000u
#define ___GFP_WRITE		0x1000000u

and

#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

Which I think is OK?

I'd forgotten about __GFP_BITS_SHIFT.  Should we do this?

--- a/include/linux/gfp.h~a
+++ a/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+/* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
  * GFP bitmasks..
_


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-20 20:18                                       ` Andrew Morton
@ 2012-11-21  8:30                                         ` Glauber Costa
  0 siblings, 0 replies; 52+ messages in thread
From: Glauber Costa @ 2012-11-21  8:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Zdenek Kabelac, Seth Jennings, Jiri Slaby,
	Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Rik van Riel,
	Robert Jennings

On 11/21/2012 12:18 AM, Andrew Morton wrote:
> On Tue, 20 Nov 2012 13:18:19 +0400
> Glauber Costa <glommer@parallels.com> wrote:
> 
>> On 11/12/2012 03:37 PM, Mel Gorman wrote:
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index 02c1c971..d0a7967 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -31,6 +31,7 @@ struct vm_area_struct;
>>>  #define ___GFP_THISNODE		0x40000u
>>>  #define ___GFP_RECLAIMABLE	0x80000u
>>>  #define ___GFP_NOTRACK		0x200000u
>>> +#define ___GFP_NO_KSWAPD	0x400000u
>>>  #define ___GFP_OTHER_NODE	0x800000u
>>>  #define ___GFP_WRITE		0x1000000u
>>
>> Keep in mind that this bit has been reused in -mm.
>> If this patch needs to be reverted, we'll need to first change
>> the definition of __GFP_KMEMCG (and __GFP_BITS_SHIFT as a result), or it
>> would break things.
> 
> I presently have
> 
> /* Plain integer GFP bitmasks. Do not use this directly. */
> #define ___GFP_DMA		0x01u
> #define ___GFP_HIGHMEM		0x02u
> #define ___GFP_DMA32		0x04u
> #define ___GFP_MOVABLE		0x08u
> #define ___GFP_WAIT		0x10u
> #define ___GFP_HIGH		0x20u
> #define ___GFP_IO		0x40u
> #define ___GFP_FS		0x80u
> #define ___GFP_COLD		0x100u
> #define ___GFP_NOWARN		0x200u
> #define ___GFP_REPEAT		0x400u
> #define ___GFP_NOFAIL		0x800u
> #define ___GFP_NORETRY		0x1000u
> #define ___GFP_MEMALLOC		0x2000u
> #define ___GFP_COMP		0x4000u
> #define ___GFP_ZERO		0x8000u
> #define ___GFP_NOMEMALLOC	0x10000u
> #define ___GFP_HARDWALL		0x20000u
> #define ___GFP_THISNODE		0x40000u
> #define ___GFP_RECLAIMABLE	0x80000u
> #define ___GFP_KMEMCG		0x100000u
> #define ___GFP_NOTRACK		0x200000u
> #define ___GFP_NO_KSWAPD	0x400000u
> #define ___GFP_OTHER_NODE	0x800000u
> #define ___GFP_WRITE		0x1000000u
> 
> and
> 

Humm, I didn't realize there were also another free space at 0x100000u.
This seems fine.

> #define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> 
> Which I think is OK?
Yes, if we haven't increased the size of the flag-space, no need to
change it.

> 
> I'd forgotten about __GFP_BITS_SHIFT.  Should we do this?
> 
> --- a/include/linux/gfp.h~a
> +++ a/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
>  #define ___GFP_NO_KSWAPD	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
>  #define ___GFP_WRITE		0x1000000u
> +/* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
This is a very helpful comment.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-20 15:38                                         ` Josh Boyer
  2012-11-20 16:13                                           ` Bruno Wolff III
  2012-11-20 17:43                                           ` Thorsten Leemhuis
@ 2012-11-21 15:08                                           ` Mel Gorman
  2 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2012-11-21 15:08 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Zdenek Kabelac, Seth Jennings, Jiri Slaby, Valdis.Kletnieks,
	Jiri Slaby, linux-mm, LKML, Andrew Morton, Rik van Riel,
	Robert Jennings, Thorsten Leemhuis, bruno

On Tue, Nov 20, 2012 at 10:38:45AM -0500, Josh Boyer wrote:
> On Fri, Nov 16, 2012 at 3:06 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
> >> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <mgorman@suse.de> wrote:
> >> > With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
> >> > based on failures" reverted, Zdenek Kabelac reported the following
> >> >
> >> >         Hmm,  so it's just took longer to hit the problem and observe
> >> >         kswapd0 spinning on my CPU again - it's not as endless like before -
> >> >         but still it easily eats minutes - it helps to  turn off  Firefox
> >> >         or TB  (memory hungry apps) so kswapd0 stops soon - and restart
> >> >         those apps again.  (And I still have like >1GB of cached memory)
> >> >
> >> >         kswapd0         R  running task        0    30      2 0x00000000
> >> >          ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
> >> >          ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
> >> >          ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
> >> >         Call Trace:
> >> >          [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
> >> >          [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
> >> >          [<ffffffff81192971>] put_super+0x31/0x40
> >> >          [<ffffffff81192a42>] drop_super+0x22/0x30
> >> >          [<ffffffff81193b89>] prune_super+0x149/0x1b0
> >> >          [<ffffffff81141e2a>] shrink_slab+0xba/0x510
> >> >
> >> > The sysrq+m indicates the system has no swap so it'll never reclaim
> >> > anonymous pages as part of reclaim/compaction. That is one part of the
> >> > problem but not the root cause as file-backed pages could also be reclaimed.
> >> >
> >> > The likely underlying problem is that kswapd is woken up or kept awake
> >> > for each THP allocation request in the page allocator slow path.
> >> >
> >> > If compaction fails for the requesting process then compaction will be
> >> > deferred for a time and direct reclaim is avoided. However, if there
> >> > are a storm of THP requests that are simply rejected, it will still
> >> > be the the case that kswapd is awake for a prolonged period of time
> >> > as pgdat->kswapd_max_order is updated each time. This is noticed by
> >> > the main kswapd() loop and it will not call kswapd_try_to_sleep().
> >> > Instead it will loopp, shrinking a small number of pages and calling
> >> > shrink_slab() on each iteration.
> >> >
> >> > The temptation is to supply a patch that checks if kswapd was woken for
> >> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> >> > backed up by proper testing. As 3.7 is very close to release and this is
> >> > not a bug we should release with, a safer path is to revert "mm: remove
> >> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> >> > balance_pgdat() logic in general.
> >> >
> >> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >>
> >> Does anyone know if this is queued to go into 3.7 somewhere?  I looked
> >> a bit and can't find it in a tree.  We have a few reports of Fedora
> >> rawhide users hitting this.
> >>
> >
> > No, because I was waiting to hear if a) it worked and preferably if the
> > alternative "less safe" option worked. This close to release it might be
> > better to just go with the safe option.
> 
> We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
> and people say this revert patch doesn't seem to make the issue go away
> fully.  Thorsten has created another kernel with the other patch applied
> for testing.
> 

There is also a potential accounting bug that could be affecting this.
https://lkml.org/lkml/2012/11/20/613 . NR_FREE_PAGES affects watermark
calculations. If it's drifts too far then processes would keep entering
direct reclaim and waking kswapd even if there is no need to.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-20 17:43                                           ` Thorsten Leemhuis
@ 2012-11-23 15:20                                             ` Thorsten Leemhuis
  2012-11-27 11:12                                               ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Thorsten Leemhuis @ 2012-11-23 15:20 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Mel Gorman, Zdenek Kabelac, Seth Jennings, Jiri Slaby,
	Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton,
	Rik van Riel, Robert Jennings, bruno

Thorsten Leemhuis wrote on 20.11.2012 18:43:
> On 20.11.2012 16:38, Josh Boyer wrote:
> 
> The short story from my current point of view is:

Quick update, in case anybody is interested:

>  * my main machine at home where I initially saw the issue that started
> this thread seems to be running fine with rc6 and the "safe" patch Mel
> posted in https://lkml.org/lkml/2012/11/12/113 Before that I ran a rc5
> kernel with the revert that went into rc6 and the "safe" patch -- that
> worked fine for a few days, too.

On this machine I'm running a rc6 kernel + the fix for the accounting
bug(¹) that went into mainline ~40 hours ago + the "riskier" patch Mel
posted in https://lkml.org/lkml/2012/11/12/151

Up to now everything works fine.

(¹) https://lkml.org/lkml/2012/11/21/362

>  * I have a second machine where I started to use 3.7-rc kernels only
> yesterday (the machine triggered a bug in the radeon driver that seems
> to be fixed in rc6) which showed symptoms like the ones Zdenek Kabelac
> mentions in this thread. I wasn't able to look closer at it, but simply
> tried rc6 with the safe patch, which didn't help. I'm now running rc6
> with the "riskier" patch from https://lkml.org/lkml/2012/11/12/151
> I can't yet tell if it helps. If the problems shows up again I'll try to
> capture more debugging data via sysrq -- there wasn't any time for that
> when I was running rc6 with the safe patch, sorry.

This machine is now also behaving fine with above mentioned rc6 kernel +
the two patches. It seems the accounting bug was the root cause for the
problems this machine showed.

CU
 Thorsten

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"
  2012-11-23 15:20                                             ` Thorsten Leemhuis
@ 2012-11-27 11:12                                               ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2012-11-27 11:12 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Josh Boyer, Zdenek Kabelac, Seth Jennings, Jiri Slaby,
	Valdis.Kletnieks, Jiri Slaby, linux-mm, LKML, Andrew Morton,
	Rik van Riel, Robert Jennings, bruno

On Fri, Nov 23, 2012 at 04:20:48PM +0100, Thorsten Leemhuis wrote:
> Thorsten Leemhuis wrote on 20.11.2012 18:43:
> > On 20.11.2012 16:38, Josh Boyer wrote:
> > 
> > The short story from my current point of view is:
> 
> Quick update, in case anybody is interested:
> 
> >  * my main machine at home where I initially saw the issue that started
> > this thread seems to be running fine with rc6 and the "safe" patch Mel
> > posted in https://lkml.org/lkml/2012/11/12/113 Before that I ran a rc5
> > kernel with the revert that went into rc6 and the "safe" patch -- that
> > worked fine for a few days, too.
> 
> On this machine I'm running a rc6 kernel + the fix for the accounting
> bug(¹) that went into mainline ~40 hours ago + the "riskier" patch Mel
> posted in https://lkml.org/lkml/2012/11/12/151
> 
> Up to now everything works fine.
> 
> (¹) https://lkml.org/lkml/2012/11/21/362
> 

That's good news, thanks for the follow up. Maybe 3.7 will not be a complete
disaster with respect to THP after all this.

The riskier patch was not picked up simply because it was riskier and
would still be vunerable to the effective infinite loop Johannes found in
kswapd. It'll all need to be revisisted.

> >  * I have a second machine where I started to use 3.7-rc kernels only
> > yesterday (the machine triggered a bug in the radeon driver that seems
> > to be fixed in rc6) which showed symptoms like the ones Zdenek Kabelac
> > mentions in this thread. I wasn't able to look closer at it, but simply
> > tried rc6 with the safe patch, which didn't help. I'm now running rc6
> > with the "riskier" patch from https://lkml.org/lkml/2012/11/12/151
> > I can't yet tell if it helps. If the problems shows up again I'll try to
> > capture more debugging data via sysrq -- there wasn't any time for that
> > when I was running rc6 with the safe patch, sorry.
> 
> This machine is now also behaving fine with above mentioned rc6 kernel +
> the two patches. It seems the accounting bug was the root cause for the
> problems this machine showed.
> 

For some yes, for others no. Others are getting stuck within effective
infinite loops in kswapd and the trigger cases are different although
the symptoms loop similar.

Thanks again.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2012-11-27 11:12 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-11  8:52 kswapd0: wxcessive CPU usage Jiri Slaby
2012-10-11 13:44 ` Valdis.Kletnieks
2012-10-11 15:34   ` Jiri Slaby
2012-10-11 17:56     ` Valdis.Kletnieks
2012-10-11 17:59       ` Jiri Slaby
2012-10-11 18:19         ` Valdis.Kletnieks
2012-10-11 22:08           ` kswapd0: excessive " Jiri Slaby
2012-10-12 12:37             ` Jiri Slaby
2012-10-12 13:57               ` Mel Gorman
2012-10-15  9:54                 ` Jiri Slaby
2012-10-15 11:09                   ` Mel Gorman
2012-10-29 10:52                     ` Thorsten Leemhuis
2012-10-30 19:18                       ` Mel Gorman
2012-10-31 11:25                         ` Thorsten Leemhuis
2012-10-31 15:04                           ` Mel Gorman
2012-11-04 16:36                         ` Rik van Riel
2012-11-02 10:44                     ` Zdenek Kabelac
2012-11-02 10:53                       ` Jiri Slaby
2012-11-02 19:45                         ` Jiri Slaby
2012-11-04 11:26                           ` Zdenek Kabelac
2012-11-05 14:24                           ` [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" Mel Gorman
2012-11-06 10:15                             ` Johannes Hirte
2012-11-09  8:36                               ` Mel Gorman
2012-11-14 21:43                                 ` Johannes Hirte
2012-11-09  9:12                             ` Mel Gorman
2012-11-09  4:22                           ` kswapd0: excessive CPU usage Seth Jennings
2012-11-09  8:07                             ` Zdenek Kabelac
2012-11-09  9:06                               ` Mel Gorman
2012-11-11  9:13                                 ` Zdenek Kabelac
2012-11-12 11:37                                   ` [PATCH] Revert "mm: remove __GFP_NO_KSWAPD" Mel Gorman
2012-11-16 19:14                                     ` Josh Boyer
2012-11-16 19:51                                       ` Andrew Morton
2012-11-20  1:43                                         ` Valdis.Kletnieks
2012-11-16 20:06                                       ` Mel Gorman
2012-11-20 15:38                                         ` Josh Boyer
2012-11-20 16:13                                           ` Bruno Wolff III
2012-11-20 17:43                                           ` Thorsten Leemhuis
2012-11-23 15:20                                             ` Thorsten Leemhuis
2012-11-27 11:12                                               ` Mel Gorman
2012-11-21 15:08                                           ` Mel Gorman
2012-11-20  9:18                                     ` Glauber Costa
2012-11-20 20:18                                       ` Andrew Morton
2012-11-21  8:30                                         ` Glauber Costa
2012-11-12 12:19                                   ` kswapd0: excessive CPU usage Mel Gorman
2012-11-12 13:13                                     ` Zdenek Kabelac
2012-11-12 13:31                                       ` Mel Gorman
2012-11-12 14:50                                         ` Zdenek Kabelac
2012-11-18 19:00                                         ` Zdenek Kabelac
2012-11-18 19:07                                           ` Jiri Slaby
2012-11-09  8:40                             ` Mel Gorman
2012-10-11 22:14 ` kswapd0: wxcessive " Andrew Morton
2012-10-11 22:26   ` Jiri Slaby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).