* [ 00/41] 3.0.39-rc2 stable review
@ 2012-07-30 17:30 Greg Kroah-Hartman
2012-07-30 17:31 ` [ 01/41] cifs: always update the inode cache with the results from a FIND_* Greg Kroah-Hartman
` (40 more replies)
0 siblings, 41 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:30 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Greg KH, torvalds, akpm, alan
From: Greg KH <gregkh@linuxfoundation.org>
3.0.39-rc1 had a few problems, so I figured we should do a -rc2 just to
ensure that it's all working properly. Here it is, please test.
There are 41 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed Aug 1 17:29:00 UTC 2012.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.0.39-rc2.gz
and the diffstat can be found below.
thanks,
greg k-h
-------------
.../trace/postprocess/trace-vmscan-postprocess.pl | 8 +-
Makefile | 4 +-
arch/mips/include/asm/thread_info.h | 4 +-
arch/mips/kernel/vmlinux.lds.S | 3 +-
drivers/base/memory.c | 58 ++--
drivers/md/dm-raid1.c | 2 +-
drivers/md/dm-region-hash.c | 5 +-
fs/btrfs/disk-io.c | 5 +-
fs/cifs/readdir.c | 7 +-
fs/hugetlbfs/inode.c | 3 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 4 +-
fs/ubifs/sb.c | 8 +-
include/linux/cpuset.h | 47 ++--
include/linux/fs.h | 11 +-
include/linux/init_task.h | 8 +
include/linux/memcontrol.h | 3 +-
include/linux/migrate.h | 23 +-
include/linux/mmzone.h | 14 +
include/linux/sched.h | 2 +-
include/linux/swap.h | 7 +-
include/trace/events/vmscan.h | 85 +++++-
kernel/cpuset.c | 63 ++---
kernel/fork.c | 3 +
kernel/time/ntp.c | 8 +-
mm/compaction.c | 26 +-
mm/filemap.c | 11 +-
mm/hugetlb.c | 13 +-
mm/memcontrol.c | 3 +-
mm/memory-failure.c | 2 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 30 +-
mm/migrate.c | 240 ++++++++++------
mm/page_alloc.c | 118 +++++---
mm/slab.c | 13 +-
mm/slub.c | 40 ++-
mm/vmscan.c | 296 ++++++++++++++++----
mm/vmstat.c | 2 +-
38 files changed, 827 insertions(+), 356 deletions(-)
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 01/41] cifs: always update the inode cache with the results from a FIND_*
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 02/41] ntp: Fix STA_INS/DEL clearing bug Greg Kroah-Hartman
` (39 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Bill Robertson, Dion Edwards,
Jeff Layton, Steve French
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Jeff Layton <jlayton@redhat.com>
commit cd60042cc1392e79410dc8de9e9c1abb38a29e57 upstream.
When we get back a FIND_FIRST/NEXT result, we have some info about the
dentry that we use to instantiate a new inode. We were ignoring and
discarding that info when we had an existing dentry in the cache.
Fix this by updating the inode in place when we find an existing dentry
and the uniqueid is the same.
Reported-and-Tested-by: Andrew Bartlett <abartlet@samba.org>
Reported-by: Bill Robertson <bill_robertson@debortoli.com.au>
Reported-by: Dion Edwards <dion_edwards@debortoli.com.au>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/cifs/readdir.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--- a/fs/cifs/readdir.c
+++ b/fs/cifs/readdir.c
@@ -85,9 +85,12 @@ cifs_readdir_lookup(struct dentry *paren
dentry = d_lookup(parent, name);
if (dentry) {
- /* FIXME: check for inode number changes? */
- if (dentry->d_inode != NULL)
+ inode = dentry->d_inode;
+ /* update inode in place if i_ino didn't change */
+ if (inode && CIFS_I(inode)->uniqueid == fattr->cf_uniqueid) {
+ cifs_fattr_to_inode(inode, fattr);
return dentry;
+ }
d_drop(dentry);
dput(dentry);
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 02/41] ntp: Fix STA_INS/DEL clearing bug
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
2012-07-30 17:31 ` [ 01/41] cifs: always update the inode cache with the results from a FIND_* Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 03/41] mm: fix lost kswapd wakeup in kswapd_stop() Greg Kroah-Hartman
` (38 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, John Stultz, Ingo Molnar,
Peter Zijlstra, Richard Cochran, Prarit Bhargava,
Thomas Gleixner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: John Stultz <johnstul@us.ibm.com>
commit 6b1859dba01c7d512b72d77e3fd7da8354235189 upstream.
In commit 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d, I
introduced a bug that kept the STA_INS or STA_DEL bit
from being cleared from time_status via adjtimex()
without forcing STA_PLL first.
Usually once the STA_INS is set, it isn't cleared
until the leap second is applied, so its unlikely this
affected anyone. However during testing I noticed it
took some effort to cancel a leap second once STA_INS
was set.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
kernel/time/ntp.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -375,7 +375,9 @@ int second_overflow(unsigned long secs)
time_state = TIME_DEL;
break;
case TIME_INS:
- if (secs % 86400 == 0) {
+ if (!(time_status & STA_INS))
+ time_state = TIME_OK;
+ else if (secs % 86400 == 0) {
leap = -1;
time_state = TIME_OOP;
time_tai++;
@@ -384,7 +386,9 @@ int second_overflow(unsigned long secs)
}
break;
case TIME_DEL:
- if ((secs + 1) % 86400 == 0) {
+ if (!(time_status & STA_DEL))
+ time_state = TIME_OK;
+ else if ((secs + 1) % 86400 == 0) {
leap = 1;
time_tai--;
time_state = TIME_WAIT;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 03/41] mm: fix lost kswapd wakeup in kswapd_stop()
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
2012-07-30 17:31 ` [ 01/41] cifs: always update the inode cache with the results from a FIND_* Greg Kroah-Hartman
2012-07-30 17:31 ` [ 02/41] ntp: Fix STA_INS/DEL clearing bug Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 04/41] MIPS: Properly align the .data..init_task section Greg Kroah-Hartman
` (37 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Aaditya Kumar, KOSAKI Motohiro,
Minchan Kim, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Aaditya Kumar <aaditya.kumar.30@gmail.com>
commit 1c7e7f6c0703d03af6bcd5ccc11fc15d23e5ecbe upstream.
Offlining memory may block forever, waiting for kswapd() to wake up
because kswapd() does not check the event kthread->should_stop before
sleeping.
The proper pattern, from Documentation/memory-barriers.txt, is:
--- waker ---
event_indicated = 1;
wake_up_process(event_daemon);
--- sleeper ---
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (event_indicated)
break;
schedule();
}
set_current_state() may be wrapped by:
prepare_to_wait();
In the kswapd() case, event_indicated is kthread->should_stop.
=== offlining memory (waker) ===
kswapd_stop()
kthread_stop()
kthread->should_stop = 1
wake_up_process()
wait_for_completion()
=== kswapd_try_to_sleep (sleeper) ===
kswapd_try_to_sleep()
prepare_to_wait()
.
.
schedule()
.
.
finish_wait()
The schedule() needs to be protected by a test of kthread->should_stop,
which is wrapped by kthread_should_stop().
Reproducer:
Do heavy file I/O in background.
Do a memory offline/online in a tight loop
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2695,7 +2695,10 @@ static void kswapd_try_to_sleep(pg_data_
* them before going back to sleep.
*/
set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
- schedule();
+
+ if (!kthread_should_stop())
+ schedule();
+
set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
} else {
if (remaining)
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 04/41] MIPS: Properly align the .data..init_task section.
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (2 preceding siblings ...)
2012-07-30 17:31 ` [ 03/41] mm: fix lost kswapd wakeup in kswapd_stop() Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 05/41] UBIFS: fix a bug in empty space fix-up Greg Kroah-Hartman
` (36 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, David Daney, Ralf Baechle, linux-mips
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: David Daney <david.daney@cavium.com>
commit 7b1c0d26a8e272787f0f9fcc5f3e8531df3b3409 upstream.
Improper alignment can lead to unbootable systems and/or random
crashes.
[ralf@linux-mips.org: This is a lond standing bug since
6eb10bc9e2deab06630261cd05c4cb1e9a60e980 (kernel.org) rsp.
c422a10917f75fd19fa7fe070aaaa23e384dae6f (lmo) [MIPS: Clean up linker script
using new linker script macros.] so dates back to 2.6.32.]
Signed-off-by: David Daney <david.daney@cavium.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/3881/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
arch/mips/include/asm/thread_info.h | 4 ++--
arch/mips/kernel/vmlinux.lds.S | 3 ++-
2 files changed, 4 insertions(+), 3 deletions(-)
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -60,6 +60,8 @@ struct thread_info {
register struct thread_info *__current_thread_info __asm__("$28");
#define current_thread_info() __current_thread_info
+#endif /* !__ASSEMBLY__ */
+
/* thread information allocation */
#if defined(CONFIG_PAGE_SIZE_4KB) && defined(CONFIG_32BIT)
#define THREAD_SIZE_ORDER (1)
@@ -97,8 +99,6 @@ register struct thread_info *__current_t
#define free_thread_info(info) kfree(info)
-#endif /* !__ASSEMBLY__ */
-
#define PREEMPT_ACTIVE 0x10000000
/*
--- a/arch/mips/kernel/vmlinux.lds.S
+++ b/arch/mips/kernel/vmlinux.lds.S
@@ -1,5 +1,6 @@
#include <asm/asm-offsets.h>
#include <asm/page.h>
+#include <asm/thread_info.h>
#include <asm-generic/vmlinux.lds.h>
#undef mips
@@ -73,7 +74,7 @@ SECTIONS
.data : { /* Data */
. = . + DATAOFFSET; /* for CONFIG_MAPPED_KERNEL */
- INIT_TASK_DATA(PAGE_SIZE)
+ INIT_TASK_DATA(THREAD_SIZE)
NOSAVE_DATA
CACHELINE_ALIGNED_DATA(1 << CONFIG_MIPS_L1_CACHE_SHIFT)
READ_MOSTLY_DATA(1 << CONFIG_MIPS_L1_CACHE_SHIFT)
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 05/41] UBIFS: fix a bug in empty space fix-up
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (3 preceding siblings ...)
2012-07-30 17:31 ` [ 04/41] MIPS: Properly align the .data..init_task section Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 06/41] dm raid1: fix crash with mirror recovery and discard Greg Kroah-Hartman
` (35 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Artem Bityutskiy, Iwo Mergler, James Nute
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Artem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
commit c6727932cfdb13501108b16c38463c09d5ec7a74 upstream.
UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
limitations of dumb flasher programs. Namely, of those flashers that are unable
to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
relatively new (introduced in v3.0).
The fix-up routine (fixup_free_space()) is executed only once at the very first
mount if the superblock has the 'space_fixup' flag set (can be done with -F
option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
writes it back to the same LEB. The routine assumes the image is pristine and
does not have anything in the journal.
There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
All but one LEB of the log of a pristine file-system are empty. And one
contains just a commit start node. And 'fixup_free_space()' just unmapped this
LEB, which resulted in wiping the commit start node. As a result, some users
were unable to mount the file-system next time with the following symptom:
UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0
The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
that the beginning of empty space in the log head (c->lhead_offs) was known
on mount. However, it is not the case - it was always 0. UBIFS does not store
in it the master node and finds out by scanning the log on every mount.
The fix is simple - just pass commit start node size instead of 0 to
'fixup_leb()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
Reported-by: Iwo Mergler <Iwo.Mergler@netcommwireless.com>
Tested-by: Iwo Mergler <Iwo.Mergler@netcommwireless.com>
Reported-by: James Nute <newten82@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/ubifs/sb.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/fs/ubifs/sb.c
+++ b/fs/ubifs/sb.c
@@ -715,8 +715,12 @@ static int fixup_free_space(struct ubifs
lnum = ubifs_next_log_lnum(c, lnum);
}
- /* Fixup the current log head */
- err = fixup_leb(c, c->lhead_lnum, c->lhead_offs);
+ /*
+ * Fixup the log head which contains the only a CS node at the
+ * beginning.
+ */
+ err = fixup_leb(c, c->lhead_lnum,
+ ALIGN(UBIFS_CS_NODE_SZ, c->min_io_size));
if (err)
goto out;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 06/41] dm raid1: fix crash with mirror recovery and discard
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (4 preceding siblings ...)
2012-07-30 17:31 ` [ 05/41] UBIFS: fix a bug in empty space fix-up Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 07/41] mm/vmstat.c: cache align vm_stat Greg Kroah-Hartman
` (34 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mikulas Patocka, Alasdair G Kergon
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mikulas Patocka <mpatocka@redhat.com>
commit 751f188dd5ab95b3f2b5f2f467c38aae5a2877eb upstream.
This patch fixes a crash when a discard request is sent during mirror
recovery.
Firstly, some background. Generally, the following sequence happens during
mirror synchronization:
- function do_recovery is called
- do_recovery calls dm_rh_recovery_prepare
- dm_rh_recovery_prepare uses a semaphore to limit the number
simultaneously recovered regions (by default the semaphore value is 1,
so only one region at a time is recovered)
- dm_rh_recovery_prepare calls __rh_recovery_prepare,
__rh_recovery_prepare asks the log driver for the next region to
recover. Then, it sets the region state to DM_RH_RECOVERING. If there
are no pending I/Os on this region, the region is added to
quiesced_regions list. If there are pending I/Os, the region is not
added to any list. It is added to the quiesced_regions list later (by
dm_rh_dec function) when all I/Os finish.
- when the region is on quiesced_regions list, there are no I/Os in
flight on this region. The region is popped from the list in
dm_rh_recovery_start function. Then, a kcopyd job is started in the
recover function.
- when the kcopyd job finishes, recovery_complete is called. It calls
dm_rh_recovery_end. dm_rh_recovery_end adds the region to
recovered_regions or failed_recovered_regions list (depending on
whether the copy operation was successful or not).
The above mechanism assumes that if the region is in DM_RH_RECOVERING
state, no new I/Os are started on this region. When I/O is started,
dm_rh_inc_pending is called, which increases reg->pending count. When
I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
If the count is zero and the region was in DM_RH_RECOVERING state,
dm_rh_dec adds it to the quiesced_regions list.
Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
in DM_RH_RECOVERING state, it could be added to quiesced_regions list
multiple times or it could be added to this list when kcopyd is copying
data (it is assumed that the region is not on any list while kcopyd does
its jobs). This results in memory corruption and crash.
There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
do not belong to any region, so they are always added to the sync list
in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
requests. These bypasses avoid the crash possibility described above.
These bypasses were improperly implemented for REQ_DISCARD when
the mirror target gained discard support in commit
5fc2ffeabb9ee0fc0e71ff16b49f34f0ed3d05b4 (dm raid1: support discard).
In do_writes, REQ_DISCARD requests is always added to the sync queue and
immediately dispatched (even if the region is in DM_RH_RECOVERING). However,
dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the
rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
corruption described above.
This patch changes it so that REQ_DISCARD requests follow the same path
as REQ_FLUSH. This avoids the crash.
Reference: https://bugzilla.redhat.com/837607
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/md/dm-raid1.c | 2 +-
drivers/md/dm-region-hash.c | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1210,7 +1210,7 @@ static int mirror_end_io(struct dm_targe
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (!(bio->bi_rw & REQ_FLUSH))
+ if (!(bio->bi_rw & (REQ_FLUSH | REQ_DISCARD)))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -404,6 +404,9 @@ void dm_rh_mark_nosync(struct dm_region_
return;
}
+ if (bio->bi_rw & REQ_DISCARD)
+ return;
+
/* We must inform the log that the sync count has changed. */
log->type->set_region_sync(log, region, 0);
@@ -524,7 +527,7 @@ void dm_rh_inc_pending(struct dm_region_
struct bio *bio;
for (bio = bios->head; bio; bio = bio->bi_next) {
- if (bio->bi_rw & REQ_FLUSH)
+ if (bio->bi_rw & (REQ_FLUSH | REQ_DISCARD))
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 07/41] mm/vmstat.c: cache align vm_stat
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (5 preceding siblings ...)
2012-07-30 17:31 ` [ 06/41] dm raid1: fix crash with mirror recovery and discard Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 08/41] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis Greg Kroah-Hartman
` (33 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Dimitri Sivanich,
Christoph Lameter, Mel Gorman, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Dimitri Sivanich <sivanich@sgi.com>
commit a1cb2c60ddc98ff4e5246f410558805401ceee67 upstream.
Stable note: Not tracked on Bugzilla. This patch is known to make a big
difference to tmpfs performance on larger machines.
This was found to adversely affect tmpfs I/O performance.
Tests run on a 640 cpu UV system.
With 120 threads doing parallel writes, each to different tmpfs mounts:
No patch: ~300 MB/sec
With vm_stat alignment: ~430 MB/sec
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmstat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -78,7 +78,7 @@ void vm_events_fold_cpu(int cpu)
*
* vm_stat contains the global counters
*/
-atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
EXPORT_SYMBOL(vm_stat);
#ifdef CONFIG_SMP
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 08/41] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (6 preceding siblings ...)
2012-07-30 17:31 ` [ 07/41] mm/vmstat.c: cache align vm_stat Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 09/41] mm: reduce the amount of work done when updating min_free_kbytes Greg Kroah-Hartman
` (32 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, KAMEZAWA Hiroyuki,
Nathan Fontenot, Greg Kroah-Hartman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit 2bbcb8788311a40714b585fc11b51da6ffa2ab92 upstream.
Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=721039 .
Without the patch, memory hot-add can fail for kernel configurations
that do not set CONFIG_SPARSEMEM_VMEMMAP.
(Resending as I am not seeing it in -next so maybe it got lost)
mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
It is expected that memory being brought online is PageReserved
similar to what happens when the page allocator is being brought up.
Memory is onlined in "memory blocks" which consist of one or more
sections. Unfortunately, the code that verifies PageReserved is
currently assuming that the memmap backing all these pages is virtually
contiguous which is only the case when CONFIG_SPARSEMEM_VMEMMAP is set.
As a result, memory hot-add is failing on those configurations with
the message;
kernel: section number XXX page number 256 not reserved, was it already online?
This patch updates the PageReserved check to lookup struct page once
per section to guarantee the correct struct page is being checked.
[Check pages within sections properly: rientjes@google.com]
[original patch by: nfont@linux.vnet.ibm.com]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
drivers/base/memory.c | 58 ++++++++++++++++++++++++++++++++++----------------
1 file changed, 40 insertions(+), 18 deletions(-)
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -224,13 +224,48 @@ int memory_isolate_notify(unsigned long
}
/*
+ * The probe routines leave the pages reserved, just as the bootmem code does.
+ * Make sure they're still that way.
+ */
+static bool pages_correctly_reserved(unsigned long start_pfn,
+ unsigned long nr_pages)
+{
+ int i, j;
+ struct page *page;
+ unsigned long pfn = start_pfn;
+
+ /*
+ * memmap between sections is not contiguous except with
+ * SPARSEMEM_VMEMMAP. We lookup the page once per section
+ * and assume memmap is contiguous within each section
+ */
+ for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+ if (WARN_ON_ONCE(!pfn_valid(pfn)))
+ return false;
+ page = pfn_to_page(pfn);
+
+ for (j = 0; j < PAGES_PER_SECTION; j++) {
+ if (PageReserved(page + j))
+ continue;
+
+ printk(KERN_WARNING "section number %ld page number %d "
+ "not reserved, was it already online?\n",
+ pfn_to_section_nr(pfn), j);
+
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/*
* MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
* OK to have direct references to sparsemem variables in here.
*/
static int
memory_block_action(unsigned long phys_index, unsigned long action)
{
- int i;
unsigned long start_pfn, start_paddr;
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
struct page *first_page;
@@ -238,26 +273,13 @@ memory_block_action(unsigned long phys_i
first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
- /*
- * The probe routines leave the pages reserved, just
- * as the bootmem code does. Make sure they're still
- * that way.
- */
- if (action == MEM_ONLINE) {
- for (i = 0; i < nr_pages; i++) {
- if (PageReserved(first_page+i))
- continue;
-
- printk(KERN_WARNING "section number %ld page number %d "
- "not reserved, was it already online?\n",
- phys_index, i);
- return -EBUSY;
- }
- }
-
switch (action) {
case MEM_ONLINE:
start_pfn = page_to_pfn(first_page);
+
+ if (!pages_correctly_reserved(start_pfn, nr_pages))
+ return -EBUSY;
+
ret = online_pages(start_pfn, nr_pages);
break;
case MEM_OFFLINE:
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 09/41] mm: reduce the amount of work done when updating min_free_kbytes
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (7 preceding siblings ...)
2012-07-30 17:31 ` [ 08/41] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 10/41] mm: vmscan: fix force-scanning small targets without swap Greg Kroah-Hartman
` (31 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Greg KH, torvalds, akpm, alan, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.
Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
Large machines with 1TB or more of RAM take a long time to boot
without this patch and may spew out soft lockup warnings.
When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.
The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary. This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/page_alloc.c | 40 ++++++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 16 deletions(-)
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3418,25 +3418,33 @@ static void setup_zone_migrate_reserve(s
if (page_to_nid(page) != zone_to_nid(zone))
continue;
- /* Blocks with reserved pages will never free, skip them. */
- block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
- if (pageblock_is_reserved(pfn, block_end_pfn))
- continue;
-
block_migratetype = get_pageblock_migratetype(page);
- /* If this block is reserved, account for it */
- if (reserve > 0 && block_migratetype == MIGRATE_RESERVE) {
- reserve--;
- continue;
- }
+ /* Only test what is necessary when the reserves are not met */
+ if (reserve > 0) {
+ /*
+ * Blocks with reserved pages will never free, skip
+ * them.
+ */
+ block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
+ if (pageblock_is_reserved(pfn, block_end_pfn))
+ continue;
- /* Suitable for reserving if this block is movable */
- if (reserve > 0 && block_migratetype == MIGRATE_MOVABLE) {
- set_pageblock_migratetype(page, MIGRATE_RESERVE);
- move_freepages_block(zone, page, MIGRATE_RESERVE);
- reserve--;
- continue;
+ /* If this block is reserved, account for it */
+ if (block_migratetype == MIGRATE_RESERVE) {
+ reserve--;
+ continue;
+ }
+
+ /* Suitable for reserving if this block is movable */
+ if (block_migratetype == MIGRATE_MOVABLE) {
+ set_pageblock_migratetype(page,
+ MIGRATE_RESERVE);
+ move_freepages_block(zone, page,
+ MIGRATE_RESERVE);
+ reserve--;
+ continue;
+ }
}
/*
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 10/41] mm: vmscan: fix force-scanning small targets without swap
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (8 preceding siblings ...)
2012-07-30 17:31 ` [ 09/41] mm: reduce the amount of work done when updating min_free_kbytes Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 11/41] vmscan: clear ZONE_CONGESTED for zone with good watermark Greg Kroah-Hartman
` (30 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Johannes Weiner,
KAMEZAWA Hiroyuki, Michal Hocko, Ying Han, Balbir Singh,
KOSAKI Motohiro, Daisuke Nishimura, Mel Gorman, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johannes Weiner <jweiner@redhat.com>
commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream.
Stable note: Not tracked in Bugzilla. This patch augments an earlier commit
that avoids scanning priority being artificially raised. The older
fix was particularly important for small memcgs to avoid calling
wait_iff_congested() unnecessarily.
Without swap, anonymous pages are not scanned. As such, they should not
count when considering force-scanning a small target if there is no swap.
Otherwise, targets are not force-scanned even when their effective scan
number is zero and the other conditions--kswapd/memcg--apply.
This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
targets").
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1747,23 +1747,15 @@ static void get_scan_count(struct zone *
u64 fraction[2], denominator;
enum lru_list l;
int noswap = 0;
- int force_scan = 0;
+ bool force_scan = false;
unsigned long nr_force_scan[2];
-
- anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
- zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
- file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
- zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
-
- if (((anon + file) >> priority) < SWAP_CLUSTER_MAX) {
- /* kswapd does zone balancing and need to scan this zone */
- if (scanning_global_lru(sc) && current_is_kswapd())
- force_scan = 1;
- /* memcg may have small limit and need to avoid priority drop */
- if (!scanning_global_lru(sc))
- force_scan = 1;
- }
+ /* kswapd does zone balancing and needs to scan this zone */
+ if (scanning_global_lru(sc) && current_is_kswapd())
+ force_scan = true;
+ /* memcg may have small limit and need to avoid priority drop */
+ if (!scanning_global_lru(sc))
+ force_scan = true;
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1776,6 +1768,11 @@ static void get_scan_count(struct zone *
goto out;
}
+ anon = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
+ zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
+ file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
+ zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+
if (scanning_global_lru(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 11/41] vmscan: clear ZONE_CONGESTED for zone with good watermark
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (9 preceding siblings ...)
2012-07-30 17:31 ` [ 10/41] mm: vmscan: fix force-scanning small targets without swap Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 12/41] vmscan: add shrink_slab tracepoints Greg Kroah-Hartman
` (29 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Shaohua Li, Mel Gorman, Minchan Kim
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Shaohua Li <shaohua.li@intel.com>
commit 439423f6894aa0dec22187526827456f5004baed upstream.
Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
ZONE_CONGESTED after it balances a zone and this patch fixes a bug
where that was failing to happen. Without this patch, processes
can stall in wait_iff_congested unnecessarily. For users, this can
look like an interactivity stall but some workloads would see it
as sudden drop in throughput.
ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task. It's possible ZONE_CONGESTED isn't cleared in some cases:
1. the zone is already balanced just entering balance_pgdat() for
order-0 because concurrent tasks free memory. In this case, later
check will skip the zone as it's balanced so the flag isn't cleared.
2. high order balance fallbacks to order-0. quote from Mel: At the
end of balance_pgdat(), kswapd uses the following logic;
If reclaiming at high order {
for each zone {
if all_unreclaimable
skip
if watermark is not met
order = 0
loop again
/* watermark is met */
clear congested
}
}
i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not,
it restarts balancing at order-0. However, if the higher zones are
balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
that only happens after a zone is shrunk. This can mean that
wait_iff_congested() stalls unnecessarily.
This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2456,6 +2456,9 @@ loop_again:
high_wmark_pages(zone), 0, 0)) {
end_zone = i;
break;
+ } else {
+ /* If balanced, clear the congested flag */
+ zone_clear_flag(zone, ZONE_CONGESTED);
}
}
if (i < 0)
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 12/41] vmscan: add shrink_slab tracepoints
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (10 preceding siblings ...)
2012-07-30 17:31 ` [ 11/41] vmscan: clear ZONE_CONGESTED for zone with good watermark Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 13/41] vmscan: shrinker->nr updates race and go wrong Greg Kroah-Hartman
` (28 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Dave Chinner, Al Viro, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Dave Chinner <dchinner@redhat.com>
commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.
Stable note: This patch makes later patches easier to apply but otherwise
has little to justify it. It is a diagnostic patch that was part
of a series addressing excessive slab shrinking after GFP_NOFS
failures. There is detailed information on the series' motivation
at https://lkml.org/lkml/2011/6/2/42 .
It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/trace/events/vmscan.h | 77 ++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 8 +++-
2 files changed, 84 insertions(+), 1 deletion(-)
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -179,6 +179,83 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_en
TP_ARGS(nr_reclaimed)
);
+TRACE_EVENT(mm_shrink_slab_start,
+ TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+ long nr_objects_to_shrink, unsigned long pgs_scanned,
+ unsigned long lru_pgs, unsigned long cache_items,
+ unsigned long long delta, unsigned long total_scan),
+
+ TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs,
+ cache_items, delta, total_scan),
+
+ TP_STRUCT__entry(
+ __field(struct shrinker *, shr)
+ __field(void *, shrink)
+ __field(long, nr_objects_to_shrink)
+ __field(gfp_t, gfp_flags)
+ __field(unsigned long, pgs_scanned)
+ __field(unsigned long, lru_pgs)
+ __field(unsigned long, cache_items)
+ __field(unsigned long long, delta)
+ __field(unsigned long, total_scan)
+ ),
+
+ TP_fast_assign(
+ __entry->shr = shr;
+ __entry->shrink = shr->shrink;
+ __entry->nr_objects_to_shrink = nr_objects_to_shrink;
+ __entry->gfp_flags = sc->gfp_mask;
+ __entry->pgs_scanned = pgs_scanned;
+ __entry->lru_pgs = lru_pgs;
+ __entry->cache_items = cache_items;
+ __entry->delta = delta;
+ __entry->total_scan = total_scan;
+ ),
+
+ TP_printk("%pF %p: objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+ __entry->shrink,
+ __entry->shr,
+ __entry->nr_objects_to_shrink,
+ show_gfp_flags(__entry->gfp_flags),
+ __entry->pgs_scanned,
+ __entry->lru_pgs,
+ __entry->cache_items,
+ __entry->delta,
+ __entry->total_scan)
+);
+
+TRACE_EVENT(mm_shrink_slab_end,
+ TP_PROTO(struct shrinker *shr, int shrinker_retval,
+ long unused_scan_cnt, long new_scan_cnt),
+
+ TP_ARGS(shr, shrinker_retval, unused_scan_cnt, new_scan_cnt),
+
+ TP_STRUCT__entry(
+ __field(struct shrinker *, shr)
+ __field(void *, shrink)
+ __field(long, unused_scan)
+ __field(long, new_scan)
+ __field(int, retval)
+ __field(long, total_scan)
+ ),
+
+ TP_fast_assign(
+ __entry->shr = shr;
+ __entry->shrink = shr->shrink;
+ __entry->unused_scan = unused_scan_cnt;
+ __entry->new_scan = new_scan_cnt;
+ __entry->retval = shrinker_retval;
+ __entry->total_scan = new_scan_cnt - unused_scan_cnt;
+ ),
+
+ TP_printk("%pF %p: unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+ __entry->shrink,
+ __entry->shr,
+ __entry->unused_scan,
+ __entry->new_scan,
+ __entry->total_scan,
+ __entry->retval)
+);
DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -250,6 +250,7 @@ unsigned long shrink_slab(struct shrink_
unsigned long long delta;
unsigned long total_scan;
unsigned long max_pass;
+ int shrink_ret = 0;
max_pass = do_shrinker_shrink(shrinker, shrink, 0);
delta = (4 * nr_pages_scanned) / shrinker->seeks;
@@ -274,9 +275,12 @@ unsigned long shrink_slab(struct shrink_
total_scan = shrinker->nr;
shrinker->nr = 0;
+ trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+ nr_pages_scanned, lru_pages,
+ max_pass, delta, total_scan);
+
while (total_scan >= SHRINK_BATCH) {
long this_scan = SHRINK_BATCH;
- int shrink_ret;
int nr_before;
nr_before = do_shrinker_shrink(shrinker, shrink, 0);
@@ -293,6 +297,8 @@ unsigned long shrink_slab(struct shrink_
}
shrinker->nr += total_scan;
+ trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
+ shrinker->nr);
}
up_read(&shrinker_rwsem);
out:
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 13/41] vmscan: shrinker->nr updates race and go wrong
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (11 preceding siblings ...)
2012-07-30 17:31 ` [ 12/41] vmscan: add shrink_slab tracepoints Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 14/41] vmscan: reduce wind up shrinker->nr when shrinker cant do work Greg Kroah-Hartman
` (27 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Dave Chinner, Al Viro, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Dave Chinner <dchinner@redhat.com>
commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream.
Stable note: Not tracked in Bugzilla. This patch reduces excessive
reclaim of slab objects reducing the amount of information
that has to be brought back in from disk.
shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.
As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!
Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++-------------
1 file changed, 32 insertions(+), 13 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -251,17 +251,29 @@ unsigned long shrink_slab(struct shrink_
unsigned long total_scan;
unsigned long max_pass;
int shrink_ret = 0;
+ long nr;
+ long new_nr;
+ /*
+ * copy the current shrinker scan count into a local variable
+ * and zero it so that other concurrent shrinker invocations
+ * don't also do this scanning work.
+ */
+ do {
+ nr = shrinker->nr;
+ } while (cmpxchg(&shrinker->nr, nr, 0) != nr);
+
+ total_scan = nr;
max_pass = do_shrinker_shrink(shrinker, shrink, 0);
delta = (4 * nr_pages_scanned) / shrinker->seeks;
delta *= max_pass;
do_div(delta, lru_pages + 1);
- shrinker->nr += delta;
- if (shrinker->nr < 0) {
+ total_scan += delta;
+ if (total_scan < 0) {
printk(KERN_ERR "shrink_slab: %pF negative objects to "
"delete nr=%ld\n",
- shrinker->shrink, shrinker->nr);
- shrinker->nr = max_pass;
+ shrinker->shrink, total_scan);
+ total_scan = max_pass;
}
/*
@@ -269,13 +281,10 @@ unsigned long shrink_slab(struct shrink_
* never try to free more than twice the estimate number of
* freeable entries.
*/
- if (shrinker->nr > max_pass * 2)
- shrinker->nr = max_pass * 2;
-
- total_scan = shrinker->nr;
- shrinker->nr = 0;
+ if (total_scan > max_pass * 2)
+ total_scan = max_pass * 2;
- trace_mm_shrink_slab_start(shrinker, shrink, total_scan,
+ trace_mm_shrink_slab_start(shrinker, shrink, nr,
nr_pages_scanned, lru_pages,
max_pass, delta, total_scan);
@@ -296,9 +305,19 @@ unsigned long shrink_slab(struct shrink_
cond_resched();
}
- shrinker->nr += total_scan;
- trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan,
- shrinker->nr);
+ /*
+ * move the unused scan count back into the shrinker in a
+ * manner that handles concurrent updates. If we exhausted the
+ * scan, there is no need to do an update.
+ */
+ do {
+ nr = shrinker->nr;
+ new_nr = total_scan + nr;
+ if (total_scan <= 0)
+ break;
+ } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
+
+ trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
}
up_read(&shrinker_rwsem);
out:
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 14/41] vmscan: reduce wind up shrinker->nr when shrinker cant do work
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (12 preceding siblings ...)
2012-07-30 17:31 ` [ 13/41] vmscan: shrinker->nr updates race and go wrong Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 15/41] vmscan: limit direct reclaim for higher order allocations Greg Kroah-Hartman
` (26 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Dave Chinner, Al Viro, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Dave Chinner <dchinner@redhat.com>
commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream.
Stable note: Not tracked in Bugzilla. This patch reduces excessive
reclaim of slab objects reducing the amount of information that
has to be brought back in from disk. The third and fourth paragram
in the series describes the impact.
When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.
However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.
This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.
Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.
This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.
To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.
If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,6 +277,21 @@ unsigned long shrink_slab(struct shrink_
}
/*
+ * We need to avoid excessive windup on filesystem shrinkers
+ * due to large numbers of GFP_NOFS allocations causing the
+ * shrinkers to return -1 all the time. This results in a large
+ * nr being built up so when a shrink that can do some work
+ * comes along it empties the entire cache due to nr >>>
+ * max_pass. This is bad for sustaining a working set in
+ * memory.
+ *
+ * Hence only allow the shrinker to scan the entire cache when
+ * a large delta change is calculated directly.
+ */
+ if (delta < max_pass / 4)
+ total_scan = min(total_scan, max_pass / 2);
+
+ /*
* Avoid risking looping forever due to too large nr value:
* never try to free more than twice the estimate number of
* freeable entries.
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 15/41] vmscan: limit direct reclaim for higher order allocations
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (13 preceding siblings ...)
2012-07-30 17:31 ` [ 14/41] vmscan: reduce wind up shrinker->nr when shrinker cant do work Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 16/41] vmscan: abort reclaim/compaction if compaction can proceed Greg Kroah-Hartman
` (25 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Rik van Riel, Johannes Weiner,
Mel Gorman, Andrea Arcangeli, Minchan Kim
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Rik van Riel <riel@redhat.com>
commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time. Paragraph
3 of this changelog is the motivation for this patch.
When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory. Due to the unfreeable
pages, compaction fails.
Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas. However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.
This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.
This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.
If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.
[jweiner@redhat.com: change comment to explain the order check]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2059,6 +2059,22 @@ static void shrink_zones(int priority, s
continue;
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
+ if (COMPACTION_BUILD) {
+ /*
+ * If we already have plenty of memory
+ * free for compaction, don't free any
+ * more. Even though compaction is
+ * invoked for any non-zero order,
+ * only frequent costly order
+ * reclamation is disruptive enough to
+ * become a noticable problem, like
+ * transparent huge page allocations.
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
+ (compaction_suitable(zone, sc->order) ||
+ compaction_deferred(zone)))
+ continue;
+ }
/*
* This steals pages from memory cgroups over softlimit
* and returns the number of reclaimed pages and
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 16/41] vmscan: abort reclaim/compaction if compaction can proceed
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (14 preceding siblings ...)
2012-07-30 17:31 ` [ 15/41] vmscan: limit direct reclaim for higher order allocations Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 17/41] mm: compaction: trivial clean up in acct_isolated() Greg Kroah-Hartman
` (24 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Rik van Riel,
Minchan Kim, Johannes Weiner, Josh Boyer, Andrea Arcangeli
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit e0c23279c9f800c403f37511484d9014ac83adec upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time.
If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps. This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 32 +++++++++++++++++++++-----------
1 file changed, 21 insertions(+), 11 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2037,14 +2037,19 @@ restart:
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
+ *
+ * This function returns true if a zone is being reclaimed for a costly
+ * high-order allocation and compaction is either ready to begin or deferred.
+ * This indicates to the caller that it should retry the allocation or fail.
*/
-static void shrink_zones(int priority, struct zonelist *zonelist,
+static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
+ bool should_abort_reclaim = false;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2061,19 +2066,20 @@ static void shrink_zones(int priority, s
continue; /* Let kswapd poll it */
if (COMPACTION_BUILD) {
/*
- * If we already have plenty of memory
- * free for compaction, don't free any
- * more. Even though compaction is
- * invoked for any non-zero order,
- * only frequent costly order
- * reclamation is disruptive enough to
- * become a noticable problem, like
- * transparent huge page allocations.
+ * If we already have plenty of memory free for
+ * compaction in this zone, don't free any more.
+ * Even though compaction is invoked for any
+ * non-zero order, only frequent costly order
+ * reclamation is disruptive enough to become a
+ * noticable problem, like transparent huge page
+ * allocations.
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
(compaction_suitable(zone, sc->order) ||
- compaction_deferred(zone)))
+ compaction_deferred(zone))) {
+ should_abort_reclaim = true;
continue;
+ }
}
/*
* This steals pages from memory cgroups over softlimit
@@ -2092,6 +2098,8 @@ static void shrink_zones(int priority, s
shrink_zone(priority, zone, sc);
}
+
+ return should_abort_reclaim;
}
static bool zone_reclaimable(struct zone *zone)
@@ -2156,7 +2164,9 @@ static unsigned long do_try_to_free_page
sc->nr_scanned = 0;
if (!priority)
disable_swap_token(sc->mem_cgroup);
- shrink_zones(priority, zonelist, sc);
+ if (shrink_zones(priority, zonelist, sc))
+ break;
+
/*
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 17/41] mm: compaction: trivial clean up in acct_isolated()
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (15 preceding siblings ...)
2012-07-30 17:31 ` [ 16/41] vmscan: abort reclaim/compaction if compaction can proceed Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 18/41] mm: change isolate mode from #define to bitwise type Greg Kroah-Hartman
` (23 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Minchan Kim, Johannes Weiner,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Mel Gorman, Rik van Riel,
Michal Hocko, Andrea Arcangeli
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Minchan Kim <minchan.kim@gmail.com>
commit b9e84ac1536d35aee03b2601f19694949f0bd506 upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but has no other impact.
acct_isolated of compaction uses page_lru_base_type which returns only
base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in conpact_control.
This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/compaction.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -35,10 +35,6 @@ struct compact_control {
unsigned long migrate_pfn; /* isolate_migratepages search base */
bool sync; /* Synchronous migration */
- /* Account for isolated anon and file pages */
- unsigned long nr_anon;
- unsigned long nr_file;
-
unsigned int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
@@ -223,17 +219,13 @@ static void isolate_freepages(struct zon
static void acct_isolated(struct zone *zone, struct compact_control *cc)
{
struct page *page;
- unsigned int count[NR_LRU_LISTS] = { 0, };
+ unsigned int count[2] = { 0, };
- list_for_each_entry(page, &cc->migratepages, lru) {
- int lru = page_lru_base_type(page);
- count[lru]++;
- }
+ list_for_each_entry(page, &cc->migratepages, lru)
+ count[!!page_is_file_cache(page)]++;
- cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
- cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
}
/* Similar to reclaim, but different enough that they don't share logic */
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 18/41] mm: change isolate mode from #define to bitwise type
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (16 preceding siblings ...)
2012-07-30 17:31 ` [ 17/41] mm: compaction: trivial clean up in acct_isolated() Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 19/41] mm: compaction: make isolate_lru_page() filter-aware Greg Kroah-Hartman
` (22 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Minchan Kim, Johannes Weiner,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Mel Gorman, Rik van Riel,
Michal Hocko, Andrea Arcangeli
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Minchan Kim <minchan.kim@gmail.com>
commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but has no other impact.
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.
Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."
This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
Documentation/trace/postprocess/trace-vmscan-postprocess.pl | 8 +-
include/linux/memcontrol.h | 3
include/linux/mmzone.h | 8 ++
include/linux/swap.h | 7 --
include/trace/events/vmscan.h | 8 +-
mm/compaction.c | 3
mm/memcontrol.c | 3
mm/vmscan.c | 37 ++++++------
8 files changed, 43 insertions(+), 34 deletions(-)
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -379,10 +379,10 @@ EVENT_PROCESS:
# To closer match vmstat scanning statistics, only count isolate_both
# and isolate_inactive as scanning. isolate_active is rotation
- # isolate_inactive == 0
- # isolate_active == 1
- # isolate_both == 2
- if ($isolate_mode != 1) {
+ # isolate_inactive == 1
+ # isolate_active == 2
+ # isolate_both == 3
+ if ($isolate_mode != 2) {
$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
}
$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,7 +35,8 @@ enum mem_cgroup_page_stat_item {
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
- int mode, struct zone *z,
+ isolate_mode_t mode,
+ struct zone *z,
struct mem_cgroup *mem_cont,
int active, int file);
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,6 +158,14 @@ static inline int is_unevictable_lru(enu
return (l == LRU_UNEVICTABLE);
}
+/* Isolate inactive pages */
+#define ISOLATE_INACTIVE ((__force isolate_mode_t)0x1)
+/* Isolate active pages */
+#define ISOLATE_ACTIVE ((__force isolate_mode_t)0x2)
+
+/* LRU Isolation modes. */
+typedef unsigned __bitwise__ isolate_mode_t;
+
enum zone_watermarks {
WMARK_MIN,
WMARK_LOW,
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -243,11 +243,6 @@ static inline void lru_cache_add_file(st
__lru_cache_add(page, LRU_INACTIVE_FILE);
}
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1 /* Isolate active pages. */
-#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */
-
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
@@ -259,7 +254,7 @@ extern unsigned long mem_cgroup_shrink_n
unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned);
-extern int __isolate_lru_page(struct page *page, int mode, int file);
+extern int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -266,7 +266,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolat
unsigned long nr_lumpy_taken,
unsigned long nr_lumpy_dirty,
unsigned long nr_lumpy_failed,
- int isolate_mode),
+ isolate_mode_t isolate_mode),
TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
@@ -278,7 +278,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolat
__field(unsigned long, nr_lumpy_taken)
__field(unsigned long, nr_lumpy_dirty)
__field(unsigned long, nr_lumpy_failed)
- __field(int, isolate_mode)
+ __field(isolate_mode_t, isolate_mode)
),
TP_fast_assign(
@@ -312,7 +312,7 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_templ
unsigned long nr_lumpy_taken,
unsigned long nr_lumpy_dirty,
unsigned long nr_lumpy_failed,
- int isolate_mode),
+ isolate_mode_t isolate_mode),
TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode)
@@ -327,7 +327,7 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_templ
unsigned long nr_lumpy_taken,
unsigned long nr_lumpy_dirty,
unsigned long nr_lumpy_failed,
- int isolate_mode),
+ isolate_mode_t isolate_mode),
TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode)
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,7 +371,8 @@ static isolate_migrate_t isolate_migrate
}
/* Try isolate the page */
- if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
+ if (__isolate_lru_page(page,
+ ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
continue;
VM_BUG_ON(PageTransCompound(page));
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1251,7 +1251,8 @@ mem_cgroup_get_reclaim_stat_from_page(st
unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
- int mode, struct zone *z,
+ isolate_mode_t mode,
+ struct zone *z,
struct mem_cgroup *mem_cont,
int active, int file)
{
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1012,23 +1012,27 @@ keep_lumpy:
*
* returns 0 on success, -ve errno on failure.
*/
-int __isolate_lru_page(struct page *page, int mode, int file)
+int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
{
+ bool all_lru_mode;
int ret = -EINVAL;
/* Only take pages on the LRU. */
if (!PageLRU(page))
return ret;
+ all_lru_mode = (mode & (ISOLATE_ACTIVE|ISOLATE_INACTIVE)) ==
+ (ISOLATE_ACTIVE|ISOLATE_INACTIVE);
+
/*
* When checking the active state, we need to be sure we are
* dealing with comparible boolean values. Take the logical not
* of each.
*/
- if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
+ if (!all_lru_mode && !PageActive(page) != !(mode & ISOLATE_ACTIVE))
return ret;
- if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
+ if (!all_lru_mode && !!page_is_file_cache(page) != file)
return ret;
/*
@@ -1076,7 +1080,8 @@ int __isolate_lru_page(struct page *page
*/
static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
struct list_head *src, struct list_head *dst,
- unsigned long *scanned, int order, int mode, int file)
+ unsigned long *scanned, int order, isolate_mode_t mode,
+ int file)
{
unsigned long nr_taken = 0;
unsigned long nr_lumpy_taken = 0;
@@ -1201,8 +1206,8 @@ static unsigned long isolate_lru_pages(u
static unsigned long isolate_pages_global(unsigned long nr,
struct list_head *dst,
unsigned long *scanned, int order,
- int mode, struct zone *z,
- int active, int file)
+ isolate_mode_t mode,
+ struct zone *z, int active, int file)
{
int lru = LRU_BASE;
if (active)
@@ -1448,6 +1453,7 @@ shrink_inactive_list(unsigned long nr_to
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
+ isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1458,15 +1464,15 @@ shrink_inactive_list(unsigned long nr_to
}
set_reclaim_mode(priority, sc, false);
+ if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM)
+ reclaim_mode |= ISOLATE_ACTIVE;
+
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
if (scanning_global_lru(sc)) {
- nr_taken = isolate_pages_global(nr_to_scan,
- &page_list, &nr_scanned, sc->order,
- sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
- zone, 0, file);
+ nr_taken = isolate_pages_global(nr_to_scan, &page_list,
+ &nr_scanned, sc->order, reclaim_mode, zone, 0, file);
zone->pages_scanned += nr_scanned;
if (current_is_kswapd())
__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1475,12 +1481,9 @@ shrink_inactive_list(unsigned long nr_to
__count_zone_vm_events(PGSCAN_DIRECT, zone,
nr_scanned);
} else {
- nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
- &page_list, &nr_scanned, sc->order,
- sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
- zone, sc->mem_cgroup,
- 0, file);
+ nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
+ &nr_scanned, sc->order, reclaim_mode, zone,
+ sc->mem_cgroup, 0, file);
/*
* mem_cgroup_isolate_pages() keeps track of
* scanned pages on its own.
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 19/41] mm: compaction: make isolate_lru_page() filter-aware
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (17 preceding siblings ...)
2012-07-30 17:31 ` [ 18/41] mm: change isolate mode from #define to bitwise type Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 20/41] mm: zone_reclaim: " Greg Kroah-Hartman
` (21 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Minchan Kim, Johannes Weiner,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Mel Gorman, Rik van Riel,
Michal Hocko, Andrea Arcangeli
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Minchan Kim <minchan.kim@gmail.com>
commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.
Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
list leading to poor reclaim decisions which has a variable
performance impact.
In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.
Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.
So, this patch helps cpu overhead and prevent unnecessary LRU churning.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/mmzone.h | 2 ++
mm/compaction.c | 7 +++++--
mm/vmscan.c | 3 +++
3 files changed, 10 insertions(+), 2 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -162,6 +162,8 @@ static inline int is_unevictable_lru(enu
#define ISOLATE_INACTIVE ((__force isolate_mode_t)0x1)
/* Isolate active pages */
#define ISOLATE_ACTIVE ((__force isolate_mode_t)0x2)
+/* Isolate clean file */
+#define ISOLATE_CLEAN ((__force isolate_mode_t)0x4)
/* LRU Isolation modes. */
typedef unsigned __bitwise__ isolate_mode_t;
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -261,6 +261,7 @@ static isolate_migrate_t isolate_migrate
unsigned long last_pageblock_nr = 0, pageblock_nr;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct list_head *migratelist = &cc->migratepages;
+ isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
/* Do not scan outside zone boundaries */
low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -370,9 +371,11 @@ static isolate_migrate_t isolate_migrate
continue;
}
+ if (!cc->sync)
+ mode |= ISOLATE_CLEAN;
+
/* Try isolate the page */
- if (__isolate_lru_page(page,
- ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
+ if (__isolate_lru_page(page, mode, 0) != 0)
continue;
VM_BUG_ON(PageTransCompound(page));
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1045,6 +1045,9 @@ int __isolate_lru_page(struct page *page
ret = -EBUSY;
+ if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
+ return ret;
+
if (likely(get_page_unless_zero(page))) {
/*
* Be careful not to clear PageLRU until after we're
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 20/41] mm: zone_reclaim: make isolate_lru_page() filter-aware
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (18 preceding siblings ...)
2012-07-30 17:31 ` [ 19/41] mm: compaction: make isolate_lru_page() filter-aware Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 21/41] mm: migration: clean up unmap_and_move() Greg Kroah-Hartman
` (20 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Minchan Kim, Johannes Weiner,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michal Hocko, Mel Gorman,
Rik van Riel, Andrea Arcangeli
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Minchan Kim <minchan.kim@gmail.com>
commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.
Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
leading to poor reclaim decisions which has a variable
performance impact.
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
we have isolated mapped page and re-add it into LRU's head. It's
unnecessary CPU overhead and makes LRU churning.
Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped. So it could be
migrated. But race is rare and although it happens, it's no big deal.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/mmzone.h | 2 ++
mm/vmscan.c | 20 ++++++++++++++++++--
2 files changed, 20 insertions(+), 2 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -164,6 +164,8 @@ static inline int is_unevictable_lru(enu
#define ISOLATE_ACTIVE ((__force isolate_mode_t)0x2)
/* Isolate clean file */
#define ISOLATE_CLEAN ((__force isolate_mode_t)0x4)
+/* Isolate unmapped file */
+#define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x8)
/* LRU Isolation modes. */
typedef unsigned __bitwise__ isolate_mode_t;
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1048,6 +1048,9 @@ int __isolate_lru_page(struct page *page
if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
return ret;
+ if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
+ return ret;
+
if (likely(get_page_unless_zero(page))) {
/*
* Be careful not to clear PageLRU until after we're
@@ -1471,6 +1474,12 @@ shrink_inactive_list(unsigned long nr_to
reclaim_mode |= ISOLATE_ACTIVE;
lru_add_drain();
+
+ if (!sc->may_unmap)
+ reclaim_mode |= ISOLATE_UNMAPPED;
+ if (!sc->may_writepage)
+ reclaim_mode |= ISOLATE_CLEAN;
+
spin_lock_irq(&zone->lru_lock);
if (scanning_global_lru(sc)) {
@@ -1588,19 +1597,26 @@ static void shrink_active_list(unsigned
struct page *page;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
unsigned long nr_rotated = 0;
+ isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
lru_add_drain();
+
+ if (!sc->may_unmap)
+ reclaim_mode |= ISOLATE_UNMAPPED;
+ if (!sc->may_writepage)
+ reclaim_mode |= ISOLATE_CLEAN;
+
spin_lock_irq(&zone->lru_lock);
if (scanning_global_lru(sc)) {
nr_taken = isolate_pages_global(nr_pages, &l_hold,
&pgscanned, sc->order,
- ISOLATE_ACTIVE, zone,
+ reclaim_mode, zone,
1, file);
zone->pages_scanned += pgscanned;
} else {
nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
&pgscanned, sc->order,
- ISOLATE_ACTIVE, zone,
+ reclaim_mode, zone,
sc->mem_cgroup, 1, file);
/*
* mem_cgroup_isolate_pages() keeps track of
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 21/41] mm: migration: clean up unmap_and_move()
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (19 preceding siblings ...)
2012-07-30 17:31 ` [ 20/41] mm: zone_reclaim: " Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 22/41] mm: compaction: allow compaction to isolate dirty pages Greg Kroah-Hartman
` (19 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Minchan Kim, KOSAKI Motohiro,
Johannes Weiner, KAMEZAWA Hiroyuki, Mel Gorman, Rik van Riel,
Michal Hocko, Andrea Arcangeli
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Minchan Kim <minchan.kim@gmail.com>
commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but has no other impact.
unmap_and_move() is one a big messy function. Clean it up.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/migrate.c | 75 +++++++++++++++++++++++++++++++----------------------------
1 file changed, 40 insertions(+), 35 deletions(-)
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -621,38 +621,18 @@ static int move_to_new_page(struct page
return rc;
}
-/*
- * Obtain the lock on page, remove all ptes and migrate the page
- * to the newly allocated page in newpage.
- */
-static int unmap_and_move(new_page_t get_new_page, unsigned long private,
- struct page *page, int force, bool offlining, bool sync)
+static int __unmap_and_move(struct page *page, struct page *newpage,
+ int force, bool offlining, bool sync)
{
- int rc = 0;
- int *result = NULL;
- struct page *newpage = get_new_page(page, private, &result);
+ int rc = -EAGAIN;
int remap_swapcache = 1;
int charge = 0;
struct mem_cgroup *mem;
struct anon_vma *anon_vma = NULL;
- if (!newpage)
- return -ENOMEM;
-
- if (page_count(page) == 1) {
- /* page was freed from under us. So we are done. */
- goto move_newpage;
- }
- if (unlikely(PageTransHuge(page)))
- if (unlikely(split_huge_page(page)))
- goto move_newpage;
-
- /* prepare cgroup just returns 0 or -ENOMEM */
- rc = -EAGAIN;
-
if (!trylock_page(page)) {
if (!force || !sync)
- goto move_newpage;
+ goto out;
/*
* It's not safe for direct compaction to call lock_page.
@@ -668,7 +648,7 @@ static int unmap_and_move(new_page_t get
* altogether.
*/
if (current->flags & PF_MEMALLOC)
- goto move_newpage;
+ goto out;
lock_page(page);
}
@@ -785,27 +765,52 @@ uncharge:
mem_cgroup_end_migration(mem, page, newpage, rc == 0);
unlock:
unlock_page(page);
+out:
+ return rc;
+}
+
+/*
+ * Obtain the lock on page, remove all ptes and migrate the page
+ * to the newly allocated page in newpage.
+ */
+static int unmap_and_move(new_page_t get_new_page, unsigned long private,
+ struct page *page, int force, bool offlining, bool sync)
+{
+ int rc = 0;
+ int *result = NULL;
+ struct page *newpage = get_new_page(page, private, &result);
+
+ if (!newpage)
+ return -ENOMEM;
-move_newpage:
+ if (page_count(page) == 1) {
+ /* page was freed from under us. So we are done. */
+ goto out;
+ }
+
+ if (unlikely(PageTransHuge(page)))
+ if (unlikely(split_huge_page(page)))
+ goto out;
+
+ rc = __unmap_and_move(page, newpage, force, offlining, sync);
+out:
if (rc != -EAGAIN) {
- /*
- * A page that has been migrated has all references
- * removed and will be freed. A page that has not been
- * migrated will have kepts its references and be
- * restored.
- */
- list_del(&page->lru);
+ /*
+ * A page that has been migrated has all references
+ * removed and will be freed. A page that has not been
+ * migrated will have kepts its references and be
+ * restored.
+ */
+ list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
putback_lru_page(page);
}
-
/*
* Move the new page to the LRU. If migration was not successful
* then this will free the page.
*/
putback_lru_page(newpage);
-
if (result) {
if (rc)
*result = rc;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 22/41] mm: compaction: allow compaction to isolate dirty pages
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (20 preceding siblings ...)
2012-07-30 17:31 ` [ 21/41] mm: migration: clean up unmap_and_move() Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 23/41] mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage Greg Kroah-Hartman
` (18 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Andrea Arcangeli,
Rik van Riel, Minchan Kim, Dave Jones, Jan Kara, Andy Isaacson,
Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
information by reducing LRU list churning had the side-effect of
reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Short summary: There are severe stalls when a USB stick using VFAT is
used with THP enabled that are reduced by this series. If you are
experiencing this problem, please test and report back and considering I
have seen complaints from openSUSE and Fedora users on this as well as a
few private mails, I'm guessing it's a widespread issue. This is a new
type of USB-related stall because it is due to synchronous compaction
writing where as in the past the big problem was dirty pages reaching
the end of the LRU and being written by reclaim.
Am cc'ing Andrew this time and this series would replace
mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
for wider testing and ideally it would be reverted and replaced by this
series.
That said, the later patches could really do with some review. If this
series is not the answer then a new direction needs to be discussed
because as it is, the stalls are unacceptable as the results in this
leader show.
For testers that try backporting this to 3.1, it won't work because
there is a non-obvious dependency on not writing back pages in direct
reclaim so you need those patches too.
Changelog since V5
o Rebase to 3.2-rc5
o Tidy up the changelogs a bit
Changelog since V4
o Added reviewed-bys, credited Andrea properly for sync-light
o Allow dirty pages without mappings to be considered for migration
o Bound the number of pages freed for compaction
o Isolate PageReclaim pages on their own LRU list
This is against 3.2-rc5 and follows on from discussions on "mm: Do
not stall in synchronous compaction for THP allocations" and "[RFC
PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
patch eliminated stalls due to compaction which sometimes resulted in
user-visible interactivity problems on browsers by simply never using
sync compaction. The downside was that THP success allocation rates
were lower because dirty pages were not being migrated as reported by
Andrea. His approach at fixing this was nacked on the grounds that
it reverted fixes from Rik merged that reduced the amount of pages
reclaimed as it severely impacted his workloads performance.
This series attempts to reconcile the requirements of maximising THP
usage, without stalling in a user-visible fashion due to compaction
or cheating by reclaiming an excessive number of pages.
Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
dirty pages. This is because migration can move some dirty
pages without blocking.
Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
synchronous compaction when it should be. This is unrelated
to the reported stalls but is worth fixing.
Patch 3 checks if we isolated a compound page during lumpy scan and
account for it properly. For the most part, this affects
tracing so it's unrelated to the stalls but worth fixing.
Patch 4 notes that it is possible to abort reclaim early for compaction
and return 0 to the page allocator potentially entering the
"may oom" path. This has not been observed in practice but
the rest of the series potentially makes it easier to happen.
Patch 5 adds a sync parameter to the migratepage callback and gives
the callback responsibility for migrating the page without
blocking if sync==false. For example, fallback_migrate_page
will not call writepage if sync==false. This increases the
number of pages that can be handled by asynchronous compaction
thereby reducing stalls.
Patch 6 restores filter-awareness to isolate_lru_page for migration.
In practice, it means that pages under writeback and pages
without a ->migratepage callback will not be isolated
for migration.
Patch 7 avoids calling direct reclaim if compaction is deferred but
makes sure that compaction is only deferred if sync
compaction was used.
Patch 8 introduces a sync-light migration mechanism that sync compaction
uses. The objective is to allow some stalls but to not call
->writepage which can lead to significant user-visible stalls.
Patch 9 notes that while we want to abort reclaim ASAP to allow
compation to go ahead that we leave a very small window of
opportunity for compaction to run. This patch allows more pages
to be freed by reclaim but bounds the number to a reasonable
level based on the high watermark on each zone.
Patch 10 allows slabs to be shrunk even after compaction_ready() is
true for one zone. This is to avoid a problem whereby a single
small zone can abort reclaim even though no pages have been
reclaimed and no suitably large zone is in a usable state.
Patch 11 fixes a problem with the rate of page scanning. As reclaim is
rarely stalling on pages under writeback it means that scan
rates are very high. This is particularly true for direct
reclaim which is not calling writepage. The vmstat figures
implied that much of this was busy work with PageReclaim pages
marked for immediate reclaim. This patch is a prototype that
moves these pages to their own LRU list.
This has been tested and other than 2 USB keys getting trashed,
nothing horrible fell out. That said, I am a bit unhappy with the
rescue logic in patch 11 but did not find a better way around it. It
does significantly reduce scan rates and System CPU time indicating
it is the right direction to take.
What is of critical importance is that stalls due to compaction
are massively reduced even though sync compaction was still
allowed. Testing from people complaining about stalls copying to USBs
with THP enabled are particularly welcome.
The following tests all involve THP usage and USB keys in some
way. Each test follows this type of pattern
1. Read from some fast fast storage, be it raw device or file. Each time
the copy finishes, start again until the test ends
2. Write a large file to a filesystem on a USB stick. Each time the copy
finishes, start again until the test ends
3. When memory is low, start an alloc process that creates a mapping
the size of physical memory to stress THP allocation. This is the
"real" part of the test and the part that is meant to trigger
stalls when THP is enabled. Copying continues in the background.
4. Record the CPU usage and time to execute of the alloc process
5. Record the number of THP allocs and fallbacks as well as the number of THP
pages in use a the end of the test just before alloc exited
6. Run the test 5 times to get an idea of variability
7. Between each run, sync is run and caches dropped and the test
waits until nr_dirty is a small number to avoid interference
or caching between iterations that would skew the figures.
The individual tests were then
writebackCPDeviceBasevfat
Disable THP, read from a raw device (sda), vfat on USB stick
writebackCPDeviceBaseext4
Disable THP, read from a raw device (sda), ext4 on USB stick
writebackCPDevicevfat
THP enabled, read from a raw device (sda), vfat on USB stick
writebackCPDeviceext4
THP enabled, read from a raw device (sda), ext4 on USB stick
writebackCPFilevfat
THP enabled, read from a file on fast storage and USB, both vfat
writebackCPFileext4
THP enabled, read from a file on fast storage and USB, both ext4
The kernels tested were
3.1 3.1
vanilla 3.2-rc5
freemore Patches 1-10
immediate Patches 1-11
andrea The 8 patches Andrea posted as a basis of comparison
The results are very long unfortunately. I'll start with the case
where we are not using THP at all
writebackCPDeviceBasevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
+/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
+/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
+/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
The THP figures are obviously all 0 because THP was enabled. The
main thing to watch is the elapsed times and how they compare to
times when THP is enabled later. It's also important to note that
elapsed time is improved by this series as System CPu time is much
reduced.
writebackCPDevicevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
+/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
(-10818.53%)
User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
+/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
95.48%)
+/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
+/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
+/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
+/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28
The first thing to note is the "Elapsed Time" for the vanilla kernels
of 2249 seconds versus 56 with THP disabled which might explain the
reports of USB stalls with THP enabled. Applying the patches brings
performance in line with THP-disabled performance while isolating
pages for immediate reclaim from the LRU cuts down System CPU time.
The "Fault Alloc" success rate figures are also improved. The vanilla
kernel only managed to allocate 76.6 pages on average over the course
of 5 iterations where as applying the series allocated 181.20 on
average albeit it is well within variance. It's worth noting that
applies the series at least descreases the amount of variance which
implies an improvement.
Andrea's series had a higher success rate for THP allocations but
at a severe cost to elapsed time which is still better than vanilla
but still much worse than disabling THP altogether. One can bring my
series close to Andrea's by removing this check
/*
* If compaction is deferred for high-order allocations, it is because
* sync compaction recently failed. In this is the case and the caller
* has requested the system not be heavily disrupted, fail the
* allocation now instead of entering direct reclaim
*/
if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;
I didn't include a patch that removed the above check because hurting
overall performance to improve the THP figure is not what the average
user wants. It's something to consider though if someone really wants
to maximise THP usage no matter what it does to the workload initially.
This is summary of vmstat figures from the same test.
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
Page Ins 3257266139 1111844061 17263623 10901575 161423219
Page Outs 81054922 30364312 3626530 3657687 8753730
Swap Ins 3294 2851 6560 4964 4592
Swap Outs 390073 528094 620197 790912 698285
Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
Kswapd efficiency 83% 69% 58% 57% 79%
Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
Direct efficiency 74% 9% 0% 1% 0%
Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
Percentage direct scans 96% 99% 99% 98% 99%
Page writes by reclaim 722646 529174 620319 791018 699198
Page writes file 332573 1080 122 106 913
Page writes anon 390073 528094 620197 790912 698285
Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
Page rescued immediate 0 0 0 87848 0
Slabs scanned 23552 23552 9216 8192 9216
Direct inode steals 231 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 28076 786 0 61 6
THP fault alloc 609 383 753 906 1433
THP collapse alloc 12 6 0 0 6
THP splits 536 211 456 593 1136
THP fault fallback 4406 4633 4263 4110 3583
THP collapse fail 120 127 0 0 4
Compaction stalls 1810 728 623 779 3200
Compaction success 196 53 60 80 123
Compaction failures 1614 675 563 699 3077
Compaction pages moved 193158 53545 243185 333457 226688
Compaction move failure 9952 9396 16424 23676 45070
The main things to look at are
1. Page In/out figures are much reduced by the series.
2. Direct page scanning is incredibly high (264745.137 pages scanned
per second on the vanilla kernel) but isolating PageReclaim pages
on their own list reduces the number of pages scanned significantly.
3. The fact that "Page rescued immediate" is a positive number implies
that we sometimes race removing pages from the LRU_IMMEDIATE list
that need to be put back on a normal LRU but it happens only for
0.07% of the pages marked for immediate reclaim.
writebackCPDeviceext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Similar test but the USB stick is using ext4 instead of vfat. As
ext4 does not use writepage for migration, the large stalls due to
compaction when THP is enabled are not observed. Still, isolating
PageReclaim pages on their own list helped completion time largely
by reducing the number of pages scanned by direct reclaim although
time spend in congestion_wait could also be a factor.
Again, Andrea's series had far higher success rates for THP allocation
at the cost of elapsed time. I didn't look too closely but a quick
look at the vmstat figures tells me kswapd reclaimed 8 times more pages
than the patch series and direct reclaim reclaimed roughly three times
as many pages. It follows that if memory is aggressively reclaimed,
there will be more available for THP.
writebackCPFilevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
+/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
(-6863.76%)
User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
+/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
+/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
+/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
+/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
+/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28
In this case, the test is reading/writing only from filesystems but as
it's vfat, it's slow due to calling writepage during compaction. Little
to observe really - the time to complete the test goes way down
with the series applied and THP allocation success rates go up in
comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
the elapsed time for that kernel is abysmal so it is not really a
sensible comparison.
As before, Andrea's series allocates more THPs at the cost of overall
performance.
writebackCPFileext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Same type of story - elapsed times go down. In this case, allocation
success rates are roughtly the same. As before, Andrea's has higher
success rates but takes a lot longer.
Overall the series does reduce latencies and while the tests are
inherency racy as alloc competes with the cp processes, the variability
was included. The THP allocation rates are not as high as they could
be but that is because we would have to be more aggressive about
reclaim and compaction impacting overall performance.
This patch:
Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list.
What was missed during review is that asynchronous migration moves dirty
pages if their ->migratepage callback is migrate_page() because these can
be moved without blocking. This potentially impacted hugepage allocation
success rates by a factor depending on how many dirty pages are in the
system.
This patch partially reverts 39deaf85 to allow migration to isolate dirty
pages again. This increases how much compaction disrupts the LRU but that
is addressed later in the series.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/compaction.c | 3 ---
1 file changed, 3 deletions(-)
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,9 +371,6 @@ static isolate_migrate_t isolate_migrate
continue;
}
- if (!cc->sync)
- mode |= ISOLATE_CLEAN;
-
/* Try isolate the page */
if (__isolate_lru_page(page, mode, 0) != 0)
continue;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 23/41] mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (21 preceding siblings ...)
2012-07-30 17:31 ` [ 22/41] mm: compaction: allow compaction to isolate dirty pages Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 24/41] mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred Greg Kroah-Hartman
` (17 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Rik van Riel,
Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
aging information by reducing LRU list churning had the side-effect
of reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/btrfs/disk-io.c | 4 -
fs/hugetlbfs/inode.c | 3 -
fs/nfs/internal.h | 2
fs/nfs/write.c | 4 -
include/linux/fs.h | 9 ++-
include/linux/migrate.h | 2
mm/migrate.c | 129 ++++++++++++++++++++++++++++++++++--------------
7 files changed, 106 insertions(+), 47 deletions(-)
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -801,7 +801,7 @@ static int btree_submit_bio_hook(struct
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
/*
* we can't safely write a btree page from here,
@@ -816,7 +816,7 @@ static int btree_migratepage(struct addr
if (page_has_private(page) &&
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
#endif
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,7 +568,8 @@ static int hugetlbfs_set_page_dirty(stru
}
static int hugetlbfs_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page,
+ bool sync)
{
int rc;
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs
#ifdef CONFIG_MIGRATION
extern int nfs_migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
#else
#define nfs_migrate_page NULL
#endif
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1662,7 +1662,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page)
+ struct page *page, bool sync)
{
/*
* If PagePrivate is set, then the page is currently associated with
@@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_spac
nfs_fscache_release_page(page, GFP_KERNEL);
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
#endif
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -607,9 +607,12 @@ struct address_space_operations {
loff_t offset, unsigned long nr_segs);
int (*get_xip_mem)(struct address_space *, pgoff_t, int,
void **, unsigned long *);
- /* migrate the contents of a page to the specified target */
+ /*
+ * migrate the contents of a page to the specified target. If sync
+ * is false, it must not block.
+ */
int (*migratepage) (struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
@@ -2478,7 +2481,7 @@ extern int generic_check_addressable(uns
#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
#else
#define buffer_migrate_page NULL
#endif
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct p
extern void putback_lru_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
bool sync);
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -220,6 +220,55 @@ out:
pte_unmap_unlock(ptep, ptl);
}
+#ifdef CONFIG_BLOCK
+/* Returns true if all buffers are successfully locked */
+static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+ struct buffer_head *bh = head;
+
+ /* Simple case, sync compaction */
+ if (sync) {
+ do {
+ get_bh(bh);
+ lock_buffer(bh);
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+
+ return true;
+ }
+
+ /* async case, we cannot block on lock_buffer so use trylock_buffer */
+ do {
+ get_bh(bh);
+ if (!trylock_buffer(bh)) {
+ /*
+ * We failed to lock the buffer and cannot stall in
+ * async migration. Release the taken locks
+ */
+ struct buffer_head *failed_bh = bh;
+ put_bh(failed_bh);
+ bh = head;
+ while (bh != failed_bh) {
+ unlock_buffer(bh);
+ put_bh(bh);
+ bh = bh->b_this_page;
+ }
+ return false;
+ }
+
+ bh = bh->b_this_page;
+ } while (bh != head);
+ return true;
+}
+#else
+static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
+ bool sync)
+{
+ return true;
+}
+#endif /* CONFIG_BLOCK */
+
/*
* Replace the page in the mapping.
*
@@ -229,7 +278,8 @@ out:
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
*/
static int migrate_page_move_mapping(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page,
+ struct buffer_head *head, bool sync)
{
int expected_count;
void **pslot;
@@ -259,6 +309,19 @@ static int migrate_page_move_mapping(str
}
/*
+ * In the async migration case of moving a page with buffers, lock the
+ * buffers using trylock before the mapping is moved. If the mapping
+ * was moved, we later failed to lock the buffers and could not move
+ * the mapping back due to an elevated page count, we would have to
+ * block waiting on other references to be dropped.
+ */
+ if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+ page_unfreeze_refs(page, expected_count);
+ spin_unlock_irq(&mapping->tree_lock);
+ return -EAGAIN;
+ }
+
+ /*
* Now we know that no one else is looking at the page.
*/
get_page(newpage); /* add cache reference */
@@ -415,13 +478,13 @@ EXPORT_SYMBOL(fail_migrate_page);
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
int rc;
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page);
+ rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
if (rc)
return rc;
@@ -438,28 +501,28 @@ EXPORT_SYMBOL(migrate_page);
* exist.
*/
int buffer_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
struct buffer_head *bh, *head;
int rc;
if (!page_has_buffers(page))
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
head = page_buffers(page);
- rc = migrate_page_move_mapping(mapping, newpage, page);
+ rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
if (rc)
return rc;
- bh = head;
- do {
- get_bh(bh);
- lock_buffer(bh);
- bh = bh->b_this_page;
-
- } while (bh != head);
+ /*
+ * In the async case, migrate_page_move_mapping locked the buffers
+ * with an IRQ-safe spinlock held. In the sync case, the buffers
+ * need to be locked now
+ */
+ if (sync)
+ BUG_ON(!buffer_migrate_lock_buffers(head, sync));
ClearPagePrivate(page);
set_page_private(newpage, page_private(page));
@@ -536,10 +599,13 @@ static int writeout(struct address_space
* Default handling if a filesystem does not provide a migration function.
*/
static int fallback_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
- if (PageDirty(page))
+ if (PageDirty(page)) {
+ if (!sync)
+ return -EBUSY;
return writeout(mapping, page);
+ }
/*
* Buffers may be managed in a filesystem specific way.
@@ -549,7 +615,7 @@ static int fallback_migrate_page(struct
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
/*
@@ -585,29 +651,18 @@ static int move_to_new_page(struct page
mapping = page_mapping(page);
if (!mapping)
- rc = migrate_page(mapping, newpage, page);
- else {
+ rc = migrate_page(mapping, newpage, page, sync);
+ else if (mapping->a_ops->migratepage)
/*
- * Do not writeback pages if !sync and migratepage is
- * not pointing to migrate_page() which is nonblocking
- * (swapcache/tmpfs uses migratepage = migrate_page).
+ * Most pages have a mapping and most filesystems provide a
+ * migratepage callback. Anonymous pages are part of swap
+ * space which also has its own migratepage callback. This
+ * is the most common path for page migration.
*/
- if (PageDirty(page) && !sync &&
- mapping->a_ops->migratepage != migrate_page)
- rc = -EBUSY;
- else if (mapping->a_ops->migratepage)
- /*
- * Most pages have a mapping and most filesystems
- * should provide a migration function. Anonymous
- * pages are part of swap space which also has its
- * own migration function. This is the most common
- * path for page migration.
- */
- rc = mapping->a_ops->migratepage(mapping,
- newpage, page);
- else
- rc = fallback_migrate_page(mapping, newpage, page);
- }
+ rc = mapping->a_ops->migratepage(mapping,
+ newpage, page, sync);
+ else
+ rc = fallback_migrate_page(mapping, newpage, page, sync);
if (rc) {
newpage->mapping = NULL;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 24/41] mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (22 preceding siblings ...)
2012-07-30 17:31 ` [ 23/41] mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 25/41] mm: compaction: make isolate_lru_page() filter-aware again Greg Kroah-Hartman
` (16 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Minchan Kim,
Rik van Riel, Andrea Arcangeli, Dave Jones, Jan Kara,
Andy Isaacson, Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream.
Stable note: Not tracked in Buzilla. This was part of a series that
reduced interactivity stalls experienced when THP was enabled.
If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed. For small high-orders, this has a
reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.
Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/page_alloc.c | 45 +++++++++++++++++++++++++++++++++++----------
1 file changed, 35 insertions(+), 10 deletions(-)
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1897,14 +1897,20 @@ static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- int migratetype, unsigned long *did_some_progress,
- bool sync_migration)
+ int migratetype, bool sync_migration,
+ bool *deferred_compaction,
+ unsigned long *did_some_progress)
{
struct page *page;
- if (!order || compaction_deferred(preferred_zone))
+ if (!order)
return NULL;
+ if (compaction_deferred(preferred_zone)) {
+ *deferred_compaction = true;
+ return NULL;
+ }
+
current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, sync_migration);
@@ -1932,7 +1938,13 @@ __alloc_pages_direct_compact(gfp_t gfp_m
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);
- defer_compaction(preferred_zone);
+
+ /*
+ * As async compaction considers a subset of pageblocks, only
+ * defer if the failure was a sync compaction failure.
+ */
+ if (sync_migration)
+ defer_compaction(preferred_zone);
cond_resched();
}
@@ -1944,8 +1956,9 @@ static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- int migratetype, unsigned long *did_some_progress,
- bool sync_migration)
+ int migratetype, bool sync_migration,
+ bool *deferred_compaction,
+ unsigned long *did_some_progress)
{
return NULL;
}
@@ -2095,6 +2108,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
bool sync_migration = false;
+ bool deferred_compaction = false;
/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -2175,12 +2189,22 @@ rebalance:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- migratetype, &did_some_progress,
- sync_migration);
+ migratetype, sync_migration,
+ &deferred_compaction,
+ &did_some_progress);
if (page)
goto got_pg;
sync_migration = true;
+ /*
+ * If compaction is deferred for high-order allocations, it is because
+ * sync compaction recently failed. In this is the case and the caller
+ * has requested the system not be heavily disrupted, fail the
+ * allocation now instead of entering direct reclaim
+ */
+ if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+ goto nopage;
+
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
@@ -2243,8 +2267,9 @@ rebalance:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- migratetype, &did_some_progress,
- sync_migration);
+ migratetype, sync_migration,
+ &deferred_compaction,
+ &did_some_progress);
if (page)
goto got_pg;
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 25/41] mm: compaction: make isolate_lru_page() filter-aware again
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (23 preceding siblings ...)
2012-07-30 17:31 ` [ 24/41] mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 26/41] kswapd: avoid unnecessary rebalance after an unsuccessful balancing Greg Kroah-Hartman
` (15 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Rik van Riel,
Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit c82449352854ff09e43062246af86bdeb628f0c3 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
information by reducing LRU list churning had the side-effect of
reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list. This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.
This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/mmzone.h | 2 ++
mm/compaction.c | 3 +++
mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++--
3 files changed, 38 insertions(+), 2 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,6 +166,8 @@ static inline int is_unevictable_lru(enu
#define ISOLATE_CLEAN ((__force isolate_mode_t)0x4)
/* Isolate unmapped file */
#define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x8)
+/* Isolate for asynchronous migration */
+#define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x10)
/* LRU Isolation modes. */
typedef unsigned __bitwise__ isolate_mode_t;
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -371,6 +371,9 @@ static isolate_migrate_t isolate_migrate
continue;
}
+ if (!cc->sync)
+ mode |= ISOLATE_ASYNC_MIGRATE;
+
/* Try isolate the page */
if (__isolate_lru_page(page, mode, 0) != 0)
continue;
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1045,8 +1045,39 @@ int __isolate_lru_page(struct page *page
ret = -EBUSY;
- if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
- return ret;
+ /*
+ * To minimise LRU disruption, the caller can indicate that it only
+ * wants to isolate pages it will be able to operate on without
+ * blocking - clean pages for the most part.
+ *
+ * ISOLATE_CLEAN means that only clean pages should be isolated. This
+ * is used by reclaim when it is cannot write to backing storage
+ *
+ * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
+ * that it is possible to migrate without blocking
+ */
+ if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+ /* All the caller can do on PageWriteback is block */
+ if (PageWriteback(page))
+ return ret;
+
+ if (PageDirty(page)) {
+ struct address_space *mapping;
+
+ /* ISOLATE_CLEAN means only clean pages */
+ if (mode & ISOLATE_CLEAN)
+ return ret;
+
+ /*
+ * Only pages without mappings or that have a
+ * ->migratepage callback are possible to migrate
+ * without blocking
+ */
+ mapping = page_mapping(page);
+ if (mapping && !mapping->a_ops->migratepage)
+ return ret;
+ }
+ }
if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
return ret;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 26/41] kswapd: avoid unnecessary rebalance after an unsuccessful balancing
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (24 preceding siblings ...)
2012-07-30 17:31 ` [ 25/41] mm: compaction: make isolate_lru_page() filter-aware again Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 27/41] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping Greg Kroah-Hartman
` (14 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Alex Shi, Tim Chen, Mel Gorman,
Pádraig Brady, Rik van Riel, KOSAKI Motohiro
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Alex Shi <alex.shi@intel.com>
commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream.
Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
patch reduces kswapd CPU usage.
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing. But in the following scenario, kswapd do something that
is not matched our expectation. The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2844,7 +2844,9 @@ static void kswapd_try_to_sleep(pg_data_
static int kswapd(void *p)
{
unsigned long order, new_order;
+ unsigned balanced_order;
int classzone_idx, new_classzone_idx;
+ int balanced_classzone_idx;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
@@ -2875,7 +2877,9 @@ static int kswapd(void *p)
set_freezable();
order = new_order = 0;
+ balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
+ balanced_classzone_idx = classzone_idx;
for ( ; ; ) {
int ret;
@@ -2884,7 +2888,8 @@ static int kswapd(void *p)
* new request of a similar or harder type will succeed soon
* so consider going to sleep on the basis we reclaimed at
*/
- if (classzone_idx >= new_classzone_idx && order == new_order) {
+ if (balanced_classzone_idx >= new_classzone_idx &&
+ balanced_order == new_order) {
new_order = pgdat->kswapd_max_order;
new_classzone_idx = pgdat->classzone_idx;
pgdat->kswapd_max_order = 0;
@@ -2899,7 +2904,8 @@ static int kswapd(void *p)
order = new_order;
classzone_idx = new_classzone_idx;
} else {
- kswapd_try_to_sleep(pgdat, order, classzone_idx);
+ kswapd_try_to_sleep(pgdat, balanced_order,
+ balanced_classzone_idx);
order = pgdat->kswapd_max_order;
classzone_idx = pgdat->classzone_idx;
pgdat->kswapd_max_order = 0;
@@ -2916,7 +2922,9 @@ static int kswapd(void *p)
*/
if (!ret) {
trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
- order = balance_pgdat(pgdat, order, &classzone_idx);
+ balanced_classzone_idx = classzone_idx;
+ balanced_order = balance_pgdat(pgdat, order,
+ &balanced_classzone_idx);
}
}
return 0;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 27/41] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (25 preceding siblings ...)
2012-07-30 17:31 ` [ 26/41] kswapd: avoid unnecessary rebalance after an unsuccessful balancing Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 28/41] mm: compaction: introduce sync-light migration for use by compaction Greg Kroah-Hartman
` (13 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Alex Shi, Mel Gorman, Minchan Kim,
Pádraig Brady
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Alex Shi <alex.shi@intel.com>
commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream.
Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
patch reduces kswapd CPU usage.
There 2 places to read pgdat in kswapd. One is return from a successful
balance, another is waked up from kswapd sleeping. The new_order and
new_classzone_idx represent the balance input order and classzone_idx.
But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.
1: after a successful balance, kswapd goes to sleep, and new_order = 0;
new_classzone_idx = __MAX_NR_ZONES - 1;
2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
3: in the balance_pgdat() running, a new balance wakeup happened with
order = 5, and classzone_idx = ZONE_NORMAL
4: the first wakeup(order = 3) finished successufly, return order = 3
but, the new_order is still 0, so, this balancing will be treated as a
failed balance. And then the second tighter balancing will be missed.
So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2908,6 +2908,8 @@ static int kswapd(void *p)
balanced_classzone_idx);
order = pgdat->kswapd_max_order;
classzone_idx = pgdat->classzone_idx;
+ new_order = order;
+ new_classzone_idx = classzone_idx;
pgdat->kswapd_max_order = 0;
pgdat->classzone_idx = pgdat->nr_zones - 1;
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 28/41] mm: compaction: introduce sync-light migration for use by compaction
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (26 preceding siblings ...)
2012-07-30 17:31 ` [ 27/41] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-31 16:43 ` Herton Ronaldo Krzesinski
2012-07-30 17:31 ` [ 29/41] mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available Greg Kroah-Hartman
` (12 subsequent siblings)
40 siblings, 1 reply; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Rik van Riel,
Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.
Stable note: Not tracked in Buzilla. This was part of a series that
reduced interactivity stalls experienced when THP was enabled.
These stalls were particularly noticable when copying data
to a USB stick but the experiences for users varied a lot.
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/btrfs/disk-io.c | 5 +--
fs/hugetlbfs/inode.c | 2 -
fs/nfs/internal.h | 2 -
fs/nfs/write.c | 4 +-
include/linux/fs.h | 6 ++-
include/linux/migrate.h | 23 +++++++++++---
mm/compaction.c | 2 -
mm/memory-failure.c | 2 -
mm/memory_hotplug.c | 2 -
mm/mempolicy.c | 2 -
mm/migrate.c | 78 ++++++++++++++++++++++++++----------------------
11 files changed, 76 insertions(+), 52 deletions(-)
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -801,7 +801,8 @@ static int btree_submit_bio_hook(struct
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
{
/*
* we can't safely write a btree page from here,
@@ -816,7 +817,7 @@ static int btree_migratepage(struct addr
if (page_has_private(page) &&
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
}
#endif
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -569,7 +569,7 @@ static int hugetlbfs_set_page_dirty(stru
static int hugetlbfs_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
- bool sync)
+ enum migrate_mode mode)
{
int rc;
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs
#ifdef CONFIG_MIGRATION
extern int nfs_migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
#else
#define nfs_migrate_page NULL
#endif
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1662,7 +1662,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page, bool sync)
+ struct page *page, enum migrate_mode mode)
{
/*
* If PagePrivate is set, then the page is currently associated with
@@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_spac
nfs_fscache_release_page(page, GFP_KERNEL);
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
}
#endif
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -523,6 +523,7 @@ enum positive_aop_returns {
struct page;
struct address_space;
struct writeback_control;
+enum migrate_mode;
struct iov_iter {
const struct iovec *iov;
@@ -612,7 +613,7 @@ struct address_space_operations {
* is false, it must not block.
*/
int (*migratepage) (struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
@@ -2481,7 +2482,8 @@ extern int generic_check_addressable(uns
#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *,
+ enum migrate_mode);
#else
#define buffer_migrate_page NULL
#endif
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -6,18 +6,31 @@
typedef struct page *new_page_t(struct page *, unsigned long private, int **);
+/*
+ * MIGRATE_ASYNC means never block
+ * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
+ * on most operations but not ->writepage as the potential stall time
+ * is too significant
+ * MIGRATE_SYNC will block when migrating pages
+ */
+enum migrate_mode {
+ MIGRATE_ASYNC,
+ MIGRATE_SYNC_LIGHT,
+ MIGRATE_SYNC,
+};
+
#ifdef CONFIG_MIGRATION
#define PAGE_MIGRATION 1
extern void putback_lru_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync);
+ enum migrate_mode mode);
extern int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync);
+ enum migrate_mode mode);
extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
@@ -36,10 +49,10 @@ extern int migrate_huge_page_move_mappin
static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync) { return -ENOSYS; }
+ enum migrate_mode mode) { return -ENOSYS; }
static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync) { return -ENOSYS; }
+ enum migrate_mode mode) { return -ENOSYS; }
static inline int migrate_prep(void) { return -ENOSYS; }
static inline int migrate_prep_local(void) { return -ENOSYS; }
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -577,7 +577,7 @@ static int compact_zone(struct zone *zon
nr_migrate = cc->nr_migratepages;
err = migrate_pages(&cc->migratepages, compaction_alloc,
(unsigned long)cc, false,
- cc->sync);
+ cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
update_nr_listpages(cc);
nr_remaining = cc->nr_migratepages;
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1464,7 +1464,7 @@ int soft_offline_page(struct page *page,
page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
- 0, true);
+ 0, MIGRATE_SYNC);
if (ret) {
putback_lru_pages(&pagelist);
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -747,7 +747,7 @@ do_migrate_range(unsigned long start_pfn
}
/* this function returns # of failed pages */
ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
- true, true);
+ true, MIGRATE_SYNC);
if (ret)
putback_lru_pages(&source);
}
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -926,7 +926,7 @@ static int migrate_to_node(struct mm_str
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_node_page, dest,
- false, true);
+ false, MIGRATE_SYNC);
if (err)
putback_lru_pages(&pagelist);
}
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -222,12 +222,13 @@ out:
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+ enum migrate_mode mode)
{
struct buffer_head *bh = head;
/* Simple case, sync compaction */
- if (sync) {
+ if (mode != MIGRATE_ASYNC) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -263,7 +264,7 @@ static bool buffer_migrate_lock_buffers(
}
#else
static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
- bool sync)
+ enum migrate_mode mode)
{
return true;
}
@@ -279,7 +280,7 @@ static inline bool buffer_migrate_lock_b
*/
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
- struct buffer_head *head, bool sync)
+ struct buffer_head *head, enum migrate_mode mode)
{
int expected_count;
void **pslot;
@@ -315,7 +316,8 @@ static int migrate_page_move_mapping(str
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+ if (mode == MIGRATE_ASYNC && head &&
+ !buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
@@ -478,13 +480,14 @@ EXPORT_SYMBOL(fail_migrate_page);
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
{
int rc;
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
+ rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
if (rc)
return rc;
@@ -501,17 +504,17 @@ EXPORT_SYMBOL(migrate_page);
* exist.
*/
int buffer_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page, enum migrate_mode mode)
{
struct buffer_head *bh, *head;
int rc;
if (!page_has_buffers(page))
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
head = page_buffers(page);
- rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
+ rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
if (rc)
return rc;
@@ -521,8 +524,8 @@ int buffer_migrate_page(struct address_s
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (sync)
- BUG_ON(!buffer_migrate_lock_buffers(head, sync));
+ if (mode != MIGRATE_ASYNC)
+ BUG_ON(!buffer_migrate_lock_buffers(head, mode));
ClearPagePrivate(page);
set_page_private(newpage, page_private(page));
@@ -599,10 +602,11 @@ static int writeout(struct address_space
* Default handling if a filesystem does not provide a migration function.
*/
static int fallback_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page, enum migrate_mode mode)
{
if (PageDirty(page)) {
- if (!sync)
+ /* Only writeback pages in full synchronous migration */
+ if (mode != MIGRATE_SYNC)
return -EBUSY;
return writeout(mapping, page);
}
@@ -615,7 +619,7 @@ static int fallback_migrate_page(struct
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
}
/*
@@ -630,7 +634,7 @@ static int fallback_migrate_page(struct
* == 0 - success
*/
static int move_to_new_page(struct page *newpage, struct page *page,
- int remap_swapcache, bool sync)
+ int remap_swapcache, enum migrate_mode mode)
{
struct address_space *mapping;
int rc;
@@ -651,7 +655,7 @@ static int move_to_new_page(struct page
mapping = page_mapping(page);
if (!mapping)
- rc = migrate_page(mapping, newpage, page, sync);
+ rc = migrate_page(mapping, newpage, page, mode);
else if (mapping->a_ops->migratepage)
/*
* Most pages have a mapping and most filesystems provide a
@@ -660,9 +664,9 @@ static int move_to_new_page(struct page
* is the most common path for page migration.
*/
rc = mapping->a_ops->migratepage(mapping,
- newpage, page, sync);
+ newpage, page, mode);
else
- rc = fallback_migrate_page(mapping, newpage, page, sync);
+ rc = fallback_migrate_page(mapping, newpage, page, mode);
if (rc) {
newpage->mapping = NULL;
@@ -677,7 +681,7 @@ static int move_to_new_page(struct page
}
static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, bool offlining, bool sync)
+ int force, bool offlining, enum migrate_mode mode)
{
int rc = -EAGAIN;
int remap_swapcache = 1;
@@ -686,7 +690,7 @@ static int __unmap_and_move(struct page
struct anon_vma *anon_vma = NULL;
if (!trylock_page(page)) {
- if (!force || !sync)
+ if (!force || mode == MIGRATE_ASYNC)
goto out;
/*
@@ -732,10 +736,12 @@ static int __unmap_and_move(struct page
if (PageWriteback(page)) {
/*
- * For !sync, there is no point retrying as the retry loop
- * is expected to be too short for PageWriteback to be cleared
+ * Only in the case of a full syncronous migration is it
+ * necessary to wait for PageWriteback. In the async case,
+ * the retry loop is too short and in the sync-light case,
+ * the overhead of stalling is too much
*/
- if (!sync) {
+ if (mode != MIGRATE_SYNC) {
rc = -EBUSY;
goto uncharge;
}
@@ -806,7 +812,7 @@ static int __unmap_and_move(struct page
skip_unmap:
if (!page_mapped(page))
- rc = move_to_new_page(newpage, page, remap_swapcache, sync);
+ rc = move_to_new_page(newpage, page, remap_swapcache, mode);
if (rc && remap_swapcache)
remove_migration_ptes(page, page);
@@ -829,7 +835,8 @@ out:
* to the newly allocated page in newpage.
*/
static int unmap_and_move(new_page_t get_new_page, unsigned long private,
- struct page *page, int force, bool offlining, bool sync)
+ struct page *page, int force, bool offlining,
+ enum migrate_mode mode)
{
int rc = 0;
int *result = NULL;
@@ -847,7 +854,7 @@ static int unmap_and_move(new_page_t get
if (unlikely(split_huge_page(page)))
goto out;
- rc = __unmap_and_move(page, newpage, force, offlining, sync);
+ rc = __unmap_and_move(page, newpage, force, offlining, mode);
out:
if (rc != -EAGAIN) {
/*
@@ -895,7 +902,8 @@ out:
*/
static int unmap_and_move_huge_page(new_page_t get_new_page,
unsigned long private, struct page *hpage,
- int force, bool offlining, bool sync)
+ int force, bool offlining,
+ enum migrate_mode mode)
{
int rc = 0;
int *result = NULL;
@@ -908,7 +916,7 @@ static int unmap_and_move_huge_page(new_
rc = -EAGAIN;
if (!trylock_page(hpage)) {
- if (!force || !sync)
+ if (!force || mode != MIGRATE_SYNC)
goto out;
lock_page(hpage);
}
@@ -919,7 +927,7 @@ static int unmap_and_move_huge_page(new_
try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
if (!page_mapped(hpage))
- rc = move_to_new_page(new_hpage, hpage, 1, sync);
+ rc = move_to_new_page(new_hpage, hpage, 1, mode);
if (rc)
remove_migration_ptes(hpage, hpage);
@@ -962,7 +970,7 @@ out:
*/
int migrate_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
- bool sync)
+ enum migrate_mode mode)
{
int retry = 1;
int nr_failed = 0;
@@ -983,7 +991,7 @@ int migrate_pages(struct list_head *from
rc = unmap_and_move(get_new_page, private,
page, pass > 2, offlining,
- sync);
+ mode);
switch(rc) {
case -ENOMEM:
@@ -1013,7 +1021,7 @@ out:
int migrate_huge_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
- bool sync)
+ enum migrate_mode mode)
{
int retry = 1;
int nr_failed = 0;
@@ -1030,7 +1038,7 @@ int migrate_huge_pages(struct list_head
rc = unmap_and_move_huge_page(get_new_page,
private, page, pass > 2, offlining,
- sync);
+ mode);
switch(rc) {
case -ENOMEM:
@@ -1159,7 +1167,7 @@ set_status:
err = 0;
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_page_node,
- (unsigned long)pm, 0, true);
+ (unsigned long)pm, 0, MIGRATE_SYNC);
if (err)
putback_lru_pages(&pagelist);
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 29/41] mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (27 preceding siblings ...)
2012-07-30 17:31 ` [ 28/41] mm: compaction: introduce sync-light migration for use by compaction Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 30/41] mm: vmscan: do not OOM if aborting reclaim to start compaction Greg Kroah-Hartman
` (11 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Rik van Riel,
Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time. This patch
addresses a problem where the fix regressed THP allocation
success rates.
In commit e0887c19 ("vmscan: limit direct reclaim for higher order
allocations"), Rik noted that reclaim was too aggressive when THP was
enabled. In his initial patch he used the number of free pages to decide
if reclaim should abort for compaction. My feedback was that reclaim and
compaction should be using the same logic when deciding if reclaim should
be aborted.
Unfortunately, this had the effect of reducing THP success rates when the
workload included something like streaming reads that continually
allocated pages. The window during which compaction could run and return
a THP was too small.
This patch combines Rik's two patches together. compaction_suitable() is
still used to decide if reclaim should be aborted to allow compaction is
used. However, it will also ensure that there is a reasonable buffer of
free pages available. This improves upon the THP allocation success rates
but bounds the number of pages that are freed for compaction.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 44 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 39 insertions(+), 5 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2075,6 +2075,42 @@ restart:
throttle_vm_writeout(sc->gfp_mask);
}
+/* Returns true if compaction should go ahead for a high-order request */
+static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
+{
+ unsigned long balance_gap, watermark;
+ bool watermark_ok;
+
+ /* Do not consider compaction for orders reclaim is meant to satisfy */
+ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER)
+ return false;
+
+ /*
+ * Compaction takes time to run and there are potentially other
+ * callers using the pages just freed. Continue reclaiming until
+ * there is a buffer of free pages available to give compaction
+ * a reasonable chance of completing and allocating the page
+ */
+ balance_gap = min(low_wmark_pages(zone),
+ (zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+ KSWAPD_ZONE_BALANCE_GAP_RATIO);
+ watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+
+ /*
+ * If compaction is deferred, reclaim up to a point where
+ * compaction will have a chance of success when re-enabled
+ */
+ if (compaction_deferred(zone))
+ return watermark_ok;
+
+ /* If compaction is not ready to start, keep reclaiming */
+ if (!compaction_suitable(zone, sc->order))
+ return false;
+
+ return watermark_ok;
+}
+
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2092,8 +2128,8 @@ restart:
* scan then give up on it.
*
* This function returns true if a zone is being reclaimed for a costly
- * high-order allocation and compaction is either ready to begin or deferred.
- * This indicates to the caller that it should retry the allocation or fail.
+ * high-order allocation and compaction is ready to begin. This indicates to
+ * the caller that it should retry the allocation or fail.
*/
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
@@ -2127,9 +2163,7 @@ static bool shrink_zones(int priority, s
* noticable problem, like transparent huge page
* allocations.
*/
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
- (compaction_suitable(zone, sc->order) ||
- compaction_deferred(zone))) {
+ if (compaction_ready(zone, sc)) {
should_abort_reclaim = true;
continue;
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 30/41] mm: vmscan: do not OOM if aborting reclaim to start compaction
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (28 preceding siblings ...)
2012-07-30 17:31 ` [ 29/41] mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 31/41] mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone Greg Kroah-Hartman
` (10 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Rik van Riel,
Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but otherwise has little to justify it. The
problem it fixes was never observed but the source of the
theoretical problem did not exist for very long.
During direct reclaim it is possible that reclaim will be aborted so that
compaction can be attempted to satisfy a high-order allocation. If this
decision is made before any pages are reclaimed, it is possible that 0 is
returned to the page allocator potentially triggering an OOM. This has
not been observed but it is a possibility so this patch addresses it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2240,6 +2240,7 @@ static unsigned long do_try_to_free_page
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
+ bool should_abort_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2251,7 +2252,8 @@ static unsigned long do_try_to_free_page
sc->nr_scanned = 0;
if (!priority)
disable_swap_token(sc->mem_cgroup);
- if (shrink_zones(priority, zonelist, sc))
+ should_abort_reclaim = shrink_zones(priority, zonelist, sc);
+ if (should_abort_reclaim)
break;
/*
@@ -2318,6 +2320,10 @@ out:
if (oom_killer_disabled)
return 0;
+ /* Aborting reclaim to try compaction? don't OOM, then */
+ if (should_abort_reclaim)
+ return 1;
+
/* top priority shrink_zones still had more to do? don't OOM, then */
if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
return 1;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 31/41] mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (29 preceding siblings ...)
2012-07-30 17:31 ` [ 30/41] mm: vmscan: do not OOM if aborting reclaim to start compaction Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 32/41] vmscan: promote shared file mapped pages Greg Kroah-Hartman
` (9 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Rik van Riel,
Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Nai Xia, Johannes Weiner
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time.
If compaction can proceed for a given zone, shrink_zones() does not
reclaim any more pages from it. After commit [e0c2327: vmscan: abort
reclaim/compaction if compaction can proceed], do_try_to_free_pages()
tries to finish as soon as possible once one zone can compact.
This was intended to prevent slabs being shrunk unnecessarily but there
are side-effects. One is that a small zone that is ready for compaction
will abort reclaim even if the chances of successfully allocating a THP
from that zone is small. It also means that reclaim can return too early
even though sc->nr_to_reclaim pages were not reclaimed.
This partially reverts the commit until it is proven that slabs are really
being shrunk unnecessarily but preserves the check to return 1 to avoid
OOM if reclaim was aborted prematurely.
[aarcange@redhat.com: This patch replaces a revert from Andrea]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2129,7 +2129,8 @@ static inline bool compaction_ready(stru
*
* This function returns true if a zone is being reclaimed for a costly
* high-order allocation and compaction is ready to begin. This indicates to
- * the caller that it should retry the allocation or fail.
+ * the caller that it should consider retrying the allocation instead of
+ * further reclaim.
*/
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
@@ -2138,7 +2139,7 @@ static bool shrink_zones(int priority, s
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
- bool should_abort_reclaim = false;
+ bool aborted_reclaim = false;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2164,7 +2165,7 @@ static bool shrink_zones(int priority, s
* allocations.
*/
if (compaction_ready(zone, sc)) {
- should_abort_reclaim = true;
+ aborted_reclaim = true;
continue;
}
}
@@ -2186,7 +2187,7 @@ static bool shrink_zones(int priority, s
shrink_zone(priority, zone, sc);
}
- return should_abort_reclaim;
+ return aborted_reclaim;
}
static bool zone_reclaimable(struct zone *zone)
@@ -2240,7 +2241,7 @@ static unsigned long do_try_to_free_page
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
- bool should_abort_reclaim;
+ bool aborted_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2252,9 +2253,7 @@ static unsigned long do_try_to_free_page
sc->nr_scanned = 0;
if (!priority)
disable_swap_token(sc->mem_cgroup);
- should_abort_reclaim = shrink_zones(priority, zonelist, sc);
- if (should_abort_reclaim)
- break;
+ aborted_reclaim = shrink_zones(priority, zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
@@ -2320,8 +2319,8 @@ out:
if (oom_killer_disabled)
return 0;
- /* Aborting reclaim to try compaction? don't OOM, then */
- if (should_abort_reclaim)
+ /* Aborted reclaim to try compaction? don't OOM, then */
+ if (aborted_reclaim)
return 1;
/* top priority shrink_zones still had more to do? don't OOM, then */
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 32/41] vmscan: promote shared file mapped pages
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (30 preceding siblings ...)
2012-07-30 17:31 ` [ 31/41] mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 33/41] vmscan: activate executable pages after first usage Greg Kroah-Hartman
` (8 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Konstantin Khlebnikov,
Pekka Enberg, Minchan Kim, KAMEZAWA Hiroyuki, Wu Fengguang,
Johannes Weiner, Nick Piggin, Mel Gorman, Shaohua Li,
Rik van Riel, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Konstantin Khlebnikov <khlebnikov@openvz.org>
commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream.
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time. The specific workload being
addressed here in described in paragraph four and while paragraph
five says it did not help performance as such, it made a difference
to major page faults. I'm aware of at least one bug for a large
vendor that was due to increased major faults.
Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file
pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.
Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access. Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.
After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.
I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory. In
this situation major-pagefaults are very costly, because all containers
share the same page. In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)
These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.
Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -723,7 +723,7 @@ static enum page_references page_check_r
*/
SetPageReferenced(page);
- if (referenced_page)
+ if (referenced_page || referenced_ptes > 1)
return PAGEREF_ACTIVATE;
return PAGEREF_KEEP;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 33/41] vmscan: activate executable pages after first usage
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (31 preceding siblings ...)
2012-07-30 17:31 ` [ 32/41] vmscan: promote shared file mapped pages Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 34/41] mm/vmscan.c: consider swap space when deciding whether to continue reclaim Greg Kroah-Hartman
` (7 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Konstantin Khlebnikov,
Pekka Enberg, Minchan Kim, KAMEZAWA Hiroyuki, Wu Fengguang,
Johannes Weiner, Nick Piggin, Mel Gorman, Shaohua Li,
Rik van Riel, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Konstantin Khlebnikov <khlebnikov@openvz.org>
commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream.
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time.
Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit
645747462435d84 ("vmscan: detect mapped file pages used only once").
Currently these pages can become "first class citizens" only after second
usage. After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -726,6 +726,12 @@ static enum page_references page_check_r
if (referenced_page || referenced_ptes > 1)
return PAGEREF_ACTIVATE;
+ /*
+ * Activate file-backed executable pages after first usage.
+ */
+ if (vm_flags & VM_EXEC)
+ return PAGEREF_ACTIVATE;
+
return PAGEREF_KEEP;
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 34/41] mm/vmscan.c: consider swap space when deciding whether to continue reclaim
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (32 preceding siblings ...)
2012-07-30 17:31 ` [ 33/41] vmscan: activate executable pages after first usage Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 35/41] mm: test PageSwapBacked in lumpy reclaim Greg Kroah-Hartman
` (6 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Minchan Kim, KOSAKI Motohiro,
Mel Gorman, Rik van Riel, Johannes Weiner, Andrea Arcangeli
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Minchan Kim <minchan@kernel.org>
commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream.
Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
usage on swapless systems with high anonymous memory usage.
It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.
Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2008,8 +2008,9 @@ static inline bool should_continue_recla
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
- inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON) +
- zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+ inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+ if (nr_swap_pages > 0)
+ inactive_lru_pages += zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
inactive_lru_pages > pages_for_compaction)
return true;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 35/41] mm: test PageSwapBacked in lumpy reclaim
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (33 preceding siblings ...)
2012-07-30 17:31 ` [ 34/41] mm/vmscan.c: consider swap space when deciding whether to continue reclaim Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 36/41] mm: vmscan: convert global reclaim to per-memcg LRU lists Greg Kroah-Hartman
` (5 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Hugh Dickins, KOSAKI Motohiro,
Minchan Kim, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Hugh Dickins <hughd@google.com>
commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream.
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time. Even though the subject
refers to lumpy reclaim, it impacts compaction as well.
Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1199,7 +1199,7 @@ static unsigned long isolate_lru_pages(u
* anon page which don't already have a swap slot is
* pointless.
*/
- if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
+ if (nr_swap_pages <= 0 && PageSwapBacked(cursor_page) &&
!PageSwapCache(cursor_page))
break;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 36/41] mm: vmscan: convert global reclaim to per-memcg LRU lists
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (34 preceding siblings ...)
2012-07-30 17:31 ` [ 35/41] mm: test PageSwapBacked in lumpy reclaim Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 37/41] cpusets: avoid looping when storing to mems_allowed if one node remains set Greg Kroah-Hartman
` (4 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Greg KH, torvalds, akpm, alan, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johannes Weiner <jweiner@redhat.com>
commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch.
Stable note: Not tracked in Bugzilla. This is a partial backport of an
upstream commit addressing a completely different issue
that accidentally contained an important fix. The workload
this patch helps was memcached when IO is started in the
background. memcached should stay resident but without this patch
it gets swapped. Sometimes this manifests as a drop in throughput
but mostly it was observed through /proc/vmstat.
Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant
to fix a problem whereby small scan targets on memcg were ignored causing
priority to raise too sharply. It forced scanning to take place if the
target was small, memcg or kswapd.
>From the time it was introduced it caused excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
convert global reclaim to per-memcg LRU lists] by making it harder for
kswapd to force scan small targets but that patchset is not suitable for
backporting. This was later changed again by commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] into a format that looks
like it would be a straight-forward backport but there is a subtle
difference due to the use of lruvecs.
The impact of the accidental fix is to make it harder for kswapd to force
scan small targets by taking zone->all_unreclaimable into account. This
patch is the closest equivalent available based on what is backported.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1850,7 +1850,8 @@ static void get_scan_count(struct zone *
unsigned long nr_force_scan[2];
/* kswapd does zone balancing and needs to scan this zone */
- if (scanning_global_lru(sc) && current_is_kswapd())
+ if (scanning_global_lru(sc) && current_is_kswapd() &&
+ zone->all_unreclaimable)
force_scan = true;
/* memcg may have small limit and need to avoid priority drop */
if (!scanning_global_lru(sc))
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 37/41] cpusets: avoid looping when storing to mems_allowed if one node remains set
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (35 preceding siblings ...)
2012-07-30 17:31 ` [ 36/41] mm: vmscan: convert global reclaim to per-memcg LRU lists Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 38/41] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask Greg Kroah-Hartman
` (3 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, David Rientjes, Miao Xie,
KOSAKI Motohiro, Nick Piggin, Paul Menage, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: David Rientjes <rientjes@google.com>
commit 89e8a244b97e48f1f30e898b6f32acca477f2a13 upstream.
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is
extremely expensive and severely impacted page allocator performance.
This is part of a series of patches that reduce page allocator
overhead.
{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.
This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself. It is not atypical to see writes to
cpuset.mems take over 2 seconds to complete, for example. In low memory
conditions, this is problematic because it's one of the most imporant
times to change cpuset.mems in the first place!
The only way a task's set of allowable nodes may change is through cpusets
by writing to cpuset.mems and when attaching a task to a generic code is
not reading the nodemask with get_mems_allowed() at the same time, and
then clearing all the old nodes. This prevents the possibility that a
reader will see an empty nodemask at the same time the writer is storing a
new nodemask.
If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes. Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Paul Menage <paul@paulmenage.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
kernel/cpuset.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,6 +949,8 @@ static void cpuset_migrate_mm(struct mm_
static void cpuset_change_task_nodemask(struct task_struct *tsk,
nodemask_t *newmems)
{
+ bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+
repeat:
/*
* Allow tasks that have access to memory reserves because they have
@@ -963,7 +965,6 @@ repeat:
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
-
/*
* ensure checking ->mems_allowed_change_disable after setting all new
* allowed nodes.
@@ -980,9 +981,11 @@ repeat:
/*
* Allocation of memory is very fast, we needn't sleep when waiting
- * for the read-side.
+ * for the read-side. No wait is necessary, however, if at least one
+ * node remains unchanged.
*/
- while (ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+ while (masks_disjoint &&
+ ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
task_unlock(tsk);
if (!task_curr(tsk))
yield();
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 38/41] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (36 preceding siblings ...)
2012-07-30 17:31 ` [ 37/41] cpusets: avoid looping when storing to mems_allowed if one node remains set Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 39/41] cpuset: mm: reduce large amounts of memory barrier related damage v3 Greg Kroah-Hartman
` (2 subsequent siblings)
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Miao Xie, David Rientjes,
KOSAKI Motohiro, Paul Menage, Stephen Rothwell, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: David Rientjes <rientjes@google.com>
commit b246272ecc5ac68c743b15c9e41a2275f7ce70e2 upstream.
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This is
part of a series of patches that reduce page allocator overhead.
Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.
c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread. This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.
This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes. This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind. To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.
This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.
Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change. This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
kernel/cpuset.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -123,6 +123,19 @@ static inline struct cpuset *task_cs(str
struct cpuset, css);
}
+#ifdef CONFIG_NUMA
+static inline bool task_has_mempolicy(struct task_struct *task)
+{
+ return task->mempolicy;
+}
+#else
+static inline bool task_has_mempolicy(struct task_struct *task)
+{
+ return false;
+}
+#endif
+
+
/* bits in struct cpuset flags field */
typedef enum {
CS_CPU_EXCLUSIVE,
@@ -949,7 +962,7 @@ static void cpuset_migrate_mm(struct mm_
static void cpuset_change_task_nodemask(struct task_struct *tsk,
nodemask_t *newmems)
{
- bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+ bool need_loop;
repeat:
/*
@@ -962,6 +975,14 @@ repeat:
return;
task_lock(tsk);
+ /*
+ * Determine if a loop is necessary if another thread is doing
+ * get_mems_allowed(). If at least one node remains unchanged and
+ * tsk does not have a mempolicy, then an empty nodemask will not be
+ * possible when mems_allowed is larger than a word.
+ */
+ need_loop = task_has_mempolicy(tsk) ||
+ !nodes_intersects(*newmems, tsk->mems_allowed);
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
@@ -981,11 +1002,9 @@ repeat:
/*
* Allocation of memory is very fast, we needn't sleep when waiting
- * for the read-side. No wait is necessary, however, if at least one
- * node remains unchanged.
+ * for the read-side.
*/
- while (masks_disjoint &&
- ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+ while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
task_unlock(tsk);
if (!task_curr(tsk))
yield();
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 39/41] cpuset: mm: reduce large amounts of memory barrier related damage v3
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (37 preceding siblings ...)
2012-07-30 17:31 ` [ 38/41] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 40/41] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma Greg Kroah-Hartman
2012-07-30 17:31 ` [ 41/41] vmscan: fix initial shrinker size handling Greg Kroah-Hartman
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Mel Gorman, Miao Xie,
David Rientjes, Peter Zijlstra, Christoph Lameter
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This
is part of a series of patches that reduce page allocator overhead.
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/cpuset.h | 51 +++++++++++++++++++---------------------------
include/linux/init_task.h | 8 +++++++
include/linux/sched.h | 2 -
kernel/cpuset.c | 43 +++++++-------------------------------
kernel/fork.c | 3 ++
mm/filemap.c | 11 ++++++---
mm/hugetlb.c | 15 +++++++++----
mm/mempolicy.c | 28 ++++++++++++++++++-------
mm/page_alloc.c | 33 ++++++++++++++++++++---------
mm/slab.c | 13 +++++++----
mm/slub.c | 36 ++++++++++++++++++++------------
mm/vmscan.c | 2 -
12 files changed, 135 insertions(+), 110 deletions(-)
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -89,42 +89,33 @@ extern void rebuild_sched_domains(void);
extern void cpuset_print_task_mems_allowed(struct task_struct *p);
/*
- * reading current mems_allowed and mempolicy in the fastpath must protected
- * by get_mems_allowed()
+ * get_mems_allowed is required when making decisions involving mems_allowed
+ * such as during page allocation. mems_allowed can be updated in parallel
+ * and depending on the new value an operation can fail potentially causing
+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * prevents these artificial failures.
*/
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
{
- current->mems_allowed_change_disable++;
+ return read_seqcount_begin(¤t->mems_allowed_seq);
+}
- /*
- * ensure that reading mems_allowed and mempolicy happens after the
- * update of ->mems_allowed_change_disable.
- *
- * the write-side task finds ->mems_allowed_change_disable is not 0,
- * and knows the read-side task is reading mems_allowed or mempolicy,
- * so it will clear old bits lazily.
- */
- smp_mb();
-}
-
-static inline void put_mems_allowed(void)
-{
- /*
- * ensure that reading mems_allowed and mempolicy before reducing
- * mems_allowed_change_disable.
- *
- * the write-side task will know that the read-side task is still
- * reading mems_allowed or mempolicy, don't clears old bits in the
- * nodemask.
- */
- smp_mb();
- --ACCESS_ONCE(current->mems_allowed_change_disable);
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate.
+ */
+static inline bool put_mems_allowed(unsigned int seq)
+{
+ return !read_seqcount_retry(¤t->mems_allowed_seq, seq);
}
static inline void set_mems_allowed(nodemask_t nodemask)
{
task_lock(current);
+ write_seqcount_begin(¤t->mems_allowed_seq);
current->mems_allowed = nodemask;
+ write_seqcount_end(¤t->mems_allowed_seq);
task_unlock(current);
}
@@ -234,12 +225,14 @@ static inline void set_mems_allowed(node
{
}
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
{
+ return 0;
}
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned int seq)
{
+ return true;
}
#endif /* !CONFIG_CPUSETS */
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -30,6 +30,13 @@ extern struct fs_struct init_fs;
#define INIT_THREADGROUP_FORK_LOCK(sig)
#endif
+#ifdef CONFIG_CPUSETS
+#define INIT_CPUSET_SEQ \
+ .mems_allowed_seq = SEQCNT_ZERO,
+#else
+#define INIT_CPUSET_SEQ
+#endif
+
#define INIT_SIGNALS(sig) { \
.nr_threads = 1, \
.wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -193,6 +200,7 @@ extern struct cred init_cred;
INIT_FTRACE_GRAPH \
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
+ INIT_CPUSET_SEQ \
}
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1484,7 +1484,7 @@ struct task_struct {
#endif
#ifdef CONFIG_CPUSETS
nodemask_t mems_allowed; /* Protected by alloc_lock */
- int mems_allowed_change_disable;
+ seqcount_t mems_allowed_seq; /* Seqence no to catch updates */
int cpuset_mem_spread_rotor;
int cpuset_slab_spread_rotor;
#endif
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(
{
bool need_loop;
-repeat:
/*
* Allow tasks that have access to memory reserves because they have
* been OOM killed to get memory anywhere.
@@ -983,45 +982,19 @@ repeat:
*/
need_loop = task_has_mempolicy(tsk) ||
!nodes_intersects(*newmems, tsk->mems_allowed);
- nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
- mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
- /*
- * ensure checking ->mems_allowed_change_disable after setting all new
- * allowed nodes.
- *
- * the read-side task can see an nodemask with new allowed nodes and
- * old allowed nodes. and if it allocates page when cpuset clears newly
- * disallowed ones continuous, it can see the new allowed bits.
- *
- * And if setting all new allowed nodes is after the checking, setting
- * all new allowed nodes and clearing newly disallowed ones will be done
- * continuous, and the read-side task may find no node to alloc page.
- */
- smp_mb();
-
- /*
- * Allocation of memory is very fast, we needn't sleep when waiting
- * for the read-side.
- */
- while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
- task_unlock(tsk);
- if (!task_curr(tsk))
- yield();
- goto repeat;
- }
+ if (need_loop)
+ write_seqcount_begin(&tsk->mems_allowed_seq);
- /*
- * ensure checking ->mems_allowed_change_disable before clearing all new
- * disallowed nodes.
- *
- * if clearing newly disallowed bits before the checking, the read-side
- * task may find no node to alloc page.
- */
- smp_mb();
+ nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+ mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
tsk->mems_allowed = *newmems;
+
+ if (need_loop)
+ write_seqcount_end(&tsk->mems_allowed_seq);
+
task_unlock(tsk);
}
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -985,6 +985,9 @@ static int copy_signal(unsigned long clo
#ifdef CONFIG_CGROUPS
init_rwsem(&sig->threadgroup_fork_lock);
#endif
+#ifdef CONFIG_CPUSETS
+ seqcount_init(&tsk->mems_allowed_seq);
+#endif
sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -516,10 +516,13 @@ struct page *__page_cache_alloc(gfp_t gf
struct page *page;
if (cpuset_do_page_mem_spread()) {
- get_mems_allowed();
- n = cpuset_mem_spread_node();
- page = alloc_pages_exact_node(n, gfp, 0);
- put_mems_allowed();
+ unsigned int cpuset_mems_cookie;
+ do {
+ cpuset_mems_cookie = get_mems_allowed();
+ n = cpuset_mem_spread_node();
+ page = alloc_pages_exact_node(n, gfp, 0);
+ } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
return page;
}
return alloc_pages(gfp, 0);
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vm
struct vm_area_struct *vma,
unsigned long address, int avoid_reserve)
{
- struct page *page = NULL;
+ struct page *page;
struct mempolicy *mpol;
nodemask_t *nodemask;
struct zonelist *zonelist;
struct zone *zone;
struct zoneref *z;
+ unsigned int cpuset_mems_cookie;
- get_mems_allowed();
+retry_cpuset:
+ cpuset_mems_cookie = get_mems_allowed();
zonelist = huge_zonelist(vma, address,
htlb_alloc_mask, &mpol, &nodemask);
/*
@@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vm
}
}
}
-err:
+
mpol_cond_put(mpol);
- put_mems_allowed();
+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ goto retry_cpuset;
return page;
+
+err:
+ mpol_cond_put(mpol);
+ return NULL;
}
static void update_and_free_page(struct hstate *h, struct page *page)
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1810,18 +1810,24 @@ struct page *
alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, int node)
{
- struct mempolicy *pol = get_vma_policy(current, vma, addr);
+ struct mempolicy *pol;
struct zonelist *zl;
struct page *page;
+ unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+ pol = get_vma_policy(current, vma, addr);
+ cpuset_mems_cookie = get_mems_allowed();
- get_mems_allowed();
if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
unsigned nid;
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
mpol_cond_put(pol);
page = alloc_page_interleave(gfp, order, nid);
- put_mems_allowed();
+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ goto retry_cpuset;
+
return page;
}
zl = policy_zonelist(gfp, pol, node);
@@ -1832,7 +1838,8 @@ alloc_pages_vma(gfp_t gfp, int order, st
struct page *page = __alloc_pages_nodemask(gfp, order,
zl, policy_nodemask(gfp, pol));
__mpol_put(pol);
- put_mems_allowed();
+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ goto retry_cpuset;
return page;
}
/*
@@ -1840,7 +1847,8 @@ alloc_pages_vma(gfp_t gfp, int order, st
*/
page = __alloc_pages_nodemask(gfp, order, zl,
policy_nodemask(gfp, pol));
- put_mems_allowed();
+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ goto retry_cpuset;
return page;
}
@@ -1867,11 +1875,14 @@ struct page *alloc_pages_current(gfp_t g
{
struct mempolicy *pol = current->mempolicy;
struct page *page;
+ unsigned int cpuset_mems_cookie;
if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
pol = &default_policy;
- get_mems_allowed();
+retry_cpuset:
+ cpuset_mems_cookie = get_mems_allowed();
+
/*
* No reference counting needed for current->mempolicy
* nor system default_policy
@@ -1882,7 +1893,10 @@ struct page *alloc_pages_current(gfp_t g
page = __alloc_pages_nodemask(gfp, order,
policy_zonelist(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));
- put_mems_allowed();
+
+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ goto retry_cpuset;
+
return page;
}
EXPORT_SYMBOL(alloc_pages_current);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2293,8 +2293,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone *preferred_zone;
- struct page *page;
+ struct page *page = NULL;
int migratetype = allocflags_to_migratetype(gfp_mask);
+ unsigned int cpuset_mems_cookie;
gfp_mask &= gfp_allowed_mask;
@@ -2313,15 +2314,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;
- get_mems_allowed();
+retry_cpuset:
+ cpuset_mems_cookie = get_mems_allowed();
+
/* The preferred zone is used for statistics later */
first_zones_zonelist(zonelist, high_zoneidx,
nodemask ? : &cpuset_current_mems_allowed,
&preferred_zone);
- if (!preferred_zone) {
- put_mems_allowed();
- return NULL;
- }
+ if (!preferred_zone)
+ goto out;
/* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
@@ -2331,9 +2332,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
- put_mems_allowed();
trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+ /*
+ * When updating a task's mems_allowed, it is possible to race with
+ * parallel threads in such a way that an allocation can fail while
+ * the mask is being updated. If a page allocation is about to fail,
+ * check if the cpuset changed during allocation and if so, retry.
+ */
+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+ goto retry_cpuset;
+
return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2557,13 +2568,15 @@ void si_meminfo_node(struct sysinfo *val
bool skip_free_areas_node(unsigned int flags, int nid)
{
bool ret = false;
+ unsigned int cpuset_mems_cookie;
if (!(flags & SHOW_MEM_FILTER_NODES))
goto out;
- get_mems_allowed();
- ret = !node_isset(nid, cpuset_current_mems_allowed);
- put_mems_allowed();
+ do {
+ cpuset_mems_cookie = get_mems_allowed();
+ ret = !node_isset(nid, cpuset_current_mems_allowed);
+ } while (!put_mems_allowed(cpuset_mems_cookie));
out:
return ret;
}
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3218,12 +3218,10 @@ static void *alternate_node_alloc(struct
if (in_interrupt() || (flags & __GFP_THISNODE))
return NULL;
nid_alloc = nid_here = numa_mem_id();
- get_mems_allowed();
if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
nid_alloc = cpuset_slab_spread_node();
else if (current->mempolicy)
nid_alloc = slab_node(current->mempolicy);
- put_mems_allowed();
if (nid_alloc != nid_here)
return ____cache_alloc_node(cachep, flags, nid_alloc);
return NULL;
@@ -3246,14 +3244,17 @@ static void *fallback_alloc(struct kmem_
enum zone_type high_zoneidx = gfp_zone(flags);
void *obj = NULL;
int nid;
+ unsigned int cpuset_mems_cookie;
if (flags & __GFP_THISNODE)
return NULL;
- get_mems_allowed();
- zonelist = node_zonelist(slab_node(current->mempolicy), flags);
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
+retry_cpuset:
+ cpuset_mems_cookie = get_mems_allowed();
+ zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+
retry:
/*
* Look through allowed nodes for objects available
@@ -3306,7 +3307,9 @@ retry:
}
}
}
- put_mems_allowed();
+
+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+ goto retry_cpuset;
return obj;
}
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1457,6 +1457,7 @@ static struct page *get_any_partial(stru
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
struct page *page;
+ unsigned int cpuset_mems_cookie;
/*
* The defrag ratio allows a configuration of the tradeoffs between
@@ -1480,23 +1481,32 @@ static struct page *get_any_partial(stru
get_cycles() % 1024 > s->remote_node_defrag_ratio)
return NULL;
- get_mems_allowed();
- zonelist = node_zonelist(slab_node(current->mempolicy), flags);
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
- struct kmem_cache_node *n;
+ do {
+ cpuset_mems_cookie = get_mems_allowed();
+ zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ struct kmem_cache_node *n;
- n = get_node(s, zone_to_nid(zone));
+ n = get_node(s, zone_to_nid(zone));
- if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
- n->nr_partial > s->min_partial) {
- page = get_partial_node(n);
- if (page) {
- put_mems_allowed();
- return page;
+ if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+ n->nr_partial > s->min_partial) {
+ page = get_partial_node(n);
+ if (page) {
+ /*
+ * Return the object even if
+ * put_mems_allowed indicated that
+ * the cpuset mems_allowed was
+ * updated in parallel. It's a
+ * harmless race between the alloc
+ * and the cpuset update.
+ */
+ put_mems_allowed(cpuset_mems_cookie);
+ return page;
+ }
}
}
- }
- put_mems_allowed();
+ } while (!put_mems_allowed(cpuset_mems_cookie));
#endif
return NULL;
}
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,7 +2251,6 @@ static unsigned long do_try_to_free_page
unsigned long writeback_threshold;
bool aborted_reclaim;
- get_mems_allowed();
delayacct_freepages_start();
if (scanning_global_lru(sc))
@@ -2314,7 +2313,6 @@ static unsigned long do_try_to_free_page
out:
delayacct_freepages_end();
- put_mems_allowed();
if (sc->nr_reclaimed)
return sc->nr_reclaimed;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 40/41] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (38 preceding siblings ...)
2012-07-30 17:31 ` [ 39/41] cpuset: mm: reduce large amounts of memory barrier related damage v3 Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 41/41] vmscan: fix initial shrinker size handling Greg Kroah-Hartman
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Konstantin Khlebnikov, Mel Gorman,
David Rientjes
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Konstantin Khlebnikov <khlebnikov@openvz.org>
commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream.
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This
is part of a series of patches that reduce page allocator overhead.
Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
large amounts of memory barrier related damage v3")
Local variable "page" can be uninitialized if the nodemask from vma policy
does not intersects with nodemask from cpuset. Even if it doesn't happens
it is better to initialize this variable explicitly than to introduce
a kernel oops in a weird corner case.
mm/hugetlb.c: In function `alloc_huge_page':
mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -454,7 +454,7 @@ static struct page *dequeue_huge_page_vm
struct vm_area_struct *vma,
unsigned long address, int avoid_reserve)
{
- struct page *page;
+ struct page *page = NULL;
struct mempolicy *mpol;
nodemask_t *nodemask;
struct zonelist *zonelist;
^ permalink raw reply [flat|nested] 44+ messages in thread
* [ 41/41] vmscan: fix initial shrinker size handling
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
` (39 preceding siblings ...)
2012-07-30 17:31 ` [ 40/41] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma Greg Kroah-Hartman
@ 2012-07-30 17:31 ` Greg Kroah-Hartman
40 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-30 17:31 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Greg KH, torvalds, akpm, alan, Konstantin Khlebnikov,
Dave Chinner, Mel Gorman
From: Greg KH <gregkh@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Konstantin Khlebnikov <khlebnikov@openvz.org>
commit 635697c663f38106063d5659f0cf2e45afcd4bb5 upstream.
Stable note: The commit [acf92b48: vmscan: shrinker->nr updates race and
go wrong] aimed to reduce excessive reclaim of slab objects but
had bug in how it treated shrinker functions that returned -1.
A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock. For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this. So the next time around this shrinker can cause
really big pressure. Let's skip such shrinkers instead.
Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/vmscan.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -248,12 +248,16 @@ unsigned long shrink_slab(struct shrink_
list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
- unsigned long total_scan;
- unsigned long max_pass;
+ long total_scan;
+ long max_pass;
int shrink_ret = 0;
long nr;
long new_nr;
+ max_pass = do_shrinker_shrink(shrinker, shrink, 0);
+ if (max_pass <= 0)
+ continue;
+
/*
* copy the current shrinker scan count into a local variable
* and zero it so that other concurrent shrinker invocations
@@ -264,7 +268,6 @@ unsigned long shrink_slab(struct shrink_
} while (cmpxchg(&shrinker->nr, nr, 0) != nr);
total_scan = nr;
- max_pass = do_shrinker_shrink(shrinker, shrink, 0);
delta = (4 * nr_pages_scanned) / shrinker->seeks;
delta *= max_pass;
do_div(delta, lru_pages + 1);
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [ 28/41] mm: compaction: introduce sync-light migration for use by compaction
2012-07-30 17:31 ` [ 28/41] mm: compaction: introduce sync-light migration for use by compaction Greg Kroah-Hartman
@ 2012-07-31 16:43 ` Herton Ronaldo Krzesinski
2012-07-31 17:00 ` Greg Kroah-Hartman
0 siblings, 1 reply; 44+ messages in thread
From: Herton Ronaldo Krzesinski @ 2012-07-31 16:43 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: linux-kernel, stable, torvalds, akpm, alan, Mel Gorman,
Rik van Riel, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Nai Xia, Johannes Weiner
On Mon, Jul 30, 2012 at 10:31:27AM -0700, Greg Kroah-Hartman wrote:
> From: Greg KH <gregkh@linuxfoundation.org>
>
> 3.0-stable review patch. If anyone has any objections, please let me know.
>
> ------------------
>
> From: Mel Gorman <mgorman@suse.de>
>
> commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.
We need also to pick recent fix dc32f63453f56d07a1073a697dcd843dd3098c09
after applying this one.
>
> Stable note: Not tracked in Buzilla. This was part of a series that
> reduced interactivity stalls experienced when THP was enabled.
> These stalls were particularly noticable when copying data
> to a USB stick but the experiences for users varied a lot.
>
> This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> mode that avoids writing back pages to backing storage. Async compaction
> maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
> For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
> used.
>
> This avoids sync compaction stalling for an excessive length of time,
> particularly when copying files to a USB stick where there might be a
> large number of dirty pages backed by a filesystem that does not support
> ->writepages.
>
> [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> [akpm@linux-foundation.org: fix fs/nfs/write.c build]
> [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Andy Isaacson <adi@hexapodia.org>
> Cc: Nai Xia <nai.xia@gmail.com>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> ---
> fs/btrfs/disk-io.c | 5 +--
> fs/hugetlbfs/inode.c | 2 -
> fs/nfs/internal.h | 2 -
> fs/nfs/write.c | 4 +-
> include/linux/fs.h | 6 ++-
> include/linux/migrate.h | 23 +++++++++++---
> mm/compaction.c | 2 -
> mm/memory-failure.c | 2 -
> mm/memory_hotplug.c | 2 -
> mm/mempolicy.c | 2 -
> mm/migrate.c | 78 ++++++++++++++++++++++++++----------------------
> 11 files changed, 76 insertions(+), 52 deletions(-)
>
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -801,7 +801,8 @@ static int btree_submit_bio_hook(struct
>
> #ifdef CONFIG_MIGRATION
> static int btree_migratepage(struct address_space *mapping,
> - struct page *newpage, struct page *page, bool sync)
> + struct page *newpage, struct page *page,
> + enum migrate_mode mode)
> {
> /*
> * we can't safely write a btree page from here,
> @@ -816,7 +817,7 @@ static int btree_migratepage(struct addr
> if (page_has_private(page) &&
> !try_to_release_page(page, GFP_KERNEL))
> return -EAGAIN;
> - return migrate_page(mapping, newpage, page, sync);
> + return migrate_page(mapping, newpage, page, mode);
> }
> #endif
>
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -569,7 +569,7 @@ static int hugetlbfs_set_page_dirty(stru
>
> static int hugetlbfs_migrate_page(struct address_space *mapping,
> struct page *newpage, struct page *page,
> - bool sync)
> + enum migrate_mode mode)
> {
> int rc;
>
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs
>
> #ifdef CONFIG_MIGRATION
> extern int nfs_migrate_page(struct address_space *,
> - struct page *, struct page *, bool);
> + struct page *, struct page *, enum migrate_mode);
> #else
> #define nfs_migrate_page NULL
> #endif
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1662,7 +1662,7 @@ out_error:
>
> #ifdef CONFIG_MIGRATION
> int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
> - struct page *page, bool sync)
> + struct page *page, enum migrate_mode mode)
> {
> /*
> * If PagePrivate is set, then the page is currently associated with
> @@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_spac
>
> nfs_fscache_release_page(page, GFP_KERNEL);
>
> - return migrate_page(mapping, newpage, page, sync);
> + return migrate_page(mapping, newpage, page, mode);
> }
> #endif
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -523,6 +523,7 @@ enum positive_aop_returns {
> struct page;
> struct address_space;
> struct writeback_control;
> +enum migrate_mode;
>
> struct iov_iter {
> const struct iovec *iov;
> @@ -612,7 +613,7 @@ struct address_space_operations {
> * is false, it must not block.
> */
> int (*migratepage) (struct address_space *,
> - struct page *, struct page *, bool);
> + struct page *, struct page *, enum migrate_mode);
> int (*launder_page) (struct page *);
> int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
> unsigned long);
> @@ -2481,7 +2482,8 @@ extern int generic_check_addressable(uns
>
> #ifdef CONFIG_MIGRATION
> extern int buffer_migrate_page(struct address_space *,
> - struct page *, struct page *, bool);
> + struct page *, struct page *,
> + enum migrate_mode);
> #else
> #define buffer_migrate_page NULL
> #endif
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -6,18 +6,31 @@
>
> typedef struct page *new_page_t(struct page *, unsigned long private, int **);
>
> +/*
> + * MIGRATE_ASYNC means never block
> + * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
> + * on most operations but not ->writepage as the potential stall time
> + * is too significant
> + * MIGRATE_SYNC will block when migrating pages
> + */
> +enum migrate_mode {
> + MIGRATE_ASYNC,
> + MIGRATE_SYNC_LIGHT,
> + MIGRATE_SYNC,
> +};
> +
> #ifdef CONFIG_MIGRATION
> #define PAGE_MIGRATION 1
>
> extern void putback_lru_pages(struct list_head *l);
> extern int migrate_page(struct address_space *,
> - struct page *, struct page *, bool);
> + struct page *, struct page *, enum migrate_mode);
> extern int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync);
> + enum migrate_mode mode);
> extern int migrate_huge_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync);
> + enum migrate_mode mode);
>
> extern int fail_migrate_page(struct address_space *,
> struct page *, struct page *);
> @@ -36,10 +49,10 @@ extern int migrate_huge_page_move_mappin
> static inline void putback_lru_pages(struct list_head *l) {}
> static inline int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync) { return -ENOSYS; }
> + enum migrate_mode mode) { return -ENOSYS; }
> static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync) { return -ENOSYS; }
> + enum migrate_mode mode) { return -ENOSYS; }
>
> static inline int migrate_prep(void) { return -ENOSYS; }
> static inline int migrate_prep_local(void) { return -ENOSYS; }
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -577,7 +577,7 @@ static int compact_zone(struct zone *zon
> nr_migrate = cc->nr_migratepages;
> err = migrate_pages(&cc->migratepages, compaction_alloc,
> (unsigned long)cc, false,
> - cc->sync);
> + cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
> update_nr_listpages(cc);
> nr_remaining = cc->nr_migratepages;
>
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1464,7 +1464,7 @@ int soft_offline_page(struct page *page,
> page_is_file_cache(page));
> list_add(&page->lru, &pagelist);
> ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> - 0, true);
> + 0, MIGRATE_SYNC);
> if (ret) {
> putback_lru_pages(&pagelist);
> pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -747,7 +747,7 @@ do_migrate_range(unsigned long start_pfn
> }
> /* this function returns # of failed pages */
> ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
> - true, true);
> + true, MIGRATE_SYNC);
> if (ret)
> putback_lru_pages(&source);
> }
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -926,7 +926,7 @@ static int migrate_to_node(struct mm_str
>
> if (!list_empty(&pagelist)) {
> err = migrate_pages(&pagelist, new_node_page, dest,
> - false, true);
> + false, MIGRATE_SYNC);
> if (err)
> putback_lru_pages(&pagelist);
> }
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -222,12 +222,13 @@ out:
>
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> -static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
> +static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> + enum migrate_mode mode)
> {
> struct buffer_head *bh = head;
>
> /* Simple case, sync compaction */
> - if (sync) {
> + if (mode != MIGRATE_ASYNC) {
> do {
> get_bh(bh);
> lock_buffer(bh);
> @@ -263,7 +264,7 @@ static bool buffer_migrate_lock_buffers(
> }
> #else
> static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
> - bool sync)
> + enum migrate_mode mode)
> {
> return true;
> }
> @@ -279,7 +280,7 @@ static inline bool buffer_migrate_lock_b
> */
> static int migrate_page_move_mapping(struct address_space *mapping,
> struct page *newpage, struct page *page,
> - struct buffer_head *head, bool sync)
> + struct buffer_head *head, enum migrate_mode mode)
> {
> int expected_count;
> void **pslot;
> @@ -315,7 +316,8 @@ static int migrate_page_move_mapping(str
> * the mapping back due to an elevated page count, we would have to
> * block waiting on other references to be dropped.
> */
> - if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
> + if (mode == MIGRATE_ASYNC && head &&
> + !buffer_migrate_lock_buffers(head, mode)) {
> page_unfreeze_refs(page, expected_count);
> spin_unlock_irq(&mapping->tree_lock);
> return -EAGAIN;
> @@ -478,13 +480,14 @@ EXPORT_SYMBOL(fail_migrate_page);
> * Pages are locked upon entry and exit.
> */
> int migrate_page(struct address_space *mapping,
> - struct page *newpage, struct page *page, bool sync)
> + struct page *newpage, struct page *page,
> + enum migrate_mode mode)
> {
> int rc;
>
> BUG_ON(PageWriteback(page)); /* Writeback must be complete */
>
> - rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
> + rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
>
> if (rc)
> return rc;
> @@ -501,17 +504,17 @@ EXPORT_SYMBOL(migrate_page);
> * exist.
> */
> int buffer_migrate_page(struct address_space *mapping,
> - struct page *newpage, struct page *page, bool sync)
> + struct page *newpage, struct page *page, enum migrate_mode mode)
> {
> struct buffer_head *bh, *head;
> int rc;
>
> if (!page_has_buffers(page))
> - return migrate_page(mapping, newpage, page, sync);
> + return migrate_page(mapping, newpage, page, mode);
>
> head = page_buffers(page);
>
> - rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
> + rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
>
> if (rc)
> return rc;
> @@ -521,8 +524,8 @@ int buffer_migrate_page(struct address_s
> * with an IRQ-safe spinlock held. In the sync case, the buffers
> * need to be locked now
> */
> - if (sync)
> - BUG_ON(!buffer_migrate_lock_buffers(head, sync));
> + if (mode != MIGRATE_ASYNC)
> + BUG_ON(!buffer_migrate_lock_buffers(head, mode));
>
> ClearPagePrivate(page);
> set_page_private(newpage, page_private(page));
> @@ -599,10 +602,11 @@ static int writeout(struct address_space
> * Default handling if a filesystem does not provide a migration function.
> */
> static int fallback_migrate_page(struct address_space *mapping,
> - struct page *newpage, struct page *page, bool sync)
> + struct page *newpage, struct page *page, enum migrate_mode mode)
> {
> if (PageDirty(page)) {
> - if (!sync)
> + /* Only writeback pages in full synchronous migration */
> + if (mode != MIGRATE_SYNC)
> return -EBUSY;
> return writeout(mapping, page);
> }
> @@ -615,7 +619,7 @@ static int fallback_migrate_page(struct
> !try_to_release_page(page, GFP_KERNEL))
> return -EAGAIN;
>
> - return migrate_page(mapping, newpage, page, sync);
> + return migrate_page(mapping, newpage, page, mode);
> }
>
> /*
> @@ -630,7 +634,7 @@ static int fallback_migrate_page(struct
> * == 0 - success
> */
> static int move_to_new_page(struct page *newpage, struct page *page,
> - int remap_swapcache, bool sync)
> + int remap_swapcache, enum migrate_mode mode)
> {
> struct address_space *mapping;
> int rc;
> @@ -651,7 +655,7 @@ static int move_to_new_page(struct page
>
> mapping = page_mapping(page);
> if (!mapping)
> - rc = migrate_page(mapping, newpage, page, sync);
> + rc = migrate_page(mapping, newpage, page, mode);
> else if (mapping->a_ops->migratepage)
> /*
> * Most pages have a mapping and most filesystems provide a
> @@ -660,9 +664,9 @@ static int move_to_new_page(struct page
> * is the most common path for page migration.
> */
> rc = mapping->a_ops->migratepage(mapping,
> - newpage, page, sync);
> + newpage, page, mode);
> else
> - rc = fallback_migrate_page(mapping, newpage, page, sync);
> + rc = fallback_migrate_page(mapping, newpage, page, mode);
>
> if (rc) {
> newpage->mapping = NULL;
> @@ -677,7 +681,7 @@ static int move_to_new_page(struct page
> }
>
> static int __unmap_and_move(struct page *page, struct page *newpage,
> - int force, bool offlining, bool sync)
> + int force, bool offlining, enum migrate_mode mode)
> {
> int rc = -EAGAIN;
> int remap_swapcache = 1;
> @@ -686,7 +690,7 @@ static int __unmap_and_move(struct page
> struct anon_vma *anon_vma = NULL;
>
> if (!trylock_page(page)) {
> - if (!force || !sync)
> + if (!force || mode == MIGRATE_ASYNC)
> goto out;
>
> /*
> @@ -732,10 +736,12 @@ static int __unmap_and_move(struct page
>
> if (PageWriteback(page)) {
> /*
> - * For !sync, there is no point retrying as the retry loop
> - * is expected to be too short for PageWriteback to be cleared
> + * Only in the case of a full syncronous migration is it
> + * necessary to wait for PageWriteback. In the async case,
> + * the retry loop is too short and in the sync-light case,
> + * the overhead of stalling is too much
> */
> - if (!sync) {
> + if (mode != MIGRATE_SYNC) {
> rc = -EBUSY;
> goto uncharge;
> }
> @@ -806,7 +812,7 @@ static int __unmap_and_move(struct page
>
> skip_unmap:
> if (!page_mapped(page))
> - rc = move_to_new_page(newpage, page, remap_swapcache, sync);
> + rc = move_to_new_page(newpage, page, remap_swapcache, mode);
>
> if (rc && remap_swapcache)
> remove_migration_ptes(page, page);
> @@ -829,7 +835,8 @@ out:
> * to the newly allocated page in newpage.
> */
> static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> - struct page *page, int force, bool offlining, bool sync)
> + struct page *page, int force, bool offlining,
> + enum migrate_mode mode)
> {
> int rc = 0;
> int *result = NULL;
> @@ -847,7 +854,7 @@ static int unmap_and_move(new_page_t get
> if (unlikely(split_huge_page(page)))
> goto out;
>
> - rc = __unmap_and_move(page, newpage, force, offlining, sync);
> + rc = __unmap_and_move(page, newpage, force, offlining, mode);
> out:
> if (rc != -EAGAIN) {
> /*
> @@ -895,7 +902,8 @@ out:
> */
> static int unmap_and_move_huge_page(new_page_t get_new_page,
> unsigned long private, struct page *hpage,
> - int force, bool offlining, bool sync)
> + int force, bool offlining,
> + enum migrate_mode mode)
> {
> int rc = 0;
> int *result = NULL;
> @@ -908,7 +916,7 @@ static int unmap_and_move_huge_page(new_
> rc = -EAGAIN;
>
> if (!trylock_page(hpage)) {
> - if (!force || !sync)
> + if (!force || mode != MIGRATE_SYNC)
> goto out;
> lock_page(hpage);
> }
> @@ -919,7 +927,7 @@ static int unmap_and_move_huge_page(new_
> try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>
> if (!page_mapped(hpage))
> - rc = move_to_new_page(new_hpage, hpage, 1, sync);
> + rc = move_to_new_page(new_hpage, hpage, 1, mode);
>
> if (rc)
> remove_migration_ptes(hpage, hpage);
> @@ -962,7 +970,7 @@ out:
> */
> int migrate_pages(struct list_head *from,
> new_page_t get_new_page, unsigned long private, bool offlining,
> - bool sync)
> + enum migrate_mode mode)
> {
> int retry = 1;
> int nr_failed = 0;
> @@ -983,7 +991,7 @@ int migrate_pages(struct list_head *from
>
> rc = unmap_and_move(get_new_page, private,
> page, pass > 2, offlining,
> - sync);
> + mode);
>
> switch(rc) {
> case -ENOMEM:
> @@ -1013,7 +1021,7 @@ out:
>
> int migrate_huge_pages(struct list_head *from,
> new_page_t get_new_page, unsigned long private, bool offlining,
> - bool sync)
> + enum migrate_mode mode)
> {
> int retry = 1;
> int nr_failed = 0;
> @@ -1030,7 +1038,7 @@ int migrate_huge_pages(struct list_head
>
> rc = unmap_and_move_huge_page(get_new_page,
> private, page, pass > 2, offlining,
> - sync);
> + mode);
>
> switch(rc) {
> case -ENOMEM:
> @@ -1159,7 +1167,7 @@ set_status:
> err = 0;
> if (!list_empty(&pagelist)) {
> err = migrate_pages(&pagelist, new_page_node,
> - (unsigned long)pm, 0, true);
> + (unsigned long)pm, 0, MIGRATE_SYNC);
> if (err)
> putback_lru_pages(&pagelist);
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe stable" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
[]'s
Herton
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [ 28/41] mm: compaction: introduce sync-light migration for use by compaction
2012-07-31 16:43 ` Herton Ronaldo Krzesinski
@ 2012-07-31 17:00 ` Greg Kroah-Hartman
0 siblings, 0 replies; 44+ messages in thread
From: Greg Kroah-Hartman @ 2012-07-31 17:00 UTC (permalink / raw)
To: Herton Ronaldo Krzesinski
Cc: linux-kernel, stable, torvalds, akpm, alan, Mel Gorman,
Rik van Riel, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Nai Xia, Johannes Weiner
On Tue, Jul 31, 2012 at 01:43:37PM -0300, Herton Ronaldo Krzesinski wrote:
> On Mon, Jul 30, 2012 at 10:31:27AM -0700, Greg Kroah-Hartman wrote:
> > From: Greg KH <gregkh@linuxfoundation.org>
> >
> > 3.0-stable review patch. If anyone has any objections, please let me know.
> >
> > ------------------
> >
> > From: Mel Gorman <mgorman@suse.de>
> >
> > commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.
>
> We need also to pick recent fix dc32f63453f56d07a1073a697dcd843dd3098c09
> after applying this one.
Thanks, I'll queue that up for the next round of kernels, given that
people have lived with the problem up until now :)
greg k-h
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2012-07-31 17:01 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-30 17:30 [ 00/41] 3.0.39-rc2 stable review Greg Kroah-Hartman
2012-07-30 17:31 ` [ 01/41] cifs: always update the inode cache with the results from a FIND_* Greg Kroah-Hartman
2012-07-30 17:31 ` [ 02/41] ntp: Fix STA_INS/DEL clearing bug Greg Kroah-Hartman
2012-07-30 17:31 ` [ 03/41] mm: fix lost kswapd wakeup in kswapd_stop() Greg Kroah-Hartman
2012-07-30 17:31 ` [ 04/41] MIPS: Properly align the .data..init_task section Greg Kroah-Hartman
2012-07-30 17:31 ` [ 05/41] UBIFS: fix a bug in empty space fix-up Greg Kroah-Hartman
2012-07-30 17:31 ` [ 06/41] dm raid1: fix crash with mirror recovery and discard Greg Kroah-Hartman
2012-07-30 17:31 ` [ 07/41] mm/vmstat.c: cache align vm_stat Greg Kroah-Hartman
2012-07-30 17:31 ` [ 08/41] mm: memory hotplug: Check if pages are correctly reserved on a per-section basis Greg Kroah-Hartman
2012-07-30 17:31 ` [ 09/41] mm: reduce the amount of work done when updating min_free_kbytes Greg Kroah-Hartman
2012-07-30 17:31 ` [ 10/41] mm: vmscan: fix force-scanning small targets without swap Greg Kroah-Hartman
2012-07-30 17:31 ` [ 11/41] vmscan: clear ZONE_CONGESTED for zone with good watermark Greg Kroah-Hartman
2012-07-30 17:31 ` [ 12/41] vmscan: add shrink_slab tracepoints Greg Kroah-Hartman
2012-07-30 17:31 ` [ 13/41] vmscan: shrinker->nr updates race and go wrong Greg Kroah-Hartman
2012-07-30 17:31 ` [ 14/41] vmscan: reduce wind up shrinker->nr when shrinker cant do work Greg Kroah-Hartman
2012-07-30 17:31 ` [ 15/41] vmscan: limit direct reclaim for higher order allocations Greg Kroah-Hartman
2012-07-30 17:31 ` [ 16/41] vmscan: abort reclaim/compaction if compaction can proceed Greg Kroah-Hartman
2012-07-30 17:31 ` [ 17/41] mm: compaction: trivial clean up in acct_isolated() Greg Kroah-Hartman
2012-07-30 17:31 ` [ 18/41] mm: change isolate mode from #define to bitwise type Greg Kroah-Hartman
2012-07-30 17:31 ` [ 19/41] mm: compaction: make isolate_lru_page() filter-aware Greg Kroah-Hartman
2012-07-30 17:31 ` [ 20/41] mm: zone_reclaim: " Greg Kroah-Hartman
2012-07-30 17:31 ` [ 21/41] mm: migration: clean up unmap_and_move() Greg Kroah-Hartman
2012-07-30 17:31 ` [ 22/41] mm: compaction: allow compaction to isolate dirty pages Greg Kroah-Hartman
2012-07-30 17:31 ` [ 23/41] mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage Greg Kroah-Hartman
2012-07-30 17:31 ` [ 24/41] mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred Greg Kroah-Hartman
2012-07-30 17:31 ` [ 25/41] mm: compaction: make isolate_lru_page() filter-aware again Greg Kroah-Hartman
2012-07-30 17:31 ` [ 26/41] kswapd: avoid unnecessary rebalance after an unsuccessful balancing Greg Kroah-Hartman
2012-07-30 17:31 ` [ 27/41] kswapd: assign new_order and new_classzone_idx after wakeup in sleeping Greg Kroah-Hartman
2012-07-30 17:31 ` [ 28/41] mm: compaction: introduce sync-light migration for use by compaction Greg Kroah-Hartman
2012-07-31 16:43 ` Herton Ronaldo Krzesinski
2012-07-31 17:00 ` Greg Kroah-Hartman
2012-07-30 17:31 ` [ 29/41] mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available Greg Kroah-Hartman
2012-07-30 17:31 ` [ 30/41] mm: vmscan: do not OOM if aborting reclaim to start compaction Greg Kroah-Hartman
2012-07-30 17:31 ` [ 31/41] mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone Greg Kroah-Hartman
2012-07-30 17:31 ` [ 32/41] vmscan: promote shared file mapped pages Greg Kroah-Hartman
2012-07-30 17:31 ` [ 33/41] vmscan: activate executable pages after first usage Greg Kroah-Hartman
2012-07-30 17:31 ` [ 34/41] mm/vmscan.c: consider swap space when deciding whether to continue reclaim Greg Kroah-Hartman
2012-07-30 17:31 ` [ 35/41] mm: test PageSwapBacked in lumpy reclaim Greg Kroah-Hartman
2012-07-30 17:31 ` [ 36/41] mm: vmscan: convert global reclaim to per-memcg LRU lists Greg Kroah-Hartman
2012-07-30 17:31 ` [ 37/41] cpusets: avoid looping when storing to mems_allowed if one node remains set Greg Kroah-Hartman
2012-07-30 17:31 ` [ 38/41] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask Greg Kroah-Hartman
2012-07-30 17:31 ` [ 39/41] cpuset: mm: reduce large amounts of memory barrier related damage v3 Greg Kroah-Hartman
2012-07-30 17:31 ` [ 40/41] mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma Greg Kroah-Hartman
2012-07-30 17:31 ` [ 41/41] vmscan: fix initial shrinker size handling Greg Kroah-Hartman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).