* [RFCv2 0/3] free reclaimed pages by paging out instantly
@ 2014-06-20 6:48 Minchan Kim
2014-06-20 6:48 ` [RFCv2 1/3] mm: Don't hide spin_lock in swap_info_get internal Minchan Kim
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Minchan Kim @ 2014-06-20 6:48 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Rik van Riel, Mel Gorman,
Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim
Normally, I/O completed pages for reclaim would be rotated into
inactive LRU tail without freeing. The why it works is we can't free
page from atomic context(ie, end_page_writeback) due to vaious locks
isn't aware of atomic context.
So for reclaiming the I/O completed pages, we need one more iteration
of reclaim and it could make unnecessary aging as well as CPU overhead.
Long time ago, at the first trial, most concern was memcg locking
but recently, Johnannes tried amazing effort to make memcg lock simple[1]
so I coded up this again based on his patchset.(Kudos to Johannes)
[1] mm: memcontrol: naturalize charge lifetime v3
https://lkml.org/lkml/2014/6/18/631
So, this patchset should go after [1] but not bad timing to prove
that how [1] make mm simple so that we can go further like this.
On 1G, 12 CPU kvm guest, build kernel 5 times and result.
Most of field isn't changed too much but one thing I can notice is
allocstall and pgrotated.
We could save direct reclaim 50% and page rotation 96%.
Yay!
Welcome testing, review and any feedback!
git clone -b mm/mm/asap_reclaim-v1r3 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
TITLE OLD NEW DIFF (RATIO)
nr_free_pages -79,600 -85,018 -5,418 (106.81)
nr_alloc_batch -532 64 596 (-12.24)
nr_inactive_anon 1,601 1,913 312 (119.48)
nr_active_anon -66,021 -64,913 1,108 (98.32)
nr_inactive_file 25,921 29,974 4,053 (115.64)
nr_active_file 112,575 112,604 29 (100.03)
nr_unevictable 0 0 0 (100.00)
nr_mlock 0 0 0 (100.00)
nr_anon_pages -63,698 -62,601 1,097 (98.28)
nr_mapped -31,818 -32,657 -839 (102.64)
nr_file_pages 139,336 143,684 4,348 (103.12)
nr_dirty 9,741 9,332 -409 (95.80)
nr_writeback 0 0 0 (100.00)
nr_slab_reclaimable 3,453 2,969 -484 (85.99)
nr_slab_unreclaimable 1,348 1,503 155 (111.49)
nr_page_table_pages 303 169 -134 (55.92)
nr_kernel_stack 5 18 13 (316.67)
nr_unstable 0 0 0 (100.00)
nr_bounce 0 0 0 (100.00)
nr_vmscan_write 2,347,178 2,475,802 128,624 (105.48)
nr_vmscan_immediate_reclaim 21,292 18,863 -2,429 (88.59)
nr_writeback_temp 0 0 0 (100.00)
nr_isolated_anon 0 0 0 (100.00)
nr_isolated_file 0 0 0 (100.00)
nr_shmem -940 -940 0 (100.00)
nr_dirtied 9,059,820 9,055,880 -3,940 (99.96)
nr_written 10,721,761 10,875,384 153,623 (101.43)
numa_hit 667,598,890 667,289,491 -309,399 (99.95)
numa_miss 0 0 0 (100.00)
numa_foreign 0 0 0 (100.00)
numa_interleave 0 0 0 (100.00)
numa_local 667,598,890 667,289,491 -309,399 (99.95)
numa_other 0 0 0 (100.00)
workingset_refault 6,843,535 6,953,675 110,140 (101.61)
workingset_activate 500,648 462,529 -38,119 (92.39)
workingset_nodereclaim 13,696 12,420 -1,276 (90.68)
nr_anon_transparent_hugepages 0 0 0 (100.00)
nr_free_cma 0 0 0 (100.00)
nr_dirty_threshold 5,890 5,756 -134 (97.73)
nr_dirty_background_threshold 2,945 2,878 -67 (97.73)
pgpgin 40,253,436 40,048,188 -205,248 (99.49)
pgpgout 43,348,244 43,949,700 601,456 (101.39)
pswpin 1,341,538 1,341,174 -364 (99.97)
pswpout 2,238,838 2,401,758 162,920 (107.28)
pgalloc_dma 8,107,785 8,658,979 551,194 (106.80)
pgalloc_dma32 662,079,225 661,199,629 -879,596 (99.87)
pgalloc_normal 0 0 0 (100.00)
pgalloc_movable 0 0 0 (100.00)
pgfree 670,107,583 669,774,227 -333,356 (99.95)
pgactivate 6,644,334 6,643,232 -1,102 (99.98)
pgdeactivate 12,717,804 12,591,803 -126,001 (99.01)
pgfault 720,714,051 720,522,028 -192,023 (99.97)
pgmajfault 293,791 300,790 6,999 (102.38)
pgrefill_dma 339,536 357,065 17,529 (105.16)
pgrefill_dma32 13,042,608 12,882,276 -160,332 (98.77)
pgrefill_normal 0 0 0 (100.00)
pgrefill_movable 0 0 0 (100.00)
pgsteal_kswapd_dma 176,437 182,289 5,852 (103.32)
pgsteal_kswapd_dma32 17,820,059 15,877,438 -1,942,621 (89.10)
pgsteal_kswapd_normal 0 0 0 (100.00)
pgsteal_kswapd_movable 0 0 0 (100.00)
pgsteal_direct_dma 30 63 33 (206.45)
pgsteal_direct_dma32 388,468 208,411 -180,057 (53.65)
pgsteal_direct_normal 0 0 0 (100.00)
pgsteal_direct_movable 0 0 0 (100.00)
pgscan_kswapd_dma 190,486 199,076 8,590 (104.51)
pgscan_kswapd_dma32 22,002,203 20,034,956 -1,967,247 (91.06)
pgscan_kswapd_normal 0 0 0 (100.00)
pgscan_kswapd_movable 0 0 0 (100.00)
pgscan_direct_dma 45 175 130 (382.61)
pgscan_direct_dma32 722,765 714,866 -7,899 (98.91)
pgscan_direct_normal 0 0 0 (100.00)
pgscan_direct_movable 0 0 0 (100.00)
pgscan_direct_throttle 0 0 0 (100.00)
zone_reclaim_failed 0 0 0 (100.00)
pginodesteal 0 0 0 (100.00)
slabs_scanned 3,537,255 3,580,408 43,153 (101.22)
kswapd_inodesteal 374 0 -374 (0.27)
kswapd_low_wmark_hit_quickly 2,485 2,528 43 (101.73)
kswapd_high_wmark_hit_quickly 1,078 728 -350 (67.56)
pageoutrun 4,652 4,346 -306 (93.42)
allocstall 8,312 4,524 -3,788 (54.43)
pgrotated 2,205,712 86,860 -2,118,852 (3.94)
drop_pagecache 0 0 0 (100.00)
drop_slab 0 0 0 (100.00)
pgmigrate_success 0 0 0 (100.00)
pgmigrate_fail 0 0 0 (100.00)
compact_migrate_scanned 0 0 0 (100.00)
compact_free_scanned 0 0 0 (100.00)
compact_isolated 0 0 0 (100.00)
compact_stall 7 6 -1 (87.50)
compact_fail 7 6 -1 (87.50)
compact_success 0 0 0 (100.00)
htlb_buddy_alloc_success 0 0 0 (100.00)
htlb_buddy_alloc_fail 0 0 0 (100.00)
unevictable_pgs_culled 0 0 0 (100.00)
unevictable_pgs_scanned 0 0 0 (100.00)
unevictable_pgs_rescued 0 0 0 (100.00)
unevictable_pgs_mlocked 0 0 0 (100.00)
unevictable_pgs_munlocked 0 0 0 (100.00)
unevictable_pgs_cleared 0 0 0 (100.00)
unevictable_pgs_stranded 0 0 0 (100.00)
thp_fault_alloc 0 0 0 (100.00)
thp_fault_fallback 0 0 0 (100.00)
thp_collapse_alloc 0 0 0 (100.00)
thp_collapse_alloc_failed 0 0 0 (100.00)
thp_split 0 0 0 (100.00)
thp_zero_page_alloc 0 0 0 (100.00)
thp_zero_page_alloc_failed 0 0 0 (100.00)
Minchan Kim (3):
mm: Don't hide spin_lock in swap_info_get internal
mm: Introduce atomic_remove_mapping
mm: Free reclaimed pages indepdent of next reclaim
include/linux/swap.h | 4 ++++
mm/filemap.c | 17 +++++++++-----
mm/swap.c | 21 ++++++++++++++++++
mm/swapfile.c | 17 ++++++++++++--
mm/vmscan.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 114 insertions(+), 8 deletions(-)
--
2.0.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFCv2 1/3] mm: Don't hide spin_lock in swap_info_get internal
2014-06-20 6:48 [RFCv2 0/3] free reclaimed pages by paging out instantly Minchan Kim
@ 2014-06-20 6:48 ` Minchan Kim
2014-06-20 6:48 ` [RFCv2 2/3] mm: Introduce atomic_remove_mapping Minchan Kim
2014-06-20 6:48 ` [RFCv2 3/3] mm: Free reclaimed pages indepdent of next reclaim Minchan Kim
2 siblings, 0 replies; 4+ messages in thread
From: Minchan Kim @ 2014-06-20 6:48 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Rik van Riel, Mel Gorman,
Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim
Now, swap_info_get hides lock holding by doing it internally
but releasing the lock so caller should release the lock.
Normally, it's not a good pattern and I need to handle lock
from caller in next patchset.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/swapfile.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e0ac59..ec2ce926ea5f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -740,7 +740,6 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
goto bad_offset;
if (!p->swap_map[offset])
goto bad_free;
- spin_lock(&p->lock);
return p;
bad_free:
@@ -835,6 +834,7 @@ void swap_free(swp_entry_t entry)
p = swap_info_get(entry);
if (p) {
+ spin_lock(&p->lock);
swap_entry_free(p, entry, 1);
spin_unlock(&p->lock);
}
@@ -849,6 +849,7 @@ void swapcache_free(swp_entry_t entry)
p = swap_info_get(entry);
if (p) {
+ spin_lock(&p->lock);
swap_entry_free(p, entry, SWAP_HAS_CACHE);
spin_unlock(&p->lock);
}
@@ -868,6 +869,7 @@ int page_swapcount(struct page *page)
entry.val = page_private(page);
p = swap_info_get(entry);
if (p) {
+ spin_lock(&p->lock);
count = swap_count(p->swap_map[swp_offset(entry)]);
spin_unlock(&p->lock);
}
@@ -950,6 +952,7 @@ int free_swap_and_cache(swp_entry_t entry)
p = swap_info_get(entry);
if (p) {
+ spin_lock(&p->lock);
if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
page = find_get_page(swap_address_space(entry),
entry.val);
@@ -2763,6 +2766,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
goto outer;
}
+ spin_lock(&si->lock);
offset = swp_offset(entry);
count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
--
2.0.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [RFCv2 2/3] mm: Introduce atomic_remove_mapping
2014-06-20 6:48 [RFCv2 0/3] free reclaimed pages by paging out instantly Minchan Kim
2014-06-20 6:48 ` [RFCv2 1/3] mm: Don't hide spin_lock in swap_info_get internal Minchan Kim
@ 2014-06-20 6:48 ` Minchan Kim
2014-06-20 6:48 ` [RFCv2 3/3] mm: Free reclaimed pages indepdent of next reclaim Minchan Kim
2 siblings, 0 replies; 4+ messages in thread
From: Minchan Kim @ 2014-06-20 6:48 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Rik van Riel, Mel Gorman,
Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
Trond Myklebust, linux-nfs
For release page from atomic context(ie, softirq), locks
related to the work should be aware of that.
There are two locks.
One is mapping->tree_lock and the other is swap_info_struct->lock.
The mapping->tree_lock is alreay aware of irq so it's no problem
but swap_info_struct->lock isn't so atomic_remove_mapping uses just
try_spinlock and if it fails to hold a lock, it just depends on
a fallback plan which moves the page into LRU's tail and expect page
freeing in next.
A change I know is mapping->a_ops->free is called on atomic context
by this patch. Only user is nfs_readdir_clear_array which is no
problem when I look at.
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/swap.h | 4 ++++
mm/swapfile.c | 11 ++++++++-
mm/vmscan.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 77 insertions(+), 1 deletion(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 94fd0b23f3f9..5df540205bda 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -336,6 +336,8 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
unsigned long *nr_scanned);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
+extern int atomic_remove_mapping(struct address_space *mapping,
+ struct page *page);
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
@@ -441,6 +443,7 @@ static inline long get_nr_swap_pages(void)
}
extern void si_swapinfo(struct sysinfo *);
+extern struct swap_info_struct *swap_info_get(swp_entry_t entry);
extern swp_entry_t get_swap_page(void);
extern swp_entry_t get_swap_page_of_type(int);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
@@ -449,6 +452,7 @@ extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
extern void swapcache_free(swp_entry_t);
+extern void __swapcache_free(swp_entry_t);
extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
extern unsigned int count_swap_pages(int, int);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ec2ce926ea5f..d76496a8a104 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -722,7 +722,7 @@ swp_entry_t get_swap_page_of_type(int type)
return (swp_entry_t) {0};
}
-static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+struct swap_info_struct *swap_info_get(swp_entry_t entry)
{
struct swap_info_struct *p;
unsigned long offset, type;
@@ -855,6 +855,15 @@ void swapcache_free(swp_entry_t entry)
}
}
+void __swapcache_free(swp_entry_t entry)
+{
+ struct swap_info_struct *p;
+
+ p = swap_info_get(entry);
+ if (p)
+ swap_entry_free(p, entry, SWAP_HAS_CACHE);
+}
+
/*
* How many references to page are currently swapped out?
* This does not give an exact answer when swap count is continued,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 521f7eab1798..a7efddc571f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -526,6 +526,69 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
}
/*
+ * Attempt to detach a locked page from its ->mapping in atomic context.
+ * If it is dirty or if someone else has a ref on the page or couldn't
+ * get necessary locks, abort and return 0.
+ * If it was successfully detached, return 1.
+ * Assumes the caller has a single ref on this page.
+ */
+int atomic_remove_mapping(struct address_space *mapping,
+ struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ BUG_ON(mapping != page_mapping(page));
+ BUG_ON(!irqs_disabled());
+
+ spin_lock(&mapping->tree_lock);
+
+ /* Look at comment in __remove_mapping */
+ if (!page_freeze_refs(page, 2))
+ goto cannot_free;
+ /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
+ if (unlikely(PageDirty(page))) {
+ page_unfreeze_refs(page, 2);
+ goto cannot_free;
+ }
+
+ if (PageSwapCache(page)) {
+ swp_entry_t swap = { .val = page_private(page) };
+ struct swap_info_struct *p = swap_info_get(swap);
+
+ if (!p || !spin_trylock(&p->lock)) {
+ page_unfreeze_refs(page, 2);
+ goto cannot_free;
+ }
+
+ mem_cgroup_swapout(page, swap);
+ __delete_from_swap_cache(page);
+ spin_unlock(&mapping->tree_lock);
+ __swapcache_free(swap);
+ spin_unlock(&p->lock);
+ } else {
+ void (*freepage)(struct page *);
+
+ freepage = mapping->a_ops->freepage;
+ __delete_from_page_cache(page, NULL);
+ spin_unlock(&mapping->tree_lock);
+
+ if (freepage != NULL)
+ freepage(page);
+ }
+
+ /*
+ * Unfreezing the refcount with 1 rather than 2 effectively
+ * drops the pagecache ref for us without requiring another
+ * atomic operation.
+ */
+ page_unfreeze_refs(page, 1);
+ return 1;
+
+cannot_free:
+ spin_unlock(&mapping->tree_lock);
+ return 0;
+}
+
+/*
* Same as remove_mapping, but if the page is removed from the mapping, it
* gets returned with a refcount of 0.
*/
--
2.0.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [RFCv2 3/3] mm: Free reclaimed pages indepdent of next reclaim
2014-06-20 6:48 [RFCv2 0/3] free reclaimed pages by paging out instantly Minchan Kim
2014-06-20 6:48 ` [RFCv2 1/3] mm: Don't hide spin_lock in swap_info_get internal Minchan Kim
2014-06-20 6:48 ` [RFCv2 2/3] mm: Introduce atomic_remove_mapping Minchan Kim
@ 2014-06-20 6:48 ` Minchan Kim
2 siblings, 0 replies; 4+ messages in thread
From: Minchan Kim @ 2014-06-20 6:48 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Rik van Riel, Mel Gorman,
Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim
Invalidate dirty/writeback page and file/swap I/O for reclaiming
are asynchronous so that when page writeback is completed,
it will be rotated back into LRU tail for freeing in next reclaim.
But it would make unnecessary CPU overhead and more aging
with higher priority of reclaim than necessary thing.
This patch makes such pages instant release when I/O complete
without LRU movement so that we could reduce reclaim events.
This patch wakes up one waiting PG_writeback and then clear
PG_reclaim bit because the page could be released during
rotating so it makes slighly race with Readahead logic but
the chance would be small and no huge side-effect even though
that happens, I belive.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/filemap.c | 17 +++++++++++------
mm/swap.c | 21 +++++++++++++++++++++
2 files changed, 32 insertions(+), 6 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c2f30ed8e95f..6e09de6cf510 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -752,23 +752,28 @@ EXPORT_SYMBOL(unlock_page);
*/
void end_page_writeback(struct page *page)
{
+ if (!test_clear_page_writeback(page))
+ BUG();
+
+ smp_mb__after_atomic();
+ wake_up_page(page, PG_writeback);
+
/*
* TestClearPageReclaim could be used here but it is an atomic
* operation and overkill in this particular case. Failing to
* shuffle a page marked for immediate reclaim is too mild to
* justify taking an atomic operation penalty at the end of
* ever page writeback.
+ *
+ * Clearing PG_reclaim after waking up waiter is slightly racy.
+ * Readahead might see PageReclaim as PageReadahead marker
+ * so readahead logic might be broken temporally but it isn't
+ * matter enough to care.
*/
if (PageReclaim(page)) {
ClearPageReclaim(page);
rotate_reclaimable_page(page);
}
-
- if (!test_clear_page_writeback(page))
- BUG();
-
- smp_mb__after_atomic();
- wake_up_page(page, PG_writeback);
}
EXPORT_SYMBOL(end_page_writeback);
diff --git a/mm/swap.c b/mm/swap.c
index 3074210f245d..d61b8783ccc3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -443,6 +443,27 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
enum lru_list lru = page_lru_base_type(page);
+ struct address_space *mapping;
+
+ if (!trylock_page(page))
+ goto move_tail;
+
+ mapping = page_mapping(page);
+ if (!mapping)
+ goto unlock;
+
+ /*
+ * If it is successful, aotmic_remove_mapping
+ * makes page->count one so the page will be
+ * released when caller release his refcount.
+ */
+ if (atomic_remove_mapping(mapping, page)) {
+ unlock_page(page);
+ return;
+ }
+unlock:
+ unlock_page(page);
+move_tail:
list_move_tail(&page->lru, &lruvec->lists[lru]);
(*pgmoved)++;
}
--
2.0.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-06-20 6:48 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-20 6:48 [RFCv2 0/3] free reclaimed pages by paging out instantly Minchan Kim
2014-06-20 6:48 ` [RFCv2 1/3] mm: Don't hide spin_lock in swap_info_get internal Minchan Kim
2014-06-20 6:48 ` [RFCv2 2/3] mm: Introduce atomic_remove_mapping Minchan Kim
2014-06-20 6:48 ` [RFCv2 3/3] mm: Free reclaimed pages indepdent of next reclaim Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).