From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> To: "linux-mm@kvack.org" <linux-mm@kvack.org> Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>, "balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>, "hugh@veritas.com" <hugh@veritas.com> Subject: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller Date: Tue, 21 Apr 2009 16:21:21 +0900 [thread overview] Message-ID: <20090421162121.1a1d15fe.kamezawa.hiroyu@jp.fujitsu.com> (raw) maybe this patch covers almost all cases of swap-leak of memcg, Nishimura-san reported. There are many callers of lock_page() but lock_page() which can be racy with free_swap_and_cache() is not so much, I think. Nishimura-san, How about this ? == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> free_swap_and_cache(), which is called under various spin_lock, is designed to be best-effort function and it does nothing if the page is locked, under I/O. But it's ok because global lru will find the page finally. But, when it comes to memory+swap cgroup, global LRU may not work and swap entry can be alive as "not used but not freed" state for very very long time because memory cgroup's LRU routine scans its own LRU only. (Such stale swap-cache is out of memcg's LRU) Nishimura repoted such kind of swap cache makes memcg unstable and - we can never free mem_cgroup object (....it's big) even if no users. - OOM killer can happen because amounts of charge-to-swap is leaked. This patch tries to fix the problem by adding PageCgroupStale() flag. If a page which is under zappped is busy swap cache, recored it as Stale. At the end of swap I/O, Stale flag is checked and swap and swapcache is freed if necessary. Page migration case is also checked. Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- include/linux/page_cgroup.h | 4 +++ include/linux/swap.h | 8 ++++++ mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++---- mm/page_io.c | 2 + mm/swapfile.c | 1 5 files changed, 62 insertions(+), 4 deletions(-) Index: linux-2.6.30-rc2/include/linux/page_cgroup.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/page_cgroup.h +++ linux-2.6.30-rc2/include/linux/page_cgroup.h @@ -26,6 +26,7 @@ enum { PCG_LOCK, /* page cgroup is locked */ PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ + PCG_STALE, /* may be a stale swap-cache */ }; #define TESTPCGFLAG(uname, lname) \ @@ -46,6 +47,9 @@ TESTPCGFLAG(Cache, CACHE) TESTPCGFLAG(Used, USED) CLEARPCGFLAG(Used, USED) +TESTPCGFLAG(Stale, STALE) +SETPCGFLAG(Stale, STALE) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc->page); Index: linux-2.6.30-rc2/include/linux/swap.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/swap.h +++ linux-2.6.30-rc2/include/linux/swap.h @@ -344,10 +344,18 @@ mem_cgroup_uncharge_swapcache(struct pag #endif #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern void mem_cgroup_uncharge_swap(swp_entry_t ent); +extern void mem_cgroup_mark_swapcache_stale(struct page *page); +extern void mem_cgroup_fixup_swapcache(struct page *page); #else static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) { } +static void mem_cgroup_check_mark_swapcache_stale(struct page *page) +{ +} +static void mem_cgroup_fixup_swapcache(struct page *page) +{ +} #endif #else /* CONFIG_SWAP */ Index: linux-2.6.30-rc2/mm/memcontrol.c =================================================================== --- linux-2.6.30-rc2.orig/mm/memcontrol.c +++ linux-2.6.30-rc2/mm/memcontrol.c @@ -1534,6 +1534,47 @@ void mem_cgroup_uncharge_swap(swp_entry_ } rcu_read_unlock(); } + +/* + * free_swap_and_cache() is an best-effort function and it doesn't free + * swapent if the swapcache seems to be busy (ex. the page is locked.) + * This behavior is designed to be as is but mem+swap cgroup has to handle it. + * Otherwise, swp_entry seems to be leaked (for very long time) + */ +void mem_cgroup_mark_swapcache_stale(struct page *page) +{ + struct page_cgroup *pc; + + if (!PageSwapCache(page) || page_mapped(page)) + return; + + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + + /* + * This "Stale" flag will be cleared when the page is reused + * somewhere. + */ + if (!PageCgroupUsed(pc)) + SetPageCgroupStale(pc); + unlock_page_cgroup(pc); +} + +void mem_cgroup_fixup_swapcache(struct page *page) +{ + struct page_cgroup *pc = lookup_page_cgroup(page); + + /* Stale flag will be cleared automatically */ + if (unlikely(PageCgroupStale(pc))) { + if (get_page_unless_zero(page)) { + lock_page(page); + try_to_free_swap(page); + unlock_page(page); + page_cache_release(page); + } + } +} + #endif /* @@ -1604,17 +1645,19 @@ void mem_cgroup_end_migration(struct mem __mem_cgroup_commit_charge(mem, pc, ctype); /* - * Both of oldpage and newpage are still under lock_page(). - * Then, we don't have to care about race in radix-tree. - * But we have to be careful that this page is unmapped or not. + * oldpage is under lock_page() at migration. Then, we don't have to + * care about race in radix-tree. But we have to be careful + * that this page is unmapped or not. * * There is a case for !page_mapped(). At the start of * migration, oldpage was mapped. But now, it's zapped. * But we know *target* page is not freed/reused under us. * mem_cgroup_uncharge_page() does all necessary checks. */ - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) { mem_cgroup_uncharge_page(target); + mem_cgroup_fixup_swapcache(target); + } } /* Index: linux-2.6.30-rc2/mm/page_io.c =================================================================== --- linux-2.6.30-rc2.orig/mm/page_io.c +++ linux-2.6.30-rc2/mm/page_io.c @@ -68,6 +68,7 @@ static void end_swap_bio_write(struct bi } end_page_writeback(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } void end_swap_bio_read(struct bio *bio, int err) @@ -87,6 +88,7 @@ void end_swap_bio_read(struct bio *bio, } unlock_page(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } /* Index: linux-2.6.30-rc2/mm/swapfile.c =================================================================== --- linux-2.6.30-rc2.orig/mm/swapfile.c +++ linux-2.6.30-rc2/mm/swapfile.c @@ -587,6 +587,7 @@ int free_swap_and_cache(swp_entry_t entr if (swap_entry_free(p, entry) == 1) { page = find_get_page(&swapper_space, entry.val); if (page && !trylock_page(page)) { + mem_cgroup_mark_swapcache_stale(page); page_cache_release(page); page = NULL; }
WARNING: multiple messages have this Message-ID (diff)
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> To: "linux-mm@kvack.org" <linux-mm@kvack.org> Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>, "balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>, "hugh@veritas.com" <hugh@veritas.com> Subject: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller Date: Tue, 21 Apr 2009 16:21:21 +0900 [thread overview] Message-ID: <20090421162121.1a1d15fe.kamezawa.hiroyu@jp.fujitsu.com> (raw) maybe this patch covers almost all cases of swap-leak of memcg, Nishimura-san reported. There are many callers of lock_page() but lock_page() which can be racy with free_swap_and_cache() is not so much, I think. Nishimura-san, How about this ? == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> free_swap_and_cache(), which is called under various spin_lock, is designed to be best-effort function and it does nothing if the page is locked, under I/O. But it's ok because global lru will find the page finally. But, when it comes to memory+swap cgroup, global LRU may not work and swap entry can be alive as "not used but not freed" state for very very long time because memory cgroup's LRU routine scans its own LRU only. (Such stale swap-cache is out of memcg's LRU) Nishimura repoted such kind of swap cache makes memcg unstable and - we can never free mem_cgroup object (....it's big) even if no users. - OOM killer can happen because amounts of charge-to-swap is leaked. This patch tries to fix the problem by adding PageCgroupStale() flag. If a page which is under zappped is busy swap cache, recored it as Stale. At the end of swap I/O, Stale flag is checked and swap and swapcache is freed if necessary. Page migration case is also checked. Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- include/linux/page_cgroup.h | 4 +++ include/linux/swap.h | 8 ++++++ mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++---- mm/page_io.c | 2 + mm/swapfile.c | 1 5 files changed, 62 insertions(+), 4 deletions(-) Index: linux-2.6.30-rc2/include/linux/page_cgroup.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/page_cgroup.h +++ linux-2.6.30-rc2/include/linux/page_cgroup.h @@ -26,6 +26,7 @@ enum { PCG_LOCK, /* page cgroup is locked */ PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ + PCG_STALE, /* may be a stale swap-cache */ }; #define TESTPCGFLAG(uname, lname) \ @@ -46,6 +47,9 @@ TESTPCGFLAG(Cache, CACHE) TESTPCGFLAG(Used, USED) CLEARPCGFLAG(Used, USED) +TESTPCGFLAG(Stale, STALE) +SETPCGFLAG(Stale, STALE) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc->page); Index: linux-2.6.30-rc2/include/linux/swap.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/swap.h +++ linux-2.6.30-rc2/include/linux/swap.h @@ -344,10 +344,18 @@ mem_cgroup_uncharge_swapcache(struct pag #endif #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern void mem_cgroup_uncharge_swap(swp_entry_t ent); +extern void mem_cgroup_mark_swapcache_stale(struct page *page); +extern void mem_cgroup_fixup_swapcache(struct page *page); #else static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) { } +static void mem_cgroup_check_mark_swapcache_stale(struct page *page) +{ +} +static void mem_cgroup_fixup_swapcache(struct page *page) +{ +} #endif #else /* CONFIG_SWAP */ Index: linux-2.6.30-rc2/mm/memcontrol.c =================================================================== --- linux-2.6.30-rc2.orig/mm/memcontrol.c +++ linux-2.6.30-rc2/mm/memcontrol.c @@ -1534,6 +1534,47 @@ void mem_cgroup_uncharge_swap(swp_entry_ } rcu_read_unlock(); } + +/* + * free_swap_and_cache() is an best-effort function and it doesn't free + * swapent if the swapcache seems to be busy (ex. the page is locked.) + * This behavior is designed to be as is but mem+swap cgroup has to handle it. + * Otherwise, swp_entry seems to be leaked (for very long time) + */ +void mem_cgroup_mark_swapcache_stale(struct page *page) +{ + struct page_cgroup *pc; + + if (!PageSwapCache(page) || page_mapped(page)) + return; + + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + + /* + * This "Stale" flag will be cleared when the page is reused + * somewhere. + */ + if (!PageCgroupUsed(pc)) + SetPageCgroupStale(pc); + unlock_page_cgroup(pc); +} + +void mem_cgroup_fixup_swapcache(struct page *page) +{ + struct page_cgroup *pc = lookup_page_cgroup(page); + + /* Stale flag will be cleared automatically */ + if (unlikely(PageCgroupStale(pc))) { + if (get_page_unless_zero(page)) { + lock_page(page); + try_to_free_swap(page); + unlock_page(page); + page_cache_release(page); + } + } +} + #endif /* @@ -1604,17 +1645,19 @@ void mem_cgroup_end_migration(struct mem __mem_cgroup_commit_charge(mem, pc, ctype); /* - * Both of oldpage and newpage are still under lock_page(). - * Then, we don't have to care about race in radix-tree. - * But we have to be careful that this page is unmapped or not. + * oldpage is under lock_page() at migration. Then, we don't have to + * care about race in radix-tree. But we have to be careful + * that this page is unmapped or not. * * There is a case for !page_mapped(). At the start of * migration, oldpage was mapped. But now, it's zapped. * But we know *target* page is not freed/reused under us. * mem_cgroup_uncharge_page() does all necessary checks. */ - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) { mem_cgroup_uncharge_page(target); + mem_cgroup_fixup_swapcache(target); + } } /* Index: linux-2.6.30-rc2/mm/page_io.c =================================================================== --- linux-2.6.30-rc2.orig/mm/page_io.c +++ linux-2.6.30-rc2/mm/page_io.c @@ -68,6 +68,7 @@ static void end_swap_bio_write(struct bi } end_page_writeback(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } void end_swap_bio_read(struct bio *bio, int err) @@ -87,6 +88,7 @@ void end_swap_bio_read(struct bio *bio, } unlock_page(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } /* Index: linux-2.6.30-rc2/mm/swapfile.c =================================================================== --- linux-2.6.30-rc2.orig/mm/swapfile.c +++ linux-2.6.30-rc2/mm/swapfile.c @@ -587,6 +587,7 @@ int free_swap_and_cache(swp_entry_t entr if (swap_entry_free(p, entry) == 1) { page = find_get_page(&swapper_space, entry.val); if (page && !trylock_page(page)) { + mem_cgroup_mark_swapcache_stale(page); page_cache_release(page); page = NULL; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2009-04-21 7:23 UTC|newest] Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top 2009-04-21 7:21 KAMEZAWA Hiroyuki [this message] 2009-04-21 7:21 ` [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller KAMEZAWA Hiroyuki 2009-04-22 5:38 ` Daisuke Nishimura 2009-04-22 5:38 ` Daisuke Nishimura 2009-04-22 6:10 ` KAMEZAWA Hiroyuki 2009-04-22 6:10 ` KAMEZAWA Hiroyuki 2009-04-23 4:14 ` Daisuke Nishimura 2009-04-23 4:14 ` Daisuke Nishimura 2009-04-23 8:45 ` KAMEZAWA Hiroyuki 2009-04-23 8:45 ` KAMEZAWA Hiroyuki 2009-04-24 4:33 ` KAMEZAWA Hiroyuki 2009-04-24 4:33 ` KAMEZAWA Hiroyuki 2009-04-24 6:21 ` Daisuke Nishimura 2009-04-24 6:21 ` Daisuke Nishimura 2009-04-24 7:28 ` [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3 KAMEZAWA Hiroyuki 2009-04-24 7:28 ` KAMEZAWA Hiroyuki 2009-04-24 8:07 ` Daisuke Nishimura 2009-04-24 8:07 ` Daisuke Nishimura 2009-04-25 12:54 ` Daisuke Nishimura 2009-04-25 12:54 ` Daisuke Nishimura 2009-04-25 16:06 ` Daisuke Nishimura 2009-04-25 16:06 ` Daisuke Nishimura 2009-04-27 7:39 ` KAMEZAWA Hiroyuki 2009-04-27 7:39 ` KAMEZAWA Hiroyuki 2009-04-27 8:12 ` Balbir Singh 2009-04-27 8:12 ` Balbir Singh 2009-04-27 8:21 ` KAMEZAWA Hiroyuki 2009-04-27 8:21 ` KAMEZAWA Hiroyuki 2009-04-27 8:43 ` Balbir Singh 2009-04-27 8:43 ` Balbir Singh 2009-04-27 8:49 ` KAMEZAWA Hiroyuki 2009-04-27 8:49 ` KAMEZAWA Hiroyuki
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20090421162121.1a1d15fe.kamezawa.hiroyu@jp.fujitsu.com \ --to=kamezawa.hiroyu@jp.fujitsu.com \ --cc=balbir@linux.vnet.ibm.com \ --cc=hugh@veritas.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=nishimura@mxp.nes.nec.co.jp \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.