From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754678AbZDUHXR (ORCPT ); Tue, 21 Apr 2009 03:23:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753946AbZDUHW5 (ORCPT ); Tue, 21 Apr 2009 03:22:57 -0400 Received: from fgwmail7.fujitsu.co.jp ([192.51.44.37]:54366 "EHLO fgwmail7.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754107AbZDUHWz (ORCPT ); Tue, 21 Apr 2009 03:22:55 -0400 Date: Tue, 21 Apr 2009 16:21:21 +0900 From: KAMEZAWA Hiroyuki To: "linux-mm@kvack.org" Cc: "linux-kernel@vger.kernel.org" , "nishimura@mxp.nes.nec.co.jp" , "balbir@linux.vnet.ibm.com" , "hugh@veritas.com" Subject: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller Message-Id: <20090421162121.1a1d15fe.kamezawa.hiroyu@jp.fujitsu.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 2.5.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org maybe this patch covers almost all cases of swap-leak of memcg, Nishimura-san reported. There are many callers of lock_page() but lock_page() which can be racy with free_swap_and_cache() is not so much, I think. Nishimura-san, How about this ? == From: KAMEZAWA Hiroyuki free_swap_and_cache(), which is called under various spin_lock, is designed to be best-effort function and it does nothing if the page is locked, under I/O. But it's ok because global lru will find the page finally. But, when it comes to memory+swap cgroup, global LRU may not work and swap entry can be alive as "not used but not freed" state for very very long time because memory cgroup's LRU routine scans its own LRU only. (Such stale swap-cache is out of memcg's LRU) Nishimura repoted such kind of swap cache makes memcg unstable and - we can never free mem_cgroup object (....it's big) even if no users. - OOM killer can happen because amounts of charge-to-swap is leaked. This patch tries to fix the problem by adding PageCgroupStale() flag. If a page which is under zappped is busy swap cache, recored it as Stale. At the end of swap I/O, Stale flag is checked and swap and swapcache is freed if necessary. Page migration case is also checked. Reported-by: Daisuke Nishimura Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/page_cgroup.h | 4 +++ include/linux/swap.h | 8 ++++++ mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++---- mm/page_io.c | 2 + mm/swapfile.c | 1 5 files changed, 62 insertions(+), 4 deletions(-) Index: linux-2.6.30-rc2/include/linux/page_cgroup.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/page_cgroup.h +++ linux-2.6.30-rc2/include/linux/page_cgroup.h @@ -26,6 +26,7 @@ enum { PCG_LOCK, /* page cgroup is locked */ PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ + PCG_STALE, /* may be a stale swap-cache */ }; #define TESTPCGFLAG(uname, lname) \ @@ -46,6 +47,9 @@ TESTPCGFLAG(Cache, CACHE) TESTPCGFLAG(Used, USED) CLEARPCGFLAG(Used, USED) +TESTPCGFLAG(Stale, STALE) +SETPCGFLAG(Stale, STALE) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc->page); Index: linux-2.6.30-rc2/include/linux/swap.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/swap.h +++ linux-2.6.30-rc2/include/linux/swap.h @@ -344,10 +344,18 @@ mem_cgroup_uncharge_swapcache(struct pag #endif #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern void mem_cgroup_uncharge_swap(swp_entry_t ent); +extern void mem_cgroup_mark_swapcache_stale(struct page *page); +extern void mem_cgroup_fixup_swapcache(struct page *page); #else static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) { } +static void mem_cgroup_check_mark_swapcache_stale(struct page *page) +{ +} +static void mem_cgroup_fixup_swapcache(struct page *page) +{ +} #endif #else /* CONFIG_SWAP */ Index: linux-2.6.30-rc2/mm/memcontrol.c =================================================================== --- linux-2.6.30-rc2.orig/mm/memcontrol.c +++ linux-2.6.30-rc2/mm/memcontrol.c @@ -1534,6 +1534,47 @@ void mem_cgroup_uncharge_swap(swp_entry_ } rcu_read_unlock(); } + +/* + * free_swap_and_cache() is an best-effort function and it doesn't free + * swapent if the swapcache seems to be busy (ex. the page is locked.) + * This behavior is designed to be as is but mem+swap cgroup has to handle it. + * Otherwise, swp_entry seems to be leaked (for very long time) + */ +void mem_cgroup_mark_swapcache_stale(struct page *page) +{ + struct page_cgroup *pc; + + if (!PageSwapCache(page) || page_mapped(page)) + return; + + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + + /* + * This "Stale" flag will be cleared when the page is reused + * somewhere. + */ + if (!PageCgroupUsed(pc)) + SetPageCgroupStale(pc); + unlock_page_cgroup(pc); +} + +void mem_cgroup_fixup_swapcache(struct page *page) +{ + struct page_cgroup *pc = lookup_page_cgroup(page); + + /* Stale flag will be cleared automatically */ + if (unlikely(PageCgroupStale(pc))) { + if (get_page_unless_zero(page)) { + lock_page(page); + try_to_free_swap(page); + unlock_page(page); + page_cache_release(page); + } + } +} + #endif /* @@ -1604,17 +1645,19 @@ void mem_cgroup_end_migration(struct mem __mem_cgroup_commit_charge(mem, pc, ctype); /* - * Both of oldpage and newpage are still under lock_page(). - * Then, we don't have to care about race in radix-tree. - * But we have to be careful that this page is unmapped or not. + * oldpage is under lock_page() at migration. Then, we don't have to + * care about race in radix-tree. But we have to be careful + * that this page is unmapped or not. * * There is a case for !page_mapped(). At the start of * migration, oldpage was mapped. But now, it's zapped. * But we know *target* page is not freed/reused under us. * mem_cgroup_uncharge_page() does all necessary checks. */ - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) { mem_cgroup_uncharge_page(target); + mem_cgroup_fixup_swapcache(target); + } } /* Index: linux-2.6.30-rc2/mm/page_io.c =================================================================== --- linux-2.6.30-rc2.orig/mm/page_io.c +++ linux-2.6.30-rc2/mm/page_io.c @@ -68,6 +68,7 @@ static void end_swap_bio_write(struct bi } end_page_writeback(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } void end_swap_bio_read(struct bio *bio, int err) @@ -87,6 +88,7 @@ void end_swap_bio_read(struct bio *bio, } unlock_page(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } /* Index: linux-2.6.30-rc2/mm/swapfile.c =================================================================== --- linux-2.6.30-rc2.orig/mm/swapfile.c +++ linux-2.6.30-rc2/mm/swapfile.c @@ -587,6 +587,7 @@ int free_swap_and_cache(swp_entry_t entr if (swap_entry_free(p, entry) == 1) { page = find_get_page(&swapper_space, entry.val); if (page && !trylock_page(page)) { + mem_cgroup_mark_swapcache_stale(page); page_cache_release(page); page = NULL; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 038C76B0055 for ; Tue, 21 Apr 2009 03:22:39 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n3L7Mxoh013465 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Tue, 21 Apr 2009 16:22:59 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id C057745DE4F for ; Tue, 21 Apr 2009 16:22:58 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 9FD5345DD71 for ; Tue, 21 Apr 2009 16:22:58 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id A576A1DB8040 for ; Tue, 21 Apr 2009 16:22:58 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 5A71C1DB8037 for ; Tue, 21 Apr 2009 16:22:55 +0900 (JST) Date: Tue, 21 Apr 2009 16:21:21 +0900 From: KAMEZAWA Hiroyuki Subject: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller Message-Id: <20090421162121.1a1d15fe.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: "linux-mm@kvack.org" Cc: "linux-kernel@vger.kernel.org" , "nishimura@mxp.nes.nec.co.jp" , "balbir@linux.vnet.ibm.com" , "hugh@veritas.com" List-ID: maybe this patch covers almost all cases of swap-leak of memcg, Nishimura-san reported. There are many callers of lock_page() but lock_page() which can be racy with free_swap_and_cache() is not so much, I think. Nishimura-san, How about this ? == From: KAMEZAWA Hiroyuki free_swap_and_cache(), which is called under various spin_lock, is designed to be best-effort function and it does nothing if the page is locked, under I/O. But it's ok because global lru will find the page finally. But, when it comes to memory+swap cgroup, global LRU may not work and swap entry can be alive as "not used but not freed" state for very very long time because memory cgroup's LRU routine scans its own LRU only. (Such stale swap-cache is out of memcg's LRU) Nishimura repoted such kind of swap cache makes memcg unstable and - we can never free mem_cgroup object (....it's big) even if no users. - OOM killer can happen because amounts of charge-to-swap is leaked. This patch tries to fix the problem by adding PageCgroupStale() flag. If a page which is under zappped is busy swap cache, recored it as Stale. At the end of swap I/O, Stale flag is checked and swap and swapcache is freed if necessary. Page migration case is also checked. Reported-by: Daisuke Nishimura Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/page_cgroup.h | 4 +++ include/linux/swap.h | 8 ++++++ mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++---- mm/page_io.c | 2 + mm/swapfile.c | 1 5 files changed, 62 insertions(+), 4 deletions(-) Index: linux-2.6.30-rc2/include/linux/page_cgroup.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/page_cgroup.h +++ linux-2.6.30-rc2/include/linux/page_cgroup.h @@ -26,6 +26,7 @@ enum { PCG_LOCK, /* page cgroup is locked */ PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ + PCG_STALE, /* may be a stale swap-cache */ }; #define TESTPCGFLAG(uname, lname) \ @@ -46,6 +47,9 @@ TESTPCGFLAG(Cache, CACHE) TESTPCGFLAG(Used, USED) CLEARPCGFLAG(Used, USED) +TESTPCGFLAG(Stale, STALE) +SETPCGFLAG(Stale, STALE) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc->page); Index: linux-2.6.30-rc2/include/linux/swap.h =================================================================== --- linux-2.6.30-rc2.orig/include/linux/swap.h +++ linux-2.6.30-rc2/include/linux/swap.h @@ -344,10 +344,18 @@ mem_cgroup_uncharge_swapcache(struct pag #endif #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern void mem_cgroup_uncharge_swap(swp_entry_t ent); +extern void mem_cgroup_mark_swapcache_stale(struct page *page); +extern void mem_cgroup_fixup_swapcache(struct page *page); #else static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) { } +static void mem_cgroup_check_mark_swapcache_stale(struct page *page) +{ +} +static void mem_cgroup_fixup_swapcache(struct page *page) +{ +} #endif #else /* CONFIG_SWAP */ Index: linux-2.6.30-rc2/mm/memcontrol.c =================================================================== --- linux-2.6.30-rc2.orig/mm/memcontrol.c +++ linux-2.6.30-rc2/mm/memcontrol.c @@ -1534,6 +1534,47 @@ void mem_cgroup_uncharge_swap(swp_entry_ } rcu_read_unlock(); } + +/* + * free_swap_and_cache() is an best-effort function and it doesn't free + * swapent if the swapcache seems to be busy (ex. the page is locked.) + * This behavior is designed to be as is but mem+swap cgroup has to handle it. + * Otherwise, swp_entry seems to be leaked (for very long time) + */ +void mem_cgroup_mark_swapcache_stale(struct page *page) +{ + struct page_cgroup *pc; + + if (!PageSwapCache(page) || page_mapped(page)) + return; + + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + + /* + * This "Stale" flag will be cleared when the page is reused + * somewhere. + */ + if (!PageCgroupUsed(pc)) + SetPageCgroupStale(pc); + unlock_page_cgroup(pc); +} + +void mem_cgroup_fixup_swapcache(struct page *page) +{ + struct page_cgroup *pc = lookup_page_cgroup(page); + + /* Stale flag will be cleared automatically */ + if (unlikely(PageCgroupStale(pc))) { + if (get_page_unless_zero(page)) { + lock_page(page); + try_to_free_swap(page); + unlock_page(page); + page_cache_release(page); + } + } +} + #endif /* @@ -1604,17 +1645,19 @@ void mem_cgroup_end_migration(struct mem __mem_cgroup_commit_charge(mem, pc, ctype); /* - * Both of oldpage and newpage are still under lock_page(). - * Then, we don't have to care about race in radix-tree. - * But we have to be careful that this page is unmapped or not. + * oldpage is under lock_page() at migration. Then, we don't have to + * care about race in radix-tree. But we have to be careful + * that this page is unmapped or not. * * There is a case for !page_mapped(). At the start of * migration, oldpage was mapped. But now, it's zapped. * But we know *target* page is not freed/reused under us. * mem_cgroup_uncharge_page() does all necessary checks. */ - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) { mem_cgroup_uncharge_page(target); + mem_cgroup_fixup_swapcache(target); + } } /* Index: linux-2.6.30-rc2/mm/page_io.c =================================================================== --- linux-2.6.30-rc2.orig/mm/page_io.c +++ linux-2.6.30-rc2/mm/page_io.c @@ -68,6 +68,7 @@ static void end_swap_bio_write(struct bi } end_page_writeback(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } void end_swap_bio_read(struct bio *bio, int err) @@ -87,6 +88,7 @@ void end_swap_bio_read(struct bio *bio, } unlock_page(page); bio_put(bio); + mem_cgroup_fixup_swapcache(page); } /* Index: linux-2.6.30-rc2/mm/swapfile.c =================================================================== --- linux-2.6.30-rc2.orig/mm/swapfile.c +++ linux-2.6.30-rc2/mm/swapfile.c @@ -587,6 +587,7 @@ int free_swap_and_cache(swp_entry_t entr if (swap_entry_free(p, entry) == 1) { page = find_get_page(&swapper_space, entry.val); if (page && !trylock_page(page)) { + mem_cgroup_mark_swapcache_stale(page); page_cache_release(page); page = NULL; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org