[RFC] memcg: handle swapcache leak

* [RFC] memcg: handle swapcache leak
@ 2009-03-17  4:57 Daisuke Nishimura
  2009-03-17  5:39 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-17  4:57 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Hugh Dickins, Daisuke Nishimura

Hi.

There are (at least) 2 types(described later) of swapcache leak in current memcg.

I mean by "swapcache leak" a swapcache which:
  a. the process that used the page has already exited(or
     unmapped the page).
  b. is not linked to memcg's LRU because the page is !PageCgroupUsed.

So, only the global page reclaim or swapoff can free these leaked swapcaches.
This means memcg's memory pressure can use up all swap entries if
the memory size of the system is greater than that of swap.

1. race between exit and swap-in
  Assume processA is exitting and processB is doing swap-in.

  If some pages of processA has been swapped out, it calls free_swap_and_cache().
  And if at the same time, processB is calling read_swap_cache_async() about
  a swap entry *that is used by processA*, a race like below can happen.

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
      swap_entry_free() == 0           |
      find_get_page() -> found         |
      try_lock_page() -> fail & return |
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

  This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.

  And this type of leaked swapcaches have been charged as swap,
  so swap entries of them have reference to the associated memcg
  and the refcnt of the memcg has been incremented.
  As a result this memcg cannot be free'ed until global page reclaim
  frees this swapcache or swapoff is executed.

  Actually, I saw "struct mem_cgroup leak"(checked by "grep kmalloc-1024 /proc/slabinfo")
  in my test, where I create a new directory, move all tasks to the new
  directory, and remove the old directory under memcg's memory pressure.
  And, this "struct mem_cgroup leak" didn't happen with setting
  /proc/sys/vm/page-cluster to 0.

2. race between exit and swap-out
  If page_remove_rmap() is called by the owner process about an anonymous
  page(not on swapchache, so uncharged here) before shrink_page_list() adds
  the page to swapcache, this page becomes a swapcache with !PageCgroupUsed.

  And if this swapcache is not free'ed by shrink_page_list(), it goes back
  to global LRU, but doesn't go back to memcg's LRU because the page is
  !PageCgroupUsed.

  This type of leak can be avoided by modifying shrink_page_list() like:

===
@@ -775,6 +776,21 @@ activate_locked:
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
+		if (!scanning_global_lru(sc) && PageSwapCache(page)) {
+			struct page_cgroup *pc;
+
+			pc = lookup_page_cgroup(page);
+			/*
+			 * Used bit of swapcache is solid under page lock.
+			 */
+			if (unlikely(!PageCgroupUsed(pc)))
+				/*
+				 * This can happen if the page is unmapped by
+				 * the owner process before it is added to
+				 * swapcache.
+				 */
+				try_to_free_swap(page);
+		}
 		unlock_page(page);
 keep:
 		list_add(&page->lru, &ret_pages);
===


I've confirmed that no leak happens with this patch for shrink_page_list() applied
and setting /proc/sys/vm/page-cluster to 0 in a simple swap in/out test.
(I think I should check page migration and rmdir too.)

I think the root cause of these problem is that !PageCgroupUsed pages are not linked
to any memcg's LRU.
So, I'm tring to implement "dummy_memcg" to maintain !PageCgroupUsed pages now.

Any comments or suggestions would be welcome.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread