All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/13] mm: memcontrol: naturalize charge lifetime v4
@ 2014-06-18 20:40 ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Hi,

this is version 4 of the memcg charge naturalization series.  Changes
since v3 include:

o reorder THP charges, __GFP_NORETRY, oom-disabled/NOFAIL patches (Michal)
o remove explicit OOM behavior charge parameter (Michal)
o add acks

These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages.  This drastically simplifies the code
and reduces charging and uncharging overhead.  The most expensive part
of charging and uncharging is the page_cgroup bit spinlock, which is
removed entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup
(i.e. executing in the root memcg).  Before:

    15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string                  
    13.31%              cat  [kernel.kallsyms]   [k] memset                                    
    11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage                         
     4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist                    
     2.38%              cat  [kernel.kallsyms]   [k] put_page                                  
     2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge                
     2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common              
     1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list                          
     1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup                       
     1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn                      

After:

    15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string                  
    13.48%           cat  [kernel.kallsyms]   [k] memset                                    
    11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage                         
     3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist                    
     2.46%           cat  [kernel.kallsyms]   [k] put_page                                  
     2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list                          
     1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup                       
     1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn                      
     1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk                        
     1.30%           cat  [kernel.kallsyms]   [k] kfree                                     

As you can see, the memcg footprint has shrunk quite a bit.

   text    data     bss     dec     hex filename
  37970    9892     400   48262    bc86 mm/memcontrol.o.old
  35239    9892     400   45531    b1db mm/memcontrol.o

 Documentation/cgroups/memcg_test.txt |  160 +---
 include/linux/memcontrol.h           |   94 +--
 include/linux/page_cgroup.h          |   43 +-
 include/linux/swap.h                 |   15 +-
 kernel/events/uprobes.c              |    1 +
 mm/filemap.c                         |   13 +-
 mm/huge_memory.c                     |   57 +-
 mm/memcontrol.c                      | 1455 ++++++++++++----------------------
 mm/memory.c                          |   43 +-
 mm/migrate.c                         |   44 +-
 mm/rmap.c                            |   20 -
 mm/shmem.c                           |   32 +-
 mm/swap.c                            |   40 +
 mm/swap_state.c                      |    8 +-
 mm/swapfile.c                        |   21 +-
 mm/truncate.c                        |    9 -
 mm/vmscan.c                          |   12 +-
 mm/zswap.c                           |    2 +-
 18 files changed, 719 insertions(+), 1350 deletions(-)


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 00/13] mm: memcontrol: naturalize charge lifetime v4
@ 2014-06-18 20:40 ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Hi,

this is version 4 of the memcg charge naturalization series.  Changes
since v3 include:

o reorder THP charges, __GFP_NORETRY, oom-disabled/NOFAIL patches (Michal)
o remove explicit OOM behavior charge parameter (Michal)
o add acks

These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages.  This drastically simplifies the code
and reduces charging and uncharging overhead.  The most expensive part
of charging and uncharging is the page_cgroup bit spinlock, which is
removed entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup
(i.e. executing in the root memcg).  Before:

    15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string                  
    13.31%              cat  [kernel.kallsyms]   [k] memset                                    
    11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage                         
     4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist                    
     2.38%              cat  [kernel.kallsyms]   [k] put_page                                  
     2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge                
     2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common              
     1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list                          
     1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup                       
     1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn                      

After:

    15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string                  
    13.48%           cat  [kernel.kallsyms]   [k] memset                                    
    11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage                         
     3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist                    
     2.46%           cat  [kernel.kallsyms]   [k] put_page                                  
     2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list                          
     1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup                       
     1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn                      
     1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk                        
     1.30%           cat  [kernel.kallsyms]   [k] kfree                                     

As you can see, the memcg footprint has shrunk quite a bit.

   text    data     bss     dec     hex filename
  37970    9892     400   48262    bc86 mm/memcontrol.o.old
  35239    9892     400   45531    b1db mm/memcontrol.o

 Documentation/cgroups/memcg_test.txt |  160 +---
 include/linux/memcontrol.h           |   94 +--
 include/linux/page_cgroup.h          |   43 +-
 include/linux/swap.h                 |   15 +-
 kernel/events/uprobes.c              |    1 +
 mm/filemap.c                         |   13 +-
 mm/huge_memory.c                     |   57 +-
 mm/memcontrol.c                      | 1455 ++++++++++++----------------------
 mm/memory.c                          |   43 +-
 mm/migrate.c                         |   44 +-
 mm/rmap.c                            |   20 -
 mm/shmem.c                           |   32 +-
 mm/swap.c                            |   40 +
 mm/swap_state.c                      |    8 +-
 mm/swapfile.c                        |   21 +-
 mm/truncate.c                        |    9 -
 mm/vmscan.c                          |   12 +-
 mm/zswap.c                           |    2 +-
 18 files changed, 719 insertions(+), 1350 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 01/13] mm: memcontrol: fold mem_cgroup_do_charge()
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

This function was split out because mem_cgroup_try_charge() got too
big.  But having essentially one sequence of operations arbitrarily
split in half is not good for reworking the code.  Fold it back in.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 166 ++++++++++++++++++++++----------------------------------
 1 file changed, 64 insertions(+), 102 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb0e6eb..94531df14d37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2551,80 +2551,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-
-/* See mem_cgroup_try_charge() for details */
-enum {
-	CHARGE_OK,		/* success */
-	CHARGE_RETRY,		/* need to retry but retry is not bad */
-	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
-	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
-};
-
-static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				unsigned int nr_pages, unsigned int min_pages,
-				bool invoke_oom)
-{
-	unsigned long csize = nr_pages * PAGE_SIZE;
-	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
-	unsigned long flags = 0;
-	int ret;
-
-	ret = res_counter_charge(&memcg->res, csize, &fail_res);
-
-	if (likely(!ret)) {
-		if (!do_swap_account)
-			return CHARGE_OK;
-		ret = res_counter_charge(&memcg->memsw, csize, &fail_res);
-		if (likely(!ret))
-			return CHARGE_OK;
-
-		res_counter_uncharge(&memcg->res, csize);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
-	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
-	/*
-	 * Never reclaim on behalf of optional batching, retry with a
-	 * single page instead.
-	 */
-	if (nr_pages > min_pages)
-		return CHARGE_RETRY;
-
-	if (!(gfp_mask & __GFP_WAIT))
-		return CHARGE_WOULDBLOCK;
-
-	if (gfp_mask & __GFP_NORETRY)
-		return CHARGE_NOMEM;
-
-	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
-	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
-		return CHARGE_RETRY;
-	/*
-	 * Even though the limit is exceeded at this point, reclaim
-	 * may have been able to free some pages.  Retry the charge
-	 * before killing the task.
-	 *
-	 * Only for regular pages, though: huge pages are rather
-	 * unlikely to succeed so close to the limit, and we fall back
-	 * to regular pages anyway in case of failure.
-	 */
-	if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret)
-		return CHARGE_RETRY;
-
-	/*
-	 * At task move, charge accounts can be doubly counted. So, it's
-	 * better to wait until the end of task_move if something is going on.
-	 */
-	if (mem_cgroup_wait_acct_move(mem_over_limit))
-		return CHARGE_RETRY;
-
-	if (invoke_oom)
-		mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
-
-	return CHARGE_NOMEM;
-}
-
 /**
  * mem_cgroup_try_charge - try charging a memcg
  * @memcg: memcg to charge
@@ -2641,7 +2567,11 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	int ret;
+	struct mem_cgroup *mem_over_limit;
+	struct res_counter *fail_res;
+	unsigned long nr_reclaimed;
+	unsigned long flags = 0;
+	unsigned long long size;
 
 	if (mem_cgroup_is_root(memcg))
 		goto done;
@@ -2661,44 +2591,76 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 
 	if (gfp_mask & __GFP_NOFAIL)
 		oom = false;
-again:
+retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
 
-	do {
-		bool invoke_oom = oom && !nr_oom_retries;
+	size = batch * PAGE_SIZE;
+	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+		if (!do_swap_account)
+			goto done_restock;
+		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+			goto done_restock;
+		res_counter_uncharge(&memcg->res, size);
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+	} else
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-		/* If killed, bypass charge */
-		if (fatal_signal_pending(current))
-			goto bypass;
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
+	}
 
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
-					   nr_pages, invoke_oom);
-		switch (ret) {
-		case CHARGE_OK:
-			break;
-		case CHARGE_RETRY: /* not in OOM situation but retry */
-			batch = nr_pages;
-			goto again;
-		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
-			goto nomem;
-		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom || invoke_oom)
-				goto nomem;
-			nr_oom_retries--;
-			break;
-		}
-	} while (ret != CHARGE_OK);
+	if (!(gfp_mask & __GFP_WAIT))
+		goto nomem;
 
-	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
-done:
-	return 0;
+	if (gfp_mask & __GFP_NORETRY)
+		goto nomem;
+
+	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+
+	if (mem_cgroup_margin(mem_over_limit) >= batch)
+		goto retry;
+	/*
+	 * Even though the limit is exceeded at this point, reclaim
+	 * may have been able to free some pages.  Retry the charge
+	 * before killing the task.
+	 *
+	 * Only for regular pages, though: huge pages are rather
+	 * unlikely to succeed so close to the limit, and we fall back
+	 * to regular pages anyway in case of failure.
+	 */
+	if (nr_reclaimed && batch <= (1 << PAGE_ALLOC_COSTLY_ORDER))
+		goto retry;
+	/*
+	 * At task move, charge accounts can be doubly counted. So, it's
+	 * better to wait until the end of task_move if something is going on.
+	 */
+	if (mem_cgroup_wait_acct_move(mem_over_limit))
+		goto retry;
+
+	if (fatal_signal_pending(current))
+		goto bypass;
+
+	if (!oom)
+		goto nomem;
+
+	if (nr_oom_retries--)
+		goto retry;
+
+	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
 bypass:
 	return -EINTR;
+
+done_restock:
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+done:
+	return 0;
 }
 
 /**
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 01/13] mm: memcontrol: fold mem_cgroup_do_charge()
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

This function was split out because mem_cgroup_try_charge() got too
big.  But having essentially one sequence of operations arbitrarily
split in half is not good for reworking the code.  Fold it back in.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 166 ++++++++++++++++++++++----------------------------------
 1 file changed, 64 insertions(+), 102 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb0e6eb..94531df14d37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2551,80 +2551,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-
-/* See mem_cgroup_try_charge() for details */
-enum {
-	CHARGE_OK,		/* success */
-	CHARGE_RETRY,		/* need to retry but retry is not bad */
-	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
-	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
-};
-
-static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				unsigned int nr_pages, unsigned int min_pages,
-				bool invoke_oom)
-{
-	unsigned long csize = nr_pages * PAGE_SIZE;
-	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
-	unsigned long flags = 0;
-	int ret;
-
-	ret = res_counter_charge(&memcg->res, csize, &fail_res);
-
-	if (likely(!ret)) {
-		if (!do_swap_account)
-			return CHARGE_OK;
-		ret = res_counter_charge(&memcg->memsw, csize, &fail_res);
-		if (likely(!ret))
-			return CHARGE_OK;
-
-		res_counter_uncharge(&memcg->res, csize);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
-	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
-	/*
-	 * Never reclaim on behalf of optional batching, retry with a
-	 * single page instead.
-	 */
-	if (nr_pages > min_pages)
-		return CHARGE_RETRY;
-
-	if (!(gfp_mask & __GFP_WAIT))
-		return CHARGE_WOULDBLOCK;
-
-	if (gfp_mask & __GFP_NORETRY)
-		return CHARGE_NOMEM;
-
-	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
-	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
-		return CHARGE_RETRY;
-	/*
-	 * Even though the limit is exceeded at this point, reclaim
-	 * may have been able to free some pages.  Retry the charge
-	 * before killing the task.
-	 *
-	 * Only for regular pages, though: huge pages are rather
-	 * unlikely to succeed so close to the limit, and we fall back
-	 * to regular pages anyway in case of failure.
-	 */
-	if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret)
-		return CHARGE_RETRY;
-
-	/*
-	 * At task move, charge accounts can be doubly counted. So, it's
-	 * better to wait until the end of task_move if something is going on.
-	 */
-	if (mem_cgroup_wait_acct_move(mem_over_limit))
-		return CHARGE_RETRY;
-
-	if (invoke_oom)
-		mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
-
-	return CHARGE_NOMEM;
-}
-
 /**
  * mem_cgroup_try_charge - try charging a memcg
  * @memcg: memcg to charge
@@ -2641,7 +2567,11 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	int ret;
+	struct mem_cgroup *mem_over_limit;
+	struct res_counter *fail_res;
+	unsigned long nr_reclaimed;
+	unsigned long flags = 0;
+	unsigned long long size;
 
 	if (mem_cgroup_is_root(memcg))
 		goto done;
@@ -2661,44 +2591,76 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 
 	if (gfp_mask & __GFP_NOFAIL)
 		oom = false;
-again:
+retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
 
-	do {
-		bool invoke_oom = oom && !nr_oom_retries;
+	size = batch * PAGE_SIZE;
+	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+		if (!do_swap_account)
+			goto done_restock;
+		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+			goto done_restock;
+		res_counter_uncharge(&memcg->res, size);
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+	} else
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-		/* If killed, bypass charge */
-		if (fatal_signal_pending(current))
-			goto bypass;
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
+	}
 
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
-					   nr_pages, invoke_oom);
-		switch (ret) {
-		case CHARGE_OK:
-			break;
-		case CHARGE_RETRY: /* not in OOM situation but retry */
-			batch = nr_pages;
-			goto again;
-		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
-			goto nomem;
-		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom || invoke_oom)
-				goto nomem;
-			nr_oom_retries--;
-			break;
-		}
-	} while (ret != CHARGE_OK);
+	if (!(gfp_mask & __GFP_WAIT))
+		goto nomem;
 
-	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
-done:
-	return 0;
+	if (gfp_mask & __GFP_NORETRY)
+		goto nomem;
+
+	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+
+	if (mem_cgroup_margin(mem_over_limit) >= batch)
+		goto retry;
+	/*
+	 * Even though the limit is exceeded at this point, reclaim
+	 * may have been able to free some pages.  Retry the charge
+	 * before killing the task.
+	 *
+	 * Only for regular pages, though: huge pages are rather
+	 * unlikely to succeed so close to the limit, and we fall back
+	 * to regular pages anyway in case of failure.
+	 */
+	if (nr_reclaimed && batch <= (1 << PAGE_ALLOC_COSTLY_ORDER))
+		goto retry;
+	/*
+	 * At task move, charge accounts can be doubly counted. So, it's
+	 * better to wait until the end of task_move if something is going on.
+	 */
+	if (mem_cgroup_wait_acct_move(mem_over_limit))
+		goto retry;
+
+	if (fatal_signal_pending(current))
+		goto bypass;
+
+	if (!oom)
+		goto nomem;
+
+	if (nr_oom_retries--)
+		goto retry;
+
+	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
 bypass:
 	return -EINTR;
+
+done_restock:
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+done:
+	return 0;
 }
 
 /**
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 02/13] mm: memcontrol: rearrange charging fast path
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The charging path currently starts out with OOM condition checks when
OOM is the rarest possible case.

Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim.  Attempting a charge does not hurt an (oom-)killed task as
much as every charge attempt having to check OOM conditions.  Also,
only check __GFP_NOFAIL when the charge would actually fail.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94531df14d37..e946f7439b16 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2575,22 +2575,6 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 
 	if (mem_cgroup_is_root(memcg))
 		goto done;
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
-		     fatal_signal_pending(current) ||
-		     current->flags & PF_EXITING))
-		goto bypass;
-
-	if (unlikely(task_in_memcg_oom(current)))
-		goto nomem;
-
-	if (gfp_mask & __GFP_NOFAIL)
-		oom = false;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2612,6 +2596,20 @@ retry:
 		goto retry;
 	}
 
+	/*
+	 * Unlike in global OOM situations, memcg is not in a physical
+	 * memory shortage.  Allow dying and OOM-killed tasks to
+	 * bypass the last charges so that they can exit quickly and
+	 * free their memory.
+	 */
+	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+		     fatal_signal_pending(current) ||
+		     current->flags & PF_EXITING))
+		goto bypass;
+
+	if (unlikely(task_in_memcg_oom(current)))
+		goto nomem;
+
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
@@ -2640,6 +2638,9 @@ retry:
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	if (gfp_mask & __GFP_NOFAIL)
+		goto bypass;
+
 	if (fatal_signal_pending(current))
 		goto bypass;
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 02/13] mm: memcontrol: rearrange charging fast path
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The charging path currently starts out with OOM condition checks when
OOM is the rarest possible case.

Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim.  Attempting a charge does not hurt an (oom-)killed task as
much as every charge attempt having to check OOM conditions.  Also,
only check __GFP_NOFAIL when the charge would actually fail.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94531df14d37..e946f7439b16 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2575,22 +2575,6 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 
 	if (mem_cgroup_is_root(memcg))
 		goto done;
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
-		     fatal_signal_pending(current) ||
-		     current->flags & PF_EXITING))
-		goto bypass;
-
-	if (unlikely(task_in_memcg_oom(current)))
-		goto nomem;
-
-	if (gfp_mask & __GFP_NOFAIL)
-		oom = false;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2612,6 +2596,20 @@ retry:
 		goto retry;
 	}
 
+	/*
+	 * Unlike in global OOM situations, memcg is not in a physical
+	 * memory shortage.  Allow dying and OOM-killed tasks to
+	 * bypass the last charges so that they can exit quickly and
+	 * free their memory.
+	 */
+	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+		     fatal_signal_pending(current) ||
+		     current->flags & PF_EXITING))
+		goto bypass;
+
+	if (unlikely(task_in_memcg_oom(current)))
+		goto nomem;
+
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
@@ -2640,6 +2638,9 @@ retry:
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	if (gfp_mask & __GFP_NOFAIL)
+		goto bypass;
+
 	if (fatal_signal_pending(current))
 		goto bypass;
 
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 03/13] mm: memcontrol: reclaim at least once for __GFP_NORETRY
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Currently, __GFP_NORETRY tries charging once and gives up before even
trying to reclaim.  Bring the behavior on par with the page allocator
and reclaim at least once before giving up.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e946f7439b16..16f0206696ce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2613,13 +2613,13 @@ retry:
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
-	if (gfp_mask & __GFP_NORETRY)
-		goto nomem;
-
 	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= batch)
 		goto retry;
+
+	if (gfp_mask & __GFP_NORETRY)
+		goto nomem;
 	/*
 	 * Even though the limit is exceeded at this point, reclaim
 	 * may have been able to free some pages.  Retry the charge
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 03/13] mm: memcontrol: reclaim at least once for __GFP_NORETRY
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Currently, __GFP_NORETRY tries charging once and gives up before even
trying to reclaim.  Bring the behavior on par with the page allocator
and reclaim at least once before giving up.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e946f7439b16..16f0206696ce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2613,13 +2613,13 @@ retry:
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
-	if (gfp_mask & __GFP_NORETRY)
-		goto nomem;
-
 	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= batch)
 		goto retry;
+
+	if (gfp_mask & __GFP_NORETRY)
+		goto nomem;
 	/*
 	 * Even though the limit is exceeded at this point, reclaim
 	 * may have been able to free some pages.  Retry the charge
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 04/13] mm: huge_memory: use GFP_TRANSHUGE when charging huge pages
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Transparent huge page charges prefer falling back to regular pages
rather than spending a lot of time in direct reclaim.

Desired reclaim behavior is usually declared in the gfp mask, but THP
charges use GFP_KERNEL and then rely on the fact that OOM is disabled
for THP charges, and that OOM-disabled charges don't retry reclaim.
Needless to say, this is anything but obvious and quite error prone.

Convert THP charges to use GFP_TRANSHUGE instead, which implies
__GFP_NORETRY, to indicate the low-latency requirement.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e60837dc785c..10cd7f2bf776 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -827,7 +827,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_KERNEL))) {
+	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_TRANSHUGE))) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1101,7 +1101,7 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))) {
+	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -2368,7 +2368,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)))
+	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE)))
 		return;
 
 	/*
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 04/13] mm: huge_memory: use GFP_TRANSHUGE when charging huge pages
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Transparent huge page charges prefer falling back to regular pages
rather than spending a lot of time in direct reclaim.

Desired reclaim behavior is usually declared in the gfp mask, but THP
charges use GFP_KERNEL and then rely on the fact that OOM is disabled
for THP charges, and that OOM-disabled charges don't retry reclaim.
Needless to say, this is anything but obvious and quite error prone.

Convert THP charges to use GFP_TRANSHUGE instead, which implies
__GFP_NORETRY, to indicate the low-latency requirement.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e60837dc785c..10cd7f2bf776 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -827,7 +827,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_KERNEL))) {
+	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_TRANSHUGE))) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1101,7 +1101,7 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))) {
+	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -2368,7 +2368,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)))
+	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE)))
 		return;
 
 	/*
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 05/13] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

There is no reason why oom-disabled and __GFP_NOFAIL charges should
try to reclaim only once when every other charge tries several times
before giving up.  Make them all retry the same number of times.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16f0206696ce..9c646b9b56f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2566,7 +2566,7 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 				 bool oom)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
-	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
@@ -2638,6 +2638,9 @@ retry:
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	if (nr_retries--)
+		goto retry;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		goto bypass;
 
@@ -2647,9 +2650,6 @@ retry:
 	if (!oom)
 		goto nomem;
 
-	if (nr_oom_retries--)
-		goto retry;
-
 	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 05/13] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

There is no reason why oom-disabled and __GFP_NOFAIL charges should
try to reclaim only once when every other charge tries several times
before giving up.  Make them all retry the same number of times.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16f0206696ce..9c646b9b56f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2566,7 +2566,7 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 				 bool oom)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
-	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
@@ -2638,6 +2638,9 @@ retry:
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	if (nr_retries--)
+		goto retry;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		goto bypass;
 
@@ -2647,9 +2650,6 @@ retry:
 	if (!oom)
 		goto nomem;
 
-	if (nr_oom_retries--)
-		goto retry;
-
 	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 06/13] mm: memcontrol: remove explicit OOM parameter in charge path
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@suse.cz>

For the page allocator, __GFP_NORETRY implies that no OOM should be
triggered, whereas memcg has an explicit parameter to disable OOM.

The only callsites that want OOM disabled are THP charges and charge
moving.  THP already uses __GFP_NORETRY and charge moving can use it
as well - one full reclaim cycle should be plenty.  Switch it over,
then remove the OOM parameter.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 32 ++++++++++----------------------
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9c646b9b56f4..c765125694e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2555,15 +2555,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
  * mem_cgroup_try_charge - try charging a memcg
  * @memcg: memcg to charge
  * @nr_pages: number of pages to charge
- * @oom: trigger OOM if reclaim fails
  *
  * Returns 0 if @memcg was charged successfully, -EINTR if the charge
  * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
  */
 static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 				 gfp_t gfp_mask,
-				 unsigned int nr_pages,
-				 bool oom)
+				 unsigned int nr_pages)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -2647,9 +2645,6 @@ retry:
 	if (fatal_signal_pending(current))
 		goto bypass;
 
-	if (!oom)
-		goto nomem;
-
 	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
@@ -2675,15 +2670,14 @@ done:
  */
 static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 				 gfp_t gfp_mask,
-				 unsigned int nr_pages,
-				 bool oom)
+				 unsigned int nr_pages)
 
 {
 	struct mem_cgroup *memcg;
 	int ret;
 
 	memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages, oom);
+	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages);
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		memcg = root_mem_cgroup;
@@ -2900,8 +2894,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	if (ret)
 		return ret;
 
-	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT,
-				    oom_gfp_allowed(gfp));
+	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT);
 	if (ret == -EINTR)  {
 		/*
 		 * mem_cgroup_try_charge() chosed to bypass to root due to
@@ -3650,7 +3643,6 @@ int mem_cgroup_charge_anon(struct page *page,
 {
 	unsigned int nr_pages = 1;
 	struct mem_cgroup *memcg;
-	bool oom = true;
 
 	if (mem_cgroup_disabled())
 		return 0;
@@ -3662,14 +3654,9 @@ int mem_cgroup_charge_anon(struct page *page,
 	if (PageTransHuge(page)) {
 		nr_pages <<= compound_order(page);
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		/*
-		 * Never OOM-kill a process for a huge page.  The
-		 * fault handler will fall back to regular pages.
-		 */
-		oom = false;
 	}
 
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
+	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages);
 	if (!memcg)
 		return -ENOMEM;
 	__mem_cgroup_commit_charge(memcg, page, nr_pages,
@@ -3706,7 +3693,7 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, mask, 1, true);
+	ret = mem_cgroup_try_charge(memcg, mask, 1);
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		memcg = root_mem_cgroup;
@@ -3733,7 +3720,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 	if (!PageSwapCache(page)) {
 		struct mem_cgroup *memcg;
 
-		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
+		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
 		if (!memcg)
 			return -ENOMEM;
 		*memcgp = memcg;
@@ -3802,7 +3789,7 @@ int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
 		return 0;
 	}
 
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
+	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
 	if (!memcg)
 		return -ENOMEM;
 	__mem_cgroup_commit_charge(memcg, page, 1, type, false);
@@ -6414,7 +6401,8 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = mem_cgroup_try_charge(memcg, GFP_KERNEL, 1, false);
+		ret = mem_cgroup_try_charge(memcg,
+					    GFP_KERNEL & ~__GFP_NORETRY, 1);
 		if (ret)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 06/13] mm: memcontrol: remove explicit OOM parameter in charge path
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@suse.cz>

For the page allocator, __GFP_NORETRY implies that no OOM should be
triggered, whereas memcg has an explicit parameter to disable OOM.

The only callsites that want OOM disabled are THP charges and charge
moving.  THP already uses __GFP_NORETRY and charge moving can use it
as well - one full reclaim cycle should be plenty.  Switch it over,
then remove the OOM parameter.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 32 ++++++++++----------------------
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9c646b9b56f4..c765125694e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2555,15 +2555,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
  * mem_cgroup_try_charge - try charging a memcg
  * @memcg: memcg to charge
  * @nr_pages: number of pages to charge
- * @oom: trigger OOM if reclaim fails
  *
  * Returns 0 if @memcg was charged successfully, -EINTR if the charge
  * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
  */
 static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 				 gfp_t gfp_mask,
-				 unsigned int nr_pages,
-				 bool oom)
+				 unsigned int nr_pages)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -2647,9 +2645,6 @@ retry:
 	if (fatal_signal_pending(current))
 		goto bypass;
 
-	if (!oom)
-		goto nomem;
-
 	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
@@ -2675,15 +2670,14 @@ done:
  */
 static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 				 gfp_t gfp_mask,
-				 unsigned int nr_pages,
-				 bool oom)
+				 unsigned int nr_pages)
 
 {
 	struct mem_cgroup *memcg;
 	int ret;
 
 	memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages, oom);
+	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages);
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		memcg = root_mem_cgroup;
@@ -2900,8 +2894,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	if (ret)
 		return ret;
 
-	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT,
-				    oom_gfp_allowed(gfp));
+	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT);
 	if (ret == -EINTR)  {
 		/*
 		 * mem_cgroup_try_charge() chosed to bypass to root due to
@@ -3650,7 +3643,6 @@ int mem_cgroup_charge_anon(struct page *page,
 {
 	unsigned int nr_pages = 1;
 	struct mem_cgroup *memcg;
-	bool oom = true;
 
 	if (mem_cgroup_disabled())
 		return 0;
@@ -3662,14 +3654,9 @@ int mem_cgroup_charge_anon(struct page *page,
 	if (PageTransHuge(page)) {
 		nr_pages <<= compound_order(page);
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		/*
-		 * Never OOM-kill a process for a huge page.  The
-		 * fault handler will fall back to regular pages.
-		 */
-		oom = false;
 	}
 
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
+	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages);
 	if (!memcg)
 		return -ENOMEM;
 	__mem_cgroup_commit_charge(memcg, page, nr_pages,
@@ -3706,7 +3693,7 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, mask, 1, true);
+	ret = mem_cgroup_try_charge(memcg, mask, 1);
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		memcg = root_mem_cgroup;
@@ -3733,7 +3720,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 	if (!PageSwapCache(page)) {
 		struct mem_cgroup *memcg;
 
-		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
+		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
 		if (!memcg)
 			return -ENOMEM;
 		*memcgp = memcg;
@@ -3802,7 +3789,7 @@ int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
 		return 0;
 	}
 
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
+	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
 	if (!memcg)
 		return -ENOMEM;
 	__mem_cgroup_commit_charge(memcg, page, 1, type, false);
@@ -6414,7 +6401,8 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = mem_cgroup_try_charge(memcg, GFP_KERNEL, 1, false);
+		ret = mem_cgroup_try_charge(memcg,
+					    GFP_KERNEL & ~__GFP_NORETRY, 1);
 		if (ret)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 07/13] mm: memcontrol: simplify move precharge function
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The move precharge function does some baroque things: it tries raw
res_counter charging of the entire amount first, and then falls back
to a loop of one-by-one charges, with checks for pending signals and
cond_resched() batching.

Just use mem_cgroup_try_charge() without __GFP_WAIT for the first bulk
charge attempt.  In the one-by-one loop, remove the signal check (this
is already checked in try_charge), and simply call cond_resched()
after every charge - it's not that expensive.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 48 +++++++++++++++---------------------------------
 1 file changed, 15 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c765125694e2..382af03a040a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6359,56 +6359,38 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
-#define PRECHARGE_COUNT_AT_ONCE	256
 static int mem_cgroup_do_precharge(unsigned long count)
 {
 	int ret = 0;
-	int batch_count = PRECHARGE_COUNT_AT_ONCE;
-	struct mem_cgroup *memcg = mc.to;
 
-	if (mem_cgroup_is_root(memcg)) {
+	if (mem_cgroup_is_root(mc.to)) {
 		mc.precharge += count;
 		/* we don't need css_get for root */
 		return ret;
 	}
-	/* try to charge at once */
-	if (count > 1) {
-		struct res_counter *dummy;
-		/*
-		 * "memcg" cannot be under rmdir() because we've already checked
-		 * by cgroup_lock_live_cgroup() that it is not removed and we
-		 * are still under the same cgroup_mutex. So we can postpone
-		 * css_get().
-		 */
-		if (res_counter_charge(&memcg->res, PAGE_SIZE * count, &dummy))
-			goto one_by_one;
-		if (do_swap_account && res_counter_charge(&memcg->memsw,
-						PAGE_SIZE * count, &dummy)) {
-			res_counter_uncharge(&memcg->res, PAGE_SIZE * count);
-			goto one_by_one;
-		}
+
+	/* Try a single bulk charge without reclaim first */
+	ret = mem_cgroup_try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+	if (!ret) {
 		mc.precharge += count;
 		return ret;
 	}
-one_by_one:
-	/* fall back to one by one charge */
+
+	/* Try charges one by one with reclaim */
 	while (count--) {
-		if (signal_pending(current)) {
-			ret = -EINTR;
-			break;
-		}
-		if (!batch_count--) {
-			batch_count = PRECHARGE_COUNT_AT_ONCE;
-			cond_resched();
-		}
-		ret = mem_cgroup_try_charge(memcg,
+		ret = mem_cgroup_try_charge(mc.to,
 					    GFP_KERNEL & ~__GFP_NORETRY, 1);
+		/*
+		 * In case of failure, any residual charges against
+		 * mc.to will be dropped by mem_cgroup_clear_mc()
+		 * later on.
+		 */
 		if (ret)
-			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
 		mc.precharge++;
+		cond_resched();
 	}
-	return ret;
+	return 0;
 }
 
 /**
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 07/13] mm: memcontrol: simplify move precharge function
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The move precharge function does some baroque things: it tries raw
res_counter charging of the entire amount first, and then falls back
to a loop of one-by-one charges, with checks for pending signals and
cond_resched() batching.

Just use mem_cgroup_try_charge() without __GFP_WAIT for the first bulk
charge attempt.  In the one-by-one loop, remove the signal check (this
is already checked in try_charge), and simply call cond_resched()
after every charge - it's not that expensive.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 48 +++++++++++++++---------------------------------
 1 file changed, 15 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c765125694e2..382af03a040a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6359,56 +6359,38 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
-#define PRECHARGE_COUNT_AT_ONCE	256
 static int mem_cgroup_do_precharge(unsigned long count)
 {
 	int ret = 0;
-	int batch_count = PRECHARGE_COUNT_AT_ONCE;
-	struct mem_cgroup *memcg = mc.to;
 
-	if (mem_cgroup_is_root(memcg)) {
+	if (mem_cgroup_is_root(mc.to)) {
 		mc.precharge += count;
 		/* we don't need css_get for root */
 		return ret;
 	}
-	/* try to charge at once */
-	if (count > 1) {
-		struct res_counter *dummy;
-		/*
-		 * "memcg" cannot be under rmdir() because we've already checked
-		 * by cgroup_lock_live_cgroup() that it is not removed and we
-		 * are still under the same cgroup_mutex. So we can postpone
-		 * css_get().
-		 */
-		if (res_counter_charge(&memcg->res, PAGE_SIZE * count, &dummy))
-			goto one_by_one;
-		if (do_swap_account && res_counter_charge(&memcg->memsw,
-						PAGE_SIZE * count, &dummy)) {
-			res_counter_uncharge(&memcg->res, PAGE_SIZE * count);
-			goto one_by_one;
-		}
+
+	/* Try a single bulk charge without reclaim first */
+	ret = mem_cgroup_try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+	if (!ret) {
 		mc.precharge += count;
 		return ret;
 	}
-one_by_one:
-	/* fall back to one by one charge */
+
+	/* Try charges one by one with reclaim */
 	while (count--) {
-		if (signal_pending(current)) {
-			ret = -EINTR;
-			break;
-		}
-		if (!batch_count--) {
-			batch_count = PRECHARGE_COUNT_AT_ONCE;
-			cond_resched();
-		}
-		ret = mem_cgroup_try_charge(memcg,
+		ret = mem_cgroup_try_charge(mc.to,
 					    GFP_KERNEL & ~__GFP_NORETRY, 1);
+		/*
+		 * In case of failure, any residual charges against
+		 * mc.to will be dropped by mem_cgroup_clear_mc()
+		 * later on.
+		 */
 		if (ret)
-			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
 		mc.precharge++;
+		cond_resched();
 	}
-	return ret;
+	return 0;
 }
 
 /**
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 08/13] mm: memcontrol: catch root bypass in move precharge
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

When mem_cgroup_try_charge() returns -EINTR, it bypassed the charge to
the root memcg.  But move precharging does not catch this and treats
this case as if no charge had happened, thus leaking a charge against
root.  Because of an old optimization, the root memcg's res_counter is
not actually charged right now, but it's still an imbalance and
subsequent patches will charge the root memcg again.

Catch those bypasses to the root memcg and properly cancel them before
giving up the move.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 382af03a040a..b6b5f7f98618 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6375,6 +6375,10 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		mc.precharge += count;
 		return ret;
 	}
+	if (ret == -EINTR) {
+		__mem_cgroup_cancel_charge(root_mem_cgroup, count);
+		return ret;
+	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
@@ -6383,8 +6387,11 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		/*
 		 * In case of failure, any residual charges against
 		 * mc.to will be dropped by mem_cgroup_clear_mc()
-		 * later on.
+		 * later on.  However, cancel any charges that are
+		 * bypassed to root right away or they'll be lost.
 		 */
+		if (ret == -EINTR)
+			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 08/13] mm: memcontrol: catch root bypass in move precharge
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

When mem_cgroup_try_charge() returns -EINTR, it bypassed the charge to
the root memcg.  But move precharging does not catch this and treats
this case as if no charge had happened, thus leaking a charge against
root.  Because of an old optimization, the root memcg's res_counter is
not actually charged right now, but it's still an imbalance and
subsequent patches will charge the root memcg again.

Catch those bypasses to the root memcg and properly cancel them before
giving up the move.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 382af03a040a..b6b5f7f98618 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6375,6 +6375,10 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		mc.precharge += count;
 		return ret;
 	}
+	if (ret == -EINTR) {
+		__mem_cgroup_cancel_charge(root_mem_cgroup, count);
+		return ret;
+	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
@@ -6383,8 +6387,11 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		/*
 		 * In case of failure, any residual charges against
 		 * mc.to will be dropped by mem_cgroup_clear_mc()
-		 * later on.
+		 * later on.  However, cancel any charges that are
+		 * bypassed to root right away or they'll be lost.
 		 */
+		if (ret == -EINTR)
+			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 09/13] mm: memcontrol: use root_mem_cgroup res_counter
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Due to an old optimization to keep expensive res_counter changes at a
minimum, the root_mem_cgroup res_counter is never charged; there is no
limit at that level anyway, and any statistics can be generated on
demand by summing up the counters of all other cgroups.

However, with per-cpu charge caches, res_counter operations do not
even show up in profiles anymore, so this optimization is no longer
necessary.

Remove it to simplify the code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 152 ++++++++++++++++----------------------------------------
 1 file changed, 44 insertions(+), 108 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6b5f7f98618..d2b8429002c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2570,9 +2570,8 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 	unsigned long nr_reclaimed;
 	unsigned long flags = 0;
 	unsigned long long size;
+	int ret = 0;
 
-	if (mem_cgroup_is_root(memcg))
-		goto done;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2650,13 +2649,15 @@ nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
 bypass:
-	return -EINTR;
+	memcg = root_mem_cgroup;
+	ret = -EINTR;
+	goto retry;
 
 done_restock:
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 done:
-	return 0;
+	return ret;
 }
 
 /**
@@ -2695,13 +2696,11 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
 				       unsigned int nr_pages)
 {
-	if (!mem_cgroup_is_root(memcg)) {
-		unsigned long bytes = nr_pages * PAGE_SIZE;
+	unsigned long bytes = nr_pages * PAGE_SIZE;
 
-		res_counter_uncharge(&memcg->res, bytes);
-		if (do_swap_account)
-			res_counter_uncharge(&memcg->memsw, bytes);
-	}
+	res_counter_uncharge(&memcg->res, bytes);
+	if (do_swap_account)
+		res_counter_uncharge(&memcg->memsw, bytes);
 }
 
 /*
@@ -2713,9 +2712,6 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
 	if (do_swap_account)
 		res_counter_uncharge_until(&memcg->memsw,
@@ -3943,7 +3939,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	 * replacement page, so leave it alone when phasing out the
 	 * page that is unused after the migration.
 	 */
-	if (!end_migration && !mem_cgroup_is_root(memcg))
+	if (!end_migration)
 		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
 
 	return memcg;
@@ -4076,8 +4072,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 * We uncharge this because swap is freed.  This memcg can
 		 * be obsolete one. We avoid calling css_tryget_online().
 		 */
-		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(memcg, false);
 		css_put(&memcg->css);
 	}
@@ -4767,78 +4762,24 @@ out:
 	return retval;
 }
 
-
-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
-					       enum mem_cgroup_stat_index idx)
-{
-	struct mem_cgroup *iter;
-	long val = 0;
-
-	/* Per-cpu values can be negative, use a signed accumulator */
-	for_each_mem_cgroup_tree(iter, memcg)
-		val += mem_cgroup_read_stat(iter, idx);
-
-	if (val < 0) /* race ? */
-		val = 0;
-	return val;
-}
-
-static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
-{
-	u64 val;
-
-	if (!mem_cgroup_is_root(memcg)) {
-		if (!swap)
-			return res_counter_read_u64(&memcg->res, RES_USAGE);
-		else
-			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
-	}
-
-	/*
-	 * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
-	 * as well as in MEM_CGROUP_STAT_RSS_HUGE.
-	 */
-	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
-	val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-
-	if (swap)
-		val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
-
-	return val << PAGE_SHIFT;
-}
-
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
-				   struct cftype *cft)
+			       struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	u64 val;
-	int name;
-	enum res_type type;
-
-	type = MEMFILE_TYPE(cft->private);
-	name = MEMFILE_ATTR(cft->private);
+	enum res_type type = MEMFILE_TYPE(cft->private);
+	int name = MEMFILE_ATTR(cft->private);
 
 	switch (type) {
 	case _MEM:
-		if (name == RES_USAGE)
-			val = mem_cgroup_usage(memcg, false);
-		else
-			val = res_counter_read_u64(&memcg->res, name);
-		break;
+		return res_counter_read_u64(&memcg->res, name);
 	case _MEMSWAP:
-		if (name == RES_USAGE)
-			val = mem_cgroup_usage(memcg, true);
-		else
-			val = res_counter_read_u64(&memcg->memsw, name);
-		break;
+		return res_counter_read_u64(&memcg->memsw, name);
 	case _KMEM:
-		val = res_counter_read_u64(&memcg->kmem, name);
+		return res_counter_read_u64(&memcg->kmem, name);
 		break;
 	default:
 		BUG();
 	}
-
-	return val;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -5300,7 +5241,10 @@ static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 	if (!t)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, swap);
+	if (!swap)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 
 	/*
 	 * current_threshold points to threshold just below or equal to usage.
@@ -5392,15 +5336,15 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 
 	mutex_lock(&memcg->thresholds_lock);
 
-	if (type == _MEM)
+	if (type == _MEM) {
 		thresholds = &memcg->thresholds;
-	else if (type == _MEMSWAP)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
-	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+	} else
 		BUG();
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
 	/* Check if a threshold crossed before adding a new one */
 	if (thresholds->primary)
 		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
@@ -5480,18 +5424,19 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	int i, j, size;
 
 	mutex_lock(&memcg->thresholds_lock);
-	if (type == _MEM)
+
+	if (type == _MEM) {
 		thresholds = &memcg->thresholds;
-	else if (type == _MEMSWAP)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
-	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+	} else
 		BUG();
 
 	if (!thresholds->primary)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
 	/* Check if a threshold crossed before removing */
 	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
 
@@ -6246,9 +6191,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		 * core guarantees its existence.
 		 */
 	} else {
-		res_counter_init(&memcg->res, NULL);
-		res_counter_init(&memcg->memsw, NULL);
-		res_counter_init(&memcg->kmem, NULL);
+		res_counter_init(&memcg->res, &root_mem_cgroup->res);
+		res_counter_init(&memcg->memsw, &root_mem_cgroup->memsw);
+		res_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
@@ -6361,13 +6306,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 /* Handlers for move charge at task migration. */
 static int mem_cgroup_do_precharge(unsigned long count)
 {
-	int ret = 0;
-
-	if (mem_cgroup_is_root(mc.to)) {
-		mc.precharge += count;
-		/* we don't need css_get for root */
-		return ret;
-	}
+	int ret;
 
 	/* Try a single bulk charge without reclaim first */
 	ret = mem_cgroup_try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
@@ -6674,21 +6613,18 @@ static void __mem_cgroup_clear_mc(void)
 	/* we must fixup refcnts and charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
-		if (!mem_cgroup_is_root(mc.from))
-			res_counter_uncharge(&mc.from->memsw,
-						PAGE_SIZE * mc.moved_swap);
+		res_counter_uncharge(&mc.from->memsw,
+				     PAGE_SIZE * mc.moved_swap);
 
 		for (i = 0; i < mc.moved_swap; i++)
 			css_put(&mc.from->css);
 
-		if (!mem_cgroup_is_root(mc.to)) {
-			/*
-			 * we charged both to->res and to->memsw, so we should
-			 * uncharge to->res.
-			 */
-			res_counter_uncharge(&mc.to->res,
-						PAGE_SIZE * mc.moved_swap);
-		}
+		/*
+		 * we charged both to->res and to->memsw, so we should
+		 * uncharge to->res.
+		 */
+		res_counter_uncharge(&mc.to->res,
+				     PAGE_SIZE * mc.moved_swap);
 		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 09/13] mm: memcontrol: use root_mem_cgroup res_counter
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Due to an old optimization to keep expensive res_counter changes at a
minimum, the root_mem_cgroup res_counter is never charged; there is no
limit at that level anyway, and any statistics can be generated on
demand by summing up the counters of all other cgroups.

However, with per-cpu charge caches, res_counter operations do not
even show up in profiles anymore, so this optimization is no longer
necessary.

Remove it to simplify the code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 152 ++++++++++++++++----------------------------------------
 1 file changed, 44 insertions(+), 108 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6b5f7f98618..d2b8429002c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2570,9 +2570,8 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 	unsigned long nr_reclaimed;
 	unsigned long flags = 0;
 	unsigned long long size;
+	int ret = 0;
 
-	if (mem_cgroup_is_root(memcg))
-		goto done;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2650,13 +2649,15 @@ nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
 bypass:
-	return -EINTR;
+	memcg = root_mem_cgroup;
+	ret = -EINTR;
+	goto retry;
 
 done_restock:
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 done:
-	return 0;
+	return ret;
 }
 
 /**
@@ -2695,13 +2696,11 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
 				       unsigned int nr_pages)
 {
-	if (!mem_cgroup_is_root(memcg)) {
-		unsigned long bytes = nr_pages * PAGE_SIZE;
+	unsigned long bytes = nr_pages * PAGE_SIZE;
 
-		res_counter_uncharge(&memcg->res, bytes);
-		if (do_swap_account)
-			res_counter_uncharge(&memcg->memsw, bytes);
-	}
+	res_counter_uncharge(&memcg->res, bytes);
+	if (do_swap_account)
+		res_counter_uncharge(&memcg->memsw, bytes);
 }
 
 /*
@@ -2713,9 +2712,6 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
 	if (do_swap_account)
 		res_counter_uncharge_until(&memcg->memsw,
@@ -3943,7 +3939,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	 * replacement page, so leave it alone when phasing out the
 	 * page that is unused after the migration.
 	 */
-	if (!end_migration && !mem_cgroup_is_root(memcg))
+	if (!end_migration)
 		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
 
 	return memcg;
@@ -4076,8 +4072,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 * We uncharge this because swap is freed.  This memcg can
 		 * be obsolete one. We avoid calling css_tryget_online().
 		 */
-		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(memcg, false);
 		css_put(&memcg->css);
 	}
@@ -4767,78 +4762,24 @@ out:
 	return retval;
 }
 
-
-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
-					       enum mem_cgroup_stat_index idx)
-{
-	struct mem_cgroup *iter;
-	long val = 0;
-
-	/* Per-cpu values can be negative, use a signed accumulator */
-	for_each_mem_cgroup_tree(iter, memcg)
-		val += mem_cgroup_read_stat(iter, idx);
-
-	if (val < 0) /* race ? */
-		val = 0;
-	return val;
-}
-
-static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
-{
-	u64 val;
-
-	if (!mem_cgroup_is_root(memcg)) {
-		if (!swap)
-			return res_counter_read_u64(&memcg->res, RES_USAGE);
-		else
-			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
-	}
-
-	/*
-	 * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
-	 * as well as in MEM_CGROUP_STAT_RSS_HUGE.
-	 */
-	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
-	val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-
-	if (swap)
-		val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
-
-	return val << PAGE_SHIFT;
-}
-
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
-				   struct cftype *cft)
+			       struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	u64 val;
-	int name;
-	enum res_type type;
-
-	type = MEMFILE_TYPE(cft->private);
-	name = MEMFILE_ATTR(cft->private);
+	enum res_type type = MEMFILE_TYPE(cft->private);
+	int name = MEMFILE_ATTR(cft->private);
 
 	switch (type) {
 	case _MEM:
-		if (name == RES_USAGE)
-			val = mem_cgroup_usage(memcg, false);
-		else
-			val = res_counter_read_u64(&memcg->res, name);
-		break;
+		return res_counter_read_u64(&memcg->res, name);
 	case _MEMSWAP:
-		if (name == RES_USAGE)
-			val = mem_cgroup_usage(memcg, true);
-		else
-			val = res_counter_read_u64(&memcg->memsw, name);
-		break;
+		return res_counter_read_u64(&memcg->memsw, name);
 	case _KMEM:
-		val = res_counter_read_u64(&memcg->kmem, name);
+		return res_counter_read_u64(&memcg->kmem, name);
 		break;
 	default:
 		BUG();
 	}
-
-	return val;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -5300,7 +5241,10 @@ static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 	if (!t)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, swap);
+	if (!swap)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 
 	/*
 	 * current_threshold points to threshold just below or equal to usage.
@@ -5392,15 +5336,15 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 
 	mutex_lock(&memcg->thresholds_lock);
 
-	if (type == _MEM)
+	if (type == _MEM) {
 		thresholds = &memcg->thresholds;
-	else if (type == _MEMSWAP)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
-	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+	} else
 		BUG();
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
 	/* Check if a threshold crossed before adding a new one */
 	if (thresholds->primary)
 		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
@@ -5480,18 +5424,19 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	int i, j, size;
 
 	mutex_lock(&memcg->thresholds_lock);
-	if (type == _MEM)
+
+	if (type == _MEM) {
 		thresholds = &memcg->thresholds;
-	else if (type == _MEMSWAP)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
-	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+	} else
 		BUG();
 
 	if (!thresholds->primary)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
 	/* Check if a threshold crossed before removing */
 	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
 
@@ -6246,9 +6191,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		 * core guarantees its existence.
 		 */
 	} else {
-		res_counter_init(&memcg->res, NULL);
-		res_counter_init(&memcg->memsw, NULL);
-		res_counter_init(&memcg->kmem, NULL);
+		res_counter_init(&memcg->res, &root_mem_cgroup->res);
+		res_counter_init(&memcg->memsw, &root_mem_cgroup->memsw);
+		res_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
@@ -6361,13 +6306,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 /* Handlers for move charge at task migration. */
 static int mem_cgroup_do_precharge(unsigned long count)
 {
-	int ret = 0;
-
-	if (mem_cgroup_is_root(mc.to)) {
-		mc.precharge += count;
-		/* we don't need css_get for root */
-		return ret;
-	}
+	int ret;
 
 	/* Try a single bulk charge without reclaim first */
 	ret = mem_cgroup_try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
@@ -6674,21 +6613,18 @@ static void __mem_cgroup_clear_mc(void)
 	/* we must fixup refcnts and charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
-		if (!mem_cgroup_is_root(mc.from))
-			res_counter_uncharge(&mc.from->memsw,
-						PAGE_SIZE * mc.moved_swap);
+		res_counter_uncharge(&mc.from->memsw,
+				     PAGE_SIZE * mc.moved_swap);
 
 		for (i = 0; i < mc.moved_swap; i++)
 			css_put(&mc.from->css);
 
-		if (!mem_cgroup_is_root(mc.to)) {
-			/*
-			 * we charged both to->res and to->memsw, so we should
-			 * uncharge to->res.
-			 */
-			res_counter_uncharge(&mc.to->res,
-						PAGE_SIZE * mc.moved_swap);
-		}
+		/*
+		 * we charged both to->res and to->memsw, so we should
+		 * uncharge to->res.
+		 */
+		res_counter_uncharge(&mc.to->res,
+				     PAGE_SIZE * mc.moved_swap);
 		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 10/13] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

There is a write barrier between setting pc->mem_cgroup and
PageCgroupUsed, which was added to allow LRU operations to lookup the
memcg LRU list of a page without acquiring the page_cgroup lock.

But ever since 38c5d72f3ebe ("memcg: simplify LRU handling by new
rule"), pages are ensured to be off-LRU while charging, so nobody else
is changing LRU state while pc->mem_cgroup is being written, and there
are no read barriers anymore.

Remove the unnecessary write barrier.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2b8429002c0..199bd50359ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2795,14 +2795,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	}
 
 	pc->mem_cgroup = memcg;
-	/*
-	 * We access a page_cgroup asynchronously without lock_page_cgroup().
-	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
-	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
-	 * before USED bit, we need memory barrier here.
-	 * See mem_cgroup_add_lru_list(), etc.
-	 */
-	smp_wmb();
 	SetPageCgroupUsed(pc);
 
 	if (lrucare) {
@@ -3483,7 +3475,6 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		smp_wmb();/* see __commit_charge() */
 		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 10/13] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

There is a write barrier between setting pc->mem_cgroup and
PageCgroupUsed, which was added to allow LRU operations to lookup the
memcg LRU list of a page without acquiring the page_cgroup lock.

But ever since 38c5d72f3ebe ("memcg: simplify LRU handling by new
rule"), pages are ensured to be off-LRU while charging, so nobody else
is changing LRU state while pc->mem_cgroup is being written, and there
are no read barriers anymore.

Remove the unnecessary write barrier.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2b8429002c0..199bd50359ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2795,14 +2795,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	}
 
 	pc->mem_cgroup = memcg;
-	/*
-	 * We access a page_cgroup asynchronously without lock_page_cgroup().
-	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
-	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
-	 * before USED bit, we need memory barrier here.
-	 * See mem_cgroup_add_lru_list(), etc.
-	 */
-	smp_wmb();
 	SetPageCgroupUsed(pc);
 
 	if (lrucare) {
@@ -3483,7 +3475,6 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		smp_wmb();/* see __commit_charge() */
 		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 11/13] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Kmem page charging and uncharging is serialized by means of exclusive
access to the page.  Do not take the page_cgroup lock and don't set
pc->flags atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 199bd50359ad..5e7f8e7dc0d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3407,12 +3407,13 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
 		return;
 	}
-
+	/*
+	 * The page is freshly allocated and not visible to any
+	 * outside callers yet.  Set up pc non-atomically.
+	 */
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-	unlock_page_cgroup(pc);
+	pc->flags = PCG_USED;
 }
 
 void __memcg_kmem_uncharge_pages(struct page *page, int order)
@@ -3422,19 +3423,11 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 
 
 	pc = lookup_page_cgroup(page);
-	/*
-	 * Fast unlocked return. Theoretically might have changed, have to
-	 * check again after locking.
-	 */
 	if (!PageCgroupUsed(pc))
 		return;
 
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
+	pc->flags = 0;
 
 	/*
 	 * We trust that only if there is a memcg associated with the page, it
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 11/13] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Kmem page charging and uncharging is serialized by means of exclusive
access to the page.  Do not take the page_cgroup lock and don't set
pc->flags atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 199bd50359ad..5e7f8e7dc0d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3407,12 +3407,13 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
 		return;
 	}
-
+	/*
+	 * The page is freshly allocated and not visible to any
+	 * outside callers yet.  Set up pc non-atomically.
+	 */
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-	unlock_page_cgroup(pc);
+	pc->flags = PCG_USED;
 }
 
 void __memcg_kmem_uncharge_pages(struct page *page, int order)
@@ -3422,19 +3423,11 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 
 
 	pc = lookup_page_cgroup(page);
-	/*
-	 * Fast unlocked return. Theoretically might have changed, have to
-	 * check again after locking.
-	 */
 	if (!PageCgroupUsed(pc))
 		return;
 
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
+	pc->flags = 0;
 
 	/*
 	 * We trust that only if there is a memcg associated with the page, it
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 12/13] mm: memcontrol: rewrite charge API
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The memcg charge API charges pages before they are rmapped - i.e. have
an actual "type" - and so every callsite needs its own set of charge
and uncharge functions to know what type is being operated on.  Worse,
uncharge has to happen from a context that is still type-specific,
rather than at the end of the page's lifetime with exclusive access,
and so requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

  mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
  pages from the memcg if necessary.

  mem_cgroup_commit_charge() commits the page to the charge once it
  has a valid page->mapping and PageAnon() reliably tells the type.

  mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before
they are added to the LRU, page_add_new_anon_rmap() must stop doing
LRU additions again.  Revive lru_cache_add_active_or_unevictable().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/memcg_test.txt |  32 +--
 include/linux/memcontrol.h           |  53 ++---
 include/linux/swap.h                 |   3 +
 kernel/events/uprobes.c              |   1 +
 mm/filemap.c                         |   9 +-
 mm/huge_memory.c                     |  57 +++--
 mm/memcontrol.c                      | 407 ++++++++++++++---------------------
 mm/memory.c                          |  41 ++--
 mm/rmap.c                            |  19 --
 mm/shmem.c                           |  24 ++-
 mm/swap.c                            |  34 +++
 mm/swapfile.c                        |  14 +-
 12 files changed, 314 insertions(+), 380 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index 80ac454704b8..bcf750d3cecd 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -24,24 +24,7 @@ Please note that implementation details can be changed.
 
    a page/swp_entry may be charged (usage += PAGE_SIZE) at
 
-	mem_cgroup_charge_anon()
-	  Called at new page fault and Copy-On-Write.
-
-	mem_cgroup_try_charge_swapin()
-	  Called at do_swap_page() (page fault on swap entry) and swapoff.
-	  Followed by charge-commit-cancel protocol. (With swap accounting)
-	  At commit, a charge recorded in swap_cgroup is removed.
-
-	mem_cgroup_charge_file()
-	  Called at add_to_page_cache()
-
-	mem_cgroup_cache_charge_swapin()
-	  Called at shmem's swapin.
-
-	mem_cgroup_prepare_migration()
-	  Called before migration. "extra" charge is done and followed by
-	  charge-commit-cancel protocol.
-	  At commit, charge against oldpage or newpage will be committed.
+	mem_cgroup_try_charge()
 
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
@@ -69,19 +52,14 @@ Please note that implementation details can be changed.
 	to new page is committed. At failure, charge to old page is committed.
 
 3. charge-commit-cancel
-	In some case, we can't know this "charge" is valid or not at charging
-	(because of races).
-	To handle such case, there are charge-commit-cancel functions.
-		mem_cgroup_try_charge_XXX
-		mem_cgroup_commit_charge_XXX
-		mem_cgroup_cancel_charge_XXX
-	these are used in swap-in and migration.
+	Memcg pages are charged in two steps:
+		mem_cgroup_try_charge()
+		mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
 
 	At try_charge(), there are no flags to say "this page is charged".
 	at this point, usage += PAGE_SIZE.
 
-	At commit(), the function checks the page should be charged or not
-	and set flags or avoid charging.(usage -= PAGE_SIZE)
+	At commit(), the page is associated with the memcg.
 
 	At cancel(), simply usage -= PAGE_SIZE.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eb65d29516ca..1a9a096858e0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
-/*
- * All "charge" functions with gfp_mask should use GFP_KERNEL or
- * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
- * alloc memory but reclaims memory from all available zones. So, "where I want
- * memory from" bits of gfp_mask has no meaning. So any bits of that field is
- * available but adding a rule is better. charge functions' gfp_mask should
- * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
- * codes.
- * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
- */
-
-extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask);
-/* for swap handling */
-extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
-extern void mem_cgroup_commit_charge_swapin(struct page *page,
-					struct mem_cgroup *memcg);
-extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
-
-extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -233,30 +216,22 @@ void mem_cgroup_print_bad_page(struct page *page);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
-static inline int mem_cgroup_charge_anon(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_charge_file(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
+static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+					gfp_t gfp_mask,
+					struct mem_cgroup **memcgp)
 {
+	*memcgp = NULL;
 	return 0;
 }
 
-static inline void mem_cgroup_commit_charge_swapin(struct page *page,
-					  struct mem_cgroup *memcg)
+static inline void mem_cgroup_commit_charge(struct page *page,
+					    struct mem_cgroup *memcg,
+					    bool lrucare)
 {
 }
 
-static inline void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
+static inline void mem_cgroup_cancel_charge(struct page *page,
+					    struct mem_cgroup *memcg)
 {
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bdbee80eede..290905133078 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -321,6 +321,9 @@ extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
 
+extern void lru_cache_add_active_or_unevictable(struct page *page,
+						struct vm_area_struct *vma);
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c445e392e93f..d17f27c69bfc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -179,6 +179,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr);
+	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
 		dec_mm_counter(mm, MM_FILEPAGES);
diff --git a/mm/filemap.c b/mm/filemap.c
index dafb06f70a09..114cd89c1cc2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -560,19 +560,19 @@ static int __add_to_page_cache_locked(struct page *page,
 				      pgoff_t offset, gfp_t gfp_mask,
 				      void **shadowp)
 {
+	struct mem_cgroup *memcg;
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_charge_file(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
 	if (error)
 		return error;
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
-		mem_cgroup_uncharge_cache_page(page);
+		mem_cgroup_cancel_charge(page, memcg);
 		return error;
 	}
 
@@ -587,13 +587,14 @@ static int __add_to_page_cache_locked(struct page *page,
 		goto err_insert;
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_commit_charge(page, memcg, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
+	mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10cd7f2bf776..2377efed2924 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -715,13 +715,20 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					unsigned long haddr, pmd_t *pmd,
 					struct page *page)
 {
+	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	spinlock_t *ptl;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
+		return VM_FAULT_OOM;
+
 	pgtable = pte_alloc_one(mm, haddr);
-	if (unlikely(!pgtable))
+	if (unlikely(!pgtable)) {
+		mem_cgroup_cancel_charge(page, memcg);
 		return VM_FAULT_OOM;
+	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	/*
@@ -734,7 +741,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(page);
+		mem_cgroup_cancel_charge(page, memcg);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -742,6 +749,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -827,13 +836,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_TRANSHUGE))) {
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
 	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct page *page,
 					unsigned long haddr)
 {
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pgtable_t pgtable;
 	pmd_t _pmd;
@@ -968,20 +972,21 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					       __GFP_OTHER_NODE,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
-			     mem_cgroup_charge_anon(pages[i], mm,
-						       GFP_KERNEL))) {
+			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
+						   &memcg))) {
 			if (pages[i])
 				put_page(pages[i]);
-			mem_cgroup_uncharge_start();
 			while (--i >= 0) {
-				mem_cgroup_uncharge_page(pages[i]);
+				memcg = (void *)page_private(pages[i]);
+				set_page_private(pages[i], 0);
+				mem_cgroup_cancel_charge(pages[i], memcg);
 				put_page(pages[i]);
 			}
-			mem_cgroup_uncharge_end();
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
 		}
+		set_page_private(pages[i], (unsigned long)memcg);
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1010,7 +1015,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		pte_t *pte, entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr);
+		mem_cgroup_commit_charge(pages[i], memcg, false);
+		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -1034,12 +1043,12 @@ out:
 out_free_pages:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		mem_cgroup_uncharge_page(pages[i]);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
+		mem_cgroup_cancel_charge(pages[i], memcg);
 		put_page(pages[i]);
 	}
-	mem_cgroup_uncharge_end();
 	kfree(pages);
 	goto out;
 }
@@ -1050,6 +1059,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 	struct page *page = NULL, *new_page;
+	struct mem_cgroup *memcg;
 	unsigned long haddr;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -1101,7 +1111,8 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, mm,
+					   GFP_TRANSHUGE, &memcg))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -1130,7 +1141,7 @@ alloc:
 		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1139,6 +1150,8 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
 		if (!page) {
@@ -2358,6 +2371,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	struct mem_cgroup *memcg;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2368,7 +2382,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm,
+					   GFP_TRANSHUGE, &memcg)))
 		return;
 
 	/*
@@ -2457,6 +2472,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
@@ -2470,7 +2487,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	goto out_up_write;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e7f8e7dc0d8..602fe7207c2d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2551,17 +2551,8 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-/**
- * mem_cgroup_try_charge - try charging a memcg
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Returns 0 if @memcg was charged successfully, -EINTR if the charge
- * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
- */
-static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages)
+static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
+		      unsigned int nr_pages)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -2660,41 +2651,7 @@ done:
 	return ret;
 }
 
-/**
- * mem_cgroup_try_charge_mm - try charging a mm
- * @mm: mm_struct to charge
- * @nr_pages: number of pages to charge
- * @oom: trigger OOM if reclaim fails
- *
- * Returns the charged mem_cgroup associated with the given mm_struct or
- * NULL the charge failed.
- */
-static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages)
-
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		memcg = NULL;
-
-	return memcg;
-}
-
-/*
- * Somemtimes we have to undo a charge we got by try_charge().
- * This function is for that and do uncharge, put css's refcnt.
- * gotten by try_charge().
- */
-static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
-				       unsigned int nr_pages)
+static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
@@ -2760,17 +2717,13 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	return memcg;
 }
 
-static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
-				       struct page *page,
-				       unsigned int nr_pages,
-				       enum charge_type ctype,
-				       bool lrucare)
+static void commit_charge(struct page *page, struct mem_cgroup *memcg,
+			  unsigned int nr_pages, bool anon, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
-	bool anon;
 
 	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
@@ -2807,11 +2760,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
-		anon = true;
-	else
-		anon = false;
-
 	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
 	unlock_page_cgroup(pc);
 
@@ -2882,21 +2830,21 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	if (ret)
 		return ret;
 
-	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT);
+	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
 	if (ret == -EINTR)  {
 		/*
-		 * mem_cgroup_try_charge() chosed to bypass to root due to
-		 * OOM kill or fatal signal.  Since our only options are to
-		 * either fail the allocation or charge it to this cgroup, do
-		 * it as a temporary condition. But we can't fail. From a
-		 * kmem/slab perspective, the cache has already been selected,
-		 * by mem_cgroup_kmem_get_cache(), so it is too late to change
+		 * try_charge() chose to bypass to root due to OOM kill or
+		 * fatal signal.  Since our only options are to either fail
+		 * the allocation or charge it to this cgroup, do it as a
+		 * temporary condition. But we can't fail. From a kmem/slab
+		 * perspective, the cache has already been selected, by
+		 * mem_cgroup_kmem_get_cache(), so it is too late to change
 		 * our minds.
 		 *
 		 * This condition will only trigger if the task entered
-		 * memcg_charge_kmem in a sane state, but was OOM-killed during
-		 * mem_cgroup_try_charge() above. Tasks that were already
-		 * dying when the allocation triggers should have been already
+		 * memcg_charge_kmem in a sane state, but was OOM-killed
+		 * during try_charge() above. Tasks that were already dying
+		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
 		res_counter_charge_nofail(&memcg->res, size, &fail_res);
@@ -3618,164 +3566,6 @@ out:
 	return ret;
 }
 
-int mem_cgroup_charge_anon(struct page *page,
-			      struct mm_struct *mm, gfp_t gfp_mask)
-{
-	unsigned int nr_pages = 1;
-	struct mem_cgroup *memcg;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	VM_BUG_ON(!mm);
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages);
-	if (!memcg)
-		return -ENOMEM;
-	__mem_cgroup_commit_charge(memcg, page, nr_pages,
-				   MEM_CGROUP_CHARGE_TYPE_ANON, false);
-	return 0;
-}
-
-/*
- * While swap-in, try_charge -> commit or cancel, the page is locked.
- * And when try_charge() successfully returns, one refcnt to memcg without
- * struct page_cgroup is acquired. This refcnt will be consumed by
- * "commit()" or removed by "cancel()"
- */
-static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-					  struct page *page,
-					  gfp_t mask,
-					  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-	int ret;
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Every swap fault against a single page tries to charge the
-	 * page, bail as early as possible.  shmem_unuse() encounters
-	 * already charged pages, too.  The USED bit is protected by
-	 * the page lock, which serializes swap cache removal, which
-	 * in turn serializes uncharging.
-	 */
-	if (PageCgroupUsed(pc))
-		goto out;
-	if (do_swap_account)
-		memcg = try_get_mem_cgroup_from_page(page);
-	if (!memcg)
-		memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, mask, 1);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		return ret;
-out:
-	*memcgp = memcg;
-	return 0;
-}
-
-int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
-				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
-{
-	if (mem_cgroup_disabled()) {
-		*memcgp = NULL;
-		return 0;
-	}
-	/*
-	 * A racing thread's fault, or swapoff, may have already
-	 * updated the pte, and even removed page from swap cache: in
-	 * those cases unuse_pte()'s pte_same() test will fail; but
-	 * there's also a KSM case which does need to charge the page.
-	 */
-	if (!PageSwapCache(page)) {
-		struct mem_cgroup *memcg;
-
-		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
-		if (!memcg)
-			return -ENOMEM;
-		*memcgp = memcg;
-		return 0;
-	}
-	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
-}
-
-void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-	__mem_cgroup_cancel_charge(memcg, 1);
-}
-
-static void
-__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
-					enum charge_type ctype)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-
-	__mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
-	/*
-	 * Now swap is on-memory. This means this page may be
-	 * counted both as mem and swap....double count.
-	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
-	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
-	 * may call delete_from_swap_cache() before reach here.
-	 */
-	if (do_swap_account && PageSwapCache(page)) {
-		swp_entry_t ent = {.val = page_private(page)};
-		mem_cgroup_uncharge_swap(ent);
-	}
-}
-
-void mem_cgroup_commit_charge_swapin(struct page *page,
-				     struct mem_cgroup *memcg)
-{
-	__mem_cgroup_commit_charge_swapin(page, memcg,
-					  MEM_CGROUP_CHARGE_TYPE_ANON);
-}
-
-int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
-{
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
-	struct mem_cgroup *memcg;
-	int ret;
-
-	if (mem_cgroup_disabled())
-		return 0;
-	if (PageCompound(page))
-		return 0;
-
-	if (PageSwapCache(page)) { /* shmem */
-		ret = __mem_cgroup_try_charge_swapin(mm, page,
-						     gfp_mask, &memcg);
-		if (ret)
-			return ret;
-		__mem_cgroup_commit_charge_swapin(page, memcg, type);
-		return 0;
-	}
-
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
-	if (!memcg)
-		return -ENOMEM;
-	__mem_cgroup_commit_charge(memcg, page, 1, type, false);
-	return 0;
-}
-
 static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
 				   unsigned int nr_pages,
 				   const enum charge_type ctype)
@@ -4122,7 +3912,6 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	enum charge_type ctype;
 
 	*memcgp = NULL;
 
@@ -4184,16 +3973,12 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * page. In the case new page is migrated but not remapped, new page's
 	 * mapcount will be finally 0 and we call uncharge in end_migration().
 	 */
-	if (PageAnon(page))
-		ctype = MEM_CGROUP_CHARGE_TYPE_ANON;
-	else
-		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	/*
 	 * The page is committed to the memcg, but it's not actually
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
+	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
 }
 
 /* remove redundant charge if migration failed*/
@@ -4252,7 +4037,6 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
 {
 	struct mem_cgroup *memcg = NULL;
 	struct page_cgroup *pc;
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -4278,7 +4062,7 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
 	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
 	 * LRU while we overwrite pc->mem_cgroup.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, type, true);
+	commit_charge(newpage, memcg, 1, false, true);
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -6293,20 +6077,19 @@ static int mem_cgroup_do_precharge(unsigned long count)
 	int ret;
 
 	/* Try a single bulk charge without reclaim first */
-	ret = mem_cgroup_try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
 	if (!ret) {
 		mc.precharge += count;
 		return ret;
 	}
 	if (ret == -EINTR) {
-		__mem_cgroup_cancel_charge(root_mem_cgroup, count);
+		cancel_charge(root_mem_cgroup, count);
 		return ret;
 	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
-		ret = mem_cgroup_try_charge(mc.to,
-					    GFP_KERNEL & ~__GFP_NORETRY, 1);
+		ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1);
 		/*
 		 * In case of failure, any residual charges against
 		 * mc.to will be dropped by mem_cgroup_clear_mc()
@@ -6314,7 +6097,7 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		 * bypassed to root right away or they'll be lost.
 		 */
 		if (ret == -EINTR)
-			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
+			cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -6583,7 +6366,7 @@ static void __mem_cgroup_clear_mc(void)
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
-		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
+		cancel_charge(mc.to, mc.precharge);
 		mc.precharge = 0;
 	}
 	/*
@@ -6591,7 +6374,7 @@ static void __mem_cgroup_clear_mc(void)
 	 * we must uncharge here.
 	 */
 	if (mc.moved_charge) {
-		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
+		cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
 	/* we must fixup refcnts and charges */
@@ -6917,6 +6700,150 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/**
+ * mem_cgroup_try_charge - try charging a page
+ * @page: page to charge
+ * @mm: mm context of the victim
+ * @gfp_mask: reclaim mode
+ * @memcgp: charged memcg return
+ *
+ * Try to charge @page to the memcg that @mm belongs to, reclaiming
+ * pages according to @gfp_mask if necessary.
+ *
+ * Returns 0 on success, with *@memcgp pointing to the charged memcg.
+ * Otherwise, an error code is returned.
+ *
+ * After page->mapping has been set up, the caller must finalize the
+ * charge with mem_cgroup_commit_charge().  Or abort the transaction
+ * with mem_cgroup_cancel_charge() in case page instantiation fails.
+ */
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+{
+	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
+	int ret = 0;
+
+	if (mem_cgroup_disabled())
+		goto out;
+
+	if (PageSwapCache(page)) {
+		struct page_cgroup *pc = lookup_page_cgroup(page);
+		/*
+		 * Every swap fault against a single page tries to charge the
+		 * page, bail as early as possible.  shmem_unuse() encounters
+		 * already charged pages, too.  The USED bit is protected by
+		 * the page lock, which serializes swap cache removal, which
+		 * in turn serializes uncharging.
+		 */
+		if (PageCgroupUsed(pc))
+			goto out;
+	}
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	if (do_swap_account && PageSwapCache(page))
+		memcg = try_get_mem_cgroup_from_page(page);
+	if (!memcg)
+		memcg = get_mem_cgroup_from_mm(mm);
+
+	ret = try_charge(memcg, gfp_mask, nr_pages);
+
+	css_put(&memcg->css);
+
+	if (ret == -EINTR) {
+		memcg = root_mem_cgroup;
+		ret = 0;
+	}
+out:
+	*memcgp = memcg;
+	return ret;
+}
+
+/**
+ * mem_cgroup_commit_charge - commit a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ * @lrucare: page might be on LRU already
+ *
+ * Finalize a charge transaction started by mem_cgroup_try_charge(),
+ * after page->mapping has been set up.  This must happen atomically
+ * as part of the page instantiation, i.e. under the page table lock
+ * for anonymous pages, under the page lock for page and swap cache.
+ *
+ * In addition, the page must not be on the LRU during the commit, to
+ * prevent racing with task migration.  If it might be, use @lrucare.
+ *
+ * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
+ */
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare)
+{
+	unsigned int nr_pages = 1;
+
+	VM_BUG_ON_PAGE(!page->mapping, page);
+	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t entry = { .val = page_private(page) };
+		/*
+		 * The swap entry might not get freed for a long time,
+		 * let's not wait for it.  The page already received a
+		 * memory+swap charge, drop the swap entry duplicate.
+		 */
+		mem_cgroup_uncharge_swap(entry);
+	}
+}
+
+/**
+ * mem_cgroup_cancel_charge - cancel a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ *
+ * Cancel a charge transaction started by mem_cgroup_try_charge().
+ */
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+{
+	unsigned int nr_pages = 1;
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	cancel_charge(memcg, nr_pages);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/memory.c b/mm/memory.c
index d67fd9fcf1f2..d66988d56caf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2049,6 +2049,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *dirty_page = NULL;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
+	struct mem_cgroup *memcg;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
@@ -2204,7 +2205,7 @@ gotten:
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
 	mmun_start  = address & PAGE_MASK;
@@ -2234,6 +2235,8 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2271,7 +2274,7 @@ gotten:
 		new_page = old_page;
 		ret |= VM_FAULT_WRITE;
 	} else
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 
 	if (new_page)
 		page_cache_release(new_page);
@@ -2407,10 +2410,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	struct page *page, *swapcache;
+	struct mem_cgroup *memcg;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	struct mem_cgroup *ptr;
 	int exclusive = 0;
 	int ret = 0;
 
@@ -2486,7 +2489,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2511,10 +2514,6 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * while the page is counted on swap but not yet in mapcount i.e.
 	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
 	 * must be called after the swap_free(), or it will never succeed.
-	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
-	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
-	 * in page->private. In this case, a record in swap_cgroup  is silently
-	 * discarded at swap_free().
 	 */
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
@@ -2530,12 +2529,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	set_pte_at(mm, address, page_table, pte);
-	if (page == swapcache)
+	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address);
-	/* It's better to call commit-charge after rmap is established */
-	mem_cgroup_commit_charge_swapin(page, ptr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2568,7 +2569,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge_swapin(ptr);
+	mem_cgroup_cancel_charge(page, memcg);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -2624,6 +2625,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		unsigned int flags)
 {
+	struct mem_cgroup *memcg;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -2657,7 +2659,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_charge_anon(page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2670,6 +2672,8 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
+	mem_cgroup_commit_charge(page, memcg, false);
+	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2679,7 +2683,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_uncharge_page(page);
+	mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -2913,6 +2917,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	struct page *fault_page, *new_page;
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret;
@@ -2924,7 +2929,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -2944,12 +2949,14 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
 	page_cache_release(fault_page);
 	return ret;
 uncharge_out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	page_cache_release(new_page);
 	return ret;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index bf05fc872ae8..07576e0b92ef 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1032,25 +1032,6 @@ void page_add_new_anon_rmap(struct page *page,
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
-
-	VM_BUG_ON_PAGE(PageLRU(page), page);
-	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
-		SetPageActive(page);
-		lru_cache_add(page);
-		return;
-	}
-
-	if (!TestSetPageMlocked(page)) {
-		/*
-		 * We use the irq-unsafe __mod_zone_page_stat because this
-		 * counter is not modified from interrupt context, and the pte
-		 * lock is held(spinlock), which implies preemption disabled.
-		 */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
-		count_vm_event(UNEVICTABLE_PGMLOCKED);
-	}
-	add_page_to_unevictable_list(page);
 }
 
 /**
diff --git a/mm/shmem.c b/mm/shmem.c
index f484c276e994..ea968bf84942 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -668,6 +668,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 {
 	struct list_head *this, *next;
 	struct shmem_inode_info *info;
+	struct mem_cgroup *memcg;
 	int found = 0;
 	int error = 0;
 
@@ -683,7 +684,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_charge_file(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -701,8 +702,11 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 
-	if (found < 0)
+	if (found < 0) {
 		error = found;
+		mem_cgroup_cancel_charge(page, memcg);
+	} else
+		mem_cgroup_commit_charge(page, memcg, true);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1005,6 +1009,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo;
+	struct mem_cgroup *memcg;
 	struct page *page;
 	swp_entry_t swap;
 	int error;
@@ -1080,8 +1085,7 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1097,12 +1101,16 @@ repeat:
 			 * Reset swap.val? No, leave it so "failed" goes back to
 			 * "repeat": reading a hole and writing should succeed.
 			 */
-			if (error)
+			if (error) {
+				mem_cgroup_cancel_charge(page, memcg);
 				delete_from_swap_cache(page);
+			}
 		}
 		if (error)
 			goto failed;
 
+		mem_cgroup_commit_charge(page, memcg, true);
+
 		spin_lock(&info->lock);
 		info->swapped--;
 		shmem_recalc_inode(inode);
@@ -1134,8 +1142,7 @@ repeat:
 
 		__SetPageSwapBacked(page);
 		__set_page_locked(page);
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1145,9 +1152,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_uncharge_cache_page(page);
+			mem_cgroup_cancel_charge(page, memcg);
 			goto decused;
 		}
+		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e3472248b..a98f48626359 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -695,6 +695,40 @@ void add_page_to_unevictable_list(struct page *page)
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/**
+ * lru_cache_add_active_or_unevictable
+ * @page:  the page to be added to LRU
+ * @vma:   vma in which page is mapped for determining reclaimability
+ *
+ * Place @page on the active or unevictable LRU list, depending on its
+ * evictability.  Note that if the page is not evictable, it goes
+ * directly back onto it's zone's unevictable list, it does NOT use a
+ * per cpu pagevec.
+ */
+void lru_cache_add_active_or_unevictable(struct page *page,
+					 struct vm_area_struct *vma)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+		SetPageActive(page);
+		lru_cache_add(page);
+		return;
+	}
+
+	if (!TestSetPageMlocked(page)) {
+		/*
+		 * We use the irq-unsafe __mod_zone_page_stat because this
+		 * counter is not modified from interrupt context, and the pte
+		 * lock is held(spinlock), which implies preemption disabled.
+		 */
+		__mod_zone_page_state(page_zone(page), NR_MLOCK,
+				    hpage_nr_pages(page));
+		count_vm_event(UNEVICTABLE_PGMLOCKED);
+	}
+	add_page_to_unevictable_list(page);
+}
+
 /*
  * If the page can not be invalidated, it is moved to the
  * inactive list to speed up its reclaim.  It is moved to the
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c524f7bd0bf..0883b4912ff7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1106,15 +1106,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
-					 GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge_swapin(memcg);
+		mem_cgroup_cancel_charge(page, memcg);
 		ret = 0;
 		goto out;
 	}
@@ -1124,11 +1123,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	get_page(page);
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	if (page == swapcache)
+	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr);
-	mem_cgroup_commit_charge_swapin(page, memcg);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The memcg charge API charges pages before they are rmapped - i.e. have
an actual "type" - and so every callsite needs its own set of charge
and uncharge functions to know what type is being operated on.  Worse,
uncharge has to happen from a context that is still type-specific,
rather than at the end of the page's lifetime with exclusive access,
and so requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

  mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
  pages from the memcg if necessary.

  mem_cgroup_commit_charge() commits the page to the charge once it
  has a valid page->mapping and PageAnon() reliably tells the type.

  mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before
they are added to the LRU, page_add_new_anon_rmap() must stop doing
LRU additions again.  Revive lru_cache_add_active_or_unevictable().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/memcg_test.txt |  32 +--
 include/linux/memcontrol.h           |  53 ++---
 include/linux/swap.h                 |   3 +
 kernel/events/uprobes.c              |   1 +
 mm/filemap.c                         |   9 +-
 mm/huge_memory.c                     |  57 +++--
 mm/memcontrol.c                      | 407 ++++++++++++++---------------------
 mm/memory.c                          |  41 ++--
 mm/rmap.c                            |  19 --
 mm/shmem.c                           |  24 ++-
 mm/swap.c                            |  34 +++
 mm/swapfile.c                        |  14 +-
 12 files changed, 314 insertions(+), 380 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index 80ac454704b8..bcf750d3cecd 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -24,24 +24,7 @@ Please note that implementation details can be changed.
 
    a page/swp_entry may be charged (usage += PAGE_SIZE) at
 
-	mem_cgroup_charge_anon()
-	  Called at new page fault and Copy-On-Write.
-
-	mem_cgroup_try_charge_swapin()
-	  Called at do_swap_page() (page fault on swap entry) and swapoff.
-	  Followed by charge-commit-cancel protocol. (With swap accounting)
-	  At commit, a charge recorded in swap_cgroup is removed.
-
-	mem_cgroup_charge_file()
-	  Called at add_to_page_cache()
-
-	mem_cgroup_cache_charge_swapin()
-	  Called at shmem's swapin.
-
-	mem_cgroup_prepare_migration()
-	  Called before migration. "extra" charge is done and followed by
-	  charge-commit-cancel protocol.
-	  At commit, charge against oldpage or newpage will be committed.
+	mem_cgroup_try_charge()
 
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
@@ -69,19 +52,14 @@ Please note that implementation details can be changed.
 	to new page is committed. At failure, charge to old page is committed.
 
 3. charge-commit-cancel
-	In some case, we can't know this "charge" is valid or not at charging
-	(because of races).
-	To handle such case, there are charge-commit-cancel functions.
-		mem_cgroup_try_charge_XXX
-		mem_cgroup_commit_charge_XXX
-		mem_cgroup_cancel_charge_XXX
-	these are used in swap-in and migration.
+	Memcg pages are charged in two steps:
+		mem_cgroup_try_charge()
+		mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
 
 	At try_charge(), there are no flags to say "this page is charged".
 	at this point, usage += PAGE_SIZE.
 
-	At commit(), the function checks the page should be charged or not
-	and set flags or avoid charging.(usage -= PAGE_SIZE)
+	At commit(), the page is associated with the memcg.
 
 	At cancel(), simply usage -= PAGE_SIZE.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eb65d29516ca..1a9a096858e0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
-/*
- * All "charge" functions with gfp_mask should use GFP_KERNEL or
- * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
- * alloc memory but reclaims memory from all available zones. So, "where I want
- * memory from" bits of gfp_mask has no meaning. So any bits of that field is
- * available but adding a rule is better. charge functions' gfp_mask should
- * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
- * codes.
- * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
- */
-
-extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask);
-/* for swap handling */
-extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
-extern void mem_cgroup_commit_charge_swapin(struct page *page,
-					struct mem_cgroup *memcg);
-extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
-
-extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -233,30 +216,22 @@ void mem_cgroup_print_bad_page(struct page *page);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
-static inline int mem_cgroup_charge_anon(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_charge_file(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
+static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+					gfp_t gfp_mask,
+					struct mem_cgroup **memcgp)
 {
+	*memcgp = NULL;
 	return 0;
 }
 
-static inline void mem_cgroup_commit_charge_swapin(struct page *page,
-					  struct mem_cgroup *memcg)
+static inline void mem_cgroup_commit_charge(struct page *page,
+					    struct mem_cgroup *memcg,
+					    bool lrucare)
 {
 }
 
-static inline void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
+static inline void mem_cgroup_cancel_charge(struct page *page,
+					    struct mem_cgroup *memcg)
 {
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bdbee80eede..290905133078 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -321,6 +321,9 @@ extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
 
+extern void lru_cache_add_active_or_unevictable(struct page *page,
+						struct vm_area_struct *vma);
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c445e392e93f..d17f27c69bfc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -179,6 +179,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr);
+	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
 		dec_mm_counter(mm, MM_FILEPAGES);
diff --git a/mm/filemap.c b/mm/filemap.c
index dafb06f70a09..114cd89c1cc2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -560,19 +560,19 @@ static int __add_to_page_cache_locked(struct page *page,
 				      pgoff_t offset, gfp_t gfp_mask,
 				      void **shadowp)
 {
+	struct mem_cgroup *memcg;
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_charge_file(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
 	if (error)
 		return error;
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
-		mem_cgroup_uncharge_cache_page(page);
+		mem_cgroup_cancel_charge(page, memcg);
 		return error;
 	}
 
@@ -587,13 +587,14 @@ static int __add_to_page_cache_locked(struct page *page,
 		goto err_insert;
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_commit_charge(page, memcg, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
+	mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10cd7f2bf776..2377efed2924 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -715,13 +715,20 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					unsigned long haddr, pmd_t *pmd,
 					struct page *page)
 {
+	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	spinlock_t *ptl;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
+		return VM_FAULT_OOM;
+
 	pgtable = pte_alloc_one(mm, haddr);
-	if (unlikely(!pgtable))
+	if (unlikely(!pgtable)) {
+		mem_cgroup_cancel_charge(page, memcg);
 		return VM_FAULT_OOM;
+	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	/*
@@ -734,7 +741,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(page);
+		mem_cgroup_cancel_charge(page, memcg);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -742,6 +749,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -827,13 +836,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_TRANSHUGE))) {
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
 	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct page *page,
 					unsigned long haddr)
 {
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pgtable_t pgtable;
 	pmd_t _pmd;
@@ -968,20 +972,21 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					       __GFP_OTHER_NODE,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
-			     mem_cgroup_charge_anon(pages[i], mm,
-						       GFP_KERNEL))) {
+			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
+						   &memcg))) {
 			if (pages[i])
 				put_page(pages[i]);
-			mem_cgroup_uncharge_start();
 			while (--i >= 0) {
-				mem_cgroup_uncharge_page(pages[i]);
+				memcg = (void *)page_private(pages[i]);
+				set_page_private(pages[i], 0);
+				mem_cgroup_cancel_charge(pages[i], memcg);
 				put_page(pages[i]);
 			}
-			mem_cgroup_uncharge_end();
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
 		}
+		set_page_private(pages[i], (unsigned long)memcg);
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1010,7 +1015,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		pte_t *pte, entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr);
+		mem_cgroup_commit_charge(pages[i], memcg, false);
+		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -1034,12 +1043,12 @@ out:
 out_free_pages:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		mem_cgroup_uncharge_page(pages[i]);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
+		mem_cgroup_cancel_charge(pages[i], memcg);
 		put_page(pages[i]);
 	}
-	mem_cgroup_uncharge_end();
 	kfree(pages);
 	goto out;
 }
@@ -1050,6 +1059,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 	struct page *page = NULL, *new_page;
+	struct mem_cgroup *memcg;
 	unsigned long haddr;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -1101,7 +1111,8 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, mm,
+					   GFP_TRANSHUGE, &memcg))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -1130,7 +1141,7 @@ alloc:
 		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1139,6 +1150,8 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
 		if (!page) {
@@ -2358,6 +2371,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	struct mem_cgroup *memcg;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2368,7 +2382,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm,
+					   GFP_TRANSHUGE, &memcg)))
 		return;
 
 	/*
@@ -2457,6 +2472,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
@@ -2470,7 +2487,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	goto out_up_write;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e7f8e7dc0d8..602fe7207c2d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2551,17 +2551,8 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-/**
- * mem_cgroup_try_charge - try charging a memcg
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Returns 0 if @memcg was charged successfully, -EINTR if the charge
- * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
- */
-static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages)
+static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
+		      unsigned int nr_pages)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -2660,41 +2651,7 @@ done:
 	return ret;
 }
 
-/**
- * mem_cgroup_try_charge_mm - try charging a mm
- * @mm: mm_struct to charge
- * @nr_pages: number of pages to charge
- * @oom: trigger OOM if reclaim fails
- *
- * Returns the charged mem_cgroup associated with the given mm_struct or
- * NULL the charge failed.
- */
-static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages)
-
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		memcg = NULL;
-
-	return memcg;
-}
-
-/*
- * Somemtimes we have to undo a charge we got by try_charge().
- * This function is for that and do uncharge, put css's refcnt.
- * gotten by try_charge().
- */
-static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
-				       unsigned int nr_pages)
+static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
@@ -2760,17 +2717,13 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	return memcg;
 }
 
-static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
-				       struct page *page,
-				       unsigned int nr_pages,
-				       enum charge_type ctype,
-				       bool lrucare)
+static void commit_charge(struct page *page, struct mem_cgroup *memcg,
+			  unsigned int nr_pages, bool anon, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
-	bool anon;
 
 	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
@@ -2807,11 +2760,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
-		anon = true;
-	else
-		anon = false;
-
 	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
 	unlock_page_cgroup(pc);
 
@@ -2882,21 +2830,21 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	if (ret)
 		return ret;
 
-	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT);
+	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
 	if (ret == -EINTR)  {
 		/*
-		 * mem_cgroup_try_charge() chosed to bypass to root due to
-		 * OOM kill or fatal signal.  Since our only options are to
-		 * either fail the allocation or charge it to this cgroup, do
-		 * it as a temporary condition. But we can't fail. From a
-		 * kmem/slab perspective, the cache has already been selected,
-		 * by mem_cgroup_kmem_get_cache(), so it is too late to change
+		 * try_charge() chose to bypass to root due to OOM kill or
+		 * fatal signal.  Since our only options are to either fail
+		 * the allocation or charge it to this cgroup, do it as a
+		 * temporary condition. But we can't fail. From a kmem/slab
+		 * perspective, the cache has already been selected, by
+		 * mem_cgroup_kmem_get_cache(), so it is too late to change
 		 * our minds.
 		 *
 		 * This condition will only trigger if the task entered
-		 * memcg_charge_kmem in a sane state, but was OOM-killed during
-		 * mem_cgroup_try_charge() above. Tasks that were already
-		 * dying when the allocation triggers should have been already
+		 * memcg_charge_kmem in a sane state, but was OOM-killed
+		 * during try_charge() above. Tasks that were already dying
+		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
 		res_counter_charge_nofail(&memcg->res, size, &fail_res);
@@ -3618,164 +3566,6 @@ out:
 	return ret;
 }
 
-int mem_cgroup_charge_anon(struct page *page,
-			      struct mm_struct *mm, gfp_t gfp_mask)
-{
-	unsigned int nr_pages = 1;
-	struct mem_cgroup *memcg;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	VM_BUG_ON(!mm);
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages);
-	if (!memcg)
-		return -ENOMEM;
-	__mem_cgroup_commit_charge(memcg, page, nr_pages,
-				   MEM_CGROUP_CHARGE_TYPE_ANON, false);
-	return 0;
-}
-
-/*
- * While swap-in, try_charge -> commit or cancel, the page is locked.
- * And when try_charge() successfully returns, one refcnt to memcg without
- * struct page_cgroup is acquired. This refcnt will be consumed by
- * "commit()" or removed by "cancel()"
- */
-static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-					  struct page *page,
-					  gfp_t mask,
-					  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-	int ret;
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Every swap fault against a single page tries to charge the
-	 * page, bail as early as possible.  shmem_unuse() encounters
-	 * already charged pages, too.  The USED bit is protected by
-	 * the page lock, which serializes swap cache removal, which
-	 * in turn serializes uncharging.
-	 */
-	if (PageCgroupUsed(pc))
-		goto out;
-	if (do_swap_account)
-		memcg = try_get_mem_cgroup_from_page(page);
-	if (!memcg)
-		memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, mask, 1);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		return ret;
-out:
-	*memcgp = memcg;
-	return 0;
-}
-
-int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
-				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
-{
-	if (mem_cgroup_disabled()) {
-		*memcgp = NULL;
-		return 0;
-	}
-	/*
-	 * A racing thread's fault, or swapoff, may have already
-	 * updated the pte, and even removed page from swap cache: in
-	 * those cases unuse_pte()'s pte_same() test will fail; but
-	 * there's also a KSM case which does need to charge the page.
-	 */
-	if (!PageSwapCache(page)) {
-		struct mem_cgroup *memcg;
-
-		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
-		if (!memcg)
-			return -ENOMEM;
-		*memcgp = memcg;
-		return 0;
-	}
-	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
-}
-
-void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-	__mem_cgroup_cancel_charge(memcg, 1);
-}
-
-static void
-__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
-					enum charge_type ctype)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-
-	__mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
-	/*
-	 * Now swap is on-memory. This means this page may be
-	 * counted both as mem and swap....double count.
-	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
-	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
-	 * may call delete_from_swap_cache() before reach here.
-	 */
-	if (do_swap_account && PageSwapCache(page)) {
-		swp_entry_t ent = {.val = page_private(page)};
-		mem_cgroup_uncharge_swap(ent);
-	}
-}
-
-void mem_cgroup_commit_charge_swapin(struct page *page,
-				     struct mem_cgroup *memcg)
-{
-	__mem_cgroup_commit_charge_swapin(page, memcg,
-					  MEM_CGROUP_CHARGE_TYPE_ANON);
-}
-
-int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
-{
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
-	struct mem_cgroup *memcg;
-	int ret;
-
-	if (mem_cgroup_disabled())
-		return 0;
-	if (PageCompound(page))
-		return 0;
-
-	if (PageSwapCache(page)) { /* shmem */
-		ret = __mem_cgroup_try_charge_swapin(mm, page,
-						     gfp_mask, &memcg);
-		if (ret)
-			return ret;
-		__mem_cgroup_commit_charge_swapin(page, memcg, type);
-		return 0;
-	}
-
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
-	if (!memcg)
-		return -ENOMEM;
-	__mem_cgroup_commit_charge(memcg, page, 1, type, false);
-	return 0;
-}
-
 static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
 				   unsigned int nr_pages,
 				   const enum charge_type ctype)
@@ -4122,7 +3912,6 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	enum charge_type ctype;
 
 	*memcgp = NULL;
 
@@ -4184,16 +3973,12 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * page. In the case new page is migrated but not remapped, new page's
 	 * mapcount will be finally 0 and we call uncharge in end_migration().
 	 */
-	if (PageAnon(page))
-		ctype = MEM_CGROUP_CHARGE_TYPE_ANON;
-	else
-		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	/*
 	 * The page is committed to the memcg, but it's not actually
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
+	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
 }
 
 /* remove redundant charge if migration failed*/
@@ -4252,7 +4037,6 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
 {
 	struct mem_cgroup *memcg = NULL;
 	struct page_cgroup *pc;
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -4278,7 +4062,7 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
 	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
 	 * LRU while we overwrite pc->mem_cgroup.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, type, true);
+	commit_charge(newpage, memcg, 1, false, true);
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -6293,20 +6077,19 @@ static int mem_cgroup_do_precharge(unsigned long count)
 	int ret;
 
 	/* Try a single bulk charge without reclaim first */
-	ret = mem_cgroup_try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
 	if (!ret) {
 		mc.precharge += count;
 		return ret;
 	}
 	if (ret == -EINTR) {
-		__mem_cgroup_cancel_charge(root_mem_cgroup, count);
+		cancel_charge(root_mem_cgroup, count);
 		return ret;
 	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
-		ret = mem_cgroup_try_charge(mc.to,
-					    GFP_KERNEL & ~__GFP_NORETRY, 1);
+		ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1);
 		/*
 		 * In case of failure, any residual charges against
 		 * mc.to will be dropped by mem_cgroup_clear_mc()
@@ -6314,7 +6097,7 @@ static int mem_cgroup_do_precharge(unsigned long count)
 		 * bypassed to root right away or they'll be lost.
 		 */
 		if (ret == -EINTR)
-			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
+			cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -6583,7 +6366,7 @@ static void __mem_cgroup_clear_mc(void)
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
-		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
+		cancel_charge(mc.to, mc.precharge);
 		mc.precharge = 0;
 	}
 	/*
@@ -6591,7 +6374,7 @@ static void __mem_cgroup_clear_mc(void)
 	 * we must uncharge here.
 	 */
 	if (mc.moved_charge) {
-		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
+		cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
 	/* we must fixup refcnts and charges */
@@ -6917,6 +6700,150 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/**
+ * mem_cgroup_try_charge - try charging a page
+ * @page: page to charge
+ * @mm: mm context of the victim
+ * @gfp_mask: reclaim mode
+ * @memcgp: charged memcg return
+ *
+ * Try to charge @page to the memcg that @mm belongs to, reclaiming
+ * pages according to @gfp_mask if necessary.
+ *
+ * Returns 0 on success, with *@memcgp pointing to the charged memcg.
+ * Otherwise, an error code is returned.
+ *
+ * After page->mapping has been set up, the caller must finalize the
+ * charge with mem_cgroup_commit_charge().  Or abort the transaction
+ * with mem_cgroup_cancel_charge() in case page instantiation fails.
+ */
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+{
+	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
+	int ret = 0;
+
+	if (mem_cgroup_disabled())
+		goto out;
+
+	if (PageSwapCache(page)) {
+		struct page_cgroup *pc = lookup_page_cgroup(page);
+		/*
+		 * Every swap fault against a single page tries to charge the
+		 * page, bail as early as possible.  shmem_unuse() encounters
+		 * already charged pages, too.  The USED bit is protected by
+		 * the page lock, which serializes swap cache removal, which
+		 * in turn serializes uncharging.
+		 */
+		if (PageCgroupUsed(pc))
+			goto out;
+	}
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	if (do_swap_account && PageSwapCache(page))
+		memcg = try_get_mem_cgroup_from_page(page);
+	if (!memcg)
+		memcg = get_mem_cgroup_from_mm(mm);
+
+	ret = try_charge(memcg, gfp_mask, nr_pages);
+
+	css_put(&memcg->css);
+
+	if (ret == -EINTR) {
+		memcg = root_mem_cgroup;
+		ret = 0;
+	}
+out:
+	*memcgp = memcg;
+	return ret;
+}
+
+/**
+ * mem_cgroup_commit_charge - commit a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ * @lrucare: page might be on LRU already
+ *
+ * Finalize a charge transaction started by mem_cgroup_try_charge(),
+ * after page->mapping has been set up.  This must happen atomically
+ * as part of the page instantiation, i.e. under the page table lock
+ * for anonymous pages, under the page lock for page and swap cache.
+ *
+ * In addition, the page must not be on the LRU during the commit, to
+ * prevent racing with task migration.  If it might be, use @lrucare.
+ *
+ * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
+ */
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare)
+{
+	unsigned int nr_pages = 1;
+
+	VM_BUG_ON_PAGE(!page->mapping, page);
+	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t entry = { .val = page_private(page) };
+		/*
+		 * The swap entry might not get freed for a long time,
+		 * let's not wait for it.  The page already received a
+		 * memory+swap charge, drop the swap entry duplicate.
+		 */
+		mem_cgroup_uncharge_swap(entry);
+	}
+}
+
+/**
+ * mem_cgroup_cancel_charge - cancel a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ *
+ * Cancel a charge transaction started by mem_cgroup_try_charge().
+ */
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+{
+	unsigned int nr_pages = 1;
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	cancel_charge(memcg, nr_pages);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/memory.c b/mm/memory.c
index d67fd9fcf1f2..d66988d56caf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2049,6 +2049,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *dirty_page = NULL;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
+	struct mem_cgroup *memcg;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
@@ -2204,7 +2205,7 @@ gotten:
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
 	mmun_start  = address & PAGE_MASK;
@@ -2234,6 +2235,8 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2271,7 +2274,7 @@ gotten:
 		new_page = old_page;
 		ret |= VM_FAULT_WRITE;
 	} else
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 
 	if (new_page)
 		page_cache_release(new_page);
@@ -2407,10 +2410,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	struct page *page, *swapcache;
+	struct mem_cgroup *memcg;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	struct mem_cgroup *ptr;
 	int exclusive = 0;
 	int ret = 0;
 
@@ -2486,7 +2489,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2511,10 +2514,6 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * while the page is counted on swap but not yet in mapcount i.e.
 	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
 	 * must be called after the swap_free(), or it will never succeed.
-	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
-	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
-	 * in page->private. In this case, a record in swap_cgroup  is silently
-	 * discarded at swap_free().
 	 */
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
@@ -2530,12 +2529,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	set_pte_at(mm, address, page_table, pte);
-	if (page == swapcache)
+	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address);
-	/* It's better to call commit-charge after rmap is established */
-	mem_cgroup_commit_charge_swapin(page, ptr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2568,7 +2569,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge_swapin(ptr);
+	mem_cgroup_cancel_charge(page, memcg);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -2624,6 +2625,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		unsigned int flags)
 {
+	struct mem_cgroup *memcg;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -2657,7 +2659,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_charge_anon(page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2670,6 +2672,8 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
+	mem_cgroup_commit_charge(page, memcg, false);
+	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2679,7 +2683,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_uncharge_page(page);
+	mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -2913,6 +2917,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	struct page *fault_page, *new_page;
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret;
@@ -2924,7 +2929,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -2944,12 +2949,14 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
 	page_cache_release(fault_page);
 	return ret;
 uncharge_out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	page_cache_release(new_page);
 	return ret;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index bf05fc872ae8..07576e0b92ef 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1032,25 +1032,6 @@ void page_add_new_anon_rmap(struct page *page,
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
-
-	VM_BUG_ON_PAGE(PageLRU(page), page);
-	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
-		SetPageActive(page);
-		lru_cache_add(page);
-		return;
-	}
-
-	if (!TestSetPageMlocked(page)) {
-		/*
-		 * We use the irq-unsafe __mod_zone_page_stat because this
-		 * counter is not modified from interrupt context, and the pte
-		 * lock is held(spinlock), which implies preemption disabled.
-		 */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
-		count_vm_event(UNEVICTABLE_PGMLOCKED);
-	}
-	add_page_to_unevictable_list(page);
 }
 
 /**
diff --git a/mm/shmem.c b/mm/shmem.c
index f484c276e994..ea968bf84942 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -668,6 +668,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 {
 	struct list_head *this, *next;
 	struct shmem_inode_info *info;
+	struct mem_cgroup *memcg;
 	int found = 0;
 	int error = 0;
 
@@ -683,7 +684,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_charge_file(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -701,8 +702,11 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 
-	if (found < 0)
+	if (found < 0) {
 		error = found;
+		mem_cgroup_cancel_charge(page, memcg);
+	} else
+		mem_cgroup_commit_charge(page, memcg, true);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1005,6 +1009,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo;
+	struct mem_cgroup *memcg;
 	struct page *page;
 	swp_entry_t swap;
 	int error;
@@ -1080,8 +1085,7 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1097,12 +1101,16 @@ repeat:
 			 * Reset swap.val? No, leave it so "failed" goes back to
 			 * "repeat": reading a hole and writing should succeed.
 			 */
-			if (error)
+			if (error) {
+				mem_cgroup_cancel_charge(page, memcg);
 				delete_from_swap_cache(page);
+			}
 		}
 		if (error)
 			goto failed;
 
+		mem_cgroup_commit_charge(page, memcg, true);
+
 		spin_lock(&info->lock);
 		info->swapped--;
 		shmem_recalc_inode(inode);
@@ -1134,8 +1142,7 @@ repeat:
 
 		__SetPageSwapBacked(page);
 		__set_page_locked(page);
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1145,9 +1152,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_uncharge_cache_page(page);
+			mem_cgroup_cancel_charge(page, memcg);
 			goto decused;
 		}
+		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e3472248b..a98f48626359 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -695,6 +695,40 @@ void add_page_to_unevictable_list(struct page *page)
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/**
+ * lru_cache_add_active_or_unevictable
+ * @page:  the page to be added to LRU
+ * @vma:   vma in which page is mapped for determining reclaimability
+ *
+ * Place @page on the active or unevictable LRU list, depending on its
+ * evictability.  Note that if the page is not evictable, it goes
+ * directly back onto it's zone's unevictable list, it does NOT use a
+ * per cpu pagevec.
+ */
+void lru_cache_add_active_or_unevictable(struct page *page,
+					 struct vm_area_struct *vma)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+		SetPageActive(page);
+		lru_cache_add(page);
+		return;
+	}
+
+	if (!TestSetPageMlocked(page)) {
+		/*
+		 * We use the irq-unsafe __mod_zone_page_stat because this
+		 * counter is not modified from interrupt context, and the pte
+		 * lock is held(spinlock), which implies preemption disabled.
+		 */
+		__mod_zone_page_state(page_zone(page), NR_MLOCK,
+				    hpage_nr_pages(page));
+		count_vm_event(UNEVICTABLE_PGMLOCKED);
+	}
+	add_page_to_unevictable_list(page);
+}
+
 /*
  * If the page can not be invalidated, it is moved to the
  * inactive list to speed up its reclaim.  It is moved to the
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c524f7bd0bf..0883b4912ff7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1106,15 +1106,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
-					 GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge_swapin(memcg);
+		mem_cgroup_cancel_charge(page, memcg);
 		ret = 0;
 		goto out;
 	}
@@ -1124,11 +1123,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	get_page(page);
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	if (page == swapcache)
+	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr);
-	mem_cgroup_commit_charge_swapin(page, memcg);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-06-18 20:40 ` Johannes Weiner
@ 2014-06-18 20:40   ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had
their page->mapping established, uncharges had to happen when the page
type could still be known from the context; as in unmap for anonymous,
page cache removal for file and shmem pages, and swap cache truncation
for swap pages.  However, these operations happen well before the page
is actually freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(),
when we know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called
after the migration is successful and the new page is fully rmapped.
Because the old page is no longer uncharged after migration, prevent
double charges by decoupling the page's memcg association (PCG_USED
and pc->mem_cgroup) from the page holding an actual charge.  The new
bits PCG_MEM and PCG_MEMSW represent the respective charges and are
transferred to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache().

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
entry before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection
the page itself offers: anonymous pages are charged under the page
table lock, whereas page cache insertions, swapin, and migration hold
the page lock.  Uncharging happens under full exclusion with no
outstanding references.  Charging and uncharging also ensure that the
page is off-LRU, which serializes against charge migration.  Remove
the very costly page_cgroup lock and set pc->flags non-atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/memcg_test.txt | 128 +------
 include/linux/memcontrol.h           |  49 +--
 include/linux/page_cgroup.h          |  43 +--
 include/linux/swap.h                 |  12 +-
 mm/filemap.c                         |   4 +-
 mm/memcontrol.c                      | 684 ++++++++++++-----------------------
 mm/memory.c                          |   2 -
 mm/migrate.c                         |  44 +--
 mm/rmap.c                            |   1 -
 mm/shmem.c                           |   8 +-
 mm/swap.c                            |   6 +
 mm/swap_state.c                      |   8 +-
 mm/swapfile.c                        |   7 +-
 mm/truncate.c                        |   9 -
 mm/vmscan.c                          |  12 +-
 mm/zswap.c                           |   2 +-
 16 files changed, 297 insertions(+), 722 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index bcf750d3cecd..8870b0212150 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -29,28 +29,13 @@ Please note that implementation details can be changed.
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
+	mem_cgroup_uncharge()
+	  Called when a page's refcount goes down to 0.
 
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
 3. charge-commit-cancel
 	Memcg pages are charged in two steps:
 		mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
 
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 
@@ -136,28 +74,20 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	    swp_entry's refcnt -= 1.
 
 
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
 
 5. Page Cache
    	Page Cache is charged at
 	- add_to_page_cache_locked().
 
-	uncharged at
-	- __remove_from_page_cache().
-
 	The logic is very clear. (About migration, see below)
 	Note: __remove_from_page_cache() is called by remove_from_page_cache()
 	and __remove_mapping().
 
 6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
+	The best way to understand shmem's page state transition is to read
+	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 
@@ -170,56 +100,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
 
 7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
-
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+
+	mem_cgroup_migrate()
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a9a096858e0..806b8fa15c5f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 			      bool lrucare);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+void mem_cgroup_uncharge(struct page *page);
+
+/* Batched uncharging */
+void mem_cgroup_uncharge_start(void);
+void mem_cgroup_uncharge_end(void);
 
-/* For coalescing uncharge for reducing memcg' overhead*/
-extern void mem_cgroup_uncharge_start(void);
-extern void mem_cgroup_uncharge_end(void);
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg)
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 
-extern void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp);
-extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok);
-
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
-extern void mem_cgroup_replace_page_cache(struct page *oldpage,
-					struct page *newpage);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
 {
 }
 
-static inline void mem_cgroup_uncharge_start(void)
+static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
 
-static inline void mem_cgroup_uncharge_end(void)
+static inline void mem_cgroup_uncharge_start(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_page(struct page *page)
+static inline void mem_cgroup_uncharge_end(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+static inline void mem_cgroup_migrate(struct page *oldpage,
+				      struct page *newpage,
+				      bool lrucare)
 {
 }
 
@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
-static inline void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp)
-{
-}
-
-static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-		struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
-static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
-				struct page *newpage)
-{
-}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..97b5c39a31c8 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -3,9 +3,9 @@
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
-	PCG_USED, /* this object is in use. */
-	PCG_MIGRATION, /* under page migration */
+	PCG_USED,	/* This page is charged to a memcg */
+	PCG_MEM,	/* This page holds a memory charge */
+	PCG_MEMSW,	/* This page holds a memory+swap charge */
 	__NR_PCG_FLAGS,
 };
 
@@ -44,42 +44,9 @@ static inline void __init page_cgroup_init(void)
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define TESTCLEARPCGFLAG(uname, lname)			\
-static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
-
-TESTPCGFLAG(Used, USED)
-CLEARPCGFLAG(Used, USED)
-SETPCGFLAG(Used, USED)
-
-SETPCGFLAG(Migration, MIGRATION)
-CLEARPCGFLAG(Migration, MIGRATION)
-TESTPCGFLAG(Migration, MIGRATION)
-
-static inline void lock_page_cgroup(struct page_cgroup *pc)
-{
-	/*
-	 * Don't take this lock in IRQ context.
-	 * This lock is for pc->mem_cgroup, USED, MIGRATION
-	 */
-	bit_spin_lock(PCG_LOCK, &pc->flags);
-}
-
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline int PageCgroupUsed(struct page_cgroup *pc)
 {
-	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	return test_bit(PCG_USED, &pc->flags);
 }
 
 #else /* CONFIG_MEMCG */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 290905133078..94fd0b23f3f9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
 #else
-static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
 {
 }
 #endif
@@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free(swp_entry_t);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp, struct page *page)
+static inline void swapcache_free(swp_entry_t swp)
 {
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 114cd89c1cc2..c2f30ed8e95f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -233,7 +233,6 @@ void delete_from_page_cache(struct page *page)
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
 		freepage(page);
@@ -501,8 +500,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
-		/* mem_cgroup codes must not be called under tree_lock */
-		mem_cgroup_replace_page_cache(old, new);
+		mem_cgroup_migrate(old, new, true);
 		radix_tree_preload_end();
 		if (freepage)
 			freepage(old);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 602fe7207c2d..94d7c40b9f26 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -882,13 +882,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
-}
-
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
@@ -909,13 +902,13 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 bool anon, int nr_pages)
+					 int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
-	if (anon)
+	if (PageAnon(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
@@ -1347,20 +1340,6 @@ out:
 	return lruvec;
 }
 
-/*
- * Following LRU functions are allowed to be used without PCG_LOCK.
- * Operations are called by routine of global LRU independently from memcg.
- * What we have to take care of here is validness of pc->mem_cgroup.
- *
- * Changes to pc->mem_cgroup happens when
- * 1. charge
- * 2. moving account
- * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
- * It is added to LRU before charge.
- * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
- * When moving account, the page is not on LRU. It's isolated.
- */
-
 /**
  * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
@@ -2261,22 +2240,14 @@ cleanup:
  *
  * Notes: Race condition
  *
- * We usually use lock_page_cgroup() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
+ * Charging occurs during page instantiation, while the page is
+ * unmapped and locked in page migration, or while the page table is
+ * locked in THP migration.  No race is possible.
  *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
+ * Uncharge happens to pages with zero references, no race possible.
  *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
- *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check memcg->moving_account and detect there are possibility
- * of race or not. If there is, we take a lock.
+ * Charge moving between groups is protected by checking mm->moving
+ * account and taking the move_lock in the slowpath.
  */
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
@@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
+/*
+ * try_get_mem_cgroup_from_page - look up page's memcg association
+ * @page: the page
+ *
+ * Look up, get a css reference, and return the memcg that owns @page.
+ *
+ * The page must be locked to prevent racing with swap-in and page
+ * cache charges.  If coming from an unlocked page table, the caller
+ * must ensure the page is on the LRU or this can race with charging.
+ */
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
 		memcg = pc->mem_cgroup;
 		if (memcg && !css_tryget_online(&memcg->css))
@@ -2713,19 +2693,17 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			memcg = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
 	return memcg;
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare)
+			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
 
-	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
 	 * we don't need page_cgroup_lock about tail pages, becase they are not
@@ -2747,8 +2725,22 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		}
 	}
 
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point:
+	 *
+	 * - the page is uncharged
+	 *
+	 * - the page is off-LRU
+	 *
+	 * - an anonymous fault has exclusive page access, except for
+	 *   a locked page table
+	 *
+	 * - a page cache insertion, a swapin fault, or a migration
+	 *   have the page locked
+	 */
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
+	pc->flags = PCG_USED | PCG_MEM | PCG_MEMSW;
 
 	if (lrucare) {
 		if (was_on_lru) {
@@ -2760,9 +2752,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
-
+	mem_cgroup_charge_statistics(memcg, page, nr_pages);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -3395,7 +3385,6 @@ static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -3416,7 +3405,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
+		pc->flags = head_pc->flags;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
@@ -3446,7 +3435,6 @@ static int mem_cgroup_move_account(struct page *page,
 {
 	unsigned long flags;
 	int ret;
-	bool anon = PageAnon(page);
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -3460,15 +3448,13 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
-	lock_page_cgroup(pc);
-
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto unlock;
+		goto out;
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
+	if (!PageAnon(page) && page_mapped(page)) {
 		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
 			       nr_pages);
 		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3482,15 +3468,19 @@ static int mem_cgroup_move_account(struct page *page,
 			       nr_pages);
 	}
 
-	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
+	mem_cgroup_charge_statistics(from, page, -nr_pages);
+
+	/*
+	 * It is safe to change pc->mem_cgroup here because the page
+	 * is referenced, charged, and isolated - we can't race with
+	 * uncharging, charging, migration, or LRU putback.
+	 */
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
+	mem_cgroup_charge_statistics(to, page, nr_pages);
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
-unlock:
-	unlock_page_cgroup(pc);
 	/*
 	 * check events
 	 */
@@ -3566,193 +3556,6 @@ out:
 	return ret;
 }
 
-static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
-				   unsigned int nr_pages,
-				   const enum charge_type ctype)
-{
-	struct memcg_batch_info *batch = NULL;
-	bool uncharge_memsw = true;
-
-	/* If swapout, usage of swap doesn't decrease */
-	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		uncharge_memsw = false;
-
-	batch = &current->memcg_batch;
-	/*
-	 * In usual, we do css_get() when we remember memcg pointer.
-	 * But in this case, we keep res->usage until end of a series of
-	 * uncharges. Then, it's ok to ignore memcg's refcnt.
-	 */
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continuously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-
-	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
-
-	if (nr_pages > 1)
-		goto direct_uncharge;
-
-	/*
-	 * In typical case, batch->memcg == mem. This means we can
-	 * merge a series of uncharges to an uncharge of res_counter.
-	 * If not, we uncharge res_counter ony by one.
-	 */
-	if (batch->memcg != memcg)
-		goto direct_uncharge;
-	/* remember freed charge and uncharge it later */
-	batch->nr_pages++;
-	if (uncharge_memsw)
-		batch->memsw_nr_pages++;
-	return;
-direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (unlikely(batch->memcg != memcg))
-		memcg_oom_recover(memcg);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
-			     bool end_migration)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!PageCgroupUsed(pc)))
-		return NULL;
-
-	lock_page_cgroup(pc);
-
-	memcg = pc->mem_cgroup;
-
-	if (!PageCgroupUsed(pc))
-		goto unlock_out;
-
-	anon = PageAnon(page);
-
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_ANON:
-		/*
-		 * Generally PageAnon tells if it's the anon statistics to be
-		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
-		 * used before page reached the stage of being marked PageAnon.
-		 */
-		anon = true;
-		/* fallthrough */
-	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		/* See mem_cgroup_prepare_migration() */
-		if (page_mapped(page))
-			goto unlock_out;
-		/*
-		 * Pages under migration may not be uncharged.  But
-		 * end_migration() /must/ be the one uncharging the
-		 * unused post-migration page and so it has to call
-		 * here with the migration bit still set.  See the
-		 * res_counter handling below.
-		 */
-		if (!end_migration && PageCgroupMigration(pc))
-			goto unlock_out;
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
-		if (!PageAnon(page)) {	/* Shared memory */
-			if (page->mapping && !page_is_file_cache(page))
-				goto unlock_out;
-		} else if (page_mapped(page)) /* Anon */
-				goto unlock_out;
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
-
-	ClearPageCgroupUsed(pc);
-	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
-	 */
-
-	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
-	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
-	/*
-	 * Migration does not charge the res_counter for the
-	 * replacement page, so leave it alone when phasing out the
-	 * page that is unused after the migration.
-	 */
-	if (!end_migration)
-		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
-
-	return memcg;
-
-unlock_out:
-	unlock_page_cgroup(pc);
-	return NULL;
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	/*
-	 * If the page is in swap cache, uncharge should be deferred
-	 * to the swap path, which also properly accounts swap usage
-	 * and handles memcg lifetime.
-	 *
-	 * Note that this check is not stable and reclaim may add the
-	 * page to swap cache at any time after this.  However, if the
-	 * page is not in swap cache by the time page->mapcount hits
-	 * 0, there won't be any page table references to the swap
-	 * slot, and reclaim will free it and not actually write the
-	 * page to disk.
-	 */
-	if (PageSwapCache(page))
-		return;
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping, page);
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
-}
-
 /*
  * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
  * In that cases, pages are freed continuously and we can expect pages
@@ -3800,57 +3603,12 @@ void mem_cgroup_uncharge_end(void)
 	batch->memcg = NULL;
 }
 
-#ifdef CONFIG_SWAP
-/*
- * called after __delete_from_swap_cache() and drop "page" account.
- * memcg information is recorded to swap_cgroup of "ent"
- */
-void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
-{
-	struct mem_cgroup *memcg;
-	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
-
-	if (!swapout) /* this was a swap cache but the swap is unused ! */
-		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
-
-	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
-
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
-	if (do_swap_account && swapout && memcg)
-		swap_cgroup_record(ent, mem_cgroup_id(memcg));
-}
-#endif
-
 #ifdef CONFIG_MEMCG_SWAP
-/*
- * called from swap_entry_free(). remove record in swap_cgroup and
- * uncharge "memsw" account.
- */
-void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
+					 bool charge)
 {
-	struct mem_cgroup *memcg;
-	unsigned short id;
-
-	if (!do_swap_account)
-		return;
-
-	id = swap_cgroup_record(ent, 0);
-	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
-	if (memcg) {
-		/*
-		 * We uncharge this because swap is freed.  This memcg can
-		 * be obsolete one. We avoid calling css_tryget_online().
-		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
-	}
-	rcu_read_unlock();
+	int val = (charge) ? 1 : -1;
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -3902,169 +3660,6 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
-/*
- * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
- * page belongs to.
- */
-void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-				  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-
-	*memcgp = NULL;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (PageTransHuge(page))
-		nr_pages <<= compound_order(page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		css_get(&memcg->css);
-		/*
-		 * At migrating an anonymous page, its mapcount goes down
-		 * to 0 and uncharge() will be called. But, even if it's fully
-		 * unmapped, migration may fail and this page has to be
-		 * charged again. We set MIGRATION flag here and delay uncharge
-		 * until end_migration() is called
-		 *
-		 * Corner Case Thinking
-		 * A)
-		 * When the old page was mapped as Anon and it's unmap-and-freed
-		 * while migration was ongoing.
-		 * If unmap finds the old page, uncharge() of it will be delayed
-		 * until end_migration(). If unmap finds a new page, it's
-		 * uncharged when it make mapcount to be 1->0. If unmap code
-		 * finds swap_migration_entry, the new page will not be mapped
-		 * and end_migration() will find it(mapcount==0).
-		 *
-		 * B)
-		 * When the old page was mapped but migraion fails, the kernel
-		 * remaps it. A charge for it is kept by MIGRATION flag even
-		 * if mapcount goes down to 0. We can do remap successfully
-		 * without charging it again.
-		 *
-		 * C)
-		 * The "old" page is under lock_page() until the end of
-		 * migration, so, the old page itself will not be swapped-out.
-		 * If the new page is swapped out before end_migraton, our
-		 * hook to usual swap-out path will catch the event.
-		 */
-		if (PageAnon(page))
-			SetPageCgroupMigration(pc);
-	}
-	unlock_page_cgroup(pc);
-	/*
-	 * If the page is not charged at this point,
-	 * we return here.
-	 */
-	if (!memcg)
-		return;
-
-	*memcgp = memcg;
-	/*
-	 * We charge new page before it's used/mapped. So, even if unlock_page()
-	 * is called before end_migration, we can catch all events on this new
-	 * page. In the case new page is migrated but not remapped, new page's
-	 * mapcount will be finally 0 and we call uncharge in end_migration().
-	 */
-	/*
-	 * The page is committed to the memcg, but it's not actually
-	 * charged to the res_counter since we plan on replacing the
-	 * old one and only one page is going to be left afterwards.
-	 */
-	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-	struct page *used, *unused;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (!memcg)
-		return;
-
-	if (!migration_ok) {
-		used = oldpage;
-		unused = newpage;
-	} else {
-		used = newpage;
-		unused = oldpage;
-	}
-	anon = PageAnon(used);
-	__mem_cgroup_uncharge_common(unused,
-				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
-				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
-				     true);
-	css_put(&memcg->css);
-	/*
-	 * We disallowed uncharge of pages under migration because mapcount
-	 * of the page goes down to zero, temporarly.
-	 * Clear the flag and check the page should be charged.
-	 */
-	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
-	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * If a page is a file cache, radix-tree replacement is very atomic
-	 * and we can skip this check. When it was an Anon page, its mapcount
-	 * goes down to 0. But because we added MIGRATION flage, it's not
-	 * uncharged yet. There are several case but page->mapcount check
-	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
-	 * check. (see prepare_charge() also)
-	 */
-	if (anon)
-		mem_cgroup_uncharge_page(used);
-}
-
-/*
- * At replace page cache, newpage is not under any memcg but it's on
- * LRU. So, this function doesn't touch res_counter but handles LRU
- * in correct way. Both pages are locked so we cannot race with uncharge.
- */
-void mem_cgroup_replace_page_cache(struct page *oldpage,
-				  struct page *newpage)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(oldpage);
-	/* fix accounting on old pages */
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * When called from shmem_replace_page(), in some cases the
-	 * oldpage has already been charged, and in some cases not.
-	 */
-	if (!memcg)
-		return;
-	/*
-	 * Even if newpage->mapping was NULL before starting replacement,
-	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
-	 * LRU while we overwrite pc->mem_cgroup.
-	 */
-	commit_charge(newpage, memcg, 1, false, true);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -6239,9 +5834,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
-		 * Do only loose check w/o page_cgroup lock.
-		 * mem_cgroup_move_account() checks the pc is valid or not under
-		 * the lock.
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the pc is valid or
+		 * not under LRU exclusion.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
@@ -6700,6 +6295,67 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+#ifdef CONFIG_MEMCG_SWAP
+/**
+ * mem_cgroup_swapout - transfer a memsw charge to swap
+ * @page: page whose memsw charge to transfer
+ * @entry: swap entry to move the charge to
+ *
+ * Transfer the memsw charge of @page to @entry.
+ */
+void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+	unsigned short oldid;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (!do_swap_account)
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Readahead page, never charged */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+	VM_BUG_ON_PAGE(oldid, page);
+
+	pc->flags &= ~PCG_MEMSW;
+	css_get(&pc->mem_cgroup->css);
+	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+}
+
+/**
+ * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * @entry: swap entry to uncharge
+ *
+ * Drop the memsw charge associated with @entry.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	if (!do_swap_account)
+		return;
+
+	id = swap_cgroup_record(entry, 0);
+	rcu_read_lock();
+	memcg = mem_cgroup_lookup(id);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(memcg, false);
+		css_put(&memcg->css);
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
@@ -6802,7 +6458,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+	commit_charge(page, memcg, nr_pages, lrucare);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6844,6 +6500,116 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_uncharge - uncharge a page
+ * @page: page to uncharge
+ *
+ * Uncharge a page previously charged with mem_cgroup_try_charge() and
+ * mem_cgroup_commit_charge().
+ */
+void mem_cgroup_uncharge(struct page *page)
+{
+	struct memcg_batch_info *batch;
+	unsigned int nr_pages = 1;
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Every final put_page() ends up here */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point, we have fully
+	 * exclusive access to the page.
+	 */
+	memcg = pc->mem_cgroup;
+	flags = pc->flags;
+	pc->flags = 0;
+
+	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
+	memcg_check_events(memcg, page);
+
+	batch = &current->memcg_batch;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	else if (batch->memcg != memcg)
+		goto uncharge;
+	if (nr_pages > 1)
+		goto uncharge;
+	if (!batch->do_batch)
+		goto uncharge;
+	if (test_thread_flag(TIF_MEMDIE))
+		goto uncharge;
+	if (flags & PCG_MEM)
+		batch->nr_pages++;
+	if (flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	return;
+uncharge:
+	if (flags & PCG_MEM)
+		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	if (flags & PCG_MEMSW)
+		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+	if (batch->memcg != memcg)
+		memcg_oom_recover(memcg);
+}
+
+/**
+ * mem_cgroup_migrate - migrate a charge to another page
+ * @oldpage: currently charged page
+ * @newpage: page to transfer the charge to
+ * @lrucare: page might be on LRU already
+ *
+ * Migrate the charge from @oldpage to @newpage.
+ *
+ * Both pages must be locked, @newpage->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare)
+{
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(oldpage);
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), oldpage);
+	pc->flags &= ~(PCG_MEM | PCG_MEMSW);
+
+	if (PageTransHuge(oldpage)) {
+		nr_pages <<= compound_order(oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
+	}
+
+	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/memory.c b/mm/memory.c
index d66988d56caf..a4b5b17b54c9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		details = NULL;
 
 	BUG_ON(addr >= end);
-	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
-	mem_cgroup_uncharge_end();
 }
 
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 63f0cd559999..4a9991aeebe9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		newpage->mapping = NULL;
+		if (!PageAnon(newpage))
+			newpage->mapping = NULL;
 	} else {
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
+		mem_cgroup_migrate(page, newpage, false);
 	}
 
 	unlock_page(newpage);
@@ -797,7 +800,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
-	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
@@ -823,9 +825,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		lock_page(page);
 	}
 
-	/* charge against new page */
-	mem_cgroup_prepare_migration(page, newpage, &mem);
-
 	if (PageWriteback(page)) {
 		/*
 		 * Only in the case of a full synchronous migration is it
@@ -835,10 +834,10 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 */
 		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
-			goto uncharge;
+			goto out_unlock;
 		}
 		if (!force)
-			goto uncharge;
+			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -874,7 +873,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 			 */
 			remap_swapcache = 0;
 		} else {
-			goto uncharge;
+			goto out_unlock;
 		}
 	}
 
@@ -887,7 +886,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto uncharge;
+		goto out_unlock;
 	}
 
 	/*
@@ -906,7 +905,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto uncharge;
+			goto out_unlock;
 		}
 		goto skip_unmap;
 	}
@@ -925,10 +924,7 @@ skip_unmap:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-uncharge:
-	mem_cgroup_end_migration(mem, page, newpage,
-				 (rc == MIGRATEPAGE_SUCCESS ||
-				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
+out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1787,7 +1783,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
@@ -1853,15 +1848,6 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	/*
-	 * Traditional migration needs to prepare the memcg charge
-	 * transaction early to prevent the old page from being
-	 * uncharged when installing migration entries.  Here we can
-	 * save the potential rollback and start the charge transfer
-	 * only when migration is already known to end successfully.
-	 */
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
@@ -1889,14 +1875,10 @@ fail_putback:
 		goto fail_putback;
 	}
 
+	mem_cgroup_migrate(page, new_page, false);
+
 	page_remove_rmap(page);
 
-	/*
-	 * Finish the charge transaction under the page table lock to
-	 * prevent split_huge_page() from dividing up the charge
-	 * before it's fully transferred to the new page.
-	 */
-	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 07576e0b92ef..e10d60543e9b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1089,7 +1089,6 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		mem_cgroup_uncharge_page(page);
 		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
diff --git a/mm/shmem.c b/mm/shmem.c
index ea968bf84942..dc9eb434ea8e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -405,7 +405,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -433,7 +432,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -484,7 +482,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -511,7 +508,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 
@@ -809,7 +805,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -982,7 +978,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 		 */
 		oldpage = newpage;
 	} else {
-		mem_cgroup_replace_page_cache(oldpage, newpage);
+		mem_cgroup_migrate(oldpage, newpage, false);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index a98f48626359..3074210f245d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
@@ -915,6 +916,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	mem_cgroup_uncharge_start();
+
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -946,6 +949,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
+		mem_cgroup_uncharge(page);
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
@@ -955,6 +959,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	mem_cgroup_uncharge_end();
+
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
 EXPORT_SYMBOL(release_pages);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2972eee184a4..e160151da6b8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free(entry);
 			return 0;
 		}
 
@@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 		return 0;
 	}
 }
@@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free(entry);
 	page_cache_release(page);
 }
 
@@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0883b4912ff7..8798b2e0ac59 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	p = swap_info_get(entry);
 	if (p) {
-		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		if (page)
-			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
+		swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 6a78c814bebf..b352481c276d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -367,7 +365,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -389,7 +386,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 	cleancache_invalidate_inode(mapping);
@@ -488,7 +484,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -517,7 +512,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -548,7 +542,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -597,7 +590,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -650,7 +642,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe8eb67..521f7eab1798 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -571,9 +571,10 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -594,7 +595,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__clear_page_locked(page);
 free_it:
+		mem_cgroup_uncharge(page);
 		nr_reclaimed++;
 
 		/*
@@ -1126,12 +1127,13 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	mem_cgroup_uncharge_end();
 
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
-	mem_cgroup_uncharge_end();
+
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_congested += nr_congested;
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
@@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
@@ -1650,6 +1654,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
diff --git a/mm/zswap.c b/mm/zswap.c
index 008388fe7b0f..333d70c66093 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-06-18 20:40   ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-18 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had
their page->mapping established, uncharges had to happen when the page
type could still be known from the context; as in unmap for anonymous,
page cache removal for file and shmem pages, and swap cache truncation
for swap pages.  However, these operations happen well before the page
is actually freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(),
when we know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called
after the migration is successful and the new page is fully rmapped.
Because the old page is no longer uncharged after migration, prevent
double charges by decoupling the page's memcg association (PCG_USED
and pc->mem_cgroup) from the page holding an actual charge.  The new
bits PCG_MEM and PCG_MEMSW represent the respective charges and are
transferred to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache().

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
entry before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection
the page itself offers: anonymous pages are charged under the page
table lock, whereas page cache insertions, swapin, and migration hold
the page lock.  Uncharging happens under full exclusion with no
outstanding references.  Charging and uncharging also ensure that the
page is off-LRU, which serializes against charge migration.  Remove
the very costly page_cgroup lock and set pc->flags non-atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/memcg_test.txt | 128 +------
 include/linux/memcontrol.h           |  49 +--
 include/linux/page_cgroup.h          |  43 +--
 include/linux/swap.h                 |  12 +-
 mm/filemap.c                         |   4 +-
 mm/memcontrol.c                      | 684 ++++++++++++-----------------------
 mm/memory.c                          |   2 -
 mm/migrate.c                         |  44 +--
 mm/rmap.c                            |   1 -
 mm/shmem.c                           |   8 +-
 mm/swap.c                            |   6 +
 mm/swap_state.c                      |   8 +-
 mm/swapfile.c                        |   7 +-
 mm/truncate.c                        |   9 -
 mm/vmscan.c                          |  12 +-
 mm/zswap.c                           |   2 +-
 16 files changed, 297 insertions(+), 722 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index bcf750d3cecd..8870b0212150 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -29,28 +29,13 @@ Please note that implementation details can be changed.
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
+	mem_cgroup_uncharge()
+	  Called when a page's refcount goes down to 0.
 
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
 3. charge-commit-cancel
 	Memcg pages are charged in two steps:
 		mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
 
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 
@@ -136,28 +74,20 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	    swp_entry's refcnt -= 1.
 
 
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
 
 5. Page Cache
    	Page Cache is charged at
 	- add_to_page_cache_locked().
 
-	uncharged at
-	- __remove_from_page_cache().
-
 	The logic is very clear. (About migration, see below)
 	Note: __remove_from_page_cache() is called by remove_from_page_cache()
 	and __remove_mapping().
 
 6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
+	The best way to understand shmem's page state transition is to read
+	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 
@@ -170,56 +100,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
 
 7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
-
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+
+	mem_cgroup_migrate()
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a9a096858e0..806b8fa15c5f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 			      bool lrucare);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+void mem_cgroup_uncharge(struct page *page);
+
+/* Batched uncharging */
+void mem_cgroup_uncharge_start(void);
+void mem_cgroup_uncharge_end(void);
 
-/* For coalescing uncharge for reducing memcg' overhead*/
-extern void mem_cgroup_uncharge_start(void);
-extern void mem_cgroup_uncharge_end(void);
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg)
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 
-extern void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp);
-extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok);
-
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
-extern void mem_cgroup_replace_page_cache(struct page *oldpage,
-					struct page *newpage);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
 {
 }
 
-static inline void mem_cgroup_uncharge_start(void)
+static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
 
-static inline void mem_cgroup_uncharge_end(void)
+static inline void mem_cgroup_uncharge_start(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_page(struct page *page)
+static inline void mem_cgroup_uncharge_end(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+static inline void mem_cgroup_migrate(struct page *oldpage,
+				      struct page *newpage,
+				      bool lrucare)
 {
 }
 
@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
-static inline void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp)
-{
-}
-
-static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-		struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
-static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
-				struct page *newpage)
-{
-}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..97b5c39a31c8 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -3,9 +3,9 @@
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
-	PCG_USED, /* this object is in use. */
-	PCG_MIGRATION, /* under page migration */
+	PCG_USED,	/* This page is charged to a memcg */
+	PCG_MEM,	/* This page holds a memory charge */
+	PCG_MEMSW,	/* This page holds a memory+swap charge */
 	__NR_PCG_FLAGS,
 };
 
@@ -44,42 +44,9 @@ static inline void __init page_cgroup_init(void)
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define TESTCLEARPCGFLAG(uname, lname)			\
-static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
-
-TESTPCGFLAG(Used, USED)
-CLEARPCGFLAG(Used, USED)
-SETPCGFLAG(Used, USED)
-
-SETPCGFLAG(Migration, MIGRATION)
-CLEARPCGFLAG(Migration, MIGRATION)
-TESTPCGFLAG(Migration, MIGRATION)
-
-static inline void lock_page_cgroup(struct page_cgroup *pc)
-{
-	/*
-	 * Don't take this lock in IRQ context.
-	 * This lock is for pc->mem_cgroup, USED, MIGRATION
-	 */
-	bit_spin_lock(PCG_LOCK, &pc->flags);
-}
-
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline int PageCgroupUsed(struct page_cgroup *pc)
 {
-	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	return test_bit(PCG_USED, &pc->flags);
 }
 
 #else /* CONFIG_MEMCG */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 290905133078..94fd0b23f3f9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
 #else
-static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
 {
 }
 #endif
@@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free(swp_entry_t);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp, struct page *page)
+static inline void swapcache_free(swp_entry_t swp)
 {
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 114cd89c1cc2..c2f30ed8e95f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -233,7 +233,6 @@ void delete_from_page_cache(struct page *page)
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
 		freepage(page);
@@ -501,8 +500,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
-		/* mem_cgroup codes must not be called under tree_lock */
-		mem_cgroup_replace_page_cache(old, new);
+		mem_cgroup_migrate(old, new, true);
 		radix_tree_preload_end();
 		if (freepage)
 			freepage(old);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 602fe7207c2d..94d7c40b9f26 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -882,13 +882,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
-}
-
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
@@ -909,13 +902,13 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 bool anon, int nr_pages)
+					 int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
-	if (anon)
+	if (PageAnon(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
@@ -1347,20 +1340,6 @@ out:
 	return lruvec;
 }
 
-/*
- * Following LRU functions are allowed to be used without PCG_LOCK.
- * Operations are called by routine of global LRU independently from memcg.
- * What we have to take care of here is validness of pc->mem_cgroup.
- *
- * Changes to pc->mem_cgroup happens when
- * 1. charge
- * 2. moving account
- * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
- * It is added to LRU before charge.
- * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
- * When moving account, the page is not on LRU. It's isolated.
- */
-
 /**
  * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
@@ -2261,22 +2240,14 @@ cleanup:
  *
  * Notes: Race condition
  *
- * We usually use lock_page_cgroup() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
+ * Charging occurs during page instantiation, while the page is
+ * unmapped and locked in page migration, or while the page table is
+ * locked in THP migration.  No race is possible.
  *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
+ * Uncharge happens to pages with zero references, no race possible.
  *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
- *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check memcg->moving_account and detect there are possibility
- * of race or not. If there is, we take a lock.
+ * Charge moving between groups is protected by checking mm->moving
+ * account and taking the move_lock in the slowpath.
  */
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
@@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
+/*
+ * try_get_mem_cgroup_from_page - look up page's memcg association
+ * @page: the page
+ *
+ * Look up, get a css reference, and return the memcg that owns @page.
+ *
+ * The page must be locked to prevent racing with swap-in and page
+ * cache charges.  If coming from an unlocked page table, the caller
+ * must ensure the page is on the LRU or this can race with charging.
+ */
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
 		memcg = pc->mem_cgroup;
 		if (memcg && !css_tryget_online(&memcg->css))
@@ -2713,19 +2693,17 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			memcg = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
 	return memcg;
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare)
+			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
 
-	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
 	 * we don't need page_cgroup_lock about tail pages, becase they are not
@@ -2747,8 +2725,22 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		}
 	}
 
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point:
+	 *
+	 * - the page is uncharged
+	 *
+	 * - the page is off-LRU
+	 *
+	 * - an anonymous fault has exclusive page access, except for
+	 *   a locked page table
+	 *
+	 * - a page cache insertion, a swapin fault, or a migration
+	 *   have the page locked
+	 */
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
+	pc->flags = PCG_USED | PCG_MEM | PCG_MEMSW;
 
 	if (lrucare) {
 		if (was_on_lru) {
@@ -2760,9 +2752,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
-
+	mem_cgroup_charge_statistics(memcg, page, nr_pages);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -3395,7 +3385,6 @@ static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -3416,7 +3405,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
+		pc->flags = head_pc->flags;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
@@ -3446,7 +3435,6 @@ static int mem_cgroup_move_account(struct page *page,
 {
 	unsigned long flags;
 	int ret;
-	bool anon = PageAnon(page);
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -3460,15 +3448,13 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
-	lock_page_cgroup(pc);
-
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto unlock;
+		goto out;
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
+	if (!PageAnon(page) && page_mapped(page)) {
 		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
 			       nr_pages);
 		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3482,15 +3468,19 @@ static int mem_cgroup_move_account(struct page *page,
 			       nr_pages);
 	}
 
-	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
+	mem_cgroup_charge_statistics(from, page, -nr_pages);
+
+	/*
+	 * It is safe to change pc->mem_cgroup here because the page
+	 * is referenced, charged, and isolated - we can't race with
+	 * uncharging, charging, migration, or LRU putback.
+	 */
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
+	mem_cgroup_charge_statistics(to, page, nr_pages);
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
-unlock:
-	unlock_page_cgroup(pc);
 	/*
 	 * check events
 	 */
@@ -3566,193 +3556,6 @@ out:
 	return ret;
 }
 
-static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
-				   unsigned int nr_pages,
-				   const enum charge_type ctype)
-{
-	struct memcg_batch_info *batch = NULL;
-	bool uncharge_memsw = true;
-
-	/* If swapout, usage of swap doesn't decrease */
-	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		uncharge_memsw = false;
-
-	batch = &current->memcg_batch;
-	/*
-	 * In usual, we do css_get() when we remember memcg pointer.
-	 * But in this case, we keep res->usage until end of a series of
-	 * uncharges. Then, it's ok to ignore memcg's refcnt.
-	 */
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continuously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-
-	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
-
-	if (nr_pages > 1)
-		goto direct_uncharge;
-
-	/*
-	 * In typical case, batch->memcg == mem. This means we can
-	 * merge a series of uncharges to an uncharge of res_counter.
-	 * If not, we uncharge res_counter ony by one.
-	 */
-	if (batch->memcg != memcg)
-		goto direct_uncharge;
-	/* remember freed charge and uncharge it later */
-	batch->nr_pages++;
-	if (uncharge_memsw)
-		batch->memsw_nr_pages++;
-	return;
-direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (unlikely(batch->memcg != memcg))
-		memcg_oom_recover(memcg);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
-			     bool end_migration)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!PageCgroupUsed(pc)))
-		return NULL;
-
-	lock_page_cgroup(pc);
-
-	memcg = pc->mem_cgroup;
-
-	if (!PageCgroupUsed(pc))
-		goto unlock_out;
-
-	anon = PageAnon(page);
-
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_ANON:
-		/*
-		 * Generally PageAnon tells if it's the anon statistics to be
-		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
-		 * used before page reached the stage of being marked PageAnon.
-		 */
-		anon = true;
-		/* fallthrough */
-	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		/* See mem_cgroup_prepare_migration() */
-		if (page_mapped(page))
-			goto unlock_out;
-		/*
-		 * Pages under migration may not be uncharged.  But
-		 * end_migration() /must/ be the one uncharging the
-		 * unused post-migration page and so it has to call
-		 * here with the migration bit still set.  See the
-		 * res_counter handling below.
-		 */
-		if (!end_migration && PageCgroupMigration(pc))
-			goto unlock_out;
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
-		if (!PageAnon(page)) {	/* Shared memory */
-			if (page->mapping && !page_is_file_cache(page))
-				goto unlock_out;
-		} else if (page_mapped(page)) /* Anon */
-				goto unlock_out;
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
-
-	ClearPageCgroupUsed(pc);
-	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
-	 */
-
-	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
-	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
-	/*
-	 * Migration does not charge the res_counter for the
-	 * replacement page, so leave it alone when phasing out the
-	 * page that is unused after the migration.
-	 */
-	if (!end_migration)
-		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
-
-	return memcg;
-
-unlock_out:
-	unlock_page_cgroup(pc);
-	return NULL;
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	/*
-	 * If the page is in swap cache, uncharge should be deferred
-	 * to the swap path, which also properly accounts swap usage
-	 * and handles memcg lifetime.
-	 *
-	 * Note that this check is not stable and reclaim may add the
-	 * page to swap cache at any time after this.  However, if the
-	 * page is not in swap cache by the time page->mapcount hits
-	 * 0, there won't be any page table references to the swap
-	 * slot, and reclaim will free it and not actually write the
-	 * page to disk.
-	 */
-	if (PageSwapCache(page))
-		return;
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping, page);
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
-}
-
 /*
  * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
  * In that cases, pages are freed continuously and we can expect pages
@@ -3800,57 +3603,12 @@ void mem_cgroup_uncharge_end(void)
 	batch->memcg = NULL;
 }
 
-#ifdef CONFIG_SWAP
-/*
- * called after __delete_from_swap_cache() and drop "page" account.
- * memcg information is recorded to swap_cgroup of "ent"
- */
-void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
-{
-	struct mem_cgroup *memcg;
-	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
-
-	if (!swapout) /* this was a swap cache but the swap is unused ! */
-		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
-
-	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
-
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
-	if (do_swap_account && swapout && memcg)
-		swap_cgroup_record(ent, mem_cgroup_id(memcg));
-}
-#endif
-
 #ifdef CONFIG_MEMCG_SWAP
-/*
- * called from swap_entry_free(). remove record in swap_cgroup and
- * uncharge "memsw" account.
- */
-void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
+					 bool charge)
 {
-	struct mem_cgroup *memcg;
-	unsigned short id;
-
-	if (!do_swap_account)
-		return;
-
-	id = swap_cgroup_record(ent, 0);
-	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
-	if (memcg) {
-		/*
-		 * We uncharge this because swap is freed.  This memcg can
-		 * be obsolete one. We avoid calling css_tryget_online().
-		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
-	}
-	rcu_read_unlock();
+	int val = (charge) ? 1 : -1;
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -3902,169 +3660,6 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
-/*
- * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
- * page belongs to.
- */
-void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-				  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-
-	*memcgp = NULL;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (PageTransHuge(page))
-		nr_pages <<= compound_order(page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		css_get(&memcg->css);
-		/*
-		 * At migrating an anonymous page, its mapcount goes down
-		 * to 0 and uncharge() will be called. But, even if it's fully
-		 * unmapped, migration may fail and this page has to be
-		 * charged again. We set MIGRATION flag here and delay uncharge
-		 * until end_migration() is called
-		 *
-		 * Corner Case Thinking
-		 * A)
-		 * When the old page was mapped as Anon and it's unmap-and-freed
-		 * while migration was ongoing.
-		 * If unmap finds the old page, uncharge() of it will be delayed
-		 * until end_migration(). If unmap finds a new page, it's
-		 * uncharged when it make mapcount to be 1->0. If unmap code
-		 * finds swap_migration_entry, the new page will not be mapped
-		 * and end_migration() will find it(mapcount==0).
-		 *
-		 * B)
-		 * When the old page was mapped but migraion fails, the kernel
-		 * remaps it. A charge for it is kept by MIGRATION flag even
-		 * if mapcount goes down to 0. We can do remap successfully
-		 * without charging it again.
-		 *
-		 * C)
-		 * The "old" page is under lock_page() until the end of
-		 * migration, so, the old page itself will not be swapped-out.
-		 * If the new page is swapped out before end_migraton, our
-		 * hook to usual swap-out path will catch the event.
-		 */
-		if (PageAnon(page))
-			SetPageCgroupMigration(pc);
-	}
-	unlock_page_cgroup(pc);
-	/*
-	 * If the page is not charged at this point,
-	 * we return here.
-	 */
-	if (!memcg)
-		return;
-
-	*memcgp = memcg;
-	/*
-	 * We charge new page before it's used/mapped. So, even if unlock_page()
-	 * is called before end_migration, we can catch all events on this new
-	 * page. In the case new page is migrated but not remapped, new page's
-	 * mapcount will be finally 0 and we call uncharge in end_migration().
-	 */
-	/*
-	 * The page is committed to the memcg, but it's not actually
-	 * charged to the res_counter since we plan on replacing the
-	 * old one and only one page is going to be left afterwards.
-	 */
-	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-	struct page *used, *unused;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (!memcg)
-		return;
-
-	if (!migration_ok) {
-		used = oldpage;
-		unused = newpage;
-	} else {
-		used = newpage;
-		unused = oldpage;
-	}
-	anon = PageAnon(used);
-	__mem_cgroup_uncharge_common(unused,
-				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
-				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
-				     true);
-	css_put(&memcg->css);
-	/*
-	 * We disallowed uncharge of pages under migration because mapcount
-	 * of the page goes down to zero, temporarly.
-	 * Clear the flag and check the page should be charged.
-	 */
-	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
-	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * If a page is a file cache, radix-tree replacement is very atomic
-	 * and we can skip this check. When it was an Anon page, its mapcount
-	 * goes down to 0. But because we added MIGRATION flage, it's not
-	 * uncharged yet. There are several case but page->mapcount check
-	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
-	 * check. (see prepare_charge() also)
-	 */
-	if (anon)
-		mem_cgroup_uncharge_page(used);
-}
-
-/*
- * At replace page cache, newpage is not under any memcg but it's on
- * LRU. So, this function doesn't touch res_counter but handles LRU
- * in correct way. Both pages are locked so we cannot race with uncharge.
- */
-void mem_cgroup_replace_page_cache(struct page *oldpage,
-				  struct page *newpage)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(oldpage);
-	/* fix accounting on old pages */
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * When called from shmem_replace_page(), in some cases the
-	 * oldpage has already been charged, and in some cases not.
-	 */
-	if (!memcg)
-		return;
-	/*
-	 * Even if newpage->mapping was NULL before starting replacement,
-	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
-	 * LRU while we overwrite pc->mem_cgroup.
-	 */
-	commit_charge(newpage, memcg, 1, false, true);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -6239,9 +5834,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
-		 * Do only loose check w/o page_cgroup lock.
-		 * mem_cgroup_move_account() checks the pc is valid or not under
-		 * the lock.
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the pc is valid or
+		 * not under LRU exclusion.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
@@ -6700,6 +6295,67 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+#ifdef CONFIG_MEMCG_SWAP
+/**
+ * mem_cgroup_swapout - transfer a memsw charge to swap
+ * @page: page whose memsw charge to transfer
+ * @entry: swap entry to move the charge to
+ *
+ * Transfer the memsw charge of @page to @entry.
+ */
+void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+	unsigned short oldid;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (!do_swap_account)
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Readahead page, never charged */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+	VM_BUG_ON_PAGE(oldid, page);
+
+	pc->flags &= ~PCG_MEMSW;
+	css_get(&pc->mem_cgroup->css);
+	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+}
+
+/**
+ * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * @entry: swap entry to uncharge
+ *
+ * Drop the memsw charge associated with @entry.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	if (!do_swap_account)
+		return;
+
+	id = swap_cgroup_record(entry, 0);
+	rcu_read_lock();
+	memcg = mem_cgroup_lookup(id);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(memcg, false);
+		css_put(&memcg->css);
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
@@ -6802,7 +6458,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+	commit_charge(page, memcg, nr_pages, lrucare);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6844,6 +6500,116 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_uncharge - uncharge a page
+ * @page: page to uncharge
+ *
+ * Uncharge a page previously charged with mem_cgroup_try_charge() and
+ * mem_cgroup_commit_charge().
+ */
+void mem_cgroup_uncharge(struct page *page)
+{
+	struct memcg_batch_info *batch;
+	unsigned int nr_pages = 1;
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Every final put_page() ends up here */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point, we have fully
+	 * exclusive access to the page.
+	 */
+	memcg = pc->mem_cgroup;
+	flags = pc->flags;
+	pc->flags = 0;
+
+	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
+	memcg_check_events(memcg, page);
+
+	batch = &current->memcg_batch;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	else if (batch->memcg != memcg)
+		goto uncharge;
+	if (nr_pages > 1)
+		goto uncharge;
+	if (!batch->do_batch)
+		goto uncharge;
+	if (test_thread_flag(TIF_MEMDIE))
+		goto uncharge;
+	if (flags & PCG_MEM)
+		batch->nr_pages++;
+	if (flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	return;
+uncharge:
+	if (flags & PCG_MEM)
+		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	if (flags & PCG_MEMSW)
+		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+	if (batch->memcg != memcg)
+		memcg_oom_recover(memcg);
+}
+
+/**
+ * mem_cgroup_migrate - migrate a charge to another page
+ * @oldpage: currently charged page
+ * @newpage: page to transfer the charge to
+ * @lrucare: page might be on LRU already
+ *
+ * Migrate the charge from @oldpage to @newpage.
+ *
+ * Both pages must be locked, @newpage->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare)
+{
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(oldpage);
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), oldpage);
+	pc->flags &= ~(PCG_MEM | PCG_MEMSW);
+
+	if (PageTransHuge(oldpage)) {
+		nr_pages <<= compound_order(oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
+	}
+
+	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/memory.c b/mm/memory.c
index d66988d56caf..a4b5b17b54c9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		details = NULL;
 
 	BUG_ON(addr >= end);
-	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
-	mem_cgroup_uncharge_end();
 }
 
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 63f0cd559999..4a9991aeebe9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		newpage->mapping = NULL;
+		if (!PageAnon(newpage))
+			newpage->mapping = NULL;
 	} else {
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
+		mem_cgroup_migrate(page, newpage, false);
 	}
 
 	unlock_page(newpage);
@@ -797,7 +800,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
-	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
@@ -823,9 +825,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		lock_page(page);
 	}
 
-	/* charge against new page */
-	mem_cgroup_prepare_migration(page, newpage, &mem);
-
 	if (PageWriteback(page)) {
 		/*
 		 * Only in the case of a full synchronous migration is it
@@ -835,10 +834,10 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 */
 		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
-			goto uncharge;
+			goto out_unlock;
 		}
 		if (!force)
-			goto uncharge;
+			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -874,7 +873,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 			 */
 			remap_swapcache = 0;
 		} else {
-			goto uncharge;
+			goto out_unlock;
 		}
 	}
 
@@ -887,7 +886,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto uncharge;
+		goto out_unlock;
 	}
 
 	/*
@@ -906,7 +905,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto uncharge;
+			goto out_unlock;
 		}
 		goto skip_unmap;
 	}
@@ -925,10 +924,7 @@ skip_unmap:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-uncharge:
-	mem_cgroup_end_migration(mem, page, newpage,
-				 (rc == MIGRATEPAGE_SUCCESS ||
-				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
+out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1787,7 +1783,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
@@ -1853,15 +1848,6 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	/*
-	 * Traditional migration needs to prepare the memcg charge
-	 * transaction early to prevent the old page from being
-	 * uncharged when installing migration entries.  Here we can
-	 * save the potential rollback and start the charge transfer
-	 * only when migration is already known to end successfully.
-	 */
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
@@ -1889,14 +1875,10 @@ fail_putback:
 		goto fail_putback;
 	}
 
+	mem_cgroup_migrate(page, new_page, false);
+
 	page_remove_rmap(page);
 
-	/*
-	 * Finish the charge transaction under the page table lock to
-	 * prevent split_huge_page() from dividing up the charge
-	 * before it's fully transferred to the new page.
-	 */
-	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 07576e0b92ef..e10d60543e9b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1089,7 +1089,6 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		mem_cgroup_uncharge_page(page);
 		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
diff --git a/mm/shmem.c b/mm/shmem.c
index ea968bf84942..dc9eb434ea8e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -405,7 +405,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -433,7 +432,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -484,7 +482,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -511,7 +508,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 
@@ -809,7 +805,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -982,7 +978,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 		 */
 		oldpage = newpage;
 	} else {
-		mem_cgroup_replace_page_cache(oldpage, newpage);
+		mem_cgroup_migrate(oldpage, newpage, false);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index a98f48626359..3074210f245d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
@@ -915,6 +916,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	mem_cgroup_uncharge_start();
+
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -946,6 +949,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
+		mem_cgroup_uncharge(page);
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
@@ -955,6 +959,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	mem_cgroup_uncharge_end();
+
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
 EXPORT_SYMBOL(release_pages);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2972eee184a4..e160151da6b8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free(entry);
 			return 0;
 		}
 
@@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 		return 0;
 	}
 }
@@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free(entry);
 	page_cache_release(page);
 }
 
@@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0883b4912ff7..8798b2e0ac59 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	p = swap_info_get(entry);
 	if (p) {
-		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		if (page)
-			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
+		swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 6a78c814bebf..b352481c276d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -367,7 +365,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -389,7 +386,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 	cleancache_invalidate_inode(mapping);
@@ -488,7 +484,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -517,7 +512,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -548,7 +542,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -597,7 +590,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -650,7 +642,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe8eb67..521f7eab1798 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -571,9 +571,10 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -594,7 +595,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__clear_page_locked(page);
 free_it:
+		mem_cgroup_uncharge(page);
 		nr_reclaimed++;
 
 		/*
@@ -1126,12 +1127,13 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	mem_cgroup_uncharge_end();
 
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
-	mem_cgroup_uncharge_end();
+
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_congested += nr_congested;
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
@@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
@@ -1650,6 +1654,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
diff --git a/mm/zswap.c b/mm/zswap.c
index 008388fe7b0f..333d70c66093 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [PATCH -mm] memcg: mem_cgroup_charge_statistics needs preempt_disable
  2014-06-18 20:40   ` Johannes Weiner
@ 2014-06-20 16:36     ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-06-20 16:36 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Vladimir Davydov, linux-mm, LKML

preempt_disable was previously disabled by lock_page_cgroup which has
been removed by "mm: memcontrol: rewrite uncharge API".

This fixes the a flood of splats like this:
[    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
[    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
[    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
[    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
[    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
[    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
[    3.162950] Call Trace:
[    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
[    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
[    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
[    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
[    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
[    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
[    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
[    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
[    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
[    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
[    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
[    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
[    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
[    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
[    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
[    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
Andrew,
the changelog is quite modest but this should be folded into
mm-memcontrol-rewrite-uncharge-api.patch anyway. If you want a
regular patch, please let me know.

 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 241cf4f91e24..cbf373085b6c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -904,6 +904,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
 					 int nr_pages)
 {
+	preempt_disable();
+
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
@@ -928,6 +930,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	}
 
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+	preempt_enable();
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* [PATCH -mm] memcg: mem_cgroup_charge_statistics needs preempt_disable
@ 2014-06-20 16:36     ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-06-20 16:36 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Hugh Dickins, Tejun Heo, Vladimir Davydov, linux-mm, LKML

preempt_disable was previously disabled by lock_page_cgroup which has
been removed by "mm: memcontrol: rewrite uncharge API".

This fixes the a flood of splats like this:
[    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
[    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
[    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
[    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
[    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
[    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
[    3.162950] Call Trace:
[    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
[    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
[    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
[    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
[    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
[    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
[    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
[    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
[    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
[    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
[    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
[    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
[    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
[    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
[    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
[    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
Andrew,
the changelog is quite modest but this should be folded into
mm-memcontrol-rewrite-uncharge-api.patch anyway. If you want a
regular patch, please let me know.

 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 241cf4f91e24..cbf373085b6c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -904,6 +904,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
 					 int nr_pages)
 {
+	preempt_disable();
+
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
@@ -928,6 +930,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	}
 
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+	preempt_enable();
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-06-18 20:40   ` Johannes Weiner
@ 2014-06-21  0:34     ` Sasha Levin
  -1 siblings, 0 replies; 141+ messages in thread
From: Sasha Levin @ 2014-06-21  0:34 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On 06/18/2014 04:40 PM, Johannes Weiner wrote:
> The memcg uncharging code that is involved towards the end of a page's
> lifetime - truncation, reclaim, swapout, migration - is impressively
> complicated and fragile.
> 
> Because anonymous and file pages were always charged before they had
> their page->mapping established, uncharges had to happen when the page
> type could still be known from the context; as in unmap for anonymous,
> page cache removal for file and shmem pages, and swap cache truncation
> for swap pages.  However, these operations happen well before the page
> is actually freed, and so a lot of synchronization is necessary:
> 
> - Charging, uncharging, page migration, and charge migration all need
>   to take a per-page bit spinlock as they could race with uncharging.
> 
> - Swap cache truncation happens during both swap-in and swap-out, and
>   possibly repeatedly before the page is actually freed.  This means
>   that the memcg swapout code is called from many contexts that make
>   no sense and it has to figure out the direction from page state to
>   make sure memory and memory+swap are always correctly charged.
> 
> - On page migration, the old page might be unmapped but then reused,
>   so memcg code has to prevent untimely uncharging in that case.
>   Because this code - which should be a simple charge transfer - is so
>   special-cased, it is not reusable for replace_page_cache().
> 
> But now that charged pages always have a page->mapping, introduce
> mem_cgroup_uncharge(), which is called after the final put_page(),
> when we know for sure that nobody is looking at the page anymore.
> 
> For page migration, introduce mem_cgroup_migrate(), which is called
> after the migration is successful and the new page is fully rmapped.
> Because the old page is no longer uncharged after migration, prevent
> double charges by decoupling the page's memcg association (PCG_USED
> and pc->mem_cgroup) from the page holding an actual charge.  The new
> bits PCG_MEM and PCG_MEMSW represent the respective charges and are
> transferred to the new page during migration.
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> which gets rid of mem_cgroup_replace_page_cache().
> 
> Swap accounting is massively simplified: because the page is no longer
> uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
> can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
> entry before the final put_page() in page reclaim.
> 
> Finally, page_cgroup changes are now protected by whatever protection
> the page itself offers: anonymous pages are charged under the page
> table lock, whereas page cache insertions, swapin, and migration hold
> the page lock.  Uncharging happens under full exclusion with no
> outstanding references.  Charging and uncharging also ensure that the
> page is off-LRU, which serializes against charge migration.  Remove
> the very costly page_cgroup lock and set pc->flags non-atomically.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hi Johannes,

I'm seeing the following when booting a VM, bisection pointed me to this
patch.

[   32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677
[   32.831522] caller is __this_cpu_preempt_check+0x13/0x20
[   32.832079] CPU: 35 PID: 8677 Comm: mkdir Not tainted 3.16.0-rc1-next-20140620-sasha-00023-g8fc12ed #700
[   32.832898]  ffffffffb27ea69d ffff8800cb91b618 ffffffffb151820b 0000000000000002
[   32.833607]  0000000000000023 ffff8800cb91b648 ffffffffaeb4c799 ffff88006efa5b60
[   32.834318]  ffffea0007cff9c0 0000000000000001 0000000000000001 ffff8800cb91b658
[   32.835030] Call Trace:
[   32.835257] dump_stack (lib/dump_stack.c:52)
[   32.835755] check_preemption_disabled (./arch/x86/include/asm/preempt.h:80 lib/smp_processor_id.c:49)
[   32.836336] __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[   32.836991] mem_cgroup_charge_statistics.isra.23 (mm/memcontrol.c:930)
[   32.837682] commit_charge (mm/memcontrol.c:2761)
[   32.838187] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
[   32.838735] ? get_parent_ip (kernel/sched/core.c:2546)
[   32.839230] mem_cgroup_commit_charge (mm/memcontrol.c:6519)
[   32.839807] __add_to_page_cache_locked (mm/filemap.c:588 include/linux/jump_label.h:115 include/trace/events/filemap.h:50 mm/filemap.c:589)
[   32.840479] add_to_page_cache_lru (mm/filemap.c:627)
[   32.841048] read_cache_pages (mm/readahead.c:92)
[   32.841560] ? v9fs_cache_session_get_key (fs/9p/cache.c:306)
[   32.842145] ? v9fs_write_begin (fs/9p/vfs_addr.c:99)
[   32.842694] v9fs_vfs_readpages (fs/9p/vfs_addr.c:127)
[   32.843251] __do_page_cache_readahead (mm/readahead.c:123 mm/readahead.c:200)
[   32.843848] ? __do_page_cache_readahead (include/linux/rcupdate.h:877 mm/readahead.c:178)
[   32.844435] ? __const_udelay (arch/x86/lib/delay.c:126)
[   32.844944] filemap_fault (include/linux/memcontrol.h:141 include/linux/memcontrol.h:198 mm/filemap.c:1869)
[   32.845465] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[   32.845999] __do_fault (mm/memory.c:2705)
[   32.846472] ? mem_cgroup_try_charge (include/linux/cgroup.h:158 mm/memcontrol.c:6467)
[   32.847048] do_cow_fault (mm/memory.c:2936)
[   32.847561] __handle_mm_fault (mm/memory.c:3078 mm/memory.c:3205 mm/memory.c:3322)
[   32.848092] ? __const_udelay (arch/x86/lib/delay.c:126)
[   32.848596] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[   32.849157] handle_mm_fault (mm/memory.c:3345)
[   32.849665] __do_page_fault (arch/x86/mm/fault.c:1230)
[   32.850239] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[   32.850963] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
[   32.851442] ? sched_clock_local (kernel/sched/clock.c:214)
[   32.852034] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[   32.852669] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[   32.853243] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[   32.853854] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[   32.854393] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[   32.854924] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[   32.855507] ? __clear_user (arch/x86/lib/usercopy_64.c:22)
[   32.855999] ? __clear_user (arch/x86/lib/usercopy_64.c:18 arch/x86/lib/usercopy_64.c:21)
[   32.856488] clear_user (arch/x86/lib/usercopy_64.c:54)
[   32.856997] padzero (fs/binfmt_elf.c:122)
[   32.857440] load_elf_binary (fs/binfmt_elf.c:909 (discriminator 1))
[   32.857949] ? search_binary_handler (fs/exec.c:1374)
[   32.858550] ? preempt_count_sub (kernel/sched/core.c:2602)
[   32.859089] search_binary_handler (fs/exec.c:1375)
[   32.859654] do_execve_common.isra.19 (fs/exec.c:1412 fs/exec.c:1508)
[   32.860319] ? do_execve_common.isra.19 (./arch/x86/include/asm/current.h:14 fs/exec.c:1406 fs/exec.c:1508)
[   32.860949] do_execve (fs/exec.c:1551)
[   32.861390] SyS_execve (fs/exec.c:1602)
[   32.861848] stub_execve (arch/x86/kernel/entry_64.S:662)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-06-21  0:34     ` Sasha Levin
  0 siblings, 0 replies; 141+ messages in thread
From: Sasha Levin @ 2014-06-21  0:34 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On 06/18/2014 04:40 PM, Johannes Weiner wrote:
> The memcg uncharging code that is involved towards the end of a page's
> lifetime - truncation, reclaim, swapout, migration - is impressively
> complicated and fragile.
> 
> Because anonymous and file pages were always charged before they had
> their page->mapping established, uncharges had to happen when the page
> type could still be known from the context; as in unmap for anonymous,
> page cache removal for file and shmem pages, and swap cache truncation
> for swap pages.  However, these operations happen well before the page
> is actually freed, and so a lot of synchronization is necessary:
> 
> - Charging, uncharging, page migration, and charge migration all need
>   to take a per-page bit spinlock as they could race with uncharging.
> 
> - Swap cache truncation happens during both swap-in and swap-out, and
>   possibly repeatedly before the page is actually freed.  This means
>   that the memcg swapout code is called from many contexts that make
>   no sense and it has to figure out the direction from page state to
>   make sure memory and memory+swap are always correctly charged.
> 
> - On page migration, the old page might be unmapped but then reused,
>   so memcg code has to prevent untimely uncharging in that case.
>   Because this code - which should be a simple charge transfer - is so
>   special-cased, it is not reusable for replace_page_cache().
> 
> But now that charged pages always have a page->mapping, introduce
> mem_cgroup_uncharge(), which is called after the final put_page(),
> when we know for sure that nobody is looking at the page anymore.
> 
> For page migration, introduce mem_cgroup_migrate(), which is called
> after the migration is successful and the new page is fully rmapped.
> Because the old page is no longer uncharged after migration, prevent
> double charges by decoupling the page's memcg association (PCG_USED
> and pc->mem_cgroup) from the page holding an actual charge.  The new
> bits PCG_MEM and PCG_MEMSW represent the respective charges and are
> transferred to the new page during migration.
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> which gets rid of mem_cgroup_replace_page_cache().
> 
> Swap accounting is massively simplified: because the page is no longer
> uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
> can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
> entry before the final put_page() in page reclaim.
> 
> Finally, page_cgroup changes are now protected by whatever protection
> the page itself offers: anonymous pages are charged under the page
> table lock, whereas page cache insertions, swapin, and migration hold
> the page lock.  Uncharging happens under full exclusion with no
> outstanding references.  Charging and uncharging also ensure that the
> page is off-LRU, which serializes against charge migration.  Remove
> the very costly page_cgroup lock and set pc->flags non-atomically.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hi Johannes,

I'm seeing the following when booting a VM, bisection pointed me to this
patch.

[   32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677
[   32.831522] caller is __this_cpu_preempt_check+0x13/0x20
[   32.832079] CPU: 35 PID: 8677 Comm: mkdir Not tainted 3.16.0-rc1-next-20140620-sasha-00023-g8fc12ed #700
[   32.832898]  ffffffffb27ea69d ffff8800cb91b618 ffffffffb151820b 0000000000000002
[   32.833607]  0000000000000023 ffff8800cb91b648 ffffffffaeb4c799 ffff88006efa5b60
[   32.834318]  ffffea0007cff9c0 0000000000000001 0000000000000001 ffff8800cb91b658
[   32.835030] Call Trace:
[   32.835257] dump_stack (lib/dump_stack.c:52)
[   32.835755] check_preemption_disabled (./arch/x86/include/asm/preempt.h:80 lib/smp_processor_id.c:49)
[   32.836336] __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[   32.836991] mem_cgroup_charge_statistics.isra.23 (mm/memcontrol.c:930)
[   32.837682] commit_charge (mm/memcontrol.c:2761)
[   32.838187] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
[   32.838735] ? get_parent_ip (kernel/sched/core.c:2546)
[   32.839230] mem_cgroup_commit_charge (mm/memcontrol.c:6519)
[   32.839807] __add_to_page_cache_locked (mm/filemap.c:588 include/linux/jump_label.h:115 include/trace/events/filemap.h:50 mm/filemap.c:589)
[   32.840479] add_to_page_cache_lru (mm/filemap.c:627)
[   32.841048] read_cache_pages (mm/readahead.c:92)
[   32.841560] ? v9fs_cache_session_get_key (fs/9p/cache.c:306)
[   32.842145] ? v9fs_write_begin (fs/9p/vfs_addr.c:99)
[   32.842694] v9fs_vfs_readpages (fs/9p/vfs_addr.c:127)
[   32.843251] __do_page_cache_readahead (mm/readahead.c:123 mm/readahead.c:200)
[   32.843848] ? __do_page_cache_readahead (include/linux/rcupdate.h:877 mm/readahead.c:178)
[   32.844435] ? __const_udelay (arch/x86/lib/delay.c:126)
[   32.844944] filemap_fault (include/linux/memcontrol.h:141 include/linux/memcontrol.h:198 mm/filemap.c:1869)
[   32.845465] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[   32.845999] __do_fault (mm/memory.c:2705)
[   32.846472] ? mem_cgroup_try_charge (include/linux/cgroup.h:158 mm/memcontrol.c:6467)
[   32.847048] do_cow_fault (mm/memory.c:2936)
[   32.847561] __handle_mm_fault (mm/memory.c:3078 mm/memory.c:3205 mm/memory.c:3322)
[   32.848092] ? __const_udelay (arch/x86/lib/delay.c:126)
[   32.848596] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[   32.849157] handle_mm_fault (mm/memory.c:3345)
[   32.849665] __do_page_fault (arch/x86/mm/fault.c:1230)
[   32.850239] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[   32.850963] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
[   32.851442] ? sched_clock_local (kernel/sched/clock.c:214)
[   32.852034] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[   32.852669] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[   32.853243] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[   32.853854] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[   32.854393] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[   32.854924] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[   32.855507] ? __clear_user (arch/x86/lib/usercopy_64.c:22)
[   32.855999] ? __clear_user (arch/x86/lib/usercopy_64.c:18 arch/x86/lib/usercopy_64.c:21)
[   32.856488] clear_user (arch/x86/lib/usercopy_64.c:54)
[   32.856997] padzero (fs/binfmt_elf.c:122)
[   32.857440] load_elf_binary (fs/binfmt_elf.c:909 (discriminator 1))
[   32.857949] ? search_binary_handler (fs/exec.c:1374)
[   32.858550] ? preempt_count_sub (kernel/sched/core.c:2602)
[   32.859089] search_binary_handler (fs/exec.c:1375)
[   32.859654] do_execve_common.isra.19 (fs/exec.c:1412 fs/exec.c:1508)
[   32.860319] ? do_execve_common.isra.19 (./arch/x86/include/asm/current.h:14 fs/exec.c:1406 fs/exec.c:1508)
[   32.860949] do_execve (fs/exec.c:1551)
[   32.861390] SyS_execve (fs/exec.c:1602)
[   32.861848] stub_execve (arch/x86/kernel/entry_64.S:662)


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-06-21  0:34     ` Sasha Levin
  (?)
@ 2014-06-21  0:56       ` Andrew Morton
  -1 siblings, 0 replies; 141+ messages in thread
From: Andrew Morton @ 2014-06-21  0:56 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Johannes Weiner, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Fri, 20 Jun 2014 20:34:43 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:

> I'm seeing the following when booting a VM, bisection pointed me to this
> patch.
> 
> [   32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677

Thanks.  This one was fixed earlier today.

From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: mem_cgroup_charge_statistics needs preempt_disable

preempt_disable was previously disabled by lock_page_cgroup which has been
removed by "mm: memcontrol: rewrite uncharge API".

This fixes the a flood of splats like this:
[    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
[    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
[    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
[    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
[    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
[    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
[    3.162950] Call Trace:
[    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
[    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
[    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
[    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
[    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
[    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
[    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
[    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
[    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
[    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
[    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
[    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
[    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
[    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
[    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
[    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    3 +++
 1 file changed, 3 insertions(+)

diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api-fix-4 mm/memcontrol.c
--- a/mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api-fix-4
+++ a/mm/memcontrol.c
@@ -904,6 +904,8 @@ static void mem_cgroup_charge_statistics
 					 struct page *page,
 					 int nr_pages)
 {
+	preempt_disable();
+
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
@@ -928,6 +930,7 @@ static void mem_cgroup_charge_statistics
 	}
 
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+	preempt_enable();
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
_


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-06-21  0:56       ` Andrew Morton
  0 siblings, 0 replies; 141+ messages in thread
From: Andrew Morton @ 2014-06-21  0:56 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Johannes Weiner, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Fri, 20 Jun 2014 20:34:43 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:

> I'm seeing the following when booting a VM, bisection pointed me to this
> patch.
> 
> [   32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677

Thanks.  This one was fixed earlier today.

From: Michal Hocko <mhocko@suse.cz>
Subject: memcg: mem_cgroup_charge_statistics needs preempt_disable

preempt_disable was previously disabled by lock_page_cgroup which has been
removed by "mm: memcontrol: rewrite uncharge API".

This fixes the a flood of splats like this:
[    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
[    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
[    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
[    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
[    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
[    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
[    3.162950] Call Trace:
[    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
[    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
[    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
[    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
[    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
[    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
[    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
[    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
[    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
[    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
[    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
[    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
[    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
[    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
[    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
[    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    3 +++
 1 file changed, 3 insertions(+)

diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api-fix-4 mm/memcontrol.c
--- a/mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api-fix-4
+++ a/mm/memcontrol.c
@@ -904,6 +904,8 @@ static void mem_cgroup_charge_statistics
 					 struct page *page,
 					 int nr_pages)
 {
+	preempt_disable();
+
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
@@ -928,6 +930,7 @@ static void mem_cgroup_charge_statistics
 	}
 
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+	preempt_enable();
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-06-21  0:56       ` Andrew Morton
  0 siblings, 0 replies; 141+ messages in thread
From: Andrew Morton @ 2014-06-21  0:56 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Johannes Weiner, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, 20 Jun 2014 20:34:43 -0400 Sasha Levin <sasha.levin-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> I'm seeing the following when booting a VM, bisection pointed me to this
> patch.
> 
> [   32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677

Thanks.  This one was fixed earlier today.

From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Subject: memcg: mem_cgroup_charge_statistics needs preempt_disable

preempt_disable was previously disabled by lock_page_cgroup which has been
removed by "mm: memcontrol: rewrite uncharge API".

This fixes the a flood of splats like this:
[    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
[    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
[    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
[    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
[    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
[    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
[    3.162950] Call Trace:
[    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
[    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
[    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
[    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
[    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
[    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
[    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
[    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
[    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
[    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
[    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
[    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
[    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
[    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
[    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
[    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5

Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---

 mm/memcontrol.c |    3 +++
 1 file changed, 3 insertions(+)

diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api-fix-4 mm/memcontrol.c
--- a/mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api-fix-4
+++ a/mm/memcontrol.c
@@ -904,6 +904,8 @@ static void mem_cgroup_charge_statistics
 					 struct page *page,
 					 int nr_pages)
 {
+	preempt_disable();
+
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
@@ -928,6 +930,7 @@ static void mem_cgroup_charge_statistics
 	}
 
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+	preempt_enable();
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
_

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-06-21  0:56       ` Andrew Morton
@ 2014-06-21  1:03         ` Sasha Levin
  -1 siblings, 0 replies; 141+ messages in thread
From: Sasha Levin @ 2014-06-21  1:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On 06/20/2014 08:56 PM, Andrew Morton wrote:
> On Fri, 20 Jun 2014 20:34:43 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> I'm seeing the following when booting a VM, bisection pointed me to this
>> patch.
>>
>> [   32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677
> 
> Thanks.  This one was fixed earlier today.

Thank Andrew. My first bisection attempt went sideways and ended up
pointing at "fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write"
for some reason.

My attempt to understand what data integrity has to do cgroups was unfruitful :(


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-06-21  1:03         ` Sasha Levin
  0 siblings, 0 replies; 141+ messages in thread
From: Sasha Levin @ 2014-06-21  1:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On 06/20/2014 08:56 PM, Andrew Morton wrote:
> On Fri, 20 Jun 2014 20:34:43 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> I'm seeing the following when booting a VM, bisection pointed me to this
>> patch.
>>
>> [   32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677
> 
> Thanks.  This one was fixed earlier today.

Thank Andrew. My first bisection attempt went sideways and ended up
pointing at "fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write"
for some reason.

My attempt to understand what data integrity has to do cgroups was unfruitful :(


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [PATCH -mm] memcg: mem_cgroup_charge_statistics needs preempt_disable
  2014-06-20 16:36     ` Michal Hocko
@ 2014-06-23  4:16       ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-23  4:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov, linux-mm, LKML

On Fri, Jun 20, 2014 at 06:36:11PM +0200, Michal Hocko wrote:
> preempt_disable was previously disabled by lock_page_cgroup which has
> been removed by "mm: memcontrol: rewrite uncharge API".
> 
> This fixes the a flood of splats like this:
> [    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
> [    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
> [    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
> [    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
> [    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
> [    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
> [    3.162950] Call Trace:
> [    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
> [    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
> [    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
> [    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
> [    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
> [    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
> [    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
> [    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
> [    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
> [    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
> [    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
> [    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
> [    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
> [    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
> [    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
> [    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Thanks, Michal.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [PATCH -mm] memcg: mem_cgroup_charge_statistics needs preempt_disable
@ 2014-06-23  4:16       ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-06-23  4:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov, linux-mm, LKML

On Fri, Jun 20, 2014 at 06:36:11PM +0200, Michal Hocko wrote:
> preempt_disable was previously disabled by lock_page_cgroup which has
> been removed by "mm: memcontrol: rewrite uncharge API".
> 
> This fixes the a flood of splats like this:
> [    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
> [    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
> [    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
> [    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
> [    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
> [    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
> [    3.162950] Call Trace:
> [    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
> [    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
> [    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
> [    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
> [    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
> [    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
> [    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
> [    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
> [    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
> [    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
> [    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
> [    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
> [    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
> [    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
> [    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
> [    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Thanks, Michal.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
  2014-06-18 20:40   ` Johannes Weiner
  (?)
@ 2014-06-23  6:15     ` Uwe Kleine-König
  -1 siblings, 0 replies; 141+ messages in thread
From: Uwe Kleine-König @ 2014-06-23  6:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel, kernel

Hello,

On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> The memcg charge API charges pages before they are rmapped - i.e. have
> an actual "type" - and so every callsite needs its own set of charge
> and uncharge functions to know what type is being operated on.  Worse,
> uncharge has to happen from a context that is still type-specific,
> rather than at the end of the page's lifetime with exclusive access,
> and so requires a lot of synchronization.
> ...

this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
charge API) and it makes efm32_defconfig (ARCH=arm) fail with:

  CC      mm/swap.o
mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
  if (!TestSetPageMlocked(page)) {
  ^
cc1: some warnings being treated as errors
scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
make[3]: *** [mm/swap.o] Error 1
Makefile:1471: recipe for target 'mm/swap.o' failed

imx_v4_v5_defconfig works, so probably the thing that makes
efm32_defconfig fail is CONFIG_MMU=n.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-06-23  6:15     ` Uwe Kleine-König
  0 siblings, 0 replies; 141+ messages in thread
From: Uwe Kleine-König @ 2014-06-23  6:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel, kernel

Hello,

On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> The memcg charge API charges pages before they are rmapped - i.e. have
> an actual "type" - and so every callsite needs its own set of charge
> and uncharge functions to know what type is being operated on.  Worse,
> uncharge has to happen from a context that is still type-specific,
> rather than at the end of the page's lifetime with exclusive access,
> and so requires a lot of synchronization.
> ...

this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
charge API) and it makes efm32_defconfig (ARCH=arm) fail with:

  CC      mm/swap.o
mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
  if (!TestSetPageMlocked(page)) {
  ^
cc1: some warnings being treated as errors
scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
make[3]: *** [mm/swap.o] Error 1
Makefile:1471: recipe for target 'mm/swap.o' failed

imx_v4_v5_defconfig works, so probably the thing that makes
efm32_defconfig fail is CONFIG_MMU=n.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-Konig            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-06-23  6:15     ` Uwe Kleine-König
  0 siblings, 0 replies; 141+ messages in thread
From: Uwe Kleine-König @ 2014-06-23  6:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel, kernel

Hello,

On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> The memcg charge API charges pages before they are rmapped - i.e. have
> an actual "type" - and so every callsite needs its own set of charge
> and uncharge functions to know what type is being operated on.  Worse,
> uncharge has to happen from a context that is still type-specific,
> rather than at the end of the page's lifetime with exclusive access,
> and so requires a lot of synchronization.
> ...

this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
charge API) and it makes efm32_defconfig (ARCH=arm) fail with:

  CC      mm/swap.o
mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
  if (!TestSetPageMlocked(page)) {
  ^
cc1: some warnings being treated as errors
scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
make[3]: *** [mm/swap.o] Error 1
Makefile:1471: recipe for target 'mm/swap.o' failed

imx_v4_v5_defconfig works, so probably the thing that makes
efm32_defconfig fail is CONFIG_MMU=n.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
  2014-06-23  6:15     ` Uwe Kleine-König
  (?)
@ 2014-06-23  9:30       ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-06-23  9:30 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel, kernel

On Mon 23-06-14 08:15:26, Uwe Kleine-König wrote:
> Hello,
> 
> On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> > The memcg charge API charges pages before they are rmapped - i.e. have
> > an actual "type" - and so every callsite needs its own set of charge
> > and uncharge functions to know what type is being operated on.  Worse,
> > uncharge has to happen from a context that is still type-specific,
> > rather than at the end of the page's lifetime with exclusive access,
> > and so requires a lot of synchronization.
> > ...
> 
> this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
> charge API) and it makes efm32_defconfig (ARCH=arm) fail with:
> 
>   CC      mm/swap.o
> mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
> mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
>   if (!TestSetPageMlocked(page)) {
>   ^
> cc1: some warnings being treated as errors
> scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
> make[3]: *** [mm/swap.o] Error 1
> Makefile:1471: recipe for target 'mm/swap.o' failed
> 
> imx_v4_v5_defconfig works, so probably the thing that makes
> efm32_defconfig fail is CONFIG_MMU=n.

Fix is here:
http://marc.info/?l=linux-mm&m=140330132521104

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-06-23  9:30       ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-06-23  9:30 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel, kernel

On Mon 23-06-14 08:15:26, Uwe Kleine-Konig wrote:
> Hello,
> 
> On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> > The memcg charge API charges pages before they are rmapped - i.e. have
> > an actual "type" - and so every callsite needs its own set of charge
> > and uncharge functions to know what type is being operated on.  Worse,
> > uncharge has to happen from a context that is still type-specific,
> > rather than at the end of the page's lifetime with exclusive access,
> > and so requires a lot of synchronization.
> > ...
> 
> this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
> charge API) and it makes efm32_defconfig (ARCH=arm) fail with:
> 
>   CC      mm/swap.o
> mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
> mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
>   if (!TestSetPageMlocked(page)) {
>   ^
> cc1: some warnings being treated as errors
> scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
> make[3]: *** [mm/swap.o] Error 1
> Makefile:1471: recipe for target 'mm/swap.o' failed
> 
> imx_v4_v5_defconfig works, so probably the thing that makes
> efm32_defconfig fail is CONFIG_MMU=n.

Fix is here:
http://marc.info/?l=linux-mm&m=140330132521104

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-06-23  9:30       ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-06-23  9:30 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-bIcnvbaLZ9MEGnE8C9+IrQ

On Mon 23-06-14 08:15:26, Uwe Kleine-König wrote:
> Hello,
> 
> On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> > The memcg charge API charges pages before they are rmapped - i.e. have
> > an actual "type" - and so every callsite needs its own set of charge
> > and uncharge functions to know what type is being operated on.  Worse,
> > uncharge has to happen from a context that is still type-specific,
> > rather than at the end of the page's lifetime with exclusive access,
> > and so requires a lot of synchronization.
> > ...
> 
> this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
> charge API) and it makes efm32_defconfig (ARCH=arm) fail with:
> 
>   CC      mm/swap.o
> mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
> mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
>   if (!TestSetPageMlocked(page)) {
>   ^
> cc1: some warnings being treated as errors
> scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
> make[3]: *** [mm/swap.o] Error 1
> Makefile:1471: recipe for target 'mm/swap.o' failed
> 
> imx_v4_v5_defconfig works, so probably the thing that makes
> efm32_defconfig fail is CONFIG_MMU=n.

Fix is here:
http://marc.info/?l=linux-mm&m=140330132521104

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
  2014-06-23  9:30       ` Michal Hocko
  (?)
@ 2014-06-23  9:42         ` Uwe Kleine-König
  -1 siblings, 0 replies; 141+ messages in thread
From: Uwe Kleine-König @ 2014-06-23  9:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel, kernel

On Mon, Jun 23, 2014 at 11:30:52AM +0200, Michal Hocko wrote:
> On Mon 23-06-14 08:15:26, Uwe Kleine-König wrote:
> > Hello,
> > 
> > On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> > > The memcg charge API charges pages before they are rmapped - i.e. have
> > > an actual "type" - and so every callsite needs its own set of charge
> > > and uncharge functions to know what type is being operated on.  Worse,
> > > uncharge has to happen from a context that is still type-specific,
> > > rather than at the end of the page's lifetime with exclusive access,
> > > and so requires a lot of synchronization.
> > > ...
> > 
> > this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
> > charge API) and it makes efm32_defconfig (ARCH=arm) fail with:
> > 
> >   CC      mm/swap.o
> > mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
> > mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
> >   if (!TestSetPageMlocked(page)) {
> >   ^
> > cc1: some warnings being treated as errors
> > scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
> > make[3]: *** [mm/swap.o] Error 1
> > Makefile:1471: recipe for target 'mm/swap.o' failed
> > 
> > imx_v4_v5_defconfig works, so probably the thing that makes
> > efm32_defconfig fail is CONFIG_MMU=n.
> 
> Fix is here:
> http://marc.info/?l=linux-mm&m=140330132521104
Thanks for the link.

I have another problem that makes my machine fail to boot but at least
this patch makes next compilable again for me.

Thanks
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-06-23  9:42         ` Uwe Kleine-König
  0 siblings, 0 replies; 141+ messages in thread
From: Uwe Kleine-König @ 2014-06-23  9:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel, kernel

On Mon, Jun 23, 2014 at 11:30:52AM +0200, Michal Hocko wrote:
> On Mon 23-06-14 08:15:26, Uwe Kleine-Konig wrote:
> > Hello,
> > 
> > On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> > > The memcg charge API charges pages before they are rmapped - i.e. have
> > > an actual "type" - and so every callsite needs its own set of charge
> > > and uncharge functions to know what type is being operated on.  Worse,
> > > uncharge has to happen from a context that is still type-specific,
> > > rather than at the end of the page's lifetime with exclusive access,
> > > and so requires a lot of synchronization.
> > > ...
> > 
> > this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
> > charge API) and it makes efm32_defconfig (ARCH=arm) fail with:
> > 
> >   CC      mm/swap.o
> > mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
> > mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
> >   if (!TestSetPageMlocked(page)) {
> >   ^
> > cc1: some warnings being treated as errors
> > scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
> > make[3]: *** [mm/swap.o] Error 1
> > Makefile:1471: recipe for target 'mm/swap.o' failed
> > 
> > imx_v4_v5_defconfig works, so probably the thing that makes
> > efm32_defconfig fail is CONFIG_MMU=n.
> 
> Fix is here:
> http://marc.info/?l=linux-mm&m=140330132521104
Thanks for the link.

I have another problem that makes my machine fail to boot but at least
this patch makes next compilable again for me.

Thanks
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-Konig            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-06-23  9:42         ` Uwe Kleine-König
  0 siblings, 0 replies; 141+ messages in thread
From: Uwe Kleine-König @ 2014-06-23  9:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-bIcnvbaLZ9MEGnE8C9+IrQ

On Mon, Jun 23, 2014 at 11:30:52AM +0200, Michal Hocko wrote:
> On Mon 23-06-14 08:15:26, Uwe Kleine-König wrote:
> > Hello,
> > 
> > On Wed, Jun 18, 2014 at 04:40:44PM -0400, Johannes Weiner wrote:
> > > The memcg charge API charges pages before they are rmapped - i.e. have
> > > an actual "type" - and so every callsite needs its own set of charge
> > > and uncharge functions to know what type is being operated on.  Worse,
> > > uncharge has to happen from a context that is still type-specific,
> > > rather than at the end of the page's lifetime with exclusive access,
> > > and so requires a lot of synchronization.
> > > ...
> > 
> > this patch made it into next-20140623 as 5e49555277df (mm: memcontrol: rewrite
> > charge API) and it makes efm32_defconfig (ARCH=arm) fail with:
> > 
> >   CC      mm/swap.o
> > mm/swap.c: In function 'lru_cache_add_active_or_unevictable':
> > mm/swap.c:719:2: error: implicit declaration of function 'TestSetPageMlocked' [-Werror=implicit-function-declaration]
> >   if (!TestSetPageMlocked(page)) {
> >   ^
> > cc1: some warnings being treated as errors
> > scripts/Makefile.build:257: recipe for target 'mm/swap.o' failed
> > make[3]: *** [mm/swap.o] Error 1
> > Makefile:1471: recipe for target 'mm/swap.o' failed
> > 
> > imx_v4_v5_defconfig works, so probably the thing that makes
> > efm32_defconfig fail is CONFIG_MMU=n.
> 
> Fix is here:
> http://marc.info/?l=linux-mm&m=140330132521104
Thanks for the link.

I have another problem that makes my machine fail to boot but at least
this patch makes next compilable again for me.

Thanks
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
  2014-06-18 20:40   ` Johannes Weiner
  (?)
@ 2014-07-14 15:04     ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-14 15:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Hi,
I've finally manage to untagle myself from internal stuff...

On Wed 18-06-14 16:40:44, Johannes Weiner wrote:
> The memcg charge API charges pages before they are rmapped - i.e. have
> an actual "type" - and so every callsite needs its own set of charge
> and uncharge functions to know what type is being operated on.  Worse,
> uncharge has to happen from a context that is still type-specific,
> rather than at the end of the page's lifetime with exclusive access,
> and so requires a lot of synchronization.
> 
> Rewrite the charge API to provide a generic set of try_charge(),
> commit_charge() and cancel_charge() transaction operations, much like
> what's currently done for swap-in:
> 
>   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
>   pages from the memcg if necessary.
> 
>   mem_cgroup_commit_charge() commits the page to the charge once it
>   has a valid page->mapping and PageAnon() reliably tells the type.
> 
>   mem_cgroup_cancel_charge() aborts the transaction.
> 
> This reduces the charge API and enables subsequent patches to
> drastically simplify uncharging.
> 
> As pages need to be committed after rmap is established but before
> they are added to the LRU, page_add_new_anon_rmap() must stop doing
> LRU additions again.  Revive lru_cache_add_active_or_unevictable().

I think it would make more sense to do
lru_cache_add_active_or_unevictable in a separate patch for easier
review. Too late, though...

Few comments bellow
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

The patch looks correct but the code is quite tricky so I hope I didn't
miss anything.

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  Documentation/cgroups/memcg_test.txt |  32 +--
>  include/linux/memcontrol.h           |  53 ++---
>  include/linux/swap.h                 |   3 +
>  kernel/events/uprobes.c              |   1 +
>  mm/filemap.c                         |   9 +-
>  mm/huge_memory.c                     |  57 +++--
>  mm/memcontrol.c                      | 407 ++++++++++++++---------------------
>  mm/memory.c                          |  41 ++--
>  mm/rmap.c                            |  19 --
>  mm/shmem.c                           |  24 ++-
>  mm/swap.c                            |  34 +++
>  mm/swapfile.c                        |  14 +-
>  12 files changed, 314 insertions(+), 380 deletions(-)
> 
[...]
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index eb65d29516ca..1a9a096858e0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> -/*
> - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> - * alloc memory but reclaims memory from all available zones. So, "where I want
> - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> - * available but adding a rule is better. charge functions' gfp_mask should
> - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> - * codes.
> - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> - */

I think we should slightly modify the comment but the primary idea
should stay there. What about the following?
/*
 * Although memcg charge functions do not allocate any memory they are
 * still getting GFP mask to control the reclaim process (therefore
 * gfp_mask & GFP_RECLAIM_MASK is expected).
 * GFP_KERNEL should be used for the general charge path without any
 * constraints for the reclaim
 * __GFP_WAIT should be cleared for atomic contexts
 * __GFP_NORETRY should be set for charges which might fail rather than
 * spend too much time reclaiming
 * __GFP_NOFAIL should be set for charges which cannot fail.
 */

> -
> -extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask);
> -/* for swap handling */
> -extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> -extern void mem_cgroup_commit_charge_swapin(struct page *page,
> -					struct mem_cgroup *memcg);
> -extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
> -
> -extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare);
> +void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
>  
>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

[...]

> @@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					struct page *page,
>  					unsigned long haddr)
>  {
> +	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pgtable_t pgtable;
>  	pmd_t _pmd;
> @@ -968,20 +972,21 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					       __GFP_OTHER_NODE,
>  					       vma, address, page_to_nid(page));
>  		if (unlikely(!pages[i] ||
> -			     mem_cgroup_charge_anon(pages[i], mm,
> -						       GFP_KERNEL))) {
> +			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
> +						   &memcg))) {
>  			if (pages[i])
>  				put_page(pages[i]);
> -			mem_cgroup_uncharge_start();
>  			while (--i >= 0) {
> -				mem_cgroup_uncharge_page(pages[i]);
> +				memcg = (void *)page_private(pages[i]);

Hmm, OK the memcg couldn't go away even if mm owner has left it because
the charge is already there and the page is not on LRU so the
mem_cgroup_css_free will wait until we uncharge it or put to LRU.

> +				set_page_private(pages[i], 0);
> +				mem_cgroup_cancel_charge(pages[i], memcg);
>  				put_page(pages[i]);
>  			}
> -			mem_cgroup_uncharge_end();
>  			kfree(pages);
>  			ret |= VM_FAULT_OOM;
>  			goto out;
>  		}

		/*
		 * Pages might end up charged to a different memcgs
		 * because the mm owner might move while we are allocating
		 * them. Abuse ->private field to store the charged
		 * memcg until we know whether to commit or cancel the
		 * charge.
		 */
> +		set_page_private(pages[i], (unsigned long)memcg);
>  	}
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {

[...]

> +/**
> + * mem_cgroup_commit_charge - commit a page charge
> + * @page: page to charge
> + * @memcg: memcg to charge the page to
> + * @lrucare: page might be on LRU already
> + *
> + * Finalize a charge transaction started by mem_cgroup_try_charge(),
> + * after page->mapping has been set up.  This must happen atomically
> + * as part of the page instantiation, i.e. under the page table lock
> + * for anonymous pages, under the page lock for page and swap cache.
> + *
> + * In addition, the page must not be on the LRU during the commit, to
> + * prevent racing with task migration.  If it might be, use @lrucare.
> + *
> + * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
> + */
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare)

I think we should be explicit that this is only required for LRU pages.
kmem doesn't have to finalize the transaction.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-07-14 15:04     ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-14 15:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Hi,
I've finally manage to untagle myself from internal stuff...

On Wed 18-06-14 16:40:44, Johannes Weiner wrote:
> The memcg charge API charges pages before they are rmapped - i.e. have
> an actual "type" - and so every callsite needs its own set of charge
> and uncharge functions to know what type is being operated on.  Worse,
> uncharge has to happen from a context that is still type-specific,
> rather than at the end of the page's lifetime with exclusive access,
> and so requires a lot of synchronization.
> 
> Rewrite the charge API to provide a generic set of try_charge(),
> commit_charge() and cancel_charge() transaction operations, much like
> what's currently done for swap-in:
> 
>   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
>   pages from the memcg if necessary.
> 
>   mem_cgroup_commit_charge() commits the page to the charge once it
>   has a valid page->mapping and PageAnon() reliably tells the type.
> 
>   mem_cgroup_cancel_charge() aborts the transaction.
> 
> This reduces the charge API and enables subsequent patches to
> drastically simplify uncharging.
> 
> As pages need to be committed after rmap is established but before
> they are added to the LRU, page_add_new_anon_rmap() must stop doing
> LRU additions again.  Revive lru_cache_add_active_or_unevictable().

I think it would make more sense to do
lru_cache_add_active_or_unevictable in a separate patch for easier
review. Too late, though...

Few comments bellow
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

The patch looks correct but the code is quite tricky so I hope I didn't
miss anything.

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  Documentation/cgroups/memcg_test.txt |  32 +--
>  include/linux/memcontrol.h           |  53 ++---
>  include/linux/swap.h                 |   3 +
>  kernel/events/uprobes.c              |   1 +
>  mm/filemap.c                         |   9 +-
>  mm/huge_memory.c                     |  57 +++--
>  mm/memcontrol.c                      | 407 ++++++++++++++---------------------
>  mm/memory.c                          |  41 ++--
>  mm/rmap.c                            |  19 --
>  mm/shmem.c                           |  24 ++-
>  mm/swap.c                            |  34 +++
>  mm/swapfile.c                        |  14 +-
>  12 files changed, 314 insertions(+), 380 deletions(-)
> 
[...]
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index eb65d29516ca..1a9a096858e0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> -/*
> - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> - * alloc memory but reclaims memory from all available zones. So, "where I want
> - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> - * available but adding a rule is better. charge functions' gfp_mask should
> - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> - * codes.
> - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> - */

I think we should slightly modify the comment but the primary idea
should stay there. What about the following?
/*
 * Although memcg charge functions do not allocate any memory they are
 * still getting GFP mask to control the reclaim process (therefore
 * gfp_mask & GFP_RECLAIM_MASK is expected).
 * GFP_KERNEL should be used for the general charge path without any
 * constraints for the reclaim
 * __GFP_WAIT should be cleared for atomic contexts
 * __GFP_NORETRY should be set for charges which might fail rather than
 * spend too much time reclaiming
 * __GFP_NOFAIL should be set for charges which cannot fail.
 */

> -
> -extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask);
> -/* for swap handling */
> -extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> -extern void mem_cgroup_commit_charge_swapin(struct page *page,
> -					struct mem_cgroup *memcg);
> -extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
> -
> -extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare);
> +void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
>  
>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

[...]

> @@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					struct page *page,
>  					unsigned long haddr)
>  {
> +	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pgtable_t pgtable;
>  	pmd_t _pmd;
> @@ -968,20 +972,21 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					       __GFP_OTHER_NODE,
>  					       vma, address, page_to_nid(page));
>  		if (unlikely(!pages[i] ||
> -			     mem_cgroup_charge_anon(pages[i], mm,
> -						       GFP_KERNEL))) {
> +			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
> +						   &memcg))) {
>  			if (pages[i])
>  				put_page(pages[i]);
> -			mem_cgroup_uncharge_start();
>  			while (--i >= 0) {
> -				mem_cgroup_uncharge_page(pages[i]);
> +				memcg = (void *)page_private(pages[i]);

Hmm, OK the memcg couldn't go away even if mm owner has left it because
the charge is already there and the page is not on LRU so the
mem_cgroup_css_free will wait until we uncharge it or put to LRU.

> +				set_page_private(pages[i], 0);
> +				mem_cgroup_cancel_charge(pages[i], memcg);
>  				put_page(pages[i]);
>  			}
> -			mem_cgroup_uncharge_end();
>  			kfree(pages);
>  			ret |= VM_FAULT_OOM;
>  			goto out;
>  		}

		/*
		 * Pages might end up charged to a different memcgs
		 * because the mm owner might move while we are allocating
		 * them. Abuse ->private field to store the charged
		 * memcg until we know whether to commit or cancel the
		 * charge.
		 */
> +		set_page_private(pages[i], (unsigned long)memcg);
>  	}
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {

[...]

> +/**
> + * mem_cgroup_commit_charge - commit a page charge
> + * @page: page to charge
> + * @memcg: memcg to charge the page to
> + * @lrucare: page might be on LRU already
> + *
> + * Finalize a charge transaction started by mem_cgroup_try_charge(),
> + * after page->mapping has been set up.  This must happen atomically
> + * as part of the page instantiation, i.e. under the page table lock
> + * for anonymous pages, under the page lock for page and swap cache.
> + *
> + * In addition, the page must not be on the LRU during the commit, to
> + * prevent racing with task migration.  If it might be, use @lrucare.
> + *
> + * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
> + */
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare)

I think we should be explicit that this is only required for LRU pages.
kmem doesn't have to finalize the transaction.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-07-14 15:04     ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-14 15:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi,
I've finally manage to untagle myself from internal stuff...

On Wed 18-06-14 16:40:44, Johannes Weiner wrote:
> The memcg charge API charges pages before they are rmapped - i.e. have
> an actual "type" - and so every callsite needs its own set of charge
> and uncharge functions to know what type is being operated on.  Worse,
> uncharge has to happen from a context that is still type-specific,
> rather than at the end of the page's lifetime with exclusive access,
> and so requires a lot of synchronization.
> 
> Rewrite the charge API to provide a generic set of try_charge(),
> commit_charge() and cancel_charge() transaction operations, much like
> what's currently done for swap-in:
> 
>   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
>   pages from the memcg if necessary.
> 
>   mem_cgroup_commit_charge() commits the page to the charge once it
>   has a valid page->mapping and PageAnon() reliably tells the type.
> 
>   mem_cgroup_cancel_charge() aborts the transaction.
> 
> This reduces the charge API and enables subsequent patches to
> drastically simplify uncharging.
> 
> As pages need to be committed after rmap is established but before
> they are added to the LRU, page_add_new_anon_rmap() must stop doing
> LRU additions again.  Revive lru_cache_add_active_or_unevictable().

I think it would make more sense to do
lru_cache_add_active_or_unevictable in a separate patch for easier
review. Too late, though...

Few comments bellow
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

The patch looks correct but the code is quite tricky so I hope I didn't
miss anything.

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  Documentation/cgroups/memcg_test.txt |  32 +--
>  include/linux/memcontrol.h           |  53 ++---
>  include/linux/swap.h                 |   3 +
>  kernel/events/uprobes.c              |   1 +
>  mm/filemap.c                         |   9 +-
>  mm/huge_memory.c                     |  57 +++--
>  mm/memcontrol.c                      | 407 ++++++++++++++---------------------
>  mm/memory.c                          |  41 ++--
>  mm/rmap.c                            |  19 --
>  mm/shmem.c                           |  24 ++-
>  mm/swap.c                            |  34 +++
>  mm/swapfile.c                        |  14 +-
>  12 files changed, 314 insertions(+), 380 deletions(-)
> 
[...]
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index eb65d29516ca..1a9a096858e0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> -/*
> - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> - * alloc memory but reclaims memory from all available zones. So, "where I want
> - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> - * available but adding a rule is better. charge functions' gfp_mask should
> - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> - * codes.
> - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> - */

I think we should slightly modify the comment but the primary idea
should stay there. What about the following?
/*
 * Although memcg charge functions do not allocate any memory they are
 * still getting GFP mask to control the reclaim process (therefore
 * gfp_mask & GFP_RECLAIM_MASK is expected).
 * GFP_KERNEL should be used for the general charge path without any
 * constraints for the reclaim
 * __GFP_WAIT should be cleared for atomic contexts
 * __GFP_NORETRY should be set for charges which might fail rather than
 * spend too much time reclaiming
 * __GFP_NOFAIL should be set for charges which cannot fail.
 */

> -
> -extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask);
> -/* for swap handling */
> -extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> -extern void mem_cgroup_commit_charge_swapin(struct page *page,
> -					struct mem_cgroup *memcg);
> -extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
> -
> -extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare);
> +void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
>  
>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

[...]

> @@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					struct page *page,
>  					unsigned long haddr)
>  {
> +	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pgtable_t pgtable;
>  	pmd_t _pmd;
> @@ -968,20 +972,21 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					       __GFP_OTHER_NODE,
>  					       vma, address, page_to_nid(page));
>  		if (unlikely(!pages[i] ||
> -			     mem_cgroup_charge_anon(pages[i], mm,
> -						       GFP_KERNEL))) {
> +			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
> +						   &memcg))) {
>  			if (pages[i])
>  				put_page(pages[i]);
> -			mem_cgroup_uncharge_start();
>  			while (--i >= 0) {
> -				mem_cgroup_uncharge_page(pages[i]);
> +				memcg = (void *)page_private(pages[i]);

Hmm, OK the memcg couldn't go away even if mm owner has left it because
the charge is already there and the page is not on LRU so the
mem_cgroup_css_free will wait until we uncharge it or put to LRU.

> +				set_page_private(pages[i], 0);
> +				mem_cgroup_cancel_charge(pages[i], memcg);
>  				put_page(pages[i]);
>  			}
> -			mem_cgroup_uncharge_end();
>  			kfree(pages);
>  			ret |= VM_FAULT_OOM;
>  			goto out;
>  		}

		/*
		 * Pages might end up charged to a different memcgs
		 * because the mm owner might move while we are allocating
		 * them. Abuse ->private field to store the charged
		 * memcg until we know whether to commit or cancel the
		 * charge.
		 */
> +		set_page_private(pages[i], (unsigned long)memcg);
>  	}
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {

[...]

> +/**
> + * mem_cgroup_commit_charge - commit a page charge
> + * @page: page to charge
> + * @memcg: memcg to charge the page to
> + * @lrucare: page might be on LRU already
> + *
> + * Finalize a charge transaction started by mem_cgroup_try_charge(),
> + * after page->mapping has been set up.  This must happen atomically
> + * as part of the page instantiation, i.e. under the page table lock
> + * for anonymous pages, under the page lock for page and swap cache.
> + *
> + * In addition, the page must not be on the LRU during the commit, to
> + * prevent racing with task migration.  If it might be, use @lrucare.
> + *
> + * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
> + */
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare)

I think we should be explicit that this is only required for LRU pages.
kmem doesn't have to finalize the transaction.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
  2014-07-14 15:04     ` Michal Hocko
@ 2014-07-14 17:13       ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-14 17:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Mon, Jul 14, 2014 at 05:04:46PM +0200, Michal Hocko wrote:
> Hi,
> I've finally manage to untagle myself from internal stuff...
> 
> On Wed 18-06-14 16:40:44, Johannes Weiner wrote:
> > The memcg charge API charges pages before they are rmapped - i.e. have
> > an actual "type" - and so every callsite needs its own set of charge
> > and uncharge functions to know what type is being operated on.  Worse,
> > uncharge has to happen from a context that is still type-specific,
> > rather than at the end of the page's lifetime with exclusive access,
> > and so requires a lot of synchronization.
> > 
> > Rewrite the charge API to provide a generic set of try_charge(),
> > commit_charge() and cancel_charge() transaction operations, much like
> > what's currently done for swap-in:
> > 
> >   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
> >   pages from the memcg if necessary.
> > 
> >   mem_cgroup_commit_charge() commits the page to the charge once it
> >   has a valid page->mapping and PageAnon() reliably tells the type.
> > 
> >   mem_cgroup_cancel_charge() aborts the transaction.
> > 
> > This reduces the charge API and enables subsequent patches to
> > drastically simplify uncharging.
> > 
> > As pages need to be committed after rmap is established but before
> > they are added to the LRU, page_add_new_anon_rmap() must stop doing
> > LRU additions again.  Revive lru_cache_add_active_or_unevictable().
> 
> I think it would make more sense to do
> lru_cache_add_active_or_unevictable in a separate patch for easier
> review. Too late, though...
> 
> Few comments bellow
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The patch looks correct but the code is quite tricky so I hope I didn't
> miss anything.
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
> >  };
> >  
> >  #ifdef CONFIG_MEMCG
> > -/*
> > - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> > - * alloc memory but reclaims memory from all available zones. So, "where I want
> > - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> > - * available but adding a rule is better. charge functions' gfp_mask should
> > - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> > - * codes.
> > - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> > - */
> 
> I think we should slightly modify the comment but the primary idea
> should stay there. What about the following?
> /*
>  * Although memcg charge functions do not allocate any memory they are
>  * still getting GFP mask to control the reclaim process (therefore
>  * gfp_mask & GFP_RECLAIM_MASK is expected).
>  * GFP_KERNEL should be used for the general charge path without any
>  * constraints for the reclaim
>  * __GFP_WAIT should be cleared for atomic contexts
>  * __GFP_NORETRY should be set for charges which might fail rather than
>  * spend too much time reclaiming
>  * __GFP_NOFAIL should be set for charges which cannot fail.
>  */

What *is* the primary idea here?

Taking any kind of gfp mask and interpreting the bits that pertain to
you is done in a lot of places already, and there really is no need to
duplicate the documentation and risk it getting stale and misleading.

> > @@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  					struct page *page,
> >  					unsigned long haddr)
> >  {
> > +	struct mem_cgroup *memcg;
> >  	spinlock_t *ptl;
> >  	pgtable_t pgtable;
> >  	pmd_t _pmd;
> > @@ -968,20 +972,21 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  					       __GFP_OTHER_NODE,
> >  					       vma, address, page_to_nid(page));
> >  		if (unlikely(!pages[i] ||
> > -			     mem_cgroup_charge_anon(pages[i], mm,
> > -						       GFP_KERNEL))) {
> > +			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
> > +						   &memcg))) {
> >  			if (pages[i])
> >  				put_page(pages[i]);
> > -			mem_cgroup_uncharge_start();
> >  			while (--i >= 0) {
> > -				mem_cgroup_uncharge_page(pages[i]);
> > +				memcg = (void *)page_private(pages[i]);
> 
> Hmm, OK the memcg couldn't go away even if mm owner has left it because
> the charge is already there and the page is not on LRU so the
> mem_cgroup_css_free will wait until we uncharge it or put to LRU.

Yep, res_counter charges have always pinned the memcg.  We already
used this exact protocol and relied on the same lifetime rules for
swapin charging.

> > +/**
> > + * mem_cgroup_commit_charge - commit a page charge
> > + * @page: page to charge
> > + * @memcg: memcg to charge the page to
> > + * @lrucare: page might be on LRU already
> > + *
> > + * Finalize a charge transaction started by mem_cgroup_try_charge(),
> > + * after page->mapping has been set up.  This must happen atomically
> > + * as part of the page instantiation, i.e. under the page table lock
> > + * for anonymous pages, under the page lock for page and swap cache.
> > + *
> > + * In addition, the page must not be on the LRU during the commit, to
> > + * prevent racing with task migration.  If it might be, use @lrucare.
> > + *
> > + * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
> > + */
> > +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> > +			      bool lrucare)
> 
> I think we should be explicit that this is only required for LRU pages.
> kmem doesn't have to finalize the transaction.

The function itself only applies to user/LRU pages.  kmem has its own
separate API for charge/commit/cancel/uncharge.

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-07-14 17:13       ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-14 17:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Mon, Jul 14, 2014 at 05:04:46PM +0200, Michal Hocko wrote:
> Hi,
> I've finally manage to untagle myself from internal stuff...
> 
> On Wed 18-06-14 16:40:44, Johannes Weiner wrote:
> > The memcg charge API charges pages before they are rmapped - i.e. have
> > an actual "type" - and so every callsite needs its own set of charge
> > and uncharge functions to know what type is being operated on.  Worse,
> > uncharge has to happen from a context that is still type-specific,
> > rather than at the end of the page's lifetime with exclusive access,
> > and so requires a lot of synchronization.
> > 
> > Rewrite the charge API to provide a generic set of try_charge(),
> > commit_charge() and cancel_charge() transaction operations, much like
> > what's currently done for swap-in:
> > 
> >   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
> >   pages from the memcg if necessary.
> > 
> >   mem_cgroup_commit_charge() commits the page to the charge once it
> >   has a valid page->mapping and PageAnon() reliably tells the type.
> > 
> >   mem_cgroup_cancel_charge() aborts the transaction.
> > 
> > This reduces the charge API and enables subsequent patches to
> > drastically simplify uncharging.
> > 
> > As pages need to be committed after rmap is established but before
> > they are added to the LRU, page_add_new_anon_rmap() must stop doing
> > LRU additions again.  Revive lru_cache_add_active_or_unevictable().
> 
> I think it would make more sense to do
> lru_cache_add_active_or_unevictable in a separate patch for easier
> review. Too late, though...
> 
> Few comments bellow
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The patch looks correct but the code is quite tricky so I hope I didn't
> miss anything.
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
> >  };
> >  
> >  #ifdef CONFIG_MEMCG
> > -/*
> > - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> > - * alloc memory but reclaims memory from all available zones. So, "where I want
> > - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> > - * available but adding a rule is better. charge functions' gfp_mask should
> > - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> > - * codes.
> > - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> > - */
> 
> I think we should slightly modify the comment but the primary idea
> should stay there. What about the following?
> /*
>  * Although memcg charge functions do not allocate any memory they are
>  * still getting GFP mask to control the reclaim process (therefore
>  * gfp_mask & GFP_RECLAIM_MASK is expected).
>  * GFP_KERNEL should be used for the general charge path without any
>  * constraints for the reclaim
>  * __GFP_WAIT should be cleared for atomic contexts
>  * __GFP_NORETRY should be set for charges which might fail rather than
>  * spend too much time reclaiming
>  * __GFP_NOFAIL should be set for charges which cannot fail.
>  */

What *is* the primary idea here?

Taking any kind of gfp mask and interpreting the bits that pertain to
you is done in a lot of places already, and there really is no need to
duplicate the documentation and risk it getting stale and misleading.

> > @@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  					struct page *page,
> >  					unsigned long haddr)
> >  {
> > +	struct mem_cgroup *memcg;
> >  	spinlock_t *ptl;
> >  	pgtable_t pgtable;
> >  	pmd_t _pmd;
> > @@ -968,20 +972,21 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  					       __GFP_OTHER_NODE,
> >  					       vma, address, page_to_nid(page));
> >  		if (unlikely(!pages[i] ||
> > -			     mem_cgroup_charge_anon(pages[i], mm,
> > -						       GFP_KERNEL))) {
> > +			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
> > +						   &memcg))) {
> >  			if (pages[i])
> >  				put_page(pages[i]);
> > -			mem_cgroup_uncharge_start();
> >  			while (--i >= 0) {
> > -				mem_cgroup_uncharge_page(pages[i]);
> > +				memcg = (void *)page_private(pages[i]);
> 
> Hmm, OK the memcg couldn't go away even if mm owner has left it because
> the charge is already there and the page is not on LRU so the
> mem_cgroup_css_free will wait until we uncharge it or put to LRU.

Yep, res_counter charges have always pinned the memcg.  We already
used this exact protocol and relied on the same lifetime rules for
swapin charging.

> > +/**
> > + * mem_cgroup_commit_charge - commit a page charge
> > + * @page: page to charge
> > + * @memcg: memcg to charge the page to
> > + * @lrucare: page might be on LRU already
> > + *
> > + * Finalize a charge transaction started by mem_cgroup_try_charge(),
> > + * after page->mapping has been set up.  This must happen atomically
> > + * as part of the page instantiation, i.e. under the page table lock
> > + * for anonymous pages, under the page lock for page and swap cache.
> > + *
> > + * In addition, the page must not be on the LRU during the commit, to
> > + * prevent racing with task migration.  If it might be, use @lrucare.
> > + *
> > + * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
> > + */
> > +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> > +			      bool lrucare)
> 
> I think we should be explicit that this is only required for LRU pages.
> kmem doesn't have to finalize the transaction.

The function itself only applies to user/LRU pages.  kmem has its own
separate API for charge/commit/cancel/uncharge.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
  2014-07-14 17:13       ` Johannes Weiner
@ 2014-07-14 18:43         ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-14 18:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Mon 14-07-14 13:13:24, Johannes Weiner wrote:
> On Mon, Jul 14, 2014 at 05:04:46PM +0200, Michal Hocko wrote:
> > Hi,
> > I've finally manage to untagle myself from internal stuff...
> > 
> > On Wed 18-06-14 16:40:44, Johannes Weiner wrote:
> > > The memcg charge API charges pages before they are rmapped - i.e. have
> > > an actual "type" - and so every callsite needs its own set of charge
> > > and uncharge functions to know what type is being operated on.  Worse,
> > > uncharge has to happen from a context that is still type-specific,
> > > rather than at the end of the page's lifetime with exclusive access,
> > > and so requires a lot of synchronization.
> > > 
> > > Rewrite the charge API to provide a generic set of try_charge(),
> > > commit_charge() and cancel_charge() transaction operations, much like
> > > what's currently done for swap-in:
> > > 
> > >   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
> > >   pages from the memcg if necessary.
> > > 
> > >   mem_cgroup_commit_charge() commits the page to the charge once it
> > >   has a valid page->mapping and PageAnon() reliably tells the type.
> > > 
> > >   mem_cgroup_cancel_charge() aborts the transaction.
> > > 
> > > This reduces the charge API and enables subsequent patches to
> > > drastically simplify uncharging.
> > > 
> > > As pages need to be committed after rmap is established but before
> > > they are added to the LRU, page_add_new_anon_rmap() must stop doing
> > > LRU additions again.  Revive lru_cache_add_active_or_unevictable().
> > 
> > I think it would make more sense to do
> > lru_cache_add_active_or_unevictable in a separate patch for easier
> > review. Too late, though...
> > 
> > Few comments bellow
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > The patch looks correct but the code is quite tricky so I hope I didn't
> > miss anything.
> > 
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks!
> 
> > > @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
> > >  };
> > >  
> > >  #ifdef CONFIG_MEMCG
> > > -/*
> > > - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > > - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> > > - * alloc memory but reclaims memory from all available zones. So, "where I want
> > > - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> > > - * available but adding a rule is better. charge functions' gfp_mask should
> > > - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> > > - * codes.
> > > - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> > > - */
> > 
> > I think we should slightly modify the comment but the primary idea
> > should stay there. What about the following?
> > /*
> >  * Although memcg charge functions do not allocate any memory they are
> >  * still getting GFP mask to control the reclaim process (therefore
> >  * gfp_mask & GFP_RECLAIM_MASK is expected).
> >  * GFP_KERNEL should be used for the general charge path without any
> >  * constraints for the reclaim
> >  * __GFP_WAIT should be cleared for atomic contexts
> >  * __GFP_NORETRY should be set for charges which might fail rather than
> >  * spend too much time reclaiming
> >  * __GFP_NOFAIL should be set for charges which cannot fail.
> >  */
> 
> What *is* the primary idea here?
> 
> Taking any kind of gfp mask and interpreting the bits that pertain to
> you is done in a lot of places already, and there really is no need to
> duplicate the documentation and risk it getting stale and misleading.

The idea was to document which flags do we care about as not all of them
are implemented. On the other hand I do agree that stale doc is worse
than no doc.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 12/13] mm: memcontrol: rewrite charge API
@ 2014-07-14 18:43         ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-14 18:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Mon 14-07-14 13:13:24, Johannes Weiner wrote:
> On Mon, Jul 14, 2014 at 05:04:46PM +0200, Michal Hocko wrote:
> > Hi,
> > I've finally manage to untagle myself from internal stuff...
> > 
> > On Wed 18-06-14 16:40:44, Johannes Weiner wrote:
> > > The memcg charge API charges pages before they are rmapped - i.e. have
> > > an actual "type" - and so every callsite needs its own set of charge
> > > and uncharge functions to know what type is being operated on.  Worse,
> > > uncharge has to happen from a context that is still type-specific,
> > > rather than at the end of the page's lifetime with exclusive access,
> > > and so requires a lot of synchronization.
> > > 
> > > Rewrite the charge API to provide a generic set of try_charge(),
> > > commit_charge() and cancel_charge() transaction operations, much like
> > > what's currently done for swap-in:
> > > 
> > >   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
> > >   pages from the memcg if necessary.
> > > 
> > >   mem_cgroup_commit_charge() commits the page to the charge once it
> > >   has a valid page->mapping and PageAnon() reliably tells the type.
> > > 
> > >   mem_cgroup_cancel_charge() aborts the transaction.
> > > 
> > > This reduces the charge API and enables subsequent patches to
> > > drastically simplify uncharging.
> > > 
> > > As pages need to be committed after rmap is established but before
> > > they are added to the LRU, page_add_new_anon_rmap() must stop doing
> > > LRU additions again.  Revive lru_cache_add_active_or_unevictable().
> > 
> > I think it would make more sense to do
> > lru_cache_add_active_or_unevictable in a separate patch for easier
> > review. Too late, though...
> > 
> > Few comments bellow
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > The patch looks correct but the code is quite tricky so I hope I didn't
> > miss anything.
> > 
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks!
> 
> > > @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
> > >  };
> > >  
> > >  #ifdef CONFIG_MEMCG
> > > -/*
> > > - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > > - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> > > - * alloc memory but reclaims memory from all available zones. So, "where I want
> > > - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> > > - * available but adding a rule is better. charge functions' gfp_mask should
> > > - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> > > - * codes.
> > > - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> > > - */
> > 
> > I think we should slightly modify the comment but the primary idea
> > should stay there. What about the following?
> > /*
> >  * Although memcg charge functions do not allocate any memory they are
> >  * still getting GFP mask to control the reclaim process (therefore
> >  * gfp_mask & GFP_RECLAIM_MASK is expected).
> >  * GFP_KERNEL should be used for the general charge path without any
> >  * constraints for the reclaim
> >  * __GFP_WAIT should be cleared for atomic contexts
> >  * __GFP_NORETRY should be set for charges which might fail rather than
> >  * spend too much time reclaiming
> >  * __GFP_NOFAIL should be set for charges which cannot fail.
> >  */
> 
> What *is* the primary idea here?
> 
> Taking any kind of gfp mask and interpreting the bits that pertain to
> you is done in a lot of places already, and there really is no need to
> duplicate the documentation and risk it getting stale and misleading.

The idea was to document which flags do we care about as not all of them
are implemented. On the other hand I do agree that stale doc is worse
than no doc.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-06-18 20:40   ` Johannes Weiner
  (?)
@ 2014-07-15  8:25     ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

As there were follow up fixed on top of this one I have squashed them
into the following one (changelogs preserved) for review. I hope I
haven't missed any patch. I will respond to this email with the review
comments. It is quite large so it will take some time...
---
>From 11adda1da1d21ba4c07759dedb68c203a991e5eb Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 19 Jun 2014 10:14:50 +0200
Subject: [PATCH] mm: memcontrol: rewrite uncharge API

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages.  However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped.  Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge.  The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
gets rid of mem_cgroup_replace_page_cache().

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration.  Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix 2

It's not entirely clear whether do_swap_account or PCG_MEMSW is the
authoritative answer to whether a page is swap-accounted or not.  This
currently leads to the following memsw counter underflow when swap
accounting is disabled:

[    2.753355] WARNING: CPU: 0 PID: 1 at kernel/res_counter.c:28 res_counter_uncharge_locked+0x48/0x74()
[    2.753355] CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-00238-gddc5bfe #1
[    2.753355] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    2.753355]  0000000000000000 ffff880012073c50 ffffffff81a23b9d ffff880012073c88
[    2.753355]  ffffffff810bc765 ffffffff8111fac8 0000000000001000 ffff88001200fa50
[    2.753355]  0000000000000001 ffff88001200fa01 ffff880012073c98 ffffffff810bc84b
[    2.753355] Call Trace:
[    2.753355]  [<ffffffff81a23b9d>] dump_stack+0x19/0x1b
[    2.753355]  [<ffffffff810bc765>] warn_slowpath_common+0x73/0x8c
[    2.753355]  [<ffffffff8111fac8>] ? res_counter_uncharge_locked+0x48/0x74
[    2.753355]  [<ffffffff810bc84b>] warn_slowpath_null+0x1a/0x1c
[    2.753355]  [<ffffffff8111fac8>] res_counter_uncharge_locked+0x48/0x74
[    2.753355]  [<ffffffff8111fd02>] res_counter_uncharge_until+0x4e/0xa9
[    2.753355]  [<ffffffff8111fd70>] res_counter_uncharge+0x13/0x15
[    2.753355]  [<ffffffff8119499c>] mem_cgroup_uncharge_end+0x73/0x8d
[    2.753355]  [<ffffffff8115735e>] release_pages+0x1f2/0x20d
[    2.753355]  [<ffffffff8116cc3a>] tlb_flush_mmu_free+0x28/0x43
[    2.753355]  [<ffffffff8116d5e5>] tlb_flush_mmu+0x20/0x23
[    2.753355]  [<ffffffff8116d5fc>] tlb_finish_mmu+0x14/0x39
[    2.753355]  [<ffffffff811730c1>] unmap_region+0xcd/0xdf
[    2.753355]  [<ffffffff81172b0e>] ? vma_gap_callbacks_propagate+0x18/0x33
[    2.753355]  [<ffffffff81174bf1>] do_munmap+0x252/0x2e0
[    2.753355]  [<ffffffff81174cc3>] vm_munmap+0x44/0x5c
[    2.753355]  [<ffffffff81174cfe>] SyS_munmap+0x23/0x29
[    2.753355]  [<ffffffff81a31567>] system_call_fastpath+0x16/0x1b
[    2.753355] ---[ end trace cfeb07101f6fbdfb ]---

Don't set PCG_MEMSW when swap accounting is disabled, so that uncharging
only has to look at this per-page flag.

mem_cgroup_swapout() could also fully rely on this flag, but as it can
bail out before even looking up the page_cgroup, check do_swap_account as
a performance optimization and only sanity test for PCG_MEMSW.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: mem_cgroup_charge_statistics needs preempt_disable

preempt_disable was previously disabled by lock_page_cgroup which has been
removed by "mm: memcontrol: rewrite uncharge API".

This fixes the a flood of splats like this:
[    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
[    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
[    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
[    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
[    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
[    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
[    3.162950] Call Trace:
[    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
[    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
[    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
[    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
[    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
[    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
[    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
[    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
[    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
[    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
[    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
[    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
[    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
[    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
[    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
[    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page-cgroup: fix flags definition

Since commit a9ce315aaec1f ("mm: memcontrol: rewrite uncharge API"),
PCG_* flags are used as bit masks, but they are still defined in a enum
as bit numbers. Fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix - uncharge from IRQ context

Hugh reports:

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
3.16.0-rc2-mm1 #3 Not tainted
------------------------------------------------------
cc1/2771 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&rtpz->lock)->rlock){+.+.-.}, at: [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
dd
and this task is already holding:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239
which would create a new lock dependency:
 (&(&zone->lru_lock)->rlock){..-.-.} -> (&(&rtpz->lock)->rlock){+.+.-.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&zone->lru_lock)->rlock){..-.-.}
... which became SOFTIRQ-irq-safe at:
  [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
  [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
  [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
  [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
  [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
  [<ffffffff81134400>] end_swap_bio_write+0x5c/0x69
  [<ffffffff8123473e>] bio_endio+0x50/0x6e
  [<ffffffff81238dee>] blk_update_request+0x163/0x255
  [<ffffffff81238ef7>] blk_update_bidi_request+0x17/0x65
  [<ffffffff81239242>] blk_end_bidi_request+0x1a/0x56
  [<ffffffff81239289>] blk_end_request+0xb/0xd
  [<ffffffff813a075a>] scsi_io_completion+0x16d/0x553
  [<ffffffff81399c0f>] scsi_finish_command+0xb6/0xbf
  [<ffffffff813a0564>] scsi_softirq_done+0xe9/0xf0
  [<ffffffff8123e8e5>] blk_done_softirq+0x79/0x8b
  [<ffffffff81088675>] __do_softirq+0xfc/0x21f
  [<ffffffff8108898f>] irq_exit+0x3d/0x92
  [<ffffffff81032379>] do_IRQ+0xcc/0xe5
  [<ffffffff815bf5ac>] ret_from_intr+0x0/0x13
  [<ffffffff81443ac0>] cpuidle_enter+0x12/0x14
  [<ffffffff810bb4e4>] cpu_startup_entry+0x187/0x243
  [<ffffffff815a90ab>] rest_init+0x12f/0x133
  [<ffffffff81970e7c>] start_kernel+0x396/0x3a3
  [<ffffffff81970489>] x86_64_start_reservations+0x2a/0x2c
  [<ffffffff81970552>] x86_64_start_kernel+0xc7/0xca

to a SOFTIRQ-irq-unsafe lock:
 (&(&rtpz->lock)->rlock){+.+.-.}
... which became SOFTIRQ-irq-unsafe at:
...  [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
  [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
  [<ffffffff811535bb>] commit_charge+0x260/0x26f
  [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
  [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
  [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
  [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
  [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
  [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
  [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
  [<ffffffff8115a56c>] vfs_write+0xb5/0x169
  [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
  [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b

The soft limit tree lock needs to be IRQ-safe as it's acquired while
holding the IRQ-safe zone->lru_lock.

But more importantly, with uncharge happening in release_pages() now,
this path is executed from interrupt context.

Make the soft limit tree lock, uncharge batching, and charge
statistics IRQ-safe.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix - double migration

Hugh reports:

VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
mm/memcontrol.c:6680!
page had count 1 mapcount 0 mapping anon index 0x196
flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
__alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
handle_mm_fault < __do_page_fault

mem_cgroup_migrate() assumes that a page is only migrated once and
then freed immediately after.

However, putting the page back on the LRU list and dropping the
isolation refcount is not done atomically.  This allows a PFN-based
migrator like compaction to isolate the page, see the expected
anonymous page refcount of 1, and migrate the page once more.

Catch pages that have already been migrated and abort migration
gracefully.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix - migrate before re-mapping

Mapped file accounting depends on the the page being charged already,
or it won't get accounted properly, and the mapped file counter will
underflow during unmap later on.

Move mem_cgroup_migrate() before remove_migration_ptes().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/memcg_test.txt | 128 +-----
 include/linux/memcontrol.h           |  49 +--
 include/linux/page_cgroup.h          |  43 +-
 include/linux/swap.h                 |  12 +-
 mm/filemap.c                         |   4 +-
 mm/memcontrol.c                      | 748 ++++++++++++-----------------------
 mm/memory.c                          |   2 -
 mm/migrate.c                         |  44 +--
 mm/rmap.c                            |   1 -
 mm/shmem.c                           |   8 +-
 mm/swap.c                            |   6 +
 mm/swap_state.c                      |   8 +-
 mm/swapfile.c                        |   7 +-
 mm/truncate.c                        |   9 -
 mm/vmscan.c                          |  12 +-
 mm/zswap.c                           |   2 +-
 16 files changed, 333 insertions(+), 750 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index bcf750d3cecd..8870b0212150 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -29,28 +29,13 @@ Please note that implementation details can be changed.
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
+	mem_cgroup_uncharge()
+	  Called when a page's refcount goes down to 0.
 
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
 3. charge-commit-cancel
 	Memcg pages are charged in two steps:
 		mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
 
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 
@@ -136,28 +74,20 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	    swp_entry's refcnt -= 1.
 
 
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
 
 5. Page Cache
    	Page Cache is charged at
 	- add_to_page_cache_locked().
 
-	uncharged at
-	- __remove_from_page_cache().
-
 	The logic is very clear. (About migration, see below)
 	Note: __remove_from_page_cache() is called by remove_from_page_cache()
 	and __remove_mapping().
 
 6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
+	The best way to understand shmem's page state transition is to read
+	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 
@@ -170,56 +100,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
 
 7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
-
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+
+	mem_cgroup_migrate()
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a9a096858e0..806b8fa15c5f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 			      bool lrucare);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+void mem_cgroup_uncharge(struct page *page);
+
+/* Batched uncharging */
+void mem_cgroup_uncharge_start(void);
+void mem_cgroup_uncharge_end(void);
 
-/* For coalescing uncharge for reducing memcg' overhead*/
-extern void mem_cgroup_uncharge_start(void);
-extern void mem_cgroup_uncharge_end(void);
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg)
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 
-extern void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp);
-extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok);
-
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
-extern void mem_cgroup_replace_page_cache(struct page *oldpage,
-					struct page *newpage);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
 {
 }
 
-static inline void mem_cgroup_uncharge_start(void)
+static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
 
-static inline void mem_cgroup_uncharge_end(void)
+static inline void mem_cgroup_uncharge_start(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_page(struct page *page)
+static inline void mem_cgroup_uncharge_end(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+static inline void mem_cgroup_migrate(struct page *oldpage,
+				      struct page *newpage,
+				      bool lrucare)
 {
 }
 
@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
-static inline void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp)
-{
-}
-
-static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-		struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
-static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
-				struct page *newpage)
-{
-}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..9bfb8e68a595 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -3,9 +3,9 @@
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
-	PCG_USED, /* this object is in use. */
-	PCG_MIGRATION, /* under page migration */
+	PCG_USED = 0x01,	/* This page is charged to a memcg */
+	PCG_MEM = 0x02,		/* This page holds a memory charge */
+	PCG_MEMSW = 0x04,	/* This page holds a memory+swap charge */
 	__NR_PCG_FLAGS,
 };
 
@@ -44,42 +44,9 @@ static inline void __init page_cgroup_init(void)
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define TESTCLEARPCGFLAG(uname, lname)			\
-static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
-
-TESTPCGFLAG(Used, USED)
-CLEARPCGFLAG(Used, USED)
-SETPCGFLAG(Used, USED)
-
-SETPCGFLAG(Migration, MIGRATION)
-CLEARPCGFLAG(Migration, MIGRATION)
-TESTPCGFLAG(Migration, MIGRATION)
-
-static inline void lock_page_cgroup(struct page_cgroup *pc)
-{
-	/*
-	 * Don't take this lock in IRQ context.
-	 * This lock is for pc->mem_cgroup, USED, MIGRATION
-	 */
-	bit_spin_lock(PCG_LOCK, &pc->flags);
-}
-
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline int PageCgroupUsed(struct page_cgroup *pc)
 {
-	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	return !!(pc->flags & PCG_USED);
 }
 
 #else /* CONFIG_MEMCG */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 290905133078..94fd0b23f3f9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
 #else
-static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
 {
 }
 #endif
@@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free(swp_entry_t);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp, struct page *page)
+static inline void swapcache_free(swp_entry_t swp)
 {
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 4afbf3da885a..698f2c2a511b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -233,7 +233,6 @@ void delete_from_page_cache(struct page *page)
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
 		freepage(page);
@@ -501,8 +500,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
-		/* mem_cgroup codes must not be called under tree_lock */
-		mem_cgroup_replace_page_cache(old, new);
+		mem_cgroup_migrate(old, new, true);
 		radix_tree_preload_end();
 		if (freepage)
 			freepage(old);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe17420afdc7..e4afdbdda0a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -754,9 +754,11 @@ static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 				       struct mem_cgroup_tree_per_zone *mctz)
 {
-	spin_lock(&mctz->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mctz->lock, flags);
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
 
@@ -779,7 +781,9 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 		 * mem is over its softlimit.
 		 */
 		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
+			unsigned long flags;
+
+			spin_lock_irqsave(&mctz->lock, flags);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mz, mctz);
@@ -788,7 +792,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 			 * If excess is 0, no tree ops.
 			 */
 			__mem_cgroup_insert_exceeded(mz, mctz, excess);
-			spin_unlock(&mctz->lock);
+			spin_unlock_irqrestore(&mctz->lock, flags);
 		}
 	}
 }
@@ -839,9 +843,9 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 {
 	struct mem_cgroup_per_zone *mz;
 
-	spin_lock(&mctz->lock);
+	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irq(&mctz->lock);
 	return mz;
 }
 
@@ -882,13 +886,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
-}
-
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
@@ -909,13 +906,13 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 bool anon, int nr_pages)
+					 int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
-	if (anon)
+	if (PageAnon(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
@@ -1013,7 +1010,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 {
-	preempt_disable();
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -1026,8 +1022,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
 #endif
-		preempt_enable();
-
 		mem_cgroup_threshold(memcg);
 		if (unlikely(do_softlimit))
 			mem_cgroup_update_tree(memcg, page);
@@ -1035,8 +1029,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
 #endif
-	} else
-		preempt_enable();
+	}
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
@@ -1347,20 +1340,6 @@ out:
 	return lruvec;
 }
 
-/*
- * Following LRU functions are allowed to be used without PCG_LOCK.
- * Operations are called by routine of global LRU independently from memcg.
- * What we have to take care of here is validness of pc->mem_cgroup.
- *
- * Changes to pc->mem_cgroup happens when
- * 1. charge
- * 2. moving account
- * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
- * It is added to LRU before charge.
- * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
- * When moving account, the page is not on LRU. It's isolated.
- */
-
 /**
  * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
@@ -2261,22 +2240,14 @@ cleanup:
  *
  * Notes: Race condition
  *
- * We usually use lock_page_cgroup() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
+ * Charging occurs during page instantiation, while the page is
+ * unmapped and locked in page migration, or while the page table is
+ * locked in THP migration.  No race is possible.
  *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
+ * Uncharge happens to pages with zero references, no race possible.
  *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check memcg->moving_account and detect there are possibility
- * of race or not. If there is, we take a lock.
+ * Charge moving between groups is protected by checking mm->moving
+ * account and taking the move_lock in the slowpath.
  */
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
@@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
+/*
+ * try_get_mem_cgroup_from_page - look up page's memcg association
+ * @page: the page
+ *
+ * Look up, get a css reference, and return the memcg that owns @page.
+ *
+ * The page must be locked to prevent racing with swap-in and page
+ * cache charges.  If coming from an unlocked page table, the caller
+ * must ensure the page is on the LRU or this can race with charging.
+ */
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
 		memcg = pc->mem_cgroup;
 		if (memcg && !css_tryget_online(&memcg->css))
@@ -2713,19 +2693,17 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			memcg = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
 	return memcg;
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare)
+			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
-	struct lruvec *lruvec;
 	bool was_on_lru = false;
+	struct lruvec *lruvec;
 
-	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
 	 * we don't need page_cgroup_lock about tail pages, becase they are not
@@ -2747,8 +2725,22 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		}
 	}
 
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point:
+	 *
+	 * - the page is uncharged
+	 *
+	 * - the page is off-LRU
+	 *
+	 * - an anonymous fault has exclusive page access, except for
+	 *   a locked page table
+	 *
+	 * - a page cache insertion, a swapin fault, or a migration
+	 *   have the page locked
+	 */
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
+	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
 
 	if (lrucare) {
 		if (was_on_lru) {
@@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
-
+	local_irq_disable();
+	mem_cgroup_charge_statistics(memcg, page, nr_pages);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
 	 */
 	memcg_check_events(memcg, page);
+	local_irq_enable();
 }
 
 static DEFINE_MUTEX(set_limit_mutex);
@@ -3446,7 +3438,6 @@ static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -3467,7 +3458,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
+		pc->flags = head_pc->flags;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
@@ -3497,7 +3488,6 @@ static int mem_cgroup_move_account(struct page *page,
 {
 	unsigned long flags;
 	int ret;
-	bool anon = PageAnon(page);
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -3511,15 +3501,13 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
-	lock_page_cgroup(pc);
-
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto unlock;
+		goto out;
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
+	if (!PageAnon(page) && page_mapped(page)) {
 		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
 			       nr_pages);
 		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3533,20 +3521,23 @@ static int mem_cgroup_move_account(struct page *page,
 			       nr_pages);
 	}
 
-	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
+	/*
+	 * It is safe to change pc->mem_cgroup here because the page
+	 * is referenced, charged, and isolated - we can't race with
+	 * uncharging, charging, migration, or LRU putback.
+	 */
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
-unlock:
-	unlock_page_cgroup(pc);
-	/*
-	 * check events
-	 */
+
+	local_irq_disable();
+	mem_cgroup_charge_statistics(to, page, nr_pages);
 	memcg_check_events(to, page);
+	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
+	local_irq_enable();
 out:
 	return ret;
 }
@@ -3617,193 +3608,6 @@ out:
 	return ret;
 }
 
-static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
-				   unsigned int nr_pages,
-				   const enum charge_type ctype)
-{
-	struct memcg_batch_info *batch = NULL;
-	bool uncharge_memsw = true;
-
-	/* If swapout, usage of swap doesn't decrease */
-	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		uncharge_memsw = false;
-
-	batch = &current->memcg_batch;
-	/*
-	 * In usual, we do css_get() when we remember memcg pointer.
-	 * But in this case, we keep res->usage until end of a series of
-	 * uncharges. Then, it's ok to ignore memcg's refcnt.
-	 */
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continuously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-
-	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
-
-	if (nr_pages > 1)
-		goto direct_uncharge;
-
-	/*
-	 * In typical case, batch->memcg == mem. This means we can
-	 * merge a series of uncharges to an uncharge of res_counter.
-	 * If not, we uncharge res_counter ony by one.
-	 */
-	if (batch->memcg != memcg)
-		goto direct_uncharge;
-	/* remember freed charge and uncharge it later */
-	batch->nr_pages++;
-	if (uncharge_memsw)
-		batch->memsw_nr_pages++;
-	return;
-direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (unlikely(batch->memcg != memcg))
-		memcg_oom_recover(memcg);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
-			     bool end_migration)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!PageCgroupUsed(pc)))
-		return NULL;
-
-	lock_page_cgroup(pc);
-
-	memcg = pc->mem_cgroup;
-
-	if (!PageCgroupUsed(pc))
-		goto unlock_out;
-
-	anon = PageAnon(page);
-
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_ANON:
-		/*
-		 * Generally PageAnon tells if it's the anon statistics to be
-		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
-		 * used before page reached the stage of being marked PageAnon.
-		 */
-		anon = true;
-		/* fallthrough */
-	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		/* See mem_cgroup_prepare_migration() */
-		if (page_mapped(page))
-			goto unlock_out;
-		/*
-		 * Pages under migration may not be uncharged.  But
-		 * end_migration() /must/ be the one uncharging the
-		 * unused post-migration page and so it has to call
-		 * here with the migration bit still set.  See the
-		 * res_counter handling below.
-		 */
-		if (!end_migration && PageCgroupMigration(pc))
-			goto unlock_out;
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
-		if (!PageAnon(page)) {	/* Shared memory */
-			if (page->mapping && !page_is_file_cache(page))
-				goto unlock_out;
-		} else if (page_mapped(page)) /* Anon */
-				goto unlock_out;
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
-
-	ClearPageCgroupUsed(pc);
-	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
-	 */
-
-	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
-	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
-	/*
-	 * Migration does not charge the res_counter for the
-	 * replacement page, so leave it alone when phasing out the
-	 * page that is unused after the migration.
-	 */
-	if (!end_migration)
-		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
-
-	return memcg;
-
-unlock_out:
-	unlock_page_cgroup(pc);
-	return NULL;
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	/*
-	 * If the page is in swap cache, uncharge should be deferred
-	 * to the swap path, which also properly accounts swap usage
-	 * and handles memcg lifetime.
-	 *
-	 * Note that this check is not stable and reclaim may add the
-	 * page to swap cache at any time after this.  However, if the
-	 * page is not in swap cache by the time page->mapcount hits
-	 * 0, there won't be any page table references to the swap
-	 * slot, and reclaim will free it and not actually write the
-	 * page to disk.
-	 */
-	if (PageSwapCache(page))
-		return;
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping, page);
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
-}
-
 /*
  * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
  * In that cases, pages are freed continuously and we can expect pages
@@ -3814,6 +3618,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
 
 void mem_cgroup_uncharge_start(void)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
 	current->memcg_batch.do_batch++;
 	/* We can do nest. */
 	if (current->memcg_batch.do_batch == 1) {
@@ -3821,21 +3628,18 @@ void mem_cgroup_uncharge_start(void)
 		current->memcg_batch.nr_pages = 0;
 		current->memcg_batch.memsw_nr_pages = 0;
 	}
+	local_irq_restore(flags);
 }
 
 void mem_cgroup_uncharge_end(void)
 {
 	struct memcg_batch_info *batch = &current->memcg_batch;
+	unsigned long flags;
 
-	if (!batch->do_batch)
-		return;
-
-	batch->do_batch--;
-	if (batch->do_batch) /* If stacked, do nothing. */
-		return;
-
-	if (!batch->memcg)
-		return;
+	local_irq_save(flags);
+	VM_BUG_ON(!batch->do_batch);
+	if (--batch->do_batch) /* If stacked, do nothing */
+		goto out;
 	/*
 	 * This "batch->memcg" is valid without any css_get/put etc...
 	 * bacause we hide charges behind us.
@@ -3847,61 +3651,16 @@ void mem_cgroup_uncharge_end(void)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
 	memcg_oom_recover(batch->memcg);
-	/* forget this pointer (for sanity check) */
-	batch->memcg = NULL;
-}
-
-#ifdef CONFIG_SWAP
-/*
- * called after __delete_from_swap_cache() and drop "page" account.
- * memcg information is recorded to swap_cgroup of "ent"
- */
-void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
-{
-	struct mem_cgroup *memcg;
-	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
-
-	if (!swapout) /* this was a swap cache but the swap is unused ! */
-		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
-
-	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
-
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
-	if (do_swap_account && swapout && memcg)
-		swap_cgroup_record(ent, mem_cgroup_id(memcg));
+out:
+	local_irq_restore(flags);
 }
-#endif
 
 #ifdef CONFIG_MEMCG_SWAP
-/*
- * called from swap_entry_free(). remove record in swap_cgroup and
- * uncharge "memsw" account.
- */
-void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
+					 bool charge)
 {
-	struct mem_cgroup *memcg;
-	unsigned short id;
-
-	if (!do_swap_account)
-		return;
-
-	id = swap_cgroup_record(ent, 0);
-	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
-	if (memcg) {
-		/*
-		 * We uncharge this because swap is freed.  This memcg can
-		 * be obsolete one. We avoid calling css_tryget_online().
-		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
-	}
-	rcu_read_unlock();
+	int val = (charge) ? 1 : -1;
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -3953,169 +3712,6 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
-/*
- * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
- * page belongs to.
- */
-void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-				  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-
-	*memcgp = NULL;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (PageTransHuge(page))
-		nr_pages <<= compound_order(page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		css_get(&memcg->css);
-		/*
-		 * At migrating an anonymous page, its mapcount goes down
-		 * to 0 and uncharge() will be called. But, even if it's fully
-		 * unmapped, migration may fail and this page has to be
-		 * charged again. We set MIGRATION flag here and delay uncharge
-		 * until end_migration() is called
-		 *
-		 * Corner Case Thinking
-		 * A)
-		 * When the old page was mapped as Anon and it's unmap-and-freed
-		 * while migration was ongoing.
-		 * If unmap finds the old page, uncharge() of it will be delayed
-		 * until end_migration(). If unmap finds a new page, it's
-		 * uncharged when it make mapcount to be 1->0. If unmap code
-		 * finds swap_migration_entry, the new page will not be mapped
-		 * and end_migration() will find it(mapcount==0).
-		 *
-		 * B)
-		 * When the old page was mapped but migraion fails, the kernel
-		 * remaps it. A charge for it is kept by MIGRATION flag even
-		 * if mapcount goes down to 0. We can do remap successfully
-		 * without charging it again.
-		 *
-		 * C)
-		 * The "old" page is under lock_page() until the end of
-		 * migration, so, the old page itself will not be swapped-out.
-		 * If the new page is swapped out before end_migraton, our
-		 * hook to usual swap-out path will catch the event.
-		 */
-		if (PageAnon(page))
-			SetPageCgroupMigration(pc);
-	}
-	unlock_page_cgroup(pc);
-	/*
-	 * If the page is not charged at this point,
-	 * we return here.
-	 */
-	if (!memcg)
-		return;
-
-	*memcgp = memcg;
-	/*
-	 * We charge new page before it's used/mapped. So, even if unlock_page()
-	 * is called before end_migration, we can catch all events on this new
-	 * page. In the case new page is migrated but not remapped, new page's
-	 * mapcount will be finally 0 and we call uncharge in end_migration().
-	 */
-	/*
-	 * The page is committed to the memcg, but it's not actually
-	 * charged to the res_counter since we plan on replacing the
-	 * old one and only one page is going to be left afterwards.
-	 */
-	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-	struct page *used, *unused;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (!memcg)
-		return;
-
-	if (!migration_ok) {
-		used = oldpage;
-		unused = newpage;
-	} else {
-		used = newpage;
-		unused = oldpage;
-	}
-	anon = PageAnon(used);
-	__mem_cgroup_uncharge_common(unused,
-				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
-				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
-				     true);
-	css_put(&memcg->css);
-	/*
-	 * We disallowed uncharge of pages under migration because mapcount
-	 * of the page goes down to zero, temporarly.
-	 * Clear the flag and check the page should be charged.
-	 */
-	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
-	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * If a page is a file cache, radix-tree replacement is very atomic
-	 * and we can skip this check. When it was an Anon page, its mapcount
-	 * goes down to 0. But because we added MIGRATION flage, it's not
-	 * uncharged yet. There are several case but page->mapcount check
-	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
-	 * check. (see prepare_charge() also)
-	 */
-	if (anon)
-		mem_cgroup_uncharge_page(used);
-}
-
-/*
- * At replace page cache, newpage is not under any memcg but it's on
- * LRU. So, this function doesn't touch res_counter but handles LRU
- * in correct way. Both pages are locked so we cannot race with uncharge.
- */
-void mem_cgroup_replace_page_cache(struct page *oldpage,
-				  struct page *newpage)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(oldpage);
-	/* fix accounting on old pages */
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * When called from shmem_replace_page(), in some cases the
-	 * oldpage has already been charged, and in some cases not.
-	 */
-	if (!memcg)
-		return;
-	/*
-	 * Even if newpage->mapping was NULL before starting replacement,
-	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
-	 * LRU while we overwrite pc->mem_cgroup.
-	 */
-	commit_charge(newpage, memcg, 1, false, true);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -4314,7 +3910,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
+		spin_lock_irq(&mctz->lock);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -4354,7 +3950,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 		 */
 		/* If excess == 0, no tree ops */
 		__mem_cgroup_insert_exceeded(mz, mctz, excess);
-		spin_unlock(&mctz->lock);
+		spin_unlock_irq(&mctz->lock);
 		css_put(&mz->memcg->css);
 		loop++;
 		/*
@@ -6290,9 +5886,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
-		 * Do only loose check w/o page_cgroup lock.
-		 * mem_cgroup_move_account() checks the pc is valid or not under
-		 * the lock.
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the pc is valid or
+		 * not under LRU exclusion.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
@@ -6751,6 +6347,67 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+#ifdef CONFIG_MEMCG_SWAP
+/**
+ * mem_cgroup_swapout - transfer a memsw charge to swap
+ * @page: page whose memsw charge to transfer
+ * @entry: swap entry to move the charge to
+ *
+ * Transfer the memsw charge of @page to @entry.
+ */
+void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+	unsigned short oldid;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (!do_swap_account)
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Readahead page, never charged */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+	VM_BUG_ON_PAGE(oldid, page);
+
+	pc->flags &= ~PCG_MEMSW;
+	css_get(&pc->mem_cgroup->css);
+	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+}
+
+/**
+ * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * @entry: swap entry to uncharge
+ *
+ * Drop the memsw charge associated with @entry.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	if (!do_swap_account)
+		return;
+
+	id = swap_cgroup_record(entry, 0);
+	rcu_read_lock();
+	memcg = mem_cgroup_lookup(id);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(memcg, false);
+		css_put(&memcg->css);
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
@@ -6853,7 +6510,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+	commit_charge(page, memcg, nr_pages, lrucare);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6895,6 +6552,123 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_uncharge - uncharge a page
+ * @page: page to uncharge
+ *
+ * Uncharge a page previously charged with mem_cgroup_try_charge() and
+ * mem_cgroup_commit_charge().
+ */
+void mem_cgroup_uncharge(struct page *page)
+{
+	struct memcg_batch_info *batch;
+	unsigned int nr_pages = 1;
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+	unsigned long pc_flags;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Every final put_page() ends up here */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point, we have fully
+	 * exclusive access to the page.
+	 */
+	memcg = pc->mem_cgroup;
+	pc_flags = pc->flags;
+	pc->flags = 0;
+
+	local_irq_save(flags);
+
+	if (nr_pages > 1)
+		goto direct;
+	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+		goto direct;
+	batch = &current->memcg_batch;
+	if (!batch->do_batch)
+		goto direct;
+	if (batch->memcg && batch->memcg != memcg)
+		goto direct;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	if (pc_flags & PCG_MEM)
+		batch->nr_pages++;
+	if (pc_flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	goto out;
+direct:
+	if (pc_flags & PCG_MEM)
+		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	if (pc_flags & PCG_MEMSW)
+		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+	memcg_oom_recover(memcg);
+out:
+	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
+	memcg_check_events(memcg, page);
+
+	local_irq_restore(flags);
+}
+
+/**
+ * mem_cgroup_migrate - migrate a charge to another page
+ * @oldpage: currently charged page
+ * @newpage: page to transfer the charge to
+ * @lrucare: page might be on LRU already
+ *
+ * Migrate the charge from @oldpage to @newpage.
+ *
+ * Both pages must be locked, @newpage->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare)
+{
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(oldpage);
+	if (!PageCgroupUsed(pc))
+		return;
+
+	/* Already migrated */
+	if (!(pc->flags & PCG_MEM))
+		return;
+
+	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
+	pc->flags &= ~(PCG_MEM | PCG_MEMSW);
+
+	if (PageTransHuge(oldpage)) {
+		nr_pages <<= compound_order(oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
+	}
+
+	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/memory.c b/mm/memory.c
index 16bce85947dc..d9a1f1038982 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		details = NULL;
 
 	BUG_ON(addr >= end);
-	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
-	mem_cgroup_uncharge_end();
 }
 
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 63f0cd559999..9da3cf84d30a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		newpage->mapping = NULL;
+		if (!PageAnon(newpage))
+			newpage->mapping = NULL;
 	} else {
+		mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
 	}
 
 	unlock_page(newpage);
@@ -797,7 +800,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
-	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
@@ -823,9 +825,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		lock_page(page);
 	}
 
-	/* charge against new page */
-	mem_cgroup_prepare_migration(page, newpage, &mem);
-
 	if (PageWriteback(page)) {
 		/*
 		 * Only in the case of a full synchronous migration is it
@@ -835,10 +834,10 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 */
 		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
-			goto uncharge;
+			goto out_unlock;
 		}
 		if (!force)
-			goto uncharge;
+			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -874,7 +873,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 			 */
 			remap_swapcache = 0;
 		} else {
-			goto uncharge;
+			goto out_unlock;
 		}
 	}
 
@@ -887,7 +886,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto uncharge;
+		goto out_unlock;
 	}
 
 	/*
@@ -906,7 +905,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto uncharge;
+			goto out_unlock;
 		}
 		goto skip_unmap;
 	}
@@ -925,10 +924,7 @@ skip_unmap:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-uncharge:
-	mem_cgroup_end_migration(mem, page, newpage,
-				 (rc == MIGRATEPAGE_SUCCESS ||
-				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
+out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1787,7 +1783,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
@@ -1853,15 +1848,6 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	/*
-	 * Traditional migration needs to prepare the memcg charge
-	 * transaction early to prevent the old page from being
-	 * uncharged when installing migration entries.  Here we can
-	 * save the potential rollback and start the charge transfer
-	 * only when migration is already known to end successfully.
-	 */
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
@@ -1889,14 +1875,10 @@ fail_putback:
 		goto fail_putback;
 	}
 
+	mem_cgroup_migrate(page, new_page, false);
+
 	page_remove_rmap(page);
 
-	/*
-	 * Finish the charge transaction under the page table lock to
-	 * prevent split_huge_page() from dividing up the charge
-	 * before it's fully transferred to the new page.
-	 */
-	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index d6673ebd0108..a930392b0454 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1085,7 +1085,6 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		mem_cgroup_uncharge_page(page);
 		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
diff --git a/mm/shmem.c b/mm/shmem.c
index da0fc83af9d5..498c8cfac48d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -405,7 +405,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -433,7 +432,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -484,7 +482,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -511,7 +508,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 
@@ -809,7 +805,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -982,7 +978,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 		 */
 		oldpage = newpage;
 	} else {
-		mem_cgroup_replace_page_cache(oldpage, newpage);
+		mem_cgroup_migrate(oldpage, newpage, false);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index a98f48626359..3074210f245d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
@@ -915,6 +916,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	mem_cgroup_uncharge_start();
+
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -946,6 +949,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
+		mem_cgroup_uncharge(page);
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
@@ -955,6 +959,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	mem_cgroup_uncharge_end();
+
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
 EXPORT_SYMBOL(release_pages);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2972eee184a4..e160151da6b8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free(entry);
 			return 0;
 		}
 
@@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 		return 0;
 	}
 }
@@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free(entry);
 	page_cache_release(page);
 }
 
@@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0883b4912ff7..8798b2e0ac59 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	p = swap_info_get(entry);
 	if (p) {
-		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		if (page)
-			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
+		swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 6a78c814bebf..b352481c276d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -367,7 +365,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -389,7 +386,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 	cleancache_invalidate_inode(mapping);
@@ -488,7 +484,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -517,7 +512,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -548,7 +542,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -597,7 +590,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -650,7 +642,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 71f23c0c1090..98234e9ccb5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -571,9 +571,10 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -594,7 +595,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__clear_page_locked(page);
 free_it:
+		mem_cgroup_uncharge(page);
 		nr_reclaimed++;
 
 		/*
@@ -1126,12 +1127,13 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	mem_cgroup_uncharge_end();
 
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
-	mem_cgroup_uncharge_end();
+
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_congested += nr_congested;
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
@@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
@@ -1652,6 +1656,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
diff --git a/mm/zswap.c b/mm/zswap.c
index 008388fe7b0f..333d70c66093 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
-- 
2.0.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15  8:25     ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

As there were follow up fixed on top of this one I have squashed them
into the following one (changelogs preserved) for review. I hope I
haven't missed any patch. I will respond to this email with the review
comments. It is quite large so it will take some time...
---

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15  8:25     ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

As there were follow up fixed on top of this one I have squashed them
into the following one (changelogs preserved) for review. I hope I
haven't missed any patch. I will respond to this email with the review
comments. It is quite large so it will take some time...
---
From 11adda1da1d21ba4c07759dedb68c203a991e5eb Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 19 Jun 2014 10:14:50 +0200
Subject: [PATCH] mm: memcontrol: rewrite uncharge API

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages.  However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped.  Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge.  The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
gets rid of mem_cgroup_replace_page_cache().

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration.  Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix 2

It's not entirely clear whether do_swap_account or PCG_MEMSW is the
authoritative answer to whether a page is swap-accounted or not.  This
currently leads to the following memsw counter underflow when swap
accounting is disabled:

[    2.753355] WARNING: CPU: 0 PID: 1 at kernel/res_counter.c:28 res_counter_uncharge_locked+0x48/0x74()
[    2.753355] CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-00238-gddc5bfe #1
[    2.753355] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    2.753355]  0000000000000000 ffff880012073c50 ffffffff81a23b9d ffff880012073c88
[    2.753355]  ffffffff810bc765 ffffffff8111fac8 0000000000001000 ffff88001200fa50
[    2.753355]  0000000000000001 ffff88001200fa01 ffff880012073c98 ffffffff810bc84b
[    2.753355] Call Trace:
[    2.753355]  [<ffffffff81a23b9d>] dump_stack+0x19/0x1b
[    2.753355]  [<ffffffff810bc765>] warn_slowpath_common+0x73/0x8c
[    2.753355]  [<ffffffff8111fac8>] ? res_counter_uncharge_locked+0x48/0x74
[    2.753355]  [<ffffffff810bc84b>] warn_slowpath_null+0x1a/0x1c
[    2.753355]  [<ffffffff8111fac8>] res_counter_uncharge_locked+0x48/0x74
[    2.753355]  [<ffffffff8111fd02>] res_counter_uncharge_until+0x4e/0xa9
[    2.753355]  [<ffffffff8111fd70>] res_counter_uncharge+0x13/0x15
[    2.753355]  [<ffffffff8119499c>] mem_cgroup_uncharge_end+0x73/0x8d
[    2.753355]  [<ffffffff8115735e>] release_pages+0x1f2/0x20d
[    2.753355]  [<ffffffff8116cc3a>] tlb_flush_mmu_free+0x28/0x43
[    2.753355]  [<ffffffff8116d5e5>] tlb_flush_mmu+0x20/0x23
[    2.753355]  [<ffffffff8116d5fc>] tlb_finish_mmu+0x14/0x39
[    2.753355]  [<ffffffff811730c1>] unmap_region+0xcd/0xdf
[    2.753355]  [<ffffffff81172b0e>] ? vma_gap_callbacks_propagate+0x18/0x33
[    2.753355]  [<ffffffff81174bf1>] do_munmap+0x252/0x2e0
[    2.753355]  [<ffffffff81174cc3>] vm_munmap+0x44/0x5c
[    2.753355]  [<ffffffff81174cfe>] SyS_munmap+0x23/0x29
[    2.753355]  [<ffffffff81a31567>] system_call_fastpath+0x16/0x1b
[    2.753355] ---[ end trace cfeb07101f6fbdfb ]---

Don't set PCG_MEMSW when swap accounting is disabled, so that uncharging
only has to look at this per-page flag.

mem_cgroup_swapout() could also fully rely on this flag, but as it can
bail out before even looking up the page_cgroup, check do_swap_account as
a performance optimization and only sanity test for PCG_MEMSW.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: mem_cgroup_charge_statistics needs preempt_disable

preempt_disable was previously disabled by lock_page_cgroup which has been
removed by "mm: memcontrol: rewrite uncharge API".

This fixes the a flood of splats like this:
[    3.149371] BUG: using __this_cpu_add() in preemptible [00000000] code: udevd/1271
[    3.151458] caller is __this_cpu_preempt_check+0x13/0x15
[    3.152927] CPU: 0 PID: 1271 Comm: udevd Not tainted 3.15.0-test1 #366
[    3.154637] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    3.156788]  0000000000000000 ffff88000005fba8 ffffffff814efe3f 0000000000000000
[    3.158810]  ffff88000005fbd8 ffffffff8125b969 ffff880007413448 0000000000000001
[    3.160836]  ffffea00001e8c00 0000000000000001 ffff88000005fbe8 ffffffff8125b9a8
[    3.162950] Call Trace:
[    3.163598]  [<ffffffff814efe3f>] dump_stack+0x4e/0x7a
[    3.164942]  [<ffffffff8125b969>] check_preemption_disabled+0xd2/0xe5
[    3.166618]  [<ffffffff8125b9a8>] __this_cpu_preempt_check+0x13/0x15
[    3.168267]  [<ffffffff8112b630>] mem_cgroup_charge_statistics.isra.36+0xb5/0xc6
[    3.170169]  [<ffffffff8112d2c5>] commit_charge+0x23c/0x256
[    3.171823]  [<ffffffff8113101b>] mem_cgroup_commit_charge+0xb8/0xd7
[    3.173838]  [<ffffffff810f5dab>] shmem_getpage_gfp+0x399/0x605
[    3.175363]  [<ffffffff810f7456>] shmem_write_begin+0x3d/0x58
[    3.176854]  [<ffffffff810e1361>] generic_perform_write+0xbc/0x192
[    3.178445]  [<ffffffff8114a086>] ? file_update_time+0x34/0xac
[    3.179952]  [<ffffffff810e2ae4>] __generic_file_aio_write+0x2c0/0x300
[    3.181655]  [<ffffffff810e2b76>] generic_file_aio_write+0x52/0xbd
[    3.183234]  [<ffffffff81133944>] do_sync_write+0x59/0x78
[    3.184630]  [<ffffffff81133ea8>] vfs_write+0xc4/0x181
[    3.185957]  [<ffffffff81134801>] SyS_write+0x4a/0x91
[    3.187258]  [<ffffffff814fd30e>] tracesys+0xd0/0xd5

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page-cgroup: fix flags definition

Since commit a9ce315aaec1f ("mm: memcontrol: rewrite uncharge API"),
PCG_* flags are used as bit masks, but they are still defined in a enum
as bit numbers. Fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix - uncharge from IRQ context

Hugh reports:

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
3.16.0-rc2-mm1 #3 Not tainted
------------------------------------------------------
cc1/2771 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&rtpz->lock)->rlock){+.+.-.}, at: [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
dd
and this task is already holding:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239
which would create a new lock dependency:
 (&(&zone->lru_lock)->rlock){..-.-.} -> (&(&rtpz->lock)->rlock){+.+.-.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&zone->lru_lock)->rlock){..-.-.}
... which became SOFTIRQ-irq-safe at:
  [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
  [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
  [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
  [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
  [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
  [<ffffffff81134400>] end_swap_bio_write+0x5c/0x69
  [<ffffffff8123473e>] bio_endio+0x50/0x6e
  [<ffffffff81238dee>] blk_update_request+0x163/0x255
  [<ffffffff81238ef7>] blk_update_bidi_request+0x17/0x65
  [<ffffffff81239242>] blk_end_bidi_request+0x1a/0x56
  [<ffffffff81239289>] blk_end_request+0xb/0xd
  [<ffffffff813a075a>] scsi_io_completion+0x16d/0x553
  [<ffffffff81399c0f>] scsi_finish_command+0xb6/0xbf
  [<ffffffff813a0564>] scsi_softirq_done+0xe9/0xf0
  [<ffffffff8123e8e5>] blk_done_softirq+0x79/0x8b
  [<ffffffff81088675>] __do_softirq+0xfc/0x21f
  [<ffffffff8108898f>] irq_exit+0x3d/0x92
  [<ffffffff81032379>] do_IRQ+0xcc/0xe5
  [<ffffffff815bf5ac>] ret_from_intr+0x0/0x13
  [<ffffffff81443ac0>] cpuidle_enter+0x12/0x14
  [<ffffffff810bb4e4>] cpu_startup_entry+0x187/0x243
  [<ffffffff815a90ab>] rest_init+0x12f/0x133
  [<ffffffff81970e7c>] start_kernel+0x396/0x3a3
  [<ffffffff81970489>] x86_64_start_reservations+0x2a/0x2c
  [<ffffffff81970552>] x86_64_start_kernel+0xc7/0xca

to a SOFTIRQ-irq-unsafe lock:
 (&(&rtpz->lock)->rlock){+.+.-.}
... which became SOFTIRQ-irq-unsafe at:
...  [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
  [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
  [<ffffffff811535bb>] commit_charge+0x260/0x26f
  [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
  [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
  [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
  [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
  [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
  [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
  [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
  [<ffffffff8115a56c>] vfs_write+0xb5/0x169
  [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
  [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b

The soft limit tree lock needs to be IRQ-safe as it's acquired while
holding the IRQ-safe zone->lru_lock.

But more importantly, with uncharge happening in release_pages() now,
this path is executed from interrupt context.

Make the soft limit tree lock, uncharge batching, and charge
statistics IRQ-safe.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix - double migration

Hugh reports:

VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
mm/memcontrol.c:6680!
page had count 1 mapcount 0 mapping anon index 0x196
flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
__alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
handle_mm_fault < __do_page_fault

mem_cgroup_migrate() assumes that a page is only migrated once and
then freed immediately after.

However, putting the page back on the LRU list and dropping the
isolation refcount is not done atomically.  This allows a PFN-based
migrator like compaction to isolate the page, see the expected
anonymous page refcount of 1, and migrate the page once more.

Catch pages that have already been migrated and abort migration
gracefully.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rewrite uncharge API fix - migrate before re-mapping

Mapped file accounting depends on the the page being charged already,
or it won't get accounted properly, and the mapped file counter will
underflow during unmap later on.

Move mem_cgroup_migrate() before remove_migration_ptes().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/memcg_test.txt | 128 +-----
 include/linux/memcontrol.h           |  49 +--
 include/linux/page_cgroup.h          |  43 +-
 include/linux/swap.h                 |  12 +-
 mm/filemap.c                         |   4 +-
 mm/memcontrol.c                      | 748 ++++++++++++-----------------------
 mm/memory.c                          |   2 -
 mm/migrate.c                         |  44 +--
 mm/rmap.c                            |   1 -
 mm/shmem.c                           |   8 +-
 mm/swap.c                            |   6 +
 mm/swap_state.c                      |   8 +-
 mm/swapfile.c                        |   7 +-
 mm/truncate.c                        |   9 -
 mm/vmscan.c                          |  12 +-
 mm/zswap.c                           |   2 +-
 16 files changed, 333 insertions(+), 750 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index bcf750d3cecd..8870b0212150 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -29,28 +29,13 @@ Please note that implementation details can be changed.
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
+	mem_cgroup_uncharge()
+	  Called when a page's refcount goes down to 0.
 
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
 3. charge-commit-cancel
 	Memcg pages are charged in two steps:
 		mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
 
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 
@@ -136,28 +74,20 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	    swp_entry's refcnt -= 1.
 
 
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
 
 5. Page Cache
    	Page Cache is charged at
 	- add_to_page_cache_locked().
 
-	uncharged at
-	- __remove_from_page_cache().
-
 	The logic is very clear. (About migration, see below)
 	Note: __remove_from_page_cache() is called by remove_from_page_cache()
 	and __remove_mapping().
 
 6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
+	The best way to understand shmem's page state transition is to read
+	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 
@@ -170,56 +100,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
 
 7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
-
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+
+	mem_cgroup_migrate()
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a9a096858e0..806b8fa15c5f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 			      bool lrucare);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+void mem_cgroup_uncharge(struct page *page);
+
+/* Batched uncharging */
+void mem_cgroup_uncharge_start(void);
+void mem_cgroup_uncharge_end(void);
 
-/* For coalescing uncharge for reducing memcg' overhead*/
-extern void mem_cgroup_uncharge_start(void);
-extern void mem_cgroup_uncharge_end(void);
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg)
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 
-extern void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp);
-extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok);
-
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
-extern void mem_cgroup_replace_page_cache(struct page *oldpage,
-					struct page *newpage);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
 {
 }
 
-static inline void mem_cgroup_uncharge_start(void)
+static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
 
-static inline void mem_cgroup_uncharge_end(void)
+static inline void mem_cgroup_uncharge_start(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_page(struct page *page)
+static inline void mem_cgroup_uncharge_end(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+static inline void mem_cgroup_migrate(struct page *oldpage,
+				      struct page *newpage,
+				      bool lrucare)
 {
 }
 
@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
-static inline void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp)
-{
-}
-
-static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-		struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
-static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
-				struct page *newpage)
-{
-}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..9bfb8e68a595 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -3,9 +3,9 @@
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
-	PCG_USED, /* this object is in use. */
-	PCG_MIGRATION, /* under page migration */
+	PCG_USED = 0x01,	/* This page is charged to a memcg */
+	PCG_MEM = 0x02,		/* This page holds a memory charge */
+	PCG_MEMSW = 0x04,	/* This page holds a memory+swap charge */
 	__NR_PCG_FLAGS,
 };
 
@@ -44,42 +44,9 @@ static inline void __init page_cgroup_init(void)
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define TESTCLEARPCGFLAG(uname, lname)			\
-static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
-
-TESTPCGFLAG(Used, USED)
-CLEARPCGFLAG(Used, USED)
-SETPCGFLAG(Used, USED)
-
-SETPCGFLAG(Migration, MIGRATION)
-CLEARPCGFLAG(Migration, MIGRATION)
-TESTPCGFLAG(Migration, MIGRATION)
-
-static inline void lock_page_cgroup(struct page_cgroup *pc)
-{
-	/*
-	 * Don't take this lock in IRQ context.
-	 * This lock is for pc->mem_cgroup, USED, MIGRATION
-	 */
-	bit_spin_lock(PCG_LOCK, &pc->flags);
-}
-
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline int PageCgroupUsed(struct page_cgroup *pc)
 {
-	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	return !!(pc->flags & PCG_USED);
 }
 
 #else /* CONFIG_MEMCG */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 290905133078..94fd0b23f3f9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
 #else
-static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
 {
 }
 #endif
@@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free(swp_entry_t);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp, struct page *page)
+static inline void swapcache_free(swp_entry_t swp)
 {
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 4afbf3da885a..698f2c2a511b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -233,7 +233,6 @@ void delete_from_page_cache(struct page *page)
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
 		freepage(page);
@@ -501,8 +500,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
-		/* mem_cgroup codes must not be called under tree_lock */
-		mem_cgroup_replace_page_cache(old, new);
+		mem_cgroup_migrate(old, new, true);
 		radix_tree_preload_end();
 		if (freepage)
 			freepage(old);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe17420afdc7..e4afdbdda0a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -754,9 +754,11 @@ static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 				       struct mem_cgroup_tree_per_zone *mctz)
 {
-	spin_lock(&mctz->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mctz->lock, flags);
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
 
@@ -779,7 +781,9 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 		 * mem is over its softlimit.
 		 */
 		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
+			unsigned long flags;
+
+			spin_lock_irqsave(&mctz->lock, flags);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mz, mctz);
@@ -788,7 +792,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 			 * If excess is 0, no tree ops.
 			 */
 			__mem_cgroup_insert_exceeded(mz, mctz, excess);
-			spin_unlock(&mctz->lock);
+			spin_unlock_irqrestore(&mctz->lock, flags);
 		}
 	}
 }
@@ -839,9 +843,9 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 {
 	struct mem_cgroup_per_zone *mz;
 
-	spin_lock(&mctz->lock);
+	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irq(&mctz->lock);
 	return mz;
 }
 
@@ -882,13 +886,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
-}
-
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
@@ -909,13 +906,13 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 bool anon, int nr_pages)
+					 int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
-	if (anon)
+	if (PageAnon(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
@@ -1013,7 +1010,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 {
-	preempt_disable();
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -1026,8 +1022,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
 #endif
-		preempt_enable();
-
 		mem_cgroup_threshold(memcg);
 		if (unlikely(do_softlimit))
 			mem_cgroup_update_tree(memcg, page);
@@ -1035,8 +1029,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
 #endif
-	} else
-		preempt_enable();
+	}
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
@@ -1347,20 +1340,6 @@ out:
 	return lruvec;
 }
 
-/*
- * Following LRU functions are allowed to be used without PCG_LOCK.
- * Operations are called by routine of global LRU independently from memcg.
- * What we have to take care of here is validness of pc->mem_cgroup.
- *
- * Changes to pc->mem_cgroup happens when
- * 1. charge
- * 2. moving account
- * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
- * It is added to LRU before charge.
- * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
- * When moving account, the page is not on LRU. It's isolated.
- */
-
 /**
  * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
@@ -2261,22 +2240,14 @@ cleanup:
  *
  * Notes: Race condition
  *
- * We usually use lock_page_cgroup() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
+ * Charging occurs during page instantiation, while the page is
+ * unmapped and locked in page migration, or while the page table is
+ * locked in THP migration.  No race is possible.
  *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
+ * Uncharge happens to pages with zero references, no race possible.
  *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check memcg->moving_account and detect there are possibility
- * of race or not. If there is, we take a lock.
+ * Charge moving between groups is protected by checking mm->moving
+ * account and taking the move_lock in the slowpath.
  */
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
@@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
+/*
+ * try_get_mem_cgroup_from_page - look up page's memcg association
+ * @page: the page
+ *
+ * Look up, get a css reference, and return the memcg that owns @page.
+ *
+ * The page must be locked to prevent racing with swap-in and page
+ * cache charges.  If coming from an unlocked page table, the caller
+ * must ensure the page is on the LRU or this can race with charging.
+ */
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
 		memcg = pc->mem_cgroup;
 		if (memcg && !css_tryget_online(&memcg->css))
@@ -2713,19 +2693,17 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			memcg = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
 	return memcg;
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare)
+			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
-	struct lruvec *lruvec;
 	bool was_on_lru = false;
+	struct lruvec *lruvec;
 
-	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
 	 * we don't need page_cgroup_lock about tail pages, becase they are not
@@ -2747,8 +2725,22 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		}
 	}
 
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point:
+	 *
+	 * - the page is uncharged
+	 *
+	 * - the page is off-LRU
+	 *
+	 * - an anonymous fault has exclusive page access, except for
+	 *   a locked page table
+	 *
+	 * - a page cache insertion, a swapin fault, or a migration
+	 *   have the page locked
+	 */
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
+	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
 
 	if (lrucare) {
 		if (was_on_lru) {
@@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
-
+	local_irq_disable();
+	mem_cgroup_charge_statistics(memcg, page, nr_pages);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
 	 */
 	memcg_check_events(memcg, page);
+	local_irq_enable();
 }
 
 static DEFINE_MUTEX(set_limit_mutex);
@@ -3446,7 +3438,6 @@ static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -3467,7 +3458,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
+		pc->flags = head_pc->flags;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
@@ -3497,7 +3488,6 @@ static int mem_cgroup_move_account(struct page *page,
 {
 	unsigned long flags;
 	int ret;
-	bool anon = PageAnon(page);
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -3511,15 +3501,13 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
-	lock_page_cgroup(pc);
-
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto unlock;
+		goto out;
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
+	if (!PageAnon(page) && page_mapped(page)) {
 		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
 			       nr_pages);
 		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3533,20 +3521,23 @@ static int mem_cgroup_move_account(struct page *page,
 			       nr_pages);
 	}
 
-	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
+	/*
+	 * It is safe to change pc->mem_cgroup here because the page
+	 * is referenced, charged, and isolated - we can't race with
+	 * uncharging, charging, migration, or LRU putback.
+	 */
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
-unlock:
-	unlock_page_cgroup(pc);
-	/*
-	 * check events
-	 */
+
+	local_irq_disable();
+	mem_cgroup_charge_statistics(to, page, nr_pages);
 	memcg_check_events(to, page);
+	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
+	local_irq_enable();
 out:
 	return ret;
 }
@@ -3617,193 +3608,6 @@ out:
 	return ret;
 }
 
-static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
-				   unsigned int nr_pages,
-				   const enum charge_type ctype)
-{
-	struct memcg_batch_info *batch = NULL;
-	bool uncharge_memsw = true;
-
-	/* If swapout, usage of swap doesn't decrease */
-	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		uncharge_memsw = false;
-
-	batch = &current->memcg_batch;
-	/*
-	 * In usual, we do css_get() when we remember memcg pointer.
-	 * But in this case, we keep res->usage until end of a series of
-	 * uncharges. Then, it's ok to ignore memcg's refcnt.
-	 */
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continuously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-
-	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
-
-	if (nr_pages > 1)
-		goto direct_uncharge;
-
-	/*
-	 * In typical case, batch->memcg == mem. This means we can
-	 * merge a series of uncharges to an uncharge of res_counter.
-	 * If not, we uncharge res_counter ony by one.
-	 */
-	if (batch->memcg != memcg)
-		goto direct_uncharge;
-	/* remember freed charge and uncharge it later */
-	batch->nr_pages++;
-	if (uncharge_memsw)
-		batch->memsw_nr_pages++;
-	return;
-direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (unlikely(batch->memcg != memcg))
-		memcg_oom_recover(memcg);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
-			     bool end_migration)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!PageCgroupUsed(pc)))
-		return NULL;
-
-	lock_page_cgroup(pc);
-
-	memcg = pc->mem_cgroup;
-
-	if (!PageCgroupUsed(pc))
-		goto unlock_out;
-
-	anon = PageAnon(page);
-
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_ANON:
-		/*
-		 * Generally PageAnon tells if it's the anon statistics to be
-		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
-		 * used before page reached the stage of being marked PageAnon.
-		 */
-		anon = true;
-		/* fallthrough */
-	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		/* See mem_cgroup_prepare_migration() */
-		if (page_mapped(page))
-			goto unlock_out;
-		/*
-		 * Pages under migration may not be uncharged.  But
-		 * end_migration() /must/ be the one uncharging the
-		 * unused post-migration page and so it has to call
-		 * here with the migration bit still set.  See the
-		 * res_counter handling below.
-		 */
-		if (!end_migration && PageCgroupMigration(pc))
-			goto unlock_out;
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
-		if (!PageAnon(page)) {	/* Shared memory */
-			if (page->mapping && !page_is_file_cache(page))
-				goto unlock_out;
-		} else if (page_mapped(page)) /* Anon */
-				goto unlock_out;
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
-
-	ClearPageCgroupUsed(pc);
-	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
-	 */
-
-	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
-	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
-	/*
-	 * Migration does not charge the res_counter for the
-	 * replacement page, so leave it alone when phasing out the
-	 * page that is unused after the migration.
-	 */
-	if (!end_migration)
-		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
-
-	return memcg;
-
-unlock_out:
-	unlock_page_cgroup(pc);
-	return NULL;
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	/*
-	 * If the page is in swap cache, uncharge should be deferred
-	 * to the swap path, which also properly accounts swap usage
-	 * and handles memcg lifetime.
-	 *
-	 * Note that this check is not stable and reclaim may add the
-	 * page to swap cache at any time after this.  However, if the
-	 * page is not in swap cache by the time page->mapcount hits
-	 * 0, there won't be any page table references to the swap
-	 * slot, and reclaim will free it and not actually write the
-	 * page to disk.
-	 */
-	if (PageSwapCache(page))
-		return;
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping, page);
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
-}
-
 /*
  * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
  * In that cases, pages are freed continuously and we can expect pages
@@ -3814,6 +3618,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
 
 void mem_cgroup_uncharge_start(void)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
 	current->memcg_batch.do_batch++;
 	/* We can do nest. */
 	if (current->memcg_batch.do_batch == 1) {
@@ -3821,21 +3628,18 @@ void mem_cgroup_uncharge_start(void)
 		current->memcg_batch.nr_pages = 0;
 		current->memcg_batch.memsw_nr_pages = 0;
 	}
+	local_irq_restore(flags);
 }
 
 void mem_cgroup_uncharge_end(void)
 {
 	struct memcg_batch_info *batch = &current->memcg_batch;
+	unsigned long flags;
 
-	if (!batch->do_batch)
-		return;
-
-	batch->do_batch--;
-	if (batch->do_batch) /* If stacked, do nothing. */
-		return;
-
-	if (!batch->memcg)
-		return;
+	local_irq_save(flags);
+	VM_BUG_ON(!batch->do_batch);
+	if (--batch->do_batch) /* If stacked, do nothing */
+		goto out;
 	/*
 	 * This "batch->memcg" is valid without any css_get/put etc...
 	 * bacause we hide charges behind us.
@@ -3847,61 +3651,16 @@ void mem_cgroup_uncharge_end(void)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
 	memcg_oom_recover(batch->memcg);
-	/* forget this pointer (for sanity check) */
-	batch->memcg = NULL;
-}
-
-#ifdef CONFIG_SWAP
-/*
- * called after __delete_from_swap_cache() and drop "page" account.
- * memcg information is recorded to swap_cgroup of "ent"
- */
-void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
-{
-	struct mem_cgroup *memcg;
-	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
-
-	if (!swapout) /* this was a swap cache but the swap is unused ! */
-		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
-
-	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
-
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
-	if (do_swap_account && swapout && memcg)
-		swap_cgroup_record(ent, mem_cgroup_id(memcg));
+out:
+	local_irq_restore(flags);
 }
-#endif
 
 #ifdef CONFIG_MEMCG_SWAP
-/*
- * called from swap_entry_free(). remove record in swap_cgroup and
- * uncharge "memsw" account.
- */
-void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
+					 bool charge)
 {
-	struct mem_cgroup *memcg;
-	unsigned short id;
-
-	if (!do_swap_account)
-		return;
-
-	id = swap_cgroup_record(ent, 0);
-	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
-	if (memcg) {
-		/*
-		 * We uncharge this because swap is freed.  This memcg can
-		 * be obsolete one. We avoid calling css_tryget_online().
-		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
-	}
-	rcu_read_unlock();
+	int val = (charge) ? 1 : -1;
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -3953,169 +3712,6 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
-/*
- * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
- * page belongs to.
- */
-void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-				  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-
-	*memcgp = NULL;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (PageTransHuge(page))
-		nr_pages <<= compound_order(page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		css_get(&memcg->css);
-		/*
-		 * At migrating an anonymous page, its mapcount goes down
-		 * to 0 and uncharge() will be called. But, even if it's fully
-		 * unmapped, migration may fail and this page has to be
-		 * charged again. We set MIGRATION flag here and delay uncharge
-		 * until end_migration() is called
-		 *
-		 * Corner Case Thinking
-		 * A)
-		 * When the old page was mapped as Anon and it's unmap-and-freed
-		 * while migration was ongoing.
-		 * If unmap finds the old page, uncharge() of it will be delayed
-		 * until end_migration(). If unmap finds a new page, it's
-		 * uncharged when it make mapcount to be 1->0. If unmap code
-		 * finds swap_migration_entry, the new page will not be mapped
-		 * and end_migration() will find it(mapcount==0).
-		 *
-		 * B)
-		 * When the old page was mapped but migraion fails, the kernel
-		 * remaps it. A charge for it is kept by MIGRATION flag even
-		 * if mapcount goes down to 0. We can do remap successfully
-		 * without charging it again.
-		 *
-		 * C)
-		 * The "old" page is under lock_page() until the end of
-		 * migration, so, the old page itself will not be swapped-out.
-		 * If the new page is swapped out before end_migraton, our
-		 * hook to usual swap-out path will catch the event.
-		 */
-		if (PageAnon(page))
-			SetPageCgroupMigration(pc);
-	}
-	unlock_page_cgroup(pc);
-	/*
-	 * If the page is not charged at this point,
-	 * we return here.
-	 */
-	if (!memcg)
-		return;
-
-	*memcgp = memcg;
-	/*
-	 * We charge new page before it's used/mapped. So, even if unlock_page()
-	 * is called before end_migration, we can catch all events on this new
-	 * page. In the case new page is migrated but not remapped, new page's
-	 * mapcount will be finally 0 and we call uncharge in end_migration().
-	 */
-	/*
-	 * The page is committed to the memcg, but it's not actually
-	 * charged to the res_counter since we plan on replacing the
-	 * old one and only one page is going to be left afterwards.
-	 */
-	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-	struct page *used, *unused;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (!memcg)
-		return;
-
-	if (!migration_ok) {
-		used = oldpage;
-		unused = newpage;
-	} else {
-		used = newpage;
-		unused = oldpage;
-	}
-	anon = PageAnon(used);
-	__mem_cgroup_uncharge_common(unused,
-				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
-				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
-				     true);
-	css_put(&memcg->css);
-	/*
-	 * We disallowed uncharge of pages under migration because mapcount
-	 * of the page goes down to zero, temporarly.
-	 * Clear the flag and check the page should be charged.
-	 */
-	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
-	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * If a page is a file cache, radix-tree replacement is very atomic
-	 * and we can skip this check. When it was an Anon page, its mapcount
-	 * goes down to 0. But because we added MIGRATION flage, it's not
-	 * uncharged yet. There are several case but page->mapcount check
-	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
-	 * check. (see prepare_charge() also)
-	 */
-	if (anon)
-		mem_cgroup_uncharge_page(used);
-}
-
-/*
- * At replace page cache, newpage is not under any memcg but it's on
- * LRU. So, this function doesn't touch res_counter but handles LRU
- * in correct way. Both pages are locked so we cannot race with uncharge.
- */
-void mem_cgroup_replace_page_cache(struct page *oldpage,
-				  struct page *newpage)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(oldpage);
-	/* fix accounting on old pages */
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * When called from shmem_replace_page(), in some cases the
-	 * oldpage has already been charged, and in some cases not.
-	 */
-	if (!memcg)
-		return;
-	/*
-	 * Even if newpage->mapping was NULL before starting replacement,
-	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
-	 * LRU while we overwrite pc->mem_cgroup.
-	 */
-	commit_charge(newpage, memcg, 1, false, true);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -4314,7 +3910,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
+		spin_lock_irq(&mctz->lock);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -4354,7 +3950,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 		 */
 		/* If excess == 0, no tree ops */
 		__mem_cgroup_insert_exceeded(mz, mctz, excess);
-		spin_unlock(&mctz->lock);
+		spin_unlock_irq(&mctz->lock);
 		css_put(&mz->memcg->css);
 		loop++;
 		/*
@@ -6290,9 +5886,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
-		 * Do only loose check w/o page_cgroup lock.
-		 * mem_cgroup_move_account() checks the pc is valid or not under
-		 * the lock.
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the pc is valid or
+		 * not under LRU exclusion.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
@@ -6751,6 +6347,67 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+#ifdef CONFIG_MEMCG_SWAP
+/**
+ * mem_cgroup_swapout - transfer a memsw charge to swap
+ * @page: page whose memsw charge to transfer
+ * @entry: swap entry to move the charge to
+ *
+ * Transfer the memsw charge of @page to @entry.
+ */
+void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+	unsigned short oldid;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (!do_swap_account)
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Readahead page, never charged */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+	VM_BUG_ON_PAGE(oldid, page);
+
+	pc->flags &= ~PCG_MEMSW;
+	css_get(&pc->mem_cgroup->css);
+	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+}
+
+/**
+ * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * @entry: swap entry to uncharge
+ *
+ * Drop the memsw charge associated with @entry.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	if (!do_swap_account)
+		return;
+
+	id = swap_cgroup_record(entry, 0);
+	rcu_read_lock();
+	memcg = mem_cgroup_lookup(id);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(memcg, false);
+		css_put(&memcg->css);
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
@@ -6853,7 +6510,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+	commit_charge(page, memcg, nr_pages, lrucare);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6895,6 +6552,123 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_uncharge - uncharge a page
+ * @page: page to uncharge
+ *
+ * Uncharge a page previously charged with mem_cgroup_try_charge() and
+ * mem_cgroup_commit_charge().
+ */
+void mem_cgroup_uncharge(struct page *page)
+{
+	struct memcg_batch_info *batch;
+	unsigned int nr_pages = 1;
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+	unsigned long pc_flags;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Every final put_page() ends up here */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point, we have fully
+	 * exclusive access to the page.
+	 */
+	memcg = pc->mem_cgroup;
+	pc_flags = pc->flags;
+	pc->flags = 0;
+
+	local_irq_save(flags);
+
+	if (nr_pages > 1)
+		goto direct;
+	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+		goto direct;
+	batch = &current->memcg_batch;
+	if (!batch->do_batch)
+		goto direct;
+	if (batch->memcg && batch->memcg != memcg)
+		goto direct;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	if (pc_flags & PCG_MEM)
+		batch->nr_pages++;
+	if (pc_flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	goto out;
+direct:
+	if (pc_flags & PCG_MEM)
+		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	if (pc_flags & PCG_MEMSW)
+		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+	memcg_oom_recover(memcg);
+out:
+	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
+	memcg_check_events(memcg, page);
+
+	local_irq_restore(flags);
+}
+
+/**
+ * mem_cgroup_migrate - migrate a charge to another page
+ * @oldpage: currently charged page
+ * @newpage: page to transfer the charge to
+ * @lrucare: page might be on LRU already
+ *
+ * Migrate the charge from @oldpage to @newpage.
+ *
+ * Both pages must be locked, @newpage->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare)
+{
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(oldpage);
+	if (!PageCgroupUsed(pc))
+		return;
+
+	/* Already migrated */
+	if (!(pc->flags & PCG_MEM))
+		return;
+
+	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
+	pc->flags &= ~(PCG_MEM | PCG_MEMSW);
+
+	if (PageTransHuge(oldpage)) {
+		nr_pages <<= compound_order(oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
+	}
+
+	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/memory.c b/mm/memory.c
index 16bce85947dc..d9a1f1038982 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		details = NULL;
 
 	BUG_ON(addr >= end);
-	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
-	mem_cgroup_uncharge_end();
 }
 
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 63f0cd559999..9da3cf84d30a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		newpage->mapping = NULL;
+		if (!PageAnon(newpage))
+			newpage->mapping = NULL;
 	} else {
+		mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
 	}
 
 	unlock_page(newpage);
@@ -797,7 +800,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
-	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
@@ -823,9 +825,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		lock_page(page);
 	}
 
-	/* charge against new page */
-	mem_cgroup_prepare_migration(page, newpage, &mem);
-
 	if (PageWriteback(page)) {
 		/*
 		 * Only in the case of a full synchronous migration is it
@@ -835,10 +834,10 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 */
 		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
-			goto uncharge;
+			goto out_unlock;
 		}
 		if (!force)
-			goto uncharge;
+			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -874,7 +873,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 			 */
 			remap_swapcache = 0;
 		} else {
-			goto uncharge;
+			goto out_unlock;
 		}
 	}
 
@@ -887,7 +886,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto uncharge;
+		goto out_unlock;
 	}
 
 	/*
@@ -906,7 +905,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto uncharge;
+			goto out_unlock;
 		}
 		goto skip_unmap;
 	}
@@ -925,10 +924,7 @@ skip_unmap:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-uncharge:
-	mem_cgroup_end_migration(mem, page, newpage,
-				 (rc == MIGRATEPAGE_SUCCESS ||
-				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
+out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1787,7 +1783,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
@@ -1853,15 +1848,6 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	/*
-	 * Traditional migration needs to prepare the memcg charge
-	 * transaction early to prevent the old page from being
-	 * uncharged when installing migration entries.  Here we can
-	 * save the potential rollback and start the charge transfer
-	 * only when migration is already known to end successfully.
-	 */
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
@@ -1889,14 +1875,10 @@ fail_putback:
 		goto fail_putback;
 	}
 
+	mem_cgroup_migrate(page, new_page, false);
+
 	page_remove_rmap(page);
 
-	/*
-	 * Finish the charge transaction under the page table lock to
-	 * prevent split_huge_page() from dividing up the charge
-	 * before it's fully transferred to the new page.
-	 */
-	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index d6673ebd0108..a930392b0454 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1085,7 +1085,6 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		mem_cgroup_uncharge_page(page);
 		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
diff --git a/mm/shmem.c b/mm/shmem.c
index da0fc83af9d5..498c8cfac48d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -405,7 +405,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -433,7 +432,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -484,7 +482,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -511,7 +508,6 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 
@@ -809,7 +805,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -982,7 +978,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 		 */
 		oldpage = newpage;
 	} else {
-		mem_cgroup_replace_page_cache(oldpage, newpage);
+		mem_cgroup_migrate(oldpage, newpage, false);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index a98f48626359..3074210f245d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
@@ -915,6 +916,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	mem_cgroup_uncharge_start();
+
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -946,6 +949,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
+		mem_cgroup_uncharge(page);
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
@@ -955,6 +959,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	mem_cgroup_uncharge_end();
+
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
 EXPORT_SYMBOL(release_pages);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2972eee184a4..e160151da6b8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free(entry);
 			return 0;
 		}
 
@@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 		return 0;
 	}
 }
@@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free(entry);
 	page_cache_release(page);
 }
 
@@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0883b4912ff7..8798b2e0ac59 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	p = swap_info_get(entry);
 	if (p) {
-		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		if (page)
-			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
+		swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 6a78c814bebf..b352481c276d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -367,7 +365,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -389,7 +386,6 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 	cleancache_invalidate_inode(mapping);
@@ -488,7 +484,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -517,7 +512,6 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -548,7 +542,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -597,7 +590,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -650,7 +642,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 71f23c0c1090..98234e9ccb5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -571,9 +571,10 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -594,7 +595,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__clear_page_locked(page);
 free_it:
+		mem_cgroup_uncharge(page);
 		nr_reclaimed++;
 
 		/*
@@ -1126,12 +1127,13 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	mem_cgroup_uncharge_end();
 
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
-	mem_cgroup_uncharge_end();
+
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_congested += nr_congested;
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
@@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
@@ -1652,6 +1656,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
diff --git a/mm/zswap.c b/mm/zswap.c
index 008388fe7b0f..333d70c66093 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
-- 
2.0.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15  8:25     ` Michal Hocko
@ 2014-07-15 12:19       ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 12:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

[...]
> +/**
> + * mem_cgroup_migrate - migrate a charge to another page
> + * @oldpage: currently charged page
> + * @newpage: page to transfer the charge to
> + * @lrucare: page might be on LRU already

which one? I guess the newpage?

> + *
> + * Migrate the charge from @oldpage to @newpage.
> + *
> + * Both pages must be locked, @newpage->mapping must be set up.
> + */
> +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> +			bool lrucare)
> +{
> +	unsigned int nr_pages = 1;
> +	struct page_cgroup *pc;
> +
> +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> +	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);

	VM_BUG_ON_PAGE(PageLRU(newpage) && !lruvec, newpage);

> +	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	pc = lookup_page_cgroup(oldpage);
> +	if (!PageCgroupUsed(pc))
> +		return;
> +
> +	/* Already migrated */
> +	if (!(pc->flags & PCG_MEM))
> +		return;
> +
> +	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
> +	pc->flags &= ~(PCG_MEM | PCG_MEMSW);

What about PCG_USED?
Wouldn't we uncharge the currently transfered charge when oldpage does
its last put_page when the migration is done?

On a not directly related note. I was quite surprised to see that
__unmap_and_move calls putback_lru_page on oldpage even when migration
succeeded. So it goes through mem_cgroup_page_lruvec which checks
PCG_USED and resets pc->mem_cgroup to root for !PCG_USED.

> +
> +	if (PageTransHuge(oldpage)) {
> +		nr_pages <<= compound_order(oldpage);
> +		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
> +		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
> +	}
> +
> +	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
> +}
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 12:19       ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 12:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

[...]
> +/**
> + * mem_cgroup_migrate - migrate a charge to another page
> + * @oldpage: currently charged page
> + * @newpage: page to transfer the charge to
> + * @lrucare: page might be on LRU already

which one? I guess the newpage?

> + *
> + * Migrate the charge from @oldpage to @newpage.
> + *
> + * Both pages must be locked, @newpage->mapping must be set up.
> + */
> +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> +			bool lrucare)
> +{
> +	unsigned int nr_pages = 1;
> +	struct page_cgroup *pc;
> +
> +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> +	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);

	VM_BUG_ON_PAGE(PageLRU(newpage) && !lruvec, newpage);

> +	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	pc = lookup_page_cgroup(oldpage);
> +	if (!PageCgroupUsed(pc))
> +		return;
> +
> +	/* Already migrated */
> +	if (!(pc->flags & PCG_MEM))
> +		return;
> +
> +	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
> +	pc->flags &= ~(PCG_MEM | PCG_MEMSW);

What about PCG_USED?
Wouldn't we uncharge the currently transfered charge when oldpage does
its last put_page when the migration is done?

On a not directly related note. I was quite surprised to see that
__unmap_and_move calls putback_lru_page on oldpage even when migration
succeeded. So it goes through mem_cgroup_page_lruvec which checks
PCG_USED and resets pc->mem_cgroup to root for !PCG_USED.

> +
> +	if (PageTransHuge(oldpage)) {
> +		nr_pages <<= compound_order(oldpage);
> +		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
> +		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
> +	}
> +
> +	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
> +}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15  8:25     ` Michal Hocko
  (?)
@ 2014-07-15 14:23       ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 14:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 10:25:45, Michal Hocko wrote:
[...]
> diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
> index bcf750d3cecd..8870b0212150 100644
> --- a/Documentation/cgroups/memcg_test.txt
> +++ b/Documentation/cgroups/memcg_test.txt
[...]
>  6. Shmem(tmpfs) Page Cache
> -	Memcg's charge/uncharge have special handlers of shmem. The best way
> -	to understand shmem's page state transition is to read mm/shmem.c.
> +	The best way to understand shmem's page state transition is to read
> +	mm/shmem.c.

:D

[...]
>  7. Page Migration
> -   	One of the most complicated functions is page-migration-handler.
> -	Memcg has 2 routines. Assume that we are migrating a page's contents
> -	from OLDPAGE to NEWPAGE.
> -
> -	Usual migration logic is..
> -	(a) remove the page from LRU.
> -	(b) allocate NEWPAGE (migration target)
> -	(c) lock by lock_page().
> -	(d) unmap all mappings.
> -	(e-1) If necessary, replace entry in radix-tree.
> -	(e-2) move contents of a page.
> -	(f) map all mappings again.
> -	(g) pushback the page to LRU.
> -	(-) OLDPAGE will be freed.
> -
> -	Before (g), memcg should complete all necessary charge/uncharge to
> -	NEWPAGE/OLDPAGE.
> -
> -	The point is....
> -	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
> -          try_to_unmap() drops all mapcount and the page will not be
> -	  SwapCache.
> -
> -	- If OLDPAGE is SwapCache, charges will be kept at (g) because
> -	  __delete_from_swap_cache() isn't called at (e-1)
> -
> -	- If OLDPAGE is page-cache, charges will be kept at (g) because
> -	  remove_from_swap_cache() isn't called at (e-1)
> -
> -	memcg provides following hooks.
> -
> -	- mem_cgroup_prepare_migration(OLDPAGE)
> -	  Called after (b) to account a charge (usage += PAGE_SIZE) against
> -	  memcg which OLDPAGE belongs to.
> -
> -        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> -	  Called after (f) before (g).
> -	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> -	  charged, a charge by prepare_migration() is automatically canceled.
> -	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
> -
> -	  But zap_pte() (by exit or munmap) can be called while migration,
> -	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> +
> +	mem_cgroup_migrate()

This doesn't tell us anything abouta the page migration. On the other
hand I am not entirely sure the documentation here is very much helpful.
There is some outdated information. I wouldn't be opposed to remove
everything up to "9. Typical Tests." section which should be the primary
target of the file anyway.

>  8. LRU
>          Each memcg has its own private LRU. Now, its handling is under global
[...]
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 290905133078..94fd0b23f3f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
>  }
>  #endif
>  #ifdef CONFIG_MEMCG_SWAP
> -extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> +extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> +extern void mem_cgroup_uncharge_swap(swp_entry_t entry);

Wouldn't it be nicer to have those two with symmetric names?
mem_cgroup_{un}charge_swap?

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fe17420afdc7..e4afdbdda0a7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  		spin_unlock_irq(&zone->lru_lock);
>  	}
>  
> -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> -	unlock_page_cgroup(pc);
> -
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
>  	/*
>  	 * "charge_statistics" updated event counter. Then, check it.
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>  	 * if they exceeds softlimit.
>  	 */
>  	memcg_check_events(memcg, page);
> +	local_irq_enable();

preempt_{enable,disbale} should be sufficient for
mem_cgroup_charge_statistics and memcg_check_events no?
The first one is about per-cpu accounting (and that should be atomic
wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
mem_cgroup_update_tree.

Not that it would matter much it is just surprising.

>  }
>  
>  static DEFINE_MUTEX(set_limit_mutex);
[...]
> @@ -3533,20 +3521,23 @@ static int mem_cgroup_move_account(struct page *page,
>  			       nr_pages);
>  	}
>  
> -	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
> +	/*
> +	 * It is safe to change pc->mem_cgroup here because the page
> +	 * is referenced, charged, and isolated - we can't race with
> +	 * uncharging, charging, migration, or LRU putback.
> +	 */
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> -	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
>  	move_unlock_mem_cgroup(from, &flags);
>  	ret = 0;
> -unlock:
> -	unlock_page_cgroup(pc);
> -	/*
> -	 * check events
> -	 */
> +
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(to, page, nr_pages);
>  	memcg_check_events(to, page);
> +	mem_cgroup_charge_statistics(from, page, -nr_pages);
>  	memcg_check_events(from, page);
> +	local_irq_enable();
>  out:
>  	return ret;
>  }

Same here.

[...]
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 63f0cd559999..9da3cf84d30a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		rc = fallback_migrate_page(mapping, newpage, page, mode);
>  
>  	if (rc != MIGRATEPAGE_SUCCESS) {
> -		newpage->mapping = NULL;
> +		if (!PageAnon(newpage))
> +			newpage->mapping = NULL;

OK, I am probably washed out from looking into this for too long but I
cannot figure why have you done this...

>  	} else {
> +		mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
> -		page->mapping = NULL;
> +		if (!PageAnon(page))
> +			page->mapping = NULL;
>  	}
>  
>  	unlock_page(newpage);

[...]

The semantic is much cleaner now. I have to digest details about the
patch because it is really huge. But nothing really jumped at me during
the review (except for few minor things mentioned here and one mentioned
in other email regarding USED flag).

Good work! 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 14:23       ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 14:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 10:25:45, Michal Hocko wrote:
[...]
> diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
> index bcf750d3cecd..8870b0212150 100644
> --- a/Documentation/cgroups/memcg_test.txt
> +++ b/Documentation/cgroups/memcg_test.txt
[...]
>  6. Shmem(tmpfs) Page Cache
> -	Memcg's charge/uncharge have special handlers of shmem. The best way
> -	to understand shmem's page state transition is to read mm/shmem.c.
> +	The best way to understand shmem's page state transition is to read
> +	mm/shmem.c.

:D

[...]
>  7. Page Migration
> -   	One of the most complicated functions is page-migration-handler.
> -	Memcg has 2 routines. Assume that we are migrating a page's contents
> -	from OLDPAGE to NEWPAGE.
> -
> -	Usual migration logic is..
> -	(a) remove the page from LRU.
> -	(b) allocate NEWPAGE (migration target)
> -	(c) lock by lock_page().
> -	(d) unmap all mappings.
> -	(e-1) If necessary, replace entry in radix-tree.
> -	(e-2) move contents of a page.
> -	(f) map all mappings again.
> -	(g) pushback the page to LRU.
> -	(-) OLDPAGE will be freed.
> -
> -	Before (g), memcg should complete all necessary charge/uncharge to
> -	NEWPAGE/OLDPAGE.
> -
> -	The point is....
> -	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
> -          try_to_unmap() drops all mapcount and the page will not be
> -	  SwapCache.
> -
> -	- If OLDPAGE is SwapCache, charges will be kept at (g) because
> -	  __delete_from_swap_cache() isn't called at (e-1)
> -
> -	- If OLDPAGE is page-cache, charges will be kept at (g) because
> -	  remove_from_swap_cache() isn't called at (e-1)
> -
> -	memcg provides following hooks.
> -
> -	- mem_cgroup_prepare_migration(OLDPAGE)
> -	  Called after (b) to account a charge (usage += PAGE_SIZE) against
> -	  memcg which OLDPAGE belongs to.
> -
> -        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> -	  Called after (f) before (g).
> -	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> -	  charged, a charge by prepare_migration() is automatically canceled.
> -	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
> -
> -	  But zap_pte() (by exit or munmap) can be called while migration,
> -	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> +
> +	mem_cgroup_migrate()

This doesn't tell us anything abouta the page migration. On the other
hand I am not entirely sure the documentation here is very much helpful.
There is some outdated information. I wouldn't be opposed to remove
everything up to "9. Typical Tests." section which should be the primary
target of the file anyway.

>  8. LRU
>          Each memcg has its own private LRU. Now, its handling is under global
[...]
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 290905133078..94fd0b23f3f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
>  }
>  #endif
>  #ifdef CONFIG_MEMCG_SWAP
> -extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> +extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> +extern void mem_cgroup_uncharge_swap(swp_entry_t entry);

Wouldn't it be nicer to have those two with symmetric names?
mem_cgroup_{un}charge_swap?

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fe17420afdc7..e4afdbdda0a7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  		spin_unlock_irq(&zone->lru_lock);
>  	}
>  
> -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> -	unlock_page_cgroup(pc);
> -
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
>  	/*
>  	 * "charge_statistics" updated event counter. Then, check it.
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>  	 * if they exceeds softlimit.
>  	 */
>  	memcg_check_events(memcg, page);
> +	local_irq_enable();

preempt_{enable,disbale} should be sufficient for
mem_cgroup_charge_statistics and memcg_check_events no?
The first one is about per-cpu accounting (and that should be atomic
wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
mem_cgroup_update_tree.

Not that it would matter much it is just surprising.

>  }
>  
>  static DEFINE_MUTEX(set_limit_mutex);
[...]
> @@ -3533,20 +3521,23 @@ static int mem_cgroup_move_account(struct page *page,
>  			       nr_pages);
>  	}
>  
> -	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
> +	/*
> +	 * It is safe to change pc->mem_cgroup here because the page
> +	 * is referenced, charged, and isolated - we can't race with
> +	 * uncharging, charging, migration, or LRU putback.
> +	 */
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> -	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
>  	move_unlock_mem_cgroup(from, &flags);
>  	ret = 0;
> -unlock:
> -	unlock_page_cgroup(pc);
> -	/*
> -	 * check events
> -	 */
> +
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(to, page, nr_pages);
>  	memcg_check_events(to, page);
> +	mem_cgroup_charge_statistics(from, page, -nr_pages);
>  	memcg_check_events(from, page);
> +	local_irq_enable();
>  out:
>  	return ret;
>  }

Same here.

[...]
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 63f0cd559999..9da3cf84d30a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		rc = fallback_migrate_page(mapping, newpage, page, mode);
>  
>  	if (rc != MIGRATEPAGE_SUCCESS) {
> -		newpage->mapping = NULL;
> +		if (!PageAnon(newpage))
> +			newpage->mapping = NULL;

OK, I am probably washed out from looking into this for too long but I
cannot figure why have you done this...

>  	} else {
> +		mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
> -		page->mapping = NULL;
> +		if (!PageAnon(page))
> +			page->mapping = NULL;
>  	}
>  
>  	unlock_page(newpage);

[...]

The semantic is much cleaner now. I have to digest details about the
patch because it is really huge. But nothing really jumped at me during
the review (except for few minor things mentioned here and one mentioned
in other email regarding USED flag).

Good work! 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 14:23       ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 14:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue 15-07-14 10:25:45, Michal Hocko wrote:
[...]
> diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
> index bcf750d3cecd..8870b0212150 100644
> --- a/Documentation/cgroups/memcg_test.txt
> +++ b/Documentation/cgroups/memcg_test.txt
[...]
>  6. Shmem(tmpfs) Page Cache
> -	Memcg's charge/uncharge have special handlers of shmem. The best way
> -	to understand shmem's page state transition is to read mm/shmem.c.
> +	The best way to understand shmem's page state transition is to read
> +	mm/shmem.c.

:D

[...]
>  7. Page Migration
> -   	One of the most complicated functions is page-migration-handler.
> -	Memcg has 2 routines. Assume that we are migrating a page's contents
> -	from OLDPAGE to NEWPAGE.
> -
> -	Usual migration logic is..
> -	(a) remove the page from LRU.
> -	(b) allocate NEWPAGE (migration target)
> -	(c) lock by lock_page().
> -	(d) unmap all mappings.
> -	(e-1) If necessary, replace entry in radix-tree.
> -	(e-2) move contents of a page.
> -	(f) map all mappings again.
> -	(g) pushback the page to LRU.
> -	(-) OLDPAGE will be freed.
> -
> -	Before (g), memcg should complete all necessary charge/uncharge to
> -	NEWPAGE/OLDPAGE.
> -
> -	The point is....
> -	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
> -          try_to_unmap() drops all mapcount and the page will not be
> -	  SwapCache.
> -
> -	- If OLDPAGE is SwapCache, charges will be kept at (g) because
> -	  __delete_from_swap_cache() isn't called at (e-1)
> -
> -	- If OLDPAGE is page-cache, charges will be kept at (g) because
> -	  remove_from_swap_cache() isn't called at (e-1)
> -
> -	memcg provides following hooks.
> -
> -	- mem_cgroup_prepare_migration(OLDPAGE)
> -	  Called after (b) to account a charge (usage += PAGE_SIZE) against
> -	  memcg which OLDPAGE belongs to.
> -
> -        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> -	  Called after (f) before (g).
> -	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> -	  charged, a charge by prepare_migration() is automatically canceled.
> -	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
> -
> -	  But zap_pte() (by exit or munmap) can be called while migration,
> -	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> +
> +	mem_cgroup_migrate()

This doesn't tell us anything abouta the page migration. On the other
hand I am not entirely sure the documentation here is very much helpful.
There is some outdated information. I wouldn't be opposed to remove
everything up to "9. Typical Tests." section which should be the primary
target of the file anyway.

>  8. LRU
>          Each memcg has its own private LRU. Now, its handling is under global
[...]
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 290905133078..94fd0b23f3f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
>  }
>  #endif
>  #ifdef CONFIG_MEMCG_SWAP
> -extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> +extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> +extern void mem_cgroup_uncharge_swap(swp_entry_t entry);

Wouldn't it be nicer to have those two with symmetric names?
mem_cgroup_{un}charge_swap?

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fe17420afdc7..e4afdbdda0a7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  		spin_unlock_irq(&zone->lru_lock);
>  	}
>  
> -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> -	unlock_page_cgroup(pc);
> -
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
>  	/*
>  	 * "charge_statistics" updated event counter. Then, check it.
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>  	 * if they exceeds softlimit.
>  	 */
>  	memcg_check_events(memcg, page);
> +	local_irq_enable();

preempt_{enable,disbale} should be sufficient for
mem_cgroup_charge_statistics and memcg_check_events no?
The first one is about per-cpu accounting (and that should be atomic
wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
mem_cgroup_update_tree.

Not that it would matter much it is just surprising.

>  }
>  
>  static DEFINE_MUTEX(set_limit_mutex);
[...]
> @@ -3533,20 +3521,23 @@ static int mem_cgroup_move_account(struct page *page,
>  			       nr_pages);
>  	}
>  
> -	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
> +	/*
> +	 * It is safe to change pc->mem_cgroup here because the page
> +	 * is referenced, charged, and isolated - we can't race with
> +	 * uncharging, charging, migration, or LRU putback.
> +	 */
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> -	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
>  	move_unlock_mem_cgroup(from, &flags);
>  	ret = 0;
> -unlock:
> -	unlock_page_cgroup(pc);
> -	/*
> -	 * check events
> -	 */
> +
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(to, page, nr_pages);
>  	memcg_check_events(to, page);
> +	mem_cgroup_charge_statistics(from, page, -nr_pages);
>  	memcg_check_events(from, page);
> +	local_irq_enable();
>  out:
>  	return ret;
>  }

Same here.

[...]
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 63f0cd559999..9da3cf84d30a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		rc = fallback_migrate_page(mapping, newpage, page, mode);
>  
>  	if (rc != MIGRATEPAGE_SUCCESS) {
> -		newpage->mapping = NULL;
> +		if (!PageAnon(newpage))
> +			newpage->mapping = NULL;

OK, I am probably washed out from looking into this for too long but I
cannot figure why have you done this...

>  	} else {
> +		mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
> -		page->mapping = NULL;
> +		if (!PageAnon(page))
> +			page->mapping = NULL;
>  	}
>  
>  	unlock_page(newpage);

[...]

The semantic is much cleaner now. I have to digest details about the
patch because it is really huge. But nothing really jumped at me during
the review (except for few minor things mentioned here and one mentioned
in other email regarding USED flag).

Good work! 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 14:23       ` Michal Hocko
@ 2014-07-15 15:09         ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> On Tue 15-07-14 10:25:45, Michal Hocko wrote:
> [...]
> > diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
> > index bcf750d3cecd..8870b0212150 100644
> > --- a/Documentation/cgroups/memcg_test.txt
> > +++ b/Documentation/cgroups/memcg_test.txt
> [...]
> >  6. Shmem(tmpfs) Page Cache
> > -	Memcg's charge/uncharge have special handlers of shmem. The best way
> > -	to understand shmem's page state transition is to read mm/shmem.c.
> > +	The best way to understand shmem's page state transition is to read
> > +	mm/shmem.c.
> 
> :D
> 
> [...]
> >  7. Page Migration
> > -   	One of the most complicated functions is page-migration-handler.
> > -	Memcg has 2 routines. Assume that we are migrating a page's contents
> > -	from OLDPAGE to NEWPAGE.
> > -
> > -	Usual migration logic is..
> > -	(a) remove the page from LRU.
> > -	(b) allocate NEWPAGE (migration target)
> > -	(c) lock by lock_page().
> > -	(d) unmap all mappings.
> > -	(e-1) If necessary, replace entry in radix-tree.
> > -	(e-2) move contents of a page.
> > -	(f) map all mappings again.
> > -	(g) pushback the page to LRU.
> > -	(-) OLDPAGE will be freed.
> > -
> > -	Before (g), memcg should complete all necessary charge/uncharge to
> > -	NEWPAGE/OLDPAGE.
> > -
> > -	The point is....
> > -	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
> > -          try_to_unmap() drops all mapcount and the page will not be
> > -	  SwapCache.
> > -
> > -	- If OLDPAGE is SwapCache, charges will be kept at (g) because
> > -	  __delete_from_swap_cache() isn't called at (e-1)
> > -
> > -	- If OLDPAGE is page-cache, charges will be kept at (g) because
> > -	  remove_from_swap_cache() isn't called at (e-1)
> > -
> > -	memcg provides following hooks.
> > -
> > -	- mem_cgroup_prepare_migration(OLDPAGE)
> > -	  Called after (b) to account a charge (usage += PAGE_SIZE) against
> > -	  memcg which OLDPAGE belongs to.
> > -
> > -        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> > -	  Called after (f) before (g).
> > -	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> > -	  charged, a charge by prepare_migration() is automatically canceled.
> > -	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
> > -
> > -	  But zap_pte() (by exit or munmap) can be called while migration,
> > -	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> > +
> > +	mem_cgroup_migrate()
> 
> This doesn't tell us anything abouta the page migration. On the other
> hand I am not entirely sure the documentation here is very much helpful.
> There is some outdated information. I wouldn't be opposed to remove
> everything up to "9. Typical Tests." section which should be the primary
> target of the file anyway.

Yeah, documentation of the implementation should be directly in the
source code and this file is kind of pointless.  So all I did there
was remove things that were wrong after my changes.  But I agree it
can probably be removed completely.

> > @@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
> >  }
> >  #endif
> >  #ifdef CONFIG_MEMCG_SWAP
> > -extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> > +extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> > +extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
> 
> Wouldn't it be nicer to have those two with symmetric names?
> mem_cgroup_{un}charge_swap?

I thought about that when I wrote them, but their operation is not
actually symmetrical.  The first one migrates a memsw charge from a
page to a swap entry when the page gets reclaimed - rather than when
the swap entry is allocated, the second one uncharges the swap entry
once the swap entry is released.

> > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> >  		spin_unlock_irq(&zone->lru_lock);
> >  	}
> >  
> > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > -	unlock_page_cgroup(pc);
> > -
> > +	local_irq_disable();
> > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> >  	/*
> >  	 * "charge_statistics" updated event counter. Then, check it.
> >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> >  	 * if they exceeds softlimit.
> >  	 */
> >  	memcg_check_events(memcg, page);
> > +	local_irq_enable();
> 
> preempt_{enable,disbale} should be sufficient for
> mem_cgroup_charge_statistics and memcg_check_events no?
> The first one is about per-cpu accounting (and that should be atomic
> wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> mem_cgroup_update_tree.

How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
the counters can fire on the local CPU?

> > @@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  		rc = fallback_migrate_page(mapping, newpage, page, mode);
> >  
> >  	if (rc != MIGRATEPAGE_SUCCESS) {
> > -		newpage->mapping = NULL;
> > +		if (!PageAnon(newpage))
> > +			newpage->mapping = NULL;
> 
> OK, I am probably washed out from looking into this for too long but I
> cannot figure why have you done this...

mem_cgroup_uncharge() relies on PageAnon() working.  Usually, anon
pages retain their page->mapping until they hit the page allocator,
the exception was old migration pages.

> >  	} else {
> > +		mem_cgroup_migrate(page, newpage, false);
> >  		if (remap_swapcache)
> >  			remove_migration_ptes(page, newpage);
> > -		page->mapping = NULL;
> > +		if (!PageAnon(page))
> > +			page->mapping = NULL;
> >  	}
> >  
> >  	unlock_page(newpage);
> 
> [...]
> 
> The semantic is much cleaner now. I have to digest details about the
> patch because it is really huge. But nothing really jumped at me during
> the review (except for few minor things mentioned here and one mentioned
> in other email regarding USED flag).
> 
> Good work! 

Thanks!

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 15:09         ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> On Tue 15-07-14 10:25:45, Michal Hocko wrote:
> [...]
> > diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
> > index bcf750d3cecd..8870b0212150 100644
> > --- a/Documentation/cgroups/memcg_test.txt
> > +++ b/Documentation/cgroups/memcg_test.txt
> [...]
> >  6. Shmem(tmpfs) Page Cache
> > -	Memcg's charge/uncharge have special handlers of shmem. The best way
> > -	to understand shmem's page state transition is to read mm/shmem.c.
> > +	The best way to understand shmem's page state transition is to read
> > +	mm/shmem.c.
> 
> :D
> 
> [...]
> >  7. Page Migration
> > -   	One of the most complicated functions is page-migration-handler.
> > -	Memcg has 2 routines. Assume that we are migrating a page's contents
> > -	from OLDPAGE to NEWPAGE.
> > -
> > -	Usual migration logic is..
> > -	(a) remove the page from LRU.
> > -	(b) allocate NEWPAGE (migration target)
> > -	(c) lock by lock_page().
> > -	(d) unmap all mappings.
> > -	(e-1) If necessary, replace entry in radix-tree.
> > -	(e-2) move contents of a page.
> > -	(f) map all mappings again.
> > -	(g) pushback the page to LRU.
> > -	(-) OLDPAGE will be freed.
> > -
> > -	Before (g), memcg should complete all necessary charge/uncharge to
> > -	NEWPAGE/OLDPAGE.
> > -
> > -	The point is....
> > -	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
> > -          try_to_unmap() drops all mapcount and the page will not be
> > -	  SwapCache.
> > -
> > -	- If OLDPAGE is SwapCache, charges will be kept at (g) because
> > -	  __delete_from_swap_cache() isn't called at (e-1)
> > -
> > -	- If OLDPAGE is page-cache, charges will be kept at (g) because
> > -	  remove_from_swap_cache() isn't called at (e-1)
> > -
> > -	memcg provides following hooks.
> > -
> > -	- mem_cgroup_prepare_migration(OLDPAGE)
> > -	  Called after (b) to account a charge (usage += PAGE_SIZE) against
> > -	  memcg which OLDPAGE belongs to.
> > -
> > -        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> > -	  Called after (f) before (g).
> > -	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> > -	  charged, a charge by prepare_migration() is automatically canceled.
> > -	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
> > -
> > -	  But zap_pte() (by exit or munmap) can be called while migration,
> > -	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> > +
> > +	mem_cgroup_migrate()
> 
> This doesn't tell us anything abouta the page migration. On the other
> hand I am not entirely sure the documentation here is very much helpful.
> There is some outdated information. I wouldn't be opposed to remove
> everything up to "9. Typical Tests." section which should be the primary
> target of the file anyway.

Yeah, documentation of the implementation should be directly in the
source code and this file is kind of pointless.  So all I did there
was remove things that were wrong after my changes.  But I agree it
can probably be removed completely.

> > @@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
> >  }
> >  #endif
> >  #ifdef CONFIG_MEMCG_SWAP
> > -extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> > +extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> > +extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
> 
> Wouldn't it be nicer to have those two with symmetric names?
> mem_cgroup_{un}charge_swap?

I thought about that when I wrote them, but their operation is not
actually symmetrical.  The first one migrates a memsw charge from a
page to a swap entry when the page gets reclaimed - rather than when
the swap entry is allocated, the second one uncharges the swap entry
once the swap entry is released.

> > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> >  		spin_unlock_irq(&zone->lru_lock);
> >  	}
> >  
> > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > -	unlock_page_cgroup(pc);
> > -
> > +	local_irq_disable();
> > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> >  	/*
> >  	 * "charge_statistics" updated event counter. Then, check it.
> >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> >  	 * if they exceeds softlimit.
> >  	 */
> >  	memcg_check_events(memcg, page);
> > +	local_irq_enable();
> 
> preempt_{enable,disbale} should be sufficient for
> mem_cgroup_charge_statistics and memcg_check_events no?
> The first one is about per-cpu accounting (and that should be atomic
> wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> mem_cgroup_update_tree.

How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
the counters can fire on the local CPU?

> > @@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  		rc = fallback_migrate_page(mapping, newpage, page, mode);
> >  
> >  	if (rc != MIGRATEPAGE_SUCCESS) {
> > -		newpage->mapping = NULL;
> > +		if (!PageAnon(newpage))
> > +			newpage->mapping = NULL;
> 
> OK, I am probably washed out from looking into this for too long but I
> cannot figure why have you done this...

mem_cgroup_uncharge() relies on PageAnon() working.  Usually, anon
pages retain their page->mapping until they hit the page allocator,
the exception was old migration pages.

> >  	} else {
> > +		mem_cgroup_migrate(page, newpage, false);
> >  		if (remap_swapcache)
> >  			remove_migration_ptes(page, newpage);
> > -		page->mapping = NULL;
> > +		if (!PageAnon(page))
> > +			page->mapping = NULL;
> >  	}
> >  
> >  	unlock_page(newpage);
> 
> [...]
> 
> The semantic is much cleaner now. I have to digest details about the
> patch because it is really huge. But nothing really jumped at me during
> the review (except for few minor things mentioned here and one mentioned
> in other email regarding USED flag).
> 
> Good work! 

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 15:09         ` Johannes Weiner
@ 2014-07-15 15:18           ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 15:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 11:09:37, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 10:25:45, Michal Hocko wrote:
[...]
> > > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> > >  		spin_unlock_irq(&zone->lru_lock);
> > >  	}
> > >  
> > > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > > -	unlock_page_cgroup(pc);
> > > -
> > > +	local_irq_disable();
> > > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> > >  	/*
> > >  	 * "charge_statistics" updated event counter. Then, check it.
> > >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > >  	 * if they exceeds softlimit.
> > >  	 */
> > >  	memcg_check_events(memcg, page);
> > > +	local_irq_enable();
> > 
> > preempt_{enable,disbale} should be sufficient for
> > mem_cgroup_charge_statistics and memcg_check_events no?
> > The first one is about per-cpu accounting (and that should be atomic
> > wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> > mem_cgroup_update_tree.
> 
> How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
> the counters can fire on the local CPU?

I meant that __this_atomic_add and __this_cpu_inc should be atomic wrt. IRQ.
We do not care that an IRQ might jump in between two per-cpu operations.
This is racy from other CPUs anyway.

> 
> > > @@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> > >  		rc = fallback_migrate_page(mapping, newpage, page, mode);
> > >  
> > >  	if (rc != MIGRATEPAGE_SUCCESS) {
> > > -		newpage->mapping = NULL;
> > > +		if (!PageAnon(newpage))
> > > +			newpage->mapping = NULL;
> > 
> > OK, I am probably washed out from looking into this for too long but I
> > cannot figure why have you done this...
> 
> mem_cgroup_uncharge() relies on PageAnon() working.  Usually, anon
> pages retain their page->mapping until they hit the page allocator,
> the exception was old migration pages.

OK, got it now. I was suprised by a change in !memcg path. Maybe this is
worth a comment?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 15:18           ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 15:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 11:09:37, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 10:25:45, Michal Hocko wrote:
[...]
> > > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> > >  		spin_unlock_irq(&zone->lru_lock);
> > >  	}
> > >  
> > > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > > -	unlock_page_cgroup(pc);
> > > -
> > > +	local_irq_disable();
> > > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> > >  	/*
> > >  	 * "charge_statistics" updated event counter. Then, check it.
> > >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > >  	 * if they exceeds softlimit.
> > >  	 */
> > >  	memcg_check_events(memcg, page);
> > > +	local_irq_enable();
> > 
> > preempt_{enable,disbale} should be sufficient for
> > mem_cgroup_charge_statistics and memcg_check_events no?
> > The first one is about per-cpu accounting (and that should be atomic
> > wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> > mem_cgroup_update_tree.
> 
> How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
> the counters can fire on the local CPU?

I meant that __this_atomic_add and __this_cpu_inc should be atomic wrt. IRQ.
We do not care that an IRQ might jump in between two per-cpu operations.
This is racy from other CPUs anyway.

> 
> > > @@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> > >  		rc = fallback_migrate_page(mapping, newpage, page, mode);
> > >  
> > >  	if (rc != MIGRATEPAGE_SUCCESS) {
> > > -		newpage->mapping = NULL;
> > > +		if (!PageAnon(newpage))
> > > +			newpage->mapping = NULL;
> > 
> > OK, I am probably washed out from looking into this for too long but I
> > cannot figure why have you done this...
> 
> mem_cgroup_uncharge() relies on PageAnon() working.  Usually, anon
> pages retain their page->mapping until they hit the page allocator,
> the exception was old migration pages.

OK, got it now. I was suprised by a change in !memcg path. Maybe this is
worth a comment?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 15:18           ` Michal Hocko
@ 2014-07-15 15:46             ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 05:18:18PM +0200, Michal Hocko wrote:
> On Tue 15-07-14 11:09:37, Johannes Weiner wrote:
> > On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> > > On Tue 15-07-14 10:25:45, Michal Hocko wrote:
> [...]
> > > > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> > > >  		spin_unlock_irq(&zone->lru_lock);
> > > >  	}
> > > >  
> > > > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > > > -	unlock_page_cgroup(pc);
> > > > -
> > > > +	local_irq_disable();
> > > > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> > > >  	/*
> > > >  	 * "charge_statistics" updated event counter. Then, check it.
> > > >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > > >  	 * if they exceeds softlimit.
> > > >  	 */
> > > >  	memcg_check_events(memcg, page);
> > > > +	local_irq_enable();
> > > 
> > > preempt_{enable,disbale} should be sufficient for
> > > mem_cgroup_charge_statistics and memcg_check_events no?
> > > The first one is about per-cpu accounting (and that should be atomic
> > > wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> > > mem_cgroup_update_tree.
> > 
> > How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
> > the counters can fire on the local CPU?
> 
> I meant that __this_atomic_add and __this_cpu_inc should be atomic wrt. IRQ.
> We do not care that an IRQ might jump in between two per-cpu operations.
> This is racy from other CPUs anyway.

It's really about a single RMW (+=) being interrupted by an IRQ.
this_cpu_ guarantees IRQ-atomicity, but __this_cpu_ does not.

We could probably migrate this code over to this_cpu and get rid of
the IRQ disabling later, because, as you said, we don't care about
being interrupted between separate counter updates.

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 15:46             ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 05:18:18PM +0200, Michal Hocko wrote:
> On Tue 15-07-14 11:09:37, Johannes Weiner wrote:
> > On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> > > On Tue 15-07-14 10:25:45, Michal Hocko wrote:
> [...]
> > > > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> > > >  		spin_unlock_irq(&zone->lru_lock);
> > > >  	}
> > > >  
> > > > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > > > -	unlock_page_cgroup(pc);
> > > > -
> > > > +	local_irq_disable();
> > > > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> > > >  	/*
> > > >  	 * "charge_statistics" updated event counter. Then, check it.
> > > >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > > >  	 * if they exceeds softlimit.
> > > >  	 */
> > > >  	memcg_check_events(memcg, page);
> > > > +	local_irq_enable();
> > > 
> > > preempt_{enable,disbale} should be sufficient for
> > > mem_cgroup_charge_statistics and memcg_check_events no?
> > > The first one is about per-cpu accounting (and that should be atomic
> > > wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> > > mem_cgroup_update_tree.
> > 
> > How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
> > the counters can fire on the local CPU?
> 
> I meant that __this_atomic_add and __this_cpu_inc should be atomic wrt. IRQ.
> We do not care that an IRQ might jump in between two per-cpu operations.
> This is racy from other CPUs anyway.

It's really about a single RMW (+=) being interrupted by an IRQ.
this_cpu_ guarantees IRQ-atomicity, but __this_cpu_ does not.

We could probably migrate this code over to this_cpu and get rid of
the IRQ disabling later, because, as you said, we don't care about
being interrupted between separate counter updates.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-06-18 20:40   ` Johannes Weiner
@ 2014-07-15 15:55     ` Naoya Horiguchi
  -1 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-15 15:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
...
> diff --git a/mm/swap.c b/mm/swap.c
> index a98f48626359..3074210f245d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> +	mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)

This seems to cause a list breakage in hstate->hugepage_activelist
when freeing a hugetlbfs page.
For hugetlbfs, we uncharge in free_huge_page() which is called after
__page_cache_release(), so I think that we don't have to uncharge here.

In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
fixed the problem, so if that works for you, could you fold the change
into your patch?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 15:55     ` Naoya Horiguchi
  0 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-15 15:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
...
> diff --git a/mm/swap.c b/mm/swap.c
> index a98f48626359..3074210f245d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> +	mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)

This seems to cause a list breakage in hstate->hugepage_activelist
when freeing a hugetlbfs page.
For hugetlbfs, we uncharge in free_huge_page() which is called after
__page_cache_release(), so I think that we don't have to uncharge here.

In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
fixed the problem, so if that works for you, could you fold the change
into your patch?

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 15:46             ` Johannes Weiner
@ 2014-07-15 15:56               ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 15:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 11:46:43, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 05:18:18PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 11:09:37, Johannes Weiner wrote:
> > > On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> > > > On Tue 15-07-14 10:25:45, Michal Hocko wrote:
> > [...]
> > > > > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> > > > >  		spin_unlock_irq(&zone->lru_lock);
> > > > >  	}
> > > > >  
> > > > > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > > > > -	unlock_page_cgroup(pc);
> > > > > -
> > > > > +	local_irq_disable();
> > > > > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> > > > >  	/*
> > > > >  	 * "charge_statistics" updated event counter. Then, check it.
> > > > >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > > > >  	 * if they exceeds softlimit.
> > > > >  	 */
> > > > >  	memcg_check_events(memcg, page);
> > > > > +	local_irq_enable();
> > > > 
> > > > preempt_{enable,disbale} should be sufficient for
> > > > mem_cgroup_charge_statistics and memcg_check_events no?
> > > > The first one is about per-cpu accounting (and that should be atomic
> > > > wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> > > > mem_cgroup_update_tree.
> > > 
> > > How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
> > > the counters can fire on the local CPU?
> > 
> > I meant that __this_atomic_add and __this_cpu_inc should be atomic wrt. IRQ.
> > We do not care that an IRQ might jump in between two per-cpu operations.
> > This is racy from other CPUs anyway.
> 
> It's really about a single RMW (+=) being interrupted by an IRQ.
> this_cpu_ guarantees IRQ-atomicity, but __this_cpu_ does not.

Yes, you are right. I was too x86 centric where both add and inc are
really a single instruction. Generic implementation already shows I was
wrong.

Sorry for the noise!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 15:56               ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 15:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 11:46:43, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 05:18:18PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 11:09:37, Johannes Weiner wrote:
> > > On Tue, Jul 15, 2014 at 04:23:50PM +0200, Michal Hocko wrote:
> > > > On Tue 15-07-14 10:25:45, Michal Hocko wrote:
> > [...]
> > > > > @@ -2760,15 +2752,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> > > > >  		spin_unlock_irq(&zone->lru_lock);
> > > > >  	}
> > > > >  
> > > > > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > > > > -	unlock_page_cgroup(pc);
> > > > > -
> > > > > +	local_irq_disable();
> > > > > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> > > > >  	/*
> > > > >  	 * "charge_statistics" updated event counter. Then, check it.
> > > > >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> > > > >  	 * if they exceeds softlimit.
> > > > >  	 */
> > > > >  	memcg_check_events(memcg, page);
> > > > > +	local_irq_enable();
> > > > 
> > > > preempt_{enable,disbale} should be sufficient for
> > > > mem_cgroup_charge_statistics and memcg_check_events no?
> > > > The first one is about per-cpu accounting (and that should be atomic
> > > > wrt. IRQ on the same CPU) and the later one uses IRQ safe locks down in
> > > > mem_cgroup_update_tree.
> > > 
> > > How could it be atomic wrt. IRQ on the local CPU when IRQs that modify
> > > the counters can fire on the local CPU?
> > 
> > I meant that __this_atomic_add and __this_cpu_inc should be atomic wrt. IRQ.
> > We do not care that an IRQ might jump in between two per-cpu operations.
> > This is racy from other CPUs anyway.
> 
> It's really about a single RMW (+=) being interrupted by an IRQ.
> this_cpu_ guarantees IRQ-atomicity, but __this_cpu_ does not.

Yes, you are right. I was too x86 centric where both add and inc are
really a single instruction. Generic implementation already shows I was
wrong.

Sorry for the noise!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 15:55     ` Naoya Horiguchi
@ 2014-07-15 16:07       ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 16:07 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> ...
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a98f48626359..3074210f245d 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  	}
> > +	mem_cgroup_uncharge(page);
> >  }
> >  
> >  static void __put_single_page(struct page *page)
> 
> This seems to cause a list breakage in hstate->hugepage_activelist
> when freeing a hugetlbfs page.

This looks like a fall out from
http://marc.info/?l=linux-mm&m=140475936311294&w=2

I didn't get to review this one but the easiest fix seems to be check
HugePage and do not call uncharge.

> For hugetlbfs, we uncharge in free_huge_page() which is called after
> __page_cache_release(), so I think that we don't have to uncharge here.
> 
> In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> fixed the problem, so if that works for you, could you fold the change
> into your patch?
> 
> Thanks,
> Naoya Horiguchi
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 16:07       ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 16:07 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> ...
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a98f48626359..3074210f245d 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  	}
> > +	mem_cgroup_uncharge(page);
> >  }
> >  
> >  static void __put_single_page(struct page *page)
> 
> This seems to cause a list breakage in hstate->hugepage_activelist
> when freeing a hugetlbfs page.

This looks like a fall out from
http://marc.info/?l=linux-mm&m=140475936311294&w=2

I didn't get to review this one but the easiest fix seems to be check
HugePage and do not call uncharge.

> For hugetlbfs, we uncharge in free_huge_page() which is called after
> __page_cache_release(), so I think that we don't have to uncharge here.
> 
> In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> fixed the problem, so if that works for you, could you fold the change
> into your patch?
> 
> Thanks,
> Naoya Horiguchi
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 16:07       ` Michal Hocko
  (?)
@ 2014-07-15 17:34         ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > ...
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index a98f48626359..3074210f245d 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  	}
> > > +	mem_cgroup_uncharge(page);
> > >  }
> > >  
> > >  static void __put_single_page(struct page *page)
> > 
> > This seems to cause a list breakage in hstate->hugepage_activelist
> > when freeing a hugetlbfs page.
> 
> This looks like a fall out from
> http://marc.info/?l=linux-mm&m=140475936311294&w=2
> 
> I didn't get to review this one but the easiest fix seems to be check
> HugePage and do not call uncharge.

Yes, that makes sense.  I'm also moving the uncharge call into
__put_single_page() and __put_compound_page() so that PageHuge(), a
function call, only needs to be checked for compound pages.

> > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > __page_cache_release(), so I think that we don't have to uncharge here.
> > 
> > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > fixed the problem, so if that works for you, could you fold the change
> > into your patch?

Memcg pages that *do* need uncharging might not necessarily be on the
LRU list.

Does the following work for you?

Thanks!

---

diff --git a/mm/swap.c b/mm/swap.c
index 3461f2f5be20..af5c8ad830d1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+	mem_cgroup_uncharge(page);
 	free_hot_cold_page(page, false);
 }
 
@@ -76,6 +76,8 @@ static void __put_compound_page(struct page *page)
 	compound_page_dtor *dtor;
 
 	__page_cache_release(page);
+	if (!PageHuge(page))
+		mem_cgroup_uncharge(page);
 	dtor = get_compound_page_dtor(page);
 	(*dtor)(page);
 }

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 17:34         ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > ...
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index a98f48626359..3074210f245d 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  	}
> > > +	mem_cgroup_uncharge(page);
> > >  }
> > >  
> > >  static void __put_single_page(struct page *page)
> > 
> > This seems to cause a list breakage in hstate->hugepage_activelist
> > when freeing a hugetlbfs page.
> 
> This looks like a fall out from
> http://marc.info/?l=linux-mm&m=140475936311294&w=2
> 
> I didn't get to review this one but the easiest fix seems to be check
> HugePage and do not call uncharge.

Yes, that makes sense.  I'm also moving the uncharge call into
__put_single_page() and __put_compound_page() so that PageHuge(), a
function call, only needs to be checked for compound pages.

> > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > __page_cache_release(), so I think that we don't have to uncharge here.
> > 
> > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > fixed the problem, so if that works for you, could you fold the change
> > into your patch?

Memcg pages that *do* need uncharging might not necessarily be on the
LRU list.

Does the following work for you?

Thanks!

---

diff --git a/mm/swap.c b/mm/swap.c
index 3461f2f5be20..af5c8ad830d1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+	mem_cgroup_uncharge(page);
 	free_hot_cold_page(page, false);
 }
 
@@ -76,6 +76,8 @@ static void __put_compound_page(struct page *page)
 	compound_page_dtor *dtor;
 
 	__page_cache_release(page);
+	if (!PageHuge(page))
+		mem_cgroup_uncharge(page);
 	dtor = get_compound_page_dtor(page);
 	(*dtor)(page);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 17:34         ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > ...
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index a98f48626359..3074210f245d 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  	}
> > > +	mem_cgroup_uncharge(page);
> > >  }
> > >  
> > >  static void __put_single_page(struct page *page)
> > 
> > This seems to cause a list breakage in hstate->hugepage_activelist
> > when freeing a hugetlbfs page.
> 
> This looks like a fall out from
> http://marc.info/?l=linux-mm&m=140475936311294&w=2
> 
> I didn't get to review this one but the easiest fix seems to be check
> HugePage and do not call uncharge.

Yes, that makes sense.  I'm also moving the uncharge call into
__put_single_page() and __put_compound_page() so that PageHuge(), a
function call, only needs to be checked for compound pages.

> > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > __page_cache_release(), so I think that we don't have to uncharge here.
> > 
> > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > fixed the problem, so if that works for you, could you fold the change
> > into your patch?

Memcg pages that *do* need uncharging might not necessarily be on the
LRU list.

Does the following work for you?

Thanks!

---

diff --git a/mm/swap.c b/mm/swap.c
index 3461f2f5be20..af5c8ad830d1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+	mem_cgroup_uncharge(page);
 	free_hot_cold_page(page, false);
 }
 
@@ -76,6 +76,8 @@ static void __put_compound_page(struct page *page)
 	compound_page_dtor *dtor;
 
 	__page_cache_release(page);
+	if (!PageHuge(page))
+		mem_cgroup_uncharge(page);
 	dtor = get_compound_page_dtor(page);
 	(*dtor)(page);
 }

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 17:34         ` Johannes Weiner
  (?)
@ 2014-07-15 18:21           ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 18:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue 15-07-14 13:34:39, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > ...
> > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > index a98f48626359..3074210f245d 100644
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > >  	}
> > > > +	mem_cgroup_uncharge(page);
> > > >  }
> > > >  
> > > >  static void __put_single_page(struct page *page)
> > > 
> > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > when freeing a hugetlbfs page.
> > 
> > This looks like a fall out from
> > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > 
> > I didn't get to review this one but the easiest fix seems to be check
> > HugePage and do not call uncharge.
> 
> Yes, that makes sense.  I'm also moving the uncharge call into
> __put_single_page() and __put_compound_page() so that PageHuge(), a
> function call, only needs to be checked for compound pages.

Hmm, there doesn't seem to be any point in calling __page_cache_release
for HugePage as well. So it should be sufficient that
__put_compound_page doesn't call __page_cache_release for PageHuge and
uncharge can stay there. Maybe this would be slightly better...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 18:21           ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 18:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue 15-07-14 13:34:39, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > ...
> > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > index a98f48626359..3074210f245d 100644
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > >  	}
> > > > +	mem_cgroup_uncharge(page);
> > > >  }
> > > >  
> > > >  static void __put_single_page(struct page *page)
> > > 
> > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > when freeing a hugetlbfs page.
> > 
> > This looks like a fall out from
> > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > 
> > I didn't get to review this one but the easiest fix seems to be check
> > HugePage and do not call uncharge.
> 
> Yes, that makes sense.  I'm also moving the uncharge call into
> __put_single_page() and __put_compound_page() so that PageHuge(), a
> function call, only needs to be checked for compound pages.

Hmm, there doesn't seem to be any point in calling __page_cache_release
for HugePage as well. So it should be sufficient that
__put_compound_page doesn't call __page_cache_release for PageHuge and
uncharge can stay there. Maybe this would be slightly better...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 18:21           ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-15 18:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue 15-07-14 13:34:39, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > ...
> > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > index a98f48626359..3074210f245d 100644
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > >  	}
> > > > +	mem_cgroup_uncharge(page);
> > > >  }
> > > >  
> > > >  static void __put_single_page(struct page *page)
> > > 
> > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > when freeing a hugetlbfs page.
> > 
> > This looks like a fall out from
> > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > 
> > I didn't get to review this one but the easiest fix seems to be check
> > HugePage and do not call uncharge.
> 
> Yes, that makes sense.  I'm also moving the uncharge call into
> __put_single_page() and __put_compound_page() so that PageHuge(), a
> function call, only needs to be checked for compound pages.

Hmm, there doesn't seem to be any point in calling __page_cache_release
for HugePage as well. So it should be sufficient that
__put_compound_page doesn't call __page_cache_release for PageHuge and
uncharge can stay there. Maybe this would be slightly better...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 17:34         ` Johannes Weiner
@ 2014-07-15 18:43           ` Naoya Horiguchi
  -1 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-15 18:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > ...
> > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > index a98f48626359..3074210f245d 100644
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > >  	}
> > > > +	mem_cgroup_uncharge(page);
> > > >  }
> > > >  
> > > >  static void __put_single_page(struct page *page)
> > > 
> > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > when freeing a hugetlbfs page.
> > 
> > This looks like a fall out from
> > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > 
> > I didn't get to review this one but the easiest fix seems to be check
> > HugePage and do not call uncharge.
> 
> Yes, that makes sense.  I'm also moving the uncharge call into
> __put_single_page() and __put_compound_page() so that PageHuge(), a
> function call, only needs to be checked for compound pages.
> 
> > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > 
> > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > fixed the problem, so if that works for you, could you fold the change
> > > into your patch?
> 
> Memcg pages that *do* need uncharging might not necessarily be on the
> LRU list.

OK.

> Does the following work for you?

Unfortunately, with this change I saw the following bug message when
stressing with hugepage migration.
move_to_new_page() is called by unmap_and_move_huge_page() too, so
we need some hugetlb related code around mem_cgroup_migrate().

[   76.753994] page:ffffea0000d18000 count:2 mapcount:0 mapping:ffff88003dc2c738 index:0x8
[   76.755171] page flags: 0x1fffff80004019(locked|uptodate|dirty|head)
[   76.756195] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[   76.758869] pc:ffff88003ebc6000 pc->flags:1 pc->mem_cgroup:ffff88011e19a800
[   76.760158] ------------[ cut here ]------------
[   76.760878] kernel BUG at /src/linux-dev/mm/memcontrol.c:2707!
[   76.761119] invalid opcode: 0000 [#1] SMP
[   76.761119] Modules linked in: bnep bluetooth ip6t_rpfilter cfg80211 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev parport_pc parport serio_raw microcode i2c_piix4 virtio_balloon floppy pcspkr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi
[   76.761119] CPU: 1 PID: 1536 Comm: mbind_fuzz Not tainted 3.15.0-140715-1353-00016-g8d61f2a989c8 #263
[   76.761119] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   76.761119] task: ffff88011d6e8000 ti: ffff8800bbd84000 task.ti: ffff8800bbd84000
[   76.761119] RIP: 0010:[<ffffffff811fee3b>]  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   76.761119] RSP: 0018:ffff8800bbd87ce8  EFLAGS: 00010292
[   76.761119] RAX: 000000000000003f RBX: ffffea0000d18000 RCX: 0000000000000000
[   76.761119] RDX: 0000000000000001 RSI: ffff88007ec0d318 RDI: ffff88007ec0d318
[   76.761119] RBP: ffff8800bbd87d28 R08: 000000000000000a R09: 0000000000000000
[   76.761119] R10: 0000000000000000 R11: ffff8800bbd879be R12: ffff88011e19a800
[   76.761119] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003ebc6000
[   76.761119] FS:  00007f7441c3a740(0000) GS:ffff88007ec00000(0000) knlGS:0000000000000000
[   76.761119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.761119] CR2: 00007ffffb5b8950 CR3: 000000003740e000 CR4: 00000000000006e0
[   76.761119] Stack:
[   76.761119]  0000000000000002 0000000000000001 00000200bbd87d28 ffffea0002b70000
[   76.761119]  ffffea0000d18000 0000000000000000 0000000000000001 0000000000000000
[   76.761119]  ffff8800bbd87d50 ffffffff81202380 ffffea0000d18000 ffffea0002b70000
[   76.761119] Call Trace:
[   76.761119]  [<ffffffff81202380>] mem_cgroup_migrate+0x100/0x1d0
[   76.761119]  [<ffffffff811f3e4d>] move_to_new_page+0xbd/0x110
[   76.761119]  [<ffffffff811f40d3>] unmap_and_move_huge_page+0x233/0x290
[   76.761119]  [<ffffffff811f477d>] migrate_pages+0xad/0x1e0
[   76.761119]  [<ffffffff811e43f0>] ? alloc_pages_vma+0x1a0/0x1a0
[   76.761119]  [<ffffffff811e4cea>] do_mbind+0x2ea/0x380
[   76.761119]  [<ffffffff811e4e1b>] SyS_mbind+0x9b/0xb0
[   76.761119]  [<ffffffff81742a12>] system_call_fastpath+0x16/0x1b
[   76.761119] Code: 13 45 19 c0 41 83 e0 02 48 c1 ea 06 83 e2 01 48 83 fa 01 41 83 d8 ff e9 30 ff ff ff 48 c7 c6 20 d0 a8 81 48 89 df e8 55 fb f9 ff <0f> 0b 48 c7 c6 f3 e2 a8 81 48 89 df e8 44 fb f9 ff 0f 0b 48 c7
[   76.761119] RIP  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   76.761119]  RSP <ffff8800bbd87ce8>
[   76.801726] ---[ end trace ddfccaa1a6a58baa ]---


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 18:43           ` Naoya Horiguchi
  0 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-15 18:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > ...
> > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > index a98f48626359..3074210f245d 100644
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > >  	}
> > > > +	mem_cgroup_uncharge(page);
> > > >  }
> > > >  
> > > >  static void __put_single_page(struct page *page)
> > > 
> > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > when freeing a hugetlbfs page.
> > 
> > This looks like a fall out from
> > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > 
> > I didn't get to review this one but the easiest fix seems to be check
> > HugePage and do not call uncharge.
> 
> Yes, that makes sense.  I'm also moving the uncharge call into
> __put_single_page() and __put_compound_page() so that PageHuge(), a
> function call, only needs to be checked for compound pages.
> 
> > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > 
> > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > fixed the problem, so if that works for you, could you fold the change
> > > into your patch?
> 
> Memcg pages that *do* need uncharging might not necessarily be on the
> LRU list.

OK.

> Does the following work for you?

Unfortunately, with this change I saw the following bug message when
stressing with hugepage migration.
move_to_new_page() is called by unmap_and_move_huge_page() too, so
we need some hugetlb related code around mem_cgroup_migrate().

[   76.753994] page:ffffea0000d18000 count:2 mapcount:0 mapping:ffff88003dc2c738 index:0x8
[   76.755171] page flags: 0x1fffff80004019(locked|uptodate|dirty|head)
[   76.756195] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[   76.758869] pc:ffff88003ebc6000 pc->flags:1 pc->mem_cgroup:ffff88011e19a800
[   76.760158] ------------[ cut here ]------------
[   76.760878] kernel BUG at /src/linux-dev/mm/memcontrol.c:2707!
[   76.761119] invalid opcode: 0000 [#1] SMP
[   76.761119] Modules linked in: bnep bluetooth ip6t_rpfilter cfg80211 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev parport_pc parport serio_raw microcode i2c_piix4 virtio_balloon floppy pcspkr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi
[   76.761119] CPU: 1 PID: 1536 Comm: mbind_fuzz Not tainted 3.15.0-140715-1353-00016-g8d61f2a989c8 #263
[   76.761119] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   76.761119] task: ffff88011d6e8000 ti: ffff8800bbd84000 task.ti: ffff8800bbd84000
[   76.761119] RIP: 0010:[<ffffffff811fee3b>]  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   76.761119] RSP: 0018:ffff8800bbd87ce8  EFLAGS: 00010292
[   76.761119] RAX: 000000000000003f RBX: ffffea0000d18000 RCX: 0000000000000000
[   76.761119] RDX: 0000000000000001 RSI: ffff88007ec0d318 RDI: ffff88007ec0d318
[   76.761119] RBP: ffff8800bbd87d28 R08: 000000000000000a R09: 0000000000000000
[   76.761119] R10: 0000000000000000 R11: ffff8800bbd879be R12: ffff88011e19a800
[   76.761119] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003ebc6000
[   76.761119] FS:  00007f7441c3a740(0000) GS:ffff88007ec00000(0000) knlGS:0000000000000000
[   76.761119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.761119] CR2: 00007ffffb5b8950 CR3: 000000003740e000 CR4: 00000000000006e0
[   76.761119] Stack:
[   76.761119]  0000000000000002 0000000000000001 00000200bbd87d28 ffffea0002b70000
[   76.761119]  ffffea0000d18000 0000000000000000 0000000000000001 0000000000000000
[   76.761119]  ffff8800bbd87d50 ffffffff81202380 ffffea0000d18000 ffffea0002b70000
[   76.761119] Call Trace:
[   76.761119]  [<ffffffff81202380>] mem_cgroup_migrate+0x100/0x1d0
[   76.761119]  [<ffffffff811f3e4d>] move_to_new_page+0xbd/0x110
[   76.761119]  [<ffffffff811f40d3>] unmap_and_move_huge_page+0x233/0x290
[   76.761119]  [<ffffffff811f477d>] migrate_pages+0xad/0x1e0
[   76.761119]  [<ffffffff811e43f0>] ? alloc_pages_vma+0x1a0/0x1a0
[   76.761119]  [<ffffffff811e4cea>] do_mbind+0x2ea/0x380
[   76.761119]  [<ffffffff811e4e1b>] SyS_mbind+0x9b/0xb0
[   76.761119]  [<ffffffff81742a12>] system_call_fastpath+0x16/0x1b
[   76.761119] Code: 13 45 19 c0 41 83 e0 02 48 c1 ea 06 83 e2 01 48 83 fa 01 41 83 d8 ff e9 30 ff ff ff 48 c7 c6 20 d0 a8 81 48 89 df e8 55 fb f9 ff <0f> 0b 48 c7 c6 f3 e2 a8 81 48 89 df e8 44 fb f9 ff 0f 0b 48 c7
[   76.761119] RIP  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   76.761119]  RSP <ffff8800bbd87ce8>
[   76.801726] ---[ end trace ddfccaa1a6a58baa ]---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 18:43           ` Naoya Horiguchi
  (?)
@ 2014-07-15 19:04             ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 19:04 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 02:43:58PM -0400, Naoya Horiguchi wrote:
> On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> > On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > > ...
> > > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > > index a98f48626359..3074210f245d 100644
> > > > > --- a/mm/swap.c
> > > > > +++ b/mm/swap.c
> > > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > > >  	}
> > > > > +	mem_cgroup_uncharge(page);
> > > > >  }
> > > > >  
> > > > >  static void __put_single_page(struct page *page)
> > > > 
> > > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > > when freeing a hugetlbfs page.
> > > 
> > > This looks like a fall out from
> > > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > > 
> > > I didn't get to review this one but the easiest fix seems to be check
> > > HugePage and do not call uncharge.
> > 
> > Yes, that makes sense.  I'm also moving the uncharge call into
> > __put_single_page() and __put_compound_page() so that PageHuge(), a
> > function call, only needs to be checked for compound pages.
> > 
> > > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > > 
> > > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > > fixed the problem, so if that works for you, could you fold the change
> > > > into your patch?
> > 
> > Memcg pages that *do* need uncharging might not necessarily be on the
> > LRU list.
> 
> OK.
> 
> > Does the following work for you?
> 
> Unfortunately, with this change I saw the following bug message when
> stressing with hugepage migration.
> move_to_new_page() is called by unmap_and_move_huge_page() too, so
> we need some hugetlb related code around mem_cgroup_migrate().

Can we just move hugetlb_cgroup_migrate() into move_to_new_page()?  It
doesn't seem to be dependent of any page-specific state.

diff --git a/mm/migrate.c b/mm/migrate.c
index 7f5a42403fae..219da52d2f43 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -781,7 +781,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		if (!PageAnon(newpage))
 			newpage->mapping = NULL;
 	} else {
-		mem_cgroup_migrate(page, newpage, false);
+		if (PageHuge(page))
+			hugetlb_cgroup_migrate(hpage, new_hpage);
+		else
+			mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
 		if (!PageAnon(page))
@@ -1064,9 +1067,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
-		hugetlb_cgroup_migrate(hpage, new_hpage);
-
 	unlock_page(hpage);
 out:
 	if (rc != -EAGAIN)

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 19:04             ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 19:04 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 02:43:58PM -0400, Naoya Horiguchi wrote:
> On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> > On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > > ...
> > > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > > index a98f48626359..3074210f245d 100644
> > > > > --- a/mm/swap.c
> > > > > +++ b/mm/swap.c
> > > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > > >  	}
> > > > > +	mem_cgroup_uncharge(page);
> > > > >  }
> > > > >  
> > > > >  static void __put_single_page(struct page *page)
> > > > 
> > > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > > when freeing a hugetlbfs page.
> > > 
> > > This looks like a fall out from
> > > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > > 
> > > I didn't get to review this one but the easiest fix seems to be check
> > > HugePage and do not call uncharge.
> > 
> > Yes, that makes sense.  I'm also moving the uncharge call into
> > __put_single_page() and __put_compound_page() so that PageHuge(), a
> > function call, only needs to be checked for compound pages.
> > 
> > > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > > 
> > > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > > fixed the problem, so if that works for you, could you fold the change
> > > > into your patch?
> > 
> > Memcg pages that *do* need uncharging might not necessarily be on the
> > LRU list.
> 
> OK.
> 
> > Does the following work for you?
> 
> Unfortunately, with this change I saw the following bug message when
> stressing with hugepage migration.
> move_to_new_page() is called by unmap_and_move_huge_page() too, so
> we need some hugetlb related code around mem_cgroup_migrate().

Can we just move hugetlb_cgroup_migrate() into move_to_new_page()?  It
doesn't seem to be dependent of any page-specific state.

diff --git a/mm/migrate.c b/mm/migrate.c
index 7f5a42403fae..219da52d2f43 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -781,7 +781,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		if (!PageAnon(newpage))
 			newpage->mapping = NULL;
 	} else {
-		mem_cgroup_migrate(page, newpage, false);
+		if (PageHuge(page))
+			hugetlb_cgroup_migrate(hpage, new_hpage);
+		else
+			mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
 		if (!PageAnon(page))
@@ -1064,9 +1067,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
-		hugetlb_cgroup_migrate(hpage, new_hpage);
-
 	unlock_page(hpage);
 out:
 	if (rc != -EAGAIN)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 19:04             ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 19:04 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Jul 15, 2014 at 02:43:58PM -0400, Naoya Horiguchi wrote:
> On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> > On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > > ...
> > > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > > index a98f48626359..3074210f245d 100644
> > > > > --- a/mm/swap.c
> > > > > +++ b/mm/swap.c
> > > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > > >  	}
> > > > > +	mem_cgroup_uncharge(page);
> > > > >  }
> > > > >  
> > > > >  static void __put_single_page(struct page *page)
> > > > 
> > > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > > when freeing a hugetlbfs page.
> > > 
> > > This looks like a fall out from
> > > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > > 
> > > I didn't get to review this one but the easiest fix seems to be check
> > > HugePage and do not call uncharge.
> > 
> > Yes, that makes sense.  I'm also moving the uncharge call into
> > __put_single_page() and __put_compound_page() so that PageHuge(), a
> > function call, only needs to be checked for compound pages.
> > 
> > > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > > 
> > > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > > fixed the problem, so if that works for you, could you fold the change
> > > > into your patch?
> > 
> > Memcg pages that *do* need uncharging might not necessarily be on the
> > LRU list.
> 
> OK.
> 
> > Does the following work for you?
> 
> Unfortunately, with this change I saw the following bug message when
> stressing with hugepage migration.
> move_to_new_page() is called by unmap_and_move_huge_page() too, so
> we need some hugetlb related code around mem_cgroup_migrate().

Can we just move hugetlb_cgroup_migrate() into move_to_new_page()?  It
doesn't seem to be dependent of any page-specific state.

diff --git a/mm/migrate.c b/mm/migrate.c
index 7f5a42403fae..219da52d2f43 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -781,7 +781,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		if (!PageAnon(newpage))
 			newpage->mapping = NULL;
 	} else {
-		mem_cgroup_migrate(page, newpage, false);
+		if (PageHuge(page))
+			hugetlb_cgroup_migrate(hpage, new_hpage);
+		else
+			mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
 		if (!PageAnon(page))
@@ -1064,9 +1067,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
-		hugetlb_cgroup_migrate(hpage, new_hpage);
-
 	unlock_page(hpage);
 out:
 	if (rc != -EAGAIN)

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 19:04             ` Johannes Weiner
@ 2014-07-15 20:49               ` Naoya Horiguchi
  -1 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-15 20:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 03:04:54PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 02:43:58PM -0400, Naoya Horiguchi wrote:
> > On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> > > On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > > > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > > > ...
> > > > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > > > index a98f48626359..3074210f245d 100644
> > > > > > --- a/mm/swap.c
> > > > > > +++ b/mm/swap.c
> > > > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > > > >  	}
> > > > > > +	mem_cgroup_uncharge(page);
> > > > > >  }
> > > > > >  
> > > > > >  static void __put_single_page(struct page *page)
> > > > > 
> > > > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > > > when freeing a hugetlbfs page.
> > > > 
> > > > This looks like a fall out from
> > > > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > > > 
> > > > I didn't get to review this one but the easiest fix seems to be check
> > > > HugePage and do not call uncharge.
> > > 
> > > Yes, that makes sense.  I'm also moving the uncharge call into
> > > __put_single_page() and __put_compound_page() so that PageHuge(), a
> > > function call, only needs to be checked for compound pages.
> > > 
> > > > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > > > 
> > > > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > > > fixed the problem, so if that works for you, could you fold the change
> > > > > into your patch?
> > > 
> > > Memcg pages that *do* need uncharging might not necessarily be on the
> > > LRU list.
> > 
> > OK.
> > 
> > > Does the following work for you?
> > 
> > Unfortunately, with this change I saw the following bug message when
> > stressing with hugepage migration.
> > move_to_new_page() is called by unmap_and_move_huge_page() too, so
> > we need some hugetlb related code around mem_cgroup_migrate().
> 
> Can we just move hugetlb_cgroup_migrate() into move_to_new_page()?  It
> doesn't seem to be dependent of any page-specific state.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..219da52d2f43 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		if (!PageAnon(newpage))
>  			newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		if (PageHuge(page))
> +			hugetlb_cgroup_migrate(hpage, new_hpage);

			hugetlb_cgroup_migrate(page, newpage);

to build successfully.

And yes, with this chanage the bug in move_to_new_page() is gone,
so we stepped one step further.

But I faced another bugs like below.

[   56.692744] BUG: Bad page state in process sysctl  pfn:71c00
[   56.693722] page:ffffea0001c70000 count:0 mapcount:0 mapping:          (null) index:0x8
[   56.695121] page flags: 0x5fffff80004008(uptodate|head)
[   56.695990] page dumped because: cgroup check failed
[   56.696816] pc:ffff88007eb9c000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[   56.698059] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfk
ill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_def
rag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss o
id_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[   56.707416] CPU: 2 PID: 1872 Comm: sysctl Tainted: G    B      O  3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[   56.709024] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   56.709810]  ffffffff81a8e0d5 ffff88003d787cb0 ffffffff8172d057 ffff88003d787cc8
[   56.711158]  ffffffff8172d08e ffffea0001c70000 ffff88003d787cf0 ffffffff8119e7a5
[   56.712344]  0000000000000000 000fffff80000000 ffffffff81a8e0d5 ffff88003d787d28
[   56.713551] Call Trace:
[   56.714088]  [<ffffffff8172d057>] __dump_stack+0x19/0x1b
[   56.714793]  [<ffffffff8172d08e>] dump_stack+0x35/0x46
[   56.715546]  [<ffffffff8119e7a5>] bad_page+0xd5/0x130
[   56.716369]  [<ffffffff8119e958>] free_pages_prepare+0x158/0x190
[   56.717222]  [<ffffffff8119edab>] __free_pages_ok+0x1b/0xb0
[   56.717960]  [<ffffffff8119f859>] __free_pages+0x29/0x50
[   56.718710]  [<ffffffff811dbce0>] update_and_free_page+0xd0/0x110
[   56.719575]  [<ffffffff811dd663>] free_pool_huge_page+0xd3/0xf0
[   56.720407]  [<ffffffff811dd7ec>] set_max_huge_pages+0x16c/0x1c0
[   56.721255]  [<ffffffff811dd968>] __nr_hugepages_store_common+0x128/0x1a0
[   56.722203]  [<ffffffff811ddb28>] hugetlb_sysctl_handler_common+0x98/0xb0
[   56.723147]  [<ffffffff811de56e>] hugetlb_sysctl_handler+0x1e/0x20
[   56.723962]  [<ffffffff8127a103>] proc_sys_call_handler+0xa3/0xb0
[   56.724805]  [<ffffffff8127a124>] proc_sys_write+0x14/0x20
[   56.725844]  [<ffffffff8120921a>] vfs_write+0xba/0x1e0
[   56.726792]  [<ffffffff81209d8d>] SyS_write+0x4d/0xc0
[   56.727596]  [<ffffffff81742a12>] system_call_fastpath+0x16/0x1b
[   58.894865] page:ffffea0001cf8000 count:2 mapcount:0 mapping:ffff88003d481278 index:0x1
[   58.896112] page flags: 0x5fffff80004809(locked|uptodate|private|head)
[   58.897148] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[   58.899325] pc:ffff88007ebbe000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[   58.900359] ------------[ cut here ]------------
[   58.901016] kernel BUG at /src/linux-dev/mm/memcontrol.c:2707!
[   58.901331] invalid opcode: 0000 [#1] SMP
[   58.901331] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[   58.901331] CPU: 1 PID: 1918 Comm: mbind_fuzz Tainted: G    B      O  3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[   58.901331] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   58.901331] task: ffff8800bd763b20 ti: ffff8800bd750000 task.ti: ffff8800bd750000
[   58.901331] RIP: 0010:[<ffffffff811fee3b>]  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   58.901331] RSP: 0000:ffff8800bd753c38  EFLAGS: 00010296
[   58.901331] RAX: 000000000000003f RBX: ffffea0001cf8000 RCX: 0000000000000000
[   58.901331] RDX: 0000000000000001 RSI: ffff88007ec0d318 RDI: ffff88007ec0d318
[   58.901331] RBP: ffff8800bd753c78 R08: 000000000000000a R09: 0000000000000000
[   58.901331] R10: 0000000000000000 R11: ffff8800bd75390e R12: ffff8800be59a800
[   58.901331] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88007ebbe000
[   58.901331] FS:  00007f9ce6fa0740(0000) GS:ffff88007ec00000(0000) knlGS:0000000000000000
[   58.901331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.901331] CR2: 0000700004600000 CR3: 000000007c194000 CR4: 00000000000006e0
[   58.901331] Stack:
[   58.901331]  ffff8800be59a800 ffffea0001cf8000 000002003d481290 ffffea0001cf8000
[   58.901331]  ffff88003d481278 0000000000000000 ffff88003d481290 00000000000000d0
[   58.901331]  ffff8800bd753c90 ffffffff812020fc ffffea0001cf8000 ffff8800bd753cd8
[   58.901331] Call Trace:
[   58.901331]  [<ffffffff812020fc>] mem_cgroup_commit_charge+0x6c/0xf0
[   58.901331]  [<ffffffff81196c8c>] __add_to_page_cache_locked+0xec/0x1e0
[   58.901331]  [<ffffffff81196d91>] add_to_page_cache_locked+0x11/0x20
[   58.901331]  [<ffffffff811df425>] hugetlb_no_page+0x105/0x3b0
[   58.901331]  [<ffffffff8138f799>] ? __rb_insert_augmented+0xf9/0x1e0
[   58.901331]  [<ffffffff811e02f4>] hugetlb_fault+0x2c4/0x3c0
[   58.901331]  [<ffffffff811bd184>] ? vma_interval_tree_insert+0x84/0x90
[   58.901331]  [<ffffffff811c5d93>] __handle_mm_fault+0x303/0x340
[   58.901331]  [<ffffffff811c5e5f>] handle_mm_fault+0x8f/0x130
[   58.901331]  [<ffffffff8173d3f6>] __do_page_fault+0x176/0x520
[   58.901331]  [<ffffffff8132d993>] ? file_map_prot_check+0x63/0xd0
[   58.901331]  [<ffffffff811b46a9>] ? vm_mmap_pgoff+0x99/0xc0
[   58.901331]  [<ffffffff8173d7ac>] do_page_fault+0xc/0x10
[   58.901331]  [<ffffffff8173a122>] page_fault+0x22/0x30
[   58.901331] Code: 13 45 19 c0 41 83 e0 02 48 c1 ea 06 83 e2 01 48 83 fa 01 41 83 d8 ff e9 30 ff ff ff 48 c7 c6 20 d0 a8 81 48 89 df e8 55 fb f9 ff <0f> 0b 48 c7 c6 f3 e2 a8 81 48 89 df e8 44 fb f9 ff 0f 0b 48 c7
[   58.901331] RIP  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   58.901331]  RSP <ffff8800bd753c38>
[   58.944251] ---[ end trace 2f1aecd49dae161f ]---

I feel that these 2 messages have the same cause (just appear differently).
__add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
and results in the bad page bug ("page dumped because: cgroup check failed").
So maybe some more PageHuge check is necessary around the charging code.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 20:49               ` Naoya Horiguchi
  0 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-15 20:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 03:04:54PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 02:43:58PM -0400, Naoya Horiguchi wrote:
> > On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> > > On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > > > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > > > ...
> > > > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > > > index a98f48626359..3074210f245d 100644
> > > > > > --- a/mm/swap.c
> > > > > > +++ b/mm/swap.c
> > > > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > > > >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > > > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > > > >  	}
> > > > > > +	mem_cgroup_uncharge(page);
> > > > > >  }
> > > > > >  
> > > > > >  static void __put_single_page(struct page *page)
> > > > > 
> > > > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > > > when freeing a hugetlbfs page.
> > > > 
> > > > This looks like a fall out from
> > > > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > > > 
> > > > I didn't get to review this one but the easiest fix seems to be check
> > > > HugePage and do not call uncharge.
> > > 
> > > Yes, that makes sense.  I'm also moving the uncharge call into
> > > __put_single_page() and __put_compound_page() so that PageHuge(), a
> > > function call, only needs to be checked for compound pages.
> > > 
> > > > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > > > 
> > > > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > > > fixed the problem, so if that works for you, could you fold the change
> > > > > into your patch?
> > > 
> > > Memcg pages that *do* need uncharging might not necessarily be on the
> > > LRU list.
> > 
> > OK.
> > 
> > > Does the following work for you?
> > 
> > Unfortunately, with this change I saw the following bug message when
> > stressing with hugepage migration.
> > move_to_new_page() is called by unmap_and_move_huge_page() too, so
> > we need some hugetlb related code around mem_cgroup_migrate().
> 
> Can we just move hugetlb_cgroup_migrate() into move_to_new_page()?  It
> doesn't seem to be dependent of any page-specific state.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..219da52d2f43 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		if (!PageAnon(newpage))
>  			newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		if (PageHuge(page))
> +			hugetlb_cgroup_migrate(hpage, new_hpage);

			hugetlb_cgroup_migrate(page, newpage);

to build successfully.

And yes, with this chanage the bug in move_to_new_page() is gone,
so we stepped one step further.

But I faced another bugs like below.

[   56.692744] BUG: Bad page state in process sysctl  pfn:71c00
[   56.693722] page:ffffea0001c70000 count:0 mapcount:0 mapping:          (null) index:0x8
[   56.695121] page flags: 0x5fffff80004008(uptodate|head)
[   56.695990] page dumped because: cgroup check failed
[   56.696816] pc:ffff88007eb9c000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[   56.698059] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfk
ill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_def
rag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss o
id_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[   56.707416] CPU: 2 PID: 1872 Comm: sysctl Tainted: G    B      O  3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[   56.709024] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   56.709810]  ffffffff81a8e0d5 ffff88003d787cb0 ffffffff8172d057 ffff88003d787cc8
[   56.711158]  ffffffff8172d08e ffffea0001c70000 ffff88003d787cf0 ffffffff8119e7a5
[   56.712344]  0000000000000000 000fffff80000000 ffffffff81a8e0d5 ffff88003d787d28
[   56.713551] Call Trace:
[   56.714088]  [<ffffffff8172d057>] __dump_stack+0x19/0x1b
[   56.714793]  [<ffffffff8172d08e>] dump_stack+0x35/0x46
[   56.715546]  [<ffffffff8119e7a5>] bad_page+0xd5/0x130
[   56.716369]  [<ffffffff8119e958>] free_pages_prepare+0x158/0x190
[   56.717222]  [<ffffffff8119edab>] __free_pages_ok+0x1b/0xb0
[   56.717960]  [<ffffffff8119f859>] __free_pages+0x29/0x50
[   56.718710]  [<ffffffff811dbce0>] update_and_free_page+0xd0/0x110
[   56.719575]  [<ffffffff811dd663>] free_pool_huge_page+0xd3/0xf0
[   56.720407]  [<ffffffff811dd7ec>] set_max_huge_pages+0x16c/0x1c0
[   56.721255]  [<ffffffff811dd968>] __nr_hugepages_store_common+0x128/0x1a0
[   56.722203]  [<ffffffff811ddb28>] hugetlb_sysctl_handler_common+0x98/0xb0
[   56.723147]  [<ffffffff811de56e>] hugetlb_sysctl_handler+0x1e/0x20
[   56.723962]  [<ffffffff8127a103>] proc_sys_call_handler+0xa3/0xb0
[   56.724805]  [<ffffffff8127a124>] proc_sys_write+0x14/0x20
[   56.725844]  [<ffffffff8120921a>] vfs_write+0xba/0x1e0
[   56.726792]  [<ffffffff81209d8d>] SyS_write+0x4d/0xc0
[   56.727596]  [<ffffffff81742a12>] system_call_fastpath+0x16/0x1b
[   58.894865] page:ffffea0001cf8000 count:2 mapcount:0 mapping:ffff88003d481278 index:0x1
[   58.896112] page flags: 0x5fffff80004809(locked|uptodate|private|head)
[   58.897148] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[   58.899325] pc:ffff88007ebbe000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[   58.900359] ------------[ cut here ]------------
[   58.901016] kernel BUG at /src/linux-dev/mm/memcontrol.c:2707!
[   58.901331] invalid opcode: 0000 [#1] SMP
[   58.901331] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[   58.901331] CPU: 1 PID: 1918 Comm: mbind_fuzz Tainted: G    B      O  3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[   58.901331] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   58.901331] task: ffff8800bd763b20 ti: ffff8800bd750000 task.ti: ffff8800bd750000
[   58.901331] RIP: 0010:[<ffffffff811fee3b>]  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   58.901331] RSP: 0000:ffff8800bd753c38  EFLAGS: 00010296
[   58.901331] RAX: 000000000000003f RBX: ffffea0001cf8000 RCX: 0000000000000000
[   58.901331] RDX: 0000000000000001 RSI: ffff88007ec0d318 RDI: ffff88007ec0d318
[   58.901331] RBP: ffff8800bd753c78 R08: 000000000000000a R09: 0000000000000000
[   58.901331] R10: 0000000000000000 R11: ffff8800bd75390e R12: ffff8800be59a800
[   58.901331] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88007ebbe000
[   58.901331] FS:  00007f9ce6fa0740(0000) GS:ffff88007ec00000(0000) knlGS:0000000000000000
[   58.901331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.901331] CR2: 0000700004600000 CR3: 000000007c194000 CR4: 00000000000006e0
[   58.901331] Stack:
[   58.901331]  ffff8800be59a800 ffffea0001cf8000 000002003d481290 ffffea0001cf8000
[   58.901331]  ffff88003d481278 0000000000000000 ffff88003d481290 00000000000000d0
[   58.901331]  ffff8800bd753c90 ffffffff812020fc ffffea0001cf8000 ffff8800bd753cd8
[   58.901331] Call Trace:
[   58.901331]  [<ffffffff812020fc>] mem_cgroup_commit_charge+0x6c/0xf0
[   58.901331]  [<ffffffff81196c8c>] __add_to_page_cache_locked+0xec/0x1e0
[   58.901331]  [<ffffffff81196d91>] add_to_page_cache_locked+0x11/0x20
[   58.901331]  [<ffffffff811df425>] hugetlb_no_page+0x105/0x3b0
[   58.901331]  [<ffffffff8138f799>] ? __rb_insert_augmented+0xf9/0x1e0
[   58.901331]  [<ffffffff811e02f4>] hugetlb_fault+0x2c4/0x3c0
[   58.901331]  [<ffffffff811bd184>] ? vma_interval_tree_insert+0x84/0x90
[   58.901331]  [<ffffffff811c5d93>] __handle_mm_fault+0x303/0x340
[   58.901331]  [<ffffffff811c5e5f>] handle_mm_fault+0x8f/0x130
[   58.901331]  [<ffffffff8173d3f6>] __do_page_fault+0x176/0x520
[   58.901331]  [<ffffffff8132d993>] ? file_map_prot_check+0x63/0xd0
[   58.901331]  [<ffffffff811b46a9>] ? vm_mmap_pgoff+0x99/0xc0
[   58.901331]  [<ffffffff8173d7ac>] do_page_fault+0xc/0x10
[   58.901331]  [<ffffffff8173a122>] page_fault+0x22/0x30
[   58.901331] Code: 13 45 19 c0 41 83 e0 02 48 c1 ea 06 83 e2 01 48 83 fa 01 41 83 d8 ff e9 30 ff ff ff 48 c7 c6 20 d0 a8 81 48 89 df e8 55 fb f9 ff <0f> 0b 48 c7 c6 f3 e2 a8 81 48 89 df e8 44 fb f9 ff 0f 0b 48 c7
[   58.901331] RIP  [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[   58.901331]  RSP <ffff8800bd753c38>
[   58.944251] ---[ end trace 2f1aecd49dae161f ]---

I feel that these 2 messages have the same cause (just appear differently).
__add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
and results in the bad page bug ("page dumped because: cgroup check failed").
So maybe some more PageHuge check is necessary around the charging code.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 20:49               ` Naoya Horiguchi
@ 2014-07-15 21:48                 ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 21:48 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> I feel that these 2 messages have the same cause (just appear differently).
> __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> and results in the bad page bug ("page dumped because: cgroup check failed").
> So maybe some more PageHuge check is necessary around the charging code.

This struck me as odd because I don't remember removing a PageHuge()
call in the charge path and wondered how it worked before my changes:
apparently it just checked PageCompound() in mem_cgroup_charge_file().

So it's not fallout of the new uncharge batching code, but was already
broken during the rewrite of the charge API because then hugetlb pages
entered the charging code.

Anyway, we don't have file-specific charging code anymore, and the
PageCompound() check would have required changing anyway for THP
cache.  So I guess the solution is checking PageHuge() in charge,
uncharge, and migrate for now.  Oh well.

How about this?

diff --git a/mm/filemap.c b/mm/filemap.c
index 9c99d6868a5e..b61194273b56 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
-	if (error)
-		return error;
+	if (!PageHuge(page)) {
+		error = mem_cgroup_try_charge(page, current->mm,
+					      gfp_mask, &memcg);
+		if (error)
+			return error;
+	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 7f5a42403fae..dabed2f08609 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		if (!PageAnon(newpage))
 			newpage->mapping = NULL;
 	} else {
-		mem_cgroup_migrate(page, newpage, false);
+		if (!PageHuge(page))
+			mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
 		if (!PageAnon(page))
diff --git a/mm/swap.c b/mm/swap.c
index 3461f2f5be20..97b6ec132398 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+	mem_cgroup_uncharge_page(page);
 	free_hot_cold_page(page, false);
 }
 
@@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
 {
 	compound_page_dtor *dtor;
 
-	__page_cache_release(page);
+	if (!PageHuge(page)) {
+		__page_cache_release(page);
+		mem_cgroup_uncharge_page(page);
+	}
 	dtor = get_compound_page_dtor(page);
 	(*dtor)(page);
 }

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-15 21:48                 ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-15 21:48 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> I feel that these 2 messages have the same cause (just appear differently).
> __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> and results in the bad page bug ("page dumped because: cgroup check failed").
> So maybe some more PageHuge check is necessary around the charging code.

This struck me as odd because I don't remember removing a PageHuge()
call in the charge path and wondered how it worked before my changes:
apparently it just checked PageCompound() in mem_cgroup_charge_file().

So it's not fallout of the new uncharge batching code, but was already
broken during the rewrite of the charge API because then hugetlb pages
entered the charging code.

Anyway, we don't have file-specific charging code anymore, and the
PageCompound() check would have required changing anyway for THP
cache.  So I guess the solution is checking PageHuge() in charge,
uncharge, and migrate for now.  Oh well.

How about this?

diff --git a/mm/filemap.c b/mm/filemap.c
index 9c99d6868a5e..b61194273b56 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
-	if (error)
-		return error;
+	if (!PageHuge(page)) {
+		error = mem_cgroup_try_charge(page, current->mm,
+					      gfp_mask, &memcg);
+		if (error)
+			return error;
+	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 7f5a42403fae..dabed2f08609 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		if (!PageAnon(newpage))
 			newpage->mapping = NULL;
 	} else {
-		mem_cgroup_migrate(page, newpage, false);
+		if (!PageHuge(page))
+			mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
 		if (!PageAnon(page))
diff --git a/mm/swap.c b/mm/swap.c
index 3461f2f5be20..97b6ec132398 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+	mem_cgroup_uncharge_page(page);
 	free_hot_cold_page(page, false);
 }
 
@@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
 {
 	compound_page_dtor *dtor;
 
-	__page_cache_release(page);
+	if (!PageHuge(page)) {
+		__page_cache_release(page);
+		mem_cgroup_uncharge_page(page);
+	}
 	dtor = get_compound_page_dtor(page);
 	(*dtor)(page);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 21:48                 ` Johannes Weiner
@ 2014-07-16  7:55                   ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-16  7:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue 15-07-14 17:48:43, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> > I feel that these 2 messages have the same cause (just appear differently).
> > __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> > for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> > for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> > and results in the bad page bug ("page dumped because: cgroup check failed").
> > So maybe some more PageHuge check is necessary around the charging code.
> 
> This struck me as odd because I don't remember removing a PageHuge()
> call in the charge path and wondered how it worked before my changes:
> apparently it just checked PageCompound() in mem_cgroup_charge_file().

I have noticed the PageCompound check during review which made me look
into history but 52d4b9ac0b98 (memcg: allocate all page_cgroup at boot)
didn't mention why it added it so I considered it hack-at-the-time which
is not actual anymore. Sorry I should have been more careful.

> So it's not fallout of the new uncharge batching code, but was already
> broken during the rewrite of the charge API because then hugetlb pages
> entered the charging code.
> 
> Anyway, we don't have file-specific charging code anymore, and the
> PageCompound() check would have required changing anyway for THP
> cache.  So I guess the solution is checking PageHuge() in charge,
> uncharge, and migrate for now.  Oh well.
> 
> How about this?

Looks good to me. I do not know why you have moved the charge function
out of __page_cache_release (the function would deserve a better name
btw. - lru_page_release would sound little bit better to me and it would
be quite natural place for the uncharge as well) when you already check
PageHuge in __put_compound_page. But that is just a minor thing.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9c99d6868a5e..b61194273b56 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
>  
> -	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
> -	if (error)
> -		return error;
> +	if (!PageHuge(page)) {
> +		error = mem_cgroup_try_charge(page, current->mm,
> +					      gfp_mask, &memcg);
> +		if (error)
> +			return error;
> +	}
>  
>  	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
>  	if (error) {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..dabed2f08609 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		if (!PageAnon(newpage))
>  			newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		if (!PageHuge(page))
> +			mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
>  		if (!PageAnon(page))
> diff --git a/mm/swap.c b/mm/swap.c
> index 3461f2f5be20..97b6ec132398 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> -	mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)
>  {
>  	__page_cache_release(page);
> +	mem_cgroup_uncharge_page(page);
>  	free_hot_cold_page(page, false);
>  }
>  
> @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
>  {
>  	compound_page_dtor *dtor;
>  
> -	__page_cache_release(page);
> +	if (!PageHuge(page)) {
> +		__page_cache_release(page);
> +		mem_cgroup_uncharge_page(page);
> +	}
>  	dtor = get_compound_page_dtor(page);
>  	(*dtor)(page);
>  }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-16  7:55                   ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-16  7:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Naoya Horiguchi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue 15-07-14 17:48:43, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> > I feel that these 2 messages have the same cause (just appear differently).
> > __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> > for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> > for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> > and results in the bad page bug ("page dumped because: cgroup check failed").
> > So maybe some more PageHuge check is necessary around the charging code.
> 
> This struck me as odd because I don't remember removing a PageHuge()
> call in the charge path and wondered how it worked before my changes:
> apparently it just checked PageCompound() in mem_cgroup_charge_file().

I have noticed the PageCompound check during review which made me look
into history but 52d4b9ac0b98 (memcg: allocate all page_cgroup at boot)
didn't mention why it added it so I considered it hack-at-the-time which
is not actual anymore. Sorry I should have been more careful.

> So it's not fallout of the new uncharge batching code, but was already
> broken during the rewrite of the charge API because then hugetlb pages
> entered the charging code.
> 
> Anyway, we don't have file-specific charging code anymore, and the
> PageCompound() check would have required changing anyway for THP
> cache.  So I guess the solution is checking PageHuge() in charge,
> uncharge, and migrate for now.  Oh well.
> 
> How about this?

Looks good to me. I do not know why you have moved the charge function
out of __page_cache_release (the function would deserve a better name
btw. - lru_page_release would sound little bit better to me and it would
be quite natural place for the uncharge as well) when you already check
PageHuge in __put_compound_page. But that is just a minor thing.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9c99d6868a5e..b61194273b56 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
>  
> -	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
> -	if (error)
> -		return error;
> +	if (!PageHuge(page)) {
> +		error = mem_cgroup_try_charge(page, current->mm,
> +					      gfp_mask, &memcg);
> +		if (error)
> +			return error;
> +	}
>  
>  	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
>  	if (error) {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..dabed2f08609 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		if (!PageAnon(newpage))
>  			newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		if (!PageHuge(page))
> +			mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
>  		if (!PageAnon(page))
> diff --git a/mm/swap.c b/mm/swap.c
> index 3461f2f5be20..97b6ec132398 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> -	mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)
>  {
>  	__page_cache_release(page);
> +	mem_cgroup_uncharge_page(page);
>  	free_hot_cold_page(page, false);
>  }
>  
> @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
>  {
>  	compound_page_dtor *dtor;
>  
> -	__page_cache_release(page);
> +	if (!PageHuge(page)) {
> +		__page_cache_release(page);
> +		mem_cgroup_uncharge_page(page);
> +	}
>  	dtor = get_compound_page_dtor(page);
>  	(*dtor)(page);
>  }

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 21:48                 ` Johannes Weiner
@ 2014-07-16 13:30                   ` Naoya Horiguchi
  -1 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-16 13:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 05:48:43PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> > I feel that these 2 messages have the same cause (just appear differently).
> > __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> > for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> > for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> > and results in the bad page bug ("page dumped because: cgroup check failed").
> > So maybe some more PageHuge check is necessary around the charging code.
> 
> This struck me as odd because I don't remember removing a PageHuge()
> call in the charge path and wondered how it worked before my changes:
> apparently it just checked PageCompound() in mem_cgroup_charge_file().
> 
> So it's not fallout of the new uncharge batching code, but was already
> broken during the rewrite of the charge API because then hugetlb pages
> entered the charging code.
> 
> Anyway, we don't have file-specific charging code anymore, and the
> PageCompound() check would have required changing anyway for THP
> cache.  So I guess the solution is checking PageHuge() in charge,
> uncharge, and migrate for now.  Oh well.
> 
> How about this?

With tweaking a bit, this patch solved the problem, thanks!

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9c99d6868a5e..b61194273b56 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
>  
> -	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
> -	if (error)
> -		return error;
> +	if (!PageHuge(page)) {
> +		error = mem_cgroup_try_charge(page, current->mm,
> +					      gfp_mask, &memcg);
> +		if (error)
> +			return error;
> +	}
>  
>  	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
>  	if (error) {

We have mem_cgroup_commit_charge() later in __add_to_page_cache_locked(),
so adding "if (!PageHuge(page))" for it is necessary too.

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..dabed2f08609 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		if (!PageAnon(newpage))
>  			newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		if (!PageHuge(page))
> +			mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
>  		if (!PageAnon(page))
> diff --git a/mm/swap.c b/mm/swap.c
> index 3461f2f5be20..97b6ec132398 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> -	mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)
>  {
>  	__page_cache_release(page);
> +	mem_cgroup_uncharge_page(page);

My kernel is based on mmotm-2014-07-09-17-08, where mem_cgroup_uncharge_page()
does not exist any more. Maybe mem_cgroup_uncharge(page) seems correct.

>  	free_hot_cold_page(page, false);
>  }
>  
> @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
>  {
>  	compound_page_dtor *dtor;
>  
> -	__page_cache_release(page);
> +	if (!PageHuge(page)) {
> +		__page_cache_release(page);
> +		mem_cgroup_uncharge_page(page);

ditto.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-16 13:30                   ` Naoya Horiguchi
  0 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-16 13:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Tue, Jul 15, 2014 at 05:48:43PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> > I feel that these 2 messages have the same cause (just appear differently).
> > __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> > for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> > for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> > and results in the bad page bug ("page dumped because: cgroup check failed").
> > So maybe some more PageHuge check is necessary around the charging code.
> 
> This struck me as odd because I don't remember removing a PageHuge()
> call in the charge path and wondered how it worked before my changes:
> apparently it just checked PageCompound() in mem_cgroup_charge_file().
> 
> So it's not fallout of the new uncharge batching code, but was already
> broken during the rewrite of the charge API because then hugetlb pages
> entered the charging code.
> 
> Anyway, we don't have file-specific charging code anymore, and the
> PageCompound() check would have required changing anyway for THP
> cache.  So I guess the solution is checking PageHuge() in charge,
> uncharge, and migrate for now.  Oh well.
> 
> How about this?

With tweaking a bit, this patch solved the problem, thanks!

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9c99d6868a5e..b61194273b56 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
>  
> -	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
> -	if (error)
> -		return error;
> +	if (!PageHuge(page)) {
> +		error = mem_cgroup_try_charge(page, current->mm,
> +					      gfp_mask, &memcg);
> +		if (error)
> +			return error;
> +	}
>  
>  	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
>  	if (error) {

We have mem_cgroup_commit_charge() later in __add_to_page_cache_locked(),
so adding "if (!PageHuge(page))" for it is necessary too.

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..dabed2f08609 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		if (!PageAnon(newpage))
>  			newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		if (!PageHuge(page))
> +			mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
>  		if (!PageAnon(page))
> diff --git a/mm/swap.c b/mm/swap.c
> index 3461f2f5be20..97b6ec132398 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> -	mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)
>  {
>  	__page_cache_release(page);
> +	mem_cgroup_uncharge_page(page);

My kernel is based on mmotm-2014-07-09-17-08, where mem_cgroup_uncharge_page()
does not exist any more. Maybe mem_cgroup_uncharge(page) seems correct.

>  	free_hot_cold_page(page, false);
>  }
>  
> @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
>  {
>  	compound_page_dtor *dtor;
>  
> -	__page_cache_release(page);
> +	if (!PageHuge(page)) {
> +		__page_cache_release(page);
> +		mem_cgroup_uncharge_page(page);

ditto.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-16 13:30                   ` Naoya Horiguchi
@ 2014-07-16 14:14                     ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-16 14:14 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Wed, Jul 16, 2014 at 09:30:50AM -0400, Naoya Horiguchi wrote:
> On Tue, Jul 15, 2014 at 05:48:43PM -0400, Johannes Weiner wrote:
> > On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> > > I feel that these 2 messages have the same cause (just appear differently).
> > > __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> > > for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> > > for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> > > and results in the bad page bug ("page dumped because: cgroup check failed").
> > > So maybe some more PageHuge check is necessary around the charging code.
> > 
> > This struck me as odd because I don't remember removing a PageHuge()
> > call in the charge path and wondered how it worked before my changes:
> > apparently it just checked PageCompound() in mem_cgroup_charge_file().
> > 
> > So it's not fallout of the new uncharge batching code, but was already
> > broken during the rewrite of the charge API because then hugetlb pages
> > entered the charging code.
> > 
> > Anyway, we don't have file-specific charging code anymore, and the
> > PageCompound() check would have required changing anyway for THP
> > cache.  So I guess the solution is checking PageHuge() in charge,
> > uncharge, and migrate for now.  Oh well.
> > 
> > How about this?
> 
> With tweaking a bit, this patch solved the problem, thanks!
> 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9c99d6868a5e..b61194273b56 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
> >  
> > -	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
> > -	if (error)
> > -		return error;
> > +	if (!PageHuge(page)) {
> > +		error = mem_cgroup_try_charge(page, current->mm,
> > +					      gfp_mask, &memcg);
> > +		if (error)
> > +			return error;
> > +	}
> >  
> >  	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
> >  	if (error) {
> 
> We have mem_cgroup_commit_charge() later in __add_to_page_cache_locked(),
> so adding "if (!PageHuge(page))" for it is necessary too.

You are right.  Annotated them all now.

> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 7f5a42403fae..dabed2f08609 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  		if (!PageAnon(newpage))
> >  			newpage->mapping = NULL;
> >  	} else {
> > -		mem_cgroup_migrate(page, newpage, false);
> > +		if (!PageHuge(page))
> > +			mem_cgroup_migrate(page, newpage, false);

I deleted this again as it was a followup fix to hugepages getting
wrongfully charged as file cache.  They shouldn't be, and
mem_cgroup_migrate() checks whether the page is charged.

> >  		if (remap_swapcache)
> >  			remove_migration_ptes(page, newpage);
> >  		if (!PageAnon(page))
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 3461f2f5be20..97b6ec132398 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
> >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  	}
> > -	mem_cgroup_uncharge(page);
> >  }
> >  
> >  static void __put_single_page(struct page *page)
> >  {
> >  	__page_cache_release(page);
> > +	mem_cgroup_uncharge_page(page);
> 
> My kernel is based on mmotm-2014-07-09-17-08, where mem_cgroup_uncharge_page()
> does not exist any more. Maybe mem_cgroup_uncharge(page) seems correct.

Sorry, I should have build tested.  The name is still reflex...

> >  	free_hot_cold_page(page, false);
> >  }
> >  
> > @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
> >  {
> >  	compound_page_dtor *dtor;
> >  
> > -	__page_cache_release(page);
> > +	if (!PageHuge(page)) {
> > +		__page_cache_release(page);
> > +		mem_cgroup_uncharge_page(page);

I reverted all these mm/swap.c changes again as well.  Instead,
mem_cgroup_uncharge() now does a preliminary check if the page is
charged before it touches page->lru.

That should be much more robust: now the vetting whether a page is
valid for memcg happens at charge time only, all other operations
check first if a page is charged before doing anything else to it.

These two places should be the only ones that need fixing then:

diff --git a/mm/filemap.c b/mm/filemap.c
index 9c99d6868a5e..bfe0745a704d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -31,6 +31,7 @@
 #include <linux/security.h>
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
+#include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
@@ -558,19 +559,24 @@ static int __add_to_page_cache_locked(struct page *page,
 				      pgoff_t offset, gfp_t gfp_mask,
 				      void **shadowp)
 {
+	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
-	if (error)
-		return error;
+	if (!huge) {
+		error = mem_cgroup_try_charge(page, current->mm,
+					      gfp_mask, &memcg);
+		if (error)
+			return error;
+	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
-		mem_cgroup_cancel_charge(page, memcg);
+		if (!huge)
+			mem_cgroup_cancel_charge(page, memcg);
 		return error;
 	}
 
@@ -585,14 +591,16 @@ static int __add_to_page_cache_locked(struct page *page,
 		goto err_insert;
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_commit_charge(page, memcg, false);
+	if (!huge)
+		mem_cgroup_commit_charge(page, memcg, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_cancel_charge(page, memcg);
+	if (!huge)
+		mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 063080e35459..b5de5deddbfb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6635,9 +6635,16 @@ static void uncharge_list(struct list_head *page_list)
  */
 void mem_cgroup_uncharge(struct page *page)
 {
+	struct page_cgroup *pc;
+
 	if (mem_cgroup_disabled())
 		return;
 
+	/* Don't touch page->lru of any random page, pre-check: */
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return;
+
 	INIT_LIST_HEAD(&page->lru);
 	uncharge_list(&page->lru);
 }

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-16 14:14                     ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-16 14:14 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Wed, Jul 16, 2014 at 09:30:50AM -0400, Naoya Horiguchi wrote:
> On Tue, Jul 15, 2014 at 05:48:43PM -0400, Johannes Weiner wrote:
> > On Tue, Jul 15, 2014 at 04:49:53PM -0400, Naoya Horiguchi wrote:
> > > I feel that these 2 messages have the same cause (just appear differently).
> > > __add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
> > > for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
> > > for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
> > > and results in the bad page bug ("page dumped because: cgroup check failed").
> > > So maybe some more PageHuge check is necessary around the charging code.
> > 
> > This struck me as odd because I don't remember removing a PageHuge()
> > call in the charge path and wondered how it worked before my changes:
> > apparently it just checked PageCompound() in mem_cgroup_charge_file().
> > 
> > So it's not fallout of the new uncharge batching code, but was already
> > broken during the rewrite of the charge API because then hugetlb pages
> > entered the charging code.
> > 
> > Anyway, we don't have file-specific charging code anymore, and the
> > PageCompound() check would have required changing anyway for THP
> > cache.  So I guess the solution is checking PageHuge() in charge,
> > uncharge, and migrate for now.  Oh well.
> > 
> > How about this?
> 
> With tweaking a bit, this patch solved the problem, thanks!
> 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9c99d6868a5e..b61194273b56 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -564,9 +564,12 @@ static int __add_to_page_cache_locked(struct page *page,
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
> >  
> > -	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
> > -	if (error)
> > -		return error;
> > +	if (!PageHuge(page)) {
> > +		error = mem_cgroup_try_charge(page, current->mm,
> > +					      gfp_mask, &memcg);
> > +		if (error)
> > +			return error;
> > +	}
> >  
> >  	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
> >  	if (error) {
> 
> We have mem_cgroup_commit_charge() later in __add_to_page_cache_locked(),
> so adding "if (!PageHuge(page))" for it is necessary too.

You are right.  Annotated them all now.

> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 7f5a42403fae..dabed2f08609 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -781,7 +781,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  		if (!PageAnon(newpage))
> >  			newpage->mapping = NULL;
> >  	} else {
> > -		mem_cgroup_migrate(page, newpage, false);
> > +		if (!PageHuge(page))
> > +			mem_cgroup_migrate(page, newpage, false);

I deleted this again as it was a followup fix to hugepages getting
wrongfully charged as file cache.  They shouldn't be, and
mem_cgroup_migrate() checks whether the page is charged.

> >  		if (remap_swapcache)
> >  			remove_migration_ptes(page, newpage);
> >  		if (!PageAnon(page))
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 3461f2f5be20..97b6ec132398 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
> >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  	}
> > -	mem_cgroup_uncharge(page);
> >  }
> >  
> >  static void __put_single_page(struct page *page)
> >  {
> >  	__page_cache_release(page);
> > +	mem_cgroup_uncharge_page(page);
> 
> My kernel is based on mmotm-2014-07-09-17-08, where mem_cgroup_uncharge_page()
> does not exist any more. Maybe mem_cgroup_uncharge(page) seems correct.

Sorry, I should have build tested.  The name is still reflex...

> >  	free_hot_cold_page(page, false);
> >  }
> >  
> > @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
> >  {
> >  	compound_page_dtor *dtor;
> >  
> > -	__page_cache_release(page);
> > +	if (!PageHuge(page)) {
> > +		__page_cache_release(page);
> > +		mem_cgroup_uncharge_page(page);

I reverted all these mm/swap.c changes again as well.  Instead,
mem_cgroup_uncharge() now does a preliminary check if the page is
charged before it touches page->lru.

That should be much more robust: now the vetting whether a page is
valid for memcg happens at charge time only, all other operations
check first if a page is charged before doing anything else to it.

These two places should be the only ones that need fixing then:

diff --git a/mm/filemap.c b/mm/filemap.c
index 9c99d6868a5e..bfe0745a704d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -31,6 +31,7 @@
 #include <linux/security.h>
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
+#include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
@@ -558,19 +559,24 @@ static int __add_to_page_cache_locked(struct page *page,
 				      pgoff_t offset, gfp_t gfp_mask,
 				      void **shadowp)
 {
+	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
-	if (error)
-		return error;
+	if (!huge) {
+		error = mem_cgroup_try_charge(page, current->mm,
+					      gfp_mask, &memcg);
+		if (error)
+			return error;
+	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
-		mem_cgroup_cancel_charge(page, memcg);
+		if (!huge)
+			mem_cgroup_cancel_charge(page, memcg);
 		return error;
 	}
 
@@ -585,14 +591,16 @@ static int __add_to_page_cache_locked(struct page *page,
 		goto err_insert;
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_commit_charge(page, memcg, false);
+	if (!huge)
+		mem_cgroup_commit_charge(page, memcg, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_cancel_charge(page, memcg);
+	if (!huge)
+		mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 063080e35459..b5de5deddbfb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6635,9 +6635,16 @@ static void uncharge_list(struct list_head *page_list)
  */
 void mem_cgroup_uncharge(struct page *page)
 {
+	struct page_cgroup *pc;
+
 	if (mem_cgroup_disabled())
 		return;
 
+	/* Don't touch page->lru of any random page, pre-check: */
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return;
+
 	INIT_LIST_HEAD(&page->lru);
 	uncharge_list(&page->lru);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-16 14:14                     ` Johannes Weiner
  (?)
@ 2014-07-16 14:57                       ` Naoya Horiguchi
  -1 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-16 14:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Wed, Jul 16, 2014 at 10:14:47AM -0400, Johannes Weiner wrote:
...
> > >  	free_hot_cold_page(page, false);
> > >  }
> > >  
> > > @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
> > >  {
> > >  	compound_page_dtor *dtor;
> > >  
> > > -	__page_cache_release(page);
> > > +	if (!PageHuge(page)) {
> > > +		__page_cache_release(page);
> > > +		mem_cgroup_uncharge_page(page);
> 
> I reverted all these mm/swap.c changes again as well.  Instead,
> mem_cgroup_uncharge() now does a preliminary check if the page is
> charged before it touches page->lru.
> 
> That should be much more robust: now the vetting whether a page is
> valid for memcg happens at charge time only, all other operations
> check first if a page is charged before doing anything else to it.
> 
> These two places should be the only ones that need fixing then:

This change also passed my testing, so the problem should be fixed.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-16 14:57                       ` Naoya Horiguchi
  0 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-16 14:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, linux-kernel

On Wed, Jul 16, 2014 at 10:14:47AM -0400, Johannes Weiner wrote:
...
> > >  	free_hot_cold_page(page, false);
> > >  }
> > >  
> > > @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
> > >  {
> > >  	compound_page_dtor *dtor;
> > >  
> > > -	__page_cache_release(page);
> > > +	if (!PageHuge(page)) {
> > > +		__page_cache_release(page);
> > > +		mem_cgroup_uncharge_page(page);
> 
> I reverted all these mm/swap.c changes again as well.  Instead,
> mem_cgroup_uncharge() now does a preliminary check if the page is
> charged before it touches page->lru.
> 
> That should be much more robust: now the vetting whether a page is
> valid for memcg happens at charge time only, all other operations
> check first if a page is charged before doing anything else to it.
> 
> These two places should be the only ones that need fixing then:

This change also passed my testing, so the problem should be fixed.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-16 14:57                       ` Naoya Horiguchi
  0 siblings, 0 replies; 141+ messages in thread
From: Naoya Horiguchi @ 2014-07-16 14:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 16, 2014 at 10:14:47AM -0400, Johannes Weiner wrote:
...
> > >  	free_hot_cold_page(page, false);
> > >  }
> > >  
> > > @@ -75,7 +75,10 @@ static void __put_compound_page(struct page *page)
> > >  {
> > >  	compound_page_dtor *dtor;
> > >  
> > > -	__page_cache_release(page);
> > > +	if (!PageHuge(page)) {
> > > +		__page_cache_release(page);
> > > +		mem_cgroup_uncharge_page(page);
> 
> I reverted all these mm/swap.c changes again as well.  Instead,
> mem_cgroup_uncharge() now does a preliminary check if the page is
> charged before it touches page->lru.
> 
> That should be much more robust: now the vetting whether a page is
> valid for memcg happens at charge time only, all other operations
> check first if a page is charged before doing anything else to it.
> 
> These two places should be the only ones that need fixing then:

This change also passed my testing, so the problem should be fixed.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-15 12:19       ` Michal Hocko
@ 2014-07-18  7:12         ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-18  7:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 14:19:35, Michal Hocko wrote:
> [...]
> > +/**
> > + * mem_cgroup_migrate - migrate a charge to another page
> > + * @oldpage: currently charged page
> > + * @newpage: page to transfer the charge to
> > + * @lrucare: page might be on LRU already
> 
> which one? I guess the newpage?
> 
> > + *
> > + * Migrate the charge from @oldpage to @newpage.
> > + *
> > + * Both pages must be locked, @newpage->mapping must be set up.
> > + */
> > +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> > +			bool lrucare)
> > +{
> > +	unsigned int nr_pages = 1;
> > +	struct page_cgroup *pc;
> > +
> > +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> > +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> > +	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> > +	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
> 
> 	VM_BUG_ON_PAGE(PageLRU(newpage) && !lruvec, newpage);

I guess everything except these two notes got addressed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-18  7:12         ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-18  7:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue 15-07-14 14:19:35, Michal Hocko wrote:
> [...]
> > +/**
> > + * mem_cgroup_migrate - migrate a charge to another page
> > + * @oldpage: currently charged page
> > + * @newpage: page to transfer the charge to
> > + * @lrucare: page might be on LRU already
> 
> which one? I guess the newpage?
> 
> > + *
> > + * Migrate the charge from @oldpage to @newpage.
> > + *
> > + * Both pages must be locked, @newpage->mapping must be set up.
> > + */
> > +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> > +			bool lrucare)
> > +{
> > +	unsigned int nr_pages = 1;
> > +	struct page_cgroup *pc;
> > +
> > +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> > +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> > +	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> > +	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
> 
> 	VM_BUG_ON_PAGE(PageLRU(newpage) && !lruvec, newpage);

I guess everything except these two notes got addressed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-18  7:12         ` Michal Hocko
  (?)
@ 2014-07-18 14:45           ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-18 14:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	Miklos Szeredi, linux-mm, cgroups, linux-kernel

Hi Michal,

[cc'ing Miklos for fuse's use of replace_page_cache()]

On Fri, Jul 18, 2014 at 09:12:46AM +0200, Michal Hocko wrote:
> On Tue 15-07-14 14:19:35, Michal Hocko wrote:
> > [...]
> > > +/**
> > > + * mem_cgroup_migrate - migrate a charge to another page
> > > + * @oldpage: currently charged page
> > > + * @newpage: page to transfer the charge to
> > > + * @lrucare: page might be on LRU already
> > 
> > which one? I guess the newpage?
> > 
> > > + *
> > > + * Migrate the charge from @oldpage to @newpage.
> > > + *
> > > + * Both pages must be locked, @newpage->mapping must be set up.
> > > + */
> > > +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> > > +			bool lrucare)
> > > +{
> > > +	unsigned int nr_pages = 1;
> > > +	struct page_cgroup *pc;
> > > +
> > > +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> > > +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> > > +	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> > > +	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
> > 
> > 	VM_BUG_ON_PAGE(PageLRU(newpage) && !lruvec, newpage);
> 
> I guess everything except these two notes got addressed.

Sorry, they fell through the cracks.

Yes, @newpage can already be on the LRU, and it's what @lrucare is
for.  However, you got me thinking about the source page, and so I
went back to replace_page_cache(); and fuse code, which is the only
user of it.

I assumed the source page would always be new, according to this part
in fuse_try_move_page():

	/*
	 * This is a new and locked page, it shouldn't be mapped or
	 * have any special flags on it
	 */
	if (WARN_ON(page_mapped(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(page_has_private(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(PageMlocked(oldpage)))
		goto out_fallback_unlock;

However, it's in the page cache and I can't really convince myself
that it's not also on the LRU.  Miklos, I have trouble pinpointing
where oldpage is instantiated exactly and what state it might be in -
can it already be on the LRU?

If it can, we need to make sure we don't change pc->mem_cgroup while
mem_cgroup_migrate() is looking at it:

---
>From c636935736bafa4d6800fe040a0c3cff7ce334ea Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 18 Jul 2014 09:48:42 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
 migration

It was known that the target page in migration could be on the LRU -
clarify this in mem_cgroup_migrate() and correct the VM_BUG_ON_PAGE().

However, the source page can also be on the LRU in case of page cache
replacement and there is nothing stabilizing pc->mem_cgroup right now:
grab the page lock in mem_cgroup_move_account() to prevent page cache
replacement from racing with charge moving.

Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9db142d83b5c..c9cebf2cf273 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3450,9 +3450,17 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
+	/*
+	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
+	 * of its source page while we change it: page migration takes
+	 * both pages off the LRU, but page cache replacement doesn't.
+	 */
+	if (!trylock_page(page))
+		goto out;
+
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto out;
+		goto out_unlock;
 
 	move_lock_mem_cgroup(from, &flags);
 
@@ -3487,6 +3495,8 @@ static int mem_cgroup_move_account(struct page *page,
 	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
+out_unlock:
+	unlock_page(page);
 out:
 	return ret;
 }
@@ -6614,7 +6624,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
  * mem_cgroup_migrate - migrate a charge to another page
  * @oldpage: currently charged page
  * @newpage: page to transfer the charge to
- * @lrucare: page might be on LRU already
+ * @lrucare: @newpage might be on LRU already
  *
  * Migrate the charge from @oldpage to @newpage.
  *
@@ -6628,8 +6638,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 
 	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
-	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
-	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
 	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
 
 	if (mem_cgroup_disabled())
-- 
2.0.0



^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-18 14:45           ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-18 14:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	Miklos Szeredi, linux-mm, cgroups, linux-kernel

Hi Michal,

[cc'ing Miklos for fuse's use of replace_page_cache()]

On Fri, Jul 18, 2014 at 09:12:46AM +0200, Michal Hocko wrote:
> On Tue 15-07-14 14:19:35, Michal Hocko wrote:
> > [...]
> > > +/**
> > > + * mem_cgroup_migrate - migrate a charge to another page
> > > + * @oldpage: currently charged page
> > > + * @newpage: page to transfer the charge to
> > > + * @lrucare: page might be on LRU already
> > 
> > which one? I guess the newpage?
> > 
> > > + *
> > > + * Migrate the charge from @oldpage to @newpage.
> > > + *
> > > + * Both pages must be locked, @newpage->mapping must be set up.
> > > + */
> > > +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> > > +			bool lrucare)
> > > +{
> > > +	unsigned int nr_pages = 1;
> > > +	struct page_cgroup *pc;
> > > +
> > > +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> > > +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> > > +	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> > > +	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
> > 
> > 	VM_BUG_ON_PAGE(PageLRU(newpage) && !lruvec, newpage);
> 
> I guess everything except these two notes got addressed.

Sorry, they fell through the cracks.

Yes, @newpage can already be on the LRU, and it's what @lrucare is
for.  However, you got me thinking about the source page, and so I
went back to replace_page_cache(); and fuse code, which is the only
user of it.

I assumed the source page would always be new, according to this part
in fuse_try_move_page():

	/*
	 * This is a new and locked page, it shouldn't be mapped or
	 * have any special flags on it
	 */
	if (WARN_ON(page_mapped(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(page_has_private(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(PageMlocked(oldpage)))
		goto out_fallback_unlock;

However, it's in the page cache and I can't really convince myself
that it's not also on the LRU.  Miklos, I have trouble pinpointing
where oldpage is instantiated exactly and what state it might be in -
can it already be on the LRU?

If it can, we need to make sure we don't change pc->mem_cgroup while
mem_cgroup_migrate() is looking at it:

---

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-18 14:45           ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-18 14:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Tejun Heo, Vladimir Davydov,
	Miklos Szeredi, linux-mm, cgroups, linux-kernel

Hi Michal,

[cc'ing Miklos for fuse's use of replace_page_cache()]

On Fri, Jul 18, 2014 at 09:12:46AM +0200, Michal Hocko wrote:
> On Tue 15-07-14 14:19:35, Michal Hocko wrote:
> > [...]
> > > +/**
> > > + * mem_cgroup_migrate - migrate a charge to another page
> > > + * @oldpage: currently charged page
> > > + * @newpage: page to transfer the charge to
> > > + * @lrucare: page might be on LRU already
> > 
> > which one? I guess the newpage?
> > 
> > > + *
> > > + * Migrate the charge from @oldpage to @newpage.
> > > + *
> > > + * Both pages must be locked, @newpage->mapping must be set up.
> > > + */
> > > +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> > > +			bool lrucare)
> > > +{
> > > +	unsigned int nr_pages = 1;
> > > +	struct page_cgroup *pc;
> > > +
> > > +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> > > +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> > > +	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> > > +	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
> > 
> > 	VM_BUG_ON_PAGE(PageLRU(newpage) && !lruvec, newpage);
> 
> I guess everything except these two notes got addressed.

Sorry, they fell through the cracks.

Yes, @newpage can already be on the LRU, and it's what @lrucare is
for.  However, you got me thinking about the source page, and so I
went back to replace_page_cache(); and fuse code, which is the only
user of it.

I assumed the source page would always be new, according to this part
in fuse_try_move_page():

	/*
	 * This is a new and locked page, it shouldn't be mapped or
	 * have any special flags on it
	 */
	if (WARN_ON(page_mapped(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(page_has_private(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
		goto out_fallback_unlock;
	if (WARN_ON(PageMlocked(oldpage)))
		goto out_fallback_unlock;

However, it's in the page cache and I can't really convince myself
that it's not also on the LRU.  Miklos, I have trouble pinpointing
where oldpage is instantiated exactly and what state it might be in -
can it already be on the LRU?

If it can, we need to make sure we don't change pc->mem_cgroup while
mem_cgroup_migrate() is looking at it:

---
From c636935736bafa4d6800fe040a0c3cff7ce334ea Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 18 Jul 2014 09:48:42 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
 migration

It was known that the target page in migration could be on the LRU -
clarify this in mem_cgroup_migrate() and correct the VM_BUG_ON_PAGE().

However, the source page can also be on the LRU in case of page cache
replacement and there is nothing stabilizing pc->mem_cgroup right now:
grab the page lock in mem_cgroup_move_account() to prevent page cache
replacement from racing with charge moving.

Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9db142d83b5c..c9cebf2cf273 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3450,9 +3450,17 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
+	/*
+	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
+	 * of its source page while we change it: page migration takes
+	 * both pages off the LRU, but page cache replacement doesn't.
+	 */
+	if (!trylock_page(page))
+		goto out;
+
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto out;
+		goto out_unlock;
 
 	move_lock_mem_cgroup(from, &flags);
 
@@ -3487,6 +3495,8 @@ static int mem_cgroup_move_account(struct page *page,
 	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
+out_unlock:
+	unlock_page(page);
 out:
 	return ret;
 }
@@ -6614,7 +6624,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
  * mem_cgroup_migrate - migrate a charge to another page
  * @oldpage: currently charged page
  * @newpage: page to transfer the charge to
- * @lrucare: page might be on LRU already
+ * @lrucare: @newpage might be on LRU already
  *
  * Migrate the charge from @oldpage to @newpage.
  *
@@ -6628,8 +6638,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 
 	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
-	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
-	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
 	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
 
 	if (mem_cgroup_disabled())
-- 
2.0.0


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-18 14:45           ` Johannes Weiner
@ 2014-07-18 15:12             ` Miklos Szeredi
  -1 siblings, 0 replies; 141+ messages in thread
From: Miklos Szeredi @ 2014-07-18 15:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> I assumed the source page would always be new, according to this part
> in fuse_try_move_page():
>
>         /*
>          * This is a new and locked page, it shouldn't be mapped or
>          * have any special flags on it
>          */
>         if (WARN_ON(page_mapped(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(page_has_private(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageMlocked(oldpage)))
>                 goto out_fallback_unlock;
>
> However, it's in the page cache and I can't really convince myself
> that it's not also on the LRU.  Miklos, I have trouble pinpointing
> where oldpage is instantiated exactly and what state it might be in -
> can it already be on the LRU?

oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.

AFAICS it is added to the LRU in read_cache_pages(), so it looks like
it is definitely on the LRU at that point.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-18 15:12             ` Miklos Szeredi
  0 siblings, 0 replies; 141+ messages in thread
From: Miklos Szeredi @ 2014-07-18 15:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> I assumed the source page would always be new, according to this part
> in fuse_try_move_page():
>
>         /*
>          * This is a new and locked page, it shouldn't be mapped or
>          * have any special flags on it
>          */
>         if (WARN_ON(page_mapped(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(page_has_private(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageMlocked(oldpage)))
>                 goto out_fallback_unlock;
>
> However, it's in the page cache and I can't really convince myself
> that it's not also on the LRU.  Miklos, I have trouble pinpointing
> where oldpage is instantiated exactly and what state it might be in -
> can it already be on the LRU?

oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.

AFAICS it is added to the LRU in read_cache_pages(), so it looks like
it is definitely on the LRU at that point.

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-18 15:12             ` Miklos Szeredi
  (?)
@ 2014-07-19 17:39               ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-19 17:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > I assumed the source page would always be new, according to this part
> > in fuse_try_move_page():
> >
> >         /*
> >          * This is a new and locked page, it shouldn't be mapped or
> >          * have any special flags on it
> >          */
> >         if (WARN_ON(page_mapped(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(page_has_private(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(PageMlocked(oldpage)))
> >                 goto out_fallback_unlock;
> >
> > However, it's in the page cache and I can't really convince myself
> > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> > where oldpage is instantiated exactly and what state it might be in -
> > can it already be on the LRU?
> 
> oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> 
> AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> it is definitely on the LRU at that point.

I see, thanks!

Then we need charge migration to lock the page like I proposed.  But
it's not enough: we also need to exclude isolation and putback while
we uncharge it, and make sure that if it was on the LRU it's moved to
the correct lruvec (the root memcg's):

---
>From ce51bdcf02bee94a1f1049864b1665c2d9830281 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 18 Jul 2014 09:48:42 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
 migration

It was known that the target page in migration could be on the LRU -
clarify this in mem_cgroup_migrate() and correct the VM_BUG_ON_PAGE().

However, during page cache replacement, the source page can also be on
the LRU, and two things need to be considered:

1. charge moving can race and change pc->mem_cgroup from under us:
grab the page lock in mem_cgroup_move_account() to prevent that.

2. the lruvec of the page changes as we uncharge it, and putback can
race with us: grab the lru lock and isolate the page iff on LRU to
prevent races and to ensure the page is on the right lruvec afterward.

Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
---
 mm/memcontrol.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 57 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9db142d83b5c..b7c9a202dee9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2696,13 +2696,42 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	return memcg;
 }
 
+static void lock_page_lru(struct page *page, int *isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page)) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		ClearPageLRU(page);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		*isolated = 1;
+	} else
+		*isolated = 0;
+}
+
+static void unlock_page_lru(struct page *page, int isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	if (isolated) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+		SetPageLRU(page);
+		add_page_to_lru_list(page, lruvec, page_lru(page));
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *uninitialized_var(zone);
-	bool was_on_lru = false;
-	struct lruvec *lruvec;
+	int isolated;
 
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
@@ -2714,16 +2743,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
 	 * may already be on some other mem_cgroup's LRU.  Take care of it.
 	 */
-	if (lrucare) {
-		zone = page_zone(page);
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page)) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, page_lru(page));
-			was_on_lru = true;
-		}
-	}
+	if (lrucare)
+		lock_page_lru(page, &isolated);
 
 	/*
 	 * Nobody should be changing or seriously looking at
@@ -2742,15 +2763,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	pc->mem_cgroup = memcg;
 	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
 
-	if (lrucare) {
-		if (was_on_lru) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			VM_BUG_ON_PAGE(PageLRU(page), page);
-			SetPageLRU(page);
-			add_page_to_lru_list(page, lruvec, page_lru(page));
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
+	if (lrucare)
+		unlock_page_lru(page, isolated);
 
 	local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, page, nr_pages);
@@ -3450,9 +3464,17 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
+	/*
+	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
+	 * of its source page while we change it: page migration takes
+	 * both pages off the LRU, but page cache replacement doesn't.
+	 */
+	if (!trylock_page(page))
+		goto out;
+
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto out;
+		goto out_unlock;
 
 	move_lock_mem_cgroup(from, &flags);
 
@@ -3487,6 +3509,8 @@ static int mem_cgroup_move_account(struct page *page,
 	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
+out_unlock:
+	unlock_page(page);
 out:
 	return ret;
 }
@@ -6614,7 +6638,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
  * mem_cgroup_migrate - migrate a charge to another page
  * @oldpage: currently charged page
  * @newpage: page to transfer the charge to
- * @lrucare: page might be on LRU already
+ * @lrucare: both pages might be on the LRU already
  *
  * Migrate the charge from @oldpage to @newpage.
  *
@@ -6625,11 +6649,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 {
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
+	int isolated;
 
 	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
-	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
-	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
 	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
 
 	if (mem_cgroup_disabled())
@@ -6648,8 +6673,14 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
 	}
 
+	if (lrucare)
+		lock_page_lru(oldpage, &isolated);
+
 	pc->flags = 0;
 
+	if (lrucare)
+		unlock_page_lru(oldpage, isolated);
+
 	local_irq_disable();
 	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
 	memcg_check_events(pc->mem_cgroup, oldpage);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-19 17:39               ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-19 17:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > I assumed the source page would always be new, according to this part
> > in fuse_try_move_page():
> >
> >         /*
> >          * This is a new and locked page, it shouldn't be mapped or
> >          * have any special flags on it
> >          */
> >         if (WARN_ON(page_mapped(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(page_has_private(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(PageMlocked(oldpage)))
> >                 goto out_fallback_unlock;
> >
> > However, it's in the page cache and I can't really convince myself
> > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> > where oldpage is instantiated exactly and what state it might be in -
> > can it already be on the LRU?
> 
> oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> 
> AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> it is definitely on the LRU at that point.

I see, thanks!

Then we need charge migration to lock the page like I proposed.  But
it's not enough: we also need to exclude isolation and putback while
we uncharge it, and make sure that if it was on the LRU it's moved to
the correct lruvec (the root memcg's):

---

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-19 17:39               ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-19 17:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Kernel Mailing List

On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> 
> > I assumed the source page would always be new, according to this part
> > in fuse_try_move_page():
> >
> >         /*
> >          * This is a new and locked page, it shouldn't be mapped or
> >          * have any special flags on it
> >          */
> >         if (WARN_ON(page_mapped(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(page_has_private(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> >                 goto out_fallback_unlock;
> >         if (WARN_ON(PageMlocked(oldpage)))
> >                 goto out_fallback_unlock;
> >
> > However, it's in the page cache and I can't really convince myself
> > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> > where oldpage is instantiated exactly and what state it might be in -
> > can it already be on the LRU?
> 
> oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> 
> AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> it is definitely on the LRU at that point.

I see, thanks!

Then we need charge migration to lock the page like I proposed.  But
it's not enough: we also need to exclude isolation and putback while
we uncharge it, and make sure that if it was on the LRU it's moved to
the correct lruvec (the root memcg's):

---
From ce51bdcf02bee94a1f1049864b1665c2d9830281 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Date: Fri, 18 Jul 2014 09:48:42 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
 migration

It was known that the target page in migration could be on the LRU -
clarify this in mem_cgroup_migrate() and correct the VM_BUG_ON_PAGE().

However, during page cache replacement, the source page can also be on
the LRU, and two things need to be considered:

1. charge moving can race and change pc->mem_cgroup from under us:
grab the page lock in mem_cgroup_move_account() to prevent that.

2. the lruvec of the page changes as we uncharge it, and putback can
race with us: grab the lru lock and isolate the page iff on LRU to
prevent races and to ensure the page is on the right lruvec afterward.

Reported-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org>
---
 mm/memcontrol.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 57 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9db142d83b5c..b7c9a202dee9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2696,13 +2696,42 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	return memcg;
 }
 
+static void lock_page_lru(struct page *page, int *isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page)) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		ClearPageLRU(page);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		*isolated = 1;
+	} else
+		*isolated = 0;
+}
+
+static void unlock_page_lru(struct page *page, int isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	if (isolated) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+		SetPageLRU(page);
+		add_page_to_lru_list(page, lruvec, page_lru(page));
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *uninitialized_var(zone);
-	bool was_on_lru = false;
-	struct lruvec *lruvec;
+	int isolated;
 
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
@@ -2714,16 +2743,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
 	 * may already be on some other mem_cgroup's LRU.  Take care of it.
 	 */
-	if (lrucare) {
-		zone = page_zone(page);
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page)) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, page_lru(page));
-			was_on_lru = true;
-		}
-	}
+	if (lrucare)
+		lock_page_lru(page, &isolated);
 
 	/*
 	 * Nobody should be changing or seriously looking at
@@ -2742,15 +2763,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	pc->mem_cgroup = memcg;
 	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
 
-	if (lrucare) {
-		if (was_on_lru) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			VM_BUG_ON_PAGE(PageLRU(page), page);
-			SetPageLRU(page);
-			add_page_to_lru_list(page, lruvec, page_lru(page));
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
+	if (lrucare)
+		unlock_page_lru(page, isolated);
 
 	local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, page, nr_pages);
@@ -3450,9 +3464,17 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
+	/*
+	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
+	 * of its source page while we change it: page migration takes
+	 * both pages off the LRU, but page cache replacement doesn't.
+	 */
+	if (!trylock_page(page))
+		goto out;
+
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto out;
+		goto out_unlock;
 
 	move_lock_mem_cgroup(from, &flags);
 
@@ -3487,6 +3509,8 @@ static int mem_cgroup_move_account(struct page *page,
 	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
+out_unlock:
+	unlock_page(page);
 out:
 	return ret;
 }
@@ -6614,7 +6638,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
  * mem_cgroup_migrate - migrate a charge to another page
  * @oldpage: currently charged page
  * @newpage: page to transfer the charge to
- * @lrucare: page might be on LRU already
+ * @lrucare: both pages might be on the LRU already
  *
  * Migrate the charge from @oldpage to @newpage.
  *
@@ -6625,11 +6649,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 {
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
+	int isolated;
 
 	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
-	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
-	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
 	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
 
 	if (mem_cgroup_disabled())
@@ -6648,8 +6673,14 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
 	}
 
+	if (lrucare)
+		lock_page_lru(oldpage, &isolated);
+
 	pc->flags = 0;
 
+	if (lrucare)
+		unlock_page_lru(oldpage, isolated);
+
 	local_irq_disable();
 	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
 	memcg_check_events(pc->mem_cgroup, oldpage);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-19 17:39               ` Johannes Weiner
@ 2014-07-22 15:08                 ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-22 15:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > I assumed the source page would always be new, according to this part
> > > in fuse_try_move_page():
> > >
> > >         /*
> > >          * This is a new and locked page, it shouldn't be mapped or
> > >          * have any special flags on it
> > >          */
> > >         if (WARN_ON(page_mapped(oldpage)))
> > >                 goto out_fallback_unlock;
> > >         if (WARN_ON(page_has_private(oldpage)))
> > >                 goto out_fallback_unlock;
> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> > >                 goto out_fallback_unlock;
> > >         if (WARN_ON(PageMlocked(oldpage)))
> > >                 goto out_fallback_unlock;
> > >
> > > However, it's in the page cache and I can't really convince myself
> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> > > where oldpage is instantiated exactly and what state it might be in -
> > > can it already be on the LRU?
> > 
> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> > 
> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> > it is definitely on the LRU at that point.

OK, so my understanding of the code was wrong :/ and staring at it for
quite a while didn't help much. The fuse code is so full of indirection
it makes my head spin. So what is the exact state of old and new pages?
Both might be on LRU, ok, but can both of them be charged to a memcg?
Possibly different memcgs?

How should we test this code path, Miklos?

> I see, thanks!
> 
> Then we need charge migration to lock the page like I proposed.  But
> it's not enough: we also need to exclude isolation and putback while
> we uncharge it, and make sure that if it was on the LRU it's moved to
> the correct lruvec (the root memcg's):
> 
> ---
> From ce51bdcf02bee94a1f1049864b1665c2d9830281 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 18 Jul 2014 09:48:42 -0400
> Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
>  migration
> 
> It was known that the target page in migration could be on the LRU -
> clarify this in mem_cgroup_migrate() and correct the VM_BUG_ON_PAGE().
> 
> However, during page cache replacement, the source page can also be on
> the LRU, and two things need to be considered:
> 
> 1. charge moving can race and change pc->mem_cgroup from under us:
> grab the page lock in mem_cgroup_move_account() to prevent that.
>
> 2. the lruvec of the page changes as we uncharge it, and putback can
> race with us: grab the lru lock and isolate the page iff on LRU to
> prevent races and to ensure the page is on the right lruvec afterward.
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>

I am not sure this is appropriate as I didn't consider old page being on
LRU. I only didn't like VM_BUG_ON_PAGE without lru_care for newpage part
because this was known to blow up.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> ---
>  mm/memcontrol.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 57 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9db142d83b5c..b7c9a202dee9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2696,13 +2696,42 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  	return memcg;
>  }
>  
> +static void lock_page_lru(struct page *page, int *isolated)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	spin_lock_irq(&zone->lru_lock);
> +	if (PageLRU(page)) {
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, zone);
> +		ClearPageLRU(page);
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
> +		*isolated = 1;
> +	} else
> +		*isolated = 0;
> +}
> +
> +static void unlock_page_lru(struct page *page, int isolated)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	if (isolated) {
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, zone);
> +		VM_BUG_ON_PAGE(PageLRU(page), page);
> +		SetPageLRU(page);
> +		add_page_to_lru_list(page, lruvec, page_lru(page));
> +	}
> +	spin_unlock_irq(&zone->lru_lock);
> +}
> +
>  static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  			  unsigned int nr_pages, bool lrucare)
>  {
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	struct zone *uninitialized_var(zone);
> -	bool was_on_lru = false;
> -	struct lruvec *lruvec;
> +	int isolated;
>  
>  	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
>  	/*
> @@ -2714,16 +2743,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
>  	 * may already be on some other mem_cgroup's LRU.  Take care of it.
>  	 */
> -	if (lrucare) {
> -		zone = page_zone(page);
> -		spin_lock_irq(&zone->lru_lock);
> -		if (PageLRU(page)) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, page_lru(page));
> -			was_on_lru = true;
> -		}
> -	}
> +	if (lrucare)
> +		lock_page_lru(page, &isolated);
>  
>  	/*
>  	 * Nobody should be changing or seriously looking at
> @@ -2742,15 +2763,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  	pc->mem_cgroup = memcg;
>  	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
>  
> -	if (lrucare) {
> -		if (was_on_lru) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			VM_BUG_ON_PAGE(PageLRU(page), page);
> -			SetPageLRU(page);
> -			add_page_to_lru_list(page, lruvec, page_lru(page));
> -		}
> -		spin_unlock_irq(&zone->lru_lock);
> -	}
> +	if (lrucare)
> +		unlock_page_lru(page, isolated);
>  
>  	local_irq_disable();
>  	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> @@ -3450,9 +3464,17 @@ static int mem_cgroup_move_account(struct page *page,
>  	if (nr_pages > 1 && !PageTransHuge(page))
>  		goto out;
>  
> +	/*
> +	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
> +	 * of its source page while we change it: page migration takes
> +	 * both pages off the LRU, but page cache replacement doesn't.
> +	 */
> +	if (!trylock_page(page))
> +		goto out;
> +
>  	ret = -EINVAL;
>  	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
> -		goto out;
> +		goto out_unlock;
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> @@ -3487,6 +3509,8 @@ static int mem_cgroup_move_account(struct page *page,
>  	mem_cgroup_charge_statistics(from, page, -nr_pages);
>  	memcg_check_events(from, page);
>  	local_irq_enable();
> +out_unlock:
> +	unlock_page(page);
>  out:
>  	return ret;
>  }
> @@ -6614,7 +6638,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
>   * mem_cgroup_migrate - migrate a charge to another page
>   * @oldpage: currently charged page
>   * @newpage: page to transfer the charge to
> - * @lrucare: page might be on LRU already
> + * @lrucare: both pages might be on the LRU already
>   *
>   * Migrate the charge from @oldpage to @newpage.
>   *
> @@ -6625,11 +6649,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
>  {
>  	unsigned int nr_pages = 1;
>  	struct page_cgroup *pc;
> +	int isolated;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
>  	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> -	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> -	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
> +	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
>  	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
>  
>  	if (mem_cgroup_disabled())
> @@ -6648,8 +6673,14 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
>  		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
>  	}
>  
> +	if (lrucare)
> +		lock_page_lru(oldpage, &isolated);
> +
>  	pc->flags = 0;
>  
> +	if (lrucare)
> +		unlock_page_lru(oldpage, isolated);
> +
>  	local_irq_disable();
>  	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
>  	memcg_check_events(pc->mem_cgroup, oldpage);
> -- 
> 2.0.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-22 15:08                 ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-22 15:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > I assumed the source page would always be new, according to this part
> > > in fuse_try_move_page():
> > >
> > >         /*
> > >          * This is a new and locked page, it shouldn't be mapped or
> > >          * have any special flags on it
> > >          */
> > >         if (WARN_ON(page_mapped(oldpage)))
> > >                 goto out_fallback_unlock;
> > >         if (WARN_ON(page_has_private(oldpage)))
> > >                 goto out_fallback_unlock;
> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> > >                 goto out_fallback_unlock;
> > >         if (WARN_ON(PageMlocked(oldpage)))
> > >                 goto out_fallback_unlock;
> > >
> > > However, it's in the page cache and I can't really convince myself
> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> > > where oldpage is instantiated exactly and what state it might be in -
> > > can it already be on the LRU?
> > 
> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> > 
> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> > it is definitely on the LRU at that point.

OK, so my understanding of the code was wrong :/ and staring at it for
quite a while didn't help much. The fuse code is so full of indirection
it makes my head spin. So what is the exact state of old and new pages?
Both might be on LRU, ok, but can both of them be charged to a memcg?
Possibly different memcgs?

How should we test this code path, Miklos?

> I see, thanks!
> 
> Then we need charge migration to lock the page like I proposed.  But
> it's not enough: we also need to exclude isolation and putback while
> we uncharge it, and make sure that if it was on the LRU it's moved to
> the correct lruvec (the root memcg's):
> 
> ---
> From ce51bdcf02bee94a1f1049864b1665c2d9830281 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 18 Jul 2014 09:48:42 -0400
> Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
>  migration
> 
> It was known that the target page in migration could be on the LRU -
> clarify this in mem_cgroup_migrate() and correct the VM_BUG_ON_PAGE().
> 
> However, during page cache replacement, the source page can also be on
> the LRU, and two things need to be considered:
> 
> 1. charge moving can race and change pc->mem_cgroup from under us:
> grab the page lock in mem_cgroup_move_account() to prevent that.
>
> 2. the lruvec of the page changes as we uncharge it, and putback can
> race with us: grab the lru lock and isolate the page iff on LRU to
> prevent races and to ensure the page is on the right lruvec afterward.
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>

I am not sure this is appropriate as I didn't consider old page being on
LRU. I only didn't like VM_BUG_ON_PAGE without lru_care for newpage part
because this was known to blow up.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> ---
>  mm/memcontrol.c | 83 +++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 57 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9db142d83b5c..b7c9a202dee9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2696,13 +2696,42 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  	return memcg;
>  }
>  
> +static void lock_page_lru(struct page *page, int *isolated)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	spin_lock_irq(&zone->lru_lock);
> +	if (PageLRU(page)) {
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, zone);
> +		ClearPageLRU(page);
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
> +		*isolated = 1;
> +	} else
> +		*isolated = 0;
> +}
> +
> +static void unlock_page_lru(struct page *page, int isolated)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	if (isolated) {
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, zone);
> +		VM_BUG_ON_PAGE(PageLRU(page), page);
> +		SetPageLRU(page);
> +		add_page_to_lru_list(page, lruvec, page_lru(page));
> +	}
> +	spin_unlock_irq(&zone->lru_lock);
> +}
> +
>  static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  			  unsigned int nr_pages, bool lrucare)
>  {
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	struct zone *uninitialized_var(zone);
> -	bool was_on_lru = false;
> -	struct lruvec *lruvec;
> +	int isolated;
>  
>  	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
>  	/*
> @@ -2714,16 +2743,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
>  	 * may already be on some other mem_cgroup's LRU.  Take care of it.
>  	 */
> -	if (lrucare) {
> -		zone = page_zone(page);
> -		spin_lock_irq(&zone->lru_lock);
> -		if (PageLRU(page)) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, page_lru(page));
> -			was_on_lru = true;
> -		}
> -	}
> +	if (lrucare)
> +		lock_page_lru(page, &isolated);
>  
>  	/*
>  	 * Nobody should be changing or seriously looking at
> @@ -2742,15 +2763,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  	pc->mem_cgroup = memcg;
>  	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
>  
> -	if (lrucare) {
> -		if (was_on_lru) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			VM_BUG_ON_PAGE(PageLRU(page), page);
> -			SetPageLRU(page);
> -			add_page_to_lru_list(page, lruvec, page_lru(page));
> -		}
> -		spin_unlock_irq(&zone->lru_lock);
> -	}
> +	if (lrucare)
> +		unlock_page_lru(page, isolated);
>  
>  	local_irq_disable();
>  	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> @@ -3450,9 +3464,17 @@ static int mem_cgroup_move_account(struct page *page,
>  	if (nr_pages > 1 && !PageTransHuge(page))
>  		goto out;
>  
> +	/*
> +	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
> +	 * of its source page while we change it: page migration takes
> +	 * both pages off the LRU, but page cache replacement doesn't.
> +	 */
> +	if (!trylock_page(page))
> +		goto out;
> +
>  	ret = -EINVAL;
>  	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
> -		goto out;
> +		goto out_unlock;
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> @@ -3487,6 +3509,8 @@ static int mem_cgroup_move_account(struct page *page,
>  	mem_cgroup_charge_statistics(from, page, -nr_pages);
>  	memcg_check_events(from, page);
>  	local_irq_enable();
> +out_unlock:
> +	unlock_page(page);
>  out:
>  	return ret;
>  }
> @@ -6614,7 +6638,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
>   * mem_cgroup_migrate - migrate a charge to another page
>   * @oldpage: currently charged page
>   * @newpage: page to transfer the charge to
> - * @lrucare: page might be on LRU already
> + * @lrucare: both pages might be on the LRU already
>   *
>   * Migrate the charge from @oldpage to @newpage.
>   *
> @@ -6625,11 +6649,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
>  {
>  	unsigned int nr_pages = 1;
>  	struct page_cgroup *pc;
> +	int isolated;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
>  	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> -	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
> -	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
> +	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
>  	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
>  
>  	if (mem_cgroup_disabled())
> @@ -6648,8 +6673,14 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
>  		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
>  	}
>  
> +	if (lrucare)
> +		lock_page_lru(oldpage, &isolated);
> +
>  	pc->flags = 0;
>  
> +	if (lrucare)
> +		unlock_page_lru(oldpage, isolated);
> +
>  	local_irq_disable();
>  	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
>  	memcg_check_events(pc->mem_cgroup, oldpage);
> -- 
> 2.0.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-22 15:08                 ` Michal Hocko
  (?)
@ 2014-07-22 15:44                   ` Miklos Szeredi
  -1 siblings, 0 replies; 141+ messages in thread
From: Miklos Szeredi @ 2014-07-22 15:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
>> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
>> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> >
>> > > I assumed the source page would always be new, according to this part
>> > > in fuse_try_move_page():
>> > >
>> > >         /*
>> > >          * This is a new and locked page, it shouldn't be mapped or
>> > >          * have any special flags on it
>> > >          */
>> > >         if (WARN_ON(page_mapped(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(page_has_private(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(PageMlocked(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >
>> > > However, it's in the page cache and I can't really convince myself
>> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
>> > > where oldpage is instantiated exactly and what state it might be in -
>> > > can it already be on the LRU?
>> >
>> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
>> >
>> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
>> > it is definitely on the LRU at that point.
>
> OK, so my understanding of the code was wrong :/ and staring at it for
> quite a while didn't help much. The fuse code is so full of indirection
> it makes my head spin.

Definitely needs a rewrite.  But forget the complexities for the
moment and just consider this single case:

 ->readpages() is called to do some readahead, pages are locked, added
to the page cache and, AFAICS, charged to a memcg (in
add_to_page_cache_lru()).

 - fuse sends a READ request to userspace and it gets a reply with
splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
indefinite origin are to replace (if possible) the pages already in
the page cache.  If not possible, for some reason, it falls back to
copying the contents.  So, AFAICS, the oldpage and the newpage can be
charged to a different memcg.

>
> How should we test this code path, Miklos?

  fusexmp_fh -osplice_write,splice_move /mnt/fuse

This will mirror / under /mnt/fuse and will use splice to move data
from the underlying filesystem to the fuse filesystem, hopefully.

It would be useful if it had some instrumentation telling us the
actual number of pages successfully moved, but it doesn't have that
yet.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-22 15:44                   ` Miklos Szeredi
  0 siblings, 0 replies; 141+ messages in thread
From: Miklos Szeredi @ 2014-07-22 15:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
>> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
>> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> >
>> > > I assumed the source page would always be new, according to this part
>> > > in fuse_try_move_page():
>> > >
>> > >         /*
>> > >          * This is a new and locked page, it shouldn't be mapped or
>> > >          * have any special flags on it
>> > >          */
>> > >         if (WARN_ON(page_mapped(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(page_has_private(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(PageMlocked(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >
>> > > However, it's in the page cache and I can't really convince myself
>> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
>> > > where oldpage is instantiated exactly and what state it might be in -
>> > > can it already be on the LRU?
>> >
>> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
>> >
>> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
>> > it is definitely on the LRU at that point.
>
> OK, so my understanding of the code was wrong :/ and staring at it for
> quite a while didn't help much. The fuse code is so full of indirection
> it makes my head spin.

Definitely needs a rewrite.  But forget the complexities for the
moment and just consider this single case:

 ->readpages() is called to do some readahead, pages are locked, added
to the page cache and, AFAICS, charged to a memcg (in
add_to_page_cache_lru()).

 - fuse sends a READ request to userspace and it gets a reply with
splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
indefinite origin are to replace (if possible) the pages already in
the page cache.  If not possible, for some reason, it falls back to
copying the contents.  So, AFAICS, the oldpage and the newpage can be
charged to a different memcg.

>
> How should we test this code path, Miklos?

  fusexmp_fh -osplice_write,splice_move /mnt/fuse

This will mirror / under /mnt/fuse and will use splice to move data
from the underlying filesystem to the fuse filesystem, hopefully.

It would be useful if it had some instrumentation telling us the
actual number of pages successfully moved, but it doesn't have that
yet.

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-22 15:44                   ` Miklos Szeredi
  0 siblings, 0 replies; 141+ messages in thread
From: Miklos Szeredi @ 2014-07-22 15:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Kernel Mailing List

On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
>> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
>> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>> >
>> > > I assumed the source page would always be new, according to this part
>> > > in fuse_try_move_page():
>> > >
>> > >         /*
>> > >          * This is a new and locked page, it shouldn't be mapped or
>> > >          * have any special flags on it
>> > >          */
>> > >         if (WARN_ON(page_mapped(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(page_has_private(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >         if (WARN_ON(PageMlocked(oldpage)))
>> > >                 goto out_fallback_unlock;
>> > >
>> > > However, it's in the page cache and I can't really convince myself
>> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
>> > > where oldpage is instantiated exactly and what state it might be in -
>> > > can it already be on the LRU?
>> >
>> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
>> >
>> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
>> > it is definitely on the LRU at that point.
>
> OK, so my understanding of the code was wrong :/ and staring at it for
> quite a while didn't help much. The fuse code is so full of indirection
> it makes my head spin.

Definitely needs a rewrite.  But forget the complexities for the
moment and just consider this single case:

 ->readpages() is called to do some readahead, pages are locked, added
to the page cache and, AFAICS, charged to a memcg (in
add_to_page_cache_lru()).

 - fuse sends a READ request to userspace and it gets a reply with
splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
indefinite origin are to replace (if possible) the pages already in
the page cache.  If not possible, for some reason, it falls back to
copying the contents.  So, AFAICS, the oldpage and the newpage can be
charged to a different memcg.

>
> How should we test this code path, Miklos?

  fusexmp_fh -osplice_write,splice_move /mnt/fuse

This will mirror / under /mnt/fuse and will use splice to move data
from the underlying filesystem to the fuse filesystem, hopefully.

It would be useful if it had some instrumentation telling us the
actual number of pages successfully moved, but it doesn't have that
yet.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-22 15:44                   ` Miklos Szeredi
  (?)
@ 2014-07-23 14:38                     ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-23 14:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Tue 22-07-14 17:44:43, Miklos Szeredi wrote:
> On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
> >> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> >> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> >
> >> > > I assumed the source page would always be new, according to this part
> >> > > in fuse_try_move_page():
> >> > >
> >> > >         /*
> >> > >          * This is a new and locked page, it shouldn't be mapped or
> >> > >          * have any special flags on it
> >> > >          */
> >> > >         if (WARN_ON(page_mapped(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(page_has_private(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(PageMlocked(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >
> >> > > However, it's in the page cache and I can't really convince myself
> >> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> >> > > where oldpage is instantiated exactly and what state it might be in -
> >> > > can it already be on the LRU?
> >> >
> >> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> >> >
> >> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> >> > it is definitely on the LRU at that point.
> >
> > OK, so my understanding of the code was wrong :/ and staring at it for
> > quite a while didn't help much. The fuse code is so full of indirection
> > it makes my head spin.
> 
> Definitely needs a rewrite.  But forget the complexities for the
> moment and just consider this single case:
> 
>  ->readpages() is called to do some readahead, pages are locked, added
> to the page cache and, AFAICS, charged to a memcg (in
> add_to_page_cache_lru()).
> 
>  - fuse sends a READ request to userspace and it gets a reply with
> splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
> indefinite origin are to replace (if possible) the pages already in
> the page cache.  If not possible, for some reason, it falls back to
> copying the contents.  So, AFAICS, the oldpage and the newpage can be
> charged to a different memcg.

OK, thanks for the clarification. I had this feeling but couldn't wrap
my head around the indirection of the code.

It seems that checkig PageCgroupUsed(new) and bail out early in
mem_cgroup_migrate should just work, no?

> > How should we test this code path, Miklos?
> 
>   fusexmp_fh -osplice_write,splice_move /mnt/fuse
> 
> This will mirror / under /mnt/fuse and will use splice to move data
> from the underlying filesystem to the fuse filesystem, hopefully.
> 
> It would be useful if it had some instrumentation telling us the
> actual number of pages successfully moved, but it doesn't have that
> yet.

Thanks I will try to play with this tomorrow when I have more time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 14:38                     ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-23 14:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Tue 22-07-14 17:44:43, Miklos Szeredi wrote:
> On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
> >> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> >> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> >
> >> > > I assumed the source page would always be new, according to this part
> >> > > in fuse_try_move_page():
> >> > >
> >> > >         /*
> >> > >          * This is a new and locked page, it shouldn't be mapped or
> >> > >          * have any special flags on it
> >> > >          */
> >> > >         if (WARN_ON(page_mapped(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(page_has_private(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(PageMlocked(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >
> >> > > However, it's in the page cache and I can't really convince myself
> >> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> >> > > where oldpage is instantiated exactly and what state it might be in -
> >> > > can it already be on the LRU?
> >> >
> >> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> >> >
> >> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> >> > it is definitely on the LRU at that point.
> >
> > OK, so my understanding of the code was wrong :/ and staring at it for
> > quite a while didn't help much. The fuse code is so full of indirection
> > it makes my head spin.
> 
> Definitely needs a rewrite.  But forget the complexities for the
> moment and just consider this single case:
> 
>  ->readpages() is called to do some readahead, pages are locked, added
> to the page cache and, AFAICS, charged to a memcg (in
> add_to_page_cache_lru()).
> 
>  - fuse sends a READ request to userspace and it gets a reply with
> splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
> indefinite origin are to replace (if possible) the pages already in
> the page cache.  If not possible, for some reason, it falls back to
> copying the contents.  So, AFAICS, the oldpage and the newpage can be
> charged to a different memcg.

OK, thanks for the clarification. I had this feeling but couldn't wrap
my head around the indirection of the code.

It seems that checkig PageCgroupUsed(new) and bail out early in
mem_cgroup_migrate should just work, no?

> > How should we test this code path, Miklos?
> 
>   fusexmp_fh -osplice_write,splice_move /mnt/fuse
> 
> This will mirror / under /mnt/fuse and will use splice to move data
> from the underlying filesystem to the fuse filesystem, hopefully.
> 
> It would be useful if it had some instrumentation telling us the
> actual number of pages successfully moved, but it doesn't have that
> yet.

Thanks I will try to play with this tomorrow when I have more time.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 14:38                     ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-23 14:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Johannes Weiner, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Kernel Mailing List

On Tue 22-07-14 17:44:43, Miklos Szeredi wrote:
> On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
> >> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> >> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> >> >
> >> > > I assumed the source page would always be new, according to this part
> >> > > in fuse_try_move_page():
> >> > >
> >> > >         /*
> >> > >          * This is a new and locked page, it shouldn't be mapped or
> >> > >          * have any special flags on it
> >> > >          */
> >> > >         if (WARN_ON(page_mapped(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(page_has_private(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >         if (WARN_ON(PageMlocked(oldpage)))
> >> > >                 goto out_fallback_unlock;
> >> > >
> >> > > However, it's in the page cache and I can't really convince myself
> >> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> >> > > where oldpage is instantiated exactly and what state it might be in -
> >> > > can it already be on the LRU?
> >> >
> >> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> >> >
> >> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> >> > it is definitely on the LRU at that point.
> >
> > OK, so my understanding of the code was wrong :/ and staring at it for
> > quite a while didn't help much. The fuse code is so full of indirection
> > it makes my head spin.
> 
> Definitely needs a rewrite.  But forget the complexities for the
> moment and just consider this single case:
> 
>  ->readpages() is called to do some readahead, pages are locked, added
> to the page cache and, AFAICS, charged to a memcg (in
> add_to_page_cache_lru()).
> 
>  - fuse sends a READ request to userspace and it gets a reply with
> splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
> indefinite origin are to replace (if possible) the pages already in
> the page cache.  If not possible, for some reason, it falls back to
> copying the contents.  So, AFAICS, the oldpage and the newpage can be
> charged to a different memcg.

OK, thanks for the clarification. I had this feeling but couldn't wrap
my head around the indirection of the code.

It seems that checkig PageCgroupUsed(new) and bail out early in
mem_cgroup_migrate should just work, no?

> > How should we test this code path, Miklos?
> 
>   fusexmp_fh -osplice_write,splice_move /mnt/fuse
> 
> This will mirror / under /mnt/fuse and will use splice to move data
> from the underlying filesystem to the fuse filesystem, hopefully.
> 
> It would be useful if it had some instrumentation telling us the
> actual number of pages successfully moved, but it doesn't have that
> yet.

Thanks I will try to play with this tomorrow when I have more time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-23 14:38                     ` Michal Hocko
@ 2014-07-23 15:06                       ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-23 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed, Jul 23, 2014 at 04:38:47PM +0200, Michal Hocko wrote:
> On Tue 22-07-14 17:44:43, Miklos Szeredi wrote:
> > On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
> > >> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> > >> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >> >
> > >> > > I assumed the source page would always be new, according to this part
> > >> > > in fuse_try_move_page():
> > >> > >
> > >> > >         /*
> > >> > >          * This is a new and locked page, it shouldn't be mapped or
> > >> > >          * have any special flags on it
> > >> > >          */
> > >> > >         if (WARN_ON(page_mapped(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >         if (WARN_ON(page_has_private(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >         if (WARN_ON(PageMlocked(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >
> > >> > > However, it's in the page cache and I can't really convince myself
> > >> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> > >> > > where oldpage is instantiated exactly and what state it might be in -
> > >> > > can it already be on the LRU?
> > >> >
> > >> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> > >> >
> > >> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> > >> > it is definitely on the LRU at that point.
> > >
> > > OK, so my understanding of the code was wrong :/ and staring at it for
> > > quite a while didn't help much. The fuse code is so full of indirection
> > > it makes my head spin.
> > 
> > Definitely needs a rewrite.  But forget the complexities for the
> > moment and just consider this single case:
> > 
> >  ->readpages() is called to do some readahead, pages are locked, added
> > to the page cache and, AFAICS, charged to a memcg (in
> > add_to_page_cache_lru()).
> > 
> >  - fuse sends a READ request to userspace and it gets a reply with
> > splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
> > indefinite origin are to replace (if possible) the pages already in
> > the page cache.  If not possible, for some reason, it falls back to
> > copying the contents.  So, AFAICS, the oldpage and the newpage can be
> > charged to a different memcg.

Can the new page be anything else than previous page cache?  The pipe
buffer stealing code truncates them, but at that point they would
already be charged as cache.

> OK, thanks for the clarification. I had this feeling but couldn't wrap
> my head around the indirection of the code.
> 
> It seems that checkig PageCgroupUsed(new) and bail out early in
> mem_cgroup_migrate should just work, no?

If the new page is already charged as page cache, we could just drop
the call to mem_cgroup_migrate() altogether.

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 15:06                       ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-23 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed, Jul 23, 2014 at 04:38:47PM +0200, Michal Hocko wrote:
> On Tue 22-07-14 17:44:43, Miklos Szeredi wrote:
> > On Tue, Jul 22, 2014 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Sat 19-07-14 13:39:11, Johannes Weiner wrote:
> > >> On Fri, Jul 18, 2014 at 05:12:54PM +0200, Miklos Szeredi wrote:
> > >> > On Fri, Jul 18, 2014 at 4:45 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >> >
> > >> > > I assumed the source page would always be new, according to this part
> > >> > > in fuse_try_move_page():
> > >> > >
> > >> > >         /*
> > >> > >          * This is a new and locked page, it shouldn't be mapped or
> > >> > >          * have any special flags on it
> > >> > >          */
> > >> > >         if (WARN_ON(page_mapped(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >         if (WARN_ON(page_has_private(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >         if (WARN_ON(PageMlocked(oldpage)))
> > >> > >                 goto out_fallback_unlock;
> > >> > >
> > >> > > However, it's in the page cache and I can't really convince myself
> > >> > > that it's not also on the LRU.  Miklos, I have trouble pinpointing
> > >> > > where oldpage is instantiated exactly and what state it might be in -
> > >> > > can it already be on the LRU?
> > >> >
> > >> > oldpage comes from ->readpages() (*NOT* ->readpage()), i.e. readahead.
> > >> >
> > >> > AFAICS it is added to the LRU in read_cache_pages(), so it looks like
> > >> > it is definitely on the LRU at that point.
> > >
> > > OK, so my understanding of the code was wrong :/ and staring at it for
> > > quite a while didn't help much. The fuse code is so full of indirection
> > > it makes my head spin.
> > 
> > Definitely needs a rewrite.  But forget the complexities for the
> > moment and just consider this single case:
> > 
> >  ->readpages() is called to do some readahead, pages are locked, added
> > to the page cache and, AFAICS, charged to a memcg (in
> > add_to_page_cache_lru()).
> > 
> >  - fuse sends a READ request to userspace and it gets a reply with
> > splice(... SPLICE_F_MOVE).  What this means that a bunch of pages of
> > indefinite origin are to replace (if possible) the pages already in
> > the page cache.  If not possible, for some reason, it falls back to
> > copying the contents.  So, AFAICS, the oldpage and the newpage can be
> > charged to a different memcg.

Can the new page be anything else than previous page cache?  The pipe
buffer stealing code truncates them, but at that point they would
already be charged as cache.

> OK, thanks for the clarification. I had this feeling but couldn't wrap
> my head around the indirection of the code.
> 
> It seems that checkig PageCgroupUsed(new) and bail out early in
> mem_cgroup_migrate should just work, no?

If the new page is already charged as page cache, we could just drop
the call to mem_cgroup_migrate() altogether.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-23 15:06                       ` Johannes Weiner
  (?)
@ 2014-07-23 15:19                         ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-23 15:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed 23-07-14 11:06:08, Johannes Weiner wrote:
> On Wed, Jul 23, 2014 at 04:38:47PM +0200, Michal Hocko wrote:
[...]
> > OK, thanks for the clarification. I had this feeling but couldn't wrap
> > my head around the indirection of the code.
> > 
> > It seems that checkig PageCgroupUsed(new) and bail out early in
> > mem_cgroup_migrate should just work, no?
> 
> If the new page is already charged as page cache, we could just drop
> the call to mem_cgroup_migrate() altogether.

Yeah, it is just that we do not want to do all the
page->page_cgroup->PageCgroupUsed thing in replace_page_cache_page.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 15:19                         ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-23 15:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed 23-07-14 11:06:08, Johannes Weiner wrote:
> On Wed, Jul 23, 2014 at 04:38:47PM +0200, Michal Hocko wrote:
[...]
> > OK, thanks for the clarification. I had this feeling but couldn't wrap
> > my head around the indirection of the code.
> > 
> > It seems that checkig PageCgroupUsed(new) and bail out early in
> > mem_cgroup_migrate should just work, no?
> 
> If the new page is already charged as page cache, we could just drop
> the call to mem_cgroup_migrate() altogether.

Yeah, it is just that we do not want to do all the
page->page_cgroup->PageCgroupUsed thing in replace_page_cache_page.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 15:19                         ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-23 15:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Kernel Mailing List

On Wed 23-07-14 11:06:08, Johannes Weiner wrote:
> On Wed, Jul 23, 2014 at 04:38:47PM +0200, Michal Hocko wrote:
[...]
> > OK, thanks for the clarification. I had this feeling but couldn't wrap
> > my head around the indirection of the code.
> > 
> > It seems that checkig PageCgroupUsed(new) and bail out early in
> > mem_cgroup_migrate should just work, no?
> 
> If the new page is already charged as page cache, we could just drop
> the call to mem_cgroup_migrate() altogether.

Yeah, it is just that we do not want to do all the
page->page_cgroup->PageCgroupUsed thing in replace_page_cache_page.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-23 15:19                         ` Michal Hocko
@ 2014-07-23 15:36                           ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-23 15:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed, Jul 23, 2014 at 05:19:09PM +0200, Michal Hocko wrote:
> On Wed 23-07-14 11:06:08, Johannes Weiner wrote:
> > On Wed, Jul 23, 2014 at 04:38:47PM +0200, Michal Hocko wrote:
> [...]
> > > OK, thanks for the clarification. I had this feeling but couldn't wrap
> > > my head around the indirection of the code.
> > > 
> > > It seems that checkig PageCgroupUsed(new) and bail out early in
> > > mem_cgroup_migrate should just work, no?
> > 
> > If the new page is already charged as page cache, we could just drop
> > the call to mem_cgroup_migrate() altogether.
> 
> Yeah, it is just that we do not want to do all the
> page->page_cgroup->PageCgroupUsed thing in replace_page_cache_page.

If the new page is *always* already charged as cache, there is no
reason to even check PageCgroupUsed.  We wouldn't have to do anything
at this point.  The old code had to, because pages were uncharged
during truncation, but now we could just carry the original charge
across truncation and the re-use as replacement page, and then
uncharge the old page.  No migration necessary.

That's why I'm asking if newpage is always charged truncated page
cache, or whether it can be something else.

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 15:36                           ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-23 15:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed, Jul 23, 2014 at 05:19:09PM +0200, Michal Hocko wrote:
> On Wed 23-07-14 11:06:08, Johannes Weiner wrote:
> > On Wed, Jul 23, 2014 at 04:38:47PM +0200, Michal Hocko wrote:
> [...]
> > > OK, thanks for the clarification. I had this feeling but couldn't wrap
> > > my head around the indirection of the code.
> > > 
> > > It seems that checkig PageCgroupUsed(new) and bail out early in
> > > mem_cgroup_migrate should just work, no?
> > 
> > If the new page is already charged as page cache, we could just drop
> > the call to mem_cgroup_migrate() altogether.
> 
> Yeah, it is just that we do not want to do all the
> page->page_cgroup->PageCgroupUsed thing in replace_page_cache_page.

If the new page is *always* already charged as cache, there is no
reason to even check PageCgroupUsed.  We wouldn't have to do anything
at this point.  The old code had to, because pages were uncharged
during truncation, but now we could just carry the original charge
across truncation and the re-use as replacement page, and then
uncharge the old page.  No migration necessary.

That's why I'm asking if newpage is always charged truncated page
cache, or whether it can be something else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-23 15:06                       ` Johannes Weiner
@ 2014-07-23 18:08                         ` Miklos Szeredi
  -1 siblings, 0 replies; 141+ messages in thread
From: Miklos Szeredi @ 2014-07-23 18:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed, Jul 23, 2014 at 5:06 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Can the new page be anything else than previous page cache?

It could be an ordinary pipe buffer too.  Stealable as well (see
generic_pipe_buf_steal()).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 18:08                         ` Miklos Szeredi
  0 siblings, 0 replies; 141+ messages in thread
From: Miklos Szeredi @ 2014-07-23 18:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed, Jul 23, 2014 at 5:06 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Can the new page be anything else than previous page cache?

It could be an ordinary pipe buffer too.  Stealable as well (see
generic_pipe_buf_steal()).

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-23 18:08                         ` Miklos Szeredi
  (?)
@ 2014-07-23 21:02                           ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-23 21:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

Hi Miklos,

On Wed, Jul 23, 2014 at 08:08:57PM +0200, Miklos Szeredi wrote:
> On Wed, Jul 23, 2014 at 5:06 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Can the new page be anything else than previous page cache?
> 
> It could be an ordinary pipe buffer too.  Stealable as well (see
> generic_pipe_buf_steal()).

Okay, they need charging, so we can't get rid of mem_cgroup_migrate()
in replace_page_cache().  With the fuse example mount you described I
can trigger the current code to blow up, so below is a fix to check if
the target page is already charged.

On an unrelated note, while playing around with the fuse example mount
and heavy swapping workloads I get the following in dmesg (changed
fuse_check_page() to use dump_page(), will send a patch later):

[  298.771921] page:ffffea000468cb80 count:1 mapcount:0 mapping:          (null) index:0x1e852f8
[  298.780517] page flags: 0x5fffc000080029(locked|uptodate|lru|swapbacked)
[  298.787385] page dumped because: fuse: trying to steal weird page
[  298.793500] pc:ffff880215f232e0 pc->flags:7 pc->mem_cgroup:ffff880216c23000

[  298.801031] page:ffffea0004662f00 count:1 mapcount:0 mapping:          (null) index:0x1e85324
[  298.809689] page flags: 0x5fffc000080029(locked|uptodate|lru|swapbacked)
[  298.816615] page dumped because: fuse: trying to steal weird page
[  298.822791] pc:ffff880215f18bc0 pc->flags:7 pc->mem_cgroup:ffff880216c23000

etc.

Somehow the page stealing ends up taking out anonymous pages, but it
must be a race condition as it happens rarely and irregularly.

---
>From 2c3525cb556313936845a7c57f4c4adc655b6680 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 23 Jul 2014 15:00:15 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
 migration 2

In case of fuse page cache replacement the target page in migration
can already be charged when splice steals it from page cache.  That
triggers the !PageCgrouUsed() assertion during commit:

[  755.141095] page:ffffea00031f9b00 count:2 mapcount:0 mapping:ffff8800c84d1858 index:0x0
[  755.141097] page flags: 0x3fffc000000029(locked|uptodate|lru)
[  755.141098] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[  755.141098] pc:ffff880215cfe6c0 pc->flags:7 pc->mem_cgroup:ffff880216c23000
[  755.141113] ------------[ cut here ]------------
[  755.141113] kernel BUG at /home/hannes/src/linux/linux/mm/memcontrol.c:2736!
[  755.141115] invalid opcode: 0000 [#1] SMP
[  755.141117] CPU: 0 PID: 342 Comm: lt-fusexmp_fh Not tainted 3.16.0-rc5-mm1-00502-g5e5b90c20054 #367
[  755.141117] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012
[  755.141118] task: ffff880213104580 ti: ffff8800c9204000 task.ti: ffff8800c9204000
[  755.141121] RIP: 0010:[<ffffffff81188497>]  [<ffffffff81188497>] commit_charge+0xa7/0xb0
[  755.141122] RSP: 0018:ffff8800c9207c18  EFLAGS: 00010286
[  755.141123] RAX: 000000000000003f RBX: ffffea00031f9b00 RCX: 0000000000004c4b
[  755.141123] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff880213104580
[  755.141124] RBP: ffff8800c9207c40 R08: 0000000000000001 R09: 0000000000000000
[  755.141124] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880216c23000
[  755.141125] R13: 0000000000000001 R14: ffff8800c84d1858 R15: 0000000000000000
[  755.141125] FS:  00007fc15f7fe700(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
[  755.141126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  755.141127] CR2: 00007f693db3b6b0 CR3: 0000000211d54000 CR4: 00000000000407f0
[  755.141127] Stack:
[  755.141128]  ffffea00031f8480 ffffea00031f9b00 ffffea00031f8480 ffffea00031f9b00
[  755.141129]  0000000000000000 ffff8800c9207c78 ffffffff8118e283 00000001c9207c60
[  755.141130]  ffff880215cfe120 00000001c9207c78 ffffea00031f9b00 ffffea00031f8480
[  755.141131] Call Trace:
[  755.141133]  [<ffffffff8118e283>] mem_cgroup_migrate+0xe3/0x210
[  755.141135]  [<ffffffff8111a086>] replace_page_cache_page+0xf6/0x1c0
[  755.141137]  [<ffffffff8127aceb>] fuse_copy_page+0x1bb/0x5f0
[  755.141138]  [<ffffffff8127b20f>] fuse_copy_args+0xef/0x140
[  755.141140]  [<ffffffff8127caba>] fuse_dev_do_write+0x7ba/0xd30
[  755.141143]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  755.141146]  [<ffffffff816a83ea>] ? __mutex_unlock_slowpath+0xaa/0x180
[  755.141147]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  755.141148]  [<ffffffff8109523d>] ? trace_hardirqs_on+0xd/0x10
[  755.141150]  [<ffffffff8127d2b2>] fuse_dev_splice_write+0x282/0x360
[  755.141152]  [<ffffffff811c4ce1>] SyS_splice+0x351/0x800
[  755.141153]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  755.141155]  [<ffffffff816ab192>] system_call_fastpath+0x16/0x1b
[  755.141166] Code: 07 48 89 10 8b 75 e4 e8 f8 fd ff ff 48 83 c4 10 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 c7 c6 68 0b 9c 81 48 89 df e8 e9 a2 f9 ff <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 48 39 f7 74 26 48 85
[  755.141167] RIP  [<ffffffff81188497>] commit_charge+0xa7/0xb0
[  755.141167]  RSP <ffff8800c9207c18>
[  755.141665] ---[ end trace 2d0ea36c8e3ded5b ]---

If the target page is already charged, just leave it as is and abort
the charge migration attempt.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b7c9a202dee9..3eaa6e83c168 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6660,6 +6660,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 	if (mem_cgroup_disabled())
 		return;
 
+	/* Page cache replacement: new page already charged? */
+	pc = lookup_page_cgroup(newpage);
+	if (PageCgroupUsed(pc))
+		return;
+
+	/* Re-entrant migration: old page already uncharged? */
 	pc = lookup_page_cgroup(oldpage);
 	if (!PageCgroupUsed(pc))
 		return;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 21:02                           ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-23 21:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

Hi Miklos,

On Wed, Jul 23, 2014 at 08:08:57PM +0200, Miklos Szeredi wrote:
> On Wed, Jul 23, 2014 at 5:06 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Can the new page be anything else than previous page cache?
> 
> It could be an ordinary pipe buffer too.  Stealable as well (see
> generic_pipe_buf_steal()).

Okay, they need charging, so we can't get rid of mem_cgroup_migrate()
in replace_page_cache().  With the fuse example mount you described I
can trigger the current code to blow up, so below is a fix to check if
the target page is already charged.

On an unrelated note, while playing around with the fuse example mount
and heavy swapping workloads I get the following in dmesg (changed
fuse_check_page() to use dump_page(), will send a patch later):

[  298.771921] page:ffffea000468cb80 count:1 mapcount:0 mapping:          (null) index:0x1e852f8
[  298.780517] page flags: 0x5fffc000080029(locked|uptodate|lru|swapbacked)
[  298.787385] page dumped because: fuse: trying to steal weird page
[  298.793500] pc:ffff880215f232e0 pc->flags:7 pc->mem_cgroup:ffff880216c23000

[  298.801031] page:ffffea0004662f00 count:1 mapcount:0 mapping:          (null) index:0x1e85324
[  298.809689] page flags: 0x5fffc000080029(locked|uptodate|lru|swapbacked)
[  298.816615] page dumped because: fuse: trying to steal weird page
[  298.822791] pc:ffff880215f18bc0 pc->flags:7 pc->mem_cgroup:ffff880216c23000

etc.

Somehow the page stealing ends up taking out anonymous pages, but it
must be a race condition as it happens rarely and irregularly.

---

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-23 21:02                           ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-23 21:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Kernel Mailing List

Hi Miklos,

On Wed, Jul 23, 2014 at 08:08:57PM +0200, Miklos Szeredi wrote:
> On Wed, Jul 23, 2014 at 5:06 PM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > Can the new page be anything else than previous page cache?
> 
> It could be an ordinary pipe buffer too.  Stealable as well (see
> generic_pipe_buf_steal()).

Okay, they need charging, so we can't get rid of mem_cgroup_migrate()
in replace_page_cache().  With the fuse example mount you described I
can trigger the current code to blow up, so below is a fix to check if
the target page is already charged.

On an unrelated note, while playing around with the fuse example mount
and heavy swapping workloads I get the following in dmesg (changed
fuse_check_page() to use dump_page(), will send a patch later):

[  298.771921] page:ffffea000468cb80 count:1 mapcount:0 mapping:          (null) index:0x1e852f8
[  298.780517] page flags: 0x5fffc000080029(locked|uptodate|lru|swapbacked)
[  298.787385] page dumped because: fuse: trying to steal weird page
[  298.793500] pc:ffff880215f232e0 pc->flags:7 pc->mem_cgroup:ffff880216c23000

[  298.801031] page:ffffea0004662f00 count:1 mapcount:0 mapping:          (null) index:0x1e85324
[  298.809689] page flags: 0x5fffc000080029(locked|uptodate|lru|swapbacked)
[  298.816615] page dumped because: fuse: trying to steal weird page
[  298.822791] pc:ffff880215f18bc0 pc->flags:7 pc->mem_cgroup:ffff880216c23000

etc.

Somehow the page stealing ends up taking out anonymous pages, but it
must be a race condition as it happens rarely and irregularly.

---
From 2c3525cb556313936845a7c57f4c4adc655b6680 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Date: Wed, 23 Jul 2014 15:00:15 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
 migration 2

In case of fuse page cache replacement the target page in migration
can already be charged when splice steals it from page cache.  That
triggers the !PageCgrouUsed() assertion during commit:

[  755.141095] page:ffffea00031f9b00 count:2 mapcount:0 mapping:ffff8800c84d1858 index:0x0
[  755.141097] page flags: 0x3fffc000000029(locked|uptodate|lru)
[  755.141098] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[  755.141098] pc:ffff880215cfe6c0 pc->flags:7 pc->mem_cgroup:ffff880216c23000
[  755.141113] ------------[ cut here ]------------
[  755.141113] kernel BUG at /home/hannes/src/linux/linux/mm/memcontrol.c:2736!
[  755.141115] invalid opcode: 0000 [#1] SMP
[  755.141117] CPU: 0 PID: 342 Comm: lt-fusexmp_fh Not tainted 3.16.0-rc5-mm1-00502-g5e5b90c20054 #367
[  755.141117] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012
[  755.141118] task: ffff880213104580 ti: ffff8800c9204000 task.ti: ffff8800c9204000
[  755.141121] RIP: 0010:[<ffffffff81188497>]  [<ffffffff81188497>] commit_charge+0xa7/0xb0
[  755.141122] RSP: 0018:ffff8800c9207c18  EFLAGS: 00010286
[  755.141123] RAX: 000000000000003f RBX: ffffea00031f9b00 RCX: 0000000000004c4b
[  755.141123] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff880213104580
[  755.141124] RBP: ffff8800c9207c40 R08: 0000000000000001 R09: 0000000000000000
[  755.141124] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880216c23000
[  755.141125] R13: 0000000000000001 R14: ffff8800c84d1858 R15: 0000000000000000
[  755.141125] FS:  00007fc15f7fe700(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
[  755.141126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  755.141127] CR2: 00007f693db3b6b0 CR3: 0000000211d54000 CR4: 00000000000407f0
[  755.141127] Stack:
[  755.141128]  ffffea00031f8480 ffffea00031f9b00 ffffea00031f8480 ffffea00031f9b00
[  755.141129]  0000000000000000 ffff8800c9207c78 ffffffff8118e283 00000001c9207c60
[  755.141130]  ffff880215cfe120 00000001c9207c78 ffffea00031f9b00 ffffea00031f8480
[  755.141131] Call Trace:
[  755.141133]  [<ffffffff8118e283>] mem_cgroup_migrate+0xe3/0x210
[  755.141135]  [<ffffffff8111a086>] replace_page_cache_page+0xf6/0x1c0
[  755.141137]  [<ffffffff8127aceb>] fuse_copy_page+0x1bb/0x5f0
[  755.141138]  [<ffffffff8127b20f>] fuse_copy_args+0xef/0x140
[  755.141140]  [<ffffffff8127caba>] fuse_dev_do_write+0x7ba/0xd30
[  755.141143]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  755.141146]  [<ffffffff816a83ea>] ? __mutex_unlock_slowpath+0xaa/0x180
[  755.141147]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  755.141148]  [<ffffffff8109523d>] ? trace_hardirqs_on+0xd/0x10
[  755.141150]  [<ffffffff8127d2b2>] fuse_dev_splice_write+0x282/0x360
[  755.141152]  [<ffffffff811c4ce1>] SyS_splice+0x351/0x800
[  755.141153]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  755.141155]  [<ffffffff816ab192>] system_call_fastpath+0x16/0x1b
[  755.141166] Code: 07 48 89 10 8b 75 e4 e8 f8 fd ff ff 48 83 c4 10 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 c7 c6 68 0b 9c 81 48 89 df e8 e9 a2 f9 ff <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 48 39 f7 74 26 48 85
[  755.141167] RIP  [<ffffffff81188497>] commit_charge+0xa7/0xb0
[  755.141167]  RSP <ffff8800c9207c18>
[  755.141665] ---[ end trace 2d0ea36c8e3ded5b ]---

If the target page is already charged, just leave it as is and abort
the charge migration attempt.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b7c9a202dee9..3eaa6e83c168 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6660,6 +6660,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 	if (mem_cgroup_disabled())
 		return;
 
+	/* Page cache replacement: new page already charged? */
+	pc = lookup_page_cgroup(newpage);
+	if (PageCgroupUsed(pc))
+		return;
+
+	/* Re-entrant migration: old page already uncharged? */
 	pc = lookup_page_cgroup(oldpage);
 	if (!PageCgroupUsed(pc))
 		return;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-23 21:02                           ` Johannes Weiner
@ 2014-07-24  8:46                             ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-24  8:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
[...]
> From 2c3525cb556313936845a7c57f4c4adc655b6680 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 23 Jul 2014 15:00:15 -0400
> Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
>  migration 2
> 
> In case of fuse page cache replacement the target page in migration
> can already be charged when splice steals it from page cache.  That
> triggers the !PageCgrouUsed() assertion during commit:
> 
> [  755.141095] page:ffffea00031f9b00 count:2 mapcount:0 mapping:ffff8800c84d1858 index:0x0
> [  755.141097] page flags: 0x3fffc000000029(locked|uptodate|lru)
> [  755.141098] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
> [  755.141098] pc:ffff880215cfe6c0 pc->flags:7 pc->mem_cgroup:ffff880216c23000
> [  755.141113] ------------[ cut here ]------------
> [  755.141113] kernel BUG at /home/hannes/src/linux/linux/mm/memcontrol.c:2736!
> [  755.141115] invalid opcode: 0000 [#1] SMP
> [  755.141117] CPU: 0 PID: 342 Comm: lt-fusexmp_fh Not tainted 3.16.0-rc5-mm1-00502-g5e5b90c20054 #367
> [  755.141117] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012
> [  755.141118] task: ffff880213104580 ti: ffff8800c9204000 task.ti: ffff8800c9204000
> [  755.141121] RIP: 0010:[<ffffffff81188497>]  [<ffffffff81188497>] commit_charge+0xa7/0xb0
> [  755.141122] RSP: 0018:ffff8800c9207c18  EFLAGS: 00010286
> [  755.141123] RAX: 000000000000003f RBX: ffffea00031f9b00 RCX: 0000000000004c4b
> [  755.141123] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff880213104580
> [  755.141124] RBP: ffff8800c9207c40 R08: 0000000000000001 R09: 0000000000000000
> [  755.141124] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880216c23000
> [  755.141125] R13: 0000000000000001 R14: ffff8800c84d1858 R15: 0000000000000000
> [  755.141125] FS:  00007fc15f7fe700(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
> [  755.141126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  755.141127] CR2: 00007f693db3b6b0 CR3: 0000000211d54000 CR4: 00000000000407f0
> [  755.141127] Stack:
> [  755.141128]  ffffea00031f8480 ffffea00031f9b00 ffffea00031f8480 ffffea00031f9b00
> [  755.141129]  0000000000000000 ffff8800c9207c78 ffffffff8118e283 00000001c9207c60
> [  755.141130]  ffff880215cfe120 00000001c9207c78 ffffea00031f9b00 ffffea00031f8480
> [  755.141131] Call Trace:
> [  755.141133]  [<ffffffff8118e283>] mem_cgroup_migrate+0xe3/0x210
> [  755.141135]  [<ffffffff8111a086>] replace_page_cache_page+0xf6/0x1c0
> [  755.141137]  [<ffffffff8127aceb>] fuse_copy_page+0x1bb/0x5f0
> [  755.141138]  [<ffffffff8127b20f>] fuse_copy_args+0xef/0x140
> [  755.141140]  [<ffffffff8127caba>] fuse_dev_do_write+0x7ba/0xd30
> [  755.141143]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
> [  755.141146]  [<ffffffff816a83ea>] ? __mutex_unlock_slowpath+0xaa/0x180
> [  755.141147]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
> [  755.141148]  [<ffffffff8109523d>] ? trace_hardirqs_on+0xd/0x10
> [  755.141150]  [<ffffffff8127d2b2>] fuse_dev_splice_write+0x282/0x360
> [  755.141152]  [<ffffffff811c4ce1>] SyS_splice+0x351/0x800
> [  755.141153]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
> [  755.141155]  [<ffffffff816ab192>] system_call_fastpath+0x16/0x1b
> [  755.141166] Code: 07 48 89 10 8b 75 e4 e8 f8 fd ff ff 48 83 c4 10 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 c7 c6 68 0b 9c 81 48 89 df e8 e9 a2 f9 ff <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 48 39 f7 74 26 48 85
> [  755.141167] RIP  [<ffffffff81188497>] commit_charge+0xa7/0xb0
> [  755.141167]  RSP <ffff8800c9207c18>
> [  755.141665] ---[ end trace 2d0ea36c8e3ded5b ]---
> 
> If the target page is already charged, just leave it as is and abort
> the charge migration attempt.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

We can reduce the lookup only to lruvec==true case, no?

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b7c9a202dee9..3eaa6e83c168 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6660,6 +6660,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
>  	if (mem_cgroup_disabled())
>  		return;
>  
> +	/* Page cache replacement: new page already charged? */
> +	pc = lookup_page_cgroup(newpage);
> +	if (PageCgroupUsed(pc))
> +		return;
> +
> +	/* Re-entrant migration: old page already uncharged? */
>  	pc = lookup_page_cgroup(oldpage);
>  	if (!PageCgroupUsed(pc))
>  		return;
> -- 
> 2.0.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-24  8:46                             ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-24  8:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
[...]
> From 2c3525cb556313936845a7c57f4c4adc655b6680 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 23 Jul 2014 15:00:15 -0400
> Subject: [patch] mm: memcontrol: rewrite uncharge API fix - page cache
>  migration 2
> 
> In case of fuse page cache replacement the target page in migration
> can already be charged when splice steals it from page cache.  That
> triggers the !PageCgrouUsed() assertion during commit:
> 
> [  755.141095] page:ffffea00031f9b00 count:2 mapcount:0 mapping:ffff8800c84d1858 index:0x0
> [  755.141097] page flags: 0x3fffc000000029(locked|uptodate|lru)
> [  755.141098] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
> [  755.141098] pc:ffff880215cfe6c0 pc->flags:7 pc->mem_cgroup:ffff880216c23000
> [  755.141113] ------------[ cut here ]------------
> [  755.141113] kernel BUG at /home/hannes/src/linux/linux/mm/memcontrol.c:2736!
> [  755.141115] invalid opcode: 0000 [#1] SMP
> [  755.141117] CPU: 0 PID: 342 Comm: lt-fusexmp_fh Not tainted 3.16.0-rc5-mm1-00502-g5e5b90c20054 #367
> [  755.141117] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012
> [  755.141118] task: ffff880213104580 ti: ffff8800c9204000 task.ti: ffff8800c9204000
> [  755.141121] RIP: 0010:[<ffffffff81188497>]  [<ffffffff81188497>] commit_charge+0xa7/0xb0
> [  755.141122] RSP: 0018:ffff8800c9207c18  EFLAGS: 00010286
> [  755.141123] RAX: 000000000000003f RBX: ffffea00031f9b00 RCX: 0000000000004c4b
> [  755.141123] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff880213104580
> [  755.141124] RBP: ffff8800c9207c40 R08: 0000000000000001 R09: 0000000000000000
> [  755.141124] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880216c23000
> [  755.141125] R13: 0000000000000001 R14: ffff8800c84d1858 R15: 0000000000000000
> [  755.141125] FS:  00007fc15f7fe700(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
> [  755.141126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  755.141127] CR2: 00007f693db3b6b0 CR3: 0000000211d54000 CR4: 00000000000407f0
> [  755.141127] Stack:
> [  755.141128]  ffffea00031f8480 ffffea00031f9b00 ffffea00031f8480 ffffea00031f9b00
> [  755.141129]  0000000000000000 ffff8800c9207c78 ffffffff8118e283 00000001c9207c60
> [  755.141130]  ffff880215cfe120 00000001c9207c78 ffffea00031f9b00 ffffea00031f8480
> [  755.141131] Call Trace:
> [  755.141133]  [<ffffffff8118e283>] mem_cgroup_migrate+0xe3/0x210
> [  755.141135]  [<ffffffff8111a086>] replace_page_cache_page+0xf6/0x1c0
> [  755.141137]  [<ffffffff8127aceb>] fuse_copy_page+0x1bb/0x5f0
> [  755.141138]  [<ffffffff8127b20f>] fuse_copy_args+0xef/0x140
> [  755.141140]  [<ffffffff8127caba>] fuse_dev_do_write+0x7ba/0xd30
> [  755.141143]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
> [  755.141146]  [<ffffffff816a83ea>] ? __mutex_unlock_slowpath+0xaa/0x180
> [  755.141147]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
> [  755.141148]  [<ffffffff8109523d>] ? trace_hardirqs_on+0xd/0x10
> [  755.141150]  [<ffffffff8127d2b2>] fuse_dev_splice_write+0x282/0x360
> [  755.141152]  [<ffffffff811c4ce1>] SyS_splice+0x351/0x800
> [  755.141153]  [<ffffffff8109518d>] ? trace_hardirqs_on_caller+0x15d/0x200
> [  755.141155]  [<ffffffff816ab192>] system_call_fastpath+0x16/0x1b
> [  755.141166] Code: 07 48 89 10 8b 75 e4 e8 f8 fd ff ff 48 83 c4 10 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 c7 c6 68 0b 9c 81 48 89 df e8 e9 a2 f9 ff <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 48 39 f7 74 26 48 85
> [  755.141167] RIP  [<ffffffff81188497>] commit_charge+0xa7/0xb0
> [  755.141167]  RSP <ffff8800c9207c18>
> [  755.141665] ---[ end trace 2d0ea36c8e3ded5b ]---
> 
> If the target page is already charged, just leave it as is and abort
> the charge migration attempt.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

We can reduce the lookup only to lruvec==true case, no?

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b7c9a202dee9..3eaa6e83c168 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6660,6 +6660,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
>  	if (mem_cgroup_disabled())
>  		return;
>  
> +	/* Page cache replacement: new page already charged? */
> +	pc = lookup_page_cgroup(newpage);
> +	if (PageCgroupUsed(pc))
> +		return;
> +
> +	/* Re-entrant migration: old page already uncharged? */
>  	pc = lookup_page_cgroup(oldpage);
>  	if (!PageCgroupUsed(pc))
>  		return;
> -- 
> 2.0.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-24  8:46                             ` Michal Hocko
  (?)
@ 2014-07-24  9:02                               ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-24  9:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
[...]
> We can reduce the lookup only to lruvec==true case, no?

Dohh
s@can@should@

newpage shouldn't charged in all other cases and it would be bug.
Or am I missing something?

> Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> > ---
> >  mm/memcontrol.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b7c9a202dee9..3eaa6e83c168 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6660,6 +6660,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  
> > +	/* Page cache replacement: new page already charged? */
> > +	pc = lookup_page_cgroup(newpage);
> > +	if (PageCgroupUsed(pc))
> > +		return;
> > +
> > +	/* Re-entrant migration: old page already uncharged? */
> >  	pc = lookup_page_cgroup(oldpage);
> >  	if (!PageCgroupUsed(pc))
> >  		return;
> > -- 
> > 2.0.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-24  9:02                               ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-24  9:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
[...]
> We can reduce the lookup only to lruvec==true case, no?

Dohh
s@can@should@

newpage shouldn't charged in all other cases and it would be bug.
Or am I missing something?

> Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> > ---
> >  mm/memcontrol.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b7c9a202dee9..3eaa6e83c168 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6660,6 +6660,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  
> > +	/* Page cache replacement: new page already charged? */
> > +	pc = lookup_page_cgroup(newpage);
> > +	if (PageCgroupUsed(pc))
> > +		return;
> > +
> > +	/* Re-entrant migration: old page already uncharged? */
> >  	pc = lookup_page_cgroup(oldpage);
> >  	if (!PageCgroupUsed(pc))
> >  		return;
> > -- 
> > 2.0.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-24  9:02                               ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-24  9:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Kernel Mailing List

On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
[...]
> We can reduce the lookup only to lruvec==true case, no?

Dohh
s@can@should@

newpage shouldn't charged in all other cases and it would be bug.
Or am I missing something?

> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> 
> > ---
> >  mm/memcontrol.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b7c9a202dee9..3eaa6e83c168 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6660,6 +6660,12 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  
> > +	/* Page cache replacement: new page already charged? */
> > +	pc = lookup_page_cgroup(newpage);
> > +	if (PageCgroupUsed(pc))
> > +		return;
> > +
> > +	/* Re-entrant migration: old page already uncharged? */
> >  	pc = lookup_page_cgroup(oldpage);
> >  	if (!PageCgroupUsed(pc))
> >  		return;
> > -- 
> > 2.0.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-24  9:02                               ` Michal Hocko
  (?)
@ 2014-07-25 15:26                                 ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-25 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Thu, Jul 24, 2014 at 11:02:57AM +0200, Michal Hocko wrote:
> On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> > On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
> [...]
> > We can reduce the lookup only to lruvec==true case, no?
> 
> Dohh
> s@can@should@
> 
> newpage shouldn't charged in all other cases and it would be bug.
> Or am I missing something?

Yeah, but I'd hate to put that assumption onto the @lrucare parameter,
it just coincides.

> > Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-25 15:26                                 ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-25 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Thu, Jul 24, 2014 at 11:02:57AM +0200, Michal Hocko wrote:
> On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> > On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
> [...]
> > We can reduce the lookup only to lruvec==true case, no?
> 
> Dohh
> s@can@should@
> 
> newpage shouldn't charged in all other cases and it would be bug.
> Or am I missing something?

Yeah, but I'd hate to put that assumption onto the @lrucare parameter,
it just coincides.

> > Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-25 15:26                                 ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-25 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Kernel Mailing List

On Thu, Jul 24, 2014 at 11:02:57AM +0200, Michal Hocko wrote:
> On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> > On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
> [...]
> > We can reduce the lookup only to lruvec==true case, no?
> 
> Dohh
> s@can@should@
> 
> newpage shouldn't charged in all other cases and it would be bug.
> Or am I missing something?

Yeah, but I'd hate to put that assumption onto the @lrucare parameter,
it just coincides.

> > Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Thanks!

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-25 15:26                                 ` Johannes Weiner
@ 2014-07-25 15:43                                   ` Michal Hocko
  -1 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-25 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri 25-07-14 11:26:54, Johannes Weiner wrote:
> On Thu, Jul 24, 2014 at 11:02:57AM +0200, Michal Hocko wrote:
> > On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> > > On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
> > [...]
> > > We can reduce the lookup only to lruvec==true case, no?
> > 
> > Dohh
> > s@can@should@
> > 
> > newpage shouldn't charged in all other cases and it would be bug.
> > Or am I missing something?
> 
> Yeah, but I'd hate to put that assumption onto the @lrucare parameter,
> it just coincides.

Yes, you are right. Maybe replace_page_cache_page should have it's own
memcg variant which does all the trickery and then call
mem_cgroup_migrate when necessary...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-25 15:43                                   ` Michal Hocko
  0 siblings, 0 replies; 141+ messages in thread
From: Michal Hocko @ 2014-07-25 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri 25-07-14 11:26:54, Johannes Weiner wrote:
> On Thu, Jul 24, 2014 at 11:02:57AM +0200, Michal Hocko wrote:
> > On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> > > On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
> > [...]
> > > We can reduce the lookup only to lruvec==true case, no?
> > 
> > Dohh
> > s@can@should@
> > 
> > newpage shouldn't charged in all other cases and it would be bug.
> > Or am I missing something?
> 
> Yeah, but I'd hate to put that assumption onto the @lrucare parameter,
> it just coincides.

Yes, you are right. Maybe replace_page_cache_page should have it's own
memcg variant which does all the trickery and then call
mem_cgroup_migrate when necessary...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
  2014-07-25 15:43                                   ` Michal Hocko
@ 2014-07-25 17:34                                     ` Johannes Weiner
  -1 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-25 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri, Jul 25, 2014 at 05:43:20PM +0200, Michal Hocko wrote:
> On Fri 25-07-14 11:26:54, Johannes Weiner wrote:
> > On Thu, Jul 24, 2014 at 11:02:57AM +0200, Michal Hocko wrote:
> > > On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> > > > On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
> > > [...]
> > > > We can reduce the lookup only to lruvec==true case, no?
> > > 
> > > Dohh
> > > s@can@should@
> > > 
> > > newpage shouldn't charged in all other cases and it would be bug.
> > > Or am I missing something?
> > 
> > Yeah, but I'd hate to put that assumption onto the @lrucare parameter,
> > it just coincides.
> 
> Yes, you are right. Maybe replace_page_cache_page should have it's own
> memcg variant which does all the trickery and then call
> mem_cgroup_migrate when necessary...

The code flow doesn't really lend itself to nesting.  It's basically
three steps: validate input, clear the old page, commit the new page.

void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
{
	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
	VM_BUG_ON_PAGE(PageTransHuge(oldpage) != PageTransHuge(newpage),
		       newpage);

	if (mem_cgroup_disabled())
		return;

	/* Re-entrant migration: old page already uncharged? */
	pc = lookup_page_cgroup(oldpage);
	if (!PageCgroupUsed(pc))
		return;

	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);

	pc->flags = 0;
	commit_charge(newpage, pc->mem_cgroup, false);
}

void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage)
{
	struct page_cgroup *pc;
	int isolated;

	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);

	if (mem_cgroup_disabled())
		return;

	/* New page already charged? */
	pc = lookup_page_cgroup(newpage);
	if (PageCgroupUsed(pc))
		return;

	pc = lookup_page_cgroup(oldpage);

	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);

	lock_page_lru(oldpage, &isolated);
	pc->flags = 0;
	unlock_page_lru(oldpage, isolated);

	commit_charge(newpage, pc->mem_cgroup, true);
}

Only the call to commit_charge() is the same and there is a little bit
of overlap in the VM_BUG_ON_PAGEs...  I'd rather have a single migrate
function, because it's so small that the code is simpler than nesting
and/or duplicating multiple functions.

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 13/13] mm: memcontrol: rewrite uncharge API
@ 2014-07-25 17:34                                     ` Johannes Weiner
  0 siblings, 0 replies; 141+ messages in thread
From: Johannes Weiner @ 2014-07-25 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miklos Szeredi, Andrew Morton, Hugh Dickins, Tejun Heo,
	Vladimir Davydov, linux-mm, cgroups, Kernel Mailing List

On Fri, Jul 25, 2014 at 05:43:20PM +0200, Michal Hocko wrote:
> On Fri 25-07-14 11:26:54, Johannes Weiner wrote:
> > On Thu, Jul 24, 2014 at 11:02:57AM +0200, Michal Hocko wrote:
> > > On Thu 24-07-14 10:46:44, Michal Hocko wrote:
> > > > On Wed 23-07-14 17:02:41, Johannes Weiner wrote:
> > > [...]
> > > > We can reduce the lookup only to lruvec==true case, no?
> > > 
> > > Dohh
> > > s@can@should@
> > > 
> > > newpage shouldn't charged in all other cases and it would be bug.
> > > Or am I missing something?
> > 
> > Yeah, but I'd hate to put that assumption onto the @lrucare parameter,
> > it just coincides.
> 
> Yes, you are right. Maybe replace_page_cache_page should have it's own
> memcg variant which does all the trickery and then call
> mem_cgroup_migrate when necessary...

The code flow doesn't really lend itself to nesting.  It's basically
three steps: validate input, clear the old page, commit the new page.

void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
{
	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
	VM_BUG_ON_PAGE(PageTransHuge(oldpage) != PageTransHuge(newpage),
		       newpage);

	if (mem_cgroup_disabled())
		return;

	/* Re-entrant migration: old page already uncharged? */
	pc = lookup_page_cgroup(oldpage);
	if (!PageCgroupUsed(pc))
		return;

	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);

	pc->flags = 0;
	commit_charge(newpage, pc->mem_cgroup, false);
}

void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage)
{
	struct page_cgroup *pc;
	int isolated;

	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);

	if (mem_cgroup_disabled())
		return;

	/* New page already charged? */
	pc = lookup_page_cgroup(newpage);
	if (PageCgroupUsed(pc))
		return;

	pc = lookup_page_cgroup(oldpage);

	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);

	lock_page_lru(oldpage, &isolated);
	pc->flags = 0;
	unlock_page_lru(oldpage, isolated);

	commit_charge(newpage, pc->mem_cgroup, true);
}

Only the call to commit_charge() is the same and there is a little bit
of overlap in the VM_BUG_ON_PAGEs...  I'd rather have a single migrate
function, because it's so small that the code is simpler than nesting
and/or duplicating multiple functions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 141+ messages in thread

end of thread, other threads:[~2014-07-25 17:34 UTC | newest]

Thread overview: 141+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-18 20:40 [patch 00/13] mm: memcontrol: naturalize charge lifetime v4 Johannes Weiner
2014-06-18 20:40 ` Johannes Weiner
2014-06-18 20:40 ` [patch 01/13] mm: memcontrol: fold mem_cgroup_do_charge() Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 02/13] mm: memcontrol: rearrange charging fast path Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 03/13] mm: memcontrol: reclaim at least once for __GFP_NORETRY Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 04/13] mm: huge_memory: use GFP_TRANSHUGE when charging huge pages Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 05/13] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 06/13] mm: memcontrol: remove explicit OOM parameter in charge path Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 07/13] mm: memcontrol: simplify move precharge function Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 08/13] mm: memcontrol: catch root bypass in move precharge Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 09/13] mm: memcontrol: use root_mem_cgroup res_counter Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 10/13] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 11/13] mm: memcontrol: do not acquire page_cgroup lock for kmem pages Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-18 20:40 ` [patch 12/13] mm: memcontrol: rewrite charge API Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-23  6:15   ` Uwe Kleine-König
2014-06-23  6:15     ` Uwe Kleine-König
2014-06-23  6:15     ` Uwe Kleine-König
2014-06-23  9:30     ` Michal Hocko
2014-06-23  9:30       ` Michal Hocko
2014-06-23  9:30       ` Michal Hocko
2014-06-23  9:42       ` Uwe Kleine-König
2014-06-23  9:42         ` Uwe Kleine-König
2014-06-23  9:42         ` Uwe Kleine-König
2014-07-14 15:04   ` Michal Hocko
2014-07-14 15:04     ` Michal Hocko
2014-07-14 15:04     ` Michal Hocko
2014-07-14 17:13     ` Johannes Weiner
2014-07-14 17:13       ` Johannes Weiner
2014-07-14 18:43       ` Michal Hocko
2014-07-14 18:43         ` Michal Hocko
2014-06-18 20:40 ` [patch 13/13] mm: memcontrol: rewrite uncharge API Johannes Weiner
2014-06-18 20:40   ` Johannes Weiner
2014-06-20 16:36   ` [PATCH -mm] memcg: mem_cgroup_charge_statistics needs preempt_disable Michal Hocko
2014-06-20 16:36     ` Michal Hocko
2014-06-23  4:16     ` Johannes Weiner
2014-06-23  4:16       ` Johannes Weiner
2014-06-21  0:34   ` [patch 13/13] mm: memcontrol: rewrite uncharge API Sasha Levin
2014-06-21  0:34     ` Sasha Levin
2014-06-21  0:56     ` Andrew Morton
2014-06-21  0:56       ` Andrew Morton
2014-06-21  0:56       ` Andrew Morton
2014-06-21  1:03       ` Sasha Levin
2014-06-21  1:03         ` Sasha Levin
2014-07-15  8:25   ` Michal Hocko
2014-07-15  8:25     ` Michal Hocko
2014-07-15  8:25     ` Michal Hocko
2014-07-15 12:19     ` Michal Hocko
2014-07-15 12:19       ` Michal Hocko
2014-07-18  7:12       ` Michal Hocko
2014-07-18  7:12         ` Michal Hocko
2014-07-18 14:45         ` Johannes Weiner
2014-07-18 14:45           ` Johannes Weiner
2014-07-18 14:45           ` Johannes Weiner
2014-07-18 15:12           ` Miklos Szeredi
2014-07-18 15:12             ` Miklos Szeredi
2014-07-19 17:39             ` Johannes Weiner
2014-07-19 17:39               ` Johannes Weiner
2014-07-19 17:39               ` Johannes Weiner
2014-07-22 15:08               ` Michal Hocko
2014-07-22 15:08                 ` Michal Hocko
2014-07-22 15:44                 ` Miklos Szeredi
2014-07-22 15:44                   ` Miklos Szeredi
2014-07-22 15:44                   ` Miklos Szeredi
2014-07-23 14:38                   ` Michal Hocko
2014-07-23 14:38                     ` Michal Hocko
2014-07-23 14:38                     ` Michal Hocko
2014-07-23 15:06                     ` Johannes Weiner
2014-07-23 15:06                       ` Johannes Weiner
2014-07-23 15:19                       ` Michal Hocko
2014-07-23 15:19                         ` Michal Hocko
2014-07-23 15:19                         ` Michal Hocko
2014-07-23 15:36                         ` Johannes Weiner
2014-07-23 15:36                           ` Johannes Weiner
2014-07-23 18:08                       ` Miklos Szeredi
2014-07-23 18:08                         ` Miklos Szeredi
2014-07-23 21:02                         ` Johannes Weiner
2014-07-23 21:02                           ` Johannes Weiner
2014-07-23 21:02                           ` Johannes Weiner
2014-07-24  8:46                           ` Michal Hocko
2014-07-24  8:46                             ` Michal Hocko
2014-07-24  9:02                             ` Michal Hocko
2014-07-24  9:02                               ` Michal Hocko
2014-07-24  9:02                               ` Michal Hocko
2014-07-25 15:26                               ` Johannes Weiner
2014-07-25 15:26                                 ` Johannes Weiner
2014-07-25 15:26                                 ` Johannes Weiner
2014-07-25 15:43                                 ` Michal Hocko
2014-07-25 15:43                                   ` Michal Hocko
2014-07-25 17:34                                   ` Johannes Weiner
2014-07-25 17:34                                     ` Johannes Weiner
2014-07-15 14:23     ` Michal Hocko
2014-07-15 14:23       ` Michal Hocko
2014-07-15 14:23       ` Michal Hocko
2014-07-15 15:09       ` Johannes Weiner
2014-07-15 15:09         ` Johannes Weiner
2014-07-15 15:18         ` Michal Hocko
2014-07-15 15:18           ` Michal Hocko
2014-07-15 15:46           ` Johannes Weiner
2014-07-15 15:46             ` Johannes Weiner
2014-07-15 15:56             ` Michal Hocko
2014-07-15 15:56               ` Michal Hocko
2014-07-15 15:55   ` Naoya Horiguchi
2014-07-15 15:55     ` Naoya Horiguchi
2014-07-15 16:07     ` Michal Hocko
2014-07-15 16:07       ` Michal Hocko
2014-07-15 17:34       ` Johannes Weiner
2014-07-15 17:34         ` Johannes Weiner
2014-07-15 17:34         ` Johannes Weiner
2014-07-15 18:21         ` Michal Hocko
2014-07-15 18:21           ` Michal Hocko
2014-07-15 18:21           ` Michal Hocko
2014-07-15 18:43         ` Naoya Horiguchi
2014-07-15 18:43           ` Naoya Horiguchi
2014-07-15 19:04           ` Johannes Weiner
2014-07-15 19:04             ` Johannes Weiner
2014-07-15 19:04             ` Johannes Weiner
2014-07-15 20:49             ` Naoya Horiguchi
2014-07-15 20:49               ` Naoya Horiguchi
2014-07-15 21:48               ` Johannes Weiner
2014-07-15 21:48                 ` Johannes Weiner
2014-07-16  7:55                 ` Michal Hocko
2014-07-16  7:55                   ` Michal Hocko
2014-07-16 13:30                 ` Naoya Horiguchi
2014-07-16 13:30                   ` Naoya Horiguchi
2014-07-16 14:14                   ` Johannes Weiner
2014-07-16 14:14                     ` Johannes Weiner
2014-07-16 14:57                     ` Naoya Horiguchi
2014-07-16 14:57                       ` Naoya Horiguchi
2014-07-16 14:57                       ` Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.