All of lore.kernel.org
 help / color / mirror / Atom feed
* mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch
@ 2014-08-06 20:59 Andrew Morton
  2014-08-06 21:00 ` mm: memcontrol: rewrite charge API Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andrew Morton @ 2014-08-06 20:59 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko; +Cc: linux-mm


Do we feel these are ready for merging?

I'll send along the consolidated patches for eyeballs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* mm: memcontrol: rewrite charge API
  2014-08-06 20:59 mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch Andrew Morton
@ 2014-08-06 21:00 ` Andrew Morton
  2014-08-06 21:00   ` mm: memcontrol: rewrite uncharge API Andrew Morton
  2014-08-07  2:26 ` mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch Johannes Weiner
  2014-08-07  7:44 ` Michal Hocko
  2 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2014-08-06 21:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, linux-mm

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: rewrite charge API

The memcg charge API charges pages before they are rmapped - i.e.  have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on.  Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

  mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
  pages from the memcg if necessary.

  mem_cgroup_commit_charge() commits the page to the charge once it
  has a valid page->mapping and PageAnon() reliably tells the type.

  mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again.  Revive lru_cache_add_active_or_unevictable().

[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/cgroups/memcg_test.txt |   32 -
 include/linux/memcontrol.h           |   53 ---
 include/linux/swap.h                 |    3 
 kernel/events/uprobes.c              |   15 
 mm/filemap.c                         |   21 -
 mm/huge_memory.c                     |   57 ++-
 mm/memcontrol.c                      |  407 ++++++++++---------------
 mm/memory.c                          |   41 +-
 mm/rmap.c                            |   19 -
 mm/shmem.c                           |   37 +-
 mm/swap.c                            |   34 ++
 mm/swapfile.c                        |   14 
 12 files changed, 338 insertions(+), 395 deletions(-)

diff -puN Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-charge-api Documentation/cgroups/memcg_test.txt
--- a/Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-charge-api
+++ a/Documentation/cgroups/memcg_test.txt
@@ -24,24 +24,7 @@ Please note that implementation details
 
    a page/swp_entry may be charged (usage += PAGE_SIZE) at
 
-	mem_cgroup_charge_anon()
-	  Called at new page fault and Copy-On-Write.
-
-	mem_cgroup_try_charge_swapin()
-	  Called at do_swap_page() (page fault on swap entry) and swapoff.
-	  Followed by charge-commit-cancel protocol. (With swap accounting)
-	  At commit, a charge recorded in swap_cgroup is removed.
-
-	mem_cgroup_charge_file()
-	  Called at add_to_page_cache()
-
-	mem_cgroup_cache_charge_swapin()
-	  Called at shmem's swapin.
-
-	mem_cgroup_prepare_migration()
-	  Called before migration. "extra" charge is done and followed by
-	  charge-commit-cancel protocol.
-	  At commit, charge against oldpage or newpage will be committed.
+	mem_cgroup_try_charge()
 
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
@@ -69,19 +52,14 @@ Please note that implementation details
 	to new page is committed. At failure, charge to old page is committed.
 
 3. charge-commit-cancel
-	In some case, we can't know this "charge" is valid or not at charging
-	(because of races).
-	To handle such case, there are charge-commit-cancel functions.
-		mem_cgroup_try_charge_XXX
-		mem_cgroup_commit_charge_XXX
-		mem_cgroup_cancel_charge_XXX
-	these are used in swap-in and migration.
+	Memcg pages are charged in two steps:
+		mem_cgroup_try_charge()
+		mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
 
 	At try_charge(), there are no flags to say "this page is charged".
 	at this point, usage += PAGE_SIZE.
 
-	At commit(), the function checks the page should be charged or not
-	and set flags or avoid charging.(usage -= PAGE_SIZE)
+	At commit(), the page is associated with the memcg.
 
 	At cancel(), simply usage -= PAGE_SIZE.
 
diff -puN include/linux/memcontrol.h~mm-memcontrol-rewrite-charge-api include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~mm-memcontrol-rewrite-charge-api
+++ a/include/linux/memcontrol.h
@@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
-/*
- * All "charge" functions with gfp_mask should use GFP_KERNEL or
- * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
- * alloc memory but reclaims memory from all available zones. So, "where I want
- * memory from" bits of gfp_mask has no meaning. So any bits of that field is
- * available but adding a rule is better. charge functions' gfp_mask should
- * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
- * codes.
- * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
- */
-
-extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask);
-/* for swap handling */
-extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
-extern void mem_cgroup_commit_charge_swapin(struct page *page,
-					struct mem_cgroup *memcg);
-extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
-
-extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -233,30 +216,22 @@ void mem_cgroup_print_bad_page(struct pa
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
-static inline int mem_cgroup_charge_anon(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_charge_file(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
+static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+					gfp_t gfp_mask,
+					struct mem_cgroup **memcgp)
 {
+	*memcgp = NULL;
 	return 0;
 }
 
-static inline void mem_cgroup_commit_charge_swapin(struct page *page,
-					  struct mem_cgroup *memcg)
+static inline void mem_cgroup_commit_charge(struct page *page,
+					    struct mem_cgroup *memcg,
+					    bool lrucare)
 {
 }
 
-static inline void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
+static inline void mem_cgroup_cancel_charge(struct page *page,
+					    struct mem_cgroup *memcg)
 {
 }
 
diff -puN include/linux/swap.h~mm-memcontrol-rewrite-charge-api include/linux/swap.h
--- a/include/linux/swap.h~mm-memcontrol-rewrite-charge-api
+++ a/include/linux/swap.h
@@ -321,6 +321,9 @@ extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
 
+extern void lru_cache_add_active_or_unevictable(struct page *page,
+						struct vm_area_struct *vma);
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff -puN kernel/events/uprobes.c~mm-memcontrol-rewrite-charge-api kernel/events/uprobes.c
--- a/kernel/events/uprobes.c~mm-memcontrol-rewrite-charge-api
+++ a/kernel/events/uprobes.c
@@ -167,6 +167,11 @@ static int __replace_page(struct vm_area
 	/* For mmu_notifiers */
 	const unsigned long mmun_start = addr;
 	const unsigned long mmun_end   = addr + PAGE_SIZE;
+	struct mem_cgroup *memcg;
+
+	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
+	if (err)
+		return err;
 
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
@@ -179,6 +184,8 @@ static int __replace_page(struct vm_area
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr);
+	mem_cgroup_commit_charge(kpage, memcg, false);
+	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
 		dec_mm_counter(mm, MM_FILEPAGES);
@@ -200,6 +207,7 @@ static int __replace_page(struct vm_area
 
 	err = 0;
  unlock:
+	mem_cgroup_cancel_charge(kpage, memcg);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	unlock_page(page);
 	return err;
@@ -315,18 +323,11 @@ retry:
 	if (!new_page)
 		goto put_old;
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))
-		goto put_new;
-
 	__SetPageUptodate(new_page);
 	copy_highpage(new_page, old_page);
 	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 
 	ret = __replace_page(vma, vaddr, old_page, new_page);
-	if (ret)
-		mem_cgroup_uncharge_page(new_page);
-
-put_new:
 	page_cache_release(new_page);
 put_old:
 	put_page(old_page);
diff -puN mm/filemap.c~mm-memcontrol-rewrite-charge-api mm/filemap.c
--- a/mm/filemap.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/filemap.c
@@ -31,6 +31,7 @@
 #include <linux/security.h>
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
+#include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
@@ -548,19 +549,24 @@ static int __add_to_page_cache_locked(st
 				      pgoff_t offset, gfp_t gfp_mask,
 				      void **shadowp)
 {
+	int huge = PageHuge(page);
+	struct mem_cgroup *memcg;
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_charge_file(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
-	if (error)
-		return error;
+	if (!huge) {
+		error = mem_cgroup_try_charge(page, current->mm,
+					      gfp_mask, &memcg);
+		if (error)
+			return error;
+	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
-		mem_cgroup_uncharge_cache_page(page);
+		if (!huge)
+			mem_cgroup_cancel_charge(page, memcg);
 		return error;
 	}
 
@@ -575,13 +581,16 @@ static int __add_to_page_cache_locked(st
 		goto err_insert;
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
+	if (!huge)
+		mem_cgroup_commit_charge(page, memcg, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
+	if (!huge)
+		mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	return error;
 }
diff -puN mm/huge_memory.c~mm-memcontrol-rewrite-charge-api mm/huge_memory.c
--- a/mm/huge_memory.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/huge_memory.c
@@ -715,13 +715,20 @@ static int __do_huge_pmd_anonymous_page(
 					unsigned long haddr, pmd_t *pmd,
 					struct page *page)
 {
+	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	spinlock_t *ptl;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
+		return VM_FAULT_OOM;
+
 	pgtable = pte_alloc_one(mm, haddr);
-	if (unlikely(!pgtable))
+	if (unlikely(!pgtable)) {
+		mem_cgroup_cancel_charge(page, memcg);
 		return VM_FAULT_OOM;
+	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	/*
@@ -734,7 +741,7 @@ static int __do_huge_pmd_anonymous_page(
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(page);
+		mem_cgroup_cancel_charge(page, memcg);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -742,6 +749,8 @@ static int __do_huge_pmd_anonymous_page(
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -827,13 +836,7 @@ int do_huge_pmd_anonymous_page(struct mm
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_TRANSHUGE))) {
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
 	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -979,6 +982,7 @@ static int do_huge_pmd_wp_page_fallback(
 					struct page *page,
 					unsigned long haddr)
 {
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pgtable_t pgtable;
 	pmd_t _pmd;
@@ -999,20 +1003,21 @@ static int do_huge_pmd_wp_page_fallback(
 					       __GFP_OTHER_NODE,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
-			     mem_cgroup_charge_anon(pages[i], mm,
-						       GFP_KERNEL))) {
+			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
+						   &memcg))) {
 			if (pages[i])
 				put_page(pages[i]);
-			mem_cgroup_uncharge_start();
 			while (--i >= 0) {
-				mem_cgroup_uncharge_page(pages[i]);
+				memcg = (void *)page_private(pages[i]);
+				set_page_private(pages[i], 0);
+				mem_cgroup_cancel_charge(pages[i], memcg);
 				put_page(pages[i]);
 			}
-			mem_cgroup_uncharge_end();
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
 		}
+		set_page_private(pages[i], (unsigned long)memcg);
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1041,7 +1046,11 @@ static int do_huge_pmd_wp_page_fallback(
 		pte_t *pte, entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr);
+		mem_cgroup_commit_charge(pages[i], memcg, false);
+		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -1065,12 +1074,12 @@ out:
 out_free_pages:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		mem_cgroup_uncharge_page(pages[i]);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
+		mem_cgroup_cancel_charge(pages[i], memcg);
 		put_page(pages[i]);
 	}
-	mem_cgroup_uncharge_end();
 	kfree(pages);
 	goto out;
 }
@@ -1081,6 +1090,7 @@ int do_huge_pmd_wp_page(struct mm_struct
 	spinlock_t *ptl;
 	int ret = 0;
 	struct page *page = NULL, *new_page;
+	struct mem_cgroup *memcg;
 	unsigned long haddr;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -1132,7 +1142,8 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, mm,
+					   GFP_TRANSHUGE, &memcg))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -1161,7 +1172,7 @@ alloc:
 		put_user_huge_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1170,6 +1181,8 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
 		if (!page) {
@@ -2389,6 +2402,7 @@ static void collapse_huge_page(struct mm
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	struct mem_cgroup *memcg;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2399,7 +2413,8 @@ static void collapse_huge_page(struct mm
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_TRANSHUGE)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm,
+					   GFP_TRANSHUGE, &memcg)))
 		return;
 
 	/*
@@ -2486,6 +2501,8 @@ static void collapse_huge_page(struct mm
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
@@ -2499,7 +2516,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	goto out_up_write;
 }
 
diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-charge-api mm/memcontrol.c
--- a/mm/memcontrol.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/memcontrol.c
@@ -2551,17 +2551,8 @@ static int memcg_cpu_hotplug_callback(st
 	return NOTIFY_OK;
 }
 
-/**
- * mem_cgroup_try_charge - try charging a memcg
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Returns 0 if @memcg was charged successfully, -EINTR if the charge
- * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
- */
-static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages)
+static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
+		      unsigned int nr_pages)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -2660,41 +2651,7 @@ done:
 	return ret;
 }
 
-/**
- * mem_cgroup_try_charge_mm - try charging a mm
- * @mm: mm_struct to charge
- * @nr_pages: number of pages to charge
- * @oom: trigger OOM if reclaim fails
- *
- * Returns the charged mem_cgroup associated with the given mm_struct or
- * NULL the charge failed.
- */
-static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages)
-
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		memcg = NULL;
-
-	return memcg;
-}
-
-/*
- * Somemtimes we have to undo a charge we got by try_charge().
- * This function is for that and do uncharge, put css's refcnt.
- * gotten by try_charge().
- */
-static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
-				       unsigned int nr_pages)
+static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
@@ -2760,17 +2717,13 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 	return memcg;
 }
 
-static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
-				       struct page *page,
-				       unsigned int nr_pages,
-				       enum charge_type ctype,
-				       bool lrucare)
+static void commit_charge(struct page *page, struct mem_cgroup *memcg,
+			  unsigned int nr_pages, bool anon, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
-	bool anon;
 
 	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
@@ -2807,11 +2760,6 @@ static void __mem_cgroup_commit_charge(s
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
-		anon = true;
-	else
-		anon = false;
-
 	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
 	unlock_page_cgroup(pc);
 
@@ -2882,21 +2830,21 @@ static int memcg_charge_kmem(struct mem_
 	if (ret)
 		return ret;
 
-	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT);
+	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
 	if (ret == -EINTR)  {
 		/*
-		 * mem_cgroup_try_charge() chosed to bypass to root due to
-		 * OOM kill or fatal signal.  Since our only options are to
-		 * either fail the allocation or charge it to this cgroup, do
-		 * it as a temporary condition. But we can't fail. From a
-		 * kmem/slab perspective, the cache has already been selected,
-		 * by mem_cgroup_kmem_get_cache(), so it is too late to change
+		 * try_charge() chose to bypass to root due to OOM kill or
+		 * fatal signal.  Since our only options are to either fail
+		 * the allocation or charge it to this cgroup, do it as a
+		 * temporary condition. But we can't fail. From a kmem/slab
+		 * perspective, the cache has already been selected, by
+		 * mem_cgroup_kmem_get_cache(), so it is too late to change
 		 * our minds.
 		 *
 		 * This condition will only trigger if the task entered
-		 * memcg_charge_kmem in a sane state, but was OOM-killed during
-		 * mem_cgroup_try_charge() above. Tasks that were already
-		 * dying when the allocation triggers should have been already
+		 * memcg_charge_kmem in a sane state, but was OOM-killed
+		 * during try_charge() above. Tasks that were already dying
+		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
 		res_counter_charge_nofail(&memcg->res, size, &fail_res);
@@ -3618,164 +3566,6 @@ out:
 	return ret;
 }
 
-int mem_cgroup_charge_anon(struct page *page,
-			      struct mm_struct *mm, gfp_t gfp_mask)
-{
-	unsigned int nr_pages = 1;
-	struct mem_cgroup *memcg;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	VM_BUG_ON(!mm);
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages);
-	if (!memcg)
-		return -ENOMEM;
-	__mem_cgroup_commit_charge(memcg, page, nr_pages,
-				   MEM_CGROUP_CHARGE_TYPE_ANON, false);
-	return 0;
-}
-
-/*
- * While swap-in, try_charge -> commit or cancel, the page is locked.
- * And when try_charge() successfully returns, one refcnt to memcg without
- * struct page_cgroup is acquired. This refcnt will be consumed by
- * "commit()" or removed by "cancel()"
- */
-static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-					  struct page *page,
-					  gfp_t mask,
-					  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-	int ret;
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Every swap fault against a single page tries to charge the
-	 * page, bail as early as possible.  shmem_unuse() encounters
-	 * already charged pages, too.  The USED bit is protected by
-	 * the page lock, which serializes swap cache removal, which
-	 * in turn serializes uncharging.
-	 */
-	if (PageCgroupUsed(pc))
-		goto out;
-	if (do_swap_account)
-		memcg = try_get_mem_cgroup_from_page(page);
-	if (!memcg)
-		memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, mask, 1);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		return ret;
-out:
-	*memcgp = memcg;
-	return 0;
-}
-
-int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
-				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
-{
-	if (mem_cgroup_disabled()) {
-		*memcgp = NULL;
-		return 0;
-	}
-	/*
-	 * A racing thread's fault, or swapoff, may have already
-	 * updated the pte, and even removed page from swap cache: in
-	 * those cases unuse_pte()'s pte_same() test will fail; but
-	 * there's also a KSM case which does need to charge the page.
-	 */
-	if (!PageSwapCache(page)) {
-		struct mem_cgroup *memcg;
-
-		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
-		if (!memcg)
-			return -ENOMEM;
-		*memcgp = memcg;
-		return 0;
-	}
-	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
-}
-
-void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-	__mem_cgroup_cancel_charge(memcg, 1);
-}
-
-static void
-__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
-					enum charge_type ctype)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-
-	__mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
-	/*
-	 * Now swap is on-memory. This means this page may be
-	 * counted both as mem and swap....double count.
-	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
-	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
-	 * may call delete_from_swap_cache() before reach here.
-	 */
-	if (do_swap_account && PageSwapCache(page)) {
-		swp_entry_t ent = {.val = page_private(page)};
-		mem_cgroup_uncharge_swap(ent);
-	}
-}
-
-void mem_cgroup_commit_charge_swapin(struct page *page,
-				     struct mem_cgroup *memcg)
-{
-	__mem_cgroup_commit_charge_swapin(page, memcg,
-					  MEM_CGROUP_CHARGE_TYPE_ANON);
-}
-
-int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
-{
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
-	struct mem_cgroup *memcg;
-	int ret;
-
-	if (mem_cgroup_disabled())
-		return 0;
-	if (PageCompound(page))
-		return 0;
-
-	if (PageSwapCache(page)) { /* shmem */
-		ret = __mem_cgroup_try_charge_swapin(mm, page,
-						     gfp_mask, &memcg);
-		if (ret)
-			return ret;
-		__mem_cgroup_commit_charge_swapin(page, memcg, type);
-		return 0;
-	}
-
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1);
-	if (!memcg)
-		return -ENOMEM;
-	__mem_cgroup_commit_charge(memcg, page, 1, type, false);
-	return 0;
-}
-
 static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
 				   unsigned int nr_pages,
 				   const enum charge_type ctype)
@@ -4122,7 +3912,6 @@ void mem_cgroup_prepare_migration(struct
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	enum charge_type ctype;
 
 	*memcgp = NULL;
 
@@ -4184,16 +3973,12 @@ void mem_cgroup_prepare_migration(struct
 	 * page. In the case new page is migrated but not remapped, new page's
 	 * mapcount will be finally 0 and we call uncharge in end_migration().
 	 */
-	if (PageAnon(page))
-		ctype = MEM_CGROUP_CHARGE_TYPE_ANON;
-	else
-		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	/*
 	 * The page is committed to the memcg, but it's not actually
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
+	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
 }
 
 /* remove redundant charge if migration failed*/
@@ -4252,7 +4037,6 @@ void mem_cgroup_replace_page_cache(struc
 {
 	struct mem_cgroup *memcg = NULL;
 	struct page_cgroup *pc;
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -4278,7 +4062,7 @@ void mem_cgroup_replace_page_cache(struc
 	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
 	 * LRU while we overwrite pc->mem_cgroup.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, type, true);
+	commit_charge(newpage, memcg, 1, false, true);
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -6319,20 +6103,19 @@ static int mem_cgroup_do_precharge(unsig
 	int ret;
 
 	/* Try a single bulk charge without reclaim first */
-	ret = mem_cgroup_try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
 	if (!ret) {
 		mc.precharge += count;
 		return ret;
 	}
 	if (ret == -EINTR) {
-		__mem_cgroup_cancel_charge(root_mem_cgroup, count);
+		cancel_charge(root_mem_cgroup, count);
 		return ret;
 	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
-		ret = mem_cgroup_try_charge(mc.to,
-					    GFP_KERNEL & ~__GFP_NORETRY, 1);
+		ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1);
 		/*
 		 * In case of failure, any residual charges against
 		 * mc.to will be dropped by mem_cgroup_clear_mc()
@@ -6340,7 +6123,7 @@ static int mem_cgroup_do_precharge(unsig
 		 * bypassed to root right away or they'll be lost.
 		 */
 		if (ret == -EINTR)
-			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
+			cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -6609,7 +6392,7 @@ static void __mem_cgroup_clear_mc(void)
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
-		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
+		cancel_charge(mc.to, mc.precharge);
 		mc.precharge = 0;
 	}
 	/*
@@ -6617,7 +6400,7 @@ static void __mem_cgroup_clear_mc(void)
 	 * we must uncharge here.
 	 */
 	if (mc.moved_charge) {
-		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
+		cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
 	/* we must fixup refcnts and charges */
@@ -6946,6 +6729,150 @@ static void __init enable_swap_cgroup(vo
 }
 #endif
 
+/**
+ * mem_cgroup_try_charge - try charging a page
+ * @page: page to charge
+ * @mm: mm context of the victim
+ * @gfp_mask: reclaim mode
+ * @memcgp: charged memcg return
+ *
+ * Try to charge @page to the memcg that @mm belongs to, reclaiming
+ * pages according to @gfp_mask if necessary.
+ *
+ * Returns 0 on success, with *@memcgp pointing to the charged memcg.
+ * Otherwise, an error code is returned.
+ *
+ * After page->mapping has been set up, the caller must finalize the
+ * charge with mem_cgroup_commit_charge().  Or abort the transaction
+ * with mem_cgroup_cancel_charge() in case page instantiation fails.
+ */
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+{
+	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
+	int ret = 0;
+
+	if (mem_cgroup_disabled())
+		goto out;
+
+	if (PageSwapCache(page)) {
+		struct page_cgroup *pc = lookup_page_cgroup(page);
+		/*
+		 * Every swap fault against a single page tries to charge the
+		 * page, bail as early as possible.  shmem_unuse() encounters
+		 * already charged pages, too.  The USED bit is protected by
+		 * the page lock, which serializes swap cache removal, which
+		 * in turn serializes uncharging.
+		 */
+		if (PageCgroupUsed(pc))
+			goto out;
+	}
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	if (do_swap_account && PageSwapCache(page))
+		memcg = try_get_mem_cgroup_from_page(page);
+	if (!memcg)
+		memcg = get_mem_cgroup_from_mm(mm);
+
+	ret = try_charge(memcg, gfp_mask, nr_pages);
+
+	css_put(&memcg->css);
+
+	if (ret == -EINTR) {
+		memcg = root_mem_cgroup;
+		ret = 0;
+	}
+out:
+	*memcgp = memcg;
+	return ret;
+}
+
+/**
+ * mem_cgroup_commit_charge - commit a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ * @lrucare: page might be on LRU already
+ *
+ * Finalize a charge transaction started by mem_cgroup_try_charge(),
+ * after page->mapping has been set up.  This must happen atomically
+ * as part of the page instantiation, i.e. under the page table lock
+ * for anonymous pages, under the page lock for page and swap cache.
+ *
+ * In addition, the page must not be on the LRU during the commit, to
+ * prevent racing with task migration.  If it might be, use @lrucare.
+ *
+ * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
+ */
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare)
+{
+	unsigned int nr_pages = 1;
+
+	VM_BUG_ON_PAGE(!page->mapping, page);
+	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t entry = { .val = page_private(page) };
+		/*
+		 * The swap entry might not get freed for a long time,
+		 * let's not wait for it.  The page already received a
+		 * memory+swap charge, drop the swap entry duplicate.
+		 */
+		mem_cgroup_uncharge_swap(entry);
+	}
+}
+
+/**
+ * mem_cgroup_cancel_charge - cancel a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ *
+ * Cancel a charge transaction started by mem_cgroup_try_charge().
+ */
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+{
+	unsigned int nr_pages = 1;
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	cancel_charge(memcg, nr_pages);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff -puN mm/memory.c~mm-memcontrol-rewrite-charge-api mm/memory.c
--- a/mm/memory.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/memory.c
@@ -2049,6 +2049,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *dirty_page = NULL;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
+	struct mem_cgroup *memcg;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
@@ -2204,7 +2205,7 @@ gotten:
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
 	mmun_start  = address & PAGE_MASK;
@@ -2234,6 +2235,8 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2271,7 +2274,7 @@ gotten:
 		new_page = old_page;
 		ret |= VM_FAULT_WRITE;
 	} else
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 
 	if (new_page)
 		page_cache_release(new_page);
@@ -2407,10 +2410,10 @@ static int do_swap_page(struct mm_struct
 {
 	spinlock_t *ptl;
 	struct page *page, *swapcache;
+	struct mem_cgroup *memcg;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	struct mem_cgroup *ptr;
 	int exclusive = 0;
 	int ret = 0;
 
@@ -2486,7 +2489,7 @@ static int do_swap_page(struct mm_struct
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2511,10 +2514,6 @@ static int do_swap_page(struct mm_struct
 	 * while the page is counted on swap but not yet in mapcount i.e.
 	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
 	 * must be called after the swap_free(), or it will never succeed.
-	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
-	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
-	 * in page->private. In this case, a record in swap_cgroup  is silently
-	 * discarded at swap_free().
 	 */
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
@@ -2530,12 +2529,14 @@ static int do_swap_page(struct mm_struct
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	set_pte_at(mm, address, page_table, pte);
-	if (page == swapcache)
+	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address);
-	/* It's better to call commit-charge after rmap is established */
-	mem_cgroup_commit_charge_swapin(page, ptr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2568,7 +2569,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge_swapin(ptr);
+	mem_cgroup_cancel_charge(page, memcg);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -2624,6 +2625,7 @@ static int do_anonymous_page(struct mm_s
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		unsigned int flags)
 {
+	struct mem_cgroup *memcg;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -2657,7 +2659,7 @@ static int do_anonymous_page(struct mm_s
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_charge_anon(page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2670,6 +2672,8 @@ static int do_anonymous_page(struct mm_s
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
+	mem_cgroup_commit_charge(page, memcg, false);
+	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2679,7 +2683,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_uncharge_page(page);
+	mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -2917,6 +2921,7 @@ static int do_cow_fault(struct mm_struct
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	struct page *fault_page, *new_page;
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret;
@@ -2928,7 +2933,7 @@ static int do_cow_fault(struct mm_struct
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -2948,12 +2953,14 @@ static int do_cow_fault(struct mm_struct
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
 	page_cache_release(fault_page);
 	return ret;
 uncharge_out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	page_cache_release(new_page);
 	return ret;
 }
diff -puN mm/rmap.c~mm-memcontrol-rewrite-charge-api mm/rmap.c
--- a/mm/rmap.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/rmap.c
@@ -1032,25 +1032,6 @@ void page_add_new_anon_rmap(struct page
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
-
-	VM_BUG_ON_PAGE(PageLRU(page), page);
-	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
-		SetPageActive(page);
-		lru_cache_add(page);
-		return;
-	}
-
-	if (!TestSetPageMlocked(page)) {
-		/*
-		 * We use the irq-unsafe __mod_zone_page_stat because this
-		 * counter is not modified from interrupt context, and the pte
-		 * lock is held(spinlock), which implies preemption disabled.
-		 */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
-		count_vm_event(UNEVICTABLE_PGMLOCKED);
-	}
-	add_page_to_unevictable_list(page);
 }
 
 /**
diff -puN mm/shmem.c~mm-memcontrol-rewrite-charge-api mm/shmem.c
--- a/mm/shmem.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/shmem.c
@@ -604,7 +604,7 @@ static int shmem_unuse_inode(struct shme
 	radswap = swp_to_radix_entry(swap);
 	index = radix_tree_locate_item(&mapping->page_tree, radswap);
 	if (index == -1)
-		return 0;
+		return -EAGAIN;	/* tell shmem_unuse we found nothing */
 
 	/*
 	 * Move _head_ to start search for next from here.
@@ -663,7 +663,6 @@ static int shmem_unuse_inode(struct shme
 			spin_unlock(&info->lock);
 			swap_free(swap);
 		}
-		error = 1;	/* not an error, but entry was found */
 	}
 	return error;
 }
@@ -675,7 +674,7 @@ int shmem_unuse(swp_entry_t swap, struct
 {
 	struct list_head *this, *next;
 	struct shmem_inode_info *info;
-	int found = 0;
+	struct mem_cgroup *memcg;
 	int error = 0;
 
 	/*
@@ -690,26 +689,32 @@ int shmem_unuse(swp_entry_t swap, struct
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_charge_file(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
+	error = -EAGAIN;
 
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(this, next, &shmem_swaplist) {
 		info = list_entry(this, struct shmem_inode_info, swaplist);
 		if (info->swapped)
-			found = shmem_unuse_inode(info, swap, &page);
+			error = shmem_unuse_inode(info, swap, &page);
 		else
 			list_del_init(&info->swaplist);
 		cond_resched();
-		if (found)
+		if (error != -EAGAIN)
 			break;
+		/* found nothing in this: move on to search the next */
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 
-	if (found < 0)
-		error = found;
+	if (error) {
+		if (error != -ENOMEM)
+			error = 0;
+		mem_cgroup_cancel_charge(page, memcg);
+	} else
+		mem_cgroup_commit_charge(page, memcg, true);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1013,6 +1018,7 @@ static int shmem_getpage_gfp(struct inod
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo;
+	struct mem_cgroup *memcg;
 	struct page *page;
 	swp_entry_t swap;
 	int error;
@@ -1091,8 +1097,7 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1108,12 +1113,16 @@ repeat:
 			 * Reset swap.val? No, leave it so "failed" goes back to
 			 * "repeat": reading a hole and writing should succeed.
 			 */
-			if (error)
+			if (error) {
+				mem_cgroup_cancel_charge(page, memcg);
 				delete_from_swap_cache(page);
+			}
 		}
 		if (error)
 			goto failed;
 
+		mem_cgroup_commit_charge(page, memcg, true);
+
 		spin_lock(&info->lock);
 		info->swapped--;
 		shmem_recalc_inode(inode);
@@ -1151,8 +1160,7 @@ repeat:
 		if (sgp == SGP_WRITE)
 			init_page_accessed(page);
 
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1162,9 +1170,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_uncharge_cache_page(page);
+			mem_cgroup_cancel_charge(page, memcg);
 			goto decused;
 		}
+		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff -puN mm/swap.c~mm-memcontrol-rewrite-charge-api mm/swap.c
--- a/mm/swap.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/swap.c
@@ -695,6 +695,40 @@ void add_page_to_unevictable_list(struct
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/**
+ * lru_cache_add_active_or_unevictable
+ * @page:  the page to be added to LRU
+ * @vma:   vma in which page is mapped for determining reclaimability
+ *
+ * Place @page on the active or unevictable LRU list, depending on its
+ * evictability.  Note that if the page is not evictable, it goes
+ * directly back onto it's zone's unevictable list, it does NOT use a
+ * per cpu pagevec.
+ */
+void lru_cache_add_active_or_unevictable(struct page *page,
+					 struct vm_area_struct *vma)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+		SetPageActive(page);
+		lru_cache_add(page);
+		return;
+	}
+
+	if (!TestSetPageMlocked(page)) {
+		/*
+		 * We use the irq-unsafe __mod_zone_page_stat because this
+		 * counter is not modified from interrupt context, and the pte
+		 * lock is held(spinlock), which implies preemption disabled.
+		 */
+		__mod_zone_page_state(page_zone(page), NR_MLOCK,
+				    hpage_nr_pages(page));
+		count_vm_event(UNEVICTABLE_PGMLOCKED);
+	}
+	add_page_to_unevictable_list(page);
+}
+
 /*
  * If the page can not be invalidated, it is moved to the
  * inactive list to speed up its reclaim.  It is moved to the
diff -puN mm/swapfile.c~mm-memcontrol-rewrite-charge-api mm/swapfile.c
--- a/mm/swapfile.c~mm-memcontrol-rewrite-charge-api
+++ a/mm/swapfile.c
@@ -1106,15 +1106,14 @@ static int unuse_pte(struct vm_area_stru
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
-					 GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge_swapin(memcg);
+		mem_cgroup_cancel_charge(page, memcg);
 		ret = 0;
 		goto out;
 	}
@@ -1124,11 +1123,14 @@ static int unuse_pte(struct vm_area_stru
 	get_page(page);
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	if (page == swapcache)
+	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr);
-	mem_cgroup_commit_charge_swapin(page, memcg);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* mm: memcontrol: rewrite uncharge API
  2014-08-06 21:00 ` mm: memcontrol: rewrite charge API Andrew Morton
@ 2014-08-06 21:00   ` Andrew Morton
  2014-08-06 21:02     ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2014-08-06 21:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, linux-mm

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: rewrite uncharge API

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages.  However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped.  Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge.  The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
gets rid of mem_cgroup_replace_page_cache().

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock. 
Uncharging happens under full exclusion with no outstanding references. 
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration.  Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/cgroups/memcg_test.txt |  126 ---
 include/linux/memcontrol.h           |   49 -
 include/linux/page_cgroup.h          |   43 -
 include/linux/swap.h                 |   12 
 mm/filemap.c                         |    4 
 mm/memcontrol.c                      |  832 +++++++++----------------
 mm/memory.c                          |    2 
 mm/migrate.c                         |   44 -
 mm/rmap.c                            |    1 
 mm/shmem.c                           |    8 
 mm/swap.c                            |    6 
 mm/swap_state.c                      |    8 
 mm/swapfile.c                        |    7 
 mm/truncate.c                        |    9 
 mm/vmscan.c                          |   12 
 mm/zswap.c                           |    2 
 16 files changed, 394 insertions(+), 771 deletions(-)

diff -puN Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api Documentation/cgroups/memcg_test.txt
--- a/Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api
+++ a/Documentation/cgroups/memcg_test.txt
@@ -29,28 +29,13 @@ Please note that implementation details
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
+	mem_cgroup_uncharge()
+	  Called when a page's refcount goes down to 0.
 
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
 3. charge-commit-cancel
 	Memcg pages are charged in two steps:
 		mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFI
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
 
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFI
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 
@@ -136,28 +74,20 @@ Under below explanation, we assume CONFI
 	    swp_entry's refcnt -= 1.
 
 
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
 
 5. Page Cache
    	Page Cache is charged at
 	- add_to_page_cache_locked().
 
-	uncharged at
-	- __remove_from_page_cache().
-
 	The logic is very clear. (About migration, see below)
 	Note: __remove_from_page_cache() is called by remove_from_page_cache()
 	and __remove_mapping().
 
 6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
+	The best way to understand shmem's page state transition is to read
+	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 
@@ -170,56 +100,10 @@ Under below explanation, we assume CONFI
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
 
 7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
 
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+	mem_cgroup_migrate()
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
diff -puN include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api
+++ a/include/linux/memcontrol.h
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct pag
 			      bool lrucare);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+void mem_cgroup_uncharge(struct page *page);
+
+/* Batched uncharging */
+void mem_cgroup_uncharge_start(void);
+void mem_cgroup_uncharge_end(void);
 
-/* For coalescing uncharge for reducing memcg' overhead*/
-extern void mem_cgroup_uncharge_start(void);
-extern void mem_cgroup_uncharge_end(void);
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_str
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 
-extern void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp);
-extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok);
-
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(st
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
-extern void mem_cgroup_replace_page_cache(struct page *oldpage,
-					struct page *newpage);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_cha
 {
 }
 
-static inline void mem_cgroup_uncharge_start(void)
+static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
 
-static inline void mem_cgroup_uncharge_end(void)
+static inline void mem_cgroup_uncharge_start(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_page(struct page *page)
+static inline void mem_cgroup_uncharge_end(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+static inline void mem_cgroup_migrate(struct page *oldpage,
+				      struct page *newpage,
+				      bool lrucare)
 {
 }
 
@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
-static inline void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp)
-{
-}
-
-static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-		struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
-static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
-				struct page *newpage)
-{
-}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff -puN include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api include/linux/page_cgroup.h
--- a/include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api
+++ a/include/linux/page_cgroup.h
@@ -3,9 +3,9 @@
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
-	PCG_USED, /* this object is in use. */
-	PCG_MIGRATION, /* under page migration */
+	PCG_USED = 0x01,	/* This page is charged to a memcg */
+	PCG_MEM = 0x02,		/* This page holds a memory charge */
+	PCG_MEMSW = 0x04,	/* This page holds a memory+swap charge */
 	__NR_PCG_FLAGS,
 };
 
@@ -44,42 +44,9 @@ static inline void __init page_cgroup_in
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define TESTCLEARPCGFLAG(uname, lname)			\
-static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
-
-TESTPCGFLAG(Used, USED)
-CLEARPCGFLAG(Used, USED)
-SETPCGFLAG(Used, USED)
-
-SETPCGFLAG(Migration, MIGRATION)
-CLEARPCGFLAG(Migration, MIGRATION)
-TESTPCGFLAG(Migration, MIGRATION)
-
-static inline void lock_page_cgroup(struct page_cgroup *pc)
-{
-	/*
-	 * Don't take this lock in IRQ context.
-	 * This lock is for pc->mem_cgroup, USED, MIGRATION
-	 */
-	bit_spin_lock(PCG_LOCK, &pc->flags);
-}
-
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline int PageCgroupUsed(struct page_cgroup *pc)
 {
-	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	return !!(pc->flags & PCG_USED);
 }
 
 #else /* CONFIG_MEMCG */
diff -puN include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api include/linux/swap.h
--- a/include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api
+++ a/include/linux/swap.h
@@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
 #else
-static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
 {
 }
 #endif
@@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free(swp_entry_t);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp, struct page *page)
+static inline void swapcache_free(swp_entry_t swp)
 {
 }
 
diff -puN mm/filemap.c~mm-memcontrol-rewrite-uncharge-api mm/filemap.c
--- a/mm/filemap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/filemap.c
@@ -234,7 +234,6 @@ void delete_from_page_cache(struct page
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
 		freepage(page);
@@ -490,8 +489,7 @@ int replace_page_cache_page(struct page
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
-		/* mem_cgroup codes must not be called under tree_lock */
-		mem_cgroup_replace_page_cache(old, new);
+		mem_cgroup_migrate(old, new, true);
 		radix_tree_preload_end();
 		if (freepage)
 			freepage(old);
diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api mm/memcontrol.c
--- a/mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/memcontrol.c
@@ -754,9 +754,11 @@ static void __mem_cgroup_remove_exceeded
 static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 				       struct mem_cgroup_tree_per_zone *mctz)
 {
-	spin_lock(&mctz->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mctz->lock, flags);
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
 
@@ -779,7 +781,9 @@ static void mem_cgroup_update_tree(struc
 		 * mem is over its softlimit.
 		 */
 		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
+			unsigned long flags;
+
+			spin_lock_irqsave(&mctz->lock, flags);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mz, mctz);
@@ -788,7 +792,7 @@ static void mem_cgroup_update_tree(struc
 			 * If excess is 0, no tree ops.
 			 */
 			__mem_cgroup_insert_exceeded(mz, mctz, excess);
-			spin_unlock(&mctz->lock);
+			spin_unlock_irqrestore(&mctz->lock, flags);
 		}
 	}
 }
@@ -839,9 +843,9 @@ mem_cgroup_largest_soft_limit_node(struc
 {
 	struct mem_cgroup_per_zone *mz;
 
-	spin_lock(&mctz->lock);
+	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irq(&mctz->lock);
 	return mz;
 }
 
@@ -882,13 +886,6 @@ static long mem_cgroup_read_stat(struct
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
-}
-
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
@@ -909,13 +906,13 @@ static unsigned long mem_cgroup_read_eve
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 bool anon, int nr_pages)
+					 int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
-	if (anon)
+	if (PageAnon(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
@@ -1013,7 +1010,6 @@ static bool mem_cgroup_event_ratelimit(s
  */
 static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 {
-	preempt_disable();
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -1026,8 +1022,6 @@ static void memcg_check_events(struct me
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
 #endif
-		preempt_enable();
-
 		mem_cgroup_threshold(memcg);
 		if (unlikely(do_softlimit))
 			mem_cgroup_update_tree(memcg, page);
@@ -1035,8 +1029,7 @@ static void memcg_check_events(struct me
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
 #endif
-	} else
-		preempt_enable();
+	}
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
@@ -1347,20 +1340,6 @@ out:
 	return lruvec;
 }
 
-/*
- * Following LRU functions are allowed to be used without PCG_LOCK.
- * Operations are called by routine of global LRU independently from memcg.
- * What we have to take care of here is validness of pc->mem_cgroup.
- *
- * Changes to pc->mem_cgroup happens when
- * 1. charge
- * 2. moving account
- * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
- * It is added to LRU before charge.
- * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
- * When moving account, the page is not on LRU. It's isolated.
- */
-
 /**
  * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
@@ -2261,22 +2240,14 @@ cleanup:
  *
  * Notes: Race condition
  *
- * We usually use lock_page_cgroup() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
- *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
- *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check memcg->moving_account and detect there are possibility
- * of race or not. If there is, we take a lock.
+ * Charging occurs during page instantiation, while the page is
+ * unmapped and locked in page migration, or while the page table is
+ * locked in THP migration.  No race is possible.
+ *
+ * Uncharge happens to pages with zero references, no race possible.
+ *
+ * Charge moving between groups is protected by checking mm->moving
+ * account and taking the move_lock in the slowpath.
  */
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
@@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_loo
 	return mem_cgroup_from_id(id);
 }
 
+/*
+ * try_get_mem_cgroup_from_page - look up page's memcg association
+ * @page: the page
+ *
+ * Look up, get a css reference, and return the memcg that owns @page.
+ *
+ * The page must be locked to prevent racing with swap-in and page
+ * cache charges.  If coming from an unlocked page table, the caller
+ * must ensure the page is on the LRU or this can race with charging.
+ */
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
 		memcg = pc->mem_cgroup;
 		if (memcg && !css_tryget_online(&memcg->css))
@@ -2713,19 +2693,46 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 			memcg = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
 	return memcg;
 }
 
+static void lock_page_lru(struct page *page, int *isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page)) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		ClearPageLRU(page);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		*isolated = 1;
+	} else
+		*isolated = 0;
+}
+
+static void unlock_page_lru(struct page *page, int isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	if (isolated) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+		SetPageLRU(page);
+		add_page_to_lru_list(page, lruvec, page_lru(page));
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare)
+			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *uninitialized_var(zone);
-	struct lruvec *lruvec;
-	bool was_on_lru = false;
+	int isolated;
 
-	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
 	 * we don't need page_cgroup_lock about tail pages, becase they are not
@@ -2736,39 +2743,38 @@ static void commit_charge(struct page *p
 	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
 	 * may already be on some other mem_cgroup's LRU.  Take care of it.
 	 */
-	if (lrucare) {
-		zone = page_zone(page);
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page)) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, page_lru(page));
-			was_on_lru = true;
-		}
-	}
+	if (lrucare)
+		lock_page_lru(page, &isolated);
 
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point:
+	 *
+	 * - the page is uncharged
+	 *
+	 * - the page is off-LRU
+	 *
+	 * - an anonymous fault has exclusive page access, except for
+	 *   a locked page table
+	 *
+	 * - a page cache insertion, a swapin fault, or a migration
+	 *   have the page locked
+	 */
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
+	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
 
-	if (lrucare) {
-		if (was_on_lru) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			VM_BUG_ON_PAGE(PageLRU(page), page);
-			SetPageLRU(page);
-			add_page_to_lru_list(page, lruvec, page_lru(page));
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
+	if (lrucare)
+		unlock_page_lru(page, isolated);
 
+	local_irq_disable();
+	mem_cgroup_charge_statistics(memcg, page, nr_pages);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
 	 */
 	memcg_check_events(memcg, page);
+	local_irq_enable();
 }
 
 static DEFINE_MUTEX(set_limit_mutex);
@@ -3395,7 +3401,6 @@ static inline void memcg_unregister_all_
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -3416,7 +3421,7 @@ void mem_cgroup_split_huge_fixup(struct
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
+		pc->flags = head_pc->flags;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
@@ -3446,7 +3451,6 @@ static int mem_cgroup_move_account(struc
 {
 	unsigned long flags;
 	int ret;
-	bool anon = PageAnon(page);
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -3460,15 +3464,21 @@ static int mem_cgroup_move_account(struc
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
-	lock_page_cgroup(pc);
+	/*
+	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
+	 * of its source page while we change it: page migration takes
+	 * both pages off the LRU, but page cache replacement doesn't.
+	 */
+	if (!trylock_page(page))
+		goto out;
 
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto unlock;
+		goto out_unlock;
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
+	if (!PageAnon(page) && page_mapped(page)) {
 		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
 			       nr_pages);
 		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3482,20 +3492,25 @@ static int mem_cgroup_move_account(struc
 			       nr_pages);
 	}
 
-	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
+	/*
+	 * It is safe to change pc->mem_cgroup here because the page
+	 * is referenced, charged, and isolated - we can't race with
+	 * uncharging, charging, migration, or LRU putback.
+	 */
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
-unlock:
-	unlock_page_cgroup(pc);
-	/*
-	 * check events
-	 */
+
+	local_irq_disable();
+	mem_cgroup_charge_statistics(to, page, nr_pages);
 	memcg_check_events(to, page);
+	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
+	local_irq_enable();
+out_unlock:
+	unlock_page(page);
 out:
 	return ret;
 }
@@ -3566,193 +3581,6 @@ out:
 	return ret;
 }
 
-static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
-				   unsigned int nr_pages,
-				   const enum charge_type ctype)
-{
-	struct memcg_batch_info *batch = NULL;
-	bool uncharge_memsw = true;
-
-	/* If swapout, usage of swap doesn't decrease */
-	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		uncharge_memsw = false;
-
-	batch = &current->memcg_batch;
-	/*
-	 * In usual, we do css_get() when we remember memcg pointer.
-	 * But in this case, we keep res->usage until end of a series of
-	 * uncharges. Then, it's ok to ignore memcg's refcnt.
-	 */
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continuously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-
-	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
-
-	if (nr_pages > 1)
-		goto direct_uncharge;
-
-	/*
-	 * In typical case, batch->memcg == mem. This means we can
-	 * merge a series of uncharges to an uncharge of res_counter.
-	 * If not, we uncharge res_counter ony by one.
-	 */
-	if (batch->memcg != memcg)
-		goto direct_uncharge;
-	/* remember freed charge and uncharge it later */
-	batch->nr_pages++;
-	if (uncharge_memsw)
-		batch->memsw_nr_pages++;
-	return;
-direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (unlikely(batch->memcg != memcg))
-		memcg_oom_recover(memcg);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
-			     bool end_migration)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!PageCgroupUsed(pc)))
-		return NULL;
-
-	lock_page_cgroup(pc);
-
-	memcg = pc->mem_cgroup;
-
-	if (!PageCgroupUsed(pc))
-		goto unlock_out;
-
-	anon = PageAnon(page);
-
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_ANON:
-		/*
-		 * Generally PageAnon tells if it's the anon statistics to be
-		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
-		 * used before page reached the stage of being marked PageAnon.
-		 */
-		anon = true;
-		/* fallthrough */
-	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		/* See mem_cgroup_prepare_migration() */
-		if (page_mapped(page))
-			goto unlock_out;
-		/*
-		 * Pages under migration may not be uncharged.  But
-		 * end_migration() /must/ be the one uncharging the
-		 * unused post-migration page and so it has to call
-		 * here with the migration bit still set.  See the
-		 * res_counter handling below.
-		 */
-		if (!end_migration && PageCgroupMigration(pc))
-			goto unlock_out;
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
-		if (!PageAnon(page)) {	/* Shared memory */
-			if (page->mapping && !page_is_file_cache(page))
-				goto unlock_out;
-		} else if (page_mapped(page)) /* Anon */
-				goto unlock_out;
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
-
-	ClearPageCgroupUsed(pc);
-	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
-	 */
-
-	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
-	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
-	/*
-	 * Migration does not charge the res_counter for the
-	 * replacement page, so leave it alone when phasing out the
-	 * page that is unused after the migration.
-	 */
-	if (!end_migration)
-		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
-
-	return memcg;
-
-unlock_out:
-	unlock_page_cgroup(pc);
-	return NULL;
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	/*
-	 * If the page is in swap cache, uncharge should be deferred
-	 * to the swap path, which also properly accounts swap usage
-	 * and handles memcg lifetime.
-	 *
-	 * Note that this check is not stable and reclaim may add the
-	 * page to swap cache at any time after this.  However, if the
-	 * page is not in swap cache by the time page->mapcount hits
-	 * 0, there won't be any page table references to the swap
-	 * slot, and reclaim will free it and not actually write the
-	 * page to disk.
-	 */
-	if (PageSwapCache(page))
-		return;
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping, page);
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
-}
-
 /*
  * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
  * In that cases, pages are freed continuously and we can expect pages
@@ -3763,6 +3591,9 @@ void mem_cgroup_uncharge_cache_page(stru
 
 void mem_cgroup_uncharge_start(void)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
 	current->memcg_batch.do_batch++;
 	/* We can do nest. */
 	if (current->memcg_batch.do_batch == 1) {
@@ -3770,21 +3601,18 @@ void mem_cgroup_uncharge_start(void)
 		current->memcg_batch.nr_pages = 0;
 		current->memcg_batch.memsw_nr_pages = 0;
 	}
+	local_irq_restore(flags);
 }
 
 void mem_cgroup_uncharge_end(void)
 {
 	struct memcg_batch_info *batch = &current->memcg_batch;
+	unsigned long flags;
 
-	if (!batch->do_batch)
-		return;
-
-	batch->do_batch--;
-	if (batch->do_batch) /* If stacked, do nothing. */
-		return;
-
-	if (!batch->memcg)
-		return;
+	local_irq_save(flags);
+	VM_BUG_ON(!batch->do_batch);
+	if (--batch->do_batch) /* If stacked, do nothing */
+		goto out;
 	/*
 	 * This "batch->memcg" is valid without any css_get/put etc...
 	 * bacause we hide charges behind us.
@@ -3796,61 +3624,16 @@ void mem_cgroup_uncharge_end(void)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
 	memcg_oom_recover(batch->memcg);
-	/* forget this pointer (for sanity check) */
-	batch->memcg = NULL;
-}
-
-#ifdef CONFIG_SWAP
-/*
- * called after __delete_from_swap_cache() and drop "page" account.
- * memcg information is recorded to swap_cgroup of "ent"
- */
-void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
-{
-	struct mem_cgroup *memcg;
-	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
-
-	if (!swapout) /* this was a swap cache but the swap is unused ! */
-		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
-
-	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
-
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
-	if (do_swap_account && swapout && memcg)
-		swap_cgroup_record(ent, mem_cgroup_id(memcg));
+out:
+	local_irq_restore(flags);
 }
-#endif
 
 #ifdef CONFIG_MEMCG_SWAP
-/*
- * called from swap_entry_free(). remove record in swap_cgroup and
- * uncharge "memsw" account.
- */
-void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
+					 bool charge)
 {
-	struct mem_cgroup *memcg;
-	unsigned short id;
-
-	if (!do_swap_account)
-		return;
-
-	id = swap_cgroup_record(ent, 0);
-	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
-	if (memcg) {
-		/*
-		 * We uncharge this because swap is freed.  This memcg can
-		 * be obsolete one. We avoid calling css_tryget_online().
-		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
-	}
-	rcu_read_unlock();
+	int val = (charge) ? 1 : -1;
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -3902,169 +3685,6 @@ static inline int mem_cgroup_move_swap_a
 }
 #endif
 
-/*
- * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
- * page belongs to.
- */
-void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-				  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-
-	*memcgp = NULL;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (PageTransHuge(page))
-		nr_pages <<= compound_order(page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		css_get(&memcg->css);
-		/*
-		 * At migrating an anonymous page, its mapcount goes down
-		 * to 0 and uncharge() will be called. But, even if it's fully
-		 * unmapped, migration may fail and this page has to be
-		 * charged again. We set MIGRATION flag here and delay uncharge
-		 * until end_migration() is called
-		 *
-		 * Corner Case Thinking
-		 * A)
-		 * When the old page was mapped as Anon and it's unmap-and-freed
-		 * while migration was ongoing.
-		 * If unmap finds the old page, uncharge() of it will be delayed
-		 * until end_migration(). If unmap finds a new page, it's
-		 * uncharged when it make mapcount to be 1->0. If unmap code
-		 * finds swap_migration_entry, the new page will not be mapped
-		 * and end_migration() will find it(mapcount==0).
-		 *
-		 * B)
-		 * When the old page was mapped but migraion fails, the kernel
-		 * remaps it. A charge for it is kept by MIGRATION flag even
-		 * if mapcount goes down to 0. We can do remap successfully
-		 * without charging it again.
-		 *
-		 * C)
-		 * The "old" page is under lock_page() until the end of
-		 * migration, so, the old page itself will not be swapped-out.
-		 * If the new page is swapped out before end_migraton, our
-		 * hook to usual swap-out path will catch the event.
-		 */
-		if (PageAnon(page))
-			SetPageCgroupMigration(pc);
-	}
-	unlock_page_cgroup(pc);
-	/*
-	 * If the page is not charged at this point,
-	 * we return here.
-	 */
-	if (!memcg)
-		return;
-
-	*memcgp = memcg;
-	/*
-	 * We charge new page before it's used/mapped. So, even if unlock_page()
-	 * is called before end_migration, we can catch all events on this new
-	 * page. In the case new page is migrated but not remapped, new page's
-	 * mapcount will be finally 0 and we call uncharge in end_migration().
-	 */
-	/*
-	 * The page is committed to the memcg, but it's not actually
-	 * charged to the res_counter since we plan on replacing the
-	 * old one and only one page is going to be left afterwards.
-	 */
-	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-	struct page *used, *unused;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (!memcg)
-		return;
-
-	if (!migration_ok) {
-		used = oldpage;
-		unused = newpage;
-	} else {
-		used = newpage;
-		unused = oldpage;
-	}
-	anon = PageAnon(used);
-	__mem_cgroup_uncharge_common(unused,
-				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
-				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
-				     true);
-	css_put(&memcg->css);
-	/*
-	 * We disallowed uncharge of pages under migration because mapcount
-	 * of the page goes down to zero, temporarly.
-	 * Clear the flag and check the page should be charged.
-	 */
-	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
-	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * If a page is a file cache, radix-tree replacement is very atomic
-	 * and we can skip this check. When it was an Anon page, its mapcount
-	 * goes down to 0. But because we added MIGRATION flage, it's not
-	 * uncharged yet. There are several case but page->mapcount check
-	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
-	 * check. (see prepare_charge() also)
-	 */
-	if (anon)
-		mem_cgroup_uncharge_page(used);
-}
-
-/*
- * At replace page cache, newpage is not under any memcg but it's on
- * LRU. So, this function doesn't touch res_counter but handles LRU
- * in correct way. Both pages are locked so we cannot race with uncharge.
- */
-void mem_cgroup_replace_page_cache(struct page *oldpage,
-				  struct page *newpage)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(oldpage);
-	/* fix accounting on old pages */
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * When called from shmem_replace_page(), in some cases the
-	 * oldpage has already been charged, and in some cases not.
-	 */
-	if (!memcg)
-		return;
-	/*
-	 * Even if newpage->mapping was NULL before starting replacement,
-	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
-	 * LRU while we overwrite pc->mem_cgroup.
-	 */
-	commit_charge(newpage, memcg, 1, false, true);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -4263,7 +3883,7 @@ unsigned long mem_cgroup_soft_limit_recl
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
+		spin_lock_irq(&mctz->lock);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -4303,7 +3923,7 @@ unsigned long mem_cgroup_soft_limit_recl
 		 */
 		/* If excess == 0, no tree ops */
 		__mem_cgroup_insert_exceeded(mz, mctz, excess);
-		spin_unlock(&mctz->lock);
+		spin_unlock_irq(&mctz->lock);
 		css_put(&mz->memcg->css);
 		loop++;
 		/*
@@ -6265,9 +5885,9 @@ static enum mc_target_type get_mctgt_typ
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
-		 * Do only loose check w/o page_cgroup lock.
-		 * mem_cgroup_move_account() checks the pc is valid or not under
-		 * the lock.
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the pc is valid or
+		 * not under LRU exclusion.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
@@ -6729,6 +6349,67 @@ static void __init enable_swap_cgroup(vo
 }
 #endif
 
+#ifdef CONFIG_MEMCG_SWAP
+/**
+ * mem_cgroup_swapout - transfer a memsw charge to swap
+ * @page: page whose memsw charge to transfer
+ * @entry: swap entry to move the charge to
+ *
+ * Transfer the memsw charge of @page to @entry.
+ */
+void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+	unsigned short oldid;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (!do_swap_account)
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Readahead page, never charged */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+	VM_BUG_ON_PAGE(oldid, page);
+
+	pc->flags &= ~PCG_MEMSW;
+	css_get(&pc->mem_cgroup->css);
+	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+}
+
+/**
+ * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * @entry: swap entry to uncharge
+ *
+ * Drop the memsw charge associated with @entry.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	if (!do_swap_account)
+		return;
+
+	id = swap_cgroup_record(entry, 0);
+	rcu_read_lock();
+	memcg = mem_cgroup_lookup(id);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(memcg, false);
+		css_put(&memcg->css);
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
@@ -6831,7 +6512,7 @@ void mem_cgroup_commit_charge(struct pag
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+	commit_charge(page, memcg, nr_pages, lrucare);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6873,6 +6554,139 @@ void mem_cgroup_cancel_charge(struct pag
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_uncharge - uncharge a page
+ * @page: page to uncharge
+ *
+ * Uncharge a page previously charged with mem_cgroup_try_charge() and
+ * mem_cgroup_commit_charge().
+ */
+void mem_cgroup_uncharge(struct page *page)
+{
+	struct memcg_batch_info *batch;
+	unsigned int nr_pages = 1;
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+	unsigned long pc_flags;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Every final put_page() ends up here */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point, we have fully
+	 * exclusive access to the page.
+	 */
+	memcg = pc->mem_cgroup;
+	pc_flags = pc->flags;
+	pc->flags = 0;
+
+	local_irq_save(flags);
+
+	if (nr_pages > 1)
+		goto direct;
+	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+		goto direct;
+	batch = &current->memcg_batch;
+	if (!batch->do_batch)
+		goto direct;
+	if (batch->memcg && batch->memcg != memcg)
+		goto direct;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	if (pc_flags & PCG_MEM)
+		batch->nr_pages++;
+	if (pc_flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	goto out;
+direct:
+	if (pc_flags & PCG_MEM)
+		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	if (pc_flags & PCG_MEMSW)
+		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+	memcg_oom_recover(memcg);
+out:
+	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
+	memcg_check_events(memcg, page);
+
+	local_irq_restore(flags);
+}
+
+/**
+ * mem_cgroup_migrate - migrate a charge to another page
+ * @oldpage: currently charged page
+ * @newpage: page to transfer the charge to
+ * @lrucare: both pages might be on the LRU already
+ *
+ * Migrate the charge from @oldpage to @newpage.
+ *
+ * Both pages must be locked, @newpage->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare)
+{
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
+	int isolated;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/* Page cache replacement: new page already charged? */
+	pc = lookup_page_cgroup(newpage);
+	if (PageCgroupUsed(pc))
+		return;
+
+	/* Re-entrant migration: old page already uncharged? */
+	pc = lookup_page_cgroup(oldpage);
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
+	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
+
+	if (PageTransHuge(oldpage)) {
+		nr_pages <<= compound_order(oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
+	}
+
+	if (lrucare)
+		lock_page_lru(oldpage, &isolated);
+
+	pc->flags = 0;
+
+	if (lrucare)
+		unlock_page_lru(oldpage, isolated);
+
+	local_irq_disable();
+	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
+	memcg_check_events(pc->mem_cgroup, oldpage);
+	local_irq_enable();
+
+	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff -puN mm/memory.c~mm-memcontrol-rewrite-uncharge-api mm/memory.c
--- a/mm/memory.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/memory.c
@@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_
 		details = NULL;
 
 	BUG_ON(addr >= end);
-	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
-	mem_cgroup_uncharge_end();
 }
 
 
diff -puN mm/migrate.c~mm-memcontrol-rewrite-uncharge-api mm/migrate.c
--- a/mm/migrate.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/migrate.c
@@ -778,11 +778,14 @@ static int move_to_new_page(struct page
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		newpage->mapping = NULL;
+		if (!PageAnon(newpage))
+			newpage->mapping = NULL;
 	} else {
+		mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
 	}
 
 	unlock_page(newpage);
@@ -795,7 +798,6 @@ static int __unmap_and_move(struct page
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
-	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
@@ -821,9 +823,6 @@ static int __unmap_and_move(struct page
 		lock_page(page);
 	}
 
-	/* charge against new page */
-	mem_cgroup_prepare_migration(page, newpage, &mem);
-
 	if (PageWriteback(page)) {
 		/*
 		 * Only in the case of a full synchronous migration is it
@@ -833,10 +832,10 @@ static int __unmap_and_move(struct page
 		 */
 		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
-			goto uncharge;
+			goto out_unlock;
 		}
 		if (!force)
-			goto uncharge;
+			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -872,7 +871,7 @@ static int __unmap_and_move(struct page
 			 */
 			remap_swapcache = 0;
 		} else {
-			goto uncharge;
+			goto out_unlock;
 		}
 	}
 
@@ -885,7 +884,7 @@ static int __unmap_and_move(struct page
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto uncharge;
+		goto out_unlock;
 	}
 
 	/*
@@ -904,7 +903,7 @@ static int __unmap_and_move(struct page
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto uncharge;
+			goto out_unlock;
 		}
 		goto skip_unmap;
 	}
@@ -923,10 +922,7 @@ skip_unmap:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-uncharge:
-	mem_cgroup_end_migration(mem, page, newpage,
-				 (rc == MIGRATEPAGE_SUCCESS ||
-				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
+out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1786,7 +1782,6 @@ int migrate_misplaced_transhuge_page(str
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
@@ -1852,15 +1847,6 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	/*
-	 * Traditional migration needs to prepare the memcg charge
-	 * transaction early to prevent the old page from being
-	 * uncharged when installing migration entries.  Here we can
-	 * save the potential rollback and start the charge transfer
-	 * only when migration is already known to end successfully.
-	 */
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
@@ -1888,14 +1874,10 @@ fail_putback:
 		goto fail_putback;
 	}
 
+	mem_cgroup_migrate(page, new_page, false);
+
 	page_remove_rmap(page);
 
-	/*
-	 * Finish the charge transaction under the page table lock to
-	 * prevent split_huge_page() from dividing up the charge
-	 * before it's fully transferred to the new page.
-	 */
-	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
diff -puN mm/rmap.c~mm-memcontrol-rewrite-uncharge-api mm/rmap.c
--- a/mm/rmap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/rmap.c
@@ -1089,7 +1089,6 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		mem_cgroup_uncharge_page(page);
 		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
diff -puN mm/shmem.c~mm-memcontrol-rewrite-uncharge-api mm/shmem.c
--- a/mm/shmem.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/shmem.c
@@ -406,7 +406,6 @@ static void shmem_undo_range(struct inod
 			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -434,7 +433,6 @@ static void shmem_undo_range(struct inod
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -482,7 +480,6 @@ static void shmem_undo_range(struct inod
 			index = start;
 			continue;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -518,7 +515,6 @@ static void shmem_undo_range(struct inod
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 
@@ -818,7 +814,7 @@ static int shmem_writepage(struct page *
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -991,7 +987,7 @@ static int shmem_replace_page(struct pag
 		 */
 		oldpage = newpage;
 	} else {
-		mem_cgroup_replace_page_cache(oldpage, newpage);
+		mem_cgroup_migrate(oldpage, newpage, false);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
diff -puN mm/swap.c~mm-memcontrol-rewrite-uncharge-api mm/swap.c
--- a/mm/swap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/swap.c
@@ -62,6 +62,7 @@ static void __page_cache_release(struct
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
@@ -915,6 +916,8 @@ void release_pages(struct page **pages,
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	mem_cgroup_uncharge_start();
+
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -946,6 +949,7 @@ void release_pages(struct page **pages,
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
+		mem_cgroup_uncharge(page);
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
@@ -955,6 +959,8 @@ void release_pages(struct page **pages,
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	mem_cgroup_uncharge_end();
+
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
 EXPORT_SYMBOL(release_pages);
diff -puN mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api mm/swap_state.c
--- a/mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/swap_state.c
@@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struc
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free(entry);
 			return 0;
 		}
 
@@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struc
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 		return 0;
 	}
 }
@@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free(entry);
 	page_cache_release(page);
 }
 
@@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_e
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff -puN mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api mm/swapfile.c
--- a/mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/swapfile.c
@@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	p = swap_info_get(entry);
 	if (p) {
-		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		if (page)
-			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
+		swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
diff -puN mm/truncate.c~mm-memcontrol-rewrite-uncharge-api mm/truncate.c
--- a/mm/truncate.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/truncate.c
@@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct a
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -369,7 +367,6 @@ void truncate_inode_pages_range(struct a
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -394,7 +391,6 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 	cleancache_invalidate_inode(mapping);
@@ -493,7 +489,6 @@ unsigned long invalidate_mapping_pages(s
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -522,7 +517,6 @@ unsigned long invalidate_mapping_pages(s
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -553,7 +547,6 @@ invalidate_complete_page2(struct address
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -602,7 +595,6 @@ int invalidate_inode_pages2_range(struct
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -655,7 +647,6 @@ int invalidate_inode_pages2_range(struct
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
diff -puN mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api mm/vmscan.c
--- a/mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/vmscan.c
@@ -571,9 +571,10 @@ static int __remove_mapping(struct addre
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -594,7 +595,6 @@ static int __remove_mapping(struct addre
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(st
 		 */
 		__clear_page_locked(page);
 free_it:
+		mem_cgroup_uncharge(page);
 		nr_reclaimed++;
 
 		/*
@@ -1126,12 +1127,13 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	mem_cgroup_uncharge_end();
 
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
-	mem_cgroup_uncharge_end();
+
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_congested += nr_congested;
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
@@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lr
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
@@ -1650,6 +1654,8 @@ static void move_active_pages_to_lru(str
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
diff -puN mm/zswap.c~mm-memcontrol-rewrite-uncharge-api mm/zswap.c
--- a/mm/zswap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/zswap.c
@@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API
  2014-08-06 21:00   ` mm: memcontrol: rewrite uncharge API Andrew Morton
@ 2014-08-06 21:02     ` Andrew Morton
  2014-08-07  7:38       ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2014-08-06 21:02 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, linux-mm

On Wed, 6 Aug 2014 14:00:55 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: mm: memcontrol: rewrite uncharge API
> 

Nope, sorry, that was missing
mm-memcontrol-rewrite-uncharge-api-fix-clear-page-mapping-in-migration.patch.

This time:

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: rewrite uncharge API

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages.  However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped.  Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge.  The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
gets rid of mem_cgroup_replace_page_cache().

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock. 
Uncharging happens under full exclusion with no outstanding references. 
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration.  Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/cgroups/memcg_test.txt |  126 ---
 include/linux/memcontrol.h           |   49 -
 include/linux/page_cgroup.h          |   43 -
 include/linux/swap.h                 |   12 
 mm/filemap.c                         |    4 
 mm/memcontrol.c                      |  832 +++++++++----------------
 mm/memory.c                          |    2 
 mm/migrate.c                         |   38 -
 mm/rmap.c                            |    1 
 mm/shmem.c                           |    8 
 mm/swap.c                            |    6 
 mm/swap_state.c                      |    8 
 mm/swapfile.c                        |    7 
 mm/truncate.c                        |    9 
 mm/vmscan.c                          |   12 
 mm/zswap.c                           |    2 
 16 files changed, 390 insertions(+), 769 deletions(-)

diff -puN Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api Documentation/cgroups/memcg_test.txt
--- a/Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api
+++ a/Documentation/cgroups/memcg_test.txt
@@ -29,28 +29,13 @@ Please note that implementation details
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
+	mem_cgroup_uncharge()
+	  Called when a page's refcount goes down to 0.
 
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
 3. charge-commit-cancel
 	Memcg pages are charged in two steps:
 		mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFI
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
 
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFI
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 
@@ -136,28 +74,20 @@ Under below explanation, we assume CONFI
 	    swp_entry's refcnt -= 1.
 
 
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
 
 5. Page Cache
    	Page Cache is charged at
 	- add_to_page_cache_locked().
 
-	uncharged at
-	- __remove_from_page_cache().
-
 	The logic is very clear. (About migration, see below)
 	Note: __remove_from_page_cache() is called by remove_from_page_cache()
 	and __remove_mapping().
 
 6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
+	The best way to understand shmem's page state transition is to read
+	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 
@@ -170,56 +100,10 @@ Under below explanation, we assume CONFI
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
 
 7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
 
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+	mem_cgroup_migrate()
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
diff -puN include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api
+++ a/include/linux/memcontrol.h
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct pag
 			      bool lrucare);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+void mem_cgroup_uncharge(struct page *page);
+
+/* Batched uncharging */
+void mem_cgroup_uncharge_start(void);
+void mem_cgroup_uncharge_end(void);
 
-/* For coalescing uncharge for reducing memcg' overhead*/
-extern void mem_cgroup_uncharge_start(void);
-extern void mem_cgroup_uncharge_end(void);
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_str
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 
-extern void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp);
-extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok);
-
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(st
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
-extern void mem_cgroup_replace_page_cache(struct page *oldpage,
-					struct page *newpage);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_cha
 {
 }
 
-static inline void mem_cgroup_uncharge_start(void)
+static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
 
-static inline void mem_cgroup_uncharge_end(void)
+static inline void mem_cgroup_uncharge_start(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_page(struct page *page)
+static inline void mem_cgroup_uncharge_end(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+static inline void mem_cgroup_migrate(struct page *oldpage,
+				      struct page *newpage,
+				      bool lrucare)
 {
 }
 
@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
-static inline void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp)
-{
-}
-
-static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-		struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
-static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
-				struct page *newpage)
-{
-}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff -puN include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api include/linux/page_cgroup.h
--- a/include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api
+++ a/include/linux/page_cgroup.h
@@ -3,9 +3,9 @@
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
-	PCG_USED, /* this object is in use. */
-	PCG_MIGRATION, /* under page migration */
+	PCG_USED = 0x01,	/* This page is charged to a memcg */
+	PCG_MEM = 0x02,		/* This page holds a memory charge */
+	PCG_MEMSW = 0x04,	/* This page holds a memory+swap charge */
 	__NR_PCG_FLAGS,
 };
 
@@ -44,42 +44,9 @@ static inline void __init page_cgroup_in
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define TESTCLEARPCGFLAG(uname, lname)			\
-static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
-
-TESTPCGFLAG(Used, USED)
-CLEARPCGFLAG(Used, USED)
-SETPCGFLAG(Used, USED)
-
-SETPCGFLAG(Migration, MIGRATION)
-CLEARPCGFLAG(Migration, MIGRATION)
-TESTPCGFLAG(Migration, MIGRATION)
-
-static inline void lock_page_cgroup(struct page_cgroup *pc)
-{
-	/*
-	 * Don't take this lock in IRQ context.
-	 * This lock is for pc->mem_cgroup, USED, MIGRATION
-	 */
-	bit_spin_lock(PCG_LOCK, &pc->flags);
-}
-
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline int PageCgroupUsed(struct page_cgroup *pc)
 {
-	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	return !!(pc->flags & PCG_USED);
 }
 
 #else /* CONFIG_MEMCG */
diff -puN include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api include/linux/swap.h
--- a/include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api
+++ a/include/linux/swap.h
@@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
 #else
-static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
 {
 }
 #endif
@@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free(swp_entry_t);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp, struct page *page)
+static inline void swapcache_free(swp_entry_t swp)
 {
 }
 
diff -puN mm/filemap.c~mm-memcontrol-rewrite-uncharge-api mm/filemap.c
--- a/mm/filemap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/filemap.c
@@ -234,7 +234,6 @@ void delete_from_page_cache(struct page
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
 		freepage(page);
@@ -490,8 +489,7 @@ int replace_page_cache_page(struct page
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
-		/* mem_cgroup codes must not be called under tree_lock */
-		mem_cgroup_replace_page_cache(old, new);
+		mem_cgroup_migrate(old, new, true);
 		radix_tree_preload_end();
 		if (freepage)
 			freepage(old);
diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api mm/memcontrol.c
--- a/mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/memcontrol.c
@@ -754,9 +754,11 @@ static void __mem_cgroup_remove_exceeded
 static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 				       struct mem_cgroup_tree_per_zone *mctz)
 {
-	spin_lock(&mctz->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mctz->lock, flags);
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
 
@@ -779,7 +781,9 @@ static void mem_cgroup_update_tree(struc
 		 * mem is over its softlimit.
 		 */
 		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
+			unsigned long flags;
+
+			spin_lock_irqsave(&mctz->lock, flags);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mz, mctz);
@@ -788,7 +792,7 @@ static void mem_cgroup_update_tree(struc
 			 * If excess is 0, no tree ops.
 			 */
 			__mem_cgroup_insert_exceeded(mz, mctz, excess);
-			spin_unlock(&mctz->lock);
+			spin_unlock_irqrestore(&mctz->lock, flags);
 		}
 	}
 }
@@ -839,9 +843,9 @@ mem_cgroup_largest_soft_limit_node(struc
 {
 	struct mem_cgroup_per_zone *mz;
 
-	spin_lock(&mctz->lock);
+	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irq(&mctz->lock);
 	return mz;
 }
 
@@ -882,13 +886,6 @@ static long mem_cgroup_read_stat(struct
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
-}
-
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
@@ -909,13 +906,13 @@ static unsigned long mem_cgroup_read_eve
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 bool anon, int nr_pages)
+					 int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
-	if (anon)
+	if (PageAnon(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
@@ -1013,7 +1010,6 @@ static bool mem_cgroup_event_ratelimit(s
  */
 static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 {
-	preempt_disable();
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -1026,8 +1022,6 @@ static void memcg_check_events(struct me
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
 #endif
-		preempt_enable();
-
 		mem_cgroup_threshold(memcg);
 		if (unlikely(do_softlimit))
 			mem_cgroup_update_tree(memcg, page);
@@ -1035,8 +1029,7 @@ static void memcg_check_events(struct me
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
 #endif
-	} else
-		preempt_enable();
+	}
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
@@ -1347,20 +1340,6 @@ out:
 	return lruvec;
 }
 
-/*
- * Following LRU functions are allowed to be used without PCG_LOCK.
- * Operations are called by routine of global LRU independently from memcg.
- * What we have to take care of here is validness of pc->mem_cgroup.
- *
- * Changes to pc->mem_cgroup happens when
- * 1. charge
- * 2. moving account
- * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
- * It is added to LRU before charge.
- * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
- * When moving account, the page is not on LRU. It's isolated.
- */
-
 /**
  * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
@@ -2261,22 +2240,14 @@ cleanup:
  *
  * Notes: Race condition
  *
- * We usually use lock_page_cgroup() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
- *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
- *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check memcg->moving_account and detect there are possibility
- * of race or not. If there is, we take a lock.
+ * Charging occurs during page instantiation, while the page is
+ * unmapped and locked in page migration, or while the page table is
+ * locked in THP migration.  No race is possible.
+ *
+ * Uncharge happens to pages with zero references, no race possible.
+ *
+ * Charge moving between groups is protected by checking mm->moving
+ * account and taking the move_lock in the slowpath.
  */
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
@@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_loo
 	return mem_cgroup_from_id(id);
 }
 
+/*
+ * try_get_mem_cgroup_from_page - look up page's memcg association
+ * @page: the page
+ *
+ * Look up, get a css reference, and return the memcg that owns @page.
+ *
+ * The page must be locked to prevent racing with swap-in and page
+ * cache charges.  If coming from an unlocked page table, the caller
+ * must ensure the page is on the LRU or this can race with charging.
+ */
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
 		memcg = pc->mem_cgroup;
 		if (memcg && !css_tryget_online(&memcg->css))
@@ -2713,19 +2693,46 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 			memcg = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
 	return memcg;
 }
 
+static void lock_page_lru(struct page *page, int *isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page)) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		ClearPageLRU(page);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		*isolated = 1;
+	} else
+		*isolated = 0;
+}
+
+static void unlock_page_lru(struct page *page, int isolated)
+{
+	struct zone *zone = page_zone(page);
+
+	if (isolated) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+		SetPageLRU(page);
+		add_page_to_lru_list(page, lruvec, page_lru(page));
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare)
+			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *uninitialized_var(zone);
-	struct lruvec *lruvec;
-	bool was_on_lru = false;
+	int isolated;
 
-	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
 	 * we don't need page_cgroup_lock about tail pages, becase they are not
@@ -2736,39 +2743,38 @@ static void commit_charge(struct page *p
 	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
 	 * may already be on some other mem_cgroup's LRU.  Take care of it.
 	 */
-	if (lrucare) {
-		zone = page_zone(page);
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page)) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, page_lru(page));
-			was_on_lru = true;
-		}
-	}
+	if (lrucare)
+		lock_page_lru(page, &isolated);
 
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point:
+	 *
+	 * - the page is uncharged
+	 *
+	 * - the page is off-LRU
+	 *
+	 * - an anonymous fault has exclusive page access, except for
+	 *   a locked page table
+	 *
+	 * - a page cache insertion, a swapin fault, or a migration
+	 *   have the page locked
+	 */
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
+	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
 
-	if (lrucare) {
-		if (was_on_lru) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			VM_BUG_ON_PAGE(PageLRU(page), page);
-			SetPageLRU(page);
-			add_page_to_lru_list(page, lruvec, page_lru(page));
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
+	if (lrucare)
+		unlock_page_lru(page, isolated);
 
+	local_irq_disable();
+	mem_cgroup_charge_statistics(memcg, page, nr_pages);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
 	 */
 	memcg_check_events(memcg, page);
+	local_irq_enable();
 }
 
 static DEFINE_MUTEX(set_limit_mutex);
@@ -3395,7 +3401,6 @@ static inline void memcg_unregister_all_
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -3416,7 +3421,7 @@ void mem_cgroup_split_huge_fixup(struct
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
+		pc->flags = head_pc->flags;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
@@ -3446,7 +3451,6 @@ static int mem_cgroup_move_account(struc
 {
 	unsigned long flags;
 	int ret;
-	bool anon = PageAnon(page);
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -3460,15 +3464,21 @@ static int mem_cgroup_move_account(struc
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
-	lock_page_cgroup(pc);
+	/*
+	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
+	 * of its source page while we change it: page migration takes
+	 * both pages off the LRU, but page cache replacement doesn't.
+	 */
+	if (!trylock_page(page))
+		goto out;
 
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto unlock;
+		goto out_unlock;
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
+	if (!PageAnon(page) && page_mapped(page)) {
 		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
 			       nr_pages);
 		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3482,20 +3492,25 @@ static int mem_cgroup_move_account(struc
 			       nr_pages);
 	}
 
-	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
+	/*
+	 * It is safe to change pc->mem_cgroup here because the page
+	 * is referenced, charged, and isolated - we can't race with
+	 * uncharging, charging, migration, or LRU putback.
+	 */
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
-unlock:
-	unlock_page_cgroup(pc);
-	/*
-	 * check events
-	 */
+
+	local_irq_disable();
+	mem_cgroup_charge_statistics(to, page, nr_pages);
 	memcg_check_events(to, page);
+	mem_cgroup_charge_statistics(from, page, -nr_pages);
 	memcg_check_events(from, page);
+	local_irq_enable();
+out_unlock:
+	unlock_page(page);
 out:
 	return ret;
 }
@@ -3566,193 +3581,6 @@ out:
 	return ret;
 }
 
-static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
-				   unsigned int nr_pages,
-				   const enum charge_type ctype)
-{
-	struct memcg_batch_info *batch = NULL;
-	bool uncharge_memsw = true;
-
-	/* If swapout, usage of swap doesn't decrease */
-	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		uncharge_memsw = false;
-
-	batch = &current->memcg_batch;
-	/*
-	 * In usual, we do css_get() when we remember memcg pointer.
-	 * But in this case, we keep res->usage until end of a series of
-	 * uncharges. Then, it's ok to ignore memcg's refcnt.
-	 */
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continuously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-
-	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
-
-	if (nr_pages > 1)
-		goto direct_uncharge;
-
-	/*
-	 * In typical case, batch->memcg == mem. This means we can
-	 * merge a series of uncharges to an uncharge of res_counter.
-	 * If not, we uncharge res_counter ony by one.
-	 */
-	if (batch->memcg != memcg)
-		goto direct_uncharge;
-	/* remember freed charge and uncharge it later */
-	batch->nr_pages++;
-	if (uncharge_memsw)
-		batch->memsw_nr_pages++;
-	return;
-direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (unlikely(batch->memcg != memcg))
-		memcg_oom_recover(memcg);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
-			     bool end_migration)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!PageCgroupUsed(pc)))
-		return NULL;
-
-	lock_page_cgroup(pc);
-
-	memcg = pc->mem_cgroup;
-
-	if (!PageCgroupUsed(pc))
-		goto unlock_out;
-
-	anon = PageAnon(page);
-
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_ANON:
-		/*
-		 * Generally PageAnon tells if it's the anon statistics to be
-		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
-		 * used before page reached the stage of being marked PageAnon.
-		 */
-		anon = true;
-		/* fallthrough */
-	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		/* See mem_cgroup_prepare_migration() */
-		if (page_mapped(page))
-			goto unlock_out;
-		/*
-		 * Pages under migration may not be uncharged.  But
-		 * end_migration() /must/ be the one uncharging the
-		 * unused post-migration page and so it has to call
-		 * here with the migration bit still set.  See the
-		 * res_counter handling below.
-		 */
-		if (!end_migration && PageCgroupMigration(pc))
-			goto unlock_out;
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
-		if (!PageAnon(page)) {	/* Shared memory */
-			if (page->mapping && !page_is_file_cache(page))
-				goto unlock_out;
-		} else if (page_mapped(page)) /* Anon */
-				goto unlock_out;
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
-
-	ClearPageCgroupUsed(pc);
-	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
-	 */
-
-	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
-	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
-	/*
-	 * Migration does not charge the res_counter for the
-	 * replacement page, so leave it alone when phasing out the
-	 * page that is unused after the migration.
-	 */
-	if (!end_migration)
-		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
-
-	return memcg;
-
-unlock_out:
-	unlock_page_cgroup(pc);
-	return NULL;
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	/*
-	 * If the page is in swap cache, uncharge should be deferred
-	 * to the swap path, which also properly accounts swap usage
-	 * and handles memcg lifetime.
-	 *
-	 * Note that this check is not stable and reclaim may add the
-	 * page to swap cache at any time after this.  However, if the
-	 * page is not in swap cache by the time page->mapcount hits
-	 * 0, there won't be any page table references to the swap
-	 * slot, and reclaim will free it and not actually write the
-	 * page to disk.
-	 */
-	if (PageSwapCache(page))
-		return;
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping, page);
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
-}
-
 /*
  * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
  * In that cases, pages are freed continuously and we can expect pages
@@ -3763,6 +3591,9 @@ void mem_cgroup_uncharge_cache_page(stru
 
 void mem_cgroup_uncharge_start(void)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
 	current->memcg_batch.do_batch++;
 	/* We can do nest. */
 	if (current->memcg_batch.do_batch == 1) {
@@ -3770,21 +3601,18 @@ void mem_cgroup_uncharge_start(void)
 		current->memcg_batch.nr_pages = 0;
 		current->memcg_batch.memsw_nr_pages = 0;
 	}
+	local_irq_restore(flags);
 }
 
 void mem_cgroup_uncharge_end(void)
 {
 	struct memcg_batch_info *batch = &current->memcg_batch;
+	unsigned long flags;
 
-	if (!batch->do_batch)
-		return;
-
-	batch->do_batch--;
-	if (batch->do_batch) /* If stacked, do nothing. */
-		return;
-
-	if (!batch->memcg)
-		return;
+	local_irq_save(flags);
+	VM_BUG_ON(!batch->do_batch);
+	if (--batch->do_batch) /* If stacked, do nothing */
+		goto out;
 	/*
 	 * This "batch->memcg" is valid without any css_get/put etc...
 	 * bacause we hide charges behind us.
@@ -3796,61 +3624,16 @@ void mem_cgroup_uncharge_end(void)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
 	memcg_oom_recover(batch->memcg);
-	/* forget this pointer (for sanity check) */
-	batch->memcg = NULL;
-}
-
-#ifdef CONFIG_SWAP
-/*
- * called after __delete_from_swap_cache() and drop "page" account.
- * memcg information is recorded to swap_cgroup of "ent"
- */
-void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
-{
-	struct mem_cgroup *memcg;
-	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
-
-	if (!swapout) /* this was a swap cache but the swap is unused ! */
-		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
-
-	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
-
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
-	if (do_swap_account && swapout && memcg)
-		swap_cgroup_record(ent, mem_cgroup_id(memcg));
+out:
+	local_irq_restore(flags);
 }
-#endif
 
 #ifdef CONFIG_MEMCG_SWAP
-/*
- * called from swap_entry_free(). remove record in swap_cgroup and
- * uncharge "memsw" account.
- */
-void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
+					 bool charge)
 {
-	struct mem_cgroup *memcg;
-	unsigned short id;
-
-	if (!do_swap_account)
-		return;
-
-	id = swap_cgroup_record(ent, 0);
-	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
-	if (memcg) {
-		/*
-		 * We uncharge this because swap is freed.  This memcg can
-		 * be obsolete one. We avoid calling css_tryget_online().
-		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
-	}
-	rcu_read_unlock();
+	int val = (charge) ? 1 : -1;
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -3902,169 +3685,6 @@ static inline int mem_cgroup_move_swap_a
 }
 #endif
 
-/*
- * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
- * page belongs to.
- */
-void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-				  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-
-	*memcgp = NULL;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (PageTransHuge(page))
-		nr_pages <<= compound_order(page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		css_get(&memcg->css);
-		/*
-		 * At migrating an anonymous page, its mapcount goes down
-		 * to 0 and uncharge() will be called. But, even if it's fully
-		 * unmapped, migration may fail and this page has to be
-		 * charged again. We set MIGRATION flag here and delay uncharge
-		 * until end_migration() is called
-		 *
-		 * Corner Case Thinking
-		 * A)
-		 * When the old page was mapped as Anon and it's unmap-and-freed
-		 * while migration was ongoing.
-		 * If unmap finds the old page, uncharge() of it will be delayed
-		 * until end_migration(). If unmap finds a new page, it's
-		 * uncharged when it make mapcount to be 1->0. If unmap code
-		 * finds swap_migration_entry, the new page will not be mapped
-		 * and end_migration() will find it(mapcount==0).
-		 *
-		 * B)
-		 * When the old page was mapped but migraion fails, the kernel
-		 * remaps it. A charge for it is kept by MIGRATION flag even
-		 * if mapcount goes down to 0. We can do remap successfully
-		 * without charging it again.
-		 *
-		 * C)
-		 * The "old" page is under lock_page() until the end of
-		 * migration, so, the old page itself will not be swapped-out.
-		 * If the new page is swapped out before end_migraton, our
-		 * hook to usual swap-out path will catch the event.
-		 */
-		if (PageAnon(page))
-			SetPageCgroupMigration(pc);
-	}
-	unlock_page_cgroup(pc);
-	/*
-	 * If the page is not charged at this point,
-	 * we return here.
-	 */
-	if (!memcg)
-		return;
-
-	*memcgp = memcg;
-	/*
-	 * We charge new page before it's used/mapped. So, even if unlock_page()
-	 * is called before end_migration, we can catch all events on this new
-	 * page. In the case new page is migrated but not remapped, new page's
-	 * mapcount will be finally 0 and we call uncharge in end_migration().
-	 */
-	/*
-	 * The page is committed to the memcg, but it's not actually
-	 * charged to the res_counter since we plan on replacing the
-	 * old one and only one page is going to be left afterwards.
-	 */
-	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-	struct page *used, *unused;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (!memcg)
-		return;
-
-	if (!migration_ok) {
-		used = oldpage;
-		unused = newpage;
-	} else {
-		used = newpage;
-		unused = oldpage;
-	}
-	anon = PageAnon(used);
-	__mem_cgroup_uncharge_common(unused,
-				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
-				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
-				     true);
-	css_put(&memcg->css);
-	/*
-	 * We disallowed uncharge of pages under migration because mapcount
-	 * of the page goes down to zero, temporarly.
-	 * Clear the flag and check the page should be charged.
-	 */
-	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
-	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * If a page is a file cache, radix-tree replacement is very atomic
-	 * and we can skip this check. When it was an Anon page, its mapcount
-	 * goes down to 0. But because we added MIGRATION flage, it's not
-	 * uncharged yet. There are several case but page->mapcount check
-	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
-	 * check. (see prepare_charge() also)
-	 */
-	if (anon)
-		mem_cgroup_uncharge_page(used);
-}
-
-/*
- * At replace page cache, newpage is not under any memcg but it's on
- * LRU. So, this function doesn't touch res_counter but handles LRU
- * in correct way. Both pages are locked so we cannot race with uncharge.
- */
-void mem_cgroup_replace_page_cache(struct page *oldpage,
-				  struct page *newpage)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(oldpage);
-	/* fix accounting on old pages */
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * When called from shmem_replace_page(), in some cases the
-	 * oldpage has already been charged, and in some cases not.
-	 */
-	if (!memcg)
-		return;
-	/*
-	 * Even if newpage->mapping was NULL before starting replacement,
-	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
-	 * LRU while we overwrite pc->mem_cgroup.
-	 */
-	commit_charge(newpage, memcg, 1, false, true);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -4263,7 +3883,7 @@ unsigned long mem_cgroup_soft_limit_recl
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
+		spin_lock_irq(&mctz->lock);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -4303,7 +3923,7 @@ unsigned long mem_cgroup_soft_limit_recl
 		 */
 		/* If excess == 0, no tree ops */
 		__mem_cgroup_insert_exceeded(mz, mctz, excess);
-		spin_unlock(&mctz->lock);
+		spin_unlock_irq(&mctz->lock);
 		css_put(&mz->memcg->css);
 		loop++;
 		/*
@@ -6265,9 +5885,9 @@ static enum mc_target_type get_mctgt_typ
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
-		 * Do only loose check w/o page_cgroup lock.
-		 * mem_cgroup_move_account() checks the pc is valid or not under
-		 * the lock.
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the pc is valid or
+		 * not under LRU exclusion.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
@@ -6729,6 +6349,67 @@ static void __init enable_swap_cgroup(vo
 }
 #endif
 
+#ifdef CONFIG_MEMCG_SWAP
+/**
+ * mem_cgroup_swapout - transfer a memsw charge to swap
+ * @page: page whose memsw charge to transfer
+ * @entry: swap entry to move the charge to
+ *
+ * Transfer the memsw charge of @page to @entry.
+ */
+void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+	unsigned short oldid;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (!do_swap_account)
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Readahead page, never charged */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+	VM_BUG_ON_PAGE(oldid, page);
+
+	pc->flags &= ~PCG_MEMSW;
+	css_get(&pc->mem_cgroup->css);
+	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+}
+
+/**
+ * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * @entry: swap entry to uncharge
+ *
+ * Drop the memsw charge associated with @entry.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	if (!do_swap_account)
+		return;
+
+	id = swap_cgroup_record(entry, 0);
+	rcu_read_lock();
+	memcg = mem_cgroup_lookup(id);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(memcg, false);
+		css_put(&memcg->css);
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
@@ -6831,7 +6512,7 @@ void mem_cgroup_commit_charge(struct pag
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+	commit_charge(page, memcg, nr_pages, lrucare);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6873,6 +6554,139 @@ void mem_cgroup_cancel_charge(struct pag
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_uncharge - uncharge a page
+ * @page: page to uncharge
+ *
+ * Uncharge a page previously charged with mem_cgroup_try_charge() and
+ * mem_cgroup_commit_charge().
+ */
+void mem_cgroup_uncharge(struct page *page)
+{
+	struct memcg_batch_info *batch;
+	unsigned int nr_pages = 1;
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+	unsigned long pc_flags;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Every final put_page() ends up here */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point, we have fully
+	 * exclusive access to the page.
+	 */
+	memcg = pc->mem_cgroup;
+	pc_flags = pc->flags;
+	pc->flags = 0;
+
+	local_irq_save(flags);
+
+	if (nr_pages > 1)
+		goto direct;
+	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+		goto direct;
+	batch = &current->memcg_batch;
+	if (!batch->do_batch)
+		goto direct;
+	if (batch->memcg && batch->memcg != memcg)
+		goto direct;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	if (pc_flags & PCG_MEM)
+		batch->nr_pages++;
+	if (pc_flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	goto out;
+direct:
+	if (pc_flags & PCG_MEM)
+		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	if (pc_flags & PCG_MEMSW)
+		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+	memcg_oom_recover(memcg);
+out:
+	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
+	memcg_check_events(memcg, page);
+
+	local_irq_restore(flags);
+}
+
+/**
+ * mem_cgroup_migrate - migrate a charge to another page
+ * @oldpage: currently charged page
+ * @newpage: page to transfer the charge to
+ * @lrucare: both pages might be on the LRU already
+ *
+ * Migrate the charge from @oldpage to @newpage.
+ *
+ * Both pages must be locked, @newpage->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare)
+{
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
+	int isolated;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/* Page cache replacement: new page already charged? */
+	pc = lookup_page_cgroup(newpage);
+	if (PageCgroupUsed(pc))
+		return;
+
+	/* Re-entrant migration: old page already uncharged? */
+	pc = lookup_page_cgroup(oldpage);
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
+	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
+
+	if (PageTransHuge(oldpage)) {
+		nr_pages <<= compound_order(oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
+	}
+
+	if (lrucare)
+		lock_page_lru(oldpage, &isolated);
+
+	pc->flags = 0;
+
+	if (lrucare)
+		unlock_page_lru(oldpage, isolated);
+
+	local_irq_disable();
+	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
+	memcg_check_events(pc->mem_cgroup, oldpage);
+	local_irq_enable();
+
+	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff -puN mm/memory.c~mm-memcontrol-rewrite-uncharge-api mm/memory.c
--- a/mm/memory.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/memory.c
@@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_
 		details = NULL;
 
 	BUG_ON(addr >= end);
-	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
-	mem_cgroup_uncharge_end();
 }
 
 
diff -puN mm/migrate.c~mm-memcontrol-rewrite-uncharge-api mm/migrate.c
--- a/mm/migrate.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/migrate.c
@@ -780,6 +780,7 @@ static int move_to_new_page(struct page
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		newpage->mapping = NULL;
 	} else {
+		mem_cgroup_migrate(page, newpage, false);
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
 		page->mapping = NULL;
@@ -795,7 +796,6 @@ static int __unmap_and_move(struct page
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
-	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
@@ -821,9 +821,6 @@ static int __unmap_and_move(struct page
 		lock_page(page);
 	}
 
-	/* charge against new page */
-	mem_cgroup_prepare_migration(page, newpage, &mem);
-
 	if (PageWriteback(page)) {
 		/*
 		 * Only in the case of a full synchronous migration is it
@@ -833,10 +830,10 @@ static int __unmap_and_move(struct page
 		 */
 		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
-			goto uncharge;
+			goto out_unlock;
 		}
 		if (!force)
-			goto uncharge;
+			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -872,7 +869,7 @@ static int __unmap_and_move(struct page
 			 */
 			remap_swapcache = 0;
 		} else {
-			goto uncharge;
+			goto out_unlock;
 		}
 	}
 
@@ -885,7 +882,7 @@ static int __unmap_and_move(struct page
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto uncharge;
+		goto out_unlock;
 	}
 
 	/*
@@ -904,7 +901,7 @@ static int __unmap_and_move(struct page
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto uncharge;
+			goto out_unlock;
 		}
 		goto skip_unmap;
 	}
@@ -923,10 +920,7 @@ skip_unmap:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-uncharge:
-	mem_cgroup_end_migration(mem, page, newpage,
-				 (rc == MIGRATEPAGE_SUCCESS ||
-				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
+out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1786,7 +1780,6 @@ int migrate_misplaced_transhuge_page(str
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
@@ -1852,15 +1845,6 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	/*
-	 * Traditional migration needs to prepare the memcg charge
-	 * transaction early to prevent the old page from being
-	 * uncharged when installing migration entries.  Here we can
-	 * save the potential rollback and start the charge transfer
-	 * only when migration is already known to end successfully.
-	 */
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
@@ -1888,14 +1872,10 @@ fail_putback:
 		goto fail_putback;
 	}
 
+	mem_cgroup_migrate(page, new_page, false);
+
 	page_remove_rmap(page);
 
-	/*
-	 * Finish the charge transaction under the page table lock to
-	 * prevent split_huge_page() from dividing up the charge
-	 * before it's fully transferred to the new page.
-	 */
-	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
diff -puN mm/rmap.c~mm-memcontrol-rewrite-uncharge-api mm/rmap.c
--- a/mm/rmap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/rmap.c
@@ -1089,7 +1089,6 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		mem_cgroup_uncharge_page(page);
 		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
diff -puN mm/shmem.c~mm-memcontrol-rewrite-uncharge-api mm/shmem.c
--- a/mm/shmem.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/shmem.c
@@ -406,7 +406,6 @@ static void shmem_undo_range(struct inod
 			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -434,7 +433,6 @@ static void shmem_undo_range(struct inod
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -482,7 +480,6 @@ static void shmem_undo_range(struct inod
 			index = start;
 			continue;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -518,7 +515,6 @@ static void shmem_undo_range(struct inod
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 
@@ -818,7 +814,7 @@ static int shmem_writepage(struct page *
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -991,7 +987,7 @@ static int shmem_replace_page(struct pag
 		 */
 		oldpage = newpage;
 	} else {
-		mem_cgroup_replace_page_cache(oldpage, newpage);
+		mem_cgroup_migrate(oldpage, newpage, false);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
diff -puN mm/swap.c~mm-memcontrol-rewrite-uncharge-api mm/swap.c
--- a/mm/swap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/swap.c
@@ -62,6 +62,7 @@ static void __page_cache_release(struct
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
@@ -915,6 +916,8 @@ void release_pages(struct page **pages,
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
+	mem_cgroup_uncharge_start();
+
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
@@ -946,6 +949,7 @@ void release_pages(struct page **pages,
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
+		mem_cgroup_uncharge(page);
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
@@ -955,6 +959,8 @@ void release_pages(struct page **pages,
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	mem_cgroup_uncharge_end();
+
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
 EXPORT_SYMBOL(release_pages);
diff -puN mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api mm/swap_state.c
--- a/mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/swap_state.c
@@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struc
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free(entry);
 			return 0;
 		}
 
@@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struc
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 		return 0;
 	}
 }
@@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free(entry);
 	page_cache_release(page);
 }
 
@@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_e
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff -puN mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api mm/swapfile.c
--- a/mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/swapfile.c
@@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	p = swap_info_get(entry);
 	if (p) {
-		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		if (page)
-			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
+		swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
diff -puN mm/truncate.c~mm-memcontrol-rewrite-uncharge-api mm/truncate.c
--- a/mm/truncate.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/truncate.c
@@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct a
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -369,7 +367,6 @@ void truncate_inode_pages_range(struct a
 			pagevec_release(&pvec);
 			break;
 		}
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -394,7 +391,6 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		index++;
 	}
 	cleancache_invalidate_inode(mapping);
@@ -493,7 +489,6 @@ unsigned long invalidate_mapping_pages(s
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -522,7 +517,6 @@ unsigned long invalidate_mapping_pages(s
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
@@ -553,7 +547,6 @@ invalidate_complete_page2(struct address
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -602,7 +595,6 @@ int invalidate_inode_pages2_range(struct
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
-		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -655,7 +647,6 @@ int invalidate_inode_pages2_range(struct
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
-		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
diff -puN mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api mm/vmscan.c
--- a/mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/vmscan.c
@@ -571,9 +571,10 @@ static int __remove_mapping(struct addre
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -594,7 +595,6 @@ static int __remove_mapping(struct addre
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(st
 		 */
 		__clear_page_locked(page);
 free_it:
+		mem_cgroup_uncharge(page);
 		nr_reclaimed++;
 
 		/*
@@ -1126,12 +1127,13 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	mem_cgroup_uncharge_end();
 
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
-	mem_cgroup_uncharge_end();
+
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_congested += nr_congested;
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
@@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lr
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
@@ -1650,6 +1654,8 @@ static void move_active_pages_to_lru(str
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
diff -puN mm/zswap.c~mm-memcontrol-rewrite-uncharge-api mm/zswap.c
--- a/mm/zswap.c~mm-memcontrol-rewrite-uncharge-api
+++ a/mm/zswap.c
@@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch
  2014-08-06 20:59 mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch Andrew Morton
  2014-08-06 21:00 ` mm: memcontrol: rewrite charge API Andrew Morton
@ 2014-08-07  2:26 ` Johannes Weiner
  2014-08-07  7:44 ` Michal Hocko
  2 siblings, 0 replies; 12+ messages in thread
From: Johannes Weiner @ 2014-08-07  2:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm

On Wed, Aug 06, 2014 at 01:59:14PM -0700, Andrew Morton wrote:
> 
> Do we feel these are ready for merging?
> 
> I'll send along the consolidated patches for eyeballs.

I just went over them again, as well as over the individual fixes.
The uncharge-from-irq-context was a more fundamental fix-up, as was
getting migration from fuse splicing right, but aside from that I
think the fallout was fairly mild and seems to have settled; the last
bug fix is from two weeks ago.  Given how much this series reduces the
burden memcg puts on users and maintainers, I think it should go in.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API
  2014-08-06 21:02     ` Andrew Morton
@ 2014-08-07  7:38       ` Michal Hocko
  2014-08-07 16:25         ` Johannes Weiner
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-08-07  7:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Johannes Weiner, linux-mm

On Wed 06-08-14 14:02:35, Andrew Morton wrote:
> On Wed, 6 Aug 2014 14:00:55 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Subject: mm: memcontrol: rewrite uncharge API
> > 
> 
> Nope, sorry, that was missing
> mm-memcontrol-rewrite-uncharge-api-fix-clear-page-mapping-in-migration.patch.
> 
> This time:
> 
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: mm: memcontrol: rewrite uncharge API
> 
> The memcg uncharging code that is involved towards the end of a page's
> lifetime - truncation, reclaim, swapout, migration - is impressively
> complicated and fragile.
> 
> Because anonymous and file pages were always charged before they had their
> page->mapping established, uncharges had to happen when the page type
> could still be known from the context; as in unmap for anonymous, page
> cache removal for file and shmem pages, and swap cache truncation for swap
> pages.  However, these operations happen well before the page is actually
> freed, and so a lot of synchronization is necessary:
> 
> - Charging, uncharging, page migration, and charge migration all need
>   to take a per-page bit spinlock as they could race with uncharging.
> 
> - Swap cache truncation happens during both swap-in and swap-out, and
>   possibly repeatedly before the page is actually freed.  This means
>   that the memcg swapout code is called from many contexts that make
>   no sense and it has to figure out the direction from page state to
>   make sure memory and memory+swap are always correctly charged.
> 
> - On page migration, the old page might be unmapped but then reused,
>   so memcg code has to prevent untimely uncharging in that case.
>   Because this code - which should be a simple charge transfer - is so
>   special-cased, it is not reusable for replace_page_cache().
> 
> But now that charged pages always have a page->mapping, introduce
> mem_cgroup_uncharge(), which is called after the final put_page(), when we
> know for sure that nobody is looking at the page anymore.
> 
> For page migration, introduce mem_cgroup_migrate(), which is called after
> the migration is successful and the new page is fully rmapped.  Because
> the old page is no longer uncharged after migration, prevent double
> charges by decoupling the page's memcg association (PCG_USED and
> pc->mem_cgroup) from the page holding an actual charge.  The new bits
> PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
> to the new page during migration.
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
> gets rid of mem_cgroup_replace_page_cache().
> 
> Swap accounting is massively simplified: because the page is no longer
> uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
> transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
> before the final put_page() in page reclaim.
> 
> Finally, page_cgroup changes are now protected by whatever protection the
> page itself offers: anonymous pages are charged under the page table lock,
> whereas page cache insertions, swapin, and migration hold the page lock. 
> Uncharging happens under full exclusion with no outstanding references. 
> Charging and uncharging also ensure that the page is off-LRU, which
> serializes against charge migration.  Remove the very costly page_cgroup
> lock and set pc->flags non-atomically.

I see some point in squashing all the fixups into the single patch but I
am afraid we have lost some interesting details from fix ups this time.
I think that at least
mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch and
mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration-2.patch
would be good to go on their own _or_ their changelogs added here. The
whole page cache replace path is obscure and we should rather have that
documented so we do not have to google for details or go through painful
code inspection next time.

> [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]

preempt_{en,dis}able were later replaced by disabled interrupts so this
might be confusing.

> [vdavydov@parallels.com: fix flags definition]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> Tested-by: Jet Chen <jet.chen@intel.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

If this comes from the above change then is should be probably removed.
You can replace it by my Acked-by. I have acked all the follow up fixes
but forgot to ack the initial patch.

> Tested-by: Felipe Balbi <balbi@ti.com>

this tested-by came from the same preempt_{en,dis}able patch AFAICS.

> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  Documentation/cgroups/memcg_test.txt |  126 ---
>  include/linux/memcontrol.h           |   49 -
>  include/linux/page_cgroup.h          |   43 -
>  include/linux/swap.h                 |   12 
>  mm/filemap.c                         |    4 
>  mm/memcontrol.c                      |  832 +++++++++----------------
>  mm/memory.c                          |    2 
>  mm/migrate.c                         |   38 -
>  mm/rmap.c                            |    1 
>  mm/shmem.c                           |    8 
>  mm/swap.c                            |    6 
>  mm/swap_state.c                      |    8 
>  mm/swapfile.c                        |    7 
>  mm/truncate.c                        |    9 
>  mm/vmscan.c                          |   12 
>  mm/zswap.c                           |    2 
>  16 files changed, 390 insertions(+), 769 deletions(-)
> 
> diff -puN Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api Documentation/cgroups/memcg_test.txt
> --- a/Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api
> +++ a/Documentation/cgroups/memcg_test.txt
> @@ -29,28 +29,13 @@ Please note that implementation details
>  2. Uncharge
>    a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
>  
> -	mem_cgroup_uncharge_page()
> -	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
> -	  to 0. If the page is SwapCache, uncharge is delayed until
> -	  mem_cgroup_uncharge_swapcache().
> -
> -	mem_cgroup_uncharge_cache_page()
> -	  Called when a page-cache is deleted from radix-tree. If the page is
> -	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
> -
> -	mem_cgroup_uncharge_swapcache()
> -	  Called when SwapCache is removed from radix-tree. The charge itself
> -	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
> -	  charge to swap occurs.)
> +	mem_cgroup_uncharge()
> +	  Called when a page's refcount goes down to 0.
>  
>  	mem_cgroup_uncharge_swap()
>  	  Called when swp_entry's refcnt goes down to 0. A charge against swap
>  	  disappears.
>  
> -	mem_cgroup_end_migration(old, new)
> -	At success of migration old is uncharged (if necessary), a charge
> -	to new page is committed. At failure, charge to old page is committed.
> -
>  3. charge-commit-cancel
>  	Memcg pages are charged in two steps:
>  		mem_cgroup_try_charge()
> @@ -69,18 +54,6 @@ Under below explanation, we assume CONFI
>  	Anonymous page is newly allocated at
>  		  - page fault into MAP_ANONYMOUS mapping.
>  		  - Copy-On-Write.
> - 	It is charged right after it's allocated before doing any page table
> -	related operations. Of course, it's uncharged when another page is used
> -	for the fault address.
> -
> -	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
> -	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
> -	are done at page_remove_rmap() when page_mapcount() goes down to 0.
> -
> -	Another page freeing is by page-reclaim (vmscan.c) and anonymous
> -	pages are swapped out. In this case, the page is marked as
> -	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
> -	as SwapCache(). It's delayed until __delete_from_swap_cache().
>  
>  	4.1 Swap-in.
>  	At swap-in, the page is taken from swap-cache. There are 2 cases.
> @@ -89,41 +62,6 @@ Under below explanation, we assume CONFI
>  	(b) If the SwapCache has been mapped by processes, it has been
>  	    charged already.
>  
> -	This swap-in is one of the most complicated work. In do_swap_page(),
> -	following events occur when pte is unchanged.
> -
> -	(1) the page (SwapCache) is looked up.
> -	(2) lock_page()
> -	(3) try_charge_swapin()
> -	(4) reuse_swap_page() (may call delete_swap_cache())
> -	(5) commit_charge_swapin()
> -	(6) swap_free().
> -
> -	Considering following situation for example.
> -
> -	(A) The page has not been charged before (2) and reuse_swap_page()
> -	    doesn't call delete_from_swap_cache().
> -	(B) The page has not been charged before (2) and reuse_swap_page()
> -	    calls delete_from_swap_cache().
> -	(C) The page has been charged before (2) and reuse_swap_page() doesn't
> -	    call delete_from_swap_cache().
> -	(D) The page has been charged before (2) and reuse_swap_page() calls
> -	    delete_from_swap_cache().
> -
> -	    memory.usage/memsw.usage changes to this page/swp_entry will be
> -	 Case          (A)      (B)       (C)     (D)
> -         Event
> -       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
> -          ===========================================
> -          (3)        +1/+1    +1/+1     +1/+1   +1/+1
> -          (4)          -       0/ 0       -     -1/ 0
> -          (5)         0/-1     0/ 0     -1/-1    0/ 0
> -          (6)          -       0/-1       -      0/-1
> -          ===========================================
> -       Result         1/ 1     1/ 1      1/ 1    1/ 1
> -
> -       In any cases, charges to this page should be 1/ 1.
> -
>  	4.2 Swap-out.
>  	At swap-out, typical state transition is below.
>  
> @@ -136,28 +74,20 @@ Under below explanation, we assume CONFI
>  	    swp_entry's refcnt -= 1.
>  
>  
> -	At (b), the page is marked as SwapCache and not uncharged.
> -	At (d), the page is removed from SwapCache and a charge in page_cgroup
> -	is moved to swap_cgroup.
> -
>  	Finally, at task exit,
>  	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
> -	Here, a charge in swap_cgroup disappears.
>  
>  5. Page Cache
>     	Page Cache is charged at
>  	- add_to_page_cache_locked().
>  
> -	uncharged at
> -	- __remove_from_page_cache().
> -
>  	The logic is very clear. (About migration, see below)
>  	Note: __remove_from_page_cache() is called by remove_from_page_cache()
>  	and __remove_mapping().
>  
>  6. Shmem(tmpfs) Page Cache
> -	Memcg's charge/uncharge have special handlers of shmem. The best way
> -	to understand shmem's page state transition is to read mm/shmem.c.
> +	The best way to understand shmem's page state transition is to read
> +	mm/shmem.c.
>  	But brief explanation of the behavior of memcg around shmem will be
>  	helpful to understand the logic.
>  
> @@ -170,56 +100,10 @@ Under below explanation, we assume CONFI
>  	It's charged when...
>  	- A new page is added to shmem's radix-tree.
>  	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
> -	It's uncharged when
> -	- A page is removed from radix-tree and not SwapCache.
> -	- When SwapCache is removed, a charge is moved to swap_cgroup.
> -	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
> -	  disappears.
>  
>  7. Page Migration
> -   	One of the most complicated functions is page-migration-handler.
> -	Memcg has 2 routines. Assume that we are migrating a page's contents
> -	from OLDPAGE to NEWPAGE.
> -
> -	Usual migration logic is..
> -	(a) remove the page from LRU.
> -	(b) allocate NEWPAGE (migration target)
> -	(c) lock by lock_page().
> -	(d) unmap all mappings.
> -	(e-1) If necessary, replace entry in radix-tree.
> -	(e-2) move contents of a page.
> -	(f) map all mappings again.
> -	(g) pushback the page to LRU.
> -	(-) OLDPAGE will be freed.
> -
> -	Before (g), memcg should complete all necessary charge/uncharge to
> -	NEWPAGE/OLDPAGE.
> -
> -	The point is....
> -	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
> -          try_to_unmap() drops all mapcount and the page will not be
> -	  SwapCache.
> -
> -	- If OLDPAGE is SwapCache, charges will be kept at (g) because
> -	  __delete_from_swap_cache() isn't called at (e-1)
> -
> -	- If OLDPAGE is page-cache, charges will be kept at (g) because
> -	  remove_from_swap_cache() isn't called at (e-1)
> -
> -	memcg provides following hooks.
> -
> -	- mem_cgroup_prepare_migration(OLDPAGE)
> -	  Called after (b) to account a charge (usage += PAGE_SIZE) against
> -	  memcg which OLDPAGE belongs to.
> -
> -        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> -	  Called after (f) before (g).
> -	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> -	  charged, a charge by prepare_migration() is automatically canceled.
> -	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
>  
> -	  But zap_pte() (by exit or munmap) can be called while migration,
> -	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> +	mem_cgroup_migrate()
>  
>  8. LRU
>          Each memcg has its own private LRU. Now, its handling is under global
> diff -puN include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api
> +++ a/include/linux/memcontrol.h
> @@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct pag
>  			      bool lrucare);
>  void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
>  
> -struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> +void mem_cgroup_uncharge(struct page *page);
> +
> +/* Batched uncharging */
> +void mem_cgroup_uncharge_start(void);
> +void mem_cgroup_uncharge_end(void);
>  
> -/* For coalescing uncharge for reducing memcg' overhead*/
> -extern void mem_cgroup_uncharge_start(void);
> -extern void mem_cgroup_uncharge_end(void);
> +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> +			bool lrucare);
>  
> -extern void mem_cgroup_uncharge_page(struct page *page);
> -extern void mem_cgroup_uncharge_cache_page(struct page *page);
> +struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>  
>  bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>  				  struct mem_cgroup *memcg);
> @@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_str
>  
>  extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
>  
> -extern void
> -mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> -			     struct mem_cgroup **memcgp);
> -extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> -	struct page *oldpage, struct page *newpage, bool migration_ok);
> -
>  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
>  				   struct mem_cgroup *,
>  				   struct mem_cgroup_reclaim_cookie *);
> @@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(st
>  void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
>  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  					struct task_struct *p);
> -extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> -					struct page *newpage);
>  
>  static inline void mem_cgroup_oom_enable(void)
>  {
> @@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_cha
>  {
>  }
>  
> -static inline void mem_cgroup_uncharge_start(void)
> +static inline void mem_cgroup_uncharge(struct page *page)
>  {
>  }
>  
> -static inline void mem_cgroup_uncharge_end(void)
> +static inline void mem_cgroup_uncharge_start(void)
>  {
>  }
>  
> -static inline void mem_cgroup_uncharge_page(struct page *page)
> +static inline void mem_cgroup_uncharge_end(void)
>  {
>  }
>  
> -static inline void mem_cgroup_uncharge_cache_page(struct page *page)
> +static inline void mem_cgroup_migrate(struct page *oldpage,
> +				      struct page *newpage,
> +				      bool lrucare)
>  {
>  }
>  
> @@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
>  	return NULL;
>  }
>  
> -static inline void
> -mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> -			     struct mem_cgroup **memcgp)
> -{
> -}
> -
> -static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> -		struct page *oldpage, struct page *newpage, bool migration_ok)
> -{
> -}
> -
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>  		struct mem_cgroup *prev,
> @@ -392,10 +377,6 @@ static inline
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> -static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
> -				struct page *newpage)
> -{
> -}
>  #endif /* CONFIG_MEMCG */
>  
>  #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> diff -puN include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api include/linux/page_cgroup.h
> --- a/include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api
> +++ a/include/linux/page_cgroup.h
> @@ -3,9 +3,9 @@
>  
>  enum {
>  	/* flags for mem_cgroup */
> -	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
> -	PCG_USED, /* this object is in use. */
> -	PCG_MIGRATION, /* under page migration */
> +	PCG_USED = 0x01,	/* This page is charged to a memcg */
> +	PCG_MEM = 0x02,		/* This page holds a memory charge */
> +	PCG_MEMSW = 0x04,	/* This page holds a memory+swap charge */
>  	__NR_PCG_FLAGS,
>  };
>  
> @@ -44,42 +44,9 @@ static inline void __init page_cgroup_in
>  struct page_cgroup *lookup_page_cgroup(struct page *page);
>  struct page *lookup_cgroup_page(struct page_cgroup *pc);
>  
> -#define TESTPCGFLAG(uname, lname)			\
> -static inline int PageCgroup##uname(struct page_cgroup *pc)	\
> -	{ return test_bit(PCG_##lname, &pc->flags); }
> -
> -#define SETPCGFLAG(uname, lname)			\
> -static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
> -	{ set_bit(PCG_##lname, &pc->flags);  }
> -
> -#define CLEARPCGFLAG(uname, lname)			\
> -static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
> -	{ clear_bit(PCG_##lname, &pc->flags);  }
> -
> -#define TESTCLEARPCGFLAG(uname, lname)			\
> -static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
> -	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
> -
> -TESTPCGFLAG(Used, USED)
> -CLEARPCGFLAG(Used, USED)
> -SETPCGFLAG(Used, USED)
> -
> -SETPCGFLAG(Migration, MIGRATION)
> -CLEARPCGFLAG(Migration, MIGRATION)
> -TESTPCGFLAG(Migration, MIGRATION)
> -
> -static inline void lock_page_cgroup(struct page_cgroup *pc)
> -{
> -	/*
> -	 * Don't take this lock in IRQ context.
> -	 * This lock is for pc->mem_cgroup, USED, MIGRATION
> -	 */
> -	bit_spin_lock(PCG_LOCK, &pc->flags);
> -}
> -
> -static inline void unlock_page_cgroup(struct page_cgroup *pc)
> +static inline int PageCgroupUsed(struct page_cgroup *pc)
>  {
> -	bit_spin_unlock(PCG_LOCK, &pc->flags);
> +	return !!(pc->flags & PCG_USED);
>  }
>  
>  #else /* CONFIG_MEMCG */
> diff -puN include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api include/linux/swap.h
> --- a/include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api
> +++ a/include/linux/swap.h
> @@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(
>  }
>  #endif
>  #ifdef CONFIG_MEMCG_SWAP
> -extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> +extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> +extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
>  #else
> -static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
> +static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
> +{
> +}
> +static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
>  {
>  }
>  #endif
> @@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> -extern void swapcache_free(swp_entry_t, struct page *page);
> +extern void swapcache_free(swp_entry_t);
>  extern int free_swap_and_cache(swp_entry_t);
>  extern int swap_type_of(dev_t, sector_t, struct block_device **);
>  extern unsigned int count_swap_pages(int, int);
> @@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t
>  {
>  }
>  
> -static inline void swapcache_free(swp_entry_t swp, struct page *page)
> +static inline void swapcache_free(swp_entry_t swp)
>  {
>  }
>  
> diff -puN mm/filemap.c~mm-memcontrol-rewrite-uncharge-api mm/filemap.c
> --- a/mm/filemap.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/filemap.c
> @@ -234,7 +234,6 @@ void delete_from_page_cache(struct page
>  	spin_lock_irq(&mapping->tree_lock);
>  	__delete_from_page_cache(page, NULL);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	mem_cgroup_uncharge_cache_page(page);
>  
>  	if (freepage)
>  		freepage(page);
> @@ -490,8 +489,7 @@ int replace_page_cache_page(struct page
>  		if (PageSwapBacked(new))
>  			__inc_zone_page_state(new, NR_SHMEM);
>  		spin_unlock_irq(&mapping->tree_lock);
> -		/* mem_cgroup codes must not be called under tree_lock */
> -		mem_cgroup_replace_page_cache(old, new);
> +		mem_cgroup_migrate(old, new, true);
>  		radix_tree_preload_end();
>  		if (freepage)
>  			freepage(old);
> diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api mm/memcontrol.c
> --- a/mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/memcontrol.c
> @@ -754,9 +754,11 @@ static void __mem_cgroup_remove_exceeded
>  static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
>  				       struct mem_cgroup_tree_per_zone *mctz)
>  {
> -	spin_lock(&mctz->lock);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&mctz->lock, flags);
>  	__mem_cgroup_remove_exceeded(mz, mctz);
> -	spin_unlock(&mctz->lock);
> +	spin_unlock_irqrestore(&mctz->lock, flags);
>  }
>  
>  
> @@ -779,7 +781,9 @@ static void mem_cgroup_update_tree(struc
>  		 * mem is over its softlimit.
>  		 */
>  		if (excess || mz->on_tree) {
> -			spin_lock(&mctz->lock);
> +			unsigned long flags;
> +
> +			spin_lock_irqsave(&mctz->lock, flags);
>  			/* if on-tree, remove it */
>  			if (mz->on_tree)
>  				__mem_cgroup_remove_exceeded(mz, mctz);
> @@ -788,7 +792,7 @@ static void mem_cgroup_update_tree(struc
>  			 * If excess is 0, no tree ops.
>  			 */
>  			__mem_cgroup_insert_exceeded(mz, mctz, excess);
> -			spin_unlock(&mctz->lock);
> +			spin_unlock_irqrestore(&mctz->lock, flags);
>  		}
>  	}
>  }
> @@ -839,9 +843,9 @@ mem_cgroup_largest_soft_limit_node(struc
>  {
>  	struct mem_cgroup_per_zone *mz;
>  
> -	spin_lock(&mctz->lock);
> +	spin_lock_irq(&mctz->lock);
>  	mz = __mem_cgroup_largest_soft_limit_node(mctz);
> -	spin_unlock(&mctz->lock);
> +	spin_unlock_irq(&mctz->lock);
>  	return mz;
>  }
>  
> @@ -882,13 +886,6 @@ static long mem_cgroup_read_stat(struct
>  	return val;
>  }
>  
> -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> -					 bool charge)
> -{
> -	int val = (charge) ? 1 : -1;
> -	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> -}
> -
>  static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
>  					    enum mem_cgroup_events_index idx)
>  {
> @@ -909,13 +906,13 @@ static unsigned long mem_cgroup_read_eve
>  
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
>  					 struct page *page,
> -					 bool anon, int nr_pages)
> +					 int nr_pages)
>  {
>  	/*
>  	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
>  	 * counted as CACHE even if it's on ANON LRU.
>  	 */
> -	if (anon)
> +	if (PageAnon(page))
>  		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
>  				nr_pages);
>  	else
> @@ -1013,7 +1010,6 @@ static bool mem_cgroup_event_ratelimit(s
>   */
>  static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
>  {
> -	preempt_disable();
>  	/* threshold event is triggered in finer grain than soft limit */
>  	if (unlikely(mem_cgroup_event_ratelimit(memcg,
>  						MEM_CGROUP_TARGET_THRESH))) {
> @@ -1026,8 +1022,6 @@ static void memcg_check_events(struct me
>  		do_numainfo = mem_cgroup_event_ratelimit(memcg,
>  						MEM_CGROUP_TARGET_NUMAINFO);
>  #endif
> -		preempt_enable();
> -
>  		mem_cgroup_threshold(memcg);
>  		if (unlikely(do_softlimit))
>  			mem_cgroup_update_tree(memcg, page);
> @@ -1035,8 +1029,7 @@ static void memcg_check_events(struct me
>  		if (unlikely(do_numainfo))
>  			atomic_inc(&memcg->numainfo_events);
>  #endif
> -	} else
> -		preempt_enable();
> +	}
>  }
>  
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> @@ -1347,20 +1340,6 @@ out:
>  	return lruvec;
>  }
>  
> -/*
> - * Following LRU functions are allowed to be used without PCG_LOCK.
> - * Operations are called by routine of global LRU independently from memcg.
> - * What we have to take care of here is validness of pc->mem_cgroup.
> - *
> - * Changes to pc->mem_cgroup happens when
> - * 1. charge
> - * 2. moving account
> - * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
> - * It is added to LRU before charge.
> - * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
> - * When moving account, the page is not on LRU. It's isolated.
> - */
> -
>  /**
>   * mem_cgroup_page_lruvec - return lruvec for adding an lru page
>   * @page: the page
> @@ -2261,22 +2240,14 @@ cleanup:
>   *
>   * Notes: Race condition
>   *
> - * We usually use lock_page_cgroup() for accessing page_cgroup member but
> - * it tends to be costly. But considering some conditions, we doesn't need
> - * to do so _always_.
> - *
> - * Considering "charge", lock_page_cgroup() is not required because all
> - * file-stat operations happen after a page is attached to radix-tree. There
> - * are no race with "charge".
> - *
> - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
> - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
> - * if there are race with "uncharge". Statistics itself is properly handled
> - * by flags.
> - *
> - * Considering "move", this is an only case we see a race. To make the race
> - * small, we check memcg->moving_account and detect there are possibility
> - * of race or not. If there is, we take a lock.
> + * Charging occurs during page instantiation, while the page is
> + * unmapped and locked in page migration, or while the page table is
> + * locked in THP migration.  No race is possible.
> + *
> + * Uncharge happens to pages with zero references, no race possible.
> + *
> + * Charge moving between groups is protected by checking mm->moving
> + * account and taking the move_lock in the slowpath.
>   */
>  
>  void __mem_cgroup_begin_update_page_stat(struct page *page,
> @@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_loo
>  	return mem_cgroup_from_id(id);
>  }
>  
> +/*
> + * try_get_mem_cgroup_from_page - look up page's memcg association
> + * @page: the page
> + *
> + * Look up, get a css reference, and return the memcg that owns @page.
> + *
> + * The page must be locked to prevent racing with swap-in and page
> + * cache charges.  If coming from an unlocked page table, the caller
> + * must ensure the page is on the LRU or this can race with charging.
> + */
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
>  	struct mem_cgroup *memcg = NULL;
> @@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
>  	pc = lookup_page_cgroup(page);
> -	lock_page_cgroup(pc);
>  	if (PageCgroupUsed(pc)) {
>  		memcg = pc->mem_cgroup;
>  		if (memcg && !css_tryget_online(&memcg->css))
> @@ -2713,19 +2693,46 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>  			memcg = NULL;
>  		rcu_read_unlock();
>  	}
> -	unlock_page_cgroup(pc);
>  	return memcg;
>  }
>  
> +static void lock_page_lru(struct page *page, int *isolated)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	spin_lock_irq(&zone->lru_lock);
> +	if (PageLRU(page)) {
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, zone);
> +		ClearPageLRU(page);
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
> +		*isolated = 1;
> +	} else
> +		*isolated = 0;
> +}
> +
> +static void unlock_page_lru(struct page *page, int isolated)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	if (isolated) {
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, zone);
> +		VM_BUG_ON_PAGE(PageLRU(page), page);
> +		SetPageLRU(page);
> +		add_page_to_lru_list(page, lruvec, page_lru(page));
> +	}
> +	spin_unlock_irq(&zone->lru_lock);
> +}
> +
>  static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> -			  unsigned int nr_pages, bool anon, bool lrucare)
> +			  unsigned int nr_pages, bool lrucare)
>  {
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	struct zone *uninitialized_var(zone);
> -	struct lruvec *lruvec;
> -	bool was_on_lru = false;
> +	int isolated;
>  
> -	lock_page_cgroup(pc);
>  	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
>  	/*
>  	 * we don't need page_cgroup_lock about tail pages, becase they are not
> @@ -2736,39 +2743,38 @@ static void commit_charge(struct page *p
>  	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
>  	 * may already be on some other mem_cgroup's LRU.  Take care of it.
>  	 */
> -	if (lrucare) {
> -		zone = page_zone(page);
> -		spin_lock_irq(&zone->lru_lock);
> -		if (PageLRU(page)) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, page_lru(page));
> -			was_on_lru = true;
> -		}
> -	}
> +	if (lrucare)
> +		lock_page_lru(page, &isolated);
>  
> +	/*
> +	 * Nobody should be changing or seriously looking at
> +	 * pc->mem_cgroup and pc->flags at this point:
> +	 *
> +	 * - the page is uncharged
> +	 *
> +	 * - the page is off-LRU
> +	 *
> +	 * - an anonymous fault has exclusive page access, except for
> +	 *   a locked page table
> +	 *
> +	 * - a page cache insertion, a swapin fault, or a migration
> +	 *   have the page locked
> +	 */
>  	pc->mem_cgroup = memcg;
> -	SetPageCgroupUsed(pc);
> +	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
>  
> -	if (lrucare) {
> -		if (was_on_lru) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			VM_BUG_ON_PAGE(PageLRU(page), page);
> -			SetPageLRU(page);
> -			add_page_to_lru_list(page, lruvec, page_lru(page));
> -		}
> -		spin_unlock_irq(&zone->lru_lock);
> -	}
> -
> -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> -	unlock_page_cgroup(pc);
> +	if (lrucare)
> +		unlock_page_lru(page, isolated);
>  
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
>  	/*
>  	 * "charge_statistics" updated event counter. Then, check it.
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>  	 * if they exceeds softlimit.
>  	 */
>  	memcg_check_events(memcg, page);
> +	local_irq_enable();
>  }
>  
>  static DEFINE_MUTEX(set_limit_mutex);
> @@ -3395,7 +3401,6 @@ static inline void memcg_unregister_all_
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  
> -#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
>  /*
>   * Because tail pages are not marked as "used", set it. We're under
>   * zone->lru_lock, 'splitting on pmd' and compound_lock.
> @@ -3416,7 +3421,7 @@ void mem_cgroup_split_huge_fixup(struct
>  	for (i = 1; i < HPAGE_PMD_NR; i++) {
>  		pc = head_pc + i;
>  		pc->mem_cgroup = memcg;
> -		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
> +		pc->flags = head_pc->flags;
>  	}
>  	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
>  		       HPAGE_PMD_NR);
> @@ -3446,7 +3451,6 @@ static int mem_cgroup_move_account(struc
>  {
>  	unsigned long flags;
>  	int ret;
> -	bool anon = PageAnon(page);
>  
>  	VM_BUG_ON(from == to);
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
> @@ -3460,15 +3464,21 @@ static int mem_cgroup_move_account(struc
>  	if (nr_pages > 1 && !PageTransHuge(page))
>  		goto out;
>  
> -	lock_page_cgroup(pc);
> +	/*
> +	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
> +	 * of its source page while we change it: page migration takes
> +	 * both pages off the LRU, but page cache replacement doesn't.
> +	 */
> +	if (!trylock_page(page))
> +		goto out;
>  
>  	ret = -EINVAL;
>  	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
> -		goto unlock;
> +		goto out_unlock;
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> +	if (!PageAnon(page) && page_mapped(page)) {
>  		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
>  			       nr_pages);
>  		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
> @@ -3482,20 +3492,25 @@ static int mem_cgroup_move_account(struc
>  			       nr_pages);
>  	}
>  
> -	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
> +	/*
> +	 * It is safe to change pc->mem_cgroup here because the page
> +	 * is referenced, charged, and isolated - we can't race with
> +	 * uncharging, charging, migration, or LRU putback.
> +	 */
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> -	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
>  	move_unlock_mem_cgroup(from, &flags);
>  	ret = 0;
> -unlock:
> -	unlock_page_cgroup(pc);
> -	/*
> -	 * check events
> -	 */
> +
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(to, page, nr_pages);
>  	memcg_check_events(to, page);
> +	mem_cgroup_charge_statistics(from, page, -nr_pages);
>  	memcg_check_events(from, page);
> +	local_irq_enable();
> +out_unlock:
> +	unlock_page(page);
>  out:
>  	return ret;
>  }
> @@ -3566,193 +3581,6 @@ out:
>  	return ret;
>  }
>  
> -static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
> -				   unsigned int nr_pages,
> -				   const enum charge_type ctype)
> -{
> -	struct memcg_batch_info *batch = NULL;
> -	bool uncharge_memsw = true;
> -
> -	/* If swapout, usage of swap doesn't decrease */
> -	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> -		uncharge_memsw = false;
> -
> -	batch = &current->memcg_batch;
> -	/*
> -	 * In usual, we do css_get() when we remember memcg pointer.
> -	 * But in this case, we keep res->usage until end of a series of
> -	 * uncharges. Then, it's ok to ignore memcg's refcnt.
> -	 */
> -	if (!batch->memcg)
> -		batch->memcg = memcg;
> -	/*
> -	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> -	 * In those cases, all pages freed continuously can be expected to be in
> -	 * the same cgroup and we have chance to coalesce uncharges.
> -	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
> -	 * because we want to do uncharge as soon as possible.
> -	 */
> -
> -	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
> -		goto direct_uncharge;
> -
> -	if (nr_pages > 1)
> -		goto direct_uncharge;
> -
> -	/*
> -	 * In typical case, batch->memcg == mem. This means we can
> -	 * merge a series of uncharges to an uncharge of res_counter.
> -	 * If not, we uncharge res_counter ony by one.
> -	 */
> -	if (batch->memcg != memcg)
> -		goto direct_uncharge;
> -	/* remember freed charge and uncharge it later */
> -	batch->nr_pages++;
> -	if (uncharge_memsw)
> -		batch->memsw_nr_pages++;
> -	return;
> -direct_uncharge:
> -	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
> -	if (uncharge_memsw)
> -		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
> -	if (unlikely(batch->memcg != memcg))
> -		memcg_oom_recover(memcg);
> -}
> -
> -/*
> - * uncharge if !page_mapped(page)
> - */
> -static struct mem_cgroup *
> -__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
> -			     bool end_migration)
> -{
> -	struct mem_cgroup *memcg = NULL;
> -	unsigned int nr_pages = 1;
> -	struct page_cgroup *pc;
> -	bool anon;
> -
> -	if (mem_cgroup_disabled())
> -		return NULL;
> -
> -	if (PageTransHuge(page)) {
> -		nr_pages <<= compound_order(page);
> -		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> -	}
> -	/*
> -	 * Check if our page_cgroup is valid
> -	 */
> -	pc = lookup_page_cgroup(page);
> -	if (unlikely(!PageCgroupUsed(pc)))
> -		return NULL;
> -
> -	lock_page_cgroup(pc);
> -
> -	memcg = pc->mem_cgroup;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto unlock_out;
> -
> -	anon = PageAnon(page);
> -
> -	switch (ctype) {
> -	case MEM_CGROUP_CHARGE_TYPE_ANON:
> -		/*
> -		 * Generally PageAnon tells if it's the anon statistics to be
> -		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
> -		 * used before page reached the stage of being marked PageAnon.
> -		 */
> -		anon = true;
> -		/* fallthrough */
> -	case MEM_CGROUP_CHARGE_TYPE_DROP:
> -		/* See mem_cgroup_prepare_migration() */
> -		if (page_mapped(page))
> -			goto unlock_out;
> -		/*
> -		 * Pages under migration may not be uncharged.  But
> -		 * end_migration() /must/ be the one uncharging the
> -		 * unused post-migration page and so it has to call
> -		 * here with the migration bit still set.  See the
> -		 * res_counter handling below.
> -		 */
> -		if (!end_migration && PageCgroupMigration(pc))
> -			goto unlock_out;
> -		break;
> -	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
> -		if (!PageAnon(page)) {	/* Shared memory */
> -			if (page->mapping && !page_is_file_cache(page))
> -				goto unlock_out;
> -		} else if (page_mapped(page)) /* Anon */
> -				goto unlock_out;
> -		break;
> -	default:
> -		break;
> -	}
> -
> -	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
> -
> -	ClearPageCgroupUsed(pc);
> -	/*
> -	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -	 * freed from LRU. This is safe because uncharged page is expected not
> -	 * to be reused (freed soon). Exception is SwapCache, it's handled by
> -	 * special functions.
> -	 */
> -
> -	unlock_page_cgroup(pc);
> -	/*
> -	 * even after unlock, we have memcg->res.usage here and this memcg
> -	 * will never be freed, so it's safe to call css_get().
> -	 */
> -	memcg_check_events(memcg, page);
> -	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> -		mem_cgroup_swap_statistics(memcg, true);
> -		css_get(&memcg->css);
> -	}
> -	/*
> -	 * Migration does not charge the res_counter for the
> -	 * replacement page, so leave it alone when phasing out the
> -	 * page that is unused after the migration.
> -	 */
> -	if (!end_migration)
> -		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
> -
> -	return memcg;
> -
> -unlock_out:
> -	unlock_page_cgroup(pc);
> -	return NULL;
> -}
> -
> -void mem_cgroup_uncharge_page(struct page *page)
> -{
> -	/* early check. */
> -	if (page_mapped(page))
> -		return;
> -	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
> -	/*
> -	 * If the page is in swap cache, uncharge should be deferred
> -	 * to the swap path, which also properly accounts swap usage
> -	 * and handles memcg lifetime.
> -	 *
> -	 * Note that this check is not stable and reclaim may add the
> -	 * page to swap cache at any time after this.  However, if the
> -	 * page is not in swap cache by the time page->mapcount hits
> -	 * 0, there won't be any page table references to the swap
> -	 * slot, and reclaim will free it and not actually write the
> -	 * page to disk.
> -	 */
> -	if (PageSwapCache(page))
> -		return;
> -	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
> -}
> -
> -void mem_cgroup_uncharge_cache_page(struct page *page)
> -{
> -	VM_BUG_ON_PAGE(page_mapped(page), page);
> -	VM_BUG_ON_PAGE(page->mapping, page);
> -	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
> -}
> -
>  /*
>   * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
>   * In that cases, pages are freed continuously and we can expect pages
> @@ -3763,6 +3591,9 @@ void mem_cgroup_uncharge_cache_page(stru
>  
>  void mem_cgroup_uncharge_start(void)
>  {
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
>  	current->memcg_batch.do_batch++;
>  	/* We can do nest. */
>  	if (current->memcg_batch.do_batch == 1) {
> @@ -3770,21 +3601,18 @@ void mem_cgroup_uncharge_start(void)
>  		current->memcg_batch.nr_pages = 0;
>  		current->memcg_batch.memsw_nr_pages = 0;
>  	}
> +	local_irq_restore(flags);
>  }
>  
>  void mem_cgroup_uncharge_end(void)
>  {
>  	struct memcg_batch_info *batch = &current->memcg_batch;
> +	unsigned long flags;
>  
> -	if (!batch->do_batch)
> -		return;
> -
> -	batch->do_batch--;
> -	if (batch->do_batch) /* If stacked, do nothing. */
> -		return;
> -
> -	if (!batch->memcg)
> -		return;
> +	local_irq_save(flags);
> +	VM_BUG_ON(!batch->do_batch);
> +	if (--batch->do_batch) /* If stacked, do nothing */
> +		goto out;
>  	/*
>  	 * This "batch->memcg" is valid without any css_get/put etc...
>  	 * bacause we hide charges behind us.
> @@ -3796,61 +3624,16 @@ void mem_cgroup_uncharge_end(void)
>  		res_counter_uncharge(&batch->memcg->memsw,
>  				     batch->memsw_nr_pages * PAGE_SIZE);
>  	memcg_oom_recover(batch->memcg);
> -	/* forget this pointer (for sanity check) */
> -	batch->memcg = NULL;
> -}
> -
> -#ifdef CONFIG_SWAP
> -/*
> - * called after __delete_from_swap_cache() and drop "page" account.
> - * memcg information is recorded to swap_cgroup of "ent"
> - */
> -void
> -mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> -{
> -	struct mem_cgroup *memcg;
> -	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
> -
> -	if (!swapout) /* this was a swap cache but the swap is unused ! */
> -		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
> -
> -	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
> -
> -	/*
> -	 * record memcg information,  if swapout && memcg != NULL,
> -	 * css_get() was called in uncharge().
> -	 */
> -	if (do_swap_account && swapout && memcg)
> -		swap_cgroup_record(ent, mem_cgroup_id(memcg));
> +out:
> +	local_irq_restore(flags);
>  }
> -#endif
>  
>  #ifdef CONFIG_MEMCG_SWAP
> -/*
> - * called from swap_entry_free(). remove record in swap_cgroup and
> - * uncharge "memsw" account.
> - */
> -void mem_cgroup_uncharge_swap(swp_entry_t ent)
> +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> +					 bool charge)
>  {
> -	struct mem_cgroup *memcg;
> -	unsigned short id;
> -
> -	if (!do_swap_account)
> -		return;
> -
> -	id = swap_cgroup_record(ent, 0);
> -	rcu_read_lock();
> -	memcg = mem_cgroup_lookup(id);
> -	if (memcg) {
> -		/*
> -		 * We uncharge this because swap is freed.  This memcg can
> -		 * be obsolete one. We avoid calling css_tryget_online().
> -		 */
> -		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> -		mem_cgroup_swap_statistics(memcg, false);
> -		css_put(&memcg->css);
> -	}
> -	rcu_read_unlock();
> +	int val = (charge) ? 1 : -1;
> +	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
>  }
>  
>  /**
> @@ -3902,169 +3685,6 @@ static inline int mem_cgroup_move_swap_a
>  }
>  #endif
>  
> -/*
> - * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
> - * page belongs to.
> - */
> -void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> -				  struct mem_cgroup **memcgp)
> -{
> -	struct mem_cgroup *memcg = NULL;
> -	unsigned int nr_pages = 1;
> -	struct page_cgroup *pc;
> -
> -	*memcgp = NULL;
> -
> -	if (mem_cgroup_disabled())
> -		return;
> -
> -	if (PageTransHuge(page))
> -		nr_pages <<= compound_order(page);
> -
> -	pc = lookup_page_cgroup(page);
> -	lock_page_cgroup(pc);
> -	if (PageCgroupUsed(pc)) {
> -		memcg = pc->mem_cgroup;
> -		css_get(&memcg->css);
> -		/*
> -		 * At migrating an anonymous page, its mapcount goes down
> -		 * to 0 and uncharge() will be called. But, even if it's fully
> -		 * unmapped, migration may fail and this page has to be
> -		 * charged again. We set MIGRATION flag here and delay uncharge
> -		 * until end_migration() is called
> -		 *
> -		 * Corner Case Thinking
> -		 * A)
> -		 * When the old page was mapped as Anon and it's unmap-and-freed
> -		 * while migration was ongoing.
> -		 * If unmap finds the old page, uncharge() of it will be delayed
> -		 * until end_migration(). If unmap finds a new page, it's
> -		 * uncharged when it make mapcount to be 1->0. If unmap code
> -		 * finds swap_migration_entry, the new page will not be mapped
> -		 * and end_migration() will find it(mapcount==0).
> -		 *
> -		 * B)
> -		 * When the old page was mapped but migraion fails, the kernel
> -		 * remaps it. A charge for it is kept by MIGRATION flag even
> -		 * if mapcount goes down to 0. We can do remap successfully
> -		 * without charging it again.
> -		 *
> -		 * C)
> -		 * The "old" page is under lock_page() until the end of
> -		 * migration, so, the old page itself will not be swapped-out.
> -		 * If the new page is swapped out before end_migraton, our
> -		 * hook to usual swap-out path will catch the event.
> -		 */
> -		if (PageAnon(page))
> -			SetPageCgroupMigration(pc);
> -	}
> -	unlock_page_cgroup(pc);
> -	/*
> -	 * If the page is not charged at this point,
> -	 * we return here.
> -	 */
> -	if (!memcg)
> -		return;
> -
> -	*memcgp = memcg;
> -	/*
> -	 * We charge new page before it's used/mapped. So, even if unlock_page()
> -	 * is called before end_migration, we can catch all events on this new
> -	 * page. In the case new page is migrated but not remapped, new page's
> -	 * mapcount will be finally 0 and we call uncharge in end_migration().
> -	 */
> -	/*
> -	 * The page is committed to the memcg, but it's not actually
> -	 * charged to the res_counter since we plan on replacing the
> -	 * old one and only one page is going to be left afterwards.
> -	 */
> -	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
> -}
> -
> -/* remove redundant charge if migration failed*/
> -void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> -	struct page *oldpage, struct page *newpage, bool migration_ok)
> -{
> -	struct page *used, *unused;
> -	struct page_cgroup *pc;
> -	bool anon;
> -
> -	if (!memcg)
> -		return;
> -
> -	if (!migration_ok) {
> -		used = oldpage;
> -		unused = newpage;
> -	} else {
> -		used = newpage;
> -		unused = oldpage;
> -	}
> -	anon = PageAnon(used);
> -	__mem_cgroup_uncharge_common(unused,
> -				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
> -				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
> -				     true);
> -	css_put(&memcg->css);
> -	/*
> -	 * We disallowed uncharge of pages under migration because mapcount
> -	 * of the page goes down to zero, temporarly.
> -	 * Clear the flag and check the page should be charged.
> -	 */
> -	pc = lookup_page_cgroup(oldpage);
> -	lock_page_cgroup(pc);
> -	ClearPageCgroupMigration(pc);
> -	unlock_page_cgroup(pc);
> -
> -	/*
> -	 * If a page is a file cache, radix-tree replacement is very atomic
> -	 * and we can skip this check. When it was an Anon page, its mapcount
> -	 * goes down to 0. But because we added MIGRATION flage, it's not
> -	 * uncharged yet. There are several case but page->mapcount check
> -	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
> -	 * check. (see prepare_charge() also)
> -	 */
> -	if (anon)
> -		mem_cgroup_uncharge_page(used);
> -}
> -
> -/*
> - * At replace page cache, newpage is not under any memcg but it's on
> - * LRU. So, this function doesn't touch res_counter but handles LRU
> - * in correct way. Both pages are locked so we cannot race with uncharge.
> - */
> -void mem_cgroup_replace_page_cache(struct page *oldpage,
> -				  struct page *newpage)
> -{
> -	struct mem_cgroup *memcg = NULL;
> -	struct page_cgroup *pc;
> -
> -	if (mem_cgroup_disabled())
> -		return;
> -
> -	pc = lookup_page_cgroup(oldpage);
> -	/* fix accounting on old pages */
> -	lock_page_cgroup(pc);
> -	if (PageCgroupUsed(pc)) {
> -		memcg = pc->mem_cgroup;
> -		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
> -		ClearPageCgroupUsed(pc);
> -	}
> -	unlock_page_cgroup(pc);
> -
> -	/*
> -	 * When called from shmem_replace_page(), in some cases the
> -	 * oldpage has already been charged, and in some cases not.
> -	 */
> -	if (!memcg)
> -		return;
> -	/*
> -	 * Even if newpage->mapping was NULL before starting replacement,
> -	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
> -	 * LRU while we overwrite pc->mem_cgroup.
> -	 */
> -	commit_charge(newpage, memcg, 1, false, true);
> -}
> -
>  #ifdef CONFIG_DEBUG_VM
>  static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
>  {
> @@ -4263,7 +3883,7 @@ unsigned long mem_cgroup_soft_limit_recl
>  						    gfp_mask, &nr_scanned);
>  		nr_reclaimed += reclaimed;
>  		*total_scanned += nr_scanned;
> -		spin_lock(&mctz->lock);
> +		spin_lock_irq(&mctz->lock);
>  
>  		/*
>  		 * If we failed to reclaim anything from this memory cgroup
> @@ -4303,7 +3923,7 @@ unsigned long mem_cgroup_soft_limit_recl
>  		 */
>  		/* If excess == 0, no tree ops */
>  		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> -		spin_unlock(&mctz->lock);
> +		spin_unlock_irq(&mctz->lock);
>  		css_put(&mz->memcg->css);
>  		loop++;
>  		/*
> @@ -6265,9 +5885,9 @@ static enum mc_target_type get_mctgt_typ
>  	if (page) {
>  		pc = lookup_page_cgroup(page);
>  		/*
> -		 * Do only loose check w/o page_cgroup lock.
> -		 * mem_cgroup_move_account() checks the pc is valid or not under
> -		 * the lock.
> +		 * Do only loose check w/o serialization.
> +		 * mem_cgroup_move_account() checks the pc is valid or
> +		 * not under LRU exclusion.
>  		 */
>  		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
>  			ret = MC_TARGET_PAGE;
> @@ -6729,6 +6349,67 @@ static void __init enable_swap_cgroup(vo
>  }
>  #endif
>  
> +#ifdef CONFIG_MEMCG_SWAP
> +/**
> + * mem_cgroup_swapout - transfer a memsw charge to swap
> + * @page: page whose memsw charge to transfer
> + * @entry: swap entry to move the charge to
> + *
> + * Transfer the memsw charge of @page to @entry.
> + */
> +void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
> +{
> +	struct page_cgroup *pc;
> +	unsigned short oldid;
> +
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
> +	VM_BUG_ON_PAGE(page_count(page), page);
> +
> +	if (!do_swap_account)
> +		return;
> +
> +	pc = lookup_page_cgroup(page);
> +
> +	/* Readahead page, never charged */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +
> +	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
> +
> +	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
> +	VM_BUG_ON_PAGE(oldid, page);
> +
> +	pc->flags &= ~PCG_MEMSW;
> +	css_get(&pc->mem_cgroup->css);
> +	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
> +}
> +
> +/**
> + * mem_cgroup_uncharge_swap - uncharge a swap entry
> + * @entry: swap entry to uncharge
> + *
> + * Drop the memsw charge associated with @entry.
> + */
> +void mem_cgroup_uncharge_swap(swp_entry_t entry)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned short id;
> +
> +	if (!do_swap_account)
> +		return;
> +
> +	id = swap_cgroup_record(entry, 0);
> +	rcu_read_lock();
> +	memcg = mem_cgroup_lookup(id);
> +	if (memcg) {
> +		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> +		mem_cgroup_swap_statistics(memcg, false);
> +		css_put(&memcg->css);
> +	}
> +	rcu_read_unlock();
> +}
> +#endif
> +
>  /**
>   * mem_cgroup_try_charge - try charging a page
>   * @page: page to charge
> @@ -6831,7 +6512,7 @@ void mem_cgroup_commit_charge(struct pag
>  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  	}
>  
> -	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
> +	commit_charge(page, memcg, nr_pages, lrucare);
>  
>  	if (do_swap_account && PageSwapCache(page)) {
>  		swp_entry_t entry = { .val = page_private(page) };
> @@ -6873,6 +6554,139 @@ void mem_cgroup_cancel_charge(struct pag
>  	cancel_charge(memcg, nr_pages);
>  }
>  
> +/**
> + * mem_cgroup_uncharge - uncharge a page
> + * @page: page to uncharge
> + *
> + * Uncharge a page previously charged with mem_cgroup_try_charge() and
> + * mem_cgroup_commit_charge().
> + */
> +void mem_cgroup_uncharge(struct page *page)
> +{
> +	struct memcg_batch_info *batch;
> +	unsigned int nr_pages = 1;
> +	struct mem_cgroup *memcg;
> +	struct page_cgroup *pc;
> +	unsigned long pc_flags;
> +	unsigned long flags;
> +
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
> +	VM_BUG_ON_PAGE(page_count(page), page);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	pc = lookup_page_cgroup(page);
> +
> +	/* Every final put_page() ends up here */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +
> +	if (PageTransHuge(page)) {
> +		nr_pages <<= compound_order(page);
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +	}
> +	/*
> +	 * Nobody should be changing or seriously looking at
> +	 * pc->mem_cgroup and pc->flags at this point, we have fully
> +	 * exclusive access to the page.
> +	 */
> +	memcg = pc->mem_cgroup;
> +	pc_flags = pc->flags;
> +	pc->flags = 0;
> +
> +	local_irq_save(flags);
> +
> +	if (nr_pages > 1)
> +		goto direct;
> +	if (unlikely(test_thread_flag(TIF_MEMDIE)))
> +		goto direct;
> +	batch = &current->memcg_batch;
> +	if (!batch->do_batch)
> +		goto direct;
> +	if (batch->memcg && batch->memcg != memcg)
> +		goto direct;
> +	if (!batch->memcg)
> +		batch->memcg = memcg;
> +	if (pc_flags & PCG_MEM)
> +		batch->nr_pages++;
> +	if (pc_flags & PCG_MEMSW)
> +		batch->memsw_nr_pages++;
> +	goto out;
> +direct:
> +	if (pc_flags & PCG_MEM)
> +		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
> +	if (pc_flags & PCG_MEMSW)
> +		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
> +	memcg_oom_recover(memcg);
> +out:
> +	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
> +	memcg_check_events(memcg, page);
> +
> +	local_irq_restore(flags);
> +}
> +
> +/**
> + * mem_cgroup_migrate - migrate a charge to another page
> + * @oldpage: currently charged page
> + * @newpage: page to transfer the charge to
> + * @lrucare: both pages might be on the LRU already
> + *
> + * Migrate the charge from @oldpage to @newpage.
> + *
> + * Both pages must be locked, @newpage->mapping must be set up.
> + */
> +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> +			bool lrucare)
> +{
> +	unsigned int nr_pages = 1;
> +	struct page_cgroup *pc;
> +	int isolated;
> +
> +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> +	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
> +	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	/* Page cache replacement: new page already charged? */
> +	pc = lookup_page_cgroup(newpage);
> +	if (PageCgroupUsed(pc))
> +		return;
> +
> +	/* Re-entrant migration: old page already uncharged? */
> +	pc = lookup_page_cgroup(oldpage);
> +	if (!PageCgroupUsed(pc))
> +		return;
> +
> +	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
> +	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
> +
> +	if (PageTransHuge(oldpage)) {
> +		nr_pages <<= compound_order(oldpage);
> +		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
> +		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
> +	}
> +
> +	if (lrucare)
> +		lock_page_lru(oldpage, &isolated);
> +
> +	pc->flags = 0;
> +
> +	if (lrucare)
> +		unlock_page_lru(oldpage, isolated);
> +
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
> +	memcg_check_events(pc->mem_cgroup, oldpage);
> +	local_irq_enable();
> +
> +	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
> +}
> +
>  /*
>   * subsys_initcall() for memory controller.
>   *
> diff -puN mm/memory.c~mm-memcontrol-rewrite-uncharge-api mm/memory.c
> --- a/mm/memory.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/memory.c
> @@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_
>  		details = NULL;
>  
>  	BUG_ON(addr >= end);
> -	mem_cgroup_uncharge_start();
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
> @@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_
>  		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
>  	} while (pgd++, addr = next, addr != end);
>  	tlb_end_vma(tlb, vma);
> -	mem_cgroup_uncharge_end();
>  }
>  
>  
> diff -puN mm/migrate.c~mm-memcontrol-rewrite-uncharge-api mm/migrate.c
> --- a/mm/migrate.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/migrate.c
> @@ -780,6 +780,7 @@ static int move_to_new_page(struct page
>  	if (rc != MIGRATEPAGE_SUCCESS) {
>  		newpage->mapping = NULL;
>  	} else {
> +		mem_cgroup_migrate(page, newpage, false);
>  		if (remap_swapcache)
>  			remove_migration_ptes(page, newpage);
>  		page->mapping = NULL;
> @@ -795,7 +796,6 @@ static int __unmap_and_move(struct page
>  {
>  	int rc = -EAGAIN;
>  	int remap_swapcache = 1;
> -	struct mem_cgroup *mem;
>  	struct anon_vma *anon_vma = NULL;
>  
>  	if (!trylock_page(page)) {
> @@ -821,9 +821,6 @@ static int __unmap_and_move(struct page
>  		lock_page(page);
>  	}
>  
> -	/* charge against new page */
> -	mem_cgroup_prepare_migration(page, newpage, &mem);
> -
>  	if (PageWriteback(page)) {
>  		/*
>  		 * Only in the case of a full synchronous migration is it
> @@ -833,10 +830,10 @@ static int __unmap_and_move(struct page
>  		 */
>  		if (mode != MIGRATE_SYNC) {
>  			rc = -EBUSY;
> -			goto uncharge;
> +			goto out_unlock;
>  		}
>  		if (!force)
> -			goto uncharge;
> +			goto out_unlock;
>  		wait_on_page_writeback(page);
>  	}
>  	/*
> @@ -872,7 +869,7 @@ static int __unmap_and_move(struct page
>  			 */
>  			remap_swapcache = 0;
>  		} else {
> -			goto uncharge;
> +			goto out_unlock;
>  		}
>  	}
>  
> @@ -885,7 +882,7 @@ static int __unmap_and_move(struct page
>  		 * the page migration right away (proteced by page lock).
>  		 */
>  		rc = balloon_page_migrate(newpage, page, mode);
> -		goto uncharge;
> +		goto out_unlock;
>  	}
>  
>  	/*
> @@ -904,7 +901,7 @@ static int __unmap_and_move(struct page
>  		VM_BUG_ON_PAGE(PageAnon(page), page);
>  		if (page_has_private(page)) {
>  			try_to_free_buffers(page);
> -			goto uncharge;
> +			goto out_unlock;
>  		}
>  		goto skip_unmap;
>  	}
> @@ -923,10 +920,7 @@ skip_unmap:
>  	if (anon_vma)
>  		put_anon_vma(anon_vma);
>  
> -uncharge:
> -	mem_cgroup_end_migration(mem, page, newpage,
> -				 (rc == MIGRATEPAGE_SUCCESS ||
> -				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
> +out_unlock:
>  	unlock_page(page);
>  out:
>  	return rc;
> @@ -1786,7 +1780,6 @@ int migrate_misplaced_transhuge_page(str
>  	pg_data_t *pgdat = NODE_DATA(node);
>  	int isolated = 0;
>  	struct page *new_page = NULL;
> -	struct mem_cgroup *memcg = NULL;
>  	int page_lru = page_is_file_cache(page);
>  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
>  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> @@ -1852,15 +1845,6 @@ fail_putback:
>  		goto out_unlock;
>  	}
>  
> -	/*
> -	 * Traditional migration needs to prepare the memcg charge
> -	 * transaction early to prevent the old page from being
> -	 * uncharged when installing migration entries.  Here we can
> -	 * save the potential rollback and start the charge transfer
> -	 * only when migration is already known to end successfully.
> -	 */
> -	mem_cgroup_prepare_migration(page, new_page, &memcg);
> -
>  	orig_entry = *pmd;
>  	entry = mk_pmd(new_page, vma->vm_page_prot);
>  	entry = pmd_mkhuge(entry);
> @@ -1888,14 +1872,10 @@ fail_putback:
>  		goto fail_putback;
>  	}
>  
> +	mem_cgroup_migrate(page, new_page, false);
> +
>  	page_remove_rmap(page);
>  
> -	/*
> -	 * Finish the charge transaction under the page table lock to
> -	 * prevent split_huge_page() from dividing up the charge
> -	 * before it's fully transferred to the new page.
> -	 */
> -	mem_cgroup_end_migration(memcg, page, new_page, true);
>  	spin_unlock(ptl);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  
> diff -puN mm/rmap.c~mm-memcontrol-rewrite-uncharge-api mm/rmap.c
> --- a/mm/rmap.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/rmap.c
> @@ -1089,7 +1089,6 @@ void page_remove_rmap(struct page *page)
>  	if (unlikely(PageHuge(page)))
>  		goto out;
>  	if (anon) {
> -		mem_cgroup_uncharge_page(page);
>  		if (PageTransHuge(page))
>  			__dec_zone_page_state(page,
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> diff -puN mm/shmem.c~mm-memcontrol-rewrite-uncharge-api mm/shmem.c
> --- a/mm/shmem.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/shmem.c
> @@ -406,7 +406,6 @@ static void shmem_undo_range(struct inod
>  			pvec.pages, indices);
>  		if (!pvec.nr)
>  			break;
> -		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> @@ -434,7 +433,6 @@ static void shmem_undo_range(struct inod
>  		}
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
> -		mem_cgroup_uncharge_end();
>  		cond_resched();
>  		index++;
>  	}
> @@ -482,7 +480,6 @@ static void shmem_undo_range(struct inod
>  			index = start;
>  			continue;
>  		}
> -		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> @@ -518,7 +515,6 @@ static void shmem_undo_range(struct inod
>  		}
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
> -		mem_cgroup_uncharge_end();
>  		index++;
>  	}
>  
> @@ -818,7 +814,7 @@ static int shmem_writepage(struct page *
>  	}
>  
>  	mutex_unlock(&shmem_swaplist_mutex);
> -	swapcache_free(swap, NULL);
> +	swapcache_free(swap);
>  redirty:
>  	set_page_dirty(page);
>  	if (wbc->for_reclaim)
> @@ -991,7 +987,7 @@ static int shmem_replace_page(struct pag
>  		 */
>  		oldpage = newpage;
>  	} else {
> -		mem_cgroup_replace_page_cache(oldpage, newpage);
> +		mem_cgroup_migrate(oldpage, newpage, false);
>  		lru_cache_add_anon(newpage);
>  		*pagep = newpage;
>  	}
> diff -puN mm/swap.c~mm-memcontrol-rewrite-uncharge-api mm/swap.c
> --- a/mm/swap.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/swap.c
> @@ -62,6 +62,7 @@ static void __page_cache_release(struct
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> +	mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)
> @@ -915,6 +916,8 @@ void release_pages(struct page **pages,
>  	struct lruvec *lruvec;
>  	unsigned long uninitialized_var(flags);
>  
> +	mem_cgroup_uncharge_start();
> +
>  	for (i = 0; i < nr; i++) {
>  		struct page *page = pages[i];
>  
> @@ -946,6 +949,7 @@ void release_pages(struct page **pages,
>  			__ClearPageLRU(page);
>  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		}
> +		mem_cgroup_uncharge(page);
>  
>  		/* Clear Active bit in case of parallel mark_page_accessed */
>  		__ClearPageActive(page);
> @@ -955,6 +959,8 @@ void release_pages(struct page **pages,
>  	if (zone)
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  
> +	mem_cgroup_uncharge_end();
> +
>  	free_hot_cold_page_list(&pages_to_free, cold);
>  }
>  EXPORT_SYMBOL(release_pages);
> diff -puN mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api mm/swap_state.c
> --- a/mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/swap_state.c
> @@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struc
>  
>  	if (unlikely(PageTransHuge(page)))
>  		if (unlikely(split_huge_page_to_list(page, list))) {
> -			swapcache_free(entry, NULL);
> +			swapcache_free(entry);
>  			return 0;
>  		}
>  
> @@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struc
>  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
>  		 * clear SWAP_HAS_CACHE flag.
>  		 */
> -		swapcache_free(entry, NULL);
> +		swapcache_free(entry);
>  		return 0;
>  	}
>  }
> @@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page
>  	__delete_from_swap_cache(page);
>  	spin_unlock_irq(&address_space->tree_lock);
>  
> -	swapcache_free(entry, page);
> +	swapcache_free(entry);
>  	page_cache_release(page);
>  }
>  
> @@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_e
>  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
>  		 * clear SWAP_HAS_CACHE flag.
>  		 */
> -		swapcache_free(entry, NULL);
> +		swapcache_free(entry);
>  	} while (err != -ENOMEM);
>  
>  	if (new_page)
> diff -puN mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api mm/swapfile.c
> --- a/mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/swapfile.c
> @@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */
> -void swapcache_free(swp_entry_t entry, struct page *page)
> +void swapcache_free(swp_entry_t entry)
>  {
>  	struct swap_info_struct *p;
> -	unsigned char count;
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> -		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
> -		if (page)
> -			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
> +		swap_entry_free(p, entry, SWAP_HAS_CACHE);
>  		spin_unlock(&p->lock);
>  	}
>  }
> diff -puN mm/truncate.c~mm-memcontrol-rewrite-uncharge-api mm/truncate.c
> --- a/mm/truncate.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/truncate.c
> @@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct a
>  	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
>  			min(end - index, (pgoff_t)PAGEVEC_SIZE),
>  			indices)) {
> -		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> @@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct a
>  		}
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
> -		mem_cgroup_uncharge_end();
>  		cond_resched();
>  		index++;
>  	}
> @@ -369,7 +367,6 @@ void truncate_inode_pages_range(struct a
>  			pagevec_release(&pvec);
>  			break;
>  		}
> -		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> @@ -394,7 +391,6 @@ void truncate_inode_pages_range(struct a
>  		}
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
> -		mem_cgroup_uncharge_end();
>  		index++;
>  	}
>  	cleancache_invalidate_inode(mapping);
> @@ -493,7 +489,6 @@ unsigned long invalidate_mapping_pages(s
>  	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
>  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
>  			indices)) {
> -		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> @@ -522,7 +517,6 @@ unsigned long invalidate_mapping_pages(s
>  		}
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
> -		mem_cgroup_uncharge_end();
>  		cond_resched();
>  		index++;
>  	}
> @@ -553,7 +547,6 @@ invalidate_complete_page2(struct address
>  	BUG_ON(page_has_private(page));
>  	__delete_from_page_cache(page, NULL);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	mem_cgroup_uncharge_cache_page(page);
>  
>  	if (mapping->a_ops->freepage)
>  		mapping->a_ops->freepage(page);
> @@ -602,7 +595,6 @@ int invalidate_inode_pages2_range(struct
>  	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
>  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
>  			indices)) {
> -		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> @@ -655,7 +647,6 @@ int invalidate_inode_pages2_range(struct
>  		}
>  		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
> -		mem_cgroup_uncharge_end();
>  		cond_resched();
>  		index++;
>  	}
> diff -puN mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api mm/vmscan.c
> --- a/mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/vmscan.c
> @@ -571,9 +571,10 @@ static int __remove_mapping(struct addre
>  
>  	if (PageSwapCache(page)) {
>  		swp_entry_t swap = { .val = page_private(page) };
> +		mem_cgroup_swapout(page, swap);
>  		__delete_from_swap_cache(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> -		swapcache_free(swap, page);
> +		swapcache_free(swap);
>  	} else {
>  		void (*freepage)(struct page *);
>  		void *shadow = NULL;
> @@ -594,7 +595,6 @@ static int __remove_mapping(struct addre
>  			shadow = workingset_eviction(mapping, page);
>  		__delete_from_page_cache(page, shadow);
>  		spin_unlock_irq(&mapping->tree_lock);
> -		mem_cgroup_uncharge_cache_page(page);
>  
>  		if (freepage != NULL)
>  			freepage(page);
> @@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(st
>  		 */
>  		__clear_page_locked(page);
>  free_it:
> +		mem_cgroup_uncharge(page);
>  		nr_reclaimed++;
>  
>  		/*
> @@ -1126,12 +1127,13 @@ keep:
>  		list_add(&page->lru, &ret_pages);
>  		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>  	}
> +	mem_cgroup_uncharge_end();
>  
>  	free_hot_cold_page_list(&free_pages, true);
>  
>  	list_splice(&ret_pages, page_list);
>  	count_vm_events(PGACTIVATE, pgactivate);
> -	mem_cgroup_uncharge_end();
> +
>  	*ret_nr_dirty += nr_dirty;
>  	*ret_nr_congested += nr_congested;
>  	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
> @@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lr
>  			__ClearPageActive(page);
>  			del_page_from_lru_list(page, lruvec, lru);
>  
> +			mem_cgroup_uncharge(page);
> +
>  			if (unlikely(PageCompound(page))) {
>  				spin_unlock_irq(&zone->lru_lock);
>  				(*get_compound_page_dtor(page))(page);
> @@ -1650,6 +1654,8 @@ static void move_active_pages_to_lru(str
>  			__ClearPageActive(page);
>  			del_page_from_lru_list(page, lruvec, lru);
>  
> +			mem_cgroup_uncharge(page);
> +
>  			if (unlikely(PageCompound(page))) {
>  				spin_unlock_irq(&zone->lru_lock);
>  				(*get_compound_page_dtor(page))(page);
> diff -puN mm/zswap.c~mm-memcontrol-rewrite-uncharge-api mm/zswap.c
> --- a/mm/zswap.c~mm-memcontrol-rewrite-uncharge-api
> +++ a/mm/zswap.c
> @@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp
>  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
>  		 * clear SWAP_HAS_CACHE flag.
>  		 */
> -		swapcache_free(entry, NULL);
> +		swapcache_free(entry);
>  	} while (err != -ENOMEM);
>  
>  	if (new_page)
> _
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch
  2014-08-06 20:59 mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch Andrew Morton
  2014-08-06 21:00 ` mm: memcontrol: rewrite charge API Andrew Morton
  2014-08-07  2:26 ` mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch Johannes Weiner
@ 2014-08-07  7:44 ` Michal Hocko
  2 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-08-07  7:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Johannes Weiner, linux-mm

On Wed 06-08-14 13:59:14, Andrew Morton wrote:
> 
> Do we feel these are ready for merging?

Yes, let's go with it. I do not think that waiting for the next cycle
will help much.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API
  2014-08-07  7:38       ` Michal Hocko
@ 2014-08-07 16:25         ` Johannes Weiner
  2014-08-07 22:40           ` Andrew Morton
  2014-08-08 11:42           ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Johannes Weiner @ 2014-08-07 16:25 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm

On Thu, Aug 07, 2014 at 09:38:26AM +0200, Michal Hocko wrote:
> On Wed 06-08-14 14:02:35, Andrew Morton wrote:
> > On Wed, 6 Aug 2014 14:00:55 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Subject: mm: memcontrol: rewrite uncharge API
> > > 
> > 
> > Nope, sorry, that was missing
> > mm-memcontrol-rewrite-uncharge-api-fix-clear-page-mapping-in-migration.patch.
> > 
> > This time:
> > 
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Subject: mm: memcontrol: rewrite uncharge API
> > 
> > The memcg uncharging code that is involved towards the end of a page's
> > lifetime - truncation, reclaim, swapout, migration - is impressively
> > complicated and fragile.
> > 
> > Because anonymous and file pages were always charged before they had their
> > page->mapping established, uncharges had to happen when the page type
> > could still be known from the context; as in unmap for anonymous, page
> > cache removal for file and shmem pages, and swap cache truncation for swap
> > pages.  However, these operations happen well before the page is actually
> > freed, and so a lot of synchronization is necessary:
> > 
> > - Charging, uncharging, page migration, and charge migration all need
> >   to take a per-page bit spinlock as they could race with uncharging.
> > 
> > - Swap cache truncation happens during both swap-in and swap-out, and
> >   possibly repeatedly before the page is actually freed.  This means
> >   that the memcg swapout code is called from many contexts that make
> >   no sense and it has to figure out the direction from page state to
> >   make sure memory and memory+swap are always correctly charged.
> > 
> > - On page migration, the old page might be unmapped but then reused,
> >   so memcg code has to prevent untimely uncharging in that case.
> >   Because this code - which should be a simple charge transfer - is so
> >   special-cased, it is not reusable for replace_page_cache().
> > 
> > But now that charged pages always have a page->mapping, introduce
> > mem_cgroup_uncharge(), which is called after the final put_page(), when we
> > know for sure that nobody is looking at the page anymore.
> > 
> > For page migration, introduce mem_cgroup_migrate(), which is called after
> > the migration is successful and the new page is fully rmapped.  Because
> > the old page is no longer uncharged after migration, prevent double
> > charges by decoupling the page's memcg association (PCG_USED and
> > pc->mem_cgroup) from the page holding an actual charge.  The new bits
> > PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
> > to the new page during migration.
> > 
> > mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
> > gets rid of mem_cgroup_replace_page_cache().
> > 
> > Swap accounting is massively simplified: because the page is no longer
> > uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
> > transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
> > before the final put_page() in page reclaim.
> > 
> > Finally, page_cgroup changes are now protected by whatever protection the
> > page itself offers: anonymous pages are charged under the page table lock,
> > whereas page cache insertions, swapin, and migration hold the page lock. 
> > Uncharging happens under full exclusion with no outstanding references. 
> > Charging and uncharging also ensure that the page is off-LRU, which
> > serializes against charge migration.  Remove the very costly page_cgroup
> > lock and set pc->flags non-atomically.
> 
> I see some point in squashing all the fixups into the single patch but I
> am afraid we have lost some interesting details from fix ups this time.
> I think that at least
> mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch and
> mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration-2.patch
> would be good to go on their own _or_ their changelogs added here. The
> whole page cache replace path is obscure and we should rather have that
> documented so we do not have to google for details or go through painful
> code inspection next time.

I agree, we would lose something there.  There is a paragraph in the
changelog that says:

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache().

Could you please update it to say:

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache().  However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration.  Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.

> > [vdavydov@parallels.com: fix flags definition]
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Vladimir Davydov <vdavydov@parallels.com>
> > Tested-by: Jet Chen <jet.chen@intel.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> If this comes from the above change then is should be probably removed.
> You can replace it by my Acked-by. I have acked all the follow up fixes
> but forgot to ack the initial patch.

Thanks!

> > Tested-by: Felipe Balbi <balbi@ti.com>
> 
> this tested-by came from the same preempt_{en,dis}able patch AFAICS.

Yeah it might be a bit overreaching to apply this to the full change.

On a different note, Michal, I just scrolled through the 2000 lines
that follow to see if you had any more comments, but there was only
your signature at the bottom.  Please think about the quote context
after you inserted your inline comments and then trim accordingly.

Thanks,
Johannes

> > Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > 
> >  Documentation/cgroups/memcg_test.txt |  126 ---
> >  include/linux/memcontrol.h           |   49 -
> >  include/linux/page_cgroup.h          |   43 -
> >  include/linux/swap.h                 |   12 
> >  mm/filemap.c                         |    4 
> >  mm/memcontrol.c                      |  832 +++++++++----------------
> >  mm/memory.c                          |    2 
> >  mm/migrate.c                         |   38 -
> >  mm/rmap.c                            |    1 
> >  mm/shmem.c                           |    8 
> >  mm/swap.c                            |    6 
> >  mm/swap_state.c                      |    8 
> >  mm/swapfile.c                        |    7 
> >  mm/truncate.c                        |    9 
> >  mm/vmscan.c                          |   12 
> >  mm/zswap.c                           |    2 
> >  16 files changed, 390 insertions(+), 769 deletions(-)
> > 
> > diff -puN Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api Documentation/cgroups/memcg_test.txt
> > --- a/Documentation/cgroups/memcg_test.txt~mm-memcontrol-rewrite-uncharge-api
> > +++ a/Documentation/cgroups/memcg_test.txt
> > @@ -29,28 +29,13 @@ Please note that implementation details
> >  2. Uncharge
> >    a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
> >  
> > -	mem_cgroup_uncharge_page()
> > -	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
> > -	  to 0. If the page is SwapCache, uncharge is delayed until
> > -	  mem_cgroup_uncharge_swapcache().
> > -
> > -	mem_cgroup_uncharge_cache_page()
> > -	  Called when a page-cache is deleted from radix-tree. If the page is
> > -	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
> > -
> > -	mem_cgroup_uncharge_swapcache()
> > -	  Called when SwapCache is removed from radix-tree. The charge itself
> > -	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
> > -	  charge to swap occurs.)
> > +	mem_cgroup_uncharge()
> > +	  Called when a page's refcount goes down to 0.
> >  
> >  	mem_cgroup_uncharge_swap()
> >  	  Called when swp_entry's refcnt goes down to 0. A charge against swap
> >  	  disappears.
> >  
> > -	mem_cgroup_end_migration(old, new)
> > -	At success of migration old is uncharged (if necessary), a charge
> > -	to new page is committed. At failure, charge to old page is committed.
> > -
> >  3. charge-commit-cancel
> >  	Memcg pages are charged in two steps:
> >  		mem_cgroup_try_charge()
> > @@ -69,18 +54,6 @@ Under below explanation, we assume CONFI
> >  	Anonymous page is newly allocated at
> >  		  - page fault into MAP_ANONYMOUS mapping.
> >  		  - Copy-On-Write.
> > - 	It is charged right after it's allocated before doing any page table
> > -	related operations. Of course, it's uncharged when another page is used
> > -	for the fault address.
> > -
> > -	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
> > -	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
> > -	are done at page_remove_rmap() when page_mapcount() goes down to 0.
> > -
> > -	Another page freeing is by page-reclaim (vmscan.c) and anonymous
> > -	pages are swapped out. In this case, the page is marked as
> > -	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
> > -	as SwapCache(). It's delayed until __delete_from_swap_cache().
> >  
> >  	4.1 Swap-in.
> >  	At swap-in, the page is taken from swap-cache. There are 2 cases.
> > @@ -89,41 +62,6 @@ Under below explanation, we assume CONFI
> >  	(b) If the SwapCache has been mapped by processes, it has been
> >  	    charged already.
> >  
> > -	This swap-in is one of the most complicated work. In do_swap_page(),
> > -	following events occur when pte is unchanged.
> > -
> > -	(1) the page (SwapCache) is looked up.
> > -	(2) lock_page()
> > -	(3) try_charge_swapin()
> > -	(4) reuse_swap_page() (may call delete_swap_cache())
> > -	(5) commit_charge_swapin()
> > -	(6) swap_free().
> > -
> > -	Considering following situation for example.
> > -
> > -	(A) The page has not been charged before (2) and reuse_swap_page()
> > -	    doesn't call delete_from_swap_cache().
> > -	(B) The page has not been charged before (2) and reuse_swap_page()
> > -	    calls delete_from_swap_cache().
> > -	(C) The page has been charged before (2) and reuse_swap_page() doesn't
> > -	    call delete_from_swap_cache().
> > -	(D) The page has been charged before (2) and reuse_swap_page() calls
> > -	    delete_from_swap_cache().
> > -
> > -	    memory.usage/memsw.usage changes to this page/swp_entry will be
> > -	 Case          (A)      (B)       (C)     (D)
> > -         Event
> > -       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
> > -          ===========================================
> > -          (3)        +1/+1    +1/+1     +1/+1   +1/+1
> > -          (4)          -       0/ 0       -     -1/ 0
> > -          (5)         0/-1     0/ 0     -1/-1    0/ 0
> > -          (6)          -       0/-1       -      0/-1
> > -          ===========================================
> > -       Result         1/ 1     1/ 1      1/ 1    1/ 1
> > -
> > -       In any cases, charges to this page should be 1/ 1.
> > -
> >  	4.2 Swap-out.
> >  	At swap-out, typical state transition is below.
> >  
> > @@ -136,28 +74,20 @@ Under below explanation, we assume CONFI
> >  	    swp_entry's refcnt -= 1.
> >  
> >  
> > -	At (b), the page is marked as SwapCache and not uncharged.
> > -	At (d), the page is removed from SwapCache and a charge in page_cgroup
> > -	is moved to swap_cgroup.
> > -
> >  	Finally, at task exit,
> >  	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
> > -	Here, a charge in swap_cgroup disappears.
> >  
> >  5. Page Cache
> >     	Page Cache is charged at
> >  	- add_to_page_cache_locked().
> >  
> > -	uncharged at
> > -	- __remove_from_page_cache().
> > -
> >  	The logic is very clear. (About migration, see below)
> >  	Note: __remove_from_page_cache() is called by remove_from_page_cache()
> >  	and __remove_mapping().
> >  
> >  6. Shmem(tmpfs) Page Cache
> > -	Memcg's charge/uncharge have special handlers of shmem. The best way
> > -	to understand shmem's page state transition is to read mm/shmem.c.
> > +	The best way to understand shmem's page state transition is to read
> > +	mm/shmem.c.
> >  	But brief explanation of the behavior of memcg around shmem will be
> >  	helpful to understand the logic.
> >  
> > @@ -170,56 +100,10 @@ Under below explanation, we assume CONFI
> >  	It's charged when...
> >  	- A new page is added to shmem's radix-tree.
> >  	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
> > -	It's uncharged when
> > -	- A page is removed from radix-tree and not SwapCache.
> > -	- When SwapCache is removed, a charge is moved to swap_cgroup.
> > -	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
> > -	  disappears.
> >  
> >  7. Page Migration
> > -   	One of the most complicated functions is page-migration-handler.
> > -	Memcg has 2 routines. Assume that we are migrating a page's contents
> > -	from OLDPAGE to NEWPAGE.
> > -
> > -	Usual migration logic is..
> > -	(a) remove the page from LRU.
> > -	(b) allocate NEWPAGE (migration target)
> > -	(c) lock by lock_page().
> > -	(d) unmap all mappings.
> > -	(e-1) If necessary, replace entry in radix-tree.
> > -	(e-2) move contents of a page.
> > -	(f) map all mappings again.
> > -	(g) pushback the page to LRU.
> > -	(-) OLDPAGE will be freed.
> > -
> > -	Before (g), memcg should complete all necessary charge/uncharge to
> > -	NEWPAGE/OLDPAGE.
> > -
> > -	The point is....
> > -	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
> > -          try_to_unmap() drops all mapcount and the page will not be
> > -	  SwapCache.
> > -
> > -	- If OLDPAGE is SwapCache, charges will be kept at (g) because
> > -	  __delete_from_swap_cache() isn't called at (e-1)
> > -
> > -	- If OLDPAGE is page-cache, charges will be kept at (g) because
> > -	  remove_from_swap_cache() isn't called at (e-1)
> > -
> > -	memcg provides following hooks.
> > -
> > -	- mem_cgroup_prepare_migration(OLDPAGE)
> > -	  Called after (b) to account a charge (usage += PAGE_SIZE) against
> > -	  memcg which OLDPAGE belongs to.
> > -
> > -        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
> > -	  Called after (f) before (g).
> > -	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
> > -	  charged, a charge by prepare_migration() is automatically canceled.
> > -	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
> >  
> > -	  But zap_pte() (by exit or munmap) can be called while migration,
> > -	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
> > +	mem_cgroup_migrate()
> >  
> >  8. LRU
> >          Each memcg has its own private LRU. Now, its handling is under global
> > diff -puN include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api include/linux/memcontrol.h
> > --- a/include/linux/memcontrol.h~mm-memcontrol-rewrite-uncharge-api
> > +++ a/include/linux/memcontrol.h
> > @@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct pag
> >  			      bool lrucare);
> >  void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
> >  
> > -struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> > -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> > +void mem_cgroup_uncharge(struct page *page);
> > +
> > +/* Batched uncharging */
> > +void mem_cgroup_uncharge_start(void);
> > +void mem_cgroup_uncharge_end(void);
> >  
> > -/* For coalescing uncharge for reducing memcg' overhead*/
> > -extern void mem_cgroup_uncharge_start(void);
> > -extern void mem_cgroup_uncharge_end(void);
> > +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> > +			bool lrucare);
> >  
> > -extern void mem_cgroup_uncharge_page(struct page *page);
> > -extern void mem_cgroup_uncharge_cache_page(struct page *page);
> > +struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> > +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> >  
> >  bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
> >  				  struct mem_cgroup *memcg);
> > @@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_str
> >  
> >  extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
> >  
> > -extern void
> > -mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> > -			     struct mem_cgroup **memcgp);
> > -extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> > -	struct page *oldpage, struct page *newpage, bool migration_ok);
> > -
> >  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
> >  				   struct mem_cgroup *,
> >  				   struct mem_cgroup_reclaim_cookie *);
> > @@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(st
> >  void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> >  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >  					struct task_struct *p);
> > -extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> > -					struct page *newpage);
> >  
> >  static inline void mem_cgroup_oom_enable(void)
> >  {
> > @@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_cha
> >  {
> >  }
> >  
> > -static inline void mem_cgroup_uncharge_start(void)
> > +static inline void mem_cgroup_uncharge(struct page *page)
> >  {
> >  }
> >  
> > -static inline void mem_cgroup_uncharge_end(void)
> > +static inline void mem_cgroup_uncharge_start(void)
> >  {
> >  }
> >  
> > -static inline void mem_cgroup_uncharge_page(struct page *page)
> > +static inline void mem_cgroup_uncharge_end(void)
> >  {
> >  }
> >  
> > -static inline void mem_cgroup_uncharge_cache_page(struct page *page)
> > +static inline void mem_cgroup_migrate(struct page *oldpage,
> > +				      struct page *newpage,
> > +				      bool lrucare)
> >  {
> >  }
> >  
> > @@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
> >  	return NULL;
> >  }
> >  
> > -static inline void
> > -mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> > -			     struct mem_cgroup **memcgp)
> > -{
> > -}
> > -
> > -static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> > -		struct page *oldpage, struct page *newpage, bool migration_ok)
> > -{
> > -}
> > -
> >  static inline struct mem_cgroup *
> >  mem_cgroup_iter(struct mem_cgroup *root,
> >  		struct mem_cgroup *prev,
> > @@ -392,10 +377,6 @@ static inline
> >  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
> >  {
> >  }
> > -static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
> > -				struct page *newpage)
> > -{
> > -}
> >  #endif /* CONFIG_MEMCG */
> >  
> >  #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> > diff -puN include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api include/linux/page_cgroup.h
> > --- a/include/linux/page_cgroup.h~mm-memcontrol-rewrite-uncharge-api
> > +++ a/include/linux/page_cgroup.h
> > @@ -3,9 +3,9 @@
> >  
> >  enum {
> >  	/* flags for mem_cgroup */
> > -	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
> > -	PCG_USED, /* this object is in use. */
> > -	PCG_MIGRATION, /* under page migration */
> > +	PCG_USED = 0x01,	/* This page is charged to a memcg */
> > +	PCG_MEM = 0x02,		/* This page holds a memory charge */
> > +	PCG_MEMSW = 0x04,	/* This page holds a memory+swap charge */
> >  	__NR_PCG_FLAGS,
> >  };
> >  
> > @@ -44,42 +44,9 @@ static inline void __init page_cgroup_in
> >  struct page_cgroup *lookup_page_cgroup(struct page *page);
> >  struct page *lookup_cgroup_page(struct page_cgroup *pc);
> >  
> > -#define TESTPCGFLAG(uname, lname)			\
> > -static inline int PageCgroup##uname(struct page_cgroup *pc)	\
> > -	{ return test_bit(PCG_##lname, &pc->flags); }
> > -
> > -#define SETPCGFLAG(uname, lname)			\
> > -static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
> > -	{ set_bit(PCG_##lname, &pc->flags);  }
> > -
> > -#define CLEARPCGFLAG(uname, lname)			\
> > -static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
> > -	{ clear_bit(PCG_##lname, &pc->flags);  }
> > -
> > -#define TESTCLEARPCGFLAG(uname, lname)			\
> > -static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
> > -	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
> > -
> > -TESTPCGFLAG(Used, USED)
> > -CLEARPCGFLAG(Used, USED)
> > -SETPCGFLAG(Used, USED)
> > -
> > -SETPCGFLAG(Migration, MIGRATION)
> > -CLEARPCGFLAG(Migration, MIGRATION)
> > -TESTPCGFLAG(Migration, MIGRATION)
> > -
> > -static inline void lock_page_cgroup(struct page_cgroup *pc)
> > -{
> > -	/*
> > -	 * Don't take this lock in IRQ context.
> > -	 * This lock is for pc->mem_cgroup, USED, MIGRATION
> > -	 */
> > -	bit_spin_lock(PCG_LOCK, &pc->flags);
> > -}
> > -
> > -static inline void unlock_page_cgroup(struct page_cgroup *pc)
> > +static inline int PageCgroupUsed(struct page_cgroup *pc)
> >  {
> > -	bit_spin_unlock(PCG_LOCK, &pc->flags);
> > +	return !!(pc->flags & PCG_USED);
> >  }
> >  
> >  #else /* CONFIG_MEMCG */
> > diff -puN include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api include/linux/swap.h
> > --- a/include/linux/swap.h~mm-memcontrol-rewrite-uncharge-api
> > +++ a/include/linux/swap.h
> > @@ -382,9 +382,13 @@ static inline int mem_cgroup_swappiness(
> >  }
> >  #endif
> >  #ifdef CONFIG_MEMCG_SWAP
> > -extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> > +extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> > +extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
> >  #else
> > -static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
> > +static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
> > +{
> > +}
> > +static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
> >  {
> >  }
> >  #endif
> > @@ -444,7 +448,7 @@ extern void swap_shmem_alloc(swp_entry_t
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> >  extern void swap_free(swp_entry_t);
> > -extern void swapcache_free(swp_entry_t, struct page *page);
> > +extern void swapcache_free(swp_entry_t);
> >  extern int free_swap_and_cache(swp_entry_t);
> >  extern int swap_type_of(dev_t, sector_t, struct block_device **);
> >  extern unsigned int count_swap_pages(int, int);
> > @@ -508,7 +512,7 @@ static inline void swap_free(swp_entry_t
> >  {
> >  }
> >  
> > -static inline void swapcache_free(swp_entry_t swp, struct page *page)
> > +static inline void swapcache_free(swp_entry_t swp)
> >  {
> >  }
> >  
> > diff -puN mm/filemap.c~mm-memcontrol-rewrite-uncharge-api mm/filemap.c
> > --- a/mm/filemap.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/filemap.c
> > @@ -234,7 +234,6 @@ void delete_from_page_cache(struct page
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	__delete_from_page_cache(page, NULL);
> >  	spin_unlock_irq(&mapping->tree_lock);
> > -	mem_cgroup_uncharge_cache_page(page);
> >  
> >  	if (freepage)
> >  		freepage(page);
> > @@ -490,8 +489,7 @@ int replace_page_cache_page(struct page
> >  		if (PageSwapBacked(new))
> >  			__inc_zone_page_state(new, NR_SHMEM);
> >  		spin_unlock_irq(&mapping->tree_lock);
> > -		/* mem_cgroup codes must not be called under tree_lock */
> > -		mem_cgroup_replace_page_cache(old, new);
> > +		mem_cgroup_migrate(old, new, true);
> >  		radix_tree_preload_end();
> >  		if (freepage)
> >  			freepage(old);
> > diff -puN mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api mm/memcontrol.c
> > --- a/mm/memcontrol.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/memcontrol.c
> > @@ -754,9 +754,11 @@ static void __mem_cgroup_remove_exceeded
> >  static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
> >  				       struct mem_cgroup_tree_per_zone *mctz)
> >  {
> > -	spin_lock(&mctz->lock);
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&mctz->lock, flags);
> >  	__mem_cgroup_remove_exceeded(mz, mctz);
> > -	spin_unlock(&mctz->lock);
> > +	spin_unlock_irqrestore(&mctz->lock, flags);
> >  }
> >  
> >  
> > @@ -779,7 +781,9 @@ static void mem_cgroup_update_tree(struc
> >  		 * mem is over its softlimit.
> >  		 */
> >  		if (excess || mz->on_tree) {
> > -			spin_lock(&mctz->lock);
> > +			unsigned long flags;
> > +
> > +			spin_lock_irqsave(&mctz->lock, flags);
> >  			/* if on-tree, remove it */
> >  			if (mz->on_tree)
> >  				__mem_cgroup_remove_exceeded(mz, mctz);
> > @@ -788,7 +792,7 @@ static void mem_cgroup_update_tree(struc
> >  			 * If excess is 0, no tree ops.
> >  			 */
> >  			__mem_cgroup_insert_exceeded(mz, mctz, excess);
> > -			spin_unlock(&mctz->lock);
> > +			spin_unlock_irqrestore(&mctz->lock, flags);
> >  		}
> >  	}
> >  }
> > @@ -839,9 +843,9 @@ mem_cgroup_largest_soft_limit_node(struc
> >  {
> >  	struct mem_cgroup_per_zone *mz;
> >  
> > -	spin_lock(&mctz->lock);
> > +	spin_lock_irq(&mctz->lock);
> >  	mz = __mem_cgroup_largest_soft_limit_node(mctz);
> > -	spin_unlock(&mctz->lock);
> > +	spin_unlock_irq(&mctz->lock);
> >  	return mz;
> >  }
> >  
> > @@ -882,13 +886,6 @@ static long mem_cgroup_read_stat(struct
> >  	return val;
> >  }
> >  
> > -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> > -					 bool charge)
> > -{
> > -	int val = (charge) ? 1 : -1;
> > -	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> > -}
> > -
> >  static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
> >  					    enum mem_cgroup_events_index idx)
> >  {
> > @@ -909,13 +906,13 @@ static unsigned long mem_cgroup_read_eve
> >  
> >  static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> >  					 struct page *page,
> > -					 bool anon, int nr_pages)
> > +					 int nr_pages)
> >  {
> >  	/*
> >  	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
> >  	 * counted as CACHE even if it's on ANON LRU.
> >  	 */
> > -	if (anon)
> > +	if (PageAnon(page))
> >  		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
> >  				nr_pages);
> >  	else
> > @@ -1013,7 +1010,6 @@ static bool mem_cgroup_event_ratelimit(s
> >   */
> >  static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> >  {
> > -	preempt_disable();
> >  	/* threshold event is triggered in finer grain than soft limit */
> >  	if (unlikely(mem_cgroup_event_ratelimit(memcg,
> >  						MEM_CGROUP_TARGET_THRESH))) {
> > @@ -1026,8 +1022,6 @@ static void memcg_check_events(struct me
> >  		do_numainfo = mem_cgroup_event_ratelimit(memcg,
> >  						MEM_CGROUP_TARGET_NUMAINFO);
> >  #endif
> > -		preempt_enable();
> > -
> >  		mem_cgroup_threshold(memcg);
> >  		if (unlikely(do_softlimit))
> >  			mem_cgroup_update_tree(memcg, page);
> > @@ -1035,8 +1029,7 @@ static void memcg_check_events(struct me
> >  		if (unlikely(do_numainfo))
> >  			atomic_inc(&memcg->numainfo_events);
> >  #endif
> > -	} else
> > -		preempt_enable();
> > +	}
> >  }
> >  
> >  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> > @@ -1347,20 +1340,6 @@ out:
> >  	return lruvec;
> >  }
> >  
> > -/*
> > - * Following LRU functions are allowed to be used without PCG_LOCK.
> > - * Operations are called by routine of global LRU independently from memcg.
> > - * What we have to take care of here is validness of pc->mem_cgroup.
> > - *
> > - * Changes to pc->mem_cgroup happens when
> > - * 1. charge
> > - * 2. moving account
> > - * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
> > - * It is added to LRU before charge.
> > - * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
> > - * When moving account, the page is not on LRU. It's isolated.
> > - */
> > -
> >  /**
> >   * mem_cgroup_page_lruvec - return lruvec for adding an lru page
> >   * @page: the page
> > @@ -2261,22 +2240,14 @@ cleanup:
> >   *
> >   * Notes: Race condition
> >   *
> > - * We usually use lock_page_cgroup() for accessing page_cgroup member but
> > - * it tends to be costly. But considering some conditions, we doesn't need
> > - * to do so _always_.
> > - *
> > - * Considering "charge", lock_page_cgroup() is not required because all
> > - * file-stat operations happen after a page is attached to radix-tree. There
> > - * are no race with "charge".
> > - *
> > - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
> > - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
> > - * if there are race with "uncharge". Statistics itself is properly handled
> > - * by flags.
> > - *
> > - * Considering "move", this is an only case we see a race. To make the race
> > - * small, we check memcg->moving_account and detect there are possibility
> > - * of race or not. If there is, we take a lock.
> > + * Charging occurs during page instantiation, while the page is
> > + * unmapped and locked in page migration, or while the page table is
> > + * locked in THP migration.  No race is possible.
> > + *
> > + * Uncharge happens to pages with zero references, no race possible.
> > + *
> > + * Charge moving between groups is protected by checking mm->moving
> > + * account and taking the move_lock in the slowpath.
> >   */
> >  
> >  void __mem_cgroup_begin_update_page_stat(struct page *page,
> > @@ -2689,6 +2660,16 @@ static struct mem_cgroup *mem_cgroup_loo
> >  	return mem_cgroup_from_id(id);
> >  }
> >  
> > +/*
> > + * try_get_mem_cgroup_from_page - look up page's memcg association
> > + * @page: the page
> > + *
> > + * Look up, get a css reference, and return the memcg that owns @page.
> > + *
> > + * The page must be locked to prevent racing with swap-in and page
> > + * cache charges.  If coming from an unlocked page table, the caller
> > + * must ensure the page is on the LRU or this can race with charging.
> > + */
> >  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> >  {
> >  	struct mem_cgroup *memcg = NULL;
> > @@ -2699,7 +2680,6 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  
> >  	pc = lookup_page_cgroup(page);
> > -	lock_page_cgroup(pc);
> >  	if (PageCgroupUsed(pc)) {
> >  		memcg = pc->mem_cgroup;
> >  		if (memcg && !css_tryget_online(&memcg->css))
> > @@ -2713,19 +2693,46 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> >  			memcg = NULL;
> >  		rcu_read_unlock();
> >  	}
> > -	unlock_page_cgroup(pc);
> >  	return memcg;
> >  }
> >  
> > +static void lock_page_lru(struct page *page, int *isolated)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +
> > +	spin_lock_irq(&zone->lru_lock);
> > +	if (PageLRU(page)) {
> > +		struct lruvec *lruvec;
> > +
> > +		lruvec = mem_cgroup_page_lruvec(page, zone);
> > +		ClearPageLRU(page);
> > +		del_page_from_lru_list(page, lruvec, page_lru(page));
> > +		*isolated = 1;
> > +	} else
> > +		*isolated = 0;
> > +}
> > +
> > +static void unlock_page_lru(struct page *page, int isolated)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +
> > +	if (isolated) {
> > +		struct lruvec *lruvec;
> > +
> > +		lruvec = mem_cgroup_page_lruvec(page, zone);
> > +		VM_BUG_ON_PAGE(PageLRU(page), page);
> > +		SetPageLRU(page);
> > +		add_page_to_lru_list(page, lruvec, page_lru(page));
> > +	}
> > +	spin_unlock_irq(&zone->lru_lock);
> > +}
> > +
> >  static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> > -			  unsigned int nr_pages, bool anon, bool lrucare)
> > +			  unsigned int nr_pages, bool lrucare)
> >  {
> >  	struct page_cgroup *pc = lookup_page_cgroup(page);
> > -	struct zone *uninitialized_var(zone);
> > -	struct lruvec *lruvec;
> > -	bool was_on_lru = false;
> > +	int isolated;
> >  
> > -	lock_page_cgroup(pc);
> >  	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
> >  	/*
> >  	 * we don't need page_cgroup_lock about tail pages, becase they are not
> > @@ -2736,39 +2743,38 @@ static void commit_charge(struct page *p
> >  	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
> >  	 * may already be on some other mem_cgroup's LRU.  Take care of it.
> >  	 */
> > -	if (lrucare) {
> > -		zone = page_zone(page);
> > -		spin_lock_irq(&zone->lru_lock);
> > -		if (PageLRU(page)) {
> > -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> > -			ClearPageLRU(page);
> > -			del_page_from_lru_list(page, lruvec, page_lru(page));
> > -			was_on_lru = true;
> > -		}
> > -	}
> > +	if (lrucare)
> > +		lock_page_lru(page, &isolated);
> >  
> > +	/*
> > +	 * Nobody should be changing or seriously looking at
> > +	 * pc->mem_cgroup and pc->flags at this point:
> > +	 *
> > +	 * - the page is uncharged
> > +	 *
> > +	 * - the page is off-LRU
> > +	 *
> > +	 * - an anonymous fault has exclusive page access, except for
> > +	 *   a locked page table
> > +	 *
> > +	 * - a page cache insertion, a swapin fault, or a migration
> > +	 *   have the page locked
> > +	 */
> >  	pc->mem_cgroup = memcg;
> > -	SetPageCgroupUsed(pc);
> > +	pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0);
> >  
> > -	if (lrucare) {
> > -		if (was_on_lru) {
> > -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> > -			VM_BUG_ON_PAGE(PageLRU(page), page);
> > -			SetPageLRU(page);
> > -			add_page_to_lru_list(page, lruvec, page_lru(page));
> > -		}
> > -		spin_unlock_irq(&zone->lru_lock);
> > -	}
> > -
> > -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> > -	unlock_page_cgroup(pc);
> > +	if (lrucare)
> > +		unlock_page_lru(page, isolated);
> >  
> > +	local_irq_disable();
> > +	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> >  	/*
> >  	 * "charge_statistics" updated event counter. Then, check it.
> >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> >  	 * if they exceeds softlimit.
> >  	 */
> >  	memcg_check_events(memcg, page);
> > +	local_irq_enable();
> >  }
> >  
> >  static DEFINE_MUTEX(set_limit_mutex);
> > @@ -3395,7 +3401,6 @@ static inline void memcg_unregister_all_
> >  
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  
> > -#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
> >  /*
> >   * Because tail pages are not marked as "used", set it. We're under
> >   * zone->lru_lock, 'splitting on pmd' and compound_lock.
> > @@ -3416,7 +3421,7 @@ void mem_cgroup_split_huge_fixup(struct
> >  	for (i = 1; i < HPAGE_PMD_NR; i++) {
> >  		pc = head_pc + i;
> >  		pc->mem_cgroup = memcg;
> > -		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
> > +		pc->flags = head_pc->flags;
> >  	}
> >  	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
> >  		       HPAGE_PMD_NR);
> > @@ -3446,7 +3451,6 @@ static int mem_cgroup_move_account(struc
> >  {
> >  	unsigned long flags;
> >  	int ret;
> > -	bool anon = PageAnon(page);
> >  
> >  	VM_BUG_ON(from == to);
> >  	VM_BUG_ON_PAGE(PageLRU(page), page);
> > @@ -3460,15 +3464,21 @@ static int mem_cgroup_move_account(struc
> >  	if (nr_pages > 1 && !PageTransHuge(page))
> >  		goto out;
> >  
> > -	lock_page_cgroup(pc);
> > +	/*
> > +	 * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup
> > +	 * of its source page while we change it: page migration takes
> > +	 * both pages off the LRU, but page cache replacement doesn't.
> > +	 */
> > +	if (!trylock_page(page))
> > +		goto out;
> >  
> >  	ret = -EINVAL;
> >  	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
> > -		goto unlock;
> > +		goto out_unlock;
> >  
> >  	move_lock_mem_cgroup(from, &flags);
> >  
> > -	if (!anon && page_mapped(page)) {
> > +	if (!PageAnon(page) && page_mapped(page)) {
> >  		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
> >  			       nr_pages);
> >  		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
> > @@ -3482,20 +3492,25 @@ static int mem_cgroup_move_account(struc
> >  			       nr_pages);
> >  	}
> >  
> > -	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
> > +	/*
> > +	 * It is safe to change pc->mem_cgroup here because the page
> > +	 * is referenced, charged, and isolated - we can't race with
> > +	 * uncharging, charging, migration, or LRU putback.
> > +	 */
> >  
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> > -	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
> >  	move_unlock_mem_cgroup(from, &flags);
> >  	ret = 0;
> > -unlock:
> > -	unlock_page_cgroup(pc);
> > -	/*
> > -	 * check events
> > -	 */
> > +
> > +	local_irq_disable();
> > +	mem_cgroup_charge_statistics(to, page, nr_pages);
> >  	memcg_check_events(to, page);
> > +	mem_cgroup_charge_statistics(from, page, -nr_pages);
> >  	memcg_check_events(from, page);
> > +	local_irq_enable();
> > +out_unlock:
> > +	unlock_page(page);
> >  out:
> >  	return ret;
> >  }
> > @@ -3566,193 +3581,6 @@ out:
> >  	return ret;
> >  }
> >  
> > -static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
> > -				   unsigned int nr_pages,
> > -				   const enum charge_type ctype)
> > -{
> > -	struct memcg_batch_info *batch = NULL;
> > -	bool uncharge_memsw = true;
> > -
> > -	/* If swapout, usage of swap doesn't decrease */
> > -	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > -		uncharge_memsw = false;
> > -
> > -	batch = &current->memcg_batch;
> > -	/*
> > -	 * In usual, we do css_get() when we remember memcg pointer.
> > -	 * But in this case, we keep res->usage until end of a series of
> > -	 * uncharges. Then, it's ok to ignore memcg's refcnt.
> > -	 */
> > -	if (!batch->memcg)
> > -		batch->memcg = memcg;
> > -	/*
> > -	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
> > -	 * In those cases, all pages freed continuously can be expected to be in
> > -	 * the same cgroup and we have chance to coalesce uncharges.
> > -	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
> > -	 * because we want to do uncharge as soon as possible.
> > -	 */
> > -
> > -	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
> > -		goto direct_uncharge;
> > -
> > -	if (nr_pages > 1)
> > -		goto direct_uncharge;
> > -
> > -	/*
> > -	 * In typical case, batch->memcg == mem. This means we can
> > -	 * merge a series of uncharges to an uncharge of res_counter.
> > -	 * If not, we uncharge res_counter ony by one.
> > -	 */
> > -	if (batch->memcg != memcg)
> > -		goto direct_uncharge;
> > -	/* remember freed charge and uncharge it later */
> > -	batch->nr_pages++;
> > -	if (uncharge_memsw)
> > -		batch->memsw_nr_pages++;
> > -	return;
> > -direct_uncharge:
> > -	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
> > -	if (uncharge_memsw)
> > -		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
> > -	if (unlikely(batch->memcg != memcg))
> > -		memcg_oom_recover(memcg);
> > -}
> > -
> > -/*
> > - * uncharge if !page_mapped(page)
> > - */
> > -static struct mem_cgroup *
> > -__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
> > -			     bool end_migration)
> > -{
> > -	struct mem_cgroup *memcg = NULL;
> > -	unsigned int nr_pages = 1;
> > -	struct page_cgroup *pc;
> > -	bool anon;
> > -
> > -	if (mem_cgroup_disabled())
> > -		return NULL;
> > -
> > -	if (PageTransHuge(page)) {
> > -		nr_pages <<= compound_order(page);
> > -		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > -	}
> > -	/*
> > -	 * Check if our page_cgroup is valid
> > -	 */
> > -	pc = lookup_page_cgroup(page);
> > -	if (unlikely(!PageCgroupUsed(pc)))
> > -		return NULL;
> > -
> > -	lock_page_cgroup(pc);
> > -
> > -	memcg = pc->mem_cgroup;
> > -
> > -	if (!PageCgroupUsed(pc))
> > -		goto unlock_out;
> > -
> > -	anon = PageAnon(page);
> > -
> > -	switch (ctype) {
> > -	case MEM_CGROUP_CHARGE_TYPE_ANON:
> > -		/*
> > -		 * Generally PageAnon tells if it's the anon statistics to be
> > -		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
> > -		 * used before page reached the stage of being marked PageAnon.
> > -		 */
> > -		anon = true;
> > -		/* fallthrough */
> > -	case MEM_CGROUP_CHARGE_TYPE_DROP:
> > -		/* See mem_cgroup_prepare_migration() */
> > -		if (page_mapped(page))
> > -			goto unlock_out;
> > -		/*
> > -		 * Pages under migration may not be uncharged.  But
> > -		 * end_migration() /must/ be the one uncharging the
> > -		 * unused post-migration page and so it has to call
> > -		 * here with the migration bit still set.  See the
> > -		 * res_counter handling below.
> > -		 */
> > -		if (!end_migration && PageCgroupMigration(pc))
> > -			goto unlock_out;
> > -		break;
> > -	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
> > -		if (!PageAnon(page)) {	/* Shared memory */
> > -			if (page->mapping && !page_is_file_cache(page))
> > -				goto unlock_out;
> > -		} else if (page_mapped(page)) /* Anon */
> > -				goto unlock_out;
> > -		break;
> > -	default:
> > -		break;
> > -	}
> > -
> > -	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
> > -
> > -	ClearPageCgroupUsed(pc);
> > -	/*
> > -	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
> > -	 * freed from LRU. This is safe because uncharged page is expected not
> > -	 * to be reused (freed soon). Exception is SwapCache, it's handled by
> > -	 * special functions.
> > -	 */
> > -
> > -	unlock_page_cgroup(pc);
> > -	/*
> > -	 * even after unlock, we have memcg->res.usage here and this memcg
> > -	 * will never be freed, so it's safe to call css_get().
> > -	 */
> > -	memcg_check_events(memcg, page);
> > -	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> > -		mem_cgroup_swap_statistics(memcg, true);
> > -		css_get(&memcg->css);
> > -	}
> > -	/*
> > -	 * Migration does not charge the res_counter for the
> > -	 * replacement page, so leave it alone when phasing out the
> > -	 * page that is unused after the migration.
> > -	 */
> > -	if (!end_migration)
> > -		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
> > -
> > -	return memcg;
> > -
> > -unlock_out:
> > -	unlock_page_cgroup(pc);
> > -	return NULL;
> > -}
> > -
> > -void mem_cgroup_uncharge_page(struct page *page)
> > -{
> > -	/* early check. */
> > -	if (page_mapped(page))
> > -		return;
> > -	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
> > -	/*
> > -	 * If the page is in swap cache, uncharge should be deferred
> > -	 * to the swap path, which also properly accounts swap usage
> > -	 * and handles memcg lifetime.
> > -	 *
> > -	 * Note that this check is not stable and reclaim may add the
> > -	 * page to swap cache at any time after this.  However, if the
> > -	 * page is not in swap cache by the time page->mapcount hits
> > -	 * 0, there won't be any page table references to the swap
> > -	 * slot, and reclaim will free it and not actually write the
> > -	 * page to disk.
> > -	 */
> > -	if (PageSwapCache(page))
> > -		return;
> > -	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
> > -}
> > -
> > -void mem_cgroup_uncharge_cache_page(struct page *page)
> > -{
> > -	VM_BUG_ON_PAGE(page_mapped(page), page);
> > -	VM_BUG_ON_PAGE(page->mapping, page);
> > -	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
> > -}
> > -
> >  /*
> >   * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
> >   * In that cases, pages are freed continuously and we can expect pages
> > @@ -3763,6 +3591,9 @@ void mem_cgroup_uncharge_cache_page(stru
> >  
> >  void mem_cgroup_uncharge_start(void)
> >  {
> > +	unsigned long flags;
> > +
> > +	local_irq_save(flags);
> >  	current->memcg_batch.do_batch++;
> >  	/* We can do nest. */
> >  	if (current->memcg_batch.do_batch == 1) {
> > @@ -3770,21 +3601,18 @@ void mem_cgroup_uncharge_start(void)
> >  		current->memcg_batch.nr_pages = 0;
> >  		current->memcg_batch.memsw_nr_pages = 0;
> >  	}
> > +	local_irq_restore(flags);
> >  }
> >  
> >  void mem_cgroup_uncharge_end(void)
> >  {
> >  	struct memcg_batch_info *batch = &current->memcg_batch;
> > +	unsigned long flags;
> >  
> > -	if (!batch->do_batch)
> > -		return;
> > -
> > -	batch->do_batch--;
> > -	if (batch->do_batch) /* If stacked, do nothing. */
> > -		return;
> > -
> > -	if (!batch->memcg)
> > -		return;
> > +	local_irq_save(flags);
> > +	VM_BUG_ON(!batch->do_batch);
> > +	if (--batch->do_batch) /* If stacked, do nothing */
> > +		goto out;
> >  	/*
> >  	 * This "batch->memcg" is valid without any css_get/put etc...
> >  	 * bacause we hide charges behind us.
> > @@ -3796,61 +3624,16 @@ void mem_cgroup_uncharge_end(void)
> >  		res_counter_uncharge(&batch->memcg->memsw,
> >  				     batch->memsw_nr_pages * PAGE_SIZE);
> >  	memcg_oom_recover(batch->memcg);
> > -	/* forget this pointer (for sanity check) */
> > -	batch->memcg = NULL;
> > -}
> > -
> > -#ifdef CONFIG_SWAP
> > -/*
> > - * called after __delete_from_swap_cache() and drop "page" account.
> > - * memcg information is recorded to swap_cgroup of "ent"
> > - */
> > -void
> > -mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> > -{
> > -	struct mem_cgroup *memcg;
> > -	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
> > -
> > -	if (!swapout) /* this was a swap cache but the swap is unused ! */
> > -		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
> > -
> > -	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
> > -
> > -	/*
> > -	 * record memcg information,  if swapout && memcg != NULL,
> > -	 * css_get() was called in uncharge().
> > -	 */
> > -	if (do_swap_account && swapout && memcg)
> > -		swap_cgroup_record(ent, mem_cgroup_id(memcg));
> > +out:
> > +	local_irq_restore(flags);
> >  }
> > -#endif
> >  
> >  #ifdef CONFIG_MEMCG_SWAP
> > -/*
> > - * called from swap_entry_free(). remove record in swap_cgroup and
> > - * uncharge "memsw" account.
> > - */
> > -void mem_cgroup_uncharge_swap(swp_entry_t ent)
> > +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> > +					 bool charge)
> >  {
> > -	struct mem_cgroup *memcg;
> > -	unsigned short id;
> > -
> > -	if (!do_swap_account)
> > -		return;
> > -
> > -	id = swap_cgroup_record(ent, 0);
> > -	rcu_read_lock();
> > -	memcg = mem_cgroup_lookup(id);
> > -	if (memcg) {
> > -		/*
> > -		 * We uncharge this because swap is freed.  This memcg can
> > -		 * be obsolete one. We avoid calling css_tryget_online().
> > -		 */
> > -		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > -		mem_cgroup_swap_statistics(memcg, false);
> > -		css_put(&memcg->css);
> > -	}
> > -	rcu_read_unlock();
> > +	int val = (charge) ? 1 : -1;
> > +	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> >  }
> >  
> >  /**
> > @@ -3902,169 +3685,6 @@ static inline int mem_cgroup_move_swap_a
> >  }
> >  #endif
> >  
> > -/*
> > - * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
> > - * page belongs to.
> > - */
> > -void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> > -				  struct mem_cgroup **memcgp)
> > -{
> > -	struct mem_cgroup *memcg = NULL;
> > -	unsigned int nr_pages = 1;
> > -	struct page_cgroup *pc;
> > -
> > -	*memcgp = NULL;
> > -
> > -	if (mem_cgroup_disabled())
> > -		return;
> > -
> > -	if (PageTransHuge(page))
> > -		nr_pages <<= compound_order(page);
> > -
> > -	pc = lookup_page_cgroup(page);
> > -	lock_page_cgroup(pc);
> > -	if (PageCgroupUsed(pc)) {
> > -		memcg = pc->mem_cgroup;
> > -		css_get(&memcg->css);
> > -		/*
> > -		 * At migrating an anonymous page, its mapcount goes down
> > -		 * to 0 and uncharge() will be called. But, even if it's fully
> > -		 * unmapped, migration may fail and this page has to be
> > -		 * charged again. We set MIGRATION flag here and delay uncharge
> > -		 * until end_migration() is called
> > -		 *
> > -		 * Corner Case Thinking
> > -		 * A)
> > -		 * When the old page was mapped as Anon and it's unmap-and-freed
> > -		 * while migration was ongoing.
> > -		 * If unmap finds the old page, uncharge() of it will be delayed
> > -		 * until end_migration(). If unmap finds a new page, it's
> > -		 * uncharged when it make mapcount to be 1->0. If unmap code
> > -		 * finds swap_migration_entry, the new page will not be mapped
> > -		 * and end_migration() will find it(mapcount==0).
> > -		 *
> > -		 * B)
> > -		 * When the old page was mapped but migraion fails, the kernel
> > -		 * remaps it. A charge for it is kept by MIGRATION flag even
> > -		 * if mapcount goes down to 0. We can do remap successfully
> > -		 * without charging it again.
> > -		 *
> > -		 * C)
> > -		 * The "old" page is under lock_page() until the end of
> > -		 * migration, so, the old page itself will not be swapped-out.
> > -		 * If the new page is swapped out before end_migraton, our
> > -		 * hook to usual swap-out path will catch the event.
> > -		 */
> > -		if (PageAnon(page))
> > -			SetPageCgroupMigration(pc);
> > -	}
> > -	unlock_page_cgroup(pc);
> > -	/*
> > -	 * If the page is not charged at this point,
> > -	 * we return here.
> > -	 */
> > -	if (!memcg)
> > -		return;
> > -
> > -	*memcgp = memcg;
> > -	/*
> > -	 * We charge new page before it's used/mapped. So, even if unlock_page()
> > -	 * is called before end_migration, we can catch all events on this new
> > -	 * page. In the case new page is migrated but not remapped, new page's
> > -	 * mapcount will be finally 0 and we call uncharge in end_migration().
> > -	 */
> > -	/*
> > -	 * The page is committed to the memcg, but it's not actually
> > -	 * charged to the res_counter since we plan on replacing the
> > -	 * old one and only one page is going to be left afterwards.
> > -	 */
> > -	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
> > -}
> > -
> > -/* remove redundant charge if migration failed*/
> > -void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> > -	struct page *oldpage, struct page *newpage, bool migration_ok)
> > -{
> > -	struct page *used, *unused;
> > -	struct page_cgroup *pc;
> > -	bool anon;
> > -
> > -	if (!memcg)
> > -		return;
> > -
> > -	if (!migration_ok) {
> > -		used = oldpage;
> > -		unused = newpage;
> > -	} else {
> > -		used = newpage;
> > -		unused = oldpage;
> > -	}
> > -	anon = PageAnon(used);
> > -	__mem_cgroup_uncharge_common(unused,
> > -				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
> > -				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
> > -				     true);
> > -	css_put(&memcg->css);
> > -	/*
> > -	 * We disallowed uncharge of pages under migration because mapcount
> > -	 * of the page goes down to zero, temporarly.
> > -	 * Clear the flag and check the page should be charged.
> > -	 */
> > -	pc = lookup_page_cgroup(oldpage);
> > -	lock_page_cgroup(pc);
> > -	ClearPageCgroupMigration(pc);
> > -	unlock_page_cgroup(pc);
> > -
> > -	/*
> > -	 * If a page is a file cache, radix-tree replacement is very atomic
> > -	 * and we can skip this check. When it was an Anon page, its mapcount
> > -	 * goes down to 0. But because we added MIGRATION flage, it's not
> > -	 * uncharged yet. There are several case but page->mapcount check
> > -	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
> > -	 * check. (see prepare_charge() also)
> > -	 */
> > -	if (anon)
> > -		mem_cgroup_uncharge_page(used);
> > -}
> > -
> > -/*
> > - * At replace page cache, newpage is not under any memcg but it's on
> > - * LRU. So, this function doesn't touch res_counter but handles LRU
> > - * in correct way. Both pages are locked so we cannot race with uncharge.
> > - */
> > -void mem_cgroup_replace_page_cache(struct page *oldpage,
> > -				  struct page *newpage)
> > -{
> > -	struct mem_cgroup *memcg = NULL;
> > -	struct page_cgroup *pc;
> > -
> > -	if (mem_cgroup_disabled())
> > -		return;
> > -
> > -	pc = lookup_page_cgroup(oldpage);
> > -	/* fix accounting on old pages */
> > -	lock_page_cgroup(pc);
> > -	if (PageCgroupUsed(pc)) {
> > -		memcg = pc->mem_cgroup;
> > -		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
> > -		ClearPageCgroupUsed(pc);
> > -	}
> > -	unlock_page_cgroup(pc);
> > -
> > -	/*
> > -	 * When called from shmem_replace_page(), in some cases the
> > -	 * oldpage has already been charged, and in some cases not.
> > -	 */
> > -	if (!memcg)
> > -		return;
> > -	/*
> > -	 * Even if newpage->mapping was NULL before starting replacement,
> > -	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
> > -	 * LRU while we overwrite pc->mem_cgroup.
> > -	 */
> > -	commit_charge(newpage, memcg, 1, false, true);
> > -}
> > -
> >  #ifdef CONFIG_DEBUG_VM
> >  static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
> >  {
> > @@ -4263,7 +3883,7 @@ unsigned long mem_cgroup_soft_limit_recl
> >  						    gfp_mask, &nr_scanned);
> >  		nr_reclaimed += reclaimed;
> >  		*total_scanned += nr_scanned;
> > -		spin_lock(&mctz->lock);
> > +		spin_lock_irq(&mctz->lock);
> >  
> >  		/*
> >  		 * If we failed to reclaim anything from this memory cgroup
> > @@ -4303,7 +3923,7 @@ unsigned long mem_cgroup_soft_limit_recl
> >  		 */
> >  		/* If excess == 0, no tree ops */
> >  		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> > -		spin_unlock(&mctz->lock);
> > +		spin_unlock_irq(&mctz->lock);
> >  		css_put(&mz->memcg->css);
> >  		loop++;
> >  		/*
> > @@ -6265,9 +5885,9 @@ static enum mc_target_type get_mctgt_typ
> >  	if (page) {
> >  		pc = lookup_page_cgroup(page);
> >  		/*
> > -		 * Do only loose check w/o page_cgroup lock.
> > -		 * mem_cgroup_move_account() checks the pc is valid or not under
> > -		 * the lock.
> > +		 * Do only loose check w/o serialization.
> > +		 * mem_cgroup_move_account() checks the pc is valid or
> > +		 * not under LRU exclusion.
> >  		 */
> >  		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> >  			ret = MC_TARGET_PAGE;
> > @@ -6729,6 +6349,67 @@ static void __init enable_swap_cgroup(vo
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_MEMCG_SWAP
> > +/**
> > + * mem_cgroup_swapout - transfer a memsw charge to swap
> > + * @page: page whose memsw charge to transfer
> > + * @entry: swap entry to move the charge to
> > + *
> > + * Transfer the memsw charge of @page to @entry.
> > + */
> > +void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
> > +{
> > +	struct page_cgroup *pc;
> > +	unsigned short oldid;
> > +
> > +	VM_BUG_ON_PAGE(PageLRU(page), page);
> > +	VM_BUG_ON_PAGE(page_count(page), page);
> > +
> > +	if (!do_swap_account)
> > +		return;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +
> > +	/* Readahead page, never charged */
> > +	if (!PageCgroupUsed(pc))
> > +		return;
> > +
> > +	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
> > +
> > +	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
> > +	VM_BUG_ON_PAGE(oldid, page);
> > +
> > +	pc->flags &= ~PCG_MEMSW;
> > +	css_get(&pc->mem_cgroup->css);
> > +	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
> > +}
> > +
> > +/**
> > + * mem_cgroup_uncharge_swap - uncharge a swap entry
> > + * @entry: swap entry to uncharge
> > + *
> > + * Drop the memsw charge associated with @entry.
> > + */
> > +void mem_cgroup_uncharge_swap(swp_entry_t entry)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	unsigned short id;
> > +
> > +	if (!do_swap_account)
> > +		return;
> > +
> > +	id = swap_cgroup_record(entry, 0);
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_lookup(id);
> > +	if (memcg) {
> > +		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> > +		mem_cgroup_swap_statistics(memcg, false);
> > +		css_put(&memcg->css);
> > +	}
> > +	rcu_read_unlock();
> > +}
> > +#endif
> > +
> >  /**
> >   * mem_cgroup_try_charge - try charging a page
> >   * @page: page to charge
> > @@ -6831,7 +6512,7 @@ void mem_cgroup_commit_charge(struct pag
> >  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> >  	}
> >  
> > -	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
> > +	commit_charge(page, memcg, nr_pages, lrucare);
> >  
> >  	if (do_swap_account && PageSwapCache(page)) {
> >  		swp_entry_t entry = { .val = page_private(page) };
> > @@ -6873,6 +6554,139 @@ void mem_cgroup_cancel_charge(struct pag
> >  	cancel_charge(memcg, nr_pages);
> >  }
> >  
> > +/**
> > + * mem_cgroup_uncharge - uncharge a page
> > + * @page: page to uncharge
> > + *
> > + * Uncharge a page previously charged with mem_cgroup_try_charge() and
> > + * mem_cgroup_commit_charge().
> > + */
> > +void mem_cgroup_uncharge(struct page *page)
> > +{
> > +	struct memcg_batch_info *batch;
> > +	unsigned int nr_pages = 1;
> > +	struct mem_cgroup *memcg;
> > +	struct page_cgroup *pc;
> > +	unsigned long pc_flags;
> > +	unsigned long flags;
> > +
> > +	VM_BUG_ON_PAGE(PageLRU(page), page);
> > +	VM_BUG_ON_PAGE(page_count(page), page);
> > +
> > +	if (mem_cgroup_disabled())
> > +		return;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +
> > +	/* Every final put_page() ends up here */
> > +	if (!PageCgroupUsed(pc))
> > +		return;
> > +
> > +	if (PageTransHuge(page)) {
> > +		nr_pages <<= compound_order(page);
> > +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > +	}
> > +	/*
> > +	 * Nobody should be changing or seriously looking at
> > +	 * pc->mem_cgroup and pc->flags at this point, we have fully
> > +	 * exclusive access to the page.
> > +	 */
> > +	memcg = pc->mem_cgroup;
> > +	pc_flags = pc->flags;
> > +	pc->flags = 0;
> > +
> > +	local_irq_save(flags);
> > +
> > +	if (nr_pages > 1)
> > +		goto direct;
> > +	if (unlikely(test_thread_flag(TIF_MEMDIE)))
> > +		goto direct;
> > +	batch = &current->memcg_batch;
> > +	if (!batch->do_batch)
> > +		goto direct;
> > +	if (batch->memcg && batch->memcg != memcg)
> > +		goto direct;
> > +	if (!batch->memcg)
> > +		batch->memcg = memcg;
> > +	if (pc_flags & PCG_MEM)
> > +		batch->nr_pages++;
> > +	if (pc_flags & PCG_MEMSW)
> > +		batch->memsw_nr_pages++;
> > +	goto out;
> > +direct:
> > +	if (pc_flags & PCG_MEM)
> > +		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
> > +	if (pc_flags & PCG_MEMSW)
> > +		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
> > +	memcg_oom_recover(memcg);
> > +out:
> > +	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
> > +	memcg_check_events(memcg, page);
> > +
> > +	local_irq_restore(flags);
> > +}
> > +
> > +/**
> > + * mem_cgroup_migrate - migrate a charge to another page
> > + * @oldpage: currently charged page
> > + * @newpage: page to transfer the charge to
> > + * @lrucare: both pages might be on the LRU already
> > + *
> > + * Migrate the charge from @oldpage to @newpage.
> > + *
> > + * Both pages must be locked, @newpage->mapping must be set up.
> > + */
> > +void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> > +			bool lrucare)
> > +{
> > +	unsigned int nr_pages = 1;
> > +	struct page_cgroup *pc;
> > +	int isolated;
> > +
> > +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> > +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> > +	VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage);
> > +	VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage);
> > +	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
> > +
> > +	if (mem_cgroup_disabled())
> > +		return;
> > +
> > +	/* Page cache replacement: new page already charged? */
> > +	pc = lookup_page_cgroup(newpage);
> > +	if (PageCgroupUsed(pc))
> > +		return;
> > +
> > +	/* Re-entrant migration: old page already uncharged? */
> > +	pc = lookup_page_cgroup(oldpage);
> > +	if (!PageCgroupUsed(pc))
> > +		return;
> > +
> > +	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
> > +	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
> > +
> > +	if (PageTransHuge(oldpage)) {
> > +		nr_pages <<= compound_order(oldpage);
> > +		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
> > +		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
> > +	}
> > +
> > +	if (lrucare)
> > +		lock_page_lru(oldpage, &isolated);
> > +
> > +	pc->flags = 0;
> > +
> > +	if (lrucare)
> > +		unlock_page_lru(oldpage, isolated);
> > +
> > +	local_irq_disable();
> > +	mem_cgroup_charge_statistics(pc->mem_cgroup, oldpage, -nr_pages);
> > +	memcg_check_events(pc->mem_cgroup, oldpage);
> > +	local_irq_enable();
> > +
> > +	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
> > +}
> > +
> >  /*
> >   * subsys_initcall() for memory controller.
> >   *
> > diff -puN mm/memory.c~mm-memcontrol-rewrite-uncharge-api mm/memory.c
> > --- a/mm/memory.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/memory.c
> > @@ -1292,7 +1292,6 @@ static void unmap_page_range(struct mmu_
> >  		details = NULL;
> >  
> >  	BUG_ON(addr >= end);
> > -	mem_cgroup_uncharge_start();
> >  	tlb_start_vma(tlb, vma);
> >  	pgd = pgd_offset(vma->vm_mm, addr);
> >  	do {
> > @@ -1302,7 +1301,6 @@ static void unmap_page_range(struct mmu_
> >  		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
> >  	} while (pgd++, addr = next, addr != end);
> >  	tlb_end_vma(tlb, vma);
> > -	mem_cgroup_uncharge_end();
> >  }
> >  
> >  
> > diff -puN mm/migrate.c~mm-memcontrol-rewrite-uncharge-api mm/migrate.c
> > --- a/mm/migrate.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/migrate.c
> > @@ -780,6 +780,7 @@ static int move_to_new_page(struct page
> >  	if (rc != MIGRATEPAGE_SUCCESS) {
> >  		newpage->mapping = NULL;
> >  	} else {
> > +		mem_cgroup_migrate(page, newpage, false);
> >  		if (remap_swapcache)
> >  			remove_migration_ptes(page, newpage);
> >  		page->mapping = NULL;
> > @@ -795,7 +796,6 @@ static int __unmap_and_move(struct page
> >  {
> >  	int rc = -EAGAIN;
> >  	int remap_swapcache = 1;
> > -	struct mem_cgroup *mem;
> >  	struct anon_vma *anon_vma = NULL;
> >  
> >  	if (!trylock_page(page)) {
> > @@ -821,9 +821,6 @@ static int __unmap_and_move(struct page
> >  		lock_page(page);
> >  	}
> >  
> > -	/* charge against new page */
> > -	mem_cgroup_prepare_migration(page, newpage, &mem);
> > -
> >  	if (PageWriteback(page)) {
> >  		/*
> >  		 * Only in the case of a full synchronous migration is it
> > @@ -833,10 +830,10 @@ static int __unmap_and_move(struct page
> >  		 */
> >  		if (mode != MIGRATE_SYNC) {
> >  			rc = -EBUSY;
> > -			goto uncharge;
> > +			goto out_unlock;
> >  		}
> >  		if (!force)
> > -			goto uncharge;
> > +			goto out_unlock;
> >  		wait_on_page_writeback(page);
> >  	}
> >  	/*
> > @@ -872,7 +869,7 @@ static int __unmap_and_move(struct page
> >  			 */
> >  			remap_swapcache = 0;
> >  		} else {
> > -			goto uncharge;
> > +			goto out_unlock;
> >  		}
> >  	}
> >  
> > @@ -885,7 +882,7 @@ static int __unmap_and_move(struct page
> >  		 * the page migration right away (proteced by page lock).
> >  		 */
> >  		rc = balloon_page_migrate(newpage, page, mode);
> > -		goto uncharge;
> > +		goto out_unlock;
> >  	}
> >  
> >  	/*
> > @@ -904,7 +901,7 @@ static int __unmap_and_move(struct page
> >  		VM_BUG_ON_PAGE(PageAnon(page), page);
> >  		if (page_has_private(page)) {
> >  			try_to_free_buffers(page);
> > -			goto uncharge;
> > +			goto out_unlock;
> >  		}
> >  		goto skip_unmap;
> >  	}
> > @@ -923,10 +920,7 @@ skip_unmap:
> >  	if (anon_vma)
> >  		put_anon_vma(anon_vma);
> >  
> > -uncharge:
> > -	mem_cgroup_end_migration(mem, page, newpage,
> > -				 (rc == MIGRATEPAGE_SUCCESS ||
> > -				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
> > +out_unlock:
> >  	unlock_page(page);
> >  out:
> >  	return rc;
> > @@ -1786,7 +1780,6 @@ int migrate_misplaced_transhuge_page(str
> >  	pg_data_t *pgdat = NODE_DATA(node);
> >  	int isolated = 0;
> >  	struct page *new_page = NULL;
> > -	struct mem_cgroup *memcg = NULL;
> >  	int page_lru = page_is_file_cache(page);
> >  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
> >  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> > @@ -1852,15 +1845,6 @@ fail_putback:
> >  		goto out_unlock;
> >  	}
> >  
> > -	/*
> > -	 * Traditional migration needs to prepare the memcg charge
> > -	 * transaction early to prevent the old page from being
> > -	 * uncharged when installing migration entries.  Here we can
> > -	 * save the potential rollback and start the charge transfer
> > -	 * only when migration is already known to end successfully.
> > -	 */
> > -	mem_cgroup_prepare_migration(page, new_page, &memcg);
> > -
> >  	orig_entry = *pmd;
> >  	entry = mk_pmd(new_page, vma->vm_page_prot);
> >  	entry = pmd_mkhuge(entry);
> > @@ -1888,14 +1872,10 @@ fail_putback:
> >  		goto fail_putback;
> >  	}
> >  
> > +	mem_cgroup_migrate(page, new_page, false);
> > +
> >  	page_remove_rmap(page);
> >  
> > -	/*
> > -	 * Finish the charge transaction under the page table lock to
> > -	 * prevent split_huge_page() from dividing up the charge
> > -	 * before it's fully transferred to the new page.
> > -	 */
> > -	mem_cgroup_end_migration(memcg, page, new_page, true);
> >  	spin_unlock(ptl);
> >  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >  
> > diff -puN mm/rmap.c~mm-memcontrol-rewrite-uncharge-api mm/rmap.c
> > --- a/mm/rmap.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/rmap.c
> > @@ -1089,7 +1089,6 @@ void page_remove_rmap(struct page *page)
> >  	if (unlikely(PageHuge(page)))
> >  		goto out;
> >  	if (anon) {
> > -		mem_cgroup_uncharge_page(page);
> >  		if (PageTransHuge(page))
> >  			__dec_zone_page_state(page,
> >  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> > diff -puN mm/shmem.c~mm-memcontrol-rewrite-uncharge-api mm/shmem.c
> > --- a/mm/shmem.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/shmem.c
> > @@ -406,7 +406,6 @@ static void shmem_undo_range(struct inod
> >  			pvec.pages, indices);
> >  		if (!pvec.nr)
> >  			break;
> > -		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  
> > @@ -434,7 +433,6 @@ static void shmem_undo_range(struct inod
> >  		}
> >  		pagevec_remove_exceptionals(&pvec);
> >  		pagevec_release(&pvec);
> > -		mem_cgroup_uncharge_end();
> >  		cond_resched();
> >  		index++;
> >  	}
> > @@ -482,7 +480,6 @@ static void shmem_undo_range(struct inod
> >  			index = start;
> >  			continue;
> >  		}
> > -		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  
> > @@ -518,7 +515,6 @@ static void shmem_undo_range(struct inod
> >  		}
> >  		pagevec_remove_exceptionals(&pvec);
> >  		pagevec_release(&pvec);
> > -		mem_cgroup_uncharge_end();
> >  		index++;
> >  	}
> >  
> > @@ -818,7 +814,7 @@ static int shmem_writepage(struct page *
> >  	}
> >  
> >  	mutex_unlock(&shmem_swaplist_mutex);
> > -	swapcache_free(swap, NULL);
> > +	swapcache_free(swap);
> >  redirty:
> >  	set_page_dirty(page);
> >  	if (wbc->for_reclaim)
> > @@ -991,7 +987,7 @@ static int shmem_replace_page(struct pag
> >  		 */
> >  		oldpage = newpage;
> >  	} else {
> > -		mem_cgroup_replace_page_cache(oldpage, newpage);
> > +		mem_cgroup_migrate(oldpage, newpage, false);
> >  		lru_cache_add_anon(newpage);
> >  		*pagep = newpage;
> >  	}
> > diff -puN mm/swap.c~mm-memcontrol-rewrite-uncharge-api mm/swap.c
> > --- a/mm/swap.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/swap.c
> > @@ -62,6 +62,7 @@ static void __page_cache_release(struct
> >  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  	}
> > +	mem_cgroup_uncharge(page);
> >  }
> >  
> >  static void __put_single_page(struct page *page)
> > @@ -915,6 +916,8 @@ void release_pages(struct page **pages,
> >  	struct lruvec *lruvec;
> >  	unsigned long uninitialized_var(flags);
> >  
> > +	mem_cgroup_uncharge_start();
> > +
> >  	for (i = 0; i < nr; i++) {
> >  		struct page *page = pages[i];
> >  
> > @@ -946,6 +949,7 @@ void release_pages(struct page **pages,
> >  			__ClearPageLRU(page);
> >  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >  		}
> > +		mem_cgroup_uncharge(page);
> >  
> >  		/* Clear Active bit in case of parallel mark_page_accessed */
> >  		__ClearPageActive(page);
> > @@ -955,6 +959,8 @@ void release_pages(struct page **pages,
> >  	if (zone)
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  
> > +	mem_cgroup_uncharge_end();
> > +
> >  	free_hot_cold_page_list(&pages_to_free, cold);
> >  }
> >  EXPORT_SYMBOL(release_pages);
> > diff -puN mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api mm/swap_state.c
> > --- a/mm/swap_state.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/swap_state.c
> > @@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struc
> >  
> >  	if (unlikely(PageTransHuge(page)))
> >  		if (unlikely(split_huge_page_to_list(page, list))) {
> > -			swapcache_free(entry, NULL);
> > +			swapcache_free(entry);
> >  			return 0;
> >  		}
> >  
> > @@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struc
> >  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> >  		 * clear SWAP_HAS_CACHE flag.
> >  		 */
> > -		swapcache_free(entry, NULL);
> > +		swapcache_free(entry);
> >  		return 0;
> >  	}
> >  }
> > @@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page
> >  	__delete_from_swap_cache(page);
> >  	spin_unlock_irq(&address_space->tree_lock);
> >  
> > -	swapcache_free(entry, page);
> > +	swapcache_free(entry);
> >  	page_cache_release(page);
> >  }
> >  
> > @@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_e
> >  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> >  		 * clear SWAP_HAS_CACHE flag.
> >  		 */
> > -		swapcache_free(entry, NULL);
> > +		swapcache_free(entry);
> >  	} while (err != -ENOMEM);
> >  
> >  	if (new_page)
> > diff -puN mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api mm/swapfile.c
> > --- a/mm/swapfile.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/swapfile.c
> > @@ -843,16 +843,13 @@ void swap_free(swp_entry_t entry)
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
> > -void swapcache_free(swp_entry_t entry, struct page *page)
> > +void swapcache_free(swp_entry_t entry)
> >  {
> >  	struct swap_info_struct *p;
> > -	unsigned char count;
> >  
> >  	p = swap_info_get(entry);
> >  	if (p) {
> > -		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
> > -		if (page)
> > -			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
> > +		swap_entry_free(p, entry, SWAP_HAS_CACHE);
> >  		spin_unlock(&p->lock);
> >  	}
> >  }
> > diff -puN mm/truncate.c~mm-memcontrol-rewrite-uncharge-api mm/truncate.c
> > --- a/mm/truncate.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/truncate.c
> > @@ -281,7 +281,6 @@ void truncate_inode_pages_range(struct a
> >  	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
> >  			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> >  			indices)) {
> > -		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  
> > @@ -307,7 +306,6 @@ void truncate_inode_pages_range(struct a
> >  		}
> >  		pagevec_remove_exceptionals(&pvec);
> >  		pagevec_release(&pvec);
> > -		mem_cgroup_uncharge_end();
> >  		cond_resched();
> >  		index++;
> >  	}
> > @@ -369,7 +367,6 @@ void truncate_inode_pages_range(struct a
> >  			pagevec_release(&pvec);
> >  			break;
> >  		}
> > -		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  
> > @@ -394,7 +391,6 @@ void truncate_inode_pages_range(struct a
> >  		}
> >  		pagevec_remove_exceptionals(&pvec);
> >  		pagevec_release(&pvec);
> > -		mem_cgroup_uncharge_end();
> >  		index++;
> >  	}
> >  	cleancache_invalidate_inode(mapping);
> > @@ -493,7 +489,6 @@ unsigned long invalidate_mapping_pages(s
> >  	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
> >  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> >  			indices)) {
> > -		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  
> > @@ -522,7 +517,6 @@ unsigned long invalidate_mapping_pages(s
> >  		}
> >  		pagevec_remove_exceptionals(&pvec);
> >  		pagevec_release(&pvec);
> > -		mem_cgroup_uncharge_end();
> >  		cond_resched();
> >  		index++;
> >  	}
> > @@ -553,7 +547,6 @@ invalidate_complete_page2(struct address
> >  	BUG_ON(page_has_private(page));
> >  	__delete_from_page_cache(page, NULL);
> >  	spin_unlock_irq(&mapping->tree_lock);
> > -	mem_cgroup_uncharge_cache_page(page);
> >  
> >  	if (mapping->a_ops->freepage)
> >  		mapping->a_ops->freepage(page);
> > @@ -602,7 +595,6 @@ int invalidate_inode_pages2_range(struct
> >  	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
> >  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> >  			indices)) {
> > -		mem_cgroup_uncharge_start();
> >  		for (i = 0; i < pagevec_count(&pvec); i++) {
> >  			struct page *page = pvec.pages[i];
> >  
> > @@ -655,7 +647,6 @@ int invalidate_inode_pages2_range(struct
> >  		}
> >  		pagevec_remove_exceptionals(&pvec);
> >  		pagevec_release(&pvec);
> > -		mem_cgroup_uncharge_end();
> >  		cond_resched();
> >  		index++;
> >  	}
> > diff -puN mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api mm/vmscan.c
> > --- a/mm/vmscan.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/vmscan.c
> > @@ -571,9 +571,10 @@ static int __remove_mapping(struct addre
> >  
> >  	if (PageSwapCache(page)) {
> >  		swp_entry_t swap = { .val = page_private(page) };
> > +		mem_cgroup_swapout(page, swap);
> >  		__delete_from_swap_cache(page);
> >  		spin_unlock_irq(&mapping->tree_lock);
> > -		swapcache_free(swap, page);
> > +		swapcache_free(swap);
> >  	} else {
> >  		void (*freepage)(struct page *);
> >  		void *shadow = NULL;
> > @@ -594,7 +595,6 @@ static int __remove_mapping(struct addre
> >  			shadow = workingset_eviction(mapping, page);
> >  		__delete_from_page_cache(page, shadow);
> >  		spin_unlock_irq(&mapping->tree_lock);
> > -		mem_cgroup_uncharge_cache_page(page);
> >  
> >  		if (freepage != NULL)
> >  			freepage(page);
> > @@ -1097,6 +1097,7 @@ static unsigned long shrink_page_list(st
> >  		 */
> >  		__clear_page_locked(page);
> >  free_it:
> > +		mem_cgroup_uncharge(page);
> >  		nr_reclaimed++;
> >  
> >  		/*
> > @@ -1126,12 +1127,13 @@ keep:
> >  		list_add(&page->lru, &ret_pages);
> >  		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
> >  	}
> > +	mem_cgroup_uncharge_end();
> >  
> >  	free_hot_cold_page_list(&free_pages, true);
> >  
> >  	list_splice(&ret_pages, page_list);
> >  	count_vm_events(PGACTIVATE, pgactivate);
> > -	mem_cgroup_uncharge_end();
> > +
> >  	*ret_nr_dirty += nr_dirty;
> >  	*ret_nr_congested += nr_congested;
> >  	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
> > @@ -1429,6 +1431,8 @@ putback_inactive_pages(struct lruvec *lr
> >  			__ClearPageActive(page);
> >  			del_page_from_lru_list(page, lruvec, lru);
> >  
> > +			mem_cgroup_uncharge(page);
> > +
> >  			if (unlikely(PageCompound(page))) {
> >  				spin_unlock_irq(&zone->lru_lock);
> >  				(*get_compound_page_dtor(page))(page);
> > @@ -1650,6 +1654,8 @@ static void move_active_pages_to_lru(str
> >  			__ClearPageActive(page);
> >  			del_page_from_lru_list(page, lruvec, lru);
> >  
> > +			mem_cgroup_uncharge(page);
> > +
> >  			if (unlikely(PageCompound(page))) {
> >  				spin_unlock_irq(&zone->lru_lock);
> >  				(*get_compound_page_dtor(page))(page);
> > diff -puN mm/zswap.c~mm-memcontrol-rewrite-uncharge-api mm/zswap.c
> > --- a/mm/zswap.c~mm-memcontrol-rewrite-uncharge-api
> > +++ a/mm/zswap.c
> > @@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp
> >  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> >  		 * clear SWAP_HAS_CACHE flag.
> >  		 */
> > -		swapcache_free(entry, NULL);
> > +		swapcache_free(entry);
> >  	} while (err != -ENOMEM);
> >  
> >  	if (new_page)
> > _
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API
  2014-08-07 16:25         ` Johannes Weiner
@ 2014-08-07 22:40           ` Andrew Morton
  2014-08-07 23:02             ` Johannes Weiner
  2014-08-07 23:18             ` Johannes Weiner
  2014-08-08 11:42           ` Michal Hocko
  1 sibling, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2014-08-07 22:40 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, linux-mm

On Thu, 7 Aug 2014 12:25:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> which gets rid of mem_cgroup_replace_page_cache().
> 
> Could you please update it to say:
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> which gets rid of mem_cgroup_replace_page_cache().  However, care
> needs to be taken because both the source and the target page can
> already be charged and on the LRU when fuse is splicing: grab the page
> lock on the charge moving side to prevent changing pc->mem_cgroup of a
> page under migration.  Also, the lruvecs of both pages change as we
> uncharge the old and charge the new during migration, and putback may
> race with us, so grab the lru lock and isolate the pages iff on LRU to
> prevent races and ensure the pages are on the right lruvec afterward.

OK thanks, I did that, separated out
mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch again
and copied the [0/n] changelog text into mm-memcontrol-rewrite-charge-api.patch.

I'll get these (presently at http://ozlabs.org/~akpm/mmots/broken-out/)

mm-memcontrol-rewrite-charge-api.patch
mm-memcontrol-rewrite-uncharge-api.patch
mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch
mm-memcontrol-use-page-lists-for-uncharge-batching.patch
#
page-cgroup-trivial-cleanup.patch
page-cgroup-get-rid-of-nr_pcg_flags.patch
#
#
memcg-remove-lookup_cgroup_page-prototype.patch

on the wire tomorrow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API
  2014-08-07 22:40           ` Andrew Morton
@ 2014-08-07 23:02             ` Johannes Weiner
  2014-08-07 23:18             ` Johannes Weiner
  1 sibling, 0 replies; 12+ messages in thread
From: Johannes Weiner @ 2014-08-07 23:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm

On Thu, Aug 07, 2014 at 03:40:46PM -0700, Andrew Morton wrote:
> On Thu, 7 Aug 2014 12:25:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> > which gets rid of mem_cgroup_replace_page_cache().
> > 
> > Could you please update it to say:
> > 
> > mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> > which gets rid of mem_cgroup_replace_page_cache().  However, care
> > needs to be taken because both the source and the target page can
> > already be charged and on the LRU when fuse is splicing: grab the page
> > lock on the charge moving side to prevent changing pc->mem_cgroup of a
> > page under migration.  Also, the lruvecs of both pages change as we
> > uncharge the old and charge the new during migration, and putback may
> > race with us, so grab the lru lock and isolate the pages iff on LRU to
> > prevent races and ensure the pages are on the right lruvec afterward.
> 
> OK thanks, I did that, separated out
> mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch again
> and copied the [0/n] changelog text into mm-memcontrol-rewrite-charge-api.patch.
> 
> I'll get these (presently at http://ozlabs.org/~akpm/mmots/broken-out/)
> 
> mm-memcontrol-rewrite-charge-api.patch
> mm-memcontrol-rewrite-uncharge-api.patch
> mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch

The updated changelog from mm-memcontrol-rewrite-uncharge-api.patch
now contains all the information from
mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch, so
having the two separate is redundant.

We could go either way with this - revert to the original changelog,
or fold the fix - but since the fix is for kernel crashes and nasty
race conditions I'd prefer folding it for bisectability.

> mm-memcontrol-use-page-lists-for-uncharge-batching.patch
> #
> page-cgroup-trivial-cleanup.patch
> page-cgroup-get-rid-of-nr_pcg_flags.patch
> #
> #
> memcg-remove-lookup_cgroup_page-prototype.patch
> 
> on the wire tomorrow.

Much appreciated, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API
  2014-08-07 22:40           ` Andrew Morton
  2014-08-07 23:02             ` Johannes Weiner
@ 2014-08-07 23:18             ` Johannes Weiner
  1 sibling, 0 replies; 12+ messages in thread
From: Johannes Weiner @ 2014-08-07 23:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm

On Thu, Aug 07, 2014 at 03:40:46PM -0700, Andrew Morton wrote:
> On Thu, 7 Aug 2014 12:25:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> > which gets rid of mem_cgroup_replace_page_cache().
> > 
> > Could you please update it to say:
> > 
> > mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> > which gets rid of mem_cgroup_replace_page_cache().  However, care
> > needs to be taken because both the source and the target page can
> > already be charged and on the LRU when fuse is splicing: grab the page
> > lock on the charge moving side to prevent changing pc->mem_cgroup of a
> > page under migration.  Also, the lruvecs of both pages change as we
> > uncharge the old and charge the new during migration, and putback may
> > race with us, so grab the lru lock and isolate the pages iff on LRU to
> > prevent races and ensure the pages are on the right lruvec afterward.
> 
> OK thanks, I did that, separated out
> mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch again
> and copied the [0/n] changelog text into mm-memcontrol-rewrite-charge-api.patch.
> 
> I'll get these (presently at http://ozlabs.org/~akpm/mmots/broken-out/)
> 
> mm-memcontrol-rewrite-charge-api.patch
> mm-memcontrol-rewrite-uncharge-api.patch
> mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch
> mm-memcontrol-use-page-lists-for-uncharge-batching.patch
> #
> page-cgroup-trivial-cleanup.patch
> page-cgroup-get-rid-of-nr_pcg_flags.patch
> #
> #
> memcg-remove-lookup_cgroup_page-prototype.patch

mm-memcontrol-avoid-charge-statistics-churn-during-page-migration.patch
came a bit later, but it's a small optimization directly relevant to
the above changes, so you may want to send it along with them.

Michal already acked it on the list:
http://marc.info/?l=linux-mm&m=140724569618836&w=2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API
  2014-08-07 16:25         ` Johannes Weiner
  2014-08-07 22:40           ` Andrew Morton
@ 2014-08-08 11:42           ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-08-08 11:42 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm

On Thu 07-08-14 12:25:07, Johannes Weiner wrote:
> On Thu, Aug 07, 2014 at 09:38:26AM +0200, Michal Hocko wrote:
> > On Wed 06-08-14 14:02:35, Andrew Morton wrote:
> > > On Wed, 6 Aug 2014 14:00:55 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> > > 
> > > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > > Subject: mm: memcontrol: rewrite uncharge API
> > > > 
> > > 
> > > Nope, sorry, that was missing
> > > mm-memcontrol-rewrite-uncharge-api-fix-clear-page-mapping-in-migration.patch.
> > > 
> > > This time:
> > > 
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Subject: mm: memcontrol: rewrite uncharge API
> > > 
> > > The memcg uncharging code that is involved towards the end of a page's
> > > lifetime - truncation, reclaim, swapout, migration - is impressively
> > > complicated and fragile.
> > > 
> > > Because anonymous and file pages were always charged before they had their
> > > page->mapping established, uncharges had to happen when the page type
> > > could still be known from the context; as in unmap for anonymous, page
> > > cache removal for file and shmem pages, and swap cache truncation for swap
> > > pages.  However, these operations happen well before the page is actually
> > > freed, and so a lot of synchronization is necessary:
> > > 
> > > - Charging, uncharging, page migration, and charge migration all need
> > >   to take a per-page bit spinlock as they could race with uncharging.
> > > 
> > > - Swap cache truncation happens during both swap-in and swap-out, and
> > >   possibly repeatedly before the page is actually freed.  This means
> > >   that the memcg swapout code is called from many contexts that make
> > >   no sense and it has to figure out the direction from page state to
> > >   make sure memory and memory+swap are always correctly charged.
> > > 
> > > - On page migration, the old page might be unmapped but then reused,
> > >   so memcg code has to prevent untimely uncharging in that case.
> > >   Because this code - which should be a simple charge transfer - is so
> > >   special-cased, it is not reusable for replace_page_cache().
> > > 
> > > But now that charged pages always have a page->mapping, introduce
> > > mem_cgroup_uncharge(), which is called after the final put_page(), when we
> > > know for sure that nobody is looking at the page anymore.
> > > 
> > > For page migration, introduce mem_cgroup_migrate(), which is called after
> > > the migration is successful and the new page is fully rmapped.  Because
> > > the old page is no longer uncharged after migration, prevent double
> > > charges by decoupling the page's memcg association (PCG_USED and
> > > pc->mem_cgroup) from the page holding an actual charge.  The new bits
> > > PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
> > > to the new page during migration.
> > > 
> > > mem_cgroup_migrate() is suitable for replace_page_cache() as well, which
> > > gets rid of mem_cgroup_replace_page_cache().
> > > 
> > > Swap accounting is massively simplified: because the page is no longer
> > > uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
> > > transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
> > > before the final put_page() in page reclaim.
> > > 
> > > Finally, page_cgroup changes are now protected by whatever protection the
> > > page itself offers: anonymous pages are charged under the page table lock,
> > > whereas page cache insertions, swapin, and migration hold the page lock. 
> > > Uncharging happens under full exclusion with no outstanding references. 
> > > Charging and uncharging also ensure that the page is off-LRU, which
> > > serializes against charge migration.  Remove the very costly page_cgroup
> > > lock and set pc->flags non-atomically.
> > 
> > I see some point in squashing all the fixups into the single patch but I
> > am afraid we have lost some interesting details from fix ups this time.
> > I think that at least
> > mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration.patch and
> > mm-memcontrol-rewrite-uncharge-api-fix-page-cache-migration-2.patch
> > would be good to go on their own _or_ their changelogs added here. The
> > whole page cache replace path is obscure and we should rather have that
> > documented so we do not have to google for details or go through painful
> > code inspection next time.
> 
> I agree, we would lose something there.  There is a paragraph in the
> changelog that says:
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> which gets rid of mem_cgroup_replace_page_cache().
> 
> Could you please update it to say:
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> which gets rid of mem_cgroup_replace_page_cache().  However, care
> needs to be taken because both the source and the target page can
> already be charged and on the LRU when fuse is splicing: grab the page
> lock on the charge moving side to prevent changing pc->mem_cgroup of a
> page under migration.  Also, the lruvecs of both pages change as we
> uncharge the old and charge the new during migration, and putback may
> race with us, so grab the lru lock and isolate the pages iff on LRU to
> prevent races and ensure the pages are on the right lruvec afterward.

Thanks! This is much better.

> > > [vdavydov@parallels.com: fix flags definition]
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: Vladimir Davydov <vdavydov@parallels.com>
> > > Tested-by: Jet Chen <jet.chen@intel.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > 
> > If this comes from the above change then is should be probably removed.
> > You can replace it by my Acked-by. I have acked all the follow up fixes
> > but forgot to ack the initial patch.
> 
> Thanks!
> 
> > > Tested-by: Felipe Balbi <balbi@ti.com>
> > 
> > this tested-by came from the same preempt_{en,dis}able patch AFAICS.
> 
> Yeah it might be a bit overreaching to apply this to the full change.
> 
> On a different note, Michal, I just scrolled through the 2000 lines
> that follow to see if you had any more comments, but there was only
> your signature at the bottom.  Please think about the quote context
> after you inserted your inline comments and then trim accordingly.

Sure I usually trim emails a lot. Forgot this time, sorry about that!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-08-08 11:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-06 20:59 mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch Andrew Morton
2014-08-06 21:00 ` mm: memcontrol: rewrite charge API Andrew Morton
2014-08-06 21:00   ` mm: memcontrol: rewrite uncharge API Andrew Morton
2014-08-06 21:02     ` Andrew Morton
2014-08-07  7:38       ` Michal Hocko
2014-08-07 16:25         ` Johannes Weiner
2014-08-07 22:40           ` Andrew Morton
2014-08-07 23:02             ` Johannes Weiner
2014-08-07 23:18             ` Johannes Weiner
2014-08-08 11:42           ` Michal Hocko
2014-08-07  2:26 ` mm-memcontrol-rewrite-charge-api.patch and mm-memcontrol-rewrite-uncharge-api.patch Johannes Weiner
2014-08-07  7:44 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.