All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] mm: memcontrol: lockless page counters
@ 2014-09-19 13:22 ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-19 13:22 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

Memory is internally accounted in bytes, using spinlock-protected
64-bit counters, even though the smallest accounting delta is a page.
The counter interface is also convoluted and does too many things.

Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it and remove the old one.  The translation
from and to bytes then only happens when interfacing with userspace.

Aside from the locking costs, this gets rid of the icky unsigned long
long types in the very heart of memcg, which is great for 32 bit and
also makes the code a lot more readable.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/hugetlb.txt          |   2 +-
 Documentation/cgroups/memory.txt           |   4 +-
 Documentation/cgroups/resource_counter.txt | 197 --------
 include/linux/hugetlb_cgroup.h             |   1 -
 include/linux/memcontrol.h                 |  37 +-
 include/linux/res_counter.h                | 223 ---------
 include/net/sock.h                         |  25 +-
 init/Kconfig                               |   9 +-
 kernel/Makefile                            |   1 -
 kernel/res_counter.c                       | 211 --------
 mm/hugetlb_cgroup.c                        | 100 ++--
 mm/memcontrol.c                            | 740 ++++++++++++++++-------------
 net/ipv4/tcp_memcontrol.c                  |  83 ++--
 13 files changed, 541 insertions(+), 1092 deletions(-)
 delete mode 100644 Documentation/cgroups/resource_counter.txt
 delete mode 100644 include/linux/res_counter.h
 delete mode 100644 kernel/res_counter.c

diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
index a9faaca1f029..106245c3aecc 100644
--- a/Documentation/cgroups/hugetlb.txt
+++ b/Documentation/cgroups/hugetlb.txt
@@ -29,7 +29,7 @@ Brief summary of control files
 
  hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
  hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
- hugetlb.<hugepagesize>.usage_in_bytes     # show current res_counter usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
  hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
 
 For a system supporting two hugepage size (16M and 16G) the control
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 02ab997a1ed2..f624727ab404 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -52,9 +52,9 @@ Brief summary of control files.
  tasks				 # attach a task(thread) and show list of threads
  cgroup.procs			 # show list of processes
  cgroup.event_control		 # an interface for event_fd()
- memory.usage_in_bytes		 # show current res_counter usage for memory
+ memory.usage_in_bytes		 # show current usage for memory
 				 (See 5.5 for details)
- memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
+ memory.memsw.usage_in_bytes	 # show current usage for memory+Swap
 				 (See 5.5 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
deleted file mode 100644
index 762ca54eb929..000000000000
--- a/Documentation/cgroups/resource_counter.txt
+++ /dev/null
@@ -1,197 +0,0 @@
-
-		The Resource Counter
-
-The resource counter, declared at include/linux/res_counter.h,
-is supposed to facilitate the resource management by controllers
-by providing common stuff for accounting.
-
-This "stuff" includes the res_counter structure and routines
-to work with it.
-
-
-
-1. Crucial parts of the res_counter structure
-
- a. unsigned long long usage
-
- 	The usage value shows the amount of a resource that is consumed
-	by a group at a given time. The units of measurement should be
-	determined by the controller that uses this counter. E.g. it can
-	be bytes, items or any other unit the controller operates on.
-
- b. unsigned long long max_usage
-
- 	The maximal value of the usage over time.
-
- 	This value is useful when gathering statistical information about
-	the particular group, as it shows the actual resource requirements
-	for a particular group, not just some usage snapshot.
-
- c. unsigned long long limit
-
- 	The maximal allowed amount of resource to consume by the group. In
-	case the group requests for more resources, so that the usage value
-	would exceed the limit, the resource allocation is rejected (see
-	the next section).
-
- d. unsigned long long failcnt
-
- 	The failcnt stands for "failures counter". This is the number of
-	resource allocation attempts that failed.
-
- c. spinlock_t lock
-
- 	Protects changes of the above values.
-
-
-
-2. Basic accounting routines
-
- a. void res_counter_init(struct res_counter *rc,
-				struct res_counter *rc_parent)
-
- 	Initializes the resource counter. As usual, should be the first
-	routine called for a new counter.
-
-	The struct res_counter *parent can be used to define a hierarchical
-	child -> parent relationship directly in the res_counter structure,
-	NULL can be used to define no relationship.
-
- c. int res_counter_charge(struct res_counter *rc, unsigned long val,
-				struct res_counter **limit_fail_at)
-
-	When a resource is about to be allocated it has to be accounted
-	with the appropriate resource counter (controller should determine
-	which one to use on its own). This operation is called "charging".
-
-	This is not very important which operation - resource allocation
-	or charging - is performed first, but
-	  * if the allocation is performed first, this may create a
-	    temporary resource over-usage by the time resource counter is
-	    charged;
-	  * if the charging is performed first, then it should be uncharged
-	    on error path (if the one is called).
-
-	If the charging fails and a hierarchical dependency exists, the
-	limit_fail_at parameter is set to the particular res_counter element
-	where the charging failed.
-
- d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
-
-	When a resource is released (freed) it should be de-accounted
-	from the resource counter it was accounted to.  This is called
-	"uncharging". The return value of this function indicate the amount
-	of charges still present in the counter.
-
-	The _locked routines imply that the res_counter->lock is taken.
-
- e. u64 res_counter_uncharge_until
-		(struct res_counter *rc, struct res_counter *top,
-		 unsigned long val)
-
-	Almost same as res_counter_uncharge() but propagation of uncharge
-	stops when rc == top. This is useful when kill a res_counter in
-	child cgroup.
-
- 2.1 Other accounting routines
-
-    There are more routines that may help you with common needs, like
-    checking whether the limit is reached or resetting the max_usage
-    value. They are all declared in include/linux/res_counter.h.
-
-
-
-3. Analyzing the resource counter registrations
-
- a. If the failcnt value constantly grows, this means that the counter's
-    limit is too tight. Either the group is misbehaving and consumes too
-    many resources, or the configuration is not suitable for the group
-    and the limit should be increased.
-
- b. The max_usage value can be used to quickly tune the group. One may
-    set the limits to maximal values and either load the container with
-    a common pattern or leave one for a while. After this the max_usage
-    value shows the amount of memory the container would require during
-    its common activity.
-
-    Setting the limit a bit above this value gives a pretty good
-    configuration that works in most of the cases.
-
- c. If the max_usage is much less than the limit, but the failcnt value
-    is growing, then the group tries to allocate a big chunk of resource
-    at once.
-
- d. If the max_usage is much less than the limit, but the failcnt value
-    is 0, then this group is given too high limit, that it does not
-    require. It is better to lower the limit a bit leaving more resource
-    for other groups.
-
-
-
-4. Communication with the control groups subsystem (cgroups)
-
-All the resource controllers that are using cgroups and resource counters
-should provide files (in the cgroup filesystem) to work with the resource
-counter fields. They are recommended to adhere to the following rules:
-
- a. File names
-
- 	Field name	File name
-	---------------------------------------------------
-	usage		usage_in_<unit_of_measurement>
-	max_usage	max_usage_in_<unit_of_measurement>
-	limit		limit_in_<unit_of_measurement>
-	failcnt		failcnt
-	lock		no file :)
-
- b. Reading from file should show the corresponding field value in the
-    appropriate format.
-
- c. Writing to file
-
- 	Field		Expected behavior
-	----------------------------------
-	usage		prohibited
-	max_usage	reset to usage
-	limit		set the limit
-	failcnt		reset to zero
-
-
-
-5. Usage example
-
- a. Declare a task group (take a look at cgroups subsystem for this) and
-    fold a res_counter into it
-
-	struct my_group {
-		struct res_counter res;
-
-		<other fields>
-	}
-
- b. Put hooks in resource allocation/release paths
-
- 	int alloc_something(...)
-	{
-		if (res_counter_charge(res_counter_ptr, amount) < 0)
-			return -ENOMEM;
-
-		<allocate the resource and return to the caller>
-	}
-
-	void release_something(...)
-	{
-		res_counter_uncharge(res_counter_ptr, amount);
-
-		<release the resource>
-	}
-
-    In order to keep the usage value self-consistent, both the
-    "res_counter_ptr" and the "amount" in release_something() should be
-    the same as they were in the alloc_something() when the releasing
-    resource was allocated.
-
- c. Provide the way to read res_counter values and set them (the cgroups
-    still can help with it).
-
- c. Compile and run :)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 0129f89cf98d..bcc853eccc85 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -16,7 +16,6 @@
 #define _LINUX_HUGETLB_CGROUP_H
 
 #include <linux/mmdebug.h>
-#include <linux/res_counter.h>
 
 struct hugetlb_cgroup;
 /*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19df5d857411..bf8fb1a05597 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+
+struct page_counter {
+	atomic_long_t count;
+	unsigned long limit;
+	struct page_counter *parent;
+
+	/* legacy */
+	unsigned long watermark;
+	unsigned long limited;
+};
+
+#if BITS_PER_LONG == 32
+#define PAGE_COUNTER_MAX ULONG_MAX
+#else
+#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
+#endif
+
+static inline void page_counter_init(struct page_counter *counter,
+				     struct page_counter *parent)
+{
+	atomic_long_set(&counter->count, 0);
+	counter->limit = PAGE_COUNTER_MAX;
+	counter->parent = parent;
+}
+
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
+			struct page_counter **fail);
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_limit(struct page_counter *counter, unsigned long limit);
+int page_counter_memparse(const char *buf, unsigned long *nr_pages);
+
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -471,9 +503,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 	/*
 	 * __GFP_NOFAIL allocations will move on even if charging is not
 	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it with
-	 * res_counter_charge_nofail, but we hope those allocations are rare,
-	 * and won't be worth the trouble.
+	 * unaccounted. We could in theory charge it forcibly, but we hope
+	 * those allocations are rare, and won't be worth the trouble.
 	 */
 	if (gfp & __GFP_NOFAIL)
 		return true;
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
deleted file mode 100644
index 56b7bc32db4f..000000000000
--- a/include/linux/res_counter.h
+++ /dev/null
@@ -1,223 +0,0 @@
-#ifndef __RES_COUNTER_H__
-#define __RES_COUNTER_H__
-
-/*
- * Resource Counters
- * Contain common data types and routines for resource accounting
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <xemul@openvz.org>
- *
- * See Documentation/cgroups/resource_counter.txt for more
- * info about what this counter is.
- */
-
-#include <linux/spinlock.h>
-#include <linux/errno.h>
-
-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
-
-struct res_counter {
-	/*
-	 * the current resource consumption level
-	 */
-	unsigned long long usage;
-	/*
-	 * the maximal value of the usage from the counter creation
-	 */
-	unsigned long long max_usage;
-	/*
-	 * the limit that usage cannot exceed
-	 */
-	unsigned long long limit;
-	/*
-	 * the limit that usage can be exceed
-	 */
-	unsigned long long soft_limit;
-	/*
-	 * the number of unsuccessful attempts to consume the resource
-	 */
-	unsigned long long failcnt;
-	/*
-	 * the lock to protect all of the above.
-	 * the routines below consider this to be IRQ-safe
-	 */
-	spinlock_t lock;
-	/*
-	 * Parent counter, used for hierarchial resource accounting
-	 */
-	struct res_counter *parent;
-};
-
-#define RES_COUNTER_MAX ULLONG_MAX
-
-/**
- * Helpers to interact with userspace
- * res_counter_read_u64() - returns the value of the specified member.
- * res_counter_read/_write - put/get the specified fields from the
- * res_counter struct to/from the user
- *
- * @counter:     the counter in question
- * @member:  the field to work with (see RES_xxx below)
- * @buf:     the buffer to opeate on,...
- * @nbytes:  its size...
- * @pos:     and the offset.
- */
-
-u64 res_counter_read_u64(struct res_counter *counter, int member);
-
-ssize_t res_counter_read(struct res_counter *counter, int member,
-		const char __user *buf, size_t nbytes, loff_t *pos,
-		int (*read_strategy)(unsigned long long val, char *s));
-
-int res_counter_memparse_write_strategy(const char *buf,
-					unsigned long long *res);
-
-/*
- * the field descriptors. one for each member of res_counter
- */
-
-enum {
-	RES_USAGE,
-	RES_MAX_USAGE,
-	RES_LIMIT,
-	RES_FAILCNT,
-	RES_SOFT_LIMIT,
-};
-
-/*
- * helpers for accounting
- */
-
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
-
-/*
- * charge - try to consume more resource.
- *
- * @counter: the counter
- * @val: the amount of the resource. each controller defines its own
- *       units, e.g. numbers, bytes, Kbytes, etc
- *
- * returns 0 on success and <0 if the counter->usage will exceed the
- * counter->limit
- *
- * charge_nofail works the same, except that it charges the resource
- * counter unconditionally, and returns < 0 if the after the current
- * charge we are over limit.
- */
-
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
-int res_counter_charge_nofail(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
-
-/*
- * uncharge - tell that some portion of the resource is released
- *
- * @counter: the counter
- * @val: the amount of the resource
- *
- * these calls check for usage underflow and show a warning on the console
- *
- * returns the total charges still present in @counter.
- */
-
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
-
-u64 res_counter_uncharge_until(struct res_counter *counter,
-			       struct res_counter *top,
-			       unsigned long val);
-/**
- * res_counter_margin - calculate chargeable space of a counter
- * @cnt: the counter
- *
- * Returns the difference between the hard limit and the current usage
- * of resource counter @cnt.
- */
-static inline unsigned long long res_counter_margin(struct res_counter *cnt)
-{
-	unsigned long long margin;
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->limit > cnt->usage)
-		margin = cnt->limit - cnt->usage;
-	else
-		margin = 0;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return margin;
-}
-
-/**
- * Get the difference between the usage and the soft limit
- * @cnt: The counter
- *
- * Returns 0 if usage is less than or equal to soft limit
- * The difference between usage and soft limit, otherwise.
- */
-static inline unsigned long long
-res_counter_soft_limit_excess(struct res_counter *cnt)
-{
-	unsigned long long excess;
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= cnt->soft_limit)
-		excess = 0;
-	else
-		excess = cnt->usage - cnt->soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return excess;
-}
-
-static inline void res_counter_reset_max(struct res_counter *cnt)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->max_usage = cnt->usage;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-}
-
-static inline void res_counter_reset_failcnt(struct res_counter *cnt)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->failcnt = 0;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-}
-
-static inline int res_counter_set_limit(struct res_counter *cnt,
-		unsigned long long limit)
-{
-	unsigned long flags;
-	int ret = -EBUSY;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= limit) {
-		cnt->limit = limit;
-		ret = 0;
-	}
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return ret;
-}
-
-static inline int
-res_counter_set_soft_limit(struct res_counter *cnt,
-				unsigned long long soft_limit)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->soft_limit = soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return 0;
-}
-
-#endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 515a4d01e932..f41749982668 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -55,7 +55,6 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/memcontrol.h>
-#include <linux/res_counter.h>
 #include <linux/static_key.h>
 #include <linux/aio.h>
 #include <linux/sched.h>
@@ -1066,7 +1065,7 @@ enum cg_proto_flags {
 };
 
 struct cg_proto {
-	struct res_counter	memory_allocated;	/* Current allocated memory. */
+	struct page_counter	memory_allocated;	/* Current allocated memory. */
 	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
 	int			memory_pressure;
 	long			sysctl_mem[3];
@@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
 					      unsigned long amt,
 					      int *parent_status)
 {
-	struct res_counter *fail;
-	int ret;
+	page_counter_charge(&prot->memory_allocated, amt, NULL);
 
-	ret = res_counter_charge_nofail(&prot->memory_allocated,
-					amt << PAGE_SHIFT, &fail);
-	if (ret < 0)
+	if (atomic_long_read(&prot->memory_allocated.count) >
+	    prot->memory_allocated.limit)
 		*parent_status = OVER_LIMIT;
 }
 
 static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
 					      unsigned long amt)
 {
-	res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
-}
-
-static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
-{
-	u64 ret;
-	ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
-	return ret >> PAGE_SHIFT;
+	page_counter_uncharge(&prot->memory_allocated, amt);
 }
 
 static inline long
 sk_memory_allocated(const struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
+
 	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return memcg_memory_allocated_read(sk->sk_cgrp);
+		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
 
 	return atomic_long_read(prot->memory_allocated);
 }
@@ -1259,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
 		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
 		/* update the root cgroup regardless */
 		atomic_long_add_return(amt, prot->memory_allocated);
-		return memcg_memory_allocated_read(sk->sk_cgrp);
+		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
 	}
 
 	return atomic_long_add_return(amt, prot->memory_allocated);
diff --git a/init/Kconfig b/init/Kconfig
index 0471be99ec38..1cf42b563834 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -975,15 +975,8 @@ config CGROUP_CPUACCT
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
 
-config RESOURCE_COUNTERS
-	bool "Resource counters"
-	help
-	  This option enables controller independent resource accounting
-	  infrastructure that works with cgroups.
-
 config MEMCG
 	bool "Memory Resource Controller for Control Groups"
-	depends on RESOURCE_COUNTERS
 	select EVENTFD
 	help
 	  Provides a memory resource controller that manages both anonymous
@@ -1051,7 +1044,7 @@ config MEMCG_KMEM
 
 config CGROUP_HUGETLB
 	bool "HugeTLB Resource Controller for Control Groups"
-	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
+	depends on MEMCG && HUGETLB_PAGE
 	default n
 	help
 	  Provides a cgroup Resource Controller for HugeTLB pages.
diff --git a/kernel/Makefile b/kernel/Makefile
index 726e18443da0..245953354974 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -58,7 +58,6 @@ obj-$(CONFIG_USER_NS) += user_namespace.o
 obj-$(CONFIG_PID_NS) += pid_namespace.o
 obj-$(CONFIG_DEBUG_SYNCHRO_TEST) += synchro-test.o
 obj-$(CONFIG_IKCONFIG) += configs.o
-obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
 obj-$(CONFIG_SMP) += stop_machine.o
 obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
deleted file mode 100644
index e791130f85a7..000000000000
--- a/kernel/res_counter.c
+++ /dev/null
@@ -1,211 +0,0 @@
-/*
- * resource cgroups
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <xemul@openvz.org>
- *
- */
-
-#include <linux/types.h>
-#include <linux/parser.h>
-#include <linux/fs.h>
-#include <linux/res_counter.h>
-#include <linux/uaccess.h>
-#include <linux/mm.h>
-
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
-{
-	spin_lock_init(&counter->lock);
-	counter->limit = RES_COUNTER_MAX;
-	counter->soft_limit = RES_COUNTER_MAX;
-	counter->parent = parent;
-}
-
-static u64 res_counter_uncharge_locked(struct res_counter *counter,
-				       unsigned long val)
-{
-	if (WARN_ON(counter->usage < val))
-		val = counter->usage;
-
-	counter->usage -= val;
-	return counter->usage;
-}
-
-static int res_counter_charge_locked(struct res_counter *counter,
-				     unsigned long val, bool force)
-{
-	int ret = 0;
-
-	if (counter->usage + val > counter->limit) {
-		counter->failcnt++;
-		ret = -ENOMEM;
-		if (!force)
-			return ret;
-	}
-
-	counter->usage += val;
-	if (counter->usage > counter->max_usage)
-		counter->max_usage = counter->usage;
-	return ret;
-}
-
-static int __res_counter_charge(struct res_counter *counter, unsigned long val,
-				struct res_counter **limit_fail_at, bool force)
-{
-	int ret, r;
-	unsigned long flags;
-	struct res_counter *c, *u;
-
-	r = ret = 0;
-	*limit_fail_at = NULL;
-	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
-		r = res_counter_charge_locked(c, val, force);
-		spin_unlock(&c->lock);
-		if (r < 0 && !ret) {
-			ret = r;
-			*limit_fail_at = c;
-			if (!force)
-				break;
-		}
-	}
-
-	if (ret < 0 && !force) {
-		for (u = counter; u != c; u = u->parent) {
-			spin_lock(&u->lock);
-			res_counter_uncharge_locked(u, val);
-			spin_unlock(&u->lock);
-		}
-	}
-	local_irq_restore(flags);
-
-	return ret;
-}
-
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
-{
-	return __res_counter_charge(counter, val, limit_fail_at, false);
-}
-
-int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
-			      struct res_counter **limit_fail_at)
-{
-	return __res_counter_charge(counter, val, limit_fail_at, true);
-}
-
-u64 res_counter_uncharge_until(struct res_counter *counter,
-			       struct res_counter *top,
-			       unsigned long val)
-{
-	unsigned long flags;
-	struct res_counter *c;
-	u64 ret = 0;
-
-	local_irq_save(flags);
-	for (c = counter; c != top; c = c->parent) {
-		u64 r;
-		spin_lock(&c->lock);
-		r = res_counter_uncharge_locked(c, val);
-		if (c == counter)
-			ret = r;
-		spin_unlock(&c->lock);
-	}
-	local_irq_restore(flags);
-	return ret;
-}
-
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
-{
-	return res_counter_uncharge_until(counter, NULL, val);
-}
-
-static inline unsigned long long *
-res_counter_member(struct res_counter *counter, int member)
-{
-	switch (member) {
-	case RES_USAGE:
-		return &counter->usage;
-	case RES_MAX_USAGE:
-		return &counter->max_usage;
-	case RES_LIMIT:
-		return &counter->limit;
-	case RES_FAILCNT:
-		return &counter->failcnt;
-	case RES_SOFT_LIMIT:
-		return &counter->soft_limit;
-	};
-
-	BUG();
-	return NULL;
-}
-
-ssize_t res_counter_read(struct res_counter *counter, int member,
-		const char __user *userbuf, size_t nbytes, loff_t *pos,
-		int (*read_strategy)(unsigned long long val, char *st_buf))
-{
-	unsigned long long *val;
-	char buf[64], *s;
-
-	s = buf;
-	val = res_counter_member(counter, member);
-	if (read_strategy)
-		s += read_strategy(*val, s);
-	else
-		s += sprintf(s, "%llu\n", *val);
-	return simple_read_from_buffer((void __user *)userbuf, nbytes,
-			pos, buf, s - buf);
-}
-
-#if BITS_PER_LONG == 32
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
-	unsigned long flags;
-	u64 ret;
-
-	spin_lock_irqsave(&counter->lock, flags);
-	ret = *res_counter_member(counter, member);
-	spin_unlock_irqrestore(&counter->lock, flags);
-
-	return ret;
-}
-#else
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
-	return *res_counter_member(counter, member);
-}
-#endif
-
-int res_counter_memparse_write_strategy(const char *buf,
-					unsigned long long *resp)
-{
-	char *end;
-	unsigned long long res;
-
-	/* return RES_COUNTER_MAX(unlimited) if "-1" is specified */
-	if (*buf == '-') {
-		int rc = kstrtoull(buf + 1, 10, &res);
-
-		if (rc)
-			return rc;
-		if (res != 1)
-			return -EINVAL;
-		*resp = RES_COUNTER_MAX;
-		return 0;
-	}
-
-	res = memparse(buf, &end);
-	if (*end != '\0')
-		return -EINVAL;
-
-	if (PAGE_ALIGN(res) >= res)
-		res = PAGE_ALIGN(res);
-	else
-		res = RES_COUNTER_MAX;
-
-	*resp = res;
-
-	return 0;
-}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a67c26e0f360..e619b6b62f1f 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -14,6 +14,7 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/memcontrol.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
@@ -23,7 +24,7 @@ struct hugetlb_cgroup {
 	/*
 	 * the counter to account for hugepages from hugetlb.
 	 */
-	struct res_counter hugepage[HUGE_MAX_HSTATE];
+	struct page_counter hugepage[HUGE_MAX_HSTATE];
 };
 
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -60,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 	int idx;
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
+		if (atomic_long_read(&h_cg->hugepage[idx].count))
 			return true;
 	}
 	return false;
@@ -79,12 +80,12 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	if (parent_h_cgroup) {
 		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
-			res_counter_init(&h_cgroup->hugepage[idx],
-					 &parent_h_cgroup->hugepage[idx]);
+			page_counter_init(&h_cgroup->hugepage[idx],
+					  &parent_h_cgroup->hugepage[idx]);
 	} else {
 		root_h_cgroup = h_cgroup;
 		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
-			res_counter_init(&h_cgroup->hugepage[idx], NULL);
+			page_counter_init(&h_cgroup->hugepage[idx], NULL);
 	}
 	return &h_cgroup->css;
 }
@@ -108,9 +109,8 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
 static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 				       struct page *page)
 {
-	int csize;
-	struct res_counter *counter;
-	struct res_counter *fail_res;
+	unsigned int nr_pages;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *page_hcg;
 	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
 
@@ -123,15 +123,15 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	if (!page_hcg || page_hcg != h_cg)
 		goto out;
 
-	csize = PAGE_SIZE << compound_order(page);
+	nr_pages = 1 << compound_order(page);
 	if (!parent) {
 		parent = root_h_cgroup;
 		/* root has no limit */
-		res_counter_charge_nofail(&parent->hugepage[idx],
-					  csize, &fail_res);
+		page_counter_charge(&parent->hugepage[idx], nr_pages, NULL);
 	}
 	counter = &h_cg->hugepage[idx];
-	res_counter_uncharge_until(counter, counter->parent, csize);
+	/* Take the pages off the local counter */
+	page_counter_cancel(counter, nr_pages);
 
 	set_hugetlb_cgroup(page, parent);
 out:
@@ -166,9 +166,8 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 				 struct hugetlb_cgroup **ptr)
 {
 	int ret = 0;
-	struct res_counter *fail_res;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *h_cg = NULL;
-	unsigned long csize = nr_pages * PAGE_SIZE;
 
 	if (hugetlb_cgroup_disabled())
 		goto done;
@@ -187,7 +186,7 @@ again:
 	}
 	rcu_read_unlock();
 
-	ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
+	ret = page_counter_charge(&h_cg->hugepage[idx], nr_pages, &counter);
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
@@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 				  struct page *page)
 {
 	struct hugetlb_cgroup *h_cg;
-	unsigned long csize = nr_pages * PAGE_SIZE;
 
 	if (hugetlb_cgroup_disabled())
 		return;
@@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 	if (unlikely(!h_cg))
 		return;
 	set_hugetlb_cgroup(page, NULL);
-	res_counter_uncharge(&h_cg->hugepage[idx], csize);
+	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
 	return;
 }
 
 void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 				    struct hugetlb_cgroup *h_cg)
 {
-	unsigned long csize = nr_pages * PAGE_SIZE;
-
 	if (hugetlb_cgroup_disabled() || !h_cg)
 		return;
 
 	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
 		return;
 
-	res_counter_uncharge(&h_cg->hugepage[idx], csize);
+	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
 	return;
 }
 
+enum {
+	RES_USAGE,
+	RES_LIMIT,
+	RES_MAX_USAGE,
+	RES_FAILCNT,
+};
+
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
-	int idx, name;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
 
-	idx = MEMFILE_IDX(cft->private);
-	name = MEMFILE_ATTR(cft->private);
+	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
 
-	return res_counter_read_u64(&h_cg->hugepage[idx], name);
+	switch (MEMFILE_ATTR(cft->private)) {
+	case RES_USAGE:
+		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+	case RES_LIMIT:
+		return (u64)counter->limit * PAGE_SIZE;
+	case RES_MAX_USAGE:
+		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_FAILCNT:
+		return counter->limited;
+	default:
+		BUG();
+	}
 }
 
 static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
-	int idx, name, ret;
-	unsigned long long val;
+	int ret, idx;
+	unsigned long nr_pages;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
 
+	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
+		return -EINVAL;
+
 	buf = strstrip(buf);
+	ret = page_counter_memparse(buf, &nr_pages);
+	if (ret)
+		return ret;
+
 	idx = MEMFILE_IDX(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_LIMIT:
-		if (hugetlb_cgroup_is_root(h_cg)) {
-			/* Can't set limit on root */
-			ret = -EINVAL;
-			break;
-		}
-		/* This function does all necessary parse...reuse it */
-		ret = res_counter_memparse_write_strategy(buf, &val);
-		if (ret)
-			break;
-		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
-		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
+		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
+		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
 		break;
 	default:
 		ret = -EINVAL;
@@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
-	int idx, name, ret = 0;
+	int ret = 0;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
 
-	idx = MEMFILE_IDX(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
+	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&h_cg->hugepage[idx]);
+		counter->watermark = atomic_long_read(&counter->count);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
+		counter->limited = 0;
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2def11f1ec1..dfd3b15a57e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,7 +25,6 @@
  * GNU General Public License for more details.
  */
 
-#include <linux/res_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
@@ -66,6 +65,117 @@
 
 #include <trace/events/vmscan.h>
 
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+{
+	long new;
+
+	new = atomic_long_sub_return(nr_pages, &counter->count);
+
+	if (WARN_ON(unlikely(new < 0)))
+		atomic_long_set(&counter->count, 0);
+
+	return new > 1;
+}
+
+int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
+			struct page_counter **fail)
+{
+	struct page_counter *c;
+
+	for (c = counter; c; c = c->parent) {
+		for (;;) {
+			unsigned long count;
+			unsigned long new;
+
+			count = atomic_long_read(&c->count);
+
+			new = count + nr_pages;
+			if (new > c->limit) {
+				c->limited++;
+				if (fail) {
+					*fail = c;
+					goto failed;
+				}
+			}
+
+			if (atomic_long_cmpxchg(&c->count, count, new) != count)
+				continue;
+
+			if (new > c->watermark)
+				c->watermark = new;
+
+			break;
+		}
+	}
+	return 0;
+
+failed:
+	for (c = counter; c != *fail; c = c->parent)
+		page_counter_cancel(c, nr_pages);
+
+	return -ENOMEM;
+}
+
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
+{
+	struct page_counter *c;
+	int ret = 1;
+
+	for (c = counter; c; c = c->parent) {
+		int remainder;
+
+		remainder = page_counter_cancel(c, nr_pages);
+		if (c == counter && !remainder)
+			ret = 0;
+	}
+
+	return ret;
+}
+
+int page_counter_limit(struct page_counter *counter, unsigned long limit)
+{
+	for (;;) {
+		unsigned long count;
+		unsigned long old;
+
+		count = atomic_long_read(&counter->count);
+
+		old = xchg(&counter->limit, limit);
+
+		if (atomic_long_read(&counter->count) != count) {
+			counter->limit = old;
+			continue;
+		}
+
+		if (count > limit) {
+			counter->limit = old;
+			return -EBUSY;
+		}
+
+		return 0;
+	}
+}
+
+int page_counter_memparse(const char *buf, unsigned long *nr_pages)
+{
+	char unlimited[] = "-1";
+	char *end;
+	u64 bytes;
+
+	if (!strncmp(buf, unlimited, sizeof(unlimited))) {
+		*nr_pages = PAGE_COUNTER_MAX;
+		return 0;
+	}
+
+	bytes = memparse(buf, &end);
+	if (*end != '\0')
+		return -EINVAL;
+
+	*nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
+
+	return 0;
+}
+
 struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
@@ -165,7 +275,7 @@ struct mem_cgroup_per_zone {
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
 	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
+	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
 	bool			on_tree;
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
@@ -198,7 +308,7 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
 
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
-	u64 threshold;
+	unsigned long threshold;
 };
 
 /* For threshold */
@@ -284,24 +394,18 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
  */
 struct mem_cgroup {
 	struct cgroup_subsys_state css;
-	/*
-	 * the counter to account for memory usage
-	 */
-	struct res_counter res;
+
+	/* Accounted resources */
+	struct page_counter memory;
+	struct page_counter memsw;
+	struct page_counter kmem;
+
+	unsigned long soft_limit;
 
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
 	/*
-	 * the counter to account for mem+swap usage.
-	 */
-	struct res_counter memsw;
-
-	/*
-	 * the counter to account for kernel memory usage.
-	 */
-	struct res_counter kmem;
-	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
@@ -647,7 +751,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 	 * This check can't live in kmem destruction function,
 	 * since the charges will outlive the cgroup
 	 */
-	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
+	WARN_ON(atomic_long_read(&memcg->kmem.count));
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -703,7 +807,7 @@ soft_limit_tree_from_page(struct page *page)
 
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 					 struct mem_cgroup_tree_per_zone *mctz,
-					 unsigned long long new_usage_in_excess)
+					 unsigned long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
@@ -752,10 +856,21 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
+static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+{
+	unsigned long nr_pages = atomic_long_read(&memcg->memory.count);
+	unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
+	unsigned long excess = 0;
+
+	if (nr_pages > soft_limit)
+		excess = nr_pages - soft_limit;
+
+	return excess;
+}
 
 static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 {
-	unsigned long long excess;
+	unsigned long excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
 
@@ -766,7 +881,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	 */
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		mz = mem_cgroup_page_zoneinfo(memcg, page);
-		excess = res_counter_soft_limit_excess(&memcg->res);
+		excess = soft_limit_excess(memcg);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
@@ -822,7 +937,7 @@ retry:
 	 * position in the tree.
 	 */
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
+	if (!soft_limit_excess(mz->memcg) ||
 	    !css_tryget_online(&mz->memcg->css))
 		goto retry;
 done:
@@ -1478,7 +1593,7 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 	return inactive * inactive_ratio < active;
 }
 
-#define mem_cgroup_from_res_counter(counter, member)	\
+#define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
 /**
@@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
  */
 static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 {
-	unsigned long long margin;
+	unsigned long margin = 0;
+	unsigned long count;
+	unsigned long limit;
 
-	margin = res_counter_margin(&memcg->res);
-	if (do_swap_account)
-		margin = min(margin, res_counter_margin(&memcg->memsw));
-	return margin >> PAGE_SHIFT;
+	count = atomic_long_read(&memcg->memory.count);
+	limit = ACCESS_ONCE(memcg->memory.limit);
+	if (count < limit)
+		margin = limit - count;
+
+	if (do_swap_account) {
+		count = atomic_long_read(&memcg->memsw.count);
+		limit = ACCESS_ONCE(memcg->memsw.limit);
+		if (count < limit)
+			margin = min(margin, limit - count);
+	}
+
+	return margin;
 }
 
 int mem_cgroup_swappiness(struct mem_cgroup *memcg)
@@ -1636,18 +1762,15 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 
 	rcu_read_unlock();
 
-	pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
-		res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
-		res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
-		res_counter_read_u64(&memcg->res, RES_FAILCNT));
-	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
-		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
-		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
-		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
-	pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n",
-		res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
-		res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
-		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
+	pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
+		K((u64)atomic_long_read(&memcg->memory.count)),
+		K((u64)memcg->memory.limit), memcg->memory.limited);
+	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
+		K((u64)atomic_long_read(&memcg->memsw.count)),
+		K((u64)memcg->memsw.limit), memcg->memsw.limited);
+	pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
+		K((u64)atomic_long_read(&memcg->kmem.count)),
+		K((u64)memcg->kmem.limit), memcg->kmem.limited);
 
 	for_each_mem_cgroup_tree(iter, memcg) {
 		pr_info("Memory cgroup stats for ");
@@ -1685,30 +1808,19 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 }
 
 /*
- * Return the memory (and swap, if configured) limit for a memcg.
+ * Return the memory (and swap, if configured) maximum consumption for a memcg.
  */
-static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
-	u64 limit;
+	unsigned long limit;
 
-	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-
-	/*
-	 * Do not consider swap space if we cannot swap due to swappiness
-	 */
+	limit = memcg->memory.limit;
 	if (mem_cgroup_swappiness(memcg)) {
-		u64 memsw;
+		unsigned long memsw_limit;
 
-		limit += total_swap_pages << PAGE_SHIFT;
-		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-
-		/*
-		 * If memsw is finite and limits the amount of swap space
-		 * available to this memcg, return that limit.
-		 */
-		limit = min(limit, memsw);
+		memsw_limit = memcg->memsw.limit;
+		limit = min(limit + total_swap_pages, memsw_limit);
 	}
-
 	return limit;
 }
 
@@ -1732,7 +1844,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	}
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -1935,7 +2047,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		.priority = 0,
 	};
 
-	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
+	excess = soft_limit_excess(root_memcg);
 
 	while (1) {
 		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1966,7 +2078,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
 						     zone, &nr_scanned);
 		*total_scanned += nr_scanned;
-		if (!res_counter_soft_limit_excess(&root_memcg->res))
+		if (!soft_limit_excess(root_memcg))
 			break;
 	}
 	mem_cgroup_iter_break(root_memcg, victim);
@@ -2293,33 +2405,31 @@ static DEFINE_MUTEX(percpu_charge_mutex);
 static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	struct memcg_stock_pcp *stock;
-	bool ret = true;
+	bool ret = false;
 
 	if (nr_pages > CHARGE_BATCH)
-		return false;
+		return ret;
 
 	stock = &get_cpu_var(memcg_stock);
-	if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
 		stock->nr_pages -= nr_pages;
-	else /* need to call res_counter_charge */
-		ret = false;
+		ret = true;
+	}
 	put_cpu_var(memcg_stock);
 	return ret;
 }
 
 /*
- * Returns stocks cached in percpu to res_counter and reset cached information.
+ * Returns stocks cached in percpu and reset cached information.
  */
 static void drain_stock(struct memcg_stock_pcp *stock)
 {
 	struct mem_cgroup *old = stock->cached;
 
 	if (stock->nr_pages) {
-		unsigned long bytes = stock->nr_pages * PAGE_SIZE;
-
-		res_counter_uncharge(&old->res, bytes);
+		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_swap_account)
-			res_counter_uncharge(&old->memsw, bytes);
+			page_counter_uncharge(&old->memsw, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
 	stock->cached = NULL;
@@ -2348,7 +2458,7 @@ static void __init memcg_stock_init(void)
 }
 
 /*
- * Cache charges(val) which is from res_counter, to local per_cpu area.
+ * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2408,8 +2518,7 @@ out:
 /*
  * Tries to drain stocked charges in other cpus. This function is asynchronous
  * and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back to res_counter later but cannot wait for
- * it.
+ * expects some charges will be back later but cannot wait for it.
  */
 static void drain_all_stock_async(struct mem_cgroup *root_memcg)
 {
@@ -2483,9 +2592,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
+	struct page_counter *counter;
 	unsigned long nr_reclaimed;
-	unsigned long long size;
 	bool may_swap = true;
 	bool drained = false;
 	int ret = 0;
@@ -2496,17 +2604,16 @@ retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
 
-	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+	if (!page_counter_charge(&memcg->memory, batch, &counter)) {
 		if (!do_swap_account)
 			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+		if (!page_counter_charge(&memcg->memsw, batch, &counter))
 			goto done_restock;
-		res_counter_uncharge(&memcg->res, size);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		page_counter_uncharge(&memcg->memory, batch);
+		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		may_swap = false;
 	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+		mem_over_limit = mem_cgroup_from_counter(counter, memory);
 
 	if (batch > nr_pages) {
 		batch = nr_pages;
@@ -2587,32 +2694,12 @@ done:
 
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	unsigned long bytes = nr_pages * PAGE_SIZE;
-
 	if (mem_cgroup_is_root(memcg))
 		return;
 
-	res_counter_uncharge(&memcg->res, bytes);
+	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_swap_account)
-		res_counter_uncharge(&memcg->memsw, bytes);
-}
-
-/*
- * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
- * This is useful when moving usage to parent cgroup.
- */
-static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
-					unsigned int nr_pages)
-{
-	unsigned long bytes = nr_pages * PAGE_SIZE;
-
-	if (mem_cgroup_is_root(memcg))
-		return;
-
-	res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
-	if (do_swap_account)
-		res_counter_uncharge_until(&memcg->memsw,
-						memcg->memsw.parent, bytes);
+		page_counter_uncharge(&memcg->memsw, nr_pages);
 }
 
 /*
@@ -2736,8 +2823,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		unlock_page_lru(page, isolated);
 }
 
-static DEFINE_MUTEX(set_limit_mutex);
-
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * The memcg_slab_mutex is held whenever a per memcg kmem cache is created or
@@ -2786,16 +2871,17 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 }
 #endif
 
-static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
+			     unsigned long nr_pages)
 {
-	struct res_counter *fail_res;
+	struct page_counter *counter;
 	int ret = 0;
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
-	if (ret)
+	ret = page_counter_charge(&memcg->kmem, nr_pages, &counter);
+	if (ret < 0)
 		return ret;
 
-	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
+	ret = try_charge(memcg, gfp, nr_pages);
 	if (ret == -EINTR)  {
 		/*
 		 * try_charge() chose to bypass to root due to OOM kill or
@@ -2812,25 +2898,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
-		res_counter_charge_nofail(&memcg->res, size, &fail_res);
+		page_counter_charge(&memcg->memory, nr_pages, NULL);
 		if (do_swap_account)
-			res_counter_charge_nofail(&memcg->memsw, size,
-						  &fail_res);
+			page_counter_charge(&memcg->memsw, nr_pages, NULL);
 		ret = 0;
 	} else if (ret)
-		res_counter_uncharge(&memcg->kmem, size);
+		page_counter_uncharge(&memcg->kmem, nr_pages);
 
 	return ret;
 }
 
-static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
+				unsigned long nr_pages)
 {
-	res_counter_uncharge(&memcg->res, size);
+	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_swap_account)
-		res_counter_uncharge(&memcg->memsw, size);
+		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	/* Not down to 0 */
-	if (res_counter_uncharge(&memcg->kmem, size))
+	if (page_counter_uncharge(&memcg->kmem, nr_pages))
 		return;
 
 	/*
@@ -3107,19 +3193,21 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,
 
 int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 {
+	unsigned int nr_pages = 1 << order;
 	int res;
 
-	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
-				PAGE_SIZE << order);
+	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages);
 	if (!res)
-		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+		atomic_add(nr_pages, &cachep->memcg_params->nr_pages);
 	return res;
 }
 
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
-	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+	unsigned int nr_pages = 1 << order;
+
+	memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages);
+	atomic_sub(nr_pages, &cachep->memcg_params->nr_pages);
 }
 
 /*
@@ -3240,7 +3328,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 		return true;
 	}
 
-	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+	ret = memcg_charge_kmem(memcg, gfp, 1 << order);
 	if (!ret)
 		*_memcg = memcg;
 
@@ -3257,7 +3345,7 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
 
 	/* The page allocation failed. Revert */
 	if (!page) {
-		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+		memcg_uncharge_kmem(memcg, 1 << order);
 		return;
 	}
 	/*
@@ -3290,7 +3378,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 		return;
 
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+	memcg_uncharge_kmem(memcg, 1 << order);
 }
 #else
 static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
@@ -3468,8 +3556,12 @@ static int mem_cgroup_move_parent(struct page *page,
 
 	ret = mem_cgroup_move_account(page, nr_pages,
 				pc, child, parent);
-	if (!ret)
-		__mem_cgroup_cancel_local_charge(child, nr_pages);
+	if (!ret) {
+		/* Take charge off the local counters */
+		page_counter_cancel(&child->memory, nr_pages);
+		if (do_swap_account)
+			page_counter_cancel(&child->memsw, nr_pages);
+	}
 
 	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
@@ -3499,7 +3591,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
  *
  * Returns 0 on success, -EINVAL on failure.
  *
- * The caller must have charged to @to, IOW, called res_counter_charge() about
+ * The caller must have charged to @to, IOW, called page_counter_charge() about
  * both res and memsw, and called css_get().
  */
 static int mem_cgroup_move_swap_account(swp_entry_t entry,
@@ -3515,7 +3607,7 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 		mem_cgroup_swap_statistics(to, true);
 		/*
 		 * This function is only called from task migration context now.
-		 * It postpones res_counter and refcount handling till the end
+		 * It postpones page_counter and refcount handling till the end
 		 * of task migration(mem_cgroup_clear_mc()) for performance
 		 * improvement. But we cannot postpone css_get(to)  because if
 		 * the process that has been moved to @to does swap-in, the
@@ -3573,49 +3665,42 @@ void mem_cgroup_print_bad_page(struct page *page)
 }
 #endif
 
+static DEFINE_MUTEX(set_limit_mutex);
+
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
-				unsigned long long val)
+				   unsigned long limit)
 {
+	unsigned long curusage;
+	unsigned long oldusage;
+	bool enlarge = false;
 	int retry_count;
-	u64 memswlimit, memlimit;
-	int ret = 0;
-	int children = mem_cgroup_count_children(memcg);
-	u64 curusage, oldusage;
-	int enlarge;
+	int ret;
 
 	/*
 	 * For keeping hierarchical_reclaim simple, how long we should retry
 	 * is depends on callers. We set our retry-count to be function
 	 * of # of children which we should visit in this loop.
 	 */
-	retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
+	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+		      mem_cgroup_count_children(memcg);
 
-	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	oldusage = atomic_long_read(&memcg->memory.count);
 
-	enlarge = 0;
-	while (retry_count) {
+	do {
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
 		}
-		/*
-		 * Rather than hide all in some function, I do this in
-		 * open coded manner. You see what this really does.
-		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
-		 */
+
 		mutex_lock(&set_limit_mutex);
-		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		if (memswlimit < val) {
-			ret = -EINVAL;
+		if (limit > memcg->memsw.limit) {
 			mutex_unlock(&set_limit_mutex);
+			ret = -EINVAL;
 			break;
 		}
-
-		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		if (memlimit < val)
-			enlarge = 1;
-
-		ret = res_counter_set_limit(&memcg->res, val);
+		if (limit > memcg->memory.limit)
+			enlarge = true;
+		ret = page_counter_limit(&memcg->memory, limit);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3623,13 +3708,14 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
 
-		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+		curusage = atomic_long_read(&memcg->memory.count);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
 		else
 			oldusage = curusage;
-	}
+	} while (retry_count);
+
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
 
@@ -3637,38 +3723,35 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 }
 
 static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
-					unsigned long long val)
+					 unsigned long limit)
 {
+	unsigned long curusage;
+	unsigned long oldusage;
+	bool enlarge = false;
 	int retry_count;
-	u64 memlimit, memswlimit, oldusage, curusage;
-	int children = mem_cgroup_count_children(memcg);
-	int ret = -EBUSY;
-	int enlarge = 0;
+	int ret;
 
 	/* see mem_cgroup_resize_res_limit */
-	retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
-	oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
-	while (retry_count) {
+	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+		      mem_cgroup_count_children(memcg);
+
+	oldusage = atomic_long_read(&memcg->memsw.count);
+
+	do {
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
 		}
-		/*
-		 * Rather than hide all in some function, I do this in
-		 * open coded manner. You see what this really does.
-		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
-		 */
+
 		mutex_lock(&set_limit_mutex);
-		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		if (memlimit > val) {
-			ret = -EINVAL;
+		if (limit < memcg->memory.limit) {
 			mutex_unlock(&set_limit_mutex);
+			ret = -EINVAL;
 			break;
 		}
-		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		if (memswlimit < val)
-			enlarge = 1;
-		ret = res_counter_set_limit(&memcg->memsw, val);
+		if (limit > memcg->memsw.limit)
+			enlarge = true;
+		ret = page_counter_limit(&memcg->memsw, limit);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3676,15 +3759,17 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
 
-		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+		curusage = atomic_long_read(&memcg->memsw.count);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
 		else
 			oldusage = curusage;
-	}
+	} while (retry_count);
+
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
+
 	return ret;
 }
 
@@ -3697,7 +3782,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	unsigned long reclaimed;
 	int loop = 0;
 	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
+	unsigned long excess;
 	unsigned long nr_scanned;
 
 	if (order > 0)
@@ -3751,7 +3836,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			} while (1);
 		}
 		__mem_cgroup_remove_exceeded(mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->memcg->res);
+		excess = soft_limit_excess(mz->memcg);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -3844,7 +3929,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 {
 	int node, zid;
-	u64 usage;
 
 	do {
 		/* This is for making all *used* pages to be on LRU. */
@@ -3876,9 +3960,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 		 * right after the check. RES_USAGE should be safe as we always
 		 * charge before adding to the LRU.
 		 */
-		usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
-			res_counter_read_u64(&memcg->kmem, RES_USAGE);
-	} while (usage > 0);
+	} while (atomic_long_read(&memcg->memory.count) -
+		 atomic_long_read(&memcg->kmem.count) > 0);
 }
 
 /*
@@ -3918,7 +4001,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 	/* we call try-to-free pages for make this cgroup empty */
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
-	while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
+	while (nr_retries && atomic_long_read(&memcg->memory.count)) {
 		int progress;
 
 		if (signal_pending(current))
@@ -3989,8 +4072,8 @@ out:
 	return retval;
 }
 
-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
-					       enum mem_cgroup_stat_index idx)
+static unsigned long tree_stat(struct mem_cgroup *memcg,
+			       enum mem_cgroup_stat_index idx)
 {
 	struct mem_cgroup *iter;
 	long val = 0;
@@ -4008,55 +4091,72 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 {
 	u64 val;
 
-	if (!mem_cgroup_is_root(memcg)) {
+	if (mem_cgroup_is_root(memcg)) {
+		val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE);
+		val += tree_stat(memcg, MEM_CGROUP_STAT_RSS);
+		if (swap)
+			val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
+	} else {
 		if (!swap)
-			return res_counter_read_u64(&memcg->res, RES_USAGE);
+			val = atomic_long_read(&memcg->memory.count);
 		else
-			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
+			val = atomic_long_read(&memcg->memsw.count);
 	}
-
-	/*
-	 * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
-	 * as well as in MEM_CGROUP_STAT_RSS_HUGE.
-	 */
-	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
-	val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-
-	if (swap)
-		val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
-
 	return val << PAGE_SHIFT;
 }
 
+enum {
+	RES_USAGE,
+	RES_LIMIT,
+	RES_MAX_USAGE,
+	RES_FAILCNT,
+	RES_SOFT_LIMIT,
+};
 
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	enum res_type type = MEMFILE_TYPE(cft->private);
-	int name = MEMFILE_ATTR(cft->private);
+	struct page_counter *counter;
 
-	switch (type) {
+	switch (MEMFILE_TYPE(cft->private)) {
 	case _MEM:
-		if (name == RES_USAGE)
-			return mem_cgroup_usage(memcg, false);
-		return res_counter_read_u64(&memcg->res, name);
+		counter = &memcg->memory;
+		break;
 	case _MEMSWAP:
-		if (name == RES_USAGE)
-			return mem_cgroup_usage(memcg, true);
-		return res_counter_read_u64(&memcg->memsw, name);
+		counter = &memcg->memsw;
+		break;
 	case _KMEM:
-		return res_counter_read_u64(&memcg->kmem, name);
+		counter = &memcg->kmem;
 		break;
 	default:
 		BUG();
 	}
+
+	switch (MEMFILE_ATTR(cft->private)) {
+	case RES_USAGE:
+		if (counter == &memcg->memory)
+			return mem_cgroup_usage(memcg, false);
+		if (counter == &memcg->memsw)
+			return mem_cgroup_usage(memcg, true);
+		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+	case RES_LIMIT:
+		return (u64)counter->limit * PAGE_SIZE;
+	case RES_MAX_USAGE:
+		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_FAILCNT:
+		return counter->limited;
+	case RES_SOFT_LIMIT:
+		return (u64)memcg->soft_limit * PAGE_SIZE;
+	default:
+		BUG();
+	}
 }
 
 #ifdef CONFIG_MEMCG_KMEM
 /* should be called with activate_kmem_mutex held */
 static int __memcg_activate_kmem(struct mem_cgroup *memcg,
-				 unsigned long long limit)
+				 unsigned long nr_pages)
 {
 	int err = 0;
 	int memcg_id;
@@ -4103,7 +4203,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
 	 * active bit set yet, so this should succeed.
 	 */
-	err = res_counter_set_limit(&memcg->kmem, limit);
+	err = page_counter_limit(&memcg->kmem, nr_pages);
 	VM_BUG_ON(err);
 
 	static_key_slow_inc(&memcg_kmem_enabled_key);
@@ -4119,25 +4219,25 @@ out:
 }
 
 static int memcg_activate_kmem(struct mem_cgroup *memcg,
-			       unsigned long long limit)
+			       unsigned long nr_pages)
 {
 	int ret;
 
 	mutex_lock(&activate_kmem_mutex);
-	ret = __memcg_activate_kmem(memcg, limit);
+	ret = __memcg_activate_kmem(memcg, nr_pages);
 	mutex_unlock(&activate_kmem_mutex);
 	return ret;
 }
 
 static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
-				   unsigned long long val)
+				   unsigned long limit)
 {
 	int ret;
 
 	if (!memcg_kmem_is_active(memcg))
-		ret = memcg_activate_kmem(memcg, val);
+		ret = memcg_activate_kmem(memcg, limit);
 	else
-		ret = res_counter_set_limit(&memcg->kmem, val);
+		ret = page_counter_limit(&memcg->kmem, limit);
 	return ret;
 }
 
@@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	 * after this point, because it has at least one child already.
 	 */
 	if (memcg_kmem_is_active(parent))
-		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
+		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
 	mutex_unlock(&activate_kmem_mutex);
 	return ret;
 }
 #else
 static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
-				   unsigned long long val)
+				   unsigned long limit)
 {
 	return -EINVAL;
 }
@@ -4175,110 +4275,69 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	enum res_type type;
-	int name;
-	unsigned long long val;
+	unsigned long nr_pages;
 	int ret;
 
 	buf = strstrip(buf);
-	type = MEMFILE_TYPE(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
+	ret = page_counter_memparse(buf, &nr_pages);
+	if (ret)
+		return ret;
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_LIMIT:
 		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
 			ret = -EINVAL;
 			break;
 		}
-		/* This function does all necessary parse...reuse it */
-		ret = res_counter_memparse_write_strategy(buf, &val);
-		if (ret)
+		switch (MEMFILE_TYPE(of_cft(of)->private)) {
+		case _MEM:
+			ret = mem_cgroup_resize_limit(memcg, nr_pages);
 			break;
-		if (type == _MEM)
-			ret = mem_cgroup_resize_limit(memcg, val);
-		else if (type == _MEMSWAP)
-			ret = mem_cgroup_resize_memsw_limit(memcg, val);
-		else if (type == _KMEM)
-			ret = memcg_update_kmem_limit(memcg, val);
-		else
-			return -EINVAL;
-		break;
-	case RES_SOFT_LIMIT:
-		ret = res_counter_memparse_write_strategy(buf, &val);
-		if (ret)
+		case _MEMSWAP:
+			ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
 			break;
-		/*
-		 * For memsw, soft limits are hard to implement in terms
-		 * of semantics, for now, we support soft limits for
-		 * control without swap
-		 */
-		if (type == _MEM)
-			ret = res_counter_set_soft_limit(&memcg->res, val);
-		else
-			ret = -EINVAL;
+		case _KMEM:
+			ret = memcg_update_kmem_limit(memcg, nr_pages);
+			break;
+		}
 		break;
-	default:
-		ret = -EINVAL; /* should be BUG() ? */
+	case RES_SOFT_LIMIT:
+		memcg->soft_limit = nr_pages;
+		ret = 0;
 		break;
 	}
 	return ret ?: nbytes;
 }
 
-static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
-		unsigned long long *mem_limit, unsigned long long *memsw_limit)
-{
-	unsigned long long min_limit, min_memsw_limit, tmp;
-
-	min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-	min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-	if (!memcg->use_hierarchy)
-		goto out;
-
-	while (memcg->css.parent) {
-		memcg = mem_cgroup_from_css(memcg->css.parent);
-		if (!memcg->use_hierarchy)
-			break;
-		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		min_limit = min(min_limit, tmp);
-		tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		min_memsw_limit = min(min_memsw_limit, tmp);
-	}
-out:
-	*mem_limit = min_limit;
-	*memsw_limit = min_memsw_limit;
-}
-
 static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
 				size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	int name;
-	enum res_type type;
+	struct page_counter *counter;
 
-	type = MEMFILE_TYPE(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
+	switch (MEMFILE_TYPE(of_cft(of)->private)) {
+	case _MEM:
+		counter = &memcg->memory;
+		break;
+	case _MEMSWAP:
+		counter = &memcg->memsw;
+		break;
+	case _KMEM:
+		counter = &memcg->kmem;
+		break;
+	default:
+		BUG();
+	}
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		if (type == _MEM)
-			res_counter_reset_max(&memcg->res);
-		else if (type == _MEMSWAP)
-			res_counter_reset_max(&memcg->memsw);
-		else if (type == _KMEM)
-			res_counter_reset_max(&memcg->kmem);
-		else
-			return -EINVAL;
+		counter->watermark = atomic_long_read(&counter->count);
 		break;
 	case RES_FAILCNT:
-		if (type == _MEM)
-			res_counter_reset_failcnt(&memcg->res);
-		else if (type == _MEMSWAP)
-			res_counter_reset_failcnt(&memcg->memsw);
-		else if (type == _KMEM)
-			res_counter_reset_failcnt(&memcg->kmem);
-		else
-			return -EINVAL;
+		counter->limited = 0;
 		break;
+	default:
+		BUG();
 	}
 
 	return nbytes;
@@ -4375,6 +4434,7 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
 static int memcg_stat_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long memory, memsw;
 	struct mem_cgroup *mi;
 	unsigned int i;
 
@@ -4394,14 +4454,16 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 			   mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);
 
 	/* Hierarchical information */
-	{
-		unsigned long long limit, memsw_limit;
-		memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit);
-		seq_printf(m, "hierarchical_memory_limit %llu\n", limit);
-		if (do_swap_account)
-			seq_printf(m, "hierarchical_memsw_limit %llu\n",
-				   memsw_limit);
+	memory = memsw = PAGE_COUNTER_MAX;
+	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
+		memory = min(memory, mi->memory.limit);
+		memsw = min(memsw, mi->memsw.limit);
 	}
+	seq_printf(m, "hierarchical_memory_limit %llu\n",
+		   (u64)memory * PAGE_SIZE);
+	if (do_swap_account)
+		seq_printf(m, "hierarchical_memsw_limit %llu\n",
+			   (u64)memsw * PAGE_SIZE);
 
 	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
 		long long val = 0;
@@ -4485,7 +4547,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
-	u64 usage;
+	unsigned long usage;
 	int i;
 
 	rcu_read_lock();
@@ -4584,10 +4646,11 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 {
 	struct mem_cgroup_thresholds *thresholds;
 	struct mem_cgroup_threshold_ary *new;
-	u64 threshold, usage;
+	unsigned long threshold;
+	unsigned long usage;
 	int i, size, ret;
 
-	ret = res_counter_memparse_write_strategy(args, &threshold);
+	ret = page_counter_memparse(args, &threshold);
 	if (ret)
 		return ret;
 
@@ -4677,7 +4740,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 {
 	struct mem_cgroup_thresholds *thresholds;
 	struct mem_cgroup_threshold_ary *new;
-	u64 usage;
+	unsigned long usage;
 	int i, j, size;
 
 	mutex_lock(&memcg->thresholds_lock);
@@ -4871,7 +4934,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 
 	memcg_kmem_mark_dead(memcg);
 
-	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
+	if (atomic_long_read(&memcg->kmem.count))
 		return;
 
 	if (memcg_kmem_test_and_clear_dead(memcg))
@@ -5351,9 +5414,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
  */
 struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
-	if (!memcg->res.parent)
+	if (!memcg->memory.parent)
 		return NULL;
-	return mem_cgroup_from_res_counter(memcg->res.parent, res);
+	return mem_cgroup_from_counter(memcg->memory.parent, memory);
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
@@ -5398,9 +5461,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
-		res_counter_init(&memcg->res, NULL);
-		res_counter_init(&memcg->memsw, NULL);
-		res_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->memsw, NULL);
+		page_counter_init(&memcg->kmem, NULL);
 	}
 
 	memcg->last_scanned_node = MAX_NUMNODES;
@@ -5438,18 +5501,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	memcg->swappiness = mem_cgroup_swappiness(parent);
 
 	if (parent->use_hierarchy) {
-		res_counter_init(&memcg->res, &parent->res);
-		res_counter_init(&memcg->memsw, &parent->memsw);
-		res_counter_init(&memcg->kmem, &parent->kmem);
+		page_counter_init(&memcg->memory, &parent->memory);
+		page_counter_init(&memcg->memsw, &parent->memsw);
+		page_counter_init(&memcg->kmem, &parent->kmem);
 
 		/*
 		 * No need to take a reference to the parent because cgroup
 		 * core guarantees its existence.
 		 */
 	} else {
-		res_counter_init(&memcg->res, NULL);
-		res_counter_init(&memcg->memsw, NULL);
-		res_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->memsw, NULL);
+		page_counter_init(&memcg->kmem, NULL);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
@@ -5520,7 +5583,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	/*
 	 * XXX: css_offline() would be where we should reparent all
 	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget_online() and res_counter charging
+	 * memcg does not do css_tryget_online() and page_counter charging
 	 * under the same RCU lock region, which means that charging
 	 * could race with offlining.  Offlining only happens to
 	 * cgroups with no tasks in them but charges can show up
@@ -5540,7 +5603,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	 * call_rcu()
 	 *   offline_css()
 	 *     reparent_charges()
-	 *                           res_counter_charge()
+	 *                           page_counter_charge()
 	 *                           css_put()
 	 *                             css_free()
 	 *                           pc->mem_cgroup = dead memcg
@@ -5575,10 +5638,10 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
-	mem_cgroup_resize_limit(memcg, ULLONG_MAX);
-	mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
-	memcg_update_kmem_limit(memcg, ULLONG_MAX);
-	res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
+	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
+	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
+	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
+	memcg->soft_limit = 0;
 }
 
 #ifdef CONFIG_MMU
@@ -5892,19 +5955,18 @@ static void __mem_cgroup_clear_mc(void)
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
 		if (!mem_cgroup_is_root(mc.from))
-			res_counter_uncharge(&mc.from->memsw,
-					     PAGE_SIZE * mc.moved_swap);
-
-		for (i = 0; i < mc.moved_swap; i++)
-			css_put(&mc.from->css);
+			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
 
 		/*
-		 * we charged both to->res and to->memsw, so we should
-		 * uncharge to->res.
+		 * we charged both to->memory and to->memsw, so we
+		 * should uncharge to->memory.
 		 */
 		if (!mem_cgroup_is_root(mc.to))
-			res_counter_uncharge(&mc.to->res,
-					     PAGE_SIZE * mc.moved_swap);
+			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
+
+		for (i = 0; i < mc.moved_swap; i++)
+			css_put(&mc.from->css);
+
 		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
@@ -6270,7 +6332,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 	memcg = mem_cgroup_lookup(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			page_counter_uncharge(&memcg->memsw, 1);
 		mem_cgroup_swap_statistics(memcg, false);
 		css_put(&memcg->css);
 	}
@@ -6436,11 +6498,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 
 	if (!mem_cgroup_is_root(memcg)) {
 		if (nr_mem)
-			res_counter_uncharge(&memcg->res,
-					     nr_mem * PAGE_SIZE);
+			page_counter_uncharge(&memcg->memory, nr_mem);
 		if (nr_memsw)
-			res_counter_uncharge(&memcg->memsw,
-					     nr_memsw * PAGE_SIZE);
+			page_counter_uncharge(&memcg->memsw, nr_memsw);
 		memcg_oom_recover(memcg);
 	}
 
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 1d191357bf88..9a448bdb19e9 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -9,13 +9,13 @@
 int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	/*
-	 * The root cgroup does not use res_counters, but rather,
+	 * The root cgroup does not use page_counters, but rather,
 	 * rely on the data already collected by the network
 	 * subsystem
 	 */
-	struct res_counter *res_parent = NULL;
-	struct cg_proto *cg_proto, *parent_cg;
 	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+	struct page_counter *counter_parent = NULL;
+	struct cg_proto *cg_proto, *parent_cg;
 
 	cg_proto = tcp_prot.proto_cgroup(memcg);
 	if (!cg_proto)
@@ -29,9 +29,9 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 
 	parent_cg = tcp_prot.proto_cgroup(parent);
 	if (parent_cg)
-		res_parent = &parent_cg->memory_allocated;
+		counter_parent = &parent_cg->memory_allocated;
 
-	res_counter_init(&cg_proto->memory_allocated, res_parent);
+	page_counter_init(&cg_proto->memory_allocated, counter_parent);
 	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
 
 	return 0;
@@ -50,7 +50,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
 
-static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 {
 	struct cg_proto *cg_proto;
 	int i;
@@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
 	if (!cg_proto)
 		return -EINVAL;
 
-	if (val > RES_COUNTER_MAX)
-		val = RES_COUNTER_MAX;
-
-	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
+	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
 	if (ret)
 		return ret;
 
 	for (i = 0; i < 3; i++)
-		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
+		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
 						sysctl_tcp_mem[i]);
 
-	if (val == RES_COUNTER_MAX)
+	if (nr_pages == ULONG_MAX / PAGE_SIZE)
 		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
-	else if (val != RES_COUNTER_MAX) {
+	else {
 		/*
 		 * The active bit needs to be written after the static_key
 		 * update. This is what guarantees that the socket activation
@@ -102,11 +99,18 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
 	return 0;
 }
 
+enum {
+	RES_USAGE,
+	RES_LIMIT,
+	RES_MAX_USAGE,
+	RES_FAILCNT,
+};
+
 static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	unsigned long long val;
+	unsigned long nr_pages;
 	int ret = 0;
 
 	buf = strstrip(buf);
@@ -114,10 +118,10 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	switch (of_cft(of)->private) {
 	case RES_LIMIT:
 		/* see memcontrol.c */
-		ret = res_counter_memparse_write_strategy(buf, &val);
+		ret = page_counter_memparse(buf, &nr_pages);
 		if (ret)
 			break;
-		ret = tcp_update_limit(memcg, val);
+		ret = tcp_update_limit(memcg, nr_pages);
 		break;
 	default:
 		ret = -EINVAL;
@@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
-{
-	struct cg_proto *cg_proto;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return default_val;
-
-	return res_counter_read_u64(&cg_proto->memory_allocated, type);
-}
-
-static u64 tcp_read_usage(struct mem_cgroup *memcg)
-{
-	struct cg_proto *cg_proto;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
-
-	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
-}
-
 static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
 	u64 val;
 
 	switch (cft->private) {
 	case RES_LIMIT:
-		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
+		if (!cg_proto)
+			return PAGE_COUNTER_MAX;
+		val = cg_proto->memory_allocated.limit;
+		val *= PAGE_SIZE;
 		break;
 	case RES_USAGE:
-		val = tcp_read_usage(memcg);
+		if (!cg_proto)
+			return atomic_long_read(&tcp_memory_allocated);
+		val = atomic_long_read(&cg_proto->memory_allocated.count);
+		val *= PAGE_SIZE;
 		break;
 	case RES_FAILCNT:
+		if (!cg_proto)
+			return 0;
+		val = cg_proto->memory_allocated.limited;
+		break;
 	case RES_MAX_USAGE:
-		val = tcp_read_stat(memcg, cft->private, 0);
+		if (!cg_proto)
+			return 0;
+		val = cg_proto->memory_allocated.watermark;
+		val *= PAGE_SIZE;
 		break;
 	default:
 		BUG();
@@ -183,10 +179,11 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
 
 	switch (of_cft(of)->private) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&cg_proto->memory_allocated);
+		cg_proto->memory_allocated.watermark =
+			atomic_long_read(&cg_proto->memory_allocated.count);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&cg_proto->memory_allocated);
+		cg_proto->memory_allocated.limited = 0;
 		break;
 	}
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [patch] mm: memcontrol: lockless page counters
@ 2014-09-19 13:22 ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-19 13:22 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

Memory is internally accounted in bytes, using spinlock-protected
64-bit counters, even though the smallest accounting delta is a page.
The counter interface is also convoluted and does too many things.

Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it and remove the old one.  The translation
from and to bytes then only happens when interfacing with userspace.

Aside from the locking costs, this gets rid of the icky unsigned long
long types in the very heart of memcg, which is great for 32 bit and
also makes the code a lot more readable.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/hugetlb.txt          |   2 +-
 Documentation/cgroups/memory.txt           |   4 +-
 Documentation/cgroups/resource_counter.txt | 197 --------
 include/linux/hugetlb_cgroup.h             |   1 -
 include/linux/memcontrol.h                 |  37 +-
 include/linux/res_counter.h                | 223 ---------
 include/net/sock.h                         |  25 +-
 init/Kconfig                               |   9 +-
 kernel/Makefile                            |   1 -
 kernel/res_counter.c                       | 211 --------
 mm/hugetlb_cgroup.c                        | 100 ++--
 mm/memcontrol.c                            | 740 ++++++++++++++++-------------
 net/ipv4/tcp_memcontrol.c                  |  83 ++--
 13 files changed, 541 insertions(+), 1092 deletions(-)
 delete mode 100644 Documentation/cgroups/resource_counter.txt
 delete mode 100644 include/linux/res_counter.h
 delete mode 100644 kernel/res_counter.c

diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
index a9faaca1f029..106245c3aecc 100644
--- a/Documentation/cgroups/hugetlb.txt
+++ b/Documentation/cgroups/hugetlb.txt
@@ -29,7 +29,7 @@ Brief summary of control files
 
  hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
  hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
- hugetlb.<hugepagesize>.usage_in_bytes     # show current res_counter usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
  hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
 
 For a system supporting two hugepage size (16M and 16G) the control
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 02ab997a1ed2..f624727ab404 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -52,9 +52,9 @@ Brief summary of control files.
  tasks				 # attach a task(thread) and show list of threads
  cgroup.procs			 # show list of processes
  cgroup.event_control		 # an interface for event_fd()
- memory.usage_in_bytes		 # show current res_counter usage for memory
+ memory.usage_in_bytes		 # show current usage for memory
 				 (See 5.5 for details)
- memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
+ memory.memsw.usage_in_bytes	 # show current usage for memory+Swap
 				 (See 5.5 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
deleted file mode 100644
index 762ca54eb929..000000000000
--- a/Documentation/cgroups/resource_counter.txt
+++ /dev/null
@@ -1,197 +0,0 @@
-
-		The Resource Counter
-
-The resource counter, declared at include/linux/res_counter.h,
-is supposed to facilitate the resource management by controllers
-by providing common stuff for accounting.
-
-This "stuff" includes the res_counter structure and routines
-to work with it.
-
-
-
-1. Crucial parts of the res_counter structure
-
- a. unsigned long long usage
-
- 	The usage value shows the amount of a resource that is consumed
-	by a group at a given time. The units of measurement should be
-	determined by the controller that uses this counter. E.g. it can
-	be bytes, items or any other unit the controller operates on.
-
- b. unsigned long long max_usage
-
- 	The maximal value of the usage over time.
-
- 	This value is useful when gathering statistical information about
-	the particular group, as it shows the actual resource requirements
-	for a particular group, not just some usage snapshot.
-
- c. unsigned long long limit
-
- 	The maximal allowed amount of resource to consume by the group. In
-	case the group requests for more resources, so that the usage value
-	would exceed the limit, the resource allocation is rejected (see
-	the next section).
-
- d. unsigned long long failcnt
-
- 	The failcnt stands for "failures counter". This is the number of
-	resource allocation attempts that failed.
-
- c. spinlock_t lock
-
- 	Protects changes of the above values.
-
-
-
-2. Basic accounting routines
-
- a. void res_counter_init(struct res_counter *rc,
-				struct res_counter *rc_parent)
-
- 	Initializes the resource counter. As usual, should be the first
-	routine called for a new counter.
-
-	The struct res_counter *parent can be used to define a hierarchical
-	child -> parent relationship directly in the res_counter structure,
-	NULL can be used to define no relationship.
-
- c. int res_counter_charge(struct res_counter *rc, unsigned long val,
-				struct res_counter **limit_fail_at)
-
-	When a resource is about to be allocated it has to be accounted
-	with the appropriate resource counter (controller should determine
-	which one to use on its own). This operation is called "charging".
-
-	This is not very important which operation - resource allocation
-	or charging - is performed first, but
-	  * if the allocation is performed first, this may create a
-	    temporary resource over-usage by the time resource counter is
-	    charged;
-	  * if the charging is performed first, then it should be uncharged
-	    on error path (if the one is called).
-
-	If the charging fails and a hierarchical dependency exists, the
-	limit_fail_at parameter is set to the particular res_counter element
-	where the charging failed.
-
- d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
-
-	When a resource is released (freed) it should be de-accounted
-	from the resource counter it was accounted to.  This is called
-	"uncharging". The return value of this function indicate the amount
-	of charges still present in the counter.
-
-	The _locked routines imply that the res_counter->lock is taken.
-
- e. u64 res_counter_uncharge_until
-		(struct res_counter *rc, struct res_counter *top,
-		 unsigned long val)
-
-	Almost same as res_counter_uncharge() but propagation of uncharge
-	stops when rc == top. This is useful when kill a res_counter in
-	child cgroup.
-
- 2.1 Other accounting routines
-
-    There are more routines that may help you with common needs, like
-    checking whether the limit is reached or resetting the max_usage
-    value. They are all declared in include/linux/res_counter.h.
-
-
-
-3. Analyzing the resource counter registrations
-
- a. If the failcnt value constantly grows, this means that the counter's
-    limit is too tight. Either the group is misbehaving and consumes too
-    many resources, or the configuration is not suitable for the group
-    and the limit should be increased.
-
- b. The max_usage value can be used to quickly tune the group. One may
-    set the limits to maximal values and either load the container with
-    a common pattern or leave one for a while. After this the max_usage
-    value shows the amount of memory the container would require during
-    its common activity.
-
-    Setting the limit a bit above this value gives a pretty good
-    configuration that works in most of the cases.
-
- c. If the max_usage is much less than the limit, but the failcnt value
-    is growing, then the group tries to allocate a big chunk of resource
-    at once.
-
- d. If the max_usage is much less than the limit, but the failcnt value
-    is 0, then this group is given too high limit, that it does not
-    require. It is better to lower the limit a bit leaving more resource
-    for other groups.
-
-
-
-4. Communication with the control groups subsystem (cgroups)
-
-All the resource controllers that are using cgroups and resource counters
-should provide files (in the cgroup filesystem) to work with the resource
-counter fields. They are recommended to adhere to the following rules:
-
- a. File names
-
- 	Field name	File name
-	---------------------------------------------------
-	usage		usage_in_<unit_of_measurement>
-	max_usage	max_usage_in_<unit_of_measurement>
-	limit		limit_in_<unit_of_measurement>
-	failcnt		failcnt
-	lock		no file :)
-
- b. Reading from file should show the corresponding field value in the
-    appropriate format.
-
- c. Writing to file
-
- 	Field		Expected behavior
-	----------------------------------
-	usage		prohibited
-	max_usage	reset to usage
-	limit		set the limit
-	failcnt		reset to zero
-
-
-
-5. Usage example
-
- a. Declare a task group (take a look at cgroups subsystem for this) and
-    fold a res_counter into it
-
-	struct my_group {
-		struct res_counter res;
-
-		<other fields>
-	}
-
- b. Put hooks in resource allocation/release paths
-
- 	int alloc_something(...)
-	{
-		if (res_counter_charge(res_counter_ptr, amount) < 0)
-			return -ENOMEM;
-
-		<allocate the resource and return to the caller>
-	}
-
-	void release_something(...)
-	{
-		res_counter_uncharge(res_counter_ptr, amount);
-
-		<release the resource>
-	}
-
-    In order to keep the usage value self-consistent, both the
-    "res_counter_ptr" and the "amount" in release_something() should be
-    the same as they were in the alloc_something() when the releasing
-    resource was allocated.
-
- c. Provide the way to read res_counter values and set them (the cgroups
-    still can help with it).
-
- c. Compile and run :)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 0129f89cf98d..bcc853eccc85 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -16,7 +16,6 @@
 #define _LINUX_HUGETLB_CGROUP_H
 
 #include <linux/mmdebug.h>
-#include <linux/res_counter.h>
 
 struct hugetlb_cgroup;
 /*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19df5d857411..bf8fb1a05597 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+
+struct page_counter {
+	atomic_long_t count;
+	unsigned long limit;
+	struct page_counter *parent;
+
+	/* legacy */
+	unsigned long watermark;
+	unsigned long limited;
+};
+
+#if BITS_PER_LONG == 32
+#define PAGE_COUNTER_MAX ULONG_MAX
+#else
+#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
+#endif
+
+static inline void page_counter_init(struct page_counter *counter,
+				     struct page_counter *parent)
+{
+	atomic_long_set(&counter->count, 0);
+	counter->limit = PAGE_COUNTER_MAX;
+	counter->parent = parent;
+}
+
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
+			struct page_counter **fail);
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_limit(struct page_counter *counter, unsigned long limit);
+int page_counter_memparse(const char *buf, unsigned long *nr_pages);
+
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -471,9 +503,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 	/*
 	 * __GFP_NOFAIL allocations will move on even if charging is not
 	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it with
-	 * res_counter_charge_nofail, but we hope those allocations are rare,
-	 * and won't be worth the trouble.
+	 * unaccounted. We could in theory charge it forcibly, but we hope
+	 * those allocations are rare, and won't be worth the trouble.
 	 */
 	if (gfp & __GFP_NOFAIL)
 		return true;
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
deleted file mode 100644
index 56b7bc32db4f..000000000000
--- a/include/linux/res_counter.h
+++ /dev/null
@@ -1,223 +0,0 @@
-#ifndef __RES_COUNTER_H__
-#define __RES_COUNTER_H__
-
-/*
- * Resource Counters
- * Contain common data types and routines for resource accounting
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <xemul@openvz.org>
- *
- * See Documentation/cgroups/resource_counter.txt for more
- * info about what this counter is.
- */
-
-#include <linux/spinlock.h>
-#include <linux/errno.h>
-
-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
-
-struct res_counter {
-	/*
-	 * the current resource consumption level
-	 */
-	unsigned long long usage;
-	/*
-	 * the maximal value of the usage from the counter creation
-	 */
-	unsigned long long max_usage;
-	/*
-	 * the limit that usage cannot exceed
-	 */
-	unsigned long long limit;
-	/*
-	 * the limit that usage can be exceed
-	 */
-	unsigned long long soft_limit;
-	/*
-	 * the number of unsuccessful attempts to consume the resource
-	 */
-	unsigned long long failcnt;
-	/*
-	 * the lock to protect all of the above.
-	 * the routines below consider this to be IRQ-safe
-	 */
-	spinlock_t lock;
-	/*
-	 * Parent counter, used for hierarchial resource accounting
-	 */
-	struct res_counter *parent;
-};
-
-#define RES_COUNTER_MAX ULLONG_MAX
-
-/**
- * Helpers to interact with userspace
- * res_counter_read_u64() - returns the value of the specified member.
- * res_counter_read/_write - put/get the specified fields from the
- * res_counter struct to/from the user
- *
- * @counter:     the counter in question
- * @member:  the field to work with (see RES_xxx below)
- * @buf:     the buffer to opeate on,...
- * @nbytes:  its size...
- * @pos:     and the offset.
- */
-
-u64 res_counter_read_u64(struct res_counter *counter, int member);
-
-ssize_t res_counter_read(struct res_counter *counter, int member,
-		const char __user *buf, size_t nbytes, loff_t *pos,
-		int (*read_strategy)(unsigned long long val, char *s));
-
-int res_counter_memparse_write_strategy(const char *buf,
-					unsigned long long *res);
-
-/*
- * the field descriptors. one for each member of res_counter
- */
-
-enum {
-	RES_USAGE,
-	RES_MAX_USAGE,
-	RES_LIMIT,
-	RES_FAILCNT,
-	RES_SOFT_LIMIT,
-};
-
-/*
- * helpers for accounting
- */
-
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
-
-/*
- * charge - try to consume more resource.
- *
- * @counter: the counter
- * @val: the amount of the resource. each controller defines its own
- *       units, e.g. numbers, bytes, Kbytes, etc
- *
- * returns 0 on success and <0 if the counter->usage will exceed the
- * counter->limit
- *
- * charge_nofail works the same, except that it charges the resource
- * counter unconditionally, and returns < 0 if the after the current
- * charge we are over limit.
- */
-
-int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
-int res_counter_charge_nofail(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
-
-/*
- * uncharge - tell that some portion of the resource is released
- *
- * @counter: the counter
- * @val: the amount of the resource
- *
- * these calls check for usage underflow and show a warning on the console
- *
- * returns the total charges still present in @counter.
- */
-
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
-
-u64 res_counter_uncharge_until(struct res_counter *counter,
-			       struct res_counter *top,
-			       unsigned long val);
-/**
- * res_counter_margin - calculate chargeable space of a counter
- * @cnt: the counter
- *
- * Returns the difference between the hard limit and the current usage
- * of resource counter @cnt.
- */
-static inline unsigned long long res_counter_margin(struct res_counter *cnt)
-{
-	unsigned long long margin;
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->limit > cnt->usage)
-		margin = cnt->limit - cnt->usage;
-	else
-		margin = 0;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return margin;
-}
-
-/**
- * Get the difference between the usage and the soft limit
- * @cnt: The counter
- *
- * Returns 0 if usage is less than or equal to soft limit
- * The difference between usage and soft limit, otherwise.
- */
-static inline unsigned long long
-res_counter_soft_limit_excess(struct res_counter *cnt)
-{
-	unsigned long long excess;
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= cnt->soft_limit)
-		excess = 0;
-	else
-		excess = cnt->usage - cnt->soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return excess;
-}
-
-static inline void res_counter_reset_max(struct res_counter *cnt)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->max_usage = cnt->usage;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-}
-
-static inline void res_counter_reset_failcnt(struct res_counter *cnt)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->failcnt = 0;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-}
-
-static inline int res_counter_set_limit(struct res_counter *cnt,
-		unsigned long long limit)
-{
-	unsigned long flags;
-	int ret = -EBUSY;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= limit) {
-		cnt->limit = limit;
-		ret = 0;
-	}
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return ret;
-}
-
-static inline int
-res_counter_set_soft_limit(struct res_counter *cnt,
-				unsigned long long soft_limit)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->soft_limit = soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return 0;
-}
-
-#endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 515a4d01e932..f41749982668 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -55,7 +55,6 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/memcontrol.h>
-#include <linux/res_counter.h>
 #include <linux/static_key.h>
 #include <linux/aio.h>
 #include <linux/sched.h>
@@ -1066,7 +1065,7 @@ enum cg_proto_flags {
 };
 
 struct cg_proto {
-	struct res_counter	memory_allocated;	/* Current allocated memory. */
+	struct page_counter	memory_allocated;	/* Current allocated memory. */
 	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
 	int			memory_pressure;
 	long			sysctl_mem[3];
@@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
 					      unsigned long amt,
 					      int *parent_status)
 {
-	struct res_counter *fail;
-	int ret;
+	page_counter_charge(&prot->memory_allocated, amt, NULL);
 
-	ret = res_counter_charge_nofail(&prot->memory_allocated,
-					amt << PAGE_SHIFT, &fail);
-	if (ret < 0)
+	if (atomic_long_read(&prot->memory_allocated.count) >
+	    prot->memory_allocated.limit)
 		*parent_status = OVER_LIMIT;
 }
 
 static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
 					      unsigned long amt)
 {
-	res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
-}
-
-static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
-{
-	u64 ret;
-	ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
-	return ret >> PAGE_SHIFT;
+	page_counter_uncharge(&prot->memory_allocated, amt);
 }
 
 static inline long
 sk_memory_allocated(const struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
+
 	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return memcg_memory_allocated_read(sk->sk_cgrp);
+		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
 
 	return atomic_long_read(prot->memory_allocated);
 }
@@ -1259,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
 		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
 		/* update the root cgroup regardless */
 		atomic_long_add_return(amt, prot->memory_allocated);
-		return memcg_memory_allocated_read(sk->sk_cgrp);
+		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
 	}
 
 	return atomic_long_add_return(amt, prot->memory_allocated);
diff --git a/init/Kconfig b/init/Kconfig
index 0471be99ec38..1cf42b563834 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -975,15 +975,8 @@ config CGROUP_CPUACCT
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
 
-config RESOURCE_COUNTERS
-	bool "Resource counters"
-	help
-	  This option enables controller independent resource accounting
-	  infrastructure that works with cgroups.
-
 config MEMCG
 	bool "Memory Resource Controller for Control Groups"
-	depends on RESOURCE_COUNTERS
 	select EVENTFD
 	help
 	  Provides a memory resource controller that manages both anonymous
@@ -1051,7 +1044,7 @@ config MEMCG_KMEM
 
 config CGROUP_HUGETLB
 	bool "HugeTLB Resource Controller for Control Groups"
-	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
+	depends on MEMCG && HUGETLB_PAGE
 	default n
 	help
 	  Provides a cgroup Resource Controller for HugeTLB pages.
diff --git a/kernel/Makefile b/kernel/Makefile
index 726e18443da0..245953354974 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -58,7 +58,6 @@ obj-$(CONFIG_USER_NS) += user_namespace.o
 obj-$(CONFIG_PID_NS) += pid_namespace.o
 obj-$(CONFIG_DEBUG_SYNCHRO_TEST) += synchro-test.o
 obj-$(CONFIG_IKCONFIG) += configs.o
-obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
 obj-$(CONFIG_SMP) += stop_machine.o
 obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
deleted file mode 100644
index e791130f85a7..000000000000
--- a/kernel/res_counter.c
+++ /dev/null
@@ -1,211 +0,0 @@
-/*
- * resource cgroups
- *
- * Copyright 2007 OpenVZ SWsoft Inc
- *
- * Author: Pavel Emelianov <xemul@openvz.org>
- *
- */
-
-#include <linux/types.h>
-#include <linux/parser.h>
-#include <linux/fs.h>
-#include <linux/res_counter.h>
-#include <linux/uaccess.h>
-#include <linux/mm.h>
-
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
-{
-	spin_lock_init(&counter->lock);
-	counter->limit = RES_COUNTER_MAX;
-	counter->soft_limit = RES_COUNTER_MAX;
-	counter->parent = parent;
-}
-
-static u64 res_counter_uncharge_locked(struct res_counter *counter,
-				       unsigned long val)
-{
-	if (WARN_ON(counter->usage < val))
-		val = counter->usage;
-
-	counter->usage -= val;
-	return counter->usage;
-}
-
-static int res_counter_charge_locked(struct res_counter *counter,
-				     unsigned long val, bool force)
-{
-	int ret = 0;
-
-	if (counter->usage + val > counter->limit) {
-		counter->failcnt++;
-		ret = -ENOMEM;
-		if (!force)
-			return ret;
-	}
-
-	counter->usage += val;
-	if (counter->usage > counter->max_usage)
-		counter->max_usage = counter->usage;
-	return ret;
-}
-
-static int __res_counter_charge(struct res_counter *counter, unsigned long val,
-				struct res_counter **limit_fail_at, bool force)
-{
-	int ret, r;
-	unsigned long flags;
-	struct res_counter *c, *u;
-
-	r = ret = 0;
-	*limit_fail_at = NULL;
-	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
-		r = res_counter_charge_locked(c, val, force);
-		spin_unlock(&c->lock);
-		if (r < 0 && !ret) {
-			ret = r;
-			*limit_fail_at = c;
-			if (!force)
-				break;
-		}
-	}
-
-	if (ret < 0 && !force) {
-		for (u = counter; u != c; u = u->parent) {
-			spin_lock(&u->lock);
-			res_counter_uncharge_locked(u, val);
-			spin_unlock(&u->lock);
-		}
-	}
-	local_irq_restore(flags);
-
-	return ret;
-}
-
-int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
-{
-	return __res_counter_charge(counter, val, limit_fail_at, false);
-}
-
-int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
-			      struct res_counter **limit_fail_at)
-{
-	return __res_counter_charge(counter, val, limit_fail_at, true);
-}
-
-u64 res_counter_uncharge_until(struct res_counter *counter,
-			       struct res_counter *top,
-			       unsigned long val)
-{
-	unsigned long flags;
-	struct res_counter *c;
-	u64 ret = 0;
-
-	local_irq_save(flags);
-	for (c = counter; c != top; c = c->parent) {
-		u64 r;
-		spin_lock(&c->lock);
-		r = res_counter_uncharge_locked(c, val);
-		if (c == counter)
-			ret = r;
-		spin_unlock(&c->lock);
-	}
-	local_irq_restore(flags);
-	return ret;
-}
-
-u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
-{
-	return res_counter_uncharge_until(counter, NULL, val);
-}
-
-static inline unsigned long long *
-res_counter_member(struct res_counter *counter, int member)
-{
-	switch (member) {
-	case RES_USAGE:
-		return &counter->usage;
-	case RES_MAX_USAGE:
-		return &counter->max_usage;
-	case RES_LIMIT:
-		return &counter->limit;
-	case RES_FAILCNT:
-		return &counter->failcnt;
-	case RES_SOFT_LIMIT:
-		return &counter->soft_limit;
-	};
-
-	BUG();
-	return NULL;
-}
-
-ssize_t res_counter_read(struct res_counter *counter, int member,
-		const char __user *userbuf, size_t nbytes, loff_t *pos,
-		int (*read_strategy)(unsigned long long val, char *st_buf))
-{
-	unsigned long long *val;
-	char buf[64], *s;
-
-	s = buf;
-	val = res_counter_member(counter, member);
-	if (read_strategy)
-		s += read_strategy(*val, s);
-	else
-		s += sprintf(s, "%llu\n", *val);
-	return simple_read_from_buffer((void __user *)userbuf, nbytes,
-			pos, buf, s - buf);
-}
-
-#if BITS_PER_LONG == 32
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
-	unsigned long flags;
-	u64 ret;
-
-	spin_lock_irqsave(&counter->lock, flags);
-	ret = *res_counter_member(counter, member);
-	spin_unlock_irqrestore(&counter->lock, flags);
-
-	return ret;
-}
-#else
-u64 res_counter_read_u64(struct res_counter *counter, int member)
-{
-	return *res_counter_member(counter, member);
-}
-#endif
-
-int res_counter_memparse_write_strategy(const char *buf,
-					unsigned long long *resp)
-{
-	char *end;
-	unsigned long long res;
-
-	/* return RES_COUNTER_MAX(unlimited) if "-1" is specified */
-	if (*buf == '-') {
-		int rc = kstrtoull(buf + 1, 10, &res);
-
-		if (rc)
-			return rc;
-		if (res != 1)
-			return -EINVAL;
-		*resp = RES_COUNTER_MAX;
-		return 0;
-	}
-
-	res = memparse(buf, &end);
-	if (*end != '\0')
-		return -EINVAL;
-
-	if (PAGE_ALIGN(res) >= res)
-		res = PAGE_ALIGN(res);
-	else
-		res = RES_COUNTER_MAX;
-
-	*resp = res;
-
-	return 0;
-}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a67c26e0f360..e619b6b62f1f 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -14,6 +14,7 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/memcontrol.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
@@ -23,7 +24,7 @@ struct hugetlb_cgroup {
 	/*
 	 * the counter to account for hugepages from hugetlb.
 	 */
-	struct res_counter hugepage[HUGE_MAX_HSTATE];
+	struct page_counter hugepage[HUGE_MAX_HSTATE];
 };
 
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -60,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 	int idx;
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
+		if (atomic_long_read(&h_cg->hugepage[idx].count))
 			return true;
 	}
 	return false;
@@ -79,12 +80,12 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	if (parent_h_cgroup) {
 		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
-			res_counter_init(&h_cgroup->hugepage[idx],
-					 &parent_h_cgroup->hugepage[idx]);
+			page_counter_init(&h_cgroup->hugepage[idx],
+					  &parent_h_cgroup->hugepage[idx]);
 	} else {
 		root_h_cgroup = h_cgroup;
 		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
-			res_counter_init(&h_cgroup->hugepage[idx], NULL);
+			page_counter_init(&h_cgroup->hugepage[idx], NULL);
 	}
 	return &h_cgroup->css;
 }
@@ -108,9 +109,8 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
 static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 				       struct page *page)
 {
-	int csize;
-	struct res_counter *counter;
-	struct res_counter *fail_res;
+	unsigned int nr_pages;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *page_hcg;
 	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
 
@@ -123,15 +123,15 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	if (!page_hcg || page_hcg != h_cg)
 		goto out;
 
-	csize = PAGE_SIZE << compound_order(page);
+	nr_pages = 1 << compound_order(page);
 	if (!parent) {
 		parent = root_h_cgroup;
 		/* root has no limit */
-		res_counter_charge_nofail(&parent->hugepage[idx],
-					  csize, &fail_res);
+		page_counter_charge(&parent->hugepage[idx], nr_pages, NULL);
 	}
 	counter = &h_cg->hugepage[idx];
-	res_counter_uncharge_until(counter, counter->parent, csize);
+	/* Take the pages off the local counter */
+	page_counter_cancel(counter, nr_pages);
 
 	set_hugetlb_cgroup(page, parent);
 out:
@@ -166,9 +166,8 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 				 struct hugetlb_cgroup **ptr)
 {
 	int ret = 0;
-	struct res_counter *fail_res;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *h_cg = NULL;
-	unsigned long csize = nr_pages * PAGE_SIZE;
 
 	if (hugetlb_cgroup_disabled())
 		goto done;
@@ -187,7 +186,7 @@ again:
 	}
 	rcu_read_unlock();
 
-	ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
+	ret = page_counter_charge(&h_cg->hugepage[idx], nr_pages, &counter);
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
@@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 				  struct page *page)
 {
 	struct hugetlb_cgroup *h_cg;
-	unsigned long csize = nr_pages * PAGE_SIZE;
 
 	if (hugetlb_cgroup_disabled())
 		return;
@@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 	if (unlikely(!h_cg))
 		return;
 	set_hugetlb_cgroup(page, NULL);
-	res_counter_uncharge(&h_cg->hugepage[idx], csize);
+	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
 	return;
 }
 
 void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 				    struct hugetlb_cgroup *h_cg)
 {
-	unsigned long csize = nr_pages * PAGE_SIZE;
-
 	if (hugetlb_cgroup_disabled() || !h_cg)
 		return;
 
 	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
 		return;
 
-	res_counter_uncharge(&h_cg->hugepage[idx], csize);
+	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
 	return;
 }
 
+enum {
+	RES_USAGE,
+	RES_LIMIT,
+	RES_MAX_USAGE,
+	RES_FAILCNT,
+};
+
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
-	int idx, name;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
 
-	idx = MEMFILE_IDX(cft->private);
-	name = MEMFILE_ATTR(cft->private);
+	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
 
-	return res_counter_read_u64(&h_cg->hugepage[idx], name);
+	switch (MEMFILE_ATTR(cft->private)) {
+	case RES_USAGE:
+		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+	case RES_LIMIT:
+		return (u64)counter->limit * PAGE_SIZE;
+	case RES_MAX_USAGE:
+		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_FAILCNT:
+		return counter->limited;
+	default:
+		BUG();
+	}
 }
 
 static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
-	int idx, name, ret;
-	unsigned long long val;
+	int ret, idx;
+	unsigned long nr_pages;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
 
+	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
+		return -EINVAL;
+
 	buf = strstrip(buf);
+	ret = page_counter_memparse(buf, &nr_pages);
+	if (ret)
+		return ret;
+
 	idx = MEMFILE_IDX(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_LIMIT:
-		if (hugetlb_cgroup_is_root(h_cg)) {
-			/* Can't set limit on root */
-			ret = -EINVAL;
-			break;
-		}
-		/* This function does all necessary parse...reuse it */
-		ret = res_counter_memparse_write_strategy(buf, &val);
-		if (ret)
-			break;
-		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
-		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
+		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
+		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
 		break;
 	default:
 		ret = -EINVAL;
@@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
-	int idx, name, ret = 0;
+	int ret = 0;
+	struct page_counter *counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
 
-	idx = MEMFILE_IDX(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
+	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&h_cg->hugepage[idx]);
+		counter->watermark = atomic_long_read(&counter->count);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
+		counter->limited = 0;
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2def11f1ec1..dfd3b15a57e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,7 +25,6 @@
  * GNU General Public License for more details.
  */
 
-#include <linux/res_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
@@ -66,6 +65,117 @@
 
 #include <trace/events/vmscan.h>
 
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+{
+	long new;
+
+	new = atomic_long_sub_return(nr_pages, &counter->count);
+
+	if (WARN_ON(unlikely(new < 0)))
+		atomic_long_set(&counter->count, 0);
+
+	return new > 1;
+}
+
+int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
+			struct page_counter **fail)
+{
+	struct page_counter *c;
+
+	for (c = counter; c; c = c->parent) {
+		for (;;) {
+			unsigned long count;
+			unsigned long new;
+
+			count = atomic_long_read(&c->count);
+
+			new = count + nr_pages;
+			if (new > c->limit) {
+				c->limited++;
+				if (fail) {
+					*fail = c;
+					goto failed;
+				}
+			}
+
+			if (atomic_long_cmpxchg(&c->count, count, new) != count)
+				continue;
+
+			if (new > c->watermark)
+				c->watermark = new;
+
+			break;
+		}
+	}
+	return 0;
+
+failed:
+	for (c = counter; c != *fail; c = c->parent)
+		page_counter_cancel(c, nr_pages);
+
+	return -ENOMEM;
+}
+
+int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
+{
+	struct page_counter *c;
+	int ret = 1;
+
+	for (c = counter; c; c = c->parent) {
+		int remainder;
+
+		remainder = page_counter_cancel(c, nr_pages);
+		if (c == counter && !remainder)
+			ret = 0;
+	}
+
+	return ret;
+}
+
+int page_counter_limit(struct page_counter *counter, unsigned long limit)
+{
+	for (;;) {
+		unsigned long count;
+		unsigned long old;
+
+		count = atomic_long_read(&counter->count);
+
+		old = xchg(&counter->limit, limit);
+
+		if (atomic_long_read(&counter->count) != count) {
+			counter->limit = old;
+			continue;
+		}
+
+		if (count > limit) {
+			counter->limit = old;
+			return -EBUSY;
+		}
+
+		return 0;
+	}
+}
+
+int page_counter_memparse(const char *buf, unsigned long *nr_pages)
+{
+	char unlimited[] = "-1";
+	char *end;
+	u64 bytes;
+
+	if (!strncmp(buf, unlimited, sizeof(unlimited))) {
+		*nr_pages = PAGE_COUNTER_MAX;
+		return 0;
+	}
+
+	bytes = memparse(buf, &end);
+	if (*end != '\0')
+		return -EINVAL;
+
+	*nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
+
+	return 0;
+}
+
 struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
@@ -165,7 +275,7 @@ struct mem_cgroup_per_zone {
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
 	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
+	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
 	bool			on_tree;
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
@@ -198,7 +308,7 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
 
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
-	u64 threshold;
+	unsigned long threshold;
 };
 
 /* For threshold */
@@ -284,24 +394,18 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
  */
 struct mem_cgroup {
 	struct cgroup_subsys_state css;
-	/*
-	 * the counter to account for memory usage
-	 */
-	struct res_counter res;
+
+	/* Accounted resources */
+	struct page_counter memory;
+	struct page_counter memsw;
+	struct page_counter kmem;
+
+	unsigned long soft_limit;
 
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
 	/*
-	 * the counter to account for mem+swap usage.
-	 */
-	struct res_counter memsw;
-
-	/*
-	 * the counter to account for kernel memory usage.
-	 */
-	struct res_counter kmem;
-	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
@@ -647,7 +751,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 	 * This check can't live in kmem destruction function,
 	 * since the charges will outlive the cgroup
 	 */
-	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
+	WARN_ON(atomic_long_read(&memcg->kmem.count));
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -703,7 +807,7 @@ soft_limit_tree_from_page(struct page *page)
 
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 					 struct mem_cgroup_tree_per_zone *mctz,
-					 unsigned long long new_usage_in_excess)
+					 unsigned long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
@@ -752,10 +856,21 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
+static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+{
+	unsigned long nr_pages = atomic_long_read(&memcg->memory.count);
+	unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
+	unsigned long excess = 0;
+
+	if (nr_pages > soft_limit)
+		excess = nr_pages - soft_limit;
+
+	return excess;
+}
 
 static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 {
-	unsigned long long excess;
+	unsigned long excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
 
@@ -766,7 +881,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	 */
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		mz = mem_cgroup_page_zoneinfo(memcg, page);
-		excess = res_counter_soft_limit_excess(&memcg->res);
+		excess = soft_limit_excess(memcg);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
@@ -822,7 +937,7 @@ retry:
 	 * position in the tree.
 	 */
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
+	if (!soft_limit_excess(mz->memcg) ||
 	    !css_tryget_online(&mz->memcg->css))
 		goto retry;
 done:
@@ -1478,7 +1593,7 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 	return inactive * inactive_ratio < active;
 }
 
-#define mem_cgroup_from_res_counter(counter, member)	\
+#define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
 /**
@@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
  */
 static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 {
-	unsigned long long margin;
+	unsigned long margin = 0;
+	unsigned long count;
+	unsigned long limit;
 
-	margin = res_counter_margin(&memcg->res);
-	if (do_swap_account)
-		margin = min(margin, res_counter_margin(&memcg->memsw));
-	return margin >> PAGE_SHIFT;
+	count = atomic_long_read(&memcg->memory.count);
+	limit = ACCESS_ONCE(memcg->memory.limit);
+	if (count < limit)
+		margin = limit - count;
+
+	if (do_swap_account) {
+		count = atomic_long_read(&memcg->memsw.count);
+		limit = ACCESS_ONCE(memcg->memsw.limit);
+		if (count < limit)
+			margin = min(margin, limit - count);
+	}
+
+	return margin;
 }
 
 int mem_cgroup_swappiness(struct mem_cgroup *memcg)
@@ -1636,18 +1762,15 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 
 	rcu_read_unlock();
 
-	pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
-		res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
-		res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
-		res_counter_read_u64(&memcg->res, RES_FAILCNT));
-	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
-		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
-		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
-		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
-	pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n",
-		res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
-		res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
-		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
+	pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
+		K((u64)atomic_long_read(&memcg->memory.count)),
+		K((u64)memcg->memory.limit), memcg->memory.limited);
+	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
+		K((u64)atomic_long_read(&memcg->memsw.count)),
+		K((u64)memcg->memsw.limit), memcg->memsw.limited);
+	pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
+		K((u64)atomic_long_read(&memcg->kmem.count)),
+		K((u64)memcg->kmem.limit), memcg->kmem.limited);
 
 	for_each_mem_cgroup_tree(iter, memcg) {
 		pr_info("Memory cgroup stats for ");
@@ -1685,30 +1808,19 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 }
 
 /*
- * Return the memory (and swap, if configured) limit for a memcg.
+ * Return the memory (and swap, if configured) maximum consumption for a memcg.
  */
-static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
-	u64 limit;
+	unsigned long limit;
 
-	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-
-	/*
-	 * Do not consider swap space if we cannot swap due to swappiness
-	 */
+	limit = memcg->memory.limit;
 	if (mem_cgroup_swappiness(memcg)) {
-		u64 memsw;
+		unsigned long memsw_limit;
 
-		limit += total_swap_pages << PAGE_SHIFT;
-		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-
-		/*
-		 * If memsw is finite and limits the amount of swap space
-		 * available to this memcg, return that limit.
-		 */
-		limit = min(limit, memsw);
+		memsw_limit = memcg->memsw.limit;
+		limit = min(limit + total_swap_pages, memsw_limit);
 	}
-
 	return limit;
 }
 
@@ -1732,7 +1844,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	}
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -1935,7 +2047,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		.priority = 0,
 	};
 
-	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
+	excess = soft_limit_excess(root_memcg);
 
 	while (1) {
 		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1966,7 +2078,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
 						     zone, &nr_scanned);
 		*total_scanned += nr_scanned;
-		if (!res_counter_soft_limit_excess(&root_memcg->res))
+		if (!soft_limit_excess(root_memcg))
 			break;
 	}
 	mem_cgroup_iter_break(root_memcg, victim);
@@ -2293,33 +2405,31 @@ static DEFINE_MUTEX(percpu_charge_mutex);
 static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	struct memcg_stock_pcp *stock;
-	bool ret = true;
+	bool ret = false;
 
 	if (nr_pages > CHARGE_BATCH)
-		return false;
+		return ret;
 
 	stock = &get_cpu_var(memcg_stock);
-	if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
 		stock->nr_pages -= nr_pages;
-	else /* need to call res_counter_charge */
-		ret = false;
+		ret = true;
+	}
 	put_cpu_var(memcg_stock);
 	return ret;
 }
 
 /*
- * Returns stocks cached in percpu to res_counter and reset cached information.
+ * Returns stocks cached in percpu and reset cached information.
  */
 static void drain_stock(struct memcg_stock_pcp *stock)
 {
 	struct mem_cgroup *old = stock->cached;
 
 	if (stock->nr_pages) {
-		unsigned long bytes = stock->nr_pages * PAGE_SIZE;
-
-		res_counter_uncharge(&old->res, bytes);
+		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_swap_account)
-			res_counter_uncharge(&old->memsw, bytes);
+			page_counter_uncharge(&old->memsw, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
 	stock->cached = NULL;
@@ -2348,7 +2458,7 @@ static void __init memcg_stock_init(void)
 }
 
 /*
- * Cache charges(val) which is from res_counter, to local per_cpu area.
+ * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2408,8 +2518,7 @@ out:
 /*
  * Tries to drain stocked charges in other cpus. This function is asynchronous
  * and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back to res_counter later but cannot wait for
- * it.
+ * expects some charges will be back later but cannot wait for it.
  */
 static void drain_all_stock_async(struct mem_cgroup *root_memcg)
 {
@@ -2483,9 +2592,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
+	struct page_counter *counter;
 	unsigned long nr_reclaimed;
-	unsigned long long size;
 	bool may_swap = true;
 	bool drained = false;
 	int ret = 0;
@@ -2496,17 +2604,16 @@ retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
 
-	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+	if (!page_counter_charge(&memcg->memory, batch, &counter)) {
 		if (!do_swap_account)
 			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+		if (!page_counter_charge(&memcg->memsw, batch, &counter))
 			goto done_restock;
-		res_counter_uncharge(&memcg->res, size);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		page_counter_uncharge(&memcg->memory, batch);
+		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		may_swap = false;
 	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+		mem_over_limit = mem_cgroup_from_counter(counter, memory);
 
 	if (batch > nr_pages) {
 		batch = nr_pages;
@@ -2587,32 +2694,12 @@ done:
 
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	unsigned long bytes = nr_pages * PAGE_SIZE;
-
 	if (mem_cgroup_is_root(memcg))
 		return;
 
-	res_counter_uncharge(&memcg->res, bytes);
+	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_swap_account)
-		res_counter_uncharge(&memcg->memsw, bytes);
-}
-
-/*
- * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
- * This is useful when moving usage to parent cgroup.
- */
-static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
-					unsigned int nr_pages)
-{
-	unsigned long bytes = nr_pages * PAGE_SIZE;
-
-	if (mem_cgroup_is_root(memcg))
-		return;
-
-	res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
-	if (do_swap_account)
-		res_counter_uncharge_until(&memcg->memsw,
-						memcg->memsw.parent, bytes);
+		page_counter_uncharge(&memcg->memsw, nr_pages);
 }
 
 /*
@@ -2736,8 +2823,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		unlock_page_lru(page, isolated);
 }
 
-static DEFINE_MUTEX(set_limit_mutex);
-
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * The memcg_slab_mutex is held whenever a per memcg kmem cache is created or
@@ -2786,16 +2871,17 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 }
 #endif
 
-static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
+			     unsigned long nr_pages)
 {
-	struct res_counter *fail_res;
+	struct page_counter *counter;
 	int ret = 0;
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
-	if (ret)
+	ret = page_counter_charge(&memcg->kmem, nr_pages, &counter);
+	if (ret < 0)
 		return ret;
 
-	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
+	ret = try_charge(memcg, gfp, nr_pages);
 	if (ret == -EINTR)  {
 		/*
 		 * try_charge() chose to bypass to root due to OOM kill or
@@ -2812,25 +2898,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
-		res_counter_charge_nofail(&memcg->res, size, &fail_res);
+		page_counter_charge(&memcg->memory, nr_pages, NULL);
 		if (do_swap_account)
-			res_counter_charge_nofail(&memcg->memsw, size,
-						  &fail_res);
+			page_counter_charge(&memcg->memsw, nr_pages, NULL);
 		ret = 0;
 	} else if (ret)
-		res_counter_uncharge(&memcg->kmem, size);
+		page_counter_uncharge(&memcg->kmem, nr_pages);
 
 	return ret;
 }
 
-static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
+				unsigned long nr_pages)
 {
-	res_counter_uncharge(&memcg->res, size);
+	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_swap_account)
-		res_counter_uncharge(&memcg->memsw, size);
+		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	/* Not down to 0 */
-	if (res_counter_uncharge(&memcg->kmem, size))
+	if (page_counter_uncharge(&memcg->kmem, nr_pages))
 		return;
 
 	/*
@@ -3107,19 +3193,21 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,
 
 int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 {
+	unsigned int nr_pages = 1 << order;
 	int res;
 
-	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
-				PAGE_SIZE << order);
+	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages);
 	if (!res)
-		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+		atomic_add(nr_pages, &cachep->memcg_params->nr_pages);
 	return res;
 }
 
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
-	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+	unsigned int nr_pages = 1 << order;
+
+	memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages);
+	atomic_sub(nr_pages, &cachep->memcg_params->nr_pages);
 }
 
 /*
@@ -3240,7 +3328,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 		return true;
 	}
 
-	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+	ret = memcg_charge_kmem(memcg, gfp, 1 << order);
 	if (!ret)
 		*_memcg = memcg;
 
@@ -3257,7 +3345,7 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
 
 	/* The page allocation failed. Revert */
 	if (!page) {
-		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+		memcg_uncharge_kmem(memcg, 1 << order);
 		return;
 	}
 	/*
@@ -3290,7 +3378,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 		return;
 
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+	memcg_uncharge_kmem(memcg, 1 << order);
 }
 #else
 static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
@@ -3468,8 +3556,12 @@ static int mem_cgroup_move_parent(struct page *page,
 
 	ret = mem_cgroup_move_account(page, nr_pages,
 				pc, child, parent);
-	if (!ret)
-		__mem_cgroup_cancel_local_charge(child, nr_pages);
+	if (!ret) {
+		/* Take charge off the local counters */
+		page_counter_cancel(&child->memory, nr_pages);
+		if (do_swap_account)
+			page_counter_cancel(&child->memsw, nr_pages);
+	}
 
 	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
@@ -3499,7 +3591,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
  *
  * Returns 0 on success, -EINVAL on failure.
  *
- * The caller must have charged to @to, IOW, called res_counter_charge() about
+ * The caller must have charged to @to, IOW, called page_counter_charge() about
  * both res and memsw, and called css_get().
  */
 static int mem_cgroup_move_swap_account(swp_entry_t entry,
@@ -3515,7 +3607,7 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 		mem_cgroup_swap_statistics(to, true);
 		/*
 		 * This function is only called from task migration context now.
-		 * It postpones res_counter and refcount handling till the end
+		 * It postpones page_counter and refcount handling till the end
 		 * of task migration(mem_cgroup_clear_mc()) for performance
 		 * improvement. But we cannot postpone css_get(to)  because if
 		 * the process that has been moved to @to does swap-in, the
@@ -3573,49 +3665,42 @@ void mem_cgroup_print_bad_page(struct page *page)
 }
 #endif
 
+static DEFINE_MUTEX(set_limit_mutex);
+
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
-				unsigned long long val)
+				   unsigned long limit)
 {
+	unsigned long curusage;
+	unsigned long oldusage;
+	bool enlarge = false;
 	int retry_count;
-	u64 memswlimit, memlimit;
-	int ret = 0;
-	int children = mem_cgroup_count_children(memcg);
-	u64 curusage, oldusage;
-	int enlarge;
+	int ret;
 
 	/*
 	 * For keeping hierarchical_reclaim simple, how long we should retry
 	 * is depends on callers. We set our retry-count to be function
 	 * of # of children which we should visit in this loop.
 	 */
-	retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
+	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+		      mem_cgroup_count_children(memcg);
 
-	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	oldusage = atomic_long_read(&memcg->memory.count);
 
-	enlarge = 0;
-	while (retry_count) {
+	do {
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
 		}
-		/*
-		 * Rather than hide all in some function, I do this in
-		 * open coded manner. You see what this really does.
-		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
-		 */
+
 		mutex_lock(&set_limit_mutex);
-		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		if (memswlimit < val) {
-			ret = -EINVAL;
+		if (limit > memcg->memsw.limit) {
 			mutex_unlock(&set_limit_mutex);
+			ret = -EINVAL;
 			break;
 		}
-
-		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		if (memlimit < val)
-			enlarge = 1;
-
-		ret = res_counter_set_limit(&memcg->res, val);
+		if (limit > memcg->memory.limit)
+			enlarge = true;
+		ret = page_counter_limit(&memcg->memory, limit);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3623,13 +3708,14 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
 
-		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
+		curusage = atomic_long_read(&memcg->memory.count);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
 		else
 			oldusage = curusage;
-	}
+	} while (retry_count);
+
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
 
@@ -3637,38 +3723,35 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 }
 
 static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
-					unsigned long long val)
+					 unsigned long limit)
 {
+	unsigned long curusage;
+	unsigned long oldusage;
+	bool enlarge = false;
 	int retry_count;
-	u64 memlimit, memswlimit, oldusage, curusage;
-	int children = mem_cgroup_count_children(memcg);
-	int ret = -EBUSY;
-	int enlarge = 0;
+	int ret;
 
 	/* see mem_cgroup_resize_res_limit */
-	retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
-	oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
-	while (retry_count) {
+	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
+		      mem_cgroup_count_children(memcg);
+
+	oldusage = atomic_long_read(&memcg->memsw.count);
+
+	do {
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
 		}
-		/*
-		 * Rather than hide all in some function, I do this in
-		 * open coded manner. You see what this really does.
-		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
-		 */
+
 		mutex_lock(&set_limit_mutex);
-		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		if (memlimit > val) {
-			ret = -EINVAL;
+		if (limit < memcg->memory.limit) {
 			mutex_unlock(&set_limit_mutex);
+			ret = -EINVAL;
 			break;
 		}
-		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		if (memswlimit < val)
-			enlarge = 1;
-		ret = res_counter_set_limit(&memcg->memsw, val);
+		if (limit > memcg->memsw.limit)
+			enlarge = true;
+		ret = page_counter_limit(&memcg->memsw, limit);
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3676,15 +3759,17 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
 
-		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+		curusage = atomic_long_read(&memcg->memsw.count);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
 		else
 			oldusage = curusage;
-	}
+	} while (retry_count);
+
 	if (!ret && enlarge)
 		memcg_oom_recover(memcg);
+
 	return ret;
 }
 
@@ -3697,7 +3782,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	unsigned long reclaimed;
 	int loop = 0;
 	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
+	unsigned long excess;
 	unsigned long nr_scanned;
 
 	if (order > 0)
@@ -3751,7 +3836,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			} while (1);
 		}
 		__mem_cgroup_remove_exceeded(mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->memcg->res);
+		excess = soft_limit_excess(mz->memcg);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -3844,7 +3929,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 {
 	int node, zid;
-	u64 usage;
 
 	do {
 		/* This is for making all *used* pages to be on LRU. */
@@ -3876,9 +3960,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 		 * right after the check. RES_USAGE should be safe as we always
 		 * charge before adding to the LRU.
 		 */
-		usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
-			res_counter_read_u64(&memcg->kmem, RES_USAGE);
-	} while (usage > 0);
+	} while (atomic_long_read(&memcg->memory.count) -
+		 atomic_long_read(&memcg->kmem.count) > 0);
 }
 
 /*
@@ -3918,7 +4001,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 	/* we call try-to-free pages for make this cgroup empty */
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
-	while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
+	while (nr_retries && atomic_long_read(&memcg->memory.count)) {
 		int progress;
 
 		if (signal_pending(current))
@@ -3989,8 +4072,8 @@ out:
 	return retval;
 }
 
-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
-					       enum mem_cgroup_stat_index idx)
+static unsigned long tree_stat(struct mem_cgroup *memcg,
+			       enum mem_cgroup_stat_index idx)
 {
 	struct mem_cgroup *iter;
 	long val = 0;
@@ -4008,55 +4091,72 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 {
 	u64 val;
 
-	if (!mem_cgroup_is_root(memcg)) {
+	if (mem_cgroup_is_root(memcg)) {
+		val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE);
+		val += tree_stat(memcg, MEM_CGROUP_STAT_RSS);
+		if (swap)
+			val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
+	} else {
 		if (!swap)
-			return res_counter_read_u64(&memcg->res, RES_USAGE);
+			val = atomic_long_read(&memcg->memory.count);
 		else
-			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
+			val = atomic_long_read(&memcg->memsw.count);
 	}
-
-	/*
-	 * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
-	 * as well as in MEM_CGROUP_STAT_RSS_HUGE.
-	 */
-	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
-	val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-
-	if (swap)
-		val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
-
 	return val << PAGE_SHIFT;
 }
 
+enum {
+	RES_USAGE,
+	RES_LIMIT,
+	RES_MAX_USAGE,
+	RES_FAILCNT,
+	RES_SOFT_LIMIT,
+};
 
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	enum res_type type = MEMFILE_TYPE(cft->private);
-	int name = MEMFILE_ATTR(cft->private);
+	struct page_counter *counter;
 
-	switch (type) {
+	switch (MEMFILE_TYPE(cft->private)) {
 	case _MEM:
-		if (name == RES_USAGE)
-			return mem_cgroup_usage(memcg, false);
-		return res_counter_read_u64(&memcg->res, name);
+		counter = &memcg->memory;
+		break;
 	case _MEMSWAP:
-		if (name == RES_USAGE)
-			return mem_cgroup_usage(memcg, true);
-		return res_counter_read_u64(&memcg->memsw, name);
+		counter = &memcg->memsw;
+		break;
 	case _KMEM:
-		return res_counter_read_u64(&memcg->kmem, name);
+		counter = &memcg->kmem;
 		break;
 	default:
 		BUG();
 	}
+
+	switch (MEMFILE_ATTR(cft->private)) {
+	case RES_USAGE:
+		if (counter == &memcg->memory)
+			return mem_cgroup_usage(memcg, false);
+		if (counter == &memcg->memsw)
+			return mem_cgroup_usage(memcg, true);
+		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+	case RES_LIMIT:
+		return (u64)counter->limit * PAGE_SIZE;
+	case RES_MAX_USAGE:
+		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_FAILCNT:
+		return counter->limited;
+	case RES_SOFT_LIMIT:
+		return (u64)memcg->soft_limit * PAGE_SIZE;
+	default:
+		BUG();
+	}
 }
 
 #ifdef CONFIG_MEMCG_KMEM
 /* should be called with activate_kmem_mutex held */
 static int __memcg_activate_kmem(struct mem_cgroup *memcg,
-				 unsigned long long limit)
+				 unsigned long nr_pages)
 {
 	int err = 0;
 	int memcg_id;
@@ -4103,7 +4203,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
 	 * active bit set yet, so this should succeed.
 	 */
-	err = res_counter_set_limit(&memcg->kmem, limit);
+	err = page_counter_limit(&memcg->kmem, nr_pages);
 	VM_BUG_ON(err);
 
 	static_key_slow_inc(&memcg_kmem_enabled_key);
@@ -4119,25 +4219,25 @@ out:
 }
 
 static int memcg_activate_kmem(struct mem_cgroup *memcg,
-			       unsigned long long limit)
+			       unsigned long nr_pages)
 {
 	int ret;
 
 	mutex_lock(&activate_kmem_mutex);
-	ret = __memcg_activate_kmem(memcg, limit);
+	ret = __memcg_activate_kmem(memcg, nr_pages);
 	mutex_unlock(&activate_kmem_mutex);
 	return ret;
 }
 
 static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
-				   unsigned long long val)
+				   unsigned long limit)
 {
 	int ret;
 
 	if (!memcg_kmem_is_active(memcg))
-		ret = memcg_activate_kmem(memcg, val);
+		ret = memcg_activate_kmem(memcg, limit);
 	else
-		ret = res_counter_set_limit(&memcg->kmem, val);
+		ret = page_counter_limit(&memcg->kmem, limit);
 	return ret;
 }
 
@@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	 * after this point, because it has at least one child already.
 	 */
 	if (memcg_kmem_is_active(parent))
-		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
+		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
 	mutex_unlock(&activate_kmem_mutex);
 	return ret;
 }
 #else
 static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
-				   unsigned long long val)
+				   unsigned long limit)
 {
 	return -EINVAL;
 }
@@ -4175,110 +4275,69 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	enum res_type type;
-	int name;
-	unsigned long long val;
+	unsigned long nr_pages;
 	int ret;
 
 	buf = strstrip(buf);
-	type = MEMFILE_TYPE(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
+	ret = page_counter_memparse(buf, &nr_pages);
+	if (ret)
+		return ret;
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_LIMIT:
 		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
 			ret = -EINVAL;
 			break;
 		}
-		/* This function does all necessary parse...reuse it */
-		ret = res_counter_memparse_write_strategy(buf, &val);
-		if (ret)
+		switch (MEMFILE_TYPE(of_cft(of)->private)) {
+		case _MEM:
+			ret = mem_cgroup_resize_limit(memcg, nr_pages);
 			break;
-		if (type == _MEM)
-			ret = mem_cgroup_resize_limit(memcg, val);
-		else if (type == _MEMSWAP)
-			ret = mem_cgroup_resize_memsw_limit(memcg, val);
-		else if (type == _KMEM)
-			ret = memcg_update_kmem_limit(memcg, val);
-		else
-			return -EINVAL;
-		break;
-	case RES_SOFT_LIMIT:
-		ret = res_counter_memparse_write_strategy(buf, &val);
-		if (ret)
+		case _MEMSWAP:
+			ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
 			break;
-		/*
-		 * For memsw, soft limits are hard to implement in terms
-		 * of semantics, for now, we support soft limits for
-		 * control without swap
-		 */
-		if (type == _MEM)
-			ret = res_counter_set_soft_limit(&memcg->res, val);
-		else
-			ret = -EINVAL;
+		case _KMEM:
+			ret = memcg_update_kmem_limit(memcg, nr_pages);
+			break;
+		}
 		break;
-	default:
-		ret = -EINVAL; /* should be BUG() ? */
+	case RES_SOFT_LIMIT:
+		memcg->soft_limit = nr_pages;
+		ret = 0;
 		break;
 	}
 	return ret ?: nbytes;
 }
 
-static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
-		unsigned long long *mem_limit, unsigned long long *memsw_limit)
-{
-	unsigned long long min_limit, min_memsw_limit, tmp;
-
-	min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-	min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-	if (!memcg->use_hierarchy)
-		goto out;
-
-	while (memcg->css.parent) {
-		memcg = mem_cgroup_from_css(memcg->css.parent);
-		if (!memcg->use_hierarchy)
-			break;
-		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		min_limit = min(min_limit, tmp);
-		tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		min_memsw_limit = min(min_memsw_limit, tmp);
-	}
-out:
-	*mem_limit = min_limit;
-	*memsw_limit = min_memsw_limit;
-}
-
 static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
 				size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	int name;
-	enum res_type type;
+	struct page_counter *counter;
 
-	type = MEMFILE_TYPE(of_cft(of)->private);
-	name = MEMFILE_ATTR(of_cft(of)->private);
+	switch (MEMFILE_TYPE(of_cft(of)->private)) {
+	case _MEM:
+		counter = &memcg->memory;
+		break;
+	case _MEMSWAP:
+		counter = &memcg->memsw;
+		break;
+	case _KMEM:
+		counter = &memcg->kmem;
+		break;
+	default:
+		BUG();
+	}
 
-	switch (name) {
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		if (type == _MEM)
-			res_counter_reset_max(&memcg->res);
-		else if (type == _MEMSWAP)
-			res_counter_reset_max(&memcg->memsw);
-		else if (type == _KMEM)
-			res_counter_reset_max(&memcg->kmem);
-		else
-			return -EINVAL;
+		counter->watermark = atomic_long_read(&counter->count);
 		break;
 	case RES_FAILCNT:
-		if (type == _MEM)
-			res_counter_reset_failcnt(&memcg->res);
-		else if (type == _MEMSWAP)
-			res_counter_reset_failcnt(&memcg->memsw);
-		else if (type == _KMEM)
-			res_counter_reset_failcnt(&memcg->kmem);
-		else
-			return -EINVAL;
+		counter->limited = 0;
 		break;
+	default:
+		BUG();
 	}
 
 	return nbytes;
@@ -4375,6 +4434,7 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
 static int memcg_stat_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long memory, memsw;
 	struct mem_cgroup *mi;
 	unsigned int i;
 
@@ -4394,14 +4454,16 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 			   mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);
 
 	/* Hierarchical information */
-	{
-		unsigned long long limit, memsw_limit;
-		memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit);
-		seq_printf(m, "hierarchical_memory_limit %llu\n", limit);
-		if (do_swap_account)
-			seq_printf(m, "hierarchical_memsw_limit %llu\n",
-				   memsw_limit);
+	memory = memsw = PAGE_COUNTER_MAX;
+	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
+		memory = min(memory, mi->memory.limit);
+		memsw = min(memsw, mi->memsw.limit);
 	}
+	seq_printf(m, "hierarchical_memory_limit %llu\n",
+		   (u64)memory * PAGE_SIZE);
+	if (do_swap_account)
+		seq_printf(m, "hierarchical_memsw_limit %llu\n",
+			   (u64)memsw * PAGE_SIZE);
 
 	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
 		long long val = 0;
@@ -4485,7 +4547,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 {
 	struct mem_cgroup_threshold_ary *t;
-	u64 usage;
+	unsigned long usage;
 	int i;
 
 	rcu_read_lock();
@@ -4584,10 +4646,11 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 {
 	struct mem_cgroup_thresholds *thresholds;
 	struct mem_cgroup_threshold_ary *new;
-	u64 threshold, usage;
+	unsigned long threshold;
+	unsigned long usage;
 	int i, size, ret;
 
-	ret = res_counter_memparse_write_strategy(args, &threshold);
+	ret = page_counter_memparse(args, &threshold);
 	if (ret)
 		return ret;
 
@@ -4677,7 +4740,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 {
 	struct mem_cgroup_thresholds *thresholds;
 	struct mem_cgroup_threshold_ary *new;
-	u64 usage;
+	unsigned long usage;
 	int i, j, size;
 
 	mutex_lock(&memcg->thresholds_lock);
@@ -4871,7 +4934,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 
 	memcg_kmem_mark_dead(memcg);
 
-	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
+	if (atomic_long_read(&memcg->kmem.count))
 		return;
 
 	if (memcg_kmem_test_and_clear_dead(memcg))
@@ -5351,9 +5414,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
  */
 struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
-	if (!memcg->res.parent)
+	if (!memcg->memory.parent)
 		return NULL;
-	return mem_cgroup_from_res_counter(memcg->res.parent, res);
+	return mem_cgroup_from_counter(memcg->memory.parent, memory);
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
@@ -5398,9 +5461,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
-		res_counter_init(&memcg->res, NULL);
-		res_counter_init(&memcg->memsw, NULL);
-		res_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->memsw, NULL);
+		page_counter_init(&memcg->kmem, NULL);
 	}
 
 	memcg->last_scanned_node = MAX_NUMNODES;
@@ -5438,18 +5501,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	memcg->swappiness = mem_cgroup_swappiness(parent);
 
 	if (parent->use_hierarchy) {
-		res_counter_init(&memcg->res, &parent->res);
-		res_counter_init(&memcg->memsw, &parent->memsw);
-		res_counter_init(&memcg->kmem, &parent->kmem);
+		page_counter_init(&memcg->memory, &parent->memory);
+		page_counter_init(&memcg->memsw, &parent->memsw);
+		page_counter_init(&memcg->kmem, &parent->kmem);
 
 		/*
 		 * No need to take a reference to the parent because cgroup
 		 * core guarantees its existence.
 		 */
 	} else {
-		res_counter_init(&memcg->res, NULL);
-		res_counter_init(&memcg->memsw, NULL);
-		res_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->memsw, NULL);
+		page_counter_init(&memcg->kmem, NULL);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
@@ -5520,7 +5583,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	/*
 	 * XXX: css_offline() would be where we should reparent all
 	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget_online() and res_counter charging
+	 * memcg does not do css_tryget_online() and page_counter charging
 	 * under the same RCU lock region, which means that charging
 	 * could race with offlining.  Offlining only happens to
 	 * cgroups with no tasks in them but charges can show up
@@ -5540,7 +5603,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	 * call_rcu()
 	 *   offline_css()
 	 *     reparent_charges()
-	 *                           res_counter_charge()
+	 *                           page_counter_charge()
 	 *                           css_put()
 	 *                             css_free()
 	 *                           pc->mem_cgroup = dead memcg
@@ -5575,10 +5638,10 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
-	mem_cgroup_resize_limit(memcg, ULLONG_MAX);
-	mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
-	memcg_update_kmem_limit(memcg, ULLONG_MAX);
-	res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
+	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
+	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
+	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
+	memcg->soft_limit = 0;
 }
 
 #ifdef CONFIG_MMU
@@ -5892,19 +5955,18 @@ static void __mem_cgroup_clear_mc(void)
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
 		if (!mem_cgroup_is_root(mc.from))
-			res_counter_uncharge(&mc.from->memsw,
-					     PAGE_SIZE * mc.moved_swap);
-
-		for (i = 0; i < mc.moved_swap; i++)
-			css_put(&mc.from->css);
+			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
 
 		/*
-		 * we charged both to->res and to->memsw, so we should
-		 * uncharge to->res.
+		 * we charged both to->memory and to->memsw, so we
+		 * should uncharge to->memory.
 		 */
 		if (!mem_cgroup_is_root(mc.to))
-			res_counter_uncharge(&mc.to->res,
-					     PAGE_SIZE * mc.moved_swap);
+			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
+
+		for (i = 0; i < mc.moved_swap; i++)
+			css_put(&mc.from->css);
+
 		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
@@ -6270,7 +6332,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 	memcg = mem_cgroup_lookup(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			page_counter_uncharge(&memcg->memsw, 1);
 		mem_cgroup_swap_statistics(memcg, false);
 		css_put(&memcg->css);
 	}
@@ -6436,11 +6498,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 
 	if (!mem_cgroup_is_root(memcg)) {
 		if (nr_mem)
-			res_counter_uncharge(&memcg->res,
-					     nr_mem * PAGE_SIZE);
+			page_counter_uncharge(&memcg->memory, nr_mem);
 		if (nr_memsw)
-			res_counter_uncharge(&memcg->memsw,
-					     nr_memsw * PAGE_SIZE);
+			page_counter_uncharge(&memcg->memsw, nr_memsw);
 		memcg_oom_recover(memcg);
 	}
 
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 1d191357bf88..9a448bdb19e9 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -9,13 +9,13 @@
 int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	/*
-	 * The root cgroup does not use res_counters, but rather,
+	 * The root cgroup does not use page_counters, but rather,
 	 * rely on the data already collected by the network
 	 * subsystem
 	 */
-	struct res_counter *res_parent = NULL;
-	struct cg_proto *cg_proto, *parent_cg;
 	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+	struct page_counter *counter_parent = NULL;
+	struct cg_proto *cg_proto, *parent_cg;
 
 	cg_proto = tcp_prot.proto_cgroup(memcg);
 	if (!cg_proto)
@@ -29,9 +29,9 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 
 	parent_cg = tcp_prot.proto_cgroup(parent);
 	if (parent_cg)
-		res_parent = &parent_cg->memory_allocated;
+		counter_parent = &parent_cg->memory_allocated;
 
-	res_counter_init(&cg_proto->memory_allocated, res_parent);
+	page_counter_init(&cg_proto->memory_allocated, counter_parent);
 	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
 
 	return 0;
@@ -50,7 +50,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
 
-static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 {
 	struct cg_proto *cg_proto;
 	int i;
@@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
 	if (!cg_proto)
 		return -EINVAL;
 
-	if (val > RES_COUNTER_MAX)
-		val = RES_COUNTER_MAX;
-
-	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
+	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
 	if (ret)
 		return ret;
 
 	for (i = 0; i < 3; i++)
-		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
+		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
 						sysctl_tcp_mem[i]);
 
-	if (val == RES_COUNTER_MAX)
+	if (nr_pages == ULONG_MAX / PAGE_SIZE)
 		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
-	else if (val != RES_COUNTER_MAX) {
+	else {
 		/*
 		 * The active bit needs to be written after the static_key
 		 * update. This is what guarantees that the socket activation
@@ -102,11 +99,18 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
 	return 0;
 }
 
+enum {
+	RES_USAGE,
+	RES_LIMIT,
+	RES_MAX_USAGE,
+	RES_FAILCNT,
+};
+
 static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	unsigned long long val;
+	unsigned long nr_pages;
 	int ret = 0;
 
 	buf = strstrip(buf);
@@ -114,10 +118,10 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	switch (of_cft(of)->private) {
 	case RES_LIMIT:
 		/* see memcontrol.c */
-		ret = res_counter_memparse_write_strategy(buf, &val);
+		ret = page_counter_memparse(buf, &nr_pages);
 		if (ret)
 			break;
-		ret = tcp_update_limit(memcg, val);
+		ret = tcp_update_limit(memcg, nr_pages);
 		break;
 	default:
 		ret = -EINVAL;
@@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
-static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
-{
-	struct cg_proto *cg_proto;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return default_val;
-
-	return res_counter_read_u64(&cg_proto->memory_allocated, type);
-}
-
-static u64 tcp_read_usage(struct mem_cgroup *memcg)
-{
-	struct cg_proto *cg_proto;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
-
-	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
-}
-
 static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
 	u64 val;
 
 	switch (cft->private) {
 	case RES_LIMIT:
-		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
+		if (!cg_proto)
+			return PAGE_COUNTER_MAX;
+		val = cg_proto->memory_allocated.limit;
+		val *= PAGE_SIZE;
 		break;
 	case RES_USAGE:
-		val = tcp_read_usage(memcg);
+		if (!cg_proto)
+			return atomic_long_read(&tcp_memory_allocated);
+		val = atomic_long_read(&cg_proto->memory_allocated.count);
+		val *= PAGE_SIZE;
 		break;
 	case RES_FAILCNT:
+		if (!cg_proto)
+			return 0;
+		val = cg_proto->memory_allocated.limited;
+		break;
 	case RES_MAX_USAGE:
-		val = tcp_read_stat(memcg, cft->private, 0);
+		if (!cg_proto)
+			return 0;
+		val = cg_proto->memory_allocated.watermark;
+		val *= PAGE_SIZE;
 		break;
 	default:
 		BUG();
@@ -183,10 +179,11 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
 
 	switch (of_cft(of)->private) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&cg_proto->memory_allocated);
+		cg_proto->memory_allocated.watermark =
+			atomic_long_read(&cg_proto->memory_allocated.count);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&cg_proto->memory_allocated);
+		cg_proto->memory_allocated.limited = 0;
 		break;
 	}
 
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-19 13:22 ` Johannes Weiner
@ 2014-09-19 13:29   ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-19 13:29 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

Hi Dave,

this patch removes the lock you saw with will-it-scale/page_fault2
entirely, is there a chance you could give it a spin?  It's based on
v3.17-rc4-mmots-2014-09-12-17-13-4 and that memcg THP fix.  That
kernel also includes the recent root-memcg revert, so you'd have to
run it in a memcg; which is as easy as:

mkdir /sys/fs/cgroup/memory/foo
echo $$ >/sys/fs/cgroup/memory/foo/tasks
perf record -g -a ./runtest.py page_fault2

Thanks!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-19 13:29   ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-19 13:29 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

Hi Dave,

this patch removes the lock you saw with will-it-scale/page_fault2
entirely, is there a chance you could give it a spin?  It's based on
v3.17-rc4-mmots-2014-09-12-17-13-4 and that memcg THP fix.  That
kernel also includes the recent root-memcg revert, so you'd have to
run it in a memcg; which is as easy as:

mkdir /sys/fs/cgroup/memory/foo
echo $$ >/sys/fs/cgroup/memory/foo/tasks
perf record -g -a ./runtest.py page_fault2

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-19 13:22 ` Johannes Weiner
@ 2014-09-22 14:41   ` Vladimir Davydov
  -1 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-22 14:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it and remove the old one.  The translation
> from and to bytes then only happens when interfacing with userspace.
> 
> Aside from the locking costs, this gets rid of the icky unsigned long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.

Overall, I like this change. A few comments below.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/cgroups/hugetlb.txt          |   2 +-
>  Documentation/cgroups/memory.txt           |   4 +-
>  Documentation/cgroups/resource_counter.txt | 197 --------
>  include/linux/hugetlb_cgroup.h             |   1 -
>  include/linux/memcontrol.h                 |  37 +-
>  include/linux/res_counter.h                | 223 ---------
>  include/net/sock.h                         |  25 +-
>  init/Kconfig                               |   9 +-
>  kernel/Makefile                            |   1 -
>  kernel/res_counter.c                       | 211 --------
>  mm/hugetlb_cgroup.c                        | 100 ++--
>  mm/memcontrol.c                            | 740 ++++++++++++++++-------------
>  net/ipv4/tcp_memcontrol.c                  |  83 ++--
>  13 files changed, 541 insertions(+), 1092 deletions(-)
>  delete mode 100644 Documentation/cgroups/resource_counter.txt
>  delete mode 100644 include/linux/res_counter.h
>  delete mode 100644 kernel/res_counter.c
> 
[...]
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19df5d857411..bf8fb1a05597 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +
> +struct page_counter {

I'd place it in a separate file, say

	include/linux/page_counter.h
	mm/page_counter.c

just to keep mm/memcontrol.c clean.

> +	atomic_long_t count;
> +	unsigned long limit;
> +	struct page_counter *parent;
> +
> +	/* legacy */
> +	unsigned long watermark;
> +	unsigned long limited;

IMHO, failcnt would fit better.

> +};
> +
> +#if BITS_PER_LONG == 32
> +#define PAGE_COUNTER_MAX ULONG_MAX
> +#else
> +#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
> +#endif
> +
> +static inline void page_counter_init(struct page_counter *counter,
> +				     struct page_counter *parent)
> +{
> +	atomic_long_set(&counter->count, 0);
> +	counter->limit = PAGE_COUNTER_MAX;
> +	counter->parent = parent;
> +}
> +
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);

When I first saw this function, I couldn't realize by looking at its
name what it's intended to do. I think

	page_counter_cancel_local_charge()

would fit better.

> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail);
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> +int page_counter_limit(struct page_counter *counter, unsigned long limit);

Hmm, why not page_counter_set_limit?

> +int page_counter_memparse(const char *buf, unsigned long *nr_pages);
> +
[...]
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 515a4d01e932..f41749982668 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -55,7 +55,6 @@
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/memcontrol.h>
> -#include <linux/res_counter.h>
>  #include <linux/static_key.h>
>  #include <linux/aio.h>
>  #include <linux/sched.h>
> @@ -1066,7 +1065,7 @@ enum cg_proto_flags {
>  };
>  
>  struct cg_proto {
> -	struct res_counter	memory_allocated;	/* Current allocated memory. */
> +	struct page_counter	memory_allocated;	/* Current allocated memory. */
>  	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
>  	int			memory_pressure;
>  	long			sysctl_mem[3];
> @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
>  					      unsigned long amt,
>  					      int *parent_status)
>  {
> -	struct res_counter *fail;
> -	int ret;
> +	page_counter_charge(&prot->memory_allocated, amt, NULL);
>  
> -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> -					amt << PAGE_SHIFT, &fail);
> -	if (ret < 0)
> +	if (atomic_long_read(&prot->memory_allocated.count) >
> +	    prot->memory_allocated.limit)

I don't like your equivalent of res_counter_charge_nofail.

Passing NULL to page_counter_charge might be useful if one doesn't have
a back-off strategy, but still want to fail on hitting the limit. With
your interface the user must pass something to the function then, which
isn't convenient.

Besides, it depends on the internal implementation of the page_counter
struct. I'd encapsulate this.

>  		*parent_status = OVER_LIMIT;
>  }
>  
>  static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
>  					      unsigned long amt)
>  {
> -	res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
> -}
> -
> -static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
> -{
> -	u64 ret;
> -	ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
> -	return ret >> PAGE_SHIFT;
> +	page_counter_uncharge(&prot->memory_allocated, amt);
>  }
>  
>  static inline long
>  sk_memory_allocated(const struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
> +
>  	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);

page_counter_read?

>  
>  	return atomic_long_read(prot->memory_allocated);
>  }
> @@ -1259,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
>  		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
>  		/* update the root cgroup regardless */
>  		atomic_long_add_return(amt, prot->memory_allocated);
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
>  	}
>  
>  	return atomic_long_add_return(amt, prot->memory_allocated);
> diff --git a/init/Kconfig b/init/Kconfig
> index 0471be99ec38..1cf42b563834 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -975,15 +975,8 @@ config CGROUP_CPUACCT
>  	  Provides a simple Resource Controller for monitoring the
>  	  total CPU consumed by the tasks in a cgroup.
>  
> -config RESOURCE_COUNTERS
> -	bool "Resource counters"
> -	help
> -	  This option enables controller independent resource accounting
> -	  infrastructure that works with cgroups.
> -
>  config MEMCG
>  	bool "Memory Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS
>  	select EVENTFD
>  	help
>  	  Provides a memory resource controller that manages both anonymous
> @@ -1051,7 +1044,7 @@ config MEMCG_KMEM
>  
>  config CGROUP_HUGETLB
>  	bool "HugeTLB Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> +	depends on MEMCG && HUGETLB_PAGE

So now the hugetlb controller depends on memcg only because it needs the
page_counter, which is supposed to be a kind of independent. Is that OK?

>  	default n
>  	help
>  	  Provides a cgroup Resource Controller for HugeTLB pages.
[...]
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index a67c26e0f360..e619b6b62f1f 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
[...]
> @@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  				  struct page *page)
>  {
>  	struct hugetlb_cgroup *h_cg;
> -	unsigned long csize = nr_pages * PAGE_SIZE;
>  
>  	if (hugetlb_cgroup_disabled())
>  		return;
> @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  	if (unlikely(!h_cg))
>  		return;
>  	set_hugetlb_cgroup(page, NULL);
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
>  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>  				    struct hugetlb_cgroup *h_cg)
>  {
> -	unsigned long csize = nr_pages * PAGE_SIZE;
> -
>  	if (hugetlb_cgroup_disabled() || !h_cg)
>  		return;
>  
>  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
>  		return;
>  
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +};
> +
>  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>  				   struct cftype *cft)
>  {
> -	int idx, name;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
>  
> -	idx = MEMFILE_IDX(cft->private);
> -	name = MEMFILE_ATTR(cft->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
>  
> -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> +	switch (MEMFILE_ATTR(cft->private)) {
> +	case RES_USAGE:
> +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;

page_counter_read?

> +	case RES_LIMIT:
> +		return (u64)counter->limit * PAGE_SIZE;

page_counter_get_limit?

> +	case RES_MAX_USAGE:
> +		return (u64)counter->watermark * PAGE_SIZE;

page_counter_read_watermark?

> +	case RES_FAILCNT:
> +		return counter->limited;
> +	default:
> +		BUG();
> +	}
>  }
>  
>  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret;
> -	unsigned long long val;
> +	int ret, idx;
> +	unsigned long nr_pages;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> +	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> +		return -EINVAL;
> +
>  	buf = strstrip(buf);
> +	ret = page_counter_memparse(buf, &nr_pages);
> +	if (ret)
> +		return ret;
> +
>  	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_LIMIT:
> -		if (hugetlb_cgroup_is_root(h_cg)) {
> -			/* Can't set limit on root */
> -			ret = -EINVAL;
> -			break;
> -		}
> -		/* This function does all necessary parse...reuse it */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> -			break;
> -		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
> -		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> +		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));

This is incorrect. Here we should have something like:

	nr_pages = ALIGN(nr_pages, 1UL << huge_page_order(&hstates[idx]));

> +		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
>  		break;
>  	default:
>  		ret = -EINVAL;
> @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret = 0;
> +	int ret = 0;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> -	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_MAX_USAGE:
> -		res_counter_reset_max(&h_cg->hugepage[idx]);
> +		counter->watermark = atomic_long_read(&counter->count);

page_counter_reset_watermark?

>  		break;
>  	case RES_FAILCNT:
> -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> +		counter->limited = 0;

page_counter_reset_failcnt?

>  		break;
>  	default:
>  		ret = -EINVAL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..dfd3b15a57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -25,7 +25,6 @@
>   * GNU General Public License for more details.
>   */
>  
> -#include <linux/res_counter.h>
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
>  #include <linux/mm.h>
> @@ -66,6 +65,117 @@
>  
>  #include <trace/events/vmscan.h>
>  
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(nr_pages, &counter->count);
> +
> +	if (WARN_ON(unlikely(new < 0)))

Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.

> +		atomic_long_set(&counter->count, 0);
> +
> +	return new > 1;

Nobody outside page_counter internals uses this retval. Why is it public
then?

BTW, why not new > 0?

> +}
> +
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail)
> +{
> +	struct page_counter *c;
> +
> +	for (c = counter; c; c = c->parent) {
> +		for (;;) {
> +			unsigned long count;
> +			unsigned long new;
> +
> +			count = atomic_long_read(&c->count);
> +
> +			new = count + nr_pages;
> +			if (new > c->limit) {
> +				c->limited++;
> +				if (fail) {

So we increase 'limited' even if ain't limited. Sounds weird.

> +					*fail = c;
> +					goto failed;
> +				}
> +			}
> +
> +			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> +				continue;
> +
> +			if (new > c->watermark)
> +				c->watermark = new;
> +
> +			break;
> +		}
> +	}
> +	return 0;
> +
> +failed:
> +	for (c = counter; c != *fail; c = c->parent)
> +		page_counter_cancel(c, nr_pages);
> +
> +	return -ENOMEM;
> +}
> +
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	struct page_counter *c;
> +	int ret = 1;
> +
> +	for (c = counter; c; c = c->parent) {
> +		int remainder;
> +
> +		remainder = page_counter_cancel(c, nr_pages);
> +		if (c == counter && !remainder)
> +			ret = 0;
> +	}
> +
> +	return ret;

The only user of this retval is memcg_uncharge_kmem, which is going to
be removed by your "mm: memcontrol: remove obsolete kmemcg pinning
tricks" patch. Is it still worth having?

> +}
> +
> +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> +{
> +	for (;;) {
> +		unsigned long count;
> +		unsigned long old;
> +
> +		count = atomic_long_read(&counter->count);
> +
> +		old = xchg(&counter->limit, limit);
> +
> +		if (atomic_long_read(&counter->count) != count) {
> +			counter->limit = old;

I wonder what can happen if two threads execute this function
concurrently... or may be it's not supposed to be smp-safe?

> +			continue;
> +		}
> +
> +		if (count > limit) {
> +			counter->limit = old;
> +			return -EBUSY;
> +		}
> +
> +		return 0;
> +	}
> +}
[...]
> @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>   */
>  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  {
> -	unsigned long long margin;

Why is it still ULL?

> +	unsigned long margin = 0;
> +	unsigned long count;
> +	unsigned long limit;
>  
> -	margin = res_counter_margin(&memcg->res);
> -	if (do_swap_account)
> -		margin = min(margin, res_counter_margin(&memcg->memsw));
> -	return margin >> PAGE_SHIFT;
> +	count = atomic_long_read(&memcg->memory.count);
> +	limit = ACCESS_ONCE(memcg->memory.limit);
> +	if (count < limit)
> +		margin = limit - count;
> +
> +	if (do_swap_account) {
> +		count = atomic_long_read(&memcg->memsw.count);
> +		limit = ACCESS_ONCE(memcg->memsw.limit);
> +		if (count < limit)
> +			margin = min(margin, limit - count);
> +	}
> +
> +	return margin;
>  }
[...]
> @@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
>  	 * after this point, because it has at least one child already.
>  	 */
>  	if (memcg_kmem_is_active(parent))
> -		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
> +		ret = __memcg_activate_kmem(memcg, ULONG_MAX);

PAGE_COUNTER_MAX?

>  	mutex_unlock(&activate_kmem_mutex);
>  	return ret;
>  }
>  #else
>  static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
> -				   unsigned long long val)
> +				   unsigned long limit)
>  {
>  	return -EINVAL;
>  }
[...]
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 1d191357bf88..9a448bdb19e9 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
[...]
> @@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
>  	if (!cg_proto)
>  		return -EINVAL;
>  
> -	if (val > RES_COUNTER_MAX)
> -		val = RES_COUNTER_MAX;
> -
> -	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
> +	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
>  	if (ret)
>  		return ret;
>  
>  	for (i = 0; i < 3; i++)
> -		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
> +		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
>  						sysctl_tcp_mem[i]);
>  
> -	if (val == RES_COUNTER_MAX)
> +	if (nr_pages == ULONG_MAX / PAGE_SIZE)

PAGE_COUNTER_MAX?

>  		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	else if (val != RES_COUNTER_MAX) {
> +	else {
>  		/*
>  		 * The active bit needs to be written after the static_key
>  		 * update. This is what guarantees that the socket activation
[...]
> @@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	return ret ?: nbytes;
>  }
>  
> -static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return default_val;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, type);
> -}
> -
> -static u64 tcp_read_usage(struct mem_cgroup *memcg)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
> -}
> -
>  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
>  	u64 val;
>  
>  	switch (cft->private) {
>  	case RES_LIMIT:
> -		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
> +		if (!cg_proto)
> +			return PAGE_COUNTER_MAX;
> +		val = cg_proto->memory_allocated.limit;
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_USAGE:
> -		val = tcp_read_usage(memcg);
> +		if (!cg_proto)
> +			return atomic_long_read(&tcp_memory_allocated);

Forgot << PAGE_SHIFT?

> +		val = atomic_long_read(&cg_proto->memory_allocated.count);
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_FAILCNT:
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.limited;
> +		break;
>  	case RES_MAX_USAGE:
> -		val = tcp_read_stat(memcg, cft->private, 0);
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.watermark;
> +		val *= PAGE_SIZE;
>  		break;
>  	default:
>  		BUG();

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-22 14:41   ` Vladimir Davydov
  0 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-22 14:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it and remove the old one.  The translation
> from and to bytes then only happens when interfacing with userspace.
> 
> Aside from the locking costs, this gets rid of the icky unsigned long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.

Overall, I like this change. A few comments below.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/cgroups/hugetlb.txt          |   2 +-
>  Documentation/cgroups/memory.txt           |   4 +-
>  Documentation/cgroups/resource_counter.txt | 197 --------
>  include/linux/hugetlb_cgroup.h             |   1 -
>  include/linux/memcontrol.h                 |  37 +-
>  include/linux/res_counter.h                | 223 ---------
>  include/net/sock.h                         |  25 +-
>  init/Kconfig                               |   9 +-
>  kernel/Makefile                            |   1 -
>  kernel/res_counter.c                       | 211 --------
>  mm/hugetlb_cgroup.c                        | 100 ++--
>  mm/memcontrol.c                            | 740 ++++++++++++++++-------------
>  net/ipv4/tcp_memcontrol.c                  |  83 ++--
>  13 files changed, 541 insertions(+), 1092 deletions(-)
>  delete mode 100644 Documentation/cgroups/resource_counter.txt
>  delete mode 100644 include/linux/res_counter.h
>  delete mode 100644 kernel/res_counter.c
> 
[...]
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19df5d857411..bf8fb1a05597 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +
> +struct page_counter {

I'd place it in a separate file, say

	include/linux/page_counter.h
	mm/page_counter.c

just to keep mm/memcontrol.c clean.

> +	atomic_long_t count;
> +	unsigned long limit;
> +	struct page_counter *parent;
> +
> +	/* legacy */
> +	unsigned long watermark;
> +	unsigned long limited;

IMHO, failcnt would fit better.

> +};
> +
> +#if BITS_PER_LONG == 32
> +#define PAGE_COUNTER_MAX ULONG_MAX
> +#else
> +#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
> +#endif
> +
> +static inline void page_counter_init(struct page_counter *counter,
> +				     struct page_counter *parent)
> +{
> +	atomic_long_set(&counter->count, 0);
> +	counter->limit = PAGE_COUNTER_MAX;
> +	counter->parent = parent;
> +}
> +
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);

When I first saw this function, I couldn't realize by looking at its
name what it's intended to do. I think

	page_counter_cancel_local_charge()

would fit better.

> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail);
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> +int page_counter_limit(struct page_counter *counter, unsigned long limit);

Hmm, why not page_counter_set_limit?

> +int page_counter_memparse(const char *buf, unsigned long *nr_pages);
> +
[...]
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 515a4d01e932..f41749982668 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -55,7 +55,6 @@
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/memcontrol.h>
> -#include <linux/res_counter.h>
>  #include <linux/static_key.h>
>  #include <linux/aio.h>
>  #include <linux/sched.h>
> @@ -1066,7 +1065,7 @@ enum cg_proto_flags {
>  };
>  
>  struct cg_proto {
> -	struct res_counter	memory_allocated;	/* Current allocated memory. */
> +	struct page_counter	memory_allocated;	/* Current allocated memory. */
>  	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
>  	int			memory_pressure;
>  	long			sysctl_mem[3];
> @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
>  					      unsigned long amt,
>  					      int *parent_status)
>  {
> -	struct res_counter *fail;
> -	int ret;
> +	page_counter_charge(&prot->memory_allocated, amt, NULL);
>  
> -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> -					amt << PAGE_SHIFT, &fail);
> -	if (ret < 0)
> +	if (atomic_long_read(&prot->memory_allocated.count) >
> +	    prot->memory_allocated.limit)

I don't like your equivalent of res_counter_charge_nofail.

Passing NULL to page_counter_charge might be useful if one doesn't have
a back-off strategy, but still want to fail on hitting the limit. With
your interface the user must pass something to the function then, which
isn't convenient.

Besides, it depends on the internal implementation of the page_counter
struct. I'd encapsulate this.

>  		*parent_status = OVER_LIMIT;
>  }
>  
>  static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
>  					      unsigned long amt)
>  {
> -	res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
> -}
> -
> -static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
> -{
> -	u64 ret;
> -	ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
> -	return ret >> PAGE_SHIFT;
> +	page_counter_uncharge(&prot->memory_allocated, amt);
>  }
>  
>  static inline long
>  sk_memory_allocated(const struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
> +
>  	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);

page_counter_read?

>  
>  	return atomic_long_read(prot->memory_allocated);
>  }
> @@ -1259,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
>  		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
>  		/* update the root cgroup regardless */
>  		atomic_long_add_return(amt, prot->memory_allocated);
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
>  	}
>  
>  	return atomic_long_add_return(amt, prot->memory_allocated);
> diff --git a/init/Kconfig b/init/Kconfig
> index 0471be99ec38..1cf42b563834 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -975,15 +975,8 @@ config CGROUP_CPUACCT
>  	  Provides a simple Resource Controller for monitoring the
>  	  total CPU consumed by the tasks in a cgroup.
>  
> -config RESOURCE_COUNTERS
> -	bool "Resource counters"
> -	help
> -	  This option enables controller independent resource accounting
> -	  infrastructure that works with cgroups.
> -
>  config MEMCG
>  	bool "Memory Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS
>  	select EVENTFD
>  	help
>  	  Provides a memory resource controller that manages both anonymous
> @@ -1051,7 +1044,7 @@ config MEMCG_KMEM
>  
>  config CGROUP_HUGETLB
>  	bool "HugeTLB Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> +	depends on MEMCG && HUGETLB_PAGE

So now the hugetlb controller depends on memcg only because it needs the
page_counter, which is supposed to be a kind of independent. Is that OK?

>  	default n
>  	help
>  	  Provides a cgroup Resource Controller for HugeTLB pages.
[...]
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index a67c26e0f360..e619b6b62f1f 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
[...]
> @@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  				  struct page *page)
>  {
>  	struct hugetlb_cgroup *h_cg;
> -	unsigned long csize = nr_pages * PAGE_SIZE;
>  
>  	if (hugetlb_cgroup_disabled())
>  		return;
> @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  	if (unlikely(!h_cg))
>  		return;
>  	set_hugetlb_cgroup(page, NULL);
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
>  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>  				    struct hugetlb_cgroup *h_cg)
>  {
> -	unsigned long csize = nr_pages * PAGE_SIZE;
> -
>  	if (hugetlb_cgroup_disabled() || !h_cg)
>  		return;
>  
>  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
>  		return;
>  
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +};
> +
>  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>  				   struct cftype *cft)
>  {
> -	int idx, name;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
>  
> -	idx = MEMFILE_IDX(cft->private);
> -	name = MEMFILE_ATTR(cft->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
>  
> -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> +	switch (MEMFILE_ATTR(cft->private)) {
> +	case RES_USAGE:
> +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;

page_counter_read?

> +	case RES_LIMIT:
> +		return (u64)counter->limit * PAGE_SIZE;

page_counter_get_limit?

> +	case RES_MAX_USAGE:
> +		return (u64)counter->watermark * PAGE_SIZE;

page_counter_read_watermark?

> +	case RES_FAILCNT:
> +		return counter->limited;
> +	default:
> +		BUG();
> +	}
>  }
>  
>  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret;
> -	unsigned long long val;
> +	int ret, idx;
> +	unsigned long nr_pages;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> +	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> +		return -EINVAL;
> +
>  	buf = strstrip(buf);
> +	ret = page_counter_memparse(buf, &nr_pages);
> +	if (ret)
> +		return ret;
> +
>  	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_LIMIT:
> -		if (hugetlb_cgroup_is_root(h_cg)) {
> -			/* Can't set limit on root */
> -			ret = -EINVAL;
> -			break;
> -		}
> -		/* This function does all necessary parse...reuse it */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> -			break;
> -		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
> -		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> +		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));

This is incorrect. Here we should have something like:

	nr_pages = ALIGN(nr_pages, 1UL << huge_page_order(&hstates[idx]));

> +		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
>  		break;
>  	default:
>  		ret = -EINVAL;
> @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret = 0;
> +	int ret = 0;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> -	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_MAX_USAGE:
> -		res_counter_reset_max(&h_cg->hugepage[idx]);
> +		counter->watermark = atomic_long_read(&counter->count);

page_counter_reset_watermark?

>  		break;
>  	case RES_FAILCNT:
> -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> +		counter->limited = 0;

page_counter_reset_failcnt?

>  		break;
>  	default:
>  		ret = -EINVAL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..dfd3b15a57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -25,7 +25,6 @@
>   * GNU General Public License for more details.
>   */
>  
> -#include <linux/res_counter.h>
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
>  #include <linux/mm.h>
> @@ -66,6 +65,117 @@
>  
>  #include <trace/events/vmscan.h>
>  
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(nr_pages, &counter->count);
> +
> +	if (WARN_ON(unlikely(new < 0)))

Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.

> +		atomic_long_set(&counter->count, 0);
> +
> +	return new > 1;

Nobody outside page_counter internals uses this retval. Why is it public
then?

BTW, why not new > 0?

> +}
> +
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail)
> +{
> +	struct page_counter *c;
> +
> +	for (c = counter; c; c = c->parent) {
> +		for (;;) {
> +			unsigned long count;
> +			unsigned long new;
> +
> +			count = atomic_long_read(&c->count);
> +
> +			new = count + nr_pages;
> +			if (new > c->limit) {
> +				c->limited++;
> +				if (fail) {

So we increase 'limited' even if ain't limited. Sounds weird.

> +					*fail = c;
> +					goto failed;
> +				}
> +			}
> +
> +			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> +				continue;
> +
> +			if (new > c->watermark)
> +				c->watermark = new;
> +
> +			break;
> +		}
> +	}
> +	return 0;
> +
> +failed:
> +	for (c = counter; c != *fail; c = c->parent)
> +		page_counter_cancel(c, nr_pages);
> +
> +	return -ENOMEM;
> +}
> +
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	struct page_counter *c;
> +	int ret = 1;
> +
> +	for (c = counter; c; c = c->parent) {
> +		int remainder;
> +
> +		remainder = page_counter_cancel(c, nr_pages);
> +		if (c == counter && !remainder)
> +			ret = 0;
> +	}
> +
> +	return ret;

The only user of this retval is memcg_uncharge_kmem, which is going to
be removed by your "mm: memcontrol: remove obsolete kmemcg pinning
tricks" patch. Is it still worth having?

> +}
> +
> +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> +{
> +	for (;;) {
> +		unsigned long count;
> +		unsigned long old;
> +
> +		count = atomic_long_read(&counter->count);
> +
> +		old = xchg(&counter->limit, limit);
> +
> +		if (atomic_long_read(&counter->count) != count) {
> +			counter->limit = old;

I wonder what can happen if two threads execute this function
concurrently... or may be it's not supposed to be smp-safe?

> +			continue;
> +		}
> +
> +		if (count > limit) {
> +			counter->limit = old;
> +			return -EBUSY;
> +		}
> +
> +		return 0;
> +	}
> +}
[...]
> @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>   */
>  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  {
> -	unsigned long long margin;

Why is it still ULL?

> +	unsigned long margin = 0;
> +	unsigned long count;
> +	unsigned long limit;
>  
> -	margin = res_counter_margin(&memcg->res);
> -	if (do_swap_account)
> -		margin = min(margin, res_counter_margin(&memcg->memsw));
> -	return margin >> PAGE_SHIFT;
> +	count = atomic_long_read(&memcg->memory.count);
> +	limit = ACCESS_ONCE(memcg->memory.limit);
> +	if (count < limit)
> +		margin = limit - count;
> +
> +	if (do_swap_account) {
> +		count = atomic_long_read(&memcg->memsw.count);
> +		limit = ACCESS_ONCE(memcg->memsw.limit);
> +		if (count < limit)
> +			margin = min(margin, limit - count);
> +	}
> +
> +	return margin;
>  }
[...]
> @@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
>  	 * after this point, because it has at least one child already.
>  	 */
>  	if (memcg_kmem_is_active(parent))
> -		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
> +		ret = __memcg_activate_kmem(memcg, ULONG_MAX);

PAGE_COUNTER_MAX?

>  	mutex_unlock(&activate_kmem_mutex);
>  	return ret;
>  }
>  #else
>  static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
> -				   unsigned long long val)
> +				   unsigned long limit)
>  {
>  	return -EINVAL;
>  }
[...]
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 1d191357bf88..9a448bdb19e9 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
[...]
> @@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
>  	if (!cg_proto)
>  		return -EINVAL;
>  
> -	if (val > RES_COUNTER_MAX)
> -		val = RES_COUNTER_MAX;
> -
> -	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
> +	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
>  	if (ret)
>  		return ret;
>  
>  	for (i = 0; i < 3; i++)
> -		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
> +		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
>  						sysctl_tcp_mem[i]);
>  
> -	if (val == RES_COUNTER_MAX)
> +	if (nr_pages == ULONG_MAX / PAGE_SIZE)

PAGE_COUNTER_MAX?

>  		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	else if (val != RES_COUNTER_MAX) {
> +	else {
>  		/*
>  		 * The active bit needs to be written after the static_key
>  		 * update. This is what guarantees that the socket activation
[...]
> @@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	return ret ?: nbytes;
>  }
>  
> -static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return default_val;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, type);
> -}
> -
> -static u64 tcp_read_usage(struct mem_cgroup *memcg)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
> -}
> -
>  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
>  	u64 val;
>  
>  	switch (cft->private) {
>  	case RES_LIMIT:
> -		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
> +		if (!cg_proto)
> +			return PAGE_COUNTER_MAX;
> +		val = cg_proto->memory_allocated.limit;
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_USAGE:
> -		val = tcp_read_usage(memcg);
> +		if (!cg_proto)
> +			return atomic_long_read(&tcp_memory_allocated);

Forgot << PAGE_SHIFT?

> +		val = atomic_long_read(&cg_proto->memory_allocated.count);
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_FAILCNT:
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.limited;
> +		break;
>  	case RES_MAX_USAGE:
> -		val = tcp_read_stat(memcg, cft->private, 0);
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.watermark;
> +		val *= PAGE_SIZE;
>  		break;
>  	default:
>  		BUG();

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-19 13:22 ` Johannes Weiner
@ 2014-09-22 14:44   ` Michal Hocko
  -1 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-22 14:44 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it and remove the old one.  The translation
> from and to bytes then only happens when interfacing with userspace.

Dunno why but I thought other controllers use res_counter as well. But
this doesn't seem to be the case so this is perfectly reasonable way
forward.

I have only glanced through the patch and it mostly seems good to me 
(I have to look more closely on the atomicity of hierarchical operations).

Nevertheless I think that the counter should live outside of memcg (it
is ugly and bad in general to make HUGETLB controller depend on MEMCG
just to have a counter). If you made kernel/page_counter.c and led both
containers select CONFIG_PAGE_COUNTER then you do not need a dependency
on MEMCG and I would find it cleaner in general.

> Aside from the locking costs, this gets rid of the icky unsigned long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.

Definitely. Nice work!

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/cgroups/hugetlb.txt          |   2 +-
>  Documentation/cgroups/memory.txt           |   4 +-
>  Documentation/cgroups/resource_counter.txt | 197 --------
>  include/linux/hugetlb_cgroup.h             |   1 -
>  include/linux/memcontrol.h                 |  37 +-
>  include/linux/res_counter.h                | 223 ---------
>  include/net/sock.h                         |  25 +-
>  init/Kconfig                               |   9 +-
>  kernel/Makefile                            |   1 -
>  kernel/res_counter.c                       | 211 --------
>  mm/hugetlb_cgroup.c                        | 100 ++--
>  mm/memcontrol.c                            | 740 ++++++++++++++++-------------
>  net/ipv4/tcp_memcontrol.c                  |  83 ++--
>  13 files changed, 541 insertions(+), 1092 deletions(-)
>  delete mode 100644 Documentation/cgroups/resource_counter.txt
>  delete mode 100644 include/linux/res_counter.h
>  delete mode 100644 kernel/res_counter.c
> 
> diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
> index a9faaca1f029..106245c3aecc 100644
> --- a/Documentation/cgroups/hugetlb.txt
> +++ b/Documentation/cgroups/hugetlb.txt
> @@ -29,7 +29,7 @@ Brief summary of control files
>  
>   hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
>   hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
> - hugetlb.<hugepagesize>.usage_in_bytes     # show current res_counter usage for "hugepagesize" hugetlb
> + hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
>   hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
>  
>  For a system supporting two hugepage size (16M and 16G) the control
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 02ab997a1ed2..f624727ab404 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -52,9 +52,9 @@ Brief summary of control files.
>   tasks				 # attach a task(thread) and show list of threads
>   cgroup.procs			 # show list of processes
>   cgroup.event_control		 # an interface for event_fd()
> - memory.usage_in_bytes		 # show current res_counter usage for memory
> + memory.usage_in_bytes		 # show current usage for memory
>  				 (See 5.5 for details)
> - memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
> + memory.memsw.usage_in_bytes	 # show current usage for memory+Swap
>  				 (See 5.5 for details)
>   memory.limit_in_bytes		 # set/show limit of memory usage
>   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> deleted file mode 100644
> index 762ca54eb929..000000000000
> --- a/Documentation/cgroups/resource_counter.txt
> +++ /dev/null
> @@ -1,197 +0,0 @@
> -
> -		The Resource Counter
> -
> -The resource counter, declared at include/linux/res_counter.h,
> -is supposed to facilitate the resource management by controllers
> -by providing common stuff for accounting.
> -
> -This "stuff" includes the res_counter structure and routines
> -to work with it.
> -
> -
> -
> -1. Crucial parts of the res_counter structure
> -
> - a. unsigned long long usage
> -
> - 	The usage value shows the amount of a resource that is consumed
> -	by a group at a given time. The units of measurement should be
> -	determined by the controller that uses this counter. E.g. it can
> -	be bytes, items or any other unit the controller operates on.
> -
> - b. unsigned long long max_usage
> -
> - 	The maximal value of the usage over time.
> -
> - 	This value is useful when gathering statistical information about
> -	the particular group, as it shows the actual resource requirements
> -	for a particular group, not just some usage snapshot.
> -
> - c. unsigned long long limit
> -
> - 	The maximal allowed amount of resource to consume by the group. In
> -	case the group requests for more resources, so that the usage value
> -	would exceed the limit, the resource allocation is rejected (see
> -	the next section).
> -
> - d. unsigned long long failcnt
> -
> - 	The failcnt stands for "failures counter". This is the number of
> -	resource allocation attempts that failed.
> -
> - c. spinlock_t lock
> -
> - 	Protects changes of the above values.
> -
> -
> -
> -2. Basic accounting routines
> -
> - a. void res_counter_init(struct res_counter *rc,
> -				struct res_counter *rc_parent)
> -
> - 	Initializes the resource counter. As usual, should be the first
> -	routine called for a new counter.
> -
> -	The struct res_counter *parent can be used to define a hierarchical
> -	child -> parent relationship directly in the res_counter structure,
> -	NULL can be used to define no relationship.
> -
> - c. int res_counter_charge(struct res_counter *rc, unsigned long val,
> -				struct res_counter **limit_fail_at)
> -
> -	When a resource is about to be allocated it has to be accounted
> -	with the appropriate resource counter (controller should determine
> -	which one to use on its own). This operation is called "charging".
> -
> -	This is not very important which operation - resource allocation
> -	or charging - is performed first, but
> -	  * if the allocation is performed first, this may create a
> -	    temporary resource over-usage by the time resource counter is
> -	    charged;
> -	  * if the charging is performed first, then it should be uncharged
> -	    on error path (if the one is called).
> -
> -	If the charging fails and a hierarchical dependency exists, the
> -	limit_fail_at parameter is set to the particular res_counter element
> -	where the charging failed.
> -
> - d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
> -
> -	When a resource is released (freed) it should be de-accounted
> -	from the resource counter it was accounted to.  This is called
> -	"uncharging". The return value of this function indicate the amount
> -	of charges still present in the counter.
> -
> -	The _locked routines imply that the res_counter->lock is taken.
> -
> - e. u64 res_counter_uncharge_until
> -		(struct res_counter *rc, struct res_counter *top,
> -		 unsigned long val)
> -
> -	Almost same as res_counter_uncharge() but propagation of uncharge
> -	stops when rc == top. This is useful when kill a res_counter in
> -	child cgroup.
> -
> - 2.1 Other accounting routines
> -
> -    There are more routines that may help you with common needs, like
> -    checking whether the limit is reached or resetting the max_usage
> -    value. They are all declared in include/linux/res_counter.h.
> -
> -
> -
> -3. Analyzing the resource counter registrations
> -
> - a. If the failcnt value constantly grows, this means that the counter's
> -    limit is too tight. Either the group is misbehaving and consumes too
> -    many resources, or the configuration is not suitable for the group
> -    and the limit should be increased.
> -
> - b. The max_usage value can be used to quickly tune the group. One may
> -    set the limits to maximal values and either load the container with
> -    a common pattern or leave one for a while. After this the max_usage
> -    value shows the amount of memory the container would require during
> -    its common activity.
> -
> -    Setting the limit a bit above this value gives a pretty good
> -    configuration that works in most of the cases.
> -
> - c. If the max_usage is much less than the limit, but the failcnt value
> -    is growing, then the group tries to allocate a big chunk of resource
> -    at once.
> -
> - d. If the max_usage is much less than the limit, but the failcnt value
> -    is 0, then this group is given too high limit, that it does not
> -    require. It is better to lower the limit a bit leaving more resource
> -    for other groups.
> -
> -
> -
> -4. Communication with the control groups subsystem (cgroups)
> -
> -All the resource controllers that are using cgroups and resource counters
> -should provide files (in the cgroup filesystem) to work with the resource
> -counter fields. They are recommended to adhere to the following rules:
> -
> - a. File names
> -
> - 	Field name	File name
> -	---------------------------------------------------
> -	usage		usage_in_<unit_of_measurement>
> -	max_usage	max_usage_in_<unit_of_measurement>
> -	limit		limit_in_<unit_of_measurement>
> -	failcnt		failcnt
> -	lock		no file :)
> -
> - b. Reading from file should show the corresponding field value in the
> -    appropriate format.
> -
> - c. Writing to file
> -
> - 	Field		Expected behavior
> -	----------------------------------
> -	usage		prohibited
> -	max_usage	reset to usage
> -	limit		set the limit
> -	failcnt		reset to zero
> -
> -
> -
> -5. Usage example
> -
> - a. Declare a task group (take a look at cgroups subsystem for this) and
> -    fold a res_counter into it
> -
> -	struct my_group {
> -		struct res_counter res;
> -
> -		<other fields>
> -	}
> -
> - b. Put hooks in resource allocation/release paths
> -
> - 	int alloc_something(...)
> -	{
> -		if (res_counter_charge(res_counter_ptr, amount) < 0)
> -			return -ENOMEM;
> -
> -		<allocate the resource and return to the caller>
> -	}
> -
> -	void release_something(...)
> -	{
> -		res_counter_uncharge(res_counter_ptr, amount);
> -
> -		<release the resource>
> -	}
> -
> -    In order to keep the usage value self-consistent, both the
> -    "res_counter_ptr" and the "amount" in release_something() should be
> -    the same as they were in the alloc_something() when the releasing
> -    resource was allocated.
> -
> - c. Provide the way to read res_counter values and set them (the cgroups
> -    still can help with it).
> -
> - c. Compile and run :)
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 0129f89cf98d..bcc853eccc85 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -16,7 +16,6 @@
>  #define _LINUX_HUGETLB_CGROUP_H
>  
>  #include <linux/mmdebug.h>
> -#include <linux/res_counter.h>
>  
>  struct hugetlb_cgroup;
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19df5d857411..bf8fb1a05597 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +
> +struct page_counter {
> +	atomic_long_t count;
> +	unsigned long limit;
> +	struct page_counter *parent;
> +
> +	/* legacy */
> +	unsigned long watermark;
> +	unsigned long limited;
> +};
> +
> +#if BITS_PER_LONG == 32
> +#define PAGE_COUNTER_MAX ULONG_MAX
> +#else
> +#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
> +#endif
> +
> +static inline void page_counter_init(struct page_counter *counter,
> +				     struct page_counter *parent)
> +{
> +	atomic_long_set(&counter->count, 0);
> +	counter->limit = PAGE_COUNTER_MAX;
> +	counter->parent = parent;
> +}
> +
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail);
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> +int page_counter_memparse(const char *buf, unsigned long *nr_pages);
> +
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
>  void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> @@ -471,9 +503,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
>  	/*
>  	 * __GFP_NOFAIL allocations will move on even if charging is not
>  	 * possible. Therefore we don't even try, and have this allocation
> -	 * unaccounted. We could in theory charge it with
> -	 * res_counter_charge_nofail, but we hope those allocations are rare,
> -	 * and won't be worth the trouble.
> +	 * unaccounted. We could in theory charge it forcibly, but we hope
> +	 * those allocations are rare, and won't be worth the trouble.
>  	 */
>  	if (gfp & __GFP_NOFAIL)
>  		return true;
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> deleted file mode 100644
> index 56b7bc32db4f..000000000000
> --- a/include/linux/res_counter.h
> +++ /dev/null
> @@ -1,223 +0,0 @@
> -#ifndef __RES_COUNTER_H__
> -#define __RES_COUNTER_H__
> -
> -/*
> - * Resource Counters
> - * Contain common data types and routines for resource accounting
> - *
> - * Copyright 2007 OpenVZ SWsoft Inc
> - *
> - * Author: Pavel Emelianov <xemul@openvz.org>
> - *
> - * See Documentation/cgroups/resource_counter.txt for more
> - * info about what this counter is.
> - */
> -
> -#include <linux/spinlock.h>
> -#include <linux/errno.h>
> -
> -/*
> - * The core object. the cgroup that wishes to account for some
> - * resource may include this counter into its structures and use
> - * the helpers described beyond
> - */
> -
> -struct res_counter {
> -	/*
> -	 * the current resource consumption level
> -	 */
> -	unsigned long long usage;
> -	/*
> -	 * the maximal value of the usage from the counter creation
> -	 */
> -	unsigned long long max_usage;
> -	/*
> -	 * the limit that usage cannot exceed
> -	 */
> -	unsigned long long limit;
> -	/*
> -	 * the limit that usage can be exceed
> -	 */
> -	unsigned long long soft_limit;
> -	/*
> -	 * the number of unsuccessful attempts to consume the resource
> -	 */
> -	unsigned long long failcnt;
> -	/*
> -	 * the lock to protect all of the above.
> -	 * the routines below consider this to be IRQ-safe
> -	 */
> -	spinlock_t lock;
> -	/*
> -	 * Parent counter, used for hierarchial resource accounting
> -	 */
> -	struct res_counter *parent;
> -};
> -
> -#define RES_COUNTER_MAX ULLONG_MAX
> -
> -/**
> - * Helpers to interact with userspace
> - * res_counter_read_u64() - returns the value of the specified member.
> - * res_counter_read/_write - put/get the specified fields from the
> - * res_counter struct to/from the user
> - *
> - * @counter:     the counter in question
> - * @member:  the field to work with (see RES_xxx below)
> - * @buf:     the buffer to opeate on,...
> - * @nbytes:  its size...
> - * @pos:     and the offset.
> - */
> -
> -u64 res_counter_read_u64(struct res_counter *counter, int member);
> -
> -ssize_t res_counter_read(struct res_counter *counter, int member,
> -		const char __user *buf, size_t nbytes, loff_t *pos,
> -		int (*read_strategy)(unsigned long long val, char *s));
> -
> -int res_counter_memparse_write_strategy(const char *buf,
> -					unsigned long long *res);
> -
> -/*
> - * the field descriptors. one for each member of res_counter
> - */
> -
> -enum {
> -	RES_USAGE,
> -	RES_MAX_USAGE,
> -	RES_LIMIT,
> -	RES_FAILCNT,
> -	RES_SOFT_LIMIT,
> -};
> -
> -/*
> - * helpers for accounting
> - */
> -
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> -
> -/*
> - * charge - try to consume more resource.
> - *
> - * @counter: the counter
> - * @val: the amount of the resource. each controller defines its own
> - *       units, e.g. numbers, bytes, Kbytes, etc
> - *
> - * returns 0 on success and <0 if the counter->usage will exceed the
> - * counter->limit
> - *
> - * charge_nofail works the same, except that it charges the resource
> - * counter unconditionally, and returns < 0 if the after the current
> - * charge we are over limit.
> - */
> -
> -int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> -int res_counter_charge_nofail(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> -
> -/*
> - * uncharge - tell that some portion of the resource is released
> - *
> - * @counter: the counter
> - * @val: the amount of the resource
> - *
> - * these calls check for usage underflow and show a warning on the console
> - *
> - * returns the total charges still present in @counter.
> - */
> -
> -u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
> -
> -u64 res_counter_uncharge_until(struct res_counter *counter,
> -			       struct res_counter *top,
> -			       unsigned long val);
> -/**
> - * res_counter_margin - calculate chargeable space of a counter
> - * @cnt: the counter
> - *
> - * Returns the difference between the hard limit and the current usage
> - * of resource counter @cnt.
> - */
> -static inline unsigned long long res_counter_margin(struct res_counter *cnt)
> -{
> -	unsigned long long margin;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->limit > cnt->usage)
> -		margin = cnt->limit - cnt->usage;
> -	else
> -		margin = 0;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return margin;
> -}
> -
> -/**
> - * Get the difference between the usage and the soft limit
> - * @cnt: The counter
> - *
> - * Returns 0 if usage is less than or equal to soft limit
> - * The difference between usage and soft limit, otherwise.
> - */
> -static inline unsigned long long
> -res_counter_soft_limit_excess(struct res_counter *cnt)
> -{
> -	unsigned long long excess;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= cnt->soft_limit)
> -		excess = 0;
> -	else
> -		excess = cnt->usage - cnt->soft_limit;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return excess;
> -}
> -
> -static inline void res_counter_reset_max(struct res_counter *cnt)
> -{
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->max_usage = cnt->usage;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -}
> -
> -static inline void res_counter_reset_failcnt(struct res_counter *cnt)
> -{
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->failcnt = 0;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -}
> -
> -static inline int res_counter_set_limit(struct res_counter *cnt,
> -		unsigned long long limit)
> -{
> -	unsigned long flags;
> -	int ret = -EBUSY;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= limit) {
> -		cnt->limit = limit;
> -		ret = 0;
> -	}
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return ret;
> -}
> -
> -static inline int
> -res_counter_set_soft_limit(struct res_counter *cnt,
> -				unsigned long long soft_limit)
> -{
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->soft_limit = soft_limit;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return 0;
> -}
> -
> -#endif
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 515a4d01e932..f41749982668 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -55,7 +55,6 @@
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/memcontrol.h>
> -#include <linux/res_counter.h>
>  #include <linux/static_key.h>
>  #include <linux/aio.h>
>  #include <linux/sched.h>
> @@ -1066,7 +1065,7 @@ enum cg_proto_flags {
>  };
>  
>  struct cg_proto {
> -	struct res_counter	memory_allocated;	/* Current allocated memory. */
> +	struct page_counter	memory_allocated;	/* Current allocated memory. */
>  	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
>  	int			memory_pressure;
>  	long			sysctl_mem[3];
> @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
>  					      unsigned long amt,
>  					      int *parent_status)
>  {
> -	struct res_counter *fail;
> -	int ret;
> +	page_counter_charge(&prot->memory_allocated, amt, NULL);
>  
> -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> -					amt << PAGE_SHIFT, &fail);
> -	if (ret < 0)
> +	if (atomic_long_read(&prot->memory_allocated.count) >
> +	    prot->memory_allocated.limit)
>  		*parent_status = OVER_LIMIT;
>  }
>  
>  static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
>  					      unsigned long amt)
>  {
> -	res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
> -}
> -
> -static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
> -{
> -	u64 ret;
> -	ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
> -	return ret >> PAGE_SHIFT;
> +	page_counter_uncharge(&prot->memory_allocated, amt);
>  }
>  
>  static inline long
>  sk_memory_allocated(const struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
> +
>  	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
>  
>  	return atomic_long_read(prot->memory_allocated);
>  }
> @@ -1259,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
>  		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
>  		/* update the root cgroup regardless */
>  		atomic_long_add_return(amt, prot->memory_allocated);
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
>  	}
>  
>  	return atomic_long_add_return(amt, prot->memory_allocated);
> diff --git a/init/Kconfig b/init/Kconfig
> index 0471be99ec38..1cf42b563834 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -975,15 +975,8 @@ config CGROUP_CPUACCT
>  	  Provides a simple Resource Controller for monitoring the
>  	  total CPU consumed by the tasks in a cgroup.
>  
> -config RESOURCE_COUNTERS
> -	bool "Resource counters"
> -	help
> -	  This option enables controller independent resource accounting
> -	  infrastructure that works with cgroups.
> -
>  config MEMCG
>  	bool "Memory Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS
>  	select EVENTFD
>  	help
>  	  Provides a memory resource controller that manages both anonymous
> @@ -1051,7 +1044,7 @@ config MEMCG_KMEM
>  
>  config CGROUP_HUGETLB
>  	bool "HugeTLB Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> +	depends on MEMCG && HUGETLB_PAGE
>  	default n
>  	help
>  	  Provides a cgroup Resource Controller for HugeTLB pages.
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 726e18443da0..245953354974 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -58,7 +58,6 @@ obj-$(CONFIG_USER_NS) += user_namespace.o
>  obj-$(CONFIG_PID_NS) += pid_namespace.o
>  obj-$(CONFIG_DEBUG_SYNCHRO_TEST) += synchro-test.o
>  obj-$(CONFIG_IKCONFIG) += configs.o
> -obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
>  obj-$(CONFIG_SMP) += stop_machine.o
>  obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
>  obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> deleted file mode 100644
> index e791130f85a7..000000000000
> --- a/kernel/res_counter.c
> +++ /dev/null
> @@ -1,211 +0,0 @@
> -/*
> - * resource cgroups
> - *
> - * Copyright 2007 OpenVZ SWsoft Inc
> - *
> - * Author: Pavel Emelianov <xemul@openvz.org>
> - *
> - */
> -
> -#include <linux/types.h>
> -#include <linux/parser.h>
> -#include <linux/fs.h>
> -#include <linux/res_counter.h>
> -#include <linux/uaccess.h>
> -#include <linux/mm.h>
> -
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> -{
> -	spin_lock_init(&counter->lock);
> -	counter->limit = RES_COUNTER_MAX;
> -	counter->soft_limit = RES_COUNTER_MAX;
> -	counter->parent = parent;
> -}
> -
> -static u64 res_counter_uncharge_locked(struct res_counter *counter,
> -				       unsigned long val)
> -{
> -	if (WARN_ON(counter->usage < val))
> -		val = counter->usage;
> -
> -	counter->usage -= val;
> -	return counter->usage;
> -}
> -
> -static int res_counter_charge_locked(struct res_counter *counter,
> -				     unsigned long val, bool force)
> -{
> -	int ret = 0;
> -
> -	if (counter->usage + val > counter->limit) {
> -		counter->failcnt++;
> -		ret = -ENOMEM;
> -		if (!force)
> -			return ret;
> -	}
> -
> -	counter->usage += val;
> -	if (counter->usage > counter->max_usage)
> -		counter->max_usage = counter->usage;
> -	return ret;
> -}
> -
> -static int __res_counter_charge(struct res_counter *counter, unsigned long val,
> -				struct res_counter **limit_fail_at, bool force)
> -{
> -	int ret, r;
> -	unsigned long flags;
> -	struct res_counter *c, *u;
> -
> -	r = ret = 0;
> -	*limit_fail_at = NULL;
> -	local_irq_save(flags);
> -	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
> -		r = res_counter_charge_locked(c, val, force);
> -		spin_unlock(&c->lock);
> -		if (r < 0 && !ret) {
> -			ret = r;
> -			*limit_fail_at = c;
> -			if (!force)
> -				break;
> -		}
> -	}
> -
> -	if (ret < 0 && !force) {
> -		for (u = counter; u != c; u = u->parent) {
> -			spin_lock(&u->lock);
> -			res_counter_uncharge_locked(u, val);
> -			spin_unlock(&u->lock);
> -		}
> -	}
> -	local_irq_restore(flags);
> -
> -	return ret;
> -}
> -
> -int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> -{
> -	return __res_counter_charge(counter, val, limit_fail_at, false);
> -}
> -
> -int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
> -			      struct res_counter **limit_fail_at)
> -{
> -	return __res_counter_charge(counter, val, limit_fail_at, true);
> -}
> -
> -u64 res_counter_uncharge_until(struct res_counter *counter,
> -			       struct res_counter *top,
> -			       unsigned long val)
> -{
> -	unsigned long flags;
> -	struct res_counter *c;
> -	u64 ret = 0;
> -
> -	local_irq_save(flags);
> -	for (c = counter; c != top; c = c->parent) {
> -		u64 r;
> -		spin_lock(&c->lock);
> -		r = res_counter_uncharge_locked(c, val);
> -		if (c == counter)
> -			ret = r;
> -		spin_unlock(&c->lock);
> -	}
> -	local_irq_restore(flags);
> -	return ret;
> -}
> -
> -u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
> -{
> -	return res_counter_uncharge_until(counter, NULL, val);
> -}
> -
> -static inline unsigned long long *
> -res_counter_member(struct res_counter *counter, int member)
> -{
> -	switch (member) {
> -	case RES_USAGE:
> -		return &counter->usage;
> -	case RES_MAX_USAGE:
> -		return &counter->max_usage;
> -	case RES_LIMIT:
> -		return &counter->limit;
> -	case RES_FAILCNT:
> -		return &counter->failcnt;
> -	case RES_SOFT_LIMIT:
> -		return &counter->soft_limit;
> -	};
> -
> -	BUG();
> -	return NULL;
> -}
> -
> -ssize_t res_counter_read(struct res_counter *counter, int member,
> -		const char __user *userbuf, size_t nbytes, loff_t *pos,
> -		int (*read_strategy)(unsigned long long val, char *st_buf))
> -{
> -	unsigned long long *val;
> -	char buf[64], *s;
> -
> -	s = buf;
> -	val = res_counter_member(counter, member);
> -	if (read_strategy)
> -		s += read_strategy(*val, s);
> -	else
> -		s += sprintf(s, "%llu\n", *val);
> -	return simple_read_from_buffer((void __user *)userbuf, nbytes,
> -			pos, buf, s - buf);
> -}
> -
> -#if BITS_PER_LONG == 32
> -u64 res_counter_read_u64(struct res_counter *counter, int member)
> -{
> -	unsigned long flags;
> -	u64 ret;
> -
> -	spin_lock_irqsave(&counter->lock, flags);
> -	ret = *res_counter_member(counter, member);
> -	spin_unlock_irqrestore(&counter->lock, flags);
> -
> -	return ret;
> -}
> -#else
> -u64 res_counter_read_u64(struct res_counter *counter, int member)
> -{
> -	return *res_counter_member(counter, member);
> -}
> -#endif
> -
> -int res_counter_memparse_write_strategy(const char *buf,
> -					unsigned long long *resp)
> -{
> -	char *end;
> -	unsigned long long res;
> -
> -	/* return RES_COUNTER_MAX(unlimited) if "-1" is specified */
> -	if (*buf == '-') {
> -		int rc = kstrtoull(buf + 1, 10, &res);
> -
> -		if (rc)
> -			return rc;
> -		if (res != 1)
> -			return -EINVAL;
> -		*resp = RES_COUNTER_MAX;
> -		return 0;
> -	}
> -
> -	res = memparse(buf, &end);
> -	if (*end != '\0')
> -		return -EINVAL;
> -
> -	if (PAGE_ALIGN(res) >= res)
> -		res = PAGE_ALIGN(res);
> -	else
> -		res = RES_COUNTER_MAX;
> -
> -	*resp = res;
> -
> -	return 0;
> -}
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index a67c26e0f360..e619b6b62f1f 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -14,6 +14,7 @@
>   */
>  
>  #include <linux/cgroup.h>
> +#include <linux/memcontrol.h>
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/hugetlb_cgroup.h>
> @@ -23,7 +24,7 @@ struct hugetlb_cgroup {
>  	/*
>  	 * the counter to account for hugepages from hugetlb.
>  	 */
> -	struct res_counter hugepage[HUGE_MAX_HSTATE];
> +	struct page_counter hugepage[HUGE_MAX_HSTATE];
>  };
>  
>  #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> @@ -60,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
>  	int idx;
>  
>  	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
> -		if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
> +		if (atomic_long_read(&h_cg->hugepage[idx].count))
>  			return true;
>  	}
>  	return false;
> @@ -79,12 +80,12 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  
>  	if (parent_h_cgroup) {
>  		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> -			res_counter_init(&h_cgroup->hugepage[idx],
> -					 &parent_h_cgroup->hugepage[idx]);
> +			page_counter_init(&h_cgroup->hugepage[idx],
> +					  &parent_h_cgroup->hugepage[idx]);
>  	} else {
>  		root_h_cgroup = h_cgroup;
>  		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> -			res_counter_init(&h_cgroup->hugepage[idx], NULL);
> +			page_counter_init(&h_cgroup->hugepage[idx], NULL);
>  	}
>  	return &h_cgroup->css;
>  }
> @@ -108,9 +109,8 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
>  static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
>  				       struct page *page)
>  {
> -	int csize;
> -	struct res_counter *counter;
> -	struct res_counter *fail_res;
> +	unsigned int nr_pages;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *page_hcg;
>  	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
>  
> @@ -123,15 +123,15 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
>  	if (!page_hcg || page_hcg != h_cg)
>  		goto out;
>  
> -	csize = PAGE_SIZE << compound_order(page);
> +	nr_pages = 1 << compound_order(page);
>  	if (!parent) {
>  		parent = root_h_cgroup;
>  		/* root has no limit */
> -		res_counter_charge_nofail(&parent->hugepage[idx],
> -					  csize, &fail_res);
> +		page_counter_charge(&parent->hugepage[idx], nr_pages, NULL);
>  	}
>  	counter = &h_cg->hugepage[idx];
> -	res_counter_uncharge_until(counter, counter->parent, csize);
> +	/* Take the pages off the local counter */
> +	page_counter_cancel(counter, nr_pages);
>  
>  	set_hugetlb_cgroup(page, parent);
>  out:
> @@ -166,9 +166,8 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
>  				 struct hugetlb_cgroup **ptr)
>  {
>  	int ret = 0;
> -	struct res_counter *fail_res;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = NULL;
> -	unsigned long csize = nr_pages * PAGE_SIZE;
>  
>  	if (hugetlb_cgroup_disabled())
>  		goto done;
> @@ -187,7 +186,7 @@ again:
>  	}
>  	rcu_read_unlock();
>  
> -	ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
> +	ret = page_counter_charge(&h_cg->hugepage[idx], nr_pages, &counter);
>  	css_put(&h_cg->css);
>  done:
>  	*ptr = h_cg;
> @@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  				  struct page *page)
>  {
>  	struct hugetlb_cgroup *h_cg;
> -	unsigned long csize = nr_pages * PAGE_SIZE;
>  
>  	if (hugetlb_cgroup_disabled())
>  		return;
> @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  	if (unlikely(!h_cg))
>  		return;
>  	set_hugetlb_cgroup(page, NULL);
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
>  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>  				    struct hugetlb_cgroup *h_cg)
>  {
> -	unsigned long csize = nr_pages * PAGE_SIZE;
> -
>  	if (hugetlb_cgroup_disabled() || !h_cg)
>  		return;
>  
>  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
>  		return;
>  
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +};
> +
>  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>  				   struct cftype *cft)
>  {
> -	int idx, name;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
>  
> -	idx = MEMFILE_IDX(cft->private);
> -	name = MEMFILE_ATTR(cft->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
>  
> -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> +	switch (MEMFILE_ATTR(cft->private)) {
> +	case RES_USAGE:
> +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> +	case RES_LIMIT:
> +		return (u64)counter->limit * PAGE_SIZE;
> +	case RES_MAX_USAGE:
> +		return (u64)counter->watermark * PAGE_SIZE;
> +	case RES_FAILCNT:
> +		return counter->limited;
> +	default:
> +		BUG();
> +	}
>  }
>  
>  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret;
> -	unsigned long long val;
> +	int ret, idx;
> +	unsigned long nr_pages;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> +	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> +		return -EINVAL;
> +
>  	buf = strstrip(buf);
> +	ret = page_counter_memparse(buf, &nr_pages);
> +	if (ret)
> +		return ret;
> +
>  	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_LIMIT:
> -		if (hugetlb_cgroup_is_root(h_cg)) {
> -			/* Can't set limit on root */
> -			ret = -EINVAL;
> -			break;
> -		}
> -		/* This function does all necessary parse...reuse it */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> -			break;
> -		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
> -		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> +		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
> +		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
>  		break;
>  	default:
>  		ret = -EINVAL;
> @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret = 0;
> +	int ret = 0;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> -	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_MAX_USAGE:
> -		res_counter_reset_max(&h_cg->hugepage[idx]);
> +		counter->watermark = atomic_long_read(&counter->count);
>  		break;
>  	case RES_FAILCNT:
> -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> +		counter->limited = 0;
>  		break;
>  	default:
>  		ret = -EINVAL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..dfd3b15a57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -25,7 +25,6 @@
>   * GNU General Public License for more details.
>   */
>  
> -#include <linux/res_counter.h>
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
>  #include <linux/mm.h>
> @@ -66,6 +65,117 @@
>  
>  #include <trace/events/vmscan.h>
>  
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(nr_pages, &counter->count);
> +
> +	if (WARN_ON(unlikely(new < 0)))
> +		atomic_long_set(&counter->count, 0);
> +
> +	return new > 1;
> +}
> +
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail)
> +{
> +	struct page_counter *c;
> +
> +	for (c = counter; c; c = c->parent) {
> +		for (;;) {
> +			unsigned long count;
> +			unsigned long new;
> +
> +			count = atomic_long_read(&c->count);
> +
> +			new = count + nr_pages;
> +			if (new > c->limit) {
> +				c->limited++;
> +				if (fail) {
> +					*fail = c;
> +					goto failed;
> +				}
> +			}
> +
> +			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> +				continue;
> +
> +			if (new > c->watermark)
> +				c->watermark = new;
> +
> +			break;
> +		}
> +	}
> +	return 0;
> +
> +failed:
> +	for (c = counter; c != *fail; c = c->parent)
> +		page_counter_cancel(c, nr_pages);
> +
> +	return -ENOMEM;
> +}
> +
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	struct page_counter *c;
> +	int ret = 1;
> +
> +	for (c = counter; c; c = c->parent) {
> +		int remainder;
> +
> +		remainder = page_counter_cancel(c, nr_pages);
> +		if (c == counter && !remainder)
> +			ret = 0;
> +	}
> +
> +	return ret;
> +}
> +
> +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> +{
> +	for (;;) {
> +		unsigned long count;
> +		unsigned long old;
> +
> +		count = atomic_long_read(&counter->count);
> +
> +		old = xchg(&counter->limit, limit);
> +
> +		if (atomic_long_read(&counter->count) != count) {
> +			counter->limit = old;
> +			continue;
> +		}
> +
> +		if (count > limit) {
> +			counter->limit = old;
> +			return -EBUSY;
> +		}
> +
> +		return 0;
> +	}
> +}
> +
> +int page_counter_memparse(const char *buf, unsigned long *nr_pages)
> +{
> +	char unlimited[] = "-1";
> +	char *end;
> +	u64 bytes;
> +
> +	if (!strncmp(buf, unlimited, sizeof(unlimited))) {
> +		*nr_pages = PAGE_COUNTER_MAX;
> +		return 0;
> +	}
> +
> +	bytes = memparse(buf, &end);
> +	if (*end != '\0')
> +		return -EINVAL;
> +
> +	*nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
> +
> +	return 0;
> +}
> +
>  struct cgroup_subsys memory_cgrp_subsys __read_mostly;
>  EXPORT_SYMBOL(memory_cgrp_subsys);
>  
> @@ -165,7 +275,7 @@ struct mem_cgroup_per_zone {
>  	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
>  
>  	struct rb_node		tree_node;	/* RB tree node */
> -	unsigned long long	usage_in_excess;/* Set to the value by which */
> +	unsigned long		usage_in_excess;/* Set to the value by which */
>  						/* the soft limit is exceeded*/
>  	bool			on_tree;
>  	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
> @@ -198,7 +308,7 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
>  
>  struct mem_cgroup_threshold {
>  	struct eventfd_ctx *eventfd;
> -	u64 threshold;
> +	unsigned long threshold;
>  };
>  
>  /* For threshold */
> @@ -284,24 +394,18 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>   */
>  struct mem_cgroup {
>  	struct cgroup_subsys_state css;
> -	/*
> -	 * the counter to account for memory usage
> -	 */
> -	struct res_counter res;
> +
> +	/* Accounted resources */
> +	struct page_counter memory;
> +	struct page_counter memsw;
> +	struct page_counter kmem;
> +
> +	unsigned long soft_limit;
>  
>  	/* vmpressure notifications */
>  	struct vmpressure vmpressure;
>  
>  	/*
> -	 * the counter to account for mem+swap usage.
> -	 */
> -	struct res_counter memsw;
> -
> -	/*
> -	 * the counter to account for kernel memory usage.
> -	 */
> -	struct res_counter kmem;
> -	/*
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> @@ -647,7 +751,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
>  	 * This check can't live in kmem destruction function,
>  	 * since the charges will outlive the cgroup
>  	 */
> -	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
> +	WARN_ON(atomic_long_read(&memcg->kmem.count));
>  }
>  #else
>  static void disarm_kmem_keys(struct mem_cgroup *memcg)
> @@ -703,7 +807,7 @@ soft_limit_tree_from_page(struct page *page)
>  
>  static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
>  					 struct mem_cgroup_tree_per_zone *mctz,
> -					 unsigned long long new_usage_in_excess)
> +					 unsigned long new_usage_in_excess)
>  {
>  	struct rb_node **p = &mctz->rb_root.rb_node;
>  	struct rb_node *parent = NULL;
> @@ -752,10 +856,21 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
>  	spin_unlock_irqrestore(&mctz->lock, flags);
>  }
>  
> +static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
> +{
> +	unsigned long nr_pages = atomic_long_read(&memcg->memory.count);
> +	unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
> +	unsigned long excess = 0;
> +
> +	if (nr_pages > soft_limit)
> +		excess = nr_pages - soft_limit;
> +
> +	return excess;
> +}
>  
>  static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
>  {
> -	unsigned long long excess;
> +	unsigned long excess;
>  	struct mem_cgroup_per_zone *mz;
>  	struct mem_cgroup_tree_per_zone *mctz;
>  
> @@ -766,7 +881,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
>  	 */
>  	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>  		mz = mem_cgroup_page_zoneinfo(memcg, page);
> -		excess = res_counter_soft_limit_excess(&memcg->res);
> +		excess = soft_limit_excess(memcg);
>  		/*
>  		 * We have to update the tree if mz is on RB-tree or
>  		 * mem is over its softlimit.
> @@ -822,7 +937,7 @@ retry:
>  	 * position in the tree.
>  	 */
>  	__mem_cgroup_remove_exceeded(mz, mctz);
> -	if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
> +	if (!soft_limit_excess(mz->memcg) ||
>  	    !css_tryget_online(&mz->memcg->css))
>  		goto retry;
>  done:
> @@ -1478,7 +1593,7 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>  	return inactive * inactive_ratio < active;
>  }
>  
> -#define mem_cgroup_from_res_counter(counter, member)	\
> +#define mem_cgroup_from_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
>  /**
> @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>   */
>  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  {
> -	unsigned long long margin;
> +	unsigned long margin = 0;
> +	unsigned long count;
> +	unsigned long limit;
>  
> -	margin = res_counter_margin(&memcg->res);
> -	if (do_swap_account)
> -		margin = min(margin, res_counter_margin(&memcg->memsw));
> -	return margin >> PAGE_SHIFT;
> +	count = atomic_long_read(&memcg->memory.count);
> +	limit = ACCESS_ONCE(memcg->memory.limit);
> +	if (count < limit)
> +		margin = limit - count;
> +
> +	if (do_swap_account) {
> +		count = atomic_long_read(&memcg->memsw.count);
> +		limit = ACCESS_ONCE(memcg->memsw.limit);
> +		if (count < limit)
> +			margin = min(margin, limit - count);
> +	}
> +
> +	return margin;
>  }
>  
>  int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> @@ -1636,18 +1762,15 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  
>  	rcu_read_unlock();
>  
> -	pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
> -		res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
> -		res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
> -		res_counter_read_u64(&memcg->res, RES_FAILCNT));
> -	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
> -		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
> -		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
> -		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> -	pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n",
> -		res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> -		res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> -		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
> +	pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
> +		K((u64)atomic_long_read(&memcg->memory.count)),
> +		K((u64)memcg->memory.limit), memcg->memory.limited);
> +	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
> +		K((u64)atomic_long_read(&memcg->memsw.count)),
> +		K((u64)memcg->memsw.limit), memcg->memsw.limited);
> +	pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
> +		K((u64)atomic_long_read(&memcg->kmem.count)),
> +		K((u64)memcg->kmem.limit), memcg->kmem.limited);
>  
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		pr_info("Memory cgroup stats for ");
> @@ -1685,30 +1808,19 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
>  }
>  
>  /*
> - * Return the memory (and swap, if configured) limit for a memcg.
> + * Return the memory (and swap, if configured) maximum consumption for a memcg.
>   */
> -static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> +static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  {
> -	u64 limit;
> +	unsigned long limit;
>  
> -	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -
> -	/*
> -	 * Do not consider swap space if we cannot swap due to swappiness
> -	 */
> +	limit = memcg->memory.limit;
>  	if (mem_cgroup_swappiness(memcg)) {
> -		u64 memsw;
> +		unsigned long memsw_limit;
>  
> -		limit += total_swap_pages << PAGE_SHIFT;
> -		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -
> -		/*
> -		 * If memsw is finite and limits the amount of swap space
> -		 * available to this memcg, return that limit.
> -		 */
> -		limit = min(limit, memsw);
> +		memsw_limit = memcg->memsw.limit;
> +		limit = min(limit + total_swap_pages, memsw_limit);
>  	}
> -
>  	return limit;
>  }
>  
> @@ -1732,7 +1844,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	}
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
> -	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct css_task_iter it;
>  		struct task_struct *task;
> @@ -1935,7 +2047,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>  		.priority = 0,
>  	};
>  
> -	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
> +	excess = soft_limit_excess(root_memcg);
>  
>  	while (1) {
>  		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> @@ -1966,7 +2078,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>  		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
>  						     zone, &nr_scanned);
>  		*total_scanned += nr_scanned;
> -		if (!res_counter_soft_limit_excess(&root_memcg->res))
> +		if (!soft_limit_excess(root_memcg))
>  			break;
>  	}
>  	mem_cgroup_iter_break(root_memcg, victim);
> @@ -2293,33 +2405,31 @@ static DEFINE_MUTEX(percpu_charge_mutex);
>  static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
>  	struct memcg_stock_pcp *stock;
> -	bool ret = true;
> +	bool ret = false;
>  
>  	if (nr_pages > CHARGE_BATCH)
> -		return false;
> +		return ret;
>  
>  	stock = &get_cpu_var(memcg_stock);
> -	if (memcg == stock->cached && stock->nr_pages >= nr_pages)
> +	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
>  		stock->nr_pages -= nr_pages;
> -	else /* need to call res_counter_charge */
> -		ret = false;
> +		ret = true;
> +	}
>  	put_cpu_var(memcg_stock);
>  	return ret;
>  }
>  
>  /*
> - * Returns stocks cached in percpu to res_counter and reset cached information.
> + * Returns stocks cached in percpu and reset cached information.
>   */
>  static void drain_stock(struct memcg_stock_pcp *stock)
>  {
>  	struct mem_cgroup *old = stock->cached;
>  
>  	if (stock->nr_pages) {
> -		unsigned long bytes = stock->nr_pages * PAGE_SIZE;
> -
> -		res_counter_uncharge(&old->res, bytes);
> +		page_counter_uncharge(&old->memory, stock->nr_pages);
>  		if (do_swap_account)
> -			res_counter_uncharge(&old->memsw, bytes);
> +			page_counter_uncharge(&old->memsw, stock->nr_pages);
>  		stock->nr_pages = 0;
>  	}
>  	stock->cached = NULL;
> @@ -2348,7 +2458,7 @@ static void __init memcg_stock_init(void)
>  }
>  
>  /*
> - * Cache charges(val) which is from res_counter, to local per_cpu area.
> + * Cache charges(val) to local per_cpu area.
>   * This will be consumed by consume_stock() function, later.
>   */
>  static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> @@ -2408,8 +2518,7 @@ out:
>  /*
>   * Tries to drain stocked charges in other cpus. This function is asynchronous
>   * and just put a work per cpu for draining localy on each cpu. Caller can
> - * expects some charges will be back to res_counter later but cannot wait for
> - * it.
> + * expects some charges will be back later but cannot wait for it.
>   */
>  static void drain_all_stock_async(struct mem_cgroup *root_memcg)
>  {
> @@ -2483,9 +2592,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
> -	struct res_counter *fail_res;
> +	struct page_counter *counter;
>  	unsigned long nr_reclaimed;
> -	unsigned long long size;
>  	bool may_swap = true;
>  	bool drained = false;
>  	int ret = 0;
> @@ -2496,17 +2604,16 @@ retry:
>  	if (consume_stock(memcg, nr_pages))
>  		goto done;
>  
> -	size = batch * PAGE_SIZE;
> -	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> +	if (!page_counter_charge(&memcg->memory, batch, &counter)) {
>  		if (!do_swap_account)
>  			goto done_restock;
> -		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> +		if (!page_counter_charge(&memcg->memsw, batch, &counter))
>  			goto done_restock;
> -		res_counter_uncharge(&memcg->res, size);
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> +		page_counter_uncharge(&memcg->memory, batch);
> +		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>  		may_swap = false;
>  	} else
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> +		mem_over_limit = mem_cgroup_from_counter(counter, memory);
>  
>  	if (batch > nr_pages) {
>  		batch = nr_pages;
> @@ -2587,32 +2694,12 @@ done:
>  
>  static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> -	unsigned long bytes = nr_pages * PAGE_SIZE;
> -
>  	if (mem_cgroup_is_root(memcg))
>  		return;
>  
> -	res_counter_uncharge(&memcg->res, bytes);
> +	page_counter_uncharge(&memcg->memory, nr_pages);
>  	if (do_swap_account)
> -		res_counter_uncharge(&memcg->memsw, bytes);
> -}
> -
> -/*
> - * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
> - * This is useful when moving usage to parent cgroup.
> - */
> -static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
> -					unsigned int nr_pages)
> -{
> -	unsigned long bytes = nr_pages * PAGE_SIZE;
> -
> -	if (mem_cgroup_is_root(memcg))
> -		return;
> -
> -	res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
> -	if (do_swap_account)
> -		res_counter_uncharge_until(&memcg->memsw,
> -						memcg->memsw.parent, bytes);
> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>  }
>  
>  /*
> @@ -2736,8 +2823,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  		unlock_page_lru(page, isolated);
>  }
>  
> -static DEFINE_MUTEX(set_limit_mutex);
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  /*
>   * The memcg_slab_mutex is held whenever a per memcg kmem cache is created or
> @@ -2786,16 +2871,17 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
>  }
>  #endif
>  
> -static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
> +			     unsigned long nr_pages)
>  {
> -	struct res_counter *fail_res;
> +	struct page_counter *counter;
>  	int ret = 0;
>  
> -	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
> -	if (ret)
> +	ret = page_counter_charge(&memcg->kmem, nr_pages, &counter);
> +	if (ret < 0)
>  		return ret;
>  
> -	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
> +	ret = try_charge(memcg, gfp, nr_pages);
>  	if (ret == -EINTR)  {
>  		/*
>  		 * try_charge() chose to bypass to root due to OOM kill or
> @@ -2812,25 +2898,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>  		 * when the allocation triggers should have been already
>  		 * directed to the root cgroup in memcontrol.h
>  		 */
> -		res_counter_charge_nofail(&memcg->res, size, &fail_res);
> +		page_counter_charge(&memcg->memory, nr_pages, NULL);
>  		if (do_swap_account)
> -			res_counter_charge_nofail(&memcg->memsw, size,
> -						  &fail_res);
> +			page_counter_charge(&memcg->memsw, nr_pages, NULL);
>  		ret = 0;
>  	} else if (ret)
> -		res_counter_uncharge(&memcg->kmem, size);
> +		page_counter_uncharge(&memcg->kmem, nr_pages);
>  
>  	return ret;
>  }
>  
> -static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
> +				unsigned long nr_pages)
>  {
> -	res_counter_uncharge(&memcg->res, size);
> +	page_counter_uncharge(&memcg->memory, nr_pages);
>  	if (do_swap_account)
> -		res_counter_uncharge(&memcg->memsw, size);
> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>  
>  	/* Not down to 0 */
> -	if (res_counter_uncharge(&memcg->kmem, size))
> +	if (page_counter_uncharge(&memcg->kmem, nr_pages))
>  		return;
>  
>  	/*
> @@ -3107,19 +3193,21 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,
>  
>  int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
>  {
> +	unsigned int nr_pages = 1 << order;
>  	int res;
>  
> -	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
> -				PAGE_SIZE << order);
> +	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages);
>  	if (!res)
> -		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
> +		atomic_add(nr_pages, &cachep->memcg_params->nr_pages);
>  	return res;
>  }
>  
>  void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
>  {
> -	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
> -	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
> +	unsigned int nr_pages = 1 << order;
> +
> +	memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages);
> +	atomic_sub(nr_pages, &cachep->memcg_params->nr_pages);
>  }
>  
>  /*
> @@ -3240,7 +3328,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
>  		return true;
>  	}
>  
> -	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
> +	ret = memcg_charge_kmem(memcg, gfp, 1 << order);
>  	if (!ret)
>  		*_memcg = memcg;
>  
> @@ -3257,7 +3345,7 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
>  
>  	/* The page allocation failed. Revert */
>  	if (!page) {
> -		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
> +		memcg_uncharge_kmem(memcg, 1 << order);
>  		return;
>  	}
>  	/*
> @@ -3290,7 +3378,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
>  		return;
>  
>  	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> -	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
> +	memcg_uncharge_kmem(memcg, 1 << order);
>  }
>  #else
>  static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
> @@ -3468,8 +3556,12 @@ static int mem_cgroup_move_parent(struct page *page,
>  
>  	ret = mem_cgroup_move_account(page, nr_pages,
>  				pc, child, parent);
> -	if (!ret)
> -		__mem_cgroup_cancel_local_charge(child, nr_pages);
> +	if (!ret) {
> +		/* Take charge off the local counters */
> +		page_counter_cancel(&child->memory, nr_pages);
> +		if (do_swap_account)
> +			page_counter_cancel(&child->memsw, nr_pages);
> +	}
>  
>  	if (nr_pages > 1)
>  		compound_unlock_irqrestore(page, flags);
> @@ -3499,7 +3591,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
>   *
>   * Returns 0 on success, -EINVAL on failure.
>   *
> - * The caller must have charged to @to, IOW, called res_counter_charge() about
> + * The caller must have charged to @to, IOW, called page_counter_charge() about
>   * both res and memsw, and called css_get().
>   */
>  static int mem_cgroup_move_swap_account(swp_entry_t entry,
> @@ -3515,7 +3607,7 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
>  		mem_cgroup_swap_statistics(to, true);
>  		/*
>  		 * This function is only called from task migration context now.
> -		 * It postpones res_counter and refcount handling till the end
> +		 * It postpones page_counter and refcount handling till the end
>  		 * of task migration(mem_cgroup_clear_mc()) for performance
>  		 * improvement. But we cannot postpone css_get(to)  because if
>  		 * the process that has been moved to @to does swap-in, the
> @@ -3573,49 +3665,42 @@ void mem_cgroup_print_bad_page(struct page *page)
>  }
>  #endif
>  
> +static DEFINE_MUTEX(set_limit_mutex);
> +
>  static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> -				unsigned long long val)
> +				   unsigned long limit)
>  {
> +	unsigned long curusage;
> +	unsigned long oldusage;
> +	bool enlarge = false;
>  	int retry_count;
> -	u64 memswlimit, memlimit;
> -	int ret = 0;
> -	int children = mem_cgroup_count_children(memcg);
> -	u64 curusage, oldusage;
> -	int enlarge;
> +	int ret;
>  
>  	/*
>  	 * For keeping hierarchical_reclaim simple, how long we should retry
>  	 * is depends on callers. We set our retry-count to be function
>  	 * of # of children which we should visit in this loop.
>  	 */
> -	retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
> +	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> +		      mem_cgroup_count_children(memcg);
>  
> -	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> +	oldusage = atomic_long_read(&memcg->memory.count);
>  
> -	enlarge = 0;
> -	while (retry_count) {
> +	do {
>  		if (signal_pending(current)) {
>  			ret = -EINTR;
>  			break;
>  		}
> -		/*
> -		 * Rather than hide all in some function, I do this in
> -		 * open coded manner. You see what this really does.
> -		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
> -		 */
> +
>  		mutex_lock(&set_limit_mutex);
> -		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		if (memswlimit < val) {
> -			ret = -EINVAL;
> +		if (limit > memcg->memsw.limit) {
>  			mutex_unlock(&set_limit_mutex);
> +			ret = -EINVAL;
>  			break;
>  		}
> -
> -		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		if (memlimit < val)
> -			enlarge = 1;
> -
> -		ret = res_counter_set_limit(&memcg->res, val);
> +		if (limit > memcg->memory.limit)
> +			enlarge = true;
> +		ret = page_counter_limit(&memcg->memory, limit);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3623,13 +3708,14 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  
>  		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
>  
> -		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> +		curusage = atomic_long_read(&memcg->memory.count);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
>  			retry_count--;
>  		else
>  			oldusage = curusage;
> -	}
> +	} while (retry_count);
> +
>  	if (!ret && enlarge)
>  		memcg_oom_recover(memcg);
>  
> @@ -3637,38 +3723,35 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  }
>  
>  static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> -					unsigned long long val)
> +					 unsigned long limit)
>  {
> +	unsigned long curusage;
> +	unsigned long oldusage;
> +	bool enlarge = false;
>  	int retry_count;
> -	u64 memlimit, memswlimit, oldusage, curusage;
> -	int children = mem_cgroup_count_children(memcg);
> -	int ret = -EBUSY;
> -	int enlarge = 0;
> +	int ret;
>  
>  	/* see mem_cgroup_resize_res_limit */
> -	retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
> -	oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> -	while (retry_count) {
> +	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> +		      mem_cgroup_count_children(memcg);
> +
> +	oldusage = atomic_long_read(&memcg->memsw.count);
> +
> +	do {
>  		if (signal_pending(current)) {
>  			ret = -EINTR;
>  			break;
>  		}
> -		/*
> -		 * Rather than hide all in some function, I do this in
> -		 * open coded manner. You see what this really does.
> -		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
> -		 */
> +
>  		mutex_lock(&set_limit_mutex);
> -		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		if (memlimit > val) {
> -			ret = -EINVAL;
> +		if (limit < memcg->memory.limit) {
>  			mutex_unlock(&set_limit_mutex);
> +			ret = -EINVAL;
>  			break;
>  		}
> -		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		if (memswlimit < val)
> -			enlarge = 1;
> -		ret = res_counter_set_limit(&memcg->memsw, val);
> +		if (limit > memcg->memsw.limit)
> +			enlarge = true;
> +		ret = page_counter_limit(&memcg->memsw, limit);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3676,15 +3759,17 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  
>  		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
>  
> -		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> +		curusage = atomic_long_read(&memcg->memsw.count);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
>  			retry_count--;
>  		else
>  			oldusage = curusage;
> -	}
> +	} while (retry_count);
> +
>  	if (!ret && enlarge)
>  		memcg_oom_recover(memcg);
> +
>  	return ret;
>  }
>  
> @@ -3697,7 +3782,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	unsigned long reclaimed;
>  	int loop = 0;
>  	struct mem_cgroup_tree_per_zone *mctz;
> -	unsigned long long excess;
> +	unsigned long excess;
>  	unsigned long nr_scanned;
>  
>  	if (order > 0)
> @@ -3751,7 +3836,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  			} while (1);
>  		}
>  		__mem_cgroup_remove_exceeded(mz, mctz);
> -		excess = res_counter_soft_limit_excess(&mz->memcg->res);
> +		excess = soft_limit_excess(mz->memcg);
>  		/*
>  		 * One school of thought says that we should not add
>  		 * back the node to the tree if reclaim returns 0.
> @@ -3844,7 +3929,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>  static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
>  {
>  	int node, zid;
> -	u64 usage;
>  
>  	do {
>  		/* This is for making all *used* pages to be on LRU. */
> @@ -3876,9 +3960,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
>  		 * right after the check. RES_USAGE should be safe as we always
>  		 * charge before adding to the LRU.
>  		 */
> -		usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
> -			res_counter_read_u64(&memcg->kmem, RES_USAGE);
> -	} while (usage > 0);
> +	} while (atomic_long_read(&memcg->memory.count) -
> +		 atomic_long_read(&memcg->kmem.count) > 0);
>  }
>  
>  /*
> @@ -3918,7 +4001,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  	/* we call try-to-free pages for make this cgroup empty */
>  	lru_add_drain_all();
>  	/* try to free all pages in this cgroup */
> -	while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
> +	while (nr_retries && atomic_long_read(&memcg->memory.count)) {
>  		int progress;
>  
>  		if (signal_pending(current))
> @@ -3989,8 +4072,8 @@ out:
>  	return retval;
>  }
>  
> -static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
> -					       enum mem_cgroup_stat_index idx)
> +static unsigned long tree_stat(struct mem_cgroup *memcg,
> +			       enum mem_cgroup_stat_index idx)
>  {
>  	struct mem_cgroup *iter;
>  	long val = 0;
> @@ -4008,55 +4091,72 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  {
>  	u64 val;
>  
> -	if (!mem_cgroup_is_root(memcg)) {
> +	if (mem_cgroup_is_root(memcg)) {
> +		val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE);
> +		val += tree_stat(memcg, MEM_CGROUP_STAT_RSS);
> +		if (swap)
> +			val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
> +	} else {
>  		if (!swap)
> -			return res_counter_read_u64(&memcg->res, RES_USAGE);
> +			val = atomic_long_read(&memcg->memory.count);
>  		else
> -			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
> +			val = atomic_long_read(&memcg->memsw.count);
>  	}
> -
> -	/*
> -	 * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
> -	 * as well as in MEM_CGROUP_STAT_RSS_HUGE.
> -	 */
> -	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
> -	val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
> -
> -	if (swap)
> -		val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
> -
>  	return val << PAGE_SHIFT;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +	RES_SOFT_LIMIT,
> +};
>  
>  static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>  			       struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	enum res_type type = MEMFILE_TYPE(cft->private);
> -	int name = MEMFILE_ATTR(cft->private);
> +	struct page_counter *counter;
>  
> -	switch (type) {
> +	switch (MEMFILE_TYPE(cft->private)) {
>  	case _MEM:
> -		if (name == RES_USAGE)
> -			return mem_cgroup_usage(memcg, false);
> -		return res_counter_read_u64(&memcg->res, name);
> +		counter = &memcg->memory;
> +		break;
>  	case _MEMSWAP:
> -		if (name == RES_USAGE)
> -			return mem_cgroup_usage(memcg, true);
> -		return res_counter_read_u64(&memcg->memsw, name);
> +		counter = &memcg->memsw;
> +		break;
>  	case _KMEM:
> -		return res_counter_read_u64(&memcg->kmem, name);
> +		counter = &memcg->kmem;
>  		break;
>  	default:
>  		BUG();
>  	}
> +
> +	switch (MEMFILE_ATTR(cft->private)) {
> +	case RES_USAGE:
> +		if (counter == &memcg->memory)
> +			return mem_cgroup_usage(memcg, false);
> +		if (counter == &memcg->memsw)
> +			return mem_cgroup_usage(memcg, true);
> +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> +	case RES_LIMIT:
> +		return (u64)counter->limit * PAGE_SIZE;
> +	case RES_MAX_USAGE:
> +		return (u64)counter->watermark * PAGE_SIZE;
> +	case RES_FAILCNT:
> +		return counter->limited;
> +	case RES_SOFT_LIMIT:
> +		return (u64)memcg->soft_limit * PAGE_SIZE;
> +	default:
> +		BUG();
> +	}
>  }
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  /* should be called with activate_kmem_mutex held */
>  static int __memcg_activate_kmem(struct mem_cgroup *memcg,
> -				 unsigned long long limit)
> +				 unsigned long nr_pages)
>  {
>  	int err = 0;
>  	int memcg_id;
> @@ -4103,7 +4203,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
>  	 * We couldn't have accounted to this cgroup, because it hasn't got the
>  	 * active bit set yet, so this should succeed.
>  	 */
> -	err = res_counter_set_limit(&memcg->kmem, limit);
> +	err = page_counter_limit(&memcg->kmem, nr_pages);
>  	VM_BUG_ON(err);
>  
>  	static_key_slow_inc(&memcg_kmem_enabled_key);
> @@ -4119,25 +4219,25 @@ out:
>  }
>  
>  static int memcg_activate_kmem(struct mem_cgroup *memcg,
> -			       unsigned long long limit)
> +			       unsigned long nr_pages)
>  {
>  	int ret;
>  
>  	mutex_lock(&activate_kmem_mutex);
> -	ret = __memcg_activate_kmem(memcg, limit);
> +	ret = __memcg_activate_kmem(memcg, nr_pages);
>  	mutex_unlock(&activate_kmem_mutex);
>  	return ret;
>  }
>  
>  static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
> -				   unsigned long long val)
> +				   unsigned long limit)
>  {
>  	int ret;
>  
>  	if (!memcg_kmem_is_active(memcg))
> -		ret = memcg_activate_kmem(memcg, val);
> +		ret = memcg_activate_kmem(memcg, limit);
>  	else
> -		ret = res_counter_set_limit(&memcg->kmem, val);
> +		ret = page_counter_limit(&memcg->kmem, limit);
>  	return ret;
>  }
>  
> @@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
>  	 * after this point, because it has at least one child already.
>  	 */
>  	if (memcg_kmem_is_active(parent))
> -		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
> +		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
>  	mutex_unlock(&activate_kmem_mutex);
>  	return ret;
>  }
>  #else
>  static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
> -				   unsigned long long val)
> +				   unsigned long limit)
>  {
>  	return -EINVAL;
>  }
> @@ -4175,110 +4275,69 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	enum res_type type;
> -	int name;
> -	unsigned long long val;
> +	unsigned long nr_pages;
>  	int ret;
>  
>  	buf = strstrip(buf);
> -	type = MEMFILE_TYPE(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	ret = page_counter_memparse(buf, &nr_pages);
> +	if (ret)
> +		return ret;
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_LIMIT:
>  		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
>  			ret = -EINVAL;
>  			break;
>  		}
> -		/* This function does all necessary parse...reuse it */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> +		switch (MEMFILE_TYPE(of_cft(of)->private)) {
> +		case _MEM:
> +			ret = mem_cgroup_resize_limit(memcg, nr_pages);
>  			break;
> -		if (type == _MEM)
> -			ret = mem_cgroup_resize_limit(memcg, val);
> -		else if (type == _MEMSWAP)
> -			ret = mem_cgroup_resize_memsw_limit(memcg, val);
> -		else if (type == _KMEM)
> -			ret = memcg_update_kmem_limit(memcg, val);
> -		else
> -			return -EINVAL;
> -		break;
> -	case RES_SOFT_LIMIT:
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> +		case _MEMSWAP:
> +			ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
>  			break;
> -		/*
> -		 * For memsw, soft limits are hard to implement in terms
> -		 * of semantics, for now, we support soft limits for
> -		 * control without swap
> -		 */
> -		if (type == _MEM)
> -			ret = res_counter_set_soft_limit(&memcg->res, val);
> -		else
> -			ret = -EINVAL;
> +		case _KMEM:
> +			ret = memcg_update_kmem_limit(memcg, nr_pages);
> +			break;
> +		}
>  		break;
> -	default:
> -		ret = -EINVAL; /* should be BUG() ? */
> +	case RES_SOFT_LIMIT:
> +		memcg->soft_limit = nr_pages;
> +		ret = 0;
>  		break;
>  	}
>  	return ret ?: nbytes;
>  }
>  
> -static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
> -		unsigned long long *mem_limit, unsigned long long *memsw_limit)
> -{
> -	unsigned long long min_limit, min_memsw_limit, tmp;
> -
> -	min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -	min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -	if (!memcg->use_hierarchy)
> -		goto out;
> -
> -	while (memcg->css.parent) {
> -		memcg = mem_cgroup_from_css(memcg->css.parent);
> -		if (!memcg->use_hierarchy)
> -			break;
> -		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		min_limit = min(min_limit, tmp);
> -		tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		min_memsw_limit = min(min_memsw_limit, tmp);
> -	}
> -out:
> -	*mem_limit = min_limit;
> -	*memsw_limit = min_memsw_limit;
> -}
> -
>  static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
>  				size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	int name;
> -	enum res_type type;
> +	struct page_counter *counter;
>  
> -	type = MEMFILE_TYPE(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	switch (MEMFILE_TYPE(of_cft(of)->private)) {
> +	case _MEM:
> +		counter = &memcg->memory;
> +		break;
> +	case _MEMSWAP:
> +		counter = &memcg->memsw;
> +		break;
> +	case _KMEM:
> +		counter = &memcg->kmem;
> +		break;
> +	default:
> +		BUG();
> +	}
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_MAX_USAGE:
> -		if (type == _MEM)
> -			res_counter_reset_max(&memcg->res);
> -		else if (type == _MEMSWAP)
> -			res_counter_reset_max(&memcg->memsw);
> -		else if (type == _KMEM)
> -			res_counter_reset_max(&memcg->kmem);
> -		else
> -			return -EINVAL;
> +		counter->watermark = atomic_long_read(&counter->count);
>  		break;
>  	case RES_FAILCNT:
> -		if (type == _MEM)
> -			res_counter_reset_failcnt(&memcg->res);
> -		else if (type == _MEMSWAP)
> -			res_counter_reset_failcnt(&memcg->memsw);
> -		else if (type == _KMEM)
> -			res_counter_reset_failcnt(&memcg->kmem);
> -		else
> -			return -EINVAL;
> +		counter->limited = 0;
>  		break;
> +	default:
> +		BUG();
>  	}
>  
>  	return nbytes;
> @@ -4375,6 +4434,7 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>  static int memcg_stat_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long memory, memsw;
>  	struct mem_cgroup *mi;
>  	unsigned int i;
>  
> @@ -4394,14 +4454,16 @@ static int memcg_stat_show(struct seq_file *m, void *v)
>  			   mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);
>  
>  	/* Hierarchical information */
> -	{
> -		unsigned long long limit, memsw_limit;
> -		memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit);
> -		seq_printf(m, "hierarchical_memory_limit %llu\n", limit);
> -		if (do_swap_account)
> -			seq_printf(m, "hierarchical_memsw_limit %llu\n",
> -				   memsw_limit);
> +	memory = memsw = PAGE_COUNTER_MAX;
> +	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
> +		memory = min(memory, mi->memory.limit);
> +		memsw = min(memsw, mi->memsw.limit);
>  	}
> +	seq_printf(m, "hierarchical_memory_limit %llu\n",
> +		   (u64)memory * PAGE_SIZE);
> +	if (do_swap_account)
> +		seq_printf(m, "hierarchical_memsw_limit %llu\n",
> +			   (u64)memsw * PAGE_SIZE);
>  
>  	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
>  		long long val = 0;
> @@ -4485,7 +4547,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
>  static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
>  {
>  	struct mem_cgroup_threshold_ary *t;
> -	u64 usage;
> +	unsigned long usage;
>  	int i;
>  
>  	rcu_read_lock();
> @@ -4584,10 +4646,11 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
>  {
>  	struct mem_cgroup_thresholds *thresholds;
>  	struct mem_cgroup_threshold_ary *new;
> -	u64 threshold, usage;
> +	unsigned long threshold;
> +	unsigned long usage;
>  	int i, size, ret;
>  
> -	ret = res_counter_memparse_write_strategy(args, &threshold);
> +	ret = page_counter_memparse(args, &threshold);
>  	if (ret)
>  		return ret;
>  
> @@ -4677,7 +4740,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
>  {
>  	struct mem_cgroup_thresholds *thresholds;
>  	struct mem_cgroup_threshold_ary *new;
> -	u64 usage;
> +	unsigned long usage;
>  	int i, j, size;
>  
>  	mutex_lock(&memcg->thresholds_lock);
> @@ -4871,7 +4934,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
>  
>  	memcg_kmem_mark_dead(memcg);
>  
> -	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
> +	if (atomic_long_read(&memcg->kmem.count))
>  		return;
>  
>  	if (memcg_kmem_test_and_clear_dead(memcg))
> @@ -5351,9 +5414,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>   */
>  struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
>  {
> -	if (!memcg->res.parent)
> +	if (!memcg->memory.parent)
>  		return NULL;
> -	return mem_cgroup_from_res_counter(memcg->res.parent, res);
> +	return mem_cgroup_from_counter(memcg->memory.parent, memory);
>  }
>  EXPORT_SYMBOL(parent_mem_cgroup);
>  
> @@ -5398,9 +5461,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	/* root ? */
>  	if (parent_css == NULL) {
>  		root_mem_cgroup = memcg;
> -		res_counter_init(&memcg->res, NULL);
> -		res_counter_init(&memcg->memsw, NULL);
> -		res_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->memsw, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
>  	}
>  
>  	memcg->last_scanned_node = MAX_NUMNODES;
> @@ -5438,18 +5501,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	memcg->swappiness = mem_cgroup_swappiness(parent);
>  
>  	if (parent->use_hierarchy) {
> -		res_counter_init(&memcg->res, &parent->res);
> -		res_counter_init(&memcg->memsw, &parent->memsw);
> -		res_counter_init(&memcg->kmem, &parent->kmem);
> +		page_counter_init(&memcg->memory, &parent->memory);
> +		page_counter_init(&memcg->memsw, &parent->memsw);
> +		page_counter_init(&memcg->kmem, &parent->kmem);
>  
>  		/*
>  		 * No need to take a reference to the parent because cgroup
>  		 * core guarantees its existence.
>  		 */
>  	} else {
> -		res_counter_init(&memcg->res, NULL);
> -		res_counter_init(&memcg->memsw, NULL);
> -		res_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->memsw, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this
> @@ -5520,7 +5583,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	/*
>  	 * XXX: css_offline() would be where we should reparent all
>  	 * memory to prepare the cgroup for destruction.  However,
> -	 * memcg does not do css_tryget_online() and res_counter charging
> +	 * memcg does not do css_tryget_online() and page_counter charging
>  	 * under the same RCU lock region, which means that charging
>  	 * could race with offlining.  Offlining only happens to
>  	 * cgroups with no tasks in them but charges can show up
> @@ -5540,7 +5603,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	 * call_rcu()
>  	 *   offline_css()
>  	 *     reparent_charges()
> -	 *                           res_counter_charge()
> +	 *                           page_counter_charge()
>  	 *                           css_put()
>  	 *                             css_free()
>  	 *                           pc->mem_cgroup = dead memcg
> @@ -5575,10 +5638,10 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  
> -	mem_cgroup_resize_limit(memcg, ULLONG_MAX);
> -	mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
> -	memcg_update_kmem_limit(memcg, ULLONG_MAX);
> -	res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
> +	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
> +	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
> +	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
> +	memcg->soft_limit = 0;
>  }
>  
>  #ifdef CONFIG_MMU
> @@ -5892,19 +5955,18 @@ static void __mem_cgroup_clear_mc(void)
>  	if (mc.moved_swap) {
>  		/* uncharge swap account from the old cgroup */
>  		if (!mem_cgroup_is_root(mc.from))
> -			res_counter_uncharge(&mc.from->memsw,
> -					     PAGE_SIZE * mc.moved_swap);
> -
> -		for (i = 0; i < mc.moved_swap; i++)
> -			css_put(&mc.from->css);
> +			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
>  
>  		/*
> -		 * we charged both to->res and to->memsw, so we should
> -		 * uncharge to->res.
> +		 * we charged both to->memory and to->memsw, so we
> +		 * should uncharge to->memory.
>  		 */
>  		if (!mem_cgroup_is_root(mc.to))
> -			res_counter_uncharge(&mc.to->res,
> -					     PAGE_SIZE * mc.moved_swap);
> +			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
> +
> +		for (i = 0; i < mc.moved_swap; i++)
> +			css_put(&mc.from->css);
> +
>  		/* we've already done css_get(mc.to) */
>  		mc.moved_swap = 0;
>  	}
> @@ -6270,7 +6332,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
>  	memcg = mem_cgroup_lookup(id);
>  	if (memcg) {
>  		if (!mem_cgroup_is_root(memcg))
> -			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> +			page_counter_uncharge(&memcg->memsw, 1);
>  		mem_cgroup_swap_statistics(memcg, false);
>  		css_put(&memcg->css);
>  	}
> @@ -6436,11 +6498,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>  
>  	if (!mem_cgroup_is_root(memcg)) {
>  		if (nr_mem)
> -			res_counter_uncharge(&memcg->res,
> -					     nr_mem * PAGE_SIZE);
> +			page_counter_uncharge(&memcg->memory, nr_mem);
>  		if (nr_memsw)
> -			res_counter_uncharge(&memcg->memsw,
> -					     nr_memsw * PAGE_SIZE);
> +			page_counter_uncharge(&memcg->memsw, nr_memsw);
>  		memcg_oom_recover(memcg);
>  	}
>  
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 1d191357bf88..9a448bdb19e9 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -9,13 +9,13 @@
>  int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  {
>  	/*
> -	 * The root cgroup does not use res_counters, but rather,
> +	 * The root cgroup does not use page_counters, but rather,
>  	 * rely on the data already collected by the network
>  	 * subsystem
>  	 */
> -	struct res_counter *res_parent = NULL;
> -	struct cg_proto *cg_proto, *parent_cg;
>  	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> +	struct page_counter *counter_parent = NULL;
> +	struct cg_proto *cg_proto, *parent_cg;
>  
>  	cg_proto = tcp_prot.proto_cgroup(memcg);
>  	if (!cg_proto)
> @@ -29,9 +29,9 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  
>  	parent_cg = tcp_prot.proto_cgroup(parent);
>  	if (parent_cg)
> -		res_parent = &parent_cg->memory_allocated;
> +		counter_parent = &parent_cg->memory_allocated;
>  
> -	res_counter_init(&cg_proto->memory_allocated, res_parent);
> +	page_counter_init(&cg_proto->memory_allocated, counter_parent);
>  	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
>  
>  	return 0;
> @@ -50,7 +50,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
>  }
>  EXPORT_SYMBOL(tcp_destroy_cgroup);
>  
> -static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
> +static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
>  {
>  	struct cg_proto *cg_proto;
>  	int i;
> @@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
>  	if (!cg_proto)
>  		return -EINVAL;
>  
> -	if (val > RES_COUNTER_MAX)
> -		val = RES_COUNTER_MAX;
> -
> -	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
> +	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
>  	if (ret)
>  		return ret;
>  
>  	for (i = 0; i < 3; i++)
> -		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
> +		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
>  						sysctl_tcp_mem[i]);
>  
> -	if (val == RES_COUNTER_MAX)
> +	if (nr_pages == ULONG_MAX / PAGE_SIZE)
>  		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	else if (val != RES_COUNTER_MAX) {
> +	else {
>  		/*
>  		 * The active bit needs to be written after the static_key
>  		 * update. This is what guarantees that the socket activation
> @@ -102,11 +99,18 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
>  	return 0;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +};
> +
>  static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	unsigned long long val;
> +	unsigned long nr_pages;
>  	int ret = 0;
>  
>  	buf = strstrip(buf);
> @@ -114,10 +118,10 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	switch (of_cft(of)->private) {
>  	case RES_LIMIT:
>  		/* see memcontrol.c */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> +		ret = page_counter_memparse(buf, &nr_pages);
>  		if (ret)
>  			break;
> -		ret = tcp_update_limit(memcg, val);
> +		ret = tcp_update_limit(memcg, nr_pages);
>  		break;
>  	default:
>  		ret = -EINVAL;
> @@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	return ret ?: nbytes;
>  }
>  
> -static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return default_val;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, type);
> -}
> -
> -static u64 tcp_read_usage(struct mem_cgroup *memcg)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
> -}
> -
>  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
>  	u64 val;
>  
>  	switch (cft->private) {
>  	case RES_LIMIT:
> -		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
> +		if (!cg_proto)
> +			return PAGE_COUNTER_MAX;
> +		val = cg_proto->memory_allocated.limit;
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_USAGE:
> -		val = tcp_read_usage(memcg);
> +		if (!cg_proto)
> +			return atomic_long_read(&tcp_memory_allocated);
> +		val = atomic_long_read(&cg_proto->memory_allocated.count);
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_FAILCNT:
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.limited;
> +		break;
>  	case RES_MAX_USAGE:
> -		val = tcp_read_stat(memcg, cft->private, 0);
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.watermark;
> +		val *= PAGE_SIZE;
>  		break;
>  	default:
>  		BUG();
> @@ -183,10 +179,11 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
>  
>  	switch (of_cft(of)->private) {
>  	case RES_MAX_USAGE:
> -		res_counter_reset_max(&cg_proto->memory_allocated);
> +		cg_proto->memory_allocated.watermark =
> +			atomic_long_read(&cg_proto->memory_allocated.count);
>  		break;
>  	case RES_FAILCNT:
> -		res_counter_reset_failcnt(&cg_proto->memory_allocated);
> +		cg_proto->memory_allocated.limited = 0;
>  		break;
>  	}
>  
> -- 
> 2.1.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-22 14:44   ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-22 14:44 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it and remove the old one.  The translation
> from and to bytes then only happens when interfacing with userspace.

Dunno why but I thought other controllers use res_counter as well. But
this doesn't seem to be the case so this is perfectly reasonable way
forward.

I have only glanced through the patch and it mostly seems good to me 
(I have to look more closely on the atomicity of hierarchical operations).

Nevertheless I think that the counter should live outside of memcg (it
is ugly and bad in general to make HUGETLB controller depend on MEMCG
just to have a counter). If you made kernel/page_counter.c and led both
containers select CONFIG_PAGE_COUNTER then you do not need a dependency
on MEMCG and I would find it cleaner in general.

> Aside from the locking costs, this gets rid of the icky unsigned long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.

Definitely. Nice work!

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/cgroups/hugetlb.txt          |   2 +-
>  Documentation/cgroups/memory.txt           |   4 +-
>  Documentation/cgroups/resource_counter.txt | 197 --------
>  include/linux/hugetlb_cgroup.h             |   1 -
>  include/linux/memcontrol.h                 |  37 +-
>  include/linux/res_counter.h                | 223 ---------
>  include/net/sock.h                         |  25 +-
>  init/Kconfig                               |   9 +-
>  kernel/Makefile                            |   1 -
>  kernel/res_counter.c                       | 211 --------
>  mm/hugetlb_cgroup.c                        | 100 ++--
>  mm/memcontrol.c                            | 740 ++++++++++++++++-------------
>  net/ipv4/tcp_memcontrol.c                  |  83 ++--
>  13 files changed, 541 insertions(+), 1092 deletions(-)
>  delete mode 100644 Documentation/cgroups/resource_counter.txt
>  delete mode 100644 include/linux/res_counter.h
>  delete mode 100644 kernel/res_counter.c
> 
> diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
> index a9faaca1f029..106245c3aecc 100644
> --- a/Documentation/cgroups/hugetlb.txt
> +++ b/Documentation/cgroups/hugetlb.txt
> @@ -29,7 +29,7 @@ Brief summary of control files
>  
>   hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
>   hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
> - hugetlb.<hugepagesize>.usage_in_bytes     # show current res_counter usage for "hugepagesize" hugetlb
> + hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
>   hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
>  
>  For a system supporting two hugepage size (16M and 16G) the control
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 02ab997a1ed2..f624727ab404 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -52,9 +52,9 @@ Brief summary of control files.
>   tasks				 # attach a task(thread) and show list of threads
>   cgroup.procs			 # show list of processes
>   cgroup.event_control		 # an interface for event_fd()
> - memory.usage_in_bytes		 # show current res_counter usage for memory
> + memory.usage_in_bytes		 # show current usage for memory
>  				 (See 5.5 for details)
> - memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
> + memory.memsw.usage_in_bytes	 # show current usage for memory+Swap
>  				 (See 5.5 for details)
>   memory.limit_in_bytes		 # set/show limit of memory usage
>   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> deleted file mode 100644
> index 762ca54eb929..000000000000
> --- a/Documentation/cgroups/resource_counter.txt
> +++ /dev/null
> @@ -1,197 +0,0 @@
> -
> -		The Resource Counter
> -
> -The resource counter, declared at include/linux/res_counter.h,
> -is supposed to facilitate the resource management by controllers
> -by providing common stuff for accounting.
> -
> -This "stuff" includes the res_counter structure and routines
> -to work with it.
> -
> -
> -
> -1. Crucial parts of the res_counter structure
> -
> - a. unsigned long long usage
> -
> - 	The usage value shows the amount of a resource that is consumed
> -	by a group at a given time. The units of measurement should be
> -	determined by the controller that uses this counter. E.g. it can
> -	be bytes, items or any other unit the controller operates on.
> -
> - b. unsigned long long max_usage
> -
> - 	The maximal value of the usage over time.
> -
> - 	This value is useful when gathering statistical information about
> -	the particular group, as it shows the actual resource requirements
> -	for a particular group, not just some usage snapshot.
> -
> - c. unsigned long long limit
> -
> - 	The maximal allowed amount of resource to consume by the group. In
> -	case the group requests for more resources, so that the usage value
> -	would exceed the limit, the resource allocation is rejected (see
> -	the next section).
> -
> - d. unsigned long long failcnt
> -
> - 	The failcnt stands for "failures counter". This is the number of
> -	resource allocation attempts that failed.
> -
> - c. spinlock_t lock
> -
> - 	Protects changes of the above values.
> -
> -
> -
> -2. Basic accounting routines
> -
> - a. void res_counter_init(struct res_counter *rc,
> -				struct res_counter *rc_parent)
> -
> - 	Initializes the resource counter. As usual, should be the first
> -	routine called for a new counter.
> -
> -	The struct res_counter *parent can be used to define a hierarchical
> -	child -> parent relationship directly in the res_counter structure,
> -	NULL can be used to define no relationship.
> -
> - c. int res_counter_charge(struct res_counter *rc, unsigned long val,
> -				struct res_counter **limit_fail_at)
> -
> -	When a resource is about to be allocated it has to be accounted
> -	with the appropriate resource counter (controller should determine
> -	which one to use on its own). This operation is called "charging".
> -
> -	This is not very important which operation - resource allocation
> -	or charging - is performed first, but
> -	  * if the allocation is performed first, this may create a
> -	    temporary resource over-usage by the time resource counter is
> -	    charged;
> -	  * if the charging is performed first, then it should be uncharged
> -	    on error path (if the one is called).
> -
> -	If the charging fails and a hierarchical dependency exists, the
> -	limit_fail_at parameter is set to the particular res_counter element
> -	where the charging failed.
> -
> - d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
> -
> -	When a resource is released (freed) it should be de-accounted
> -	from the resource counter it was accounted to.  This is called
> -	"uncharging". The return value of this function indicate the amount
> -	of charges still present in the counter.
> -
> -	The _locked routines imply that the res_counter->lock is taken.
> -
> - e. u64 res_counter_uncharge_until
> -		(struct res_counter *rc, struct res_counter *top,
> -		 unsigned long val)
> -
> -	Almost same as res_counter_uncharge() but propagation of uncharge
> -	stops when rc == top. This is useful when kill a res_counter in
> -	child cgroup.
> -
> - 2.1 Other accounting routines
> -
> -    There are more routines that may help you with common needs, like
> -    checking whether the limit is reached or resetting the max_usage
> -    value. They are all declared in include/linux/res_counter.h.
> -
> -
> -
> -3. Analyzing the resource counter registrations
> -
> - a. If the failcnt value constantly grows, this means that the counter's
> -    limit is too tight. Either the group is misbehaving and consumes too
> -    many resources, or the configuration is not suitable for the group
> -    and the limit should be increased.
> -
> - b. The max_usage value can be used to quickly tune the group. One may
> -    set the limits to maximal values and either load the container with
> -    a common pattern or leave one for a while. After this the max_usage
> -    value shows the amount of memory the container would require during
> -    its common activity.
> -
> -    Setting the limit a bit above this value gives a pretty good
> -    configuration that works in most of the cases.
> -
> - c. If the max_usage is much less than the limit, but the failcnt value
> -    is growing, then the group tries to allocate a big chunk of resource
> -    at once.
> -
> - d. If the max_usage is much less than the limit, but the failcnt value
> -    is 0, then this group is given too high limit, that it does not
> -    require. It is better to lower the limit a bit leaving more resource
> -    for other groups.
> -
> -
> -
> -4. Communication with the control groups subsystem (cgroups)
> -
> -All the resource controllers that are using cgroups and resource counters
> -should provide files (in the cgroup filesystem) to work with the resource
> -counter fields. They are recommended to adhere to the following rules:
> -
> - a. File names
> -
> - 	Field name	File name
> -	---------------------------------------------------
> -	usage		usage_in_<unit_of_measurement>
> -	max_usage	max_usage_in_<unit_of_measurement>
> -	limit		limit_in_<unit_of_measurement>
> -	failcnt		failcnt
> -	lock		no file :)
> -
> - b. Reading from file should show the corresponding field value in the
> -    appropriate format.
> -
> - c. Writing to file
> -
> - 	Field		Expected behavior
> -	----------------------------------
> -	usage		prohibited
> -	max_usage	reset to usage
> -	limit		set the limit
> -	failcnt		reset to zero
> -
> -
> -
> -5. Usage example
> -
> - a. Declare a task group (take a look at cgroups subsystem for this) and
> -    fold a res_counter into it
> -
> -	struct my_group {
> -		struct res_counter res;
> -
> -		<other fields>
> -	}
> -
> - b. Put hooks in resource allocation/release paths
> -
> - 	int alloc_something(...)
> -	{
> -		if (res_counter_charge(res_counter_ptr, amount) < 0)
> -			return -ENOMEM;
> -
> -		<allocate the resource and return to the caller>
> -	}
> -
> -	void release_something(...)
> -	{
> -		res_counter_uncharge(res_counter_ptr, amount);
> -
> -		<release the resource>
> -	}
> -
> -    In order to keep the usage value self-consistent, both the
> -    "res_counter_ptr" and the "amount" in release_something() should be
> -    the same as they were in the alloc_something() when the releasing
> -    resource was allocated.
> -
> - c. Provide the way to read res_counter values and set them (the cgroups
> -    still can help with it).
> -
> - c. Compile and run :)
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 0129f89cf98d..bcc853eccc85 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -16,7 +16,6 @@
>  #define _LINUX_HUGETLB_CGROUP_H
>  
>  #include <linux/mmdebug.h>
> -#include <linux/res_counter.h>
>  
>  struct hugetlb_cgroup;
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19df5d857411..bf8fb1a05597 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +
> +struct page_counter {
> +	atomic_long_t count;
> +	unsigned long limit;
> +	struct page_counter *parent;
> +
> +	/* legacy */
> +	unsigned long watermark;
> +	unsigned long limited;
> +};
> +
> +#if BITS_PER_LONG == 32
> +#define PAGE_COUNTER_MAX ULONG_MAX
> +#else
> +#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
> +#endif
> +
> +static inline void page_counter_init(struct page_counter *counter,
> +				     struct page_counter *parent)
> +{
> +	atomic_long_set(&counter->count, 0);
> +	counter->limit = PAGE_COUNTER_MAX;
> +	counter->parent = parent;
> +}
> +
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail);
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> +int page_counter_memparse(const char *buf, unsigned long *nr_pages);
> +
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
>  void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> @@ -471,9 +503,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
>  	/*
>  	 * __GFP_NOFAIL allocations will move on even if charging is not
>  	 * possible. Therefore we don't even try, and have this allocation
> -	 * unaccounted. We could in theory charge it with
> -	 * res_counter_charge_nofail, but we hope those allocations are rare,
> -	 * and won't be worth the trouble.
> +	 * unaccounted. We could in theory charge it forcibly, but we hope
> +	 * those allocations are rare, and won't be worth the trouble.
>  	 */
>  	if (gfp & __GFP_NOFAIL)
>  		return true;
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> deleted file mode 100644
> index 56b7bc32db4f..000000000000
> --- a/include/linux/res_counter.h
> +++ /dev/null
> @@ -1,223 +0,0 @@
> -#ifndef __RES_COUNTER_H__
> -#define __RES_COUNTER_H__
> -
> -/*
> - * Resource Counters
> - * Contain common data types and routines for resource accounting
> - *
> - * Copyright 2007 OpenVZ SWsoft Inc
> - *
> - * Author: Pavel Emelianov <xemul@openvz.org>
> - *
> - * See Documentation/cgroups/resource_counter.txt for more
> - * info about what this counter is.
> - */
> -
> -#include <linux/spinlock.h>
> -#include <linux/errno.h>
> -
> -/*
> - * The core object. the cgroup that wishes to account for some
> - * resource may include this counter into its structures and use
> - * the helpers described beyond
> - */
> -
> -struct res_counter {
> -	/*
> -	 * the current resource consumption level
> -	 */
> -	unsigned long long usage;
> -	/*
> -	 * the maximal value of the usage from the counter creation
> -	 */
> -	unsigned long long max_usage;
> -	/*
> -	 * the limit that usage cannot exceed
> -	 */
> -	unsigned long long limit;
> -	/*
> -	 * the limit that usage can be exceed
> -	 */
> -	unsigned long long soft_limit;
> -	/*
> -	 * the number of unsuccessful attempts to consume the resource
> -	 */
> -	unsigned long long failcnt;
> -	/*
> -	 * the lock to protect all of the above.
> -	 * the routines below consider this to be IRQ-safe
> -	 */
> -	spinlock_t lock;
> -	/*
> -	 * Parent counter, used for hierarchial resource accounting
> -	 */
> -	struct res_counter *parent;
> -};
> -
> -#define RES_COUNTER_MAX ULLONG_MAX
> -
> -/**
> - * Helpers to interact with userspace
> - * res_counter_read_u64() - returns the value of the specified member.
> - * res_counter_read/_write - put/get the specified fields from the
> - * res_counter struct to/from the user
> - *
> - * @counter:     the counter in question
> - * @member:  the field to work with (see RES_xxx below)
> - * @buf:     the buffer to opeate on,...
> - * @nbytes:  its size...
> - * @pos:     and the offset.
> - */
> -
> -u64 res_counter_read_u64(struct res_counter *counter, int member);
> -
> -ssize_t res_counter_read(struct res_counter *counter, int member,
> -		const char __user *buf, size_t nbytes, loff_t *pos,
> -		int (*read_strategy)(unsigned long long val, char *s));
> -
> -int res_counter_memparse_write_strategy(const char *buf,
> -					unsigned long long *res);
> -
> -/*
> - * the field descriptors. one for each member of res_counter
> - */
> -
> -enum {
> -	RES_USAGE,
> -	RES_MAX_USAGE,
> -	RES_LIMIT,
> -	RES_FAILCNT,
> -	RES_SOFT_LIMIT,
> -};
> -
> -/*
> - * helpers for accounting
> - */
> -
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> -
> -/*
> - * charge - try to consume more resource.
> - *
> - * @counter: the counter
> - * @val: the amount of the resource. each controller defines its own
> - *       units, e.g. numbers, bytes, Kbytes, etc
> - *
> - * returns 0 on success and <0 if the counter->usage will exceed the
> - * counter->limit
> - *
> - * charge_nofail works the same, except that it charges the resource
> - * counter unconditionally, and returns < 0 if the after the current
> - * charge we are over limit.
> - */
> -
> -int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> -int res_counter_charge_nofail(struct res_counter *counter,
> -		unsigned long val, struct res_counter **limit_fail_at);
> -
> -/*
> - * uncharge - tell that some portion of the resource is released
> - *
> - * @counter: the counter
> - * @val: the amount of the resource
> - *
> - * these calls check for usage underflow and show a warning on the console
> - *
> - * returns the total charges still present in @counter.
> - */
> -
> -u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
> -
> -u64 res_counter_uncharge_until(struct res_counter *counter,
> -			       struct res_counter *top,
> -			       unsigned long val);
> -/**
> - * res_counter_margin - calculate chargeable space of a counter
> - * @cnt: the counter
> - *
> - * Returns the difference between the hard limit and the current usage
> - * of resource counter @cnt.
> - */
> -static inline unsigned long long res_counter_margin(struct res_counter *cnt)
> -{
> -	unsigned long long margin;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->limit > cnt->usage)
> -		margin = cnt->limit - cnt->usage;
> -	else
> -		margin = 0;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return margin;
> -}
> -
> -/**
> - * Get the difference between the usage and the soft limit
> - * @cnt: The counter
> - *
> - * Returns 0 if usage is less than or equal to soft limit
> - * The difference between usage and soft limit, otherwise.
> - */
> -static inline unsigned long long
> -res_counter_soft_limit_excess(struct res_counter *cnt)
> -{
> -	unsigned long long excess;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= cnt->soft_limit)
> -		excess = 0;
> -	else
> -		excess = cnt->usage - cnt->soft_limit;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return excess;
> -}
> -
> -static inline void res_counter_reset_max(struct res_counter *cnt)
> -{
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->max_usage = cnt->usage;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -}
> -
> -static inline void res_counter_reset_failcnt(struct res_counter *cnt)
> -{
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->failcnt = 0;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -}
> -
> -static inline int res_counter_set_limit(struct res_counter *cnt,
> -		unsigned long long limit)
> -{
> -	unsigned long flags;
> -	int ret = -EBUSY;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= limit) {
> -		cnt->limit = limit;
> -		ret = 0;
> -	}
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return ret;
> -}
> -
> -static inline int
> -res_counter_set_soft_limit(struct res_counter *cnt,
> -				unsigned long long soft_limit)
> -{
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->soft_limit = soft_limit;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> -	return 0;
> -}
> -
> -#endif
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 515a4d01e932..f41749982668 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -55,7 +55,6 @@
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/memcontrol.h>
> -#include <linux/res_counter.h>
>  #include <linux/static_key.h>
>  #include <linux/aio.h>
>  #include <linux/sched.h>
> @@ -1066,7 +1065,7 @@ enum cg_proto_flags {
>  };
>  
>  struct cg_proto {
> -	struct res_counter	memory_allocated;	/* Current allocated memory. */
> +	struct page_counter	memory_allocated;	/* Current allocated memory. */
>  	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
>  	int			memory_pressure;
>  	long			sysctl_mem[3];
> @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
>  					      unsigned long amt,
>  					      int *parent_status)
>  {
> -	struct res_counter *fail;
> -	int ret;
> +	page_counter_charge(&prot->memory_allocated, amt, NULL);
>  
> -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> -					amt << PAGE_SHIFT, &fail);
> -	if (ret < 0)
> +	if (atomic_long_read(&prot->memory_allocated.count) >
> +	    prot->memory_allocated.limit)
>  		*parent_status = OVER_LIMIT;
>  }
>  
>  static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
>  					      unsigned long amt)
>  {
> -	res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT);
> -}
> -
> -static inline u64 memcg_memory_allocated_read(struct cg_proto *prot)
> -{
> -	u64 ret;
> -	ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE);
> -	return ret >> PAGE_SHIFT;
> +	page_counter_uncharge(&prot->memory_allocated, amt);
>  }
>  
>  static inline long
>  sk_memory_allocated(const struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
> +
>  	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
>  
>  	return atomic_long_read(prot->memory_allocated);
>  }
> @@ -1259,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
>  		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
>  		/* update the root cgroup regardless */
>  		atomic_long_add_return(amt, prot->memory_allocated);
> -		return memcg_memory_allocated_read(sk->sk_cgrp);
> +		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
>  	}
>  
>  	return atomic_long_add_return(amt, prot->memory_allocated);
> diff --git a/init/Kconfig b/init/Kconfig
> index 0471be99ec38..1cf42b563834 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -975,15 +975,8 @@ config CGROUP_CPUACCT
>  	  Provides a simple Resource Controller for monitoring the
>  	  total CPU consumed by the tasks in a cgroup.
>  
> -config RESOURCE_COUNTERS
> -	bool "Resource counters"
> -	help
> -	  This option enables controller independent resource accounting
> -	  infrastructure that works with cgroups.
> -
>  config MEMCG
>  	bool "Memory Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS
>  	select EVENTFD
>  	help
>  	  Provides a memory resource controller that manages both anonymous
> @@ -1051,7 +1044,7 @@ config MEMCG_KMEM
>  
>  config CGROUP_HUGETLB
>  	bool "HugeTLB Resource Controller for Control Groups"
> -	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> +	depends on MEMCG && HUGETLB_PAGE
>  	default n
>  	help
>  	  Provides a cgroup Resource Controller for HugeTLB pages.
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 726e18443da0..245953354974 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -58,7 +58,6 @@ obj-$(CONFIG_USER_NS) += user_namespace.o
>  obj-$(CONFIG_PID_NS) += pid_namespace.o
>  obj-$(CONFIG_DEBUG_SYNCHRO_TEST) += synchro-test.o
>  obj-$(CONFIG_IKCONFIG) += configs.o
> -obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o
>  obj-$(CONFIG_SMP) += stop_machine.o
>  obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
>  obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> deleted file mode 100644
> index e791130f85a7..000000000000
> --- a/kernel/res_counter.c
> +++ /dev/null
> @@ -1,211 +0,0 @@
> -/*
> - * resource cgroups
> - *
> - * Copyright 2007 OpenVZ SWsoft Inc
> - *
> - * Author: Pavel Emelianov <xemul@openvz.org>
> - *
> - */
> -
> -#include <linux/types.h>
> -#include <linux/parser.h>
> -#include <linux/fs.h>
> -#include <linux/res_counter.h>
> -#include <linux/uaccess.h>
> -#include <linux/mm.h>
> -
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> -{
> -	spin_lock_init(&counter->lock);
> -	counter->limit = RES_COUNTER_MAX;
> -	counter->soft_limit = RES_COUNTER_MAX;
> -	counter->parent = parent;
> -}
> -
> -static u64 res_counter_uncharge_locked(struct res_counter *counter,
> -				       unsigned long val)
> -{
> -	if (WARN_ON(counter->usage < val))
> -		val = counter->usage;
> -
> -	counter->usage -= val;
> -	return counter->usage;
> -}
> -
> -static int res_counter_charge_locked(struct res_counter *counter,
> -				     unsigned long val, bool force)
> -{
> -	int ret = 0;
> -
> -	if (counter->usage + val > counter->limit) {
> -		counter->failcnt++;
> -		ret = -ENOMEM;
> -		if (!force)
> -			return ret;
> -	}
> -
> -	counter->usage += val;
> -	if (counter->usage > counter->max_usage)
> -		counter->max_usage = counter->usage;
> -	return ret;
> -}
> -
> -static int __res_counter_charge(struct res_counter *counter, unsigned long val,
> -				struct res_counter **limit_fail_at, bool force)
> -{
> -	int ret, r;
> -	unsigned long flags;
> -	struct res_counter *c, *u;
> -
> -	r = ret = 0;
> -	*limit_fail_at = NULL;
> -	local_irq_save(flags);
> -	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
> -		r = res_counter_charge_locked(c, val, force);
> -		spin_unlock(&c->lock);
> -		if (r < 0 && !ret) {
> -			ret = r;
> -			*limit_fail_at = c;
> -			if (!force)
> -				break;
> -		}
> -	}
> -
> -	if (ret < 0 && !force) {
> -		for (u = counter; u != c; u = u->parent) {
> -			spin_lock(&u->lock);
> -			res_counter_uncharge_locked(u, val);
> -			spin_unlock(&u->lock);
> -		}
> -	}
> -	local_irq_restore(flags);
> -
> -	return ret;
> -}
> -
> -int res_counter_charge(struct res_counter *counter, unsigned long val,
> -			struct res_counter **limit_fail_at)
> -{
> -	return __res_counter_charge(counter, val, limit_fail_at, false);
> -}
> -
> -int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
> -			      struct res_counter **limit_fail_at)
> -{
> -	return __res_counter_charge(counter, val, limit_fail_at, true);
> -}
> -
> -u64 res_counter_uncharge_until(struct res_counter *counter,
> -			       struct res_counter *top,
> -			       unsigned long val)
> -{
> -	unsigned long flags;
> -	struct res_counter *c;
> -	u64 ret = 0;
> -
> -	local_irq_save(flags);
> -	for (c = counter; c != top; c = c->parent) {
> -		u64 r;
> -		spin_lock(&c->lock);
> -		r = res_counter_uncharge_locked(c, val);
> -		if (c == counter)
> -			ret = r;
> -		spin_unlock(&c->lock);
> -	}
> -	local_irq_restore(flags);
> -	return ret;
> -}
> -
> -u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
> -{
> -	return res_counter_uncharge_until(counter, NULL, val);
> -}
> -
> -static inline unsigned long long *
> -res_counter_member(struct res_counter *counter, int member)
> -{
> -	switch (member) {
> -	case RES_USAGE:
> -		return &counter->usage;
> -	case RES_MAX_USAGE:
> -		return &counter->max_usage;
> -	case RES_LIMIT:
> -		return &counter->limit;
> -	case RES_FAILCNT:
> -		return &counter->failcnt;
> -	case RES_SOFT_LIMIT:
> -		return &counter->soft_limit;
> -	};
> -
> -	BUG();
> -	return NULL;
> -}
> -
> -ssize_t res_counter_read(struct res_counter *counter, int member,
> -		const char __user *userbuf, size_t nbytes, loff_t *pos,
> -		int (*read_strategy)(unsigned long long val, char *st_buf))
> -{
> -	unsigned long long *val;
> -	char buf[64], *s;
> -
> -	s = buf;
> -	val = res_counter_member(counter, member);
> -	if (read_strategy)
> -		s += read_strategy(*val, s);
> -	else
> -		s += sprintf(s, "%llu\n", *val);
> -	return simple_read_from_buffer((void __user *)userbuf, nbytes,
> -			pos, buf, s - buf);
> -}
> -
> -#if BITS_PER_LONG == 32
> -u64 res_counter_read_u64(struct res_counter *counter, int member)
> -{
> -	unsigned long flags;
> -	u64 ret;
> -
> -	spin_lock_irqsave(&counter->lock, flags);
> -	ret = *res_counter_member(counter, member);
> -	spin_unlock_irqrestore(&counter->lock, flags);
> -
> -	return ret;
> -}
> -#else
> -u64 res_counter_read_u64(struct res_counter *counter, int member)
> -{
> -	return *res_counter_member(counter, member);
> -}
> -#endif
> -
> -int res_counter_memparse_write_strategy(const char *buf,
> -					unsigned long long *resp)
> -{
> -	char *end;
> -	unsigned long long res;
> -
> -	/* return RES_COUNTER_MAX(unlimited) if "-1" is specified */
> -	if (*buf == '-') {
> -		int rc = kstrtoull(buf + 1, 10, &res);
> -
> -		if (rc)
> -			return rc;
> -		if (res != 1)
> -			return -EINVAL;
> -		*resp = RES_COUNTER_MAX;
> -		return 0;
> -	}
> -
> -	res = memparse(buf, &end);
> -	if (*end != '\0')
> -		return -EINVAL;
> -
> -	if (PAGE_ALIGN(res) >= res)
> -		res = PAGE_ALIGN(res);
> -	else
> -		res = RES_COUNTER_MAX;
> -
> -	*resp = res;
> -
> -	return 0;
> -}
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index a67c26e0f360..e619b6b62f1f 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -14,6 +14,7 @@
>   */
>  
>  #include <linux/cgroup.h>
> +#include <linux/memcontrol.h>
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/hugetlb_cgroup.h>
> @@ -23,7 +24,7 @@ struct hugetlb_cgroup {
>  	/*
>  	 * the counter to account for hugepages from hugetlb.
>  	 */
> -	struct res_counter hugepage[HUGE_MAX_HSTATE];
> +	struct page_counter hugepage[HUGE_MAX_HSTATE];
>  };
>  
>  #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> @@ -60,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
>  	int idx;
>  
>  	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
> -		if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
> +		if (atomic_long_read(&h_cg->hugepage[idx].count))
>  			return true;
>  	}
>  	return false;
> @@ -79,12 +80,12 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  
>  	if (parent_h_cgroup) {
>  		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> -			res_counter_init(&h_cgroup->hugepage[idx],
> -					 &parent_h_cgroup->hugepage[idx]);
> +			page_counter_init(&h_cgroup->hugepage[idx],
> +					  &parent_h_cgroup->hugepage[idx]);
>  	} else {
>  		root_h_cgroup = h_cgroup;
>  		for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> -			res_counter_init(&h_cgroup->hugepage[idx], NULL);
> +			page_counter_init(&h_cgroup->hugepage[idx], NULL);
>  	}
>  	return &h_cgroup->css;
>  }
> @@ -108,9 +109,8 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
>  static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
>  				       struct page *page)
>  {
> -	int csize;
> -	struct res_counter *counter;
> -	struct res_counter *fail_res;
> +	unsigned int nr_pages;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *page_hcg;
>  	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
>  
> @@ -123,15 +123,15 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
>  	if (!page_hcg || page_hcg != h_cg)
>  		goto out;
>  
> -	csize = PAGE_SIZE << compound_order(page);
> +	nr_pages = 1 << compound_order(page);
>  	if (!parent) {
>  		parent = root_h_cgroup;
>  		/* root has no limit */
> -		res_counter_charge_nofail(&parent->hugepage[idx],
> -					  csize, &fail_res);
> +		page_counter_charge(&parent->hugepage[idx], nr_pages, NULL);
>  	}
>  	counter = &h_cg->hugepage[idx];
> -	res_counter_uncharge_until(counter, counter->parent, csize);
> +	/* Take the pages off the local counter */
> +	page_counter_cancel(counter, nr_pages);
>  
>  	set_hugetlb_cgroup(page, parent);
>  out:
> @@ -166,9 +166,8 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
>  				 struct hugetlb_cgroup **ptr)
>  {
>  	int ret = 0;
> -	struct res_counter *fail_res;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = NULL;
> -	unsigned long csize = nr_pages * PAGE_SIZE;
>  
>  	if (hugetlb_cgroup_disabled())
>  		goto done;
> @@ -187,7 +186,7 @@ again:
>  	}
>  	rcu_read_unlock();
>  
> -	ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
> +	ret = page_counter_charge(&h_cg->hugepage[idx], nr_pages, &counter);
>  	css_put(&h_cg->css);
>  done:
>  	*ptr = h_cg;
> @@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  				  struct page *page)
>  {
>  	struct hugetlb_cgroup *h_cg;
> -	unsigned long csize = nr_pages * PAGE_SIZE;
>  
>  	if (hugetlb_cgroup_disabled())
>  		return;
> @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>  	if (unlikely(!h_cg))
>  		return;
>  	set_hugetlb_cgroup(page, NULL);
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
>  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>  				    struct hugetlb_cgroup *h_cg)
>  {
> -	unsigned long csize = nr_pages * PAGE_SIZE;
> -
>  	if (hugetlb_cgroup_disabled() || !h_cg)
>  		return;
>  
>  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
>  		return;
>  
> -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
>  	return;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +};
> +
>  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>  				   struct cftype *cft)
>  {
> -	int idx, name;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
>  
> -	idx = MEMFILE_IDX(cft->private);
> -	name = MEMFILE_ATTR(cft->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
>  
> -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> +	switch (MEMFILE_ATTR(cft->private)) {
> +	case RES_USAGE:
> +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> +	case RES_LIMIT:
> +		return (u64)counter->limit * PAGE_SIZE;
> +	case RES_MAX_USAGE:
> +		return (u64)counter->watermark * PAGE_SIZE;
> +	case RES_FAILCNT:
> +		return counter->limited;
> +	default:
> +		BUG();
> +	}
>  }
>  
>  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret;
> -	unsigned long long val;
> +	int ret, idx;
> +	unsigned long nr_pages;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> +	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> +		return -EINVAL;
> +
>  	buf = strstrip(buf);
> +	ret = page_counter_memparse(buf, &nr_pages);
> +	if (ret)
> +		return ret;
> +
>  	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_LIMIT:
> -		if (hugetlb_cgroup_is_root(h_cg)) {
> -			/* Can't set limit on root */
> -			ret = -EINVAL;
> -			break;
> -		}
> -		/* This function does all necessary parse...reuse it */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> -			break;
> -		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
> -		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> +		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
> +		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
>  		break;
>  	default:
>  		ret = -EINVAL;
> @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
>  				    char *buf, size_t nbytes, loff_t off)
>  {
> -	int idx, name, ret = 0;
> +	int ret = 0;
> +	struct page_counter *counter;
>  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>  
> -	idx = MEMFILE_IDX(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_MAX_USAGE:
> -		res_counter_reset_max(&h_cg->hugepage[idx]);
> +		counter->watermark = atomic_long_read(&counter->count);
>  		break;
>  	case RES_FAILCNT:
> -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> +		counter->limited = 0;
>  		break;
>  	default:
>  		ret = -EINVAL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..dfd3b15a57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -25,7 +25,6 @@
>   * GNU General Public License for more details.
>   */
>  
> -#include <linux/res_counter.h>
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
>  #include <linux/mm.h>
> @@ -66,6 +65,117 @@
>  
>  #include <trace/events/vmscan.h>
>  
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(nr_pages, &counter->count);
> +
> +	if (WARN_ON(unlikely(new < 0)))
> +		atomic_long_set(&counter->count, 0);
> +
> +	return new > 1;
> +}
> +
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail)
> +{
> +	struct page_counter *c;
> +
> +	for (c = counter; c; c = c->parent) {
> +		for (;;) {
> +			unsigned long count;
> +			unsigned long new;
> +
> +			count = atomic_long_read(&c->count);
> +
> +			new = count + nr_pages;
> +			if (new > c->limit) {
> +				c->limited++;
> +				if (fail) {
> +					*fail = c;
> +					goto failed;
> +				}
> +			}
> +
> +			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> +				continue;
> +
> +			if (new > c->watermark)
> +				c->watermark = new;
> +
> +			break;
> +		}
> +	}
> +	return 0;
> +
> +failed:
> +	for (c = counter; c != *fail; c = c->parent)
> +		page_counter_cancel(c, nr_pages);
> +
> +	return -ENOMEM;
> +}
> +
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	struct page_counter *c;
> +	int ret = 1;
> +
> +	for (c = counter; c; c = c->parent) {
> +		int remainder;
> +
> +		remainder = page_counter_cancel(c, nr_pages);
> +		if (c == counter && !remainder)
> +			ret = 0;
> +	}
> +
> +	return ret;
> +}
> +
> +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> +{
> +	for (;;) {
> +		unsigned long count;
> +		unsigned long old;
> +
> +		count = atomic_long_read(&counter->count);
> +
> +		old = xchg(&counter->limit, limit);
> +
> +		if (atomic_long_read(&counter->count) != count) {
> +			counter->limit = old;
> +			continue;
> +		}
> +
> +		if (count > limit) {
> +			counter->limit = old;
> +			return -EBUSY;
> +		}
> +
> +		return 0;
> +	}
> +}
> +
> +int page_counter_memparse(const char *buf, unsigned long *nr_pages)
> +{
> +	char unlimited[] = "-1";
> +	char *end;
> +	u64 bytes;
> +
> +	if (!strncmp(buf, unlimited, sizeof(unlimited))) {
> +		*nr_pages = PAGE_COUNTER_MAX;
> +		return 0;
> +	}
> +
> +	bytes = memparse(buf, &end);
> +	if (*end != '\0')
> +		return -EINVAL;
> +
> +	*nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
> +
> +	return 0;
> +}
> +
>  struct cgroup_subsys memory_cgrp_subsys __read_mostly;
>  EXPORT_SYMBOL(memory_cgrp_subsys);
>  
> @@ -165,7 +275,7 @@ struct mem_cgroup_per_zone {
>  	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
>  
>  	struct rb_node		tree_node;	/* RB tree node */
> -	unsigned long long	usage_in_excess;/* Set to the value by which */
> +	unsigned long		usage_in_excess;/* Set to the value by which */
>  						/* the soft limit is exceeded*/
>  	bool			on_tree;
>  	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
> @@ -198,7 +308,7 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
>  
>  struct mem_cgroup_threshold {
>  	struct eventfd_ctx *eventfd;
> -	u64 threshold;
> +	unsigned long threshold;
>  };
>  
>  /* For threshold */
> @@ -284,24 +394,18 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>   */
>  struct mem_cgroup {
>  	struct cgroup_subsys_state css;
> -	/*
> -	 * the counter to account for memory usage
> -	 */
> -	struct res_counter res;
> +
> +	/* Accounted resources */
> +	struct page_counter memory;
> +	struct page_counter memsw;
> +	struct page_counter kmem;
> +
> +	unsigned long soft_limit;
>  
>  	/* vmpressure notifications */
>  	struct vmpressure vmpressure;
>  
>  	/*
> -	 * the counter to account for mem+swap usage.
> -	 */
> -	struct res_counter memsw;
> -
> -	/*
> -	 * the counter to account for kernel memory usage.
> -	 */
> -	struct res_counter kmem;
> -	/*
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> @@ -647,7 +751,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
>  	 * This check can't live in kmem destruction function,
>  	 * since the charges will outlive the cgroup
>  	 */
> -	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
> +	WARN_ON(atomic_long_read(&memcg->kmem.count));
>  }
>  #else
>  static void disarm_kmem_keys(struct mem_cgroup *memcg)
> @@ -703,7 +807,7 @@ soft_limit_tree_from_page(struct page *page)
>  
>  static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
>  					 struct mem_cgroup_tree_per_zone *mctz,
> -					 unsigned long long new_usage_in_excess)
> +					 unsigned long new_usage_in_excess)
>  {
>  	struct rb_node **p = &mctz->rb_root.rb_node;
>  	struct rb_node *parent = NULL;
> @@ -752,10 +856,21 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
>  	spin_unlock_irqrestore(&mctz->lock, flags);
>  }
>  
> +static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
> +{
> +	unsigned long nr_pages = atomic_long_read(&memcg->memory.count);
> +	unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
> +	unsigned long excess = 0;
> +
> +	if (nr_pages > soft_limit)
> +		excess = nr_pages - soft_limit;
> +
> +	return excess;
> +}
>  
>  static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
>  {
> -	unsigned long long excess;
> +	unsigned long excess;
>  	struct mem_cgroup_per_zone *mz;
>  	struct mem_cgroup_tree_per_zone *mctz;
>  
> @@ -766,7 +881,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
>  	 */
>  	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>  		mz = mem_cgroup_page_zoneinfo(memcg, page);
> -		excess = res_counter_soft_limit_excess(&memcg->res);
> +		excess = soft_limit_excess(memcg);
>  		/*
>  		 * We have to update the tree if mz is on RB-tree or
>  		 * mem is over its softlimit.
> @@ -822,7 +937,7 @@ retry:
>  	 * position in the tree.
>  	 */
>  	__mem_cgroup_remove_exceeded(mz, mctz);
> -	if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
> +	if (!soft_limit_excess(mz->memcg) ||
>  	    !css_tryget_online(&mz->memcg->css))
>  		goto retry;
>  done:
> @@ -1478,7 +1593,7 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>  	return inactive * inactive_ratio < active;
>  }
>  
> -#define mem_cgroup_from_res_counter(counter, member)	\
> +#define mem_cgroup_from_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
>  /**
> @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>   */
>  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  {
> -	unsigned long long margin;
> +	unsigned long margin = 0;
> +	unsigned long count;
> +	unsigned long limit;
>  
> -	margin = res_counter_margin(&memcg->res);
> -	if (do_swap_account)
> -		margin = min(margin, res_counter_margin(&memcg->memsw));
> -	return margin >> PAGE_SHIFT;
> +	count = atomic_long_read(&memcg->memory.count);
> +	limit = ACCESS_ONCE(memcg->memory.limit);
> +	if (count < limit)
> +		margin = limit - count;
> +
> +	if (do_swap_account) {
> +		count = atomic_long_read(&memcg->memsw.count);
> +		limit = ACCESS_ONCE(memcg->memsw.limit);
> +		if (count < limit)
> +			margin = min(margin, limit - count);
> +	}
> +
> +	return margin;
>  }
>  
>  int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> @@ -1636,18 +1762,15 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  
>  	rcu_read_unlock();
>  
> -	pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
> -		res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
> -		res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
> -		res_counter_read_u64(&memcg->res, RES_FAILCNT));
> -	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
> -		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
> -		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
> -		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> -	pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n",
> -		res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> -		res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> -		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
> +	pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
> +		K((u64)atomic_long_read(&memcg->memory.count)),
> +		K((u64)memcg->memory.limit), memcg->memory.limited);
> +	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
> +		K((u64)atomic_long_read(&memcg->memsw.count)),
> +		K((u64)memcg->memsw.limit), memcg->memsw.limited);
> +	pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
> +		K((u64)atomic_long_read(&memcg->kmem.count)),
> +		K((u64)memcg->kmem.limit), memcg->kmem.limited);
>  
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		pr_info("Memory cgroup stats for ");
> @@ -1685,30 +1808,19 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
>  }
>  
>  /*
> - * Return the memory (and swap, if configured) limit for a memcg.
> + * Return the memory (and swap, if configured) maximum consumption for a memcg.
>   */
> -static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> +static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  {
> -	u64 limit;
> +	unsigned long limit;
>  
> -	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -
> -	/*
> -	 * Do not consider swap space if we cannot swap due to swappiness
> -	 */
> +	limit = memcg->memory.limit;
>  	if (mem_cgroup_swappiness(memcg)) {
> -		u64 memsw;
> +		unsigned long memsw_limit;
>  
> -		limit += total_swap_pages << PAGE_SHIFT;
> -		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -
> -		/*
> -		 * If memsw is finite and limits the amount of swap space
> -		 * available to this memcg, return that limit.
> -		 */
> -		limit = min(limit, memsw);
> +		memsw_limit = memcg->memsw.limit;
> +		limit = min(limit + total_swap_pages, memsw_limit);
>  	}
> -
>  	return limit;
>  }
>  
> @@ -1732,7 +1844,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	}
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
> -	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct css_task_iter it;
>  		struct task_struct *task;
> @@ -1935,7 +2047,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>  		.priority = 0,
>  	};
>  
> -	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
> +	excess = soft_limit_excess(root_memcg);
>  
>  	while (1) {
>  		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> @@ -1966,7 +2078,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>  		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
>  						     zone, &nr_scanned);
>  		*total_scanned += nr_scanned;
> -		if (!res_counter_soft_limit_excess(&root_memcg->res))
> +		if (!soft_limit_excess(root_memcg))
>  			break;
>  	}
>  	mem_cgroup_iter_break(root_memcg, victim);
> @@ -2293,33 +2405,31 @@ static DEFINE_MUTEX(percpu_charge_mutex);
>  static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
>  	struct memcg_stock_pcp *stock;
> -	bool ret = true;
> +	bool ret = false;
>  
>  	if (nr_pages > CHARGE_BATCH)
> -		return false;
> +		return ret;
>  
>  	stock = &get_cpu_var(memcg_stock);
> -	if (memcg == stock->cached && stock->nr_pages >= nr_pages)
> +	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
>  		stock->nr_pages -= nr_pages;
> -	else /* need to call res_counter_charge */
> -		ret = false;
> +		ret = true;
> +	}
>  	put_cpu_var(memcg_stock);
>  	return ret;
>  }
>  
>  /*
> - * Returns stocks cached in percpu to res_counter and reset cached information.
> + * Returns stocks cached in percpu and reset cached information.
>   */
>  static void drain_stock(struct memcg_stock_pcp *stock)
>  {
>  	struct mem_cgroup *old = stock->cached;
>  
>  	if (stock->nr_pages) {
> -		unsigned long bytes = stock->nr_pages * PAGE_SIZE;
> -
> -		res_counter_uncharge(&old->res, bytes);
> +		page_counter_uncharge(&old->memory, stock->nr_pages);
>  		if (do_swap_account)
> -			res_counter_uncharge(&old->memsw, bytes);
> +			page_counter_uncharge(&old->memsw, stock->nr_pages);
>  		stock->nr_pages = 0;
>  	}
>  	stock->cached = NULL;
> @@ -2348,7 +2458,7 @@ static void __init memcg_stock_init(void)
>  }
>  
>  /*
> - * Cache charges(val) which is from res_counter, to local per_cpu area.
> + * Cache charges(val) to local per_cpu area.
>   * This will be consumed by consume_stock() function, later.
>   */
>  static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> @@ -2408,8 +2518,7 @@ out:
>  /*
>   * Tries to drain stocked charges in other cpus. This function is asynchronous
>   * and just put a work per cpu for draining localy on each cpu. Caller can
> - * expects some charges will be back to res_counter later but cannot wait for
> - * it.
> + * expects some charges will be back later but cannot wait for it.
>   */
>  static void drain_all_stock_async(struct mem_cgroup *root_memcg)
>  {
> @@ -2483,9 +2592,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
> -	struct res_counter *fail_res;
> +	struct page_counter *counter;
>  	unsigned long nr_reclaimed;
> -	unsigned long long size;
>  	bool may_swap = true;
>  	bool drained = false;
>  	int ret = 0;
> @@ -2496,17 +2604,16 @@ retry:
>  	if (consume_stock(memcg, nr_pages))
>  		goto done;
>  
> -	size = batch * PAGE_SIZE;
> -	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> +	if (!page_counter_charge(&memcg->memory, batch, &counter)) {
>  		if (!do_swap_account)
>  			goto done_restock;
> -		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> +		if (!page_counter_charge(&memcg->memsw, batch, &counter))
>  			goto done_restock;
> -		res_counter_uncharge(&memcg->res, size);
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> +		page_counter_uncharge(&memcg->memory, batch);
> +		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>  		may_swap = false;
>  	} else
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> +		mem_over_limit = mem_cgroup_from_counter(counter, memory);
>  
>  	if (batch > nr_pages) {
>  		batch = nr_pages;
> @@ -2587,32 +2694,12 @@ done:
>  
>  static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> -	unsigned long bytes = nr_pages * PAGE_SIZE;
> -
>  	if (mem_cgroup_is_root(memcg))
>  		return;
>  
> -	res_counter_uncharge(&memcg->res, bytes);
> +	page_counter_uncharge(&memcg->memory, nr_pages);
>  	if (do_swap_account)
> -		res_counter_uncharge(&memcg->memsw, bytes);
> -}
> -
> -/*
> - * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
> - * This is useful when moving usage to parent cgroup.
> - */
> -static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
> -					unsigned int nr_pages)
> -{
> -	unsigned long bytes = nr_pages * PAGE_SIZE;
> -
> -	if (mem_cgroup_is_root(memcg))
> -		return;
> -
> -	res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
> -	if (do_swap_account)
> -		res_counter_uncharge_until(&memcg->memsw,
> -						memcg->memsw.parent, bytes);
> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>  }
>  
>  /*
> @@ -2736,8 +2823,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>  		unlock_page_lru(page, isolated);
>  }
>  
> -static DEFINE_MUTEX(set_limit_mutex);
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  /*
>   * The memcg_slab_mutex is held whenever a per memcg kmem cache is created or
> @@ -2786,16 +2871,17 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
>  }
>  #endif
>  
> -static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
> +			     unsigned long nr_pages)
>  {
> -	struct res_counter *fail_res;
> +	struct page_counter *counter;
>  	int ret = 0;
>  
> -	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
> -	if (ret)
> +	ret = page_counter_charge(&memcg->kmem, nr_pages, &counter);
> +	if (ret < 0)
>  		return ret;
>  
> -	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT);
> +	ret = try_charge(memcg, gfp, nr_pages);
>  	if (ret == -EINTR)  {
>  		/*
>  		 * try_charge() chose to bypass to root due to OOM kill or
> @@ -2812,25 +2898,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>  		 * when the allocation triggers should have been already
>  		 * directed to the root cgroup in memcontrol.h
>  		 */
> -		res_counter_charge_nofail(&memcg->res, size, &fail_res);
> +		page_counter_charge(&memcg->memory, nr_pages, NULL);
>  		if (do_swap_account)
> -			res_counter_charge_nofail(&memcg->memsw, size,
> -						  &fail_res);
> +			page_counter_charge(&memcg->memsw, nr_pages, NULL);
>  		ret = 0;
>  	} else if (ret)
> -		res_counter_uncharge(&memcg->kmem, size);
> +		page_counter_uncharge(&memcg->kmem, nr_pages);
>  
>  	return ret;
>  }
>  
> -static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
> +				unsigned long nr_pages)
>  {
> -	res_counter_uncharge(&memcg->res, size);
> +	page_counter_uncharge(&memcg->memory, nr_pages);
>  	if (do_swap_account)
> -		res_counter_uncharge(&memcg->memsw, size);
> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>  
>  	/* Not down to 0 */
> -	if (res_counter_uncharge(&memcg->kmem, size))
> +	if (page_counter_uncharge(&memcg->kmem, nr_pages))
>  		return;
>  
>  	/*
> @@ -3107,19 +3193,21 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg,
>  
>  int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
>  {
> +	unsigned int nr_pages = 1 << order;
>  	int res;
>  
> -	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
> -				PAGE_SIZE << order);
> +	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages);
>  	if (!res)
> -		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
> +		atomic_add(nr_pages, &cachep->memcg_params->nr_pages);
>  	return res;
>  }
>  
>  void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
>  {
> -	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
> -	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
> +	unsigned int nr_pages = 1 << order;
> +
> +	memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages);
> +	atomic_sub(nr_pages, &cachep->memcg_params->nr_pages);
>  }
>  
>  /*
> @@ -3240,7 +3328,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
>  		return true;
>  	}
>  
> -	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
> +	ret = memcg_charge_kmem(memcg, gfp, 1 << order);
>  	if (!ret)
>  		*_memcg = memcg;
>  
> @@ -3257,7 +3345,7 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
>  
>  	/* The page allocation failed. Revert */
>  	if (!page) {
> -		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
> +		memcg_uncharge_kmem(memcg, 1 << order);
>  		return;
>  	}
>  	/*
> @@ -3290,7 +3378,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
>  		return;
>  
>  	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> -	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
> +	memcg_uncharge_kmem(memcg, 1 << order);
>  }
>  #else
>  static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
> @@ -3468,8 +3556,12 @@ static int mem_cgroup_move_parent(struct page *page,
>  
>  	ret = mem_cgroup_move_account(page, nr_pages,
>  				pc, child, parent);
> -	if (!ret)
> -		__mem_cgroup_cancel_local_charge(child, nr_pages);
> +	if (!ret) {
> +		/* Take charge off the local counters */
> +		page_counter_cancel(&child->memory, nr_pages);
> +		if (do_swap_account)
> +			page_counter_cancel(&child->memsw, nr_pages);
> +	}
>  
>  	if (nr_pages > 1)
>  		compound_unlock_irqrestore(page, flags);
> @@ -3499,7 +3591,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
>   *
>   * Returns 0 on success, -EINVAL on failure.
>   *
> - * The caller must have charged to @to, IOW, called res_counter_charge() about
> + * The caller must have charged to @to, IOW, called page_counter_charge() about
>   * both res and memsw, and called css_get().
>   */
>  static int mem_cgroup_move_swap_account(swp_entry_t entry,
> @@ -3515,7 +3607,7 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
>  		mem_cgroup_swap_statistics(to, true);
>  		/*
>  		 * This function is only called from task migration context now.
> -		 * It postpones res_counter and refcount handling till the end
> +		 * It postpones page_counter and refcount handling till the end
>  		 * of task migration(mem_cgroup_clear_mc()) for performance
>  		 * improvement. But we cannot postpone css_get(to)  because if
>  		 * the process that has been moved to @to does swap-in, the
> @@ -3573,49 +3665,42 @@ void mem_cgroup_print_bad_page(struct page *page)
>  }
>  #endif
>  
> +static DEFINE_MUTEX(set_limit_mutex);
> +
>  static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> -				unsigned long long val)
> +				   unsigned long limit)
>  {
> +	unsigned long curusage;
> +	unsigned long oldusage;
> +	bool enlarge = false;
>  	int retry_count;
> -	u64 memswlimit, memlimit;
> -	int ret = 0;
> -	int children = mem_cgroup_count_children(memcg);
> -	u64 curusage, oldusage;
> -	int enlarge;
> +	int ret;
>  
>  	/*
>  	 * For keeping hierarchical_reclaim simple, how long we should retry
>  	 * is depends on callers. We set our retry-count to be function
>  	 * of # of children which we should visit in this loop.
>  	 */
> -	retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
> +	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> +		      mem_cgroup_count_children(memcg);
>  
> -	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> +	oldusage = atomic_long_read(&memcg->memory.count);
>  
> -	enlarge = 0;
> -	while (retry_count) {
> +	do {
>  		if (signal_pending(current)) {
>  			ret = -EINTR;
>  			break;
>  		}
> -		/*
> -		 * Rather than hide all in some function, I do this in
> -		 * open coded manner. You see what this really does.
> -		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
> -		 */
> +
>  		mutex_lock(&set_limit_mutex);
> -		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		if (memswlimit < val) {
> -			ret = -EINVAL;
> +		if (limit > memcg->memsw.limit) {
>  			mutex_unlock(&set_limit_mutex);
> +			ret = -EINVAL;
>  			break;
>  		}
> -
> -		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		if (memlimit < val)
> -			enlarge = 1;
> -
> -		ret = res_counter_set_limit(&memcg->res, val);
> +		if (limit > memcg->memory.limit)
> +			enlarge = true;
> +		ret = page_counter_limit(&memcg->memory, limit);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3623,13 +3708,14 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  
>  		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
>  
> -		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> +		curusage = atomic_long_read(&memcg->memory.count);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
>  			retry_count--;
>  		else
>  			oldusage = curusage;
> -	}
> +	} while (retry_count);
> +
>  	if (!ret && enlarge)
>  		memcg_oom_recover(memcg);
>  
> @@ -3637,38 +3723,35 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  }
>  
>  static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> -					unsigned long long val)
> +					 unsigned long limit)
>  {
> +	unsigned long curusage;
> +	unsigned long oldusage;
> +	bool enlarge = false;
>  	int retry_count;
> -	u64 memlimit, memswlimit, oldusage, curusage;
> -	int children = mem_cgroup_count_children(memcg);
> -	int ret = -EBUSY;
> -	int enlarge = 0;
> +	int ret;
>  
>  	/* see mem_cgroup_resize_res_limit */
> -	retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
> -	oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> -	while (retry_count) {
> +	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
> +		      mem_cgroup_count_children(memcg);
> +
> +	oldusage = atomic_long_read(&memcg->memsw.count);
> +
> +	do {
>  		if (signal_pending(current)) {
>  			ret = -EINTR;
>  			break;
>  		}
> -		/*
> -		 * Rather than hide all in some function, I do this in
> -		 * open coded manner. You see what this really does.
> -		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
> -		 */
> +
>  		mutex_lock(&set_limit_mutex);
> -		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		if (memlimit > val) {
> -			ret = -EINVAL;
> +		if (limit < memcg->memory.limit) {
>  			mutex_unlock(&set_limit_mutex);
> +			ret = -EINVAL;
>  			break;
>  		}
> -		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		if (memswlimit < val)
> -			enlarge = 1;
> -		ret = res_counter_set_limit(&memcg->memsw, val);
> +		if (limit > memcg->memsw.limit)
> +			enlarge = true;
> +		ret = page_counter_limit(&memcg->memsw, limit);
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3676,15 +3759,17 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  
>  		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
>  
> -		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> +		curusage = atomic_long_read(&memcg->memsw.count);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
>  			retry_count--;
>  		else
>  			oldusage = curusage;
> -	}
> +	} while (retry_count);
> +
>  	if (!ret && enlarge)
>  		memcg_oom_recover(memcg);
> +
>  	return ret;
>  }
>  
> @@ -3697,7 +3782,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	unsigned long reclaimed;
>  	int loop = 0;
>  	struct mem_cgroup_tree_per_zone *mctz;
> -	unsigned long long excess;
> +	unsigned long excess;
>  	unsigned long nr_scanned;
>  
>  	if (order > 0)
> @@ -3751,7 +3836,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  			} while (1);
>  		}
>  		__mem_cgroup_remove_exceeded(mz, mctz);
> -		excess = res_counter_soft_limit_excess(&mz->memcg->res);
> +		excess = soft_limit_excess(mz->memcg);
>  		/*
>  		 * One school of thought says that we should not add
>  		 * back the node to the tree if reclaim returns 0.
> @@ -3844,7 +3929,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>  static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
>  {
>  	int node, zid;
> -	u64 usage;
>  
>  	do {
>  		/* This is for making all *used* pages to be on LRU. */
> @@ -3876,9 +3960,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
>  		 * right after the check. RES_USAGE should be safe as we always
>  		 * charge before adding to the LRU.
>  		 */
> -		usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
> -			res_counter_read_u64(&memcg->kmem, RES_USAGE);
> -	} while (usage > 0);
> +	} while (atomic_long_read(&memcg->memory.count) -
> +		 atomic_long_read(&memcg->kmem.count) > 0);
>  }
>  
>  /*
> @@ -3918,7 +4001,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  	/* we call try-to-free pages for make this cgroup empty */
>  	lru_add_drain_all();
>  	/* try to free all pages in this cgroup */
> -	while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
> +	while (nr_retries && atomic_long_read(&memcg->memory.count)) {
>  		int progress;
>  
>  		if (signal_pending(current))
> @@ -3989,8 +4072,8 @@ out:
>  	return retval;
>  }
>  
> -static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
> -					       enum mem_cgroup_stat_index idx)
> +static unsigned long tree_stat(struct mem_cgroup *memcg,
> +			       enum mem_cgroup_stat_index idx)
>  {
>  	struct mem_cgroup *iter;
>  	long val = 0;
> @@ -4008,55 +4091,72 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  {
>  	u64 val;
>  
> -	if (!mem_cgroup_is_root(memcg)) {
> +	if (mem_cgroup_is_root(memcg)) {
> +		val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE);
> +		val += tree_stat(memcg, MEM_CGROUP_STAT_RSS);
> +		if (swap)
> +			val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
> +	} else {
>  		if (!swap)
> -			return res_counter_read_u64(&memcg->res, RES_USAGE);
> +			val = atomic_long_read(&memcg->memory.count);
>  		else
> -			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
> +			val = atomic_long_read(&memcg->memsw.count);
>  	}
> -
> -	/*
> -	 * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
> -	 * as well as in MEM_CGROUP_STAT_RSS_HUGE.
> -	 */
> -	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
> -	val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
> -
> -	if (swap)
> -		val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
> -
>  	return val << PAGE_SHIFT;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +	RES_SOFT_LIMIT,
> +};
>  
>  static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>  			       struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	enum res_type type = MEMFILE_TYPE(cft->private);
> -	int name = MEMFILE_ATTR(cft->private);
> +	struct page_counter *counter;
>  
> -	switch (type) {
> +	switch (MEMFILE_TYPE(cft->private)) {
>  	case _MEM:
> -		if (name == RES_USAGE)
> -			return mem_cgroup_usage(memcg, false);
> -		return res_counter_read_u64(&memcg->res, name);
> +		counter = &memcg->memory;
> +		break;
>  	case _MEMSWAP:
> -		if (name == RES_USAGE)
> -			return mem_cgroup_usage(memcg, true);
> -		return res_counter_read_u64(&memcg->memsw, name);
> +		counter = &memcg->memsw;
> +		break;
>  	case _KMEM:
> -		return res_counter_read_u64(&memcg->kmem, name);
> +		counter = &memcg->kmem;
>  		break;
>  	default:
>  		BUG();
>  	}
> +
> +	switch (MEMFILE_ATTR(cft->private)) {
> +	case RES_USAGE:
> +		if (counter == &memcg->memory)
> +			return mem_cgroup_usage(memcg, false);
> +		if (counter == &memcg->memsw)
> +			return mem_cgroup_usage(memcg, true);
> +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> +	case RES_LIMIT:
> +		return (u64)counter->limit * PAGE_SIZE;
> +	case RES_MAX_USAGE:
> +		return (u64)counter->watermark * PAGE_SIZE;
> +	case RES_FAILCNT:
> +		return counter->limited;
> +	case RES_SOFT_LIMIT:
> +		return (u64)memcg->soft_limit * PAGE_SIZE;
> +	default:
> +		BUG();
> +	}
>  }
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  /* should be called with activate_kmem_mutex held */
>  static int __memcg_activate_kmem(struct mem_cgroup *memcg,
> -				 unsigned long long limit)
> +				 unsigned long nr_pages)
>  {
>  	int err = 0;
>  	int memcg_id;
> @@ -4103,7 +4203,7 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
>  	 * We couldn't have accounted to this cgroup, because it hasn't got the
>  	 * active bit set yet, so this should succeed.
>  	 */
> -	err = res_counter_set_limit(&memcg->kmem, limit);
> +	err = page_counter_limit(&memcg->kmem, nr_pages);
>  	VM_BUG_ON(err);
>  
>  	static_key_slow_inc(&memcg_kmem_enabled_key);
> @@ -4119,25 +4219,25 @@ out:
>  }
>  
>  static int memcg_activate_kmem(struct mem_cgroup *memcg,
> -			       unsigned long long limit)
> +			       unsigned long nr_pages)
>  {
>  	int ret;
>  
>  	mutex_lock(&activate_kmem_mutex);
> -	ret = __memcg_activate_kmem(memcg, limit);
> +	ret = __memcg_activate_kmem(memcg, nr_pages);
>  	mutex_unlock(&activate_kmem_mutex);
>  	return ret;
>  }
>  
>  static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
> -				   unsigned long long val)
> +				   unsigned long limit)
>  {
>  	int ret;
>  
>  	if (!memcg_kmem_is_active(memcg))
> -		ret = memcg_activate_kmem(memcg, val);
> +		ret = memcg_activate_kmem(memcg, limit);
>  	else
> -		ret = res_counter_set_limit(&memcg->kmem, val);
> +		ret = page_counter_limit(&memcg->kmem, limit);
>  	return ret;
>  }
>  
> @@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
>  	 * after this point, because it has at least one child already.
>  	 */
>  	if (memcg_kmem_is_active(parent))
> -		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
> +		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
>  	mutex_unlock(&activate_kmem_mutex);
>  	return ret;
>  }
>  #else
>  static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
> -				   unsigned long long val)
> +				   unsigned long limit)
>  {
>  	return -EINVAL;
>  }
> @@ -4175,110 +4275,69 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	enum res_type type;
> -	int name;
> -	unsigned long long val;
> +	unsigned long nr_pages;
>  	int ret;
>  
>  	buf = strstrip(buf);
> -	type = MEMFILE_TYPE(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	ret = page_counter_memparse(buf, &nr_pages);
> +	if (ret)
> +		return ret;
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_LIMIT:
>  		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
>  			ret = -EINVAL;
>  			break;
>  		}
> -		/* This function does all necessary parse...reuse it */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> +		switch (MEMFILE_TYPE(of_cft(of)->private)) {
> +		case _MEM:
> +			ret = mem_cgroup_resize_limit(memcg, nr_pages);
>  			break;
> -		if (type == _MEM)
> -			ret = mem_cgroup_resize_limit(memcg, val);
> -		else if (type == _MEMSWAP)
> -			ret = mem_cgroup_resize_memsw_limit(memcg, val);
> -		else if (type == _KMEM)
> -			ret = memcg_update_kmem_limit(memcg, val);
> -		else
> -			return -EINVAL;
> -		break;
> -	case RES_SOFT_LIMIT:
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> -		if (ret)
> +		case _MEMSWAP:
> +			ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages);
>  			break;
> -		/*
> -		 * For memsw, soft limits are hard to implement in terms
> -		 * of semantics, for now, we support soft limits for
> -		 * control without swap
> -		 */
> -		if (type == _MEM)
> -			ret = res_counter_set_soft_limit(&memcg->res, val);
> -		else
> -			ret = -EINVAL;
> +		case _KMEM:
> +			ret = memcg_update_kmem_limit(memcg, nr_pages);
> +			break;
> +		}
>  		break;
> -	default:
> -		ret = -EINVAL; /* should be BUG() ? */
> +	case RES_SOFT_LIMIT:
> +		memcg->soft_limit = nr_pages;
> +		ret = 0;
>  		break;
>  	}
>  	return ret ?: nbytes;
>  }
>  
> -static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
> -		unsigned long long *mem_limit, unsigned long long *memsw_limit)
> -{
> -	unsigned long long min_limit, min_memsw_limit, tmp;
> -
> -	min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -	min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -	if (!memcg->use_hierarchy)
> -		goto out;
> -
> -	while (memcg->css.parent) {
> -		memcg = mem_cgroup_from_css(memcg->css.parent);
> -		if (!memcg->use_hierarchy)
> -			break;
> -		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		min_limit = min(min_limit, tmp);
> -		tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		min_memsw_limit = min(min_memsw_limit, tmp);
> -	}
> -out:
> -	*mem_limit = min_limit;
> -	*memsw_limit = min_memsw_limit;
> -}
> -
>  static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
>  				size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	int name;
> -	enum res_type type;
> +	struct page_counter *counter;
>  
> -	type = MEMFILE_TYPE(of_cft(of)->private);
> -	name = MEMFILE_ATTR(of_cft(of)->private);
> +	switch (MEMFILE_TYPE(of_cft(of)->private)) {
> +	case _MEM:
> +		counter = &memcg->memory;
> +		break;
> +	case _MEMSWAP:
> +		counter = &memcg->memsw;
> +		break;
> +	case _KMEM:
> +		counter = &memcg->kmem;
> +		break;
> +	default:
> +		BUG();
> +	}
>  
> -	switch (name) {
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
>  	case RES_MAX_USAGE:
> -		if (type == _MEM)
> -			res_counter_reset_max(&memcg->res);
> -		else if (type == _MEMSWAP)
> -			res_counter_reset_max(&memcg->memsw);
> -		else if (type == _KMEM)
> -			res_counter_reset_max(&memcg->kmem);
> -		else
> -			return -EINVAL;
> +		counter->watermark = atomic_long_read(&counter->count);
>  		break;
>  	case RES_FAILCNT:
> -		if (type == _MEM)
> -			res_counter_reset_failcnt(&memcg->res);
> -		else if (type == _MEMSWAP)
> -			res_counter_reset_failcnt(&memcg->memsw);
> -		else if (type == _KMEM)
> -			res_counter_reset_failcnt(&memcg->kmem);
> -		else
> -			return -EINVAL;
> +		counter->limited = 0;
>  		break;
> +	default:
> +		BUG();
>  	}
>  
>  	return nbytes;
> @@ -4375,6 +4434,7 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>  static int memcg_stat_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long memory, memsw;
>  	struct mem_cgroup *mi;
>  	unsigned int i;
>  
> @@ -4394,14 +4454,16 @@ static int memcg_stat_show(struct seq_file *m, void *v)
>  			   mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);
>  
>  	/* Hierarchical information */
> -	{
> -		unsigned long long limit, memsw_limit;
> -		memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit);
> -		seq_printf(m, "hierarchical_memory_limit %llu\n", limit);
> -		if (do_swap_account)
> -			seq_printf(m, "hierarchical_memsw_limit %llu\n",
> -				   memsw_limit);
> +	memory = memsw = PAGE_COUNTER_MAX;
> +	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
> +		memory = min(memory, mi->memory.limit);
> +		memsw = min(memsw, mi->memsw.limit);
>  	}
> +	seq_printf(m, "hierarchical_memory_limit %llu\n",
> +		   (u64)memory * PAGE_SIZE);
> +	if (do_swap_account)
> +		seq_printf(m, "hierarchical_memsw_limit %llu\n",
> +			   (u64)memsw * PAGE_SIZE);
>  
>  	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
>  		long long val = 0;
> @@ -4485,7 +4547,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
>  static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
>  {
>  	struct mem_cgroup_threshold_ary *t;
> -	u64 usage;
> +	unsigned long usage;
>  	int i;
>  
>  	rcu_read_lock();
> @@ -4584,10 +4646,11 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
>  {
>  	struct mem_cgroup_thresholds *thresholds;
>  	struct mem_cgroup_threshold_ary *new;
> -	u64 threshold, usage;
> +	unsigned long threshold;
> +	unsigned long usage;
>  	int i, size, ret;
>  
> -	ret = res_counter_memparse_write_strategy(args, &threshold);
> +	ret = page_counter_memparse(args, &threshold);
>  	if (ret)
>  		return ret;
>  
> @@ -4677,7 +4740,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
>  {
>  	struct mem_cgroup_thresholds *thresholds;
>  	struct mem_cgroup_threshold_ary *new;
> -	u64 usage;
> +	unsigned long usage;
>  	int i, j, size;
>  
>  	mutex_lock(&memcg->thresholds_lock);
> @@ -4871,7 +4934,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
>  
>  	memcg_kmem_mark_dead(memcg);
>  
> -	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
> +	if (atomic_long_read(&memcg->kmem.count))
>  		return;
>  
>  	if (memcg_kmem_test_and_clear_dead(memcg))
> @@ -5351,9 +5414,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>   */
>  struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
>  {
> -	if (!memcg->res.parent)
> +	if (!memcg->memory.parent)
>  		return NULL;
> -	return mem_cgroup_from_res_counter(memcg->res.parent, res);
> +	return mem_cgroup_from_counter(memcg->memory.parent, memory);
>  }
>  EXPORT_SYMBOL(parent_mem_cgroup);
>  
> @@ -5398,9 +5461,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	/* root ? */
>  	if (parent_css == NULL) {
>  		root_mem_cgroup = memcg;
> -		res_counter_init(&memcg->res, NULL);
> -		res_counter_init(&memcg->memsw, NULL);
> -		res_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->memsw, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
>  	}
>  
>  	memcg->last_scanned_node = MAX_NUMNODES;
> @@ -5438,18 +5501,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	memcg->swappiness = mem_cgroup_swappiness(parent);
>  
>  	if (parent->use_hierarchy) {
> -		res_counter_init(&memcg->res, &parent->res);
> -		res_counter_init(&memcg->memsw, &parent->memsw);
> -		res_counter_init(&memcg->kmem, &parent->kmem);
> +		page_counter_init(&memcg->memory, &parent->memory);
> +		page_counter_init(&memcg->memsw, &parent->memsw);
> +		page_counter_init(&memcg->kmem, &parent->kmem);
>  
>  		/*
>  		 * No need to take a reference to the parent because cgroup
>  		 * core guarantees its existence.
>  		 */
>  	} else {
> -		res_counter_init(&memcg->res, NULL);
> -		res_counter_init(&memcg->memsw, NULL);
> -		res_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->memsw, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this
> @@ -5520,7 +5583,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	/*
>  	 * XXX: css_offline() would be where we should reparent all
>  	 * memory to prepare the cgroup for destruction.  However,
> -	 * memcg does not do css_tryget_online() and res_counter charging
> +	 * memcg does not do css_tryget_online() and page_counter charging
>  	 * under the same RCU lock region, which means that charging
>  	 * could race with offlining.  Offlining only happens to
>  	 * cgroups with no tasks in them but charges can show up
> @@ -5540,7 +5603,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	 * call_rcu()
>  	 *   offline_css()
>  	 *     reparent_charges()
> -	 *                           res_counter_charge()
> +	 *                           page_counter_charge()
>  	 *                           css_put()
>  	 *                             css_free()
>  	 *                           pc->mem_cgroup = dead memcg
> @@ -5575,10 +5638,10 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  
> -	mem_cgroup_resize_limit(memcg, ULLONG_MAX);
> -	mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
> -	memcg_update_kmem_limit(memcg, ULLONG_MAX);
> -	res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
> +	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
> +	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
> +	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
> +	memcg->soft_limit = 0;
>  }
>  
>  #ifdef CONFIG_MMU
> @@ -5892,19 +5955,18 @@ static void __mem_cgroup_clear_mc(void)
>  	if (mc.moved_swap) {
>  		/* uncharge swap account from the old cgroup */
>  		if (!mem_cgroup_is_root(mc.from))
> -			res_counter_uncharge(&mc.from->memsw,
> -					     PAGE_SIZE * mc.moved_swap);
> -
> -		for (i = 0; i < mc.moved_swap; i++)
> -			css_put(&mc.from->css);
> +			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
>  
>  		/*
> -		 * we charged both to->res and to->memsw, so we should
> -		 * uncharge to->res.
> +		 * we charged both to->memory and to->memsw, so we
> +		 * should uncharge to->memory.
>  		 */
>  		if (!mem_cgroup_is_root(mc.to))
> -			res_counter_uncharge(&mc.to->res,
> -					     PAGE_SIZE * mc.moved_swap);
> +			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
> +
> +		for (i = 0; i < mc.moved_swap; i++)
> +			css_put(&mc.from->css);
> +
>  		/* we've already done css_get(mc.to) */
>  		mc.moved_swap = 0;
>  	}
> @@ -6270,7 +6332,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
>  	memcg = mem_cgroup_lookup(id);
>  	if (memcg) {
>  		if (!mem_cgroup_is_root(memcg))
> -			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> +			page_counter_uncharge(&memcg->memsw, 1);
>  		mem_cgroup_swap_statistics(memcg, false);
>  		css_put(&memcg->css);
>  	}
> @@ -6436,11 +6498,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>  
>  	if (!mem_cgroup_is_root(memcg)) {
>  		if (nr_mem)
> -			res_counter_uncharge(&memcg->res,
> -					     nr_mem * PAGE_SIZE);
> +			page_counter_uncharge(&memcg->memory, nr_mem);
>  		if (nr_memsw)
> -			res_counter_uncharge(&memcg->memsw,
> -					     nr_memsw * PAGE_SIZE);
> +			page_counter_uncharge(&memcg->memsw, nr_memsw);
>  		memcg_oom_recover(memcg);
>  	}
>  
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 1d191357bf88..9a448bdb19e9 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -9,13 +9,13 @@
>  int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  {
>  	/*
> -	 * The root cgroup does not use res_counters, but rather,
> +	 * The root cgroup does not use page_counters, but rather,
>  	 * rely on the data already collected by the network
>  	 * subsystem
>  	 */
> -	struct res_counter *res_parent = NULL;
> -	struct cg_proto *cg_proto, *parent_cg;
>  	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> +	struct page_counter *counter_parent = NULL;
> +	struct cg_proto *cg_proto, *parent_cg;
>  
>  	cg_proto = tcp_prot.proto_cgroup(memcg);
>  	if (!cg_proto)
> @@ -29,9 +29,9 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  
>  	parent_cg = tcp_prot.proto_cgroup(parent);
>  	if (parent_cg)
> -		res_parent = &parent_cg->memory_allocated;
> +		counter_parent = &parent_cg->memory_allocated;
>  
> -	res_counter_init(&cg_proto->memory_allocated, res_parent);
> +	page_counter_init(&cg_proto->memory_allocated, counter_parent);
>  	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
>  
>  	return 0;
> @@ -50,7 +50,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
>  }
>  EXPORT_SYMBOL(tcp_destroy_cgroup);
>  
> -static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
> +static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
>  {
>  	struct cg_proto *cg_proto;
>  	int i;
> @@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
>  	if (!cg_proto)
>  		return -EINVAL;
>  
> -	if (val > RES_COUNTER_MAX)
> -		val = RES_COUNTER_MAX;
> -
> -	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
> +	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
>  	if (ret)
>  		return ret;
>  
>  	for (i = 0; i < 3; i++)
> -		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
> +		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
>  						sysctl_tcp_mem[i]);
>  
> -	if (val == RES_COUNTER_MAX)
> +	if (nr_pages == ULONG_MAX / PAGE_SIZE)
>  		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	else if (val != RES_COUNTER_MAX) {
> +	else {
>  		/*
>  		 * The active bit needs to be written after the static_key
>  		 * update. This is what guarantees that the socket activation
> @@ -102,11 +99,18 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
>  	return 0;
>  }
>  
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +};
> +
>  static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	unsigned long long val;
> +	unsigned long nr_pages;
>  	int ret = 0;
>  
>  	buf = strstrip(buf);
> @@ -114,10 +118,10 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	switch (of_cft(of)->private) {
>  	case RES_LIMIT:
>  		/* see memcontrol.c */
> -		ret = res_counter_memparse_write_strategy(buf, &val);
> +		ret = page_counter_memparse(buf, &nr_pages);
>  		if (ret)
>  			break;
> -		ret = tcp_update_limit(memcg, val);
> +		ret = tcp_update_limit(memcg, nr_pages);
>  		break;
>  	default:
>  		ret = -EINVAL;
> @@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	return ret ?: nbytes;
>  }
>  
> -static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return default_val;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, type);
> -}
> -
> -static u64 tcp_read_usage(struct mem_cgroup *memcg)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
> -
> -	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
> -}
> -
>  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
>  	u64 val;
>  
>  	switch (cft->private) {
>  	case RES_LIMIT:
> -		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
> +		if (!cg_proto)
> +			return PAGE_COUNTER_MAX;
> +		val = cg_proto->memory_allocated.limit;
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_USAGE:
> -		val = tcp_read_usage(memcg);
> +		if (!cg_proto)
> +			return atomic_long_read(&tcp_memory_allocated);
> +		val = atomic_long_read(&cg_proto->memory_allocated.count);
> +		val *= PAGE_SIZE;
>  		break;
>  	case RES_FAILCNT:
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.limited;
> +		break;
>  	case RES_MAX_USAGE:
> -		val = tcp_read_stat(memcg, cft->private, 0);
> +		if (!cg_proto)
> +			return 0;
> +		val = cg_proto->memory_allocated.watermark;
> +		val *= PAGE_SIZE;
>  		break;
>  	default:
>  		BUG();
> @@ -183,10 +179,11 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
>  
>  	switch (of_cft(of)->private) {
>  	case RES_MAX_USAGE:
> -		res_counter_reset_max(&cg_proto->memory_allocated);
> +		cg_proto->memory_allocated.watermark =
> +			atomic_long_read(&cg_proto->memory_allocated.count);
>  		break;
>  	case RES_FAILCNT:
> -		res_counter_reset_failcnt(&cg_proto->memory_allocated);
> +		cg_proto->memory_allocated.limited = 0;
>  		break;
>  	}
>  
> -- 
> 2.1.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-22 14:44   ` Michal Hocko
@ 2014-09-22 15:50     ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-22 15:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> > Memory is internally accounted in bytes, using spinlock-protected
> > 64-bit counters, even though the smallest accounting delta is a page.
> > The counter interface is also convoluted and does too many things.
> > 
> > Introduce a new lockless word-sized page counter API, then change all
> > memory accounting over to it and remove the old one.  The translation
> > from and to bytes then only happens when interfacing with userspace.
> 
> Dunno why but I thought other controllers use res_counter as well. But
> this doesn't seem to be the case so this is perfectly reasonable way
> forward.

You were fooled by its generic name!  It really is a lot less generic
than what it was designed for, and there are no new users in sight.

> I have only glanced through the patch and it mostly seems good to me 
> (I have to look more closely on the atomicity of hierarchical operations).
> 
> Nevertheless I think that the counter should live outside of memcg (it
> is ugly and bad in general to make HUGETLB controller depend on MEMCG
> just to have a counter). If you made kernel/page_counter.c and led both
> containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> on MEMCG and I would find it cleaner in general.

The reason I did it this way is because the hugetlb controller simply
accounts and limits a certain type of memory and in the future I would
like to make it a memcg extension, just like kmem and swap.

Once that is done, page counters can become fully private, but until
then I think it's a good idea to make them part of memcg to express
this relationship and to ensure we are moving in the same direction.

> > Aside from the locking costs, this gets rid of the icky unsigned long
> > long types in the very heart of memcg, which is great for 32 bit and
> > also makes the code a lot more readable.
> 
> Definitely. Nice work!

Thanks!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-22 15:50     ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-22 15:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> > Memory is internally accounted in bytes, using spinlock-protected
> > 64-bit counters, even though the smallest accounting delta is a page.
> > The counter interface is also convoluted and does too many things.
> > 
> > Introduce a new lockless word-sized page counter API, then change all
> > memory accounting over to it and remove the old one.  The translation
> > from and to bytes then only happens when interfacing with userspace.
> 
> Dunno why but I thought other controllers use res_counter as well. But
> this doesn't seem to be the case so this is perfectly reasonable way
> forward.

You were fooled by its generic name!  It really is a lot less generic
than what it was designed for, and there are no new users in sight.

> I have only glanced through the patch and it mostly seems good to me 
> (I have to look more closely on the atomicity of hierarchical operations).
> 
> Nevertheless I think that the counter should live outside of memcg (it
> is ugly and bad in general to make HUGETLB controller depend on MEMCG
> just to have a counter). If you made kernel/page_counter.c and led both
> containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> on MEMCG and I would find it cleaner in general.

The reason I did it this way is because the hugetlb controller simply
accounts and limits a certain type of memory and in the future I would
like to make it a memcg extension, just like kmem and swap.

Once that is done, page counters can become fully private, but until
then I think it's a good idea to make them part of memcg to express
this relationship and to ensure we are moving in the same direction.

> > Aside from the locking costs, this gets rid of the icky unsigned long
> > long types in the very heart of memcg, which is great for 32 bit and
> > also makes the code a lot more readable.
> 
> Definitely. Nice work!

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-22 15:50     ` Johannes Weiner
@ 2014-09-22 17:28       ` Michal Hocko
  -1 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-22 17:28 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
[...]
> > Nevertheless I think that the counter should live outside of memcg (it
> > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > just to have a counter). If you made kernel/page_counter.c and led both
> > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > on MEMCG and I would find it cleaner in general.
> 
> The reason I did it this way is because the hugetlb controller simply
> accounts and limits a certain type of memory and in the future I would
> like to make it a memcg extension, just like kmem and swap.

I am not sure this is the right way to go. Hugetlb has always been
"special" and I do not see any advantage to pull its specialness into
memcg proper. It would just make the code more complicated. I can also
imagine users who simply do not want to pay memcg overhead and use only
hugetlb controller.

Besides that it is not like a separate page_counter with a clear
interface would cause more maintenance overhead so I really do not see
any reason to pull it into memcg.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-22 17:28       ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-22 17:28 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
[...]
> > Nevertheless I think that the counter should live outside of memcg (it
> > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > just to have a counter). If you made kernel/page_counter.c and led both
> > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > on MEMCG and I would find it cleaner in general.
> 
> The reason I did it this way is because the hugetlb controller simply
> accounts and limits a certain type of memory and in the future I would
> like to make it a memcg extension, just like kmem and swap.

I am not sure this is the right way to go. Hugetlb has always been
"special" and I do not see any advantage to pull its specialness into
memcg proper. It would just make the code more complicated. I can also
imagine users who simply do not want to pay memcg overhead and use only
hugetlb controller.

Besides that it is not like a separate page_counter with a clear
interface would cause more maintenance overhead so I really do not see
any reason to pull it into memcg.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-22 14:41   ` Vladimir Davydov
  (?)
@ 2014-09-22 18:57     ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-22 18:57 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

Hi Vladimir,

On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 19df5d857411..bf8fb1a05597 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> >  };
> >  
> >  #ifdef CONFIG_MEMCG
> > +
> > +struct page_counter {
> 
> I'd place it in a separate file, say
> 
> 	include/linux/page_counter.h
> 	mm/page_counter.c
> 
> just to keep mm/memcontrol.c clean.

The page counters are the very core of the memory controller and, as I
said to Michal, I want to integrate the hugetlb controller into memcg
as well, at which point there won't be any outside users anymore.  So
I think this is the right place for it.

> > +	atomic_long_t count;
> > +	unsigned long limit;
> > +	struct page_counter *parent;
> > +
> > +	/* legacy */
> > +	unsigned long watermark;
> > +	unsigned long limited;
> 
> IMHO, failcnt would fit better.

I never liked the failcnt name, but also have to admit that "limited"
is crap.  Let's leave it at failcnt for now.

> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> 
> When I first saw this function, I couldn't realize by looking at its
> name what it's intended to do. I think
> 
> 	page_counter_cancel_local_charge()
> 
> would fit better.

It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?

> > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > +			struct page_counter **fail);
> > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> 
> Hmm, why not page_counter_set_limit?

Limit is used as a verb here, "to limit".  Getters and setters are
usually wrappers around unusual/complex data structure access, but
this function does a lot more, so I'm not fond of _set_limit().

> > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> >  					      unsigned long amt,
> >  					      int *parent_status)
> >  {
> > -	struct res_counter *fail;
> > -	int ret;
> > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> >  
> > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > -					amt << PAGE_SHIFT, &fail);
> > -	if (ret < 0)
> > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > +	    prot->memory_allocated.limit)
> 
> I don't like your equivalent of res_counter_charge_nofail.
> 
> Passing NULL to page_counter_charge might be useful if one doesn't have
> a back-off strategy, but still want to fail on hitting the limit. With
> your interface the user must pass something to the function then, which
> isn't convenient.
> 
> Besides, it depends on the internal implementation of the page_counter
> struct. I'd encapsulate this.

Thinking about this more, I don't like my version either; not because
of how @fail must always be passed, but because of how it changes the
behavior.  I changed the API to

void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
                            struct page_counter **fail);

We could make @fail optional in the try_charge(), but all callsites
pass it at this time, so for now I kept it mandatory for simplicity.

What do you think?

> > @@ -975,15 +975,8 @@ config CGROUP_CPUACCT
> >  	  Provides a simple Resource Controller for monitoring the
> >  	  total CPU consumed by the tasks in a cgroup.
> >  
> > -config RESOURCE_COUNTERS
> > -	bool "Resource counters"
> > -	help
> > -	  This option enables controller independent resource accounting
> > -	  infrastructure that works with cgroups.
> > -
> >  config MEMCG
> >  	bool "Memory Resource Controller for Control Groups"
> > -	depends on RESOURCE_COUNTERS
> >  	select EVENTFD
> >  	help
> >  	  Provides a memory resource controller that manages both anonymous
> > @@ -1051,7 +1044,7 @@ config MEMCG_KMEM
> >  
> >  config CGROUP_HUGETLB
> >  	bool "HugeTLB Resource Controller for Control Groups"
> > -	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> > +	depends on MEMCG && HUGETLB_PAGE
> 
> So now the hugetlb controller depends on memcg only because it needs the
> page_counter, which is supposed to be a kind of independent. Is that OK?

As mentioned before, I want to integrate the hugetlb controller into
memcg anyway, so this should be fine, and it keeps things simpler.

> > @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> >  	if (unlikely(!h_cg))
> >  		return;
> >  	set_hugetlb_cgroup(page, NULL);
> > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> >  	return;
> >  }
> >  
> >  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> >  				    struct hugetlb_cgroup *h_cg)
> >  {
> > -	unsigned long csize = nr_pages * PAGE_SIZE;
> > -
> >  	if (hugetlb_cgroup_disabled() || !h_cg)
> >  		return;
> >  
> >  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> >  		return;
> >  
> > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> >  	return;
> >  }
> >  
> > +enum {
> > +	RES_USAGE,
> > +	RES_LIMIT,
> > +	RES_MAX_USAGE,
> > +	RES_FAILCNT,
> > +};
> > +
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >  				   struct cftype *cft)
> >  {
> > -	int idx, name;
> > +	struct page_counter *counter;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> >  
> > -	idx = MEMFILE_IDX(cft->private);
> > -	name = MEMFILE_ATTR(cft->private);
> > +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> >  
> > -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> > +	switch (MEMFILE_ATTR(cft->private)) {
> > +	case RES_USAGE:
> > +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> 
> page_counter_read?
> 
> > +	case RES_LIMIT:
> > +		return (u64)counter->limit * PAGE_SIZE;
> 
> page_counter_get_limit?
> 
> > +	case RES_MAX_USAGE:
> > +		return (u64)counter->watermark * PAGE_SIZE;
> 
> page_counter_read_watermark?

I added page_counter_read() to abstract away the fact that we use a
signed counter internally, but it still returns the number of pages as
unsigned long.

The entire counter API is based on pages now, and only the userspace
interface translates back and forth into bytes, so that's where the
translation should be located.

> >  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >  				    char *buf, size_t nbytes, loff_t off)
> >  {
> > -	int idx, name, ret;
> > -	unsigned long long val;
> > +	int ret, idx;
> > +	unsigned long nr_pages;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >  
> > +	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> > +		return -EINVAL;
> > +
> >  	buf = strstrip(buf);
> > +	ret = page_counter_memparse(buf, &nr_pages);
> > +	if (ret)
> > +		return ret;
> > +
> >  	idx = MEMFILE_IDX(of_cft(of)->private);
> > -	name = MEMFILE_ATTR(of_cft(of)->private);
> >  
> > -	switch (name) {
> > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >  	case RES_LIMIT:
> > -		if (hugetlb_cgroup_is_root(h_cg)) {
> > -			/* Can't set limit on root */
> > -			ret = -EINVAL;
> > -			break;
> > -		}
> > -		/* This function does all necessary parse...reuse it */
> > -		ret = res_counter_memparse_write_strategy(buf, &val);
> > -		if (ret)
> > -			break;
> > -		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
> > -		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> > +		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
> 
> This is incorrect. Here we should have something like:
>
> 	nr_pages = ALIGN(nr_pages, 1UL << huge_page_order(&hstates[idx]));

Good catch, thanks.

> > @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> >  				    char *buf, size_t nbytes, loff_t off)
> >  {
> > -	int idx, name, ret = 0;
> > +	int ret = 0;
> > +	struct page_counter *counter;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >  
> > -	idx = MEMFILE_IDX(of_cft(of)->private);
> > -	name = MEMFILE_ATTR(of_cft(of)->private);
> > +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> >  
> > -	switch (name) {
> > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >  	case RES_MAX_USAGE:
> > -		res_counter_reset_max(&h_cg->hugepage[idx]);
> > +		counter->watermark = atomic_long_read(&counter->count);
> 
> page_counter_reset_watermark?

Yes, that operation deserves a wrapper.

> >  		break;
> >  	case RES_FAILCNT:
> > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > +		counter->limited = 0;
> 
> page_counter_reset_failcnt?

That would be more obscure than counter->failcnt = 0, I think.

> > @@ -66,6 +65,117 @@
> >  
> >  #include <trace/events/vmscan.h>
> >  
> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > +{
> > +	long new;
> > +
> > +	new = atomic_long_sub_return(nr_pages, &counter->count);
> > +
> > +	if (WARN_ON(unlikely(new < 0)))
> 
> Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.

Since this is a page counter, this would overflow at 8 petabyte on 32
bit.  So even though the maximum is ULONG_MAX, in practice we should
never even reach LONG_MAX, and ULONG_MAX was only chosen for backward
compatibility with the default unlimited value.

This is actually not the only place that assumes we never go negative;
the userspace read functions' u64 cast of a long would sign-extend any
negative value and return ludicrous numbers.

But thinking longer about this, it's probably not worth to have these
gotchas in the code just to maintain the default unlimited value.  I
changed PAGE_COUNTER_MAX to LONG_MAX and LONG_MAX / PAGE_SIZE, resp.

> > +		atomic_long_set(&counter->count, 0);
> > +
> > +	return new > 1;
> 
> Nobody outside page_counter internals uses this retval. Why is it public
> then?

kmemcg still uses this for the pinning trick, but I'll update the
patch that removes it to also change the interface.

> BTW, why not new > 0?

That's a plain bug - probably a left-over from rephrasing this
underflow test several times.  Thanks for catching.

> > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > +			struct page_counter **fail)
> > +{
> > +	struct page_counter *c;
> > +
> > +	for (c = counter; c; c = c->parent) {
> > +		for (;;) {
> > +			unsigned long count;
> > +			unsigned long new;
> > +
> > +			count = atomic_long_read(&c->count);
> > +
> > +			new = count + nr_pages;
> > +			if (new > c->limit) {
> > +				c->limited++;
> > +				if (fail) {
> 
> So we increase 'limited' even if ain't limited. Sounds weird.

The old code actually did that too, but I removed it now in the
transition to separate charge and try_charge functions.

> > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > +{
> > +	for (;;) {
> > +		unsigned long count;
> > +		unsigned long old;
> > +
> > +		count = atomic_long_read(&counter->count);
> > +
> > +		old = xchg(&counter->limit, limit);
> > +
> > +		if (atomic_long_read(&counter->count) != count) {
> > +			counter->limit = old;
> 
> I wonder what can happen if two threads execute this function
> concurrently... or may be it's not supposed to be smp-safe?

memcg already holds the set_limit_mutex here.  I updated the tcp and
hugetlb controllers accordingly to take limit locks as well.

> > @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> >   */
> >  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >  {
> > -	unsigned long long margin;
> 
> Why is it still ULL?

Hm?  This is a removal.  Too many longs...

> > @@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
> >  	 * after this point, because it has at least one child already.
> >  	 */
> >  	if (memcg_kmem_is_active(parent))
> > -		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
> > +		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
> 
> PAGE_COUNTER_MAX?

Good catch, thanks.  That was a left-over from several iterations of
trying to find a value that's suitable for both 32 bit and 64 bit.

> > @@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
> >  	if (!cg_proto)
> >  		return -EINVAL;
> >  
> > -	if (val > RES_COUNTER_MAX)
> > -		val = RES_COUNTER_MAX;
> > -
> > -	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
> > +	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
> >  	if (ret)
> >  		return ret;
> >  
> >  	for (i = 0; i < 3; i++)
> > -		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
> > +		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
> >  						sysctl_tcp_mem[i]);
> >  
> > -	if (val == RES_COUNTER_MAX)
> > +	if (nr_pages == ULONG_MAX / PAGE_SIZE)
> 
> PAGE_COUNTER_MAX?

Same.

> > @@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
> >  	return ret ?: nbytes;
> >  }
> >  
> > -static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
> > -{
> > -	struct cg_proto *cg_proto;
> > -
> > -	cg_proto = tcp_prot.proto_cgroup(memcg);
> > -	if (!cg_proto)
> > -		return default_val;
> > -
> > -	return res_counter_read_u64(&cg_proto->memory_allocated, type);
> > -}
> > -
> > -static u64 tcp_read_usage(struct mem_cgroup *memcg)
> > -{
> > -	struct cg_proto *cg_proto;
> > -
> > -	cg_proto = tcp_prot.proto_cgroup(memcg);
> > -	if (!cg_proto)
> > -		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
> > -
> > -	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
> > -}
> > -
> >  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
> >  {
> >  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
> >  	u64 val;
> >  
> >  	switch (cft->private) {
> >  	case RES_LIMIT:
> > -		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
> > +		if (!cg_proto)
> > +			return PAGE_COUNTER_MAX;
> > +		val = cg_proto->memory_allocated.limit;
> > +		val *= PAGE_SIZE;
> >  		break;
> >  	case RES_USAGE:
> > -		val = tcp_read_usage(memcg);
> > +		if (!cg_proto)
> > +			return atomic_long_read(&tcp_memory_allocated);
> 
> Forgot << PAGE_SHIFT?

Yes, indeed.

Thanks for your thorough review, Vladimir!

Here is the delta patch:

---
>From 3601538347af756b9bfe05142ab21b5ae44f8cdd Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 22 Sep 2014 13:54:24 -0400
Subject: [patch] mm: memcontrol: lockless page counters fix

- renamed limited to failcnt again [vladimir]
- base page counter range on on LONG_MAX [vladimir]
- page_counter_read() [vladimir]
- page_counter_sub() [vladimir]
- rework the nofail charging [vladimir]
- page_counter_reset_watermark() [vladimir]
- fixed hugepage limit page alignment [vladimir]
- fixed page_counter_sub() return value [vladimir]
- fixed kmem's idea of unlimited [vladimir]
- fixed tcp memcontrol's idea of unlimited [vladimir]
- fixed tcp memcontrol's usage reporting [vladimir]
- serialize page_counter_limit() callsites [vladimir]
---
 include/linux/memcontrol.h |  24 ++++++---
 include/net/sock.h         |   8 +--
 mm/hugetlb_cgroup.c        |  22 ++++----
 mm/memcontrol.c            | 123 +++++++++++++++++++++++++--------------------
 net/ipv4/tcp_memcontrol.c  |  18 ++++---
 5 files changed, 115 insertions(+), 80 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf8fb1a05597..a8b939376a5d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,13 +62,13 @@ struct page_counter {
 
 	/* legacy */
 	unsigned long watermark;
-	unsigned long limited;
+	unsigned long failcnt;
 };
 
 #if BITS_PER_LONG == 32
-#define PAGE_COUNTER_MAX ULONG_MAX
+#define PAGE_COUNTER_MAX LONG_MAX
 #else
-#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
+#define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE)
 #endif
 
 static inline void page_counter_init(struct page_counter *counter,
@@ -79,13 +79,25 @@ static inline void page_counter_init(struct page_counter *counter,
 	counter->parent = parent;
 }
 
-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
-int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
-			struct page_counter **fail);
+static inline unsigned long page_counter_read(struct page_counter *counter)
+{
+	return atomic_long_read(&counter->count);
+}
+
+int page_counter_sub(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_try_charge(struct page_counter *counter,
+			    unsigned long nr_pages,
+			    struct page_counter **fail);
 int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_limit(struct page_counter *counter, unsigned long limit);
 int page_counter_memparse(const char *buf, unsigned long *nr_pages);
 
+static inline void page_counter_reset_watermark(struct page_counter *counter)
+{
+	counter->watermark = page_counter_read(counter);
+}
+
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
diff --git a/include/net/sock.h b/include/net/sock.h
index f41749982668..9aa435de3ef1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1217,9 +1217,9 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
 					      unsigned long amt,
 					      int *parent_status)
 {
-	page_counter_charge(&prot->memory_allocated, amt, NULL);
+	page_counter_charge(&prot->memory_allocated, amt);
 
-	if (atomic_long_read(&prot->memory_allocated.count) >
+	if (page_counter_read(&prot->memory_allocated) >
 	    prot->memory_allocated.limit)
 		*parent_status = OVER_LIMIT;
 }
@@ -1236,7 +1236,7 @@ sk_memory_allocated(const struct sock *sk)
 	struct proto *prot = sk->sk_prot;
 
 	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
+		return page_counter_read(&sk->sk_cgrp->memory_allocated);
 
 	return atomic_long_read(prot->memory_allocated);
 }
@@ -1250,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
 		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
 		/* update the root cgroup regardless */
 		atomic_long_add_return(amt, prot->memory_allocated);
-		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
+		return page_counter_read(&sk->sk_cgrp->memory_allocated);
 	}
 
 	return atomic_long_add_return(amt, prot->memory_allocated);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index e619b6b62f1f..abd1e8dc7b46 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -61,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 	int idx;
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if (atomic_long_read(&h_cg->hugepage[idx].count))
+		if (page_counter_read(&h_cg->hugepage[idx]))
 			return true;
 	}
 	return false;
@@ -127,11 +127,11 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	if (!parent) {
 		parent = root_h_cgroup;
 		/* root has no limit */
-		page_counter_charge(&parent->hugepage[idx], nr_pages, NULL);
+		page_counter_charge(&parent->hugepage[idx], nr_pages);
 	}
 	counter = &h_cg->hugepage[idx];
 	/* Take the pages off the local counter */
-	page_counter_cancel(counter, nr_pages);
+	page_counter_sub(counter, nr_pages);
 
 	set_hugetlb_cgroup(page, parent);
 out:
@@ -186,7 +186,7 @@ again:
 	}
 	rcu_read_unlock();
 
-	ret = page_counter_charge(&h_cg->hugepage[idx], nr_pages, &counter);
+	ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
@@ -254,18 +254,20 @@ static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 
 	switch (MEMFILE_ATTR(cft->private)) {
 	case RES_USAGE:
-		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+		return (u64)page_counter_read(counter) * PAGE_SIZE;
 	case RES_LIMIT:
 		return (u64)counter->limit * PAGE_SIZE;
 	case RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
 	case RES_FAILCNT:
-		return counter->limited;
+		return counter->failcnt;
 	default:
 		BUG();
 	}
 }
 
+static DEFINE_MUTEX(hugetlb_limit_mutex);
+
 static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
@@ -285,8 +287,10 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_LIMIT:
-		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
+		nr_pages = ALIGN(nr_pages, 1UL<<huge_page_order(&hstates[idx]));
+		mutex_lock(&hugetlb_limit_mutex);
 		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
+		mutex_unlock(&hugetlb_limit_mutex);
 		break;
 	default:
 		ret = -EINVAL;
@@ -306,10 +310,10 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		counter->watermark = atomic_long_read(&counter->count);
+		page_counter_reset_watermark(counter);
 		break;
 	case RES_FAILCNT:
-		counter->limited = 0;
+		counter->failcnt = 0;
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dfd3b15a57e8..9dec20b3c928 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,7 +65,7 @@
 
 #include <trace/events/vmscan.h>
 
-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
 {
 	long new;
 
@@ -74,28 +74,41 @@ int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
 	if (WARN_ON(unlikely(new < 0)))
 		atomic_long_set(&counter->count, 0);
 
-	return new > 1;
+	return new > 0;
 }
 
-int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
-			struct page_counter **fail)
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
+{
+	struct page_counter *c;
+
+	for (c = counter; c; c = c->parent) {
+		long new;
+
+		new = atomic_long_add_return(nr_pages, &c->count);
+
+		if (new > c->watermark)
+			c->watermark = new;
+	}
+}
+
+int page_counter_try_charge(struct page_counter *counter,
+			    unsigned long nr_pages,
+			    struct page_counter **fail)
 {
 	struct page_counter *c;
 
 	for (c = counter; c; c = c->parent) {
 		for (;;) {
-			unsigned long count;
-			unsigned long new;
+			long count;
+			long new;
 
 			count = atomic_long_read(&c->count);
 
 			new = count + nr_pages;
 			if (new > c->limit) {
-				c->limited++;
-				if (fail) {
-					*fail = c;
-					goto failed;
-				}
+				c->failcnt++;
+				*fail = c;
+				goto failed;
 			}
 
 			if (atomic_long_cmpxchg(&c->count, count, new) != count)
@@ -111,7 +124,7 @@ int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
-		page_counter_cancel(c, nr_pages);
+		page_counter_sub(c, nr_pages);
 
 	return -ENOMEM;
 }
@@ -124,7 +137,7 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 	for (c = counter; c; c = c->parent) {
 		int remainder;
 
-		remainder = page_counter_cancel(c, nr_pages);
+		remainder = page_counter_sub(c, nr_pages);
 		if (c == counter && !remainder)
 			ret = 0;
 	}
@@ -135,8 +148,8 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 int page_counter_limit(struct page_counter *counter, unsigned long limit)
 {
 	for (;;) {
-		unsigned long count;
 		unsigned long old;
+		long count;
 
 		count = atomic_long_read(&counter->count);
 
@@ -751,7 +764,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 	 * This check can't live in kmem destruction function,
 	 * since the charges will outlive the cgroup
 	 */
-	WARN_ON(atomic_long_read(&memcg->kmem.count));
+	WARN_ON(page_counter_read(&memcg->kmem));
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -858,7 +871,7 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 
 static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
 {
-	unsigned long nr_pages = atomic_long_read(&memcg->memory.count);
+	unsigned long nr_pages = page_counter_read(&memcg->memory);
 	unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
 	unsigned long excess = 0;
 
@@ -1609,13 +1622,13 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 	unsigned long count;
 	unsigned long limit;
 
-	count = atomic_long_read(&memcg->memory.count);
+	count = page_counter_read(&memcg->memory);
 	limit = ACCESS_ONCE(memcg->memory.limit);
 	if (count < limit)
 		margin = limit - count;
 
 	if (do_swap_account) {
-		count = atomic_long_read(&memcg->memsw.count);
+		count = page_counter_read(&memcg->memsw);
 		limit = ACCESS_ONCE(memcg->memsw.limit);
 		if (count < limit)
 			margin = min(margin, limit - count);
@@ -1763,14 +1776,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 	rcu_read_unlock();
 
 	pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
-		K((u64)atomic_long_read(&memcg->memory.count)),
-		K((u64)memcg->memory.limit), memcg->memory.limited);
+		K((u64)page_counter_read(&memcg->memory)),
+		K((u64)memcg->memory.limit), memcg->memory.failcnt);
 	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
-		K((u64)atomic_long_read(&memcg->memsw.count)),
-		K((u64)memcg->memsw.limit), memcg->memsw.limited);
+		K((u64)page_counter_read(&memcg->memsw)),
+		K((u64)memcg->memsw.limit), memcg->memsw.failcnt);
 	pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
-		K((u64)atomic_long_read(&memcg->kmem.count)),
-		K((u64)memcg->kmem.limit), memcg->kmem.limited);
+		K((u64)page_counter_read(&memcg->kmem)),
+		K((u64)memcg->kmem.limit), memcg->kmem.failcnt);
 
 	for_each_mem_cgroup_tree(iter, memcg) {
 		pr_info("Memory cgroup stats for ");
@@ -2604,10 +2617,10 @@ retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
 
-	if (!page_counter_charge(&memcg->memory, batch, &counter)) {
+	if (!page_counter_try_charge(&memcg->memory, batch, &counter)) {
 		if (!do_swap_account)
 			goto done_restock;
-		if (!page_counter_charge(&memcg->memsw, batch, &counter))
+		if (!page_counter_try_charge(&memcg->memsw, batch, &counter))
 			goto done_restock;
 		page_counter_uncharge(&memcg->memory, batch);
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
@@ -2877,7 +2890,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
 	struct page_counter *counter;
 	int ret = 0;
 
-	ret = page_counter_charge(&memcg->kmem, nr_pages, &counter);
+	ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
 	if (ret < 0)
 		return ret;
 
@@ -2898,9 +2911,9 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
 		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
-		page_counter_charge(&memcg->memory, nr_pages, NULL);
+		page_counter_charge(&memcg->memory, nr_pages);
 		if (do_swap_account)
-			page_counter_charge(&memcg->memsw, nr_pages, NULL);
+			page_counter_charge(&memcg->memsw, nr_pages);
 		ret = 0;
 	} else if (ret)
 		page_counter_uncharge(&memcg->kmem, nr_pages);
@@ -3558,9 +3571,9 @@ static int mem_cgroup_move_parent(struct page *page,
 				pc, child, parent);
 	if (!ret) {
 		/* Take charge off the local counters */
-		page_counter_cancel(&child->memory, nr_pages);
+		page_counter_sub(&child->memory, nr_pages);
 		if (do_swap_account)
-			page_counter_cancel(&child->memsw, nr_pages);
+			page_counter_sub(&child->memsw, nr_pages);
 	}
 
 	if (nr_pages > 1)
@@ -3665,7 +3678,7 @@ void mem_cgroup_print_bad_page(struct page *page)
 }
 #endif
 
-static DEFINE_MUTEX(set_limit_mutex);
+static DEFINE_MUTEX(memcg_limit_mutex);
 
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 				   unsigned long limit)
@@ -3684,7 +3697,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
 		      mem_cgroup_count_children(memcg);
 
-	oldusage = atomic_long_read(&memcg->memory.count);
+	oldusage = page_counter_read(&memcg->memory);
 
 	do {
 		if (signal_pending(current)) {
@@ -3692,23 +3705,23 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			break;
 		}
 
-		mutex_lock(&set_limit_mutex);
+		mutex_lock(&memcg_limit_mutex);
 		if (limit > memcg->memsw.limit) {
-			mutex_unlock(&set_limit_mutex);
+			mutex_unlock(&memcg_limit_mutex);
 			ret = -EINVAL;
 			break;
 		}
 		if (limit > memcg->memory.limit)
 			enlarge = true;
 		ret = page_counter_limit(&memcg->memory, limit);
-		mutex_unlock(&set_limit_mutex);
+		mutex_unlock(&memcg_limit_mutex);
 
 		if (!ret)
 			break;
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
 
-		curusage = atomic_long_read(&memcg->memory.count);
+		curusage = page_counter_read(&memcg->memory);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
@@ -3735,7 +3748,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
 		      mem_cgroup_count_children(memcg);
 
-	oldusage = atomic_long_read(&memcg->memsw.count);
+	oldusage = page_counter_read(&memcg->memsw);
 
 	do {
 		if (signal_pending(current)) {
@@ -3743,23 +3756,23 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			break;
 		}
 
-		mutex_lock(&set_limit_mutex);
+		mutex_lock(&memcg_limit_mutex);
 		if (limit < memcg->memory.limit) {
-			mutex_unlock(&set_limit_mutex);
+			mutex_unlock(&memcg_limit_mutex);
 			ret = -EINVAL;
 			break;
 		}
 		if (limit > memcg->memsw.limit)
 			enlarge = true;
 		ret = page_counter_limit(&memcg->memsw, limit);
-		mutex_unlock(&set_limit_mutex);
+		mutex_unlock(&memcg_limit_mutex);
 
 		if (!ret)
 			break;
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
 
-		curusage = atomic_long_read(&memcg->memsw.count);
+		curusage = page_counter_read(&memcg->memsw);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
@@ -3960,8 +3973,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 		 * right after the check. RES_USAGE should be safe as we always
 		 * charge before adding to the LRU.
 		 */
-	} while (atomic_long_read(&memcg->memory.count) -
-		 atomic_long_read(&memcg->kmem.count) > 0);
+	} while (page_counter_read(&memcg->memory) -
+		 page_counter_read(&memcg->kmem) > 0);
 }
 
 /*
@@ -4001,7 +4014,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 	/* we call try-to-free pages for make this cgroup empty */
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
-	while (nr_retries && atomic_long_read(&memcg->memory.count)) {
+	while (nr_retries && page_counter_read(&memcg->memory)) {
 		int progress;
 
 		if (signal_pending(current))
@@ -4098,9 +4111,9 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 			val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
 	} else {
 		if (!swap)
-			val = atomic_long_read(&memcg->memory.count);
+			val = page_counter_read(&memcg->memory);
 		else
-			val = atomic_long_read(&memcg->memsw.count);
+			val = page_counter_read(&memcg->memsw);
 	}
 	return val << PAGE_SHIFT;
 }
@@ -4139,13 +4152,13 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 			return mem_cgroup_usage(memcg, false);
 		if (counter == &memcg->memsw)
 			return mem_cgroup_usage(memcg, true);
-		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+		return (u64)page_counter_read(counter) * PAGE_SIZE;
 	case RES_LIMIT:
 		return (u64)counter->limit * PAGE_SIZE;
 	case RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
 	case RES_FAILCNT:
-		return counter->limited;
+		return counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return (u64)memcg->soft_limit * PAGE_SIZE;
 	default:
@@ -4234,10 +4247,12 @@ static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
 {
 	int ret;
 
+	mutex_lock(&memcg_limit_mutex);
 	if (!memcg_kmem_is_active(memcg))
 		ret = memcg_activate_kmem(memcg, limit);
 	else
 		ret = page_counter_limit(&memcg->kmem, limit);
+	mutex_unlock(&memcg_limit_mutex);
 	return ret;
 }
 
@@ -4255,7 +4270,7 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	 * after this point, because it has at least one child already.
 	 */
 	if (memcg_kmem_is_active(parent))
-		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
+		ret = __memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
 	mutex_unlock(&activate_kmem_mutex);
 	return ret;
 }
@@ -4331,10 +4346,10 @@ static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		counter->watermark = atomic_long_read(&counter->count);
+		page_counter_reset_watermark(counter);
 		break;
 	case RES_FAILCNT:
-		counter->limited = 0;
+		counter->failcnt = 0;
 		break;
 	default:
 		BUG();
@@ -4934,7 +4949,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 
 	memcg_kmem_mark_dead(memcg);
 
-	if (atomic_long_read(&memcg->kmem.count))
+	if (page_counter_read(&memcg->kmem))
 		return;
 
 	if (memcg_kmem_test_and_clear_dead(memcg))
@@ -5603,7 +5618,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	 * call_rcu()
 	 *   offline_css()
 	 *     reparent_charges()
-	 *                           page_counter_charge()
+	 *                           page_counter_try_charge()
 	 *                           css_put()
 	 *                             css_free()
 	 *                           pc->mem_cgroup = dead memcg
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 9a448bdb19e9..272327134a1b 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -68,7 +68,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
 						sysctl_tcp_mem[i]);
 
-	if (nr_pages == ULONG_MAX / PAGE_SIZE)
+	if (nr_pages == PAGE_COUNTER_MAX)
 		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
 	else {
 		/*
@@ -106,6 +106,8 @@ enum {
 	RES_FAILCNT,
 };
 
+static DEFINE_MUTEX(tcp_limit_mutex);
+
 static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
@@ -121,7 +123,9 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 		ret = page_counter_memparse(buf, &nr_pages);
 		if (ret)
 			break;
+		mutex_lock(&tcp_limit_mutex);
 		ret = tcp_update_limit(memcg, nr_pages);
+		mutex_unlock(&tcp_limit_mutex);
 		break;
 	default:
 		ret = -EINVAL;
@@ -145,14 +149,15 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 		break;
 	case RES_USAGE:
 		if (!cg_proto)
-			return atomic_long_read(&tcp_memory_allocated);
-		val = atomic_long_read(&cg_proto->memory_allocated.count);
+			val = atomic_long_read(&tcp_memory_allocated);
+		else
+			val = page_counter_read(&cg_proto->memory_allocated);
 		val *= PAGE_SIZE;
 		break;
 	case RES_FAILCNT:
 		if (!cg_proto)
 			return 0;
-		val = cg_proto->memory_allocated.limited;
+		val = cg_proto->memory_allocated.failcnt;
 		break;
 	case RES_MAX_USAGE:
 		if (!cg_proto)
@@ -179,11 +184,10 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
 
 	switch (of_cft(of)->private) {
 	case RES_MAX_USAGE:
-		cg_proto->memory_allocated.watermark =
-			atomic_long_read(&cg_proto->memory_allocated.count);
+		page_counter_reset_watermark(&cg_proto->memory_allocated);
 		break;
 	case RES_FAILCNT:
-		cg_proto->memory_allocated.limited = 0;
+		cg_proto->memory_allocated.failcnt = 0;
 		break;
 	}
 
-- 
2.1.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-22 18:57     ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-22 18:57 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

Hi Vladimir,

On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 19df5d857411..bf8fb1a05597 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> >  };
> >  
> >  #ifdef CONFIG_MEMCG
> > +
> > +struct page_counter {
> 
> I'd place it in a separate file, say
> 
> 	include/linux/page_counter.h
> 	mm/page_counter.c
> 
> just to keep mm/memcontrol.c clean.

The page counters are the very core of the memory controller and, as I
said to Michal, I want to integrate the hugetlb controller into memcg
as well, at which point there won't be any outside users anymore.  So
I think this is the right place for it.

> > +	atomic_long_t count;
> > +	unsigned long limit;
> > +	struct page_counter *parent;
> > +
> > +	/* legacy */
> > +	unsigned long watermark;
> > +	unsigned long limited;
> 
> IMHO, failcnt would fit better.

I never liked the failcnt name, but also have to admit that "limited"
is crap.  Let's leave it at failcnt for now.

> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> 
> When I first saw this function, I couldn't realize by looking at its
> name what it's intended to do. I think
> 
> 	page_counter_cancel_local_charge()
> 
> would fit better.

It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?

> > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > +			struct page_counter **fail);
> > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> 
> Hmm, why not page_counter_set_limit?

Limit is used as a verb here, "to limit".  Getters and setters are
usually wrappers around unusual/complex data structure access, but
this function does a lot more, so I'm not fond of _set_limit().

> > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> >  					      unsigned long amt,
> >  					      int *parent_status)
> >  {
> > -	struct res_counter *fail;
> > -	int ret;
> > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> >  
> > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > -					amt << PAGE_SHIFT, &fail);
> > -	if (ret < 0)
> > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > +	    prot->memory_allocated.limit)
> 
> I don't like your equivalent of res_counter_charge_nofail.
> 
> Passing NULL to page_counter_charge might be useful if one doesn't have
> a back-off strategy, but still want to fail on hitting the limit. With
> your interface the user must pass something to the function then, which
> isn't convenient.
> 
> Besides, it depends on the internal implementation of the page_counter
> struct. I'd encapsulate this.

Thinking about this more, I don't like my version either; not because
of how @fail must always be passed, but because of how it changes the
behavior.  I changed the API to

void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
                            struct page_counter **fail);

We could make @fail optional in the try_charge(), but all callsites
pass it at this time, so for now I kept it mandatory for simplicity.

What do you think?

> > @@ -975,15 +975,8 @@ config CGROUP_CPUACCT
> >  	  Provides a simple Resource Controller for monitoring the
> >  	  total CPU consumed by the tasks in a cgroup.
> >  
> > -config RESOURCE_COUNTERS
> > -	bool "Resource counters"
> > -	help
> > -	  This option enables controller independent resource accounting
> > -	  infrastructure that works with cgroups.
> > -
> >  config MEMCG
> >  	bool "Memory Resource Controller for Control Groups"
> > -	depends on RESOURCE_COUNTERS
> >  	select EVENTFD
> >  	help
> >  	  Provides a memory resource controller that manages both anonymous
> > @@ -1051,7 +1044,7 @@ config MEMCG_KMEM
> >  
> >  config CGROUP_HUGETLB
> >  	bool "HugeTLB Resource Controller for Control Groups"
> > -	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> > +	depends on MEMCG && HUGETLB_PAGE
> 
> So now the hugetlb controller depends on memcg only because it needs the
> page_counter, which is supposed to be a kind of independent. Is that OK?

As mentioned before, I want to integrate the hugetlb controller into
memcg anyway, so this should be fine, and it keeps things simpler.

> > @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> >  	if (unlikely(!h_cg))
> >  		return;
> >  	set_hugetlb_cgroup(page, NULL);
> > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> >  	return;
> >  }
> >  
> >  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> >  				    struct hugetlb_cgroup *h_cg)
> >  {
> > -	unsigned long csize = nr_pages * PAGE_SIZE;
> > -
> >  	if (hugetlb_cgroup_disabled() || !h_cg)
> >  		return;
> >  
> >  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> >  		return;
> >  
> > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> >  	return;
> >  }
> >  
> > +enum {
> > +	RES_USAGE,
> > +	RES_LIMIT,
> > +	RES_MAX_USAGE,
> > +	RES_FAILCNT,
> > +};
> > +
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >  				   struct cftype *cft)
> >  {
> > -	int idx, name;
> > +	struct page_counter *counter;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> >  
> > -	idx = MEMFILE_IDX(cft->private);
> > -	name = MEMFILE_ATTR(cft->private);
> > +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> >  
> > -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> > +	switch (MEMFILE_ATTR(cft->private)) {
> > +	case RES_USAGE:
> > +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> 
> page_counter_read?
> 
> > +	case RES_LIMIT:
> > +		return (u64)counter->limit * PAGE_SIZE;
> 
> page_counter_get_limit?
> 
> > +	case RES_MAX_USAGE:
> > +		return (u64)counter->watermark * PAGE_SIZE;
> 
> page_counter_read_watermark?

I added page_counter_read() to abstract away the fact that we use a
signed counter internally, but it still returns the number of pages as
unsigned long.

The entire counter API is based on pages now, and only the userspace
interface translates back and forth into bytes, so that's where the
translation should be located.

> >  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >  				    char *buf, size_t nbytes, loff_t off)
> >  {
> > -	int idx, name, ret;
> > -	unsigned long long val;
> > +	int ret, idx;
> > +	unsigned long nr_pages;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >  
> > +	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> > +		return -EINVAL;
> > +
> >  	buf = strstrip(buf);
> > +	ret = page_counter_memparse(buf, &nr_pages);
> > +	if (ret)
> > +		return ret;
> > +
> >  	idx = MEMFILE_IDX(of_cft(of)->private);
> > -	name = MEMFILE_ATTR(of_cft(of)->private);
> >  
> > -	switch (name) {
> > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >  	case RES_LIMIT:
> > -		if (hugetlb_cgroup_is_root(h_cg)) {
> > -			/* Can't set limit on root */
> > -			ret = -EINVAL;
> > -			break;
> > -		}
> > -		/* This function does all necessary parse...reuse it */
> > -		ret = res_counter_memparse_write_strategy(buf, &val);
> > -		if (ret)
> > -			break;
> > -		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
> > -		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> > +		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
> 
> This is incorrect. Here we should have something like:
>
> 	nr_pages = ALIGN(nr_pages, 1UL << huge_page_order(&hstates[idx]));

Good catch, thanks.

> > @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> >  				    char *buf, size_t nbytes, loff_t off)
> >  {
> > -	int idx, name, ret = 0;
> > +	int ret = 0;
> > +	struct page_counter *counter;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >  
> > -	idx = MEMFILE_IDX(of_cft(of)->private);
> > -	name = MEMFILE_ATTR(of_cft(of)->private);
> > +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> >  
> > -	switch (name) {
> > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >  	case RES_MAX_USAGE:
> > -		res_counter_reset_max(&h_cg->hugepage[idx]);
> > +		counter->watermark = atomic_long_read(&counter->count);
> 
> page_counter_reset_watermark?

Yes, that operation deserves a wrapper.

> >  		break;
> >  	case RES_FAILCNT:
> > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > +		counter->limited = 0;
> 
> page_counter_reset_failcnt?

That would be more obscure than counter->failcnt = 0, I think.

> > @@ -66,6 +65,117 @@
> >  
> >  #include <trace/events/vmscan.h>
> >  
> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > +{
> > +	long new;
> > +
> > +	new = atomic_long_sub_return(nr_pages, &counter->count);
> > +
> > +	if (WARN_ON(unlikely(new < 0)))
> 
> Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.

Since this is a page counter, this would overflow at 8 petabyte on 32
bit.  So even though the maximum is ULONG_MAX, in practice we should
never even reach LONG_MAX, and ULONG_MAX was only chosen for backward
compatibility with the default unlimited value.

This is actually not the only place that assumes we never go negative;
the userspace read functions' u64 cast of a long would sign-extend any
negative value and return ludicrous numbers.

But thinking longer about this, it's probably not worth to have these
gotchas in the code just to maintain the default unlimited value.  I
changed PAGE_COUNTER_MAX to LONG_MAX and LONG_MAX / PAGE_SIZE, resp.

> > +		atomic_long_set(&counter->count, 0);
> > +
> > +	return new > 1;
> 
> Nobody outside page_counter internals uses this retval. Why is it public
> then?

kmemcg still uses this for the pinning trick, but I'll update the
patch that removes it to also change the interface.

> BTW, why not new > 0?

That's a plain bug - probably a left-over from rephrasing this
underflow test several times.  Thanks for catching.

> > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > +			struct page_counter **fail)
> > +{
> > +	struct page_counter *c;
> > +
> > +	for (c = counter; c; c = c->parent) {
> > +		for (;;) {
> > +			unsigned long count;
> > +			unsigned long new;
> > +
> > +			count = atomic_long_read(&c->count);
> > +
> > +			new = count + nr_pages;
> > +			if (new > c->limit) {
> > +				c->limited++;
> > +				if (fail) {
> 
> So we increase 'limited' even if ain't limited. Sounds weird.

The old code actually did that too, but I removed it now in the
transition to separate charge and try_charge functions.

> > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > +{
> > +	for (;;) {
> > +		unsigned long count;
> > +		unsigned long old;
> > +
> > +		count = atomic_long_read(&counter->count);
> > +
> > +		old = xchg(&counter->limit, limit);
> > +
> > +		if (atomic_long_read(&counter->count) != count) {
> > +			counter->limit = old;
> 
> I wonder what can happen if two threads execute this function
> concurrently... or may be it's not supposed to be smp-safe?

memcg already holds the set_limit_mutex here.  I updated the tcp and
hugetlb controllers accordingly to take limit locks as well.

> > @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> >   */
> >  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >  {
> > -	unsigned long long margin;
> 
> Why is it still ULL?

Hm?  This is a removal.  Too many longs...

> > @@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
> >  	 * after this point, because it has at least one child already.
> >  	 */
> >  	if (memcg_kmem_is_active(parent))
> > -		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
> > +		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
> 
> PAGE_COUNTER_MAX?

Good catch, thanks.  That was a left-over from several iterations of
trying to find a value that's suitable for both 32 bit and 64 bit.

> > @@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
> >  	if (!cg_proto)
> >  		return -EINVAL;
> >  
> > -	if (val > RES_COUNTER_MAX)
> > -		val = RES_COUNTER_MAX;
> > -
> > -	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
> > +	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
> >  	if (ret)
> >  		return ret;
> >  
> >  	for (i = 0; i < 3; i++)
> > -		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
> > +		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
> >  						sysctl_tcp_mem[i]);
> >  
> > -	if (val == RES_COUNTER_MAX)
> > +	if (nr_pages == ULONG_MAX / PAGE_SIZE)
> 
> PAGE_COUNTER_MAX?

Same.

> > @@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
> >  	return ret ?: nbytes;
> >  }
> >  
> > -static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
> > -{
> > -	struct cg_proto *cg_proto;
> > -
> > -	cg_proto = tcp_prot.proto_cgroup(memcg);
> > -	if (!cg_proto)
> > -		return default_val;
> > -
> > -	return res_counter_read_u64(&cg_proto->memory_allocated, type);
> > -}
> > -
> > -static u64 tcp_read_usage(struct mem_cgroup *memcg)
> > -{
> > -	struct cg_proto *cg_proto;
> > -
> > -	cg_proto = tcp_prot.proto_cgroup(memcg);
> > -	if (!cg_proto)
> > -		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
> > -
> > -	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
> > -}
> > -
> >  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
> >  {
> >  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
> >  	u64 val;
> >  
> >  	switch (cft->private) {
> >  	case RES_LIMIT:
> > -		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
> > +		if (!cg_proto)
> > +			return PAGE_COUNTER_MAX;
> > +		val = cg_proto->memory_allocated.limit;
> > +		val *= PAGE_SIZE;
> >  		break;
> >  	case RES_USAGE:
> > -		val = tcp_read_usage(memcg);
> > +		if (!cg_proto)
> > +			return atomic_long_read(&tcp_memory_allocated);
> 
> Forgot << PAGE_SHIFT?

Yes, indeed.

Thanks for your thorough review, Vladimir!

Here is the delta patch:

---

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-22 18:57     ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-22 18:57 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

Hi Vladimir,

On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 19df5d857411..bf8fb1a05597 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> >  };
> >  
> >  #ifdef CONFIG_MEMCG
> > +
> > +struct page_counter {
> 
> I'd place it in a separate file, say
> 
> 	include/linux/page_counter.h
> 	mm/page_counter.c
> 
> just to keep mm/memcontrol.c clean.

The page counters are the very core of the memory controller and, as I
said to Michal, I want to integrate the hugetlb controller into memcg
as well, at which point there won't be any outside users anymore.  So
I think this is the right place for it.

> > +	atomic_long_t count;
> > +	unsigned long limit;
> > +	struct page_counter *parent;
> > +
> > +	/* legacy */
> > +	unsigned long watermark;
> > +	unsigned long limited;
> 
> IMHO, failcnt would fit better.

I never liked the failcnt name, but also have to admit that "limited"
is crap.  Let's leave it at failcnt for now.

> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> 
> When I first saw this function, I couldn't realize by looking at its
> name what it's intended to do. I think
> 
> 	page_counter_cancel_local_charge()
> 
> would fit better.

It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?

> > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > +			struct page_counter **fail);
> > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> 
> Hmm, why not page_counter_set_limit?

Limit is used as a verb here, "to limit".  Getters and setters are
usually wrappers around unusual/complex data structure access, but
this function does a lot more, so I'm not fond of _set_limit().

> > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> >  					      unsigned long amt,
> >  					      int *parent_status)
> >  {
> > -	struct res_counter *fail;
> > -	int ret;
> > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> >  
> > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > -					amt << PAGE_SHIFT, &fail);
> > -	if (ret < 0)
> > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > +	    prot->memory_allocated.limit)
> 
> I don't like your equivalent of res_counter_charge_nofail.
> 
> Passing NULL to page_counter_charge might be useful if one doesn't have
> a back-off strategy, but still want to fail on hitting the limit. With
> your interface the user must pass something to the function then, which
> isn't convenient.
> 
> Besides, it depends on the internal implementation of the page_counter
> struct. I'd encapsulate this.

Thinking about this more, I don't like my version either; not because
of how @fail must always be passed, but because of how it changes the
behavior.  I changed the API to

void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
                            struct page_counter **fail);

We could make @fail optional in the try_charge(), but all callsites
pass it at this time, so for now I kept it mandatory for simplicity.

What do you think?

> > @@ -975,15 +975,8 @@ config CGROUP_CPUACCT
> >  	  Provides a simple Resource Controller for monitoring the
> >  	  total CPU consumed by the tasks in a cgroup.
> >  
> > -config RESOURCE_COUNTERS
> > -	bool "Resource counters"
> > -	help
> > -	  This option enables controller independent resource accounting
> > -	  infrastructure that works with cgroups.
> > -
> >  config MEMCG
> >  	bool "Memory Resource Controller for Control Groups"
> > -	depends on RESOURCE_COUNTERS
> >  	select EVENTFD
> >  	help
> >  	  Provides a memory resource controller that manages both anonymous
> > @@ -1051,7 +1044,7 @@ config MEMCG_KMEM
> >  
> >  config CGROUP_HUGETLB
> >  	bool "HugeTLB Resource Controller for Control Groups"
> > -	depends on RESOURCE_COUNTERS && HUGETLB_PAGE
> > +	depends on MEMCG && HUGETLB_PAGE
> 
> So now the hugetlb controller depends on memcg only because it needs the
> page_counter, which is supposed to be a kind of independent. Is that OK?

As mentioned before, I want to integrate the hugetlb controller into
memcg anyway, so this should be fine, and it keeps things simpler.

> > @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> >  	if (unlikely(!h_cg))
> >  		return;
> >  	set_hugetlb_cgroup(page, NULL);
> > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> >  	return;
> >  }
> >  
> >  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> >  				    struct hugetlb_cgroup *h_cg)
> >  {
> > -	unsigned long csize = nr_pages * PAGE_SIZE;
> > -
> >  	if (hugetlb_cgroup_disabled() || !h_cg)
> >  		return;
> >  
> >  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> >  		return;
> >  
> > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> >  	return;
> >  }
> >  
> > +enum {
> > +	RES_USAGE,
> > +	RES_LIMIT,
> > +	RES_MAX_USAGE,
> > +	RES_FAILCNT,
> > +};
> > +
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >  				   struct cftype *cft)
> >  {
> > -	int idx, name;
> > +	struct page_counter *counter;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> >  
> > -	idx = MEMFILE_IDX(cft->private);
> > -	name = MEMFILE_ATTR(cft->private);
> > +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> >  
> > -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> > +	switch (MEMFILE_ATTR(cft->private)) {
> > +	case RES_USAGE:
> > +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> 
> page_counter_read?
> 
> > +	case RES_LIMIT:
> > +		return (u64)counter->limit * PAGE_SIZE;
> 
> page_counter_get_limit?
> 
> > +	case RES_MAX_USAGE:
> > +		return (u64)counter->watermark * PAGE_SIZE;
> 
> page_counter_read_watermark?

I added page_counter_read() to abstract away the fact that we use a
signed counter internally, but it still returns the number of pages as
unsigned long.

The entire counter API is based on pages now, and only the userspace
interface translates back and forth into bytes, so that's where the
translation should be located.

> >  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >  				    char *buf, size_t nbytes, loff_t off)
> >  {
> > -	int idx, name, ret;
> > -	unsigned long long val;
> > +	int ret, idx;
> > +	unsigned long nr_pages;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >  
> > +	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> > +		return -EINVAL;
> > +
> >  	buf = strstrip(buf);
> > +	ret = page_counter_memparse(buf, &nr_pages);
> > +	if (ret)
> > +		return ret;
> > +
> >  	idx = MEMFILE_IDX(of_cft(of)->private);
> > -	name = MEMFILE_ATTR(of_cft(of)->private);
> >  
> > -	switch (name) {
> > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >  	case RES_LIMIT:
> > -		if (hugetlb_cgroup_is_root(h_cg)) {
> > -			/* Can't set limit on root */
> > -			ret = -EINVAL;
> > -			break;
> > -		}
> > -		/* This function does all necessary parse...reuse it */
> > -		ret = res_counter_memparse_write_strategy(buf, &val);
> > -		if (ret)
> > -			break;
> > -		val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx]));
> > -		ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> > +		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
> 
> This is incorrect. Here we should have something like:
>
> 	nr_pages = ALIGN(nr_pages, 1UL << huge_page_order(&hstates[idx]));

Good catch, thanks.

> > @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> >  				    char *buf, size_t nbytes, loff_t off)
> >  {
> > -	int idx, name, ret = 0;
> > +	int ret = 0;
> > +	struct page_counter *counter;
> >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >  
> > -	idx = MEMFILE_IDX(of_cft(of)->private);
> > -	name = MEMFILE_ATTR(of_cft(of)->private);
> > +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> >  
> > -	switch (name) {
> > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >  	case RES_MAX_USAGE:
> > -		res_counter_reset_max(&h_cg->hugepage[idx]);
> > +		counter->watermark = atomic_long_read(&counter->count);
> 
> page_counter_reset_watermark?

Yes, that operation deserves a wrapper.

> >  		break;
> >  	case RES_FAILCNT:
> > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > +		counter->limited = 0;
> 
> page_counter_reset_failcnt?

That would be more obscure than counter->failcnt = 0, I think.

> > @@ -66,6 +65,117 @@
> >  
> >  #include <trace/events/vmscan.h>
> >  
> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > +{
> > +	long new;
> > +
> > +	new = atomic_long_sub_return(nr_pages, &counter->count);
> > +
> > +	if (WARN_ON(unlikely(new < 0)))
> 
> Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.

Since this is a page counter, this would overflow at 8 petabyte on 32
bit.  So even though the maximum is ULONG_MAX, in practice we should
never even reach LONG_MAX, and ULONG_MAX was only chosen for backward
compatibility with the default unlimited value.

This is actually not the only place that assumes we never go negative;
the userspace read functions' u64 cast of a long would sign-extend any
negative value and return ludicrous numbers.

But thinking longer about this, it's probably not worth to have these
gotchas in the code just to maintain the default unlimited value.  I
changed PAGE_COUNTER_MAX to LONG_MAX and LONG_MAX / PAGE_SIZE, resp.

> > +		atomic_long_set(&counter->count, 0);
> > +
> > +	return new > 1;
> 
> Nobody outside page_counter internals uses this retval. Why is it public
> then?

kmemcg still uses this for the pinning trick, but I'll update the
patch that removes it to also change the interface.

> BTW, why not new > 0?

That's a plain bug - probably a left-over from rephrasing this
underflow test several times.  Thanks for catching.

> > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > +			struct page_counter **fail)
> > +{
> > +	struct page_counter *c;
> > +
> > +	for (c = counter; c; c = c->parent) {
> > +		for (;;) {
> > +			unsigned long count;
> > +			unsigned long new;
> > +
> > +			count = atomic_long_read(&c->count);
> > +
> > +			new = count + nr_pages;
> > +			if (new > c->limit) {
> > +				c->limited++;
> > +				if (fail) {
> 
> So we increase 'limited' even if ain't limited. Sounds weird.

The old code actually did that too, but I removed it now in the
transition to separate charge and try_charge functions.

> > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > +{
> > +	for (;;) {
> > +		unsigned long count;
> > +		unsigned long old;
> > +
> > +		count = atomic_long_read(&counter->count);
> > +
> > +		old = xchg(&counter->limit, limit);
> > +
> > +		if (atomic_long_read(&counter->count) != count) {
> > +			counter->limit = old;
> 
> I wonder what can happen if two threads execute this function
> concurrently... or may be it's not supposed to be smp-safe?

memcg already holds the set_limit_mutex here.  I updated the tcp and
hugetlb controllers accordingly to take limit locks as well.

> > @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> >   */
> >  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >  {
> > -	unsigned long long margin;
> 
> Why is it still ULL?

Hm?  This is a removal.  Too many longs...

> > @@ -4155,13 +4255,13 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
> >  	 * after this point, because it has at least one child already.
> >  	 */
> >  	if (memcg_kmem_is_active(parent))
> > -		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
> > +		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
> 
> PAGE_COUNTER_MAX?

Good catch, thanks.  That was a left-over from several iterations of
trying to find a value that's suitable for both 32 bit and 64 bit.

> > @@ -60,20 +60,17 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
> >  	if (!cg_proto)
> >  		return -EINVAL;
> >  
> > -	if (val > RES_COUNTER_MAX)
> > -		val = RES_COUNTER_MAX;
> > -
> > -	ret = res_counter_set_limit(&cg_proto->memory_allocated, val);
> > +	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
> >  	if (ret)
> >  		return ret;
> >  
> >  	for (i = 0; i < 3; i++)
> > -		cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT,
> > +		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
> >  						sysctl_tcp_mem[i]);
> >  
> > -	if (val == RES_COUNTER_MAX)
> > +	if (nr_pages == ULONG_MAX / PAGE_SIZE)
> 
> PAGE_COUNTER_MAX?

Same.

> > @@ -126,43 +130,35 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
> >  	return ret ?: nbytes;
> >  }
> >  
> > -static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
> > -{
> > -	struct cg_proto *cg_proto;
> > -
> > -	cg_proto = tcp_prot.proto_cgroup(memcg);
> > -	if (!cg_proto)
> > -		return default_val;
> > -
> > -	return res_counter_read_u64(&cg_proto->memory_allocated, type);
> > -}
> > -
> > -static u64 tcp_read_usage(struct mem_cgroup *memcg)
> > -{
> > -	struct cg_proto *cg_proto;
> > -
> > -	cg_proto = tcp_prot.proto_cgroup(memcg);
> > -	if (!cg_proto)
> > -		return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
> > -
> > -	return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE);
> > -}
> > -
> >  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
> >  {
> >  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
> >  	u64 val;
> >  
> >  	switch (cft->private) {
> >  	case RES_LIMIT:
> > -		val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX);
> > +		if (!cg_proto)
> > +			return PAGE_COUNTER_MAX;
> > +		val = cg_proto->memory_allocated.limit;
> > +		val *= PAGE_SIZE;
> >  		break;
> >  	case RES_USAGE:
> > -		val = tcp_read_usage(memcg);
> > +		if (!cg_proto)
> > +			return atomic_long_read(&tcp_memory_allocated);
> 
> Forgot << PAGE_SHIFT?

Yes, indeed.

Thanks for your thorough review, Vladimir!

Here is the delta patch:

---
From 3601538347af756b9bfe05142ab21b5ae44f8cdd Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 22 Sep 2014 13:54:24 -0400
Subject: [patch] mm: memcontrol: lockless page counters fix

- renamed limited to failcnt again [vladimir]
- base page counter range on on LONG_MAX [vladimir]
- page_counter_read() [vladimir]
- page_counter_sub() [vladimir]
- rework the nofail charging [vladimir]
- page_counter_reset_watermark() [vladimir]
- fixed hugepage limit page alignment [vladimir]
- fixed page_counter_sub() return value [vladimir]
- fixed kmem's idea of unlimited [vladimir]
- fixed tcp memcontrol's idea of unlimited [vladimir]
- fixed tcp memcontrol's usage reporting [vladimir]
- serialize page_counter_limit() callsites [vladimir]
---
 include/linux/memcontrol.h |  24 ++++++---
 include/net/sock.h         |   8 +--
 mm/hugetlb_cgroup.c        |  22 ++++----
 mm/memcontrol.c            | 123 +++++++++++++++++++++++++--------------------
 net/ipv4/tcp_memcontrol.c  |  18 ++++---
 5 files changed, 115 insertions(+), 80 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf8fb1a05597..a8b939376a5d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,13 +62,13 @@ struct page_counter {
 
 	/* legacy */
 	unsigned long watermark;
-	unsigned long limited;
+	unsigned long failcnt;
 };
 
 #if BITS_PER_LONG == 32
-#define PAGE_COUNTER_MAX ULONG_MAX
+#define PAGE_COUNTER_MAX LONG_MAX
 #else
-#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
+#define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE)
 #endif
 
 static inline void page_counter_init(struct page_counter *counter,
@@ -79,13 +79,25 @@ static inline void page_counter_init(struct page_counter *counter,
 	counter->parent = parent;
 }
 
-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
-int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
-			struct page_counter **fail);
+static inline unsigned long page_counter_read(struct page_counter *counter)
+{
+	return atomic_long_read(&counter->count);
+}
+
+int page_counter_sub(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_try_charge(struct page_counter *counter,
+			    unsigned long nr_pages,
+			    struct page_counter **fail);
 int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_limit(struct page_counter *counter, unsigned long limit);
 int page_counter_memparse(const char *buf, unsigned long *nr_pages);
 
+static inline void page_counter_reset_watermark(struct page_counter *counter)
+{
+	counter->watermark = page_counter_read(counter);
+}
+
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
diff --git a/include/net/sock.h b/include/net/sock.h
index f41749982668..9aa435de3ef1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1217,9 +1217,9 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
 					      unsigned long amt,
 					      int *parent_status)
 {
-	page_counter_charge(&prot->memory_allocated, amt, NULL);
+	page_counter_charge(&prot->memory_allocated, amt);
 
-	if (atomic_long_read(&prot->memory_allocated.count) >
+	if (page_counter_read(&prot->memory_allocated) >
 	    prot->memory_allocated.limit)
 		*parent_status = OVER_LIMIT;
 }
@@ -1236,7 +1236,7 @@ sk_memory_allocated(const struct sock *sk)
 	struct proto *prot = sk->sk_prot;
 
 	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
+		return page_counter_read(&sk->sk_cgrp->memory_allocated);
 
 	return atomic_long_read(prot->memory_allocated);
 }
@@ -1250,7 +1250,7 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
 		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
 		/* update the root cgroup regardless */
 		atomic_long_add_return(amt, prot->memory_allocated);
-		return atomic_long_read(&sk->sk_cgrp->memory_allocated.count);
+		return page_counter_read(&sk->sk_cgrp->memory_allocated);
 	}
 
 	return atomic_long_add_return(amt, prot->memory_allocated);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index e619b6b62f1f..abd1e8dc7b46 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -61,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 	int idx;
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if (atomic_long_read(&h_cg->hugepage[idx].count))
+		if (page_counter_read(&h_cg->hugepage[idx]))
 			return true;
 	}
 	return false;
@@ -127,11 +127,11 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	if (!parent) {
 		parent = root_h_cgroup;
 		/* root has no limit */
-		page_counter_charge(&parent->hugepage[idx], nr_pages, NULL);
+		page_counter_charge(&parent->hugepage[idx], nr_pages);
 	}
 	counter = &h_cg->hugepage[idx];
 	/* Take the pages off the local counter */
-	page_counter_cancel(counter, nr_pages);
+	page_counter_sub(counter, nr_pages);
 
 	set_hugetlb_cgroup(page, parent);
 out:
@@ -186,7 +186,7 @@ again:
 	}
 	rcu_read_unlock();
 
-	ret = page_counter_charge(&h_cg->hugepage[idx], nr_pages, &counter);
+	ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
@@ -254,18 +254,20 @@ static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 
 	switch (MEMFILE_ATTR(cft->private)) {
 	case RES_USAGE:
-		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+		return (u64)page_counter_read(counter) * PAGE_SIZE;
 	case RES_LIMIT:
 		return (u64)counter->limit * PAGE_SIZE;
 	case RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
 	case RES_FAILCNT:
-		return counter->limited;
+		return counter->failcnt;
 	default:
 		BUG();
 	}
 }
 
+static DEFINE_MUTEX(hugetlb_limit_mutex);
+
 static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
@@ -285,8 +287,10 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_LIMIT:
-		nr_pages = ALIGN(nr_pages, huge_page_shift(&hstates[idx]));
+		nr_pages = ALIGN(nr_pages, 1UL<<huge_page_order(&hstates[idx]));
+		mutex_lock(&hugetlb_limit_mutex);
 		ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages);
+		mutex_unlock(&hugetlb_limit_mutex);
 		break;
 	default:
 		ret = -EINVAL;
@@ -306,10 +310,10 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		counter->watermark = atomic_long_read(&counter->count);
+		page_counter_reset_watermark(counter);
 		break;
 	case RES_FAILCNT:
-		counter->limited = 0;
+		counter->failcnt = 0;
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dfd3b15a57e8..9dec20b3c928 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,7 +65,7 @@
 
 #include <trace/events/vmscan.h>
 
-int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
+int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
 {
 	long new;
 
@@ -74,28 +74,41 @@ int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
 	if (WARN_ON(unlikely(new < 0)))
 		atomic_long_set(&counter->count, 0);
 
-	return new > 1;
+	return new > 0;
 }
 
-int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
-			struct page_counter **fail)
+void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
+{
+	struct page_counter *c;
+
+	for (c = counter; c; c = c->parent) {
+		long new;
+
+		new = atomic_long_add_return(nr_pages, &c->count);
+
+		if (new > c->watermark)
+			c->watermark = new;
+	}
+}
+
+int page_counter_try_charge(struct page_counter *counter,
+			    unsigned long nr_pages,
+			    struct page_counter **fail)
 {
 	struct page_counter *c;
 
 	for (c = counter; c; c = c->parent) {
 		for (;;) {
-			unsigned long count;
-			unsigned long new;
+			long count;
+			long new;
 
 			count = atomic_long_read(&c->count);
 
 			new = count + nr_pages;
 			if (new > c->limit) {
-				c->limited++;
-				if (fail) {
-					*fail = c;
-					goto failed;
-				}
+				c->failcnt++;
+				*fail = c;
+				goto failed;
 			}
 
 			if (atomic_long_cmpxchg(&c->count, count, new) != count)
@@ -111,7 +124,7 @@ int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
-		page_counter_cancel(c, nr_pages);
+		page_counter_sub(c, nr_pages);
 
 	return -ENOMEM;
 }
@@ -124,7 +137,7 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 	for (c = counter; c; c = c->parent) {
 		int remainder;
 
-		remainder = page_counter_cancel(c, nr_pages);
+		remainder = page_counter_sub(c, nr_pages);
 		if (c == counter && !remainder)
 			ret = 0;
 	}
@@ -135,8 +148,8 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 int page_counter_limit(struct page_counter *counter, unsigned long limit)
 {
 	for (;;) {
-		unsigned long count;
 		unsigned long old;
+		long count;
 
 		count = atomic_long_read(&counter->count);
 
@@ -751,7 +764,7 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 	 * This check can't live in kmem destruction function,
 	 * since the charges will outlive the cgroup
 	 */
-	WARN_ON(atomic_long_read(&memcg->kmem.count));
+	WARN_ON(page_counter_read(&memcg->kmem));
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -858,7 +871,7 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 
 static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
 {
-	unsigned long nr_pages = atomic_long_read(&memcg->memory.count);
+	unsigned long nr_pages = page_counter_read(&memcg->memory);
 	unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit);
 	unsigned long excess = 0;
 
@@ -1609,13 +1622,13 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 	unsigned long count;
 	unsigned long limit;
 
-	count = atomic_long_read(&memcg->memory.count);
+	count = page_counter_read(&memcg->memory);
 	limit = ACCESS_ONCE(memcg->memory.limit);
 	if (count < limit)
 		margin = limit - count;
 
 	if (do_swap_account) {
-		count = atomic_long_read(&memcg->memsw.count);
+		count = page_counter_read(&memcg->memsw);
 		limit = ACCESS_ONCE(memcg->memsw.limit);
 		if (count < limit)
 			margin = min(margin, limit - count);
@@ -1763,14 +1776,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 	rcu_read_unlock();
 
 	pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n",
-		K((u64)atomic_long_read(&memcg->memory.count)),
-		K((u64)memcg->memory.limit), memcg->memory.limited);
+		K((u64)page_counter_read(&memcg->memory)),
+		K((u64)memcg->memory.limit), memcg->memory.failcnt);
 	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n",
-		K((u64)atomic_long_read(&memcg->memsw.count)),
-		K((u64)memcg->memsw.limit), memcg->memsw.limited);
+		K((u64)page_counter_read(&memcg->memsw)),
+		K((u64)memcg->memsw.limit), memcg->memsw.failcnt);
 	pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n",
-		K((u64)atomic_long_read(&memcg->kmem.count)),
-		K((u64)memcg->kmem.limit), memcg->kmem.limited);
+		K((u64)page_counter_read(&memcg->kmem)),
+		K((u64)memcg->kmem.limit), memcg->kmem.failcnt);
 
 	for_each_mem_cgroup_tree(iter, memcg) {
 		pr_info("Memory cgroup stats for ");
@@ -2604,10 +2617,10 @@ retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
 
-	if (!page_counter_charge(&memcg->memory, batch, &counter)) {
+	if (!page_counter_try_charge(&memcg->memory, batch, &counter)) {
 		if (!do_swap_account)
 			goto done_restock;
-		if (!page_counter_charge(&memcg->memsw, batch, &counter))
+		if (!page_counter_try_charge(&memcg->memsw, batch, &counter))
 			goto done_restock;
 		page_counter_uncharge(&memcg->memory, batch);
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
@@ -2877,7 +2890,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
 	struct page_counter *counter;
 	int ret = 0;
 
-	ret = page_counter_charge(&memcg->kmem, nr_pages, &counter);
+	ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
 	if (ret < 0)
 		return ret;
 
@@ -2898,9 +2911,9 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
 		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
-		page_counter_charge(&memcg->memory, nr_pages, NULL);
+		page_counter_charge(&memcg->memory, nr_pages);
 		if (do_swap_account)
-			page_counter_charge(&memcg->memsw, nr_pages, NULL);
+			page_counter_charge(&memcg->memsw, nr_pages);
 		ret = 0;
 	} else if (ret)
 		page_counter_uncharge(&memcg->kmem, nr_pages);
@@ -3558,9 +3571,9 @@ static int mem_cgroup_move_parent(struct page *page,
 				pc, child, parent);
 	if (!ret) {
 		/* Take charge off the local counters */
-		page_counter_cancel(&child->memory, nr_pages);
+		page_counter_sub(&child->memory, nr_pages);
 		if (do_swap_account)
-			page_counter_cancel(&child->memsw, nr_pages);
+			page_counter_sub(&child->memsw, nr_pages);
 	}
 
 	if (nr_pages > 1)
@@ -3665,7 +3678,7 @@ void mem_cgroup_print_bad_page(struct page *page)
 }
 #endif
 
-static DEFINE_MUTEX(set_limit_mutex);
+static DEFINE_MUTEX(memcg_limit_mutex);
 
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 				   unsigned long limit)
@@ -3684,7 +3697,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
 		      mem_cgroup_count_children(memcg);
 
-	oldusage = atomic_long_read(&memcg->memory.count);
+	oldusage = page_counter_read(&memcg->memory);
 
 	do {
 		if (signal_pending(current)) {
@@ -3692,23 +3705,23 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			break;
 		}
 
-		mutex_lock(&set_limit_mutex);
+		mutex_lock(&memcg_limit_mutex);
 		if (limit > memcg->memsw.limit) {
-			mutex_unlock(&set_limit_mutex);
+			mutex_unlock(&memcg_limit_mutex);
 			ret = -EINVAL;
 			break;
 		}
 		if (limit > memcg->memory.limit)
 			enlarge = true;
 		ret = page_counter_limit(&memcg->memory, limit);
-		mutex_unlock(&set_limit_mutex);
+		mutex_unlock(&memcg_limit_mutex);
 
 		if (!ret)
 			break;
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
 
-		curusage = atomic_long_read(&memcg->memory.count);
+		curusage = page_counter_read(&memcg->memory);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
@@ -3735,7 +3748,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	retry_count = MEM_CGROUP_RECLAIM_RETRIES *
 		      mem_cgroup_count_children(memcg);
 
-	oldusage = atomic_long_read(&memcg->memsw.count);
+	oldusage = page_counter_read(&memcg->memsw);
 
 	do {
 		if (signal_pending(current)) {
@@ -3743,23 +3756,23 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 			break;
 		}
 
-		mutex_lock(&set_limit_mutex);
+		mutex_lock(&memcg_limit_mutex);
 		if (limit < memcg->memory.limit) {
-			mutex_unlock(&set_limit_mutex);
+			mutex_unlock(&memcg_limit_mutex);
 			ret = -EINVAL;
 			break;
 		}
 		if (limit > memcg->memsw.limit)
 			enlarge = true;
 		ret = page_counter_limit(&memcg->memsw, limit);
-		mutex_unlock(&set_limit_mutex);
+		mutex_unlock(&memcg_limit_mutex);
 
 		if (!ret)
 			break;
 
 		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
 
-		curusage = atomic_long_read(&memcg->memsw.count);
+		curusage = page_counter_read(&memcg->memsw);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
 			retry_count--;
@@ -3960,8 +3973,8 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 		 * right after the check. RES_USAGE should be safe as we always
 		 * charge before adding to the LRU.
 		 */
-	} while (atomic_long_read(&memcg->memory.count) -
-		 atomic_long_read(&memcg->kmem.count) > 0);
+	} while (page_counter_read(&memcg->memory) -
+		 page_counter_read(&memcg->kmem) > 0);
 }
 
 /*
@@ -4001,7 +4014,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 	/* we call try-to-free pages for make this cgroup empty */
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
-	while (nr_retries && atomic_long_read(&memcg->memory.count)) {
+	while (nr_retries && page_counter_read(&memcg->memory)) {
 		int progress;
 
 		if (signal_pending(current))
@@ -4098,9 +4111,9 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 			val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP);
 	} else {
 		if (!swap)
-			val = atomic_long_read(&memcg->memory.count);
+			val = page_counter_read(&memcg->memory);
 		else
-			val = atomic_long_read(&memcg->memsw.count);
+			val = page_counter_read(&memcg->memsw);
 	}
 	return val << PAGE_SHIFT;
 }
@@ -4139,13 +4152,13 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 			return mem_cgroup_usage(memcg, false);
 		if (counter == &memcg->memsw)
 			return mem_cgroup_usage(memcg, true);
-		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
+		return (u64)page_counter_read(counter) * PAGE_SIZE;
 	case RES_LIMIT:
 		return (u64)counter->limit * PAGE_SIZE;
 	case RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
 	case RES_FAILCNT:
-		return counter->limited;
+		return counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return (u64)memcg->soft_limit * PAGE_SIZE;
 	default:
@@ -4234,10 +4247,12 @@ static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
 {
 	int ret;
 
+	mutex_lock(&memcg_limit_mutex);
 	if (!memcg_kmem_is_active(memcg))
 		ret = memcg_activate_kmem(memcg, limit);
 	else
 		ret = page_counter_limit(&memcg->kmem, limit);
+	mutex_unlock(&memcg_limit_mutex);
 	return ret;
 }
 
@@ -4255,7 +4270,7 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	 * after this point, because it has at least one child already.
 	 */
 	if (memcg_kmem_is_active(parent))
-		ret = __memcg_activate_kmem(memcg, ULONG_MAX);
+		ret = __memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
 	mutex_unlock(&activate_kmem_mutex);
 	return ret;
 }
@@ -4331,10 +4346,10 @@ static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
-		counter->watermark = atomic_long_read(&counter->count);
+		page_counter_reset_watermark(counter);
 		break;
 	case RES_FAILCNT:
-		counter->limited = 0;
+		counter->failcnt = 0;
 		break;
 	default:
 		BUG();
@@ -4934,7 +4949,7 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 
 	memcg_kmem_mark_dead(memcg);
 
-	if (atomic_long_read(&memcg->kmem.count))
+	if (page_counter_read(&memcg->kmem))
 		return;
 
 	if (memcg_kmem_test_and_clear_dead(memcg))
@@ -5603,7 +5618,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	 * call_rcu()
 	 *   offline_css()
 	 *     reparent_charges()
-	 *                           page_counter_charge()
+	 *                           page_counter_try_charge()
 	 *                           css_put()
 	 *                             css_free()
 	 *                           pc->mem_cgroup = dead memcg
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 9a448bdb19e9..272327134a1b 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -68,7 +68,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
 						sysctl_tcp_mem[i]);
 
-	if (nr_pages == ULONG_MAX / PAGE_SIZE)
+	if (nr_pages == PAGE_COUNTER_MAX)
 		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
 	else {
 		/*
@@ -106,6 +106,8 @@ enum {
 	RES_FAILCNT,
 };
 
+static DEFINE_MUTEX(tcp_limit_mutex);
+
 static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
@@ -121,7 +123,9 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 		ret = page_counter_memparse(buf, &nr_pages);
 		if (ret)
 			break;
+		mutex_lock(&tcp_limit_mutex);
 		ret = tcp_update_limit(memcg, nr_pages);
+		mutex_unlock(&tcp_limit_mutex);
 		break;
 	default:
 		ret = -EINVAL;
@@ -145,14 +149,15 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 		break;
 	case RES_USAGE:
 		if (!cg_proto)
-			return atomic_long_read(&tcp_memory_allocated);
-		val = atomic_long_read(&cg_proto->memory_allocated.count);
+			val = atomic_long_read(&tcp_memory_allocated);
+		else
+			val = page_counter_read(&cg_proto->memory_allocated);
 		val *= PAGE_SIZE;
 		break;
 	case RES_FAILCNT:
 		if (!cg_proto)
 			return 0;
-		val = cg_proto->memory_allocated.limited;
+		val = cg_proto->memory_allocated.failcnt;
 		break;
 	case RES_MAX_USAGE:
 		if (!cg_proto)
@@ -179,11 +184,10 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
 
 	switch (of_cft(of)->private) {
 	case RES_MAX_USAGE:
-		cg_proto->memory_allocated.watermark =
-			atomic_long_read(&cg_proto->memory_allocated.count);
+		page_counter_reset_watermark(&cg_proto->memory_allocated);
 		break;
 	case RES_FAILCNT:
-		cg_proto->memory_allocated.limited = 0;
+		cg_proto->memory_allocated.failcnt = 0;
 		break;
 	}
 
-- 
2.1.0


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-22 17:28       ` Michal Hocko
@ 2014-09-22 19:58         ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-22 19:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon, Sep 22, 2014 at 07:28:00PM +0200, Michal Hocko wrote:
> On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> > On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> [...]
> > > Nevertheless I think that the counter should live outside of memcg (it
> > > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > > just to have a counter). If you made kernel/page_counter.c and led both
> > > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > > on MEMCG and I would find it cleaner in general.
> > 
> > The reason I did it this way is because the hugetlb controller simply
> > accounts and limits a certain type of memory and in the future I would
> > like to make it a memcg extension, just like kmem and swap.
> 
> I am not sure this is the right way to go. Hugetlb has always been
> "special" and I do not see any advantage to pull its specialness into
> memcg proper.
>
> It would just make the code more complicated. I can also imagine
> users who simply do not want to pay memcg overhead and use only
> hugetlb controller.

We already group user memory, kernel memory, and swap space together,
what makes hugetlb-backed memory special?

It's much easier to organize the code if all those closely related
things are grouped together.  It's also better for the user interface
to have a single memory controller.

We're also close to the point where we don't differentiate between the
root group and dedicated groups in terms of performance, Dave's tests
fell apart at fairly high concurrency, and I'm already getting rid of
the lock he saw contended.

The downsides of fragmenting our configuration- and testspace, our
user interface, and our code base by far outweigh the benefits of
offering a dedicated hugetlb controller.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-22 19:58         ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-22 19:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon, Sep 22, 2014 at 07:28:00PM +0200, Michal Hocko wrote:
> On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> > On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> [...]
> > > Nevertheless I think that the counter should live outside of memcg (it
> > > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > > just to have a counter). If you made kernel/page_counter.c and led both
> > > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > > on MEMCG and I would find it cleaner in general.
> > 
> > The reason I did it this way is because the hugetlb controller simply
> > accounts and limits a certain type of memory and in the future I would
> > like to make it a memcg extension, just like kmem and swap.
> 
> I am not sure this is the right way to go. Hugetlb has always been
> "special" and I do not see any advantage to pull its specialness into
> memcg proper.
>
> It would just make the code more complicated. I can also imagine
> users who simply do not want to pay memcg overhead and use only
> hugetlb controller.

We already group user memory, kernel memory, and swap space together,
what makes hugetlb-backed memory special?

It's much easier to organize the code if all those closely related
things are grouped together.  It's also better for the user interface
to have a single memory controller.

We're also close to the point where we don't differentiate between the
root group and dedicated groups in terms of performance, Dave's tests
fell apart at fairly high concurrency, and I'm already getting rid of
the lock he saw contended.

The downsides of fragmenting our configuration- and testspace, our
user interface, and our code base by far outweigh the benefits of
offering a dedicated hugetlb controller.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-19 13:22 ` Johannes Weiner
@ 2014-09-23  7:46   ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 53+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-23  7:46 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm
  Cc: Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

(2014/09/19 22:22), Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it and remove the old one.  The translation
> from and to bytes then only happens when interfacing with userspace.
> 
> Aside from the locking costs, this gets rid of the icky unsigned long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I like this patch because I hate res_counter very much.

a few nitpick comments..

<snip>

> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19df5d857411..bf8fb1a05597 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
>   };
>   
>   #ifdef CONFIG_MEMCG
> +
> +struct page_counter {
> +	atomic_long_t count;
> +	unsigned long limit;
> +	struct page_counter *parent;
> +
> +	/* legacy */
> +	unsigned long watermark;
> +	unsigned long limited;
> +};

I guees all attributes should be on the same cache line. How about align this to cache ?
And legacy values are not very important to be atomic by design, right ?

> +
> +#if BITS_PER_LONG == 32
> +#define PAGE_COUNTER_MAX ULONG_MAX
> +#else
> +#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
> +#endif
> +
<snip>

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..dfd3b15a57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -25,7 +25,6 @@
>    * GNU General Public License for more details.
>    */
>   
> -#include <linux/res_counter.h>
>   #include <linux/memcontrol.h>
>   #include <linux/cgroup.h>
>   #include <linux/mm.h>
> @@ -66,6 +65,117 @@
>   
>   #include <trace/events/vmscan.h>
>   
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(nr_pages, &counter->count);
> +
> +	if (WARN_ON(unlikely(new < 0)))
> +		atomic_long_set(&counter->count, 0);

 WARN_ON_ONCE() ?
 Or I prefer atomic_add(&counter->count, nr_pages) rather than set to 0
 because if a buggy call's "nr_pages" is enough big, following calls to
 page_counter_cacnel() will show more logs.

> +
> +	return new > 1;
> +}
> +
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail)
> +{
> +	struct page_counter *c;
> +
> +	for (c = counter; c; c = c->parent) {
> +		for (;;) {
> +			unsigned long count;
> +			unsigned long new;
> +
> +			count = atomic_long_read(&c->count);
> +
> +			new = count + nr_pages;
> +			if (new > c->limit) {
> +				c->limited++;
> +				if (fail) {
> +					*fail = c;
> +					goto failed;
> +				}
  seeing res_counter(), c ret code for this case should be -ENOMEM.
> +			}
> +
> +			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> +				continue;
> +
> +			if (new > c->watermark)
> +				c->watermark = new;
> +
> +			break;
> +		}
> +	}
> +	return 0;
> +
> +failed:
> +	for (c = counter; c != *fail; c = c->parent)
> +		page_counter_cancel(c, nr_pages);
> +
> +	return -ENOMEM;
> +}
> +
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	struct page_counter *c;
> +	int ret = 1;
> +
> +	for (c = counter; c; c = c->parent) {
> +		int remainder;
> +
> +		remainder = page_counter_cancel(c, nr_pages);
> +		if (c == counter && !remainder)
> +			ret = 0;
> +	}
> +
> +	return ret;
> +}
> +
> +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> +{
> +	for (;;) {
> +		unsigned long count;
> +		unsigned long old;
> +
> +		count = atomic_long_read(&counter->count);
> +
> +		old = xchg(&counter->limit, limit);
> +
> +		if (atomic_long_read(&counter->count) != count) {
> +			counter->limit = old;
> +			continue;
> +		}
> +
> +		if (count > limit) {
> +			counter->limit = old;
> +			return -EBUSY;
> +		}
> +
> +		return 0;
> +	}
> +}

I think the whole "updating limit"  ops should be mutual exclusive. It seems
there will be trouble if multiple updater comes at once.
So, "xchg" isn't required. calllers should have their own locks.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23  7:46   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 53+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-23  7:46 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm
  Cc: Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

(2014/09/19 22:22), Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it and remove the old one.  The translation
> from and to bytes then only happens when interfacing with userspace.
> 
> Aside from the locking costs, this gets rid of the icky unsigned long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I like this patch because I hate res_counter very much.

a few nitpick comments..

<snip>

> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19df5d857411..bf8fb1a05597 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
>   };
>   
>   #ifdef CONFIG_MEMCG
> +
> +struct page_counter {
> +	atomic_long_t count;
> +	unsigned long limit;
> +	struct page_counter *parent;
> +
> +	/* legacy */
> +	unsigned long watermark;
> +	unsigned long limited;
> +};

I guees all attributes should be on the same cache line. How about align this to cache ?
And legacy values are not very important to be atomic by design, right ?

> +
> +#if BITS_PER_LONG == 32
> +#define PAGE_COUNTER_MAX ULONG_MAX
> +#else
> +#define PAGE_COUNTER_MAX (ULONG_MAX / PAGE_SIZE)
> +#endif
> +
<snip>

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..dfd3b15a57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -25,7 +25,6 @@
>    * GNU General Public License for more details.
>    */
>   
> -#include <linux/res_counter.h>
>   #include <linux/memcontrol.h>
>   #include <linux/cgroup.h>
>   #include <linux/mm.h>
> @@ -66,6 +65,117 @@
>   
>   #include <trace/events/vmscan.h>
>   
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	long new;
> +
> +	new = atomic_long_sub_return(nr_pages, &counter->count);
> +
> +	if (WARN_ON(unlikely(new < 0)))
> +		atomic_long_set(&counter->count, 0);

 WARN_ON_ONCE() ?
 Or I prefer atomic_add(&counter->count, nr_pages) rather than set to 0
 because if a buggy call's "nr_pages" is enough big, following calls to
 page_counter_cacnel() will show more logs.

> +
> +	return new > 1;
> +}
> +
> +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> +			struct page_counter **fail)
> +{
> +	struct page_counter *c;
> +
> +	for (c = counter; c; c = c->parent) {
> +		for (;;) {
> +			unsigned long count;
> +			unsigned long new;
> +
> +			count = atomic_long_read(&c->count);
> +
> +			new = count + nr_pages;
> +			if (new > c->limit) {
> +				c->limited++;
> +				if (fail) {
> +					*fail = c;
> +					goto failed;
> +				}
  seeing res_counter(), c ret code for this case should be -ENOMEM.
> +			}
> +
> +			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> +				continue;
> +
> +			if (new > c->watermark)
> +				c->watermark = new;
> +
> +			break;
> +		}
> +	}
> +	return 0;
> +
> +failed:
> +	for (c = counter; c != *fail; c = c->parent)
> +		page_counter_cancel(c, nr_pages);
> +
> +	return -ENOMEM;
> +}
> +
> +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
> +{
> +	struct page_counter *c;
> +	int ret = 1;
> +
> +	for (c = counter; c; c = c->parent) {
> +		int remainder;
> +
> +		remainder = page_counter_cancel(c, nr_pages);
> +		if (c == counter && !remainder)
> +			ret = 0;
> +	}
> +
> +	return ret;
> +}
> +
> +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> +{
> +	for (;;) {
> +		unsigned long count;
> +		unsigned long old;
> +
> +		count = atomic_long_read(&counter->count);
> +
> +		old = xchg(&counter->limit, limit);
> +
> +		if (atomic_long_read(&counter->count) != count) {
> +			counter->limit = old;
> +			continue;
> +		}
> +
> +		if (count > limit) {
> +			counter->limit = old;
> +			return -EBUSY;
> +		}
> +
> +		return 0;
> +	}
> +}

I think the whole "updating limit"  ops should be mutual exclusive. It seems
there will be trouble if multiple updater comes at once.
So, "xchg" isn't required. calllers should have their own locks.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-22 18:57     ` Johannes Weiner
  (?)
@ 2014-09-23 11:06       ` Vladimir Davydov
  -1 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-23 11:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 19df5d857411..bf8fb1a05597 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > >  };
> > >  
> > >  #ifdef CONFIG_MEMCG
> > > +
> > > +struct page_counter {
> > 
> > I'd place it in a separate file, say
> > 
> > 	include/linux/page_counter.h
> > 	mm/page_counter.c
> > 
> > just to keep mm/memcontrol.c clean.
> 
> The page counters are the very core of the memory controller and, as I
> said to Michal, I want to integrate the hugetlb controller into memcg
> as well, at which point there won't be any outside users anymore.  So
> I think this is the right place for it.

Hmm, there might be memcg users out there that don't want to pay for
hugetlb accounting. Or is the overhead supposed to be negligible?

Anyway, I still don't think it's a good idea to keep all the definitions
in the same file. memcontrol.c is already huge. Adding more code to it
is not desirable, especially if it can naturally live in a separate
file. And since the page_counter is independent of the memcg core and
*looks* generic, I believe we should keep it separately.

> > > +	atomic_long_t count;
> > > +	unsigned long limit;
> > > +	struct page_counter *parent;
> > > +
> > > +	/* legacy */
> > > +	unsigned long watermark;
> > > +	unsigned long limited;
> > 
> > IMHO, failcnt would fit better.
> 
> I never liked the failcnt name, but also have to admit that "limited"
> is crap.  Let's leave it at failcnt for now.
> 
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > 
> > When I first saw this function, I couldn't realize by looking at its
> > name what it's intended to do. I think
> > 
> > 	page_counter_cancel_local_charge()
> > 
> > would fit better.
> 
> It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?

The _sub suffix doesn't match _charge/_uncharge. May be
page_counter_local_uncharge, or _uncharge_local?

> 
> > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > +			struct page_counter **fail);
> > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > 
> > Hmm, why not page_counter_set_limit?
> 
> Limit is used as a verb here, "to limit".  Getters and setters are
> usually wrappers around unusual/complex data structure access,

Not necessarily. Look at percpu_counter_read e.g. It's a one-line
getter, which we could easily live w/o, but still it's there.

> but this function does a lot more, so I'm not fond of _set_limit().

Nevertheless, everything it does can be perfectly described in one
sentence "it tries to set the new value of the limit", so it does
function as a setter. And if there's a setter, there must be a getter
IMO.

> > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > >  					      unsigned long amt,
> > >  					      int *parent_status)
> > >  {
> > > -	struct res_counter *fail;
> > > -	int ret;
> > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > >  
> > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > -					amt << PAGE_SHIFT, &fail);
> > > -	if (ret < 0)
> > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > +	    prot->memory_allocated.limit)
> > 
> > I don't like your equivalent of res_counter_charge_nofail.
> > 
> > Passing NULL to page_counter_charge might be useful if one doesn't have
> > a back-off strategy, but still want to fail on hitting the limit. With
> > your interface the user must pass something to the function then, which
> > isn't convenient.
> > 
> > Besides, it depends on the internal implementation of the page_counter
> > struct. I'd encapsulate this.
> 
> Thinking about this more, I don't like my version either; not because
> of how @fail must always be passed, but because of how it changes the
> behavior.  I changed the API to
> 
> void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
>                             struct page_counter **fail);

That looks good to me. I would also add something like

  bool page_counter_exceeds_limit(struct page_counter *counter);

to use instead of this

+	if (atomic_long_read(&prot->memory_allocated.count) >
+	    prot->memory_allocated.limit)

> We could make @fail optional in the try_charge(), but all callsites
> pass it at this time, so for now I kept it mandatory for simplicity.

It doesn't really matter to me. Both variants are fine.

[...]
> > > @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> > >  	if (unlikely(!h_cg))
> > >  		return;
> > >  	set_hugetlb_cgroup(page, NULL);
> > > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> > >  	return;
> > >  }
> > >  
> > >  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> > >  				    struct hugetlb_cgroup *h_cg)
> > >  {
> > > -	unsigned long csize = nr_pages * PAGE_SIZE;
> > > -
> > >  	if (hugetlb_cgroup_disabled() || !h_cg)
> > >  		return;
> > >  
> > >  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> > >  		return;
> > >  
> > > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> > >  	return;
> > >  }
> > >  
> > > +enum {
> > > +	RES_USAGE,
> > > +	RES_LIMIT,
> > > +	RES_MAX_USAGE,
> > > +	RES_FAILCNT,
> > > +};
> > > +
> > >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> > >  				   struct cftype *cft)
> > >  {
> > > -	int idx, name;
> > > +	struct page_counter *counter;
> > >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> > >  
> > > -	idx = MEMFILE_IDX(cft->private);
> > > -	name = MEMFILE_ATTR(cft->private);
> > > +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> > >  
> > > -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> > > +	switch (MEMFILE_ATTR(cft->private)) {
> > > +	case RES_USAGE:
> > > +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> > 
> > page_counter_read?
> > 
> > > +	case RES_LIMIT:
> > > +		return (u64)counter->limit * PAGE_SIZE;
> > 
> > page_counter_get_limit?
> > 
> > > +	case RES_MAX_USAGE:
> > > +		return (u64)counter->watermark * PAGE_SIZE;
> > 
> > page_counter_read_watermark?
> 
> I added page_counter_read() to abstract away the fact that we use a
> signed counter internally, but it still returns the number of pages as
> unsigned long.

That's exactly what I meant.

> The entire counter API is based on pages now, and only the userspace
> interface translates back and forth into bytes, so that's where the
> translation should be located.

Absolutely right.

[...]
> > > @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> > >  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> > >  				    char *buf, size_t nbytes, loff_t off)
> > >  {
> > > -	int idx, name, ret = 0;
> > > +	int ret = 0;
> > > +	struct page_counter *counter;
> > >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> > >  
> > > -	idx = MEMFILE_IDX(of_cft(of)->private);
> > > -	name = MEMFILE_ATTR(of_cft(of)->private);
> > > +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> > >  
> > > -	switch (name) {
> > > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> > >  	case RES_MAX_USAGE:
> > > -		res_counter_reset_max(&h_cg->hugepage[idx]);
> > > +		counter->watermark = atomic_long_read(&counter->count);
> > 
> > page_counter_reset_watermark?
> 
> Yes, that operation deserves a wrapper.
> 
> > >  		break;
> > >  	case RES_FAILCNT:
> > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > +		counter->limited = 0;
> > 
> > page_counter_reset_failcnt?
> 
> That would be more obscure than counter->failcnt = 0, I think.

There's one thing that bothers me about this patch. Before, all the
functions operating on res_counter were mutually smp-safe, now they
aren't. E.g. if the failcnt reset races with the falcnt increment from
page_counter_try_charge, the reset might be skipped. You only use the
atomic type for the counter, but my guess is that failcnt and watermark
should be atomic too, at least if we're not going to get rid of them
soon. Otherwise, it should be clearly stated that failcnt and watermark
are racy.

Anyway, that's where the usefulness of res_counter_reset_failcnt
reveals. If one decides to make it race-free one day, they won't have to
modify code outside the page_counter definition.

> > > @@ -66,6 +65,117 @@
> > >  
> > >  #include <trace/events/vmscan.h>
> > >  
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > > +{
> > > +	long new;
> > > +
> > > +	new = atomic_long_sub_return(nr_pages, &counter->count);
> > > +
> > > +	if (WARN_ON(unlikely(new < 0)))
> > 
> > Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.
> 
> Since this is a page counter, this would overflow at 8 petabyte on 32
> bit.  So even though the maximum is ULONG_MAX, in practice we should
> never even reach LONG_MAX, and ULONG_MAX was only chosen for backward
> compatibility with the default unlimited value.
> 
> This is actually not the only place that assumes we never go negative;
> the userspace read functions' u64 cast of a long would sign-extend any
> negative value and return ludicrous numbers.
> 
> But thinking longer about this, it's probably not worth to have these
> gotchas in the code just to maintain the default unlimited value.  I
> changed PAGE_COUNTER_MAX to LONG_MAX and LONG_MAX / PAGE_SIZE, resp.

That sounds sane. We only have to maintain the user interface, not the
internal implementation.

> > > +		atomic_long_set(&counter->count, 0);
> > > +
> > > +	return new > 1;
> > 
> > Nobody outside page_counter internals uses this retval. Why is it public
> > then?
> 
> kmemcg still uses this for the pinning trick, but I'll update the
> patch that removes it to also change the interface.
> 
> > BTW, why not new > 0?
> 
> That's a plain bug - probably a left-over from rephrasing this
> underflow test several times.  Thanks for catching.
> 
> > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > +			struct page_counter **fail)
> > > +{
> > > +	struct page_counter *c;
> > > +
> > > +	for (c = counter; c; c = c->parent) {
> > > +		for (;;) {
> > > +			unsigned long count;
> > > +			unsigned long new;
> > > +
> > > +			count = atomic_long_read(&c->count);
> > > +
> > > +			new = count + nr_pages;
> > > +			if (new > c->limit) {
> > > +				c->limited++;
> > > +				if (fail) {
> > 
> > So we increase 'limited' even if ain't limited. Sounds weird.
> 
> The old code actually did that too, but I removed it now in the
> transition to separate charge and try_charge functions.
> 
> > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > +{
> > > +	for (;;) {
> > > +		unsigned long count;
> > > +		unsigned long old;
> > > +
> > > +		count = atomic_long_read(&counter->count);
> > > +
> > > +		old = xchg(&counter->limit, limit);
> > > +
> > > +		if (atomic_long_read(&counter->count) != count) {
> > > +			counter->limit = old;
> > 
> > I wonder what can happen if two threads execute this function
> > concurrently... or may be it's not supposed to be smp-safe?
> 
> memcg already holds the set_limit_mutex here.  I updated the tcp and
> hugetlb controllers accordingly to take limit locks as well.

I would prefer page_counter to handle it internally, because we won't
need the set_limit_mutex once memsw is converted to plain swap
accounting. Besides, memcg_update_kmem_limit doesn't take it. Any chance
to achieve that w/o spinlocks, using only atomic variables?

> > > @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> > >   */
> > >  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > >  {
> > > -	unsigned long long margin;
> > 
> > Why is it still ULL?
> 
> Hm?  This is a removal.  Too many longs...

Sorry, I missed the minus sign.

[...]
> Here is the delta patch:
[...]

If I were you, I'd separate the patch introducing the page_counter API
and implementation from the rest. I think it'd ease the review.

A couple of extra notes about the patch:

 - I think having comments to function definitions would be nice.

 - Your implementation of try_charge uses CAS, but this is a really
   costly operation (the most costly of all atomic primitives). Have
   you considered using FAA? Something like this:

   try_charge(pc, nr):

     limit = pc->limit;
     count = atomic_add_return(&pc->count, nr);
     if (count > limit) {
         atomic_sub(&pc->count, nr);
         return -ENOMEM;
     }
     return 0;

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 11:06       ` Vladimir Davydov
  0 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-23 11:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 19df5d857411..bf8fb1a05597 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > >  };
> > >  
> > >  #ifdef CONFIG_MEMCG
> > > +
> > > +struct page_counter {
> > 
> > I'd place it in a separate file, say
> > 
> > 	include/linux/page_counter.h
> > 	mm/page_counter.c
> > 
> > just to keep mm/memcontrol.c clean.
> 
> The page counters are the very core of the memory controller and, as I
> said to Michal, I want to integrate the hugetlb controller into memcg
> as well, at which point there won't be any outside users anymore.  So
> I think this is the right place for it.

Hmm, there might be memcg users out there that don't want to pay for
hugetlb accounting. Or is the overhead supposed to be negligible?

Anyway, I still don't think it's a good idea to keep all the definitions
in the same file. memcontrol.c is already huge. Adding more code to it
is not desirable, especially if it can naturally live in a separate
file. And since the page_counter is independent of the memcg core and
*looks* generic, I believe we should keep it separately.

> > > +	atomic_long_t count;
> > > +	unsigned long limit;
> > > +	struct page_counter *parent;
> > > +
> > > +	/* legacy */
> > > +	unsigned long watermark;
> > > +	unsigned long limited;
> > 
> > IMHO, failcnt would fit better.
> 
> I never liked the failcnt name, but also have to admit that "limited"
> is crap.  Let's leave it at failcnt for now.
> 
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > 
> > When I first saw this function, I couldn't realize by looking at its
> > name what it's intended to do. I think
> > 
> > 	page_counter_cancel_local_charge()
> > 
> > would fit better.
> 
> It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?

The _sub suffix doesn't match _charge/_uncharge. May be
page_counter_local_uncharge, or _uncharge_local?

> 
> > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > +			struct page_counter **fail);
> > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > 
> > Hmm, why not page_counter_set_limit?
> 
> Limit is used as a verb here, "to limit".  Getters and setters are
> usually wrappers around unusual/complex data structure access,

Not necessarily. Look at percpu_counter_read e.g. It's a one-line
getter, which we could easily live w/o, but still it's there.

> but this function does a lot more, so I'm not fond of _set_limit().

Nevertheless, everything it does can be perfectly described in one
sentence "it tries to set the new value of the limit", so it does
function as a setter. And if there's a setter, there must be a getter
IMO.

> > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > >  					      unsigned long amt,
> > >  					      int *parent_status)
> > >  {
> > > -	struct res_counter *fail;
> > > -	int ret;
> > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > >  
> > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > -					amt << PAGE_SHIFT, &fail);
> > > -	if (ret < 0)
> > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > +	    prot->memory_allocated.limit)
> > 
> > I don't like your equivalent of res_counter_charge_nofail.
> > 
> > Passing NULL to page_counter_charge might be useful if one doesn't have
> > a back-off strategy, but still want to fail on hitting the limit. With
> > your interface the user must pass something to the function then, which
> > isn't convenient.
> > 
> > Besides, it depends on the internal implementation of the page_counter
> > struct. I'd encapsulate this.
> 
> Thinking about this more, I don't like my version either; not because
> of how @fail must always be passed, but because of how it changes the
> behavior.  I changed the API to
> 
> void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
>                             struct page_counter **fail);

That looks good to me. I would also add something like

  bool page_counter_exceeds_limit(struct page_counter *counter);

to use instead of this

+	if (atomic_long_read(&prot->memory_allocated.count) >
+	    prot->memory_allocated.limit)

> We could make @fail optional in the try_charge(), but all callsites
> pass it at this time, so for now I kept it mandatory for simplicity.

It doesn't really matter to me. Both variants are fine.

[...]
> > > @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> > >  	if (unlikely(!h_cg))
> > >  		return;
> > >  	set_hugetlb_cgroup(page, NULL);
> > > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> > >  	return;
> > >  }
> > >  
> > >  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> > >  				    struct hugetlb_cgroup *h_cg)
> > >  {
> > > -	unsigned long csize = nr_pages * PAGE_SIZE;
> > > -
> > >  	if (hugetlb_cgroup_disabled() || !h_cg)
> > >  		return;
> > >  
> > >  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> > >  		return;
> > >  
> > > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> > >  	return;
> > >  }
> > >  
> > > +enum {
> > > +	RES_USAGE,
> > > +	RES_LIMIT,
> > > +	RES_MAX_USAGE,
> > > +	RES_FAILCNT,
> > > +};
> > > +
> > >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> > >  				   struct cftype *cft)
> > >  {
> > > -	int idx, name;
> > > +	struct page_counter *counter;
> > >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> > >  
> > > -	idx = MEMFILE_IDX(cft->private);
> > > -	name = MEMFILE_ATTR(cft->private);
> > > +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> > >  
> > > -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> > > +	switch (MEMFILE_ATTR(cft->private)) {
> > > +	case RES_USAGE:
> > > +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> > 
> > page_counter_read?
> > 
> > > +	case RES_LIMIT:
> > > +		return (u64)counter->limit * PAGE_SIZE;
> > 
> > page_counter_get_limit?
> > 
> > > +	case RES_MAX_USAGE:
> > > +		return (u64)counter->watermark * PAGE_SIZE;
> > 
> > page_counter_read_watermark?
> 
> I added page_counter_read() to abstract away the fact that we use a
> signed counter internally, but it still returns the number of pages as
> unsigned long.

That's exactly what I meant.

> The entire counter API is based on pages now, and only the userspace
> interface translates back and forth into bytes, so that's where the
> translation should be located.

Absolutely right.

[...]
> > > @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> > >  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> > >  				    char *buf, size_t nbytes, loff_t off)
> > >  {
> > > -	int idx, name, ret = 0;
> > > +	int ret = 0;
> > > +	struct page_counter *counter;
> > >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> > >  
> > > -	idx = MEMFILE_IDX(of_cft(of)->private);
> > > -	name = MEMFILE_ATTR(of_cft(of)->private);
> > > +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> > >  
> > > -	switch (name) {
> > > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> > >  	case RES_MAX_USAGE:
> > > -		res_counter_reset_max(&h_cg->hugepage[idx]);
> > > +		counter->watermark = atomic_long_read(&counter->count);
> > 
> > page_counter_reset_watermark?
> 
> Yes, that operation deserves a wrapper.
> 
> > >  		break;
> > >  	case RES_FAILCNT:
> > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > +		counter->limited = 0;
> > 
> > page_counter_reset_failcnt?
> 
> That would be more obscure than counter->failcnt = 0, I think.

There's one thing that bothers me about this patch. Before, all the
functions operating on res_counter were mutually smp-safe, now they
aren't. E.g. if the failcnt reset races with the falcnt increment from
page_counter_try_charge, the reset might be skipped. You only use the
atomic type for the counter, but my guess is that failcnt and watermark
should be atomic too, at least if we're not going to get rid of them
soon. Otherwise, it should be clearly stated that failcnt and watermark
are racy.

Anyway, that's where the usefulness of res_counter_reset_failcnt
reveals. If one decides to make it race-free one day, they won't have to
modify code outside the page_counter definition.

> > > @@ -66,6 +65,117 @@
> > >  
> > >  #include <trace/events/vmscan.h>
> > >  
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > > +{
> > > +	long new;
> > > +
> > > +	new = atomic_long_sub_return(nr_pages, &counter->count);
> > > +
> > > +	if (WARN_ON(unlikely(new < 0)))
> > 
> > Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.
> 
> Since this is a page counter, this would overflow at 8 petabyte on 32
> bit.  So even though the maximum is ULONG_MAX, in practice we should
> never even reach LONG_MAX, and ULONG_MAX was only chosen for backward
> compatibility with the default unlimited value.
> 
> This is actually not the only place that assumes we never go negative;
> the userspace read functions' u64 cast of a long would sign-extend any
> negative value and return ludicrous numbers.
> 
> But thinking longer about this, it's probably not worth to have these
> gotchas in the code just to maintain the default unlimited value.  I
> changed PAGE_COUNTER_MAX to LONG_MAX and LONG_MAX / PAGE_SIZE, resp.

That sounds sane. We only have to maintain the user interface, not the
internal implementation.

> > > +		atomic_long_set(&counter->count, 0);
> > > +
> > > +	return new > 1;
> > 
> > Nobody outside page_counter internals uses this retval. Why is it public
> > then?
> 
> kmemcg still uses this for the pinning trick, but I'll update the
> patch that removes it to also change the interface.
> 
> > BTW, why not new > 0?
> 
> That's a plain bug - probably a left-over from rephrasing this
> underflow test several times.  Thanks for catching.
> 
> > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > +			struct page_counter **fail)
> > > +{
> > > +	struct page_counter *c;
> > > +
> > > +	for (c = counter; c; c = c->parent) {
> > > +		for (;;) {
> > > +			unsigned long count;
> > > +			unsigned long new;
> > > +
> > > +			count = atomic_long_read(&c->count);
> > > +
> > > +			new = count + nr_pages;
> > > +			if (new > c->limit) {
> > > +				c->limited++;
> > > +				if (fail) {
> > 
> > So we increase 'limited' even if ain't limited. Sounds weird.
> 
> The old code actually did that too, but I removed it now in the
> transition to separate charge and try_charge functions.
> 
> > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > +{
> > > +	for (;;) {
> > > +		unsigned long count;
> > > +		unsigned long old;
> > > +
> > > +		count = atomic_long_read(&counter->count);
> > > +
> > > +		old = xchg(&counter->limit, limit);
> > > +
> > > +		if (atomic_long_read(&counter->count) != count) {
> > > +			counter->limit = old;
> > 
> > I wonder what can happen if two threads execute this function
> > concurrently... or may be it's not supposed to be smp-safe?
> 
> memcg already holds the set_limit_mutex here.  I updated the tcp and
> hugetlb controllers accordingly to take limit locks as well.

I would prefer page_counter to handle it internally, because we won't
need the set_limit_mutex once memsw is converted to plain swap
accounting. Besides, memcg_update_kmem_limit doesn't take it. Any chance
to achieve that w/o spinlocks, using only atomic variables?

> > > @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> > >   */
> > >  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > >  {
> > > -	unsigned long long margin;
> > 
> > Why is it still ULL?
> 
> Hm?  This is a removal.  Too many longs...

Sorry, I missed the minus sign.

[...]
> Here is the delta patch:
[...]

If I were you, I'd separate the patch introducing the page_counter API
and implementation from the rest. I think it'd ease the review.

A couple of extra notes about the patch:

 - I think having comments to function definitions would be nice.

 - Your implementation of try_charge uses CAS, but this is a really
   costly operation (the most costly of all atomic primitives). Have
   you considered using FAA? Something like this:

   try_charge(pc, nr):

     limit = pc->limit;
     count = atomic_add_return(&pc->count, nr);
     if (count > limit) {
         atomic_sub(&pc->count, nr);
         return -ENOMEM;
     }
     return 0;

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 11:06       ` Vladimir Davydov
  0 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-23 11:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Greg Thelen,
	Dave Hansen, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 19df5d857411..bf8fb1a05597 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > >  };
> > >  
> > >  #ifdef CONFIG_MEMCG
> > > +
> > > +struct page_counter {
> > 
> > I'd place it in a separate file, say
> > 
> > 	include/linux/page_counter.h
> > 	mm/page_counter.c
> > 
> > just to keep mm/memcontrol.c clean.
> 
> The page counters are the very core of the memory controller and, as I
> said to Michal, I want to integrate the hugetlb controller into memcg
> as well, at which point there won't be any outside users anymore.  So
> I think this is the right place for it.

Hmm, there might be memcg users out there that don't want to pay for
hugetlb accounting. Or is the overhead supposed to be negligible?

Anyway, I still don't think it's a good idea to keep all the definitions
in the same file. memcontrol.c is already huge. Adding more code to it
is not desirable, especially if it can naturally live in a separate
file. And since the page_counter is independent of the memcg core and
*looks* generic, I believe we should keep it separately.

> > > +	atomic_long_t count;
> > > +	unsigned long limit;
> > > +	struct page_counter *parent;
> > > +
> > > +	/* legacy */
> > > +	unsigned long watermark;
> > > +	unsigned long limited;
> > 
> > IMHO, failcnt would fit better.
> 
> I never liked the failcnt name, but also have to admit that "limited"
> is crap.  Let's leave it at failcnt for now.
> 
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > 
> > When I first saw this function, I couldn't realize by looking at its
> > name what it's intended to do. I think
> > 
> > 	page_counter_cancel_local_charge()
> > 
> > would fit better.
> 
> It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?

The _sub suffix doesn't match _charge/_uncharge. May be
page_counter_local_uncharge, or _uncharge_local?

> 
> > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > +			struct page_counter **fail);
> > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > 
> > Hmm, why not page_counter_set_limit?
> 
> Limit is used as a verb here, "to limit".  Getters and setters are
> usually wrappers around unusual/complex data structure access,

Not necessarily. Look at percpu_counter_read e.g. It's a one-line
getter, which we could easily live w/o, but still it's there.

> but this function does a lot more, so I'm not fond of _set_limit().

Nevertheless, everything it does can be perfectly described in one
sentence "it tries to set the new value of the limit", so it does
function as a setter. And if there's a setter, there must be a getter
IMO.

> > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > >  					      unsigned long amt,
> > >  					      int *parent_status)
> > >  {
> > > -	struct res_counter *fail;
> > > -	int ret;
> > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > >  
> > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > -					amt << PAGE_SHIFT, &fail);
> > > -	if (ret < 0)
> > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > +	    prot->memory_allocated.limit)
> > 
> > I don't like your equivalent of res_counter_charge_nofail.
> > 
> > Passing NULL to page_counter_charge might be useful if one doesn't have
> > a back-off strategy, but still want to fail on hitting the limit. With
> > your interface the user must pass something to the function then, which
> > isn't convenient.
> > 
> > Besides, it depends on the internal implementation of the page_counter
> > struct. I'd encapsulate this.
> 
> Thinking about this more, I don't like my version either; not because
> of how @fail must always be passed, but because of how it changes the
> behavior.  I changed the API to
> 
> void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
>                             struct page_counter **fail);

That looks good to me. I would also add something like

  bool page_counter_exceeds_limit(struct page_counter *counter);

to use instead of this

+	if (atomic_long_read(&prot->memory_allocated.count) >
+	    prot->memory_allocated.limit)

> We could make @fail optional in the try_charge(), but all callsites
> pass it at this time, so for now I kept it mandatory for simplicity.

It doesn't really matter to me. Both variants are fine.

[...]
> > > @@ -222,61 +220,73 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> > >  	if (unlikely(!h_cg))
> > >  		return;
> > >  	set_hugetlb_cgroup(page, NULL);
> > > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> > >  	return;
> > >  }
> > >  
> > >  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> > >  				    struct hugetlb_cgroup *h_cg)
> > >  {
> > > -	unsigned long csize = nr_pages * PAGE_SIZE;
> > > -
> > >  	if (hugetlb_cgroup_disabled() || !h_cg)
> > >  		return;
> > >  
> > >  	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> > >  		return;
> > >  
> > > -	res_counter_uncharge(&h_cg->hugepage[idx], csize);
> > > +	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
> > >  	return;
> > >  }
> > >  
> > > +enum {
> > > +	RES_USAGE,
> > > +	RES_LIMIT,
> > > +	RES_MAX_USAGE,
> > > +	RES_FAILCNT,
> > > +};
> > > +
> > >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> > >  				   struct cftype *cft)
> > >  {
> > > -	int idx, name;
> > > +	struct page_counter *counter;
> > >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> > >  
> > > -	idx = MEMFILE_IDX(cft->private);
> > > -	name = MEMFILE_ATTR(cft->private);
> > > +	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> > >  
> > > -	return res_counter_read_u64(&h_cg->hugepage[idx], name);
> > > +	switch (MEMFILE_ATTR(cft->private)) {
> > > +	case RES_USAGE:
> > > +		return (u64)atomic_long_read(&counter->count) * PAGE_SIZE;
> > 
> > page_counter_read?
> > 
> > > +	case RES_LIMIT:
> > > +		return (u64)counter->limit * PAGE_SIZE;
> > 
> > page_counter_get_limit?
> > 
> > > +	case RES_MAX_USAGE:
> > > +		return (u64)counter->watermark * PAGE_SIZE;
> > 
> > page_counter_read_watermark?
> 
> I added page_counter_read() to abstract away the fact that we use a
> signed counter internally, but it still returns the number of pages as
> unsigned long.

That's exactly what I meant.

> The entire counter API is based on pages now, and only the userspace
> interface translates back and forth into bytes, so that's where the
> translation should be located.

Absolutely right.

[...]
> > > @@ -288,18 +298,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> > >  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> > >  				    char *buf, size_t nbytes, loff_t off)
> > >  {
> > > -	int idx, name, ret = 0;
> > > +	int ret = 0;
> > > +	struct page_counter *counter;
> > >  	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> > >  
> > > -	idx = MEMFILE_IDX(of_cft(of)->private);
> > > -	name = MEMFILE_ATTR(of_cft(of)->private);
> > > +	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> > >  
> > > -	switch (name) {
> > > +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> > >  	case RES_MAX_USAGE:
> > > -		res_counter_reset_max(&h_cg->hugepage[idx]);
> > > +		counter->watermark = atomic_long_read(&counter->count);
> > 
> > page_counter_reset_watermark?
> 
> Yes, that operation deserves a wrapper.
> 
> > >  		break;
> > >  	case RES_FAILCNT:
> > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > +		counter->limited = 0;
> > 
> > page_counter_reset_failcnt?
> 
> That would be more obscure than counter->failcnt = 0, I think.

There's one thing that bothers me about this patch. Before, all the
functions operating on res_counter were mutually smp-safe, now they
aren't. E.g. if the failcnt reset races with the falcnt increment from
page_counter_try_charge, the reset might be skipped. You only use the
atomic type for the counter, but my guess is that failcnt and watermark
should be atomic too, at least if we're not going to get rid of them
soon. Otherwise, it should be clearly stated that failcnt and watermark
are racy.

Anyway, that's where the usefulness of res_counter_reset_failcnt
reveals. If one decides to make it race-free one day, they won't have to
modify code outside the page_counter definition.

> > > @@ -66,6 +65,117 @@
> > >  
> > >  #include <trace/events/vmscan.h>
> > >  
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > > +{
> > > +	long new;
> > > +
> > > +	new = atomic_long_sub_return(nr_pages, &counter->count);
> > > +
> > > +	if (WARN_ON(unlikely(new < 0)))
> > 
> > Max value on 32 bit is ULONG_MAX, right? Then the WARN_ON is incorrect.
> 
> Since this is a page counter, this would overflow at 8 petabyte on 32
> bit.  So even though the maximum is ULONG_MAX, in practice we should
> never even reach LONG_MAX, and ULONG_MAX was only chosen for backward
> compatibility with the default unlimited value.
> 
> This is actually not the only place that assumes we never go negative;
> the userspace read functions' u64 cast of a long would sign-extend any
> negative value and return ludicrous numbers.
> 
> But thinking longer about this, it's probably not worth to have these
> gotchas in the code just to maintain the default unlimited value.  I
> changed PAGE_COUNTER_MAX to LONG_MAX and LONG_MAX / PAGE_SIZE, resp.

That sounds sane. We only have to maintain the user interface, not the
internal implementation.

> > > +		atomic_long_set(&counter->count, 0);
> > > +
> > > +	return new > 1;
> > 
> > Nobody outside page_counter internals uses this retval. Why is it public
> > then?
> 
> kmemcg still uses this for the pinning trick, but I'll update the
> patch that removes it to also change the interface.
> 
> > BTW, why not new > 0?
> 
> That's a plain bug - probably a left-over from rephrasing this
> underflow test several times.  Thanks for catching.
> 
> > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > +			struct page_counter **fail)
> > > +{
> > > +	struct page_counter *c;
> > > +
> > > +	for (c = counter; c; c = c->parent) {
> > > +		for (;;) {
> > > +			unsigned long count;
> > > +			unsigned long new;
> > > +
> > > +			count = atomic_long_read(&c->count);
> > > +
> > > +			new = count + nr_pages;
> > > +			if (new > c->limit) {
> > > +				c->limited++;
> > > +				if (fail) {
> > 
> > So we increase 'limited' even if ain't limited. Sounds weird.
> 
> The old code actually did that too, but I removed it now in the
> transition to separate charge and try_charge functions.
> 
> > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > +{
> > > +	for (;;) {
> > > +		unsigned long count;
> > > +		unsigned long old;
> > > +
> > > +		count = atomic_long_read(&counter->count);
> > > +
> > > +		old = xchg(&counter->limit, limit);
> > > +
> > > +		if (atomic_long_read(&counter->count) != count) {
> > > +			counter->limit = old;
> > 
> > I wonder what can happen if two threads execute this function
> > concurrently... or may be it's not supposed to be smp-safe?
> 
> memcg already holds the set_limit_mutex here.  I updated the tcp and
> hugetlb controllers accordingly to take limit locks as well.

I would prefer page_counter to handle it internally, because we won't
need the set_limit_mutex once memsw is converted to plain swap
accounting. Besides, memcg_update_kmem_limit doesn't take it. Any chance
to achieve that w/o spinlocks, using only atomic variables?

> > > @@ -1490,12 +1605,23 @@ int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> > >   */
> > >  static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > >  {
> > > -	unsigned long long margin;
> > 
> > Why is it still ULL?
> 
> Hm?  This is a removal.  Too many longs...

Sorry, I missed the minus sign.

[...]
> Here is the delta patch:
[...]

If I were you, I'd separate the patch introducing the page_counter API
and implementation from the rest. I think it'd ease the review.

A couple of extra notes about the patch:

 - I think having comments to function definitions would be nice.

 - Your implementation of try_charge uses CAS, but this is a really
   costly operation (the most costly of all atomic primitives). Have
   you considered using FAA? Something like this:

   try_charge(pc, nr):

     limit = pc->limit;
     count = atomic_add_return(&pc->count, nr);
     if (count > limit) {
         atomic_sub(&pc->count, nr);
         return -ENOMEM;
     }
     return 0;

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-22 19:58         ` Johannes Weiner
  (?)
@ 2014-09-23 13:25           ` Michal Hocko
  -1 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-23 13:25 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon 22-09-14 15:58:29, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 07:28:00PM +0200, Michal Hocko wrote:
> > On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> > > On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > > > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> > [...]
> > > > Nevertheless I think that the counter should live outside of memcg (it
> > > > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > > > just to have a counter). If you made kernel/page_counter.c and led both
> > > > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > > > on MEMCG and I would find it cleaner in general.
> > > 
> > > The reason I did it this way is because the hugetlb controller simply
> > > accounts and limits a certain type of memory and in the future I would
> > > like to make it a memcg extension, just like kmem and swap.
> > 
> > I am not sure this is the right way to go. Hugetlb has always been
> > "special" and I do not see any advantage to pull its specialness into
> > memcg proper.
> >
> > It would just make the code more complicated. I can also imagine
> > users who simply do not want to pay memcg overhead and use only
> > hugetlb controller.
> 
> We already group user memory, kernel memory, and swap space together,
> what makes hugetlb-backed memory special?

There is only a little overlap between LRU backed and kmem accounted
memory with hugetlb which has always been standing aside from the rest
of the memory management code (THP being a successor which fits in much
better and which is already covered by memcg). It has basically its own
code path for every aspect of its object life cycle and internal data
structures which are in many ways not compatible with regular user or
kmem memory. Merging the controllers would require to merge hugetlb code
closer the MM code. Until then it just doesn't make sense to me.

> It's much easier to organize the code if all those closely related
> things are grouped together.

Could you be more specific please? How can pulling hugetlb details would
help other !hugetlb code paths?

> It's also better for the user interface to have a single memory
> controller.

I have seen so much confusion coming from hugetlb vs. THP that I think
the quite opposite is true. Besides that we would need a separate limit
for hugetlb accounted memory anyway so having a small and specialized
controller for specialized memory sounds like a proper way to go.

Finally, as mentioned in previous email, you might have users interested
only in hugetlb controller with memcg disabled.

> We're also close to the point where we don't differentiate between the
> root group and dedicated groups in terms of performance, Dave's tests
> fell apart at fairly high concurrency, and I'm already getting rid of
> the lock he saw contended.

Sure but this has nothing to do with it. Hugetlb can safely use the same
lockless counter as a replacement for res_counter and benefit from it
even though the contention hasn't been seen/reported yet.

> The downsides of fragmenting our configuration- and testspace, our
> user interface, and our code base by far outweigh the benefits of
> offering a dedicated hugetlb controller.

Could you be more specific please? Hugetlb has to be configured and
tested separately whether it would be in a separate controller or not.

Last but not least, even if this turns out to make some sense in
the future please do not mix those things together here. Your
res_counter -> page_counter transition makes a lot of sense for both
controllers. And it is a huge improvement. I do not see any reason
to pull a conceptually nontrivial merging/dependency of two separate
controllers into the picture. If you think it makes some sense then
bring that up later for a separate discussion.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 13:25           ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-23 13:25 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Mon 22-09-14 15:58:29, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 07:28:00PM +0200, Michal Hocko wrote:
> > On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> > > On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > > > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> > [...]
> > > > Nevertheless I think that the counter should live outside of memcg (it
> > > > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > > > just to have a counter). If you made kernel/page_counter.c and led both
> > > > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > > > on MEMCG and I would find it cleaner in general.
> > > 
> > > The reason I did it this way is because the hugetlb controller simply
> > > accounts and limits a certain type of memory and in the future I would
> > > like to make it a memcg extension, just like kmem and swap.
> > 
> > I am not sure this is the right way to go. Hugetlb has always been
> > "special" and I do not see any advantage to pull its specialness into
> > memcg proper.
> >
> > It would just make the code more complicated. I can also imagine
> > users who simply do not want to pay memcg overhead and use only
> > hugetlb controller.
> 
> We already group user memory, kernel memory, and swap space together,
> what makes hugetlb-backed memory special?

There is only a little overlap between LRU backed and kmem accounted
memory with hugetlb which has always been standing aside from the rest
of the memory management code (THP being a successor which fits in much
better and which is already covered by memcg). It has basically its own
code path for every aspect of its object life cycle and internal data
structures which are in many ways not compatible with regular user or
kmem memory. Merging the controllers would require to merge hugetlb code
closer the MM code. Until then it just doesn't make sense to me.

> It's much easier to organize the code if all those closely related
> things are grouped together.

Could you be more specific please? How can pulling hugetlb details would
help other !hugetlb code paths?

> It's also better for the user interface to have a single memory
> controller.

I have seen so much confusion coming from hugetlb vs. THP that I think
the quite opposite is true. Besides that we would need a separate limit
for hugetlb accounted memory anyway so having a small and specialized
controller for specialized memory sounds like a proper way to go.

Finally, as mentioned in previous email, you might have users interested
only in hugetlb controller with memcg disabled.

> We're also close to the point where we don't differentiate between the
> root group and dedicated groups in terms of performance, Dave's tests
> fell apart at fairly high concurrency, and I'm already getting rid of
> the lock he saw contended.

Sure but this has nothing to do with it. Hugetlb can safely use the same
lockless counter as a replacement for res_counter and benefit from it
even though the contention hasn't been seen/reported yet.

> The downsides of fragmenting our configuration- and testspace, our
> user interface, and our code base by far outweigh the benefits of
> offering a dedicated hugetlb controller.

Could you be more specific please? Hugetlb has to be configured and
tested separately whether it would be in a separate controller or not.

Last but not least, even if this turns out to make some sense in
the future please do not mix those things together here. Your
res_counter -> page_counter transition makes a lot of sense for both
controllers. And it is a huge improvement. I do not see any reason
to pull a conceptually nontrivial merging/dependency of two separate
controllers into the picture. If you think it makes some sense then
bring that up later for a separate discussion.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 13:25           ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-23 13:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Greg Thelen, Dave Hansen,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 22-09-14 15:58:29, Johannes Weiner wrote:
> On Mon, Sep 22, 2014 at 07:28:00PM +0200, Michal Hocko wrote:
> > On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> > > On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > > > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> > [...]
> > > > Nevertheless I think that the counter should live outside of memcg (it
> > > > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > > > just to have a counter). If you made kernel/page_counter.c and led both
> > > > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > > > on MEMCG and I would find it cleaner in general.
> > > 
> > > The reason I did it this way is because the hugetlb controller simply
> > > accounts and limits a certain type of memory and in the future I would
> > > like to make it a memcg extension, just like kmem and swap.
> > 
> > I am not sure this is the right way to go. Hugetlb has always been
> > "special" and I do not see any advantage to pull its specialness into
> > memcg proper.
> >
> > It would just make the code more complicated. I can also imagine
> > users who simply do not want to pay memcg overhead and use only
> > hugetlb controller.
> 
> We already group user memory, kernel memory, and swap space together,
> what makes hugetlb-backed memory special?

There is only a little overlap between LRU backed and kmem accounted
memory with hugetlb which has always been standing aside from the rest
of the memory management code (THP being a successor which fits in much
better and which is already covered by memcg). It has basically its own
code path for every aspect of its object life cycle and internal data
structures which are in many ways not compatible with regular user or
kmem memory. Merging the controllers would require to merge hugetlb code
closer the MM code. Until then it just doesn't make sense to me.

> It's much easier to organize the code if all those closely related
> things are grouped together.

Could you be more specific please? How can pulling hugetlb details would
help other !hugetlb code paths?

> It's also better for the user interface to have a single memory
> controller.

I have seen so much confusion coming from hugetlb vs. THP that I think
the quite opposite is true. Besides that we would need a separate limit
for hugetlb accounted memory anyway so having a small and specialized
controller for specialized memory sounds like a proper way to go.

Finally, as mentioned in previous email, you might have users interested
only in hugetlb controller with memcg disabled.

> We're also close to the point where we don't differentiate between the
> root group and dedicated groups in terms of performance, Dave's tests
> fell apart at fairly high concurrency, and I'm already getting rid of
> the lock he saw contended.

Sure but this has nothing to do with it. Hugetlb can safely use the same
lockless counter as a replacement for res_counter and benefit from it
even though the contention hasn't been seen/reported yet.

> The downsides of fragmenting our configuration- and testspace, our
> user interface, and our code base by far outweigh the benefits of
> offering a dedicated hugetlb controller.

Could you be more specific please? Hugetlb has to be configured and
tested separately whether it would be in a separate controller or not.

Last but not least, even if this turns out to make some sense in
the future please do not mix those things together here. Your
res_counter -> page_counter transition makes a lot of sense for both
controllers. And it is a huge improvement. I do not see any reason
to pull a conceptually nontrivial merging/dependency of two separate
controllers into the picture. If you think it makes some sense then
bring that up later for a separate discussion.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 11:06       ` Vladimir Davydov
@ 2014-09-23 13:28         ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-23 13:28 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index 19df5d857411..bf8fb1a05597 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > >  };
> > > >  
> > > >  #ifdef CONFIG_MEMCG
> > > > +
> > > > +struct page_counter {
> > > 
> > > I'd place it in a separate file, say
> > > 
> > > 	include/linux/page_counter.h
> > > 	mm/page_counter.c
> > > 
> > > just to keep mm/memcontrol.c clean.
> > 
> > The page counters are the very core of the memory controller and, as I
> > said to Michal, I want to integrate the hugetlb controller into memcg
> > as well, at which point there won't be any outside users anymore.  So
> > I think this is the right place for it.
> 
> Hmm, there might be memcg users out there that don't want to pay for
> hugetlb accounting. Or is the overhead supposed to be negligible?

Yes.  But if it gets in the way, it creates pressure to optimize it.
That's the same reason why I've been trying to integrate memcg into
the rest of the VM for over two years now - aside from resulting in
much more unified code, it forces us to compete, and it increases our
testing exposure by several orders of magnitude.

The only reason we are discussing lockless page counters right now is
because we got rid of "memcg specialness" and exposed res_counters to
the rest of the world; and boy did that instantly raise the bar on us.

> Anyway, I still don't think it's a good idea to keep all the definitions
> in the same file. memcontrol.c is already huge. Adding more code to it
> is not desirable, especially if it can naturally live in a separate
> file. And since the page_counter is independent of the memcg core and
> *looks* generic, I believe we should keep it separately.

It's less code than what I just deleted, and half of it seems
redundant when integrated into memcg.  This code would benefit a lot
from being part of memcg, and memcg could reduce its public API.

There are tangible costs associated with having a separate pile of
bitrot depend on our public interface.  Over 90% of the recent changes
to the hugetlb controller were done by Tejun as part of changing the
cgroup interfaces.  And I went through several WTFs switching that
code to the new page counter API.  The cost of maintaining a unified
codebase is negligible in comparison.

> > > > +	atomic_long_t count;
> > > > +	unsigned long limit;
> > > > +	struct page_counter *parent;
> > > > +
> > > > +	/* legacy */
> > > > +	unsigned long watermark;
> > > > +	unsigned long limited;
> > > 
> > > IMHO, failcnt would fit better.
> > 
> > I never liked the failcnt name, but also have to admit that "limited"
> > is crap.  Let's leave it at failcnt for now.
> > 
> > > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > > 
> > > When I first saw this function, I couldn't realize by looking at its
> > > name what it's intended to do. I think
> > > 
> > > 	page_counter_cancel_local_charge()
> > > 
> > > would fit better.
> > 
> > It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?
> 
> The _sub suffix doesn't match _charge/_uncharge. May be
> page_counter_local_uncharge, or _uncharge_local?

I always think of a charge as the full hierarchical quantity, but this
function only clips that one counter and so anything "uncharge" sounds
terribly wrong to me.  But I can't think of anything great, either.

Any more ideas? :)

> > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > +			struct page_counter **fail);
> > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > 
> > > Hmm, why not page_counter_set_limit?
> > 
> > Limit is used as a verb here, "to limit".  Getters and setters are
> > usually wrappers around unusual/complex data structure access,
> 
> Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> getter, which we could easily live w/o, but still it's there.

It abstracts an unusual and error-prone access to a counter value,
i.e. reading an unsigned quantity out of a signed variable.

> > but this function does a lot more, so I'm not fond of _set_limit().
> 
> Nevertheless, everything it does can be perfectly described in one
> sentence "it tries to set the new value of the limit", so it does
> function as a setter. And if there's a setter, there must be a getter
> IMO.

That's oversimplifying things.  Setting a limit requires enforcing a
whole bunch of rules and synchronization, whereas reading a limit is
accessing an unsigned long.

In general I agree that we should strive for symmetry and follow the
principle of least surprise, but in terms of complexity these two
operations are very different, and providing a getter on principle
would not actually improve readability in this case.

> > > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > > >  					      unsigned long amt,
> > > >  					      int *parent_status)
> > > >  {
> > > > -	struct res_counter *fail;
> > > > -	int ret;
> > > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > > >  
> > > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > > -					amt << PAGE_SHIFT, &fail);
> > > > -	if (ret < 0)
> > > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > > +	    prot->memory_allocated.limit)
> > > 
> > > I don't like your equivalent of res_counter_charge_nofail.
> > > 
> > > Passing NULL to page_counter_charge might be useful if one doesn't have
> > > a back-off strategy, but still want to fail on hitting the limit. With
> > > your interface the user must pass something to the function then, which
> > > isn't convenient.
> > > 
> > > Besides, it depends on the internal implementation of the page_counter
> > > struct. I'd encapsulate this.
> > 
> > Thinking about this more, I don't like my version either; not because
> > of how @fail must always be passed, but because of how it changes the
> > behavior.  I changed the API to
> > 
> > void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> > int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
> >                             struct page_counter **fail);
> 
> That looks good to me. I would also add something like
> 
>   bool page_counter_exceeds_limit(struct page_counter *counter);
> 
> to use instead of this
> 
> +	if (atomic_long_read(&prot->memory_allocated.count) >
> +	    prot->memory_allocated.limit)

I really don't see the point in obscuring a simple '<' behind a
function call.  What follows this is that somebody adds it for the
soft limit, and later for any other type of relational comparison.

> > > >  		break;
> > > >  	case RES_FAILCNT:
> > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > +		counter->limited = 0;
> > > 
> > > page_counter_reset_failcnt?
> > 
> > That would be more obscure than counter->failcnt = 0, I think.
> 
> There's one thing that bothers me about this patch. Before, all the
> functions operating on res_counter were mutually smp-safe, now they
> aren't. E.g. if the failcnt reset races with the falcnt increment from
> page_counter_try_charge, the reset might be skipped. You only use the
> atomic type for the counter, but my guess is that failcnt and watermark
> should be atomic too, at least if we're not going to get rid of them
> soon. Otherwise, it should be clearly stated that failcnt and watermark
> are racy.

It's fair enough that the raciness should be documented, but both
counters are such roundabout metrics to begin with that it really
doesn't matter.  What's the difference between a failcnt of 590 and
600 in practical terms?  And what does it matter if the counter
watermark is off by a few pages when there are per-cpu caches on top
of the counters, and the majority of workloads peg the watermark to
the limit during startup anyway?

> Anyway, that's where the usefulness of res_counter_reset_failcnt
> reveals. If one decides to make it race-free one day, they won't have to
> modify code outside the page_counter definition.

A major problem with cgroups overall was that it was designed for a
lot of hypotheticals that are irrelevant in practice but incur very
high costs.  Multiple orthogonal hierarchies is the best example, but
using locked byte counters that can be used to account all manner of
resources, with accurate watermarks and limit failures, when all we
need to count is pages and nobody cares about accurate watermarks and
limit failures, is another one.

It's very unlikely that failcnt and watermark will have to me atomic
ever again, so there is very little hypothetical upside to wrapping a
'= 0' in a function.  But such indirection comes at a real cost.

> > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > +{
> > > > +	for (;;) {
> > > > +		unsigned long count;
> > > > +		unsigned long old;
> > > > +
> > > > +		count = atomic_long_read(&counter->count);
> > > > +
> > > > +		old = xchg(&counter->limit, limit);
> > > > +
> > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > +			counter->limit = old;
> > > 
> > > I wonder what can happen if two threads execute this function
> > > concurrently... or may be it's not supposed to be smp-safe?
> > 
> > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > hugetlb controllers accordingly to take limit locks as well.
> 
> I would prefer page_counter to handle it internally, because we won't
> need the set_limit_mutex once memsw is converted to plain swap
> accounting.
>
> Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> achieve that w/o spinlocks, using only atomic variables?

We still need it to serialize concurrent access to the memory limit,
and I updated the patch to have kmem take it as well.  It's such a
cold path that using a lockless scheme and worrying about coherency
between updates is not worth it, I think.

> If I were you, I'd separate the patch introducing the page_counter API
> and implementation from the rest. I think it'd ease the review.
> 
> A couple of extra notes about the patch:
> 
>  - I think having comments to function definitions would be nice.
> 
>  - Your implementation of try_charge uses CAS, but this is a really
>    costly operation (the most costly of all atomic primitives). Have
>    you considered using FAA? Something like this:
> 
>    try_charge(pc, nr):
> 
>      limit = pc->limit;
>      count = atomic_add_return(&pc->count, nr);
>      if (count > limit) {
>          atomic_sub(&pc->count, nr);
>          return -ENOMEM;
>      }
>      return 0;

The thing I was worried about there is when you have varying charge
sizes, and a failing big-charge can lock out a small-charge that would
otherwise succeed and force it into early reclaim.  But thinking more
about it, the error is limited to the biggest charge size, which is
THP in our case, and it doesn't matter if we reclaim 2MB too early.

I'll reconsider, thanks for your input!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 13:28         ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-23 13:28 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index 19df5d857411..bf8fb1a05597 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > >  };
> > > >  
> > > >  #ifdef CONFIG_MEMCG
> > > > +
> > > > +struct page_counter {
> > > 
> > > I'd place it in a separate file, say
> > > 
> > > 	include/linux/page_counter.h
> > > 	mm/page_counter.c
> > > 
> > > just to keep mm/memcontrol.c clean.
> > 
> > The page counters are the very core of the memory controller and, as I
> > said to Michal, I want to integrate the hugetlb controller into memcg
> > as well, at which point there won't be any outside users anymore.  So
> > I think this is the right place for it.
> 
> Hmm, there might be memcg users out there that don't want to pay for
> hugetlb accounting. Or is the overhead supposed to be negligible?

Yes.  But if it gets in the way, it creates pressure to optimize it.
That's the same reason why I've been trying to integrate memcg into
the rest of the VM for over two years now - aside from resulting in
much more unified code, it forces us to compete, and it increases our
testing exposure by several orders of magnitude.

The only reason we are discussing lockless page counters right now is
because we got rid of "memcg specialness" and exposed res_counters to
the rest of the world; and boy did that instantly raise the bar on us.

> Anyway, I still don't think it's a good idea to keep all the definitions
> in the same file. memcontrol.c is already huge. Adding more code to it
> is not desirable, especially if it can naturally live in a separate
> file. And since the page_counter is independent of the memcg core and
> *looks* generic, I believe we should keep it separately.

It's less code than what I just deleted, and half of it seems
redundant when integrated into memcg.  This code would benefit a lot
from being part of memcg, and memcg could reduce its public API.

There are tangible costs associated with having a separate pile of
bitrot depend on our public interface.  Over 90% of the recent changes
to the hugetlb controller were done by Tejun as part of changing the
cgroup interfaces.  And I went through several WTFs switching that
code to the new page counter API.  The cost of maintaining a unified
codebase is negligible in comparison.

> > > > +	atomic_long_t count;
> > > > +	unsigned long limit;
> > > > +	struct page_counter *parent;
> > > > +
> > > > +	/* legacy */
> > > > +	unsigned long watermark;
> > > > +	unsigned long limited;
> > > 
> > > IMHO, failcnt would fit better.
> > 
> > I never liked the failcnt name, but also have to admit that "limited"
> > is crap.  Let's leave it at failcnt for now.
> > 
> > > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > > 
> > > When I first saw this function, I couldn't realize by looking at its
> > > name what it's intended to do. I think
> > > 
> > > 	page_counter_cancel_local_charge()
> > > 
> > > would fit better.
> > 
> > It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?
> 
> The _sub suffix doesn't match _charge/_uncharge. May be
> page_counter_local_uncharge, or _uncharge_local?

I always think of a charge as the full hierarchical quantity, but this
function only clips that one counter and so anything "uncharge" sounds
terribly wrong to me.  But I can't think of anything great, either.

Any more ideas? :)

> > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > +			struct page_counter **fail);
> > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > 
> > > Hmm, why not page_counter_set_limit?
> > 
> > Limit is used as a verb here, "to limit".  Getters and setters are
> > usually wrappers around unusual/complex data structure access,
> 
> Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> getter, which we could easily live w/o, but still it's there.

It abstracts an unusual and error-prone access to a counter value,
i.e. reading an unsigned quantity out of a signed variable.

> > but this function does a lot more, so I'm not fond of _set_limit().
> 
> Nevertheless, everything it does can be perfectly described in one
> sentence "it tries to set the new value of the limit", so it does
> function as a setter. And if there's a setter, there must be a getter
> IMO.

That's oversimplifying things.  Setting a limit requires enforcing a
whole bunch of rules and synchronization, whereas reading a limit is
accessing an unsigned long.

In general I agree that we should strive for symmetry and follow the
principle of least surprise, but in terms of complexity these two
operations are very different, and providing a getter on principle
would not actually improve readability in this case.

> > > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > > >  					      unsigned long amt,
> > > >  					      int *parent_status)
> > > >  {
> > > > -	struct res_counter *fail;
> > > > -	int ret;
> > > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > > >  
> > > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > > -					amt << PAGE_SHIFT, &fail);
> > > > -	if (ret < 0)
> > > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > > +	    prot->memory_allocated.limit)
> > > 
> > > I don't like your equivalent of res_counter_charge_nofail.
> > > 
> > > Passing NULL to page_counter_charge might be useful if one doesn't have
> > > a back-off strategy, but still want to fail on hitting the limit. With
> > > your interface the user must pass something to the function then, which
> > > isn't convenient.
> > > 
> > > Besides, it depends on the internal implementation of the page_counter
> > > struct. I'd encapsulate this.
> > 
> > Thinking about this more, I don't like my version either; not because
> > of how @fail must always be passed, but because of how it changes the
> > behavior.  I changed the API to
> > 
> > void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> > int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
> >                             struct page_counter **fail);
> 
> That looks good to me. I would also add something like
> 
>   bool page_counter_exceeds_limit(struct page_counter *counter);
> 
> to use instead of this
> 
> +	if (atomic_long_read(&prot->memory_allocated.count) >
> +	    prot->memory_allocated.limit)

I really don't see the point in obscuring a simple '<' behind a
function call.  What follows this is that somebody adds it for the
soft limit, and later for any other type of relational comparison.

> > > >  		break;
> > > >  	case RES_FAILCNT:
> > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > +		counter->limited = 0;
> > > 
> > > page_counter_reset_failcnt?
> > 
> > That would be more obscure than counter->failcnt = 0, I think.
> 
> There's one thing that bothers me about this patch. Before, all the
> functions operating on res_counter were mutually smp-safe, now they
> aren't. E.g. if the failcnt reset races with the falcnt increment from
> page_counter_try_charge, the reset might be skipped. You only use the
> atomic type for the counter, but my guess is that failcnt and watermark
> should be atomic too, at least if we're not going to get rid of them
> soon. Otherwise, it should be clearly stated that failcnt and watermark
> are racy.

It's fair enough that the raciness should be documented, but both
counters are such roundabout metrics to begin with that it really
doesn't matter.  What's the difference between a failcnt of 590 and
600 in practical terms?  And what does it matter if the counter
watermark is off by a few pages when there are per-cpu caches on top
of the counters, and the majority of workloads peg the watermark to
the limit during startup anyway?

> Anyway, that's where the usefulness of res_counter_reset_failcnt
> reveals. If one decides to make it race-free one day, they won't have to
> modify code outside the page_counter definition.

A major problem with cgroups overall was that it was designed for a
lot of hypotheticals that are irrelevant in practice but incur very
high costs.  Multiple orthogonal hierarchies is the best example, but
using locked byte counters that can be used to account all manner of
resources, with accurate watermarks and limit failures, when all we
need to count is pages and nobody cares about accurate watermarks and
limit failures, is another one.

It's very unlikely that failcnt and watermark will have to me atomic
ever again, so there is very little hypothetical upside to wrapping a
'= 0' in a function.  But such indirection comes at a real cost.

> > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > +{
> > > > +	for (;;) {
> > > > +		unsigned long count;
> > > > +		unsigned long old;
> > > > +
> > > > +		count = atomic_long_read(&counter->count);
> > > > +
> > > > +		old = xchg(&counter->limit, limit);
> > > > +
> > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > +			counter->limit = old;
> > > 
> > > I wonder what can happen if two threads execute this function
> > > concurrently... or may be it's not supposed to be smp-safe?
> > 
> > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > hugetlb controllers accordingly to take limit locks as well.
> 
> I would prefer page_counter to handle it internally, because we won't
> need the set_limit_mutex once memsw is converted to plain swap
> accounting.
>
> Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> achieve that w/o spinlocks, using only atomic variables?

We still need it to serialize concurrent access to the memory limit,
and I updated the patch to have kmem take it as well.  It's such a
cold path that using a lockless scheme and worrying about coherency
between updates is not worth it, I think.

> If I were you, I'd separate the patch introducing the page_counter API
> and implementation from the rest. I think it'd ease the review.
> 
> A couple of extra notes about the patch:
> 
>  - I think having comments to function definitions would be nice.
> 
>  - Your implementation of try_charge uses CAS, but this is a really
>    costly operation (the most costly of all atomic primitives). Have
>    you considered using FAA? Something like this:
> 
>    try_charge(pc, nr):
> 
>      limit = pc->limit;
>      count = atomic_add_return(&pc->count, nr);
>      if (count > limit) {
>          atomic_sub(&pc->count, nr);
>          return -ENOMEM;
>      }
>      return 0;

The thing I was worried about there is when you have varying charge
sizes, and a failing big-charge can lock out a small-charge that would
otherwise succeed and force it into early reclaim.  But thinking more
about it, the error is limited to the biggest charge size, which is
THP in our case, and it doesn't matter if we reclaim 2MB too early.

I'll reconsider, thanks for your input!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 13:25           ` Michal Hocko
@ 2014-09-23 14:05             ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-23 14:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 03:25:53PM +0200, Michal Hocko wrote:
> On Mon 22-09-14 15:58:29, Johannes Weiner wrote:
> > On Mon, Sep 22, 2014 at 07:28:00PM +0200, Michal Hocko wrote:
> > > On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> > > > On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > > > > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> > > [...]
> > > > > Nevertheless I think that the counter should live outside of memcg (it
> > > > > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > > > > just to have a counter). If you made kernel/page_counter.c and led both
> > > > > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > > > > on MEMCG and I would find it cleaner in general.
> > > > 
> > > > The reason I did it this way is because the hugetlb controller simply
> > > > accounts and limits a certain type of memory and in the future I would
> > > > like to make it a memcg extension, just like kmem and swap.
> > > 
> > > I am not sure this is the right way to go. Hugetlb has always been
> > > "special" and I do not see any advantage to pull its specialness into
> > > memcg proper.
> > >
> > > It would just make the code more complicated. I can also imagine
> > > users who simply do not want to pay memcg overhead and use only
> > > hugetlb controller.
> > 
> > We already group user memory, kernel memory, and swap space together,
> > what makes hugetlb-backed memory special?
> 
> There is only a little overlap between LRU backed and kmem accounted
> memory with hugetlb which has always been standing aside from the rest
> of the memory management code (THP being a successor which fits in much
> better and which is already covered by memcg). It has basically its own
> code path for every aspect of its object life cycle and internal data
> structures which are in many ways not compatible with regular user or
> kmem memory. Merging the controllers would require to merge hugetlb code
> closer the MM code. Until then it just doesn't make sense to me.

Just look at the hugetlb controller code and think about what would be
left if it were simply another page counter in struct mem_cgroup.

There is a glaring memory leak in its css_alloc() method because
nobody ever looks at this code.  The controller was missed in the
reparenting removal patches because it's just not on the radar.

This is so painfully obvious if you actually work on this code, I
don't know why we are even discussing this.

> > It's also better for the user interface to have a single memory
> > controller.
> 
> I have seen so much confusion coming from hugetlb vs. THP that I think
> the quite opposite is true. Besides that we would need a separate limit
> for hugetlb accounted memory anyway so having a small and specialized
> controller for specialized memory sounds like a proper way to go.
> 
> Finally, as mentioned in previous email, you might have users interested
> only in hugetlb controller with memcg disabled.

They use a global spinlock to allocate and charge these pages, I think
they'll be fine with memcg.

> > We're also close to the point where we don't differentiate between the
> > root group and dedicated groups in terms of performance, Dave's tests
> > fell apart at fairly high concurrency, and I'm already getting rid of
> > the lock he saw contended.
> 
> Sure but this has nothing to do with it. Hugetlb can safely use the same
> lockless counter as a replacement for res_counter and benefit from it
> even though the contention hasn't been seen/reported yet.

It doesn't even use these counters the right way, just look at what it
does during reparenting.  And as per above, it should also be included
in the reparenting removal, but I'm haven't even configured the thing.

> > The downsides of fragmenting our configuration- and testspace, our
> > user interface, and our code base by far outweigh the benefits of
> > offering a dedicated hugetlb controller.
> 
> Could you be more specific please? Hugetlb has to be configured and
> tested separately whether it would be in a separate controller or not.
> 
> Last but not least, even if this turns out to make some sense in
> the future please do not mix those things together here. Your
> res_counter -> page_counter transition makes a lot of sense for both
> controllers. And it is a huge improvement. I do not see any reason
> to pull a conceptually nontrivial merging/dependency of two separate
> controllers into the picture. If you think it makes some sense then
> bring that up later for a separate discussion.

That's one way to put it.  But the way I see it is that I remove a
generic resource counter and replace it with a pure memory counter
which I put where we account and limit memory - with one exception
that is hardly worth creating a dedicated library file for.

I only explained my plans of merging all memory controllers because I
assumed we could ever be on the same page when it comes to this code.

But regardless of that, my approach immediately simplifies Kconfig,
Makefiles, #includes, and you haven't made a good point why the
hugetlb controller depending on memcg would harm anybody in real life.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 14:05             ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-23 14:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 03:25:53PM +0200, Michal Hocko wrote:
> On Mon 22-09-14 15:58:29, Johannes Weiner wrote:
> > On Mon, Sep 22, 2014 at 07:28:00PM +0200, Michal Hocko wrote:
> > > On Mon 22-09-14 11:50:49, Johannes Weiner wrote:
> > > > On Mon, Sep 22, 2014 at 04:44:36PM +0200, Michal Hocko wrote:
> > > > > On Fri 19-09-14 09:22:08, Johannes Weiner wrote:
> > > [...]
> > > > > Nevertheless I think that the counter should live outside of memcg (it
> > > > > is ugly and bad in general to make HUGETLB controller depend on MEMCG
> > > > > just to have a counter). If you made kernel/page_counter.c and led both
> > > > > containers select CONFIG_PAGE_COUNTER then you do not need a dependency
> > > > > on MEMCG and I would find it cleaner in general.
> > > > 
> > > > The reason I did it this way is because the hugetlb controller simply
> > > > accounts and limits a certain type of memory and in the future I would
> > > > like to make it a memcg extension, just like kmem and swap.
> > > 
> > > I am not sure this is the right way to go. Hugetlb has always been
> > > "special" and I do not see any advantage to pull its specialness into
> > > memcg proper.
> > >
> > > It would just make the code more complicated. I can also imagine
> > > users who simply do not want to pay memcg overhead and use only
> > > hugetlb controller.
> > 
> > We already group user memory, kernel memory, and swap space together,
> > what makes hugetlb-backed memory special?
> 
> There is only a little overlap between LRU backed and kmem accounted
> memory with hugetlb which has always been standing aside from the rest
> of the memory management code (THP being a successor which fits in much
> better and which is already covered by memcg). It has basically its own
> code path for every aspect of its object life cycle and internal data
> structures which are in many ways not compatible with regular user or
> kmem memory. Merging the controllers would require to merge hugetlb code
> closer the MM code. Until then it just doesn't make sense to me.

Just look at the hugetlb controller code and think about what would be
left if it were simply another page counter in struct mem_cgroup.

There is a glaring memory leak in its css_alloc() method because
nobody ever looks at this code.  The controller was missed in the
reparenting removal patches because it's just not on the radar.

This is so painfully obvious if you actually work on this code, I
don't know why we are even discussing this.

> > It's also better for the user interface to have a single memory
> > controller.
> 
> I have seen so much confusion coming from hugetlb vs. THP that I think
> the quite opposite is true. Besides that we would need a separate limit
> for hugetlb accounted memory anyway so having a small and specialized
> controller for specialized memory sounds like a proper way to go.
> 
> Finally, as mentioned in previous email, you might have users interested
> only in hugetlb controller with memcg disabled.

They use a global spinlock to allocate and charge these pages, I think
they'll be fine with memcg.

> > We're also close to the point where we don't differentiate between the
> > root group and dedicated groups in terms of performance, Dave's tests
> > fell apart at fairly high concurrency, and I'm already getting rid of
> > the lock he saw contended.
> 
> Sure but this has nothing to do with it. Hugetlb can safely use the same
> lockless counter as a replacement for res_counter and benefit from it
> even though the contention hasn't been seen/reported yet.

It doesn't even use these counters the right way, just look at what it
does during reparenting.  And as per above, it should also be included
in the reparenting removal, but I'm haven't even configured the thing.

> > The downsides of fragmenting our configuration- and testspace, our
> > user interface, and our code base by far outweigh the benefits of
> > offering a dedicated hugetlb controller.
> 
> Could you be more specific please? Hugetlb has to be configured and
> tested separately whether it would be in a separate controller or not.
> 
> Last but not least, even if this turns out to make some sense in
> the future please do not mix those things together here. Your
> res_counter -> page_counter transition makes a lot of sense for both
> controllers. And it is a huge improvement. I do not see any reason
> to pull a conceptually nontrivial merging/dependency of two separate
> controllers into the picture. If you think it makes some sense then
> bring that up later for a separate discussion.

That's one way to put it.  But the way I see it is that I remove a
generic resource counter and replace it with a pure memory counter
which I put where we account and limit memory - with one exception
that is hardly worth creating a dedicated library file for.

I only explained my plans of merging all memory controllers because I
assumed we could ever be on the same page when it comes to this code.

But regardless of that, my approach immediately simplifies Kconfig,
Makefiles, #includes, and you haven't made a good point why the
hugetlb controller depending on memcg would harm anybody in real life.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 14:05             ` Johannes Weiner
  (?)
@ 2014-09-23 14:28               ` Michal Hocko
  -1 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-23 14:28 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue 23-09-14 10:05:26, Johannes Weiner wrote:
[...]
> That's one way to put it.  But the way I see it is that I remove a
> generic resource counter and replace it with a pure memory counter
> which I put where we account and limit memory - with one exception
> that is hardly worth creating a dedicated library file for.

So you completely seem to ignore people doing CONFIG_CGROUP_HUGETLB &&
!CONFIG_MEMCG without any justification and hiding it behind performance
improvement which those users even didn't ask for yet.

All that just to not have one additional header and c file hidden by
CONFIG_PAGE_COUNTER selected by both controllers. No special
configuration option is really needed for CONFIG_PAGE_COUNTER.

> I only explained my plans of merging all memory controllers because I
> assumed we could ever be on the same page when it comes to this code.

I doubt this is a good plan but I might be wrong here. The main thing
stays there is no good reason to make hugetlb depend on memcg right now
and such a change _shouldn't_ be hidden behind an unrelated change. From
hugetlb container point of view this is just a different counter which
doesn't depend on memcg. I am really surprised you are pushing for this
so hard right now because it only distracts from the main motivation of
your patch.

> But regardless of that, my approach immediately simplifies Kconfig,
> Makefiles, #includes, and you haven't made a good point why the
> hugetlb controller depending on memcg would harm anybody in real life.

Usecases for hugetlb and memcg controllers have basically zero overlap
they simply manage different resources and so it is natural to use one
without the other.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 14:28               ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-23 14:28 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue 23-09-14 10:05:26, Johannes Weiner wrote:
[...]
> That's one way to put it.  But the way I see it is that I remove a
> generic resource counter and replace it with a pure memory counter
> which I put where we account and limit memory - with one exception
> that is hardly worth creating a dedicated library file for.

So you completely seem to ignore people doing CONFIG_CGROUP_HUGETLB &&
!CONFIG_MEMCG without any justification and hiding it behind performance
improvement which those users even didn't ask for yet.

All that just to not have one additional header and c file hidden by
CONFIG_PAGE_COUNTER selected by both controllers. No special
configuration option is really needed for CONFIG_PAGE_COUNTER.

> I only explained my plans of merging all memory controllers because I
> assumed we could ever be on the same page when it comes to this code.

I doubt this is a good plan but I might be wrong here. The main thing
stays there is no good reason to make hugetlb depend on memcg right now
and such a change _shouldn't_ be hidden behind an unrelated change. From
hugetlb container point of view this is just a different counter which
doesn't depend on memcg. I am really surprised you are pushing for this
so hard right now because it only distracts from the main motivation of
your patch.

> But regardless of that, my approach immediately simplifies Kconfig,
> Makefiles, #includes, and you haven't made a good point why the
> hugetlb controller depending on memcg would harm anybody in real life.

Usecases for hugetlb and memcg controllers have basically zero overlap
they simply manage different resources and so it is natural to use one
without the other.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 14:28               ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-23 14:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Greg Thelen, Dave Hansen,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue 23-09-14 10:05:26, Johannes Weiner wrote:
[...]
> That's one way to put it.  But the way I see it is that I remove a
> generic resource counter and replace it with a pure memory counter
> which I put where we account and limit memory - with one exception
> that is hardly worth creating a dedicated library file for.

So you completely seem to ignore people doing CONFIG_CGROUP_HUGETLB &&
!CONFIG_MEMCG without any justification and hiding it behind performance
improvement which those users even didn't ask for yet.

All that just to not have one additional header and c file hidden by
CONFIG_PAGE_COUNTER selected by both controllers. No special
configuration option is really needed for CONFIG_PAGE_COUNTER.

> I only explained my plans of merging all memory controllers because I
> assumed we could ever be on the same page when it comes to this code.

I doubt this is a good plan but I might be wrong here. The main thing
stays there is no good reason to make hugetlb depend on memcg right now
and such a change _shouldn't_ be hidden behind an unrelated change. From
hugetlb container point of view this is just a different counter which
doesn't depend on memcg. I am really surprised you are pushing for this
so hard right now because it only distracts from the main motivation of
your patch.

> But regardless of that, my approach immediately simplifies Kconfig,
> Makefiles, #includes, and you haven't made a good point why the
> hugetlb controller depending on memcg would harm anybody in real life.

Usecases for hugetlb and memcg controllers have basically zero overlap
they simply manage different resources and so it is natural to use one
without the other.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 13:28         ` Johannes Weiner
  (?)
@ 2014-09-23 15:21           ` Vladimir Davydov
  -1 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-23 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> > On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > index 19df5d857411..bf8fb1a05597 100644
> > > > > --- a/include/linux/memcontrol.h
> > > > > +++ b/include/linux/memcontrol.h
> > > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > > >  };
> > > > >  
> > > > >  #ifdef CONFIG_MEMCG
> > > > > +
> > > > > +struct page_counter {
> > > > 
> > > > I'd place it in a separate file, say
> > > > 
> > > > 	include/linux/page_counter.h
> > > > 	mm/page_counter.c
> > > > 
> > > > just to keep mm/memcontrol.c clean.
> > > 
> > > The page counters are the very core of the memory controller and, as I
> > > said to Michal, I want to integrate the hugetlb controller into memcg
> > > as well, at which point there won't be any outside users anymore.  So
> > > I think this is the right place for it.
> > 
> > Hmm, there might be memcg users out there that don't want to pay for
> > hugetlb accounting. Or is the overhead supposed to be negligible?
> 
> Yes.  But if it gets in the way, it creates pressure to optimize it.

There always will be an overhead no matter how we optimize it.

I think we should only merge them if it could really help simplify the
code, for instance if they were dependant on each other. Anyway, I'm not
an expert in the hugetlb cgroup, so I can't judge whether it's good or
not. I believe you should raise this topic separately and see if others
agree with you.

> That's the same reason why I've been trying to integrate memcg into
> the rest of the VM for over two years now - aside from resulting in
> much more unified code, it forces us to compete, and it increases our
> testing exposure by several orders of magnitude.
> 
> The only reason we are discussing lockless page counters right now is
> because we got rid of "memcg specialness" and exposed res_counters to
> the rest of the world; and boy did that instantly raise the bar on us.
> 
> > Anyway, I still don't think it's a good idea to keep all the definitions
> > in the same file. memcontrol.c is already huge. Adding more code to it
> > is not desirable, especially if it can naturally live in a separate
> > file. And since the page_counter is independent of the memcg core and
> > *looks* generic, I believe we should keep it separately.
> 
> It's less code than what I just deleted, and half of it seems
> redundant when integrated into memcg.  This code would benefit a lot
> from being part of memcg, and memcg could reduce its public API.

I think I understand. You are afraid that other users of the
page_counter will show up one day, and you won't be able to modify its
API freely. That's reasonable. But we can solve it while still keeping
page_counter separately. For example, put the header to mm/ and state
clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
special permission.

My point is that it's easier to maintain the code divided in logical
parts. And page_counter seems to be such a logical part.

Coming to think about placing page_counter.h to mm/, another question
springs into my mind. Why do you force kmem.tcp code to use the
page_counter instead of the res_counter? Nobody seems to use it and it
should pass away sooner or later. May be it's worth leaving kmem.tcp
using res_counter? I think we could isolate kmem.tcp under a separate
config option depending on the RES_COUNTER, and mark them both as
deprecated somehow.

> There are tangible costs associated with having a separate pile of
> bitrot depend on our public interface.  Over 90% of the recent changes
> to the hugetlb controller were done by Tejun as part of changing the
> cgroup interfaces.  And I went through several WTFs switching that
> code to the new page counter API.  The cost of maintaining a unified
> codebase is negligible in comparison.
> 
> > > > > +	atomic_long_t count;
> > > > > +	unsigned long limit;
> > > > > +	struct page_counter *parent;
> > > > > +
> > > > > +	/* legacy */
> > > > > +	unsigned long watermark;
> > > > > +	unsigned long limited;
> > > > 
> > > > IMHO, failcnt would fit better.
> > > 
> > > I never liked the failcnt name, but also have to admit that "limited"
> > > is crap.  Let's leave it at failcnt for now.
> > > 
> > > > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > > > 
> > > > When I first saw this function, I couldn't realize by looking at its
> > > > name what it's intended to do. I think
> > > > 
> > > > 	page_counter_cancel_local_charge()
> > > > 
> > > > would fit better.
> > > 
> > > It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?
> > 
> > The _sub suffix doesn't match _charge/_uncharge. May be
> > page_counter_local_uncharge, or _uncharge_local?
> 
> I always think of a charge as the full hierarchical quantity, but this
> function only clips that one counter and so anything "uncharge" sounds
> terribly wrong to me.  But I can't think of anything great, either.
> 
> Any more ideas? :)

Not yet :(

> > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > +			struct page_counter **fail);
> > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > 
> > > > Hmm, why not page_counter_set_limit?
> > > 
> > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > usually wrappers around unusual/complex data structure access,
> > 
> > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > getter, which we could easily live w/o, but still it's there.
> 
> It abstracts an unusual and error-prone access to a counter value,
> i.e. reading an unsigned quantity out of a signed variable.

Wait, percpu_counter_read retval has type s64 and it returns
percpu_counter->count, which is also s64. So it's a 100% equivalent of
plain reading of percpu_counter->count.

> > > but this function does a lot more, so I'm not fond of _set_limit().
> > 
> > Nevertheless, everything it does can be perfectly described in one
> > sentence "it tries to set the new value of the limit", so it does
> > function as a setter. And if there's a setter, there must be a getter
> > IMO.
> 
> That's oversimplifying things.  Setting a limit requires enforcing a
> whole bunch of rules and synchronization, whereas reading a limit is
> accessing an unsigned long.

Let's look at percpu_counter once again (excuse me for sticking to it,
but it seems to be a good example): percpu_counter_set requires
enforcing a whole bunch of rules and synchronization, whereas reading
percpu_counter value is accessing an s64. Nevertheless, they're
*semantically* a getter and a setter. The same is fair for the
page_counter IMO.

I don't want to enforce you to introduce the getter, it's not that
important to me. Just reasoning.

> In general I agree that we should strive for symmetry and follow the
> principle of least surprise, but in terms of complexity these two
> operations are very different, and providing a getter on principle
> would not actually improve readability in this case.
> 
> > > > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > > > >  					      unsigned long amt,
> > > > >  					      int *parent_status)
> > > > >  {
> > > > > -	struct res_counter *fail;
> > > > > -	int ret;
> > > > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > > > >  
> > > > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > > > -					amt << PAGE_SHIFT, &fail);
> > > > > -	if (ret < 0)
> > > > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > > > +	    prot->memory_allocated.limit)
> > > > 
> > > > I don't like your equivalent of res_counter_charge_nofail.
> > > > 
> > > > Passing NULL to page_counter_charge might be useful if one doesn't have
> > > > a back-off strategy, but still want to fail on hitting the limit. With
> > > > your interface the user must pass something to the function then, which
> > > > isn't convenient.
> > > > 
> > > > Besides, it depends on the internal implementation of the page_counter
> > > > struct. I'd encapsulate this.
> > > 
> > > Thinking about this more, I don't like my version either; not because
> > > of how @fail must always be passed, but because of how it changes the
> > > behavior.  I changed the API to
> > > 
> > > void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> > > int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
> > >                             struct page_counter **fail);
> > 
> > That looks good to me. I would also add something like
> > 
> >   bool page_counter_exceeds_limit(struct page_counter *counter);
> > 
> > to use instead of this
> > 
> > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > +	    prot->memory_allocated.limit)
> 
> I really don't see the point in obscuring a simple '<' behind a
> function call.  What follows this is that somebody adds it for the
> soft limit, and later for any other type of relational comparison.

I think I agree with you at this point.

> > > > >  		break;
> > > > >  	case RES_FAILCNT:
> > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > +		counter->limited = 0;
> > > > 
> > > > page_counter_reset_failcnt?
> > > 
> > > That would be more obscure than counter->failcnt = 0, I think.
> > 
> > There's one thing that bothers me about this patch. Before, all the
> > functions operating on res_counter were mutually smp-safe, now they
> > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > page_counter_try_charge, the reset might be skipped. You only use the
> > atomic type for the counter, but my guess is that failcnt and watermark
> > should be atomic too, at least if we're not going to get rid of them
> > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > are racy.
> 
> It's fair enough that the raciness should be documented, but both
> counters are such roundabout metrics to begin with that it really
> doesn't matter.  What's the difference between a failcnt of 590 and
> 600 in practical terms?  And what does it matter if the counter
> watermark is off by a few pages when there are per-cpu caches on top
> of the counters, and the majority of workloads peg the watermark to
> the limit during startup anyway?

Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
That's a huge difference.

> > Anyway, that's where the usefulness of res_counter_reset_failcnt
> > reveals. If one decides to make it race-free one day, they won't have to
> > modify code outside the page_counter definition.
> 
> A major problem with cgroups overall was that it was designed for a
> lot of hypotheticals that are irrelevant in practice but incur very
> high costs.  Multiple orthogonal hierarchies is the best example, but
> using locked byte counters that can be used to account all manner of
> resources, with accurate watermarks and limit failures, when all we
> need to count is pages and nobody cares about accurate watermarks and
> limit failures, is another one.
> 
> It's very unlikely that failcnt and watermark will have to me atomic
> ever again, so there is very little hypothetical upside to wrapping a
> '= 0' in a function.  But such indirection comes at a real cost.

Hmm, why? I mean why could making failcnt atomic be problematic? IMO it
wouldn't complicate the code, neither would it affect performance. And
why does page_counter_reset_failcnt() come at a real cost? It's +4 lines
to your patch.

> > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > +{
> > > > > +	for (;;) {
> > > > > +		unsigned long count;
> > > > > +		unsigned long old;
> > > > > +
> > > > > +		count = atomic_long_read(&counter->count);
> > > > > +
> > > > > +		old = xchg(&counter->limit, limit);
> > > > > +
> > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > +			counter->limit = old;
> > > > 
> > > > I wonder what can happen if two threads execute this function
> > > > concurrently... or may be it's not supposed to be smp-safe?
> > > 
> > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > hugetlb controllers accordingly to take limit locks as well.
> > 
> > I would prefer page_counter to handle it internally, because we won't
> > need the set_limit_mutex once memsw is converted to plain swap
> > accounting.
> >
> > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > achieve that w/o spinlocks, using only atomic variables?
> 
> We still need it to serialize concurrent access to the memory limit,
> and I updated the patch to have kmem take it as well.  It's such a
> cold path that using a lockless scheme and worrying about coherency
> between updates is not worth it, I think.

OK, it's not that important actually. Please state explicitly in the
comment to the function definition that it needs external
synchronization then.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 15:21           ` Vladimir Davydov
  0 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-23 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> > On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > index 19df5d857411..bf8fb1a05597 100644
> > > > > --- a/include/linux/memcontrol.h
> > > > > +++ b/include/linux/memcontrol.h
> > > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > > >  };
> > > > >  
> > > > >  #ifdef CONFIG_MEMCG
> > > > > +
> > > > > +struct page_counter {
> > > > 
> > > > I'd place it in a separate file, say
> > > > 
> > > > 	include/linux/page_counter.h
> > > > 	mm/page_counter.c
> > > > 
> > > > just to keep mm/memcontrol.c clean.
> > > 
> > > The page counters are the very core of the memory controller and, as I
> > > said to Michal, I want to integrate the hugetlb controller into memcg
> > > as well, at which point there won't be any outside users anymore.  So
> > > I think this is the right place for it.
> > 
> > Hmm, there might be memcg users out there that don't want to pay for
> > hugetlb accounting. Or is the overhead supposed to be negligible?
> 
> Yes.  But if it gets in the way, it creates pressure to optimize it.

There always will be an overhead no matter how we optimize it.

I think we should only merge them if it could really help simplify the
code, for instance if they were dependant on each other. Anyway, I'm not
an expert in the hugetlb cgroup, so I can't judge whether it's good or
not. I believe you should raise this topic separately and see if others
agree with you.

> That's the same reason why I've been trying to integrate memcg into
> the rest of the VM for over two years now - aside from resulting in
> much more unified code, it forces us to compete, and it increases our
> testing exposure by several orders of magnitude.
> 
> The only reason we are discussing lockless page counters right now is
> because we got rid of "memcg specialness" and exposed res_counters to
> the rest of the world; and boy did that instantly raise the bar on us.
> 
> > Anyway, I still don't think it's a good idea to keep all the definitions
> > in the same file. memcontrol.c is already huge. Adding more code to it
> > is not desirable, especially if it can naturally live in a separate
> > file. And since the page_counter is independent of the memcg core and
> > *looks* generic, I believe we should keep it separately.
> 
> It's less code than what I just deleted, and half of it seems
> redundant when integrated into memcg.  This code would benefit a lot
> from being part of memcg, and memcg could reduce its public API.

I think I understand. You are afraid that other users of the
page_counter will show up one day, and you won't be able to modify its
API freely. That's reasonable. But we can solve it while still keeping
page_counter separately. For example, put the header to mm/ and state
clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
special permission.

My point is that it's easier to maintain the code divided in logical
parts. And page_counter seems to be such a logical part.

Coming to think about placing page_counter.h to mm/, another question
springs into my mind. Why do you force kmem.tcp code to use the
page_counter instead of the res_counter? Nobody seems to use it and it
should pass away sooner or later. May be it's worth leaving kmem.tcp
using res_counter? I think we could isolate kmem.tcp under a separate
config option depending on the RES_COUNTER, and mark them both as
deprecated somehow.

> There are tangible costs associated with having a separate pile of
> bitrot depend on our public interface.  Over 90% of the recent changes
> to the hugetlb controller were done by Tejun as part of changing the
> cgroup interfaces.  And I went through several WTFs switching that
> code to the new page counter API.  The cost of maintaining a unified
> codebase is negligible in comparison.
> 
> > > > > +	atomic_long_t count;
> > > > > +	unsigned long limit;
> > > > > +	struct page_counter *parent;
> > > > > +
> > > > > +	/* legacy */
> > > > > +	unsigned long watermark;
> > > > > +	unsigned long limited;
> > > > 
> > > > IMHO, failcnt would fit better.
> > > 
> > > I never liked the failcnt name, but also have to admit that "limited"
> > > is crap.  Let's leave it at failcnt for now.
> > > 
> > > > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > > > 
> > > > When I first saw this function, I couldn't realize by looking at its
> > > > name what it's intended to do. I think
> > > > 
> > > > 	page_counter_cancel_local_charge()
> > > > 
> > > > would fit better.
> > > 
> > > It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?
> > 
> > The _sub suffix doesn't match _charge/_uncharge. May be
> > page_counter_local_uncharge, or _uncharge_local?
> 
> I always think of a charge as the full hierarchical quantity, but this
> function only clips that one counter and so anything "uncharge" sounds
> terribly wrong to me.  But I can't think of anything great, either.
> 
> Any more ideas? :)

Not yet :(

> > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > +			struct page_counter **fail);
> > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > 
> > > > Hmm, why not page_counter_set_limit?
> > > 
> > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > usually wrappers around unusual/complex data structure access,
> > 
> > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > getter, which we could easily live w/o, but still it's there.
> 
> It abstracts an unusual and error-prone access to a counter value,
> i.e. reading an unsigned quantity out of a signed variable.

Wait, percpu_counter_read retval has type s64 and it returns
percpu_counter->count, which is also s64. So it's a 100% equivalent of
plain reading of percpu_counter->count.

> > > but this function does a lot more, so I'm not fond of _set_limit().
> > 
> > Nevertheless, everything it does can be perfectly described in one
> > sentence "it tries to set the new value of the limit", so it does
> > function as a setter. And if there's a setter, there must be a getter
> > IMO.
> 
> That's oversimplifying things.  Setting a limit requires enforcing a
> whole bunch of rules and synchronization, whereas reading a limit is
> accessing an unsigned long.

Let's look at percpu_counter once again (excuse me for sticking to it,
but it seems to be a good example): percpu_counter_set requires
enforcing a whole bunch of rules and synchronization, whereas reading
percpu_counter value is accessing an s64. Nevertheless, they're
*semantically* a getter and a setter. The same is fair for the
page_counter IMO.

I don't want to enforce you to introduce the getter, it's not that
important to me. Just reasoning.

> In general I agree that we should strive for symmetry and follow the
> principle of least surprise, but in terms of complexity these two
> operations are very different, and providing a getter on principle
> would not actually improve readability in this case.
> 
> > > > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > > > >  					      unsigned long amt,
> > > > >  					      int *parent_status)
> > > > >  {
> > > > > -	struct res_counter *fail;
> > > > > -	int ret;
> > > > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > > > >  
> > > > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > > > -					amt << PAGE_SHIFT, &fail);
> > > > > -	if (ret < 0)
> > > > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > > > +	    prot->memory_allocated.limit)
> > > > 
> > > > I don't like your equivalent of res_counter_charge_nofail.
> > > > 
> > > > Passing NULL to page_counter_charge might be useful if one doesn't have
> > > > a back-off strategy, but still want to fail on hitting the limit. With
> > > > your interface the user must pass something to the function then, which
> > > > isn't convenient.
> > > > 
> > > > Besides, it depends on the internal implementation of the page_counter
> > > > struct. I'd encapsulate this.
> > > 
> > > Thinking about this more, I don't like my version either; not because
> > > of how @fail must always be passed, but because of how it changes the
> > > behavior.  I changed the API to
> > > 
> > > void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> > > int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
> > >                             struct page_counter **fail);
> > 
> > That looks good to me. I would also add something like
> > 
> >   bool page_counter_exceeds_limit(struct page_counter *counter);
> > 
> > to use instead of this
> > 
> > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > +	    prot->memory_allocated.limit)
> 
> I really don't see the point in obscuring a simple '<' behind a
> function call.  What follows this is that somebody adds it for the
> soft limit, and later for any other type of relational comparison.

I think I agree with you at this point.

> > > > >  		break;
> > > > >  	case RES_FAILCNT:
> > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > +		counter->limited = 0;
> > > > 
> > > > page_counter_reset_failcnt?
> > > 
> > > That would be more obscure than counter->failcnt = 0, I think.
> > 
> > There's one thing that bothers me about this patch. Before, all the
> > functions operating on res_counter were mutually smp-safe, now they
> > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > page_counter_try_charge, the reset might be skipped. You only use the
> > atomic type for the counter, but my guess is that failcnt and watermark
> > should be atomic too, at least if we're not going to get rid of them
> > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > are racy.
> 
> It's fair enough that the raciness should be documented, but both
> counters are such roundabout metrics to begin with that it really
> doesn't matter.  What's the difference between a failcnt of 590 and
> 600 in practical terms?  And what does it matter if the counter
> watermark is off by a few pages when there are per-cpu caches on top
> of the counters, and the majority of workloads peg the watermark to
> the limit during startup anyway?

Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
That's a huge difference.

> > Anyway, that's where the usefulness of res_counter_reset_failcnt
> > reveals. If one decides to make it race-free one day, they won't have to
> > modify code outside the page_counter definition.
> 
> A major problem with cgroups overall was that it was designed for a
> lot of hypotheticals that are irrelevant in practice but incur very
> high costs.  Multiple orthogonal hierarchies is the best example, but
> using locked byte counters that can be used to account all manner of
> resources, with accurate watermarks and limit failures, when all we
> need to count is pages and nobody cares about accurate watermarks and
> limit failures, is another one.
> 
> It's very unlikely that failcnt and watermark will have to me atomic
> ever again, so there is very little hypothetical upside to wrapping a
> '= 0' in a function.  But such indirection comes at a real cost.

Hmm, why? I mean why could making failcnt atomic be problematic? IMO it
wouldn't complicate the code, neither would it affect performance. And
why does page_counter_reset_failcnt() come at a real cost? It's +4 lines
to your patch.

> > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > +{
> > > > > +	for (;;) {
> > > > > +		unsigned long count;
> > > > > +		unsigned long old;
> > > > > +
> > > > > +		count = atomic_long_read(&counter->count);
> > > > > +
> > > > > +		old = xchg(&counter->limit, limit);
> > > > > +
> > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > +			counter->limit = old;
> > > > 
> > > > I wonder what can happen if two threads execute this function
> > > > concurrently... or may be it's not supposed to be smp-safe?
> > > 
> > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > hugetlb controllers accordingly to take limit locks as well.
> > 
> > I would prefer page_counter to handle it internally, because we won't
> > need the set_limit_mutex once memsw is converted to plain swap
> > accounting.
> >
> > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > achieve that w/o spinlocks, using only atomic variables?
> 
> We still need it to serialize concurrent access to the memory limit,
> and I updated the patch to have kmem take it as well.  It's such a
> cold path that using a lockless scheme and worrying about coherency
> between updates is not worth it, I think.

OK, it's not that important actually. Please state explicitly in the
comment to the function definition that it needs external
synchronization then.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 15:21           ` Vladimir Davydov
  0 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-23 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Greg Thelen,
	Dave Hansen, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> > On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > index 19df5d857411..bf8fb1a05597 100644
> > > > > --- a/include/linux/memcontrol.h
> > > > > +++ b/include/linux/memcontrol.h
> > > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > > >  };
> > > > >  
> > > > >  #ifdef CONFIG_MEMCG
> > > > > +
> > > > > +struct page_counter {
> > > > 
> > > > I'd place it in a separate file, say
> > > > 
> > > > 	include/linux/page_counter.h
> > > > 	mm/page_counter.c
> > > > 
> > > > just to keep mm/memcontrol.c clean.
> > > 
> > > The page counters are the very core of the memory controller and, as I
> > > said to Michal, I want to integrate the hugetlb controller into memcg
> > > as well, at which point there won't be any outside users anymore.  So
> > > I think this is the right place for it.
> > 
> > Hmm, there might be memcg users out there that don't want to pay for
> > hugetlb accounting. Or is the overhead supposed to be negligible?
> 
> Yes.  But if it gets in the way, it creates pressure to optimize it.

There always will be an overhead no matter how we optimize it.

I think we should only merge them if it could really help simplify the
code, for instance if they were dependant on each other. Anyway, I'm not
an expert in the hugetlb cgroup, so I can't judge whether it's good or
not. I believe you should raise this topic separately and see if others
agree with you.

> That's the same reason why I've been trying to integrate memcg into
> the rest of the VM for over two years now - aside from resulting in
> much more unified code, it forces us to compete, and it increases our
> testing exposure by several orders of magnitude.
> 
> The only reason we are discussing lockless page counters right now is
> because we got rid of "memcg specialness" and exposed res_counters to
> the rest of the world; and boy did that instantly raise the bar on us.
> 
> > Anyway, I still don't think it's a good idea to keep all the definitions
> > in the same file. memcontrol.c is already huge. Adding more code to it
> > is not desirable, especially if it can naturally live in a separate
> > file. And since the page_counter is independent of the memcg core and
> > *looks* generic, I believe we should keep it separately.
> 
> It's less code than what I just deleted, and half of it seems
> redundant when integrated into memcg.  This code would benefit a lot
> from being part of memcg, and memcg could reduce its public API.

I think I understand. You are afraid that other users of the
page_counter will show up one day, and you won't be able to modify its
API freely. That's reasonable. But we can solve it while still keeping
page_counter separately. For example, put the header to mm/ and state
clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
special permission.

My point is that it's easier to maintain the code divided in logical
parts. And page_counter seems to be such a logical part.

Coming to think about placing page_counter.h to mm/, another question
springs into my mind. Why do you force kmem.tcp code to use the
page_counter instead of the res_counter? Nobody seems to use it and it
should pass away sooner or later. May be it's worth leaving kmem.tcp
using res_counter? I think we could isolate kmem.tcp under a separate
config option depending on the RES_COUNTER, and mark them both as
deprecated somehow.

> There are tangible costs associated with having a separate pile of
> bitrot depend on our public interface.  Over 90% of the recent changes
> to the hugetlb controller were done by Tejun as part of changing the
> cgroup interfaces.  And I went through several WTFs switching that
> code to the new page counter API.  The cost of maintaining a unified
> codebase is negligible in comparison.
> 
> > > > > +	atomic_long_t count;
> > > > > +	unsigned long limit;
> > > > > +	struct page_counter *parent;
> > > > > +
> > > > > +	/* legacy */
> > > > > +	unsigned long watermark;
> > > > > +	unsigned long limited;
> > > > 
> > > > IMHO, failcnt would fit better.
> > > 
> > > I never liked the failcnt name, but also have to admit that "limited"
> > > is crap.  Let's leave it at failcnt for now.
> > > 
> > > > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
> > > > 
> > > > When I first saw this function, I couldn't realize by looking at its
> > > > name what it's intended to do. I think
> > > > 
> > > > 	page_counter_cancel_local_charge()
> > > > 
> > > > would fit better.
> > > 
> > > It's a fairly unwieldy name.  How about page_counter_sub()?  local_sub()?
> > 
> > The _sub suffix doesn't match _charge/_uncharge. May be
> > page_counter_local_uncharge, or _uncharge_local?
> 
> I always think of a charge as the full hierarchical quantity, but this
> function only clips that one counter and so anything "uncharge" sounds
> terribly wrong to me.  But I can't think of anything great, either.
> 
> Any more ideas? :)

Not yet :(

> > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > +			struct page_counter **fail);
> > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > 
> > > > Hmm, why not page_counter_set_limit?
> > > 
> > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > usually wrappers around unusual/complex data structure access,
> > 
> > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > getter, which we could easily live w/o, but still it's there.
> 
> It abstracts an unusual and error-prone access to a counter value,
> i.e. reading an unsigned quantity out of a signed variable.

Wait, percpu_counter_read retval has type s64 and it returns
percpu_counter->count, which is also s64. So it's a 100% equivalent of
plain reading of percpu_counter->count.

> > > but this function does a lot more, so I'm not fond of _set_limit().
> > 
> > Nevertheless, everything it does can be perfectly described in one
> > sentence "it tries to set the new value of the limit", so it does
> > function as a setter. And if there's a setter, there must be a getter
> > IMO.
> 
> That's oversimplifying things.  Setting a limit requires enforcing a
> whole bunch of rules and synchronization, whereas reading a limit is
> accessing an unsigned long.

Let's look at percpu_counter once again (excuse me for sticking to it,
but it seems to be a good example): percpu_counter_set requires
enforcing a whole bunch of rules and synchronization, whereas reading
percpu_counter value is accessing an s64. Nevertheless, they're
*semantically* a getter and a setter. The same is fair for the
page_counter IMO.

I don't want to enforce you to introduce the getter, it's not that
important to me. Just reasoning.

> In general I agree that we should strive for symmetry and follow the
> principle of least surprise, but in terms of complexity these two
> operations are very different, and providing a getter on principle
> would not actually improve readability in this case.
> 
> > > > > @@ -1218,34 +1217,26 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> > > > >  					      unsigned long amt,
> > > > >  					      int *parent_status)
> > > > >  {
> > > > > -	struct res_counter *fail;
> > > > > -	int ret;
> > > > > +	page_counter_charge(&prot->memory_allocated, amt, NULL);
> > > > >  
> > > > > -	ret = res_counter_charge_nofail(&prot->memory_allocated,
> > > > > -					amt << PAGE_SHIFT, &fail);
> > > > > -	if (ret < 0)
> > > > > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > > > > +	    prot->memory_allocated.limit)
> > > > 
> > > > I don't like your equivalent of res_counter_charge_nofail.
> > > > 
> > > > Passing NULL to page_counter_charge might be useful if one doesn't have
> > > > a back-off strategy, but still want to fail on hitting the limit. With
> > > > your interface the user must pass something to the function then, which
> > > > isn't convenient.
> > > > 
> > > > Besides, it depends on the internal implementation of the page_counter
> > > > struct. I'd encapsulate this.
> > > 
> > > Thinking about this more, I don't like my version either; not because
> > > of how @fail must always be passed, but because of how it changes the
> > > behavior.  I changed the API to
> > > 
> > > void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> > > int page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages,
> > >                             struct page_counter **fail);
> > 
> > That looks good to me. I would also add something like
> > 
> >   bool page_counter_exceeds_limit(struct page_counter *counter);
> > 
> > to use instead of this
> > 
> > +	if (atomic_long_read(&prot->memory_allocated.count) >
> > +	    prot->memory_allocated.limit)
> 
> I really don't see the point in obscuring a simple '<' behind a
> function call.  What follows this is that somebody adds it for the
> soft limit, and later for any other type of relational comparison.

I think I agree with you at this point.

> > > > >  		break;
> > > > >  	case RES_FAILCNT:
> > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > +		counter->limited = 0;
> > > > 
> > > > page_counter_reset_failcnt?
> > > 
> > > That would be more obscure than counter->failcnt = 0, I think.
> > 
> > There's one thing that bothers me about this patch. Before, all the
> > functions operating on res_counter were mutually smp-safe, now they
> > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > page_counter_try_charge, the reset might be skipped. You only use the
> > atomic type for the counter, but my guess is that failcnt and watermark
> > should be atomic too, at least if we're not going to get rid of them
> > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > are racy.
> 
> It's fair enough that the raciness should be documented, but both
> counters are such roundabout metrics to begin with that it really
> doesn't matter.  What's the difference between a failcnt of 590 and
> 600 in practical terms?  And what does it matter if the counter
> watermark is off by a few pages when there are per-cpu caches on top
> of the counters, and the majority of workloads peg the watermark to
> the limit during startup anyway?

Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
That's a huge difference.

> > Anyway, that's where the usefulness of res_counter_reset_failcnt
> > reveals. If one decides to make it race-free one day, they won't have to
> > modify code outside the page_counter definition.
> 
> A major problem with cgroups overall was that it was designed for a
> lot of hypotheticals that are irrelevant in practice but incur very
> high costs.  Multiple orthogonal hierarchies is the best example, but
> using locked byte counters that can be used to account all manner of
> resources, with accurate watermarks and limit failures, when all we
> need to count is pages and nobody cares about accurate watermarks and
> limit failures, is another one.
> 
> It's very unlikely that failcnt and watermark will have to me atomic
> ever again, so there is very little hypothetical upside to wrapping a
> '= 0' in a function.  But such indirection comes at a real cost.

Hmm, why? I mean why could making failcnt atomic be problematic? IMO it
wouldn't complicate the code, neither would it affect performance. And
why does page_counter_reset_failcnt() come at a real cost? It's +4 lines
to your patch.

> > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > +{
> > > > > +	for (;;) {
> > > > > +		unsigned long count;
> > > > > +		unsigned long old;
> > > > > +
> > > > > +		count = atomic_long_read(&counter->count);
> > > > > +
> > > > > +		old = xchg(&counter->limit, limit);
> > > > > +
> > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > +			counter->limit = old;
> > > > 
> > > > I wonder what can happen if two threads execute this function
> > > > concurrently... or may be it's not supposed to be smp-safe?
> > > 
> > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > hugetlb controllers accordingly to take limit locks as well.
> > 
> > I would prefer page_counter to handle it internally, because we won't
> > need the set_limit_mutex once memsw is converted to plain swap
> > accounting.
> >
> > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > achieve that w/o spinlocks, using only atomic variables?
> 
> We still need it to serialize concurrent access to the memory limit,
> and I updated the patch to have kmem take it as well.  It's such a
> cold path that using a lockless scheme and worrying about coherency
> between updates is not worth it, I think.

OK, it's not that important actually. Please state explicitly in the
comment to the function definition that it needs external
synchronization then.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 15:21           ` Vladimir Davydov
  (?)
@ 2014-09-23 17:05             ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-23 17:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 07:21:50PM +0400, Vladimir Davydov wrote:
> On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> > On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> > > On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > > > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > index 19df5d857411..bf8fb1a05597 100644
> > > > > > --- a/include/linux/memcontrol.h
> > > > > > +++ b/include/linux/memcontrol.h
> > > > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > > > >  };
> > > > > >  
> > > > > >  #ifdef CONFIG_MEMCG
> > > > > > +
> > > > > > +struct page_counter {
> > > > > 
> > > > > I'd place it in a separate file, say
> > > > > 
> > > > > 	include/linux/page_counter.h
> > > > > 	mm/page_counter.c
> > > > > 
> > > > > just to keep mm/memcontrol.c clean.
> > > > 
> > > > The page counters are the very core of the memory controller and, as I
> > > > said to Michal, I want to integrate the hugetlb controller into memcg
> > > > as well, at which point there won't be any outside users anymore.  So
> > > > I think this is the right place for it.
> > > 
> > > Hmm, there might be memcg users out there that don't want to pay for
> > > hugetlb accounting. Or is the overhead supposed to be negligible?
> > 
> > Yes.  But if it gets in the way, it creates pressure to optimize it.
> 
> There always will be an overhead no matter how we optimize it.
> 
> I think we should only merge them if it could really help simplify the
> code, for instance if they were dependant on each other. Anyway, I'm not
> an expert in the hugetlb cgroup, so I can't judge whether it's good or
> not. I believe you should raise this topic separately and see if others
> agree with you.

It's not a dependency, it's that there is a lot of redundancy in the
code because they do pretty much the same thing, and ongoing breakage
by stringing such a foreign object along.  Those two things have been
a problem with memcg from the beginning and created a lot of grief.

But I agree that it's a separate topic.  The only question for now is
whether the page counter should be in its own file.  They are pretty
much half of what memory does (account & limit, track & reclaim), so
they are not misplaced in memcontrol.c, and there is one measly user
outside of memcg proper, which is not hurt by having to compile memcg
into the kernel.

> > That's the same reason why I've been trying to integrate memcg into
> > the rest of the VM for over two years now - aside from resulting in
> > much more unified code, it forces us to compete, and it increases our
> > testing exposure by several orders of magnitude.
> > 
> > The only reason we are discussing lockless page counters right now is
> > because we got rid of "memcg specialness" and exposed res_counters to
> > the rest of the world; and boy did that instantly raise the bar on us.
> > 
> > > Anyway, I still don't think it's a good idea to keep all the definitions
> > > in the same file. memcontrol.c is already huge. Adding more code to it
> > > is not desirable, especially if it can naturally live in a separate
> > > file. And since the page_counter is independent of the memcg core and
> > > *looks* generic, I believe we should keep it separately.
> > 
> > It's less code than what I just deleted, and half of it seems
> > redundant when integrated into memcg.  This code would benefit a lot
> > from being part of memcg, and memcg could reduce its public API.
> 
> I think I understand. You are afraid that other users of the
> page_counter will show up one day, and you won't be able to modify its
> API freely. That's reasonable. But we can solve it while still keeping
> page_counter separately. For example, put the header to mm/ and state
> clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
> special permission.
> 
> My point is that it's easier to maintain the code divided in logical
> parts. And page_counter seems to be such a logical part.

It's not about freely changing it, it's that there is no user outside
of memory controlling proper, and there is none in sight.

What makes memcontrol.c hard to deal with is that different things are
interspersed.  The page_counter is in its own little section in there,
and the compiler can optimize and inline the important fastpaths.

> Coming to think about placing page_counter.h to mm/, another question
> springs into my mind. Why do you force kmem.tcp code to use the
> page_counter instead of the res_counter? Nobody seems to use it and it
> should pass away sooner or later. May be it's worth leaving kmem.tcp
> using res_counter? I think we could isolate kmem.tcp under a separate
> config option depending on the RES_COUNTER, and mark them both as
> deprecated somehow.

What we usually do is keep changing deprecated or unused code with
minimal/compile-only testing.  Eventually somebody will notice that
something major broke while doing this but that noone has complained
in months or even years, at which point we remove it.

I'm curious why you reach the conclusion that we should string *more*
code along for unused interfaces, rather than less?

> > > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > > +			struct page_counter **fail);
> > > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > > 
> > > > > Hmm, why not page_counter_set_limit?
> > > > 
> > > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > > usually wrappers around unusual/complex data structure access,
> > > 
> > > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > > getter, which we could easily live w/o, but still it's there.
> > 
> > It abstracts an unusual and error-prone access to a counter value,
> > i.e. reading an unsigned quantity out of a signed variable.
> 
> Wait, percpu_counter_read retval has type s64 and it returns
> percpu_counter->count, which is also s64. So it's a 100% equivalent of
> plain reading of percpu_counter->count.

My bad, I read page_counter instead of percpu_counter.  Nonetheless, I
did add page_counter_read().

> > > > but this function does a lot more, so I'm not fond of _set_limit().
> > > 
> > > Nevertheless, everything it does can be perfectly described in one
> > > sentence "it tries to set the new value of the limit", so it does
> > > function as a setter. And if there's a setter, there must be a getter
> > > IMO.
> > 
> > That's oversimplifying things.  Setting a limit requires enforcing a
> > whole bunch of rules and synchronization, whereas reading a limit is
> > accessing an unsigned long.
> 
> Let's look at percpu_counter once again (excuse me for sticking to it,
> but it seems to be a good example): percpu_counter_set requires
> enforcing a whole bunch of rules and synchronization, whereas reading
> percpu_counter value is accessing an s64. Nevertheless, they're
> *semantically* a getter and a setter. The same is fair for the
> page_counter IMO.
> 
> I don't want to enforce you to introduce the getter, it's not that
> important to me. Just reasoning.

You can always find such comparisons, but the only thing that counts
is the engineering merit, which I don't see for the page limit.

percpu_counter_read() is more obvious, because it's an API that is
expected to be widely used, and the "counter" is actually a complex
data structure.  That accessor might choose to do postprocessing like
underflow or percpu-variance correction at some point, and then it can
change callers all over the tree in a single place.

None of this really applies to the page counter limit, however.

> > > > > >  		break;
> > > > > >  	case RES_FAILCNT:
> > > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > > +		counter->limited = 0;
> > > > > 
> > > > > page_counter_reset_failcnt?
> > > > 
> > > > That would be more obscure than counter->failcnt = 0, I think.
> > > 
> > > There's one thing that bothers me about this patch. Before, all the
> > > functions operating on res_counter were mutually smp-safe, now they
> > > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > > page_counter_try_charge, the reset might be skipped. You only use the
> > > atomic type for the counter, but my guess is that failcnt and watermark
> > > should be atomic too, at least if we're not going to get rid of them
> > > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > > are racy.
> > 
> > It's fair enough that the raciness should be documented, but both
> > counters are such roundabout metrics to begin with that it really
> > doesn't matter.  What's the difference between a failcnt of 590 and
> > 600 in practical terms?  And what does it matter if the counter
> > watermark is off by a few pages when there are per-cpu caches on top
> > of the counters, and the majority of workloads peg the watermark to
> > the limit during startup anyway?
> 
> Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
> That's a huge difference.

I guess that's true, but I still have a really hard time caring.  Who
resets this in the middle of ongoing execution?  Who resets this at
all instead of just comparing before/after snapshots, like with all
other mm stats?  And is anybody even using these pointless metrics?

I'm inclined to just put stop_machine() into the reset functions...

> > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > > +{
> > > > > > +	for (;;) {
> > > > > > +		unsigned long count;
> > > > > > +		unsigned long old;
> > > > > > +
> > > > > > +		count = atomic_long_read(&counter->count);
> > > > > > +
> > > > > > +		old = xchg(&counter->limit, limit);
> > > > > > +
> > > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > > +			counter->limit = old;
> > > > > 
> > > > > I wonder what can happen if two threads execute this function
> > > > > concurrently... or may be it's not supposed to be smp-safe?
> > > > 
> > > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > > hugetlb controllers accordingly to take limit locks as well.
> > > 
> > > I would prefer page_counter to handle it internally, because we won't
> > > need the set_limit_mutex once memsw is converted to plain swap
> > > accounting.
> > >
> > > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > > achieve that w/o spinlocks, using only atomic variables?
> > 
> > We still need it to serialize concurrent access to the memory limit,
> > and I updated the patch to have kmem take it as well.  It's such a
> > cold path that using a lockless scheme and worrying about coherency
> > between updates is not worth it, I think.
> 
> OK, it's not that important actually. Please state explicitly in the
> comment to the function definition that it needs external
> synchronization then.

Yeah, I documented everything now.

How about the following update?  Don't be thrown by the
page_counter_cancel(), I went back to it until we find something more
suitable.  But as long as it's documented and has only 1.5 callsites,
it shouldn't matter all that much TBH.

Thanks for your invaluable feedback so far, and sorry if the original
patch was hard to review.  I'll try to break it up, to me it's usually
easier to verify new functions by looking at the callers in the same
patch, but I can probably remove the res_counter in a follow-up patch.

---
>From f90a203fd445a7f76c7b273294d4a8f117691d05 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 23 Sep 2014 11:48:48 -0400
Subject: [patch] mm: memcontrol: lockless page counters fix 2

- page_counter_sub -> page_counter_cancel [johannes]
- document page counter API [vladimir]
- WARN_ON_ONCE and revert on counter underflow [kame]
- convert page_counter_try_charge() from CAS to FAA [vladimir]
---
 include/linux/memcontrol.h |   2 +-
 mm/hugetlb_cgroup.c        |   2 +-
 mm/memcontrol.c            | 100 +++++++++++++++++++++++++++++++++------------
 3 files changed, 76 insertions(+), 28 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a8b939376a5d..1bda77dff591 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -84,7 +84,7 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
 	return atomic_long_read(&counter->count);
 }
 
-int page_counter_sub(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_try_charge(struct page_counter *counter,
 			    unsigned long nr_pages,
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index abd1e8dc7b46..aae47a24ec0e 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -131,7 +131,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	}
 	counter = &h_cg->hugepage[idx];
 	/* Take the pages off the local counter */
-	page_counter_sub(counter, nr_pages);
+	page_counter_cancel(counter, nr_pages);
 
 	set_hugetlb_cgroup(page, parent);
 out:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec2210965686..70839678d805 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,18 +65,32 @@
 
 #include <trace/events/vmscan.h>
 
-int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
+/**
+ * page_counter_cancel - take pages out of the local counter
+ * @counter: counter
+ * @nr_pages: number of pages to cancel
+ *
+ * Returns whether there are remaining pages in the counter.
+ */
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
 {
 	long new;
 
 	new = atomic_long_sub_return(nr_pages, &counter->count);
 
-	if (WARN_ON(unlikely(new < 0)))
-		atomic_long_set(&counter->count, 0);
+	if (WARN_ON_ONCE(unlikely(new < 0)))
+		atomic_long_add(nr_pages, &counter->count);
 
 	return new > 0;
 }
 
+/**
+ * page_counter_charge - hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ *
+ * NOTE: This may exceed the configured counter limits.
+ */
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
@@ -91,6 +105,15 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 	}
 }
 
+/**
+ * page_counter_try_charge - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
+ * its ancestors has hit its limit.
+ */
 int page_counter_try_charge(struct page_counter *counter,
 			    unsigned long nr_pages,
 			    struct page_counter **fail)
@@ -98,37 +121,44 @@ int page_counter_try_charge(struct page_counter *counter,
 	struct page_counter *c;
 
 	for (c = counter; c; c = c->parent) {
-		for (;;) {
-			long count;
-			long new;
-
-			count = atomic_long_read(&c->count);
-
-			new = count + nr_pages;
-			if (new > c->limit) {
-				c->failcnt++;
-				*fail = c;
-				goto failed;
-			}
-
-			if (atomic_long_cmpxchg(&c->count, count, new) != count)
-				continue;
-
-			if (new > c->watermark)
-				c->watermark = new;
+		long new;
 
-			break;
+		new = atomic_long_add_return(nr_pages, &c->count);
+		if (new > c->limit) {
+			atomic_long_sub(nr_pages, &c->count);
+			/*
+			 * This is racy, but the failcnt is only a
+			 * ballpark metric anyway.
+			 */
+			c->failcnt++;
+			*fail = c;
+			goto failed;
 		}
+		/*
+		 * This is racy, but with the per-cpu caches on top
+		 * this is a ballpark metric as well, and with lazy
+		 * cache reclaim, the majority of workloads peg the
+		 * watermark to the group limit soon after launch.
+		 */
+		if (new > c->watermark)
+			c->watermark = new;
 	}
 	return 0;
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
-		page_counter_sub(c, nr_pages);
+		page_counter_cancel(c, nr_pages);
 
 	return -ENOMEM;
 }
 
+/**
+ * page_counter_uncharge - hierarchically uncharge pages
+ * @counter: counter
+ * @nr_pages: number of pages to uncharge
+ *
+ * Returns whether there are remaining charges in @counter.
+ */
 int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
@@ -137,7 +167,7 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 	for (c = counter; c; c = c->parent) {
 		int remainder;
 
-		remainder = page_counter_sub(c, nr_pages);
+		remainder = page_counter_cancel(c, nr_pages);
 		if (c == counter && !remainder)
 			ret = 0;
 	}
@@ -145,6 +175,16 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 	return ret;
 }
 
+/**
+ * page_counter_limit - limit the number of pages allowed
+ * @counter: counter
+ * @limit: limit to set
+ *
+ * Returns 0 on success, -EBUSY if the current number of pages on the
+ * counter already exceeds the specified limit.
+ *
+ * The caller must serialize invocations on the same counter.
+ */
 int page_counter_limit(struct page_counter *counter, unsigned long limit)
 {
 	for (;;) {
@@ -169,6 +209,14 @@ int page_counter_limit(struct page_counter *counter, unsigned long limit)
 	}
 }
 
+/**
+ * page_counter_memparse - memparse() for page counter limits
+ * @buf: string to parse
+ * @nr_pages: returns the result in number of pages
+ *
+ * Returns -EINVAL, or 0 and @nr_pages on success.  @nr_pages will be
+ * limited to %PAGE_COUNTER_MAX.
+ */
 int page_counter_memparse(const char *buf, unsigned long *nr_pages)
 {
 	char unlimited[] = "-1";
@@ -3572,9 +3620,9 @@ static int mem_cgroup_move_parent(struct page *page,
 				pc, child, parent);
 	if (!ret) {
 		/* Take charge off the local counters */
-		page_counter_sub(&child->memory, nr_pages);
+		page_counter_cancel(&child->memory, nr_pages);
 		if (do_swap_account)
-			page_counter_sub(&child->memsw, nr_pages);
+			page_counter_cancel(&child->memsw, nr_pages);
 	}
 
 	if (nr_pages > 1)
-- 
2.1.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 17:05             ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-23 17:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 07:21:50PM +0400, Vladimir Davydov wrote:
> On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> > On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> > > On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > > > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > index 19df5d857411..bf8fb1a05597 100644
> > > > > > --- a/include/linux/memcontrol.h
> > > > > > +++ b/include/linux/memcontrol.h
> > > > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > > > >  };
> > > > > >  
> > > > > >  #ifdef CONFIG_MEMCG
> > > > > > +
> > > > > > +struct page_counter {
> > > > > 
> > > > > I'd place it in a separate file, say
> > > > > 
> > > > > 	include/linux/page_counter.h
> > > > > 	mm/page_counter.c
> > > > > 
> > > > > just to keep mm/memcontrol.c clean.
> > > > 
> > > > The page counters are the very core of the memory controller and, as I
> > > > said to Michal, I want to integrate the hugetlb controller into memcg
> > > > as well, at which point there won't be any outside users anymore.  So
> > > > I think this is the right place for it.
> > > 
> > > Hmm, there might be memcg users out there that don't want to pay for
> > > hugetlb accounting. Or is the overhead supposed to be negligible?
> > 
> > Yes.  But if it gets in the way, it creates pressure to optimize it.
> 
> There always will be an overhead no matter how we optimize it.
> 
> I think we should only merge them if it could really help simplify the
> code, for instance if they were dependant on each other. Anyway, I'm not
> an expert in the hugetlb cgroup, so I can't judge whether it's good or
> not. I believe you should raise this topic separately and see if others
> agree with you.

It's not a dependency, it's that there is a lot of redundancy in the
code because they do pretty much the same thing, and ongoing breakage
by stringing such a foreign object along.  Those two things have been
a problem with memcg from the beginning and created a lot of grief.

But I agree that it's a separate topic.  The only question for now is
whether the page counter should be in its own file.  They are pretty
much half of what memory does (account & limit, track & reclaim), so
they are not misplaced in memcontrol.c, and there is one measly user
outside of memcg proper, which is not hurt by having to compile memcg
into the kernel.

> > That's the same reason why I've been trying to integrate memcg into
> > the rest of the VM for over two years now - aside from resulting in
> > much more unified code, it forces us to compete, and it increases our
> > testing exposure by several orders of magnitude.
> > 
> > The only reason we are discussing lockless page counters right now is
> > because we got rid of "memcg specialness" and exposed res_counters to
> > the rest of the world; and boy did that instantly raise the bar on us.
> > 
> > > Anyway, I still don't think it's a good idea to keep all the definitions
> > > in the same file. memcontrol.c is already huge. Adding more code to it
> > > is not desirable, especially if it can naturally live in a separate
> > > file. And since the page_counter is independent of the memcg core and
> > > *looks* generic, I believe we should keep it separately.
> > 
> > It's less code than what I just deleted, and half of it seems
> > redundant when integrated into memcg.  This code would benefit a lot
> > from being part of memcg, and memcg could reduce its public API.
> 
> I think I understand. You are afraid that other users of the
> page_counter will show up one day, and you won't be able to modify its
> API freely. That's reasonable. But we can solve it while still keeping
> page_counter separately. For example, put the header to mm/ and state
> clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
> special permission.
> 
> My point is that it's easier to maintain the code divided in logical
> parts. And page_counter seems to be such a logical part.

It's not about freely changing it, it's that there is no user outside
of memory controlling proper, and there is none in sight.

What makes memcontrol.c hard to deal with is that different things are
interspersed.  The page_counter is in its own little section in there,
and the compiler can optimize and inline the important fastpaths.

> Coming to think about placing page_counter.h to mm/, another question
> springs into my mind. Why do you force kmem.tcp code to use the
> page_counter instead of the res_counter? Nobody seems to use it and it
> should pass away sooner or later. May be it's worth leaving kmem.tcp
> using res_counter? I think we could isolate kmem.tcp under a separate
> config option depending on the RES_COUNTER, and mark them both as
> deprecated somehow.

What we usually do is keep changing deprecated or unused code with
minimal/compile-only testing.  Eventually somebody will notice that
something major broke while doing this but that noone has complained
in months or even years, at which point we remove it.

I'm curious why you reach the conclusion that we should string *more*
code along for unused interfaces, rather than less?

> > > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > > +			struct page_counter **fail);
> > > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > > 
> > > > > Hmm, why not page_counter_set_limit?
> > > > 
> > > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > > usually wrappers around unusual/complex data structure access,
> > > 
> > > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > > getter, which we could easily live w/o, but still it's there.
> > 
> > It abstracts an unusual and error-prone access to a counter value,
> > i.e. reading an unsigned quantity out of a signed variable.
> 
> Wait, percpu_counter_read retval has type s64 and it returns
> percpu_counter->count, which is also s64. So it's a 100% equivalent of
> plain reading of percpu_counter->count.

My bad, I read page_counter instead of percpu_counter.  Nonetheless, I
did add page_counter_read().

> > > > but this function does a lot more, so I'm not fond of _set_limit().
> > > 
> > > Nevertheless, everything it does can be perfectly described in one
> > > sentence "it tries to set the new value of the limit", so it does
> > > function as a setter. And if there's a setter, there must be a getter
> > > IMO.
> > 
> > That's oversimplifying things.  Setting a limit requires enforcing a
> > whole bunch of rules and synchronization, whereas reading a limit is
> > accessing an unsigned long.
> 
> Let's look at percpu_counter once again (excuse me for sticking to it,
> but it seems to be a good example): percpu_counter_set requires
> enforcing a whole bunch of rules and synchronization, whereas reading
> percpu_counter value is accessing an s64. Nevertheless, they're
> *semantically* a getter and a setter. The same is fair for the
> page_counter IMO.
> 
> I don't want to enforce you to introduce the getter, it's not that
> important to me. Just reasoning.

You can always find such comparisons, but the only thing that counts
is the engineering merit, which I don't see for the page limit.

percpu_counter_read() is more obvious, because it's an API that is
expected to be widely used, and the "counter" is actually a complex
data structure.  That accessor might choose to do postprocessing like
underflow or percpu-variance correction at some point, and then it can
change callers all over the tree in a single place.

None of this really applies to the page counter limit, however.

> > > > > >  		break;
> > > > > >  	case RES_FAILCNT:
> > > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > > +		counter->limited = 0;
> > > > > 
> > > > > page_counter_reset_failcnt?
> > > > 
> > > > That would be more obscure than counter->failcnt = 0, I think.
> > > 
> > > There's one thing that bothers me about this patch. Before, all the
> > > functions operating on res_counter were mutually smp-safe, now they
> > > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > > page_counter_try_charge, the reset might be skipped. You only use the
> > > atomic type for the counter, but my guess is that failcnt and watermark
> > > should be atomic too, at least if we're not going to get rid of them
> > > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > > are racy.
> > 
> > It's fair enough that the raciness should be documented, but both
> > counters are such roundabout metrics to begin with that it really
> > doesn't matter.  What's the difference between a failcnt of 590 and
> > 600 in practical terms?  And what does it matter if the counter
> > watermark is off by a few pages when there are per-cpu caches on top
> > of the counters, and the majority of workloads peg the watermark to
> > the limit during startup anyway?
> 
> Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
> That's a huge difference.

I guess that's true, but I still have a really hard time caring.  Who
resets this in the middle of ongoing execution?  Who resets this at
all instead of just comparing before/after snapshots, like with all
other mm stats?  And is anybody even using these pointless metrics?

I'm inclined to just put stop_machine() into the reset functions...

> > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > > +{
> > > > > > +	for (;;) {
> > > > > > +		unsigned long count;
> > > > > > +		unsigned long old;
> > > > > > +
> > > > > > +		count = atomic_long_read(&counter->count);
> > > > > > +
> > > > > > +		old = xchg(&counter->limit, limit);
> > > > > > +
> > > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > > +			counter->limit = old;
> > > > > 
> > > > > I wonder what can happen if two threads execute this function
> > > > > concurrently... or may be it's not supposed to be smp-safe?
> > > > 
> > > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > > hugetlb controllers accordingly to take limit locks as well.
> > > 
> > > I would prefer page_counter to handle it internally, because we won't
> > > need the set_limit_mutex once memsw is converted to plain swap
> > > accounting.
> > >
> > > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > > achieve that w/o spinlocks, using only atomic variables?
> > 
> > We still need it to serialize concurrent access to the memory limit,
> > and I updated the patch to have kmem take it as well.  It's such a
> > cold path that using a lockless scheme and worrying about coherency
> > between updates is not worth it, I think.
> 
> OK, it's not that important actually. Please state explicitly in the
> comment to the function definition that it needs external
> synchronization then.

Yeah, I documented everything now.

How about the following update?  Don't be thrown by the
page_counter_cancel(), I went back to it until we find something more
suitable.  But as long as it's documented and has only 1.5 callsites,
it shouldn't matter all that much TBH.

Thanks for your invaluable feedback so far, and sorry if the original
patch was hard to review.  I'll try to break it up, to me it's usually
easier to verify new functions by looking at the callers in the same
patch, but I can probably remove the res_counter in a follow-up patch.

---

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 17:05             ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-23 17:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 07:21:50PM +0400, Vladimir Davydov wrote:
> On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> > On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
> > > On Mon, Sep 22, 2014 at 02:57:36PM -0400, Johannes Weiner wrote:
> > > > On Mon, Sep 22, 2014 at 06:41:58PM +0400, Vladimir Davydov wrote:
> > > > > On Fri, Sep 19, 2014 at 09:22:08AM -0400, Johannes Weiner wrote:
> > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > index 19df5d857411..bf8fb1a05597 100644
> > > > > > --- a/include/linux/memcontrol.h
> > > > > > +++ b/include/linux/memcontrol.h
> > > > > > @@ -54,6 +54,38 @@ struct mem_cgroup_reclaim_cookie {
> > > > > >  };
> > > > > >  
> > > > > >  #ifdef CONFIG_MEMCG
> > > > > > +
> > > > > > +struct page_counter {
> > > > > 
> > > > > I'd place it in a separate file, say
> > > > > 
> > > > > 	include/linux/page_counter.h
> > > > > 	mm/page_counter.c
> > > > > 
> > > > > just to keep mm/memcontrol.c clean.
> > > > 
> > > > The page counters are the very core of the memory controller and, as I
> > > > said to Michal, I want to integrate the hugetlb controller into memcg
> > > > as well, at which point there won't be any outside users anymore.  So
> > > > I think this is the right place for it.
> > > 
> > > Hmm, there might be memcg users out there that don't want to pay for
> > > hugetlb accounting. Or is the overhead supposed to be negligible?
> > 
> > Yes.  But if it gets in the way, it creates pressure to optimize it.
> 
> There always will be an overhead no matter how we optimize it.
> 
> I think we should only merge them if it could really help simplify the
> code, for instance if they were dependant on each other. Anyway, I'm not
> an expert in the hugetlb cgroup, so I can't judge whether it's good or
> not. I believe you should raise this topic separately and see if others
> agree with you.

It's not a dependency, it's that there is a lot of redundancy in the
code because they do pretty much the same thing, and ongoing breakage
by stringing such a foreign object along.  Those two things have been
a problem with memcg from the beginning and created a lot of grief.

But I agree that it's a separate topic.  The only question for now is
whether the page counter should be in its own file.  They are pretty
much half of what memory does (account & limit, track & reclaim), so
they are not misplaced in memcontrol.c, and there is one measly user
outside of memcg proper, which is not hurt by having to compile memcg
into the kernel.

> > That's the same reason why I've been trying to integrate memcg into
> > the rest of the VM for over two years now - aside from resulting in
> > much more unified code, it forces us to compete, and it increases our
> > testing exposure by several orders of magnitude.
> > 
> > The only reason we are discussing lockless page counters right now is
> > because we got rid of "memcg specialness" and exposed res_counters to
> > the rest of the world; and boy did that instantly raise the bar on us.
> > 
> > > Anyway, I still don't think it's a good idea to keep all the definitions
> > > in the same file. memcontrol.c is already huge. Adding more code to it
> > > is not desirable, especially if it can naturally live in a separate
> > > file. And since the page_counter is independent of the memcg core and
> > > *looks* generic, I believe we should keep it separately.
> > 
> > It's less code than what I just deleted, and half of it seems
> > redundant when integrated into memcg.  This code would benefit a lot
> > from being part of memcg, and memcg could reduce its public API.
> 
> I think I understand. You are afraid that other users of the
> page_counter will show up one day, and you won't be able to modify its
> API freely. That's reasonable. But we can solve it while still keeping
> page_counter separately. For example, put the header to mm/ and state
> clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
> special permission.
> 
> My point is that it's easier to maintain the code divided in logical
> parts. And page_counter seems to be such a logical part.

It's not about freely changing it, it's that there is no user outside
of memory controlling proper, and there is none in sight.

What makes memcontrol.c hard to deal with is that different things are
interspersed.  The page_counter is in its own little section in there,
and the compiler can optimize and inline the important fastpaths.

> Coming to think about placing page_counter.h to mm/, another question
> springs into my mind. Why do you force kmem.tcp code to use the
> page_counter instead of the res_counter? Nobody seems to use it and it
> should pass away sooner or later. May be it's worth leaving kmem.tcp
> using res_counter? I think we could isolate kmem.tcp under a separate
> config option depending on the RES_COUNTER, and mark them both as
> deprecated somehow.

What we usually do is keep changing deprecated or unused code with
minimal/compile-only testing.  Eventually somebody will notice that
something major broke while doing this but that noone has complained
in months or even years, at which point we remove it.

I'm curious why you reach the conclusion that we should string *more*
code along for unused interfaces, rather than less?

> > > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > > +			struct page_counter **fail);
> > > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > > 
> > > > > Hmm, why not page_counter_set_limit?
> > > > 
> > > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > > usually wrappers around unusual/complex data structure access,
> > > 
> > > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > > getter, which we could easily live w/o, but still it's there.
> > 
> > It abstracts an unusual and error-prone access to a counter value,
> > i.e. reading an unsigned quantity out of a signed variable.
> 
> Wait, percpu_counter_read retval has type s64 and it returns
> percpu_counter->count, which is also s64. So it's a 100% equivalent of
> plain reading of percpu_counter->count.

My bad, I read page_counter instead of percpu_counter.  Nonetheless, I
did add page_counter_read().

> > > > but this function does a lot more, so I'm not fond of _set_limit().
> > > 
> > > Nevertheless, everything it does can be perfectly described in one
> > > sentence "it tries to set the new value of the limit", so it does
> > > function as a setter. And if there's a setter, there must be a getter
> > > IMO.
> > 
> > That's oversimplifying things.  Setting a limit requires enforcing a
> > whole bunch of rules and synchronization, whereas reading a limit is
> > accessing an unsigned long.
> 
> Let's look at percpu_counter once again (excuse me for sticking to it,
> but it seems to be a good example): percpu_counter_set requires
> enforcing a whole bunch of rules and synchronization, whereas reading
> percpu_counter value is accessing an s64. Nevertheless, they're
> *semantically* a getter and a setter. The same is fair for the
> page_counter IMO.
> 
> I don't want to enforce you to introduce the getter, it's not that
> important to me. Just reasoning.

You can always find such comparisons, but the only thing that counts
is the engineering merit, which I don't see for the page limit.

percpu_counter_read() is more obvious, because it's an API that is
expected to be widely used, and the "counter" is actually a complex
data structure.  That accessor might choose to do postprocessing like
underflow or percpu-variance correction at some point, and then it can
change callers all over the tree in a single place.

None of this really applies to the page counter limit, however.

> > > > > >  		break;
> > > > > >  	case RES_FAILCNT:
> > > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > > +		counter->limited = 0;
> > > > > 
> > > > > page_counter_reset_failcnt?
> > > > 
> > > > That would be more obscure than counter->failcnt = 0, I think.
> > > 
> > > There's one thing that bothers me about this patch. Before, all the
> > > functions operating on res_counter were mutually smp-safe, now they
> > > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > > page_counter_try_charge, the reset might be skipped. You only use the
> > > atomic type for the counter, but my guess is that failcnt and watermark
> > > should be atomic too, at least if we're not going to get rid of them
> > > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > > are racy.
> > 
> > It's fair enough that the raciness should be documented, but both
> > counters are such roundabout metrics to begin with that it really
> > doesn't matter.  What's the difference between a failcnt of 590 and
> > 600 in practical terms?  And what does it matter if the counter
> > watermark is off by a few pages when there are per-cpu caches on top
> > of the counters, and the majority of workloads peg the watermark to
> > the limit during startup anyway?
> 
> Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
> That's a huge difference.

I guess that's true, but I still have a really hard time caring.  Who
resets this in the middle of ongoing execution?  Who resets this at
all instead of just comparing before/after snapshots, like with all
other mm stats?  And is anybody even using these pointless metrics?

I'm inclined to just put stop_machine() into the reset functions...

> > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > > +{
> > > > > > +	for (;;) {
> > > > > > +		unsigned long count;
> > > > > > +		unsigned long old;
> > > > > > +
> > > > > > +		count = atomic_long_read(&counter->count);
> > > > > > +
> > > > > > +		old = xchg(&counter->limit, limit);
> > > > > > +
> > > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > > +			counter->limit = old;
> > > > > 
> > > > > I wonder what can happen if two threads execute this function
> > > > > concurrently... or may be it's not supposed to be smp-safe?
> > > > 
> > > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > > hugetlb controllers accordingly to take limit locks as well.
> > > 
> > > I would prefer page_counter to handle it internally, because we won't
> > > need the set_limit_mutex once memsw is converted to plain swap
> > > accounting.
> > >
> > > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > > achieve that w/o spinlocks, using only atomic variables?
> > 
> > We still need it to serialize concurrent access to the memory limit,
> > and I updated the patch to have kmem take it as well.  It's such a
> > cold path that using a lockless scheme and worrying about coherency
> > between updates is not worth it, I think.
> 
> OK, it's not that important actually. Please state explicitly in the
> comment to the function definition that it needs external
> synchronization then.

Yeah, I documented everything now.

How about the following update?  Don't be thrown by the
page_counter_cancel(), I went back to it until we find something more
suitable.  But as long as it's documented and has only 1.5 callsites,
it shouldn't matter all that much TBH.

Thanks for your invaluable feedback so far, and sorry if the original
patch was hard to review.  I'll try to break it up, to me it's usually
easier to verify new functions by looking at the callers in the same
patch, but I can probably remove the res_counter in a follow-up patch.

---
From f90a203fd445a7f76c7b273294d4a8f117691d05 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 23 Sep 2014 11:48:48 -0400
Subject: [patch] mm: memcontrol: lockless page counters fix 2

- page_counter_sub -> page_counter_cancel [johannes]
- document page counter API [vladimir]
- WARN_ON_ONCE and revert on counter underflow [kame]
- convert page_counter_try_charge() from CAS to FAA [vladimir]
---
 include/linux/memcontrol.h |   2 +-
 mm/hugetlb_cgroup.c        |   2 +-
 mm/memcontrol.c            | 100 +++++++++++++++++++++++++++++++++------------
 3 files changed, 76 insertions(+), 28 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a8b939376a5d..1bda77dff591 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -84,7 +84,7 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
 	return atomic_long_read(&counter->count);
 }
 
-int page_counter_sub(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_try_charge(struct page_counter *counter,
 			    unsigned long nr_pages,
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index abd1e8dc7b46..aae47a24ec0e 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -131,7 +131,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	}
 	counter = &h_cg->hugepage[idx];
 	/* Take the pages off the local counter */
-	page_counter_sub(counter, nr_pages);
+	page_counter_cancel(counter, nr_pages);
 
 	set_hugetlb_cgroup(page, parent);
 out:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec2210965686..70839678d805 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,18 +65,32 @@
 
 #include <trace/events/vmscan.h>
 
-int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
+/**
+ * page_counter_cancel - take pages out of the local counter
+ * @counter: counter
+ * @nr_pages: number of pages to cancel
+ *
+ * Returns whether there are remaining pages in the counter.
+ */
+int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
 {
 	long new;
 
 	new = atomic_long_sub_return(nr_pages, &counter->count);
 
-	if (WARN_ON(unlikely(new < 0)))
-		atomic_long_set(&counter->count, 0);
+	if (WARN_ON_ONCE(unlikely(new < 0)))
+		atomic_long_add(nr_pages, &counter->count);
 
 	return new > 0;
 }
 
+/**
+ * page_counter_charge - hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ *
+ * NOTE: This may exceed the configured counter limits.
+ */
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
@@ -91,6 +105,15 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 	}
 }
 
+/**
+ * page_counter_try_charge - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
+ * its ancestors has hit its limit.
+ */
 int page_counter_try_charge(struct page_counter *counter,
 			    unsigned long nr_pages,
 			    struct page_counter **fail)
@@ -98,37 +121,44 @@ int page_counter_try_charge(struct page_counter *counter,
 	struct page_counter *c;
 
 	for (c = counter; c; c = c->parent) {
-		for (;;) {
-			long count;
-			long new;
-
-			count = atomic_long_read(&c->count);
-
-			new = count + nr_pages;
-			if (new > c->limit) {
-				c->failcnt++;
-				*fail = c;
-				goto failed;
-			}
-
-			if (atomic_long_cmpxchg(&c->count, count, new) != count)
-				continue;
-
-			if (new > c->watermark)
-				c->watermark = new;
+		long new;
 
-			break;
+		new = atomic_long_add_return(nr_pages, &c->count);
+		if (new > c->limit) {
+			atomic_long_sub(nr_pages, &c->count);
+			/*
+			 * This is racy, but the failcnt is only a
+			 * ballpark metric anyway.
+			 */
+			c->failcnt++;
+			*fail = c;
+			goto failed;
 		}
+		/*
+		 * This is racy, but with the per-cpu caches on top
+		 * this is a ballpark metric as well, and with lazy
+		 * cache reclaim, the majority of workloads peg the
+		 * watermark to the group limit soon after launch.
+		 */
+		if (new > c->watermark)
+			c->watermark = new;
 	}
 	return 0;
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
-		page_counter_sub(c, nr_pages);
+		page_counter_cancel(c, nr_pages);
 
 	return -ENOMEM;
 }
 
+/**
+ * page_counter_uncharge - hierarchically uncharge pages
+ * @counter: counter
+ * @nr_pages: number of pages to uncharge
+ *
+ * Returns whether there are remaining charges in @counter.
+ */
 int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
@@ -137,7 +167,7 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 	for (c = counter; c; c = c->parent) {
 		int remainder;
 
-		remainder = page_counter_sub(c, nr_pages);
+		remainder = page_counter_cancel(c, nr_pages);
 		if (c == counter && !remainder)
 			ret = 0;
 	}
@@ -145,6 +175,16 @@ int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 	return ret;
 }
 
+/**
+ * page_counter_limit - limit the number of pages allowed
+ * @counter: counter
+ * @limit: limit to set
+ *
+ * Returns 0 on success, -EBUSY if the current number of pages on the
+ * counter already exceeds the specified limit.
+ *
+ * The caller must serialize invocations on the same counter.
+ */
 int page_counter_limit(struct page_counter *counter, unsigned long limit)
 {
 	for (;;) {
@@ -169,6 +209,14 @@ int page_counter_limit(struct page_counter *counter, unsigned long limit)
 	}
 }
 
+/**
+ * page_counter_memparse - memparse() for page counter limits
+ * @buf: string to parse
+ * @nr_pages: returns the result in number of pages
+ *
+ * Returns -EINVAL, or 0 and @nr_pages on success.  @nr_pages will be
+ * limited to %PAGE_COUNTER_MAX.
+ */
 int page_counter_memparse(const char *buf, unsigned long *nr_pages)
 {
 	char unlimited[] = "-1";
@@ -3572,9 +3620,9 @@ static int mem_cgroup_move_parent(struct page *page,
 				pc, child, parent);
 	if (!ret) {
 		/* Take charge off the local counters */
-		page_counter_sub(&child->memory, nr_pages);
+		page_counter_cancel(&child->memory, nr_pages);
 		if (do_swap_account)
-			page_counter_sub(&child->memsw, nr_pages);
+			page_counter_cancel(&child->memsw, nr_pages);
 	}
 
 	if (nr_pages > 1)
-- 
2.1.0


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 14:28               ` Michal Hocko
@ 2014-09-23 22:33                 ` David Rientjes
  -1 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2014-09-23 22:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Tue, 23 Sep 2014, Michal Hocko wrote:

> On Tue 23-09-14 10:05:26, Johannes Weiner wrote:
> [...]
> > That's one way to put it.  But the way I see it is that I remove a
> > generic resource counter and replace it with a pure memory counter
> > which I put where we account and limit memory - with one exception
> > that is hardly worth creating a dedicated library file for.
> 
> So you completely seem to ignore people doing CONFIG_CGROUP_HUGETLB &&
> !CONFIG_MEMCG without any justification and hiding it behind performance
> improvement which those users even didn't ask for yet.
> 
> All that just to not have one additional header and c file hidden by
> CONFIG_PAGE_COUNTER selected by both controllers. No special
> configuration option is really needed for CONFIG_PAGE_COUNTER.
> 

I'm hoping that if there is a merge that there is not an implicit reliance 
on struct page_cgroup for the hugetlb cgroup.  We boot a lot of our 
machines with very large numbers of hugetlb pages on the kernel command 
line (>95% of memory) and can save hundreds of megabytes (meaning more 
overcommit hugepages!) by freeing unneeded and unused struct page_cgroup 
for CONFIG_SPARSEMEM.

> > I only explained my plans of merging all memory controllers because I
> > assumed we could ever be on the same page when it comes to this code.
> 
> I doubt this is a good plan but I might be wrong here. The main thing
> stays there is no good reason to make hugetlb depend on memcg right now
> and such a change _shouldn't_ be hidden behind an unrelated change. From
> hugetlb container point of view this is just a different counter which
> doesn't depend on memcg. I am really surprised you are pushing for this
> so hard right now because it only distracts from the main motivation of
> your patch.
> 

I could very easily imagine a user who would like to use hugetlb cgroup 
without memcg if hugetlb cgroup would charge reserved but unmapped hugetlb 
pages to its cgroup as well.  It's quite disappointing that the hugetlb 
cgroup allows a user to map 100% of a machine's hugetlb pages from the 
reserved pool while its hugetlb cgroup limit is much smaller since this 
prevents anybody else from using the global resource simply because 
someone has reserved but not faulted hugepages.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-23 22:33                 ` David Rientjes
  0 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2014-09-23 22:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Tue, 23 Sep 2014, Michal Hocko wrote:

> On Tue 23-09-14 10:05:26, Johannes Weiner wrote:
> [...]
> > That's one way to put it.  But the way I see it is that I remove a
> > generic resource counter and replace it with a pure memory counter
> > which I put where we account and limit memory - with one exception
> > that is hardly worth creating a dedicated library file for.
> 
> So you completely seem to ignore people doing CONFIG_CGROUP_HUGETLB &&
> !CONFIG_MEMCG without any justification and hiding it behind performance
> improvement which those users even didn't ask for yet.
> 
> All that just to not have one additional header and c file hidden by
> CONFIG_PAGE_COUNTER selected by both controllers. No special
> configuration option is really needed for CONFIG_PAGE_COUNTER.
> 

I'm hoping that if there is a merge that there is not an implicit reliance 
on struct page_cgroup for the hugetlb cgroup.  We boot a lot of our 
machines with very large numbers of hugetlb pages on the kernel command 
line (>95% of memory) and can save hundreds of megabytes (meaning more 
overcommit hugepages!) by freeing unneeded and unused struct page_cgroup 
for CONFIG_SPARSEMEM.

> > I only explained my plans of merging all memory controllers because I
> > assumed we could ever be on the same page when it comes to this code.
> 
> I doubt this is a good plan but I might be wrong here. The main thing
> stays there is no good reason to make hugetlb depend on memcg right now
> and such a change _shouldn't_ be hidden behind an unrelated change. From
> hugetlb container point of view this is just a different counter which
> doesn't depend on memcg. I am really surprised you are pushing for this
> so hard right now because it only distracts from the main motivation of
> your patch.
> 

I could very easily imagine a user who would like to use hugetlb cgroup 
without memcg if hugetlb cgroup would charge reserved but unmapped hugetlb 
pages to its cgroup as well.  It's quite disappointing that the hugetlb 
cgroup allows a user to map 100% of a machine's hugetlb pages from the 
reserved pool while its hugetlb cgroup limit is much smaller since this 
prevents anybody else from using the global resource simply because 
someone has reserved but not faulted hugepages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 17:05             ` Johannes Weiner
@ 2014-09-24  8:02               ` Vladimir Davydov
  -1 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-24  8:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 01:05:25PM -0400, Johannes Weiner wrote:
> On Tue, Sep 23, 2014 at 07:21:50PM +0400, Vladimir Davydov wrote:
> > On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> > > On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
[...]
> > > > Anyway, I still don't think it's a good idea to keep all the definitions
> > > > in the same file. memcontrol.c is already huge. Adding more code to it
> > > > is not desirable, especially if it can naturally live in a separate
> > > > file. And since the page_counter is independent of the memcg core and
> > > > *looks* generic, I believe we should keep it separately.
> > > 
> > > It's less code than what I just deleted, and half of it seems
> > > redundant when integrated into memcg.  This code would benefit a lot
> > > from being part of memcg, and memcg could reduce its public API.
> > 
> > I think I understand. You are afraid that other users of the
> > page_counter will show up one day, and you won't be able to modify its
> > API freely. That's reasonable. But we can solve it while still keeping
> > page_counter separately. For example, put the header to mm/ and state
> > clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
> > special permission.
> > 
> > My point is that it's easier to maintain the code divided in logical
> > parts. And page_counter seems to be such a logical part.
> 
> It's not about freely changing it, it's that there is no user outside
> of memory controlling proper, and there is none in sight.
> 
> What makes memcontrol.c hard to deal with is that different things are
> interspersed.  The page_counter is in its own little section in there,
> and the compiler can optimize and inline the important fastpaths.

I'm afraid we may end up like kernel/sched, which had been staying as a
bunch of .c files included one into another for a long time until it was
finally split properly into logical parts.

> > Coming to think about placing page_counter.h to mm/, another question
> > springs into my mind. Why do you force kmem.tcp code to use the
> > page_counter instead of the res_counter? Nobody seems to use it and it
> > should pass away sooner or later. May be it's worth leaving kmem.tcp
> > using res_counter? I think we could isolate kmem.tcp under a separate
> > config option depending on the RES_COUNTER, and mark them both as
> > deprecated somehow.
> 
> What we usually do is keep changing deprecated or unused code with
> minimal/compile-only testing.  Eventually somebody will notice that
> something major broke while doing this but that noone has complained
> in months or even years, at which point we remove it.
> 
> I'm curious why you reach the conclusion that we should string *more*
> code along for unused interfaces, rather than less?

This would reduce the patch footprint. The code left (res_counters)
would be disabled by default anyway so it wouldn't result in the binary
growth. And it wouldn't really affect the code base, because it lives
peacefully in a separate file. OTOH this would allow you to make
page_counter private to memcontrol.c sooner, w/o waiting until kmem.tcp
is gone.

> > > > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > > > +			struct page_counter **fail);
> > > > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > > > 
> > > > > > Hmm, why not page_counter_set_limit?
> > > > > 
> > > > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > > > usually wrappers around unusual/complex data structure access,
> > > > 
> > > > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > > > getter, which we could easily live w/o, but still it's there.
> > > 
> > > It abstracts an unusual and error-prone access to a counter value,
> > > i.e. reading an unsigned quantity out of a signed variable.
> > 
> > Wait, percpu_counter_read retval has type s64 and it returns
> > percpu_counter->count, which is also s64. So it's a 100% equivalent of
> > plain reading of percpu_counter->count.
> 
> My bad, I read page_counter instead of percpu_counter.  Nonetheless, I
> did add page_counter_read().
> 
> > > > > but this function does a lot more, so I'm not fond of _set_limit().
> > > > 
> > > > Nevertheless, everything it does can be perfectly described in one
> > > > sentence "it tries to set the new value of the limit", so it does
> > > > function as a setter. And if there's a setter, there must be a getter
> > > > IMO.
> > > 
> > > That's oversimplifying things.  Setting a limit requires enforcing a
> > > whole bunch of rules and synchronization, whereas reading a limit is
> > > accessing an unsigned long.
> > 
> > Let's look at percpu_counter once again (excuse me for sticking to it,
> > but it seems to be a good example): percpu_counter_set requires
> > enforcing a whole bunch of rules and synchronization, whereas reading
> > percpu_counter value is accessing an s64. Nevertheless, they're
> > *semantically* a getter and a setter. The same is fair for the
> > page_counter IMO.
> > 
> > I don't want to enforce you to introduce the getter, it's not that
> > important to me. Just reasoning.
> 
> You can always find such comparisons, but the only thing that counts
> is the engineering merit, which I don't see for the page limit.
> 
> percpu_counter_read() is more obvious, because it's an API that is
> expected to be widely used, and the "counter" is actually a complex
> data structure.  That accessor might choose to do postprocessing like
> underflow or percpu-variance correction at some point, and then it can
> change callers all over the tree in a single place.
> 
> None of this really applies to the page counter limit, however.

OK, you convinced me.

> > > > > > >  		break;
> > > > > > >  	case RES_FAILCNT:
> > > > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > > > +		counter->limited = 0;
> > > > > > 
> > > > > > page_counter_reset_failcnt?
> > > > > 
> > > > > That would be more obscure than counter->failcnt = 0, I think.
> > > > 
> > > > There's one thing that bothers me about this patch. Before, all the
> > > > functions operating on res_counter were mutually smp-safe, now they
> > > > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > > > page_counter_try_charge, the reset might be skipped. You only use the
> > > > atomic type for the counter, but my guess is that failcnt and watermark
> > > > should be atomic too, at least if we're not going to get rid of them
> > > > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > > > are racy.
> > > 
> > > It's fair enough that the raciness should be documented, but both
> > > counters are such roundabout metrics to begin with that it really
> > > doesn't matter.  What's the difference between a failcnt of 590 and
> > > 600 in practical terms?  And what does it matter if the counter
> > > watermark is off by a few pages when there are per-cpu caches on top
> > > of the counters, and the majority of workloads peg the watermark to
> > > the limit during startup anyway?
> > 
> > Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
> > That's a huge difference.
> 
> I guess that's true, but I still have a really hard time caring.  Who
> resets this in the middle of ongoing execution?  Who resets this at
> all instead of just comparing before/after snapshots, like with all
> other mm stats?  And is anybody even using these pointless metrics?

Don't know about the watermark, but the failcnt can be really useful
while investigating why your container started to behave badly. Also,
they might be used for testing that memcg limits work properly. Not sure
if anybody would need to reset it though.

> I'm inclined to just put stop_machine() into the reset functions...

I don't see why making it atomic could be worse. Anyway, I think this
means we need a getter and reset functions for them, because they ain't
as trivial as the limit.

> > > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > > > +{
> > > > > > > +	for (;;) {
> > > > > > > +		unsigned long count;
> > > > > > > +		unsigned long old;
> > > > > > > +
> > > > > > > +		count = atomic_long_read(&counter->count);
> > > > > > > +
> > > > > > > +		old = xchg(&counter->limit, limit);
> > > > > > > +
> > > > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > > > +			counter->limit = old;
> > > > > > 
> > > > > > I wonder what can happen if two threads execute this function
> > > > > > concurrently... or may be it's not supposed to be smp-safe?
> > > > > 
> > > > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > > > hugetlb controllers accordingly to take limit locks as well.
> > > > 
> > > > I would prefer page_counter to handle it internally, because we won't
> > > > need the set_limit_mutex once memsw is converted to plain swap
> > > > accounting.
> > > >
> > > > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > > > achieve that w/o spinlocks, using only atomic variables?
> > > 
> > > We still need it to serialize concurrent access to the memory limit,
> > > and I updated the patch to have kmem take it as well.  It's such a
> > > cold path that using a lockless scheme and worrying about coherency
> > > between updates is not worth it, I think.
> > 
> > OK, it's not that important actually. Please state explicitly in the
> > comment to the function definition that it needs external
> > synchronization then.
> 
> Yeah, I documented everything now.

Thank you.

> How about the following update?  Don't be thrown by the
> page_counter_cancel(), I went back to it until we find something more
> suitable.  But as long as it's documented and has only 1.5 callsites,
> it shouldn't matter all that much TBH.

A couple of minor comments inline.

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ec2210965686..70839678d805 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -65,18 +65,32 @@
>  
>  #include <trace/events/vmscan.h>
>  
> -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> +/**
> + * page_counter_cancel - take pages out of the local counter
> + * @counter: counter
> + * @nr_pages: number of pages to cancel
> + *
> + * Returns whether there are remaining pages in the counter.
> + */
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	long new;
>  
>  	new = atomic_long_sub_return(nr_pages, &counter->count);
>  
> -	if (WARN_ON(unlikely(new < 0)))
> -		atomic_long_set(&counter->count, 0);
> +	if (WARN_ON_ONCE(unlikely(new < 0)))

WARN_ON is unlikely by itself, no need to include yet another one
inside.

> +		atomic_long_add(nr_pages, &counter->count);
>  
>  	return new > 0;
>  }
>  
> +/**
> + * page_counter_charge - hierarchically charge pages
> + * @counter: counter
> + * @nr_pages: number of pages to charge
> + *
> + * NOTE: This may exceed the configured counter limits.
> + */
>  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
> @@ -91,6 +105,15 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  	}
>  }
>  
> +/**
> + * page_counter_try_charge - try to hierarchically charge pages
> + * @counter: counter
> + * @nr_pages: number of pages to charge
> + * @fail: points first counter to hit its limit, if any
> + *
> + * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
> + * its ancestors has hit its limit.
> + */
>  int page_counter_try_charge(struct page_counter *counter,
>  			    unsigned long nr_pages,
>  			    struct page_counter **fail)
> @@ -98,37 +121,44 @@ int page_counter_try_charge(struct page_counter *counter,
>  	struct page_counter *c;
>  
>  	for (c = counter; c; c = c->parent) {
> -		for (;;) {
> -			long count;
> -			long new;
> -
> -			count = atomic_long_read(&c->count);
> -
> -			new = count + nr_pages;
> -			if (new > c->limit) {
> -				c->failcnt++;
> -				*fail = c;
> -				goto failed;
> -			}
> -
> -			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> -				continue;
> -
> -			if (new > c->watermark)
> -				c->watermark = new;
> +		long new;
>  
> -			break;
> +		new = atomic_long_add_return(nr_pages, &c->count);
> +		if (new > c->limit) {

I'd also added a comment explaining that this is racy too, and may
result in false-positives, but that this isn't critical for our use
case, as you pointed out in your previous e-mail. Just to forestall
further questions.

> +			atomic_long_sub(nr_pages, &c->count);
> +			/*
> +			 * This is racy, but the failcnt is only a
> +			 * ballpark metric anyway.
> +			 */
> +			c->failcnt++;
> +			*fail = c;
> +			goto failed;
>  		}
> +		/*
> +		 * This is racy, but with the per-cpu caches on top
> +		 * this is a ballpark metric as well, and with lazy
> +		 * cache reclaim, the majority of workloads peg the
> +		 * watermark to the group limit soon after launch.
> +		 */
> +		if (new > c->watermark)
> +			c->watermark = new;
>  	}
>  	return 0;
>  
>  failed:
>  	for (c = counter; c != *fail; c = c->parent)
> -		page_counter_sub(c, nr_pages);
> +		page_counter_cancel(c, nr_pages);
>  
>  	return -ENOMEM;
>  }
[...]

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-24  8:02               ` Vladimir Davydov
  0 siblings, 0 replies; 53+ messages in thread
From: Vladimir Davydov @ 2014-09-24  8:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Dave Hansen, cgroups, linux-kernel

On Tue, Sep 23, 2014 at 01:05:25PM -0400, Johannes Weiner wrote:
> On Tue, Sep 23, 2014 at 07:21:50PM +0400, Vladimir Davydov wrote:
> > On Tue, Sep 23, 2014 at 09:28:01AM -0400, Johannes Weiner wrote:
> > > On Tue, Sep 23, 2014 at 03:06:34PM +0400, Vladimir Davydov wrote:
[...]
> > > > Anyway, I still don't think it's a good idea to keep all the definitions
> > > > in the same file. memcontrol.c is already huge. Adding more code to it
> > > > is not desirable, especially if it can naturally live in a separate
> > > > file. And since the page_counter is independent of the memcg core and
> > > > *looks* generic, I believe we should keep it separately.
> > > 
> > > It's less code than what I just deleted, and half of it seems
> > > redundant when integrated into memcg.  This code would benefit a lot
> > > from being part of memcg, and memcg could reduce its public API.
> > 
> > I think I understand. You are afraid that other users of the
> > page_counter will show up one day, and you won't be able to modify its
> > API freely. That's reasonable. But we can solve it while still keeping
> > page_counter separately. For example, put the header to mm/ and state
> > clearly that it's for memcontrol.c and nobody is allowed to use it w/o a
> > special permission.
> > 
> > My point is that it's easier to maintain the code divided in logical
> > parts. And page_counter seems to be such a logical part.
> 
> It's not about freely changing it, it's that there is no user outside
> of memory controlling proper, and there is none in sight.
> 
> What makes memcontrol.c hard to deal with is that different things are
> interspersed.  The page_counter is in its own little section in there,
> and the compiler can optimize and inline the important fastpaths.

I'm afraid we may end up like kernel/sched, which had been staying as a
bunch of .c files included one into another for a long time until it was
finally split properly into logical parts.

> > Coming to think about placing page_counter.h to mm/, another question
> > springs into my mind. Why do you force kmem.tcp code to use the
> > page_counter instead of the res_counter? Nobody seems to use it and it
> > should pass away sooner or later. May be it's worth leaving kmem.tcp
> > using res_counter? I think we could isolate kmem.tcp under a separate
> > config option depending on the RES_COUNTER, and mark them both as
> > deprecated somehow.
> 
> What we usually do is keep changing deprecated or unused code with
> minimal/compile-only testing.  Eventually somebody will notice that
> something major broke while doing this but that noone has complained
> in months or even years, at which point we remove it.
> 
> I'm curious why you reach the conclusion that we should string *more*
> code along for unused interfaces, rather than less?

This would reduce the patch footprint. The code left (res_counters)
would be disabled by default anyway so it wouldn't result in the binary
growth. And it wouldn't really affect the code base, because it lives
peacefully in a separate file. OTOH this would allow you to make
page_counter private to memcontrol.c sooner, w/o waiting until kmem.tcp
is gone.

> > > > > > > +int page_counter_charge(struct page_counter *counter, unsigned long nr_pages,
> > > > > > > +			struct page_counter **fail);
> > > > > > > +int page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
> > > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit);
> > > > > > 
> > > > > > Hmm, why not page_counter_set_limit?
> > > > > 
> > > > > Limit is used as a verb here, "to limit".  Getters and setters are
> > > > > usually wrappers around unusual/complex data structure access,
> > > > 
> > > > Not necessarily. Look at percpu_counter_read e.g. It's a one-line
> > > > getter, which we could easily live w/o, but still it's there.
> > > 
> > > It abstracts an unusual and error-prone access to a counter value,
> > > i.e. reading an unsigned quantity out of a signed variable.
> > 
> > Wait, percpu_counter_read retval has type s64 and it returns
> > percpu_counter->count, which is also s64. So it's a 100% equivalent of
> > plain reading of percpu_counter->count.
> 
> My bad, I read page_counter instead of percpu_counter.  Nonetheless, I
> did add page_counter_read().
> 
> > > > > but this function does a lot more, so I'm not fond of _set_limit().
> > > > 
> > > > Nevertheless, everything it does can be perfectly described in one
> > > > sentence "it tries to set the new value of the limit", so it does
> > > > function as a setter. And if there's a setter, there must be a getter
> > > > IMO.
> > > 
> > > That's oversimplifying things.  Setting a limit requires enforcing a
> > > whole bunch of rules and synchronization, whereas reading a limit is
> > > accessing an unsigned long.
> > 
> > Let's look at percpu_counter once again (excuse me for sticking to it,
> > but it seems to be a good example): percpu_counter_set requires
> > enforcing a whole bunch of rules and synchronization, whereas reading
> > percpu_counter value is accessing an s64. Nevertheless, they're
> > *semantically* a getter and a setter. The same is fair for the
> > page_counter IMO.
> > 
> > I don't want to enforce you to introduce the getter, it's not that
> > important to me. Just reasoning.
> 
> You can always find such comparisons, but the only thing that counts
> is the engineering merit, which I don't see for the page limit.
> 
> percpu_counter_read() is more obvious, because it's an API that is
> expected to be widely used, and the "counter" is actually a complex
> data structure.  That accessor might choose to do postprocessing like
> underflow or percpu-variance correction at some point, and then it can
> change callers all over the tree in a single place.
> 
> None of this really applies to the page counter limit, however.

OK, you convinced me.

> > > > > > >  		break;
> > > > > > >  	case RES_FAILCNT:
> > > > > > > -		res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> > > > > > > +		counter->limited = 0;
> > > > > > 
> > > > > > page_counter_reset_failcnt?
> > > > > 
> > > > > That would be more obscure than counter->failcnt = 0, I think.
> > > > 
> > > > There's one thing that bothers me about this patch. Before, all the
> > > > functions operating on res_counter were mutually smp-safe, now they
> > > > aren't. E.g. if the failcnt reset races with the falcnt increment from
> > > > page_counter_try_charge, the reset might be skipped. You only use the
> > > > atomic type for the counter, but my guess is that failcnt and watermark
> > > > should be atomic too, at least if we're not going to get rid of them
> > > > soon. Otherwise, it should be clearly stated that failcnt and watermark
> > > > are racy.
> > > 
> > > It's fair enough that the raciness should be documented, but both
> > > counters are such roundabout metrics to begin with that it really
> > > doesn't matter.  What's the difference between a failcnt of 590 and
> > > 600 in practical terms?  And what does it matter if the counter
> > > watermark is off by a few pages when there are per-cpu caches on top
> > > of the counters, and the majority of workloads peg the watermark to
> > > the limit during startup anyway?
> > 
> > Suppose failcnt=42000. The user resets it and gets 42001 instead of 0.
> > That's a huge difference.
> 
> I guess that's true, but I still have a really hard time caring.  Who
> resets this in the middle of ongoing execution?  Who resets this at
> all instead of just comparing before/after snapshots, like with all
> other mm stats?  And is anybody even using these pointless metrics?

Don't know about the watermark, but the failcnt can be really useful
while investigating why your container started to behave badly. Also,
they might be used for testing that memcg limits work properly. Not sure
if anybody would need to reset it though.

> I'm inclined to just put stop_machine() into the reset functions...

I don't see why making it atomic could be worse. Anyway, I think this
means we need a getter and reset functions for them, because they ain't
as trivial as the limit.

> > > > > > > +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> > > > > > > +{
> > > > > > > +	for (;;) {
> > > > > > > +		unsigned long count;
> > > > > > > +		unsigned long old;
> > > > > > > +
> > > > > > > +		count = atomic_long_read(&counter->count);
> > > > > > > +
> > > > > > > +		old = xchg(&counter->limit, limit);
> > > > > > > +
> > > > > > > +		if (atomic_long_read(&counter->count) != count) {
> > > > > > > +			counter->limit = old;
> > > > > > 
> > > > > > I wonder what can happen if two threads execute this function
> > > > > > concurrently... or may be it's not supposed to be smp-safe?
> > > > > 
> > > > > memcg already holds the set_limit_mutex here.  I updated the tcp and
> > > > > hugetlb controllers accordingly to take limit locks as well.
> > > > 
> > > > I would prefer page_counter to handle it internally, because we won't
> > > > need the set_limit_mutex once memsw is converted to plain swap
> > > > accounting.
> > > >
> > > > Besides, memcg_update_kmem_limit doesn't take it. Any chance to
> > > > achieve that w/o spinlocks, using only atomic variables?
> > > 
> > > We still need it to serialize concurrent access to the memory limit,
> > > and I updated the patch to have kmem take it as well.  It's such a
> > > cold path that using a lockless scheme and worrying about coherency
> > > between updates is not worth it, I think.
> > 
> > OK, it's not that important actually. Please state explicitly in the
> > comment to the function definition that it needs external
> > synchronization then.
> 
> Yeah, I documented everything now.

Thank you.

> How about the following update?  Don't be thrown by the
> page_counter_cancel(), I went back to it until we find something more
> suitable.  But as long as it's documented and has only 1.5 callsites,
> it shouldn't matter all that much TBH.

A couple of minor comments inline.

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ec2210965686..70839678d805 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -65,18 +65,32 @@
>  
>  #include <trace/events/vmscan.h>
>  
> -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> +/**
> + * page_counter_cancel - take pages out of the local counter
> + * @counter: counter
> + * @nr_pages: number of pages to cancel
> + *
> + * Returns whether there are remaining pages in the counter.
> + */
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	long new;
>  
>  	new = atomic_long_sub_return(nr_pages, &counter->count);
>  
> -	if (WARN_ON(unlikely(new < 0)))
> -		atomic_long_set(&counter->count, 0);
> +	if (WARN_ON_ONCE(unlikely(new < 0)))

WARN_ON is unlikely by itself, no need to include yet another one
inside.

> +		atomic_long_add(nr_pages, &counter->count);
>  
>  	return new > 0;
>  }
>  
> +/**
> + * page_counter_charge - hierarchically charge pages
> + * @counter: counter
> + * @nr_pages: number of pages to charge
> + *
> + * NOTE: This may exceed the configured counter limits.
> + */
>  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
> @@ -91,6 +105,15 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  	}
>  }
>  
> +/**
> + * page_counter_try_charge - try to hierarchically charge pages
> + * @counter: counter
> + * @nr_pages: number of pages to charge
> + * @fail: points first counter to hit its limit, if any
> + *
> + * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
> + * its ancestors has hit its limit.
> + */
>  int page_counter_try_charge(struct page_counter *counter,
>  			    unsigned long nr_pages,
>  			    struct page_counter **fail)
> @@ -98,37 +121,44 @@ int page_counter_try_charge(struct page_counter *counter,
>  	struct page_counter *c;
>  
>  	for (c = counter; c; c = c->parent) {
> -		for (;;) {
> -			long count;
> -			long new;
> -
> -			count = atomic_long_read(&c->count);
> -
> -			new = count + nr_pages;
> -			if (new > c->limit) {
> -				c->failcnt++;
> -				*fail = c;
> -				goto failed;
> -			}
> -
> -			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> -				continue;
> -
> -			if (new > c->watermark)
> -				c->watermark = new;
> +		long new;
>  
> -			break;
> +		new = atomic_long_add_return(nr_pages, &c->count);
> +		if (new > c->limit) {

I'd also added a comment explaining that this is racy too, and may
result in false-positives, but that this isn't critical for our use
case, as you pointed out in your previous e-mail. Just to forestall
further questions.

> +			atomic_long_sub(nr_pages, &c->count);
> +			/*
> +			 * This is racy, but the failcnt is only a
> +			 * ballpark metric anyway.
> +			 */
> +			c->failcnt++;
> +			*fail = c;
> +			goto failed;
>  		}
> +		/*
> +		 * This is racy, but with the per-cpu caches on top
> +		 * this is a ballpark metric as well, and with lazy
> +		 * cache reclaim, the majority of workloads peg the
> +		 * watermark to the group limit soon after launch.
> +		 */
> +		if (new > c->watermark)
> +			c->watermark = new;
>  	}
>  	return 0;
>  
>  failed:
>  	for (c = counter; c != *fail; c = c->parent)
> -		page_counter_sub(c, nr_pages);
> +		page_counter_cancel(c, nr_pages);
>  
>  	return -ENOMEM;
>  }
[...]

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 17:05             ` Johannes Weiner
  (?)
@ 2014-09-24 13:33               ` Michal Hocko
  -1 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-24 13:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
[...]
> How about the following update?  Don't be thrown by the
> page_counter_cancel(), I went back to it until we find something more
> suitable.  But as long as it's documented and has only 1.5 callsites,
> it shouldn't matter all that much TBH.
> 
> Thanks for your invaluable feedback so far, and sorry if the original
> patch was hard to review.  I'll try to break it up, to me it's usually
> easier to verify new functions by looking at the callers in the same
> patch, but I can probably remove the res_counter in a follow-up patch.

The original patch was really huge and rather hard to review. Having
res_counter removal in a separate patch would be definitely helpful.
I would even lobby to have the new page_counter in a separate patch with
the detailed description of the semantic and expected usage. Lockless
schemes are always tricky and hard to review.

[...]
> @@ -98,37 +121,44 @@ int page_counter_try_charge(struct page_counter *counter,
>  	struct page_counter *c;
>  
>  	for (c = counter; c; c = c->parent) {
> -		for (;;) {
> -			long count;
> -			long new;
> -
> -			count = atomic_long_read(&c->count);
> -
> -			new = count + nr_pages;
> -			if (new > c->limit) {
> -				c->failcnt++;
> -				*fail = c;
> -				goto failed;
> -			}
> -
> -			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> -				continue;
> -
> -			if (new > c->watermark)
> -				c->watermark = new;
> +		long new;
>  
> -			break;
> +		new = atomic_long_add_return(nr_pages, &c->count);
> +		if (new > c->limit) {
> +			atomic_long_sub(nr_pages, &c->count);
> +			/*
> +			 * This is racy, but the failcnt is only a
> +			 * ballpark metric anyway.
> +			 */
> +			c->failcnt++;
> +			*fail = c;
> +			goto failed;
>  		}

I like this much more because the retry loop might lead to starvation.
As you pointed out in the other email this implementation might lead
to premature reclaim but I would find the former issue more probable
because it might happen even when we are far away from the limit (e.g.
in unlimited - root - memcg).

> +		/*
> +		 * This is racy, but with the per-cpu caches on top
> +		 * this is a ballpark metric as well, and with lazy
> +		 * cache reclaim, the majority of workloads peg the
> +		 * watermark to the group limit soon after launch.
> +		 */
> +		if (new > c->watermark)
> +			c->watermark = new;
>  	}
>  	return 0;

Btw. are you planning to post another version (possibly split up)
anytime soon so it would make sense to wait for it or should I continue
with this version?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-24 13:33               ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-24 13:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
[...]
> How about the following update?  Don't be thrown by the
> page_counter_cancel(), I went back to it until we find something more
> suitable.  But as long as it's documented and has only 1.5 callsites,
> it shouldn't matter all that much TBH.
> 
> Thanks for your invaluable feedback so far, and sorry if the original
> patch was hard to review.  I'll try to break it up, to me it's usually
> easier to verify new functions by looking at the callers in the same
> patch, but I can probably remove the res_counter in a follow-up patch.

The original patch was really huge and rather hard to review. Having
res_counter removal in a separate patch would be definitely helpful.
I would even lobby to have the new page_counter in a separate patch with
the detailed description of the semantic and expected usage. Lockless
schemes are always tricky and hard to review.

[...]
> @@ -98,37 +121,44 @@ int page_counter_try_charge(struct page_counter *counter,
>  	struct page_counter *c;
>  
>  	for (c = counter; c; c = c->parent) {
> -		for (;;) {
> -			long count;
> -			long new;
> -
> -			count = atomic_long_read(&c->count);
> -
> -			new = count + nr_pages;
> -			if (new > c->limit) {
> -				c->failcnt++;
> -				*fail = c;
> -				goto failed;
> -			}
> -
> -			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> -				continue;
> -
> -			if (new > c->watermark)
> -				c->watermark = new;
> +		long new;
>  
> -			break;
> +		new = atomic_long_add_return(nr_pages, &c->count);
> +		if (new > c->limit) {
> +			atomic_long_sub(nr_pages, &c->count);
> +			/*
> +			 * This is racy, but the failcnt is only a
> +			 * ballpark metric anyway.
> +			 */
> +			c->failcnt++;
> +			*fail = c;
> +			goto failed;
>  		}

I like this much more because the retry loop might lead to starvation.
As you pointed out in the other email this implementation might lead
to premature reclaim but I would find the former issue more probable
because it might happen even when we are far away from the limit (e.g.
in unlimited - root - memcg).

> +		/*
> +		 * This is racy, but with the per-cpu caches on top
> +		 * this is a ballpark metric as well, and with lazy
> +		 * cache reclaim, the majority of workloads peg the
> +		 * watermark to the group limit soon after launch.
> +		 */
> +		if (new > c->watermark)
> +			c->watermark = new;
>  	}
>  	return 0;

Btw. are you planning to post another version (possibly split up)
anytime soon so it would make sense to wait for it or should I continue
with this version?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-24 13:33               ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-24 13:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Greg Thelen,
	Dave Hansen, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
[...]
> How about the following update?  Don't be thrown by the
> page_counter_cancel(), I went back to it until we find something more
> suitable.  But as long as it's documented and has only 1.5 callsites,
> it shouldn't matter all that much TBH.
> 
> Thanks for your invaluable feedback so far, and sorry if the original
> patch was hard to review.  I'll try to break it up, to me it's usually
> easier to verify new functions by looking at the callers in the same
> patch, but I can probably remove the res_counter in a follow-up patch.

The original patch was really huge and rather hard to review. Having
res_counter removal in a separate patch would be definitely helpful.
I would even lobby to have the new page_counter in a separate patch with
the detailed description of the semantic and expected usage. Lockless
schemes are always tricky and hard to review.

[...]
> @@ -98,37 +121,44 @@ int page_counter_try_charge(struct page_counter *counter,
>  	struct page_counter *c;
>  
>  	for (c = counter; c; c = c->parent) {
> -		for (;;) {
> -			long count;
> -			long new;
> -
> -			count = atomic_long_read(&c->count);
> -
> -			new = count + nr_pages;
> -			if (new > c->limit) {
> -				c->failcnt++;
> -				*fail = c;
> -				goto failed;
> -			}
> -
> -			if (atomic_long_cmpxchg(&c->count, count, new) != count)
> -				continue;
> -
> -			if (new > c->watermark)
> -				c->watermark = new;
> +		long new;
>  
> -			break;
> +		new = atomic_long_add_return(nr_pages, &c->count);
> +		if (new > c->limit) {
> +			atomic_long_sub(nr_pages, &c->count);
> +			/*
> +			 * This is racy, but the failcnt is only a
> +			 * ballpark metric anyway.
> +			 */
> +			c->failcnt++;
> +			*fail = c;
> +			goto failed;
>  		}

I like this much more because the retry loop might lead to starvation.
As you pointed out in the other email this implementation might lead
to premature reclaim but I would find the former issue more probable
because it might happen even when we are far away from the limit (e.g.
in unlimited - root - memcg).

> +		/*
> +		 * This is racy, but with the per-cpu caches on top
> +		 * this is a ballpark metric as well, and with lazy
> +		 * cache reclaim, the majority of workloads peg the
> +		 * watermark to the group limit soon after launch.
> +		 */
> +		if (new > c->watermark)
> +			c->watermark = new;
>  	}
>  	return 0;

Btw. are you planning to post another version (possibly split up)
anytime soon so it would make sense to wait for it or should I continue
with this version?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-23 17:05             ` Johannes Weiner
@ 2014-09-24 14:16               ` Michal Hocko
  -1 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-24 14:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
[...]
>  #include <trace/events/vmscan.h>
>  
> -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> +/**
> + * page_counter_cancel - take pages out of the local counter
> + * @counter: counter
> + * @nr_pages: number of pages to cancel
> + *
> + * Returns whether there are remaining pages in the counter.
> + */
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	long new;
>  
>  	new = atomic_long_sub_return(nr_pages, &counter->count);
>  
> -	if (WARN_ON(unlikely(new < 0)))
> -		atomic_long_set(&counter->count, 0);
> +	if (WARN_ON_ONCE(unlikely(new < 0)))
> +		atomic_long_add(nr_pages, &counter->count);
>  
>  	return new > 0;
>  }

I am not sure I understand this correctly.

The original res_counter code has protection against < 0 because it used
unsigned longs and wanted to protect from really disturbing effects of
underflow I guess (this wasn't documented anywhere). But you are using
long so even underflow shouldn't be a big problem so why do we need a
fixup?

The only way how we can end up < 0 would be a cancel without pairing
charge AFAICS. A charge should always appear before uncharge
because both of them are using atomics which imply memory barriers
(atomic_*_return). So do I understand correctly that your motivation
is to fix up those cancel-without-charge automatically? This would
definitely ask for a fat comment. Or am I missing something?

Besides that do we need to have any memory barrier there?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-24 14:16               ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-24 14:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
[...]
>  #include <trace/events/vmscan.h>
>  
> -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> +/**
> + * page_counter_cancel - take pages out of the local counter
> + * @counter: counter
> + * @nr_pages: number of pages to cancel
> + *
> + * Returns whether there are remaining pages in the counter.
> + */
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	long new;
>  
>  	new = atomic_long_sub_return(nr_pages, &counter->count);
>  
> -	if (WARN_ON(unlikely(new < 0)))
> -		atomic_long_set(&counter->count, 0);
> +	if (WARN_ON_ONCE(unlikely(new < 0)))
> +		atomic_long_add(nr_pages, &counter->count);
>  
>  	return new > 0;
>  }

I am not sure I understand this correctly.

The original res_counter code has protection against < 0 because it used
unsigned longs and wanted to protect from really disturbing effects of
underflow I guess (this wasn't documented anywhere). But you are using
long so even underflow shouldn't be a big problem so why do we need a
fixup?

The only way how we can end up < 0 would be a cancel without pairing
charge AFAICS. A charge should always appear before uncharge
because both of them are using atomics which imply memory barriers
(atomic_*_return). So do I understand correctly that your motivation
is to fix up those cancel-without-charge automatically? This would
definitely ask for a fat comment. Or am I missing something?

Besides that do we need to have any memory barrier there?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-24 13:33               ` Michal Hocko
@ 2014-09-24 16:51                 ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-24 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Wed, Sep 24, 2014 at 03:33:16PM +0200, Michal Hocko wrote:
> On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
> [...]
> > How about the following update?  Don't be thrown by the
> > page_counter_cancel(), I went back to it until we find something more
> > suitable.  But as long as it's documented and has only 1.5 callsites,
> > it shouldn't matter all that much TBH.
> > 
> > Thanks for your invaluable feedback so far, and sorry if the original
> > patch was hard to review.  I'll try to break it up, to me it's usually
> > easier to verify new functions by looking at the callers in the same
> > patch, but I can probably remove the res_counter in a follow-up patch.
> 
> The original patch was really huge and rather hard to review. Having
> res_counter removal in a separate patch would be definitely helpful.

Sorry, I just saw your follow-up after sending out v2.  Yes, I split
out the res_counter removal, so the patch is a lot smaller.

> I would even lobby to have the new page_counter in a separate patch with
> the detailed description of the semantic and expected usage. Lockless
> schemes are always tricky and hard to review.

I was considering that (before someone explicitely asked for it) but
ended up thinking it's better to have the API go in along with the
main user, which often helps understanding.  The users of the API are
unchanged, except for requiring callers to serialize limit updates.
And I commented all race conditions inside the counter code, hopefully
that helps, but let me know if things are unclear in v2.

Thanks

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-24 16:51                 ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-24 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Wed, Sep 24, 2014 at 03:33:16PM +0200, Michal Hocko wrote:
> On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
> [...]
> > How about the following update?  Don't be thrown by the
> > page_counter_cancel(), I went back to it until we find something more
> > suitable.  But as long as it's documented and has only 1.5 callsites,
> > it shouldn't matter all that much TBH.
> > 
> > Thanks for your invaluable feedback so far, and sorry if the original
> > patch was hard to review.  I'll try to break it up, to me it's usually
> > easier to verify new functions by looking at the callers in the same
> > patch, but I can probably remove the res_counter in a follow-up patch.
> 
> The original patch was really huge and rather hard to review. Having
> res_counter removal in a separate patch would be definitely helpful.

Sorry, I just saw your follow-up after sending out v2.  Yes, I split
out the res_counter removal, so the patch is a lot smaller.

> I would even lobby to have the new page_counter in a separate patch with
> the detailed description of the semantic and expected usage. Lockless
> schemes are always tricky and hard to review.

I was considering that (before someone explicitely asked for it) but
ended up thinking it's better to have the API go in along with the
main user, which often helps understanding.  The users of the API are
unchanged, except for requiring callers to serialize limit updates.
And I commented all race conditions inside the counter code, hopefully
that helps, but let me know if things are unclear in v2.

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-24 14:16               ` Michal Hocko
@ 2014-09-24 17:00                 ` Johannes Weiner
  -1 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-24 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Wed, Sep 24, 2014 at 04:16:33PM +0200, Michal Hocko wrote:
> On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
> [...]
> >  #include <trace/events/vmscan.h>
> >  
> > -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> > +/**
> > + * page_counter_cancel - take pages out of the local counter
> > + * @counter: counter
> > + * @nr_pages: number of pages to cancel
> > + *
> > + * Returns whether there are remaining pages in the counter.
> > + */
> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> >  {
> >  	long new;
> >  
> >  	new = atomic_long_sub_return(nr_pages, &counter->count);
> >  
> > -	if (WARN_ON(unlikely(new < 0)))
> > -		atomic_long_set(&counter->count, 0);
> > +	if (WARN_ON_ONCE(unlikely(new < 0)))
> > +		atomic_long_add(nr_pages, &counter->count);
> >  
> >  	return new > 0;
> >  }
> 
> I am not sure I understand this correctly.
> 
> The original res_counter code has protection against < 0 because it used
> unsigned longs and wanted to protect from really disturbing effects of
> underflow I guess (this wasn't documented anywhere). But you are using
> long so even underflow shouldn't be a big problem so why do we need a
> fixup?

Immediate issues might be bogus numbers showing up in userspace or
endless looping during reparenting.  Negative values are just not
defined for that counter, so I want to mitigate exposing them.

It's not completely leak-free, as you can see, but I don't think it'd
be worth weighing down the hot path any more than this just to
mitigate the unlikely consequences of kernel bug.

> The only way how we can end up < 0 would be a cancel without pairing
> charge AFAICS. A charge should always appear before uncharge
> because both of them are using atomics which imply memory barriers
> (atomic_*_return). So do I understand correctly that your motivation
> is to fix up those cancel-without-charge automatically? This would
> definitely ask for a fat comment. Or am I missing something?

This function is also used by the uncharge path, so any imbalance in
accounting, not just from spurious cancels, is caught that way.

As you said, these are all atomics, so it has nothing to do with
memory ordering.  It's simply catching logical underflows.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-24 17:00                 ` Johannes Weiner
  0 siblings, 0 replies; 53+ messages in thread
From: Johannes Weiner @ 2014-09-24 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Wed, Sep 24, 2014 at 04:16:33PM +0200, Michal Hocko wrote:
> On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
> [...]
> >  #include <trace/events/vmscan.h>
> >  
> > -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> > +/**
> > + * page_counter_cancel - take pages out of the local counter
> > + * @counter: counter
> > + * @nr_pages: number of pages to cancel
> > + *
> > + * Returns whether there are remaining pages in the counter.
> > + */
> > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> >  {
> >  	long new;
> >  
> >  	new = atomic_long_sub_return(nr_pages, &counter->count);
> >  
> > -	if (WARN_ON(unlikely(new < 0)))
> > -		atomic_long_set(&counter->count, 0);
> > +	if (WARN_ON_ONCE(unlikely(new < 0)))
> > +		atomic_long_add(nr_pages, &counter->count);
> >  
> >  	return new > 0;
> >  }
> 
> I am not sure I understand this correctly.
> 
> The original res_counter code has protection against < 0 because it used
> unsigned longs and wanted to protect from really disturbing effects of
> underflow I guess (this wasn't documented anywhere). But you are using
> long so even underflow shouldn't be a big problem so why do we need a
> fixup?

Immediate issues might be bogus numbers showing up in userspace or
endless looping during reparenting.  Negative values are just not
defined for that counter, so I want to mitigate exposing them.

It's not completely leak-free, as you can see, but I don't think it'd
be worth weighing down the hot path any more than this just to
mitigate the unlikely consequences of kernel bug.

> The only way how we can end up < 0 would be a cancel without pairing
> charge AFAICS. A charge should always appear before uncharge
> because both of them are using atomics which imply memory barriers
> (atomic_*_return). So do I understand correctly that your motivation
> is to fix up those cancel-without-charge automatically? This would
> definitely ask for a fat comment. Or am I missing something?

This function is also used by the uncharge path, so any imbalance in
accounting, not just from spurious cancels, is caught that way.

As you said, these are all atomics, so it has nothing to do with
memory ordering.  It's simply catching logical underflows.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
  2014-09-24 17:00                 ` Johannes Weiner
@ 2014-09-25 13:07                   ` Michal Hocko
  -1 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-25 13:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Wed 24-09-14 13:00:17, Johannes Weiner wrote:
> On Wed, Sep 24, 2014 at 04:16:33PM +0200, Michal Hocko wrote:
> > On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
> > [...]
> > >  #include <trace/events/vmscan.h>
> > >  
> > > -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> > > +/**
> > > + * page_counter_cancel - take pages out of the local counter
> > > + * @counter: counter
> > > + * @nr_pages: number of pages to cancel
> > > + *
> > > + * Returns whether there are remaining pages in the counter.
> > > + */
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > >  {
> > >  	long new;
> > >  
> > >  	new = atomic_long_sub_return(nr_pages, &counter->count);
> > >  
> > > -	if (WARN_ON(unlikely(new < 0)))
> > > -		atomic_long_set(&counter->count, 0);
> > > +	if (WARN_ON_ONCE(unlikely(new < 0)))
> > > +		atomic_long_add(nr_pages, &counter->count);
> > >  
> > >  	return new > 0;
> > >  }
> > 
> > I am not sure I understand this correctly.
> > 
> > The original res_counter code has protection against < 0 because it used
> > unsigned longs and wanted to protect from really disturbing effects of
> > underflow I guess (this wasn't documented anywhere). But you are using
> > long so even underflow shouldn't be a big problem so why do we need a
> > fixup?
> 
> Immediate issues might be bogus numbers showing up in userspace or
> endless looping during reparenting.  Negative values are just not
> defined for that counter, so I want to mitigate exposing them.
> 
> It's not completely leak-free, as you can see, but I don't think it'd
> be worth weighing down the hot path any more than this just to
> mitigate the unlikely consequences of kernel bug.
> 
> > The only way how we can end up < 0 would be a cancel without pairing
> > charge AFAICS. A charge should always appear before uncharge
> > because both of them are using atomics which imply memory barriers
> > (atomic_*_return). So do I understand correctly that your motivation
> > is to fix up those cancel-without-charge automatically? This would
> > definitely ask for a fat comment. Or am I missing something?
> 
> This function is also used by the uncharge path, so any imbalance in
> accounting, not just from spurious cancels, is caught that way.
> 
> As you said, these are all atomics, so it has nothing to do with
> memory ordering.  It's simply catching logical underflows.

OK, I think we should document this in the changelog and/or in the
comment. These things are easy to forget...

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] mm: memcontrol: lockless page counters
@ 2014-09-25 13:07                   ` Michal Hocko
  0 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2014-09-25 13:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, linux-mm, Greg Thelen, Dave Hansen, cgroups,
	linux-kernel

On Wed 24-09-14 13:00:17, Johannes Weiner wrote:
> On Wed, Sep 24, 2014 at 04:16:33PM +0200, Michal Hocko wrote:
> > On Tue 23-09-14 13:05:25, Johannes Weiner wrote:
> > [...]
> > >  #include <trace/events/vmscan.h>
> > >  
> > > -int page_counter_sub(struct page_counter *counter, unsigned long nr_pages)
> > > +/**
> > > + * page_counter_cancel - take pages out of the local counter
> > > + * @counter: counter
> > > + * @nr_pages: number of pages to cancel
> > > + *
> > > + * Returns whether there are remaining pages in the counter.
> > > + */
> > > +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> > >  {
> > >  	long new;
> > >  
> > >  	new = atomic_long_sub_return(nr_pages, &counter->count);
> > >  
> > > -	if (WARN_ON(unlikely(new < 0)))
> > > -		atomic_long_set(&counter->count, 0);
> > > +	if (WARN_ON_ONCE(unlikely(new < 0)))
> > > +		atomic_long_add(nr_pages, &counter->count);
> > >  
> > >  	return new > 0;
> > >  }
> > 
> > I am not sure I understand this correctly.
> > 
> > The original res_counter code has protection against < 0 because it used
> > unsigned longs and wanted to protect from really disturbing effects of
> > underflow I guess (this wasn't documented anywhere). But you are using
> > long so even underflow shouldn't be a big problem so why do we need a
> > fixup?
> 
> Immediate issues might be bogus numbers showing up in userspace or
> endless looping during reparenting.  Negative values are just not
> defined for that counter, so I want to mitigate exposing them.
> 
> It's not completely leak-free, as you can see, but I don't think it'd
> be worth weighing down the hot path any more than this just to
> mitigate the unlikely consequences of kernel bug.
> 
> > The only way how we can end up < 0 would be a cancel without pairing
> > charge AFAICS. A charge should always appear before uncharge
> > because both of them are using atomics which imply memory barriers
> > (atomic_*_return). So do I understand correctly that your motivation
> > is to fix up those cancel-without-charge automatically? This would
> > definitely ask for a fat comment. Or am I missing something?
> 
> This function is also used by the uncharge path, so any imbalance in
> accounting, not just from spurious cancels, is caught that way.
> 
> As you said, these are all atomics, so it has nothing to do with
> memory ordering.  It's simply catching logical underflows.

OK, I think we should document this in the changelog and/or in the
comment. These things are easy to forget...

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2014-09-25 13:07 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-19 13:22 [patch] mm: memcontrol: lockless page counters Johannes Weiner
2014-09-19 13:22 ` Johannes Weiner
2014-09-19 13:29 ` Johannes Weiner
2014-09-19 13:29   ` Johannes Weiner
2014-09-22 14:41 ` Vladimir Davydov
2014-09-22 14:41   ` Vladimir Davydov
2014-09-22 18:57   ` Johannes Weiner
2014-09-22 18:57     ` Johannes Weiner
2014-09-22 18:57     ` Johannes Weiner
2014-09-23 11:06     ` Vladimir Davydov
2014-09-23 11:06       ` Vladimir Davydov
2014-09-23 11:06       ` Vladimir Davydov
2014-09-23 13:28       ` Johannes Weiner
2014-09-23 13:28         ` Johannes Weiner
2014-09-23 15:21         ` Vladimir Davydov
2014-09-23 15:21           ` Vladimir Davydov
2014-09-23 15:21           ` Vladimir Davydov
2014-09-23 17:05           ` Johannes Weiner
2014-09-23 17:05             ` Johannes Weiner
2014-09-23 17:05             ` Johannes Weiner
2014-09-24  8:02             ` Vladimir Davydov
2014-09-24  8:02               ` Vladimir Davydov
2014-09-24 13:33             ` Michal Hocko
2014-09-24 13:33               ` Michal Hocko
2014-09-24 13:33               ` Michal Hocko
2014-09-24 16:51               ` Johannes Weiner
2014-09-24 16:51                 ` Johannes Weiner
2014-09-24 14:16             ` Michal Hocko
2014-09-24 14:16               ` Michal Hocko
2014-09-24 17:00               ` Johannes Weiner
2014-09-24 17:00                 ` Johannes Weiner
2014-09-25 13:07                 ` Michal Hocko
2014-09-25 13:07                   ` Michal Hocko
2014-09-22 14:44 ` Michal Hocko
2014-09-22 14:44   ` Michal Hocko
2014-09-22 15:50   ` Johannes Weiner
2014-09-22 15:50     ` Johannes Weiner
2014-09-22 17:28     ` Michal Hocko
2014-09-22 17:28       ` Michal Hocko
2014-09-22 19:58       ` Johannes Weiner
2014-09-22 19:58         ` Johannes Weiner
2014-09-23 13:25         ` Michal Hocko
2014-09-23 13:25           ` Michal Hocko
2014-09-23 13:25           ` Michal Hocko
2014-09-23 14:05           ` Johannes Weiner
2014-09-23 14:05             ` Johannes Weiner
2014-09-23 14:28             ` Michal Hocko
2014-09-23 14:28               ` Michal Hocko
2014-09-23 14:28               ` Michal Hocko
2014-09-23 22:33               ` David Rientjes
2014-09-23 22:33                 ` David Rientjes
2014-09-23  7:46 ` Kamezawa Hiroyuki
2014-09-23  7:46   ` Kamezawa Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.