linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND 0/8] More stuff to charge to kmemcg
@ 2016-05-24  8:49 Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 1/8] mm: remove pointless struct in struct page definition Vladimir Davydov
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, linux-fsdevel, netdev,
	x86, linux-kernel

[resending with all relevant lists in Cc]

Hi,

This patch implements per kmemcg accounting of page tables (x86-only),
pipe buffers, and unix socket buffers.

Basically, this is v2 of my earlier attempt [1], addressing comments by
Andrew, namely: lack of comments to non-standard _mapcount usage, extra
overhead even when kmemcg is unused, wrong handling of stolen pipe
buffer pages.

Patches 1-3 are just cleanups that are not supposed to introduce any
functional changes. Patches 4 and 5 move charge/uncharge to generic page
allocator paths for the sake of accounting pipe and unix socket buffers.
Patches 5-7 make x86 page tables, pipe buffers, and unix socket buffers
accountable.

[1] http://lkml.kernel.org/r/%3Ccover.1443262808.git.vdavydov@parallels.com%3E

Thanks,

Vladimir Davydov (8):
  mm: remove pointless struct in struct page definition
  mm: clean up non-standard page->_mapcount users
  mm: memcontrol: cleanup kmem charge functions
  mm: charge/uncharge kmemcg from generic page allocator paths
  mm: memcontrol: teach uncharge_list to deal with kmem pages
  arch: x86: charge page tables to kmemcg
  pipe: account to kmemcg
  af_unix: charge buffers to kmemcg

 arch/x86/include/asm/pgalloc.h |  12 ++++-
 arch/x86/mm/pgtable.c          |  11 ++--
 fs/pipe.c                      |  32 ++++++++---
 include/linux/gfp.h            |  10 +---
 include/linux/memcontrol.h     | 103 +++---------------------------------
 include/linux/mm_types.h       |  73 ++++++++++++-------------
 include/linux/page-flags.h     |  78 +++++++++++++--------------
 kernel/fork.c                  |   6 +--
 mm/memcontrol.c                | 117 ++++++++++++++++++++++++++++-------------
 mm/page_alloc.c                |  63 +++++-----------------
 mm/slab.h                      |  16 ++++--
 mm/slab_common.c               |   2 +-
 mm/slub.c                      |   6 +--
 mm/vmalloc.c                   |   6 +--
 net/unix/af_unix.c             |   1 +
 scripts/tags.sh                |   3 ++
 16 files changed, 245 insertions(+), 294 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RESEND 1/8] mm: remove pointless struct in struct page definition
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 2/8] mm: clean up non-standard page->_mapcount users Vladimir Davydov
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, linux-fsdevel, netdev,
	x86, linux-kernel

... to reduce indentation level thus leaving more space for comments.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 include/linux/mm_types.h | 68 +++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d553855503e6..3cc5977a9cab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -60,51 +60,47 @@ struct page {
 	};
 
 	/* Second double word */
-	struct {
-		union {
-			pgoff_t index;		/* Our offset within mapping. */
-			void *freelist;		/* sl[aou]b first free object */
-			/* page_deferred_list().prev	-- second tail page */
-		};
+	union {
+		pgoff_t index;		/* Our offset within mapping. */
+		void *freelist;		/* sl[aou]b first free object */
+		/* page_deferred_list().prev	-- second tail page */
+	};
 
-		union {
+	union {
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
 	defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-			/* Used for cmpxchg_double in slub */
-			unsigned long counters;
+		/* Used for cmpxchg_double in slub */
+		unsigned long counters;
 #else
-			/*
-			 * Keep _refcount separate from slub cmpxchg_double
-			 * data.  As the rest of the double word is protected by
-			 * slab_lock but _refcount is not.
-			 */
-			unsigned counters;
+		/*
+		 * Keep _refcount separate from slub cmpxchg_double data.
+		 * As the rest of the double word is protected by slab_lock
+		 * but _refcount is not.
+		 */
+		unsigned counters;
 #endif
+		struct {
 
-			struct {
-
-				union {
-					/*
-					 * Count of ptes mapped in mms, to show
-					 * when page is mapped & limit reverse
-					 * map searches.
-					 */
-					atomic_t _mapcount;
-
-					struct { /* SLUB */
-						unsigned inuse:16;
-						unsigned objects:15;
-						unsigned frozen:1;
-					};
-					int units;	/* SLOB */
-				};
+			union {
 				/*
-				 * Usage count, *USE WRAPPER FUNCTION*
-				 * when manual accounting. See page_ref.h
+				 * Count of ptes mapped in mms, to show when
+				 * page is mapped & limit reverse map searches.
 				 */
-				atomic_t _refcount;
+				atomic_t _mapcount;
+
+				unsigned int active;		/* SLAB */
+				struct {			/* SLUB */
+					unsigned inuse:16;
+					unsigned objects:15;
+					unsigned frozen:1;
+				};
+				int units;			/* SLOB */
 			};
-			unsigned int active;	/* SLAB */
+			/*
+			 * Usage count, *USE WRAPPER FUNCTION* when manual
+			 * accounting. See page_ref.h
+			 */
+			atomic_t _refcount;
 		};
 	};
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RESEND 2/8] mm: clean up non-standard page->_mapcount users
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 1/8] mm: remove pointless struct in struct page definition Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 3/8] mm: memcontrol: cleanup kmem charge functions Vladimir Davydov
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, linux-fsdevel, netdev,
	x86, linux-kernel

 - Add a proper comment to page->_mapcount.
 - Introduce a macro for generating helper functions.
 - Place all special page->_mapcount values next to each other so that
   readers can see all possible values and so we don't get duplicates.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 include/linux/mm_types.h   |  5 ++++
 include/linux/page-flags.h | 73 ++++++++++++++++++++--------------------------
 scripts/tags.sh            |  3 ++
 3 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc5977a9cab..16bdef7943e3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,6 +85,11 @@ struct page {
 				/*
 				 * Count of ptes mapped in mms, to show when
 				 * page is mapped & limit reverse map searches.
+				 *
+				 * Extra information about page type may be
+				 * stored here for pages that are never mapped,
+				 * in which case the value MUST BE <= -2.
+				 * See page-flags.h for more details.
 				 */
 				atomic_t _mapcount;
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e5a32445f930..9940ade6a25e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -593,54 +593,45 @@ TESTPAGEFLAG_FALSE(DoubleMap)
 #endif
 
 /*
- * PageBuddy() indicate that the page is free and in the buddy system
- * (see mm/page_alloc.c).
- *
- * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
- * -2 so that an underflow of the page_mapcount() won't be mistaken
- * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
- * efficiently by most CPU architectures.
+ * For pages that are never mapped to userspace, page->mapcount may be
+ * used for storing extra information about page type. Any value used
+ * for this purpose must be <= -2, but it's better start not too close
+ * to -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a special page.
  */
-#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
-
-static inline int PageBuddy(struct page *page)
-{
-	return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
+#define PAGE_MAPCOUNT_OPS(uname, lname)					\
+static __always_inline int Page##uname(struct page *page)		\
+{									\
+	return atomic_read(&page->_mapcount) ==				\
+				PAGE_##lname##_MAPCOUNT_VALUE;		\
+}									\
+static __always_inline void __SetPage##uname(struct page *page)		\
+{									\
+	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);	\
+	atomic_set(&page->_mapcount, PAGE_##lname##_MAPCOUNT_VALUE);	\
+}									\
+static __always_inline void __ClearPage##uname(struct page *page)	\
+{									\
+	VM_BUG_ON_PAGE(!Page##uname(page), page);			\
+	atomic_set(&page->_mapcount, -1);				\
 }
 
-static inline void __SetPageBuddy(struct page *page)
-{
-	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-	atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
-}
+/*
+ * PageBuddy() indicate that the page is free and in the buddy system
+ * (see mm/page_alloc.c).
+ */
+#define PAGE_BUDDY_MAPCOUNT_VALUE		(-128)
+PAGE_MAPCOUNT_OPS(Buddy, BUDDY)
 
-static inline void __ClearPageBuddy(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageBuddy(page), page);
-	atomic_set(&page->_mapcount, -1);
-}
+/*
+ * PageBalloon() is set on pages that are on the balloon page list
+ * (see mm/balloon_compaction.c).
+ */
+#define PAGE_BALLOON_MAPCOUNT_VALUE		(-256)
+PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 
 extern bool is_free_buddy_page(struct page *page);
 
-#define PAGE_BALLOON_MAPCOUNT_VALUE (-256)
-
-static inline int PageBalloon(struct page *page)
-{
-	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE;
-}
-
-static inline void __SetPageBalloon(struct page *page)
-{
-	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-	atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
-}
-
-static inline void __ClearPageBalloon(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageBalloon(page), page);
-	atomic_set(&page->_mapcount, -1);
-}
-
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/scripts/tags.sh b/scripts/tags.sh
index f72f48f638ae..ed7eef24ef89 100755
--- a/scripts/tags.sh
+++ b/scripts/tags.sh
@@ -185,6 +185,9 @@ regex_c=(
 	'/\<CLEARPAGEFLAG_NOOP(\([[:alnum:]_]*\).*/ClearPage\1/'
 	'/\<__CLEARPAGEFLAG_NOOP(\([[:alnum:]_]*\).*/__ClearPage\1/'
 	'/\<TESTCLEARFLAG_FALSE(\([[:alnum:]_]*\).*/TestClearPage\1/'
+	'/^PAGE_MAPCOUNT_OPS(\([[:alnum:]_]*\).*/Page\1/'
+	'/^PAGE_MAPCOUNT_OPS(\([[:alnum:]_]*\).*/__SetPage\1/'
+	'/^PAGE_MAPCOUNT_OPS(\([[:alnum:]_]*\).*/__ClearPage\1/'
 	'/^TASK_PFA_TEST([^,]*, *\([[:alnum:]_]*\))/task_\1/'
 	'/^TASK_PFA_SET([^,]*, *\([[:alnum:]_]*\))/task_set_\1/'
 	'/^TASK_PFA_CLEAR([^,]*, *\([[:alnum:]_]*\))/task_clear_\1/'
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RESEND 3/8] mm: memcontrol: cleanup kmem charge functions
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 1/8] mm: remove pointless struct in struct page definition Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 2/8] mm: clean up non-standard page->_mapcount users Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 4/8] mm: charge/uncharge kmemcg from generic page allocator paths Vladimir Davydov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, linux-fsdevel, netdev,
	x86, linux-kernel

 - Handle memcg_kmem_enabled check out to the caller. This reduces the
   number of function definitions making the code easier to follow. At
   the same time it doesn't result in code bloat, because all of these
   functions are used only in one or two places.
 - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
   have to dive deep into memcg implementation to see which allocations
   are charged and which are not.
 - Refresh comments.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 include/linux/memcontrol.h | 103 +++------------------------------------------
 mm/memcontrol.c            |  75 ++++++++++++++++++++++++---------
 mm/page_alloc.c            |   9 ++--
 mm/slab.h                  |  16 +++++--
 4 files changed, 80 insertions(+), 123 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..2d03975c7dc0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -754,6 +754,13 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 }
 #endif
 
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+void memcg_kmem_put_cache(struct kmem_cache *cachep);
+int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
+			    struct mem_cgroup *memcg);
+int memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
+void memcg_kmem_uncharge(struct page *page, int order);
+
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 extern struct static_key_false memcg_kmem_enabled_key;
 
@@ -775,22 +782,6 @@ static inline bool memcg_kmem_enabled(void)
 }
 
 /*
- * In general, we'll do everything in our power to not incur in any overhead
- * for non-memcg users for the kmem functions. Not even a function call, if we
- * can avoid it.
- *
- * Therefore, we'll inline all those functions so that in the best case, we'll
- * see that kmemcg is off for everybody and proceed quickly.  If it is on,
- * we'll still do most of the flag checking inline. We check a lot of
- * conditions, but because they are pretty simple, they are expected to be
- * fast.
- */
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
-			      struct mem_cgroup *memcg);
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
-void __memcg_kmem_uncharge(struct page *page, int order);
-
-/*
  * helper for accessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
  * will return -1 when this is not a kmem-limited memcg.
@@ -800,67 +791,6 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
-struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
-void __memcg_kmem_put_cache(struct kmem_cache *cachep);
-
-static inline bool __memcg_kmem_bypass(void)
-{
-	if (!memcg_kmem_enabled())
-		return true;
-	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-		return true;
-	return false;
-}
-
-/**
- * memcg_kmem_charge: charge a kmem page
- * @page: page to charge
- * @gfp: reclaim mode
- * @order: allocation order
- *
- * Returns 0 on success, an error code on failure.
- */
-static __always_inline int memcg_kmem_charge(struct page *page,
-					     gfp_t gfp, int order)
-{
-	if (__memcg_kmem_bypass())
-		return 0;
-	if (!(gfp & __GFP_ACCOUNT))
-		return 0;
-	return __memcg_kmem_charge(page, gfp, order);
-}
-
-/**
- * memcg_kmem_uncharge: uncharge a kmem page
- * @page: page to uncharge
- * @order: allocation order
- */
-static __always_inline void memcg_kmem_uncharge(struct page *page, int order)
-{
-	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge(page, order);
-}
-
-/**
- * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
- * @cachep: the original global kmem cache
- *
- * All memory allocated from a per-memcg cache is charged to the owner memcg.
- */
-static __always_inline struct kmem_cache *
-memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
-{
-	if (__memcg_kmem_bypass())
-		return cachep;
-	return __memcg_kmem_get_cache(cachep, gfp);
-}
-
-static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-	if (memcg_kmem_enabled())
-		__memcg_kmem_put_cache(cachep);
-}
-
 /**
  * memcg_kmem_update_page_stat - update kmem page state statistics
  * @page: the page
@@ -883,15 +813,6 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
-static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
-{
-	return 0;
-}
-
-static inline void memcg_kmem_uncharge(struct page *page, int order)
-{
-}
-
 static inline int memcg_cache_id(struct mem_cgroup *memcg)
 {
 	return -1;
@@ -905,16 +826,6 @@ static inline void memcg_put_cache_ids(void)
 {
 }
 
-static inline struct kmem_cache *
-memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
-{
-	return cachep;
-}
-
-static inline void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-}
-
 static inline void memcg_kmem_update_page_stat(struct page *page,
 				enum mem_cgroup_stat_index idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b3f16ab4b431..482b4a0c97e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2268,20 +2268,30 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	current->memcg_kmem_skip_account = 0;
 }
 
-/*
+static inline bool memcg_kmem_bypass(void)
+{
+	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+		return true;
+	return false;
+}
+
+/**
+ * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
+ * @cachep: the original global kmem cache
+ *
  * Return the kmem_cache we're supposed to use for a slab allocation.
  * We try to use the current memcg's version of the cache.
  *
- * If the cache does not exist yet, if we are the first user of it,
- * we either create it immediately, if possible, or create it asynchronously
- * in a workqueue.
- * In the latter case, we will let the current allocation go through with
- * the original cache.
+ * If the cache does not exist yet, if we are the first user of it, we
+ * create it asynchronously in a workqueue and let the current allocation
+ * go through with the original cache.
  *
- * Can't be called in interrupt context or from kernel threads.
- * This function needs to be called with rcu_read_lock() held.
+ * This function takes a reference to the cache it returns to assure it
+ * won't get destroyed while we are working with it. Once the caller is
+ * done with it, memcg_kmem_put_cache() must be called to release the
+ * reference.
  */
-struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
@@ -2289,10 +2299,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 
 	VM_BUG_ON(!is_root_cache(cachep));
 
-	if (cachep->flags & SLAB_ACCOUNT)
-		gfp |= __GFP_ACCOUNT;
-
-	if (!(gfp & __GFP_ACCOUNT))
+	if (memcg_kmem_bypass())
 		return cachep;
 
 	if (current->memcg_kmem_skip_account)
@@ -2325,14 +2332,27 @@ out:
 	return cachep;
 }
 
-void __memcg_kmem_put_cache(struct kmem_cache *cachep)
+/**
+ * memcg_kmem_put_cache: drop reference taken by memcg_kmem_get_cache
+ * @cachep: the cache returned by memcg_kmem_get_cache
+ */
+void memcg_kmem_put_cache(struct kmem_cache *cachep)
 {
 	if (!is_root_cache(cachep))
 		css_put(&cachep->memcg_params.memcg->css);
 }
 
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
-			      struct mem_cgroup *memcg)
+/**
+ * memcg_kmem_charge: charge a kmem page
+ * @page: page to charge
+ * @gfp: reclaim mode
+ * @order: allocation order
+ * @memcg: memory cgroup to charge
+ *
+ * Returns 0 on success, an error code on failure.
+ */
+int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
+			    struct mem_cgroup *memcg)
 {
 	unsigned int nr_pages = 1 << order;
 	struct page_counter *counter;
@@ -2353,19 +2373,34 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 	return 0;
 }
 
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+/**
+ * memcg_kmem_charge: charge a kmem page to the current memory cgroup
+ * @page: page to charge
+ * @gfp: reclaim mode
+ * @order: allocation order
+ *
+ * Returns 0 on success, an error code on failure.
+ */
+int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 {
 	struct mem_cgroup *memcg;
 	int ret = 0;
 
+	if (memcg_kmem_bypass())
+		return 0;
+
 	memcg = get_mem_cgroup_from_mm(current->mm);
 	if (!mem_cgroup_is_root(memcg))
-		ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
 	css_put(&memcg->css);
 	return ret;
 }
-
-void __memcg_kmem_uncharge(struct page *page, int order)
+/**
+ * memcg_kmem_uncharge: uncharge a kmem page
+ * @page: page to uncharge
+ * @order: allocation order
+ */
+void memcg_kmem_uncharge(struct page *page, int order)
 {
 	struct mem_cgroup *memcg = page->mem_cgroup;
 	unsigned int nr_pages = 1 << order;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 520c3576249b..f948f4e02e89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4001,7 +4001,8 @@ struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order)
 	struct page *page;
 
 	page = alloc_pages(gfp_mask, order);
-	if (page && memcg_kmem_charge(page, gfp_mask, order) != 0) {
+	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) &&
+	    page && memcg_kmem_charge(page, gfp_mask, order) != 0) {
 		__free_pages(page, order);
 		page = NULL;
 	}
@@ -4013,7 +4014,8 @@ struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 	struct page *page;
 
 	page = alloc_pages_node(nid, gfp_mask, order);
-	if (page && memcg_kmem_charge(page, gfp_mask, order) != 0) {
+	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) &&
+	    page && memcg_kmem_charge(page, gfp_mask, order) != 0) {
 		__free_pages(page, order);
 		page = NULL;
 	}
@@ -4026,7 +4028,8 @@ struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
  */
 void __free_kmem_pages(struct page *page, unsigned int order)
 {
-	memcg_kmem_uncharge(page, order);
+	if (memcg_kmem_enabled())
+		memcg_kmem_uncharge(page, order);
 	__free_pages(page, order);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index dedb1a920fb8..a98859400bc0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -253,8 +253,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	if (is_root_cache(s))
 		return 0;
 
-	ret = __memcg_kmem_charge_memcg(page, gfp, order,
-					s->memcg_params.memcg);
+	ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
 	if (ret)
 		return ret;
 
@@ -268,6 +267,9 @@ static __always_inline int memcg_charge_slab(struct page *page,
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
+	if (!memcg_kmem_enabled())
+		return;
+
 	memcg_kmem_update_page_stat(page,
 			(s->flags & SLAB_RECLAIM_ACCOUNT) ?
 			MEMCG_SLAB_RECLAIMABLE : MEMCG_SLAB_UNRECLAIMABLE,
@@ -390,7 +392,11 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 	if (should_failslab(s, flags))
 		return NULL;
 
-	return memcg_kmem_get_cache(s, flags);
+	if (memcg_kmem_enabled() &&
+	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
+		return memcg_kmem_get_cache(s);
+
+	return s;
 }
 
 static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
@@ -407,7 +413,9 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 					 s->flags, flags);
 		kasan_slab_alloc(s, object, flags);
 	}
-	memcg_kmem_put_cache(s);
+
+	if (memcg_kmem_enabled())
+		memcg_kmem_put_cache(s);
 }
 
 #ifndef CONFIG_SLOB
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RESEND 4/8] mm: charge/uncharge kmemcg from generic page allocator paths
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
                   ` (2 preceding siblings ...)
  2016-05-24  8:49 ` [PATCH RESEND 3/8] mm: memcontrol: cleanup kmem charge functions Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 5/8] mm: memcontrol: teach uncharge_list to deal with kmem pages Vladimir Davydov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, linux-fsdevel, netdev,
	x86, linux-kernel

Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.

This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.

To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.

To make it possible, we need to mark pages charged to kmemcg somehow. To
avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to be
mapped to userspace, it should work just fine. There are other (ab)users
of page->_mapcount - buddy and balloon pages - but we don't conflict
with them.

In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 include/linux/gfp.h        | 10 +------
 include/linux/page-flags.h |  7 +++++
 kernel/fork.c              |  6 ++---
 mm/page_alloc.c            | 66 +++++++++-------------------------------------
 mm/slab_common.c           |  2 +-
 mm/slub.c                  |  6 ++---
 mm/vmalloc.c               |  6 ++---
 7 files changed, 31 insertions(+), 72 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..c29e9d347bc6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -78,8 +78,7 @@ struct vm_area_struct;
  * __GFP_THISNODE forces the allocation to be satisified from the requested
  *   node with no fallbacks or placement policy enforcements.
  *
- * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg (only relevant
- *   to kmem allocations).
+ * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
@@ -486,10 +485,6 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 #define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
 	alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
-extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
-extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
-					  unsigned int order);
-
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
@@ -513,9 +508,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc,
 			       unsigned int fragsz, gfp_t gfp_mask);
 extern void __free_page_frag(void *addr);
 
-extern void __free_kmem_pages(struct page *page, unsigned int order);
-extern void free_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9940ade6a25e..b51e75a47e82 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -630,6 +630,13 @@ PAGE_MAPCOUNT_OPS(Buddy, BUDDY)
 #define PAGE_BALLOON_MAPCOUNT_VALUE		(-256)
 PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 
+/*
+ * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
+ * pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
+ */
+#define PAGE_KMEMCG_MAPCOUNT_VALUE		(-512)
+PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG)
+
 extern bool is_free_buddy_page(struct page *page);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 66cc2e0e137e..3f3c30f80786 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -162,8 +162,8 @@ void __weak arch_release_thread_info(struct thread_info *ti)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
-	struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
-						  THREAD_SIZE_ORDER);
+	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
+					     THREAD_SIZE_ORDER);
 
 	if (page)
 		memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
@@ -178,7 +178,7 @@ static inline void free_thread_info(struct thread_info *ti)
 
 	memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
 				    -(1 << THREAD_SIZE_ORDER));
-	__free_kmem_pages(page, THREAD_SIZE_ORDER);
+	__free_pages(page, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f948f4e02e89..c9ee98f1d3cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -63,6 +63,7 @@
 #include <linux/sched/rt.h>
 #include <linux/page_owner.h>
 #include <linux/kthread.h>
+#include <linux/memcontrol.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1017,6 +1018,10 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	}
 	if (PageAnonHead(page))
 		page->mapping = NULL;
+	if (memcg_kmem_enabled() && PageKmemcg(page)) {
+		memcg_kmem_uncharge(page, order);
+		__ClearPageKmemcg(page);
+	}
 	if (check_free)
 		bad += free_pages_check(page);
 	if (bad)
@@ -3833,6 +3838,14 @@ no_zone:
 	}
 
 out:
+	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page) {
+		if (unlikely(memcg_kmem_charge(page, gfp_mask, order))) {
+			__free_pages(page, order);
+			page = NULL;
+		} else
+			__SetPageKmemcg(page);
+	}
+
 	if (kmemcheck_enabled && page)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
 
@@ -3988,59 +4001,6 @@ void __free_page_frag(void *addr)
 }
 EXPORT_SYMBOL(__free_page_frag);
 
-/*
- * alloc_kmem_pages charges newly allocated pages to the kmem resource counter
- * of the current memory cgroup if __GFP_ACCOUNT is set, other than that it is
- * equivalent to alloc_pages.
- *
- * It should be used when the caller would like to use kmalloc, but since the
- * allocation is large, it has to fall back to the page allocator.
- */
-struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order)
-{
-	struct page *page;
-
-	page = alloc_pages(gfp_mask, order);
-	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) &&
-	    page && memcg_kmem_charge(page, gfp_mask, order) != 0) {
-		__free_pages(page, order);
-		page = NULL;
-	}
-	return page;
-}
-
-struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
-{
-	struct page *page;
-
-	page = alloc_pages_node(nid, gfp_mask, order);
-	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) &&
-	    page && memcg_kmem_charge(page, gfp_mask, order) != 0) {
-		__free_pages(page, order);
-		page = NULL;
-	}
-	return page;
-}
-
-/*
- * __free_kmem_pages and free_kmem_pages will free pages allocated with
- * alloc_kmem_pages.
- */
-void __free_kmem_pages(struct page *page, unsigned int order)
-{
-	if (memcg_kmem_enabled())
-		memcg_kmem_uncharge(page, order);
-	__free_pages(page, order);
-}
-
-void free_kmem_pages(unsigned long addr, unsigned int order)
-{
-	if (addr != 0) {
-		VM_BUG_ON(!virt_addr_valid((void *)addr));
-		__free_kmem_pages(virt_to_page((void *)addr), order);
-	}
-}
-
 static void *make_alloc_exact(unsigned long addr, unsigned int order,
 		size_t size)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a65dad7fdcd1..3bf8f63a6abe 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1012,7 +1012,7 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 	struct page *page;
 
 	flags |= __GFP_COMP;
-	page = alloc_kmem_pages(flags, order);
+	page = alloc_pages(flags, order);
 	ret = page ? page_address(page) : NULL;
 	kmemleak_alloc(ret, size, 1, flags);
 	kasan_kmalloc_large(ret, size, flags);
diff --git a/mm/slub.c b/mm/slub.c
index 538c8584ba8d..ac7b5a55e4c9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2867,7 +2867,7 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
 		if (unlikely(!PageSlab(page))) {
 			BUG_ON(!PageCompound(page));
 			kfree_hook(object);
-			__free_kmem_pages(page, compound_order(page));
+			__free_pages(page, compound_order(page));
 			p[size] = NULL; /* mark object processed */
 			return size;
 		}
@@ -3575,7 +3575,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	void *ptr = NULL;
 
 	flags |= __GFP_COMP | __GFP_NOTRACK;
-	page = alloc_kmem_pages_node(node, flags, get_order(size));
+	page = alloc_pages_node(node, flags, get_order(size));
 	if (page)
 		ptr = page_address(page);
 
@@ -3656,7 +3656,7 @@ void kfree(const void *x)
 	if (unlikely(!PageSlab(page))) {
 		BUG_ON(!PageCompound(page));
 		kfree_hook(x);
-		__free_kmem_pages(page, compound_order(page));
+		__free_pages(page, compound_order(page));
 		return;
 	}
 	slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index cf7ad1a53be0..a806236ecd62 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1501,7 +1501,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
 			struct page *page = area->pages[i];
 
 			BUG_ON(!page);
-			__free_kmem_pages(page, 0);
+			__free_pages(page, 0);
 		}
 
 		kvfree(area->pages);
@@ -1628,9 +1628,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 		struct page *page;
 
 		if (node == NUMA_NO_NODE)
-			page = alloc_kmem_pages(alloc_mask, order);
+			page = alloc_pages(alloc_mask, order);
 		else
-			page = alloc_kmem_pages_node(node, alloc_mask, order);
+			page = alloc_pages_node(node, alloc_mask, order);
 
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RESEND 5/8] mm: memcontrol: teach uncharge_list to deal with kmem pages
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
                   ` (3 preceding siblings ...)
  2016-05-24  8:49 ` [PATCH RESEND 4/8] mm: charge/uncharge kmemcg from generic page allocator paths Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 6/8] arch: x86: charge page tables to kmemcg Vladimir Davydov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, linux-fsdevel, netdev,
	x86, linux-kernel

Page table pages are batched-freed in release_pages on most
architectures. If we want to charge them to kmemcg (this is what is done
later in this series), we need to teach mem_cgroup_uncharge_list to
handle kmem pages.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 mm/memcontrol.c | 42 ++++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 482b4a0c97e4..89a421ee4713 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5432,15 +5432,18 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 
 static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 			   unsigned long nr_anon, unsigned long nr_file,
-			   unsigned long nr_huge, struct page *dummy_page)
+			   unsigned long nr_huge, unsigned long nr_kmem,
+			   struct page *dummy_page)
 {
-	unsigned long nr_pages = nr_anon + nr_file;
+	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
 	unsigned long flags;
 
 	if (!mem_cgroup_is_root(memcg)) {
 		page_counter_uncharge(&memcg->memory, nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&memcg->memsw, nr_pages);
+		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
+			page_counter_uncharge(&memcg->kmem, nr_kmem);
 		memcg_oom_recover(memcg);
 	}
 
@@ -5463,6 +5466,7 @@ static void uncharge_list(struct list_head *page_list)
 	unsigned long nr_anon = 0;
 	unsigned long nr_file = 0;
 	unsigned long nr_huge = 0;
+	unsigned long nr_kmem = 0;
 	unsigned long pgpgout = 0;
 	struct list_head *next;
 	struct page *page;
@@ -5473,8 +5477,6 @@ static void uncharge_list(struct list_head *page_list)
 	 */
 	next = page_list->next;
 	do {
-		unsigned int nr_pages = 1;
-
 		page = list_entry(next, struct page, lru);
 		next = page->lru.next;
 
@@ -5493,31 +5495,35 @@ static void uncharge_list(struct list_head *page_list)
 		if (memcg != page->mem_cgroup) {
 			if (memcg) {
 				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-					       nr_huge, page);
-				pgpgout = nr_anon = nr_file = nr_huge = 0;
+					       nr_huge, nr_kmem, page);
+				pgpgout = nr_anon = nr_file =
+					nr_huge = nr_kmem = 0;
 			}
 			memcg = page->mem_cgroup;
 		}
 
-		if (PageTransHuge(page)) {
-			nr_pages <<= compound_order(page);
-			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-			nr_huge += nr_pages;
-		}
+		if (!PageKmemcg(page)) {
+			unsigned int nr_pages = 1;
 
-		if (PageAnon(page))
-			nr_anon += nr_pages;
-		else
-			nr_file += nr_pages;
+			if (PageTransHuge(page)) {
+				nr_pages <<= compound_order(page);
+				VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+				nr_huge += nr_pages;
+			}
+			if (PageAnon(page))
+				nr_anon += nr_pages;
+			else
+				nr_file += nr_pages;
+			pgpgout++;
+		} else
+			nr_kmem += 1 << compound_order(page);
 
 		page->mem_cgroup = NULL;
-
-		pgpgout++;
 	} while (next != page_list);
 
 	if (memcg)
 		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-			       nr_huge, page);
+			       nr_huge, nr_kmem, page);
 }
 
 /**
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RESEND 6/8] arch: x86: charge page tables to kmemcg
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
                   ` (4 preceding siblings ...)
  2016-05-24  8:49 ` [PATCH RESEND 5/8] mm: memcontrol: teach uncharge_list to deal with kmem pages Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 7/8] pipe: account " Vladimir Davydov
  2016-05-24  8:49 ` [PATCH RESEND 8/8] af_unix: charge buffers " Vladimir Davydov
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Johannes Weiner,
	Michal Hocko, linux-mm, linux-fsdevel, netdev, x86, linux-kernel

Page tables can bite a relatively big chunk off system memory and their
allocations are easy to trigger from userspace, so they should be
accounted to kmemcg.

This patch marks page table allocations as __GFP_ACCOUNT for x86. Note
we must not charge allocations of kernel page tables, because they can
be shared among processes from different cgroups so accounting them to a
particular one can pin other cgroups for indefinitely long. So we clear
__GFP_ACCOUNT flag if a page table is allocated for the kernel.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/pgalloc.h | 12 ++++++++++--
 arch/x86/mm/pgtable.c          | 11 ++++++++---
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index bf7f8b55b0f9..2f531633cb16 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -81,7 +81,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *page;
-	page = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO;
+
+	if (mm == &init_mm)
+		gfp &= ~__GFP_ACCOUNT;
+	page = alloc_pages(gfp, 0);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -125,7 +129,11 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT;
+
+	if (mm == &init_mm)
+		gfp &= ~__GFP_ACCOUNT;
+	return (pud_t *)get_zeroed_page(gfp);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 4eb287e25043..421ac6b74d11 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -6,7 +6,8 @@
 #include <asm/fixmap.h>
 #include <asm/mtrr.h>
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
+#define PGALLOC_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | __GFP_REPEAT | \
+		     __GFP_ZERO)
 
 #ifdef CONFIG_HIGHPTE
 #define PGALLOC_USER_GFP __GFP_HIGHMEM
@@ -18,7 +19,7 @@ gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	return (pte_t *)__get_free_page(PGALLOC_GFP);
+	return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -207,9 +208,13 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
 {
 	int i;
 	bool failed = false;
+	gfp_t gfp = PGALLOC_GFP;
+
+	if (mm == &init_mm)
+		gfp &= ~__GFP_ACCOUNT;
 
 	for(i = 0; i < PREALLOCATED_PMDS; i++) {
-		pmd_t *pmd = (pmd_t *)__get_free_page(PGALLOC_GFP);
+		pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
 		if (!pmd)
 			failed = true;
 		if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
                   ` (5 preceding siblings ...)
  2016-05-24  8:49 ` [PATCH RESEND 6/8] arch: x86: charge page tables to kmemcg Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24 12:59   ` Eric Dumazet
  2016-05-24  8:49 ` [PATCH RESEND 8/8] af_unix: charge buffers " Vladimir Davydov
  7 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Johannes Weiner, Michal Hocko, linux-mm,
	linux-fsdevel, netdev, x86, linux-kernel

Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.

This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/pipe.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 0d3f5165cb0b..4b32928f5426 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -21,6 +21,7 @@
 #include <linux/audit.h>
 #include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 		put_page(page);
 }
 
+static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
+			       struct pipe_buffer *buf)
+{
+	struct page *page = buf->page;
+
+	if (page_count(page) == 1) {
+		if (memcg_kmem_enabled()) {
+			memcg_kmem_uncharge(page, 0);
+			__ClearPageKmemcg(page);
+		}
+		__SetPageLocked(page);
+		return 0;
+	}
+	return 1;
+}
+
 /**
  * generic_pipe_buf_steal - attempt to take ownership of a &pipe_buffer
  * @pipe:	the pipe that the buffer belongs to
@@ -219,7 +236,7 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
 	.can_merge = 1,
 	.confirm = generic_pipe_buf_confirm,
 	.release = anon_pipe_buf_release,
-	.steal = generic_pipe_buf_steal,
+	.steal = anon_pipe_buf_steal,
 	.get = generic_pipe_buf_get,
 };
 
@@ -227,7 +244,7 @@ static const struct pipe_buf_operations packet_pipe_buf_ops = {
 	.can_merge = 0,
 	.confirm = generic_pipe_buf_confirm,
 	.release = anon_pipe_buf_release,
-	.steal = generic_pipe_buf_steal,
+	.steal = anon_pipe_buf_steal,
 	.get = generic_pipe_buf_get,
 };
 
@@ -405,7 +422,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			int copied;
 
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 				if (unlikely(!page)) {
 					ret = ret ? : -ENOMEM;
 					break;
@@ -611,7 +628,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 {
 	struct pipe_inode_info *pipe;
 
-	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT);
 	if (pipe) {
 		unsigned long pipe_bufs = PIPE_DEF_BUFFERS;
 		struct user_struct *user = get_current_user();
@@ -619,7 +636,9 @@ struct pipe_inode_info *alloc_pipe_info(void)
 		if (!too_many_pipe_buffers_hard(user)) {
 			if (too_many_pipe_buffers_soft(user))
 				pipe_bufs = 1;
-			pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * pipe_bufs, GFP_KERNEL);
+			pipe->bufs = kcalloc(pipe_bufs,
+					     sizeof(struct pipe_buffer),
+					     GFP_KERNEL_ACCOUNT);
 		}
 
 		if (pipe->bufs) {
@@ -1010,7 +1029,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long nr_pages)
 	if (nr_pages < pipe->nrbufs)
 		return -EBUSY;
 
-	bufs = kcalloc(nr_pages, sizeof(*bufs), GFP_KERNEL | __GFP_NOWARN);
+	bufs = kcalloc(nr_pages, sizeof(*bufs),
+		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (unlikely(!bufs))
 		return -ENOMEM;
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
  2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
                   ` (6 preceding siblings ...)
  2016-05-24  8:49 ` [PATCH RESEND 7/8] pipe: account " Vladimir Davydov
@ 2016-05-24  8:49 ` Vladimir Davydov
  2016-05-24 13:02   ` Eric Dumazet
  7 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David S. Miller, Johannes Weiner, Michal Hocko, linux-mm,
	linux-fsdevel, netdev, x86, linux-kernel

Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.

Since unix socket buffers are always allocated from process context,
all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock->sk_allocation mask.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: "David S. Miller" <davem@davemloft.net>
---
 net/unix/af_unix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 80aa6a3e6817..022bdd3ab7d9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -769,6 +769,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 	lockdep_set_class(&sk->sk_receive_queue.lock,
 				&af_unix_sk_receive_queue_lock_key);
 
+	sk->sk_allocation	= GFP_KERNEL_ACCOUNT;
 	sk->sk_write_space	= unix_write_space;
 	sk->sk_max_ack_backlog	= net->unx.sysctl_max_dgram_qlen;
 	sk->sk_destruct		= unix_sock_destructor;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-24  8:49 ` [PATCH RESEND 7/8] pipe: account " Vladimir Davydov
@ 2016-05-24 12:59   ` Eric Dumazet
  2016-05-24 16:13     ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-24 12:59 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Alexander Viro, Johannes Weiner, Michal Hocko,
	linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> Pipes can consume a significant amount of system memory, hence they
> should be accounted to kmemcg.
> 
> This patch marks pipe_inode_info and anonymous pipe buffer page
> allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
> Note, since a pipe buffer page can be "stolen" and get reused for other
> purposes, including mapping to userspace, we clear PageKmemcg thus
> resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
> is introduced by this patch.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/pipe.c | 32 ++++++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 0d3f5165cb0b..4b32928f5426 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -21,6 +21,7 @@
>  #include <linux/audit.h>
>  #include <linux/syscalls.h>
>  #include <linux/fcntl.h>
> +#include <linux/memcontrol.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/ioctls.h>
> @@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
>  		put_page(page);
>  }
>  
> +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> +			       struct pipe_buffer *buf)
> +{
> +	struct page *page = buf->page;
> +
> +	if (page_count(page) == 1) {

This looks racy : some cpu could have temporarily elevated page count.

> +		if (memcg_kmem_enabled()) {
> +			memcg_kmem_uncharge(page, 0);
> +			__ClearPageKmemcg(page);
> +		}
> +		__SetPageLocked(page);
> +		return 0;
> +	}
> +	return 1;
> +}
> +




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
  2016-05-24  8:49 ` [PATCH RESEND 8/8] af_unix: charge buffers " Vladimir Davydov
@ 2016-05-24 13:02   ` Eric Dumazet
  2016-05-24 16:36     ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-24 13:02 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, David S. Miller, Johannes Weiner, Michal Hocko,
	linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> Unix sockets can consume a significant amount of system memory, hence
> they should be accounted to kmemcg.
> 
> Since unix socket buffers are always allocated from process context,
> all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> sock->sk_allocation mask.

I have two questions : 

1) What happens when a buffer, allocated from socket <A> lands in a
different socket <B>, maybe owned by another user/process.

Who owns it now, in term of kmemcg accounting ?

2) Has performance impact been evaluated ?

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-24 12:59   ` Eric Dumazet
@ 2016-05-24 16:13     ` Vladimir Davydov
  2016-05-24 20:04       ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24 16:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Alexander Viro, Johannes Weiner, Michal Hocko,
	linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
...
> > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > +			       struct pipe_buffer *buf)
> > +{
> > +	struct page *page = buf->page;
> > +
> > +	if (page_count(page) == 1) {
> 
> This looks racy : some cpu could have temporarily elevated page count.

All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
with refcount of 1 in ->steal, that means that we are the only its user
and it can't be spliced to another pipe.

In fact, I just copied the code from generic_pipe_buf_steal, adding
kmemcg related checks along the way, so it should be fine.

Thanks,
Vladimir

> 
> > +		if (memcg_kmem_enabled()) {
> > +			memcg_kmem_uncharge(page, 0);
> > +			__ClearPageKmemcg(page);
> > +		}
> > +		__SetPageLocked(page);
> > +		return 0;
> > +	}
> > +	return 1;
> > +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
  2016-05-24 13:02   ` Eric Dumazet
@ 2016-05-24 16:36     ` Vladimir Davydov
  2016-08-23 13:48       ` Sudeep K N
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-24 16:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, David S. Miller, Johannes Weiner, Michal Hocko,
	linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote:
> On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> > Unix sockets can consume a significant amount of system memory, hence
> > they should be accounted to kmemcg.
> > 
> > Since unix socket buffers are always allocated from process context,
> > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> > sock->sk_allocation mask.
> 
> I have two questions : 
> 
> 1) What happens when a buffer, allocated from socket <A> lands in a
> different socket <B>, maybe owned by another user/process.
> 
> Who owns it now, in term of kmemcg accounting ?

We never move memcg charges. E.g. if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first. Or if two processes are working with the
same directory tree, inodes and dentries will be charged to the first
user. The same is fair for unix socket buffers - they will be charged to
the sender.

> 
> 2) Has performance impact been evaluated ?

I ran netperf STREAM_STREAM with default options in a kmemcg on
a 4 core x 2 HT box. The results are below:

 # clients            bandwidth (10^6bits/sec)
                    base              patched
         1      67643 +-  725      64874 +-  353    - 4.0 %
         4     193585 +- 2516     186715 +- 1460    - 3.5 %
         8     194820 +-  377     187443 +- 1229    - 3.7 %

So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.

Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-24 16:13     ` Vladimir Davydov
@ 2016-05-24 20:04       ` Eric Dumazet
  2016-05-25 10:30         ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-24 20:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Alexander Viro, Johannes Weiner, Michal Hocko,
	linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> ...
> > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > +			       struct pipe_buffer *buf)
> > > +{
> > > +	struct page *page = buf->page;
> > > +
> > > +	if (page_count(page) == 1) {
> > 
> > This looks racy : some cpu could have temporarily elevated page count.
> 
> All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> with refcount of 1 in ->steal, that means that we are the only its user
> and it can't be spliced to another pipe.
> 
> In fact, I just copied the code from generic_pipe_buf_steal, adding
> kmemcg related checks along the way, so it should be fine.

So you guarantee that no other cpu might have done
get_page_unless_zero() right before this test ?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-24 20:04       ` Eric Dumazet
@ 2016-05-25 10:30         ` Vladimir Davydov
  2016-05-26  7:04           ` Minchan Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-25 10:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Alexander Viro, Johannes Weiner, Michal Hocko,
	linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > ...
> > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > +			       struct pipe_buffer *buf)
> > > > +{
> > > > +	struct page *page = buf->page;
> > > > +
> > > > +	if (page_count(page) == 1) {
> > > 
> > > This looks racy : some cpu could have temporarily elevated page count.
> > 
> > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > with refcount of 1 in ->steal, that means that we are the only its user
> > and it can't be spliced to another pipe.
> > 
> > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > kmemcg related checks along the way, so it should be fine.
> 
> So you guarantee that no other cpu might have done
> get_page_unless_zero() right before this test ?

Each pipe_buffer holds a reference to its page. If we find page's
refcount to be 1 here, then it can be referenced only by our
pipe_buffer. And the refcount cannot be increased by a parallel thread,
because we hold pipe_lock, which rules out splice, and otherwise it's
impossible to reach the page as it is not on lru. That said, I think I
guarantee that this should be safe.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-25 10:30         ` Vladimir Davydov
@ 2016-05-26  7:04           ` Minchan Kim
  2016-05-26 13:59             ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Minchan Kim @ 2016-05-26  7:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Eric Dumazet, Andrew Morton, Alexander Viro, Johannes Weiner,
	Michal Hocko, linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > ...
> > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > +			       struct pipe_buffer *buf)
> > > > > +{
> > > > > +	struct page *page = buf->page;
> > > > > +
> > > > > +	if (page_count(page) == 1) {
> > > > 
> > > > This looks racy : some cpu could have temporarily elevated page count.
> > > 
> > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > > with refcount of 1 in ->steal, that means that we are the only its user
> > > and it can't be spliced to another pipe.
> > > 
> > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > kmemcg related checks along the way, so it should be fine.
> > 
> > So you guarantee that no other cpu might have done
> > get_page_unless_zero() right before this test ?
> 
> Each pipe_buffer holds a reference to its page. If we find page's
> refcount to be 1 here, then it can be referenced only by our
> pipe_buffer. And the refcount cannot be increased by a parallel thread,
> because we hold pipe_lock, which rules out splice, and otherwise it's
> impossible to reach the page as it is not on lru. That said, I think I
> guarantee that this should be safe.

I don't know kmemcg internal and pipe stuff so my comment might be
totally crap.

No one cannot guarantee any CPU cannot held a reference of a page.
Look at get_page_unless_zero usecases.

1. balloon_page_isolate

It can hold a reference in random page and then verify the page
is balloon page. Otherwise, just put.

2. page_idle_get_page

It has PageLRU check but it's racy so it can hold a reference
of randome page and then verify within zone->lru_lock. If it's
not LRU page, just put.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-26  7:04           ` Minchan Kim
@ 2016-05-26 13:59             ` Vladimir Davydov
  2016-05-26 14:15               ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-26 13:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Eric Dumazet, Andrew Morton, Alexander Viro, Johannes Weiner,
	Michal Hocko, linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > ...
> > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > +			       struct pipe_buffer *buf)
> > > > > > +{
> > > > > > +	struct page *page = buf->page;
> > > > > > +
> > > > > > +	if (page_count(page) == 1) {
> > > > > 
> > > > > This looks racy : some cpu could have temporarily elevated page count.
> > > > 
> > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > > > with refcount of 1 in ->steal, that means that we are the only its user
> > > > and it can't be spliced to another pipe.
> > > > 
> > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > kmemcg related checks along the way, so it should be fine.
> > > 
> > > So you guarantee that no other cpu might have done
> > > get_page_unless_zero() right before this test ?
> > 
> > Each pipe_buffer holds a reference to its page. If we find page's
> > refcount to be 1 here, then it can be referenced only by our
> > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > because we hold pipe_lock, which rules out splice, and otherwise it's
> > impossible to reach the page as it is not on lru. That said, I think I
> > guarantee that this should be safe.
> 
> I don't know kmemcg internal and pipe stuff so my comment might be
> totally crap.
> 
> No one cannot guarantee any CPU cannot held a reference of a page.
> Look at get_page_unless_zero usecases.
> 
> 1. balloon_page_isolate
> 
> It can hold a reference in random page and then verify the page
> is balloon page. Otherwise, just put.
> 
> 2. page_idle_get_page
> 
> It has PageLRU check but it's racy so it can hold a reference
> of randome page and then verify within zone->lru_lock. If it's
> not LRU page, just put.

Well, I see your concern now - even if a page is not on lru and we
locked all structs pointing to it, it can always get accessed by pfn in
a completely unrelated thread, like in examples you gave above. That's a
fair point.

However, I still think that it's OK in case of pipe buffers. What can
happen if somebody takes a transient reference to a pipe buffer page? At
worst, we'll see page_count > 1 due to temporary ref and abort stealing,
falling back on copying instead. That's OK, because stealing is not
guaranteed. Can a function that takes a transient ref to page by pfn
mistakenly assume that this is a page it's interested in? I don't think
so, because this page has no marks on it except special _mapcount value,
which should only be set on kmemcg pages.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-26 13:59             ` Vladimir Davydov
@ 2016-05-26 14:15               ` Eric Dumazet
  2016-05-27 15:03                 ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2016-05-26 14:15 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Minchan Kim, Andrew Morton, Alexander Viro, Johannes Weiner,
	Michal Hocko, linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Thu, 2016-05-26 at 16:59 +0300, Vladimir Davydov wrote:
> On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> > On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > > ...
> > > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > > +			       struct pipe_buffer *buf)
> > > > > > > +{
> > > > > > > +	struct page *page = buf->page;
> > > > > > > +
> > > > > > > +	if (page_count(page) == 1) {
> > > > > > 
> > > > > > This looks racy : some cpu could have temporarily elevated page count.
> > > > > 
> > > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > > > > with refcount of 1 in ->steal, that means that we are the only its user
> > > > > and it can't be spliced to another pipe.
> > > > > 
> > > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > > kmemcg related checks along the way, so it should be fine.
> > > > 
> > > > So you guarantee that no other cpu might have done
> > > > get_page_unless_zero() right before this test ?
> > > 
> > > Each pipe_buffer holds a reference to its page. If we find page's
> > > refcount to be 1 here, then it can be referenced only by our
> > > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > > because we hold pipe_lock, which rules out splice, and otherwise it's
> > > impossible to reach the page as it is not on lru. That said, I think I
> > > guarantee that this should be safe.
> > 
> > I don't know kmemcg internal and pipe stuff so my comment might be
> > totally crap.
> > 
> > No one cannot guarantee any CPU cannot held a reference of a page.
> > Look at get_page_unless_zero usecases.
> > 
> > 1. balloon_page_isolate
> > 
> > It can hold a reference in random page and then verify the page
> > is balloon page. Otherwise, just put.
> > 
> > 2. page_idle_get_page
> > 
> > It has PageLRU check but it's racy so it can hold a reference
> > of randome page and then verify within zone->lru_lock. If it's
> > not LRU page, just put.
> 
> Well, I see your concern now - even if a page is not on lru and we
> locked all structs pointing to it, it can always get accessed by pfn in
> a completely unrelated thread, like in examples you gave above. That's a
> fair point.
> 
> However, I still think that it's OK in case of pipe buffers. What can
> happen if somebody takes a transient reference to a pipe buffer page? At
> worst, we'll see page_count > 1 due to temporary ref and abort stealing,
> falling back on copying instead. That's OK, because stealing is not
> guaranteed. Can a function that takes a transient ref to page by pfn
> mistakenly assume that this is a page it's interested in? I don't think
> so, because this page has no marks on it except special _mapcount value,
> which should only be set on kmemcg pages.

Well, all this information deserve to be in the changelog.

Maybe in 6 months, this will be incredibly useful for bug hunting.

pipes can be used to exchange data (or pages) between processes in
different domains.

If kmemcg is not precise, this could be used by some attackers to force
some processes to consume all their budget and eventually not be able to
allocate new pages.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 7/8] pipe: account to kmemcg
  2016-05-26 14:15               ` Eric Dumazet
@ 2016-05-27 15:03                 ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2016-05-27 15:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Minchan Kim, Andrew Morton, Alexander Viro, Johannes Weiner,
	Michal Hocko, linux-mm, linux-fsdevel, netdev, x86, linux-kernel

On Thu, May 26, 2016 at 07:15:49AM -0700, Eric Dumazet wrote:
> On Thu, 2016-05-26 at 16:59 +0300, Vladimir Davydov wrote:
> > On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> > > On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > > > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > > > ...
> > > > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > > > +			       struct pipe_buffer *buf)
> > > > > > > > +{
> > > > > > > > +	struct page *page = buf->page;
> > > > > > > > +
> > > > > > > > +	if (page_count(page) == 1) {
> > > > > > > 
> > > > > > > This looks racy : some cpu could have temporarily elevated page count.
> > > > > > 
> > > > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > > > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > > > > > with refcount of 1 in ->steal, that means that we are the only its user
> > > > > > and it can't be spliced to another pipe.
> > > > > > 
> > > > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > > > kmemcg related checks along the way, so it should be fine.
> > > > > 
> > > > > So you guarantee that no other cpu might have done
> > > > > get_page_unless_zero() right before this test ?
> > > > 
> > > > Each pipe_buffer holds a reference to its page. If we find page's
> > > > refcount to be 1 here, then it can be referenced only by our
> > > > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > > > because we hold pipe_lock, which rules out splice, and otherwise it's
> > > > impossible to reach the page as it is not on lru. That said, I think I
> > > > guarantee that this should be safe.
> > > 
> > > I don't know kmemcg internal and pipe stuff so my comment might be
> > > totally crap.
> > > 
> > > No one cannot guarantee any CPU cannot held a reference of a page.
> > > Look at get_page_unless_zero usecases.
> > > 
> > > 1. balloon_page_isolate
> > > 
> > > It can hold a reference in random page and then verify the page
> > > is balloon page. Otherwise, just put.
> > > 
> > > 2. page_idle_get_page
> > > 
> > > It has PageLRU check but it's racy so it can hold a reference
> > > of randome page and then verify within zone->lru_lock. If it's
> > > not LRU page, just put.
> > 
> > Well, I see your concern now - even if a page is not on lru and we
> > locked all structs pointing to it, it can always get accessed by pfn in
> > a completely unrelated thread, like in examples you gave above. That's a
> > fair point.
> > 
> > However, I still think that it's OK in case of pipe buffers. What can
> > happen if somebody takes a transient reference to a pipe buffer page? At
> > worst, we'll see page_count > 1 due to temporary ref and abort stealing,
> > falling back on copying instead. That's OK, because stealing is not
> > guaranteed. Can a function that takes a transient ref to page by pfn
> > mistakenly assume that this is a page it's interested in? I don't think
> > so, because this page has no marks on it except special _mapcount value,
> > which should only be set on kmemcg pages.
> 
> Well, all this information deserve to be in the changelog.
> 
> Maybe in 6 months, this will be incredibly useful for bug hunting.
> 
> pipes can be used to exchange data (or pages) between processes in
> different domains.
> 
> If kmemcg is not precise, this could be used by some attackers to force
> some processes to consume all their budget and eventually not be able to
> allocate new pages.
> 

Here goes the patch with updated change log.
---
From: Vladimir Davydov <vdavydov@virtuozzo.com>
Subject: [PATCH] pipe: account to kmemcg

Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.

This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.

A note regarding anon_pipe_buf_steal implementation. We allow to steal
the page if its ref count equals 1. It looks racy, but it is correct for
anonymous pipe buffer pages, because:

 - We lock out all other pipe users, because ->steal is called with
   pipe_lock held, so the page can't be spliced to another pipe from
   under us.

 - The page is not on LRU and it never was.

 - Thus a parallel thread can access it only by PFN. Although this is
   quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
   this is not dangerous, because all such functions do is increase page
   ref count, check if the page is the one they are looking for, and
   decrease ref count if it isn't. Since our page is clean except for
   PageKmemcg mark, which doesn't conflict with other _mapcount users,
   the worst that can happen is we see page_count > 2 due to a transient
   ref, in which case we false-positively abort ->steal, which is still
   fine, because ->steal is not guaranteed to succeed.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/pipe.c b/fs/pipe.c
index 0d3f5165cb0b..4b32928f5426 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -21,6 +21,7 @@
 #include <linux/audit.h>
 #include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 		put_page(page);
 }
 
+static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
+			       struct pipe_buffer *buf)
+{
+	struct page *page = buf->page;
+
+	if (page_count(page) == 1) {
+		if (memcg_kmem_enabled()) {
+			memcg_kmem_uncharge(page, 0);
+			__ClearPageKmemcg(page);
+		}
+		__SetPageLocked(page);
+		return 0;
+	}
+	return 1;
+}
+
 /**
  * generic_pipe_buf_steal - attempt to take ownership of a &pipe_buffer
  * @pipe:	the pipe that the buffer belongs to
@@ -219,7 +236,7 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
 	.can_merge = 1,
 	.confirm = generic_pipe_buf_confirm,
 	.release = anon_pipe_buf_release,
-	.steal = generic_pipe_buf_steal,
+	.steal = anon_pipe_buf_steal,
 	.get = generic_pipe_buf_get,
 };
 
@@ -227,7 +244,7 @@ static const struct pipe_buf_operations packet_pipe_buf_ops = {
 	.can_merge = 0,
 	.confirm = generic_pipe_buf_confirm,
 	.release = anon_pipe_buf_release,
-	.steal = generic_pipe_buf_steal,
+	.steal = anon_pipe_buf_steal,
 	.get = generic_pipe_buf_get,
 };
 
@@ -405,7 +422,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			int copied;
 
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 				if (unlikely(!page)) {
 					ret = ret ? : -ENOMEM;
 					break;
@@ -611,7 +628,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 {
 	struct pipe_inode_info *pipe;
 
-	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT);
 	if (pipe) {
 		unsigned long pipe_bufs = PIPE_DEF_BUFFERS;
 		struct user_struct *user = get_current_user();
@@ -619,7 +636,9 @@ struct pipe_inode_info *alloc_pipe_info(void)
 		if (!too_many_pipe_buffers_hard(user)) {
 			if (too_many_pipe_buffers_soft(user))
 				pipe_bufs = 1;
-			pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * pipe_bufs, GFP_KERNEL);
+			pipe->bufs = kcalloc(pipe_bufs,
+					     sizeof(struct pipe_buffer),
+					     GFP_KERNEL_ACCOUNT);
 		}
 
 		if (pipe->bufs) {
@@ -1010,7 +1029,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long nr_pages)
 	if (nr_pages < pipe->nrbufs)
 		return -EBUSY;
 
-	bufs = kcalloc(nr_pages, sizeof(*bufs), GFP_KERNEL | __GFP_NOWARN);
+	bufs = kcalloc(nr_pages, sizeof(*bufs),
+		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (unlikely(!bufs))
 		return -ENOMEM;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
  2016-05-24 16:36     ` Vladimir Davydov
@ 2016-08-23 13:48       ` Sudeep K N
  2016-08-23 16:44         ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Sudeep K N @ 2016-08-23 13:48 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Eric Dumazet, Andrew Morton, David S. Miller, Johannes Weiner,
	Michal Hocko, linux-mm, linux-fsdevel, netdev, x86, open list,
	Ingo Molnar, Peter Zijlstra, Sudeep Holla

On Tue, May 24, 2016 at 5:36 PM, Vladimir Davydov
<vdavydov@virtuozzo.com> wrote:
> On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote:
>> On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
>> > Unix sockets can consume a significant amount of system memory, hence
>> > they should be accounted to kmemcg.
>> >
>> > Since unix socket buffers are always allocated from process context,
>> > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
>> > sock->sk_allocation mask.
>>
>> I have two questions :
>>
>> 1) What happens when a buffer, allocated from socket <A> lands in a
>> different socket <B>, maybe owned by another user/process.
>>
>> Who owns it now, in term of kmemcg accounting ?
>
> We never move memcg charges. E.g. if two processes from different
> cgroups are sharing a memory region, each page will be charged to the
> process which touched it first. Or if two processes are working with the
> same directory tree, inodes and dentries will be charged to the first
> user. The same is fair for unix socket buffers - they will be charged to
> the sender.
>
>>
>> 2) Has performance impact been evaluated ?
>
> I ran netperf STREAM_STREAM with default options in a kmemcg on
> a 4 core x 2 HT box. The results are below:
>
>  # clients            bandwidth (10^6bits/sec)
>                     base              patched
>          1      67643 +-  725      64874 +-  353    - 4.0 %
>          4     193585 +- 2516     186715 +- 1460    - 3.5 %
>          8     194820 +-  377     187443 +- 1229    - 3.7 %
>
> So the accounting doesn't come for free - it takes ~4% of performance.
> I believe we could optimize it by using per cpu batching not only on
> charge, but also on uncharge in memcg core, but that's beyond the scope
> of this patch set - I'll take a look at this later.
>
> Anyway, if performance impact is found to be unacceptable, it is always
> possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
> or not use memory cgroups at runtime at all (thanks to jump labels
> there'll be no overhead even if they are compiled in).
>

I started seeing almost 10% degradation in the hackbench score with v4.8-rc1
Bisecting it resulted in this patch, i.e. Commit 3aa9799e1364 ("af_unix: charge
buffers to kmemcg") in the mainline.

As per the commit log, it seems like that's expected but I was not sure about
the margin. I also see the hackbench score is more inconsistent after this
patch, but I may be wrong as that's based on limited observation.

Is this something we can ignore as hackbench is more synthetic compared
to the gain this patch provides in some real workloads ?

-- 
Regards,
Sudeep

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
  2016-08-23 13:48       ` Sudeep K N
@ 2016-08-23 16:44         ` Vladimir Davydov
  2016-08-23 16:50           ` Sudeep Holla
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2016-08-23 16:44 UTC (permalink / raw)
  To: Sudeep K N
  Cc: Eric Dumazet, Andrew Morton, David S. Miller, Johannes Weiner,
	Michal Hocko, linux-mm, linux-fsdevel, netdev, x86, open list,
	Ingo Molnar, Peter Zijlstra, Sudeep Holla

Hello,

On Tue, Aug 23, 2016 at 02:48:11PM +0100, Sudeep K N wrote:
> On Tue, May 24, 2016 at 5:36 PM, Vladimir Davydov
> <vdavydov@virtuozzo.com> wrote:
> > On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote:
> >> On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> >> > Unix sockets can consume a significant amount of system memory, hence
> >> > they should be accounted to kmemcg.
> >> >
> >> > Since unix socket buffers are always allocated from process context,
> >> > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> >> > sock->sk_allocation mask.
> >>
> >> I have two questions :
> >>
> >> 1) What happens when a buffer, allocated from socket <A> lands in a
> >> different socket <B>, maybe owned by another user/process.
> >>
> >> Who owns it now, in term of kmemcg accounting ?
> >
> > We never move memcg charges. E.g. if two processes from different
> > cgroups are sharing a memory region, each page will be charged to the
> > process which touched it first. Or if two processes are working with the
> > same directory tree, inodes and dentries will be charged to the first
> > user. The same is fair for unix socket buffers - they will be charged to
> > the sender.
> >
> >>
> >> 2) Has performance impact been evaluated ?
> >
> > I ran netperf STREAM_STREAM with default options in a kmemcg on
> > a 4 core x 2 HT box. The results are below:
> >
> >  # clients            bandwidth (10^6bits/sec)
> >                     base              patched
> >          1      67643 +-  725      64874 +-  353    - 4.0 %
> >          4     193585 +- 2516     186715 +- 1460    - 3.5 %
> >          8     194820 +-  377     187443 +- 1229    - 3.7 %
> >
> > So the accounting doesn't come for free - it takes ~4% of performance.
> > I believe we could optimize it by using per cpu batching not only on
> > charge, but also on uncharge in memcg core, but that's beyond the scope
> > of this patch set - I'll take a look at this later.
> >
> > Anyway, if performance impact is found to be unacceptable, it is always
> > possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
> > or not use memory cgroups at runtime at all (thanks to jump labels
> > there'll be no overhead even if they are compiled in).
> >
> 
> I started seeing almost 10% degradation in the hackbench score with v4.8-rc1
> Bisecting it resulted in this patch, i.e. Commit 3aa9799e1364 ("af_unix: charge
> buffers to kmemcg") in the mainline.
> 
> As per the commit log, it seems like that's expected but I was not sure about
> the margin. I also see the hackbench score is more inconsistent after this
> patch, but I may be wrong as that's based on limited observation.
> 
> Is this something we can ignore as hackbench is more synthetic compared
> to the gain this patch provides in some real workloads ?

AFAIU hackbench essentially measures the rate of sending data over a
unix socket back and forth between processes running on different cpus,
so it isn't a surprise that the patch resulted in a degradation, as it
makes every skb page allocation/deallocation inc/dec an atomic counter
inside memcg. The more processes/cpus running in the same cgroup are
involved in this test, the more significant the overhead of this atomic
counter is going to be.

The degradation is not unavoidable - it can be fixed by making kmem
charge/uncharge code use per-cpu batches. The infrastructure for this
already exists in memcontrol.c. If it were not for the legacy
mem_cgroup->kmem counter (which is actually useless and will be dropped
in cgroup v2), the issue would be pretty easy to fix. However, this
legacy counter makes a possible implementation quite messy, so I'd like
to postpone it until cgroup v2 has finally settled down.

Regarding your problem. As a workaround you can either start your
workload in the root memory cgroup or disable kmem accounting for memory
cgroups altogether (via cgroup.memory=nokmem boot option). If you find
the issue critical, I don't mind reverting the patch - we can always
re-apply it once per-cpu batches are implemented for kmem charges.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg
  2016-08-23 16:44         ` Vladimir Davydov
@ 2016-08-23 16:50           ` Sudeep Holla
  0 siblings, 0 replies; 22+ messages in thread
From: Sudeep Holla @ 2016-08-23 16:50 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Sudeep Holla, Eric Dumazet, Andrew Morton, David S. Miller,
	Johannes Weiner, Michal Hocko, linux-mm, linux-fsdevel, netdev,
	x86, open list, Ingo Molnar, Peter Zijlstra



On 23/08/16 17:44, Vladimir Davydov wrote:
> Hello,
>
> On Tue, Aug 23, 2016 at 02:48:11PM +0100, Sudeep K N wrote:
>> On Tue, May 24, 2016 at 5:36 PM, Vladimir Davydov
>> <vdavydov@virtuozzo.com> wrote:
>>> On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote:
>>>> On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
>>>>> Unix sockets can consume a significant amount of system memory, hence
>>>>> they should be accounted to kmemcg.
>>>>>
>>>>> Since unix socket buffers are always allocated from process context,
>>>>> all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
>>>>> sock->sk_allocation mask.
>>>>
>>>> I have two questions :
>>>>
>>>> 1) What happens when a buffer, allocated from socket <A> lands in a
>>>> different socket <B>, maybe owned by another user/process.
>>>>
>>>> Who owns it now, in term of kmemcg accounting ?
>>>
>>> We never move memcg charges. E.g. if two processes from different
>>> cgroups are sharing a memory region, each page will be charged to the
>>> process which touched it first. Or if two processes are working with the
>>> same directory tree, inodes and dentries will be charged to the first
>>> user. The same is fair for unix socket buffers - they will be charged to
>>> the sender.
>>>
>>>>
>>>> 2) Has performance impact been evaluated ?
>>>
>>> I ran netperf STREAM_STREAM with default options in a kmemcg on
>>> a 4 core x 2 HT box. The results are below:
>>>
>>>  # clients            bandwidth (10^6bits/sec)
>>>                     base              patched
>>>          1      67643 +-  725      64874 +-  353    - 4.0 %
>>>          4     193585 +- 2516     186715 +- 1460    - 3.5 %
>>>          8     194820 +-  377     187443 +- 1229    - 3.7 %
>>>
>>> So the accounting doesn't come for free - it takes ~4% of performance.
>>> I believe we could optimize it by using per cpu batching not only on
>>> charge, but also on uncharge in memcg core, but that's beyond the scope
>>> of this patch set - I'll take a look at this later.
>>>
>>> Anyway, if performance impact is found to be unacceptable, it is always
>>> possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
>>> or not use memory cgroups at runtime at all (thanks to jump labels
>>> there'll be no overhead even if they are compiled in).
>>>
>>
>> I started seeing almost 10% degradation in the hackbench score with v4.8-rc1
>> Bisecting it resulted in this patch, i.e. Commit 3aa9799e1364 ("af_unix: charge
>> buffers to kmemcg") in the mainline.
>>
>> As per the commit log, it seems like that's expected but I was not sure about
>> the margin. I also see the hackbench score is more inconsistent after this
>> patch, but I may be wrong as that's based on limited observation.
>>
>> Is this something we can ignore as hackbench is more synthetic compared
>> to the gain this patch provides in some real workloads ?
>
> AFAIU hackbench essentially measures the rate of sending data over a
> unix socket back and forth between processes running on different cpus,
> so it isn't a surprise that the patch resulted in a degradation, as it
> makes every skb page allocation/deallocation inc/dec an atomic counter
> inside memcg. The more processes/cpus running in the same cgroup are
> involved in this test, the more significant the overhead of this atomic
> counter is going to be.
>

Understood.

> The degradation is not unavoidable - it can be fixed by making kmem
> charge/uncharge code use per-cpu batches. The infrastructure for this
> already exists in memcontrol.c. If it were not for the legacy
> mem_cgroup->kmem counter (which is actually useless and will be dropped
> in cgroup v2), the issue would be pretty easy to fix. However, this
> legacy counter makes a possible implementation quite messy, so I'd like
> to postpone it until cgroup v2 has finally settled down.
>

Sure

> Regarding your problem. As a workaround you can either start your
> workload in the root memory cgroup or disable kmem accounting for memory
> cgroups altogether (via cgroup.memory=nokmem boot option). If you find
> the issue critical, I don't mind reverting the patch - we can always
> re-apply it once per-cpu batches are implemented for kmem charges.
>

I did try "cgroup.memory=nokmem" as specified in the commit message, I
saw the result to be not so consistent. I need to check again to be sure.

I am not asking to revert, just wanted to know if that's expected so
that we can adjust the scores when comparing especially if we are using
it as some kind of benchmark in development.

-- 
Regards,
Sudeep

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-08-23 16:50 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-24  8:49 [PATCH RESEND 0/8] More stuff to charge to kmemcg Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 1/8] mm: remove pointless struct in struct page definition Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 2/8] mm: clean up non-standard page->_mapcount users Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 3/8] mm: memcontrol: cleanup kmem charge functions Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 4/8] mm: charge/uncharge kmemcg from generic page allocator paths Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 5/8] mm: memcontrol: teach uncharge_list to deal with kmem pages Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 6/8] arch: x86: charge page tables to kmemcg Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 7/8] pipe: account " Vladimir Davydov
2016-05-24 12:59   ` Eric Dumazet
2016-05-24 16:13     ` Vladimir Davydov
2016-05-24 20:04       ` Eric Dumazet
2016-05-25 10:30         ` Vladimir Davydov
2016-05-26  7:04           ` Minchan Kim
2016-05-26 13:59             ` Vladimir Davydov
2016-05-26 14:15               ` Eric Dumazet
2016-05-27 15:03                 ` Vladimir Davydov
2016-05-24  8:49 ` [PATCH RESEND 8/8] af_unix: charge buffers " Vladimir Davydov
2016-05-24 13:02   ` Eric Dumazet
2016-05-24 16:36     ` Vladimir Davydov
2016-08-23 13:48       ` Sudeep K N
2016-08-23 16:44         ` Vladimir Davydov
2016-08-23 16:50           ` Sudeep Holla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).