linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] memcg: charge page tables (x86) and pipe buffers
@ 2015-09-26 10:45 Vladimir Davydov
  2015-09-26 10:45 ` [PATCH 1/5] mm: uncharge kmem pages from generic free_page path Vladimir Davydov
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-26 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

Hi,

There are at least two object types left that can be allocated by an
unprivileged process and go uncharged to memcg - pipe buffers and page
tables. This patch set tries to make them accounted.

Comments are welcome.

Thanks,

Vladimir Davydov (5):
  mm: uncharge kmem pages from generic free_page path
  fs: charge pipe buffers to memcg
  memcg: teach uncharge_list to uncharge kmem pages
  mm: add __get_free_kmem_pages helper
  x86: charge page table pages to memcg

 arch/x86/include/asm/pgalloc.h |  5 +++--
 arch/x86/mm/pgtable.c          |  8 ++++----
 fs/pipe.c                      |  2 +-
 include/linux/gfp.h            |  4 +---
 include/linux/page-flags.h     | 22 ++++++++++++++++++++++
 kernel/fork.c                  |  2 +-
 mm/memcontrol.c                | 21 ++++++++++++++-------
 mm/page_alloc.c                | 38 ++++++++++++++++++++------------------
 mm/slub.c                      |  2 +-
 9 files changed, 67 insertions(+), 37 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] mm: uncharge kmem pages from generic free_page path
  2015-09-26 10:45 [PATCH 0/5] memcg: charge page tables (x86) and pipe buffers Vladimir Davydov
@ 2015-09-26 10:45 ` Vladimir Davydov
  2015-09-29 22:43   ` Andrew Morton
  2015-09-30 19:51   ` Greg Thelen
  2015-09-26 10:45 ` [PATCH 2/5] fs: charge pipe buffers to memcg Vladimir Davydov
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-26 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

Currently, to charge a page to kmemcg one should use alloc_kmem_pages
helper. When the page is not needed anymore it must be freed with
free_kmem_pages helper, which will uncharge the page before freeing it.
Such a design is acceptable for thread info pages and kmalloc large
allocations, which are currently the only users of alloc_kmem_pages, but
it gets extremely inconvenient if one wants to make use of batched free
(e.g. to charge page tables - see release_pages) or page reference
counter (pipe buffers - see anon_pipe_buf_release).

To overcome this limitation, this patch moves kmemcg uncharge code to
the generic free path and zaps free_kmem_pages helper. To distinguish
kmem pages from other page types, it makes alloc_kmem_pages initialize
page->_mapcount to a special value and introduces a new PageKmem helper,
which returns true if it sees this value.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/gfp.h        |  3 ---
 include/linux/page-flags.h | 22 ++++++++++++++++++++++
 kernel/fork.c              |  2 +-
 mm/page_alloc.c            | 26 ++++++++------------------
 mm/slub.c                  |  2 +-
 mm/swap.c                  |  3 ++-
 6 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f92cbd2f4450..b46147c45966 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -384,9 +384,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc,
 			       unsigned int fragsz, gfp_t gfp_mask);
 extern void __free_page_frag(void *addr);
 
-extern void __free_kmem_pages(struct page *page, unsigned int order);
-extern void free_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 416509e26d6d..a190719c2f46 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -594,6 +594,28 @@ static inline void __ClearPageBalloon(struct page *page)
 }
 
 /*
+ * PageKmem() returns true if the page was allocated with alloc_kmem_pages().
+ */
+#define PAGE_KMEM_MAPCOUNT_VALUE (-512)
+
+static inline int PageKmem(struct page *page)
+{
+	return atomic_read(&page->_mapcount) == PAGE_KMEM_MAPCOUNT_VALUE;
+}
+
+static inline void __SetPageKmem(struct page *page)
+{
+	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
+	atomic_set(&page->_mapcount, PAGE_KMEM_MAPCOUNT_VALUE);
+}
+
+static inline void __ClearPageKmem(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageKmem(page), page);
+	atomic_set(&page->_mapcount, -1);
+}
+
+/*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 2845623fb582..c23f8a17e99e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -169,7 +169,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b9f253..88d85367c81e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -942,6 +942,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 
 	if (PageAnon(page))
 		page->mapping = NULL;
+	if (PageKmem(page)) {
+		memcg_kmem_uncharge_pages(page, order);
+		__ClearPageKmem(page);
+	}
 	bad += free_pages_check(page);
 	for (i = 1; i < (1 << order); i++) {
 		if (compound)
@@ -3434,6 +3438,8 @@ struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order)
 		return NULL;
 	page = alloc_pages(gfp_mask, order);
 	memcg_kmem_commit_charge(page, memcg, order);
+	if (page)
+		__SetPageKmem(page);
 	return page;
 }
 
@@ -3446,27 +3452,11 @@ struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 		return NULL;
 	page = alloc_pages_node(nid, gfp_mask, order);
 	memcg_kmem_commit_charge(page, memcg, order);
+	if (page)
+		__SetPageKmem(page);
 	return page;
 }
 
-/*
- * __free_kmem_pages and free_kmem_pages will free pages allocated with
- * alloc_kmem_pages.
- */
-void __free_kmem_pages(struct page *page, unsigned int order)
-{
-	memcg_kmem_uncharge_pages(page, order);
-	__free_pages(page, order);
-}
-
-void free_kmem_pages(unsigned long addr, unsigned int order)
-{
-	if (addr != 0) {
-		VM_BUG_ON(!virt_addr_valid((void *)addr));
-		__free_kmem_pages(virt_to_page((void *)addr), order);
-	}
-}
-
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
 	if (addr) {
diff --git a/mm/slub.c b/mm/slub.c
index f614b5dc396b..f5248a7d9438 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3516,7 +3516,7 @@ void kfree(const void *x)
 	if (unlikely(!PageSlab(page))) {
 		BUG_ON(!PageCompound(page));
 		kfree_hook(x);
-		__free_kmem_pages(page, compound_order(page));
+		__free_pages(page, compound_order(page));
 		return;
 	}
 	slab_free(page->slab_cache, page, object, _RET_IP_);
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..8d8d03118a18 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -64,7 +64,8 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	mem_cgroup_uncharge(page);
+	if (!PageKmem(page))
+		mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] fs: charge pipe buffers to memcg
  2015-09-26 10:45 [PATCH 0/5] memcg: charge page tables (x86) and pipe buffers Vladimir Davydov
  2015-09-26 10:45 ` [PATCH 1/5] mm: uncharge kmem pages from generic free_page path Vladimir Davydov
@ 2015-09-26 10:45 ` Vladimir Davydov
  2015-09-29 22:57   ` Andrew Morton
  2015-09-26 10:45 ` [PATCH 3/5] memcg: teach uncharge_list to uncharge kmem pages Vladimir Davydov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-26 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

Pipe buffers can be generated unrestrictedly by an unprivileged
userspace process, so they shouldn't go unaccounted.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 fs/pipe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 8865f7963700..6880884b70b0 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -400,7 +400,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			int copied;
 
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_kmem_pages(GFP_HIGHUSER, 0);
 				if (unlikely(!page)) {
 					ret = ret ? : -ENOMEM;
 					break;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] memcg: teach uncharge_list to uncharge kmem pages
  2015-09-26 10:45 [PATCH 0/5] memcg: charge page tables (x86) and pipe buffers Vladimir Davydov
  2015-09-26 10:45 ` [PATCH 1/5] mm: uncharge kmem pages from generic free_page path Vladimir Davydov
  2015-09-26 10:45 ` [PATCH 2/5] fs: charge pipe buffers to memcg Vladimir Davydov
@ 2015-09-26 10:45 ` Vladimir Davydov
  2015-09-26 10:45 ` [PATCH 4/5] mm: add __get_free_kmem_pages helper Vladimir Davydov
  2015-09-26 10:45 ` [PATCH 5/5] x86: charge page table pages to memcg Vladimir Davydov
  4 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-26 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

Page table pages are batched-freed in release_pages on most
architectures. If we want to charge them to kmemcg (this is what is done
later in this series), we need to teach mem_cgroup_uncharge_list to
handle kmem pages. With PageKmem helper introduced previously this is
trivial.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c | 21 ++++++++++++++-------
 mm/swap.c       |  3 +--
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba34e09..a61fe1604f49 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5420,15 +5420,18 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 
 static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 			   unsigned long nr_anon, unsigned long nr_file,
-			   unsigned long nr_huge, struct page *dummy_page)
+			   unsigned long nr_huge, unsigned long nr_kmem,
+			   struct page *dummy_page)
 {
-	unsigned long nr_pages = nr_anon + nr_file;
+	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
 	unsigned long flags;
 
 	if (!mem_cgroup_is_root(memcg)) {
 		page_counter_uncharge(&memcg->memory, nr_pages);
 		if (do_swap_account)
 			page_counter_uncharge(&memcg->memsw, nr_pages);
+		if (nr_kmem)
+			page_counter_uncharge(&memcg->kmem, nr_kmem);
 		memcg_oom_recover(memcg);
 	}
 
@@ -5451,6 +5454,7 @@ static void uncharge_list(struct list_head *page_list)
 	unsigned long nr_anon = 0;
 	unsigned long nr_file = 0;
 	unsigned long nr_huge = 0;
+	unsigned long nr_kmem = 0;
 	unsigned long pgpgout = 0;
 	struct list_head *next;
 	struct page *page;
@@ -5477,19 +5481,22 @@ static void uncharge_list(struct list_head *page_list)
 		if (memcg != page->mem_cgroup) {
 			if (memcg) {
 				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-					       nr_huge, page);
-				pgpgout = nr_anon = nr_file = nr_huge = 0;
+					       nr_huge, nr_kmem, page);
+				pgpgout = nr_anon = nr_file =
+					nr_huge = nr_kmem = 0;
 			}
 			memcg = page->mem_cgroup;
 		}
 
-		if (PageTransHuge(page)) {
+		if (!PageKmem(page) && PageTransHuge(page)) {
 			nr_pages <<= compound_order(page);
 			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			nr_huge += nr_pages;
 		}
 
-		if (PageAnon(page))
+		if (PageKmem(page))
+			nr_kmem += 1 << compound_order(page);
+		else if (PageAnon(page))
 			nr_anon += nr_pages;
 		else
 			nr_file += nr_pages;
@@ -5501,7 +5508,7 @@ static void uncharge_list(struct list_head *page_list)
 
 	if (memcg)
 		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-			       nr_huge, page);
+			       nr_huge, nr_kmem, page);
 }
 
 /**
diff --git a/mm/swap.c b/mm/swap.c
index 8d8d03118a18..983f692a47fd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -64,8 +64,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	if (!PageKmem(page))
-		mem_cgroup_uncharge(page);
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] mm: add __get_free_kmem_pages helper
  2015-09-26 10:45 [PATCH 0/5] memcg: charge page tables (x86) and pipe buffers Vladimir Davydov
                   ` (2 preceding siblings ...)
  2015-09-26 10:45 ` [PATCH 3/5] memcg: teach uncharge_list to uncharge kmem pages Vladimir Davydov
@ 2015-09-26 10:45 ` Vladimir Davydov
  2015-09-26 10:45 ` [PATCH 5/5] x86: charge page table pages to memcg Vladimir Davydov
  4 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-26 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

Works exactly as __get_free_pages except it also tries to charge newly
allocated pages to kmemcg. It will be used by the next patch.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/gfp.h |  1 +
 mm/page_alloc.c     | 12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b46147c45966..34dc0db54b59 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -362,6 +362,7 @@ extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
 					  unsigned int order);
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
+extern unsigned long __get_free_kmem_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
 void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88d85367c81e..e4a3a7aa8e42 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3296,6 +3296,18 @@ unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
 }
 EXPORT_SYMBOL(__get_free_pages);
 
+unsigned long __get_free_kmem_pages(gfp_t gfp_mask, unsigned int order)
+{
+	struct page *page;
+
+	VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0);
+
+	page = alloc_kmem_pages(gfp_mask, order);
+	if (!page)
+		return 0;
+	return (unsigned long) page_address(page);
+}
+
 unsigned long get_zeroed_page(gfp_t gfp_mask)
 {
 	return __get_free_pages(gfp_mask | __GFP_ZERO, 0);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] x86: charge page table pages to memcg
  2015-09-26 10:45 [PATCH 0/5] memcg: charge page tables (x86) and pipe buffers Vladimir Davydov
                   ` (3 preceding siblings ...)
  2015-09-26 10:45 ` [PATCH 4/5] mm: add __get_free_kmem_pages helper Vladimir Davydov
@ 2015-09-26 10:45 ` Vladimir Davydov
  4 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-26 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

As noted in the comment to commit dc6c9a35b66b5 ("mm: account pmd page
tables to the process"), "unprivileged process can allocate significant
amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by
oom-killer and memory cgroup". While the above-mentioned commit fixed
the problem in case of oom-killer, this patch attempts to fix it for
memory cgroup on x86 by making pte_alloc_one and friends use
alloc_kmem_pages instead of alloc_pages so as to charge page table pages
to kmemcg.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 arch/x86/include/asm/pgalloc.h | 5 +++--
 arch/x86/mm/pgtable.c          | 8 ++++----
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index bf7f8b55b0f9..944c543836d5 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -81,7 +81,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *page;
-	page = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+	page = alloc_kmem_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -125,7 +125,8 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pud_t *)__get_free_kmem_pages(GFP_KERNEL|__GFP_REPEAT|
+					      __GFP_ZERO, 0);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fb0a9dd1d6e4..c2f0d57aa7e8 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -25,7 +25,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
 
-	pte = alloc_pages(__userpte_alloc_gfp, 0);
+	pte = alloc_kmem_pages(__userpte_alloc_gfp, 0);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
@@ -209,7 +209,7 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
 	bool failed = false;
 
 	for(i = 0; i < PREALLOCATED_PMDS; i++) {
-		pmd_t *pmd = (pmd_t *)__get_free_page(PGALLOC_GFP);
+		pmd_t *pmd = (pmd_t *)__get_free_kmem_pages(PGALLOC_GFP, 0);
 		if (!pmd)
 			failed = true;
 		if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
@@ -323,7 +323,7 @@ static inline pgd_t *_pgd_alloc(void)
 	 * We allocate one page for pgd.
 	 */
 	if (!SHARED_KERNEL_PMD)
-		return (pgd_t *)__get_free_page(PGALLOC_GFP);
+		pgd = (pgd_t *)__get_free_kmem_pages(PGALLOC_GFP, 0);
 
 	/*
 	 * Now PAE kernel is not running as a Xen domain. We can allocate
@@ -342,7 +342,7 @@ static inline void _pgd_free(pgd_t *pgd)
 #else
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_kmem_pages(PGALLOC_GFP, 0);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] mm: uncharge kmem pages from generic free_page path
  2015-09-26 10:45 ` [PATCH 1/5] mm: uncharge kmem pages from generic free_page path Vladimir Davydov
@ 2015-09-29 22:43   ` Andrew Morton
  2015-09-30 16:46     ` Vladimir Davydov
  2015-09-30 19:51   ` Greg Thelen
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2015-09-29 22:43 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

On Sat, 26 Sep 2015 13:45:53 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:

> Currently, to charge a page to kmemcg one should use alloc_kmem_pages
> helper. When the page is not needed anymore it must be freed with
> free_kmem_pages helper, which will uncharge the page before freeing it.
> Such a design is acceptable for thread info pages and kmalloc large
> allocations, which are currently the only users of alloc_kmem_pages, but
> it gets extremely inconvenient if one wants to make use of batched free
> (e.g. to charge page tables - see release_pages) or page reference
> counter (pipe buffers - see anon_pipe_buf_release).
> 
> To overcome this limitation, this patch moves kmemcg uncharge code to
> the generic free path and zaps free_kmem_pages helper. To distinguish
> kmem pages from other page types, it makes alloc_kmem_pages initialize
> page->_mapcount to a special value and introduces a new PageKmem helper,
> which returns true if it sees this value.

As far as I can tell, this new use of page._mapcount is OK, but...

- The documentation for _mapcount needs to be updated (mm_types.h)

- Don't believe the documentation!  Because someone else may have
  done what you tried to do.  Please manually audit mm/ for _mapcount
  uses.

- One such use is "For recording whether a page is in the buddy
  system, we set ->_mapcount PAGE_BUDDY_MAPCOUNT_VALUE".  Please update
  the comment for this while you're in there.  (Including description
  of the state's lifetime).

- And please update _mapcount docs for PageBalloon()

- Why is the code accessing ->_mapcount directly?  afaict page_mapcount()
  and friends will work OK?

- The patch adds overhead to all kernels, even non-kmemcg and
  non-memcg kernels.  Bad.  Fixable?

- PAGE_BUDDY_MAPCOUNT_VALUE, PAGE_BALLOON_MAPCOUNT_VALUE and
  PAGE_KMEM_MAPCOUNT_VALUE should all be put next to each other so
  readers can see all the possible values and so we don't get
  duplicates, etc.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] fs: charge pipe buffers to memcg
  2015-09-26 10:45 ` [PATCH 2/5] fs: charge pipe buffers to memcg Vladimir Davydov
@ 2015-09-29 22:57   ` Andrew Morton
  2015-09-30 16:49     ` Vladimir Davydov
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2015-09-29 22:57 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

On Sat, 26 Sep 2015 13:45:54 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:

> Pipe buffers can be generated unrestrictedly by an unprivileged
> userspace process, so they shouldn't go unaccounted.
> 
> ...
>
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -400,7 +400,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  			int copied;
>  
>  			if (!page) {
> -				page = alloc_page(GFP_HIGHUSER);
> +				page = alloc_kmem_pages(GFP_HIGHUSER, 0);
>  				if (unlikely(!page)) {
>  					ret = ret ? : -ENOMEM;
>  					break;

This seems broken.  We have a page buffer page which has a weird
->mapcount.  Now it gets stolen (generic_pipe_buf_steal()) and spliced
into pagecache.  Then the page gets mmapped and MM starts playing with
its ->_mapcount?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] mm: uncharge kmem pages from generic free_page path
  2015-09-29 22:43   ` Andrew Morton
@ 2015-09-30 16:46     ` Vladimir Davydov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-30 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

On Tue, Sep 29, 2015 at 03:43:47PM -0700, Andrew Morton wrote:
> On Sat, 26 Sep 2015 13:45:53 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
> 
> > Currently, to charge a page to kmemcg one should use alloc_kmem_pages
> > helper. When the page is not needed anymore it must be freed with
> > free_kmem_pages helper, which will uncharge the page before freeing it.
> > Such a design is acceptable for thread info pages and kmalloc large
> > allocations, which are currently the only users of alloc_kmem_pages, but
> > it gets extremely inconvenient if one wants to make use of batched free
> > (e.g. to charge page tables - see release_pages) or page reference
> > counter (pipe buffers - see anon_pipe_buf_release).
> > 
> > To overcome this limitation, this patch moves kmemcg uncharge code to
> > the generic free path and zaps free_kmem_pages helper. To distinguish
> > kmem pages from other page types, it makes alloc_kmem_pages initialize
> > page->_mapcount to a special value and introduces a new PageKmem helper,
> > which returns true if it sees this value.
> 
> As far as I can tell, this new use of page._mapcount is OK, but...
> 
> - The documentation for _mapcount needs to be updated (mm_types.h)
> 
> - Don't believe the documentation!  Because someone else may have
>   done what you tried to do.  Please manually audit mm/ for _mapcount
>   uses.

OK, I rechecked mm/. Here is the list of (ab)users of the
page->_mapcount field:

 - free pages in buddy (PAGE_BUDDY_MAPCOUNT_VALUE)
 - balloon pages (PAGE_BALLOON_MAPCOUNT_VALUE)
 - compound tail pages (use _mapcount for reference counting)

None of them needs PageKmem set by design. The _mapcount is also
overloaded by slab, but the latter doesn't need alloc_kmem_pages for
kmemcg accounting.

However, there is a (ab)user of _mapcount outside mm/. It's arch/x390,
which stores its private info in page table pages' _mapcount. AFAICS
this shouldn't result in any conflicts with the PageKmem helper
introduced by this patch set, because s390 doesn't use generic
tlb_remove_page, but it looks nasty anyway and at least needs a comment.
I'll look what we can do with that.

> 
> - One such use is "For recording whether a page is in the buddy
>   system, we set ->_mapcount PAGE_BUDDY_MAPCOUNT_VALUE".  Please update
>   the comment for this while you're in there.  (Including description
>   of the state's lifetime).
> 
> - And please update _mapcount docs for PageBalloon()
> 
> - Why is the code accessing ->_mapcount directly?  afaict page_mapcount()
>   and friends will work OK?

page_mapcount() lives in mm.h, which isn't included by page-flags.h.
Anyway, I don't think it's a good idea to use page_mapcount() helper
here, because the latter returns not the value of _mapcount, but the
actual count of page mappings (i.e. _mapcount+1), which is IMO confusing
if the page is never mapped and _mapcount is (ab)used for storing a
flag.

> 
> - The patch adds overhead to all kernels, even non-kmemcg and
>   non-memcg kernels.  Bad.  Fixable?

If kmemcg is not used, memcg_kmem_uncharge_pages is a no-op (thanks to
jump labels), so the overhead added is that of load + comparison + store
(i.e. if PageKmem(page) __ClearPageKmem(page)) at worst. For most cases
(!PageKmem) it's just a load + comparison. Taking into account the fact
that page->_mapcount is accessed in free_pages_check anyway, the
overhead should therefore be negligible IMO.

I can, of course, move all PageKmem stuff under CONFIG_MEMCG_KMEM, but
is that vague performance benefit worth obscuring the code?

> 
> - PAGE_BUDDY_MAPCOUNT_VALUE, PAGE_BALLOON_MAPCOUNT_VALUE and
>   PAGE_KMEM_MAPCOUNT_VALUE should all be put next to each other so
>   readers can see all the possible values and so we don't get
>   duplicates, etc.

Right, will do.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] fs: charge pipe buffers to memcg
  2015-09-29 22:57   ` Andrew Morton
@ 2015-09-30 16:49     ` Vladimir Davydov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-09-30 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, linux-mm, linux-kernel

On Tue, Sep 29, 2015 at 03:57:11PM -0700, Andrew Morton wrote:
> On Sat, 26 Sep 2015 13:45:54 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
> 
> > Pipe buffers can be generated unrestrictedly by an unprivileged
> > userspace process, so they shouldn't go unaccounted.
> > 
> > ...
> >
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -400,7 +400,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
> >  			int copied;
> >  
> >  			if (!page) {
> > -				page = alloc_page(GFP_HIGHUSER);
> > +				page = alloc_kmem_pages(GFP_HIGHUSER, 0);
> >  				if (unlikely(!page)) {
> >  					ret = ret ? : -ENOMEM;
> >  					break;
> 
> This seems broken.  We have a page buffer page which has a weird
> ->mapcount.  Now it gets stolen (generic_pipe_buf_steal()) and spliced
> into pagecache.  Then the page gets mmapped and MM starts playing with
> its ->_mapcount?

Right you are! I completely forgot of vmsplice case. Surely, we need to
uncharge the page in the ->steal method and clear its PageKmem. Will fix
that.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] mm: uncharge kmem pages from generic free_page path
  2015-09-26 10:45 ` [PATCH 1/5] mm: uncharge kmem pages from generic free_page path Vladimir Davydov
  2015-09-29 22:43   ` Andrew Morton
@ 2015-09-30 19:51   ` Greg Thelen
  2015-10-01 18:52     ` Vladimir Davydov
  1 sibling, 1 reply; 12+ messages in thread
From: Greg Thelen @ 2015-09-30 19:51 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Tejun Heo,
	linux-mm, linux-kernel


Vladimir Davydov wrote:

> Currently, to charge a page to kmemcg one should use alloc_kmem_pages
> helper. When the page is not needed anymore it must be freed with
> free_kmem_pages helper, which will uncharge the page before freeing it.
> Such a design is acceptable for thread info pages and kmalloc large
> allocations, which are currently the only users of alloc_kmem_pages, but
> it gets extremely inconvenient if one wants to make use of batched free
> (e.g. to charge page tables - see release_pages) or page reference
> counter (pipe buffers - see anon_pipe_buf_release).
>
> To overcome this limitation, this patch moves kmemcg uncharge code to
> the generic free path and zaps free_kmem_pages helper. To distinguish
> kmem pages from other page types, it makes alloc_kmem_pages initialize
> page->_mapcount to a special value and introduces a new PageKmem helper,
> which returns true if it sees this value.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  include/linux/gfp.h        |  3 ---
>  include/linux/page-flags.h | 22 ++++++++++++++++++++++
>  kernel/fork.c              |  2 +-
>  mm/page_alloc.c            | 26 ++++++++------------------
>  mm/slub.c                  |  2 +-
>  mm/swap.c                  |  3 ++-
>  6 files changed, 34 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f92cbd2f4450..b46147c45966 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -384,9 +384,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc,
>  			       unsigned int fragsz, gfp_t gfp_mask);
>  extern void __free_page_frag(void *addr);
>  
> -extern void __free_kmem_pages(struct page *page, unsigned int order);
> -extern void free_kmem_pages(unsigned long addr, unsigned int order);
> -
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 416509e26d6d..a190719c2f46 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -594,6 +594,28 @@ static inline void __ClearPageBalloon(struct page *page)
>  }
>  
>  /*
> + * PageKmem() returns true if the page was allocated with alloc_kmem_pages().
> + */
> +#define PAGE_KMEM_MAPCOUNT_VALUE (-512)
> +
> +static inline int PageKmem(struct page *page)
> +{
> +	return atomic_read(&page->_mapcount) == PAGE_KMEM_MAPCOUNT_VALUE;
> +}
> +
> +static inline void __SetPageKmem(struct page *page)
> +{
> +	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
> +	atomic_set(&page->_mapcount, PAGE_KMEM_MAPCOUNT_VALUE);
> +}

What do you think about several special mapcount values for various
types of kmem?

It's helps user and administrators break down memory usage.

A nice equation is:
  memory.usage_in_bytes = memory.stat[file + anon + unevictable + kmem]

Next, it's helpful to be able to breakdown kmem into:
  kmem = stack + pgtable + slab + ...

On one hand (and the kernel I use internally) we can use separate per
memcg counters for each kmem type.  Then reconstitute memory.kmem as
needed by adding them together.  But using keeping a single kernel kmem
counter is workable if there is a way to breakdown the memory charge to
a container (e.g. by walking /proc/kpageflags-ish or per memcg
memory.kpageflags-ish file).

> +static inline void __ClearPageKmem(struct page *page)
> +{
> +	VM_BUG_ON_PAGE(!PageKmem(page), page);
> +	atomic_set(&page->_mapcount, -1);
> +}
> +
> +/*
>   * If network-based swap is enabled, sl*b must keep track of whether pages
>   * were allocated from pfmemalloc reserves.
>   */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2845623fb582..c23f8a17e99e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -169,7 +169,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> -	free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> +	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48aaf7b9f253..88d85367c81e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -942,6 +942,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>  
>  	if (PageAnon(page))
>  		page->mapping = NULL;
> +	if (PageKmem(page)) {
> +		memcg_kmem_uncharge_pages(page, order);
> +		__ClearPageKmem(page);
> +	}
>  	bad += free_pages_check(page);
>  	for (i = 1; i < (1 << order); i++) {
>  		if (compound)
> @@ -3434,6 +3438,8 @@ struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order)
>  		return NULL;
>  	page = alloc_pages(gfp_mask, order);
>  	memcg_kmem_commit_charge(page, memcg, order);
> +	if (page)
> +		__SetPageKmem(page);
>  	return page;
>  }
>  
> @@ -3446,27 +3452,11 @@ struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  		return NULL;
>  	page = alloc_pages_node(nid, gfp_mask, order);
>  	memcg_kmem_commit_charge(page, memcg, order);
> +	if (page)
> +		__SetPageKmem(page);
>  	return page;
>  }
>  
> -/*
> - * __free_kmem_pages and free_kmem_pages will free pages allocated with
> - * alloc_kmem_pages.
> - */
> -void __free_kmem_pages(struct page *page, unsigned int order)
> -{
> -	memcg_kmem_uncharge_pages(page, order);
> -	__free_pages(page, order);
> -}
> -
> -void free_kmem_pages(unsigned long addr, unsigned int order)
> -{
> -	if (addr != 0) {
> -		VM_BUG_ON(!virt_addr_valid((void *)addr));
> -		__free_kmem_pages(virt_to_page((void *)addr), order);
> -	}
> -}
> -
>  static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
>  {
>  	if (addr) {
> diff --git a/mm/slub.c b/mm/slub.c
> index f614b5dc396b..f5248a7d9438 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3516,7 +3516,7 @@ void kfree(const void *x)
>  	if (unlikely(!PageSlab(page))) {
>  		BUG_ON(!PageCompound(page));
>  		kfree_hook(x);
> -		__free_kmem_pages(page, compound_order(page));
> +		__free_pages(page, compound_order(page));
>  		return;
>  	}
>  	slab_free(page->slab_cache, page, object, _RET_IP_);
> diff --git a/mm/swap.c b/mm/swap.c
> index 983f692a47fd..8d8d03118a18 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -64,7 +64,8 @@ static void __page_cache_release(struct page *page)
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> -	mem_cgroup_uncharge(page);
> +	if (!PageKmem(page))
> +		mem_cgroup_uncharge(page);
>  }
>  
>  static void __put_single_page(struct page *page)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] mm: uncharge kmem pages from generic free_page path
  2015-09-30 19:51   ` Greg Thelen
@ 2015-10-01 18:52     ` Vladimir Davydov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2015-10-01 18:52 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Tejun Heo,
	linux-mm, linux-kernel

On Wed, Sep 30, 2015 at 12:51:18PM -0700, Greg Thelen wrote:
> 
> Vladimir Davydov wrote:
...
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 416509e26d6d..a190719c2f46 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -594,6 +594,28 @@ static inline void __ClearPageBalloon(struct page *page)
> >  }
> >  
> >  /*
> > + * PageKmem() returns true if the page was allocated with alloc_kmem_pages().
> > + */
> > +#define PAGE_KMEM_MAPCOUNT_VALUE (-512)
> > +
> > +static inline int PageKmem(struct page *page)
> > +{
> > +	return atomic_read(&page->_mapcount) == PAGE_KMEM_MAPCOUNT_VALUE;
> > +}
> > +
> > +static inline void __SetPageKmem(struct page *page)
> > +{
> > +	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
> > +	atomic_set(&page->_mapcount, PAGE_KMEM_MAPCOUNT_VALUE);
> > +}
> 
> What do you think about several special mapcount values for various
> types of kmem?
> 
> It's helps user and administrators break down memory usage.
> 
> A nice equation is:
>   memory.usage_in_bytes = memory.stat[file + anon + unevictable + kmem]
> 
> Next, it's helpful to be able to breakdown kmem into:
>   kmem = stack + pgtable + slab + ...
> 
> On one hand (and the kernel I use internally) we can use separate per
> memcg counters for each kmem type.  Then reconstitute memory.kmem as
> needed by adding them together.  But using keeping a single kernel kmem
> counter is workable if there is a way to breakdown the memory charge to
> a container (e.g. by walking /proc/kpageflags-ish or per memcg
> memory.kpageflags-ish file).

I don't think that storing information about kmem type on the page
struct just to report it via /proc/kpageflags is a good idea, because
the number of unused bits left on the page struct is limited so we'd
better (ab)use them carefully, only when it's really difficult to get
along w/o them.

OTOH I do agree that some extra info showing what "kmem" is actually
used for could be helpful. To accumulate this info we can always use
per-cpu counters, which are pretty cheap and won't degrade performance,
and then report it via memory.stat. Furthermore, it will be more
convenient for administrators to read this info in human-readable format
than parsing /proc/kpageflags, which in addition takes long on systems
with a lot of RAM.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-10-01 18:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-26 10:45 [PATCH 0/5] memcg: charge page tables (x86) and pipe buffers Vladimir Davydov
2015-09-26 10:45 ` [PATCH 1/5] mm: uncharge kmem pages from generic free_page path Vladimir Davydov
2015-09-29 22:43   ` Andrew Morton
2015-09-30 16:46     ` Vladimir Davydov
2015-09-30 19:51   ` Greg Thelen
2015-10-01 18:52     ` Vladimir Davydov
2015-09-26 10:45 ` [PATCH 2/5] fs: charge pipe buffers to memcg Vladimir Davydov
2015-09-29 22:57   ` Andrew Morton
2015-09-30 16:49     ` Vladimir Davydov
2015-09-26 10:45 ` [PATCH 3/5] memcg: teach uncharge_list to uncharge kmem pages Vladimir Davydov
2015-09-26 10:45 ` [PATCH 4/5] mm: add __get_free_kmem_pages helper Vladimir Davydov
2015-09-26 10:45 ` [PATCH 5/5] x86: charge page table pages to memcg Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).