All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
@ 2015-12-13 20:14 ` Cyrill Gorcunov
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-13 20:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Quentin Casasnovas, Vegard Nossum, Linus Torvalds,
	Willy Tarreau, Andy Lutomirski, Kees Cook, Vladimir Davydov,
	Konstantin Khlebnikov, Pavel Emelyanov, Peter Zijlstra,
	Cyrill Gorcunov

[-- Attachment #1: mm-rlimit-data-2 --]
[-- Type: text/plain, Size: 7381 bytes --]

When inspecting a vague code inside prctl(PR_SET_MM_MEM)
call (which testing the RLIMIT_DATA value to figure out
if we're allowed to assign new @start_brk, @brk, @start_data,
@end_data from mm_struct) it's been commited that RLIMIT_DATA
in a form it's implemented now doesn't do anything useful
because most of user-space libraries use mmap() syscall
for dynamic memory allocations.

So Linus suggested to convert RLIMIT_DATA rlimit into something
suitable for anonymous memory accounting. Here we introduce
new @anon_vm member into mm descriptor which is updated
on every vm_stat_account() call.

Tests for RLIMIT_DATA are done in three places:
 - mmap_region: when user calls mmap() helper
 - do_brk: for brk() helper
 - vma_to_resize: when mremap() is used

The do_brk() is also used in vm_brk() which is called
when the kernel loads Elf and aout files to execute.

Because test for limit is done in do_brk helper we
no longer need to call check_data_rlimit here.

v2:
 - update doc
 - add may_expand_anon_vm helper
 - call for RLIMIT_DATA test in mremap and do_brk

CC: Quentin Casasnovas <quentin.casasnovas@oracle.com>
CC: Vegard Nossum <vegard.nossum@oracle.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Willy Tarreau <w@1wt.eu>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Kees Cook <keescook@google.com>
CC: Vladimir Davydov <vdavydov@virtuozzo.com>
CC: Konstantin Khlebnikov <koct9i@gmail.com>
CC: Pavel Emelyanov <xemul@virtuozzo.com>
CC: Vladimir Davydov <vdavydov@virtuozzo.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 Documentation/filesystems/proc.txt |    1 
 fs/proc/task_mmu.c                 |    2 +
 include/linux/mm_types.h           |    1 
 mm/mmap.c                          |   46 ++++++++++++++++++++++---------------
 mm/mremap.c                        |    5 ++++
 5 files changed, 37 insertions(+), 18 deletions(-)

Index: linux-ml.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-ml.git.orig/Documentation/filesystems/proc.txt
+++ linux-ml.git/Documentation/filesystems/proc.txt
@@ -233,6 +233,7 @@ Table 1-2: Contents of the status files
  VmHWM                       peak resident set size ("high water mark")
  VmRSS                       size of memory portions
  VmData                      size of data, stack, and text segments
+ VmAnon                      size of anonymous private memory (not file backended)
  VmStk                       size of data, stack, and text segments
  VmExe                       size of text segment
  VmLib                       size of shared library code
Index: linux-ml.git/fs/proc/task_mmu.c
===================================================================
--- linux-ml.git.orig/fs/proc/task_mmu.c
+++ linux-ml.git/fs/proc/task_mmu.c
@@ -53,6 +53,7 @@ void task_mem(struct seq_file *m, struct
 		"VmHWM:\t%8lu kB\n"
 		"VmRSS:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
+		"VmAnon:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
 		"VmLib:\t%8lu kB\n"
@@ -66,6 +67,7 @@ void task_mem(struct seq_file *m, struct
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
+		mm->anon_vm << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
 		pmds >> 10,
Index: linux-ml.git/include/linux/mm_types.h
===================================================================
--- linux-ml.git.orig/include/linux/mm_types.h
+++ linux-ml.git/include/linux/mm_types.h
@@ -429,6 +429,7 @@ struct mm_struct {
 	unsigned long shared_vm;	/* Shared pages (files) */
 	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE */
 	unsigned long stack_vm;		/* VM_GROWSUP/DOWN */
+	unsigned long anon_vm;		/* Anonymous pages mapped */
 	unsigned long def_flags;
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
Index: linux-ml.git/mm/mmap.c
===================================================================
--- linux-ml.git.orig/mm/mmap.c
+++ linux-ml.git/mm/mmap.c
@@ -309,16 +309,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	if (brk < min_brk)
 		goto out;
 
-	/*
-	 * Check against rlimit here. If this check is done later after the test
-	 * of oldbrk with newbrk then it can escape the test and let the data
-	 * segment grow beyond its set limit the in case where the limit is
-	 * not page aligned -Ram Gupta
-	 */
-	if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
-			      mm->end_data, mm->start_data))
-		goto out;
-
 	newbrk = PAGE_ALIGN(brk);
 	oldbrk = PAGE_ALIGN(mm->brk);
 	if (oldbrk == newbrk)
@@ -1223,6 +1213,9 @@ void vm_stat_account(struct mm_struct *m
 			mm->exec_vm += pages;
 	} else if (flags & stack_flags)
 		mm->stack_vm += pages;
+
+	if (anon_accountable_mapping(file, flags))
+		mm->anon_vm += pages;
 }
 #endif /* CONFIG_PROC_FS */
 
@@ -1578,6 +1571,14 @@ unsigned long mmap_region(struct file *f
 	}
 
 	/*
+	 * For anon mappings make sure we don't exceed the limit.
+	 */
+	if (anon_accountable_mapping(file, vm_flags)) {
+		if (!may_expand_anon_vm(mm, len >> PAGE_SHIFT))
+			return -ENOMEM;
+	}
+
+	/*
 	 * Can we just expand an old mapping?
 	 */
 	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
@@ -2760,7 +2761,8 @@ static unsigned long do_brk(unsigned lon
 	}
 
 	/* Check against address space limits *after* clearing old maps... */
-	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
+	if (!may_expand_vm(mm, len >> PAGE_SHIFT) ||
+	    !may_expand_anon_vm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
 	if (mm->map_count > sysctl_max_map_count)
@@ -2795,6 +2797,7 @@ static unsigned long do_brk(unsigned lon
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
+	mm->anon_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
@@ -2982,20 +2985,27 @@ out:
 	return NULL;
 }
 
+static inline int __may_expand_vm(unsigned int limit,
+				  unsigned long cur,
+				  unsigned long npages)
+{
+	unsigned long lim = rlimit(limit) >> PAGE_SHIFT;
+
+	return ((cur + npages) > lim) ? 0 : 1;
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
  */
 int may_expand_vm(struct mm_struct *mm, unsigned long npages)
 {
-	unsigned long cur = mm->total_vm;	/* pages */
-	unsigned long lim;
-
-	lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
+	return __may_expand_vm(RLIMIT_AS, mm->total_vm, npages);
+}
 
-	if (cur + npages > lim)
-		return 0;
-	return 1;
+int may_expand_anon_vm(struct mm_struct *mm, unsigned long npages)
+{
+	return __may_expand_vm(RLIMIT_DATA, mm->anon_vm, npages);
 }
 
 static int special_mapping_fault(struct vm_area_struct *vma,
Index: linux-ml.git/mm/mremap.c
===================================================================
--- linux-ml.git.orig/mm/mremap.c
+++ linux-ml.git/mm/mremap.c
@@ -382,6 +382,11 @@ static struct vm_area_struct *vma_to_res
 	if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
 		return ERR_PTR(-ENOMEM);
 
+	if (anon_accountable_mapping(vma->vm_file, vma->vm_flags)) {
+		if (!may_expand_anon_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
+			return ERR_PTR(-ENOMEM);
+	}
+
 	if (vma->vm_flags & VM_ACCOUNT) {
 		unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
 		if (security_vm_enough_memory_mm(mm, charged))


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
@ 2015-12-13 20:14 ` Cyrill Gorcunov
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-13 20:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Quentin Casasnovas, Vegard Nossum, Linus Torvalds,
	Willy Tarreau, Andy Lutomirski, Kees Cook, Vladimir Davydov,
	Konstantin Khlebnikov, Pavel Emelyanov, Peter Zijlstra,
	Cyrill Gorcunov

[-- Attachment #1: mm-rlimit-data-2 --]
[-- Type: text/plain, Size: 7606 bytes --]

When inspecting a vague code inside prctl(PR_SET_MM_MEM)
call (which testing the RLIMIT_DATA value to figure out
if we're allowed to assign new @start_brk, @brk, @start_data,
@end_data from mm_struct) it's been commited that RLIMIT_DATA
in a form it's implemented now doesn't do anything useful
because most of user-space libraries use mmap() syscall
for dynamic memory allocations.

So Linus suggested to convert RLIMIT_DATA rlimit into something
suitable for anonymous memory accounting. Here we introduce
new @anon_vm member into mm descriptor which is updated
on every vm_stat_account() call.

Tests for RLIMIT_DATA are done in three places:
 - mmap_region: when user calls mmap() helper
 - do_brk: for brk() helper
 - vma_to_resize: when mremap() is used

The do_brk() is also used in vm_brk() which is called
when the kernel loads Elf and aout files to execute.

Because test for limit is done in do_brk helper we
no longer need to call check_data_rlimit here.

v2:
 - update doc
 - add may_expand_anon_vm helper
 - call for RLIMIT_DATA test in mremap and do_brk

CC: Quentin Casasnovas <quentin.casasnovas@oracle.com>
CC: Vegard Nossum <vegard.nossum@oracle.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Willy Tarreau <w@1wt.eu>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Kees Cook <keescook@google.com>
CC: Vladimir Davydov <vdavydov@virtuozzo.com>
CC: Konstantin Khlebnikov <koct9i@gmail.com>
CC: Pavel Emelyanov <xemul@virtuozzo.com>
CC: Vladimir Davydov <vdavydov@virtuozzo.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 Documentation/filesystems/proc.txt |    1 
 fs/proc/task_mmu.c                 |    2 +
 include/linux/mm_types.h           |    1 
 mm/mmap.c                          |   46 ++++++++++++++++++++++---------------
 mm/mremap.c                        |    5 ++++
 5 files changed, 37 insertions(+), 18 deletions(-)

Index: linux-ml.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-ml.git.orig/Documentation/filesystems/proc.txt
+++ linux-ml.git/Documentation/filesystems/proc.txt
@@ -233,6 +233,7 @@ Table 1-2: Contents of the status files
  VmHWM                       peak resident set size ("high water mark")
  VmRSS                       size of memory portions
  VmData                      size of data, stack, and text segments
+ VmAnon                      size of anonymous private memory (not file backended)
  VmStk                       size of data, stack, and text segments
  VmExe                       size of text segment
  VmLib                       size of shared library code
Index: linux-ml.git/fs/proc/task_mmu.c
===================================================================
--- linux-ml.git.orig/fs/proc/task_mmu.c
+++ linux-ml.git/fs/proc/task_mmu.c
@@ -53,6 +53,7 @@ void task_mem(struct seq_file *m, struct
 		"VmHWM:\t%8lu kB\n"
 		"VmRSS:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
+		"VmAnon:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
 		"VmLib:\t%8lu kB\n"
@@ -66,6 +67,7 @@ void task_mem(struct seq_file *m, struct
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
+		mm->anon_vm << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
 		pmds >> 10,
Index: linux-ml.git/include/linux/mm_types.h
===================================================================
--- linux-ml.git.orig/include/linux/mm_types.h
+++ linux-ml.git/include/linux/mm_types.h
@@ -429,6 +429,7 @@ struct mm_struct {
 	unsigned long shared_vm;	/* Shared pages (files) */
 	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE */
 	unsigned long stack_vm;		/* VM_GROWSUP/DOWN */
+	unsigned long anon_vm;		/* Anonymous pages mapped */
 	unsigned long def_flags;
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
Index: linux-ml.git/mm/mmap.c
===================================================================
--- linux-ml.git.orig/mm/mmap.c
+++ linux-ml.git/mm/mmap.c
@@ -309,16 +309,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	if (brk < min_brk)
 		goto out;
 
-	/*
-	 * Check against rlimit here. If this check is done later after the test
-	 * of oldbrk with newbrk then it can escape the test and let the data
-	 * segment grow beyond its set limit the in case where the limit is
-	 * not page aligned -Ram Gupta
-	 */
-	if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
-			      mm->end_data, mm->start_data))
-		goto out;
-
 	newbrk = PAGE_ALIGN(brk);
 	oldbrk = PAGE_ALIGN(mm->brk);
 	if (oldbrk == newbrk)
@@ -1223,6 +1213,9 @@ void vm_stat_account(struct mm_struct *m
 			mm->exec_vm += pages;
 	} else if (flags & stack_flags)
 		mm->stack_vm += pages;
+
+	if (anon_accountable_mapping(file, flags))
+		mm->anon_vm += pages;
 }
 #endif /* CONFIG_PROC_FS */
 
@@ -1578,6 +1571,14 @@ unsigned long mmap_region(struct file *f
 	}
 
 	/*
+	 * For anon mappings make sure we don't exceed the limit.
+	 */
+	if (anon_accountable_mapping(file, vm_flags)) {
+		if (!may_expand_anon_vm(mm, len >> PAGE_SHIFT))
+			return -ENOMEM;
+	}
+
+	/*
 	 * Can we just expand an old mapping?
 	 */
 	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
@@ -2760,7 +2761,8 @@ static unsigned long do_brk(unsigned lon
 	}
 
 	/* Check against address space limits *after* clearing old maps... */
-	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
+	if (!may_expand_vm(mm, len >> PAGE_SHIFT) ||
+	    !may_expand_anon_vm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
 	if (mm->map_count > sysctl_max_map_count)
@@ -2795,6 +2797,7 @@ static unsigned long do_brk(unsigned lon
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
+	mm->anon_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
@@ -2982,20 +2985,27 @@ out:
 	return NULL;
 }
 
+static inline int __may_expand_vm(unsigned int limit,
+				  unsigned long cur,
+				  unsigned long npages)
+{
+	unsigned long lim = rlimit(limit) >> PAGE_SHIFT;
+
+	return ((cur + npages) > lim) ? 0 : 1;
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
  */
 int may_expand_vm(struct mm_struct *mm, unsigned long npages)
 {
-	unsigned long cur = mm->total_vm;	/* pages */
-	unsigned long lim;
-
-	lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
+	return __may_expand_vm(RLIMIT_AS, mm->total_vm, npages);
+}
 
-	if (cur + npages > lim)
-		return 0;
-	return 1;
+int may_expand_anon_vm(struct mm_struct *mm, unsigned long npages)
+{
+	return __may_expand_vm(RLIMIT_DATA, mm->anon_vm, npages);
 }
 
 static int special_mapping_fault(struct vm_area_struct *vma,
Index: linux-ml.git/mm/mremap.c
===================================================================
--- linux-ml.git.orig/mm/mremap.c
+++ linux-ml.git/mm/mremap.c
@@ -382,6 +382,11 @@ static struct vm_area_struct *vma_to_res
 	if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
 		return ERR_PTR(-ENOMEM);
 
+	if (anon_accountable_mapping(vma->vm_file, vma->vm_flags)) {
+		if (!may_expand_anon_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
+			return ERR_PTR(-ENOMEM);
+	}
+
 	if (vma->vm_flags & VM_ACCOUNT) {
 		unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
 		if (security_vm_enough_memory_mm(mm, charged))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
  2015-12-13 20:14 ` Cyrill Gorcunov
@ 2015-12-14  8:12   ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 16+ messages in thread
From: Konstantin Khlebnikov @ 2015-12-14  8:12 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Linux Kernel Mailing List, linux-mm, Quentin Casasnovas,
	Vegard Nossum, Linus Torvalds, Willy Tarreau, Andy Lutomirski,
	Kees Cook, Vladimir Davydov, Pavel Emelyanov, Peter Zijlstra,
	Cyrill Gorcunov

On Sun, Dec 13, 2015 at 11:14 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> When inspecting a vague code inside prctl(PR_SET_MM_MEM)
> call (which testing the RLIMIT_DATA value to figure out
> if we're allowed to assign new @start_brk, @brk, @start_data,
> @end_data from mm_struct) it's been commited that RLIMIT_DATA
> in a form it's implemented now doesn't do anything useful
> because most of user-space libraries use mmap() syscall
> for dynamic memory allocations.
>
> So Linus suggested to convert RLIMIT_DATA rlimit into something
> suitable for anonymous memory accounting. Here we introduce
> new @anon_vm member into mm descriptor which is updated
> on every vm_stat_account() call.

I dont like this part. VmData already here but  you add new VmAnon
which has even more vague meaning. And after that RLIMIT_DATA limits
VmAnon not VmData.

I have alternative solution. have a look. (patch follows)

>
> Tests for RLIMIT_DATA are done in three places:
>  - mmap_region: when user calls mmap() helper
>  - do_brk: for brk() helper
>  - vma_to_resize: when mremap() is used
>
> The do_brk() is also used in vm_brk() which is called
> when the kernel loads Elf and aout files to execute.
>
> Because test for limit is done in do_brk helper we
> no longer need to call check_data_rlimit here.
>
> v2:
>  - update doc
>  - add may_expand_anon_vm helper
>  - call for RLIMIT_DATA test in mremap and do_brk
>
> CC: Quentin Casasnovas <quentin.casasnovas@oracle.com>
> CC: Vegard Nossum <vegard.nossum@oracle.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Willy Tarreau <w@1wt.eu>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Kees Cook <keescook@google.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Konstantin Khlebnikov <koct9i@gmail.com>
> CC: Pavel Emelyanov <xemul@virtuozzo.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> ---
>  Documentation/filesystems/proc.txt |    1
>  fs/proc/task_mmu.c                 |    2 +
>  include/linux/mm_types.h           |    1
>  mm/mmap.c                          |   46 ++++++++++++++++++++++---------------
>  mm/mremap.c                        |    5 ++++
>  5 files changed, 37 insertions(+), 18 deletions(-)
>
> Index: linux-ml.git/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux-ml.git.orig/Documentation/filesystems/proc.txt
> +++ linux-ml.git/Documentation/filesystems/proc.txt
> @@ -233,6 +233,7 @@ Table 1-2: Contents of the status files
>   VmHWM                       peak resident set size ("high water mark")
>   VmRSS                       size of memory portions
>   VmData                      size of data, stack, and text segments
> + VmAnon                      size of anonymous private memory (not file backended)
>   VmStk                       size of data, stack, and text segments
>   VmExe                       size of text segment
>   VmLib                       size of shared library code
> Index: linux-ml.git/fs/proc/task_mmu.c
> ===================================================================
> --- linux-ml.git.orig/fs/proc/task_mmu.c
> +++ linux-ml.git/fs/proc/task_mmu.c
> @@ -53,6 +53,7 @@ void task_mem(struct seq_file *m, struct
>                 "VmHWM:\t%8lu kB\n"
>                 "VmRSS:\t%8lu kB\n"
>                 "VmData:\t%8lu kB\n"
> +               "VmAnon:\t%8lu kB\n"
>                 "VmStk:\t%8lu kB\n"
>                 "VmExe:\t%8lu kB\n"
>                 "VmLib:\t%8lu kB\n"
> @@ -66,6 +67,7 @@ void task_mem(struct seq_file *m, struct
>                 hiwater_rss << (PAGE_SHIFT-10),
>                 total_rss << (PAGE_SHIFT-10),
>                 data << (PAGE_SHIFT-10),
> +               mm->anon_vm << (PAGE_SHIFT-10),
>                 mm->stack_vm << (PAGE_SHIFT-10), text, lib,
>                 ptes >> 10,
>                 pmds >> 10,
> Index: linux-ml.git/include/linux/mm_types.h
> ===================================================================
> --- linux-ml.git.orig/include/linux/mm_types.h
> +++ linux-ml.git/include/linux/mm_types.h
> @@ -429,6 +429,7 @@ struct mm_struct {
>         unsigned long shared_vm;        /* Shared pages (files) */
>         unsigned long exec_vm;          /* VM_EXEC & ~VM_WRITE */
>         unsigned long stack_vm;         /* VM_GROWSUP/DOWN */
> +       unsigned long anon_vm;          /* Anonymous pages mapped */
>         unsigned long def_flags;
>         unsigned long start_code, end_code, start_data, end_data;
>         unsigned long start_brk, brk, start_stack;
> Index: linux-ml.git/mm/mmap.c
> ===================================================================
> --- linux-ml.git.orig/mm/mmap.c
> +++ linux-ml.git/mm/mmap.c
> @@ -309,16 +309,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>         if (brk < min_brk)
>                 goto out;
>
> -       /*
> -        * Check against rlimit here. If this check is done later after the test
> -        * of oldbrk with newbrk then it can escape the test and let the data
> -        * segment grow beyond its set limit the in case where the limit is
> -        * not page aligned -Ram Gupta
> -        */
> -       if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
> -                             mm->end_data, mm->start_data))
> -               goto out;
> -
>         newbrk = PAGE_ALIGN(brk);
>         oldbrk = PAGE_ALIGN(mm->brk);
>         if (oldbrk == newbrk)
> @@ -1223,6 +1213,9 @@ void vm_stat_account(struct mm_struct *m
>                         mm->exec_vm += pages;
>         } else if (flags & stack_flags)
>                 mm->stack_vm += pages;
> +
> +       if (anon_accountable_mapping(file, flags))
> +               mm->anon_vm += pages;
>  }
>  #endif /* CONFIG_PROC_FS */
>
> @@ -1578,6 +1571,14 @@ unsigned long mmap_region(struct file *f
>         }
>
>         /*
> +        * For anon mappings make sure we don't exceed the limit.
> +        */
> +       if (anon_accountable_mapping(file, vm_flags)) {
> +               if (!may_expand_anon_vm(mm, len >> PAGE_SHIFT))
> +                       return -ENOMEM;
> +       }
> +
> +       /*
>          * Can we just expand an old mapping?
>          */
>         vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> @@ -2760,7 +2761,8 @@ static unsigned long do_brk(unsigned lon
>         }
>
>         /* Check against address space limits *after* clearing old maps... */
> -       if (!may_expand_vm(mm, len >> PAGE_SHIFT))
> +       if (!may_expand_vm(mm, len >> PAGE_SHIFT) ||
> +           !may_expand_anon_vm(mm, len >> PAGE_SHIFT))
>                 return -ENOMEM;
>
>         if (mm->map_count > sysctl_max_map_count)
> @@ -2795,6 +2797,7 @@ static unsigned long do_brk(unsigned lon
>  out:
>         perf_event_mmap(vma);
>         mm->total_vm += len >> PAGE_SHIFT;
> +       mm->anon_vm += len >> PAGE_SHIFT;
>         if (flags & VM_LOCKED)
>                 mm->locked_vm += (len >> PAGE_SHIFT);
>         vma->vm_flags |= VM_SOFTDIRTY;
> @@ -2982,20 +2985,27 @@ out:
>         return NULL;
>  }
>
> +static inline int __may_expand_vm(unsigned int limit,
> +                                 unsigned long cur,
> +                                 unsigned long npages)
> +{
> +       unsigned long lim = rlimit(limit) >> PAGE_SHIFT;
> +
> +       return ((cur + npages) > lim) ? 0 : 1;
> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
>   */
>  int may_expand_vm(struct mm_struct *mm, unsigned long npages)
>  {
> -       unsigned long cur = mm->total_vm;       /* pages */
> -       unsigned long lim;
> -
> -       lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
> +       return __may_expand_vm(RLIMIT_AS, mm->total_vm, npages);
> +}
>
> -       if (cur + npages > lim)
> -               return 0;
> -       return 1;
> +int may_expand_anon_vm(struct mm_struct *mm, unsigned long npages)
> +{
> +       return __may_expand_vm(RLIMIT_DATA, mm->anon_vm, npages);
>  }
>
>  static int special_mapping_fault(struct vm_area_struct *vma,
> Index: linux-ml.git/mm/mremap.c
> ===================================================================
> --- linux-ml.git.orig/mm/mremap.c
> +++ linux-ml.git/mm/mremap.c
> @@ -382,6 +382,11 @@ static struct vm_area_struct *vma_to_res
>         if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
>                 return ERR_PTR(-ENOMEM);
>
> +       if (anon_accountable_mapping(vma->vm_file, vma->vm_flags)) {
> +               if (!may_expand_anon_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
> +                       return ERR_PTR(-ENOMEM);
> +       }
> +
>         if (vma->vm_flags & VM_ACCOUNT) {
>                 unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
>                 if (security_vm_enough_memory_mm(mm, charged))
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
@ 2015-12-14  8:12   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Khlebnikov @ 2015-12-14  8:12 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Linux Kernel Mailing List, linux-mm, Quentin Casasnovas,
	Vegard Nossum, Linus Torvalds, Willy Tarreau, Andy Lutomirski,
	Kees Cook, Vladimir Davydov, Pavel Emelyanov, Peter Zijlstra,
	Cyrill Gorcunov

On Sun, Dec 13, 2015 at 11:14 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> When inspecting a vague code inside prctl(PR_SET_MM_MEM)
> call (which testing the RLIMIT_DATA value to figure out
> if we're allowed to assign new @start_brk, @brk, @start_data,
> @end_data from mm_struct) it's been commited that RLIMIT_DATA
> in a form it's implemented now doesn't do anything useful
> because most of user-space libraries use mmap() syscall
> for dynamic memory allocations.
>
> So Linus suggested to convert RLIMIT_DATA rlimit into something
> suitable for anonymous memory accounting. Here we introduce
> new @anon_vm member into mm descriptor which is updated
> on every vm_stat_account() call.

I dont like this part. VmData already here but  you add new VmAnon
which has even more vague meaning. And after that RLIMIT_DATA limits
VmAnon not VmData.

I have alternative solution. have a look. (patch follows)

>
> Tests for RLIMIT_DATA are done in three places:
>  - mmap_region: when user calls mmap() helper
>  - do_brk: for brk() helper
>  - vma_to_resize: when mremap() is used
>
> The do_brk() is also used in vm_brk() which is called
> when the kernel loads Elf and aout files to execute.
>
> Because test for limit is done in do_brk helper we
> no longer need to call check_data_rlimit here.
>
> v2:
>  - update doc
>  - add may_expand_anon_vm helper
>  - call for RLIMIT_DATA test in mremap and do_brk
>
> CC: Quentin Casasnovas <quentin.casasnovas@oracle.com>
> CC: Vegard Nossum <vegard.nossum@oracle.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Willy Tarreau <w@1wt.eu>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Kees Cook <keescook@google.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Konstantin Khlebnikov <koct9i@gmail.com>
> CC: Pavel Emelyanov <xemul@virtuozzo.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> ---
>  Documentation/filesystems/proc.txt |    1
>  fs/proc/task_mmu.c                 |    2 +
>  include/linux/mm_types.h           |    1
>  mm/mmap.c                          |   46 ++++++++++++++++++++++---------------
>  mm/mremap.c                        |    5 ++++
>  5 files changed, 37 insertions(+), 18 deletions(-)
>
> Index: linux-ml.git/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux-ml.git.orig/Documentation/filesystems/proc.txt
> +++ linux-ml.git/Documentation/filesystems/proc.txt
> @@ -233,6 +233,7 @@ Table 1-2: Contents of the status files
>   VmHWM                       peak resident set size ("high water mark")
>   VmRSS                       size of memory portions
>   VmData                      size of data, stack, and text segments
> + VmAnon                      size of anonymous private memory (not file backended)
>   VmStk                       size of data, stack, and text segments
>   VmExe                       size of text segment
>   VmLib                       size of shared library code
> Index: linux-ml.git/fs/proc/task_mmu.c
> ===================================================================
> --- linux-ml.git.orig/fs/proc/task_mmu.c
> +++ linux-ml.git/fs/proc/task_mmu.c
> @@ -53,6 +53,7 @@ void task_mem(struct seq_file *m, struct
>                 "VmHWM:\t%8lu kB\n"
>                 "VmRSS:\t%8lu kB\n"
>                 "VmData:\t%8lu kB\n"
> +               "VmAnon:\t%8lu kB\n"
>                 "VmStk:\t%8lu kB\n"
>                 "VmExe:\t%8lu kB\n"
>                 "VmLib:\t%8lu kB\n"
> @@ -66,6 +67,7 @@ void task_mem(struct seq_file *m, struct
>                 hiwater_rss << (PAGE_SHIFT-10),
>                 total_rss << (PAGE_SHIFT-10),
>                 data << (PAGE_SHIFT-10),
> +               mm->anon_vm << (PAGE_SHIFT-10),
>                 mm->stack_vm << (PAGE_SHIFT-10), text, lib,
>                 ptes >> 10,
>                 pmds >> 10,
> Index: linux-ml.git/include/linux/mm_types.h
> ===================================================================
> --- linux-ml.git.orig/include/linux/mm_types.h
> +++ linux-ml.git/include/linux/mm_types.h
> @@ -429,6 +429,7 @@ struct mm_struct {
>         unsigned long shared_vm;        /* Shared pages (files) */
>         unsigned long exec_vm;          /* VM_EXEC & ~VM_WRITE */
>         unsigned long stack_vm;         /* VM_GROWSUP/DOWN */
> +       unsigned long anon_vm;          /* Anonymous pages mapped */
>         unsigned long def_flags;
>         unsigned long start_code, end_code, start_data, end_data;
>         unsigned long start_brk, brk, start_stack;
> Index: linux-ml.git/mm/mmap.c
> ===================================================================
> --- linux-ml.git.orig/mm/mmap.c
> +++ linux-ml.git/mm/mmap.c
> @@ -309,16 +309,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>         if (brk < min_brk)
>                 goto out;
>
> -       /*
> -        * Check against rlimit here. If this check is done later after the test
> -        * of oldbrk with newbrk then it can escape the test and let the data
> -        * segment grow beyond its set limit the in case where the limit is
> -        * not page aligned -Ram Gupta
> -        */
> -       if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
> -                             mm->end_data, mm->start_data))
> -               goto out;
> -
>         newbrk = PAGE_ALIGN(brk);
>         oldbrk = PAGE_ALIGN(mm->brk);
>         if (oldbrk == newbrk)
> @@ -1223,6 +1213,9 @@ void vm_stat_account(struct mm_struct *m
>                         mm->exec_vm += pages;
>         } else if (flags & stack_flags)
>                 mm->stack_vm += pages;
> +
> +       if (anon_accountable_mapping(file, flags))
> +               mm->anon_vm += pages;
>  }
>  #endif /* CONFIG_PROC_FS */
>
> @@ -1578,6 +1571,14 @@ unsigned long mmap_region(struct file *f
>         }
>
>         /*
> +        * For anon mappings make sure we don't exceed the limit.
> +        */
> +       if (anon_accountable_mapping(file, vm_flags)) {
> +               if (!may_expand_anon_vm(mm, len >> PAGE_SHIFT))
> +                       return -ENOMEM;
> +       }
> +
> +       /*
>          * Can we just expand an old mapping?
>          */
>         vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> @@ -2760,7 +2761,8 @@ static unsigned long do_brk(unsigned lon
>         }
>
>         /* Check against address space limits *after* clearing old maps... */
> -       if (!may_expand_vm(mm, len >> PAGE_SHIFT))
> +       if (!may_expand_vm(mm, len >> PAGE_SHIFT) ||
> +           !may_expand_anon_vm(mm, len >> PAGE_SHIFT))
>                 return -ENOMEM;
>
>         if (mm->map_count > sysctl_max_map_count)
> @@ -2795,6 +2797,7 @@ static unsigned long do_brk(unsigned lon
>  out:
>         perf_event_mmap(vma);
>         mm->total_vm += len >> PAGE_SHIFT;
> +       mm->anon_vm += len >> PAGE_SHIFT;
>         if (flags & VM_LOCKED)
>                 mm->locked_vm += (len >> PAGE_SHIFT);
>         vma->vm_flags |= VM_SOFTDIRTY;
> @@ -2982,20 +2985,27 @@ out:
>         return NULL;
>  }
>
> +static inline int __may_expand_vm(unsigned int limit,
> +                                 unsigned long cur,
> +                                 unsigned long npages)
> +{
> +       unsigned long lim = rlimit(limit) >> PAGE_SHIFT;
> +
> +       return ((cur + npages) > lim) ? 0 : 1;
> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
>   */
>  int may_expand_vm(struct mm_struct *mm, unsigned long npages)
>  {
> -       unsigned long cur = mm->total_vm;       /* pages */
> -       unsigned long lim;
> -
> -       lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
> +       return __may_expand_vm(RLIMIT_AS, mm->total_vm, npages);
> +}
>
> -       if (cur + npages > lim)
> -               return 0;
> -       return 1;
> +int may_expand_anon_vm(struct mm_struct *mm, unsigned long npages)
> +{
> +       return __may_expand_vm(RLIMIT_DATA, mm->anon_vm, npages);
>  }
>
>  static int special_mapping_fault(struct vm_area_struct *vma,
> Index: linux-ml.git/mm/mremap.c
> ===================================================================
> --- linux-ml.git.orig/mm/mremap.c
> +++ linux-ml.git/mm/mremap.c
> @@ -382,6 +382,11 @@ static struct vm_area_struct *vma_to_res
>         if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
>                 return ERR_PTR(-ENOMEM);
>
> +       if (anon_accountable_mapping(vma->vm_file, vma->vm_flags)) {
> +               if (!may_expand_anon_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
> +                       return ERR_PTR(-ENOMEM);
> +       }
> +
>         if (vma->vm_flags & VM_ACCOUNT) {
>                 unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
>                 if (security_vm_enough_memory_mm(mm, charged))
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC] mm: rework virtual memory accounting
  2015-12-14  8:12   ` Konstantin Khlebnikov
@ 2015-12-14  8:12     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 16+ messages in thread
From: Konstantin Khlebnikov @ 2015-12-14  8:12 UTC (permalink / raw)
  To: Cyrill Gorcunov, linux-mm, linux-kernel

Here several rated changes bundled together:
* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas

This way code looks cleaner: now code/stack/data
classification depends only on vm_flags state:

VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared",
but that might be strange beasts like readonly-private or VM_IO areas.

RLIMIT_AS limits whole address space "VmSize"
RLIMIT_STACK limits stack "VmStk" (but each vma individually)
RLIMIT_DATA now limits "VmData"

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 arch/ia64/kernel/perfmon.c |    3 +-
 fs/proc/task_mmu.c         |    7 ++---
 include/linux/mm.h         |   13 ++-------
 include/linux/mm_types.h   |    2 +
 kernel/fork.c              |    5 +--
 mm/debug.c                 |    4 +--
 mm/mmap.c                  |   63 ++++++++++++++++++++++----------------------
 mm/mprotect.c              |    8 ++++--
 mm/mremap.c                |    7 +++--
 9 files changed, 53 insertions(+), 59 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 60e02f7747ff..9cd607b06964 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2332,8 +2332,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t
 	 */
 	insert_vm_struct(mm, vma);
 
-	vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file,
-							vma_pages(vma));
+	vm_stat_account(vma->vm_mm, vma->vm_flags, vma_pages(vma));
 	up_write(&task->mm->mmap_sem);
 
 	/*
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187b3b5f242e..58e73af17af9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
 
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
-	unsigned long data, text, lib, swap, ptes, pmds;
+	unsigned long text, lib, swap, ptes, pmds;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
 	/*
@@ -39,7 +39,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
 
-	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
 	text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
 	lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
 	swap = get_mm_counter(mm, MM_SWAPENTS);
@@ -65,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		mm->pinned_vm << (PAGE_SHIFT-10),
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
-		data << (PAGE_SHIFT-10),
+		mm->data_vm << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
 		pmds >> 10,
@@ -85,7 +84,7 @@ unsigned long task_statm(struct mm_struct *mm,
 	*shared = get_mm_counter(mm, MM_FILEPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
-	*data = mm->total_vm - mm->shared_vm;
+	*data = mm->data_vm + mm->stack_vm;
 	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
 	return mm->total_vm;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00bad7793788..6816a48b6542 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1898,7 +1898,9 @@ extern void mm_drop_all_locks(struct mm_struct *mm);
 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
 extern struct file *get_mm_exe_file(struct mm_struct *mm);
 
-extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
+extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long npages);
+extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
+
 extern struct vm_area_struct *_install_special_mapping(struct mm_struct *mm,
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags,
@@ -2116,15 +2118,6 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
 
-#ifdef CONFIG_PROC_FS
-void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
-#else
-static inline void vm_stat_account(struct mm_struct *mm,
-			unsigned long flags, struct file *file, long pages)
-{
-	mm->total_vm += pages;
-}
-#endif /* CONFIG_PROC_FS */
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 extern bool _debug_pagealloc_enabled;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f8d1492a114f..fa114a255b11 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -426,7 +426,7 @@ struct mm_struct {
 	unsigned long total_vm;		/* Total pages mapped */
 	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
 	unsigned long pinned_vm;	/* Refcount permanently increased */
-	unsigned long shared_vm;	/* Shared pages (files) */
+	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED/GROWSDOWN */
 	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE */
 	unsigned long stack_vm;		/* VM_GROWSUP/DOWN */
 	unsigned long def_flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index fce002ee3ddf..1205601c2174 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -413,7 +413,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
 
 	mm->total_vm = oldmm->total_vm;
-	mm->shared_vm = oldmm->shared_vm;
+	mm->data_vm = oldmm->data_vm;
 	mm->exec_vm = oldmm->exec_vm;
 	mm->stack_vm = oldmm->stack_vm;
 
@@ -432,8 +432,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		struct file *file;
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
-			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
-							-vma_pages(mpnt));
+			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
 			continue;
 		}
 		charge = 0;
diff --git a/mm/debug.c b/mm/debug.c
index 668aa35191ca..5d2072ed8d5e 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -175,7 +175,7 @@ void dump_mm(const struct mm_struct *mm)
 		"mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
 		"pgd %p mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d\n"
 		"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
-		"pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n"
+		"pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n"
 		"start_code %lx end_code %lx start_data %lx end_data %lx\n"
 		"start_brk %lx brk %lx start_stack %lx\n"
 		"arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
@@ -209,7 +209,7 @@ void dump_mm(const struct mm_struct *mm)
 		mm_nr_pmds((struct mm_struct *)mm),
 		mm->map_count,
 		mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
-		mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm,
+		mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
 		mm->start_code, mm->end_code, mm->start_data, mm->end_data,
 		mm->start_brk, mm->brk, mm->start_stack,
 		mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
diff --git a/mm/mmap.c b/mm/mmap.c
index 2ce04a649f6b..e38f9d4af9c8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1208,24 +1208,6 @@ none:
 	return NULL;
 }
 
-#ifdef CONFIG_PROC_FS
-void vm_stat_account(struct mm_struct *mm, unsigned long flags,
-						struct file *file, long pages)
-{
-	const unsigned long stack_flags
-		= VM_STACK_FLAGS & (VM_GROWSUP|VM_GROWSDOWN);
-
-	mm->total_vm += pages;
-
-	if (file) {
-		mm->shared_vm += pages;
-		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
-			mm->exec_vm += pages;
-	} else if (flags & stack_flags)
-		mm->stack_vm += pages;
-}
-#endif /* CONFIG_PROC_FS */
-
 /*
  * If a hint addr is less than mmap_min_addr change hint to be as
  * low as possible but still greater than mmap_min_addr
@@ -1544,7 +1526,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long charged = 0;
 
 	/* Check against address space limit. */
-	if (!may_expand_vm(mm, len >> PAGE_SHIFT)) {
+	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
 		unsigned long nr_pages;
 
 		/*
@@ -1556,7 +1538,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 		nr_pages = count_vma_pages_range(mm, addr, addr + len);
 
-		if (!may_expand_vm(mm, (len >> PAGE_SHIFT) - nr_pages))
+		if (!may_expand_vm(mm, vm_flags,
+					(len >> PAGE_SHIFT) - nr_pages))
 			return -ENOMEM;
 	}
 
@@ -1655,7 +1638,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
-	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
+	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm)))
@@ -2102,7 +2085,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
 	unsigned long new_start, actual_size;
 
 	/* address space limit tests */
-	if (!may_expand_vm(mm, grow))
+	if (!may_expand_vm(mm, vma->vm_flags, grow))
 		return -ENOMEM;
 
 	/* Stack limit test */
@@ -2199,8 +2182,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				spin_lock(&mm->page_table_lock);
 				if (vma->vm_flags & VM_LOCKED)
 					mm->locked_vm += grow;
-				vm_stat_account(mm, vma->vm_flags,
-						vma->vm_file, grow);
+				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				anon_vma_interval_tree_post_update_vma(vma);
@@ -2275,8 +2257,7 @@ int expand_downwards(struct vm_area_struct *vma,
 				spin_lock(&mm->page_table_lock);
 				if (vma->vm_flags & VM_LOCKED)
 					mm->locked_vm += grow;
-				vm_stat_account(mm, vma->vm_flags,
-						vma->vm_file, grow);
+				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;
@@ -2390,7 +2371,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
 
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += nrpages;
-		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
+		vm_stat_account(mm, vma->vm_flags, -nrpages);
 		vma = remove_vma(vma);
 	} while (vma);
 	vm_unacct_memory(nr_accounted);
@@ -2760,7 +2741,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 	}
 
 	/* Check against address space limits *after* clearing old maps... */
-	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
+	if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
 	if (mm->map_count > sysctl_max_map_count)
@@ -2795,6 +2776,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
+	mm->data_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
@@ -2986,7 +2968,7 @@ out:
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
  */
-int may_expand_vm(struct mm_struct *mm, unsigned long npages)
+bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages)
 {
 	unsigned long cur = mm->total_vm;	/* pages */
 	unsigned long lim;
@@ -2994,8 +2976,25 @@ int may_expand_vm(struct mm_struct *mm, unsigned long npages)
 	lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
 
 	if (cur + npages > lim)
-		return 0;
-	return 1;
+		return false;
+
+	if ((flags & (VM_WRITE | VM_SHARED | (VM_STACK_FLAGS &
+				(VM_GROWSUP | VM_GROWSDOWN)))) == VM_WRITE)
+		return mm->data_vm + npages <= rlimit(RLIMIT_DATA);
+
+	return true;
+}
+
+void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
+{
+	mm->total_vm += npages;
+
+	if ((flags & (VM_EXEC | VM_WRITE)) == VM_EXEC)
+		mm->exec_vm += npages;
+	else if (flags & (VM_STACK_FLAGS & (VM_GROWSUP | VM_GROWSDOWN)))
+		mm->stack_vm += npages;
+	else if ((flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
+		mm->data_vm += npages;
 }
 
 static int special_mapping_fault(struct vm_area_struct *vma,
@@ -3077,7 +3076,7 @@ static struct vm_area_struct *__install_special_mapping(
 	if (ret)
 		goto out;
 
-	mm->total_vm += len >> PAGE_SHIFT;
+	vm_stat_account(mm, vma->vm_flags, len >> PAGE_SHIFT);
 
 	perf_event_mmap(vma);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ef5be8eaab00..c764402c464f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -278,6 +278,10 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * even if read-only so there is no need to account for them here
 	 */
 	if (newflags & VM_WRITE) {
+		/* Check space limits when area turns into data. */
+		if (!may_expand_vm(mm, newflags, nrpages) &&
+				may_expand_vm(mm, oldflags, nrpages))
+			return -ENOMEM;
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
 						VM_SHARED|VM_NORESERVE))) {
 			charged = nrpages;
@@ -334,8 +338,8 @@ success:
 		populate_vma_page_range(vma, start, end, NULL);
 	}
 
-	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
-	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
+	vm_stat_account(mm, oldflags, -nrpages);
+	vm_stat_account(mm, newflags, nrpages);
 	perf_event_mmap(vma);
 	return 0;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index c25bc6268e46..4fe3248426d1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -317,7 +317,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	 * If this were a serious issue, we'd add a flag to do_munmap().
 	 */
 	hiwater_vm = mm->hiwater_vm;
-	vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
+	vm_stat_account(mm, vma->vm_flags, new_len >> PAGE_SHIFT);
 
 	if (do_munmap(mm, old_addr, old_len) < 0) {
 		/* OOM: unable to split vma, just get accounts right */
@@ -379,7 +379,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 			return ERR_PTR(-EAGAIN);
 	}
 
-	if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
+	if (!may_expand_vm(mm, vma->vm_flags,
+				(new_len - old_len) >> PAGE_SHIFT))
 		return ERR_PTR(-ENOMEM);
 
 	if (vma->vm_flags & VM_ACCOUNT) {
@@ -541,7 +542,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 				goto out;
 			}
 
-			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
+			vm_stat_account(mm, vma->vm_flags, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
 				locked = true;


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC] mm: rework virtual memory accounting
@ 2015-12-14  8:12     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Khlebnikov @ 2015-12-14  8:12 UTC (permalink / raw)
  To: Cyrill Gorcunov, linux-mm, linux-kernel

Here several rated changes bundled together:
* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas

This way code looks cleaner: now code/stack/data
classification depends only on vm_flags state:

VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared",
but that might be strange beasts like readonly-private or VM_IO areas.

RLIMIT_AS limits whole address space "VmSize"
RLIMIT_STACK limits stack "VmStk" (but each vma individually)
RLIMIT_DATA now limits "VmData"

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 arch/ia64/kernel/perfmon.c |    3 +-
 fs/proc/task_mmu.c         |    7 ++---
 include/linux/mm.h         |   13 ++-------
 include/linux/mm_types.h   |    2 +
 kernel/fork.c              |    5 +--
 mm/debug.c                 |    4 +--
 mm/mmap.c                  |   63 ++++++++++++++++++++++----------------------
 mm/mprotect.c              |    8 ++++--
 mm/mremap.c                |    7 +++--
 9 files changed, 53 insertions(+), 59 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 60e02f7747ff..9cd607b06964 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2332,8 +2332,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t
 	 */
 	insert_vm_struct(mm, vma);
 
-	vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file,
-							vma_pages(vma));
+	vm_stat_account(vma->vm_mm, vma->vm_flags, vma_pages(vma));
 	up_write(&task->mm->mmap_sem);
 
 	/*
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187b3b5f242e..58e73af17af9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
 
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
-	unsigned long data, text, lib, swap, ptes, pmds;
+	unsigned long text, lib, swap, ptes, pmds;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
 	/*
@@ -39,7 +39,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
 
-	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
 	text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
 	lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
 	swap = get_mm_counter(mm, MM_SWAPENTS);
@@ -65,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		mm->pinned_vm << (PAGE_SHIFT-10),
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
-		data << (PAGE_SHIFT-10),
+		mm->data_vm << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
 		pmds >> 10,
@@ -85,7 +84,7 @@ unsigned long task_statm(struct mm_struct *mm,
 	*shared = get_mm_counter(mm, MM_FILEPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
-	*data = mm->total_vm - mm->shared_vm;
+	*data = mm->data_vm + mm->stack_vm;
 	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
 	return mm->total_vm;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00bad7793788..6816a48b6542 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1898,7 +1898,9 @@ extern void mm_drop_all_locks(struct mm_struct *mm);
 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
 extern struct file *get_mm_exe_file(struct mm_struct *mm);
 
-extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
+extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long npages);
+extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
+
 extern struct vm_area_struct *_install_special_mapping(struct mm_struct *mm,
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags,
@@ -2116,15 +2118,6 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
 
-#ifdef CONFIG_PROC_FS
-void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
-#else
-static inline void vm_stat_account(struct mm_struct *mm,
-			unsigned long flags, struct file *file, long pages)
-{
-	mm->total_vm += pages;
-}
-#endif /* CONFIG_PROC_FS */
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 extern bool _debug_pagealloc_enabled;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f8d1492a114f..fa114a255b11 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -426,7 +426,7 @@ struct mm_struct {
 	unsigned long total_vm;		/* Total pages mapped */
 	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
 	unsigned long pinned_vm;	/* Refcount permanently increased */
-	unsigned long shared_vm;	/* Shared pages (files) */
+	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED/GROWSDOWN */
 	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE */
 	unsigned long stack_vm;		/* VM_GROWSUP/DOWN */
 	unsigned long def_flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index fce002ee3ddf..1205601c2174 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -413,7 +413,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
 
 	mm->total_vm = oldmm->total_vm;
-	mm->shared_vm = oldmm->shared_vm;
+	mm->data_vm = oldmm->data_vm;
 	mm->exec_vm = oldmm->exec_vm;
 	mm->stack_vm = oldmm->stack_vm;
 
@@ -432,8 +432,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		struct file *file;
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
-			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
-							-vma_pages(mpnt));
+			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
 			continue;
 		}
 		charge = 0;
diff --git a/mm/debug.c b/mm/debug.c
index 668aa35191ca..5d2072ed8d5e 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -175,7 +175,7 @@ void dump_mm(const struct mm_struct *mm)
 		"mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
 		"pgd %p mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d\n"
 		"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
-		"pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n"
+		"pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n"
 		"start_code %lx end_code %lx start_data %lx end_data %lx\n"
 		"start_brk %lx brk %lx start_stack %lx\n"
 		"arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
@@ -209,7 +209,7 @@ void dump_mm(const struct mm_struct *mm)
 		mm_nr_pmds((struct mm_struct *)mm),
 		mm->map_count,
 		mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
-		mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm,
+		mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
 		mm->start_code, mm->end_code, mm->start_data, mm->end_data,
 		mm->start_brk, mm->brk, mm->start_stack,
 		mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
diff --git a/mm/mmap.c b/mm/mmap.c
index 2ce04a649f6b..e38f9d4af9c8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1208,24 +1208,6 @@ none:
 	return NULL;
 }
 
-#ifdef CONFIG_PROC_FS
-void vm_stat_account(struct mm_struct *mm, unsigned long flags,
-						struct file *file, long pages)
-{
-	const unsigned long stack_flags
-		= VM_STACK_FLAGS & (VM_GROWSUP|VM_GROWSDOWN);
-
-	mm->total_vm += pages;
-
-	if (file) {
-		mm->shared_vm += pages;
-		if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC)
-			mm->exec_vm += pages;
-	} else if (flags & stack_flags)
-		mm->stack_vm += pages;
-}
-#endif /* CONFIG_PROC_FS */
-
 /*
  * If a hint addr is less than mmap_min_addr change hint to be as
  * low as possible but still greater than mmap_min_addr
@@ -1544,7 +1526,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long charged = 0;
 
 	/* Check against address space limit. */
-	if (!may_expand_vm(mm, len >> PAGE_SHIFT)) {
+	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
 		unsigned long nr_pages;
 
 		/*
@@ -1556,7 +1538,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 		nr_pages = count_vma_pages_range(mm, addr, addr + len);
 
-		if (!may_expand_vm(mm, (len >> PAGE_SHIFT) - nr_pages))
+		if (!may_expand_vm(mm, vm_flags,
+					(len >> PAGE_SHIFT) - nr_pages))
 			return -ENOMEM;
 	}
 
@@ -1655,7 +1638,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
-	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
+	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm)))
@@ -2102,7 +2085,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
 	unsigned long new_start, actual_size;
 
 	/* address space limit tests */
-	if (!may_expand_vm(mm, grow))
+	if (!may_expand_vm(mm, vma->vm_flags, grow))
 		return -ENOMEM;
 
 	/* Stack limit test */
@@ -2199,8 +2182,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				spin_lock(&mm->page_table_lock);
 				if (vma->vm_flags & VM_LOCKED)
 					mm->locked_vm += grow;
-				vm_stat_account(mm, vma->vm_flags,
-						vma->vm_file, grow);
+				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				anon_vma_interval_tree_post_update_vma(vma);
@@ -2275,8 +2257,7 @@ int expand_downwards(struct vm_area_struct *vma,
 				spin_lock(&mm->page_table_lock);
 				if (vma->vm_flags & VM_LOCKED)
 					mm->locked_vm += grow;
-				vm_stat_account(mm, vma->vm_flags,
-						vma->vm_file, grow);
+				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;
@@ -2390,7 +2371,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
 
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += nrpages;
-		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
+		vm_stat_account(mm, vma->vm_flags, -nrpages);
 		vma = remove_vma(vma);
 	} while (vma);
 	vm_unacct_memory(nr_accounted);
@@ -2760,7 +2741,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 	}
 
 	/* Check against address space limits *after* clearing old maps... */
-	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
+	if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
 	if (mm->map_count > sysctl_max_map_count)
@@ -2795,6 +2776,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
+	mm->data_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
@@ -2986,7 +2968,7 @@ out:
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
  */
-int may_expand_vm(struct mm_struct *mm, unsigned long npages)
+bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages)
 {
 	unsigned long cur = mm->total_vm;	/* pages */
 	unsigned long lim;
@@ -2994,8 +2976,25 @@ int may_expand_vm(struct mm_struct *mm, unsigned long npages)
 	lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
 
 	if (cur + npages > lim)
-		return 0;
-	return 1;
+		return false;
+
+	if ((flags & (VM_WRITE | VM_SHARED | (VM_STACK_FLAGS &
+				(VM_GROWSUP | VM_GROWSDOWN)))) == VM_WRITE)
+		return mm->data_vm + npages <= rlimit(RLIMIT_DATA);
+
+	return true;
+}
+
+void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
+{
+	mm->total_vm += npages;
+
+	if ((flags & (VM_EXEC | VM_WRITE)) == VM_EXEC)
+		mm->exec_vm += npages;
+	else if (flags & (VM_STACK_FLAGS & (VM_GROWSUP | VM_GROWSDOWN)))
+		mm->stack_vm += npages;
+	else if ((flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
+		mm->data_vm += npages;
 }
 
 static int special_mapping_fault(struct vm_area_struct *vma,
@@ -3077,7 +3076,7 @@ static struct vm_area_struct *__install_special_mapping(
 	if (ret)
 		goto out;
 
-	mm->total_vm += len >> PAGE_SHIFT;
+	vm_stat_account(mm, vma->vm_flags, len >> PAGE_SHIFT);
 
 	perf_event_mmap(vma);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ef5be8eaab00..c764402c464f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -278,6 +278,10 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * even if read-only so there is no need to account for them here
 	 */
 	if (newflags & VM_WRITE) {
+		/* Check space limits when area turns into data. */
+		if (!may_expand_vm(mm, newflags, nrpages) &&
+				may_expand_vm(mm, oldflags, nrpages))
+			return -ENOMEM;
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
 						VM_SHARED|VM_NORESERVE))) {
 			charged = nrpages;
@@ -334,8 +338,8 @@ success:
 		populate_vma_page_range(vma, start, end, NULL);
 	}
 
-	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
-	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
+	vm_stat_account(mm, oldflags, -nrpages);
+	vm_stat_account(mm, newflags, nrpages);
 	perf_event_mmap(vma);
 	return 0;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index c25bc6268e46..4fe3248426d1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -317,7 +317,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	 * If this were a serious issue, we'd add a flag to do_munmap().
 	 */
 	hiwater_vm = mm->hiwater_vm;
-	vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
+	vm_stat_account(mm, vma->vm_flags, new_len >> PAGE_SHIFT);
 
 	if (do_munmap(mm, old_addr, old_len) < 0) {
 		/* OOM: unable to split vma, just get accounts right */
@@ -379,7 +379,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 			return ERR_PTR(-EAGAIN);
 	}
 
-	if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
+	if (!may_expand_vm(mm, vma->vm_flags,
+				(new_len - old_len) >> PAGE_SHIFT))
 		return ERR_PTR(-ENOMEM);
 
 	if (vma->vm_flags & VM_ACCOUNT) {
@@ -541,7 +542,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 				goto out;
 			}
 
-			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
+			vm_stat_account(mm, vma->vm_flags, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
 				locked = true;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] mm: rework virtual memory accounting
  2015-12-14  8:12     ` Konstantin Khlebnikov
@ 2015-12-14  9:08       ` Cyrill Gorcunov
  -1 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-14  9:08 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-mm, linux-kernel, Linus Torvalds

On Mon, Dec 14, 2015 at 11:12:38AM +0300, Konstantin Khlebnikov wrote:
> Here several rated changes bundled together:
> * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
> * replace mm->shared_vm with better defined mm->data_vm
> * account anonymous executable areas as executable
> * account file-backed growsdown/up areas as stack
> * drop struct file* argument from vm_stat_account
> * enforce RLIMIT_DATA for size of data areas
> 
> This way code looks cleaner: now code/stack/data
> classification depends only on vm_flags state:
> 
> VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
> VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
> VM_WRITE & ~VM_SHARED & !stack -> data (VmData)
> 
> The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared",
> but that might be strange beasts like readonly-private or VM_IO areas.
> 
> RLIMIT_AS limits whole address space "VmSize"
> RLIMIT_STACK limits stack "VmStk" (but each vma individually)
> RLIMIT_DATA now limits "VmData"
> 
> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>

Looks OK to me. Lets wait for Linus' opinion.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] mm: rework virtual memory accounting
@ 2015-12-14  9:08       ` Cyrill Gorcunov
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-14  9:08 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-mm, linux-kernel, Linus Torvalds

On Mon, Dec 14, 2015 at 11:12:38AM +0300, Konstantin Khlebnikov wrote:
> Here several rated changes bundled together:
> * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
> * replace mm->shared_vm with better defined mm->data_vm
> * account anonymous executable areas as executable
> * account file-backed growsdown/up areas as stack
> * drop struct file* argument from vm_stat_account
> * enforce RLIMIT_DATA for size of data areas
> 
> This way code looks cleaner: now code/stack/data
> classification depends only on vm_flags state:
> 
> VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
> VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
> VM_WRITE & ~VM_SHARED & !stack -> data (VmData)
> 
> The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared",
> but that might be strange beasts like readonly-private or VM_IO areas.
> 
> RLIMIT_AS limits whole address space "VmSize"
> RLIMIT_STACK limits stack "VmStk" (but each vma individually)
> RLIMIT_DATA now limits "VmData"
> 
> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>

Looks OK to me. Lets wait for Linus' opinion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
  2015-12-13 20:14 ` Cyrill Gorcunov
@ 2015-12-14 14:51   ` Quentin Casasnovas
  -1 siblings, 0 replies; 16+ messages in thread
From: Quentin Casasnovas @ 2015-12-14 14:51 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, linux-mm, Quentin Casasnovas, Vegard Nossum,
	Linus Torvalds, Willy Tarreau, Andy Lutomirski, Kees Cook,
	Vladimir Davydov, Konstantin Khlebnikov, Pavel Emelyanov,
	Peter Zijlstra, Cyrill Gorcunov

On Sun, Dec 13, 2015 at 11:14:19PM +0300, Cyrill Gorcunov wrote:
> When inspecting a vague code inside prctl(PR_SET_MM_MEM)
> call (which testing the RLIMIT_DATA value to figure out
> if we're allowed to assign new @start_brk, @brk, @start_data,
> @end_data from mm_struct) it's been commited that RLIMIT_DATA
> in a form it's implemented now doesn't do anything useful
> because most of user-space libraries use mmap() syscall
> for dynamic memory allocations.
> 
> So Linus suggested to convert RLIMIT_DATA rlimit into something
> suitable for anonymous memory accounting. Here we introduce
> new @anon_vm member into mm descriptor which is updated
> on every vm_stat_account() call.
> 
> Tests for RLIMIT_DATA are done in three places:
>  - mmap_region: when user calls mmap() helper
>  - do_brk: for brk() helper
>  - vma_to_resize: when mremap() is used
> 
> The do_brk() is also used in vm_brk() which is called
> when the kernel loads Elf and aout files to execute.
> 
> Because test for limit is done in do_brk helper we
> no longer need to call check_data_rlimit here.
> 
> v2:
>  - update doc
>  - add may_expand_anon_vm helper
>  - call for RLIMIT_DATA test in mremap and do_brk
> 
> CC: Quentin Casasnovas <quentin.casasnovas@oracle.com>
> CC: Vegard Nossum <vegard.nossum@oracle.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Willy Tarreau <w@1wt.eu>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Kees Cook <keescook@google.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Konstantin Khlebnikov <koct9i@gmail.com>
> CC: Pavel Emelyanov <xemul@virtuozzo.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> ---
>  Documentation/filesystems/proc.txt |    1 
>  fs/proc/task_mmu.c                 |    2 +
>  include/linux/mm_types.h           |    1 
>  mm/mmap.c                          |   46 ++++++++++++++++++++++---------------
>  mm/mremap.c                        |    5 ++++
>  5 files changed, 37 insertions(+), 18 deletions(-)
> 
> Index: linux-ml.git/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux-ml.git.orig/Documentation/filesystems/proc.txt
> +++ linux-ml.git/Documentation/filesystems/proc.txt
> @@ -233,6 +233,7 @@ Table 1-2: Contents of the status files
>   VmHWM                       peak resident set size ("high water mark")
>   VmRSS                       size of memory portions
>   VmData                      size of data, stack, and text segments
> + VmAnon                      size of anonymous private memory (not file backended)
>   VmStk                       size of data, stack, and text segments
>   VmExe                       size of text segment
>   VmLib                       size of shared library code
> Index: linux-ml.git/fs/proc/task_mmu.c
> ===================================================================
> --- linux-ml.git.orig/fs/proc/task_mmu.c
> +++ linux-ml.git/fs/proc/task_mmu.c
> @@ -53,6 +53,7 @@ void task_mem(struct seq_file *m, struct
>  		"VmHWM:\t%8lu kB\n"
>  		"VmRSS:\t%8lu kB\n"
>  		"VmData:\t%8lu kB\n"
> +		"VmAnon:\t%8lu kB\n"
>  		"VmStk:\t%8lu kB\n"
>  		"VmExe:\t%8lu kB\n"
>  		"VmLib:\t%8lu kB\n"
> @@ -66,6 +67,7 @@ void task_mem(struct seq_file *m, struct
>  		hiwater_rss << (PAGE_SHIFT-10),
>  		total_rss << (PAGE_SHIFT-10),
>  		data << (PAGE_SHIFT-10),
> +		mm->anon_vm << (PAGE_SHIFT-10),
>  		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
>  		ptes >> 10,
>  		pmds >> 10,
> Index: linux-ml.git/include/linux/mm_types.h
> ===================================================================
> --- linux-ml.git.orig/include/linux/mm_types.h
> +++ linux-ml.git/include/linux/mm_types.h
> @@ -429,6 +429,7 @@ struct mm_struct {
>  	unsigned long shared_vm;	/* Shared pages (files) */
>  	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE */
>  	unsigned long stack_vm;		/* VM_GROWSUP/DOWN */
> +	unsigned long anon_vm;		/* Anonymous pages mapped */
>  	unsigned long def_flags;
>  	unsigned long start_code, end_code, start_data, end_data;
>  	unsigned long start_brk, brk, start_stack;
> Index: linux-ml.git/mm/mmap.c
> ===================================================================
> --- linux-ml.git.orig/mm/mmap.c
> +++ linux-ml.git/mm/mmap.c
> @@ -309,16 +309,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>  	if (brk < min_brk)
>  		goto out;
>  
> -	/*
> -	 * Check against rlimit here. If this check is done later after the test
> -	 * of oldbrk with newbrk then it can escape the test and let the data
> -	 * segment grow beyond its set limit the in case where the limit is
> -	 * not page aligned -Ram Gupta
> -	 */
> -	if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
> -			      mm->end_data, mm->start_data))
> -		goto out;
> -
>  	newbrk = PAGE_ALIGN(brk);
>  	oldbrk = PAGE_ALIGN(mm->brk);
>  	if (oldbrk == newbrk)
> @@ -1223,6 +1213,9 @@ void vm_stat_account(struct mm_struct *m
>  			mm->exec_vm += pages;
>  	} else if (flags & stack_flags)
>  		mm->stack_vm += pages;
> +
> +	if (anon_accountable_mapping(file, flags))
> +		mm->anon_vm += pages;
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> @@ -1578,6 +1571,14 @@ unsigned long mmap_region(struct file *f
>  	}
>  
>  	/*
> +	 * For anon mappings make sure we don't exceed the limit.
> +	 */
> +	if (anon_accountable_mapping(file, vm_flags)) {
> +		if (!may_expand_anon_vm(mm, len >> PAGE_SHIFT))
> +			return -ENOMEM;
> +	}
> +
> +	/*
>  	 * Can we just expand an old mapping?
>  	 */
>  	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> @@ -2760,7 +2761,8 @@ static unsigned long do_brk(unsigned lon
>  	}
>  
>  	/* Check against address space limits *after* clearing old maps... */
> -	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
> +	if (!may_expand_vm(mm, len >> PAGE_SHIFT) ||
> +	    !may_expand_anon_vm(mm, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>  
>  	if (mm->map_count > sysctl_max_map_count)
> @@ -2795,6 +2797,7 @@ static unsigned long do_brk(unsigned lon
>  out:
>  	perf_event_mmap(vma);
>  	mm->total_vm += len >> PAGE_SHIFT;
> +	mm->anon_vm += len >> PAGE_SHIFT;
>  	if (flags & VM_LOCKED)
>  		mm->locked_vm += (len >> PAGE_SHIFT);
>  	vma->vm_flags |= VM_SOFTDIRTY;
> @@ -2982,20 +2985,27 @@ out:
>  	return NULL;
>  }
>  
> +static inline int __may_expand_vm(unsigned int limit,
> +				  unsigned long cur,
> +				  unsigned long npages)
> +{
> +	unsigned long lim = rlimit(limit) >> PAGE_SHIFT;
> +
> +	return ((cur + npages) > lim) ? 0 : 1;
> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
>   */
>  int may_expand_vm(struct mm_struct *mm, unsigned long npages)
>  {
> -	unsigned long cur = mm->total_vm;	/* pages */
> -	unsigned long lim;
> -
> -	lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
> +	return __may_expand_vm(RLIMIT_AS, mm->total_vm, npages);
> +}
>  
> -	if (cur + npages > lim)
> -		return 0;
> -	return 1;
> +int may_expand_anon_vm(struct mm_struct *mm, unsigned long npages)
> +{
> +	return __may_expand_vm(RLIMIT_DATA, mm->anon_vm, npages);
>  }
>

Do we want to fold may_expand_anon_vm() into may_expand_vm() (potentially
passing it the flags/struct file if needed) so there is just one such
helper function?  Rationale being that it then gets hard to see what
restricts what, and it's easy to miss one place.

For example, I couldn't find anything preventing a user to
mmap(MAP_GROWSDOWN) and uses that as a base to get pages that would not be
accounted for in your patch (making it a poor-man mremap()).

I only had a quick look so apologies if this is handled and I missed it :)

Quentin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
@ 2015-12-14 14:51   ` Quentin Casasnovas
  0 siblings, 0 replies; 16+ messages in thread
From: Quentin Casasnovas @ 2015-12-14 14:51 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, linux-mm, Quentin Casasnovas, Vegard Nossum,
	Linus Torvalds, Willy Tarreau, Andy Lutomirski, Kees Cook,
	Vladimir Davydov, Konstantin Khlebnikov, Pavel Emelyanov,
	Peter Zijlstra, Cyrill Gorcunov

On Sun, Dec 13, 2015 at 11:14:19PM +0300, Cyrill Gorcunov wrote:
> When inspecting a vague code inside prctl(PR_SET_MM_MEM)
> call (which testing the RLIMIT_DATA value to figure out
> if we're allowed to assign new @start_brk, @brk, @start_data,
> @end_data from mm_struct) it's been commited that RLIMIT_DATA
> in a form it's implemented now doesn't do anything useful
> because most of user-space libraries use mmap() syscall
> for dynamic memory allocations.
> 
> So Linus suggested to convert RLIMIT_DATA rlimit into something
> suitable for anonymous memory accounting. Here we introduce
> new @anon_vm member into mm descriptor which is updated
> on every vm_stat_account() call.
> 
> Tests for RLIMIT_DATA are done in three places:
>  - mmap_region: when user calls mmap() helper
>  - do_brk: for brk() helper
>  - vma_to_resize: when mremap() is used
> 
> The do_brk() is also used in vm_brk() which is called
> when the kernel loads Elf and aout files to execute.
> 
> Because test for limit is done in do_brk helper we
> no longer need to call check_data_rlimit here.
> 
> v2:
>  - update doc
>  - add may_expand_anon_vm helper
>  - call for RLIMIT_DATA test in mremap and do_brk
> 
> CC: Quentin Casasnovas <quentin.casasnovas@oracle.com>
> CC: Vegard Nossum <vegard.nossum@oracle.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Willy Tarreau <w@1wt.eu>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Kees Cook <keescook@google.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Konstantin Khlebnikov <koct9i@gmail.com>
> CC: Pavel Emelyanov <xemul@virtuozzo.com>
> CC: Vladimir Davydov <vdavydov@virtuozzo.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> ---
>  Documentation/filesystems/proc.txt |    1 
>  fs/proc/task_mmu.c                 |    2 +
>  include/linux/mm_types.h           |    1 
>  mm/mmap.c                          |   46 ++++++++++++++++++++++---------------
>  mm/mremap.c                        |    5 ++++
>  5 files changed, 37 insertions(+), 18 deletions(-)
> 
> Index: linux-ml.git/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux-ml.git.orig/Documentation/filesystems/proc.txt
> +++ linux-ml.git/Documentation/filesystems/proc.txt
> @@ -233,6 +233,7 @@ Table 1-2: Contents of the status files
>   VmHWM                       peak resident set size ("high water mark")
>   VmRSS                       size of memory portions
>   VmData                      size of data, stack, and text segments
> + VmAnon                      size of anonymous private memory (not file backended)
>   VmStk                       size of data, stack, and text segments
>   VmExe                       size of text segment
>   VmLib                       size of shared library code
> Index: linux-ml.git/fs/proc/task_mmu.c
> ===================================================================
> --- linux-ml.git.orig/fs/proc/task_mmu.c
> +++ linux-ml.git/fs/proc/task_mmu.c
> @@ -53,6 +53,7 @@ void task_mem(struct seq_file *m, struct
>  		"VmHWM:\t%8lu kB\n"
>  		"VmRSS:\t%8lu kB\n"
>  		"VmData:\t%8lu kB\n"
> +		"VmAnon:\t%8lu kB\n"
>  		"VmStk:\t%8lu kB\n"
>  		"VmExe:\t%8lu kB\n"
>  		"VmLib:\t%8lu kB\n"
> @@ -66,6 +67,7 @@ void task_mem(struct seq_file *m, struct
>  		hiwater_rss << (PAGE_SHIFT-10),
>  		total_rss << (PAGE_SHIFT-10),
>  		data << (PAGE_SHIFT-10),
> +		mm->anon_vm << (PAGE_SHIFT-10),
>  		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
>  		ptes >> 10,
>  		pmds >> 10,
> Index: linux-ml.git/include/linux/mm_types.h
> ===================================================================
> --- linux-ml.git.orig/include/linux/mm_types.h
> +++ linux-ml.git/include/linux/mm_types.h
> @@ -429,6 +429,7 @@ struct mm_struct {
>  	unsigned long shared_vm;	/* Shared pages (files) */
>  	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE */
>  	unsigned long stack_vm;		/* VM_GROWSUP/DOWN */
> +	unsigned long anon_vm;		/* Anonymous pages mapped */
>  	unsigned long def_flags;
>  	unsigned long start_code, end_code, start_data, end_data;
>  	unsigned long start_brk, brk, start_stack;
> Index: linux-ml.git/mm/mmap.c
> ===================================================================
> --- linux-ml.git.orig/mm/mmap.c
> +++ linux-ml.git/mm/mmap.c
> @@ -309,16 +309,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>  	if (brk < min_brk)
>  		goto out;
>  
> -	/*
> -	 * Check against rlimit here. If this check is done later after the test
> -	 * of oldbrk with newbrk then it can escape the test and let the data
> -	 * segment grow beyond its set limit the in case where the limit is
> -	 * not page aligned -Ram Gupta
> -	 */
> -	if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk,
> -			      mm->end_data, mm->start_data))
> -		goto out;
> -
>  	newbrk = PAGE_ALIGN(brk);
>  	oldbrk = PAGE_ALIGN(mm->brk);
>  	if (oldbrk == newbrk)
> @@ -1223,6 +1213,9 @@ void vm_stat_account(struct mm_struct *m
>  			mm->exec_vm += pages;
>  	} else if (flags & stack_flags)
>  		mm->stack_vm += pages;
> +
> +	if (anon_accountable_mapping(file, flags))
> +		mm->anon_vm += pages;
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> @@ -1578,6 +1571,14 @@ unsigned long mmap_region(struct file *f
>  	}
>  
>  	/*
> +	 * For anon mappings make sure we don't exceed the limit.
> +	 */
> +	if (anon_accountable_mapping(file, vm_flags)) {
> +		if (!may_expand_anon_vm(mm, len >> PAGE_SHIFT))
> +			return -ENOMEM;
> +	}
> +
> +	/*
>  	 * Can we just expand an old mapping?
>  	 */
>  	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> @@ -2760,7 +2761,8 @@ static unsigned long do_brk(unsigned lon
>  	}
>  
>  	/* Check against address space limits *after* clearing old maps... */
> -	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
> +	if (!may_expand_vm(mm, len >> PAGE_SHIFT) ||
> +	    !may_expand_anon_vm(mm, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>  
>  	if (mm->map_count > sysctl_max_map_count)
> @@ -2795,6 +2797,7 @@ static unsigned long do_brk(unsigned lon
>  out:
>  	perf_event_mmap(vma);
>  	mm->total_vm += len >> PAGE_SHIFT;
> +	mm->anon_vm += len >> PAGE_SHIFT;
>  	if (flags & VM_LOCKED)
>  		mm->locked_vm += (len >> PAGE_SHIFT);
>  	vma->vm_flags |= VM_SOFTDIRTY;
> @@ -2982,20 +2985,27 @@ out:
>  	return NULL;
>  }
>  
> +static inline int __may_expand_vm(unsigned int limit,
> +				  unsigned long cur,
> +				  unsigned long npages)
> +{
> +	unsigned long lim = rlimit(limit) >> PAGE_SHIFT;
> +
> +	return ((cur + npages) > lim) ? 0 : 1;
> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
>   */
>  int may_expand_vm(struct mm_struct *mm, unsigned long npages)
>  {
> -	unsigned long cur = mm->total_vm;	/* pages */
> -	unsigned long lim;
> -
> -	lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
> +	return __may_expand_vm(RLIMIT_AS, mm->total_vm, npages);
> +}
>  
> -	if (cur + npages > lim)
> -		return 0;
> -	return 1;
> +int may_expand_anon_vm(struct mm_struct *mm, unsigned long npages)
> +{
> +	return __may_expand_vm(RLIMIT_DATA, mm->anon_vm, npages);
>  }
>

Do we want to fold may_expand_anon_vm() into may_expand_vm() (potentially
passing it the flags/struct file if needed) so there is just one such
helper function?  Rationale being that it then gets hard to see what
restricts what, and it's easy to miss one place.

For example, I couldn't find anything preventing a user to
mmap(MAP_GROWSDOWN) and uses that as a base to get pages that would not be
accounted for in your patch (making it a poor-man mremap()).

I only had a quick look so apologies if this is handled and I missed it :)

Quentin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
  2015-12-14 14:51   ` Quentin Casasnovas
@ 2015-12-14 15:11     ` Cyrill Gorcunov
  -1 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-14 15:11 UTC (permalink / raw)
  To: Quentin Casasnovas
  Cc: linux-kernel, linux-mm, Vegard Nossum, Linus Torvalds,
	Willy Tarreau, Andy Lutomirski, Kees Cook, Vladimir Davydov,
	Konstantin Khlebnikov, Pavel Emelyanov, Peter Zijlstra

On Mon, Dec 14, 2015 at 03:51:26PM +0100, Quentin Casasnovas wrote:
...
> 
> Do we want to fold may_expand_anon_vm() into may_expand_vm() (potentially
> passing it the flags/struct file if needed) so there is just one such
> helper function?  Rationale being that it then gets hard to see what
> restricts what, and it's easy to miss one place.

I tried to make the patch small as possible (because otherwise indeed
I would have to pass @vm_file|@file as additional argument). This won't
be a problem but may_expand_vm is called way more times than
may_expand_anon_vm. That's the only rationale I followed.

> For example, I couldn't find anything preventing a user to
> mmap(MAP_GROWSDOWN) and uses that as a base to get pages that would not be
> accounted for in your patch (making it a poor-man mremap()).

growsup/down stand for stack usage iirc, so it was intentionally
not accounted here.

> 
> I only had a quick look so apologies if this is handled and I missed it :)

thanks for feedback! also take a look on Kostya's patch, I think it's
even better approach (and I like it more than mine).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
@ 2015-12-14 15:11     ` Cyrill Gorcunov
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-14 15:11 UTC (permalink / raw)
  To: Quentin Casasnovas
  Cc: linux-kernel, linux-mm, Vegard Nossum, Linus Torvalds,
	Willy Tarreau, Andy Lutomirski, Kees Cook, Vladimir Davydov,
	Konstantin Khlebnikov, Pavel Emelyanov, Peter Zijlstra

On Mon, Dec 14, 2015 at 03:51:26PM +0100, Quentin Casasnovas wrote:
...
> 
> Do we want to fold may_expand_anon_vm() into may_expand_vm() (potentially
> passing it the flags/struct file if needed) so there is just one such
> helper function?  Rationale being that it then gets hard to see what
> restricts what, and it's easy to miss one place.

I tried to make the patch small as possible (because otherwise indeed
I would have to pass @vm_file|@file as additional argument). This won't
be a problem but may_expand_vm is called way more times than
may_expand_anon_vm. That's the only rationale I followed.

> For example, I couldn't find anything preventing a user to
> mmap(MAP_GROWSDOWN) and uses that as a base to get pages that would not be
> accounted for in your patch (making it a poor-man mremap()).

growsup/down stand for stack usage iirc, so it was intentionally
not accounted here.

> 
> I only had a quick look so apologies if this is handled and I missed it :)

thanks for feedback! also take a look on Kostya's patch, I think it's
even better approach (and I like it more than mine).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
  2015-12-14 15:11     ` Cyrill Gorcunov
@ 2015-12-14 15:32       ` Quentin Casasnovas
  -1 siblings, 0 replies; 16+ messages in thread
From: Quentin Casasnovas @ 2015-12-14 15:32 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Quentin Casasnovas, linux-kernel, linux-mm, Vegard Nossum,
	Linus Torvalds, Willy Tarreau, Andy Lutomirski, Kees Cook,
	Vladimir Davydov, Konstantin Khlebnikov, Pavel Emelyanov,
	Peter Zijlstra

On Mon, Dec 14, 2015 at 06:11:16PM +0300, Cyrill Gorcunov wrote:
> On Mon, Dec 14, 2015 at 03:51:26PM +0100, Quentin Casasnovas wrote:
> ...
> > 
> > Do we want to fold may_expand_anon_vm() into may_expand_vm() (potentially
> > passing it the flags/struct file if needed) so there is just one such
> > helper function?  Rationale being that it then gets hard to see what
> > restricts what, and it's easy to miss one place.
> 
> I tried to make the patch small as possible (because otherwise indeed
> I would have to pass @vm_file|@file as additional argument). This won't
> be a problem but may_expand_vm is called way more times than
> may_expand_anon_vm. That's the only rationale I followed.
>
> > For example, I couldn't find anything preventing a user to
> > mmap(MAP_GROWSDOWN) and uses that as a base to get pages that would not be
> > accounted for in your patch (making it a poor-man mremap()).
> 
> growsup/down stand for stack usage iirc, so it was intentionally
> not accounted here.
>

Right, but in the same vein of Linus saying RLIMIT_DATA is/was useless
because everyone could use mmap() instead of brk() to get anonymous memory,
what's the point of restricting "almost-all" anonymous memory if one can
just use MAP_GROWSDOWN/UP and cause repeated page faults to extend that
mapping, circumventing your checks?  That makes the new restriction as
useless as what RLIMIT_DATA used to be, doesn't it?

> > 
> > I only had a quick look so apologies if this is handled and I missed it :)
> 
> thanks for feedback! also take a look on Kostya's patch, I think it's
> even better approach (and I like it more than mine).

Ha I'm not subscribed to LKML so I missed those, I suppose you can ignore
my comments then! :)

Quentin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
@ 2015-12-14 15:32       ` Quentin Casasnovas
  0 siblings, 0 replies; 16+ messages in thread
From: Quentin Casasnovas @ 2015-12-14 15:32 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Quentin Casasnovas, linux-kernel, linux-mm, Vegard Nossum,
	Linus Torvalds, Willy Tarreau, Andy Lutomirski, Kees Cook,
	Vladimir Davydov, Konstantin Khlebnikov, Pavel Emelyanov,
	Peter Zijlstra

On Mon, Dec 14, 2015 at 06:11:16PM +0300, Cyrill Gorcunov wrote:
> On Mon, Dec 14, 2015 at 03:51:26PM +0100, Quentin Casasnovas wrote:
> ...
> > 
> > Do we want to fold may_expand_anon_vm() into may_expand_vm() (potentially
> > passing it the flags/struct file if needed) so there is just one such
> > helper function?  Rationale being that it then gets hard to see what
> > restricts what, and it's easy to miss one place.
> 
> I tried to make the patch small as possible (because otherwise indeed
> I would have to pass @vm_file|@file as additional argument). This won't
> be a problem but may_expand_vm is called way more times than
> may_expand_anon_vm. That's the only rationale I followed.
>
> > For example, I couldn't find anything preventing a user to
> > mmap(MAP_GROWSDOWN) and uses that as a base to get pages that would not be
> > accounted for in your patch (making it a poor-man mremap()).
> 
> growsup/down stand for stack usage iirc, so it was intentionally
> not accounted here.
>

Right, but in the same vein of Linus saying RLIMIT_DATA is/was useless
because everyone could use mmap() instead of brk() to get anonymous memory,
what's the point of restricting "almost-all" anonymous memory if one can
just use MAP_GROWSDOWN/UP and cause repeated page faults to extend that
mapping, circumventing your checks?  That makes the new restriction as
useless as what RLIMIT_DATA used to be, doesn't it?

> > 
> > I only had a quick look so apologies if this is handled and I missed it :)
> 
> thanks for feedback! also take a look on Kostya's patch, I think it's
> even better approach (and I like it more than mine).

Ha I'm not subscribed to LKML so I missed those, I suppose you can ignore
my comments then! :)

Quentin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
  2015-12-14 15:32       ` Quentin Casasnovas
@ 2015-12-14 15:43         ` Cyrill Gorcunov
  -1 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-14 15:43 UTC (permalink / raw)
  To: Quentin Casasnovas
  Cc: linux-kernel, linux-mm, Vegard Nossum, Linus Torvalds,
	Willy Tarreau, Andy Lutomirski, Kees Cook, Vladimir Davydov,
	Konstantin Khlebnikov, Pavel Emelyanov, Peter Zijlstra

On Mon, Dec 14, 2015 at 04:32:34PM +0100, Quentin Casasnovas wrote:
> > 
> > growsup/down stand for stack usage iirc, so it was intentionally
> > not accounted here.
> >
> 
> Right, but in the same vein of Linus saying RLIMIT_DATA is/was useless
> because everyone could use mmap() instead of brk() to get anonymous memory,
> what's the point of restricting "almost-all" anonymous memory if one can
> just use MAP_GROWSDOWN/UP and cause repeated page faults to extend that
> mapping, circumventing your checks?  That makes the new restriction as
> useless as what RLIMIT_DATA used to be, doesn't it?

Not as it were before, but true, using growsdown/up will give a way
to allocate memory not limited byt rlimit-data. (Also I just noted
that I modified mm.h as well, where anon_accountable_mapping
was implemented but forgot to add it into quilt, so this patch
on its own won't compile, don't apply it).

> > > 
> > > I only had a quick look so apologies if this is handled and I missed it :)
> > 
> > thanks for feedback! also take a look on Kostya's patch, I think it's
> > even better approach (and I like it more than mine).
> 
> Ha I'm not subscribed to LKML so I missed those, I suppose you can ignore
> my comments then! :)

https://lkml.org/lkml/2015/12/14/72

Take a look.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA
@ 2015-12-14 15:43         ` Cyrill Gorcunov
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov @ 2015-12-14 15:43 UTC (permalink / raw)
  To: Quentin Casasnovas
  Cc: linux-kernel, linux-mm, Vegard Nossum, Linus Torvalds,
	Willy Tarreau, Andy Lutomirski, Kees Cook, Vladimir Davydov,
	Konstantin Khlebnikov, Pavel Emelyanov, Peter Zijlstra

On Mon, Dec 14, 2015 at 04:32:34PM +0100, Quentin Casasnovas wrote:
> > 
> > growsup/down stand for stack usage iirc, so it was intentionally
> > not accounted here.
> >
> 
> Right, but in the same vein of Linus saying RLIMIT_DATA is/was useless
> because everyone could use mmap() instead of brk() to get anonymous memory,
> what's the point of restricting "almost-all" anonymous memory if one can
> just use MAP_GROWSDOWN/UP and cause repeated page faults to extend that
> mapping, circumventing your checks?  That makes the new restriction as
> useless as what RLIMIT_DATA used to be, doesn't it?

Not as it were before, but true, using growsdown/up will give a way
to allocate memory not limited byt rlimit-data. (Also I just noted
that I modified mm.h as well, where anon_accountable_mapping
was implemented but forgot to add it into quilt, so this patch
on its own won't compile, don't apply it).

> > > 
> > > I only had a quick look so apologies if this is handled and I missed it :)
> > 
> > thanks for feedback! also take a look on Kostya's patch, I think it's
> > even better approach (and I like it more than mine).
> 
> Ha I'm not subscribed to LKML so I missed those, I suppose you can ignore
> my comments then! :)

https://lkml.org/lkml/2015/12/14/72

Take a look.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-12-14 15:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-13 20:14 [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA Cyrill Gorcunov
2015-12-13 20:14 ` Cyrill Gorcunov
2015-12-14  8:12 ` Konstantin Khlebnikov
2015-12-14  8:12   ` Konstantin Khlebnikov
2015-12-14  8:12   ` [PATCH RFC] mm: rework virtual memory accounting Konstantin Khlebnikov
2015-12-14  8:12     ` Konstantin Khlebnikov
2015-12-14  9:08     ` Cyrill Gorcunov
2015-12-14  9:08       ` Cyrill Gorcunov
2015-12-14 14:51 ` [RFC 1/2] [RFC] mm: Account anon mappings as RLIMIT_DATA Quentin Casasnovas
2015-12-14 14:51   ` Quentin Casasnovas
2015-12-14 15:11   ` Cyrill Gorcunov
2015-12-14 15:11     ` Cyrill Gorcunov
2015-12-14 15:32     ` Quentin Casasnovas
2015-12-14 15:32       ` Quentin Casasnovas
2015-12-14 15:43       ` Cyrill Gorcunov
2015-12-14 15:43         ` Cyrill Gorcunov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.