All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-08 12:41 ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 12:41 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, linux-mm, peterz, mingo, Kirill A. Shutemov

Hi Andrew and Linus,

These two patches demonstrate how we can get rid nonlinear mappings.

The first patch documents remap_file_pages(2) deprecation and add printk
into syscall code. The patch could be propagated through stable kernel if
the approach with remap_file_pages() emulation is okay.

The second patch replaces remap_file_pages(2) with and emulation. I didn't
find any real code (apart LTP) to test it on. So I wrote simple test case.
See commit message for numbers.

I will prepare separate patchset to cleanup all nonlinear mappings
leftovers if the approach with emulation is desirable.

Comments?

Kirill A. Shutemov (2):
  mm: mark remap_file_pages() syscall as deprecated
  mm: replace remap_file_pages() syscall with emulation

 Documentation/vm/remap_file_pages.txt |  27 ++++
 include/linux/fs.h                    |   8 +-
 mm/Makefile                           |   2 +-
 mm/fremap.c                           | 282 ----------------------------------
 mm/mmap.c                             |  66 ++++++++
 mm/nommu.c                            |   8 -
 6 files changed, 100 insertions(+), 293 deletions(-)
 create mode 100644 Documentation/vm/remap_file_pages.txt
 delete mode 100644 mm/fremap.c

-- 
2.0.0.rc2


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-08 12:41 ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 12:41 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, linux-mm, peterz, mingo, Kirill A. Shutemov

Hi Andrew and Linus,

These two patches demonstrate how we can get rid nonlinear mappings.

The first patch documents remap_file_pages(2) deprecation and add printk
into syscall code. The patch could be propagated through stable kernel if
the approach with remap_file_pages() emulation is okay.

The second patch replaces remap_file_pages(2) with and emulation. I didn't
find any real code (apart LTP) to test it on. So I wrote simple test case.
See commit message for numbers.

I will prepare separate patchset to cleanup all nonlinear mappings
leftovers if the approach with emulation is desirable.

Comments?

Kirill A. Shutemov (2):
  mm: mark remap_file_pages() syscall as deprecated
  mm: replace remap_file_pages() syscall with emulation

 Documentation/vm/remap_file_pages.txt |  27 ++++
 include/linux/fs.h                    |   8 +-
 mm/Makefile                           |   2 +-
 mm/fremap.c                           | 282 ----------------------------------
 mm/mmap.c                             |  66 ++++++++
 mm/nommu.c                            |   8 -
 6 files changed, 100 insertions(+), 293 deletions(-)
 create mode 100644 Documentation/vm/remap_file_pages.txt
 delete mode 100644 mm/fremap.c

-- 
2.0.0.rc2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
  2014-05-08 12:41 ` Kirill A. Shutemov
@ 2014-05-08 12:41   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 12:41 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, linux-mm, peterz, mingo, Kirill A. Shutemov

The remap_file_pages() system call is used to create a nonlinear mapping,
that is, a mapping in which the pages of the file are mapped into a
nonsequential order in memory. The advantage of using remap_file_pages()
over using repeated calls to mmap(2) is that the former approach does not
require the kernel to create additional VMA (Virtual Memory Area) data
structures.

Supporting of nonlinear mapping requires significant amount of non-trivial
code in kernel virtual memory subsystem including hot paths. Also to get
nonlinear mapping work kernel need a way to distinguish normal page table
entries from entries with file offset (pte_file). Kernel reserves flag in
PTE for this purpose. PTE flags are scarce resource especially on some CPU
architectures. It would be nice to free up the flag for other usage.

Fortunately, there are not many users of remap_file_pages() in the wild.
It's only known that one enterprise RDBMS implementation uses the syscall
on 32-bit systems to map files bigger than can linearly fit into 32-bit
virtual address space. This use-case is not critical anymore since 64-bit
systems are widely available.

The plan is to deprecate the syscall and replace it with an emulation.
The emulation will create new VMAs instead of nonlinear mappings. It's
going to work slower for rare users of remap_file_pages() but ABI is
preserved.

One side effect of emulation (apart from performance) is that user can hit
vm.max_map_count limit more easily due to additional VMAs. See comment for
DEFAULT_MAX_MAP_COUNT for more details on the limit.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/remap_file_pages.txt | 28 ++++++++++++++++++++++++++++
 mm/fremap.c                           |  4 ++++
 2 files changed, 32 insertions(+)
 create mode 100644 Documentation/vm/remap_file_pages.txt

diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
new file mode 100644
index 000000000000..560e4363a55d
--- /dev/null
+++ b/Documentation/vm/remap_file_pages.txt
@@ -0,0 +1,28 @@
+The remap_file_pages() system call is used to create a nonlinear mapping,
+that is, a mapping in which the pages of the file are mapped into a
+nonsequential order in memory. The advantage of using remap_file_pages()
+over using repeated calls to mmap(2) is that the former approach does not
+require the kernel to create additional VMA (Virtual Memory Area) data
+structures.
+
+Supporting of nonlinear mapping requires significant amount of non-trivial
+code in kernel virtual memory subsystem including hot paths. Also to get
+nonlinear mapping work kernel need a way to distinguish normal page table
+entries from entries with file offset (pte_file). Kernel reserves flag in
+PTE for this purpose. PTE flags are scarce resource especially on some CPU
+architectures. It would be nice to free up the flag for other usage.
+
+Fortunately, there are not many users of remap_file_pages() in the wild.
+It's only known that one enterprise RDBMS implementation uses the syscall
+on 32-bit systems to map files bigger than can linearly fit into 32-bit
+virtual address space. This use-case is not critical anymore since 64-bit
+systems are widely available.
+
+The plan is to deprecate the syscall and replace it with an emulation.
+The emulation will create new VMAs instead of nonlinear mappings. It's
+going to work slower for rare users of remap_file_pages() but ABI is
+preserved.
+
+One side effect of emulation (apart from performance) is that user can hit
+vm.max_map_count limit more easily due to additional VMAs. See comment for
+DEFAULT_MAX_MAP_COUNT for more details on the limit.
diff --git a/mm/fremap.c b/mm/fremap.c
index 34feba60a17e..12c3bb63b7f9 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -152,6 +152,10 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	int has_write_lock = 0;
 	vm_flags_t vm_flags = 0;
 
+	pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
+			"See Documentation/vm/remap_file_pages.txt.\n",
+			current->comm, current->pid);
+
 	if (prot)
 		return err;
 	/*
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-05-08 12:41   ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 12:41 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, linux-mm, peterz, mingo, Kirill A. Shutemov

The remap_file_pages() system call is used to create a nonlinear mapping,
that is, a mapping in which the pages of the file are mapped into a
nonsequential order in memory. The advantage of using remap_file_pages()
over using repeated calls to mmap(2) is that the former approach does not
require the kernel to create additional VMA (Virtual Memory Area) data
structures.

Supporting of nonlinear mapping requires significant amount of non-trivial
code in kernel virtual memory subsystem including hot paths. Also to get
nonlinear mapping work kernel need a way to distinguish normal page table
entries from entries with file offset (pte_file). Kernel reserves flag in
PTE for this purpose. PTE flags are scarce resource especially on some CPU
architectures. It would be nice to free up the flag for other usage.

Fortunately, there are not many users of remap_file_pages() in the wild.
It's only known that one enterprise RDBMS implementation uses the syscall
on 32-bit systems to map files bigger than can linearly fit into 32-bit
virtual address space. This use-case is not critical anymore since 64-bit
systems are widely available.

The plan is to deprecate the syscall and replace it with an emulation.
The emulation will create new VMAs instead of nonlinear mappings. It's
going to work slower for rare users of remap_file_pages() but ABI is
preserved.

One side effect of emulation (apart from performance) is that user can hit
vm.max_map_count limit more easily due to additional VMAs. See comment for
DEFAULT_MAX_MAP_COUNT for more details on the limit.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/remap_file_pages.txt | 28 ++++++++++++++++++++++++++++
 mm/fremap.c                           |  4 ++++
 2 files changed, 32 insertions(+)
 create mode 100644 Documentation/vm/remap_file_pages.txt

diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
new file mode 100644
index 000000000000..560e4363a55d
--- /dev/null
+++ b/Documentation/vm/remap_file_pages.txt
@@ -0,0 +1,28 @@
+The remap_file_pages() system call is used to create a nonlinear mapping,
+that is, a mapping in which the pages of the file are mapped into a
+nonsequential order in memory. The advantage of using remap_file_pages()
+over using repeated calls to mmap(2) is that the former approach does not
+require the kernel to create additional VMA (Virtual Memory Area) data
+structures.
+
+Supporting of nonlinear mapping requires significant amount of non-trivial
+code in kernel virtual memory subsystem including hot paths. Also to get
+nonlinear mapping work kernel need a way to distinguish normal page table
+entries from entries with file offset (pte_file). Kernel reserves flag in
+PTE for this purpose. PTE flags are scarce resource especially on some CPU
+architectures. It would be nice to free up the flag for other usage.
+
+Fortunately, there are not many users of remap_file_pages() in the wild.
+It's only known that one enterprise RDBMS implementation uses the syscall
+on 32-bit systems to map files bigger than can linearly fit into 32-bit
+virtual address space. This use-case is not critical anymore since 64-bit
+systems are widely available.
+
+The plan is to deprecate the syscall and replace it with an emulation.
+The emulation will create new VMAs instead of nonlinear mappings. It's
+going to work slower for rare users of remap_file_pages() but ABI is
+preserved.
+
+One side effect of emulation (apart from performance) is that user can hit
+vm.max_map_count limit more easily due to additional VMAs. See comment for
+DEFAULT_MAX_MAP_COUNT for more details on the limit.
diff --git a/mm/fremap.c b/mm/fremap.c
index 34feba60a17e..12c3bb63b7f9 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -152,6 +152,10 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	int has_write_lock = 0;
 	vm_flags_t vm_flags = 0;
 
+	pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
+			"See Documentation/vm/remap_file_pages.txt.\n",
+			current->comm, current->pid);
+
 	if (prot)
 		return err;
 	/*
-- 
2.0.0.rc2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-08 12:41 ` Kirill A. Shutemov
@ 2014-05-08 12:41   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 12:41 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, linux-mm, peterz, mingo, Kirill A. Shutemov

remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.

Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.

Let's drop it and get rid of all these special-cased code.

The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.

I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in strace,
gdb, valgrind and this kind of stuff.

There are few basic tests in LTP for the syscall. They work just fine
with emulation.

To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.

The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.

Before:		23.3 ( +-  4.31% ) seconds
After:		43.9 ( +-  0.85% ) seconds
Slowdown:	1.88x

I believe we can live with that.

Test case:

	#define _GNU_SOURCE
	#include <assert.h>
	#include <stdlib.h>
	#include <stdio.h>
	#include <sys/mman.h>

	#define MB	(1024UL * 1024)
	#define SIZE	(4096 * MB)

	int main(int argc, char **argv)
	{
		unsigned long *p;
		long i, pass;

		for (pass = 0; pass < 10; pass++) {
			p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
					MAP_SHARED | MAP_ANONYMOUS, -1, 0);
			if (p == MAP_FAILED) {
				perror("mmap");
				return -1;
			}

			for (i = 0; i < SIZE / 4096; i++)
				p[i * 4096 / sizeof(*p)] = i;

			for (i = 0; i < SIZE / 4096; i++) {
				if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
						0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
					perror("remap_file_pages");
					return -1;
				}
			}

			for (i = SIZE / 4096 - 1; i >= 0; i--)
				assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

			munmap(p, SIZE);
		}

		return 0;
	}

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/remap_file_pages.txt |   7 +-
 include/linux/fs.h                    |   8 +-
 mm/Makefile                           |   2 +-
 mm/fremap.c                           | 286 ----------------------------------
 mm/mmap.c                             |  66 ++++++++
 mm/nommu.c                            |   8 -
 6 files changed, 76 insertions(+), 301 deletions(-)
 delete mode 100644 mm/fremap.c

diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
index 560e4363a55d..f609142f406a 100644
--- a/Documentation/vm/remap_file_pages.txt
+++ b/Documentation/vm/remap_file_pages.txt
@@ -18,10 +18,9 @@ on 32-bit systems to map files bigger than can linearly fit into 32-bit
 virtual address space. This use-case is not critical anymore since 64-bit
 systems are widely available.
 
-The plan is to deprecate the syscall and replace it with an emulation.
-The emulation will create new VMAs instead of nonlinear mappings. It's
-going to work slower for rare users of remap_file_pages() but ABI is
-preserved.
+The syscall is deprecated and replaced it with an emulation now. The
+emulation creates new VMAs instead of nonlinear mappings. It's going to
+work slower for rare users of remap_file_pages() but ABI is preserved.
 
 One side effect of emulation (apart from performance) is that user can hit
 vm.max_map_count limit more easily due to additional VMAs. See comment for
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 878031227c57..b7cda7d95ea0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2401,8 +2401,12 @@ extern int sb_min_blocksize(struct super_block *, int);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
-extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr,
-		unsigned long size, pgoff_t pgoff);
+static inline int generic_file_remap_pages(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long size, pgoff_t pgoff)
+{
+	BUG();
+	return 0;
+}
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long);
diff --git a/mm/Makefile b/mm/Makefile
index b484452dac57..27e3e30be39b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -3,7 +3,7 @@
 #
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
 			   vmalloc.o pagewalk.o pgtable-generic.o
 
diff --git a/mm/fremap.c b/mm/fremap.c
deleted file mode 100644
index 12c3bb63b7f9..000000000000
--- a/mm/fremap.c
+++ /dev/null
@@ -1,286 +0,0 @@
-/*
- *   linux/mm/fremap.c
- * 
- * Explicit pagetable population and nonlinear (random) mappings support.
- *
- * started by Ingo Molnar, Copyright (C) 2002, 2003
- */
-#include <linux/export.h>
-#include <linux/backing-dev.h>
-#include <linux/mm.h>
-#include <linux/swap.h>
-#include <linux/file.h>
-#include <linux/mman.h>
-#include <linux/pagemap.h>
-#include <linux/swapops.h>
-#include <linux/rmap.h>
-#include <linux/syscalls.h>
-#include <linux/mmu_notifier.h>
-
-#include <asm/mmu_context.h>
-#include <asm/cacheflush.h>
-#include <asm/tlbflush.h>
-
-#include "internal.h"
-
-static int mm_counter(struct page *page)
-{
-	return PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
-}
-
-static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long addr, pte_t *ptep)
-{
-	pte_t pte = *ptep;
-	struct page *page;
-	swp_entry_t entry;
-
-	if (pte_present(pte)) {
-		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
-		page = vm_normal_page(vma, addr, pte);
-		if (page) {
-			if (pte_dirty(pte))
-				set_page_dirty(page);
-			update_hiwater_rss(mm);
-			dec_mm_counter(mm, mm_counter(page));
-			page_remove_rmap(page);
-			page_cache_release(page);
-		}
-	} else {	/* zap_pte() is not called when pte_none() */
-		if (!pte_file(pte)) {
-			update_hiwater_rss(mm);
-			entry = pte_to_swp_entry(pte);
-			if (non_swap_entry(entry)) {
-				if (is_migration_entry(entry)) {
-					page = migration_entry_to_page(entry);
-					dec_mm_counter(mm, mm_counter(page));
-				}
-			} else {
-				free_swap_and_cache(entry);
-				dec_mm_counter(mm, MM_SWAPENTS);
-			}
-		}
-		pte_clear_not_present_full(mm, addr, ptep, 0);
-	}
-}
-
-/*
- * Install a file pte to a given virtual memory address, release any
- * previously existing mapping.
- */
-static int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long pgoff, pgprot_t prot)
-{
-	int err = -ENOMEM;
-	pte_t *pte, ptfile;
-	spinlock_t *ptl;
-
-	pte = get_locked_pte(mm, addr, &ptl);
-	if (!pte)
-		goto out;
-
-	ptfile = pgoff_to_pte(pgoff);
-
-	if (!pte_none(*pte)) {
-		if (pte_present(*pte) && pte_soft_dirty(*pte))
-			pte_file_mksoft_dirty(ptfile);
-		zap_pte(mm, vma, addr, pte);
-	}
-
-	set_pte_at(mm, addr, pte, ptfile);
-	/*
-	 * We don't need to run update_mmu_cache() here because the "file pte"
-	 * being installed by install_file_pte() is not a real pte - it's a
-	 * non-present entry (like a swap entry), noting what file offset should
-	 * be mapped there when there's a fault (in a non-linear vma where
-	 * that's not obvious).
-	 */
-	pte_unmap_unlock(pte, ptl);
-	err = 0;
-out:
-	return err;
-}
-
-int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
-			     unsigned long size, pgoff_t pgoff)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	int err;
-
-	do {
-		err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot);
-		if (err)
-			return err;
-
-		size -= PAGE_SIZE;
-		addr += PAGE_SIZE;
-		pgoff++;
-	} while (size);
-
-	return 0;
-}
-EXPORT_SYMBOL(generic_file_remap_pages);
-
-/**
- * sys_remap_file_pages - remap arbitrary pages of an existing VM_SHARED vma
- * @start: start of the remapped virtual memory range
- * @size: size of the remapped virtual memory range
- * @prot: new protection bits of the range (see NOTE)
- * @pgoff: to-be-mapped page of the backing store file
- * @flags: 0 or MAP_NONBLOCKED - the later will cause no IO.
- *
- * sys_remap_file_pages remaps arbitrary pages of an existing VM_SHARED vma
- * (shared backing store file).
- *
- * This syscall works purely via pagetables, so it's the most efficient
- * way to map the same (large) file into a given virtual window. Unlike
- * mmap()/mremap() it does not create any new vmas. The new mappings are
- * also safe across swapout.
- *
- * NOTE: the @prot parameter right now is ignored (but must be zero),
- * and the vma's default protection is used. Arbitrary protections
- * might be implemented in the future.
- */
-SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
-		unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
-{
-	struct mm_struct *mm = current->mm;
-	struct address_space *mapping;
-	struct vm_area_struct *vma;
-	int err = -EINVAL;
-	int has_write_lock = 0;
-	vm_flags_t vm_flags = 0;
-
-	pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
-			"See Documentation/vm/remap_file_pages.txt.\n",
-			current->comm, current->pid);
-
-	if (prot)
-		return err;
-	/*
-	 * Sanitize the syscall parameters:
-	 */
-	start = start & PAGE_MASK;
-	size = size & PAGE_MASK;
-
-	/* Does the address range wrap, or is the span zero-sized? */
-	if (start + size <= start)
-		return err;
-
-	/* Does pgoff wrap? */
-	if (pgoff + (size >> PAGE_SHIFT) < pgoff)
-		return err;
-
-	/* Can we represent this offset inside this architecture's pte's? */
-#if PTE_FILE_MAX_BITS < BITS_PER_LONG
-	if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS))
-		return err;
-#endif
-
-	/* We need down_write() to change vma->vm_flags. */
-	down_read(&mm->mmap_sem);
- retry:
-	vma = find_vma(mm, start);
-
-	/*
-	 * Make sure the vma is shared, that it supports prefaulting,
-	 * and that the remapped range is valid and fully within
-	 * the single existing vma.
-	 */
-	if (!vma || !(vma->vm_flags & VM_SHARED))
-		goto out;
-
-	if (!vma->vm_ops || !vma->vm_ops->remap_pages)
-		goto out;
-
-	if (start < vma->vm_start || start + size > vma->vm_end)
-		goto out;
-
-	/* Must set VM_NONLINEAR before any pages are populated. */
-	if (!(vma->vm_flags & VM_NONLINEAR)) {
-		/*
-		 * vm_private_data is used as a swapout cursor
-		 * in a VM_NONLINEAR vma.
-		 */
-		if (vma->vm_private_data)
-			goto out;
-
-		/* Don't need a nonlinear mapping, exit success */
-		if (pgoff == linear_page_index(vma, start)) {
-			err = 0;
-			goto out;
-		}
-
-		if (!has_write_lock) {
-get_write_lock:
-			up_read(&mm->mmap_sem);
-			down_write(&mm->mmap_sem);
-			has_write_lock = 1;
-			goto retry;
-		}
-		mapping = vma->vm_file->f_mapping;
-		/*
-		 * page_mkclean doesn't work on nonlinear vmas, so if
-		 * dirty pages need to be accounted, emulate with linear
-		 * vmas.
-		 */
-		if (mapping_cap_account_dirty(mapping)) {
-			unsigned long addr;
-			struct file *file = get_file(vma->vm_file);
-			/* mmap_region may free vma; grab the info now */
-			vm_flags = vma->vm_flags;
-
-			addr = mmap_region(file, start, size, vm_flags, pgoff);
-			fput(file);
-			if (IS_ERR_VALUE(addr)) {
-				err = addr;
-			} else {
-				BUG_ON(addr != start);
-				err = 0;
-			}
-			goto out_freed;
-		}
-		mutex_lock(&mapping->i_mmap_mutex);
-		flush_dcache_mmap_lock(mapping);
-		vma->vm_flags |= VM_NONLINEAR;
-		vma_interval_tree_remove(vma, &mapping->i_mmap);
-		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
-		flush_dcache_mmap_unlock(mapping);
-		mutex_unlock(&mapping->i_mmap_mutex);
-	}
-
-	if (vma->vm_flags & VM_LOCKED) {
-		/*
-		 * drop PG_Mlocked flag for over-mapped range
-		 */
-		if (!has_write_lock)
-			goto get_write_lock;
-		vm_flags = vma->vm_flags;
-		munlock_vma_pages_range(vma, start, start + size);
-		vma->vm_flags = vm_flags;
-	}
-
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
-	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
-
-	/*
-	 * We can't clear VM_NONLINEAR because we'd have to do
-	 * it after ->populate completes, and that would prevent
-	 * downgrading the lock.  (Locks can't be upgraded).
-	 */
-
-out:
-	if (vma)
-		vm_flags = vma->vm_flags;
-out_freed:
-	if (likely(!has_write_lock))
-		up_read(&mm->mmap_sem);
-	else
-		up_write(&mm->mmap_sem);
-	if (!err && ((vm_flags & VM_LOCKED) || !(flags & MAP_NONBLOCK)))
-		mm_populate(start, size);
-
-	return err;
-}
diff --git a/mm/mmap.c b/mm/mmap.c
index b1202cf81f4b..a490e5203eb9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2579,6 +2579,72 @@ SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 	return vm_munmap(addr, len);
 }
 
+
+/*
+ * Emulation of deprecated remap_file_pages() syscall.
+ */
+SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
+		unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
+{
+
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	unsigned long populate;
+	unsigned long ret = -EINVAL;
+
+	pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
+			"See Documentation/vm/remap_file_pages.txt.\n",
+			current->comm, current->pid);
+
+	if (prot)
+		return ret;
+	start = start & PAGE_MASK;
+	size = size & PAGE_MASK;
+
+	if (start + size <= start)
+		return ret;
+
+	/* Does pgoff wrap? */
+	if (pgoff + (size >> PAGE_SHIFT) < pgoff)
+		return ret;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+
+	if (!vma || !(vma->vm_flags & VM_SHARED))
+		goto out;
+
+	if (start < vma->vm_start || start + size > vma->vm_end)
+		goto out;
+
+	if (pgoff == linear_page_index(vma, start)) {
+		ret = 0;
+		goto out;
+	}
+
+	prot |= vma->vm_flags & VM_READ ? PROT_READ : 0;
+	prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0;
+	prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0;
+
+	flags &= MAP_NONBLOCK;
+	flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE;
+	if (vma->vm_flags & VM_LOCKED) {
+		flags |= MAP_LOCKED;
+		/* drop PG_Mlocked flag for over-mapped range */
+		munlock_vma_pages_range(vma, start, start + size);
+	}
+
+	ret = do_mmap_pgoff(vma->vm_file, start, size,
+			prot, flags, pgoff, &populate);
+out:
+	up_write(&mm->mmap_sem);
+	if (populate)
+		mm_populate(ret, populate);
+	if (!IS_ERR_VALUE(ret))
+		ret = 0;
+	return ret;
+}
+
 static inline void verify_mm_writelocked(struct mm_struct *mm)
 {
 #ifdef CONFIG_DEBUG_VM
diff --git a/mm/nommu.c b/mm/nommu.c
index 85f8d6698d48..d20b7fea2852 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1996,14 +1996,6 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_map_pages);
 
-int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
-			     unsigned long size, pgoff_t pgoff)
-{
-	BUG();
-	return 0;
-}
-EXPORT_SYMBOL(generic_file_remap_pages);
-
 static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long addr, void *buf, int len, int write)
 {
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-08 12:41   ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 12:41 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, linux-mm, peterz, mingo, Kirill A. Shutemov

remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.

Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.

Let's drop it and get rid of all these special-cased code.

The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.

I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in strace,
gdb, valgrind and this kind of stuff.

There are few basic tests in LTP for the syscall. They work just fine
with emulation.

To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.

The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.

Before:		23.3 ( +-  4.31% ) seconds
After:		43.9 ( +-  0.85% ) seconds
Slowdown:	1.88x

I believe we can live with that.

Test case:

	#define _GNU_SOURCE
	#include <assert.h>
	#include <stdlib.h>
	#include <stdio.h>
	#include <sys/mman.h>

	#define MB	(1024UL * 1024)
	#define SIZE	(4096 * MB)

	int main(int argc, char **argv)
	{
		unsigned long *p;
		long i, pass;

		for (pass = 0; pass < 10; pass++) {
			p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
					MAP_SHARED | MAP_ANONYMOUS, -1, 0);
			if (p == MAP_FAILED) {
				perror("mmap");
				return -1;
			}

			for (i = 0; i < SIZE / 4096; i++)
				p[i * 4096 / sizeof(*p)] = i;

			for (i = 0; i < SIZE / 4096; i++) {
				if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
						0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
					perror("remap_file_pages");
					return -1;
				}
			}

			for (i = SIZE / 4096 - 1; i >= 0; i--)
				assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

			munmap(p, SIZE);
		}

		return 0;
	}

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/remap_file_pages.txt |   7 +-
 include/linux/fs.h                    |   8 +-
 mm/Makefile                           |   2 +-
 mm/fremap.c                           | 286 ----------------------------------
 mm/mmap.c                             |  66 ++++++++
 mm/nommu.c                            |   8 -
 6 files changed, 76 insertions(+), 301 deletions(-)
 delete mode 100644 mm/fremap.c

diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
index 560e4363a55d..f609142f406a 100644
--- a/Documentation/vm/remap_file_pages.txt
+++ b/Documentation/vm/remap_file_pages.txt
@@ -18,10 +18,9 @@ on 32-bit systems to map files bigger than can linearly fit into 32-bit
 virtual address space. This use-case is not critical anymore since 64-bit
 systems are widely available.
 
-The plan is to deprecate the syscall and replace it with an emulation.
-The emulation will create new VMAs instead of nonlinear mappings. It's
-going to work slower for rare users of remap_file_pages() but ABI is
-preserved.
+The syscall is deprecated and replaced it with an emulation now. The
+emulation creates new VMAs instead of nonlinear mappings. It's going to
+work slower for rare users of remap_file_pages() but ABI is preserved.
 
 One side effect of emulation (apart from performance) is that user can hit
 vm.max_map_count limit more easily due to additional VMAs. See comment for
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 878031227c57..b7cda7d95ea0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2401,8 +2401,12 @@ extern int sb_min_blocksize(struct super_block *, int);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
-extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr,
-		unsigned long size, pgoff_t pgoff);
+static inline int generic_file_remap_pages(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long size, pgoff_t pgoff)
+{
+	BUG();
+	return 0;
+}
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long);
diff --git a/mm/Makefile b/mm/Makefile
index b484452dac57..27e3e30be39b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -3,7 +3,7 @@
 #
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
 			   vmalloc.o pagewalk.o pgtable-generic.o
 
diff --git a/mm/fremap.c b/mm/fremap.c
deleted file mode 100644
index 12c3bb63b7f9..000000000000
--- a/mm/fremap.c
+++ /dev/null
@@ -1,286 +0,0 @@
-/*
- *   linux/mm/fremap.c
- * 
- * Explicit pagetable population and nonlinear (random) mappings support.
- *
- * started by Ingo Molnar, Copyright (C) 2002, 2003
- */
-#include <linux/export.h>
-#include <linux/backing-dev.h>
-#include <linux/mm.h>
-#include <linux/swap.h>
-#include <linux/file.h>
-#include <linux/mman.h>
-#include <linux/pagemap.h>
-#include <linux/swapops.h>
-#include <linux/rmap.h>
-#include <linux/syscalls.h>
-#include <linux/mmu_notifier.h>
-
-#include <asm/mmu_context.h>
-#include <asm/cacheflush.h>
-#include <asm/tlbflush.h>
-
-#include "internal.h"
-
-static int mm_counter(struct page *page)
-{
-	return PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
-}
-
-static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long addr, pte_t *ptep)
-{
-	pte_t pte = *ptep;
-	struct page *page;
-	swp_entry_t entry;
-
-	if (pte_present(pte)) {
-		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
-		page = vm_normal_page(vma, addr, pte);
-		if (page) {
-			if (pte_dirty(pte))
-				set_page_dirty(page);
-			update_hiwater_rss(mm);
-			dec_mm_counter(mm, mm_counter(page));
-			page_remove_rmap(page);
-			page_cache_release(page);
-		}
-	} else {	/* zap_pte() is not called when pte_none() */
-		if (!pte_file(pte)) {
-			update_hiwater_rss(mm);
-			entry = pte_to_swp_entry(pte);
-			if (non_swap_entry(entry)) {
-				if (is_migration_entry(entry)) {
-					page = migration_entry_to_page(entry);
-					dec_mm_counter(mm, mm_counter(page));
-				}
-			} else {
-				free_swap_and_cache(entry);
-				dec_mm_counter(mm, MM_SWAPENTS);
-			}
-		}
-		pte_clear_not_present_full(mm, addr, ptep, 0);
-	}
-}
-
-/*
- * Install a file pte to a given virtual memory address, release any
- * previously existing mapping.
- */
-static int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long pgoff, pgprot_t prot)
-{
-	int err = -ENOMEM;
-	pte_t *pte, ptfile;
-	spinlock_t *ptl;
-
-	pte = get_locked_pte(mm, addr, &ptl);
-	if (!pte)
-		goto out;
-
-	ptfile = pgoff_to_pte(pgoff);
-
-	if (!pte_none(*pte)) {
-		if (pte_present(*pte) && pte_soft_dirty(*pte))
-			pte_file_mksoft_dirty(ptfile);
-		zap_pte(mm, vma, addr, pte);
-	}
-
-	set_pte_at(mm, addr, pte, ptfile);
-	/*
-	 * We don't need to run update_mmu_cache() here because the "file pte"
-	 * being installed by install_file_pte() is not a real pte - it's a
-	 * non-present entry (like a swap entry), noting what file offset should
-	 * be mapped there when there's a fault (in a non-linear vma where
-	 * that's not obvious).
-	 */
-	pte_unmap_unlock(pte, ptl);
-	err = 0;
-out:
-	return err;
-}
-
-int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
-			     unsigned long size, pgoff_t pgoff)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	int err;
-
-	do {
-		err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot);
-		if (err)
-			return err;
-
-		size -= PAGE_SIZE;
-		addr += PAGE_SIZE;
-		pgoff++;
-	} while (size);
-
-	return 0;
-}
-EXPORT_SYMBOL(generic_file_remap_pages);
-
-/**
- * sys_remap_file_pages - remap arbitrary pages of an existing VM_SHARED vma
- * @start: start of the remapped virtual memory range
- * @size: size of the remapped virtual memory range
- * @prot: new protection bits of the range (see NOTE)
- * @pgoff: to-be-mapped page of the backing store file
- * @flags: 0 or MAP_NONBLOCKED - the later will cause no IO.
- *
- * sys_remap_file_pages remaps arbitrary pages of an existing VM_SHARED vma
- * (shared backing store file).
- *
- * This syscall works purely via pagetables, so it's the most efficient
- * way to map the same (large) file into a given virtual window. Unlike
- * mmap()/mremap() it does not create any new vmas. The new mappings are
- * also safe across swapout.
- *
- * NOTE: the @prot parameter right now is ignored (but must be zero),
- * and the vma's default protection is used. Arbitrary protections
- * might be implemented in the future.
- */
-SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
-		unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
-{
-	struct mm_struct *mm = current->mm;
-	struct address_space *mapping;
-	struct vm_area_struct *vma;
-	int err = -EINVAL;
-	int has_write_lock = 0;
-	vm_flags_t vm_flags = 0;
-
-	pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
-			"See Documentation/vm/remap_file_pages.txt.\n",
-			current->comm, current->pid);
-
-	if (prot)
-		return err;
-	/*
-	 * Sanitize the syscall parameters:
-	 */
-	start = start & PAGE_MASK;
-	size = size & PAGE_MASK;
-
-	/* Does the address range wrap, or is the span zero-sized? */
-	if (start + size <= start)
-		return err;
-
-	/* Does pgoff wrap? */
-	if (pgoff + (size >> PAGE_SHIFT) < pgoff)
-		return err;
-
-	/* Can we represent this offset inside this architecture's pte's? */
-#if PTE_FILE_MAX_BITS < BITS_PER_LONG
-	if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS))
-		return err;
-#endif
-
-	/* We need down_write() to change vma->vm_flags. */
-	down_read(&mm->mmap_sem);
- retry:
-	vma = find_vma(mm, start);
-
-	/*
-	 * Make sure the vma is shared, that it supports prefaulting,
-	 * and that the remapped range is valid and fully within
-	 * the single existing vma.
-	 */
-	if (!vma || !(vma->vm_flags & VM_SHARED))
-		goto out;
-
-	if (!vma->vm_ops || !vma->vm_ops->remap_pages)
-		goto out;
-
-	if (start < vma->vm_start || start + size > vma->vm_end)
-		goto out;
-
-	/* Must set VM_NONLINEAR before any pages are populated. */
-	if (!(vma->vm_flags & VM_NONLINEAR)) {
-		/*
-		 * vm_private_data is used as a swapout cursor
-		 * in a VM_NONLINEAR vma.
-		 */
-		if (vma->vm_private_data)
-			goto out;
-
-		/* Don't need a nonlinear mapping, exit success */
-		if (pgoff == linear_page_index(vma, start)) {
-			err = 0;
-			goto out;
-		}
-
-		if (!has_write_lock) {
-get_write_lock:
-			up_read(&mm->mmap_sem);
-			down_write(&mm->mmap_sem);
-			has_write_lock = 1;
-			goto retry;
-		}
-		mapping = vma->vm_file->f_mapping;
-		/*
-		 * page_mkclean doesn't work on nonlinear vmas, so if
-		 * dirty pages need to be accounted, emulate with linear
-		 * vmas.
-		 */
-		if (mapping_cap_account_dirty(mapping)) {
-			unsigned long addr;
-			struct file *file = get_file(vma->vm_file);
-			/* mmap_region may free vma; grab the info now */
-			vm_flags = vma->vm_flags;
-
-			addr = mmap_region(file, start, size, vm_flags, pgoff);
-			fput(file);
-			if (IS_ERR_VALUE(addr)) {
-				err = addr;
-			} else {
-				BUG_ON(addr != start);
-				err = 0;
-			}
-			goto out_freed;
-		}
-		mutex_lock(&mapping->i_mmap_mutex);
-		flush_dcache_mmap_lock(mapping);
-		vma->vm_flags |= VM_NONLINEAR;
-		vma_interval_tree_remove(vma, &mapping->i_mmap);
-		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
-		flush_dcache_mmap_unlock(mapping);
-		mutex_unlock(&mapping->i_mmap_mutex);
-	}
-
-	if (vma->vm_flags & VM_LOCKED) {
-		/*
-		 * drop PG_Mlocked flag for over-mapped range
-		 */
-		if (!has_write_lock)
-			goto get_write_lock;
-		vm_flags = vma->vm_flags;
-		munlock_vma_pages_range(vma, start, start + size);
-		vma->vm_flags = vm_flags;
-	}
-
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
-	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
-
-	/*
-	 * We can't clear VM_NONLINEAR because we'd have to do
-	 * it after ->populate completes, and that would prevent
-	 * downgrading the lock.  (Locks can't be upgraded).
-	 */
-
-out:
-	if (vma)
-		vm_flags = vma->vm_flags;
-out_freed:
-	if (likely(!has_write_lock))
-		up_read(&mm->mmap_sem);
-	else
-		up_write(&mm->mmap_sem);
-	if (!err && ((vm_flags & VM_LOCKED) || !(flags & MAP_NONBLOCK)))
-		mm_populate(start, size);
-
-	return err;
-}
diff --git a/mm/mmap.c b/mm/mmap.c
index b1202cf81f4b..a490e5203eb9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2579,6 +2579,72 @@ SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 	return vm_munmap(addr, len);
 }
 
+
+/*
+ * Emulation of deprecated remap_file_pages() syscall.
+ */
+SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
+		unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
+{
+
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	unsigned long populate;
+	unsigned long ret = -EINVAL;
+
+	pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
+			"See Documentation/vm/remap_file_pages.txt.\n",
+			current->comm, current->pid);
+
+	if (prot)
+		return ret;
+	start = start & PAGE_MASK;
+	size = size & PAGE_MASK;
+
+	if (start + size <= start)
+		return ret;
+
+	/* Does pgoff wrap? */
+	if (pgoff + (size >> PAGE_SHIFT) < pgoff)
+		return ret;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+
+	if (!vma || !(vma->vm_flags & VM_SHARED))
+		goto out;
+
+	if (start < vma->vm_start || start + size > vma->vm_end)
+		goto out;
+
+	if (pgoff == linear_page_index(vma, start)) {
+		ret = 0;
+		goto out;
+	}
+
+	prot |= vma->vm_flags & VM_READ ? PROT_READ : 0;
+	prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0;
+	prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0;
+
+	flags &= MAP_NONBLOCK;
+	flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE;
+	if (vma->vm_flags & VM_LOCKED) {
+		flags |= MAP_LOCKED;
+		/* drop PG_Mlocked flag for over-mapped range */
+		munlock_vma_pages_range(vma, start, start + size);
+	}
+
+	ret = do_mmap_pgoff(vma->vm_file, start, size,
+			prot, flags, pgoff, &populate);
+out:
+	up_write(&mm->mmap_sem);
+	if (populate)
+		mm_populate(ret, populate);
+	if (!IS_ERR_VALUE(ret))
+		ret = 0;
+	return ret;
+}
+
 static inline void verify_mm_writelocked(struct mm_struct *mm)
 {
 #ifdef CONFIG_DEBUG_VM
diff --git a/mm/nommu.c b/mm/nommu.c
index 85f8d6698d48..d20b7fea2852 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1996,14 +1996,6 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_map_pages);
 
-int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
-			     unsigned long size, pgoff_t pgoff)
-{
-	BUG();
-	return 0;
-}
-EXPORT_SYMBOL(generic_file_remap_pages);
-
 static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long addr, void *buf, int len, int write)
 {
-- 
2.0.0.rc2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-08 12:41 ` Kirill A. Shutemov
@ 2014-05-08 15:35   ` Linus Torvalds
  -1 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2014-05-08 15:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Ingo Molnar

On Thu, May 8, 2014 at 5:41 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> The second patch replaces remap_file_pages(2) with and emulation. I didn't
> find any real code (apart LTP) to test it on. So I wrote simple test case.
> See commit message for numbers.

I'm certainly ok with this. It removes code even in the "no cleanup of
the core VM" case, and performance doesn't seem to be horrible.

That said, I *really* want to get at least some minimal testing on
something that actually uses it as more than just a test-program. I'm
sure somebody inside RH has to have a 32-bit Oracle setup for
performance testing. Guys?

             Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-08 15:35   ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2014-05-08 15:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Ingo Molnar

On Thu, May 8, 2014 at 5:41 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> The second patch replaces remap_file_pages(2) with and emulation. I didn't
> find any real code (apart LTP) to test it on. So I wrote simple test case.
> See commit message for numbers.

I'm certainly ok with this. It removes code even in the "no cleanup of
the core VM" case, and performance doesn't seem to be horrible.

That said, I *really* want to get at least some minimal testing on
something that actually uses it as more than just a test-program. I'm
sure somebody inside RH has to have a 32-bit Oracle setup for
performance testing. Guys?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-08 12:41 ` Kirill A. Shutemov
@ 2014-05-08 15:44   ` Armin Rigo
  -1 siblings, 0 replies; 54+ messages in thread
From: Armin Rigo @ 2014-05-08 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, peterz, mingo

Hi everybody,

Here is a note from the PyPy project (mentioned earlier in this
thread, and at https://lwn.net/Articles/587923/ ).

Yes, we use remap_file_pages() heavily on the x86-64 architecture.
However, the individual calls to remap_file_pages() are not
performance-critical, so it is easy to switch to using multiple
mmap()s.  We need to perform more measurements to know exactly what
the overhead would be, in terms notably of kernel memory.

However, an issue with that approach is the upper bound on the number
of VMAs.  By default, it is not large enough.  Right now, it is
possible to remap say 10% of the individual pages from an anonymous
mmap of multiple GBs in size; but doing so with individual calls to
mmap hits this arbitrary limit.  I have no particular weight to give
for or against keeping remap_file_pages() in the kernel, but if it is
removed or emulated, it would be a plus if the programs would run on a
machine with the default configuration --- i.e. if you remove or
emulate remap_file_pages(), please increase the default limit as well.


A bientôt,

Armin.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-08 15:44   ` Armin Rigo
  0 siblings, 0 replies; 54+ messages in thread
From: Armin Rigo @ 2014-05-08 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, peterz, mingo

Hi everybody,

Here is a note from the PyPy project (mentioned earlier in this
thread, and at https://lwn.net/Articles/587923/ ).

Yes, we use remap_file_pages() heavily on the x86-64 architecture.
However, the individual calls to remap_file_pages() are not
performance-critical, so it is easy to switch to using multiple
mmap()s.  We need to perform more measurements to know exactly what
the overhead would be, in terms notably of kernel memory.

However, an issue with that approach is the upper bound on the number
of VMAs.  By default, it is not large enough.  Right now, it is
possible to remap say 10% of the individual pages from an anonymous
mmap of multiple GBs in size; but doing so with individual calls to
mmap hits this arbitrary limit.  I have no particular weight to give
for or against keeping remap_file_pages() in the kernel, but if it is
removed or emulated, it would be a plus if the programs would run on a
machine with the default configuration --- i.e. if you remove or
emulate remap_file_pages(), please increase the default limit as well.


A bientôt,

Armin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-08 15:44   ` Armin Rigo
@ 2014-05-08 16:02     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 16:02 UTC (permalink / raw)
  To: Armin Rigo
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Armin Rigo wrote:
> Hi everybody,
> 
> Here is a note from the PyPy project (mentioned earlier in this
> thread, and at https://lwn.net/Articles/587923/ ).
> 
> Yes, we use remap_file_pages() heavily on the x86-64 architecture.
> However, the individual calls to remap_file_pages() are not
> performance-critical, so it is easy to switch to using multiple
> mmap()s.  We need to perform more measurements to know exactly what
> the overhead would be, in terms notably of kernel memory.
> 
> However, an issue with that approach is the upper bound on the number
> of VMAs.  By default, it is not large enough.  Right now, it is
> possible to remap say 10% of the individual pages from an anonymous
> mmap of multiple GBs in size; but doing so with individual calls to
> mmap hits this arbitrary limit.

The limit is not totaly random. We use ELF format for coredumps and ELF has
limitation (16-bit field) on number of sections it can store.

With ELF extended numbering we can bypass 16-bit limit, but some userspace
can be surprised by that.

> I have no particular weight to give
> for or against keeping remap_file_pages() in the kernel, but if it is
> removed or emulated, it would be a plus if the programs would run on a
> machine with the default configuration --- i.e. if you remove or
> emulate remap_file_pages(), please increase the default limit as well.

It's fine to me. Andrew?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-08 16:02     ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-08 16:02 UTC (permalink / raw)
  To: Armin Rigo
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Armin Rigo wrote:
> Hi everybody,
> 
> Here is a note from the PyPy project (mentioned earlier in this
> thread, and at https://lwn.net/Articles/587923/ ).
> 
> Yes, we use remap_file_pages() heavily on the x86-64 architecture.
> However, the individual calls to remap_file_pages() are not
> performance-critical, so it is easy to switch to using multiple
> mmap()s.  We need to perform more measurements to know exactly what
> the overhead would be, in terms notably of kernel memory.
> 
> However, an issue with that approach is the upper bound on the number
> of VMAs.  By default, it is not large enough.  Right now, it is
> possible to remap say 10% of the individual pages from an anonymous
> mmap of multiple GBs in size; but doing so with individual calls to
> mmap hits this arbitrary limit.

The limit is not totaly random. We use ELF format for coredumps and ELF has
limitation (16-bit field) on number of sections it can store.

With ELF extended numbering we can bypass 16-bit limit, but some userspace
can be surprised by that.

> I have no particular weight to give
> for or against keeping remap_file_pages() in the kernel, but if it is
> removed or emulated, it would be a plus if the programs would run on a
> machine with the default configuration --- i.e. if you remove or
> emulate remap_file_pages(), please increase the default limit as well.

It's fine to me. Andrew?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-08 16:02     ` Kirill A. Shutemov
@ 2014-05-08 16:08       ` Linus Torvalds
  -1 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2014-05-08 16:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Armin Rigo, Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Ingo Molnar

On Thu, May 8, 2014 at 9:02 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
>> i.e. if you remove or
>> emulate remap_file_pages(), please increase the default limit as well.
>
> It's fine to me. Andrew?

Not Andrew, but one thing we might look at is to make the limit
per-user rather than per-vm.

Because the vma limit isn't _just_ about the ELF core dump format
(although the default value for it is), it's also about making it
harder for people to use up tons of kernel memory in non-obvious ways.

(There are possibly also latency issues for process exit or big
munmap, I'm not sure how big a deal that is any more. Our find_vma()
should certainly scale fine, so the most obvious "tons of vma's"
problems are long gone)

               Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-08 16:08       ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2014-05-08 16:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Armin Rigo, Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Ingo Molnar

On Thu, May 8, 2014 at 9:02 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
>> i.e. if you remove or
>> emulate remap_file_pages(), please increase the default limit as well.
>
> It's fine to me. Andrew?

Not Andrew, but one thing we might look at is to make the limit
per-user rather than per-vm.

Because the vma limit isn't _just_ about the ELF core dump format
(although the default value for it is), it's also about making it
harder for people to use up tons of kernel memory in non-obvious ways.

(There are possibly also latency issues for process exit or big
munmap, I'm not sure how big a deal that is any more. Our find_vma()
should certainly scale fine, so the most obvious "tons of vma's"
problems are long gone)

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-08 12:41   ` Kirill A. Shutemov
@ 2014-05-08 21:57     ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2014-05-08 21:57 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Linus Torvalds, linux-kernel, linux-mm, peterz, mingo

On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> remap_file_pages(2) was invented to be able efficiently map parts of
> huge file into limited 32-bit virtual address space such as in database
> workloads.
> 
> Nonlinear mappings are pain to support and it seems there's no
> legitimate use-cases nowadays since 64-bit systems are widely available.
> 
> Let's drop it and get rid of all these special-cased code.
> 
> The patch replaces the syscall with emulation which creates new VMA on
> each remap_file_pages(), unless they it can be merged with an adjacent
> one.
> 
> I didn't find *any* real code that uses remap_file_pages(2) to test
> emulation impact on. I've checked Debian code search and source of all
> packages in ALT Linux. No real users: libc wrappers, mentions in strace,
> gdb, valgrind and this kind of stuff.
> 
> There are few basic tests in LTP for the syscall. They work just fine
> with emulation.
> 
> To test performance impact, I've written small test case which
> demonstrate pretty much worst case scenario: map 4G shmfs file, write to
> begin of every page pgoff of the page, remap pages in reverse order,
> read every page.
> 
> The test creates 1 million of VMAs if emulation is in use, so I had to
> set vm.max_map_count to 1100000 to avoid -ENOMEM.
> 
> Before:		23.3 ( +-  4.31% ) seconds
> After:		43.9 ( +-  0.85% ) seconds
> Slowdown:	1.88x
> 
> I believe we can live with that.
> 

There's still all the special-case goop around the place to be cleaned
up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
mm/*.c".  And although this cleanup is the main reason for the
patchset, let's not do it now - we can do all that if/after this patch
get merged.

I'll queue the patches for some linux-next exposure and shall send
[1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
sorted out the too-many-vmas issue we'll need to work out when to merge
[2/2].



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-08 21:57     ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2014-05-08 21:57 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Linus Torvalds, linux-kernel, linux-mm, peterz, mingo

On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> remap_file_pages(2) was invented to be able efficiently map parts of
> huge file into limited 32-bit virtual address space such as in database
> workloads.
> 
> Nonlinear mappings are pain to support and it seems there's no
> legitimate use-cases nowadays since 64-bit systems are widely available.
> 
> Let's drop it and get rid of all these special-cased code.
> 
> The patch replaces the syscall with emulation which creates new VMA on
> each remap_file_pages(), unless they it can be merged with an adjacent
> one.
> 
> I didn't find *any* real code that uses remap_file_pages(2) to test
> emulation impact on. I've checked Debian code search and source of all
> packages in ALT Linux. No real users: libc wrappers, mentions in strace,
> gdb, valgrind and this kind of stuff.
> 
> There are few basic tests in LTP for the syscall. They work just fine
> with emulation.
> 
> To test performance impact, I've written small test case which
> demonstrate pretty much worst case scenario: map 4G shmfs file, write to
> begin of every page pgoff of the page, remap pages in reverse order,
> read every page.
> 
> The test creates 1 million of VMAs if emulation is in use, so I had to
> set vm.max_map_count to 1100000 to avoid -ENOMEM.
> 
> Before:		23.3 ( +-  4.31% ) seconds
> After:		43.9 ( +-  0.85% ) seconds
> Slowdown:	1.88x
> 
> I believe we can live with that.
> 

There's still all the special-case goop around the place to be cleaned
up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
mm/*.c".  And although this cleanup is the main reason for the
patchset, let's not do it now - we can do all that if/after this patch
get merged.

I'll queue the patches for some linux-next exposure and shall send
[1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
sorted out the too-many-vmas issue we'll need to work out when to merge
[2/2].


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-08 16:08       ` Linus Torvalds
@ 2014-05-09 14:05         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-09 14:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

Linus Torvalds wrote:
> On Thu, May 8, 2014 at 9:02 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> >> i.e. if you remove or
> >> emulate remap_file_pages(), please increase the default limit as well.
> >
> > It's fine to me. Andrew?
> 
> Not Andrew, but one thing we might look at is to make the limit
> per-user rather than per-vm.

Hm. I'm confused here. Do we have any limit forced per-user?
I only see things like rlimits which are copied from parrent.
Is it what you want?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-09 14:05         ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-09 14:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

Linus Torvalds wrote:
> On Thu, May 8, 2014 at 9:02 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> >> i.e. if you remove or
> >> emulate remap_file_pages(), please increase the default limit as well.
> >
> > It's fine to me. Andrew?
> 
> Not Andrew, but one thing we might look at is to make the limit
> per-user rather than per-vm.

Hm. I'm confused here. Do we have any limit forced per-user?
I only see things like rlimits which are copied from parrent.
Is it what you want?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-09 14:05         ` Kirill A. Shutemov
@ 2014-05-09 15:14           ` Linus Torvalds
  -1 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2014-05-09 15:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Armin Rigo, Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Ingo Molnar

On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Hm. I'm confused here. Do we have any limit forced per-user?

Sure we do. See "struct user_struct". We limit max number of
processes, open files, signals etc.

> I only see things like rlimits which are copied from parrent.
> Is it what you want?

No, rlimits are per process (although in some cases what they limit
are counted per user despite the _limits_ of those resources then
being settable per thread).

So I was just thinking that if we raise the per-mm default limits,
maybe we should add a global per-user limit to make it harder for a
user to use tons and toms of vma's.

          Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-09 15:14           ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2014-05-09 15:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Armin Rigo, Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Ingo Molnar

On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Hm. I'm confused here. Do we have any limit forced per-user?

Sure we do. See "struct user_struct". We limit max number of
processes, open files, signals etc.

> I only see things like rlimits which are copied from parrent.
> Is it what you want?

No, rlimits are per process (although in some cases what they limit
are counted per user despite the _limits_ of those resources then
being settable per thread).

So I was just thinking that if we raise the per-mm default limits,
maybe we should add a global per-user limit to make it harder for a
user to use tons and toms of vma's.

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-09 15:14           ` Linus Torvalds
@ 2014-05-09 18:19             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-09 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

On Fri, May 09, 2014 at 08:14:08AM -0700, Linus Torvalds wrote:
> On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Hm. I'm confused here. Do we have any limit forced per-user?
> 
> Sure we do. See "struct user_struct". We limit max number of
> processes, open files, signals etc.

Okay got it.

BTW, nobody seems use field 'files' of user_struct:

>From 8bb8a0c740ad66126be4d3c092493e1ecc2189ef Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 9 May 2014 20:25:55 +0300
Subject: [PATCH] kernel: drop unused field 'files' from user_struct

Nobody seems uses it for a long time. Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/sched.h | 1 -
 kernel/user.c         | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c79f757..f0503ffa7a59 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -745,7 +745,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
 struct user_struct {
 	atomic_t __count;	/* reference count */
 	atomic_t processes;	/* How many processes does this user have? */
-	atomic_t files;		/* How many open files does this user have? */
 	atomic_t sigpending;	/* How many pending signals does this user have? */
 #ifdef CONFIG_INOTIFY_USER
 	atomic_t inotify_watches; /* How many inotify watches does this user have? */
diff --git a/kernel/user.c b/kernel/user.c
index 294fc6a94168..4efa39350e44 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -87,7 +87,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
 struct user_struct root_user = {
 	.__count	= ATOMIC_INIT(1),
 	.processes	= ATOMIC_INIT(1),
-	.files		= ATOMIC_INIT(0),
 	.sigpending	= ATOMIC_INIT(0),
 	.locked_shm     = 0,
 	.uid		= GLOBAL_ROOT_UID,
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-09 18:19             ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-09 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

On Fri, May 09, 2014 at 08:14:08AM -0700, Linus Torvalds wrote:
> On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Hm. I'm confused here. Do we have any limit forced per-user?
> 
> Sure we do. See "struct user_struct". We limit max number of
> processes, open files, signals etc.

Okay got it.

BTW, nobody seems use field 'files' of user_struct:

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-08 15:44   ` Armin Rigo
@ 2014-05-12  3:36     ` Andi Kleen
  -1 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2014-05-12  3:36 UTC (permalink / raw)
  To: Armin Rigo
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Armin Rigo <arigo@tunes.org> writes:

> Here is a note from the PyPy project (mentioned earlier in this
> thread, and at https://lwn.net/Articles/587923/ ).

Your use is completely bogus. remap_file_pages() pins everything 
and disables any swapping for the area.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-12  3:36     ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2014-05-12  3:36 UTC (permalink / raw)
  To: Armin Rigo
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Armin Rigo <arigo@tunes.org> writes:

> Here is a note from the PyPy project (mentioned earlier in this
> thread, and at https://lwn.net/Articles/587923/ ).

Your use is completely bogus. remap_file_pages() pins everything 
and disables any swapping for the area.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-12  3:36     ` Andi Kleen
@ 2014-05-12  5:16       ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 54+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-12  5:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Armin Rigo, Kirill A. Shutemov, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List, linux-mm, peterz, Ingo Molnar

On Mon, May 12, 2014 at 7:36 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Armin Rigo <arigo@tunes.org> writes:
>
>> Here is a note from the PyPy project (mentioned earlier in this
>> thread, and at https://lwn.net/Articles/587923/ ).
>
> Your use is completely bogus. remap_file_pages() pins everything
> and disables any swapping for the area.

Wait, what's wrong with swapping pages from non-linear vmas?
try_to_umap() can handle them, though not very effectively.

Some time ago I was thinking about tracking rmap for non-linear vmas, something
like second-level tree of sub-vmas stored in non-linear vma. This
could be done using
exising vm_area_struct, and in rmap tree everything will looks just as normal.
We'll waste some kernel memory, but it also will remove complexity from rmap and
make non-linear vmas usable for all filesystems not just for shmem.

But it's not worth. I ACK killing it.

Maybe we should keep flag on vma and hide/merge them in proc/maps.
Bloating files/dirs in proc might be bigger problem than non-existent
performance regression.

>
> -Andi
> --
> ak@linux.intel.com -- Speaking for myself only
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-12  5:16       ` Konstantin Khlebnikov
  0 siblings, 0 replies; 54+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-12  5:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Armin Rigo, Kirill A. Shutemov, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List, linux-mm, peterz, Ingo Molnar

On Mon, May 12, 2014 at 7:36 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Armin Rigo <arigo@tunes.org> writes:
>
>> Here is a note from the PyPy project (mentioned earlier in this
>> thread, and at https://lwn.net/Articles/587923/ ).
>
> Your use is completely bogus. remap_file_pages() pins everything
> and disables any swapping for the area.

Wait, what's wrong with swapping pages from non-linear vmas?
try_to_umap() can handle them, though not very effectively.

Some time ago I was thinking about tracking rmap for non-linear vmas, something
like second-level tree of sub-vmas stored in non-linear vma. This
could be done using
exising vm_area_struct, and in rmap tree everything will looks just as normal.
We'll waste some kernel memory, but it also will remove complexity from rmap and
make non-linear vmas usable for all filesystems not just for shmem.

But it's not worth. I ACK killing it.

Maybe we should keep flag on vma and hide/merge them in proc/maps.
Bloating files/dirs in proc might be bigger problem than non-existent
performance regression.

>
> -Andi
> --
> ak@linux.intel.com -- Speaking for myself only
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-12  3:36     ` Andi Kleen
@ 2014-05-12  7:50       ` Armin Rigo
  -1 siblings, 0 replies; 54+ messages in thread
From: Armin Rigo @ 2014-05-12  7:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Hi Andi,

On 12 May 2014 05:36, Andi Kleen <andi@firstfloor.org> wrote:
>> Here is a note from the PyPy project (mentioned earlier in this
>> thread, and at https://lwn.net/Articles/587923/ ).
>
> Your use is completely bogus. remap_file_pages() pins everything
> and disables any swapping for the area.

? No.  Trying this example: http://bpaste.net/show/fCUTnR9mDzJ2IEKrQLAR/

...really allocates 4GB of RAM, and on a 4GB machine it causes some
swapping.  It seems to work fine.  I'm not sure to understand you.
I'm also not sure that a property as essential as "disables swapping"
should be omitted from the man page; if so, that would be a real man
page bug.


A bientôt,

Armin.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-12  7:50       ` Armin Rigo
  0 siblings, 0 replies; 54+ messages in thread
From: Armin Rigo @ 2014-05-12  7:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Hi Andi,

On 12 May 2014 05:36, Andi Kleen <andi@firstfloor.org> wrote:
>> Here is a note from the PyPy project (mentioned earlier in this
>> thread, and at https://lwn.net/Articles/587923/ ).
>
> Your use is completely bogus. remap_file_pages() pins everything
> and disables any swapping for the area.

? No.  Trying this example: http://bpaste.net/show/fCUTnR9mDzJ2IEKrQLAR/

...really allocates 4GB of RAM, and on a 4GB machine it causes some
swapping.  It seems to work fine.  I'm not sure to understand you.
I'm also not sure that a property as essential as "disables swapping"
should be omitted from the man page; if so, that would be a real man
page bug.


A bientôt,

Armin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-09 15:14           ` Linus Torvalds
@ 2014-05-12 12:43             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-12 12:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

On Fri, May 09, 2014 at 08:14:08AM -0700, Linus Torvalds wrote:
> On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Hm. I'm confused here. Do we have any limit forced per-user?
> 
> Sure we do. See "struct user_struct". We limit max number of
> processes, open files, signals etc.
> 
> > I only see things like rlimits which are copied from parrent.
> > Is it what you want?
> 
> No, rlimits are per process (although in some cases what they limit
> are counted per user despite the _limits_ of those resources then
> being settable per thread).
> 
> So I was just thinking that if we raise the per-mm default limits,
> maybe we should add a global per-user limit to make it harder for a
> user to use tons and toms of vma's.

Here's the first attempt.

I'm not completely happy about current_user(). It means we rely on that
user of mm owner task is always equal to user of current. Not sure if it's
always the case.

Other option is to make MM_OWNER is always on and lookup proper user
through task_cred_xxx(rcu_dereference(mm->owner), user).

>From 5ee6f6dd721ada8eb66c84a91003ac1e3eb2970a Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 12 May 2014 15:13:12 +0300
Subject: [PATCH] mm: add per-user limit on mapping count

We're going to increase per-mm map_count. To avoid non-obvious memory
abuse by creating a lot of VMA's, let's introduce per-user limit.

The limit is implemented as sysctl. For now value of limit is pretty
arbitrary -- 2^20.

sizeof(vm_area_struct) with my kernel config (DEBUG_KERNEL=n) is 184
bytes. It means with the limit user can use up to 184 MiB of RAM in
VMAs.

The limit is not applicable for root (INIT_USER).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/unicore32/include/asm/mmu_context.h |  2 +-
 include/linux/sched.h                    | 27 +++++++++++++++++++++++++++
 include/linux/sched/sysctl.h             |  1 +
 kernel/fork.c                            |  3 ++-
 kernel/sysctl.c                          |  8 ++++++++
 mm/mmap.c                                | 17 +++++++++--------
 mm/mremap.c                              |  2 +-
 mm/nommu.c                               |  7 ++++---
 8 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/arch/unicore32/include/asm/mmu_context.h b/arch/unicore32/include/asm/mmu_context.h
index ef470a7a3d0f..f370d74339da 100644
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -76,7 +76,7 @@ do { \
 			mm->mmap = NULL; \
 		rb_erase(&high_vma->vm_rb, &mm->mm_rb); \
 		vmacache_invalidate(mm); \
-		mm->map_count--; \
+		dec_map_count(mm); \
 		remove_vma(high_vma); \
 	} \
 } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c79f757..f9f12c503d14 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -56,6 +56,7 @@ struct sched_param {
 #include <linux/llist.h>
 #include <linux/uidgid.h>
 #include <linux/gfp.h>
+#include <linux/sched/sysctl.h>
 
 #include <asm/processor.h>
 
@@ -747,6 +748,7 @@ struct user_struct {
 	atomic_t processes;	/* How many processes does this user have? */
 	atomic_t files;		/* How many open files does this user have? */
 	atomic_t sigpending;	/* How many pending signals does this user have? */
+	atomic_t map_count;	/* How many mapping does this user have? */
 #ifdef CONFIG_INOTIFY_USER
 	atomic_t inotify_watches; /* How many inotify watches does this user have? */
 	atomic_t inotify_devs;	/* How many inotify devs does this user have opened? */
@@ -2991,4 +2993,29 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+static inline void inc_map_count(struct mm_struct *mm)
+{
+	mm->map_count++;
+	atomic_inc(&current_user()->map_count);
+}
+
+static inline void dec_map_count(struct mm_struct *mm)
+{
+	mm->map_count--;
+	atomic_dec(&current_user()->map_count);
+}
+
+static inline bool map_count_check(struct mm_struct *mm, int limit_offset)
+{
+	struct user_struct *user = current_user();
+	if (mm->map_count > sysctl_max_map_count + limit_offset)
+		return true;
+	if (user == INIT_USER)
+		return false;
+	if (atomic_read(&user->map_count) >
+			sysctl_max_map_count_per_user + limit_offset)
+		return true;
+	return false;
+}
+
 #endif
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 8045a554cafb..ce66c4697dbf 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -30,6 +30,7 @@ enum { sysctl_hung_task_timeout_secs = 0 };
 #define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
 
 extern int sysctl_max_map_count;
+extern long sysctl_max_map_count_per_user;
 
 extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
diff --git a/kernel/fork.c b/kernel/fork.c
index 54a8d26f612f..8ea1c538c79e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -454,7 +454,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		rb_link = &tmp->vm_rb.rb_right;
 		rb_parent = &tmp->vm_rb;
 
-		mm->map_count++;
+		inc_map_count(mm);
 		retval = copy_page_range(mm, oldmm, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
@@ -600,6 +600,7 @@ void __mmdrop(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
+	atomic_sub(mm->map_count, &current_user()->map_count);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 74f5b580fe34..4efe2ed927f2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1316,6 +1316,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "max_map_count_per_user",
+		.data		= &sysctl_max_map_count_per_user,
+		.maxlen		= sizeof(sysctl_max_map_count_per_user),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
 #else
 	{
 		.procname	= "nr_trim_pages",
diff --git a/mm/mmap.c b/mm/mmap.c
index b1202cf81f4b..8e2d581347f6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -89,6 +89,7 @@ int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;  /* heuristic ove
 int sysctl_overcommit_ratio __read_mostly = 50;	/* default is 50% */
 unsigned long sysctl_overcommit_kbytes __read_mostly;
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+long sysctl_max_map_count_per_user __read_mostly = 1UL << 20;
 unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
 unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
 /*
@@ -652,7 +653,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
 
-	mm->map_count++;
+	inc_map_count(mm);
 	validate_mm(mm);
 }
 
@@ -669,7 +670,7 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 			   &prev, &rb_link, &rb_parent))
 		BUG();
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
-	mm->map_count++;
+	inc_map_count(mm);
 }
 
 static inline void
@@ -865,7 +866,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
-		mm->map_count--;
+		dec_map_count(mm);
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
 		/*
@@ -1259,7 +1260,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
                return -EOVERFLOW;
 
 	/* Too many mappings? */
-	if (mm->map_count > sysctl_max_map_count)
+	if (map_count_check(mm, 0))
 		return -ENOMEM;
 
 	/* Obtain the address to map to. we verify (or select) it and ensure
@@ -2378,7 +2379,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	vma->vm_prev = NULL;
 	do {
 		vma_rb_erase(vma, &mm->mm_rb);
-		mm->map_count--;
+		dec_map_count(mm);
 		tail_vma = vma;
 		vma = vma->vm_next;
 	} while (vma && vma->vm_start < end);
@@ -2468,7 +2469,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	      unsigned long addr, int new_below)
 {
-	if (mm->map_count >= sysctl_max_map_count)
+	if (map_count_check(mm, -1))
 		return -ENOMEM;
 
 	return __split_vma(mm, vma, addr, new_below);
@@ -2517,7 +2518,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
-		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+		if (end < vma->vm_end && map_count_check(mm, -1))
 			return -ENOMEM;
 
 		error = __split_vma(mm, vma, start, 0);
@@ -2637,7 +2638,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
-	if (mm->map_count > sysctl_max_map_count)
+	if (map_count_check(mm, 0))
 		return -ENOMEM;
 
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180e9f21..f0e34e87828d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -252,7 +252,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	 * We'd prefer to avoid failure later on in do_munmap:
 	 * which may split one vma into three before unmapping.
 	 */
-	if (mm->map_count >= sysctl_max_map_count - 3)
+	if (map_count_check(mm, -4))
 		return -ENOMEM;
 
 	/*
diff --git a/mm/nommu.c b/mm/nommu.c
index 85f8d6698d48..5b60bd88405c 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -64,6 +64,7 @@ int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
 int sysctl_overcommit_ratio = 50; /* default is 50% */
 unsigned long sysctl_overcommit_kbytes __read_mostly;
 int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
+long sysctl_max_map_count_per_user __read_mostly = 1UL << 20;
 int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
 unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
 unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
@@ -710,7 +711,7 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 
 	BUG_ON(!vma->vm_region);
 
-	mm->map_count++;
+	inc_map_count(mm);
 	vma->vm_mm = mm;
 
 	protect_vma(vma, vma->vm_flags);
@@ -779,7 +780,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 
 	protect_vma(vma, 0);
 
-	mm->map_count--;
+	dec_map_count(mm);
 	for (i = 0; i < VMACACHE_SIZE; i++) {
 		/* if the vma is cached, invalidate the entire cache */
 		if (curr->vmacache[i] == vma) {
@@ -1554,7 +1555,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (vma->vm_file)
 		return -ENOMEM;
 
-	if (mm->map_count >= sysctl_max_map_count)
+	if (check_map_count(mm, -1))
 		return -ENOMEM;
 
 	region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-12 12:43             ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-12 12:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

On Fri, May 09, 2014 at 08:14:08AM -0700, Linus Torvalds wrote:
> On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Hm. I'm confused here. Do we have any limit forced per-user?
> 
> Sure we do. See "struct user_struct". We limit max number of
> processes, open files, signals etc.
> 
> > I only see things like rlimits which are copied from parrent.
> > Is it what you want?
> 
> No, rlimits are per process (although in some cases what they limit
> are counted per user despite the _limits_ of those resources then
> being settable per thread).
> 
> So I was just thinking that if we raise the per-mm default limits,
> maybe we should add a global per-user limit to make it harder for a
> user to use tons and toms of vma's.

Here's the first attempt.

I'm not completely happy about current_user(). It means we rely on that
user of mm owner task is always equal to user of current. Not sure if it's
always the case.

Other option is to make MM_OWNER is always on and lookup proper user
through task_cred_xxx(rcu_dereference(mm->owner), user).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
  2014-05-12 12:43             ` Kirill A. Shutemov
@ 2014-05-12 14:59               ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 54+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-12 14:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

On Mon, May 12, 2014 at 4:43 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Fri, May 09, 2014 at 08:14:08AM -0700, Linus Torvalds wrote:
>> On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> >
>> > Hm. I'm confused here. Do we have any limit forced per-user?
>>
>> Sure we do. See "struct user_struct". We limit max number of
>> processes, open files, signals etc.
>>
>> > I only see things like rlimits which are copied from parrent.
>> > Is it what you want?
>>
>> No, rlimits are per process (although in some cases what they limit
>> are counted per user despite the _limits_ of those resources then
>> being settable per thread).
>>
>> So I was just thinking that if we raise the per-mm default limits,
>> maybe we should add a global per-user limit to make it harder for a
>> user to use tons and toms of vma's.
>
> Here's the first attempt.
>
> I'm not completely happy about current_user(). It means we rely on that
> user of mm owner task is always equal to user of current. Not sure if it's
> always the case.
>
> Other option is to make MM_OWNER is always on and lookup proper user
> through task_cred_xxx(rcu_dereference(mm->owner), user).
>
> From 5ee6f6dd721ada8eb66c84a91003ac1e3eb2970a Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 12 May 2014 15:13:12 +0300
> Subject: [PATCH] mm: add per-user limit on mapping count
>
> We're going to increase per-mm map_count. To avoid non-obvious memory
> abuse by creating a lot of VMA's, let's introduce per-user limit.
>
> The limit is implemented as sysctl. For now value of limit is pretty
> arbitrary -- 2^20.
>
> sizeof(vm_area_struct) with my kernel config (DEBUG_KERNEL=n) is 184
> bytes. It means with the limit user can use up to 184 MiB of RAM in
> VMAs.
>
> The limit is not applicable for root (INIT_USER).

I don't like this.

Maybe we could just account VMAs into OOM-badness points and let
OOM-killer do its job?

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -170,7 +170,9 @@ unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg,
         * task's rss, pagetable and swap space use.
         */
        points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_ptes) +
-                get_mm_counter(p->mm, MM_SWAPENTS);
+                get_mm_counter(p->mm, MM_SWAPENTS) +
+                (long)p->mm->map_count *
+                       sizeof(struct vm_area_struct) / PAGE_SIZE;
        task_unlock(p);

        /*

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHv2 0/2] remap_file_pages() decommission
@ 2014-05-12 14:59               ` Konstantin Khlebnikov
  0 siblings, 0 replies; 54+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-12 14:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Kirill A. Shutemov, Armin Rigo, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Ingo Molnar

On Mon, May 12, 2014 at 4:43 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Fri, May 09, 2014 at 08:14:08AM -0700, Linus Torvalds wrote:
>> On Fri, May 9, 2014 at 7:05 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> >
>> > Hm. I'm confused here. Do we have any limit forced per-user?
>>
>> Sure we do. See "struct user_struct". We limit max number of
>> processes, open files, signals etc.
>>
>> > I only see things like rlimits which are copied from parrent.
>> > Is it what you want?
>>
>> No, rlimits are per process (although in some cases what they limit
>> are counted per user despite the _limits_ of those resources then
>> being settable per thread).
>>
>> So I was just thinking that if we raise the per-mm default limits,
>> maybe we should add a global per-user limit to make it harder for a
>> user to use tons and toms of vma's.
>
> Here's the first attempt.
>
> I'm not completely happy about current_user(). It means we rely on that
> user of mm owner task is always equal to user of current. Not sure if it's
> always the case.
>
> Other option is to make MM_OWNER is always on and lookup proper user
> through task_cred_xxx(rcu_dereference(mm->owner), user).
>
> From 5ee6f6dd721ada8eb66c84a91003ac1e3eb2970a Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 12 May 2014 15:13:12 +0300
> Subject: [PATCH] mm: add per-user limit on mapping count
>
> We're going to increase per-mm map_count. To avoid non-obvious memory
> abuse by creating a lot of VMA's, let's introduce per-user limit.
>
> The limit is implemented as sysctl. For now value of limit is pretty
> arbitrary -- 2^20.
>
> sizeof(vm_area_struct) with my kernel config (DEBUG_KERNEL=n) is 184
> bytes. It means with the limit user can use up to 184 MiB of RAM in
> VMAs.
>
> The limit is not applicable for root (INIT_USER).

I don't like this.

Maybe we could just account VMAs into OOM-badness points and let
OOM-killer do its job?

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -170,7 +170,9 @@ unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg,
         * task's rss, pagetable and swap space use.
         */
        points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_ptes) +
-                get_mm_counter(p->mm, MM_SWAPENTS);
+                get_mm_counter(p->mm, MM_SWAPENTS) +
+                (long)p->mm->map_count *
+                       sizeof(struct vm_area_struct) / PAGE_SIZE;
        task_unlock(p);

        /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-08 21:57     ` Andrew Morton
@ 2014-05-12 15:11       ` Sasha Levin
  -1 siblings, 0 replies; 54+ messages in thread
From: Sasha Levin @ 2014-05-12 15:11 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Linus Torvalds, linux-kernel, linux-mm, peterz, mingo

On 05/08/2014 05:57 PM, Andrew Morton wrote:
> On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
>> > remap_file_pages(2) was invented to be able efficiently map parts of
>> > huge file into limited 32-bit virtual address space such as in database
>> > workloads.
>> > 
>> > Nonlinear mappings are pain to support and it seems there's no
>> > legitimate use-cases nowadays since 64-bit systems are widely available.
>> > 
>> > Let's drop it and get rid of all these special-cased code.
>> > 
>> > The patch replaces the syscall with emulation which creates new VMA on
>> > each remap_file_pages(), unless they it can be merged with an adjacent
>> > one.
>> > 
>> > I didn't find *any* real code that uses remap_file_pages(2) to test
>> > emulation impact on. I've checked Debian code search and source of all
>> > packages in ALT Linux. No real users: libc wrappers, mentions in strace,
>> > gdb, valgrind and this kind of stuff.
>> > 
>> > There are few basic tests in LTP for the syscall. They work just fine
>> > with emulation.
>> > 
>> > To test performance impact, I've written small test case which
>> > demonstrate pretty much worst case scenario: map 4G shmfs file, write to
>> > begin of every page pgoff of the page, remap pages in reverse order,
>> > read every page.
>> > 
>> > The test creates 1 million of VMAs if emulation is in use, so I had to
>> > set vm.max_map_count to 1100000 to avoid -ENOMEM.
>> > 
>> > Before:		23.3 ( +-  4.31% ) seconds
>> > After:		43.9 ( +-  0.85% ) seconds
>> > Slowdown:	1.88x
>> > 
>> > I believe we can live with that.
>> > 
> There's still all the special-case goop around the place to be cleaned
> up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
> mm/*.c".  And although this cleanup is the main reason for the
> patchset, let's not do it now - we can do all that if/after this patch
> get merged.
> 
> I'll queue the patches for some linux-next exposure and shall send
> [1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
> sorted out the too-many-vmas issue we'll need to work out when to merge
> [2/2].

It seems that since no one is really using it, it's also impossible to
properly test it. I've sent a fix that deals with panics in error paths
that are very easy to trigger, but I'm worried that there are a lot more
of those hiding over there.

Since we can't find any actual users, testing suites are very incomplete
w.r.t this syscall, and the amount of work required to "remove" it is
non-trivial, can we just kill this syscall off?

It sounds to me like a better option than to ship a new, buggy and possibly
security dangerous version which we can't even test.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-12 15:11       ` Sasha Levin
  0 siblings, 0 replies; 54+ messages in thread
From: Sasha Levin @ 2014-05-12 15:11 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Linus Torvalds, linux-kernel, linux-mm, peterz, mingo

On 05/08/2014 05:57 PM, Andrew Morton wrote:
> On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
>> > remap_file_pages(2) was invented to be able efficiently map parts of
>> > huge file into limited 32-bit virtual address space such as in database
>> > workloads.
>> > 
>> > Nonlinear mappings are pain to support and it seems there's no
>> > legitimate use-cases nowadays since 64-bit systems are widely available.
>> > 
>> > Let's drop it and get rid of all these special-cased code.
>> > 
>> > The patch replaces the syscall with emulation which creates new VMA on
>> > each remap_file_pages(), unless they it can be merged with an adjacent
>> > one.
>> > 
>> > I didn't find *any* real code that uses remap_file_pages(2) to test
>> > emulation impact on. I've checked Debian code search and source of all
>> > packages in ALT Linux. No real users: libc wrappers, mentions in strace,
>> > gdb, valgrind and this kind of stuff.
>> > 
>> > There are few basic tests in LTP for the syscall. They work just fine
>> > with emulation.
>> > 
>> > To test performance impact, I've written small test case which
>> > demonstrate pretty much worst case scenario: map 4G shmfs file, write to
>> > begin of every page pgoff of the page, remap pages in reverse order,
>> > read every page.
>> > 
>> > The test creates 1 million of VMAs if emulation is in use, so I had to
>> > set vm.max_map_count to 1100000 to avoid -ENOMEM.
>> > 
>> > Before:		23.3 ( +-  4.31% ) seconds
>> > After:		43.9 ( +-  0.85% ) seconds
>> > Slowdown:	1.88x
>> > 
>> > I believe we can live with that.
>> > 
> There's still all the special-case goop around the place to be cleaned
> up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
> mm/*.c".  And although this cleanup is the main reason for the
> patchset, let's not do it now - we can do all that if/after this patch
> get merged.
> 
> I'll queue the patches for some linux-next exposure and shall send
> [1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
> sorted out the too-many-vmas issue we'll need to work out when to merge
> [2/2].

It seems that since no one is really using it, it's also impossible to
properly test it. I've sent a fix that deals with panics in error paths
that are very easy to trigger, but I'm worried that there are a lot more
of those hiding over there.

Since we can't find any actual users, testing suites are very incomplete
w.r.t this syscall, and the amount of work required to "remove" it is
non-trivial, can we just kill this syscall off?

It sounds to me like a better option than to ship a new, buggy and possibly
security dangerous version which we can't even test.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-12 15:11       ` Sasha Levin
@ 2014-05-12 17:05         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-12 17:05 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

On Mon, May 12, 2014 at 11:11:48AM -0400, Sasha Levin wrote:
> On 05/08/2014 05:57 PM, Andrew Morton wrote:
> > On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> >> > remap_file_pages(2) was invented to be able efficiently map parts of
> >> > huge file into limited 32-bit virtual address space such as in database
> >> > workloads.
> >> > 
> >> > Nonlinear mappings are pain to support and it seems there's no
> >> > legitimate use-cases nowadays since 64-bit systems are widely available.
> >> > 
> >> > Let's drop it and get rid of all these special-cased code.
> >> > 
> >> > The patch replaces the syscall with emulation which creates new VMA on
> >> > each remap_file_pages(), unless they it can be merged with an adjacent
> >> > one.
> >> > 
> >> > I didn't find *any* real code that uses remap_file_pages(2) to test
> >> > emulation impact on. I've checked Debian code search and source of all
> >> > packages in ALT Linux. No real users: libc wrappers, mentions in strace,
> >> > gdb, valgrind and this kind of stuff.
> >> > 
> >> > There are few basic tests in LTP for the syscall. They work just fine
> >> > with emulation.
> >> > 
> >> > To test performance impact, I've written small test case which
> >> > demonstrate pretty much worst case scenario: map 4G shmfs file, write to
> >> > begin of every page pgoff of the page, remap pages in reverse order,
> >> > read every page.
> >> > 
> >> > The test creates 1 million of VMAs if emulation is in use, so I had to
> >> > set vm.max_map_count to 1100000 to avoid -ENOMEM.
> >> > 
> >> > Before:		23.3 ( +-  4.31% ) seconds
> >> > After:		43.9 ( +-  0.85% ) seconds
> >> > Slowdown:	1.88x
> >> > 
> >> > I believe we can live with that.
> >> > 
> > There's still all the special-case goop around the place to be cleaned
> > up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
> > mm/*.c".  And although this cleanup is the main reason for the
> > patchset, let's not do it now - we can do all that if/after this patch
> > get merged.
> > 
> > I'll queue the patches for some linux-next exposure and shall send
> > [1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
> > sorted out the too-many-vmas issue we'll need to work out when to merge
> > [2/2].
> 
> It seems that since no one is really using it, it's also impossible to
> properly test it. I've sent a fix that deals with panics in error paths
> that are very easy to trigger, but I'm worried that there are a lot more
> of those hiding over there.

Sorry for that.

> Since we can't find any actual users, testing suites are very incomplete
> w.r.t this syscall, and the amount of work required to "remove" it is
> non-trivial, can we just kill this syscall off?
> 
> It sounds to me like a better option than to ship a new, buggy and possibly
> security dangerous version which we can't even test.

Taking into account your employment, is it possible to check how the RDBMS
(old but it still supported 32-bit versions) would react on -ENOSYS here?

I would like to get rid of it completely, but I thought it's not an option
for compatibility reason.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-12 17:05         ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-12 17:05 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

On Mon, May 12, 2014 at 11:11:48AM -0400, Sasha Levin wrote:
> On 05/08/2014 05:57 PM, Andrew Morton wrote:
> > On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> >> > remap_file_pages(2) was invented to be able efficiently map parts of
> >> > huge file into limited 32-bit virtual address space such as in database
> >> > workloads.
> >> > 
> >> > Nonlinear mappings are pain to support and it seems there's no
> >> > legitimate use-cases nowadays since 64-bit systems are widely available.
> >> > 
> >> > Let's drop it and get rid of all these special-cased code.
> >> > 
> >> > The patch replaces the syscall with emulation which creates new VMA on
> >> > each remap_file_pages(), unless they it can be merged with an adjacent
> >> > one.
> >> > 
> >> > I didn't find *any* real code that uses remap_file_pages(2) to test
> >> > emulation impact on. I've checked Debian code search and source of all
> >> > packages in ALT Linux. No real users: libc wrappers, mentions in strace,
> >> > gdb, valgrind and this kind of stuff.
> >> > 
> >> > There are few basic tests in LTP for the syscall. They work just fine
> >> > with emulation.
> >> > 
> >> > To test performance impact, I've written small test case which
> >> > demonstrate pretty much worst case scenario: map 4G shmfs file, write to
> >> > begin of every page pgoff of the page, remap pages in reverse order,
> >> > read every page.
> >> > 
> >> > The test creates 1 million of VMAs if emulation is in use, so I had to
> >> > set vm.max_map_count to 1100000 to avoid -ENOMEM.
> >> > 
> >> > Before:		23.3 ( +-  4.31% ) seconds
> >> > After:		43.9 ( +-  0.85% ) seconds
> >> > Slowdown:	1.88x
> >> > 
> >> > I believe we can live with that.
> >> > 
> > There's still all the special-case goop around the place to be cleaned
> > up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
> > mm/*.c".  And although this cleanup is the main reason for the
> > patchset, let's not do it now - we can do all that if/after this patch
> > get merged.
> > 
> > I'll queue the patches for some linux-next exposure and shall send
> > [1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
> > sorted out the too-many-vmas issue we'll need to work out when to merge
> > [2/2].
> 
> It seems that since no one is really using it, it's also impossible to
> properly test it. I've sent a fix that deals with panics in error paths
> that are very easy to trigger, but I'm worried that there are a lot more
> of those hiding over there.

Sorry for that.

> Since we can't find any actual users, testing suites are very incomplete
> w.r.t this syscall, and the amount of work required to "remove" it is
> non-trivial, can we just kill this syscall off?
> 
> It sounds to me like a better option than to ship a new, buggy and possibly
> security dangerous version which we can't even test.

Taking into account your employment, is it possible to check how the RDBMS
(old but it still supported 32-bit versions) would react on -ENOSYS here?

I would like to get rid of it completely, but I thought it's not an option
for compatibility reason.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-12 15:11       ` Sasha Levin
@ 2014-05-13  7:32         ` Armin Rigo
  -1 siblings, 0 replies; 54+ messages in thread
From: Armin Rigo @ 2014-05-13  7:32 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Hi Sasha,

On 12 May 2014 17:11, Sasha Levin <sasha.levin@oracle.com> wrote:
> Since we can't find any actual users,

The PyPy project doesn't count as an "actual user"?  It's not just an
idea in the air.  It's beta code that is already released (and open
source):

http://morepypy.blogspot.ch/2014/04/stm-results-and-second-call-for.html

The core library is available from there (see the test suite in c7/test/):

https://bitbucket.org/pypy/stmgc

I already reacted to the discussion here by making remap_file_pages()
optional (#undef USE_REMAP_FILE_PAGES) but didn't measure the
performance impact of this, if any (I expect it to be reasonable).
Still, if you're looking for a real piece of code using
remap_file_pages(), it's one.


A bientôt,

Armin.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-13  7:32         ` Armin Rigo
  0 siblings, 0 replies; 54+ messages in thread
From: Armin Rigo @ 2014-05-13  7:32 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

Hi Sasha,

On 12 May 2014 17:11, Sasha Levin <sasha.levin@oracle.com> wrote:
> Since we can't find any actual users,

The PyPy project doesn't count as an "actual user"?  It's not just an
idea in the air.  It's beta code that is already released (and open
source):

http://morepypy.blogspot.ch/2014/04/stm-results-and-second-call-for.html

The core library is available from there (see the test suite in c7/test/):

https://bitbucket.org/pypy/stmgc

I already reacted to the discussion here by making remap_file_pages()
optional (#undef USE_REMAP_FILE_PAGES) but didn't measure the
performance impact of this, if any (I expect it to be reasonable).
Still, if you're looking for a real piece of code using
remap_file_pages(), it's one.


A bientôt,

Armin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-13  7:32         ` Armin Rigo
@ 2014-05-13 12:57           ` Sasha Levin
  -1 siblings, 0 replies; 54+ messages in thread
From: Sasha Levin @ 2014-05-13 12:57 UTC (permalink / raw)
  To: Armin Rigo
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

On 05/13/2014 03:32 AM, Armin Rigo wrote:
> Hi Sasha,
> 
> On 12 May 2014 17:11, Sasha Levin <sasha.levin@oracle.com> wrote:
>> Since we can't find any actual users,
> 
> The PyPy project doesn't count as an "actual user"?  It's not just an
> idea in the air.  It's beta code that is already released (and open
> source):
> 
> http://morepypy.blogspot.ch/2014/04/stm-results-and-second-call-for.html
> 
> The core library is available from there (see the test suite in c7/test/):
> 
> https://bitbucket.org/pypy/stmgc
> 
> I already reacted to the discussion here by making remap_file_pages()
> optional (#undef USE_REMAP_FILE_PAGES) but didn't measure the
> performance impact of this, if any (I expect it to be reasonable).
> Still, if you're looking for a real piece of code using
> remap_file_pages(), it's one.

Oh, I don't have anything against PyPy, I just wasn't aware it used
remap_file_pages() (I think I've missed the discussion in the parallel
thread).

Indeed it is a user, have you tried it with a kernel that is running
Kirill's patch set to replace remap_file_pages()?


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-13 12:57           ` Sasha Levin
  0 siblings, 0 replies; 54+ messages in thread
From: Sasha Levin @ 2014-05-13 12:57 UTC (permalink / raw)
  To: Armin Rigo
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

On 05/13/2014 03:32 AM, Armin Rigo wrote:
> Hi Sasha,
> 
> On 12 May 2014 17:11, Sasha Levin <sasha.levin@oracle.com> wrote:
>> Since we can't find any actual users,
> 
> The PyPy project doesn't count as an "actual user"?  It's not just an
> idea in the air.  It's beta code that is already released (and open
> source):
> 
> http://morepypy.blogspot.ch/2014/04/stm-results-and-second-call-for.html
> 
> The core library is available from there (see the test suite in c7/test/):
> 
> https://bitbucket.org/pypy/stmgc
> 
> I already reacted to the discussion here by making remap_file_pages()
> optional (#undef USE_REMAP_FILE_PAGES) but didn't measure the
> performance impact of this, if any (I expect it to be reasonable).
> Still, if you're looking for a real piece of code using
> remap_file_pages(), it's one.

Oh, I don't have anything against PyPy, I just wasn't aware it used
remap_file_pages() (I think I've missed the discussion in the parallel
thread).

Indeed it is a user, have you tried it with a kernel that is running
Kirill's patch set to replace remap_file_pages()?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-12 17:05         ` Kirill A. Shutemov
@ 2014-05-14 20:52           ` Sasha Levin
  -1 siblings, 0 replies; 54+ messages in thread
From: Sasha Levin @ 2014-05-14 20:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

On 05/12/2014 01:05 PM, Kirill A. Shutemov wrote:
> On Mon, May 12, 2014 at 11:11:48AM -0400, Sasha Levin wrote:
>> On 05/08/2014 05:57 PM, Andrew Morton wrote:
>>> On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>>>
>>>>> remap_file_pages(2) was invented to be able efficiently map parts of
>>>>> huge file into limited 32-bit virtual address space such as in database
>>>>> workloads.
>>>>>
>>>>> Nonlinear mappings are pain to support and it seems there's no
>>>>> legitimate use-cases nowadays since 64-bit systems are widely available.
>>>>>
>>>>> Let's drop it and get rid of all these special-cased code.
>>>>>
>>>>> The patch replaces the syscall with emulation which creates new VMA on
>>>>> each remap_file_pages(), unless they it can be merged with an adjacent
>>>>> one.
>>>>>
>>>>> I didn't find *any* real code that uses remap_file_pages(2) to test
>>>>> emulation impact on. I've checked Debian code search and source of all
>>>>> packages in ALT Linux. No real users: libc wrappers, mentions in strace,
>>>>> gdb, valgrind and this kind of stuff.
>>>>>
>>>>> There are few basic tests in LTP for the syscall. They work just fine
>>>>> with emulation.
>>>>>
>>>>> To test performance impact, I've written small test case which
>>>>> demonstrate pretty much worst case scenario: map 4G shmfs file, write to
>>>>> begin of every page pgoff of the page, remap pages in reverse order,
>>>>> read every page.
>>>>>
>>>>> The test creates 1 million of VMAs if emulation is in use, so I had to
>>>>> set vm.max_map_count to 1100000 to avoid -ENOMEM.
>>>>>
>>>>> Before:		23.3 ( +-  4.31% ) seconds
>>>>> After:		43.9 ( +-  0.85% ) seconds
>>>>> Slowdown:	1.88x
>>>>>
>>>>> I believe we can live with that.
>>>>>
>>> There's still all the special-case goop around the place to be cleaned
>>> up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
>>> mm/*.c".  And although this cleanup is the main reason for the
>>> patchset, let's not do it now - we can do all that if/after this patch
>>> get merged.
>>>
>>> I'll queue the patches for some linux-next exposure and shall send
>>> [1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
>>> sorted out the too-many-vmas issue we'll need to work out when to merge
>>> [2/2].
>>
>> It seems that since no one is really using it, it's also impossible to
>> properly test it. I've sent a fix that deals with panics in error paths
>> that are very easy to trigger, but I'm worried that there are a lot more
>> of those hiding over there.
> 
> Sorry for that.
> 
>> Since we can't find any actual users, testing suites are very incomplete
>> w.r.t this syscall, and the amount of work required to "remove" it is
>> non-trivial, can we just kill this syscall off?
>>
>> It sounds to me like a better option than to ship a new, buggy and possibly
>> security dangerous version which we can't even test.
> 
> Taking into account your employment, is it possible to check how the RDBMS
> (old but it still supported 32-bit versions) would react on -ENOSYS here?

Alrighty, I got an answer:

1. remap_file_pages() only works when the "VLM" feature of the db is enabled,
so those databases can work just fine without it, but be limited to 3-4GB of
memory. This is not needed at all on 64bit machines.

2. As of OL7 (kernel 3.8), there will not be a 32bit kernel build. I'm still
waiting for an answer whether there will do a 32bit DB build for a 64bit kernel,
but that never happened before and seems unlikely.

3. They're basically saying that by the time upstream releases a kernel without
remap_file_pages() no one will need it here.

To sum it up, they're fine with removing remap_file_pages().


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-14 20:52           ` Sasha Levin
  0 siblings, 0 replies; 54+ messages in thread
From: Sasha Levin @ 2014-05-14 20:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Linus Torvalds, linux-kernel,
	linux-mm, peterz, mingo

On 05/12/2014 01:05 PM, Kirill A. Shutemov wrote:
> On Mon, May 12, 2014 at 11:11:48AM -0400, Sasha Levin wrote:
>> On 05/08/2014 05:57 PM, Andrew Morton wrote:
>>> On Thu,  8 May 2014 15:41:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>>>
>>>>> remap_file_pages(2) was invented to be able efficiently map parts of
>>>>> huge file into limited 32-bit virtual address space such as in database
>>>>> workloads.
>>>>>
>>>>> Nonlinear mappings are pain to support and it seems there's no
>>>>> legitimate use-cases nowadays since 64-bit systems are widely available.
>>>>>
>>>>> Let's drop it and get rid of all these special-cased code.
>>>>>
>>>>> The patch replaces the syscall with emulation which creates new VMA on
>>>>> each remap_file_pages(), unless they it can be merged with an adjacent
>>>>> one.
>>>>>
>>>>> I didn't find *any* real code that uses remap_file_pages(2) to test
>>>>> emulation impact on. I've checked Debian code search and source of all
>>>>> packages in ALT Linux. No real users: libc wrappers, mentions in strace,
>>>>> gdb, valgrind and this kind of stuff.
>>>>>
>>>>> There are few basic tests in LTP for the syscall. They work just fine
>>>>> with emulation.
>>>>>
>>>>> To test performance impact, I've written small test case which
>>>>> demonstrate pretty much worst case scenario: map 4G shmfs file, write to
>>>>> begin of every page pgoff of the page, remap pages in reverse order,
>>>>> read every page.
>>>>>
>>>>> The test creates 1 million of VMAs if emulation is in use, so I had to
>>>>> set vm.max_map_count to 1100000 to avoid -ENOMEM.
>>>>>
>>>>> Before:		23.3 ( +-  4.31% ) seconds
>>>>> After:		43.9 ( +-  0.85% ) seconds
>>>>> Slowdown:	1.88x
>>>>>
>>>>> I believe we can live with that.
>>>>>
>>> There's still all the special-case goop around the place to be cleaned
>>> up - VM_NONLINEAR is a decent search term.  As is "grep nonlinear
>>> mm/*.c".  And although this cleanup is the main reason for the
>>> patchset, let's not do it now - we can do all that if/after this patch
>>> get merged.
>>>
>>> I'll queue the patches for some linux-next exposure and shall send
>>> [1/2] Linuswards for 3.16 if nothing terrible happens.  Once we've
>>> sorted out the too-many-vmas issue we'll need to work out when to merge
>>> [2/2].
>>
>> It seems that since no one is really using it, it's also impossible to
>> properly test it. I've sent a fix that deals with panics in error paths
>> that are very easy to trigger, but I'm worried that there are a lot more
>> of those hiding over there.
> 
> Sorry for that.
> 
>> Since we can't find any actual users, testing suites are very incomplete
>> w.r.t this syscall, and the amount of work required to "remove" it is
>> non-trivial, can we just kill this syscall off?
>>
>> It sounds to me like a better option than to ship a new, buggy and possibly
>> security dangerous version which we can't even test.
> 
> Taking into account your employment, is it possible to check how the RDBMS
> (old but it still supported 32-bit versions) would react on -ENOSYS here?

Alrighty, I got an answer:

1. remap_file_pages() only works when the "VLM" feature of the db is enabled,
so those databases can work just fine without it, but be limited to 3-4GB of
memory. This is not needed at all on 64bit machines.

2. As of OL7 (kernel 3.8), there will not be a 32bit kernel build. I'm still
waiting for an answer whether there will do a 32bit DB build for a 64bit kernel,
but that never happened before and seems unlikely.

3. They're basically saying that by the time upstream releases a kernel without
remap_file_pages() no one will need it here.

To sum it up, they're fine with removing remap_file_pages().


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-14 20:52           ` Sasha Levin
@ 2014-05-14 21:17             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-14 21:17 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Sasha Levin, Kirill A. Shutemov, linux-kernel, linux-mm, peterz, mingo

On Wed, May 14, 2014 at 04:52:17PM -0400, Sasha Levin wrote:
> On 05/12/2014 01:05 PM, Kirill A. Shutemov wrote:
> > Taking into account your employment, is it possible to check how the RDBMS
> > (old but it still supported 32-bit versions) would react on -ENOSYS here?
> 
> Alrighty, I got an answer:
> 
> 1. remap_file_pages() only works when the "VLM" feature of the db is enabled,
> so those databases can work just fine without it, but be limited to 3-4GB of
> memory. This is not needed at all on 64bit machines.

Okay. And it seems user need to enable it manually with option
USE_INDIRECT_DATA_BUFFERS=TRUE.

http://docs.oracle.com/cd/B28359_01/server.111/b32009/appi_vlm.htm

> 2. As of OL7 (kernel 3.8), there will not be a 32bit kernel build. I'm still
> waiting for an answer whether there will do a 32bit DB build for a 64bit kernel,
> but that never happened before and seems unlikely.
> 
> 3. They're basically saying that by the time upstream releases a kernel without
> remap_file_pages() no one will need it here.
> 
> To sum it up, they're fine with removing remap_file_pages().

Andrew, Linus, what will we do here: live with emulation or just kill the
syscall? Or may be kill the syscall after few releases with emulation?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-14 21:17             ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-05-14 21:17 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Sasha Levin, Kirill A. Shutemov, linux-kernel, linux-mm, peterz, mingo

On Wed, May 14, 2014 at 04:52:17PM -0400, Sasha Levin wrote:
> On 05/12/2014 01:05 PM, Kirill A. Shutemov wrote:
> > Taking into account your employment, is it possible to check how the RDBMS
> > (old but it still supported 32-bit versions) would react on -ENOSYS here?
> 
> Alrighty, I got an answer:
> 
> 1. remap_file_pages() only works when the "VLM" feature of the db is enabled,
> so those databases can work just fine without it, but be limited to 3-4GB of
> memory. This is not needed at all on 64bit machines.

Okay. And it seems user need to enable it manually with option
USE_INDIRECT_DATA_BUFFERS=TRUE.

http://docs.oracle.com/cd/B28359_01/server.111/b32009/appi_vlm.htm

> 2. As of OL7 (kernel 3.8), there will not be a 32bit kernel build. I'm still
> waiting for an answer whether there will do a 32bit DB build for a 64bit kernel,
> but that never happened before and seems unlikely.
> 
> 3. They're basically saying that by the time upstream releases a kernel without
> remap_file_pages() no one will need it here.
> 
> To sum it up, they're fine with removing remap_file_pages().

Andrew, Linus, what will we do here: live with emulation or just kill the
syscall? Or may be kill the syscall after few releases with emulation?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
  2014-05-14 21:17             ` Kirill A. Shutemov
@ 2014-05-14 21:40               ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2014-05-14 21:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Sasha Levin, Kirill A. Shutemov, linux-kernel,
	linux-mm, peterz, mingo

On Thu, 15 May 2014 00:17:48 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Wed, May 14, 2014 at 04:52:17PM -0400, Sasha Levin wrote:
> > On 05/12/2014 01:05 PM, Kirill A. Shutemov wrote:
> > > Taking into account your employment, is it possible to check how the RDBMS
> > > (old but it still supported 32-bit versions) would react on -ENOSYS here?
> > 
> > Alrighty, I got an answer:
> > 
> > 1. remap_file_pages() only works when the "VLM" feature of the db is enabled,
> > so those databases can work just fine without it, but be limited to 3-4GB of
> > memory. This is not needed at all on 64bit machines.
> 
> Okay. And it seems user need to enable it manually with option
> USE_INDIRECT_DATA_BUFFERS=TRUE.
> 
> http://docs.oracle.com/cd/B28359_01/server.111/b32009/appi_vlm.htm
> 
> > 2. As of OL7 (kernel 3.8), there will not be a 32bit kernel build. I'm still
> > waiting for an answer whether there will do a 32bit DB build for a 64bit kernel,
> > but that never happened before and seems unlikely.
> > 
> > 3. They're basically saying that by the time upstream releases a kernel without
> > remap_file_pages() no one will need it here.
> > 
> > To sum it up, they're fine with removing remap_file_pages().
> 
> Andrew, Linus, what will we do here: live with emulation or just kill the
> syscall? Or may be kill the syscall after few releases with emulation?

Well we can put the printk in there initially to gather more
information.

If it appears necessary then we can include the emulation, but retain
the "this-is-going-away" printk then remove the emulation later on.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation
@ 2014-05-14 21:40               ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2014-05-14 21:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Sasha Levin, Kirill A. Shutemov, linux-kernel,
	linux-mm, peterz, mingo

On Thu, 15 May 2014 00:17:48 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Wed, May 14, 2014 at 04:52:17PM -0400, Sasha Levin wrote:
> > On 05/12/2014 01:05 PM, Kirill A. Shutemov wrote:
> > > Taking into account your employment, is it possible to check how the RDBMS
> > > (old but it still supported 32-bit versions) would react on -ENOSYS here?
> > 
> > Alrighty, I got an answer:
> > 
> > 1. remap_file_pages() only works when the "VLM" feature of the db is enabled,
> > so those databases can work just fine without it, but be limited to 3-4GB of
> > memory. This is not needed at all on 64bit machines.
> 
> Okay. And it seems user need to enable it manually with option
> USE_INDIRECT_DATA_BUFFERS=TRUE.
> 
> http://docs.oracle.com/cd/B28359_01/server.111/b32009/appi_vlm.htm
> 
> > 2. As of OL7 (kernel 3.8), there will not be a 32bit kernel build. I'm still
> > waiting for an answer whether there will do a 32bit DB build for a 64bit kernel,
> > but that never happened before and seems unlikely.
> > 
> > 3. They're basically saying that by the time upstream releases a kernel without
> > remap_file_pages() no one will need it here.
> > 
> > To sum it up, they're fine with removing remap_file_pages().
> 
> Andrew, Linus, what will we do here: live with emulation or just kill the
> syscall? Or may be kill the syscall after few releases with emulation?

Well we can put the printk in there initially to gather more
information.

If it appears necessary then we can include the emulation, but retain
the "this-is-going-away" printk then remove the emulation later on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-06-12  5:48     ` Michael Kerrisk
  0 siblings, 0 replies; 54+ messages in thread
From: Michael Kerrisk @ 2014-06-12  5:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel, linux-mm,
	Peter Zijlstra, Ingo Molnar, Linux API, Michael Kerrisk-manpages

Hi Kirill,

On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> The remap_file_pages() system call is used to create a nonlinear mapping,
> that is, a mapping in which the pages of the file are mapped into a
> nonsequential order in memory. The advantage of using remap_file_pages()
> over using repeated calls to mmap(2) is that the former approach does not
> require the kernel to create additional VMA (Virtual Memory Area) data
> structures.
>
> Supporting of nonlinear mapping requires significant amount of non-trivial
> code in kernel virtual memory subsystem including hot paths. Also to get
> nonlinear mapping work kernel need a way to distinguish normal page table
> entries from entries with file offset (pte_file). Kernel reserves flag in
> PTE for this purpose. PTE flags are scarce resource especially on some CPU
> architectures. It would be nice to free up the flag for other usage.
>
> Fortunately, there are not many users of remap_file_pages() in the wild.
> It's only known that one enterprise RDBMS implementation uses the syscall
> on 32-bit systems to map files bigger than can linearly fit into 32-bit
> virtual address space. This use-case is not critical anymore since 64-bit
> systems are widely available.
>
> The plan is to deprecate the syscall and replace it with an emulation.
> The emulation will create new VMAs instead of nonlinear mappings. It's
> going to work slower for rare users of remap_file_pages() but ABI is
> preserved.
>
> One side effect of emulation (apart from performance) is that user can hit
> vm.max_map_count limit more easily due to additional VMAs. See comment for
> DEFAULT_MAX_MAP_COUNT for more details on the limit.

Best to CC linux-api@
(https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
like this, as well as the man-pages maintainer, so that something goes
into the man page. I added the following into the man page:

       Note:  this  system  call  is (since Linux 3.16) deprecated and
       will eventually be replaced by a  slower  in-kernel  emulation.
       Those  few  applications  that use this system call should con‐
       sider migrating to alternatives.

Okay?

Cheers,

Michael


> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  Documentation/vm/remap_file_pages.txt | 28 ++++++++++++++++++++++++++++
>  mm/fremap.c                           |  4 ++++
>  2 files changed, 32 insertions(+)
>  create mode 100644 Documentation/vm/remap_file_pages.txt
>
> diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
> new file mode 100644
> index 000000000000..560e4363a55d
> --- /dev/null
> +++ b/Documentation/vm/remap_file_pages.txt
> @@ -0,0 +1,28 @@
> +The remap_file_pages() system call is used to create a nonlinear mapping,
> +that is, a mapping in which the pages of the file are mapped into a
> +nonsequential order in memory. The advantage of using remap_file_pages()
> +over using repeated calls to mmap(2) is that the former approach does not
> +require the kernel to create additional VMA (Virtual Memory Area) data
> +structures.
> +
> +Supporting of nonlinear mapping requires significant amount of non-trivial
> +code in kernel virtual memory subsystem including hot paths. Also to get
> +nonlinear mapping work kernel need a way to distinguish normal page table
> +entries from entries with file offset (pte_file). Kernel reserves flag in
> +PTE for this purpose. PTE flags are scarce resource especially on some CPU
> +architectures. It would be nice to free up the flag for other usage.
> +
> +Fortunately, there are not many users of remap_file_pages() in the wild.
> +It's only known that one enterprise RDBMS implementation uses the syscall
> +on 32-bit systems to map files bigger than can linearly fit into 32-bit
> +virtual address space. This use-case is not critical anymore since 64-bit
> +systems are widely available.
> +
> +The plan is to deprecate the syscall and replace it with an emulation.
> +The emulation will create new VMAs instead of nonlinear mappings. It's
> +going to work slower for rare users of remap_file_pages() but ABI is
> +preserved.
> +
> +One side effect of emulation (apart from performance) is that user can hit
> +vm.max_map_count limit more easily due to additional VMAs. See comment for
> +DEFAULT_MAX_MAP_COUNT for more details on the limit.
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 34feba60a17e..12c3bb63b7f9 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -152,6 +152,10 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
>         int has_write_lock = 0;
>         vm_flags_t vm_flags = 0;
>
> +       pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
> +                       "See Documentation/vm/remap_file_pages.txt.\n",
> +                       current->comm, current->pid);
> +
>         if (prot)
>                 return err;
>         /*
> --
> 2.0.0.rc2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-06-12  5:48     ` Michael Kerrisk
  0 siblings, 0 replies; 54+ messages in thread
From: Michael Kerrisk @ 2014-06-12  5:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel, linux-mm,
	Peter Zijlstra, Ingo Molnar, Linux API, Michael Kerrisk-manpages

Hi Kirill,

On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
<kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> The remap_file_pages() system call is used to create a nonlinear mapping,
> that is, a mapping in which the pages of the file are mapped into a
> nonsequential order in memory. The advantage of using remap_file_pages()
> over using repeated calls to mmap(2) is that the former approach does not
> require the kernel to create additional VMA (Virtual Memory Area) data
> structures.
>
> Supporting of nonlinear mapping requires significant amount of non-trivial
> code in kernel virtual memory subsystem including hot paths. Also to get
> nonlinear mapping work kernel need a way to distinguish normal page table
> entries from entries with file offset (pte_file). Kernel reserves flag in
> PTE for this purpose. PTE flags are scarce resource especially on some CPU
> architectures. It would be nice to free up the flag for other usage.
>
> Fortunately, there are not many users of remap_file_pages() in the wild.
> It's only known that one enterprise RDBMS implementation uses the syscall
> on 32-bit systems to map files bigger than can linearly fit into 32-bit
> virtual address space. This use-case is not critical anymore since 64-bit
> systems are widely available.
>
> The plan is to deprecate the syscall and replace it with an emulation.
> The emulation will create new VMAs instead of nonlinear mappings. It's
> going to work slower for rare users of remap_file_pages() but ABI is
> preserved.
>
> One side effect of emulation (apart from performance) is that user can hit
> vm.max_map_count limit more easily due to additional VMAs. See comment for
> DEFAULT_MAX_MAP_COUNT for more details on the limit.

Best to CC linux-api@
(https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
like this, as well as the man-pages maintainer, so that something goes
into the man page. I added the following into the man page:

       Note:  this  system  call  is (since Linux 3.16) deprecated and
       will eventually be replaced by a  slower  in-kernel  emulation.
       Those  few  applications  that use this system call should con‐
       sider migrating to alternatives.

Okay?

Cheers,

Michael


> Signed-off-by: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
>  Documentation/vm/remap_file_pages.txt | 28 ++++++++++++++++++++++++++++
>  mm/fremap.c                           |  4 ++++
>  2 files changed, 32 insertions(+)
>  create mode 100644 Documentation/vm/remap_file_pages.txt
>
> diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
> new file mode 100644
> index 000000000000..560e4363a55d
> --- /dev/null
> +++ b/Documentation/vm/remap_file_pages.txt
> @@ -0,0 +1,28 @@
> +The remap_file_pages() system call is used to create a nonlinear mapping,
> +that is, a mapping in which the pages of the file are mapped into a
> +nonsequential order in memory. The advantage of using remap_file_pages()
> +over using repeated calls to mmap(2) is that the former approach does not
> +require the kernel to create additional VMA (Virtual Memory Area) data
> +structures.
> +
> +Supporting of nonlinear mapping requires significant amount of non-trivial
> +code in kernel virtual memory subsystem including hot paths. Also to get
> +nonlinear mapping work kernel need a way to distinguish normal page table
> +entries from entries with file offset (pte_file). Kernel reserves flag in
> +PTE for this purpose. PTE flags are scarce resource especially on some CPU
> +architectures. It would be nice to free up the flag for other usage.
> +
> +Fortunately, there are not many users of remap_file_pages() in the wild.
> +It's only known that one enterprise RDBMS implementation uses the syscall
> +on 32-bit systems to map files bigger than can linearly fit into 32-bit
> +virtual address space. This use-case is not critical anymore since 64-bit
> +systems are widely available.
> +
> +The plan is to deprecate the syscall and replace it with an emulation.
> +The emulation will create new VMAs instead of nonlinear mappings. It's
> +going to work slower for rare users of remap_file_pages() but ABI is
> +preserved.
> +
> +One side effect of emulation (apart from performance) is that user can hit
> +vm.max_map_count limit more easily due to additional VMAs. See comment for
> +DEFAULT_MAX_MAP_COUNT for more details on the limit.
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 34feba60a17e..12c3bb63b7f9 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -152,6 +152,10 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
>         int has_write_lock = 0;
>         vm_flags_t vm_flags = 0;
>
> +       pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
> +                       "See Documentation/vm/remap_file_pages.txt.\n",
> +                       current->comm, current->pid);
> +
>         if (prot)
>                 return err;
>         /*
> --
> 2.0.0.rc2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-06-12  5:48     ` Michael Kerrisk
  0 siblings, 0 replies; 54+ messages in thread
From: Michael Kerrisk @ 2014-06-12  5:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel, linux-mm,
	Peter Zijlstra, Ingo Molnar, Linux API, Michael Kerrisk-manpages

Hi Kirill,

On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> The remap_file_pages() system call is used to create a nonlinear mapping,
> that is, a mapping in which the pages of the file are mapped into a
> nonsequential order in memory. The advantage of using remap_file_pages()
> over using repeated calls to mmap(2) is that the former approach does not
> require the kernel to create additional VMA (Virtual Memory Area) data
> structures.
>
> Supporting of nonlinear mapping requires significant amount of non-trivial
> code in kernel virtual memory subsystem including hot paths. Also to get
> nonlinear mapping work kernel need a way to distinguish normal page table
> entries from entries with file offset (pte_file). Kernel reserves flag in
> PTE for this purpose. PTE flags are scarce resource especially on some CPU
> architectures. It would be nice to free up the flag for other usage.
>
> Fortunately, there are not many users of remap_file_pages() in the wild.
> It's only known that one enterprise RDBMS implementation uses the syscall
> on 32-bit systems to map files bigger than can linearly fit into 32-bit
> virtual address space. This use-case is not critical anymore since 64-bit
> systems are widely available.
>
> The plan is to deprecate the syscall and replace it with an emulation.
> The emulation will create new VMAs instead of nonlinear mappings. It's
> going to work slower for rare users of remap_file_pages() but ABI is
> preserved.
>
> One side effect of emulation (apart from performance) is that user can hit
> vm.max_map_count limit more easily due to additional VMAs. See comment for
> DEFAULT_MAX_MAP_COUNT for more details on the limit.

Best to CC linux-api@
(https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
like this, as well as the man-pages maintainer, so that something goes
into the man page. I added the following into the man page:

       Note:  this  system  call  is (since Linux 3.16) deprecated and
       will eventually be replaced by a  slower  in-kernel  emulation.
       Those  few  applications  that use this system call should con‐
       sider migrating to alternatives.

Okay?

Cheers,

Michael


> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  Documentation/vm/remap_file_pages.txt | 28 ++++++++++++++++++++++++++++
>  mm/fremap.c                           |  4 ++++
>  2 files changed, 32 insertions(+)
>  create mode 100644 Documentation/vm/remap_file_pages.txt
>
> diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
> new file mode 100644
> index 000000000000..560e4363a55d
> --- /dev/null
> +++ b/Documentation/vm/remap_file_pages.txt
> @@ -0,0 +1,28 @@
> +The remap_file_pages() system call is used to create a nonlinear mapping,
> +that is, a mapping in which the pages of the file are mapped into a
> +nonsequential order in memory. The advantage of using remap_file_pages()
> +over using repeated calls to mmap(2) is that the former approach does not
> +require the kernel to create additional VMA (Virtual Memory Area) data
> +structures.
> +
> +Supporting of nonlinear mapping requires significant amount of non-trivial
> +code in kernel virtual memory subsystem including hot paths. Also to get
> +nonlinear mapping work kernel need a way to distinguish normal page table
> +entries from entries with file offset (pte_file). Kernel reserves flag in
> +PTE for this purpose. PTE flags are scarce resource especially on some CPU
> +architectures. It would be nice to free up the flag for other usage.
> +
> +Fortunately, there are not many users of remap_file_pages() in the wild.
> +It's only known that one enterprise RDBMS implementation uses the syscall
> +on 32-bit systems to map files bigger than can linearly fit into 32-bit
> +virtual address space. This use-case is not critical anymore since 64-bit
> +systems are widely available.
> +
> +The plan is to deprecate the syscall and replace it with an emulation.
> +The emulation will create new VMAs instead of nonlinear mappings. It's
> +going to work slower for rare users of remap_file_pages() but ABI is
> +preserved.
> +
> +One side effect of emulation (apart from performance) is that user can hit
> +vm.max_map_count limit more easily due to additional VMAs. See comment for
> +DEFAULT_MAX_MAP_COUNT for more details on the limit.
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 34feba60a17e..12c3bb63b7f9 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -152,6 +152,10 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
>         int has_write_lock = 0;
>         vm_flags_t vm_flags = 0;
>
> +       pr_warn_once("%s (%d) uses depricated remap_file_pages() syscall. "
> +                       "See Documentation/vm/remap_file_pages.txt.\n",
> +                       current->comm, current->pid);
> +
>         if (prot)
>                 return err;
>         /*
> --
> 2.0.0.rc2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-06-12  9:40       ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-06-12  9:40 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, Linux Kernel,
	linux-mm, Peter Zijlstra, Ingo Molnar, Linux API,
	Michael Kerrisk-manpages

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2365 bytes --]

Michael Kerrisk wrote:
> Hi Kirill,
> 
> On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > The remap_file_pages() system call is used to create a nonlinear mapping,
> > that is, a mapping in which the pages of the file are mapped into a
> > nonsequential order in memory. The advantage of using remap_file_pages()
> > over using repeated calls to mmap(2) is that the former approach does not
> > require the kernel to create additional VMA (Virtual Memory Area) data
> > structures.
> >
> > Supporting of nonlinear mapping requires significant amount of non-trivial
> > code in kernel virtual memory subsystem including hot paths. Also to get
> > nonlinear mapping work kernel need a way to distinguish normal page table
> > entries from entries with file offset (pte_file). Kernel reserves flag in
> > PTE for this purpose. PTE flags are scarce resource especially on some CPU
> > architectures. It would be nice to free up the flag for other usage.
> >
> > Fortunately, there are not many users of remap_file_pages() in the wild.
> > It's only known that one enterprise RDBMS implementation uses the syscall
> > on 32-bit systems to map files bigger than can linearly fit into 32-bit
> > virtual address space. This use-case is not critical anymore since 64-bit
> > systems are widely available.
> >
> > The plan is to deprecate the syscall and replace it with an emulation.
> > The emulation will create new VMAs instead of nonlinear mappings. It's
> > going to work slower for rare users of remap_file_pages() but ABI is
> > preserved.
> >
> > One side effect of emulation (apart from performance) is that user can hit
> > vm.max_map_count limit more easily due to additional VMAs. See comment for
> > DEFAULT_MAX_MAP_COUNT for more details on the limit.
> 
> Best to CC linux-api@
> (https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
> like this, as well as the man-pages maintainer, so that something goes
> into the man page. I added the following into the man page:
> 
>        Note:  this  system  call  is (since Linux 3.16) deprecated and
>        will eventually be replaced by a  slower  in-kernel  emulation.
>        Those  few  applications  that use this system call should con‐
>        sider migrating to alternatives.
> 
> Okay?

Yep. Looks okay to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-06-12  9:40       ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-06-12  9:40 UTC (permalink / raw)
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, Linux Kernel,
	linux-mm, Peter Zijlstra, Ingo Molnar, Linux API,
	Michael Kerrisk-manpages

Michael Kerrisk wrote:
> Hi Kirill,
> 
> On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
> <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> > The remap_file_pages() system call is used to create a nonlinear mapping,
> > that is, a mapping in which the pages of the file are mapped into a
> > nonsequential order in memory. The advantage of using remap_file_pages()
> > over using repeated calls to mmap(2) is that the former approach does not
> > require the kernel to create additional VMA (Virtual Memory Area) data
> > structures.
> >
> > Supporting of nonlinear mapping requires significant amount of non-trivial
> > code in kernel virtual memory subsystem including hot paths. Also to get
> > nonlinear mapping work kernel need a way to distinguish normal page table
> > entries from entries with file offset (pte_file). Kernel reserves flag in
> > PTE for this purpose. PTE flags are scarce resource especially on some CPU
> > architectures. It would be nice to free up the flag for other usage.
> >
> > Fortunately, there are not many users of remap_file_pages() in the wild.
> > It's only known that one enterprise RDBMS implementation uses the syscall
> > on 32-bit systems to map files bigger than can linearly fit into 32-bit
> > virtual address space. This use-case is not critical anymore since 64-bit
> > systems are widely available.
> >
> > The plan is to deprecate the syscall and replace it with an emulation.
> > The emulation will create new VMAs instead of nonlinear mappings. It's
> > going to work slower for rare users of remap_file_pages() but ABI is
> > preserved.
> >
> > One side effect of emulation (apart from performance) is that user can hit
> > vm.max_map_count limit more easily due to additional VMAs. See comment for
> > DEFAULT_MAX_MAP_COUNT for more details on the limit.
> 
> Best to CC linux-api@
> (https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
> like this, as well as the man-pages maintainer, so that something goes
> into the man page. I added the following into the man page:
> 
>        Note:  this  system  call  is (since Linux 3.16) deprecated and
>        will eventually be replaced by a  slower  in-kernel  emulation.
>        Those  few  applications  that use this system call should con‐
>        sider migrating to alternatives.
> 
> Okay?

Yep. Looks okay to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-06-12  9:40       ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2014-06-12  9:40 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Kirill A. Shutemov, Andrew Morton, Linus Torvalds, Linux Kernel,
	linux-mm, Peter Zijlstra, Ingo Molnar, Linux API

Michael Kerrisk wrote:
> Hi Kirill,
> 
> On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > The remap_file_pages() system call is used to create a nonlinear mapping,
> > that is, a mapping in which the pages of the file are mapped into a
> > nonsequential order in memory. The advantage of using remap_file_pages()
> > over using repeated calls to mmap(2) is that the former approach does not
> > require the kernel to create additional VMA (Virtual Memory Area) data
> > structures.
> >
> > Supporting of nonlinear mapping requires significant amount of non-trivial
> > code in kernel virtual memory subsystem including hot paths. Also to get
> > nonlinear mapping work kernel need a way to distinguish normal page table
> > entries from entries with file offset (pte_file). Kernel reserves flag in
> > PTE for this purpose. PTE flags are scarce resource especially on some CPU
> > architectures. It would be nice to free up the flag for other usage.
> >
> > Fortunately, there are not many users of remap_file_pages() in the wild.
> > It's only known that one enterprise RDBMS implementation uses the syscall
> > on 32-bit systems to map files bigger than can linearly fit into 32-bit
> > virtual address space. This use-case is not critical anymore since 64-bit
> > systems are widely available.
> >
> > The plan is to deprecate the syscall and replace it with an emulation.
> > The emulation will create new VMAs instead of nonlinear mappings. It's
> > going to work slower for rare users of remap_file_pages() but ABI is
> > preserved.
> >
> > One side effect of emulation (apart from performance) is that user can hit
> > vm.max_map_count limit more easily due to additional VMAs. See comment for
> > DEFAULT_MAX_MAP_COUNT for more details on the limit.
> 
> Best to CC linux-api@
> (https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
> like this, as well as the man-pages maintainer, so that something goes
> into the man page. I added the following into the man page:
> 
>        Note:  this  system  call  is (since Linux 3.16) deprecated and
>        will eventually be replaced by a  slower  in-kernel  emulation.
>        Those  few  applications  that use this system call should cona??
>        sider migrating to alternatives.
> 
> Okay?

Yep. Looks okay to me.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
  2014-06-12  9:40       ` Kirill A. Shutemov
@ 2014-06-12  9:44         ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 54+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-06-12  9:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel, linux-mm,
	Peter Zijlstra, Ingo Molnar, Linux API

Hi Kirill,

On Thu, Jun 12, 2014 at 11:40 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Michael Kerrisk wrote:
>> Hi Kirill,
>>
>> On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > The remap_file_pages() system call is used to create a nonlinear mapping,
>> > that is, a mapping in which the pages of the file are mapped into a
>> > nonsequential order in memory. The advantage of using remap_file_pages()
>> > over using repeated calls to mmap(2) is that the former approach does not
>> > require the kernel to create additional VMA (Virtual Memory Area) data
>> > structures.
>> >
>> > Supporting of nonlinear mapping requires significant amount of non-trivial
>> > code in kernel virtual memory subsystem including hot paths. Also to get
>> > nonlinear mapping work kernel need a way to distinguish normal page table
>> > entries from entries with file offset (pte_file). Kernel reserves flag in
>> > PTE for this purpose. PTE flags are scarce resource especially on some CPU
>> > architectures. It would be nice to free up the flag for other usage.
>> >
>> > Fortunately, there are not many users of remap_file_pages() in the wild.
>> > It's only known that one enterprise RDBMS implementation uses the syscall
>> > on 32-bit systems to map files bigger than can linearly fit into 32-bit
>> > virtual address space. This use-case is not critical anymore since 64-bit
>> > systems are widely available.
>> >
>> > The plan is to deprecate the syscall and replace it with an emulation.
>> > The emulation will create new VMAs instead of nonlinear mappings. It's
>> > going to work slower for rare users of remap_file_pages() but ABI is
>> > preserved.
>> >
>> > One side effect of emulation (apart from performance) is that user can hit
>> > vm.max_map_count limit more easily due to additional VMAs. See comment for
>> > DEFAULT_MAX_MAP_COUNT for more details on the limit.
>>
>> Best to CC linux-api@
>> (https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
>> like this, as well as the man-pages maintainer, so that something goes
>> into the man page. I added the following into the man page:
>>
>>        Note:  this  system  call  is (since Linux 3.16) deprecated and
>>        will eventually be replaced by a  slower  in-kernel  emulation.
>>        Those  few  applications  that use this system call should con‐
>>        sider migrating to alternatives.
>>
>> Okay?
>
> Yep. Looks okay to me.

Thanks for checking.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated
@ 2014-06-12  9:44         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 54+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-06-12  9:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel, linux-mm,
	Peter Zijlstra, Ingo Molnar, Linux API

Hi Kirill,

On Thu, Jun 12, 2014 at 11:40 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Michael Kerrisk wrote:
>> Hi Kirill,
>>
>> On Thu, May 8, 2014 at 2:41 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > The remap_file_pages() system call is used to create a nonlinear mapping,
>> > that is, a mapping in which the pages of the file are mapped into a
>> > nonsequential order in memory. The advantage of using remap_file_pages()
>> > over using repeated calls to mmap(2) is that the former approach does not
>> > require the kernel to create additional VMA (Virtual Memory Area) data
>> > structures.
>> >
>> > Supporting of nonlinear mapping requires significant amount of non-trivial
>> > code in kernel virtual memory subsystem including hot paths. Also to get
>> > nonlinear mapping work kernel need a way to distinguish normal page table
>> > entries from entries with file offset (pte_file). Kernel reserves flag in
>> > PTE for this purpose. PTE flags are scarce resource especially on some CPU
>> > architectures. It would be nice to free up the flag for other usage.
>> >
>> > Fortunately, there are not many users of remap_file_pages() in the wild.
>> > It's only known that one enterprise RDBMS implementation uses the syscall
>> > on 32-bit systems to map files bigger than can linearly fit into 32-bit
>> > virtual address space. This use-case is not critical anymore since 64-bit
>> > systems are widely available.
>> >
>> > The plan is to deprecate the syscall and replace it with an emulation.
>> > The emulation will create new VMAs instead of nonlinear mappings. It's
>> > going to work slower for rare users of remap_file_pages() but ABI is
>> > preserved.
>> >
>> > One side effect of emulation (apart from performance) is that user can hit
>> > vm.max_map_count limit more easily due to additional VMAs. See comment for
>> > DEFAULT_MAX_MAP_COUNT for more details on the limit.
>>
>> Best to CC linux-api@
>> (https://www.kernel.org/doc/man-pages/linux-api-ml.html) on patches
>> like this, as well as the man-pages maintainer, so that something goes
>> into the man page. I added the following into the man page:
>>
>>        Note:  this  system  call  is (since Linux 3.16) deprecated and
>>        will eventually be replaced by a  slower  in-kernel  emulation.
>>        Those  few  applications  that use this system call should con‐
>>        sider migrating to alternatives.
>>
>> Okay?
>
> Yep. Looks okay to me.

Thanks for checking.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2014-06-12  9:44 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-08 12:41 [PATCHv2 0/2] remap_file_pages() decommission Kirill A. Shutemov
2014-05-08 12:41 ` Kirill A. Shutemov
2014-05-08 12:41 ` [PATCH 1/2] mm: mark remap_file_pages() syscall as deprecated Kirill A. Shutemov
2014-05-08 12:41   ` Kirill A. Shutemov
2014-06-12  5:48   ` Michael Kerrisk
2014-06-12  5:48     ` Michael Kerrisk
2014-06-12  5:48     ` Michael Kerrisk
2014-06-12  9:40     ` Kirill A. Shutemov
2014-06-12  9:40       ` Kirill A. Shutemov
2014-06-12  9:40       ` Kirill A. Shutemov
2014-06-12  9:44       ` Michael Kerrisk (man-pages)
2014-06-12  9:44         ` Michael Kerrisk (man-pages)
2014-05-08 12:41 ` [PATCH 2/2] mm: replace remap_file_pages() syscall with emulation Kirill A. Shutemov
2014-05-08 12:41   ` Kirill A. Shutemov
2014-05-08 21:57   ` Andrew Morton
2014-05-08 21:57     ` Andrew Morton
2014-05-12 15:11     ` Sasha Levin
2014-05-12 15:11       ` Sasha Levin
2014-05-12 17:05       ` Kirill A. Shutemov
2014-05-12 17:05         ` Kirill A. Shutemov
2014-05-14 20:52         ` Sasha Levin
2014-05-14 20:52           ` Sasha Levin
2014-05-14 21:17           ` Kirill A. Shutemov
2014-05-14 21:17             ` Kirill A. Shutemov
2014-05-14 21:40             ` Andrew Morton
2014-05-14 21:40               ` Andrew Morton
2014-05-13  7:32       ` Armin Rigo
2014-05-13  7:32         ` Armin Rigo
2014-05-13 12:57         ` Sasha Levin
2014-05-13 12:57           ` Sasha Levin
2014-05-08 15:35 ` [PATCHv2 0/2] remap_file_pages() decommission Linus Torvalds
2014-05-08 15:35   ` Linus Torvalds
2014-05-08 15:44 ` Armin Rigo
2014-05-08 15:44   ` Armin Rigo
2014-05-08 16:02   ` Kirill A. Shutemov
2014-05-08 16:02     ` Kirill A. Shutemov
2014-05-08 16:08     ` Linus Torvalds
2014-05-08 16:08       ` Linus Torvalds
2014-05-09 14:05       ` Kirill A. Shutemov
2014-05-09 14:05         ` Kirill A. Shutemov
2014-05-09 15:14         ` Linus Torvalds
2014-05-09 15:14           ` Linus Torvalds
2014-05-09 18:19           ` Kirill A. Shutemov
2014-05-09 18:19             ` Kirill A. Shutemov
2014-05-12 12:43           ` Kirill A. Shutemov
2014-05-12 12:43             ` Kirill A. Shutemov
2014-05-12 14:59             ` Konstantin Khlebnikov
2014-05-12 14:59               ` Konstantin Khlebnikov
2014-05-12  3:36   ` Andi Kleen
2014-05-12  3:36     ` Andi Kleen
2014-05-12  5:16     ` Konstantin Khlebnikov
2014-05-12  5:16       ` Konstantin Khlebnikov
2014-05-12  7:50     ` Armin Rigo
2014-05-12  7:50       ` Armin Rigo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.