linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2] Support volatile range for anon vma
@ 2012-10-30  1:29 Minchan Kim
  2012-10-31 21:35 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Minchan Kim @ 2012-10-30  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

This patch introudces new madvise behavior MADV_VOLATILE and
MADV_NOVOLATILE for anonymous pages. It's different with
John Stultz's version which considers only tmpfs while this patch
considers only anonymous pages so this cannot cover John's one.
If below idea is proved as reasonable, I hope we can unify both
concepts by madvise/fadvise.

Rationale is following as.
Many allocators call munmap(2) when user call free(3) if ptr is
in mmaped area. But munmap isn't cheap because it have to clean up
all pte entries and unlinking a vma so overhead would be increased
linearly by mmaped area's size.

Volatile conecept of Robert Love could be very useful for reducing
free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
munmap(2)(Of course, they need to manage volatile mmaped area to
reduce shortage of address space and sometime ends up unmaping them).
The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because

1) it just marks the flag in VMA and
2) if memory pressure happens, VM can discard pages of volatile VMA
   instead of swapping out when volatile pages is selected as victim
   by normal VM aging policy.
3) freed mmaped area doesn't include any meaningful data so there
   is no point to swap them out.

Allocator should call madvise(MADV_NOVOLATILE) before reusing for
allocating that area to user. Otherwise, accessing of volatile range
will meet SIGBUS error.

The downside is that we have to age anon lru list although we don't
have swap because I don't want to discard volatile pages by top priority
when memory pressure happens as volatile in this patch means "We don't
need to swap out because user can handle the situation which data are
disappear suddenly", NOT "They are useless so hurry up to reclaim them".
So I want to apply same aging rule of nomal pages to them.

Anon background aging of non-swap system would be a trade-off for
getting good feature. Even, we had done it two years ago until merge
[1] and I believe free(3) performance gain will beat loss of anon lru
aging's overead once all of allocator start to use madvise.
(This patch doesn't include background aging in case of non-swap system
 but it's trivial if we decide)

I hope seeing opinions from others before diving into glibc or bionic.
Welcome to any comment.

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Changelog
 * from RFC v1
   * add clear comment of purge - Christoph
   * Change patch descritpion

Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/asm-generic/mman-common.h |    3 +
 include/linux/mm.h                |    9 ++-
 include/linux/mm_types.h          |    5 ++
 include/linux/rmap.h              |   24 ++++++-
 mm/ksm.c                          |    4 +-
 mm/madvise.c                      |   32 +++++++++-
 mm/memory.c                       |    2 +
 mm/migrate.c                      |    6 +-
 mm/rmap.c                         |  126 +++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                       |    3 +
 10 files changed, 202 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
index d030d2c..5f8090d 100644
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -34,6 +34,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
+#define MADV_NOVOLATILE 6		/* pages will not disappear */
+
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..78c8c08 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+/*
+ * Recently, Konstantin removed a few flags but not merged yet
+ * so we will get a room for new flag for supporting 32 bit.
+ * Thanks, Konstantin!
+ */
+#define VM_VOLATILE	0x100000000
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
 
@@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp);
  * Special vmas that are non-mergable, non-mlock()able.
  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
  */
-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
+#define VM_SPECIAL (VM_IO|VM_DONTEXPAND|VM_RESERVED|VM_PFNMAP|VM_VOLATILE)
 
 /*
  * mapping from the currently active vm_flags protection bits (the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bf78672..c813daa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -279,6 +279,11 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	/*
+	 * True if more than a page in this vma is reclaimed.
+	 * It's protected by anon_vma->mutex.
+	 */
+	bool purged;
 };
 
 struct core_thread {
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3fce545..65b9f33 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -67,6 +67,10 @@ struct anon_vma_chain {
 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
 };
 
+
+void volatile_lock(struct vm_area_struct *vma);
+void volatile_unlock(struct vm_area_struct *vma);
+
 #ifdef CONFIG_MMU
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
@@ -170,12 +174,14 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 
 int try_to_unmap(struct page *, enum ttu_flags flags);
 int try_to_unmap_one(struct page *, struct vm_area_struct *,
-			unsigned long address, enum ttu_flags flags);
+			unsigned long address, enum ttu_flags flags,
+			bool *is_volatile);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
@@ -194,6 +200,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	return ptep;
 }
 
+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
+				unsigned long, spinlock_t **);
+
+static inline pte_t *page_check_volatile_address(struct page *page,
+					struct mm_struct *mm,
+					unsigned long address,
+					spinlock_t **ptlp)
+{
+	pte_t *ptep;
+
+	__cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
+					mm, address, ptlp));
+	return ptep;
+}
+
 /*
  * Used by swapoff to help locate where page is expected in vma.
  */
@@ -257,5 +278,6 @@ static inline int page_mkclean(struct page *page)
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 #define SWAP_MLOCK	3
+#define SWAP_DISCARD	4
 
 #endif	/* _LINUX_RMAP_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 47c8853..22c54d2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1653,6 +1653,7 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
 	struct rmap_item *rmap_item;
 	int ret = SWAP_AGAIN;
 	int search_new_forks = 0;
+	bool dummy_volatile;
 
 	VM_BUG_ON(!PageKsm(page));
 	VM_BUG_ON(!PageLocked(page));
@@ -1682,7 +1683,8 @@ again:
 				continue;
 
 			ret = try_to_unmap_one(page, vma,
-					rmap_item->address, flags);
+					rmap_item->address, flags,
+					&dummy_volatile);
 			if (ret != SWAP_AGAIN || !page_mapped(page)) {
 				anon_vma_unlock(anon_vma);
 				goto out;
diff --git a/mm/madvise.c b/mm/madvise.c
index 14d260f..53cd77f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_VOLATILE:
+		if (vma->vm_flags & VM_LOCKED) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_VOLATILE;
+		vma->purged = false;
+		break;
+	case MADV_NOVOLATILE:
+		if (!(vma->vm_flags & VM_VOLATILE)) {
+			error = -EINVAL;
+			goto out;
+		}
+
+		new_flags &= ~VM_VOLATILE;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -118,9 +134,15 @@ static long madvise_behavior(struct vm_area_struct * vma,
 success:
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * In case of VOLATILE, we need volatile_lock, additionally.
 	 */
+	if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
+		volatile_lock(vma);
 	vma->vm_flags = new_flags;
-
+	if (behavior == MADV_NOVOLATILE)
+		error = vma->purged;
+	if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
+		volatile_unlock(vma);
 out:
 	if (error == -ENOMEM)
 		error = -EAGAIN;
@@ -310,6 +332,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_VOLATILE:
+	case MADV_NOVOLATILE:
 		return 1;
 
 	default:
@@ -383,7 +407,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 
 	if (start & ~PAGE_MASK)
 		goto out;
-	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+	if (behavior != MADV_VOLATILE && behavior != MADV_NOVOLATILE)
+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+	else
+		len = len_in & PAGE_MASK;
 
 	/* Check to see whether len was rounded up from small -ve to zero */
 	if (len_in && !len)
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..26b3f73 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3441,6 +3441,8 @@ int handle_pte_fault(struct mm_struct *mm,
 	entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
+			if (unlikely(vma->vm_flags & VM_VOLATILE))
+				return VM_FAULT_SIGBUS;
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
 					return do_linear_fault(mm, vma, address,
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..d1b51af 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+			TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 skip_unmap:
 	if (!page_mapped(page))
@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (PageAnon(hpage))
 		anon_vma = page_get_anon_vma(hpage);
 
-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 	if (!page_mapped(hpage))
 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..778abfc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	return vma_address(page, vma);
 }
 
+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
+			  unsigned long address, spinlock_t **ptlp)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	swp_entry_t entry = { .val = page_private(page) };
+
+	if (unlikely(PageHuge(page))) {
+		pte = huge_pte_offset(mm, address);
+		ptl = &mm->page_table_lock;
+		goto check;
+	}
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return NULL;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return NULL;
+	if (pmd_trans_huge(*pmd))
+		return NULL;
+
+	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
+check:
+	spin_lock(ptl);
+	if (PageAnon(page)) {
+		if (!pte_present(*pte) && entry.val ==
+				pte_to_swp_entry(*pte).val) {
+			*ptlp = ptl;
+			return pte;
+		}
+	} else {
+		if (pte_none(*pte)) {
+			*ptlp = ptl;
+			return pte;
+		}
+	}
+	pte_unmap_unlock(pte, ptl);
+	return NULL;
+}
+
 /*
  * Check that @page is mapped at @address into @mm.
  *
@@ -1218,12 +1269,42 @@ out:
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
+		unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+
+	pte = page_check_volatile_address(page, mm, address, &ptl);
+	if (!pte)
+		return 0;
+
+	/* Nuke the page table entry. */
+	flush_cache_page(vma, address, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, address, pte);
+
+	if (PageAnon(page)) {
+		swp_entry_t entry = { .val = page_private(page) };
+		if (PageSwapCache(page)) {
+			dec_mm_counter(mm, MM_SWAPENTS);
+			swap_free(entry);
+		}
+	}
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, address);
+	return 1;
+}
+
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
  */
 int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-		     unsigned long address, enum ttu_flags flags)
+		     unsigned long address, enum ttu_flags flags,
+		     bool *is_volatile)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte;
@@ -1235,6 +1316,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if (!pte)
 		goto out;
 
+	if (!(vma->vm_flags & VM_VOLATILE))
+		*is_volatile = false;
 	/*
 	 * If the page is mlock()d, we cannot swap it out.
 	 * If it's recently referenced (perhaps page_referenced
@@ -1494,6 +1577,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
+	bool is_volatile = true;
+
+	if (flags & TTU_IGNORE_VOLATILE)
+		is_volatile = false;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
@@ -1512,17 +1599,32 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 		 * temporary VMAs until after exec() completes.
 		 */
 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
-				is_vma_temporary_stack(vma))
+				is_vma_temporary_stack(vma)) {
+			is_volatile = false;
 			continue;
+		}
 
 		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
-		ret = try_to_unmap_one(page, vma, address, flags);
+		ret = try_to_unmap_one(page, vma, address, flags, &is_volatile);
 		if (ret != SWAP_AGAIN || !page_mapped(page))
 			break;
 	}
 
+	if (page_mapped(page) || is_volatile == false)
+		goto out;
+
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long address;
+
+		address = vma_address(page, vma);
+		if (try_to_zap_one(page, vma, address))
+			vma->purged = true;
+	}
+	ret = SWAP_DISCARD;
+out:
 	page_unlock_anon_vma(anon_vma);
 	return ret;
 }
@@ -1553,13 +1655,14 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
+	bool dummy;
 
 	mutex_lock(&mapping->i_mmap_mutex);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
-		ret = try_to_unmap_one(page, vma, address, flags);
+		ret = try_to_unmap_one(page, vma, address, flags, &dummy);
 		if (ret != SWAP_AGAIN || !page_mapped(page))
 			goto out;
 	}
@@ -1651,6 +1754,7 @@ out:
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
+ * SWAP_DISCARD - page is volatile.
  */
 int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
@@ -1665,7 +1769,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 		ret = try_to_unmap_anon(page, flags);
 	else
 		ret = try_to_unmap_file(page, flags);
-	if (ret != SWAP_MLOCK && !page_mapped(page))
+	if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
 		ret = SWAP_SUCCESS;
 	return ret;
 }
@@ -1707,6 +1811,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
 	anon_vma_free(anon_vma);
 }
 
+void volatile_lock(struct vm_area_struct *vma)
+{
+	if (vma->anon_vma)
+		anon_vma_lock(vma->anon_vma);
+}
+
+void volatile_unlock(struct vm_area_struct *vma)
+{
+	if (vma->anon_vma)
+		anon_vma_unlock(vma->anon_vma);
+}
+
 #ifdef CONFIG_MIGRATION
 /*
  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99b434b..d5b60d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -789,6 +789,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, TTU_UNMAP)) {
+			case SWAP_DISCARD:
+				goto discard_page;
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -857,6 +859,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
+discard_page:
 		/*
 		 * If the page has buffers, try to free the buffer mappings
 		 * associated with this page. If we succeed we try to free
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-30  1:29 [RFC v2] Support volatile range for anon vma Minchan Kim
@ 2012-10-31 21:35 ` Andrew Morton
  2012-10-31 21:59   ` Paul Turner
  2012-11-01  0:21   ` Minchan Kim
  2012-11-02  1:43 ` Bob Liu
  2012-11-22  0:36 ` John Stultz
  2 siblings, 2 replies; 30+ messages in thread
From: Andrew Morton @ 2012-10-31 21:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, John Stultz, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, sanjay, pjt,
	David Rientjes

On Tue, 30 Oct 2012 10:29:54 +0900
Minchan Kim <minchan@kernel.org> wrote:

> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
> 
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.

Presumably the userspace allocator will internally manage memory in
large chunks, so the munmap() call frequency will be much lower than
the free() call frequency.  So the performance gains from this change
might be very small.

The whole point of the patch is to improve performance, but we have no
evidence that it was successful in doing that!  I do think we'll need
good quantitative testing results before proceeding with such a patch,
please.

Also, it is very desirable that we involve the relevant userspace
(glibc, etc) developers in this.  And I understand that the google
tcmalloc project will probably have interest in this - I've cc'ed
various people@google in the hope that they can provide input (please).

Also, it is a userspace API change.  Please cc mtk.manpages@gmail.com.

Also, I assume that you have userspace test code.  At some stage,
please consider adding a case to tools/testing/selftests.  Such a test
would require to creation of memory pressure, which is rather contrary
to the selftests' current philosopy of being a bunch of short-running
little tests.  Perhaps you can come up with something.  But I suggest
that such work be done later, once it becomes clearer that this code is
actually headed into the kernel.

> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.

Well, why?  It would be easy enough for the fault handler to give
userspace a new, zeroed page at that address.

Or we could simply leave the old page in place at that address.  If the
page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
page (or all the not-yet-reclaimed pages) back to userspace at their
old addresses.

Various options suggest themselves here.  You've chosen one of them but
I would like to see a pretty exhaustive description of the reasoning
behind that decision.

Also, I wonder about the interaction with other vma manipulation
operations.  For example, can a VMA get split when in the MADV_VOLATILE
state?  If so, what happens?  

Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
but I bet this wasn't tested ;)

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_VOLATILE:
> +		if (vma->vm_flags & VM_LOCKED) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags |= VM_VOLATILE;
> +		vma->purged = false;
> +		break;
> +	case MADV_NOVOLATILE:
> +		if (!(vma->vm_flags & VM_VOLATILE)) {
> +			error = -EINVAL;
> +			goto out;

I wonder if this really should return an error.  Other madvise()
options don't do this, and running MADV_NOVOLATILE against a
not-volatile area seems pretty benign and has clearly defined before-
and after- states.

> +		}
> +
> +		new_flags &= ~VM_VOLATILE;
> +		break;
>  	}
>  
>  	if (new_flags == vma->vm_flags) {


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-31 21:35 ` Andrew Morton
@ 2012-10-31 21:59   ` Paul Turner
  2012-10-31 22:56     ` KOSAKI Motohiro
  2012-11-01  0:50     ` Minchan Kim
  2012-11-01  0:21   ` Minchan Kim
  1 sibling, 2 replies; 30+ messages in thread
From: Paul Turner @ 2012-10-31 21:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim <minchan@kernel.org> wrote:
>
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
>
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency.  So the performance gains from this change
> might be very small.

I don't think I strictly understand the motivation from a
malloc-standpoint here.

These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
to perform discards on Linux.    For any reasonable allocator (short
of binding malloc --> mmap, free --> unmap) this seems a better
choice.

Note also from a performance stand-point I doubt any allocator (which
case about performance) is going to want to pay the cost of even a
null syscall about typical malloc/free usage (consider: a tcmalloc
malloc/free pairis currently <20ns).  Given then that this cost is
amortized once you start doing discards on larger blocks MADV_DONTNEED
seems a preferable interface:
- You don't need to reconstruct an arena when you do want to allocate
since there's no munmap/mmap for the region to change about
- There are no syscalls involved in later reallocating the block.

The only real additional cost is address-space.  Are you strongly
concerned about the 32-bit case?

>
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that!  I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.
>
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this.  And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).
>
> Also, it is a userspace API change.  Please cc mtk.manpages@gmail.com.
>
> Also, I assume that you have userspace test code.  At some stage,
> please consider adding a case to tools/testing/selftests.  Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests.  Perhaps you can come up with something.  But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.
>
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
>
> Well, why?  It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Note: MADV_DONTNEED already has this (nice) property.

>
> Or we could simply leave the old page in place at that address.  If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
>
> Various options suggest themselves here.  You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.
>
> Also, I wonder about the interaction with other vma manipulation
> operations.  For example, can a VMA get split when in the MADV_VOLATILE
> state?  If so, what happens?
>
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)
>
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >               if (error)
> >                       goto out;
> >               break;
> > +     case MADV_VOLATILE:
> > +             if (vma->vm_flags & VM_LOCKED) {
> > +                     error = -EINVAL;
> > +                     goto out;
> > +             }
> > +             new_flags |= VM_VOLATILE;
> > +             vma->purged = false;
> > +             break;
> > +     case MADV_NOVOLATILE:
> > +             if (!(vma->vm_flags & VM_VOLATILE)) {
> > +                     error = -EINVAL;
> > +                     goto out;
>
> I wonder if this really should return an error.  Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.
>
> > +             }
> > +
> > +             new_flags &= ~VM_VOLATILE;
> > +             break;
> >       }
> >
> >       if (new_flags == vma->vm_flags) {
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-31 21:59   ` Paul Turner
@ 2012-10-31 22:56     ` KOSAKI Motohiro
  2012-11-01  1:15       ` Paul Turner
                         ` (2 more replies)
  2012-11-01  0:50     ` Minchan Kim
  1 sibling, 3 replies; 30+ messages in thread
From: KOSAKI Motohiro @ 2012-10-31 22:56 UTC (permalink / raw)
  To: Paul Turner
  Cc: Andrew Morton, Minchan Kim, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KAMEZAWA Hiroyuki, sanjay,
	David Rientjes

>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
>> > allocating that area to user. Otherwise, accessing of volatile range
>> > will meet SIGBUS error.
>>
>> Well, why?  It would be easy enough for the fault handler to give
>> userspace a new, zeroed page at that address.
>
> Note: MADV_DONTNEED already has this (nice) property.

I don't think I strictly understand this patch. but maybe I can answer why
userland and malloc folks don't like MADV_DONTNEED.

glibc malloc discard freed memory by using MADV_DONTNEED
as tcmalloc. and it is often a source of large performance decrease.
because of MADV_DONTNEED discard memory immediately and
right after malloc() call fall into page fault and pagesize memset() path.
then, using DONTNEED increased zero fill and cache miss rate.

At called free() time, malloc don't have a knowledge when next big malloc()
is called. then, immediate discarding may or may not get good performance
gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)


In past, several developers tryied to avoid such situation, likes

- making zero page daemon and avoid pagesize zero fill at page fault
- making new vma or page flags and mark as discardable w/o swap and
  vmscan treat it. (like this and/or MADV_FREE)
- making new process option and avoid page zero fill from page fault path.
  (yes, it is big incompatibility and insecure. but some embedded folks thought
   they are acceptable downside)
- etc


btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
don't need mmap_sem and works very effectively when a lot of threads case.
taking mmap_sem might bring worse performance than DONTNEED. dunno.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-31 21:35 ` Andrew Morton
  2012-10-31 21:59   ` Paul Turner
@ 2012-11-01  0:21   ` Minchan Kim
  1 sibling, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2012-11-01  0:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, John Stultz, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, sanjay, pjt,
	David Rientjes

Hi Andrew,

On Wed, Oct 31, 2012 at 02:35:24PM -0700, Andrew Morton wrote:
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> > 
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
> 
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency.  So the performance gains from this change
> might be very small.
> 
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that!  I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.

Absolutely. That's why I send it as RFC.
In this time, I would like to reach a concensus on that this idea
makes sense before further investigating because we have lots of
experienced developer pool and one of them might know this is really
needed or not.

> 
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this.  And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).

Thanks! I should have done. Such input is really one I need now.

> 
> Also, it is a userspace API change.  Please cc mtk.manpages@gmail.com.

This is RFC so we don't have anything fixed until now.
I will Cc'ed him after everything I should solve goes out and
interface is fixed.

> 
> Also, I assume that you have userspace test code.  At some stage,
> please consider adding a case to tools/testing/selftests.  Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests.  Perhaps you can come up with something.  But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.

Yes.

> 
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
> 
> Well, why?  It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Absolutely. It would be convenient but as a matter of fact, I am considering
to unify John Stultz's fallocate volatile range which consider only tmpfs
pages so madvise/fadvise might be better candidate as API.
In tmpfs case, John implemented it as returning zero page when someone
access volatile region like you mentioned but in this kernel summit, Hugh
pointed out and wanted to return SIGBUS and I think it makes debug better.

Another option is we can put a flag in API which indicates that VM will
return zero page or SIGBUS when user access volatile range so user can do
what they want.

> 
> Or we could simply leave the old page in place at that address.  If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
> 
> Various options suggest themselves here.  You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.

Will do.

> 
> Also, I wonder about the interaction with other vma manipulation
> operations.  For example, can a VMA get split when in the MADV_VOLATILE
> state?  If so, what happens?  

Both VMAs would be volatile although one of either has never reclaimed
pages. I understand it's not an optimal but I expect user will not do 
such operations(ex, mprotect, mremap) frequently on volatile vma.

If they do, maybe I need to come up with something but It wouldn't be easy.

> 
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)

Yes. I didn't consider that yet. AFAIK, nonlinear vma is related to
file-mapped vma while this patch consider only anon vma which is good for
first step.

> 
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >  		if (error)
> >  			goto out;
> >  		break;
> > +	case MADV_VOLATILE:
> > +		if (vma->vm_flags & VM_LOCKED) {
> > +			error = -EINVAL;
> > +			goto out;
> > +		}
> > +		new_flags |= VM_VOLATILE;
> > +		vma->purged = false;
> > +		break;
> > +	case MADV_NOVOLATILE:
> > +		if (!(vma->vm_flags & VM_VOLATILE)) {
> > +			error = -EINVAL;
> > +			goto out;
> 
> I wonder if this really should return an error.  Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.

Will fix.

Thanks for the review, Andrew.

> 
> > +		}
> > +
> > +		new_flags &= ~VM_VOLATILE;
> > +		break;
> >  	}
> >  
> >  	if (new_flags == vma->vm_flags) {
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-31 21:59   ` Paul Turner
  2012-10-31 22:56     ` KOSAKI Motohiro
@ 2012-11-01  0:50     ` Minchan Kim
  2012-11-01  1:22       ` Paul Turner
  1 sibling, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2012-11-01  0:50 UTC (permalink / raw)
  To: Paul Turner
  Cc: Andrew Morton, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

Hello,

On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 30 Oct 2012 10:29:54 +0900
> > Minchan Kim <minchan@kernel.org> wrote:
> >
> > > This patch introudces new madvise behavior MADV_VOLATILE and
> > > MADV_NOVOLATILE for anonymous pages. It's different with
> > > John Stultz's version which considers only tmpfs while this patch
> > > considers only anonymous pages so this cannot cover John's one.
> > > If below idea is proved as reasonable, I hope we can unify both
> > > concepts by madvise/fadvise.
> > >
> > > Rationale is following as.
> > > Many allocators call munmap(2) when user call free(3) if ptr is
> > > in mmaped area. But munmap isn't cheap because it have to clean up
> > > all pte entries and unlinking a vma so overhead would be increased
> > > linearly by mmaped area's size.
> >
> > Presumably the userspace allocator will internally manage memory in
> > large chunks, so the munmap() call frequency will be much lower than
> > the free() call frequency.  So the performance gains from this change
> > might be very small.
> 
> I don't think I strictly understand the motivation from a
> malloc-standpoint here.
> 
> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
> to perform discards on Linux.    For any reasonable allocator (short
> of binding malloc --> mmap, free --> unmap) this seems a better
> choice.
> 
> Note also from a performance stand-point I doubt any allocator (which
> case about performance) is going to want to pay the cost of even a
> null syscall about typical malloc/free usage (consider: a tcmalloc

Good point.

> malloc/free pairis currently <20ns).  Given then that this cost is
> amortized once you start doing discards on larger blocks MADV_DONTNEED
> seems a preferable interface:
> - You don't need to reconstruct an arena when you do want to allocate
> since there's no munmap/mmap for the region to change about
> - There are no syscalls involved in later reallocating the block.

Above benefits are applied on MADV_VOLATILE, too.
But as you pointed out, there is a little bit overhead than DONTNEED
because allocator should call madvise(MADV_NOVOLATILE) before allocation.
For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
and could be a problem on parallel malloc/free workload as KOSAKI pointed out.

In such case, we can change semantic so malloc doesn't need to call
madivse(NOVOLATILE) before allocating. Then, page fault handler have to
check whether this page fault happen by access of volatile vma. If so,
it could return zero page instead of SIGBUS and mark the vma isn't volatile
any more.

> 
> The only real additional cost is address-space.  Are you strongly
> concerned about the 32-bit case?

No. I believe allocators have a logic to clean up them once address space is
almost full.

Thanks, Paul.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-31 22:56     ` KOSAKI Motohiro
@ 2012-11-01  1:15       ` Paul Turner
  2012-11-01  1:46         ` Minchan Kim
  2012-11-01  1:25       ` Minchan Kim
  2012-11-05 23:54       ` Arun Sharma
  2 siblings, 1 reply; 30+ messages in thread
From: Paul Turner @ 2012-11-01  1:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Minchan Kim, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KAMEZAWA Hiroyuki, sanjay,
	David Rientjes

On Wed, Oct 31, 2012 at 3:56 PM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
>>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
>>> > allocating that area to user. Otherwise, accessing of volatile range
>>> > will meet SIGBUS error.
>>>
>>> Well, why?  It would be easy enough for the fault handler to give
>>> userspace a new, zeroed page at that address.
>>
>> Note: MADV_DONTNEED already has this (nice) property.
>
> I don't think I strictly understand this patch. but maybe I can answer why
> userland and malloc folks don't like MADV_DONTNEED.
>
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.
>
> At called free() time, malloc don't have a knowledge when next big malloc()
> is called. then, immediate discarding may or may not get good performance
> gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)
>

Ah; In tcmalloc allocations (and their associated free-lists) are
binned into separate lists as a function of object-size which helps to
mitigate this.

I'd make a separate more general argument here:
If I'm allocating a large (multi-kilobyte object) the cost of what I'm
about to do with that object is likely fairly large -- The fault/zero
cost a probably fairly small proportional cost, which limits the
optimization value.

>
> In past, several developers tryied to avoid such situation, likes
>
> - making zero page daemon and avoid pagesize zero fill at page fault
> - making new vma or page flags and mark as discardable w/o swap and
>   vmscan treat it. (like this and/or MADV_FREE)
> - making new process option and avoid page zero fill from page fault path.
>   (yes, it is big incompatibility and insecure. but some embedded folks thought
>    they are acceptable downside)
> - etc
>
>
> btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
> don't need mmap_sem and works very effectively when a lot of threads case.
> taking mmap_sem might bring worse performance than DONTNEED. dunno.

MADV_VOLATILE also seems to end up looking quite similar to a
user-visible (range-based) cleancache.

A second popular use-case for such semantics is the case of
discardable cache elements (e.g. web browser).  I suspect we'd want to
at least mention these in the changelog.  (Alternatively, what does a
cleancache-backed-fs exposing these semantics look like?)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-01  0:50     ` Minchan Kim
@ 2012-11-01  1:22       ` Paul Turner
  2012-11-01  1:33         ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Turner @ 2012-11-01  1:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

On Wed, Oct 31, 2012 at 5:50 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hello,
>
> On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
>> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>> >
>> > On Tue, 30 Oct 2012 10:29:54 +0900
>> > Minchan Kim <minchan@kernel.org> wrote:
>> >
>> > > This patch introudces new madvise behavior MADV_VOLATILE and
>> > > MADV_NOVOLATILE for anonymous pages. It's different with
>> > > John Stultz's version which considers only tmpfs while this patch
>> > > considers only anonymous pages so this cannot cover John's one.
>> > > If below idea is proved as reasonable, I hope we can unify both
>> > > concepts by madvise/fadvise.
>> > >
>> > > Rationale is following as.
>> > > Many allocators call munmap(2) when user call free(3) if ptr is
>> > > in mmaped area. But munmap isn't cheap because it have to clean up
>> > > all pte entries and unlinking a vma so overhead would be increased
>> > > linearly by mmaped area's size.
>> >
>> > Presumably the userspace allocator will internally manage memory in
>> > large chunks, so the munmap() call frequency will be much lower than
>> > the free() call frequency.  So the performance gains from this change
>> > might be very small.
>>
>> I don't think I strictly understand the motivation from a
>> malloc-standpoint here.
>>
>> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
>> to perform discards on Linux.    For any reasonable allocator (short
>> of binding malloc --> mmap, free --> unmap) this seems a better
>> choice.
>>
>> Note also from a performance stand-point I doubt any allocator (which
>> case about performance) is going to want to pay the cost of even a
>> null syscall about typical malloc/free usage (consider: a tcmalloc
>
> Good point.
>
>> malloc/free pairis currently <20ns).  Given then that this cost is
>> amortized once you start doing discards on larger blocks MADV_DONTNEED
>> seems a preferable interface:
>> - You don't need to reconstruct an arena when you do want to allocate
>> since there's no munmap/mmap for the region to change about
>> - There are no syscalls involved in later reallocating the block.
>
> Above benefits are applied on MADV_VOLATILE, too.
> But as you pointed out, there is a little bit overhead than DONTNEED
> because allocator should call madvise(MADV_NOVOLATILE) before allocation.
> For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
> and could be a problem on parallel malloc/free workload as KOSAKI pointed out.
>
> In such case, we can change semantic so malloc doesn't need to call
> madivse(NOVOLATILE) before allocating. Then, page fault handler have to
> check whether this page fault happen by access of volatile vma. If so,
> it could return zero page instead of SIGBUS and mark the vma isn't volatile
> any more.

I think being able to determine whether the backing was discarded
(about a atomic transition to non-volatile) would be a required
property to make this useful for non-malloc use-cases.

>
>>
>> The only real additional cost is address-space.  Are you strongly
>> concerned about the 32-bit case?
>
> No. I believe allocators have a logic to clean up them once address space is
> almost full.
>
> Thanks, Paul.
>
> --
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-31 22:56     ` KOSAKI Motohiro
  2012-11-01  1:15       ` Paul Turner
@ 2012-11-01  1:25       ` Minchan Kim
  2012-11-01  2:01         ` KOSAKI Motohiro
  2012-11-05 23:54       ` Arun Sharma
  2 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2012-11-01  1:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Paul Turner, Andrew Morton, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KAMEZAWA Hiroyuki, sanjay,
	David Rientjes

Hi KOSAKI,

On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> >> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >> > allocating that area to user. Otherwise, accessing of volatile range
> >> > will meet SIGBUS error.
> >>
> >> Well, why?  It would be easy enough for the fault handler to give
> >> userspace a new, zeroed page at that address.
> >
> > Note: MADV_DONTNEED already has this (nice) property.
> 
> I don't think I strictly understand this patch. but maybe I can answer why
> userland and malloc folks don't like MADV_DONTNEED.
> 
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.
> 
> At called free() time, malloc don't have a knowledge when next big malloc()
> is called. then, immediate discarding may or may not get good performance
> gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)
> 
> 
> In past, several developers tryied to avoid such situation, likes
> 
> - making zero page daemon and avoid pagesize zero fill at page fault
> - making new vma or page flags and mark as discardable w/o swap and
>   vmscan treat it. (like this and/or MADV_FREE)

Thanks for the information.
I realized by you I'm not first people to think of this idea.
Rik already tried it(https://lkml.org/lkml/2007/4/17/53) by new page flag
and even other OSes already have such good feature. And John's concept was
already tried long time ago (https://lkml.org/lkml/2005/11/1/384)

Hmm, I look over Rik's thread but couldn't find why it wasn't merged
at that time. Anyone know it?

> - making new process option and avoid page zero fill from page fault path.
>   (yes, it is big incompatibility and insecure. but some embedded folks thought
>    they are acceptable downside)
> - etc
> 
> 
> btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
> don't need mmap_sem and works very effectively when a lot of threads case.
> taking mmap_sem might bring worse performance than DONTNEED. dunno.

It's a good point. 

Quote from my reply to Paul
"
In such case, we can change semantic so malloc doesn't need to call
madivse(NOVOLATILE) before allocating. Then, page fault handler have to
check whether this page fault happen by access of volatile vma. If so,
it could return zero page instead of SIGBUS and mark the vma isn't volatile
any more.
"
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-01  1:22       ` Paul Turner
@ 2012-11-01  1:33         ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2012-11-01  1:33 UTC (permalink / raw)
  To: Paul Turner
  Cc: Andrew Morton, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

On Wed, Oct 31, 2012 at 06:22:58PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 5:50 PM, Minchan Kim <minchan@kernel.org> wrote:
> > Hello,
> >
> > On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
> >> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
> >> <akpm@linux-foundation.org> wrote:
> >> >
> >> > On Tue, 30 Oct 2012 10:29:54 +0900
> >> > Minchan Kim <minchan@kernel.org> wrote:
> >> >
> >> > > This patch introudces new madvise behavior MADV_VOLATILE and
> >> > > MADV_NOVOLATILE for anonymous pages. It's different with
> >> > > John Stultz's version which considers only tmpfs while this patch
> >> > > considers only anonymous pages so this cannot cover John's one.
> >> > > If below idea is proved as reasonable, I hope we can unify both
> >> > > concepts by madvise/fadvise.
> >> > >
> >> > > Rationale is following as.
> >> > > Many allocators call munmap(2) when user call free(3) if ptr is
> >> > > in mmaped area. But munmap isn't cheap because it have to clean up
> >> > > all pte entries and unlinking a vma so overhead would be increased
> >> > > linearly by mmaped area's size.
> >> >
> >> > Presumably the userspace allocator will internally manage memory in
> >> > large chunks, so the munmap() call frequency will be much lower than
> >> > the free() call frequency.  So the performance gains from this change
> >> > might be very small.
> >>
> >> I don't think I strictly understand the motivation from a
> >> malloc-standpoint here.
> >>
> >> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
> >> to perform discards on Linux.    For any reasonable allocator (short
> >> of binding malloc --> mmap, free --> unmap) this seems a better
> >> choice.
> >>
> >> Note also from a performance stand-point I doubt any allocator (which
> >> case about performance) is going to want to pay the cost of even a
> >> null syscall about typical malloc/free usage (consider: a tcmalloc
> >
> > Good point.
> >
> >> malloc/free pairis currently <20ns).  Given then that this cost is
> >> amortized once you start doing discards on larger blocks MADV_DONTNEED
> >> seems a preferable interface:
> >> - You don't need to reconstruct an arena when you do want to allocate
> >> since there's no munmap/mmap for the region to change about
> >> - There are no syscalls involved in later reallocating the block.
> >
> > Above benefits are applied on MADV_VOLATILE, too.
> > But as you pointed out, there is a little bit overhead than DONTNEED
> > because allocator should call madvise(MADV_NOVOLATILE) before allocation.
> > For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
> > and could be a problem on parallel malloc/free workload as KOSAKI pointed out.
> >
> > In such case, we can change semantic so malloc doesn't need to call
> > madivse(NOVOLATILE) before allocating. Then, page fault handler have to
> > check whether this page fault happen by access of volatile vma. If so,
> > it could return zero page instead of SIGBUS and mark the vma isn't volatile
> > any more.
> 
> I think being able to determine whether the backing was discarded
> (about a atomic transition to non-volatile) would be a required
> property to make this useful for non-malloc use-cases.
> 

Absolutely.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-01  1:15       ` Paul Turner
@ 2012-11-01  1:46         ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2012-11-01  1:46 UTC (permalink / raw)
  To: Paul Turner
  Cc: KOSAKI Motohiro, Andrew Morton, linux-kernel, linux-mm,
	John Stultz, Christoph Lameter, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Mike Hommey, Taras Glek,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

On Wed, Oct 31, 2012 at 06:15:33PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 3:56 PM, KOSAKI Motohiro
> <kosaki.motohiro@gmail.com> wrote:
> >>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >>> > allocating that area to user. Otherwise, accessing of volatile range
> >>> > will meet SIGBUS error.
> >>>
> >>> Well, why?  It would be easy enough for the fault handler to give
> >>> userspace a new, zeroed page at that address.
> >>
> >> Note: MADV_DONTNEED already has this (nice) property.
> >
> > I don't think I strictly understand this patch. but maybe I can answer why
> > userland and malloc folks don't like MADV_DONTNEED.
> >
> > glibc malloc discard freed memory by using MADV_DONTNEED
> > as tcmalloc. and it is often a source of large performance decrease.
> > because of MADV_DONTNEED discard memory immediately and
> > right after malloc() call fall into page fault and pagesize memset() path.
> > then, using DONTNEED increased zero fill and cache miss rate.
> >
> > At called free() time, malloc don't have a knowledge when next big malloc()
> > is called. then, immediate discarding may or may not get good performance
> > gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)
> >
> 
> Ah; In tcmalloc allocations (and their associated free-lists) are
> binned into separate lists as a function of object-size which helps to
> mitigate this.
> 
> I'd make a separate more general argument here:
> If I'm allocating a large (multi-kilobyte object) the cost of what I'm
> about to do with that object is likely fairly large -- The fault/zero
> cost a probably fairly small proportional cost, which limits the
> optimization value.

While I look at thread trial of Rik which is same goal while implementation
is different, I found this number.

https://lkml.org/lkml/2007/4/20/390

I believe optimiation is valuable. Of course, I need simillar testing for
proving it.

> 
> >
> > In past, several developers tryied to avoid such situation, likes
> >
> > - making zero page daemon and avoid pagesize zero fill at page fault
> > - making new vma or page flags and mark as discardable w/o swap and
> >   vmscan treat it. (like this and/or MADV_FREE)
> > - making new process option and avoid page zero fill from page fault path.
> >   (yes, it is big incompatibility and insecure. but some embedded folks thought
> >    they are acceptable downside)
> > - etc
> >
> >
> > btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
> > don't need mmap_sem and works very effectively when a lot of threads case.
> > taking mmap_sem might bring worse performance than DONTNEED. dunno.
> 
> MADV_VOLATILE also seems to end up looking quite similar to a
> user-visible (range-based) cleancache.
> 
> A second popular use-case for such semantics is the case of
> discardable cache elements (e.g. web browser).  I suspect we'd want to
> at least mention these in the changelog.  (Alternatively, what does a
> cleancache-backed-fs exposing these semantics look like?)
> 

It's a trial of John Stultz(http://lwn.net/Articles/518130/, there was another
trial long time ago https://lkml.org/lkml/2005/11/1/384) and I want to
expand the concept from file-backed page to anonymous page so this patch
is a trial for anonymous page. So, usecase of my patch have focussed on
malloc/free case.
I hope both are able to be unified.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-01  1:25       ` Minchan Kim
@ 2012-11-01  2:01         ` KOSAKI Motohiro
  0 siblings, 0 replies; 30+ messages in thread
From: KOSAKI Motohiro @ 2012-11-01  2:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Paul Turner, Andrew Morton, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KAMEZAWA Hiroyuki, sanjay,
	David Rientjes

>> - making zero page daemon and avoid pagesize zero fill at page fault
>> - making new vma or page flags and mark as discardable w/o swap and
>>   vmscan treat it. (like this and/or MADV_FREE)
>
> Thanks for the information.
> I realized by you I'm not first people to think of this idea.
> Rik already tried it(https://lkml.org/lkml/2007/4/17/53) by new page flag
> and even other OSes already have such good feature. And John's concept was
> already tried long time ago (https://lkml.org/lkml/2005/11/1/384)
>
> Hmm, I look over Rik's thread but couldn't find why it wasn't merged
> at that time. Anyone know it?

Dunno. and I like volatile feature than old one. but bold remark, please don't
100% trust me, I haven't review a detailed code of your patch and I don't
strictly understand it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-30  1:29 [RFC v2] Support volatile range for anon vma Minchan Kim
  2012-10-31 21:35 ` Andrew Morton
@ 2012-11-02  1:43 ` Bob Liu
  2012-11-02  2:37   ` Minchan Kim
  2012-11-22  0:36 ` John Stultz
  2 siblings, 1 reply; 30+ messages in thread
From: Bob Liu @ 2012-11-02  1:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Tue, Oct 30, 2012 at 9:29 AM, Minchan Kim <minchan@kernel.org> wrote:
> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
>
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.
>

I have a question.
Pte entries are cleaned up during munmap(), so if user space try to
access the unmaped address
page fault will be generated.

If use MADV_VOLATILE? What's the result?
The pte entries are not cleaned so user space can still access the
memory before
VM discard pages?

> Volatile conecept of Robert Love could be very useful for reducing
> free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> munmap(2)(Of course, they need to manage volatile mmaped area to
> reduce shortage of address space and sometime ends up unmaping them).
> The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
>
> 1) it just marks the flag in VMA and
> 2) if memory pressure happens, VM can discard pages of volatile VMA
>    instead of swapping out when volatile pages is selected as victim
>    by normal VM aging policy.
> 3) freed mmaped area doesn't include any meaningful data so there
>    is no point to swap them out.
>
> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.
>
> The downside is that we have to age anon lru list although we don't
> have swap because I don't want to discard volatile pages by top priority
> when memory pressure happens as volatile in this patch means "We don't
> need to swap out because user can handle the situation which data are
> disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> So I want to apply same aging rule of nomal pages to them.
>
> Anon background aging of non-swap system would be a trade-off for
> getting good feature. Even, we had done it two years ago until merge
> [1] and I believe free(3) performance gain will beat loss of anon lru
> aging's overead once all of allocator start to use madvise.
> (This patch doesn't include background aging in case of non-swap system
>  but it's trivial if we decide)
>
> I hope seeing opinions from others before diving into glibc or bionic.
> Welcome to any comment.
>
> [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>
> Changelog
>  * from RFC v1
>    * add clear comment of purge - Christoph
>    * Change patch descritpion
>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/asm-generic/mman-common.h |    3 +
>  include/linux/mm.h                |    9 ++-
>  include/linux/mm_types.h          |    5 ++
>  include/linux/rmap.h              |   24 ++++++-
>  mm/ksm.c                          |    4 +-
>  mm/madvise.c                      |   32 +++++++++-
>  mm/memory.c                       |    2 +
>  mm/migrate.c                      |    6 +-
>  mm/rmap.c                         |  126 +++++++++++++++++++++++++++++++++++--
>  mm/vmscan.c                       |    3 +
>  10 files changed, 202 insertions(+), 12 deletions(-)
>
> diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> index d030d2c..5f8090d 100644
> --- a/include/asm-generic/mman-common.h
> +++ b/include/asm-generic/mman-common.h
> @@ -34,6 +34,9 @@
>  #define MADV_SEQUENTIAL        2               /* expect sequential page references */
>  #define MADV_WILLNEED  3               /* will need these pages */
>  #define MADV_DONTNEED  4               /* don't need these pages */
> +#define MADV_VOLATILE  5               /* pages will disappear suddenly */
> +#define MADV_NOVOLATILE 6              /* pages will not disappear */
> +
>
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_REMOVE    9               /* remove these pages & resources */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 311be90..78c8c08 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_PFN_AT_MMAP 0x40000000      /* PFNMAP vma that is fully mapped at mmap time */
>  #define VM_MERGEABLE   0x80000000      /* KSM may merge identical pages */
>
> +/*
> + * Recently, Konstantin removed a few flags but not merged yet
> + * so we will get a room for new flag for supporting 32 bit.
> + * Thanks, Konstantin!
> + */
> +#define VM_VOLATILE    0x100000000
> +
>  /* Bits set in the VMA until the stack is in its final location */
>  #define VM_STACK_INCOMPLETE_SETUP      (VM_RAND_READ | VM_SEQ_READ)
>
> @@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp);
>   * Special vmas that are non-mergable, non-mlock()able.
>   * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
>   */
> -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> +#define VM_SPECIAL (VM_IO|VM_DONTEXPAND|VM_RESERVED|VM_PFNMAP|VM_VOLATILE)
>
>  /*
>   * mapping from the currently active vm_flags protection bits (the
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index bf78672..c813daa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -279,6 +279,11 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>         struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
>  #endif
> +       /*
> +        * True if more than a page in this vma is reclaimed.
> +        * It's protected by anon_vma->mutex.
> +        */
> +       bool purged;
>  };
>
>  struct core_thread {
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 3fce545..65b9f33 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -67,6 +67,10 @@ struct anon_vma_chain {
>         struct list_head same_anon_vma; /* locked by anon_vma->mutex */
>  };
>
> +
> +void volatile_lock(struct vm_area_struct *vma);
> +void volatile_unlock(struct vm_area_struct *vma);
> +
>  #ifdef CONFIG_MMU
>  static inline void get_anon_vma(struct anon_vma *anon_vma)
>  {
> @@ -170,12 +174,14 @@ enum ttu_flags {
>         TTU_IGNORE_MLOCK = (1 << 8),    /* ignore mlock */
>         TTU_IGNORE_ACCESS = (1 << 9),   /* don't age */
>         TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> +       TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
>  };
>  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>
>  int try_to_unmap(struct page *, enum ttu_flags flags);
>  int try_to_unmap_one(struct page *, struct vm_area_struct *,
> -                       unsigned long address, enum ttu_flags flags);
> +                       unsigned long address, enum ttu_flags flags,
> +                       bool *is_volatile);
>
>  /*
>   * Called from mm/filemap_xip.c to unmap empty zero page
> @@ -194,6 +200,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>         return ptep;
>  }
>
> +pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
> +                               unsigned long, spinlock_t **);
> +
> +static inline pte_t *page_check_volatile_address(struct page *page,
> +                                       struct mm_struct *mm,
> +                                       unsigned long address,
> +                                       spinlock_t **ptlp)
> +{
> +       pte_t *ptep;
> +
> +       __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
> +                                       mm, address, ptlp));
> +       return ptep;
> +}
> +
>  /*
>   * Used by swapoff to help locate where page is expected in vma.
>   */
> @@ -257,5 +278,6 @@ static inline int page_mkclean(struct page *page)
>  #define SWAP_AGAIN     1
>  #define SWAP_FAIL      2
>  #define SWAP_MLOCK     3
> +#define SWAP_DISCARD   4
>
>  #endif /* _LINUX_RMAP_H */
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 47c8853..22c54d2 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1653,6 +1653,7 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
>         struct rmap_item *rmap_item;
>         int ret = SWAP_AGAIN;
>         int search_new_forks = 0;
> +       bool dummy_volatile;
>
>         VM_BUG_ON(!PageKsm(page));
>         VM_BUG_ON(!PageLocked(page));
> @@ -1682,7 +1683,8 @@ again:
>                                 continue;
>
>                         ret = try_to_unmap_one(page, vma,
> -                                       rmap_item->address, flags);
> +                                       rmap_item->address, flags,
> +                                       &dummy_volatile);
>                         if (ret != SWAP_AGAIN || !page_mapped(page)) {
>                                 anon_vma_unlock(anon_vma);
>                                 goto out;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 14d260f..53cd77f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
>                 if (error)
>                         goto out;
>                 break;
> +       case MADV_VOLATILE:
> +               if (vma->vm_flags & VM_LOCKED) {
> +                       error = -EINVAL;
> +                       goto out;
> +               }
> +               new_flags |= VM_VOLATILE;
> +               vma->purged = false;
> +               break;
> +       case MADV_NOVOLATILE:
> +               if (!(vma->vm_flags & VM_VOLATILE)) {
> +                       error = -EINVAL;
> +                       goto out;
> +               }
> +
> +               new_flags &= ~VM_VOLATILE;
> +               break;
>         }
>
>         if (new_flags == vma->vm_flags) {
> @@ -118,9 +134,15 @@ static long madvise_behavior(struct vm_area_struct * vma,
>  success:
>         /*
>          * vm_flags is protected by the mmap_sem held in write mode.
> +        * In case of VOLATILE, we need volatile_lock, additionally.
>          */
> +       if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> +               volatile_lock(vma);
>         vma->vm_flags = new_flags;
> -
> +       if (behavior == MADV_NOVOLATILE)
> +               error = vma->purged;
> +       if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> +               volatile_unlock(vma);
>  out:
>         if (error == -ENOMEM)
>                 error = -EAGAIN;
> @@ -310,6 +332,8 @@ madvise_behavior_valid(int behavior)
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> +       case MADV_VOLATILE:
> +       case MADV_NOVOLATILE:
>                 return 1;
>
>         default:
> @@ -383,7 +407,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>
>         if (start & ~PAGE_MASK)
>                 goto out;
> -       len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> +
> +       if (behavior != MADV_VOLATILE && behavior != MADV_NOVOLATILE)
> +               len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> +       else
> +               len = len_in & PAGE_MASK;
>
>         /* Check to see whether len was rounded up from small -ve to zero */
>         if (len_in && !len)
> diff --git a/mm/memory.c b/mm/memory.c
> index 5736170..26b3f73 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3441,6 +3441,8 @@ int handle_pte_fault(struct mm_struct *mm,
>         entry = *pte;
>         if (!pte_present(entry)) {
>                 if (pte_none(entry)) {
> +                       if (unlikely(vma->vm_flags & VM_VOLATILE))
> +                               return VM_FAULT_SIGBUS;
>                         if (vma->vm_ops) {
>                                 if (likely(vma->vm_ops->fault))
>                                         return do_linear_fault(mm, vma, address,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 77ed2d7..d1b51af 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>         }
>
>         /* Establish migration ptes or remove ptes */
> -       try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +       try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +                       TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
>  skip_unmap:
>         if (!page_mapped(page))
> @@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>         if (PageAnon(hpage))
>                 anon_vma = page_get_anon_vma(hpage);
>
> -       try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +       try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +                               TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
>         if (!page_mapped(hpage))
>                 rc = move_to_new_page(new_hpage, hpage, 1, mode);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..778abfc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>         return vma_address(page, vma);
>  }
>
> +pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
> +                         unsigned long address, spinlock_t **ptlp)
> +{
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +       spinlock_t *ptl;
> +
> +       swp_entry_t entry = { .val = page_private(page) };
> +
> +       if (unlikely(PageHuge(page))) {
> +               pte = huge_pte_offset(mm, address);
> +               ptl = &mm->page_table_lock;
> +               goto check;
> +       }
> +
> +       pgd = pgd_offset(mm, address);
> +       if (!pgd_present(*pgd))
> +               return NULL;
> +
> +       pud = pud_offset(pgd, address);
> +       if (!pud_present(*pud))
> +               return NULL;
> +
> +       pmd = pmd_offset(pud, address);
> +       if (!pmd_present(*pmd))
> +               return NULL;
> +       if (pmd_trans_huge(*pmd))
> +               return NULL;
> +
> +       pte = pte_offset_map(pmd, address);
> +       ptl = pte_lockptr(mm, pmd);
> +check:
> +       spin_lock(ptl);
> +       if (PageAnon(page)) {
> +               if (!pte_present(*pte) && entry.val ==
> +                               pte_to_swp_entry(*pte).val) {
> +                       *ptlp = ptl;
> +                       return pte;
> +               }
> +       } else {
> +               if (pte_none(*pte)) {
> +                       *ptlp = ptl;
> +                       return pte;
> +               }
> +       }
> +       pte_unmap_unlock(pte, ptl);
> +       return NULL;
> +}
> +
>  /*
>   * Check that @page is mapped at @address into @mm.
>   *
> @@ -1218,12 +1269,42 @@ out:
>                 mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>
> +int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
> +               unsigned long address)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       pte_t *pte;
> +       pte_t pteval;
> +       spinlock_t *ptl;
> +
> +       pte = page_check_volatile_address(page, mm, address, &ptl);
> +       if (!pte)
> +               return 0;
> +
> +       /* Nuke the page table entry. */
> +       flush_cache_page(vma, address, page_to_pfn(page));
> +       pteval = ptep_clear_flush(vma, address, pte);
> +
> +       if (PageAnon(page)) {
> +               swp_entry_t entry = { .val = page_private(page) };
> +               if (PageSwapCache(page)) {
> +                       dec_mm_counter(mm, MM_SWAPENTS);
> +                       swap_free(entry);
> +               }
> +       }
> +
> +       pte_unmap_unlock(pte, ptl);
> +       mmu_notifier_invalidate_page(mm, address);
> +       return 1;
> +}
> +
>  /*
>   * Subfunctions of try_to_unmap: try_to_unmap_one called
>   * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
>   */
>  int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> -                    unsigned long address, enum ttu_flags flags)
> +                    unsigned long address, enum ttu_flags flags,
> +                    bool *is_volatile)
>  {
>         struct mm_struct *mm = vma->vm_mm;
>         pte_t *pte;
> @@ -1235,6 +1316,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>         if (!pte)
>                 goto out;
>
> +       if (!(vma->vm_flags & VM_VOLATILE))
> +               *is_volatile = false;
>         /*
>          * If the page is mlock()d, we cannot swap it out.
>          * If it's recently referenced (perhaps page_referenced
> @@ -1494,6 +1577,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>         struct anon_vma *anon_vma;
>         struct anon_vma_chain *avc;
>         int ret = SWAP_AGAIN;
> +       bool is_volatile = true;
> +
> +       if (flags & TTU_IGNORE_VOLATILE)
> +               is_volatile = false;
>
>         anon_vma = page_lock_anon_vma(page);
>         if (!anon_vma)
> @@ -1512,17 +1599,32 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>                  * temporary VMAs until after exec() completes.
>                  */
>                 if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> -                               is_vma_temporary_stack(vma))
> +                               is_vma_temporary_stack(vma)) {
> +                       is_volatile = false;
>                         continue;
> +               }
>
>                 address = vma_address(page, vma);
>                 if (address == -EFAULT)
>                         continue;
> -               ret = try_to_unmap_one(page, vma, address, flags);
> +               ret = try_to_unmap_one(page, vma, address, flags, &is_volatile);
>                 if (ret != SWAP_AGAIN || !page_mapped(page))
>                         break;
>         }
>
> +       if (page_mapped(page) || is_volatile == false)
> +               goto out;
> +
> +       list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +               struct vm_area_struct *vma = avc->vma;
> +               unsigned long address;
> +
> +               address = vma_address(page, vma);
> +               if (try_to_zap_one(page, vma, address))
> +                       vma->purged = true;
> +       }
> +       ret = SWAP_DISCARD;
> +out:
>         page_unlock_anon_vma(anon_vma);
>         return ret;
>  }
> @@ -1553,13 +1655,14 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
>         unsigned long max_nl_cursor = 0;
>         unsigned long max_nl_size = 0;
>         unsigned int mapcount;
> +       bool dummy;
>
>         mutex_lock(&mapping->i_mmap_mutex);
>         vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>                 unsigned long address = vma_address(page, vma);
>                 if (address == -EFAULT)
>                         continue;
> -               ret = try_to_unmap_one(page, vma, address, flags);
> +               ret = try_to_unmap_one(page, vma, address, flags, &dummy);
>                 if (ret != SWAP_AGAIN || !page_mapped(page))
>                         goto out;
>         }
> @@ -1651,6 +1754,7 @@ out:
>   * SWAP_AGAIN  - we missed a mapping, try again later
>   * SWAP_FAIL   - the page is unswappable
>   * SWAP_MLOCK  - page is mlocked.
> + * SWAP_DISCARD - page is volatile.
>   */
>  int try_to_unmap(struct page *page, enum ttu_flags flags)
>  {
> @@ -1665,7 +1769,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
>                 ret = try_to_unmap_anon(page, flags);
>         else
>                 ret = try_to_unmap_file(page, flags);
> -       if (ret != SWAP_MLOCK && !page_mapped(page))
> +       if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
>                 ret = SWAP_SUCCESS;
>         return ret;
>  }
> @@ -1707,6 +1811,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
>         anon_vma_free(anon_vma);
>  }
>
> +void volatile_lock(struct vm_area_struct *vma)
> +{
> +       if (vma->anon_vma)
> +               anon_vma_lock(vma->anon_vma);
> +}
> +
> +void volatile_unlock(struct vm_area_struct *vma)
> +{
> +       if (vma->anon_vma)
> +               anon_vma_unlock(vma->anon_vma);
> +}
> +
>  #ifdef CONFIG_MIGRATION
>  /*
>   * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99b434b..d5b60d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -789,6 +789,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                  */
>                 if (page_mapped(page) && mapping) {
>                         switch (try_to_unmap(page, TTU_UNMAP)) {
> +                       case SWAP_DISCARD:
> +                               goto discard_page;
>                         case SWAP_FAIL:
>                                 goto activate_locked;
>                         case SWAP_AGAIN:
> @@ -857,6 +859,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         }
>                 }
>
> +discard_page:
>                 /*
>                  * If the page has buffers, try to free the buffer mappings
>                  * associated with this page. If we succeed we try to free
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks,
--Bob

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-02  1:43 ` Bob Liu
@ 2012-11-02  2:37   ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2012-11-02  2:37 UTC (permalink / raw)
  To: Bob Liu
  Cc: Andrew Morton, linux-kernel, linux-mm, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

Hi Bob,

On Fri, Nov 02, 2012 at 09:43:01AM +0800, Bob Liu wrote:
> On Tue, Oct 30, 2012 at 9:29 AM, Minchan Kim <minchan@kernel.org> wrote:
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
> >
> 
> I have a question.
> Pte entries are cleaned up during munmap(), so if user space try to
> access the unmaped address
> page fault will be generated.
> 
> If use MADV_VOLATILE? What's the result?

1) If user accesses pages of volatile vma which are discard by VM
   due to memory pressure, it's SIGBUS.

2) If user accesses pages of volatile vma which live in memory without
   discarding, he can access it.

3) If user accesses pages of volatile vma which are swapped out before
      calling madvise(MADV_VOLATILE), It's okay.

> The pte entries are not cleaned so user space can still access the
> memory before
> VM discard pages?

Yeb.

> 
> > Volatile conecept of Robert Love could be very useful for reducing
> > free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> > munmap(2)(Of course, they need to manage volatile mmaped area to
> > reduce shortage of address space and sometime ends up unmaping them).
> > The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
> >
> > 1) it just marks the flag in VMA and
> > 2) if memory pressure happens, VM can discard pages of volatile VMA
> >    instead of swapping out when volatile pages is selected as victim
> >    by normal VM aging policy.
> > 3) freed mmaped area doesn't include any meaningful data so there
> >    is no point to swap them out.
> >
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
> >
> > The downside is that we have to age anon lru list although we don't
> > have swap because I don't want to discard volatile pages by top priority
> > when memory pressure happens as volatile in this patch means "We don't
> > need to swap out because user can handle the situation which data are
> > disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> > So I want to apply same aging rule of nomal pages to them.
> >
> > Anon background aging of non-swap system would be a trade-off for
> > getting good feature. Even, we had done it two years ago until merge
> > [1] and I believe free(3) performance gain will beat loss of anon lru
> > aging's overead once all of allocator start to use madvise.
> > (This patch doesn't include background aging in case of non-swap system
> >  but it's trivial if we decide)
> >
> > I hope seeing opinions from others before diving into glibc or bionic.
> > Welcome to any comment.
> >
> > [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> >
> > Changelog
> >  * from RFC v1
> >    * add clear comment of purge - Christoph
> >    * Change patch descritpion
> >
> > Cc: John Stultz <john.stultz@linaro.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Android Kernel Team <kernel-team@android.com>
> > Cc: Robert Love <rlove@google.com>
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Neil Brown <neilb@suse.de>
> > Cc: Mike Hommey <mh@glandium.org>
> > Cc: Taras Glek <tglek@mozilla.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/asm-generic/mman-common.h |    3 +
> >  include/linux/mm.h                |    9 ++-
> >  include/linux/mm_types.h          |    5 ++
> >  include/linux/rmap.h              |   24 ++++++-
> >  mm/ksm.c                          |    4 +-
> >  mm/madvise.c                      |   32 +++++++++-
> >  mm/memory.c                       |    2 +
> >  mm/migrate.c                      |    6 +-
> >  mm/rmap.c                         |  126 +++++++++++++++++++++++++++++++++++--
> >  mm/vmscan.c                       |    3 +
> >  10 files changed, 202 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> > index d030d2c..5f8090d 100644
> > --- a/include/asm-generic/mman-common.h
> > +++ b/include/asm-generic/mman-common.h
> > @@ -34,6 +34,9 @@
> >  #define MADV_SEQUENTIAL        2               /* expect sequential page references */
> >  #define MADV_WILLNEED  3               /* will need these pages */
> >  #define MADV_DONTNEED  4               /* don't need these pages */
> > +#define MADV_VOLATILE  5               /* pages will disappear suddenly */
> > +#define MADV_NOVOLATILE 6              /* pages will not disappear */
> > +
> >
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_REMOVE    9               /* remove these pages & resources */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 311be90..78c8c08 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
> >  #define VM_PFN_AT_MMAP 0x40000000      /* PFNMAP vma that is fully mapped at mmap time */
> >  #define VM_MERGEABLE   0x80000000      /* KSM may merge identical pages */
> >
> > +/*
> > + * Recently, Konstantin removed a few flags but not merged yet
> > + * so we will get a room for new flag for supporting 32 bit.
> > + * Thanks, Konstantin!
> > + */
> > +#define VM_VOLATILE    0x100000000
> > +
> >  /* Bits set in the VMA until the stack is in its final location */
> >  #define VM_STACK_INCOMPLETE_SETUP      (VM_RAND_READ | VM_SEQ_READ)
> >
> > @@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp);
> >   * Special vmas that are non-mergable, non-mlock()able.
> >   * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
> >   */
> > -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> > +#define VM_SPECIAL (VM_IO|VM_DONTEXPAND|VM_RESERVED|VM_PFNMAP|VM_VOLATILE)
> >
> >  /*
> >   * mapping from the currently active vm_flags protection bits (the
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index bf78672..c813daa 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -279,6 +279,11 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA
> >         struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
> >  #endif
> > +       /*
> > +        * True if more than a page in this vma is reclaimed.
> > +        * It's protected by anon_vma->mutex.
> > +        */
> > +       bool purged;
> >  };
> >
> >  struct core_thread {
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 3fce545..65b9f33 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -67,6 +67,10 @@ struct anon_vma_chain {
> >         struct list_head same_anon_vma; /* locked by anon_vma->mutex */
> >  };
> >
> > +
> > +void volatile_lock(struct vm_area_struct *vma);
> > +void volatile_unlock(struct vm_area_struct *vma);
> > +
> >  #ifdef CONFIG_MMU
> >  static inline void get_anon_vma(struct anon_vma *anon_vma)
> >  {
> > @@ -170,12 +174,14 @@ enum ttu_flags {
> >         TTU_IGNORE_MLOCK = (1 << 8),    /* ignore mlock */
> >         TTU_IGNORE_ACCESS = (1 << 9),   /* don't age */
> >         TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> > +       TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
> >  };
> >  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >
> >  int try_to_unmap(struct page *, enum ttu_flags flags);
> >  int try_to_unmap_one(struct page *, struct vm_area_struct *,
> > -                       unsigned long address, enum ttu_flags flags);
> > +                       unsigned long address, enum ttu_flags flags,
> > +                       bool *is_volatile);
> >
> >  /*
> >   * Called from mm/filemap_xip.c to unmap empty zero page
> > @@ -194,6 +200,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> >         return ptep;
> >  }
> >
> > +pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
> > +                               unsigned long, spinlock_t **);
> > +
> > +static inline pte_t *page_check_volatile_address(struct page *page,
> > +                                       struct mm_struct *mm,
> > +                                       unsigned long address,
> > +                                       spinlock_t **ptlp)
> > +{
> > +       pte_t *ptep;
> > +
> > +       __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
> > +                                       mm, address, ptlp));
> > +       return ptep;
> > +}
> > +
> >  /*
> >   * Used by swapoff to help locate where page is expected in vma.
> >   */
> > @@ -257,5 +278,6 @@ static inline int page_mkclean(struct page *page)
> >  #define SWAP_AGAIN     1
> >  #define SWAP_FAIL      2
> >  #define SWAP_MLOCK     3
> > +#define SWAP_DISCARD   4
> >
> >  #endif /* _LINUX_RMAP_H */
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 47c8853..22c54d2 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1653,6 +1653,7 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
> >         struct rmap_item *rmap_item;
> >         int ret = SWAP_AGAIN;
> >         int search_new_forks = 0;
> > +       bool dummy_volatile;
> >
> >         VM_BUG_ON(!PageKsm(page));
> >         VM_BUG_ON(!PageLocked(page));
> > @@ -1682,7 +1683,8 @@ again:
> >                                 continue;
> >
> >                         ret = try_to_unmap_one(page, vma,
> > -                                       rmap_item->address, flags);
> > +                                       rmap_item->address, flags,
> > +                                       &dummy_volatile);
> >                         if (ret != SWAP_AGAIN || !page_mapped(page)) {
> >                                 anon_vma_unlock(anon_vma);
> >                                 goto out;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 14d260f..53cd77f 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >                 if (error)
> >                         goto out;
> >                 break;
> > +       case MADV_VOLATILE:
> > +               if (vma->vm_flags & VM_LOCKED) {
> > +                       error = -EINVAL;
> > +                       goto out;
> > +               }
> > +               new_flags |= VM_VOLATILE;
> > +               vma->purged = false;
> > +               break;
> > +       case MADV_NOVOLATILE:
> > +               if (!(vma->vm_flags & VM_VOLATILE)) {
> > +                       error = -EINVAL;
> > +                       goto out;
> > +               }
> > +
> > +               new_flags &= ~VM_VOLATILE;
> > +               break;
> >         }
> >
> >         if (new_flags == vma->vm_flags) {
> > @@ -118,9 +134,15 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >  success:
> >         /*
> >          * vm_flags is protected by the mmap_sem held in write mode.
> > +        * In case of VOLATILE, we need volatile_lock, additionally.
> >          */
> > +       if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> > +               volatile_lock(vma);
> >         vma->vm_flags = new_flags;
> > -
> > +       if (behavior == MADV_NOVOLATILE)
> > +               error = vma->purged;
> > +       if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
> > +               volatile_unlock(vma);
> >  out:
> >         if (error == -ENOMEM)
> >                 error = -EAGAIN;
> > @@ -310,6 +332,8 @@ madvise_behavior_valid(int behavior)
> >  #endif
> >         case MADV_DONTDUMP:
> >         case MADV_DODUMP:
> > +       case MADV_VOLATILE:
> > +       case MADV_NOVOLATILE:
> >                 return 1;
> >
> >         default:
> > @@ -383,7 +407,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >
> >         if (start & ~PAGE_MASK)
> >                 goto out;
> > -       len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> > +
> > +       if (behavior != MADV_VOLATILE && behavior != MADV_NOVOLATILE)
> > +               len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> > +       else
> > +               len = len_in & PAGE_MASK;
> >
> >         /* Check to see whether len was rounded up from small -ve to zero */
> >         if (len_in && !len)
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5736170..26b3f73 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3441,6 +3441,8 @@ int handle_pte_fault(struct mm_struct *mm,
> >         entry = *pte;
> >         if (!pte_present(entry)) {
> >                 if (pte_none(entry)) {
> > +                       if (unlikely(vma->vm_flags & VM_VOLATILE))
> > +                               return VM_FAULT_SIGBUS;
> >                         if (vma->vm_ops) {
> >                                 if (likely(vma->vm_ops->fault))
> >                                         return do_linear_fault(mm, vma, address,
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 77ed2d7..d1b51af 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >         }
> >
> >         /* Establish migration ptes or remove ptes */
> > -       try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> > +       try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> > +                       TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >
> >  skip_unmap:
> >         if (!page_mapped(page))
> > @@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >         if (PageAnon(hpage))
> >                 anon_vma = page_get_anon_vma(hpage);
> >
> > -       try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> > +       try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> > +                               TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >
> >         if (!page_mapped(hpage))
> >                 rc = move_to_new_page(new_hpage, hpage, 1, mode);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..778abfc 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> >         return vma_address(page, vma);
> >  }
> >
> > +pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
> > +                         unsigned long address, spinlock_t **ptlp)
> > +{
> > +       pgd_t *pgd;
> > +       pud_t *pud;
> > +       pmd_t *pmd;
> > +       pte_t *pte;
> > +       spinlock_t *ptl;
> > +
> > +       swp_entry_t entry = { .val = page_private(page) };
> > +
> > +       if (unlikely(PageHuge(page))) {
> > +               pte = huge_pte_offset(mm, address);
> > +               ptl = &mm->page_table_lock;
> > +               goto check;
> > +       }
> > +
> > +       pgd = pgd_offset(mm, address);
> > +       if (!pgd_present(*pgd))
> > +               return NULL;
> > +
> > +       pud = pud_offset(pgd, address);
> > +       if (!pud_present(*pud))
> > +               return NULL;
> > +
> > +       pmd = pmd_offset(pud, address);
> > +       if (!pmd_present(*pmd))
> > +               return NULL;
> > +       if (pmd_trans_huge(*pmd))
> > +               return NULL;
> > +
> > +       pte = pte_offset_map(pmd, address);
> > +       ptl = pte_lockptr(mm, pmd);
> > +check:
> > +       spin_lock(ptl);
> > +       if (PageAnon(page)) {
> > +               if (!pte_present(*pte) && entry.val ==
> > +                               pte_to_swp_entry(*pte).val) {
> > +                       *ptlp = ptl;
> > +                       return pte;
> > +               }
> > +       } else {
> > +               if (pte_none(*pte)) {
> > +                       *ptlp = ptl;
> > +                       return pte;
> > +               }
> > +       }
> > +       pte_unmap_unlock(pte, ptl);
> > +       return NULL;
> > +}
> > +
> >  /*
> >   * Check that @page is mapped at @address into @mm.
> >   *
> > @@ -1218,12 +1269,42 @@ out:
> >                 mem_cgroup_end_update_page_stat(page, &locked, &flags);
> >  }
> >
> > +int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
> > +               unsigned long address)
> > +{
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       pte_t *pte;
> > +       pte_t pteval;
> > +       spinlock_t *ptl;
> > +
> > +       pte = page_check_volatile_address(page, mm, address, &ptl);
> > +       if (!pte)
> > +               return 0;
> > +
> > +       /* Nuke the page table entry. */
> > +       flush_cache_page(vma, address, page_to_pfn(page));
> > +       pteval = ptep_clear_flush(vma, address, pte);
> > +
> > +       if (PageAnon(page)) {
> > +               swp_entry_t entry = { .val = page_private(page) };
> > +               if (PageSwapCache(page)) {
> > +                       dec_mm_counter(mm, MM_SWAPENTS);
> > +                       swap_free(entry);
> > +               }
> > +       }
> > +
> > +       pte_unmap_unlock(pte, ptl);
> > +       mmu_notifier_invalidate_page(mm, address);
> > +       return 1;
> > +}
> > +
> >  /*
> >   * Subfunctions of try_to_unmap: try_to_unmap_one called
> >   * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
> >   */
> >  int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > -                    unsigned long address, enum ttu_flags flags)
> > +                    unsigned long address, enum ttu_flags flags,
> > +                    bool *is_volatile)
> >  {
> >         struct mm_struct *mm = vma->vm_mm;
> >         pte_t *pte;
> > @@ -1235,6 +1316,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >         if (!pte)
> >                 goto out;
> >
> > +       if (!(vma->vm_flags & VM_VOLATILE))
> > +               *is_volatile = false;
> >         /*
> >          * If the page is mlock()d, we cannot swap it out.
> >          * If it's recently referenced (perhaps page_referenced
> > @@ -1494,6 +1577,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> >         struct anon_vma *anon_vma;
> >         struct anon_vma_chain *avc;
> >         int ret = SWAP_AGAIN;
> > +       bool is_volatile = true;
> > +
> > +       if (flags & TTU_IGNORE_VOLATILE)
> > +               is_volatile = false;
> >
> >         anon_vma = page_lock_anon_vma(page);
> >         if (!anon_vma)
> > @@ -1512,17 +1599,32 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> >                  * temporary VMAs until after exec() completes.
> >                  */
> >                 if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> > -                               is_vma_temporary_stack(vma))
> > +                               is_vma_temporary_stack(vma)) {
> > +                       is_volatile = false;
> >                         continue;
> > +               }
> >
> >                 address = vma_address(page, vma);
> >                 if (address == -EFAULT)
> >                         continue;
> > -               ret = try_to_unmap_one(page, vma, address, flags);
> > +               ret = try_to_unmap_one(page, vma, address, flags, &is_volatile);
> >                 if (ret != SWAP_AGAIN || !page_mapped(page))
> >                         break;
> >         }
> >
> > +       if (page_mapped(page) || is_volatile == false)
> > +               goto out;
> > +
> > +       list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> > +               struct vm_area_struct *vma = avc->vma;
> > +               unsigned long address;
> > +
> > +               address = vma_address(page, vma);
> > +               if (try_to_zap_one(page, vma, address))
> > +                       vma->purged = true;
> > +       }
> > +       ret = SWAP_DISCARD;
> > +out:
> >         page_unlock_anon_vma(anon_vma);
> >         return ret;
> >  }
> > @@ -1553,13 +1655,14 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
> >         unsigned long max_nl_cursor = 0;
> >         unsigned long max_nl_size = 0;
> >         unsigned int mapcount;
> > +       bool dummy;
> >
> >         mutex_lock(&mapping->i_mmap_mutex);
> >         vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> >                 unsigned long address = vma_address(page, vma);
> >                 if (address == -EFAULT)
> >                         continue;
> > -               ret = try_to_unmap_one(page, vma, address, flags);
> > +               ret = try_to_unmap_one(page, vma, address, flags, &dummy);
> >                 if (ret != SWAP_AGAIN || !page_mapped(page))
> >                         goto out;
> >         }
> > @@ -1651,6 +1754,7 @@ out:
> >   * SWAP_AGAIN  - we missed a mapping, try again later
> >   * SWAP_FAIL   - the page is unswappable
> >   * SWAP_MLOCK  - page is mlocked.
> > + * SWAP_DISCARD - page is volatile.
> >   */
> >  int try_to_unmap(struct page *page, enum ttu_flags flags)
> >  {
> > @@ -1665,7 +1769,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> >                 ret = try_to_unmap_anon(page, flags);
> >         else
> >                 ret = try_to_unmap_file(page, flags);
> > -       if (ret != SWAP_MLOCK && !page_mapped(page))
> > +       if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
> >                 ret = SWAP_SUCCESS;
> >         return ret;
> >  }
> > @@ -1707,6 +1811,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
> >         anon_vma_free(anon_vma);
> >  }
> >
> > +void volatile_lock(struct vm_area_struct *vma)
> > +{
> > +       if (vma->anon_vma)
> > +               anon_vma_lock(vma->anon_vma);
> > +}
> > +
> > +void volatile_unlock(struct vm_area_struct *vma)
> > +{
> > +       if (vma->anon_vma)
> > +               anon_vma_unlock(vma->anon_vma);
> > +}
> > +
> >  #ifdef CONFIG_MIGRATION
> >  /*
> >   * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 99b434b..d5b60d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -789,6 +789,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >                  */
> >                 if (page_mapped(page) && mapping) {
> >                         switch (try_to_unmap(page, TTU_UNMAP)) {
> > +                       case SWAP_DISCARD:
> > +                               goto discard_page;
> >                         case SWAP_FAIL:
> >                                 goto activate_locked;
> >                         case SWAP_AGAIN:
> > @@ -857,6 +859,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >                         }
> >                 }
> >
> > +discard_page:
> >                 /*
> >                  * If the page has buffers, try to free the buffer mappings
> >                  * associated with this page. If we succeed we try to free
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Thanks,
> --Bob
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-31 22:56     ` KOSAKI Motohiro
  2012-11-01  1:15       ` Paul Turner
  2012-11-01  1:25       ` Minchan Kim
@ 2012-11-05 23:54       ` Arun Sharma
  2012-11-06  1:49         ` Minchan Kim
  2 siblings, 1 reply; 30+ messages in thread
From: Arun Sharma @ 2012-11-05 23:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Paul Turner, Andrew Morton, Minchan Kim, linux-kernel, linux-mm,
	John Stultz, Christoph Lameter, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Mike Hommey, Taras Glek,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.

The memcg based solution that I posted a few months ago is working well
for us. We see significantly less cpu in zero'ing pages.

Not everyone was comfortable with the security implications of recycling
pages between processes in a memcg, although it was disabled by default
and had to be explicitly opted-in.

Also, memory allocators have a second motivation in using madvise: to
create virtually contiguous regions of memory from a fragmented address 
space, without increasing the RSS.

 -Arun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-05 23:54       ` Arun Sharma
@ 2012-11-06  1:49         ` Minchan Kim
  2012-11-06  2:03           ` Arun Sharma
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2012-11-06  1:49 UTC (permalink / raw)
  To: Arun Sharma
  Cc: KOSAKI Motohiro, Paul Turner, Andrew Morton, linux-kernel,
	linux-mm, John Stultz, Christoph Lameter, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Mike Hommey, Taras Glek,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

Hello,

On Mon, Nov 05, 2012 at 03:54:43PM -0800, Arun Sharma wrote:
> On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> > glibc malloc discard freed memory by using MADV_DONTNEED
> > as tcmalloc. and it is often a source of large performance decrease.
> > because of MADV_DONTNEED discard memory immediately and
> > right after malloc() call fall into page fault and pagesize memset() path.
> > then, using DONTNEED increased zero fill and cache miss rate.
> 
> The memcg based solution that I posted a few months ago is working well
> for us. We see significantly less cpu in zero'ing pages.
> 
> Not everyone was comfortable with the security implications of recycling
> pages between processes in a memcg, although it was disabled by default
> and had to be explicitly opted-in.
> 
> Also, memory allocators have a second motivation in using madvise: to
> create virtually contiguous regions of memory from a fragmented address 
> space, without increasing the RSS.

I don't get it. How do we create contiguos region by madvise?
Just out of curiosity.
Could you elaborate that use case? :)

> 
>  -Arun
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind Regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-06  1:49         ` Minchan Kim
@ 2012-11-06  2:03           ` Arun Sharma
  0 siblings, 0 replies; 30+ messages in thread
From: Arun Sharma @ 2012-11-06  2:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Paul Turner, Andrew Morton, linux-kernel,
	linux-mm, John Stultz, Christoph Lameter, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Mike Hommey, Taras Glek,
	KAMEZAWA Hiroyuki, sanjay, David Rientjes

On 11/5/12 5:49 PM, Minchan Kim wrote:

>> Also, memory allocators have a second motivation in using madvise: to
>> create virtually contiguous regions of memory from a fragmented address
>> space, without increasing the RSS.
>
> I don't get it. How do we create contiguos region by madvise?
> Just out of curiosity.
> Could you elaborate that use case? :)

By using a new anonymous map and faulting pages in.

The fragmented virtual memory is released via MADV_DONTNEED and if the 
malloc/free activity on the system is dominated by one process, chances 
are that the newly faulted in page is the one released by the same 
process :)

The net effect is that physical pages within a single address space are 
rearranged so larger allocations can be satisfied.

  -Arun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-10-30  1:29 [RFC v2] Support volatile range for anon vma Minchan Kim
  2012-10-31 21:35 ` Andrew Morton
  2012-11-02  1:43 ` Bob Liu
@ 2012-11-22  0:36 ` John Stultz
  2012-11-29  4:18   ` John Stultz
  2012-12-03 23:50   ` Minchan Kim
  2 siblings, 2 replies; 30+ messages in thread
From: John Stultz @ 2012-11-22  0:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 10/29/2012 06:29 PM, Minchan Kim wrote:
> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
>
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.
>
> Volatile conecept of Robert Love could be very useful for reducing
> free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> munmap(2)(Of course, they need to manage volatile mmaped area to
> reduce shortage of address space and sometime ends up unmaping them).
> The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
>
> 1) it just marks the flag in VMA and
> 2) if memory pressure happens, VM can discard pages of volatile VMA
>     instead of swapping out when volatile pages is selected as victim
>     by normal VM aging policy.
> 3) freed mmaped area doesn't include any meaningful data so there
>     is no point to swap them out.
>
> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.
>
> The downside is that we have to age anon lru list although we don't
> have swap because I don't want to discard volatile pages by top priority
> when memory pressure happens as volatile in this patch means "We don't
> need to swap out because user can handle the situation which data are
> disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> So I want to apply same aging rule of nomal pages to them.
>
> Anon background aging of non-swap system would be a trade-off for
> getting good feature. Even, we had done it two years ago until merge
> [1] and I believe free(3) performance gain will beat loss of anon lru
> aging's overead once all of allocator start to use madvise.
> (This patch doesn't include background aging in case of non-swap system
>   but it's trivial if we decide)

Hey Minchan!
     So I've been looking at your patch for a bit, and I'm still trying 
to fully grok it and the rmap code. Overall this approach looks pretty 
interesting,  and while your patch description focused on malloc/free 
behavior, I suspect your patch would satisfy what the mozilla folks are 
looking for, and while its not quite sufficient yet for Android, the 
interface semantics are very close to what I've been wanting (my test 
cases were easily mapped over).

The two major issues for me are:
1) As you noted, this approach currently doesn't work on non-swap 
systems, as we don't try to shrink the anonymous page lrus. This is a 
big problem, as it makes it unusable for most all Android systems. You 
suggest we may want to change aging the anonymous lru, and I had a patch 
earlier that tried to change some of the anonymous lru aging rules for 
volatile pages, but its not quite right for what you have here. So I'd 
be interested in hearing how you think the anonymous lru aging should 
happen with swapoff.

2) Being able to use this with tmpfs files. I'm currently trying to 
better understand the rmap code, looking to see if there's a way to have 
try_to_unmap_file() work similarly to try_to_unmap_anon(), to allow 
allow users to madvise() on mmapped tmpfs files. This would provide a 
very similar interface as to what I've been proposing with 
fadvise/fallocate, but just using process virtual addresses instead of 
(fd, offset) pairs.   The benefit with (fd,offset) pairs for Android is 
that its easier to manage shared volatile ranges between two processes 
that are sharing data via an mmapped tmpfs file (although this actual 
use case may be fairly rare).  I believe we should still be able to 
rework the ashmem internals to use madvise (which would provide legacy 
support for existing android apps), so then its just a question of if we 
could then eventually convince Android apps to use the madvise interface 
directly, rather then the ashmem unpin ioctl.

The other concern with the madvise on mmapped files approach is that 
there's no easy way I can see to limit it to tmpfs files. I know some 
have been interested in having fallocate(VOLATILE) interface for 
non-tmpfs files, but I'm not sure I see the benefit there yet. I have 
noted folks mixing the idea of volatile pages being purged under memory 
pressure with the idea of volatile files, which might be purged from 
disk under disk pressure. While I think the second idea is interesting, 
I do think its completely separate from the volatile memory concept.

Anyway, I'd be interested in your thoughts on these two issues. Thanks 
so much for sending out this patch, its given me quite a bit to chew on, 
and I too hope we can merge our different approaches together.

thanks
-john


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-22  0:36 ` John Stultz
@ 2012-11-29  4:18   ` John Stultz
  2012-12-04  0:00     ` Minchan Kim
  2012-12-03 23:50   ` Minchan Kim
  1 sibling, 1 reply; 30+ messages in thread
From: John Stultz @ 2012-11-29  4:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 11/21/2012 04:36 PM, John Stultz wrote:
> 2) Being able to use this with tmpfs files. I'm currently trying to 
> better understand the rmap code, looking to see if there's a way to 
> have try_to_unmap_file() work similarly to try_to_unmap_anon(), to 
> allow allow users to madvise() on mmapped tmpfs files. This would 
> provide a very similar interface as to what I've been proposing with 
> fadvise/fallocate, but just using process virtual addresses instead of 
> (fd, offset) pairs.   The benefit with (fd,offset) pairs for Android 
> is that its easier to manage shared volatile ranges between two 
> processes that are sharing data via an mmapped tmpfs file (although 
> this actual use case may be fairly rare).  I believe we should still 
> be able to rework the ashmem internals to use madvise (which would 
> provide legacy support for existing android apps), so then its just a 
> question of if we could then eventually convince Android apps to use 
> the madvise interface directly, rather then the ashmem unpin ioctl.

Hey Minchan,
     I've been playing around with your patch trying to better 
understand your approach and to extend it to support tmpfs files. In 
doing so I've found a few bugs, and have some rough fixes I wanted to 
share. There's still a few edge cases I need to deal with (the 
vma-purged flag isn't being properly handled through vma merge/split 
operations), but its starting to come along.

Anyway, take a look at the tree here and let me know what you think.
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

I'm sure much is wrong with the tree, but with it I can now mark tmpfs 
file pages as volatile/nonvolatile and see them purged under pressure. 
Unfortunately its not limited to tmpfs, so persistent files will also 
work, but the state of the underlying files on purge is undefined. 
Hopefully I can find a way to limit it to non-persistent filesystems for 
now, and if needed find a way to extend it to persistent filesystems in 
a sane way later.

thanks
-john


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-22  0:36 ` John Stultz
  2012-11-29  4:18   ` John Stultz
@ 2012-12-03 23:50   ` Minchan Kim
  1 sibling, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2012-12-03 23:50 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

Hi John,

Sorry for the long delay.
At last, I get a chance to look at this, again.

On Wed, Nov 21, 2012 at 04:36:42PM -0800, John Stultz wrote:
> On 10/29/2012 06:29 PM, Minchan Kim wrote:
> >This patch introudces new madvise behavior MADV_VOLATILE and
> >MADV_NOVOLATILE for anonymous pages. It's different with
> >John Stultz's version which considers only tmpfs while this patch
> >considers only anonymous pages so this cannot cover John's one.
> >If below idea is proved as reasonable, I hope we can unify both
> >concepts by madvise/fadvise.
> >
> >Rationale is following as.
> >Many allocators call munmap(2) when user call free(3) if ptr is
> >in mmaped area. But munmap isn't cheap because it have to clean up
> >all pte entries and unlinking a vma so overhead would be increased
> >linearly by mmaped area's size.
> >
> >Volatile conecept of Robert Love could be very useful for reducing
> >free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> >munmap(2)(Of course, they need to manage volatile mmaped area to
> >reduce shortage of address space and sometime ends up unmaping them).
> >The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
> >
> >1) it just marks the flag in VMA and
> >2) if memory pressure happens, VM can discard pages of volatile VMA
> >    instead of swapping out when volatile pages is selected as victim
> >    by normal VM aging policy.
> >3) freed mmaped area doesn't include any meaningful data so there
> >    is no point to swap them out.
> >
> >Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >allocating that area to user. Otherwise, accessing of volatile range
> >will meet SIGBUS error.
> >
> >The downside is that we have to age anon lru list although we don't
> >have swap because I don't want to discard volatile pages by top priority
> >when memory pressure happens as volatile in this patch means "We don't
> >need to swap out because user can handle the situation which data are
> >disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> >So I want to apply same aging rule of nomal pages to them.
> >
> >Anon background aging of non-swap system would be a trade-off for
> >getting good feature. Even, we had done it two years ago until merge
> >[1] and I believe free(3) performance gain will beat loss of anon lru
> >aging's overead once all of allocator start to use madvise.
> >(This patch doesn't include background aging in case of non-swap system
> >  but it's trivial if we decide)
> 
> Hey Minchan!
>     So I've been looking at your patch for a bit, and I'm still
> trying to fully grok it and the rmap code. Overall this approach
> looks pretty interesting,  and while your patch description focused
> on malloc/free behavior, I suspect your patch would satisfy what the
> mozilla folks are looking for, and while its not quite sufficient
> yet for Android, the interface semantics are very close to what I've
> been wanting (my test cases were easily mapped over).
> 
> The two major issues for me are:
> 1) As you noted, this approach currently doesn't work on non-swap
> systems, as we don't try to shrink the anonymous page lrus. This is
> a big problem, as it makes it unusable for most all Android systems.
> You suggest we may want to change aging the anonymous lru, and I had
> a patch earlier that tried to change some of the anonymous lru aging
> rules for volatile pages, but its not quite right for what you have
> here. So I'd be interested in hearing how you think the anonymous
> lru aging should happen with swapoff.

As I mentioned, anon LRU aging should happen for getting the benefit
of VOLATILE. I expect user-space allocator(ex, glibc) or VM heap management
will start it so the gain would be higher than lose.
Otherwise, we should choose and move volatile pages into anonther LRU
when madvise calls. It would be a big overhead as you already have measured.
So the design is same with rmap which move the overhead from frequent place
(ex, fork, pgfault) to to rare plcae(ie, reclaim)
I will accept if there is another good idea which is able to minimise
madvise's overhead.

> 
> 2) Being able to use this with tmpfs files. I'm currently trying to
> better understand the rmap code, looking to see if there's a way to
> have try_to_unmap_file() work similarly to try_to_unmap_anon(), to
> allow allow users to madvise() on mmapped tmpfs files. This would
> provide a very similar interface as to what I've been proposing with
> fadvise/fallocate, but just using process virtual addresses instead
> of (fd, offset) pairs.   The benefit with (fd,offset) pairs for
> Android is that its easier to manage shared volatile ranges between
> two processes that are sharing data via an mmapped tmpfs file
> (although this actual use case may be fairly rare).  I believe we
> should still be able to rework the ashmem internals to use madvise
> (which would provide legacy support for existing android apps), so
> then its just a question of if we could then eventually convince
> Android apps to use the madvise interface directly, rather then the
> ashmem unpin ioctl.

I didn't look at ashmem in detail. let's pass the answer to andorid
folks. But apparently, it's very important. If we can't, We might
separate madvise(anon)/fallocate(file).

> 
> The other concern with the madvise on mmapped files approach is that
> there's no easy way I can see to limit it to tmpfs files. I know

First thing I can thing of is PG_swapbacked. I guess It would be set
to only swappable data.

> some have been interested in having fallocate(VOLATILE) interface
> for non-tmpfs files, but I'm not sure I see the benefit there yet. I
> have noted folks mixing the idea of volatile pages being purged
> under memory pressure with the idea of volatile files, which might
> be purged from disk under disk pressure. While I think the second
> idea is interesting, I do think its completely separate from the
> volatile memory concept.
> 
> Anyway, I'd be interested in your thoughts on these two issues.
> Thanks so much for sending out this patch, its given me quite a bit
> to chew on, and I too hope we can merge our different approaches
> together.

Thanks, John!

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-11-29  4:18   ` John Stultz
@ 2012-12-04  0:00     ` Minchan Kim
  2012-12-04  0:57       ` John Stultz
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2012-12-04  0:00 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> On 11/21/2012 04:36 PM, John Stultz wrote:
> >2) Being able to use this with tmpfs files. I'm currently trying
> >to better understand the rmap code, looking to see if there's a
> >way to have try_to_unmap_file() work similarly to
> >try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >tmpfs files. This would provide a very similar interface as to
> >what I've been proposing with fadvise/fallocate, but just using
> >process virtual addresses instead of (fd, offset) pairs.   The
> >benefit with (fd,offset) pairs for Android is that its easier to
> >manage shared volatile ranges between two processes that are
> >sharing data via an mmapped tmpfs file (although this actual use
> >case may be fairly rare).  I believe we should still be able to
> >rework the ashmem internals to use madvise (which would provide
> >legacy support for existing android apps), so then its just a
> >question of if we could then eventually convince Android apps to
> >use the madvise interface directly, rather then the ashmem unpin
> >ioctl.
> 
> Hey Minchan,
>     I've been playing around with your patch trying to better
> understand your approach and to extend it to support tmpfs files. In
> doing so I've found a few bugs, and have some rough fixes I wanted
> to share. There's still a few edge cases I need to deal with (the
> vma-purged flag isn't being properly handled through vma merge/split
> operations), but its starting to come along.

Hmm, my patch doesn't allow to merge volatile with another one by
inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
In case of split, __split_vma copy old vma to new vma like this

        *new = *vma;

So the problem shouldn't happen, I guess.
Did you see the real problem about that?

> 
> Anyway, take a look at the tree here and let me know what you think.
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> 
> I'm sure much is wrong with the tree, but with it I can now mark
> tmpfs file pages as volatile/nonvolatile and see them purged under
> pressure. Unfortunately its not limited to tmpfs, so persistent
> files will also work, but the state of the underlying files on purge
> is undefined. Hopefully I can find a way to limit it to
> non-persistent filesystems for now, and if needed find a way to
> extend it to persistent filesystems in a sane way later.

I will take a look.
Thanks.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-04  0:00     ` Minchan Kim
@ 2012-12-04  0:57       ` John Stultz
  2012-12-04  7:22         ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: John Stultz @ 2012-12-04  0:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 12/03/2012 04:00 PM, Minchan Kim wrote:
> On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
>> On 11/21/2012 04:36 PM, John Stultz wrote:
>>> 2) Being able to use this with tmpfs files. I'm currently trying
>>> to better understand the rmap code, looking to see if there's a
>>> way to have try_to_unmap_file() work similarly to
>>> try_to_unmap_anon(), to allow allow users to madvise() on mmapped
>>> tmpfs files. This would provide a very similar interface as to
>>> what I've been proposing with fadvise/fallocate, but just using
>>> process virtual addresses instead of (fd, offset) pairs.   The
>>> benefit with (fd,offset) pairs for Android is that its easier to
>>> manage shared volatile ranges between two processes that are
>>> sharing data via an mmapped tmpfs file (although this actual use
>>> case may be fairly rare).  I believe we should still be able to
>>> rework the ashmem internals to use madvise (which would provide
>>> legacy support for existing android apps), so then its just a
>>> question of if we could then eventually convince Android apps to
>>> use the madvise interface directly, rather then the ashmem unpin
>>> ioctl.
>> Hey Minchan,
>>      I've been playing around with your patch trying to better
>> understand your approach and to extend it to support tmpfs files. In
>> doing so I've found a few bugs, and have some rough fixes I wanted
>> to share. There's still a few edge cases I need to deal with (the
>> vma-purged flag isn't being properly handled through vma merge/split
>> operations), but its starting to come along.
> Hmm, my patch doesn't allow to merge volatile with another one by
> inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
> In case of split, __split_vma copy old vma to new vma like this
>
>          *new = *vma;
>
> So the problem shouldn't happen, I guess.
> Did you see the real problem about that?
Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE is 
applied, we can get a result where data is purged, but we aren't 
notified of it.  Also, since madvise returns early if it encounters an 
error, in the case where you have checkerboard volatile regions (say 
every other page is volatile), which you mark non-volatile with one 
large MADV_NOVOLATILE call, the first volatile vma will be marked 
non-volatile, but since it returns purged, the madvise loop will stop 
and the following volatile regions will be left volatile.

The patches in the git tree below which handle the perged state better 
seem to work for my tests, as far as resolving any overlapping calls. Of 
course there may yet still be problems I've not found.

>> Anyway, take a look at the tree here and let me know what you think.
>> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

Eager to hear what you think!

Thanks again!
-john


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-04  0:57       ` John Stultz
@ 2012-12-04  7:22         ` Minchan Kim
  2012-12-04 19:13           ` John Stultz
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2012-12-04  7:22 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:
> On 12/03/2012 04:00 PM, Minchan Kim wrote:
> >On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> >>On 11/21/2012 04:36 PM, John Stultz wrote:
> >>>2) Being able to use this with tmpfs files. I'm currently trying
> >>>to better understand the rmap code, looking to see if there's a
> >>>way to have try_to_unmap_file() work similarly to
> >>>try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >>>tmpfs files. This would provide a very similar interface as to
> >>>what I've been proposing with fadvise/fallocate, but just using
> >>>process virtual addresses instead of (fd, offset) pairs.   The
> >>>benefit with (fd,offset) pairs for Android is that its easier to
> >>>manage shared volatile ranges between two processes that are
> >>>sharing data via an mmapped tmpfs file (although this actual use
> >>>case may be fairly rare).  I believe we should still be able to
> >>>rework the ashmem internals to use madvise (which would provide
> >>>legacy support for existing android apps), so then its just a
> >>>question of if we could then eventually convince Android apps to
> >>>use the madvise interface directly, rather then the ashmem unpin
> >>>ioctl.
> >>Hey Minchan,
> >>     I've been playing around with your patch trying to better
> >>understand your approach and to extend it to support tmpfs files. In
> >>doing so I've found a few bugs, and have some rough fixes I wanted
> >>to share. There's still a few edge cases I need to deal with (the
> >>vma-purged flag isn't being properly handled through vma merge/split
> >>operations), but its starting to come along.
> >Hmm, my patch doesn't allow to merge volatile with another one by
> >inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
> >In case of split, __split_vma copy old vma to new vma like this
> >
> >         *new = *vma;
> >
> >So the problem shouldn't happen, I guess.
> >Did you see the real problem about that?
> Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
> is applied, we can get a result where data is purged, but we aren't
> notified of it.  Also, since madvise returns early if it encounters
> an error, in the case where you have checkerboard volatile regions
> (say every other page is volatile), which you mark non-volatile with
> one large MADV_NOVOLATILE call, the first volatile vma will be
> marked non-volatile, but since it returns purged, the madvise loop
> will stop and the following volatile regions will be left volatile.
> 
> The patches in the git tree below which handle the perged state
> better seem to work for my tests, as far as resolving any
> overlapping calls. Of course there may yet still be problems I've
> not found.
> 
> >>Anyway, take a look at the tree here and let me know what you think.
> >>http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> 
> Eager to hear what you think!

Below two patches look good to me.

[rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
[rmap: ClearPageDirty() when returning SWAP_DISCARD]

[madvise: Fix NOVOLATILE bug]
I can't understand description of the patch.
Could you elaborate it with example?

[madvise: Fixup vma->purged handling]
I included VM_VOLATILE into VM_SPECIAL intentionally.
If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
So I guess you see other problem. When I see my source code today, locking
scheme/purge handling is totally broken. I will look at it. Maybe you are seeing
bug related that. Part of patch is needed. It could be separate patch.
I will merge it.

ff --git a/mm/madvise.c b/mm/madvise.c
index 65af0b5..5469d76 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -139,8 +139,10 @@ success:
        if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
                volatile_lock(vma);
        vma->vm_flags = new_flags;
-       if (behavior == MADV_NOVOLATILE)
+       if (behavior == MADV_NOVOLATILE) {
                error = vma->purged;
+               vma->purged = 0;
+       }
        if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
                volatile_unlock(vma);
 out:

First of all, after I resolve above issue, let's talk about tmpfs volatile.
Thanks for the fix, John!

> 
> Thanks again!
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-04  7:22         ` Minchan Kim
@ 2012-12-04 19:13           ` John Stultz
  2012-12-05  4:18             ` Minchan Kim
  2012-12-05  7:01             ` Minchan Kim
  0 siblings, 2 replies; 30+ messages in thread
From: John Stultz @ 2012-12-04 19:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 12/03/2012 11:22 PM, Minchan Kim wrote:
> On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:
>> On 12/03/2012 04:00 PM, Minchan Kim wrote:
>>> On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
>>>> On 11/21/2012 04:36 PM, John Stultz wrote:
>>>>> 2) Being able to use this with tmpfs files. I'm currently trying
>>>>> to better understand the rmap code, looking to see if there's a
>>>>> way to have try_to_unmap_file() work similarly to
>>>>> try_to_unmap_anon(), to allow allow users to madvise() on mmapped
>>>>> tmpfs files. This would provide a very similar interface as to
>>>>> what I've been proposing with fadvise/fallocate, but just using
>>>>> process virtual addresses instead of (fd, offset) pairs.   The
>>>>> benefit with (fd,offset) pairs for Android is that its easier to
>>>>> manage shared volatile ranges between two processes that are
>>>>> sharing data via an mmapped tmpfs file (although this actual use
>>>>> case may be fairly rare).  I believe we should still be able to
>>>>> rework the ashmem internals to use madvise (which would provide
>>>>> legacy support for existing android apps), so then its just a
>>>>> question of if we could then eventually convince Android apps to
>>>>> use the madvise interface directly, rather then the ashmem unpin
>>>>> ioctl.
>>>> Hey Minchan,
>>>>      I've been playing around with your patch trying to better
>>>> understand your approach and to extend it to support tmpfs files. In
>>>> doing so I've found a few bugs, and have some rough fixes I wanted
>>>> to share. There's still a few edge cases I need to deal with (the
>>>> vma-purged flag isn't being properly handled through vma merge/split
>>>> operations), but its starting to come along.
>>> Hmm, my patch doesn't allow to merge volatile with another one by
>>> inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
>>> In case of split, __split_vma copy old vma to new vma like this
>>>
>>>          *new = *vma;
>>>
>>> So the problem shouldn't happen, I guess.
>>> Did you see the real problem about that?
>> Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
>> is applied, we can get a result where data is purged, but we aren't
>> notified of it.  Also, since madvise returns early if it encounters
>> an error, in the case where you have checkerboard volatile regions
>> (say every other page is volatile), which you mark non-volatile with
>> one large MADV_NOVOLATILE call, the first volatile vma will be
>> marked non-volatile, but since it returns purged, the madvise loop
>> will stop and the following volatile regions will be left volatile.
>>
>> The patches in the git tree below which handle the perged state
>> better seem to work for my tests, as far as resolving any
>> overlapping calls. Of course there may yet still be problems I've
>> not found.
>>
>>>> Anyway, take a look at the tree here and let me know what you think.
>>>> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
>> Eager to hear what you think!
> Below two patches look good to me.
>
> [rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
> [rmap: ClearPageDirty() when returning SWAP_DISCARD]
>
> [madvise: Fix NOVOLATILE bug]
> I can't understand description of the patch.
> Could you elaborate it with example?
The case I ran into here is if you have a range where you mark every 
other page as volatile. Then mark all the pages in that range as 
non-volatile in one madvise call.

sys_madvise() will then find the first vma in the range, and call 
madvise_vma(), which marks the first vma non-volatile and return the 
purged state.  If the page has been purged, sys_madvise code will note 
that as an error, and break out of the vma iteration loop, leaving the 
following vmas in the range volatile.

> [madvise: Fixup vma->purged handling]
> I included VM_VOLATILE into VM_SPECIAL intentionally.
> If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
> So I guess you see other problem. When I see my source code today, locking
> scheme/purge handling is totally broken. I will look at it. Maybe you are seeing
> bug related that. Part of patch is needed. It could be separate patch.
> I will merge it.
I don't think the problem is when vmas being marked VM_VOLATILE are 
being merged, its that when we mark the vma as *non-volatile*, and 
remove the VM_VOLATILE flag we merge the non-volatile vmas with 
neighboring vmas. So preserving the purged flag during that merge is 
important. Again, the example I used to trigger this was an alternating 
pattern of volatile and non volatile vmas, then marking the entire range 
non-volatile (though sometimes in two overlapping passes).

thanks
-john


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-04 19:13           ` John Stultz
@ 2012-12-05  4:18             ` Minchan Kim
  2012-12-08  0:49               ` John Stultz
  2012-12-05  7:01             ` Minchan Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2012-12-05  4:18 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> On 12/03/2012 11:22 PM, Minchan Kim wrote:
> >On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:
> >>On 12/03/2012 04:00 PM, Minchan Kim wrote:
> >>>On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> >>>>On 11/21/2012 04:36 PM, John Stultz wrote:
> >>>>>2) Being able to use this with tmpfs files. I'm currently trying
> >>>>>to better understand the rmap code, looking to see if there's a
> >>>>>way to have try_to_unmap_file() work similarly to
> >>>>>try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >>>>>tmpfs files. This would provide a very similar interface as to
> >>>>>what I've been proposing with fadvise/fallocate, but just using
> >>>>>process virtual addresses instead of (fd, offset) pairs.   The
> >>>>>benefit with (fd,offset) pairs for Android is that its easier to
> >>>>>manage shared volatile ranges between two processes that are
> >>>>>sharing data via an mmapped tmpfs file (although this actual use
> >>>>>case may be fairly rare).  I believe we should still be able to
> >>>>>rework the ashmem internals to use madvise (which would provide
> >>>>>legacy support for existing android apps), so then its just a
> >>>>>question of if we could then eventually convince Android apps to
> >>>>>use the madvise interface directly, rather then the ashmem unpin
> >>>>>ioctl.
> >>>>Hey Minchan,
> >>>>     I've been playing around with your patch trying to better
> >>>>understand your approach and to extend it to support tmpfs files. In
> >>>>doing so I've found a few bugs, and have some rough fixes I wanted
> >>>>to share. There's still a few edge cases I need to deal with (the
> >>>>vma-purged flag isn't being properly handled through vma merge/split
> >>>>operations), but its starting to come along.
> >>>Hmm, my patch doesn't allow to merge volatile with another one by
> >>>inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
> >>>In case of split, __split_vma copy old vma to new vma like this
> >>>
> >>>         *new = *vma;
> >>>
> >>>So the problem shouldn't happen, I guess.
> >>>Did you see the real problem about that?
> >>Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
> >>is applied, we can get a result where data is purged, but we aren't
> >>notified of it.  Also, since madvise returns early if it encounters
> >>an error, in the case where you have checkerboard volatile regions
> >>(say every other page is volatile), which you mark non-volatile with
> >>one large MADV_NOVOLATILE call, the first volatile vma will be
> >>marked non-volatile, but since it returns purged, the madvise loop
> >>will stop and the following volatile regions will be left volatile.
> >>
> >>The patches in the git tree below which handle the perged state
> >>better seem to work for my tests, as far as resolving any
> >>overlapping calls. Of course there may yet still be problems I've
> >>not found.
> >>
> >>>>Anyway, take a look at the tree here and let me know what you think.
> >>>>http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> >>Eager to hear what you think!
> >Below two patches look good to me.
> >
> >[rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
> >[rmap: ClearPageDirty() when returning SWAP_DISCARD]
> >
> >[madvise: Fix NOVOLATILE bug]
> >I can't understand description of the patch.
> >Could you elaborate it with example?
> The case I ran into here is if you have a range where you mark every
> other page as volatile. Then mark all the pages in that range as
> non-volatile in one madvise call.
> 
> sys_madvise() will then find the first vma in the range, and call
> madvise_vma(), which marks the first vma non-volatile and return the
> purged state.  If the page has been purged, sys_madvise code will
> note that as an error, and break out of the vma iteration loop,
> leaving the following vmas in the range volatile.
> 
> >[madvise: Fixup vma->purged handling]
> >I included VM_VOLATILE into VM_SPECIAL intentionally.
> >If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
> >So I guess you see other problem. When I see my source code today, locking
> >scheme/purge handling is totally broken. I will look at it. Maybe you are seeing
> >bug related that. Part of patch is needed. It could be separate patch.
> >I will merge it.
> I don't think the problem is when vmas being marked VM_VOLATILE are
> being merged, its that when we mark the vma as *non-volatile*, and
> remove the VM_VOLATILE flag we merge the non-volatile vmas with
> neighboring vmas. So preserving the purged flag during that merge is
> important. Again, the example I used to trigger this was an
> alternating pattern of volatile and non volatile vmas, then marking
> the entire range non-volatile (though sometimes in two overlapping
> passes).

Understood. Thanks.
Below patch solves your problems? It's simple than yours.
Anyway, both yours and mine are not right fix.
As I mentioned, locking scheme is broken.
We need anon_vma_lock to handle purged and we should consider fork
case, too.

diff --git a/mm/madvise.c b/mm/madvise.c
index 965a53d..5fa3254 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -41,7 +41,8 @@ static int madvise_need_mmap_write(int behavior)
  */
 static long madvise_behavior(struct vm_area_struct * vma,
 		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end, int behavior)
+		     unsigned long start, unsigned long end,
+		     int behavior, bool *purged)
 {
 	struct mm_struct * mm = vma->vm_mm;
 	int error = 0;
@@ -151,7 +152,7 @@ success:
 		volatile_lock(vma);
 	vma->vm_flags = new_flags;
 	if (behavior == MADV_NOVOLATILE) {
-		error = vma->purged;
+		*purged |= vma->purged;
 		vma->purged = false;
 	}
 	if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
@@ -309,7 +310,7 @@ static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
 
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		unsigned long start, unsigned long end, int behavior)
+	unsigned long start, unsigned long end, int behavior, bool *purged)
 {
 	switch (behavior) {
 	case MADV_REMOVE:
@@ -319,7 +320,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
-		return madvise_behavior(vma, prev, start, end, behavior);
+		return madvise_behavior(vma, prev, start, end, behavior, purged);
 	}
 }
 
@@ -405,6 +406,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	int error = -EINVAL;
 	int write;
 	size_t len;
+	bool purged = false;
 
 #ifdef CONFIG_MEMORY_FAILURE
 	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
@@ -468,7 +470,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			tmp = end;
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
+		error = madvise_vma(vma, &prev, start, tmp, behavior, &purged);
 		if (error)
 			goto out;
 		start = tmp;
@@ -488,5 +490,7 @@ out:
 	else
 		up_read(&current->mm->mmap_sem);
 
+	if (!error & purged)
+		error = 1;
 	return error;
 }
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-04 19:13           ` John Stultz
  2012-12-05  4:18             ` Minchan Kim
@ 2012-12-05  7:01             ` Minchan Kim
  2012-12-08  0:20               ` John Stultz
  1 sibling, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2012-12-05  7:01 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

Hi John,

On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> On 12/03/2012 11:22 PM, Minchan Kim wrote:
> >On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:
> >>On 12/03/2012 04:00 PM, Minchan Kim wrote:
> >>>On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> >>>>On 11/21/2012 04:36 PM, John Stultz wrote:
> >>>>>2) Being able to use this with tmpfs files. I'm currently trying
> >>>>>to better understand the rmap code, looking to see if there's a
> >>>>>way to have try_to_unmap_file() work similarly to
> >>>>>try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >>>>>tmpfs files. This would provide a very similar interface as to
> >>>>>what I've been proposing with fadvise/fallocate, but just using
> >>>>>process virtual addresses instead of (fd, offset) pairs.   The
> >>>>>benefit with (fd,offset) pairs for Android is that its easier to
> >>>>>manage shared volatile ranges between two processes that are
> >>>>>sharing data via an mmapped tmpfs file (although this actual use
> >>>>>case may be fairly rare).  I believe we should still be able to
> >>>>>rework the ashmem internals to use madvise (which would provide
> >>>>>legacy support for existing android apps), so then its just a
> >>>>>question of if we could then eventually convince Android apps to
> >>>>>use the madvise interface directly, rather then the ashmem unpin
> >>>>>ioctl.
> >>>>Hey Minchan,
> >>>>     I've been playing around with your patch trying to better
> >>>>understand your approach and to extend it to support tmpfs files. In
> >>>>doing so I've found a few bugs, and have some rough fixes I wanted
> >>>>to share. There's still a few edge cases I need to deal with (the
> >>>>vma-purged flag isn't being properly handled through vma merge/split
> >>>>operations), but its starting to come along.
> >>>Hmm, my patch doesn't allow to merge volatile with another one by
> >>>inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
> >>>In case of split, __split_vma copy old vma to new vma like this
> >>>
> >>>         *new = *vma;
> >>>
> >>>So the problem shouldn't happen, I guess.
> >>>Did you see the real problem about that?
> >>Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
> >>is applied, we can get a result where data is purged, but we aren't
> >>notified of it.  Also, since madvise returns early if it encounters
> >>an error, in the case where you have checkerboard volatile regions
> >>(say every other page is volatile), which you mark non-volatile with
> >>one large MADV_NOVOLATILE call, the first volatile vma will be
> >>marked non-volatile, but since it returns purged, the madvise loop
> >>will stop and the following volatile regions will be left volatile.
> >>
> >>The patches in the git tree below which handle the perged state
> >>better seem to work for my tests, as far as resolving any
> >>overlapping calls. Of course there may yet still be problems I've
> >>not found.
> >>
> >>>>Anyway, take a look at the tree here and let me know what you think.
> >>>>http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> >>Eager to hear what you think!
> >Below two patches look good to me.
> >
> >[rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
> >[rmap: ClearPageDirty() when returning SWAP_DISCARD]
> >
> >[madvise: Fix NOVOLATILE bug]
> >I can't understand description of the patch.
> >Could you elaborate it with example?
> The case I ran into here is if you have a range where you mark every
> other page as volatile. Then mark all the pages in that range as
> non-volatile in one madvise call.
> 
> sys_madvise() will then find the first vma in the range, and call
> madvise_vma(), which marks the first vma non-volatile and return the
> purged state.  If the page has been purged, sys_madvise code will
> note that as an error, and break out of the vma iteration loop,
> leaving the following vmas in the range volatile.
> 
> >[madvise: Fixup vma->purged handling]
> >I included VM_VOLATILE into VM_SPECIAL intentionally.
> >If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
> >So I guess you see other problem. When I see my source code today, locking
> >scheme/purge handling is totally broken. I will look at it. Maybe you are seeing
> >bug related that. Part of patch is needed. It could be separate patch.
> >I will merge it.
> I don't think the problem is when vmas being marked VM_VOLATILE are
> being merged, its that when we mark the vma as *non-volatile*, and
> remove the VM_VOLATILE flag we merge the non-volatile vmas with
> neighboring vmas. So preserving the purged flag during that merge is
> important. Again, the example I used to trigger this was an
> alternating pattern of volatile and non volatile vmas, then marking
> the entire range non-volatile (though sometimes in two overlapping
> passes).

If I understand correctly, you mean following as.

chunk1 = mmap(8M)
chunk2 = chunk1 + 2M;
chunk3 = chunk2 + 2M
chunk4 = chunk3 + 2M

madvise(chunk1, 2M, VOLATILE);
madvise(chunk4, 2M, VOLATILE);

/*
 * V : volatile vma
 * N : non volatile vma
 * So Now vma is VNVN.
 */
And chunk4 is purged.

int ret = madvise(chunk1, 8M, NOVOLATILE);
ASSERT(ret == 1);
/* And you expect VNVN->N ?*/

Right?
If so, why should non-volatile function semantic allow it which cross over
non-volatile areas in a range? I would like to fail such case because
in case of MADV_REMOVE, it fails in the middle of operation if it encounter
VM_LOCKED.

What do you think about it?

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-05  7:01             ` Minchan Kim
@ 2012-12-08  0:20               ` John Stultz
  2012-12-11  4:34                 ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: John Stultz @ 2012-12-08  0:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 12/04/2012 11:01 PM, Minchan Kim wrote:
> Hi John,
>
> On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
>>
>> I don't think the problem is when vmas being marked VM_VOLATILE are
>> being merged, its that when we mark the vma as *non-volatile*, and
>> remove the VM_VOLATILE flag we merge the non-volatile vmas with
>> neighboring vmas. So preserving the purged flag during that merge is
>> important. Again, the example I used to trigger this was an
>> alternating pattern of volatile and non volatile vmas, then marking
>> the entire range non-volatile (though sometimes in two overlapping
>> passes).
> If I understand correctly, you mean following as.
>
> chunk1 = mmap(8M)
> chunk2 = chunk1 + 2M;
> chunk3 = chunk2 + 2M
> chunk4 = chunk3 + 2M
>
> madvise(chunk1, 2M, VOLATILE);
> madvise(chunk4, 2M, VOLATILE);
>
> /*
>   * V : volatile vma
>   * N : non volatile vma
>   * So Now vma is VNVN.
>   */
> And chunk4 is purged.
>
> int ret = madvise(chunk1, 8M, NOVOLATILE);
> ASSERT(ret == 1);
> /* And you expect VNVN->N ?*/
>
> Right?

Yes. That's exactly right.

> If so, why should non-volatile function semantic allow it which cross over
> non-volatile areas in a range? I would like to fail such case because
> in case of MADV_REMOVE, it fails in the middle of operation if it encounter
> VM_LOCKED.
>
> What do you think about it?
Right, so I think this issue is maybe a problematic part of the VMA 
based approach.  While marking an area as nonvolatile twice might not 
make a lot of sense, I think userland applications would not appreciate 
the constraint that madvise(VOLATILE/NONVOLATILE) calls be made in 
perfect pairs of identical sizes.

For instance, if a browser has rendered a web page, but the page is so 
large that only a sliding window/view of that page is visible at one 
time, it may want to mark the regions not currently in the view as 
volatile.   So it would be nice (albeit naive) for that application that 
when the view location changed, it would just mark the new region as 
non-volatile, and any region not in the current view as volatile.  This 
would be easier then trying to calculate the diff of the old view region 
boundaries vs the new and modifying only the ranges that changed. 
Granted, doing so might be more efficient, but I'm not sure we can be 
sure every similar case would be more efficient.

So in my mind, double-clearing a flag should be allowed (as well as 
double-setting), as well as allowing for setting/clearing overlapping 
regions.

Aside from if the behavior should be allowed or not, the error mode of 
madvise is problematic as well, since failures can happen mid way 
through the operation, leaving the vmas in the range specified 
inconsistent. Since usually its only advisory, such inconsistent states 
aren't really problematic, and repeating the last action is probably fine.

The problem with NOVOLATILE's  purged state, with vmas, is that if we 
hit an error mid-way through, its hard to figure out what the state of 
the pages are for the range specified. Some of them could have been 
purged and set to non-volatile, while some may not be purged, and still 
left volatile. You can't just repeat the last action and get a sane 
result (as we lose the purged flag state).

With my earlier fallocate implementations, I tried to avoid this by 
making any memory allocations that might be required before making any 
state changes, so there wasn't a chance for a partial failure from 
-ENOMEM.  (It was also simpler because in my own range management code 
there were only volatile ranges,  non-volatility was simply the absence 
of a volatile range. With vmas we have to manage both volatile and 
nonvolatile vmas).  I'm not sure how this could be done with the vma 
method other then by maybe reworking the merge/split logic, but I'm wary 
of mucking with that too much as I know its performance sensitive.

Your thoughts?  Am I just being too set in my way of thinking here?

thanks
-john


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-05  4:18             ` Minchan Kim
@ 2012-12-08  0:49               ` John Stultz
  2012-12-11  4:40                 ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: John Stultz @ 2012-12-08  0:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 12/04/2012 08:18 PM, Minchan Kim wrote:
> On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
>> I don't think the problem is when vmas being marked VM_VOLATILE are
>> being merged, its that when we mark the vma as *non-volatile*, and
>> remove the VM_VOLATILE flag we merge the non-volatile vmas with
>> neighboring vmas. So preserving the purged flag during that merge is
>> important. Again, the example I used to trigger this was an
>> alternating pattern of volatile and non volatile vmas, then marking
>> the entire range non-volatile (though sometimes in two overlapping
>> passes).
> Understood. Thanks.
> Below patch solves your problems? It's simple than yours.

Yea, this is nicer then my fix.
Although I still need the purged handling in the vma merge code for me 
to see the behavior I expect in my tests.

I've integrated your patch and repushed my queue here:
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

git://git.linaro.org/people/jstultz/android-dev.git dev/minchan-anonvol

> Anyway, both yours and mine are not right fix.
> As I mentioned, locking scheme is broken.
> We need anon_vma_lock to handle purged and we should consider fork
> case, too.
Hrm. I'm sure you're right, as I've not yet fully grasped all the 
locking rules here.  Could you clarify how it is broken? And why is the 
anon_vma_lock needed to manage the purged state that is part of the vma 
itself?

thanks
-john


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-08  0:20               ` John Stultz
@ 2012-12-11  4:34                 ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2012-12-11  4:34 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Fri, Dec 07, 2012 at 04:20:30PM -0800, John Stultz wrote:
> On 12/04/2012 11:01 PM, Minchan Kim wrote:
> >Hi John,
> >
> >On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> >>
> >>I don't think the problem is when vmas being marked VM_VOLATILE are
> >>being merged, its that when we mark the vma as *non-volatile*, and
> >>remove the VM_VOLATILE flag we merge the non-volatile vmas with
> >>neighboring vmas. So preserving the purged flag during that merge is
> >>important. Again, the example I used to trigger this was an
> >>alternating pattern of volatile and non volatile vmas, then marking
> >>the entire range non-volatile (though sometimes in two overlapping
> >>passes).
> >If I understand correctly, you mean following as.
> >
> >chunk1 = mmap(8M)
> >chunk2 = chunk1 + 2M;
> >chunk3 = chunk2 + 2M
> >chunk4 = chunk3 + 2M
> >
> >madvise(chunk1, 2M, VOLATILE);
> >madvise(chunk4, 2M, VOLATILE);
> >
> >/*
> >  * V : volatile vma
> >  * N : non volatile vma
> >  * So Now vma is VNVN.
> >  */
> >And chunk4 is purged.
> >
> >int ret = madvise(chunk1, 8M, NOVOLATILE);
> >ASSERT(ret == 1);
> >/* And you expect VNVN->N ?*/
> >
> >Right?
> 
> Yes. That's exactly right.
> 
> >If so, why should non-volatile function semantic allow it which cross over
> >non-volatile areas in a range? I would like to fail such case because
> >in case of MADV_REMOVE, it fails in the middle of operation if it encounter
> >VM_LOCKED.
> >
> >What do you think about it?
> Right, so I think this issue is maybe a problematic part of the VMA
> based approach.  While marking an area as nonvolatile twice might
> not make a lot of sense, I think userland applications would not
> appreciate the constraint that madvise(VOLATILE/NONVOLATILE) calls
> be made in perfect pairs of identical sizes.
> 
> For instance, if a browser has rendered a web page, but the page is
> so large that only a sliding window/view of that page is visible at
> one time, it may want to mark the regions not currently in the view
> as volatile.   So it would be nice (albeit naive) for that
> application that when the view location changed, it would just mark
> the new region as non-volatile, and any region not in the current
> view as volatile.  This would be easier then trying to calculate the
> diff of the old view region boundaries vs the new and modifying only
> the ranges that changed. Granted, doing so might be more efficient,
> but I'm not sure we can be sure every similar case would be more
> efficient.
> 
> So in my mind, double-clearing a flag should be allowed (as well as
> double-setting), as well as allowing for setting/clearing
> overlapping regions.

It might and as you said, it's not matched by normal madvise opearation.
So if user really want it, we might need another interface like new
system call like mlock.

Although we can implement it, what I has a concern is mmap_sem hold time.
For VMA approach, we need exclusive mmap_sem and it ends up preventing
concurrent page fault handling so it would mitigate anon volatile's goal
for user-space allocators. So I would like to avoid more works with
exclusive mmap_sem as far as possible.

Of course, you can argue that if we don't support such semantic,
user can call madvise(NOVOATILE) several time with several ranges
so it could be more bad. Right. But I suggest for plumbers to implement
range management smart and let's leave kernel implementation simple/fast.

I'm not solid. If user really want such semantic, I can support it with
new system call.

Frankly speaking, I would like to remove madvise(NOVOLATILE) call.
If you already saw my patch just I sent morning, you can guess what it is.
The problem of anon volatile with madvise(NOVOLATILE) is that time delay
between allocator allocats a free chunk and user really access the memory.
Normally, when allocator return free chunk to customer, allocator should
call madvise(NOVOLATILE) but user could access the memory long time after.
So during that time difference, that pages could be swap out.
So I decide to remove madvise(NOVOLATILE) and it's handled at first
page fault.

Yeb. The same rule couldn't applied to tmpfs volatile and it does needs
NOVOLATILE semantic. Hmm,, I am biasing to new system call.

int mvolatile(const void *addr, size_t len, int mode);
int munvolatile(const void *addr, size_t len;

If someone call mvolatile with AUTO mode, it would work as my anon volatile
while in MANUAL mode, user must call munvolatile before using.
It might meet your and mine goal. But adding new system call is last resort. :)

> 
> Aside from if the behavior should be allowed or not, the error mode
> of madvise is problematic as well, since failures can happen mid way
> through the operation, leaving the vmas in the range specified
> inconsistent. Since usually its only advisory, such inconsistent
> states aren't really problematic, and repeating the last action is
> probably fine.

True.

> 
> The problem with NOVOLATILE's  purged state, with vmas, is that if
> we hit an error mid-way through, its hard to figure out what the
> state of the pages are for the range specified. Some of them could
> have been purged and set to non-volatile, while some may not be
> purged, and still left volatile. You can't just repeat the last
> action and get a sane result (as we lose the purged flag state).

Agreed.

> 
> With my earlier fallocate implementations, I tried to avoid this by
> making any memory allocations that might be required before making
> any state changes, so there wasn't a chance for a partial failure
> from -ENOMEM.  (It was also simpler because in my own range
> management code there were only volatile ranges,  non-volatility was
> simply the absence of a volatile range. With vmas we have to manage
> both volatile and nonvolatile vmas).  I'm not sure how this could be
> done with the vma method other then by maybe reworking the
> merge/split logic, but I'm wary of mucking with that too much as I
> know its performance sensitive.
> 
> Your thoughts?  Am I just being too set in my way of thinking here?

Your claim makes sense. Two option.

1) Go separate way with each interface. (madvise vs fadvise or fallocate)
2) A new system call to unify them.

Hmm, I would like to wait more inputs from user-space allocator guys
because they might ask requirement which is similar to you.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] Support volatile range for anon vma
  2012-12-08  0:49               ` John Stultz
@ 2012-12-11  4:40                 ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2012-12-11  4:40 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Fri, Dec 07, 2012 at 04:49:56PM -0800, John Stultz wrote:
> On 12/04/2012 08:18 PM, Minchan Kim wrote:
> >On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> >>I don't think the problem is when vmas being marked VM_VOLATILE are
> >>being merged, its that when we mark the vma as *non-volatile*, and
> >>remove the VM_VOLATILE flag we merge the non-volatile vmas with
> >>neighboring vmas. So preserving the purged flag during that merge is
> >>important. Again, the example I used to trigger this was an
> >>alternating pattern of volatile and non volatile vmas, then marking
> >>the entire range non-volatile (though sometimes in two overlapping
> >>passes).
> >Understood. Thanks.
> >Below patch solves your problems? It's simple than yours.
> 
> Yea, this is nicer then my fix.
> Although I still need the purged handling in the vma merge code for
> me to see the behavior I expect in my tests.
> 
> I've integrated your patch and repushed my queue here:
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> 
> git://git.linaro.org/people/jstultz/android-dev.git dev/minchan-anonvol
> 
> >Anyway, both yours and mine are not right fix.
> >As I mentioned, locking scheme is broken.
> >We need anon_vma_lock to handle purged and we should consider fork
> >case, too.
> Hrm. I'm sure you're right, as I've not yet fully grasped all the
> locking rules here.  Could you clarify how it is broken? And why is
> the anon_vma_lock needed to manage the purged state that is part of
> the vma itself?

If you don't hold anon->lock, merge/split/fork can race with try_to_unmap
so vma->[purged|vm_flags] would lose consistency.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-12-11  4:40 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-30  1:29 [RFC v2] Support volatile range for anon vma Minchan Kim
2012-10-31 21:35 ` Andrew Morton
2012-10-31 21:59   ` Paul Turner
2012-10-31 22:56     ` KOSAKI Motohiro
2012-11-01  1:15       ` Paul Turner
2012-11-01  1:46         ` Minchan Kim
2012-11-01  1:25       ` Minchan Kim
2012-11-01  2:01         ` KOSAKI Motohiro
2012-11-05 23:54       ` Arun Sharma
2012-11-06  1:49         ` Minchan Kim
2012-11-06  2:03           ` Arun Sharma
2012-11-01  0:50     ` Minchan Kim
2012-11-01  1:22       ` Paul Turner
2012-11-01  1:33         ` Minchan Kim
2012-11-01  0:21   ` Minchan Kim
2012-11-02  1:43 ` Bob Liu
2012-11-02  2:37   ` Minchan Kim
2012-11-22  0:36 ` John Stultz
2012-11-29  4:18   ` John Stultz
2012-12-04  0:00     ` Minchan Kim
2012-12-04  0:57       ` John Stultz
2012-12-04  7:22         ` Minchan Kim
2012-12-04 19:13           ` John Stultz
2012-12-05  4:18             ` Minchan Kim
2012-12-08  0:49               ` John Stultz
2012-12-11  4:40                 ` Minchan Kim
2012-12-05  7:01             ` Minchan Kim
2012-12-08  0:20               ` John Stultz
2012-12-11  4:34                 ` Minchan Kim
2012-12-03 23:50   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).