linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [rfc patch 1/3] mm: always pass mapping in zap_details
@ 2009-09-30 21:09 Johannes Weiner
  2009-09-30 21:09 ` [rfc patch 2/3] mm: serialize truncation unmap against try_to_unmap() Johannes Weiner
  2009-09-30 21:09 ` [rfc patch 3/3] mm: munlock COW pages on truncation unmap Johannes Weiner
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Weiner @ 2009-09-30 21:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Hugh Dickins, Mel Gorman,
	Lee Schermerhorn, Peter Zijlstra

Currently, unmap_mapping_range() only fills in the address space
information in zap details when it wants to filter private COW pages
and the zapping loops need to compare page->mapping against it.

For unmapping private COW pages on truncation, we will need this
information also in the opposite case to filter shared pages.

Demux this one check_mapping member by adding an explicit mode flag
and always pass along the address space.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   11 ++++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -732,7 +732,8 @@ extern void user_shm_unlock(size_t, stru
  */
 struct zap_details {
 	struct vm_area_struct *nonlinear_vma;	/* Check page->index if set */
-	struct address_space *check_mapping;	/* Check page->mapping if set */
+	struct address_space *mapping;		/* Backing address space */
+	bool keep_private;			/* Do not touch private pages */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -824,8 +824,8 @@ static unsigned long zap_pte_range(struc
 				 * invalidate cache without truncating:
 				 * unmap shared but keep private pages.
 				 */
-				if (details->check_mapping &&
-				    details->check_mapping != page->mapping)
+				if (details->keep_private &&
+				    details->mapping != page->mapping)
 					continue;
 				/*
 				 * Each page->index must be checked when
@@ -863,7 +863,7 @@ static unsigned long zap_pte_range(struc
 			continue;
 		}
 		/*
-		 * If details->check_mapping, we leave swap entries;
+		 * If details->keep_private, we leave swap entries;
 		 * if details->nonlinear_vma, we leave file entries.
 		 */
 		if (unlikely(details))
@@ -936,7 +936,7 @@ static unsigned long unmap_page_range(st
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping && !details->nonlinear_vma)
+	if (details && !details->keep_private && !details->nonlinear_vma)
 		details = NULL;
 
 	BUG_ON(addr >= end);
@@ -2433,7 +2433,8 @@ void unmap_mapping_range(struct address_
 			hlen = ULONG_MAX - hba + 1;
 	}
 
-	details.check_mapping = even_cows? NULL: mapping;
+	details.mapping = mapping;
+	details.keep_private = !even_cows;
 	details.nonlinear_vma = NULL;
 	details.first_index = hba;
 	details.last_index = hba + hlen - 1;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [rfc patch 2/3] mm: serialize truncation unmap against try_to_unmap()
  2009-09-30 21:09 [rfc patch 1/3] mm: always pass mapping in zap_details Johannes Weiner
@ 2009-09-30 21:09 ` Johannes Weiner
  2009-09-30 21:09 ` [rfc patch 3/3] mm: munlock COW pages on truncation unmap Johannes Weiner
  1 sibling, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2009-09-30 21:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Hugh Dickins, Mel Gorman,
	Lee Schermerhorn, Peter Zijlstra

To munlock private COW pages on truncating unmap, we must serialize
against concurrent reclaimers doing try_to_unmap() so they don't
re-mlock the page before we free it.

Grabbing the page lock is not possible when zapping the page table
entries, so prevent lazy mlock in the reclaimer by holding onto the
anon_vma lock while unmapping a VMA.

The anon_vma can show up only after we tried locking it.  Pass it down
in zap_details so that the zapping loops can check for whether we
acquired the lock or not.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -999,6 +999,7 @@ unsigned long unmap_vmas(struct mmu_gath
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct anon_vma *anon_vma = details? details->anon_vma: NULL;
 	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
@@ -1056,8 +1057,9 @@ unsigned long unmap_vmas(struct mmu_gath
 			tlb_finish_mmu(*tlbp, tlb_start, start);
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
+			    (i_mmap_lock && spin_needbreak(i_mmap_lock)) ||
+			    (anon_vma && spin_needbreak(&anon_vma->lock))) {
+				if (i_mmap_lock || anon_vma) {
 					*tlbp = NULL;
 					goto out;
 				}
@@ -2327,9 +2329,14 @@ again:
 		}
 	}
 
+	details->anon_vma = vma->anon_vma;
+	if (details->anon_vma)
+		spin_lock(&details->anon_vma->lock);
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
 	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	if (details->anon_vma)
+		spin_unlock(&details->anon_vma->lock);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -733,6 +733,7 @@ extern void user_shm_unlock(size_t, stru
 struct zap_details {
 	struct vm_area_struct *nonlinear_vma;	/* Check page->index if set */
 	struct address_space *mapping;		/* Backing address space */
+	struct anon_vma *anon_vma;		/* Rmap for private COW pages */
 	bool keep_private;			/* Do not touch private pages */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [rfc patch 3/3] mm: munlock COW pages on truncation unmap
  2009-09-30 21:09 [rfc patch 1/3] mm: always pass mapping in zap_details Johannes Weiner
  2009-09-30 21:09 ` [rfc patch 2/3] mm: serialize truncation unmap against try_to_unmap() Johannes Weiner
@ 2009-09-30 21:09 ` Johannes Weiner
  2009-10-02  2:40   ` KOSAKI Motohiro
  1 sibling, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2009-09-30 21:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Hugh Dickins, Mel Gorman,
	Lee Schermerhorn, Peter Zijlstra

When truncating, VMAs are not explicitely munlocked before unmap.  The
truncation code munlocks page cache pages only and we can end up
freeing mlocked private COW pages.

This patch makes sure we munlock and move them from the unevictable
list before dropping the page table reference.  We know they are going
away with the last reference, so simply clearing the mlock (and
accounting for it) is okay.

We can not grab the page lock from the unmapping context, so this
tries to move the page to the evictable list optimistically and makes
sure a racing reclaimer moves the page instead if we fail.

Rare case: the anon_vma is unlocked when encountering private pages
because the first one in the VMA was faulted in only after we tried
locking the anon_vma.  But we can handle it: on the second unmapping
iteration, page cache will be truncated and vma->anon_vma will be
stable, so just skip the page on non-present anon_vma.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/memory.c |   47 +++++++++++++++++++++++++++++++++++++++++------
 mm/vmscan.c |    7 +++++++
 2 files changed, 48 insertions(+), 6 deletions(-)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -819,13 +819,13 @@ static unsigned long zap_pte_range(struc
 
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(details) && page) {
+				int private = details->mapping != page->mapping;
 				/*
 				 * unmap_shared_mapping_pages() wants to
 				 * invalidate cache without truncating:
 				 * unmap shared but keep private pages.
 				 */
-				if (details->keep_private &&
-				    details->mapping != page->mapping)
+				if (details->keep_private && private)
 					continue;
 				/*
 				 * Each page->index must be checked when
@@ -835,6 +835,43 @@ static unsigned long zap_pte_range(struc
 				    (page->index < details->first_index ||
 				     page->index > details->last_index))
 					continue;
+				/*
+				 * When truncating, private COW pages may be
+				 * mlocked in VM_LOCKED VMAs, so they need
+				 * munlocking here before getting freed.
+				 *
+				 * Skip them completely if we don't have the
+				 * anon_vma locked.  We will get it the second
+				 * time.  When page cache is truncated, no more
+				 * private pages can show up against this VMA
+				 * and the anon_vma is either present or will
+				 * never be.
+				 *
+				 * Otherwise, we still have to synchronize
+				 * against concurrent reclaimers.  We can not
+				 * grab the page lock, but with correct
+				 * ordering of page flag accesses we can get
+				 * away without it.
+				 *
+				 * A concurrent isolator may add the page to
+				 * the unevictable list, set PG_lru and then
+				 * recheck PG_mlocked to verify it chose the
+				 * right list and conditionally move it again.
+				 *
+				 * TestClearPageMlocked() provides one half of
+				 * the barrier: when we do not see the page on
+				 * the LRU and fail isolation, the isolator
+				 * must see PG_mlocked cleared and move the
+				 * page on its own back to the evictable list.
+				 */
+				if (private && !details->anon_vma)
+					continue;
+				if (private && TestClearPageMlocked(page)) {
+					dec_zone_page_state(page, NR_MLOCK);
+					count_vm_event(UNEVICTABLE_PGCLEARED);
+					if (!isolate_lru_page(page))
+						putback_lru_page(page);
+				}
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
@@ -866,7 +903,8 @@ static unsigned long zap_pte_range(struc
 		 * If details->keep_private, we leave swap entries;
 		 * if details->nonlinear_vma, we leave file entries.
 		 */
-		if (unlikely(details))
+		if (unlikely(details && (details->keep_private ||
+					 details->nonlinear_vma)))
 			continue;
 		if (pte_file(ptent)) {
 			if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
@@ -936,9 +974,6 @@ static unsigned long unmap_page_range(st
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->keep_private && !details->nonlinear_vma)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -544,6 +544,13 @@ redo:
 		 */
 		lru = LRU_UNEVICTABLE;
 		add_page_to_unevictable_list(page);
+		/*
+		 * See the TestClearPageMlocked() in zap_pte_range():
+		 * if a racing unmapper did not see the above setting
+		 * of PG_lru, we must see its clearing of PG_locked
+		 * and move the page back to the evictable list.
+		 */
+		smp_mb();
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc patch 3/3] mm: munlock COW pages on truncation unmap
  2009-09-30 21:09 ` [rfc patch 3/3] mm: munlock COW pages on truncation unmap Johannes Weiner
@ 2009-10-02  2:40   ` KOSAKI Motohiro
  2009-10-02 23:38     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: KOSAKI Motohiro @ 2009-10-02  2:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Hugh Dickins,
	Mel Gorman, Lee Schermerhorn, Peter Zijlstra

Hi

thanks for very interesting patches.
I have a question.


> @@ -835,6 +835,43 @@ static unsigned long zap_pte_range(struc
>  				    (page->index < details->first_index ||
>  				     page->index > details->last_index))
>  					continue;
> +				/*
> +				 * When truncating, private COW pages may be
> +				 * mlocked in VM_LOCKED VMAs, so they need
> +				 * munlocking here before getting freed.
> +				 *
> +				 * Skip them completely if we don't have the
> +				 * anon_vma locked.  We will get it the second
> +				 * time.  When page cache is truncated, no more
> +				 * private pages can show up against this VMA
> +				 * and the anon_vma is either present or will
> +				 * never be.
> +				 *
> +				 * Otherwise, we still have to synchronize
> +				 * against concurrent reclaimers.  We can not
> +				 * grab the page lock, but with correct
> +				 * ordering of page flag accesses we can get
> +				 * away without it.
> +				 *
> +				 * A concurrent isolator may add the page to
> +				 * the unevictable list, set PG_lru and then
> +				 * recheck PG_mlocked to verify it chose the
> +				 * right list and conditionally move it again.
> +				 *
> +				 * TestClearPageMlocked() provides one half of
> +				 * the barrier: when we do not see the page on
> +				 * the LRU and fail isolation, the isolator
> +				 * must see PG_mlocked cleared and move the
> +				 * page on its own back to the evictable list.
> +				 */
> +				if (private && !details->anon_vma)
> +					continue;
> +				if (private && TestClearPageMlocked(page)) {
> +					dec_zone_page_state(page, NR_MLOCK);
> +					count_vm_event(UNEVICTABLE_PGCLEARED);
> +					if (!isolate_lru_page(page))
> +						putback_lru_page(page);
> +				}
>  			}
>  			ptent = ptep_get_and_clear_full(mm, addr, pte,
>  							tlb->fullmm);

Umm..
I haven't understand this.

(1) unmap_mapping_range() is called twice.

	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
	truncate_inode_pages(mapping, new);
	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);

(2) PG_mlock is turned on from mlock() and vmscan.
(3) vmscan grab anon_vma, but mlock don't grab anon_vma.
(4) after truncate_inode_pages(), we don't need to think vs-COW, because
    find_get_page() never success. but first unmap_mapping_range()
    have vs-COW racing. 

So, Is anon_vma grabbing really sufficient?
Or, you intent to the following?

	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 0);
	truncate_inode_pages(mapping, new);
	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);



> @@ -544,6 +544,13 @@ redo:
>  		 */
>  		lru = LRU_UNEVICTABLE;
>  		add_page_to_unevictable_list(page);
> +		/*
> +		 * See the TestClearPageMlocked() in zap_pte_range():
> +		 * if a racing unmapper did not see the above setting
> +		 * of PG_lru, we must see its clearing of PG_locked
> +		 * and move the page back to the evictable list.
> +		 */
> +		smp_mb();
>  	}

add_page_to_unevictable() have a spin lock. Why do we need additionl
explicit memory barrier?




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc patch 3/3] mm: munlock COW pages on truncation unmap
  2009-10-02  2:40   ` KOSAKI Motohiro
@ 2009-10-02 23:38     ` Johannes Weiner
  2009-10-03 13:56       ` KOSAKI Motohiro
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2009-10-02 23:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Hugh Dickins, Mel Gorman,
	Lee Schermerhorn, Peter Zijlstra

Hello KOSAKI-san,

On Fri, Oct 02, 2009 at 11:40:34AM +0900, KOSAKI Motohiro wrote:
> > @@ -835,6 +835,43 @@ static unsigned long zap_pte_range(struc
> >  				    (page->index < details->first_index ||
> >  				     page->index > details->last_index))
> >  					continue;
> > +				/*
> > +				 * When truncating, private COW pages may be
> > +				 * mlocked in VM_LOCKED VMAs, so they need
> > +				 * munlocking here before getting freed.
> > +				 *
> > +				 * Skip them completely if we don't have the
> > +				 * anon_vma locked.  We will get it the second
> > +				 * time.  When page cache is truncated, no more
> > +				 * private pages can show up against this VMA
> > +				 * and the anon_vma is either present or will
> > +				 * never be.
> > +				 *
> > +				 * Otherwise, we still have to synchronize
> > +				 * against concurrent reclaimers.  We can not
> > +				 * grab the page lock, but with correct
> > +				 * ordering of page flag accesses we can get
> > +				 * away without it.
> > +				 *
> > +				 * A concurrent isolator may add the page to
> > +				 * the unevictable list, set PG_lru and then
> > +				 * recheck PG_mlocked to verify it chose the
> > +				 * right list and conditionally move it again.
> > +				 *
> > +				 * TestClearPageMlocked() provides one half of
> > +				 * the barrier: when we do not see the page on
> > +				 * the LRU and fail isolation, the isolator
> > +				 * must see PG_mlocked cleared and move the
> > +				 * page on its own back to the evictable list.
> > +				 */
> > +				if (private && !details->anon_vma)
> > +					continue;
> > +				if (private && TestClearPageMlocked(page)) {
> > +					dec_zone_page_state(page, NR_MLOCK);
> > +					count_vm_event(UNEVICTABLE_PGCLEARED);
> > +					if (!isolate_lru_page(page))
> > +						putback_lru_page(page);
> > +				}
> >  			}
> >  			ptent = ptep_get_and_clear_full(mm, addr, pte,
> >  							tlb->fullmm);
> 
> Umm..
> I haven't understand this.
> 
> (1) unmap_mapping_range() is called twice.
> 
> 	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> 	truncate_inode_pages(mapping, new);
> 	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> 
> (2) PG_mlock is turned on from mlock() and vmscan.
> (3) vmscan grab anon_vma, but mlock don't grab anon_vma.

You are right, I was so focused on the LRU side that I missed an
obvious window here: an _explicit_ mlock can still happen between the
PG_mlocked clearing section and releasing the page.

If we race with it, the put_page() in __mlock_vma_pages_range() might
free the freshly mlocked page.

> (4) after truncate_inode_pages(), we don't need to think vs-COW, because
>     find_get_page() never success. but first unmap_mapping_range()
>     have vs-COW racing.

Yes, we can race with COW breaking, but I can not see a problem there.
It clears the old page's mlock, but also with an atomic
TestClearPageMlocked().  And the new page is mapped and mlocked under
pte lock and only if we didn't clear the pte in the meantime.

> So, Is anon_vma grabbing really sufficient?

No, the explicit mlocking race exists, I think.

> Or, you intent to the following?
> 
> 	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 0);
> 	truncate_inode_pages(mapping, new);
> 	unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);

As mentioned above, I don't see how it would make a difference.

> > @@ -544,6 +544,13 @@ redo:
> >  		 */
> >  		lru = LRU_UNEVICTABLE;
> >  		add_page_to_unevictable_list(page);
> > +		/*
> > +		 * See the TestClearPageMlocked() in zap_pte_range():
> > +		 * if a racing unmapper did not see the above setting
> > +		 * of PG_lru, we must see its clearing of PG_locked
> > +		 * and move the page back to the evictable list.
> > +		 */
> > +		smp_mb();
> >  	}
> 
> add_page_to_unevictable() have a spin lock. Why do we need additionl
> explicit memory barrier?

It sets PG_lru under spinlock and tests PG_mlocked after the unlock.
The following sections from memory-barriers.txt made me nervous:

 (5) LOCK operations.

     This acts as a one-way permeable barrier.  It guarantees that all memory
     operations after the LOCK operation will appear to happen after the LOCK
     operation with respect to the other components of the system.

 (6) UNLOCK operations.

     This also acts as a one-way permeable barrier.  It guarantees that all
     memory operations before the UNLOCK operation will appear to happen before
     the UNLOCK operation with respect to the other components of the system.

     Memory operations that occur after an UNLOCK operation may appear to
     happen before it completes.

So the only garuantee this gives us is that both PG_lru setting and
PG_mlocked testing happen after LOCK and PG_lru setting finishes
before UNLOCK, no?  I wanted to make sure this does not happen:

	LOCK, test PG_mlocked, set PG_lru, UNLOCK

I don't know whether there is a data dependency between those two
operations.  They go to the same word, but I could also imagine
setting one bit is independent of reading another one.  Humm.  Help.

Thanks,
	Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc patch 3/3] mm: munlock COW pages on truncation unmap
  2009-10-02 23:38     ` Johannes Weiner
@ 2009-10-03 13:56       ` KOSAKI Motohiro
  2009-10-05 19:32         ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: KOSAKI Motohiro @ 2009-10-03 13:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Hugh Dickins, Mel Gorman,
	Lee Schermerhorn, Peter Zijlstra

>> Umm..
>> I haven't understand this.
>>
>> (1) unmap_mapping_range() is called twice.
>>
>>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
>>       truncate_inode_pages(mapping, new);
>>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
>>
>> (2) PG_mlock is turned on from mlock() and vmscan.
>> (3) vmscan grab anon_vma, but mlock don't grab anon_vma.
>
> You are right, I was so focused on the LRU side that I missed an
> obvious window here: an _explicit_ mlock can still happen between the
> PG_mlocked clearing section and releasing the page.
>
> If we race with it, the put_page() in __mlock_vma_pages_range() might
> free the freshly mlocked page.
>
>> (4) after truncate_inode_pages(), we don't need to think vs-COW, because
>>     find_get_page() never success. but first unmap_mapping_range()
>>     have vs-COW racing.
>
> Yes, we can race with COW breaking, but I can not see a problem there.
> It clears the old page's mlock, but also with an atomic
> TestClearPageMlocked().  And the new page is mapped and mlocked under
> pte lock and only if we didn't clear the pte in the meantime.

Ah, You are right.


>> So, Is anon_vma grabbing really sufficient?
>
> No, the explicit mlocking race exists, I think.
>
>> Or, you intent to the following?
>>
>>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 0);
>>       truncate_inode_pages(mapping, new);
>>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
>
> As mentioned above, I don't see how it would make a difference.

Yes, sorry. please forget this.


>> > @@ -544,6 +544,13 @@ redo:
>> >              */
>> >             lru = LRU_UNEVICTABLE;
>> >             add_page_to_unevictable_list(page);
>> > +           /*
>> > +            * See the TestClearPageMlocked() in zap_pte_range():
>> > +            * if a racing unmapper did not see the above setting
>> > +            * of PG_lru, we must see its clearing of PG_locked
>> > +            * and move the page back to the evictable list.
>> > +            */
>> > +           smp_mb();
>> >     }
>>
>> add_page_to_unevictable() have a spin lock. Why do we need additionl
>> explicit memory barrier?
>
> It sets PG_lru under spinlock and tests PG_mlocked after the unlock.
> The following sections from memory-barriers.txt made me nervous:
>
>  (5) LOCK operations.
>
>     This acts as a one-way permeable barrier.  It guarantees that all memory
>     operations after the LOCK operation will appear to happen after the LOCK
>     operation with respect to the other components of the system.
>
>  (6) UNLOCK operations.
>
>     This also acts as a one-way permeable barrier.  It guarantees that all
>     memory operations before the UNLOCK operation will appear to happen before
>     the UNLOCK operation with respect to the other components of the system.
>
>     Memory operations that occur after an UNLOCK operation may appear to
>     happen before it completes.
>
> So the only garuantee this gives us is that both PG_lru setting and
> PG_mlocked testing happen after LOCK and PG_lru setting finishes
> before UNLOCK, no?  I wanted to make sure this does not happen:
>
>        LOCK, test PG_mlocked, set PG_lru, UNLOCK
>
> I don't know whether there is a data dependency between those two
> operations.  They go to the same word, but I could also imagine
> setting one bit is independent of reading another one.  Humm.  Help.

Ahh, Yes! you are right.
We really need this barrier.

However, I think this issue doesn't depend on zap_pte_range patch.
Other TestClearPageMlocked(page) caller have the same problem, because
putback_lru_page() doesn't have any exclusion, right?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc patch 3/3] mm: munlock COW pages on truncation unmap
  2009-10-03 13:56       ` KOSAKI Motohiro
@ 2009-10-05 19:32         ` Johannes Weiner
  2009-10-06  1:11           ` KOSAKI Motohiro
  2009-10-06  7:44           ` KOSAKI Motohiro
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Weiner @ 2009-10-05 19:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Hugh Dickins, Mel Gorman,
	Lee Schermerhorn, Peter Zijlstra, Andrew Morton

Hi,

On Sat, Oct 03, 2009 at 10:56:55PM +0900, KOSAKI Motohiro wrote:
> >> Umm..
> >> I haven't understand this.
> >>
> >> (1) unmap_mapping_range() is called twice.
> >>
> >>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> >>       truncate_inode_pages(mapping, new);
> >>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> >>
> >> (2) PG_mlock is turned on from mlock() and vmscan.
> >> (3) vmscan grab anon_vma, but mlock don't grab anon_vma.
> >
> > You are right, I was so focused on the LRU side that I missed an
> > obvious window here: an _explicit_ mlock can still happen between the
> > PG_mlocked clearing section and releasing the page.

Okay, so what are the opinions on this?  Would you consider my patches
to fix the most likely issues?  Dropping them in favor of looking for
a complete fix?  Revert the warning on freeing PG_mlocked pages?

> >> > @@ -544,6 +544,13 @@ redo:
> >> >              */
> >> >             lru = LRU_UNEVICTABLE;
> >> >             add_page_to_unevictable_list(page);
> >> > +           /*
> >> > +            * See the TestClearPageMlocked() in zap_pte_range():
> >> > +            * if a racing unmapper did not see the above setting
> >> > +            * of PG_lru, we must see its clearing of PG_locked
> >> > +            * and move the page back to the evictable list.
> >> > +            */
> >> > +           smp_mb();
> >> >     }
> >>
> >> add_page_to_unevictable() have a spin lock. Why do we need additionl
> >> explicit memory barrier?
> >
> > It sets PG_lru under spinlock and tests PG_mlocked after the unlock.
> > The following sections from memory-barriers.txt made me nervous:
> >
> >  (5) LOCK operations.
> >
> >     This acts as a one-way permeable barrier.  It guarantees that all memory
> >     operations after the LOCK operation will appear to happen after the LOCK
> >     operation with respect to the other components of the system.
> >
> >  (6) UNLOCK operations.
> >
> >     This also acts as a one-way permeable barrier.  It guarantees that all
> >     memory operations before the UNLOCK operation will appear to happen before
> >     the UNLOCK operation with respect to the other components of the system.
> >
> >     Memory operations that occur after an UNLOCK operation may appear to
> >     happen before it completes.
> >
> > So the only garuantee this gives us is that both PG_lru setting and
> > PG_mlocked testing happen after LOCK and PG_lru setting finishes
> > before UNLOCK, no?  I wanted to make sure this does not happen:
> >
> >        LOCK, test PG_mlocked, set PG_lru, UNLOCK
> >
> > I don't know whether there is a data dependency between those two
> > operations.  They go to the same word, but I could also imagine
> > setting one bit is independent of reading another one.  Humm.  Help.
> 
> Ahh, Yes! you are right.
> We really need this barrier.
> 
> However, I think this issue doesn't depend on zap_pte_range patch.
> Other TestClearPageMlocked(page) caller have the same problem, because
> putback_lru_page() doesn't have any exclusion, right?

You are right, it's an issue on its own.  Please find a stand-alone
patch below.

Thanks,

	Hannes

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: order evictable rescue in LRU putback

Isolators putting a page back to the LRU do not hold the page lock,
and if the page is mlocked, another thread might munlock it
concurrently.

Expecting this, the putback code re-checks the evictability of a page
when it just moved it to the unevictable list in order to correct its
decision.

The problem, however, is that ordering is not garuanteed between
setting PG_lru when moving the page to the list and checking
PG_mlocked afterwards:

	#0 putback			#1 munlock

	spin_lock()
					if (TestClearPageMlocked())
					  if (PageLRU())
					    move to evictable list
	SetPageLRU()
	spin_unlock()
	if (!PageMlocked())
	  move to evictable list

The PageMlocked() reading may get reordered before SetPageLRU() in #0,
resulting in #0 not moving the still mlocked page, and in #1 failing
to isolate and move the page as well.  The evictable page is now
stranded on the unevictable list.

TestClearPageMlocked() in #1 already provides full memory barrier
semantics.

This patch adds an explicit full barrier to force ordering between
SetPageLRU() and PageMlocked() in #0 so that either one of the
competitors rescues the page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/vmscan.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -544,6 +544,16 @@ redo:
 		 */
 		lru = LRU_UNEVICTABLE;
 		add_page_to_unevictable_list(page);
+		/*
+		 * When racing with an mlock clearing (page is
+		 * unlocked), make sure that if the other thread does
+		 * not observe our setting of PG_lru and fails
+		 * isolation, we see PG_mlocked cleared below and move
+		 * the page back to the evictable list.
+		 *
+		 * The other side is TestClearPageMlocked().
+		 */
+		smp_mb();
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc patch 3/3] mm: munlock COW pages on truncation unmap
  2009-10-05 19:32         ` Johannes Weiner
@ 2009-10-06  1:11           ` KOSAKI Motohiro
  2009-10-06  7:44           ` KOSAKI Motohiro
  1 sibling, 0 replies; 9+ messages in thread
From: KOSAKI Motohiro @ 2009-10-06  1:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Hugh Dickins,
	Mel Gorman, Lee Schermerhorn, Peter Zijlstra, Andrew Morton

> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: mm: order evictable rescue in LRU putback
> 
> Isolators putting a page back to the LRU do not hold the page lock,
> and if the page is mlocked, another thread might munlock it
> concurrently.
> 
> Expecting this, the putback code re-checks the evictability of a page
> when it just moved it to the unevictable list in order to correct its
> decision.
> 
> The problem, however, is that ordering is not garuanteed between
> setting PG_lru when moving the page to the list and checking
> PG_mlocked afterwards:
> 
> 	#0 putback			#1 munlock
> 
> 	spin_lock()
> 					if (TestClearPageMlocked())
> 					  if (PageLRU())
> 					    move to evictable list
> 	SetPageLRU()
> 	spin_unlock()
> 	if (!PageMlocked())
> 	  move to evictable list
> 
> The PageMlocked() reading may get reordered before SetPageLRU() in #0,
> resulting in #0 not moving the still mlocked page, and in #1 failing
> to isolate and move the page as well.  The evictable page is now
> stranded on the unevictable list.
> 
> TestClearPageMlocked() in #1 already provides full memory barrier
> semantics.
> 
> This patch adds an explicit full barrier to force ordering between
> SetPageLRU() and PageMlocked() in #0 so that either one of the
> competitors rescues the page.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/vmscan.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -544,6 +544,16 @@ redo:
>  		 */
>  		lru = LRU_UNEVICTABLE;
>  		add_page_to_unevictable_list(page);
> +		/*
> +		 * When racing with an mlock clearing (page is
> +		 * unlocked), make sure that if the other thread does
> +		 * not observe our setting of PG_lru and fails
> +		 * isolation, we see PG_mlocked cleared below and move
> +		 * the page back to the evictable list.
> +		 *
> +		 * The other side is TestClearPageMlocked().
> +		 */
> +		smp_mb();
>  	}

IA64 is most relax cpu reorder architecture. I'm usually test on it
and my test found no problem.
Then, I don't think this issue occur in the real world. but I think
this patch is right.

Hannes, you are great.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc patch 3/3] mm: munlock COW pages on truncation unmap
  2009-10-05 19:32         ` Johannes Weiner
  2009-10-06  1:11           ` KOSAKI Motohiro
@ 2009-10-06  7:44           ` KOSAKI Motohiro
  1 sibling, 0 replies; 9+ messages in thread
From: KOSAKI Motohiro @ 2009-10-06  7:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Hugh Dickins,
	Mel Gorman, Lee Schermerhorn, Peter Zijlstra, Andrew Morton

> Hi,
> 
> On Sat, Oct 03, 2009 at 10:56:55PM +0900, KOSAKI Motohiro wrote:
> > >> Umm..
> > >> I haven't understand this.
> > >>
> > >> (1) unmap_mapping_range() is called twice.
> > >>
> > >>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> > >>       truncate_inode_pages(mapping, new);
> > >>       unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> > >>
> > >> (2) PG_mlock is turned on from mlock() and vmscan.
> > >> (3) vmscan grab anon_vma, but mlock don't grab anon_vma.
> > >
> > > You are right, I was so focused on the LRU side that I missed an
> > > obvious window here: an _explicit_ mlock can still happen between the
> > > PG_mlocked clearing section and releasing the page.
> 
> Okay, so what are the opinions on this?  Would you consider my patches
> to fix the most likely issues?  Dropping them in favor of looking for
> a complete fix?  Revert the warning on freeing PG_mlocked pages?

Honestly, I don't have any good idea. but luckly, we have enough time.
the false-positve warning is not so big problem. then, I prefer looking for
complete solusion.

Thanks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-10-06  7:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-30 21:09 [rfc patch 1/3] mm: always pass mapping in zap_details Johannes Weiner
2009-09-30 21:09 ` [rfc patch 2/3] mm: serialize truncation unmap against try_to_unmap() Johannes Weiner
2009-09-30 21:09 ` [rfc patch 3/3] mm: munlock COW pages on truncation unmap Johannes Weiner
2009-10-02  2:40   ` KOSAKI Motohiro
2009-10-02 23:38     ` Johannes Weiner
2009-10-03 13:56       ` KOSAKI Motohiro
2009-10-05 19:32         ` Johannes Weiner
2009-10-06  1:11           ` KOSAKI Motohiro
2009-10-06  7:44           ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).