linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
@ 2012-09-10 16:19 Tim Chen
  2012-09-11  2:36 ` Minchan Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Tim Chen @ 2012-09-10 16:19 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Minchan Kim, Johannes Weiner,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, David Rientjes,
	Michal Hocko, Xiao Guangrong, Paul Gortmaker
  Cc: Kirill A. Shutemov, Andi Kleen, linux-mm, linux-kernel, Alex Shi,
	Matthew Wilcox, Fengguang Wu

We gather the pages that need to be unmapped in shrink_page_list.  We batch
the unmap to reduce the frequency of acquisition of
the tree lock protecting the mapping's radix tree. This is
possible as successive pages likely share the same mapping in 
__remove_mapping_batch routine.  This avoids excessive cache bouncing of
the tree lock when page reclamations are occurring simultaneously.

Tim
---
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
--- 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aac5672..d4ab646 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -600,6 +600,85 @@ cannot_free:
 	return 0;
 }
 
+/* Same as __remove_mapping, but batching operations to minimize locking */
+/* Pages to be unmapped should be locked first */
+static int __remove_mapping_batch(struct list_head *unmap_pages,
+				  struct list_head *ret_pages,
+				  struct list_head *free_pages)
+{
+	struct address_space *mapping, *next;
+	LIST_HEAD(swap_pages);
+	swp_entry_t swap;
+	struct page *page;
+	int nr_reclaimed;
+
+	mapping = NULL;
+	nr_reclaimed = 0;
+	while (!list_empty(unmap_pages)) {
+
+		page = lru_to_page(unmap_pages);
+		BUG_ON(!PageLocked(page));
+
+		list_del(&page->lru);
+		next = page_mapping(page);
+		if (mapping != next) {
+			if (mapping)
+				spin_unlock_irq(&mapping->tree_lock);
+			mapping = next;
+			spin_lock_irq(&mapping->tree_lock);
+		}
+
+		if (!page_freeze_refs(page, 2))
+			goto cannot_free;
+		if (unlikely(PageDirty(page))) {
+			page_unfreeze_refs(page, 2);
+			goto cannot_free;
+		}
+
+		if (PageSwapCache(page)) {
+			__delete_from_swap_cache(page);
+			/* swapcache_free need to be called without tree_lock */
+			list_add(&page->lru, &swap_pages);
+		} else {
+			void (*freepage)(struct page *);
+
+			freepage = mapping->a_ops->freepage;
+
+			__delete_from_page_cache(page);
+			mem_cgroup_uncharge_cache_page(page);
+
+			if (freepage != NULL)
+				freepage(page);
+
+			unlock_page(page);
+			nr_reclaimed++;
+			list_add(&page->lru, free_pages);
+		}
+		continue;
+cannot_free:
+		unlock_page(page);
+		list_add(&page->lru, ret_pages);
+		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
+
+	}
+
+	if (mapping)
+		spin_unlock_irq(&mapping->tree_lock);
+
+	while (!list_empty(&swap_pages)) {
+		page = lru_to_page(&swap_pages);
+		list_del(&page->lru);
+
+		swap.val = page_private(page);
+		swapcache_free(swap, page);
+
+		unlock_page(page);
+		nr_reclaimed++;
+		list_add(&page->lru, free_pages);
+	}
+
+	return nr_reclaimed;
+}
 /*
  * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
  * someone else has a ref on the page, abort and return 0.  If it was
@@ -771,6 +850,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(unmap_pages);
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
 	unsigned long nr_congested = 0;
@@ -969,17 +1049,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping)
 			goto keep_locked;
 
-		/*
-		 * At this point, we have no other references and there is
-		 * no way to pick any more up (removed from LRU, removed
-		 * from pagecache). Can use non-atomic bitops now (and
-		 * we obviously don't have to worry about waking up a process
-		 * waiting on the page lock, because there are no references.
-		 */
-		__clear_page_locked(page);
+		/* remove pages from mapping in batch at end of loop */
+		list_add(&page->lru, &unmap_pages);
+		continue;
+
 free_it:
 		nr_reclaimed++;
 
@@ -1014,6 +1090,9 @@ keep_lumpy:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages,
+					       &free_pages);
+
 	/*
 	 * Tag a zone as congested if all the dirty pages encountered were
 	 * backed by a congested BDI. In this case, reclaimers should just











^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
  2012-09-10 16:19 [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list Tim Chen
@ 2012-09-11  2:36 ` Minchan Kim
  2012-09-11 11:05 ` Mel Gorman
  2012-09-12 19:28 ` Andrew Morton
  2 siblings, 0 replies; 6+ messages in thread
From: Minchan Kim @ 2012-09-11  2:36 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, Michal Hocko, Xiao Guangrong,
	Paul Gortmaker, Kirill A. Shutemov, Andi Kleen, linux-mm,
	linux-kernel, Alex Shi, Matthew Wilcox, Fengguang Wu

Hi Tim,

On Mon, Sep 10, 2012 at 09:19:25AM -0700, Tim Chen wrote:
> We gather the pages that need to be unmapped in shrink_page_list.  We batch
> the unmap to reduce the frequency of acquisition of
> the tree lock protecting the mapping's radix tree. This is
> possible as successive pages likely share the same mapping in 
> __remove_mapping_batch routine.  This avoids excessive cache bouncing of
> the tree lock when page reclamations are occurring simultaneously.
> 
> Tim
> ---
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> --- 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aac5672..d4ab646 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -600,6 +600,85 @@ cannot_free:
>  	return 0;
>  }
>  
> +/* Same as __remove_mapping, but batching operations to minimize locking */
> +/* Pages to be unmapped should be locked first */

Use below comment style.

/*
 * blah, blah,
 */

> +static int __remove_mapping_batch(struct list_head *unmap_pages,
> +				  struct list_head *ret_pages,
> +				  struct list_head *free_pages)
> +{
> +	struct address_space *mapping, *next;
> +	LIST_HEAD(swap_pages);
> +	swp_entry_t swap;
> +	struct page *page;
> +	int nr_reclaimed;

Use unsigned long instead of int.

> +
> +	mapping = NULL;
> +	nr_reclaimed = 0;
> +	while (!list_empty(unmap_pages)) {
> +
> +		page = lru_to_page(unmap_pages);
> +		BUG_ON(!PageLocked(page));
> +
> +		list_del(&page->lru);
> +		next = page_mapping(page);
> +		if (mapping != next) {
> +			if (mapping)
> +				spin_unlock_irq(&mapping->tree_lock);
> +			mapping = next;
> +			spin_lock_irq(&mapping->tree_lock);
> +		}
> +
> +		if (!page_freeze_refs(page, 2))
> +			goto cannot_free;
> +		if (unlikely(PageDirty(page))) {
> +			page_unfreeze_refs(page, 2);
> +			goto cannot_free;
> +		}
> +
> +		if (PageSwapCache(page)) {
> +			__delete_from_swap_cache(page);
> +			/* swapcache_free need to be called without tree_lock */
> +			list_add(&page->lru, &swap_pages);
> +		} else {
> +			void (*freepage)(struct page *);
> +
> +			freepage = mapping->a_ops->freepage;
> +
> +			__delete_from_page_cache(page);
> +			mem_cgroup_uncharge_cache_page(page);
> +
> +			if (freepage != NULL)
> +				freepage(page);
> +
> +			unlock_page(page);

If you use unlock_page, you removes a978d6f5.
You can use __clear_page_locked.


> +			nr_reclaimed++;
> +			list_add(&page->lru, free_pages);
> +		}
> +		continue;
> +cannot_free:
> +		unlock_page(page);
> +		list_add(&page->lru, ret_pages);
> +		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> +
> +	}
> +
> +	if (mapping)
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +	while (!list_empty(&swap_pages)) {
> +		page = lru_to_page(&swap_pages);
> +		list_del(&page->lru);
> +
> +		swap.val = page_private(page);
> +		swapcache_free(swap, page);
> +
> +		unlock_page(page);

Ditto.

> +		nr_reclaimed++;
> +		list_add(&page->lru, free_pages);
> +	}
> +
> +	return nr_reclaimed;
> +}
>  /*
>   * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
>   * someone else has a ref on the page, abort and return 0.  If it was
> @@ -771,6 +850,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> +	LIST_HEAD(unmap_pages);
>  	int pgactivate = 0;
>  	unsigned long nr_dirty = 0;
>  	unsigned long nr_congested = 0;
> @@ -969,17 +1049,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping)
>  			goto keep_locked;
>  
> -		/*
> -		 * At this point, we have no other references and there is
> -		 * no way to pick any more up (removed from LRU, removed
> -		 * from pagecache). Can use non-atomic bitops now (and
> -		 * we obviously don't have to worry about waking up a process
> -		 * waiting on the page lock, because there are no references.
> -		 */

Please don't remove this description and move it into your function with
__clear_page_locked.

> -		__clear_page_locked(page);
> +		/* remove pages from mapping in batch at end of loop */
> +		list_add(&page->lru, &unmap_pages);
> +		continue;
> +
>  free_it:
>  		nr_reclaimed++;
>  
> @@ -1014,6 +1090,9 @@ keep_lumpy:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages,
> +					       &free_pages);
> +
>  	/*
>  	 * Tag a zone as congested if all the dirty pages encountered were
>  	 * backed by a congested BDI. In this case, reclaimers should just

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
  2012-09-10 16:19 [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list Tim Chen
  2012-09-11  2:36 ` Minchan Kim
@ 2012-09-11 11:05 ` Mel Gorman
  2012-09-13 16:06   ` Tim Chen
  2012-09-12 19:28 ` Andrew Morton
  2 siblings, 1 reply; 6+ messages in thread
From: Mel Gorman @ 2012-09-11 11:05 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, Michal Hocko, Xiao Guangrong,
	Paul Gortmaker, Kirill A. Shutemov, Andi Kleen, linux-mm,
	linux-kernel, Alex Shi, Matthew Wilcox, Fengguang Wu

On Mon, Sep 10, 2012 at 09:19:25AM -0700, Tim Chen wrote:
> We gather the pages that need to be unmapped in shrink_page_list.  We batch
> the unmap to reduce the frequency of acquisition of
> the tree lock protecting the mapping's radix tree. This is
> possible as successive pages likely share the same mapping in 
> __remove_mapping_batch routine.  This avoids excessive cache bouncing of
> the tree lock when page reclamations are occurring simultaneously.
> 

Ok, I get the intention of the patch at least. Based on the description
and without seeing the code I worry that this depends on pages belonging
to the same mapping being adjacent in the LRU list.

It would also be nice to get a better description of the workload in
your leader and ideally a perf reporting showing a reduced amount of time
acquiring the tree lock.

> Tim
> ---
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

Patch formatting error. Your signed-off-by should be above the --- and
there is no need for your signature.

> --- 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aac5672..d4ab646 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -600,6 +600,85 @@ cannot_free:
>  	return 0;
>  }
>  
> +/* Same as __remove_mapping, but batching operations to minimize locking */
> +/* Pages to be unmapped should be locked first */

/*
 * multi-line comments should
 * look like this
 */

> +static int __remove_mapping_batch(struct list_head *unmap_pages,
> +				  struct list_head *ret_pages,
> +				  struct list_head *free_pages)

The return values should be explained in the comments. What's in
ret_pages, what's in free_pages?

/*
 * When this function returns, unmap_pages will be empty. ret_pages
 * will contain ..... . free_pages will contain ...
 */

unmap_pages is a misleading name. In the context of vmscan.c "unmap"
means that the page is unmapped from the userspace page tables. Here you
are removing the pages from the address space radix tree.

> +{
> +	struct address_space *mapping, *next;
> +	LIST_HEAD(swap_pages);
> +	swp_entry_t swap;
> +	struct page *page;
> +	int nr_reclaimed;
> +
> +	mapping = NULL;
> +	nr_reclaimed = 0;
> +	while (!list_empty(unmap_pages)) {
> +

unnecessary whitespace.

> +		page = lru_to_page(unmap_pages);
> +		BUG_ON(!PageLocked(page));
> +
> +		list_del(&page->lru);
> +		next = page_mapping(page);
> +		if (mapping != next) {
> +			if (mapping)
> +				spin_unlock_irq(&mapping->tree_lock);
> +			mapping = next;
> +			spin_lock_irq(&mapping->tree_lock);
> +		}

Ok, so for the batching to work the pages do need to be adjacent on the
LRU list. There is nothing wrong with this as such but this limitation
should be called out in the changelog.

> +
> +		if (!page_freeze_refs(page, 2))
> +			goto cannot_free;
> +		if (unlikely(PageDirty(page))) {
> +			page_unfreeze_refs(page, 2);
> +			goto cannot_free;
> +		}
> +
> +		if (PageSwapCache(page)) {
> +			__delete_from_swap_cache(page);
> +			/* swapcache_free need to be called without tree_lock */
> +			list_add(&page->lru, &swap_pages);
> +		} else {
> +			void (*freepage)(struct page *);
> +
> +			freepage = mapping->a_ops->freepage;
> +
> +			__delete_from_page_cache(page);
> +			mem_cgroup_uncharge_cache_page(page);
> +
> +			if (freepage != NULL)
> +				freepage(page);
> +
> +			unlock_page(page);

no longer using __clear_page_locked in the success path, why?

> +			nr_reclaimed++;
> +			list_add(&page->lru, free_pages);
> +		}
> +		continue;
> +cannot_free:
> +		unlock_page(page);
> +		list_add(&page->lru, ret_pages);
> +		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> +
> +	}
> +
> +	if (mapping)
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +	while (!list_empty(&swap_pages)) {
> +		page = lru_to_page(&swap_pages);
> +		list_del(&page->lru);
> +
> +		swap.val = page_private(page);
> +		swapcache_free(swap, page);
> +
> +		unlock_page(page);
> +		nr_reclaimed++;
> +		list_add(&page->lru, free_pages);
> +	}
> +
> +	return nr_reclaimed;
> +}
>  /*
>   * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
>   * someone else has a ref on the page, abort and return 0.  If it was
> @@ -771,6 +850,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> +	LIST_HEAD(unmap_pages);
>  	int pgactivate = 0;
>  	unsigned long nr_dirty = 0;
>  	unsigned long nr_congested = 0;
> @@ -969,17 +1049,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping)
>  			goto keep_locked;
>  
> -		/*
> -		 * At this point, we have no other references and there is
> -		 * no way to pick any more up (removed from LRU, removed
> -		 * from pagecache). Can use non-atomic bitops now (and
> -		 * we obviously don't have to worry about waking up a process
> -		 * waiting on the page lock, because there are no references.
> -		 */
> -		__clear_page_locked(page);
> +		/* remove pages from mapping in batch at end of loop */
> +		list_add(&page->lru, &unmap_pages);
> +		continue;
> +
>  free_it:
>  		nr_reclaimed++;
>  

One *massive* change here that is not called out in the changelog is that
the reclaim path now holds the page lock on multiple pages at the same
time waiting for them to be batch unlocked in __remove_mapping_batch.
This is suspicious for two reasons.

The first suspicion is that it is expected that there are filesystems
that lock multiple pages in page->index order and page reclaim tries to
lock pages in a random order.  You are "ok" because you trylock the pages
but there should be a comment explaining the situation and why you're
ok.

My *far* greater concern is that the hold time for a locked page is
now potentially much longer. You could lock a bunch of filesystem pages
and then call pageout() on an swapcache page that takes a long time to
write. This potentially causes a filesystem (or flusher threads etc)
to stall on lock_page and that could cause all sorts of latency trouble.
It will be hard to hit this bug and diagnose it but I believe it's
there.

That second risk *really* must be commented upon and ideally reviewed by
the filesystem people. However, I very strongly suspect that the outcome
of such a review will be a suggestion to unlock the pages and reacquire
the lock in __remove_mapping_batch(). Bear in mind that if you take this
approach that you *must* use trylock when reacquiring the page lock and
handle being unable to lock the page.


> @@ -1014,6 +1090,9 @@ keep_lumpy:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages,
> +					       &free_pages);
> +
>  	/*
>  	 * Tag a zone as congested if all the dirty pages encountered were
>  	 * backed by a congested BDI. In this case, reclaimers should just
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
  2012-09-10 16:19 [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list Tim Chen
  2012-09-11  2:36 ` Minchan Kim
  2012-09-11 11:05 ` Mel Gorman
@ 2012-09-12 19:28 ` Andrew Morton
  2 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2012-09-12 19:28 UTC (permalink / raw)
  To: Tim Chen
  Cc: Mel Gorman, Minchan Kim, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, Michal Hocko, Xiao Guangrong,
	Paul Gortmaker, Kirill A. Shutemov, Andi Kleen, linux-mm,
	linux-kernel, Alex Shi, Matthew Wilcox, Fengguang Wu

On Mon, 10 Sep 2012 09:19:25 -0700
Tim Chen <tim.c.chen@linux.intel.com> wrote:

> @@ -1014,6 +1090,9 @@ keep_lumpy:

I worry that your tree has the label "keep_lumpy" in it!  Are you
developing against recent kernels?

>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages,
> +					       &free_pages);
> +
>  	/*
>  	 * Tag a zone as congested if all the dirty pages encountered were
>  	 * backed by a congested BDI. In this case, reclaimers should just
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
  2012-09-11 11:05 ` Mel Gorman
@ 2012-09-13 16:06   ` Tim Chen
  2012-09-14  8:56     ` Mel Gorman
  0 siblings, 1 reply; 6+ messages in thread
From: Tim Chen @ 2012-09-13 16:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, Michal Hocko, Xiao Guangrong,
	Paul Gortmaker, Kirill A. Shutemov, Andi Kleen, linux-mm,
	linux-kernel, Alex Shi, Matthew Wilcox, Fengguang Wu

On Tue, 2012-09-11 at 12:05 +0100, Mel Gorman wrote:

> 
> One *massive* change here that is not called out in the changelog is that
> the reclaim path now holds the page lock on multiple pages at the same
> time waiting for them to be batch unlocked in __remove_mapping_batch.
> This is suspicious for two reasons.
> 
> The first suspicion is that it is expected that there are filesystems
> that lock multiple pages in page->index order and page reclaim tries to
> lock pages in a random order.  You are "ok" because you trylock the pages
> but there should be a comment explaining the situation and why you're
> ok.
> 
> My *far* greater concern is that the hold time for a locked page is
> now potentially much longer. You could lock a bunch of filesystem pages
> and then call pageout() on an swapcache page that takes a long time to
> write. This potentially causes a filesystem (or flusher threads etc)
> to stall on lock_page and that could cause all sorts of latency trouble.
> It will be hard to hit this bug and diagnose it but I believe it's
> there.
> 
> That second risk *really* must be commented upon and ideally reviewed by
> the filesystem people. However, I very strongly suspect that the outcome
> of such a review will be a suggestion to unlock the pages and reacquire
> the lock in __remove_mapping_batch(). Bear in mind that if you take this
> approach that you *must* use trylock when reacquiring the page lock and
> handle being unable to lock the page.
> 

Mel,

Thanks for your detailed comments and analysis.  If I unlock the pages,
will flusher threads be the only things that will touch them?  Or do I
have to worry about potentially other things done to the pages that will
make it invalid for me to unmap the pages later and put them on free
list?

Tim


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
  2012-09-13 16:06   ` Tim Chen
@ 2012-09-14  8:56     ` Mel Gorman
  0 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2012-09-14  8:56 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, Michal Hocko, Xiao Guangrong,
	Paul Gortmaker, Kirill A. Shutemov, Andi Kleen, linux-mm,
	linux-kernel, Alex Shi, Matthew Wilcox, Fengguang Wu

On Thu, Sep 13, 2012 at 09:06:10AM -0700, Tim Chen wrote:
> On Tue, 2012-09-11 at 12:05 +0100, Mel Gorman wrote:
> 
> > 
> > One *massive* change here that is not called out in the changelog is that
> > the reclaim path now holds the page lock on multiple pages at the same
> > time waiting for them to be batch unlocked in __remove_mapping_batch.
> > This is suspicious for two reasons.
> > 
> > The first suspicion is that it is expected that there are filesystems
> > that lock multiple pages in page->index order and page reclaim tries to
> > lock pages in a random order.  You are "ok" because you trylock the pages
> > but there should be a comment explaining the situation and why you're
> > ok.
> > 
> > My *far* greater concern is that the hold time for a locked page is
> > now potentially much longer. You could lock a bunch of filesystem pages
> > and then call pageout() on an swapcache page that takes a long time to
> > write. This potentially causes a filesystem (or flusher threads etc)
> > to stall on lock_page and that could cause all sorts of latency trouble.
> > It will be hard to hit this bug and diagnose it but I believe it's
> > there.
> > 
> > That second risk *really* must be commented upon and ideally reviewed by
> > the filesystem people. However, I very strongly suspect that the outcome
> > of such a review will be a suggestion to unlock the pages and reacquire
> > the lock in __remove_mapping_batch(). Bear in mind that if you take this
> > approach that you *must* use trylock when reacquiring the page lock and
> > handle being unable to lock the page.
> > 
> 
> Mel,
> 
> Thanks for your detailed comments and analysis.  If I unlock the pages,
> will flusher threads be the only things that will touch them? 

I don't think so. The pages are still in the mappings radix tree so
potentially can be still read()/write() to and recreate empty buffers
potentially. Something like;

writepage handler
-> block_write_full_page
  -> block_write_full_page_endio
    -> __block_write_full_page
      -> create_empty_buffers

Offhand I would also expect it's possible to fault the page again once
the page lock is released.

> Or do I
> have to worry about potentially other things done to the pages that will
> make it invalid for me to unmap the pages later and put them on free
> list?
> 

I expect that you'll have to double check that the page is still eligible
to be removed from the mapping and added to the free list.  This could get
complex because you then have to retry reclaim with those pages without
any batching and it may offset the advantage you are measuring.

You have potentially another option as well that you should consider. I
was complaining about holding multiple locks because of the potentially
unbounded length of time you hold those locks.  It now occurs to me that
you could hold these locks but call __remove_mapping_batch() and drain
the list before calling long-lived operations like pageout(). That might
be easier to implement.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-09-14  8:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-10 16:19 [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list Tim Chen
2012-09-11  2:36 ` Minchan Kim
2012-09-11 11:05 ` Mel Gorman
2012-09-13 16:06   ` Tim Chen
2012-09-14  8:56     ` Mel Gorman
2012-09-12 19:28 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).