All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-01-22 10:53 ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-01-22 10:53 UTC (permalink / raw)
  To: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus
  Cc: linux-mm, linux-kernel, Raghavendra K T

max_sane_readahead returns zero on the cpu having no local memory
node. Fix that by returning a sanitized number of pages viz.,
minimum of (requested pages, 4k)

Result:
fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
32GB* 4G RAM  numa machine ( 12 iterations) yielded

Kernel     Avg      Stddev
base      7.2963    1.10 %
patched   7.2972    1.18 %

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 Changes in V5:
 - Drop the 4k limit for normal readahead. (Jan Kara)

 Changes in V4:
 - Check for total node memory to decide whether we don't
   have local memory (jan Kara)
 - Add 4k page limit on readahead for normal and remote readahead (Linus)
   (Linus suggestion was 16MB limit).

 Changes in V3:
 - Drop iterating over numa nodes that calculates total free pages (Linus)

 Agree that we do not have control on allocation for readahead on a
 particular numa node and hence for remote readahead we can not further
 sanitize based on potential free pages of that node. and also we do
 not want to itererate through all nodes to find total free pages.

 Suggestions and comments welcome

 mm/readahead.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 7cdbb44..9d2afd0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return ret;
 }
 
+#define MAX_REMOTE_READAHEAD   4096UL
 /*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long local_free_page;
+	int nid;
+
+	nid = numa_node_id();
+	if (node_present_pages(nid)) {
+		/*
+		 * We sanitize readahead size depending on free memory in
+		 * the local node.
+		 */
+		local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+				 + node_page_state(nid, NR_FREE_PAGES);
+		return min(nr, local_free_page / 2);
+	}
+	/*
+	 * Readahead onto remote memory is better than no readahead when local
+	 * numa node does not have memory. We limit the readahead to 4k
+	 * pages though to avoid trashing page cache.
+	 */
+	return min(nr, MAX_REMOTE_READAHEAD);
 }
 
 /*
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-01-22 10:53 ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-01-22 10:53 UTC (permalink / raw)
  To: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus
  Cc: linux-mm, linux-kernel, Raghavendra K T

max_sane_readahead returns zero on the cpu having no local memory
node. Fix that by returning a sanitized number of pages viz.,
minimum of (requested pages, 4k)

Result:
fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
32GB* 4G RAM  numa machine ( 12 iterations) yielded

Kernel     Avg      Stddev
base      7.2963    1.10 %
patched   7.2972    1.18 %

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 Changes in V5:
 - Drop the 4k limit for normal readahead. (Jan Kara)

 Changes in V4:
 - Check for total node memory to decide whether we don't
   have local memory (jan Kara)
 - Add 4k page limit on readahead for normal and remote readahead (Linus)
   (Linus suggestion was 16MB limit).

 Changes in V3:
 - Drop iterating over numa nodes that calculates total free pages (Linus)

 Agree that we do not have control on allocation for readahead on a
 particular numa node and hence for remote readahead we can not further
 sanitize based on potential free pages of that node. and also we do
 not want to itererate through all nodes to find total free pages.

 Suggestions and comments welcome

 mm/readahead.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 7cdbb44..9d2afd0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return ret;
 }
 
+#define MAX_REMOTE_READAHEAD   4096UL
 /*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long local_free_page;
+	int nid;
+
+	nid = numa_node_id();
+	if (node_present_pages(nid)) {
+		/*
+		 * We sanitize readahead size depending on free memory in
+		 * the local node.
+		 */
+		local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+				 + node_page_state(nid, NR_FREE_PAGES);
+		return min(nr, local_free_page / 2);
+	}
+	/*
+	 * Readahead onto remote memory is better than no readahead when local
+	 * numa node does not have memory. We limit the readahead to 4k
+	 * pages though to avoid trashing page cache.
+	 */
+	return min(nr, MAX_REMOTE_READAHEAD);
 }
 
 /*
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-01-22 10:53 ` Raghavendra K T
@ 2014-02-03  8:30   ` Raghavendra K T
  -1 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-03  8:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On 01/22/2014 04:23 PM, Raghavendra K T wrote:
> max_sane_readahead returns zero on the cpu having no local memory
> node. Fix that by returning a sanitized number of pages viz.,
> minimum of (requested pages, 4k)
>
> Result:
> fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
> 32GB* 4G RAM  numa machine ( 12 iterations) yielded
>
> Kernel     Avg      Stddev
> base      7.2963    1.10 %
> patched   7.2972    1.18 %
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---

Could you please let me know what do you feel about the patch ?



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-03  8:30   ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-03  8:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On 01/22/2014 04:23 PM, Raghavendra K T wrote:
> max_sane_readahead returns zero on the cpu having no local memory
> node. Fix that by returning a sanitized number of pages viz.,
> minimum of (requested pages, 4k)
>
> Result:
> fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
> 32GB* 4G RAM  numa machine ( 12 iterations) yielded
>
> Kernel     Avg      Stddev
> base      7.2963    1.10 %
> patched   7.2972    1.18 %
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---

Could you please let me know what do you feel about the patch ?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-01-22 10:53 ` Raghavendra K T
@ 2014-02-06 22:51   ` Andrew Morton
  -1 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-06 22:51 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Fengguang Wu, David Cohen, Al Viro, Damien Ramonda, Jan Kara,
	Linus, linux-mm, linux-kernel

On Wed, 22 Jan 2014 16:23:45 +0530 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> max_sane_readahead returns zero on the cpu having no local memory
> node. Fix that by returning a sanitized number of pages viz.,
> minimum of (requested pages, 4k)

um, fix what?  The changelog should describe the user-visible impact of
the current implementation.  There are a whole bunch of reasons for
this, but I tire of typing them in day after day after day.

> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  	return ret;
>  }
>  
> +#define MAX_REMOTE_READAHEAD   4096UL
>  /*
>   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
>   * sensible upper limit.
>   */
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
> -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> +	unsigned long local_free_page;
> +	int nid;
> +
> +	nid = numa_node_id();
> +	if (node_present_pages(nid)) {
> +		/*
> +		 * We sanitize readahead size depending on free memory in
> +		 * the local node.
> +		 */
> +		local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> +				 + node_page_state(nid, NR_FREE_PAGES);
> +		return min(nr, local_free_page / 2);
> +	}
> +	/*
> +	 * Readahead onto remote memory is better than no readahead when local
> +	 * numa node does not have memory. We limit the readahead to 4k
> +	 * pages though to avoid trashing page cache.
> +	 */
> +	return min(nr, MAX_REMOTE_READAHEAD);
>  }

Looks reasonable to me.  Please send along a fixed up changelog.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-06 22:51   ` Andrew Morton
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-06 22:51 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Fengguang Wu, David Cohen, Al Viro, Damien Ramonda, Jan Kara,
	Linus, linux-mm, linux-kernel

On Wed, 22 Jan 2014 16:23:45 +0530 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> max_sane_readahead returns zero on the cpu having no local memory
> node. Fix that by returning a sanitized number of pages viz.,
> minimum of (requested pages, 4k)

um, fix what?  The changelog should describe the user-visible impact of
the current implementation.  There are a whole bunch of reasons for
this, but I tire of typing them in day after day after day.

> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  	return ret;
>  }
>  
> +#define MAX_REMOTE_READAHEAD   4096UL
>  /*
>   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
>   * sensible upper limit.
>   */
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
> -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> +	unsigned long local_free_page;
> +	int nid;
> +
> +	nid = numa_node_id();
> +	if (node_present_pages(nid)) {
> +		/*
> +		 * We sanitize readahead size depending on free memory in
> +		 * the local node.
> +		 */
> +		local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> +				 + node_page_state(nid, NR_FREE_PAGES);
> +		return min(nr, local_free_page / 2);
> +	}
> +	/*
> +	 * Readahead onto remote memory is better than no readahead when local
> +	 * numa node does not have memory. We limit the readahead to 4k
> +	 * pages though to avoid trashing page cache.
> +	 */
> +	return min(nr, MAX_REMOTE_READAHEAD);
>  }

Looks reasonable to me.  Please send along a fixed up changelog.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-06 22:51   ` Andrew Morton
@ 2014-02-06 22:58     ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-06 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014, Andrew Morton wrote:

> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> >  	return ret;
> >  }
> >  
> > +#define MAX_REMOTE_READAHEAD   4096UL
> >  /*
> >   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> >   * sensible upper limit.
> >   */
> >  unsigned long max_sane_readahead(unsigned long nr)
> >  {
> > -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > +	unsigned long local_free_page;
> > +	int nid;
> > +
> > +	nid = numa_node_id();

If you're intending this to be cached for your calls into 
node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).

What's the downside of just using numa_mem_id() here instead which is 
usually "local memory to this memoryless node cpu" and forget about 
testing node_present_pages(nid)?

> > +	if (node_present_pages(nid)) {
> > +		/*
> > +		 * We sanitize readahead size depending on free memory in
> > +		 * the local node.
> > +		 */
> > +		local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> > +				 + node_page_state(nid, NR_FREE_PAGES);
> > +		return min(nr, local_free_page / 2);
> > +	}
> > +	/*
> > +	 * Readahead onto remote memory is better than no readahead when local
> > +	 * numa node does not have memory. We limit the readahead to 4k
> > +	 * pages though to avoid trashing page cache.
> > +	 */
> > +	return min(nr, MAX_REMOTE_READAHEAD);
> >  }

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-06 22:58     ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-06 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014, Andrew Morton wrote:

> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -237,14 +237,32 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> >  	return ret;
> >  }
> >  
> > +#define MAX_REMOTE_READAHEAD   4096UL
> >  /*
> >   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> >   * sensible upper limit.
> >   */
> >  unsigned long max_sane_readahead(unsigned long nr)
> >  {
> > -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > +	unsigned long local_free_page;
> > +	int nid;
> > +
> > +	nid = numa_node_id();

If you're intending this to be cached for your calls into 
node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).

What's the downside of just using numa_mem_id() here instead which is 
usually "local memory to this memoryless node cpu" and forget about 
testing node_present_pages(nid)?

> > +	if (node_present_pages(nid)) {
> > +		/*
> > +		 * We sanitize readahead size depending on free memory in
> > +		 * the local node.
> > +		 */
> > +		local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
> > +				 + node_page_state(nid, NR_FREE_PAGES);
> > +		return min(nr, local_free_page / 2);
> > +	}
> > +	/*
> > +	 * Readahead onto remote memory is better than no readahead when local
> > +	 * numa node does not have memory. We limit the readahead to 4k
> > +	 * pages though to avoid trashing page cache.
> > +	 */
> > +	return min(nr, MAX_REMOTE_READAHEAD);
> >  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-06 22:58     ` David Rientjes
@ 2014-02-06 23:22       ` Andrew Morton
  -1 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-06 23:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014 14:58:21 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> > > +#define MAX_REMOTE_READAHEAD   4096UL
> > >  /*
> > >   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > >   * sensible upper limit.
> > >   */
> > >  unsigned long max_sane_readahead(unsigned long nr)
> > >  {
> > > -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > +	unsigned long local_free_page;
> > > +	int nid;
> > > +
> > > +	nid = numa_node_id();
> 
> If you're intending this to be cached for your calls into 
> node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).

ugh.  That's too subtle and we didn't even document it.

We could put the ACCESS_ONCE inside numa_node_id() I assume but we
still have the same problem as smp_processor_id(): the numa_node_id()
return value is wrong as soon as you obtain it if running preemptibly. 

We could plaster Big Fat Warnings all over the place or we could treat
numa_node_id() and derivatives in the same way as smp_processor_id()
(which is a huge pain).  Or something else, but we've left a big hand
grenade here and Raghavendra won't be the last one to pull the pin?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-06 23:22       ` Andrew Morton
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-06 23:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014 14:58:21 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> > > +#define MAX_REMOTE_READAHEAD   4096UL
> > >  /*
> > >   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > >   * sensible upper limit.
> > >   */
> > >  unsigned long max_sane_readahead(unsigned long nr)
> > >  {
> > > -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > +	unsigned long local_free_page;
> > > +	int nid;
> > > +
> > > +	nid = numa_node_id();
> 
> If you're intending this to be cached for your calls into 
> node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).

ugh.  That's too subtle and we didn't even document it.

We could put the ACCESS_ONCE inside numa_node_id() I assume but we
still have the same problem as smp_processor_id(): the numa_node_id()
return value is wrong as soon as you obtain it if running preemptibly. 

We could plaster Big Fat Warnings all over the place or we could treat
numa_node_id() and derivatives in the same way as smp_processor_id()
(which is a huge pain).  Or something else, but we've left a big hand
grenade here and Raghavendra won't be the last one to pull the pin?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-06 23:22       ` Andrew Morton
@ 2014-02-06 23:48         ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-06 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014, Andrew Morton wrote:

> On Thu, 6 Feb 2014 14:58:21 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
> > > > +#define MAX_REMOTE_READAHEAD   4096UL
> > > >  /*
> > > >   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > > >   * sensible upper limit.
> > > >   */
> > > >  unsigned long max_sane_readahead(unsigned long nr)
> > > >  {
> > > > -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > > -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > > +	unsigned long local_free_page;
> > > > +	int nid;
> > > > +
> > > > +	nid = numa_node_id();
> > 
> > If you're intending this to be cached for your calls into 
> > node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).
> 
> ugh.  That's too subtle and we didn't even document it.
> 
> We could put the ACCESS_ONCE inside numa_node_id() I assume but we
> still have the same problem as smp_processor_id(): the numa_node_id()
> return value is wrong as soon as you obtain it if running preemptibly. 
> 
> We could plaster Big Fat Warnings all over the place or we could treat
> numa_node_id() and derivatives in the same way as smp_processor_id()
> (which is a huge pain).  Or something else, but we've left a big hand
> grenade here and Raghavendra won't be the last one to pull the pin?
> 

Normally it wouldn't matter because there's no significant downside to it 
racing, things like mempolicies which use numa_node_id() extensively would 
result in, oops, a page allocation on the wrong node.

This stands out to me, though, because you're expecting the calculation to 
be correct for a specific node.

The patch is still wrong, though, it should just do

	int node = ACCESS_ONCE(numa_mem_id());
	return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
		        node_page_state(node, NR_FREE_PAGES)) / 2);

since we want to readahead based on the cpu's local node, the comment 
saying we're reading ahead onto "remote memory" is wrong since a 
memoryless node has local affinity to numa_mem_id().

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-06 23:48         ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-06 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014, Andrew Morton wrote:

> On Thu, 6 Feb 2014 14:58:21 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
> > > > +#define MAX_REMOTE_READAHEAD   4096UL
> > > >  /*
> > > >   * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > > >   * sensible upper limit.
> > > >   */
> > > >  unsigned long max_sane_readahead(unsigned long nr)
> > > >  {
> > > > -	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > > -		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > > +	unsigned long local_free_page;
> > > > +	int nid;
> > > > +
> > > > +	nid = numa_node_id();
> > 
> > If you're intending this to be cached for your calls into 
> > node_page_state() you need nid = ACCESS_ONCE(numa_node_id()).
> 
> ugh.  That's too subtle and we didn't even document it.
> 
> We could put the ACCESS_ONCE inside numa_node_id() I assume but we
> still have the same problem as smp_processor_id(): the numa_node_id()
> return value is wrong as soon as you obtain it if running preemptibly. 
> 
> We could plaster Big Fat Warnings all over the place or we could treat
> numa_node_id() and derivatives in the same way as smp_processor_id()
> (which is a huge pain).  Or something else, but we've left a big hand
> grenade here and Raghavendra won't be the last one to pull the pin?
> 

Normally it wouldn't matter because there's no significant downside to it 
racing, things like mempolicies which use numa_node_id() extensively would 
result in, oops, a page allocation on the wrong node.

This stands out to me, though, because you're expecting the calculation to 
be correct for a specific node.

The patch is still wrong, though, it should just do

	int node = ACCESS_ONCE(numa_mem_id());
	return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
		        node_page_state(node, NR_FREE_PAGES)) / 2);

since we want to readahead based on the cpu's local node, the comment 
saying we're reading ahead onto "remote memory" is wrong since a 
memoryless node has local affinity to numa_mem_id().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-06 23:48         ` David Rientjes
@ 2014-02-06 23:58           ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-06 23:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014, David Rientjes wrote:

> > > > > +#define MAX_REMOTE_READAHEAD   4096UL

> Normally it wouldn't matter because there's no significant downside to it 
> racing, things like mempolicies which use numa_node_id() extensively would 
> result in, oops, a page allocation on the wrong node.
> 
> This stands out to me, though, because you're expecting the calculation to 
> be correct for a specific node.
> 
> The patch is still wrong, though, it should just do
> 
> 	int node = ACCESS_ONCE(numa_mem_id());
> 	return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
> 		        node_page_state(node, NR_FREE_PAGES)) / 2);
> 
> since we want to readahead based on the cpu's local node, the comment 
> saying we're reading ahead onto "remote memory" is wrong since a 
> memoryless node has local affinity to numa_mem_id().
> 

Oops, forgot about the MAX_REMOTE_READAHEAD which needs to be factored in 
as well, but this handles the bound on local node's statistics.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-06 23:58           ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-06 23:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Thu, 6 Feb 2014, David Rientjes wrote:

> > > > > +#define MAX_REMOTE_READAHEAD   4096UL

> Normally it wouldn't matter because there's no significant downside to it 
> racing, things like mempolicies which use numa_node_id() extensively would 
> result in, oops, a page allocation on the wrong node.
> 
> This stands out to me, though, because you're expecting the calculation to 
> be correct for a specific node.
> 
> The patch is still wrong, though, it should just do
> 
> 	int node = ACCESS_ONCE(numa_mem_id());
> 	return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
> 		        node_page_state(node, NR_FREE_PAGES)) / 2);
> 
> since we want to readahead based on the cpu's local node, the comment 
> saying we're reading ahead onto "remote memory" is wrong since a 
> memoryless node has local affinity to numa_mem_id().
> 

Oops, forgot about the MAX_REMOTE_READAHEAD which needs to be factored in 
as well, but this handles the bound on local node's statistics.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-06 23:58           ` David Rientjes
@ 2014-02-07 10:42             ` Raghavendra K T
  -1 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-07 10:42 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Fengguang Wu, David Cohen, Al Viro, Damien Ramonda, Jan Kara,
	Linus, linux-mm, linux-kernel

On 02/07/2014 05:28 AM, David Rientjes wrote:
> On Thu, 6 Feb 2014, David Rientjes wrote:
>
>>>>>> +#define MAX_REMOTE_READAHEAD   4096UL
>
>> Normally it wouldn't matter because there's no significant downside to it
>> racing, things like mempolicies which use numa_node_id() extensively would
>> result in, oops, a page allocation on the wrong node.
>>
>> This stands out to me, though, because you're expecting the calculation to
>> be correct for a specific node.
>>
>> The patch is still wrong, though, it should just do
>>
>> 	int node = ACCESS_ONCE(numa_mem_id());
>> 	return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
>> 		        node_page_state(node, NR_FREE_PAGES)) / 2);
>>
>> since we want to readahead based on the cpu's local node, the comment
>> saying we're reading ahead onto "remote memory" is wrong since a
>> memoryless node has local affinity to numa_mem_id().
>>
>
> Oops, forgot about the MAX_REMOTE_READAHEAD which needs to be factored in
> as well, but this handles the bound on local node's statistics.
>

So following discussion TODO for my patch is:

1) Update the changelog with user visible impact of the patch.
(Andrew's suggestion)
2) Add ACCESS_ONCE to numa_node_id().
3) Change the "readahead into remote memory" part of the documentation
which is misleading.

( I feel no need to add numa_mem_id() since we would specifically limit
the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-07 10:42             ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-07 10:42 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Fengguang Wu, David Cohen, Al Viro, Damien Ramonda, Jan Kara,
	Linus, linux-mm, linux-kernel

On 02/07/2014 05:28 AM, David Rientjes wrote:
> On Thu, 6 Feb 2014, David Rientjes wrote:
>
>>>>>> +#define MAX_REMOTE_READAHEAD   4096UL
>
>> Normally it wouldn't matter because there's no significant downside to it
>> racing, things like mempolicies which use numa_node_id() extensively would
>> result in, oops, a page allocation on the wrong node.
>>
>> This stands out to me, though, because you're expecting the calculation to
>> be correct for a specific node.
>>
>> The patch is still wrong, though, it should just do
>>
>> 	int node = ACCESS_ONCE(numa_mem_id());
>> 	return min(nr, (node_page_state(node, NR_INACTIVE_FILE) +
>> 		        node_page_state(node, NR_FREE_PAGES)) / 2);
>>
>> since we want to readahead based on the cpu's local node, the comment
>> saying we're reading ahead onto "remote memory" is wrong since a
>> memoryless node has local affinity to numa_mem_id().
>>
>
> Oops, forgot about the MAX_REMOTE_READAHEAD which needs to be factored in
> as well, but this handles the bound on local node's statistics.
>

So following discussion TODO for my patch is:

1) Update the changelog with user visible impact of the patch.
(Andrew's suggestion)
2) Add ACCESS_ONCE to numa_node_id().
3) Change the "readahead into remote memory" part of the documentation
which is misleading.

( I feel no need to add numa_mem_id() since we would specifically limit
the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-07 10:42             ` Raghavendra K T
@ 2014-02-07 20:41               ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-07 20:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Fri, 7 Feb 2014, Raghavendra K T wrote:

> So following discussion TODO for my patch is:
> 
> 1) Update the changelog with user visible impact of the patch.
> (Andrew's suggestion)
> 2) Add ACCESS_ONCE to numa_node_id().
> 3) Change the "readahead into remote memory" part of the documentation
> which is misleading.
> 
> ( I feel no need to add numa_mem_id() since we would specifically limit
> the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).
> 

I don't understand what you're saying, numa_mem_id() is local memory to 
current's cpu.  When a node consists only of cpus and not memory it is not 
true that all memory is then considered remote, you won't find that in any 
specification that defines memory affinity including the ACPI spec.  I can 
trivially define all cpus on my system to be on memoryless nodes and 
having that affect readahead behavior when, in fact, there is affinity 
would be ridiculous.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-07 20:41               ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-07 20:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On Fri, 7 Feb 2014, Raghavendra K T wrote:

> So following discussion TODO for my patch is:
> 
> 1) Update the changelog with user visible impact of the patch.
> (Andrew's suggestion)
> 2) Add ACCESS_ONCE to numa_node_id().
> 3) Change the "readahead into remote memory" part of the documentation
> which is misleading.
> 
> ( I feel no need to add numa_mem_id() since we would specifically limit
> the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).
> 

I don't understand what you're saying, numa_mem_id() is local memory to 
current's cpu.  When a node consists only of cpus and not memory it is not 
true that all memory is then considered remote, you won't find that in any 
specification that defines memory affinity including the ACPI spec.  I can 
trivially define all cpus on my system to be on memoryless nodes and 
having that affect readahead behavior when, in fact, there is affinity 
would be ridiculous.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-07 20:41               ` David Rientjes
@ 2014-02-10  8:21                 ` Raghavendra K T
  -1 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-10  8:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On 02/08/2014 02:11 AM, David Rientjes wrote:
> On Fri, 7 Feb 2014, Raghavendra K T wrote:
>> 3) Change the "readahead into remote memory" part of the documentation
>> which is misleading.
>>
>> ( I feel no need to add numa_mem_id() since we would specifically limit
>> the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).
>>
>
> I don't understand what you're saying, numa_mem_id() is local memory to
> current's cpu.  When a node consists only of cpus and not memory it is not
> true that all memory is then considered remote, you won't find that in any
> specification that defines memory affinity including the ACPI spec.  I can
> trivially define all cpus on my system to be on memoryless nodes and
> having that affect readahead behavior when, in fact, there is affinity
> would be ridiculous.
>
As you rightly pointed , I 'll drop remote memory term and use
something like  :

"* Ensure readahead success on a memoryless node cpu. But we limit
  * the readahead to 4k pages to avoid trashing page cache." ..

Regarding ACCESS_ONCE, since we will have to add
inside the function and still there is nothing that could prevent us
getting run on different cpu with a different node (as Andrew ponted), I 
have not included in current patch that I am posting.
Moreover this case is hopefully not fatal since it is just a hint for 
readahead we can do.

So there are many possible implementation:
(1) use numa_mem_id(), apply freepage limit  and use 4k page limit for 
all case
(Jan had reservation about this case)

(2)for normal case:    use free memory calculation and do not apply 4k
     limit (no change).
    for memoryless cpu case:  use numa_mem_id for more accurate
     calculation of limit and also apply 4k limit.

(3) for normal case:   use free memory calculation and do not apply 4k
     limit (no change).
     for memoryless case: apply 4k page limit

(4) use numa_mem_id() and apply only free page limit..

So, I ll be resending the patch with changelog and comment changes
based on your and Andrew's feedback (type (3) implementation).





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-10  8:21                 ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-10  8:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel

On 02/08/2014 02:11 AM, David Rientjes wrote:
> On Fri, 7 Feb 2014, Raghavendra K T wrote:
>> 3) Change the "readahead into remote memory" part of the documentation
>> which is misleading.
>>
>> ( I feel no need to add numa_mem_id() since we would specifically limit
>> the readahead with MAX_REMOTE_READAHEAD in memoryless cpu cases).
>>
>
> I don't understand what you're saying, numa_mem_id() is local memory to
> current's cpu.  When a node consists only of cpus and not memory it is not
> true that all memory is then considered remote, you won't find that in any
> specification that defines memory affinity including the ACPI spec.  I can
> trivially define all cpus on my system to be on memoryless nodes and
> having that affect readahead behavior when, in fact, there is affinity
> would be ridiculous.
>
As you rightly pointed , I 'll drop remote memory term and use
something like  :

"* Ensure readahead success on a memoryless node cpu. But we limit
  * the readahead to 4k pages to avoid trashing page cache." ..

Regarding ACCESS_ONCE, since we will have to add
inside the function and still there is nothing that could prevent us
getting run on different cpu with a different node (as Andrew ponted), I 
have not included in current patch that I am posting.
Moreover this case is hopefully not fatal since it is just a hint for 
readahead we can do.

So there are many possible implementation:
(1) use numa_mem_id(), apply freepage limit  and use 4k page limit for 
all case
(Jan had reservation about this case)

(2)for normal case:    use free memory calculation and do not apply 4k
     limit (no change).
    for memoryless cpu case:  use numa_mem_id for more accurate
     calculation of limit and also apply 4k limit.

(3) for normal case:   use free memory calculation and do not apply 4k
     limit (no change).
     for memoryless case: apply 4k page limit

(4) use numa_mem_id() and apply only free page limit..

So, I ll be resending the patch with changelog and comment changes
based on your and Andrew's feedback (type (3) implementation).




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5 RESEND] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-06 22:51   ` Andrew Morton
@ 2014-02-10  8:29     ` Raghavendra K T
  -1 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-10  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel,
	rientjes, nacc

* Andrew Morton <akpm@linux-foundation.org> [2014-02-06 14:51:05]:

> On Wed, 22 Jan 2014 16:23:45 +0530 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> 
> 
> Looks reasonable to me.  Please send along a fixed up changelog.
> 

Hi Andrew,
Sorry took some time to get and measure benefit on the memoryless system.
Resending patch with changelog and comment changes based on your and
David's suggestion.

----8<--- 
>From fc8186b5c33a34810a34f5aadd50082463117636 Mon Sep 17 00:00:00 2001
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Date: Mon, 25 Nov 2013 14:29:03 +0530
Subject: [RFC PATCH V5 RESEND] mm readahead: Fix readahead fail for no local
 memory and limit readahead pages

Currently max_sane_readahead() returns zero on the cpu having no local memory node
which leads to readahead failure. Fix the readahead failure by returning
minimum of (requested pages, 4k). Users running application a on memory less cpu
which needs readahead such as streaming application see considerable boost in the
performance.

Result:
fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless CPU
with 1GB testfile ( 12 iterations) yielded 46.66% improvement

kernel             Avg        Stddev
base_ppc         11.946833     1.34%
patched_ppc       6.3720833    1.80%

Below result proves that there is no impact on the normal NUMA cases w/ patch.

fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile
32GB* 4G RAM  numa machine ( 12 iterations) yielded

Kernel     Avg      Stddev
base      7.2963    1.10 %
patched   7.2972    1.18 %

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 
 Changes in V5:
 - Updated the changelog with benefit seen (Andrew)
 - Discard remote memroy term in comment since memoryless CPU
   will have affinity to numa_mem_id() (David)
 - Drop the 4k limit for normal readahead. (Jan Kara)

 Changes in V4:
 - Check for total node memory to decide whether we don't
   have local memory (jan Kara)
 - Add 4k page limit on readahead for normal and remote readahead (Linus)
   (Linus suggestion was 16MB limit).

 Changes in V3:
 - Drop iterating over numa nodes that calculates total free pages (Linus)

 Agree that we do not have control on allocation for readahead on a
 particular numa node and hence for remote readahead we can not further
 sanitize based on potential free pages of that node. and also we do
 not want to itererate through all nodes to find total free pages.

 Suggestions and comments welcome
 mm/readahead.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 0de2360..4c7343b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -233,14 +233,31 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return 0;
 }
 
+#define MAX_REMOTE_READAHEAD   4096UL
 /*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long local_free_page;
+	int nid;
+
+	nid = numa_node_id();
+	if (node_present_pages(nid)) {
+		/*
+		 * We sanitize readahead size depending on free memory in
+		 * the local node.
+		 */
+		local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+				 + node_page_state(nid, NR_FREE_PAGES);
+		return min(nr, local_free_page / 2);
+	}
+	/*
+	 * Ensure readahead success on a memoryless node cpu. But we limit
+	 * the readahead to 4k pages to avoid trashing page cache.
+	 */
+	return min(nr, MAX_REMOTE_READAHEAD);
 }
 
 /*
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5 RESEND] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-10  8:29     ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-10  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus, linux-mm, linux-kernel,
	rientjes, nacc

* Andrew Morton <akpm@linux-foundation.org> [2014-02-06 14:51:05]:

> On Wed, 22 Jan 2014 16:23:45 +0530 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
> 
> 
> Looks reasonable to me.  Please send along a fixed up changelog.
> 

Hi Andrew,
Sorry took some time to get and measure benefit on the memoryless system.
Resending patch with changelog and comment changes based on your and
David's suggestion.

----8<--- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-10  8:21                 ` Raghavendra K T
@ 2014-02-10 10:05                   ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-10 10:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On Mon, 10 Feb 2014, Raghavendra K T wrote:

> As you rightly pointed , I 'll drop remote memory term and use
> something like  :
> 
> "* Ensure readahead success on a memoryless node cpu. But we limit
>  * the readahead to 4k pages to avoid trashing page cache." ..
> 

I don't know how to proceed here after pointing it out twice, I'm afraid.

numa_mem_id() is local memory for a memoryless node.  node_present_pages() 
has no place in your patch.

> Regarding ACCESS_ONCE, since we will have to add
> inside the function and still there is nothing that could prevent us
> getting run on different cpu with a different node (as Andrew ponted), I have
> not included in current patch that I am posting.
> Moreover this case is hopefully not fatal since it is just a hint for
> readahead we can do.
> 

I have no idea why you think the ACCESS_ONCE() is a problem.  It's relying 
on gcc's implementation to ensure that the equation is done only for one 
node.  It has absolutely nothing to do with the fact that the process may 
be moved to another cpu upon returning or even immediately after the 
calculation is done.  Is it possible that node0 has 80% of memory free and 
node1 has 80% of memory inactive?  Well, then your equation doesn't work 
quite so well if the process moves.

There is no downside whatsoever to using it, I have no idea why you think 
it's better without it.

> So there are many possible implementation:
> (1) use numa_mem_id(), apply freepage limit  and use 4k page limit for all
> case
> (Jan had reservation about this case)
> 
> (2)for normal case:    use free memory calculation and do not apply 4k
>     limit (no change).
>    for memoryless cpu case:  use numa_mem_id for more accurate
>     calculation of limit and also apply 4k limit.
> 
> (3) for normal case:   use free memory calculation and do not apply 4k
>     limit (no change).
>     for memoryless case: apply 4k page limit
> 
> (4) use numa_mem_id() and apply only free page limit..
> 
> So, I ll be resending the patch with changelog and comment changes
> based on your and Andrew's feedback (type (3) implementation).
> 

It's frustrating to have to say something three times.  Ask yourself what 
happens if ALL NODES WITH CPUS DO NOT HAVE MEMORY?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-10 10:05                   ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-10 10:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On Mon, 10 Feb 2014, Raghavendra K T wrote:

> As you rightly pointed , I 'll drop remote memory term and use
> something like  :
> 
> "* Ensure readahead success on a memoryless node cpu. But we limit
>  * the readahead to 4k pages to avoid trashing page cache." ..
> 

I don't know how to proceed here after pointing it out twice, I'm afraid.

numa_mem_id() is local memory for a memoryless node.  node_present_pages() 
has no place in your patch.

> Regarding ACCESS_ONCE, since we will have to add
> inside the function and still there is nothing that could prevent us
> getting run on different cpu with a different node (as Andrew ponted), I have
> not included in current patch that I am posting.
> Moreover this case is hopefully not fatal since it is just a hint for
> readahead we can do.
> 

I have no idea why you think the ACCESS_ONCE() is a problem.  It's relying 
on gcc's implementation to ensure that the equation is done only for one 
node.  It has absolutely nothing to do with the fact that the process may 
be moved to another cpu upon returning or even immediately after the 
calculation is done.  Is it possible that node0 has 80% of memory free and 
node1 has 80% of memory inactive?  Well, then your equation doesn't work 
quite so well if the process moves.

There is no downside whatsoever to using it, I have no idea why you think 
it's better without it.

> So there are many possible implementation:
> (1) use numa_mem_id(), apply freepage limit  and use 4k page limit for all
> case
> (Jan had reservation about this case)
> 
> (2)for normal case:    use free memory calculation and do not apply 4k
>     limit (no change).
>    for memoryless cpu case:  use numa_mem_id for more accurate
>     calculation of limit and also apply 4k limit.
> 
> (3) for normal case:   use free memory calculation and do not apply 4k
>     limit (no change).
>     for memoryless case: apply 4k page limit
> 
> (4) use numa_mem_id() and apply only free page limit..
> 
> So, I ll be resending the patch with changelog and comment changes
> based on your and Andrew's feedback (type (3) implementation).
> 

It's frustrating to have to say something three times.  Ask yourself what 
happens if ALL NODES WITH CPUS DO NOT HAVE MEMORY?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-10 10:05                   ` David Rientjes
@ 2014-02-10 12:25                     ` Raghavendra K T
  -1 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-10 12:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On 02/10/2014 03:35 PM, David Rientjes wrote:
> On Mon, 10 Feb 2014, Raghavendra K T wrote:
>
>> As you rightly pointed , I 'll drop remote memory term and use
>> something like  :
>>
>> "* Ensure readahead success on a memoryless node cpu. But we limit
>>   * the readahead to 4k pages to avoid trashing page cache." ..
>>
>
> I don't know how to proceed here after pointing it out twice, I'm afraid.
>
> numa_mem_id() is local memory for a memoryless node.  node_present_pages()
> has no place in your patch.

Hi David,  I am happy to see your pointer reg. numa_mem_id(). I did not
meant to be ignoring/offensive .. sorry if conversation thought to be so.

So I understood that you are suggesting implementations like below

1) I do not have problem with the below approach, I could post this in
next version.
( But this did not include 4k limit Linus mentioned to apply)

unsigned long max_sane_readahead(unsigned long nr)
{
         unsigned long local_free_page;
         int nid;

         nid = numa_mem_id();

         /*
          * We sanitize readahead size depending on free memory in
          * the local node.
          */
         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
                           + node_page_state(nid, NR_FREE_PAGES);
         return min(nr, local_free_page / 2);
}

2) I did not go for below because Honza (Jan Kara) had some
concerns for 4k limit for normal case, and since I am not
the expert, I was waiting for opinions.

unsigned long max_sane_readahead(unsigned long nr)
{
         unsigned long local_free_page, sane_nr;
         int nid;

         nid = numa_mem_id();
	/* limit the max readahead to 4k pages */
	sane_nr = min(nr, MAX_REMOTE_READAHEAD);

         /*
          * We sanitize readahead size depending on free memory in
          * the local node.
          */
         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
                           + node_page_state(nid, NR_FREE_PAGES);
         return min(sane_nr, local_free_page / 2);
}

>
>> Regarding ACCESS_ONCE, since we will have to add
>> inside the function and still there is nothing that could prevent us
>> getting run on different cpu with a different node (as Andrew ponted), I have
>> not included in current patch that I am posting.
>> Moreover this case is hopefully not fatal since it is just a hint for
>> readahead we can do.
>>
>
> I have no idea why you think the ACCESS_ONCE() is a problem.  It's relying
> on gcc's implementation to ensure that the equation is done only for one
> node.  It has absolutely nothing to do with the fact that the process may
> be moved to another cpu upon returning or even immediately after the
> calculation is done.  Is it possible that node0 has 80% of memory free and
> node1 has 80% of memory inactive?  Well, then your equation doesn't work
> quite so well if the process moves.
>
> There is no downside whatsoever to using it, I have no idea why you think
> it's better without it.

I have no problem introducing ACESSS_ONCE too. But I skipped only
after I got the below error.

mm/readahead.c: In function ‘max_sane_readahead’:
mm/readahead.c:246: error: lvalue required as unary ‘&’ operand

>
>> So there are many possible implementation:
>> (1) use numa_mem_id(), apply freepage limit  and use 4k page limit for all
>> case
>> (Jan had reservation about this case)
>>
>> (2)for normal case:    use free memory calculation and do not apply 4k
>>      limit (no change).
>>     for memoryless cpu case:  use numa_mem_id for more accurate
>>      calculation of limit and also apply 4k limit.
>>
>> (3) for normal case:   use free memory calculation and do not apply 4k
>>      limit (no change).
>>      for memoryless case: apply 4k page limit
>>
>> (4) use numa_mem_id() and apply only free page limit..
>>
>> So, I ll be resending the patch with changelog and comment changes
>> based on your and Andrew's feedback (type (3) implementation).
>>
>
> It's frustrating to have to say something three times.  Ask yourself what
> happens if ALL NODES WITH CPUS DO NOT HAVE MEMORY?
>

True, this is the reason why we could go for implementation (1) I posted
above. It was just that I did not want to float a new version without
knowing whether Andrew was expecting new patch or change log updates.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-10 12:25                     ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-10 12:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On 02/10/2014 03:35 PM, David Rientjes wrote:
> On Mon, 10 Feb 2014, Raghavendra K T wrote:
>
>> As you rightly pointed , I 'll drop remote memory term and use
>> something like  :
>>
>> "* Ensure readahead success on a memoryless node cpu. But we limit
>>   * the readahead to 4k pages to avoid trashing page cache." ..
>>
>
> I don't know how to proceed here after pointing it out twice, I'm afraid.
>
> numa_mem_id() is local memory for a memoryless node.  node_present_pages()
> has no place in your patch.

Hi David,  I am happy to see your pointer reg. numa_mem_id(). I did not
meant to be ignoring/offensive .. sorry if conversation thought to be so.

So I understood that you are suggesting implementations like below

1) I do not have problem with the below approach, I could post this in
next version.
( But this did not include 4k limit Linus mentioned to apply)

unsigned long max_sane_readahead(unsigned long nr)
{
         unsigned long local_free_page;
         int nid;

         nid = numa_mem_id();

         /*
          * We sanitize readahead size depending on free memory in
          * the local node.
          */
         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
                           + node_page_state(nid, NR_FREE_PAGES);
         return min(nr, local_free_page / 2);
}

2) I did not go for below because Honza (Jan Kara) had some
concerns for 4k limit for normal case, and since I am not
the expert, I was waiting for opinions.

unsigned long max_sane_readahead(unsigned long nr)
{
         unsigned long local_free_page, sane_nr;
         int nid;

         nid = numa_mem_id();
	/* limit the max readahead to 4k pages */
	sane_nr = min(nr, MAX_REMOTE_READAHEAD);

         /*
          * We sanitize readahead size depending on free memory in
          * the local node.
          */
         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
                           + node_page_state(nid, NR_FREE_PAGES);
         return min(sane_nr, local_free_page / 2);
}

>
>> Regarding ACCESS_ONCE, since we will have to add
>> inside the function and still there is nothing that could prevent us
>> getting run on different cpu with a different node (as Andrew ponted), I have
>> not included in current patch that I am posting.
>> Moreover this case is hopefully not fatal since it is just a hint for
>> readahead we can do.
>>
>
> I have no idea why you think the ACCESS_ONCE() is a problem.  It's relying
> on gcc's implementation to ensure that the equation is done only for one
> node.  It has absolutely nothing to do with the fact that the process may
> be moved to another cpu upon returning or even immediately after the
> calculation is done.  Is it possible that node0 has 80% of memory free and
> node1 has 80% of memory inactive?  Well, then your equation doesn't work
> quite so well if the process moves.
>
> There is no downside whatsoever to using it, I have no idea why you think
> it's better without it.

I have no problem introducing ACESSS_ONCE too. But I skipped only
after I got the below error.

mm/readahead.c: In function ?max_sane_readahead?:
mm/readahead.c:246: error: lvalue required as unary ?&? operand

>
>> So there are many possible implementation:
>> (1) use numa_mem_id(), apply freepage limit  and use 4k page limit for all
>> case
>> (Jan had reservation about this case)
>>
>> (2)for normal case:    use free memory calculation and do not apply 4k
>>      limit (no change).
>>     for memoryless cpu case:  use numa_mem_id for more accurate
>>      calculation of limit and also apply 4k limit.
>>
>> (3) for normal case:   use free memory calculation and do not apply 4k
>>      limit (no change).
>>      for memoryless case: apply 4k page limit
>>
>> (4) use numa_mem_id() and apply only free page limit..
>>
>> So, I ll be resending the patch with changelog and comment changes
>> based on your and Andrew's feedback (type (3) implementation).
>>
>
> It's frustrating to have to say something three times.  Ask yourself what
> happens if ALL NODES WITH CPUS DO NOT HAVE MEMORY?
>

True, this is the reason why we could go for implementation (1) I posted
above. It was just that I did not want to float a new version without
knowing whether Andrew was expecting new patch or change log updates.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-10 12:25                     ` Raghavendra K T
@ 2014-02-10 21:35                       ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-10 21:35 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On Mon, 10 Feb 2014, Raghavendra K T wrote:

> So I understood that you are suggesting implementations like below
> 
> 1) I do not have problem with the below approach, I could post this in
> next version.
> ( But this did not include 4k limit Linus mentioned to apply)
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         unsigned long local_free_page;
>         int nid;
> 
>         nid = numa_mem_id();
> 
>         /*
>          * We sanitize readahead size depending on free memory in
>          * the local node.
>          */
>         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>                           + node_page_state(nid, NR_FREE_PAGES);
>         return min(nr, local_free_page / 2);
> }
> 
> 2) I did not go for below because Honza (Jan Kara) had some
> concerns for 4k limit for normal case, and since I am not
> the expert, I was waiting for opinions.
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         unsigned long local_free_page, sane_nr;
>         int nid;
> 
>         nid = numa_mem_id();
> 	/* limit the max readahead to 4k pages */
> 	sane_nr = min(nr, MAX_REMOTE_READAHEAD);
> 
>         /*
>          * We sanitize readahead size depending on free memory in
>          * the local node.
>          */
>         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>                           + node_page_state(nid, NR_FREE_PAGES);
>         return min(sane_nr, local_free_page / 2);
> }
> 

I have no opinion on the 4KB pages, either of the above is just fine.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-10 21:35                       ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-10 21:35 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On Mon, 10 Feb 2014, Raghavendra K T wrote:

> So I understood that you are suggesting implementations like below
> 
> 1) I do not have problem with the below approach, I could post this in
> next version.
> ( But this did not include 4k limit Linus mentioned to apply)
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         unsigned long local_free_page;
>         int nid;
> 
>         nid = numa_mem_id();
> 
>         /*
>          * We sanitize readahead size depending on free memory in
>          * the local node.
>          */
>         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>                           + node_page_state(nid, NR_FREE_PAGES);
>         return min(nr, local_free_page / 2);
> }
> 
> 2) I did not go for below because Honza (Jan Kara) had some
> concerns for 4k limit for normal case, and since I am not
> the expert, I was waiting for opinions.
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         unsigned long local_free_page, sane_nr;
>         int nid;
> 
>         nid = numa_mem_id();
> 	/* limit the max readahead to 4k pages */
> 	sane_nr = min(nr, MAX_REMOTE_READAHEAD);
> 
>         /*
>          * We sanitize readahead size depending on free memory in
>          * the local node.
>          */
>         local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>                           + node_page_state(nid, NR_FREE_PAGES);
>         return min(sane_nr, local_free_page / 2);
> }
> 

I have no opinion on the 4KB pages, either of the above is just fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-10 21:35                       ` David Rientjes
@ 2014-02-13  7:07                         ` Raghavendra K T
  -1 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-13  7:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On 02/11/2014 03:05 AM, David Rientjes wrote:
> On Mon, 10 Feb 2014, Raghavendra K T wrote:
>
>> So I understood that you are suggesting implementations like below
>>
>> 1) I do not have problem with the below approach, I could post this in
>> next version.
>> ( But this did not include 4k limit Linus mentioned to apply)
>>
>> unsigned long max_sane_readahead(unsigned long nr)
>> {
>>          unsigned long local_free_page;
>>          int nid;
>>
>>          nid = numa_mem_id();
>>
>>          /*
>>           * We sanitize readahead size depending on free memory in
>>           * the local node.
>>           */
>>          local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>>                            + node_page_state(nid, NR_FREE_PAGES);
>>          return min(nr, local_free_page / 2);
>> }
>>
>> 2) I did not go for below because Honza (Jan Kara) had some
>> concerns for 4k limit for normal case, and since I am not
>> the expert, I was waiting for opinions.
>>
>> unsigned long max_sane_readahead(unsigned long nr)
>> {
>>          unsigned long local_free_page, sane_nr;
>>          int nid;
>>
>>          nid = numa_mem_id();
>> 	/* limit the max readahead to 4k pages */
>> 	sane_nr = min(nr, MAX_REMOTE_READAHEAD);
>>
>>          /*
>>           * We sanitize readahead size depending on free memory in
>>           * the local node.
>>           */
>>          local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>>                            + node_page_state(nid, NR_FREE_PAGES);
>>          return min(sane_nr, local_free_page / 2);
>> }
>>
>
> I have no opinion on the 4KB pages, either of the above is just fine.
>

I was able to test (1) implementation on the system where readahead 
problem occurred. Unfortunately it did not help.

Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
numa_mem_id(). The PPC machine I am facing problem has topology like
this:

numactl -H
---------
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
...
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 8 9 10 11 32 33 34 35 ...
node 1 size: 8071 MB
node 1 free: 2479 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10

So it seems numa_mem_id() does not help for all the configs..
Am I missing something ?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-13  7:07                         ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-13  7:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm, linux-kernel

On 02/11/2014 03:05 AM, David Rientjes wrote:
> On Mon, 10 Feb 2014, Raghavendra K T wrote:
>
>> So I understood that you are suggesting implementations like below
>>
>> 1) I do not have problem with the below approach, I could post this in
>> next version.
>> ( But this did not include 4k limit Linus mentioned to apply)
>>
>> unsigned long max_sane_readahead(unsigned long nr)
>> {
>>          unsigned long local_free_page;
>>          int nid;
>>
>>          nid = numa_mem_id();
>>
>>          /*
>>           * We sanitize readahead size depending on free memory in
>>           * the local node.
>>           */
>>          local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>>                            + node_page_state(nid, NR_FREE_PAGES);
>>          return min(nr, local_free_page / 2);
>> }
>>
>> 2) I did not go for below because Honza (Jan Kara) had some
>> concerns for 4k limit for normal case, and since I am not
>> the expert, I was waiting for opinions.
>>
>> unsigned long max_sane_readahead(unsigned long nr)
>> {
>>          unsigned long local_free_page, sane_nr;
>>          int nid;
>>
>>          nid = numa_mem_id();
>> 	/* limit the max readahead to 4k pages */
>> 	sane_nr = min(nr, MAX_REMOTE_READAHEAD);
>>
>>          /*
>>           * We sanitize readahead size depending on free memory in
>>           * the local node.
>>           */
>>          local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
>>                            + node_page_state(nid, NR_FREE_PAGES);
>>          return min(sane_nr, local_free_page / 2);
>> }
>>
>
> I have no opinion on the 4KB pages, either of the above is just fine.
>

I was able to test (1) implementation on the system where readahead 
problem occurred. Unfortunately it did not help.

Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
numa_mem_id(). The PPC machine I am facing problem has topology like
this:

numactl -H
---------
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
...
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 8 9 10 11 32 33 34 35 ...
node 1 size: 8071 MB
node 1 free: 2479 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10

So it seems numa_mem_id() does not help for all the configs..
Am I missing something ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-13  7:07                         ` Raghavendra K T
@ 2014-02-13  8:05                           ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-13  8:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel

On Thu, 13 Feb 2014, Raghavendra K T wrote:

> I was able to test (1) implementation on the system where readahead problem
> occurred. Unfortunately it did not help.
> 
> Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> numa_mem_id(). The PPC machine I am facing problem has topology like
> this:
> 
> numactl -H
> ---------
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> ...
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 8 9 10 11 32 33 34 35 ...
> node 1 size: 8071 MB
> node 1 free: 2479 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> So it seems numa_mem_id() does not help for all the configs..
> Am I missing something ?
> 

You need the patch from http://marc.info/?l=linux-mm&m=139093411119013 
first.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-13  8:05                           ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-13  8:05 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel

On Thu, 13 Feb 2014, Raghavendra K T wrote:

> I was able to test (1) implementation on the system where readahead problem
> occurred. Unfortunately it did not help.
> 
> Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> numa_mem_id(). The PPC machine I am facing problem has topology like
> this:
> 
> numactl -H
> ---------
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> ...
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 8 9 10 11 32 33 34 35 ...
> node 1 size: 8071 MB
> node 1 free: 2479 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> So it seems numa_mem_id() does not help for all the configs..
> Am I missing something ?
> 

You need the patch from http://marc.info/?l=linux-mm&m=139093411119013 
first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-13  8:05                           ` David Rientjes
@ 2014-02-13 10:04                             ` Raghavendra K T
  -1 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-13 10:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel

On 02/13/2014 01:35 PM, David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
>
>> I was able to test (1) implementation on the system where readahead problem
>> occurred. Unfortunately it did not help.
>>
>> Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
>> numa_mem_id(). The PPC machine I am facing problem has topology like
>> this:
[...]
>>
>> So it seems numa_mem_id() does not help for all the configs..
>> Am I missing something ?
>>
>
> You need the patch from http://marc.info/?l=linux-mm&m=139093411119013
> first.

Thanks David, unfortunately even after applying that patch, I do not see
the improvement.

Interestingly numa_mem_id() seem to still return the value of a
memoryless node.
May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-13 10:04                             ` Raghavendra K T
  0 siblings, 0 replies; 64+ messages in thread
From: Raghavendra K T @ 2014-02-13 10:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel

On 02/13/2014 01:35 PM, David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
>
>> I was able to test (1) implementation on the system where readahead problem
>> occurred. Unfortunately it did not help.
>>
>> Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
>> numa_mem_id(). The PPC machine I am facing problem has topology like
>> this:
[...]
>>
>> So it seems numa_mem_id() does not help for all the configs..
>> Am I missing something ?
>>
>
> You need the patch from http://marc.info/?l=linux-mm&m=139093411119013
> first.

Thanks David, unfortunately even after applying that patch, I do not see
the improvement.

Interestingly numa_mem_id() seem to still return the value of a
memoryless node.
May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-13  8:05                           ` David Rientjes
@ 2014-02-13 21:06                             ` Andrew Morton
  -1 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-13 21:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel, Anton Blanchard, Benjamin Herrenschmidt,
	Nishanth Aravamudan

On Thu, 13 Feb 2014 00:05:31 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> On Thu, 13 Feb 2014, Raghavendra K T wrote:
> 
> > I was able to test (1) implementation on the system where readahead problem
> > occurred. Unfortunately it did not help.
> > 
> > Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> > numa_mem_id(). The PPC machine I am facing problem has topology like
> > this:
> > 
> > numactl -H
> > ---------
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> > ...
> > node 0 size: 0 MB
> > node 0 free: 0 MB
> > node 1 cpus: 8 9 10 11 32 33 34 35 ...
> > node 1 size: 8071 MB
> > node 1 free: 2479 MB
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > So it seems numa_mem_id() does not help for all the configs..
> > Am I missing something ?
> > 
> 
> You need the patch from http://marc.info/?l=linux-mm&m=139093411119013 
> first.

That (un-signed-off) powerpc patch appears to be moribund.  What's up?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-13 21:06                             ` Andrew Morton
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-13 21:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel, Anton Blanchard, Benjamin Herrenschmidt

On Thu, 13 Feb 2014 00:05:31 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> On Thu, 13 Feb 2014, Raghavendra K T wrote:
> 
> > I was able to test (1) implementation on the system where readahead problem
> > occurred. Unfortunately it did not help.
> > 
> > Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> > numa_mem_id(). The PPC machine I am facing problem has topology like
> > this:
> > 
> > numactl -H
> > ---------
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> > ...
> > node 0 size: 0 MB
> > node 0 free: 0 MB
> > node 1 cpus: 8 9 10 11 32 33 34 35 ...
> > node 1 size: 8071 MB
> > node 1 free: 2479 MB
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > So it seems numa_mem_id() does not help for all the configs..
> > Am I missing something ?
> > 
> 
> You need the patch from http://marc.info/?l=linux-mm&m=139093411119013 
> first.

That (un-signed-off) powerpc patch appears to be moribund.  What's up?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-13 21:06                             ` Andrew Morton
@ 2014-02-13 21:42                               ` Nishanth Aravamudan
  -1 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-13 21:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Raghavendra K T, Fengguang Wu, David Cohen,
	Al Viro, Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm,
	linux-kernel, Anton Blanchard, Benjamin Herrenschmidt

On 13.02.2014 [13:06:43 -0800], Andrew Morton wrote:
> On Thu, 13 Feb 2014 00:05:31 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
> > On Thu, 13 Feb 2014, Raghavendra K T wrote:
> > 
> > > I was able to test (1) implementation on the system where readahead problem
> > > occurred. Unfortunately it did not help.
> > > 
> > > Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> > > numa_mem_id(). The PPC machine I am facing problem has topology like
> > > this:
> > > 
> > > numactl -H
> > > ---------
> > > available: 2 nodes (0-1)
> > > node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> > > ...
> > > node 0 size: 0 MB
> > > node 0 free: 0 MB
> > > node 1 cpus: 8 9 10 11 32 33 34 35 ...
> > > node 1 size: 8071 MB
> > > node 1 free: 2479 MB
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > So it seems numa_mem_id() does not help for all the configs..
> > > Am I missing something ?
> > > 
> > 
> > You need the patch from http://marc.info/?l=linux-mm&m=139093411119013 
> > first.
> 
> That (un-signed-off) powerpc patch appears to be moribund.  What's up?

Gah, thanks for catching that Andrew, not sure what went wrong. I've
appended my S-o-b. I've asked Ben to take a look, but I think he's still
catching up on his queue after travelling.

-Nish


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-13 21:42                               ` Nishanth Aravamudan
  0 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-13 21:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Raghavendra K T, Fengguang Wu, David Cohen,
	Al Viro, Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm,
	linux-kernel, Anton Blanchard, Benjamin Herrenschmidt

On 13.02.2014 [13:06:43 -0800], Andrew Morton wrote:
> On Thu, 13 Feb 2014 00:05:31 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
> > On Thu, 13 Feb 2014, Raghavendra K T wrote:
> > 
> > > I was able to test (1) implementation on the system where readahead problem
> > > occurred. Unfortunately it did not help.
> > > 
> > > Reason seem to be that CONFIG_HAVE_MEMORYLESS_NODES dependency of
> > > numa_mem_id(). The PPC machine I am facing problem has topology like
> > > this:
> > > 
> > > numactl -H
> > > ---------
> > > available: 2 nodes (0-1)
> > > node 0 cpus: 0 1 2 3 4 5 6 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> > > ...
> > > node 0 size: 0 MB
> > > node 0 free: 0 MB
> > > node 1 cpus: 8 9 10 11 32 33 34 35 ...
> > > node 1 size: 8071 MB
> > > node 1 free: 2479 MB
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > So it seems numa_mem_id() does not help for all the configs..
> > > Am I missing something ?
> > > 
> > 
> > You need the patch from http://marc.info/?l=linux-mm&m=139093411119013 
> > first.
> 
> That (un-signed-off) powerpc patch appears to be moribund.  What's up?

Gah, thanks for catching that Andrew, not sure what went wrong. I've
appended my S-o-b. I've asked Ben to take a look, but I think he's still
catching up on his queue after travelling.

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-13 10:04                             ` Raghavendra K T
@ 2014-02-13 22:41                               ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-13 22:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel

On Thu, 13 Feb 2014, Raghavendra K T wrote:

> Thanks David, unfortunately even after applying that patch, I do not see
> the improvement.
> 
> Interestingly numa_mem_id() seem to still return the value of a
> memoryless node.
> May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....
> 

I believe ppc will be relying on __build_all_zonelists() to set 
numa_mem_id() to be the proper node, and that relies on the ordering of 
the zonelist built for the memoryless node.  It would be very strange if 
local_memory_node() is returning a memoryless node because it is the first 
zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the 
zonelist at all?).

I think the real problem is that build_all_zonelists() is only called at 
init when the boot cpu is online so it's only setting numa_mem_id() 
properly for the boot cpu.  Does it return a node with memory if you 
toggle /proc/sys/vm/numa_zonelist_order?  Do

	echo node > /proc/sys/vm/numa_zonelist_order
	echo zone > /proc/sys/vm/numa_zonelist_order
	echo default > /proc/sys/vm/numa_zonelist_order

and check if it returns the proper value at either point.  This will force 
build_all_zonelists() and numa_mem_id() to point to the proper node since 
all cpus are now online.

So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an 
arch-specific set_numa_mem() that makes this mapping correct like ia64 
does.  If that's the case, then it's (1) completely undocumented and (2) 
Nishanth's patch is incomplete because anything that adds 
CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it 
to be any different than numa_node_id().

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-13 22:41                               ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-13 22:41 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, Linus Torvalds, Nishanth Aravamudan,
	linux-mm, linux-kernel

On Thu, 13 Feb 2014, Raghavendra K T wrote:

> Thanks David, unfortunately even after applying that patch, I do not see
> the improvement.
> 
> Interestingly numa_mem_id() seem to still return the value of a
> memoryless node.
> May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....
> 

I believe ppc will be relying on __build_all_zonelists() to set 
numa_mem_id() to be the proper node, and that relies on the ordering of 
the zonelist built for the memoryless node.  It would be very strange if 
local_memory_node() is returning a memoryless node because it is the first 
zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the 
zonelist at all?).

I think the real problem is that build_all_zonelists() is only called at 
init when the boot cpu is online so it's only setting numa_mem_id() 
properly for the boot cpu.  Does it return a node with memory if you 
toggle /proc/sys/vm/numa_zonelist_order?  Do

	echo node > /proc/sys/vm/numa_zonelist_order
	echo zone > /proc/sys/vm/numa_zonelist_order
	echo default > /proc/sys/vm/numa_zonelist_order

and check if it returns the proper value at either point.  This will force 
build_all_zonelists() and numa_mem_id() to point to the proper node since 
all cpus are now online.

So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an 
arch-specific set_numa_mem() that makes this mapping correct like ia64 
does.  If that's the case, then it's (1) completely undocumented and (2) 
Nishanth's patch is incomplete because anything that adds 
CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it 
to be any different than numa_node_id().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-13 22:41                               ` David Rientjes
@ 2014-02-14  0:14                                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-14  0:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Andrew Morton, Fengguang Wu, David Cohen,
	Al Viro, Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm,
	linux-kernel

On 13.02.2014 [14:41:04 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
> 
> > Thanks David, unfortunately even after applying that patch, I do not see
> > the improvement.
> > 
> > Interestingly numa_mem_id() seem to still return the value of a
> > memoryless node.
> > May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....
> > 
> 
> I believe ppc will be relying on __build_all_zonelists() to set 
> numa_mem_id() to be the proper node, and that relies on the ordering of 
> the zonelist built for the memoryless node.  It would be very strange if 
> local_memory_node() is returning a memoryless node because it is the first 
> zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the 
> zonelist at all?).
> 
> I think the real problem is that build_all_zonelists() is only called at 
> init when the boot cpu is online so it's only setting numa_mem_id() 
> properly for the boot cpu.  Does it return a node with memory if you 
> toggle /proc/sys/vm/numa_zonelist_order?  Do
> 
> 	echo node > /proc/sys/vm/numa_zonelist_order
> 	echo zone > /proc/sys/vm/numa_zonelist_order
> 	echo default > /proc/sys/vm/numa_zonelist_order
> 
> and check if it returns the proper value at either point.  This will force 
> build_all_zonelists() and numa_mem_id() to point to the proper node since 
> all cpus are now online.
> 
> So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an 
> arch-specific set_numa_mem() that makes this mapping correct like ia64 
> does.  If that's the case, then it's (1) completely undocumented and (2) 
> Nishanth's patch is incomplete because anything that adds 
> CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it 
> to be any different than numa_node_id().

I'm working on this latter bit now. I tried to mirror ia64, but it looks
like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?

I'll keep working, but would appreciate any further insight.

-Nish


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-14  0:14                                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-14  0:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Andrew Morton, Fengguang Wu, David Cohen,
	Al Viro, Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm,
	linux-kernel

On 13.02.2014 [14:41:04 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
> 
> > Thanks David, unfortunately even after applying that patch, I do not see
> > the improvement.
> > 
> > Interestingly numa_mem_id() seem to still return the value of a
> > memoryless node.
> > May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....
> > 
> 
> I believe ppc will be relying on __build_all_zonelists() to set 
> numa_mem_id() to be the proper node, and that relies on the ordering of 
> the zonelist built for the memoryless node.  It would be very strange if 
> local_memory_node() is returning a memoryless node because it is the first 
> zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the 
> zonelist at all?).
> 
> I think the real problem is that build_all_zonelists() is only called at 
> init when the boot cpu is online so it's only setting numa_mem_id() 
> properly for the boot cpu.  Does it return a node with memory if you 
> toggle /proc/sys/vm/numa_zonelist_order?  Do
> 
> 	echo node > /proc/sys/vm/numa_zonelist_order
> 	echo zone > /proc/sys/vm/numa_zonelist_order
> 	echo default > /proc/sys/vm/numa_zonelist_order
> 
> and check if it returns the proper value at either point.  This will force 
> build_all_zonelists() and numa_mem_id() to point to the proper node since 
> all cpus are now online.
> 
> So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an 
> arch-specific set_numa_mem() that makes this mapping correct like ia64 
> does.  If that's the case, then it's (1) completely undocumented and (2) 
> Nishanth's patch is incomplete because anything that adds 
> CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it 
> to be any different than numa_node_id().

I'm working on this latter bit now. I tried to mirror ia64, but it looks
like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?

I'll keep working, but would appreciate any further insight.

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14  0:14                                 ` Nishanth Aravamudan
@ 2014-02-14  0:37                                   ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2014-02-14  0:37 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

Is this whole thread still just for the crazy and pointless
"max_sane_readahead()"?

Or is there some *real* reason we should care?

Because if it really is just for max_sane_readahead(), then for the
love of God, let us just do this

 unsigned long max_sane_readahead(unsigned long nr)
 {
        return min(nr, 128);
 }

and bury this whole idiotic thread.

Seriously, if your IO subsystem needs more than 512kB of read-ahead to
get full performance, your IO subsystem is just bad, and has latencies
that are long enough that you should just replace it. There's no real
reason to bend over backwards for that, and the whole complexity and
fragility of the insane "let's try to figure out how much memory this
node has" is just not worth it. The read-ahead should be small enough
that we should never need to care, and large enough that you get
reasonable IO throughput. The above does that.

Goddammit, there's a reason the whole "Gordian knot" parable is
famous. We're making this all too damn complicated FOR NO GOOD REASON.

Just cut the rope, people. Our aim is not to generate some kind of job
security by making things as complicated as possible.

                 Linus

On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
<nacc@linux.vnet.ibm.com> wrote:
>
> I'm working on this latter bit now. I tried to mirror ia64, but it looks
> like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
>
> I'll keep working, but would appreciate any further insight.
>
> -Nish
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-14  0:37                                   ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2014-02-14  0:37 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

Is this whole thread still just for the crazy and pointless
"max_sane_readahead()"?

Or is there some *real* reason we should care?

Because if it really is just for max_sane_readahead(), then for the
love of God, let us just do this

 unsigned long max_sane_readahead(unsigned long nr)
 {
        return min(nr, 128);
 }

and bury this whole idiotic thread.

Seriously, if your IO subsystem needs more than 512kB of read-ahead to
get full performance, your IO subsystem is just bad, and has latencies
that are long enough that you should just replace it. There's no real
reason to bend over backwards for that, and the whole complexity and
fragility of the insane "let's try to figure out how much memory this
node has" is just not worth it. The read-ahead should be small enough
that we should never need to care, and large enough that you get
reasonable IO throughput. The above does that.

Goddammit, there's a reason the whole "Gordian knot" parable is
famous. We're making this all too damn complicated FOR NO GOOD REASON.

Just cut the rope, people. Our aim is not to generate some kind of job
security by making things as complicated as possible.

                 Linus

On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
<nacc@linux.vnet.ibm.com> wrote:
>
> I'm working on this latter bit now. I tried to mirror ia64, but it looks
> like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
>
> I'll keep working, but would appreciate any further insight.
>
> -Nish
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14  0:37                                   ` Linus Torvalds
@ 2014-02-14  0:45                                     ` Andrew Morton
  -1 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-14  0:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nishanth Aravamudan, David Rientjes, Raghavendra K T,
	Fengguang Wu, David Cohen, Al Viro, Damien Ramonda, Jan Kara,
	linux-mm, Linux Kernel Mailing List

On Thu, 13 Feb 2014 16:37:53 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

>  unsigned long max_sane_readahead(unsigned long nr)
>  {
>         return min(nr, 128);
>  }

I bet nobody will notice.

It should be 128*4096/PAGE_CACHE_SIZE so that variations in PAGE_SIZE
don't affect readahead behaviour.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-14  0:45                                     ` Andrew Morton
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Morton @ 2014-02-14  0:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nishanth Aravamudan, David Rientjes, Raghavendra K T,
	Fengguang Wu, David Cohen, Al Viro, Damien Ramonda, Jan Kara,
	linux-mm, Linux Kernel Mailing List

On Thu, 13 Feb 2014 16:37:53 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

>  unsigned long max_sane_readahead(unsigned long nr)
>  {
>         return min(nr, 128);
>  }

I bet nobody will notice.

It should be 128*4096/PAGE_CACHE_SIZE so that variations in PAGE_SIZE
don't affect readahead behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14  0:37                                   ` Linus Torvalds
@ 2014-02-14  4:32                                     ` Nishanth Aravamudan
  -1 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-14  4:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

Hi Linus,

On 13.02.2014 [16:37:53 -0800], Linus Torvalds wrote:
> Is this whole thread still just for the crazy and pointless
> "max_sane_readahead()"?
> 
> Or is there some *real* reason we should care?

There is an open issue on powerpc with memoryless nodes (inasmuch as we
can have them, but the kernel doesn't support it properly). There is a
separate discussion going on on linuxppc-dev about what is necessary for
CONFIG_HAVE_MEMORYLESS_NODES to be supported.

> Because if it really is just for max_sane_readahead(), then for the
> love of God, let us just do this
> 
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
>         return min(nr, 128);
>  }
> 
> and bury this whole idiotic thread.

Agreed that for the readahead case the above is probably more than
sufficient.

Apologies for hijacking the thread, my comments below were purely about
the memoryless node support, not about readahead specifically.

Thanks,
Nish

> Seriously, if your IO subsystem needs more than 512kB of read-ahead to
> get full performance, your IO subsystem is just bad, and has latencies
> that are long enough that you should just replace it. There's no real
> reason to bend over backwards for that, and the whole complexity and
> fragility of the insane "let's try to figure out how much memory this
> node has" is just not worth it. The read-ahead should be small enough
> that we should never need to care, and large enough that you get
> reasonable IO throughput. The above does that.
> 
> Goddammit, there's a reason the whole "Gordian knot" parable is
> famous. We're making this all too damn complicated FOR NO GOOD REASON.
> 
> Just cut the rope, people. Our aim is not to generate some kind of job
> security by making things as complicated as possible.
> 
>                  Linus
> 
> On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
> <nacc@linux.vnet.ibm.com> wrote:
> >
> > I'm working on this latter bit now. I tried to mirror ia64, but it looks
> > like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> > It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> > CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
> >
> > I'll keep working, but would appreciate any further insight.
> >
> > -Nish
> >
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-14  4:32                                     ` Nishanth Aravamudan
  0 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-14  4:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

Hi Linus,

On 13.02.2014 [16:37:53 -0800], Linus Torvalds wrote:
> Is this whole thread still just for the crazy and pointless
> "max_sane_readahead()"?
> 
> Or is there some *real* reason we should care?

There is an open issue on powerpc with memoryless nodes (inasmuch as we
can have them, but the kernel doesn't support it properly). There is a
separate discussion going on on linuxppc-dev about what is necessary for
CONFIG_HAVE_MEMORYLESS_NODES to be supported.

> Because if it really is just for max_sane_readahead(), then for the
> love of God, let us just do this
> 
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
>         return min(nr, 128);
>  }
> 
> and bury this whole idiotic thread.

Agreed that for the readahead case the above is probably more than
sufficient.

Apologies for hijacking the thread, my comments below were purely about
the memoryless node support, not about readahead specifically.

Thanks,
Nish

> Seriously, if your IO subsystem needs more than 512kB of read-ahead to
> get full performance, your IO subsystem is just bad, and has latencies
> that are long enough that you should just replace it. There's no real
> reason to bend over backwards for that, and the whole complexity and
> fragility of the insane "let's try to figure out how much memory this
> node has" is just not worth it. The read-ahead should be small enough
> that we should never need to care, and large enough that you get
> reasonable IO throughput. The above does that.
> 
> Goddammit, there's a reason the whole "Gordian knot" parable is
> famous. We're making this all too damn complicated FOR NO GOOD REASON.
> 
> Just cut the rope, people. Our aim is not to generate some kind of job
> security by making things as complicated as possible.
> 
>                  Linus
> 
> On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
> <nacc@linux.vnet.ibm.com> wrote:
> >
> > I'm working on this latter bit now. I tried to mirror ia64, but it looks
> > like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> > It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> > CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
> >
> > I'll keep working, but would appreciate any further insight.
> >
> > -Nish
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-13 22:41                               ` David Rientjes
@ 2014-02-14  5:47                                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-14  5:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Andrew Morton, Fengguang Wu, David Cohen,
	Al Viro, Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm,
	linux-kernel

On 13.02.2014 [14:41:04 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
> 
> > Thanks David, unfortunately even after applying that patch, I do not see
> > the improvement.
> > 
> > Interestingly numa_mem_id() seem to still return the value of a
> > memoryless node.
> > May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....
> > 
> 
> I believe ppc will be relying on __build_all_zonelists() to set 
> numa_mem_id() to be the proper node, and that relies on the ordering of 
> the zonelist built for the memoryless node.  It would be very strange if 
> local_memory_node() is returning a memoryless node because it is the first 
> zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the 
> zonelist at all?).
> 
> I think the real problem is that build_all_zonelists() is only called at 
> init when the boot cpu is online so it's only setting numa_mem_id() 
> properly for the boot cpu.  Does it return a node with memory if you 
> toggle /proc/sys/vm/numa_zonelist_order?  Do
> 
> 	echo node > /proc/sys/vm/numa_zonelist_order
> 	echo zone > /proc/sys/vm/numa_zonelist_order
> 	echo default > /proc/sys/vm/numa_zonelist_order
> 
> and check if it returns the proper value at either point.  This will force 
> build_all_zonelists() and numa_mem_id() to point to the proper node since 
> all cpus are now online.

Yep, after massaging the code to allow CONFIG_USE_PERCPU_NUMA_NODE_ID,
you're right that the memory node is wrong. The cpu node is right (they
are all on node 0), but that could be lucky. The memory node is right
for the boot cpu. I did notice that some CPUs now think the cpu node is
1, which is wrong.

> So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an 
> arch-specific set_numa_mem() that makes this mapping correct like ia64 
> does.  If that's the case, then it's (1) completely undocumented and (2) 
> Nishanth's patch is incomplete because anything that adds 
> CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it 
> to be any different than numa_node_id().

I'll work on getting the set_numa_mem() and set_numa_node() correct for
powerpc.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-14  5:47                                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-14  5:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Raghavendra K T, Andrew Morton, Fengguang Wu, David Cohen,
	Al Viro, Damien Ramonda, Jan Kara, Linus Torvalds, linux-mm,
	linux-kernel

On 13.02.2014 [14:41:04 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Raghavendra K T wrote:
> 
> > Thanks David, unfortunately even after applying that patch, I do not see
> > the improvement.
> > 
> > Interestingly numa_mem_id() seem to still return the value of a
> > memoryless node.
> > May be  per cpu _numa_mem_ values are not set properly. Need to dig out ....
> > 
> 
> I believe ppc will be relying on __build_all_zonelists() to set 
> numa_mem_id() to be the proper node, and that relies on the ordering of 
> the zonelist built for the memoryless node.  It would be very strange if 
> local_memory_node() is returning a memoryless node because it is the first 
> zone for node_zonelist(GFP_KERNEL) (why would a memoryless node be on the 
> zonelist at all?).
> 
> I think the real problem is that build_all_zonelists() is only called at 
> init when the boot cpu is online so it's only setting numa_mem_id() 
> properly for the boot cpu.  Does it return a node with memory if you 
> toggle /proc/sys/vm/numa_zonelist_order?  Do
> 
> 	echo node > /proc/sys/vm/numa_zonelist_order
> 	echo zone > /proc/sys/vm/numa_zonelist_order
> 	echo default > /proc/sys/vm/numa_zonelist_order
> 
> and check if it returns the proper value at either point.  This will force 
> build_all_zonelists() and numa_mem_id() to point to the proper node since 
> all cpus are now online.

Yep, after massaging the code to allow CONFIG_USE_PERCPU_NUMA_NODE_ID,
you're right that the memory node is wrong. The cpu node is right (they
are all on node 0), but that could be lucky. The memory node is right
for the boot cpu. I did notice that some CPUs now think the cpu node is
1, which is wrong.

> So the prerequisite for CONFIG_HAVE_MEMORYLESS_NODES is that there is an 
> arch-specific set_numa_mem() that makes this mapping correct like ia64 
> does.  If that's the case, then it's (1) completely undocumented and (2) 
> Nishanth's patch is incomplete because anything that adds 
> CONFIG_HAVE_MEMORYLESS_NODES needs to do the proper set_numa_mem() for it 
> to be any different than numa_node_id().

I'll work on getting the set_numa_mem() and set_numa_node() correct for
powerpc.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14  0:37                                   ` Linus Torvalds
@ 2014-02-14  7:43                                     ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2014-02-14  7:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nishanth Aravamudan, David Rientjes, Raghavendra K T,
	Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, linux-mm, Linux Kernel Mailing List

On Thu 13-02-14 16:37:53, Linus Torvalds wrote:
> Is this whole thread still just for the crazy and pointless
> "max_sane_readahead()"?
> 
> Or is there some *real* reason we should care?
> 
> Because if it really is just for max_sane_readahead(), then for the
> love of God, let us just do this
> 
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
>         return min(nr, 128);
>  }
> 
> and bury this whole idiotic thread.
  max_sane_readahead() is also used for limiting amount of readahead for
[fm]advice(2) WILLNEED and that is used e.g. by a dynamic linker to preload
shared libraries into memory. So I'm convinced this usecase *will* notice
the change - effectively we limit preloading of shared libraries to the
first 512KB in the file but libraries get accessed in a rather random manner.

Maybe limits for WILLNEED and for standard readahead should be different.
It makes sence to me and people seem to keep forgetting that
max_sane_readahead() limits also WILLNEED preloading.

								Honza

> On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
> <nacc@linux.vnet.ibm.com> wrote:
> >
> > I'm working on this latter bit now. I tried to mirror ia64, but it looks
> > like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> > It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> > CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
> >
> > I'll keep working, but would appreciate any further insight.
> >
> > -Nish
> >
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-14  7:43                                     ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2014-02-14  7:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nishanth Aravamudan, David Rientjes, Raghavendra K T,
	Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, Jan Kara, linux-mm, Linux Kernel Mailing List

On Thu 13-02-14 16:37:53, Linus Torvalds wrote:
> Is this whole thread still just for the crazy and pointless
> "max_sane_readahead()"?
> 
> Or is there some *real* reason we should care?
> 
> Because if it really is just for max_sane_readahead(), then for the
> love of God, let us just do this
> 
>  unsigned long max_sane_readahead(unsigned long nr)
>  {
>         return min(nr, 128);
>  }
> 
> and bury this whole idiotic thread.
  max_sane_readahead() is also used for limiting amount of readahead for
[fm]advice(2) WILLNEED and that is used e.g. by a dynamic linker to preload
shared libraries into memory. So I'm convinced this usecase *will* notice
the change - effectively we limit preloading of shared libraries to the
first 512KB in the file but libraries get accessed in a rather random manner.

Maybe limits for WILLNEED and for standard readahead should be different.
It makes sence to me and people seem to keep forgetting that
max_sane_readahead() limits also WILLNEED preloading.

								Honza

> On Thu, Feb 13, 2014 at 4:14 PM, Nishanth Aravamudan
> <nacc@linux.vnet.ibm.com> wrote:
> >
> > I'm working on this latter bit now. I tried to mirror ia64, but it looks
> > like they have CONFIG_USER_PERCPU_NUMA_NODE_ID, which powerpc doesn't.
> > It seems like CONFIG_USER_PERCPU_NUMA_NODE_ID and
> > CONFIG_HAVE_MEMORYLESS_NODES should be tied together in Kconfig?
> >
> > I'll keep working, but would appreciate any further insight.
> >
> > -Nish
> >
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14  4:32                                     ` Nishanth Aravamudan
@ 2014-02-14 10:54                                       ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-14 10:54 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On Thu, 13 Feb 2014, Nishanth Aravamudan wrote:

> There is an open issue on powerpc with memoryless nodes (inasmuch as we
> can have them, but the kernel doesn't support it properly). There is a
> separate discussion going on on linuxppc-dev about what is necessary for
> CONFIG_HAVE_MEMORYLESS_NODES to be supported.
> 

Yeah, and this is causing problems with the slub allocator as well.

> Apologies for hijacking the thread, my comments below were purely about
> the memoryless node support, not about readahead specifically.
> 

Neither you nor Raghavendra have any reason to apologize to anybody.  
Memoryless node support on powerpc isn't working very well right now and 
you're trying to fix it, that fix is needed both in this thread and in 
your fixes for slub.  It's great to see both of you working hard on your 
platform to make it work the best.

I think what you'll need to do in addition to your 
CONFIG_HAVE_MEMORYLESS_NODE fix, which is obviously needed, is to enable 
CONFIG_USE_PERCPU_NUMA_NODE_ID for the same NUMA configurations and then 
use set_numa_node() or set_cpu_numa_node() to properly store the mapping 
between cpu and node rather than numa_cpu_lookup_table.  Then you should 
be able to do away with your own implementation of cpu_to_node().

After that, I think it should be as simple as doing

	set_numa_node(cpu_to_node(cpu));
	set_numa_mem(local_memory_node(cpu_to_node(cpu)));

probably before taking vector_lock in smp_callin().  The cpu-to-node 
mapping should be done much earlier in boot while the nodes are being 
initialized, I don't think there should be any problem there.

While you're at it, I think you'll also want to add a comment that
setting up the cpu sibling mask must be done before the smp_wmb() before 
notify_cpu_starting(cpu), it's crucial to have before the cpu is brought 
online and why we need the store memory barrier.

But, again, please don't apologize for developing your architecture and 
attacking bugs as they arise, it's very admirable and I'm happy to help in 
any way that I can.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-14 10:54                                       ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-14 10:54 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On Thu, 13 Feb 2014, Nishanth Aravamudan wrote:

> There is an open issue on powerpc with memoryless nodes (inasmuch as we
> can have them, but the kernel doesn't support it properly). There is a
> separate discussion going on on linuxppc-dev about what is necessary for
> CONFIG_HAVE_MEMORYLESS_NODES to be supported.
> 

Yeah, and this is causing problems with the slub allocator as well.

> Apologies for hijacking the thread, my comments below were purely about
> the memoryless node support, not about readahead specifically.
> 

Neither you nor Raghavendra have any reason to apologize to anybody.  
Memoryless node support on powerpc isn't working very well right now and 
you're trying to fix it, that fix is needed both in this thread and in 
your fixes for slub.  It's great to see both of you working hard on your 
platform to make it work the best.

I think what you'll need to do in addition to your 
CONFIG_HAVE_MEMORYLESS_NODE fix, which is obviously needed, is to enable 
CONFIG_USE_PERCPU_NUMA_NODE_ID for the same NUMA configurations and then 
use set_numa_node() or set_cpu_numa_node() to properly store the mapping 
between cpu and node rather than numa_cpu_lookup_table.  Then you should 
be able to do away with your own implementation of cpu_to_node().

After that, I think it should be as simple as doing

	set_numa_node(cpu_to_node(cpu));
	set_numa_mem(local_memory_node(cpu_to_node(cpu)));

probably before taking vector_lock in smp_callin().  The cpu-to-node 
mapping should be done much earlier in boot while the nodes are being 
initialized, I don't think there should be any problem there.

While you're at it, I think you'll also want to add a comment that
setting up the cpu sibling mask must be done before the smp_wmb() before 
notify_cpu_starting(cpu), it's crucial to have before the cpu is brought 
online and why we need the store memory barrier.

But, again, please don't apologize for developing your architecture and 
attacking bugs as they arise, it's very admirable and I'm happy to help in 
any way that I can.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14 10:54                                       ` David Rientjes
@ 2014-02-17 19:28                                         ` Nishanth Aravamudan
  -1 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-17 19:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On 14.02.2014 [02:54:06 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Nishanth Aravamudan wrote:
> 
> > There is an open issue on powerpc with memoryless nodes (inasmuch as we
> > can have them, but the kernel doesn't support it properly). There is a
> > separate discussion going on on linuxppc-dev about what is necessary for
> > CONFIG_HAVE_MEMORYLESS_NODES to be supported.
> > 
> 
> Yeah, and this is causing problems with the slub allocator as well.
> 
> > Apologies for hijacking the thread, my comments below were purely about
> > the memoryless node support, not about readahead specifically.
> > 
> 
> Neither you nor Raghavendra have any reason to apologize to anybody.  
> Memoryless node support on powerpc isn't working very well right now and 
> you're trying to fix it, that fix is needed both in this thread and in 
> your fixes for slub.  It's great to see both of you working hard on your 
> platform to make it work the best.
> 
> I think what you'll need to do in addition to your 
> CONFIG_HAVE_MEMORYLESS_NODE fix, which is obviously needed, is to enable 
> CONFIG_USE_PERCPU_NUMA_NODE_ID for the same NUMA configurations and then 
> use set_numa_node() or set_cpu_numa_node() to properly store the mapping 
> between cpu and node rather than numa_cpu_lookup_table.  Then you should 
> be able to do away with your own implementation of cpu_to_node().
> 
> After that, I think it should be as simple as doing
> 
> 	set_numa_node(cpu_to_node(cpu));
> 	set_numa_mem(local_memory_node(cpu_to_node(cpu)));
> 
> probably before taking vector_lock in smp_callin().  The cpu-to-node 
> mapping should be done much earlier in boot while the nodes are being 
> initialized, I don't think there should be any problem there.

vector_lock/smp_callin are ia64 specific things, I believe? I think the
equivalent is just in start_secondary() for powerpc? (which in fact is
what calls smp_callin on powerpc).

Here is what I'm running into now:

setup_arch ->
	do_init_bootmem ->
		cpu_numa_callback ->
			numa_setup_cpu ->
				map_cpu_to_node -> 
					update_numa_cpu_lookup_table

Which current updates the powerpc specific numa_cpu_lookup_table. I
would like to update that function to use set_cpu_numa_node() and
set_cpu_numa_mem(), but local_memory_node() is not yet functional
because build_all_zonelists is called later in start_kernel. Would it
make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
don't have a zone?

Thanks,
Nish


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-17 19:28                                         ` Nishanth Aravamudan
  0 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-17 19:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On 14.02.2014 [02:54:06 -0800], David Rientjes wrote:
> On Thu, 13 Feb 2014, Nishanth Aravamudan wrote:
> 
> > There is an open issue on powerpc with memoryless nodes (inasmuch as we
> > can have them, but the kernel doesn't support it properly). There is a
> > separate discussion going on on linuxppc-dev about what is necessary for
> > CONFIG_HAVE_MEMORYLESS_NODES to be supported.
> > 
> 
> Yeah, and this is causing problems with the slub allocator as well.
> 
> > Apologies for hijacking the thread, my comments below were purely about
> > the memoryless node support, not about readahead specifically.
> > 
> 
> Neither you nor Raghavendra have any reason to apologize to anybody.  
> Memoryless node support on powerpc isn't working very well right now and 
> you're trying to fix it, that fix is needed both in this thread and in 
> your fixes for slub.  It's great to see both of you working hard on your 
> platform to make it work the best.
> 
> I think what you'll need to do in addition to your 
> CONFIG_HAVE_MEMORYLESS_NODE fix, which is obviously needed, is to enable 
> CONFIG_USE_PERCPU_NUMA_NODE_ID for the same NUMA configurations and then 
> use set_numa_node() or set_cpu_numa_node() to properly store the mapping 
> between cpu and node rather than numa_cpu_lookup_table.  Then you should 
> be able to do away with your own implementation of cpu_to_node().
> 
> After that, I think it should be as simple as doing
> 
> 	set_numa_node(cpu_to_node(cpu));
> 	set_numa_mem(local_memory_node(cpu_to_node(cpu)));
> 
> probably before taking vector_lock in smp_callin().  The cpu-to-node 
> mapping should be done much earlier in boot while the nodes are being 
> initialized, I don't think there should be any problem there.

vector_lock/smp_callin are ia64 specific things, I believe? I think the
equivalent is just in start_secondary() for powerpc? (which in fact is
what calls smp_callin on powerpc).

Here is what I'm running into now:

setup_arch ->
	do_init_bootmem ->
		cpu_numa_callback ->
			numa_setup_cpu ->
				map_cpu_to_node -> 
					update_numa_cpu_lookup_table

Which current updates the powerpc specific numa_cpu_lookup_table. I
would like to update that function to use set_cpu_numa_node() and
set_cpu_numa_mem(), but local_memory_node() is not yet functional
because build_all_zonelists is called later in start_kernel. Would it
make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
don't have a zone?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14  7:43                                     ` Jan Kara
@ 2014-02-17 22:57                                       ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2014-02-17 22:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nishanth Aravamudan, David Rientjes, Raghavendra K T,
	Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, linux-mm, Linux Kernel Mailing List

On Thu, Feb 13, 2014 at 11:43 PM, Jan Kara <jack@suse.cz> wrote:
>
>   max_sane_readahead() is also used for limiting amount of readahead for
> [fm]advice(2) WILLNEED and that is used e.g. by a dynamic linker to preload
> shared libraries into memory. So I'm convinced this usecase *will* notice
> the change - effectively we limit preloading of shared libraries to the
> first 512KB in the file but libraries get accessed in a rather random manner.
>
> Maybe limits for WILLNEED and for standard readahead should be different.
> It makes sence to me and people seem to keep forgetting that
> max_sane_readahead() limits also WILLNEED preloading.

Good point. But it's probably overly complex to have two different
limits. The "512kB" thing was entirely random - the only real issue is
that it should be small enough that it won't be a problem on any
reasonable current machines, and big enough to get perfectly fine IO
patterns unless your IO subsystem sucks so bad that it's not even
worth worrying about.

If we just add a third requirement that it be "big enough that
reasonable uses of [fm]advice() will work well enough", then your
shared library example might well be grounds for saying "let's just do
2MB instead". That's still small enough that it won't really hurt any
modern machines.

And if it means that WILLNEED won't necessarily always read the whole
file for big files - well, we never guaranteed that to begin with.

                                Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-17 22:57                                       ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2014-02-17 22:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nishanth Aravamudan, David Rientjes, Raghavendra K T,
	Andrew Morton, Fengguang Wu, David Cohen, Al Viro,
	Damien Ramonda, linux-mm, Linux Kernel Mailing List

On Thu, Feb 13, 2014 at 11:43 PM, Jan Kara <jack@suse.cz> wrote:
>
>   max_sane_readahead() is also used for limiting amount of readahead for
> [fm]advice(2) WILLNEED and that is used e.g. by a dynamic linker to preload
> shared libraries into memory. So I'm convinced this usecase *will* notice
> the change - effectively we limit preloading of shared libraries to the
> first 512KB in the file but libraries get accessed in a rather random manner.
>
> Maybe limits for WILLNEED and for standard readahead should be different.
> It makes sence to me and people seem to keep forgetting that
> max_sane_readahead() limits also WILLNEED preloading.

Good point. But it's probably overly complex to have two different
limits. The "512kB" thing was entirely random - the only real issue is
that it should be small enough that it won't be a problem on any
reasonable current machines, and big enough to get perfectly fine IO
patterns unless your IO subsystem sucks so bad that it's not even
worth worrying about.

If we just add a third requirement that it be "big enough that
reasonable uses of [fm]advice() will work well enough", then your
shared library example might well be grounds for saying "let's just do
2MB instead". That's still small enough that it won't really hurt any
modern machines.

And if it means that WILLNEED won't necessarily always read the whole
file for big files - well, we never guaranteed that to begin with.

                                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-14  4:32                                     ` Nishanth Aravamudan
@ 2014-02-17 22:59                                       ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2014-02-17 22:59 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On Thu, Feb 13, 2014 at 8:32 PM, Nishanth Aravamudan
<nacc@linux.vnet.ibm.com> wrote:
>
> Agreed that for the readahead case the above is probably more than
> sufficient.
>
> Apologies for hijacking the thread, my comments below were purely about
> the memoryless node support, not about readahead specifically.

Ok, no problem. I just wanted to make sure that we're not going down
some fragile rats nest just for something silly that wasn't worth it.

                 Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-17 22:59                                       ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2014-02-17 22:59 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On Thu, Feb 13, 2014 at 8:32 PM, Nishanth Aravamudan
<nacc@linux.vnet.ibm.com> wrote:
>
> Agreed that for the readahead case the above is probably more than
> sufficient.
>
> Apologies for hijacking the thread, my comments below were purely about
> the memoryless node support, not about readahead specifically.

Ok, no problem. I just wanted to make sure that we're not going down
some fragile rats nest just for something silly that wasn't worth it.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-17 19:28                                         ` Nishanth Aravamudan
@ 2014-02-17 23:14                                           ` David Rientjes
  -1 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-17 23:14 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On Mon, 17 Feb 2014, Nishanth Aravamudan wrote:

> Here is what I'm running into now:
> 
> setup_arch ->
> 	do_init_bootmem ->
> 		cpu_numa_callback ->
> 			numa_setup_cpu ->
> 				map_cpu_to_node -> 
> 					update_numa_cpu_lookup_table
> 
> Which current updates the powerpc specific numa_cpu_lookup_table. I
> would like to update that function to use set_cpu_numa_node() and
> set_cpu_numa_mem(), but local_memory_node() is not yet functional
> because build_all_zonelists is called later in start_kernel. Would it
> make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
> don't have a zone?
> 

Hmm, I don't think we'll want to modify the generic first_zones_zonelist() 
for a special case that is only true during boot.  Instead, would it make 
sense to modify numa_setup_cpu() to use the generic cpu_to_node() instead 
of using a powerpc mapping and then do the set_cpu_numa_mem() after 
paging_init() when the zonelists will have been built and zones without 
present pages are properly excluded?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-17 23:14                                           ` David Rientjes
  0 siblings, 0 replies; 64+ messages in thread
From: David Rientjes @ 2014-02-17 23:14 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On Mon, 17 Feb 2014, Nishanth Aravamudan wrote:

> Here is what I'm running into now:
> 
> setup_arch ->
> 	do_init_bootmem ->
> 		cpu_numa_callback ->
> 			numa_setup_cpu ->
> 				map_cpu_to_node -> 
> 					update_numa_cpu_lookup_table
> 
> Which current updates the powerpc specific numa_cpu_lookup_table. I
> would like to update that function to use set_cpu_numa_node() and
> set_cpu_numa_mem(), but local_memory_node() is not yet functional
> because build_all_zonelists is called later in start_kernel. Would it
> make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
> don't have a zone?
> 

Hmm, I don't think we'll want to modify the generic first_zones_zonelist() 
for a special case that is only true during boot.  Instead, would it make 
sense to modify numa_setup_cpu() to use the generic cpu_to_node() instead 
of using a powerpc mapping and then do the set_cpu_numa_mem() after 
paging_init() when the zonelists will have been built and zones without 
present pages are properly excluded?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
  2014-02-17 23:14                                           ` David Rientjes
@ 2014-02-18  1:31                                             ` Nishanth Aravamudan
  -1 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18  1:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On 17.02.2014 [15:14:06 -0800], David Rientjes wrote:
> On Mon, 17 Feb 2014, Nishanth Aravamudan wrote:
> 
> > Here is what I'm running into now:
> > 
> > setup_arch ->
> > 	do_init_bootmem ->
> > 		cpu_numa_callback ->
> > 			numa_setup_cpu ->
> > 				map_cpu_to_node -> 
> > 					update_numa_cpu_lookup_table
> > 
> > Which current updates the powerpc specific numa_cpu_lookup_table. I
> > would like to update that function to use set_cpu_numa_node() and
> > set_cpu_numa_mem(), but local_memory_node() is not yet functional
> > because build_all_zonelists is called later in start_kernel. Would it
> > make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
> > don't have a zone?
> > 
> 
> Hmm, I don't think we'll want to modify the generic first_zones_zonelist() 
> for a special case that is only true during boot.  Instead, would it make 
> sense to modify numa_setup_cpu() to use the generic cpu_to_node() instead 
> of using a powerpc mapping and then do the set_cpu_numa_mem() after 
> paging_init() when the zonelists will have been built and zones without 
> present pages are properly excluded?

Sorry, I was unclear in my e-mail. I meant to modify
local_memory_node(), not first_zones_zonelist(). Well, it only needs the
following, I think?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3758a0..5de4337 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3650,6 +3650,8 @@ int local_memory_node(int node)
                                   gfp_zone(GFP_KERNEL),
                                   NULL,
                                   &zone);
+       if (!zone)
+               return NUMA_NO_NODE;
        return zone->node;
 }
 #endif

I think that condition should only happen during boot -- maybe even
deserving of an unlikely, but I don't think the above is considered a
hot-path. If the above isn't palatable, I can look into your suggestion
instead.

Thanks,
Nish


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages
@ 2014-02-18  1:31                                             ` Nishanth Aravamudan
  0 siblings, 0 replies; 64+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18  1:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Raghavendra K T, Andrew Morton, Fengguang Wu,
	David Cohen, Al Viro, Damien Ramonda, Jan Kara, linux-mm,
	Linux Kernel Mailing List

On 17.02.2014 [15:14:06 -0800], David Rientjes wrote:
> On Mon, 17 Feb 2014, Nishanth Aravamudan wrote:
> 
> > Here is what I'm running into now:
> > 
> > setup_arch ->
> > 	do_init_bootmem ->
> > 		cpu_numa_callback ->
> > 			numa_setup_cpu ->
> > 				map_cpu_to_node -> 
> > 					update_numa_cpu_lookup_table
> > 
> > Which current updates the powerpc specific numa_cpu_lookup_table. I
> > would like to update that function to use set_cpu_numa_node() and
> > set_cpu_numa_mem(), but local_memory_node() is not yet functional
> > because build_all_zonelists is called later in start_kernel. Would it
> > make sense for first_zones_zonelist() to return NUMA_NO_NODE if we
> > don't have a zone?
> > 
> 
> Hmm, I don't think we'll want to modify the generic first_zones_zonelist() 
> for a special case that is only true during boot.  Instead, would it make 
> sense to modify numa_setup_cpu() to use the generic cpu_to_node() instead 
> of using a powerpc mapping and then do the set_cpu_numa_mem() after 
> paging_init() when the zonelists will have been built and zones without 
> present pages are properly excluded?

Sorry, I was unclear in my e-mail. I meant to modify
local_memory_node(), not first_zones_zonelist(). Well, it only needs the
following, I think?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3758a0..5de4337 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3650,6 +3650,8 @@ int local_memory_node(int node)
                                   gfp_zone(GFP_KERNEL),
                                   NULL,
                                   &zone);
+       if (!zone)
+               return NUMA_NO_NODE;
        return zone->node;
 }
 #endif

I think that condition should only happen during boot -- maybe even
deserving of an unlikely, but I don't think the above is considered a
hot-path. If the above isn't palatable, I can look into your suggestion
instead.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2014-02-18  1:31 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-22 10:53 [RFC PATCH V5] mm readahead: Fix readahead fail for no local memory and limit readahead pages Raghavendra K T
2014-01-22 10:53 ` Raghavendra K T
2014-02-03  8:30 ` Raghavendra K T
2014-02-03  8:30   ` Raghavendra K T
2014-02-06 22:51 ` Andrew Morton
2014-02-06 22:51   ` Andrew Morton
2014-02-06 22:58   ` David Rientjes
2014-02-06 22:58     ` David Rientjes
2014-02-06 23:22     ` Andrew Morton
2014-02-06 23:22       ` Andrew Morton
2014-02-06 23:48       ` David Rientjes
2014-02-06 23:48         ` David Rientjes
2014-02-06 23:58         ` David Rientjes
2014-02-06 23:58           ` David Rientjes
2014-02-07 10:42           ` Raghavendra K T
2014-02-07 10:42             ` Raghavendra K T
2014-02-07 20:41             ` David Rientjes
2014-02-07 20:41               ` David Rientjes
2014-02-10  8:21               ` Raghavendra K T
2014-02-10  8:21                 ` Raghavendra K T
2014-02-10 10:05                 ` David Rientjes
2014-02-10 10:05                   ` David Rientjes
2014-02-10 12:25                   ` Raghavendra K T
2014-02-10 12:25                     ` Raghavendra K T
2014-02-10 21:35                     ` David Rientjes
2014-02-10 21:35                       ` David Rientjes
2014-02-13  7:07                       ` Raghavendra K T
2014-02-13  7:07                         ` Raghavendra K T
2014-02-13  8:05                         ` David Rientjes
2014-02-13  8:05                           ` David Rientjes
2014-02-13 10:04                           ` Raghavendra K T
2014-02-13 10:04                             ` Raghavendra K T
2014-02-13 22:41                             ` David Rientjes
2014-02-13 22:41                               ` David Rientjes
2014-02-14  0:14                               ` Nishanth Aravamudan
2014-02-14  0:14                                 ` Nishanth Aravamudan
2014-02-14  0:37                                 ` Linus Torvalds
2014-02-14  0:37                                   ` Linus Torvalds
2014-02-14  0:45                                   ` Andrew Morton
2014-02-14  0:45                                     ` Andrew Morton
2014-02-14  4:32                                   ` Nishanth Aravamudan
2014-02-14  4:32                                     ` Nishanth Aravamudan
2014-02-14 10:54                                     ` David Rientjes
2014-02-14 10:54                                       ` David Rientjes
2014-02-17 19:28                                       ` Nishanth Aravamudan
2014-02-17 19:28                                         ` Nishanth Aravamudan
2014-02-17 23:14                                         ` David Rientjes
2014-02-17 23:14                                           ` David Rientjes
2014-02-18  1:31                                           ` Nishanth Aravamudan
2014-02-18  1:31                                             ` Nishanth Aravamudan
2014-02-17 22:59                                     ` Linus Torvalds
2014-02-17 22:59                                       ` Linus Torvalds
2014-02-14  7:43                                   ` Jan Kara
2014-02-14  7:43                                     ` Jan Kara
2014-02-17 22:57                                     ` Linus Torvalds
2014-02-17 22:57                                       ` Linus Torvalds
2014-02-14  5:47                               ` Nishanth Aravamudan
2014-02-14  5:47                                 ` Nishanth Aravamudan
2014-02-13 21:06                           ` Andrew Morton
2014-02-13 21:06                             ` Andrew Morton
2014-02-13 21:42                             ` Nishanth Aravamudan
2014-02-13 21:42                               ` Nishanth Aravamudan
2014-02-10  8:29   ` [RFC PATCH V5 RESEND] " Raghavendra K T
2014-02-10  8:29     ` Raghavendra K T

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.