linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] mm: reduce /proc/pagetypeinfo ovehead
@ 2019-10-25  7:26 Michal Hocko
  2019-10-25  7:26 ` [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
  2019-10-25  7:26 ` [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2019-10-25  7:26 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML

Hi,
Waiman Long has reported [1] that reading /proc/pagetypeinfo can
severely interfere with the system and it might lead even to hard lockup
detector firing up on a very large machines. Nevertheless small machines
are not completely fine either because the operation requires to take
the zone->lock IRQ safe spinlock and thus to interfere with both the IRQ
delivery and the page allocator. The file is world readable which makes
this kinda bad.

The immediate danger is addressed by making the file root readable only.
This is a debugging aid so general audience shouldn't require it for a
general operation. This is done in the first patch.

The potentially excessive time spent for free_list iteration is handled
by capping the iteration loop. This should be fine for existing usecases
because low numbers are usually of the primary interest. This is
implemented in patch 2.

I am reposting these two patches with dropped RFC (previously posted
[2]) and asking for inclusion. I have also dropped Mel's Ack from the
second patch because there were quite some changes since he reviewed. 

[1] http://lkml.kernel.org/r/20191022162156.17316-1-longman@redhat.com
[2] http://lkml.kernel.org/r/20191023102737.32274-1-mhocko@kernel.org


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-25  7:26 [PATCH 0/2] mm: reduce /proc/pagetypeinfo ovehead Michal Hocko
@ 2019-10-25  7:26 ` Michal Hocko
  2019-10-25  7:33   ` Vlastimil Babka
  2019-10-25  8:18   ` David Hildenbrand
  2019-10-25  7:26 ` [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
  1 sibling, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2019-10-25  7:26 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko, David Rientjes

From: Michal Hocko <mhocko@suse.com>

/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for
any other use so normal users really do not need to read this file.

Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that
a) interferes with the page allocator and its users and b) can lead to
hard lockups on large machines which have very long free_list.

Reduce both issues by simply not exporting the file to regular users.

Reported-by: Waiman Long <longman@redhat.com>
Cc: stable
Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..4e885ecd44d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
-	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
+	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
 	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
 	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25  7:26 [PATCH 0/2] mm: reduce /proc/pagetypeinfo ovehead Michal Hocko
  2019-10-25  7:26 ` [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
@ 2019-10-25  7:26 ` Michal Hocko
  2019-10-25  7:35   ` Vlastimil Babka
                     ` (3 more replies)
  1 sibling, 4 replies; 14+ messages in thread
From: Michal Hocko @ 2019-10-25  7:26 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.

The new output will simply tell
[...]
Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648

instead of
Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

While we are at it, also drop the zone lock after each free_list
iteration which will help with the IRQ and page allocator responsiveness
even further as the IRQ lock held time is always bound to those 100k
pages.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmstat.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4e885ecd44d1..ddb89f4e0486 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 			unsigned long freecount = 0;
 			struct free_area *area;
 			struct list_head *curr;
+			bool overflow = false;
 
 			area = &(zone->free_area[order]);
 
-			list_for_each(curr, &area->free_list[mtype])
-				freecount++;
-			seq_printf(m, "%6lu ", freecount);
+			list_for_each(curr, &area->free_list[mtype]) {
+				/*
+				 * Cap the free_list iteration because it might
+				 * be really large and we are under a spinlock
+				 * so a long time spent here could trigger a
+				 * hard lockup detector. Anyway this is a
+				 * debugging tool so knowing there is a handful
+				 * of pages in this order should be more than
+				 * sufficient
+				 */
+				if (++freecount >= 100000) {
+					overflow = true;
+					break;
+				}
+			}
+			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
+			spin_unlock_irq(&zone->lock);
+			cond_resched();
+			spin_lock_irq(&zone->lock);
 		}
 		seq_putc(m, '\n');
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-25  7:26 ` [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
@ 2019-10-25  7:33   ` Vlastimil Babka
  2019-10-25 21:58     ` Andrew Morton
  2019-10-25  8:18   ` David Hildenbrand
  1 sibling, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2019-10-25  7:33 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko, David Rientjes, stable

On 10/25/19 9:26 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable

Cc: <stable@vger.kernel.org>

> Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Waiman Long <longman@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..4e885ecd44d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
>  #endif
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
> -	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
> +	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
>  	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
>  	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
>  #endif
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25  7:26 ` [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
@ 2019-10-25  7:35   ` Vlastimil Babka
  2019-10-25  8:21   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2019-10-25  7:35 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On 10/25/19 9:26 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes and low number of
> pages is much more interesting therefore putting a bound on the number
> of pages on the free_list sounds like a reasonable tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> 
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
> 
> While we are at it, also drop the zone lock after each free_list
> iteration which will help with the IRQ and page allocator responsiveness
> even further as the IRQ lock held time is always bound to those 100k
> pages.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmstat.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..ddb89f4e0486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  			unsigned long freecount = 0;
>  			struct free_area *area;
>  			struct list_head *curr;
> +			bool overflow = false;
>  
>  			area = &(zone->free_area[order]);
>  
> -			list_for_each(curr, &area->free_list[mtype])
> -				freecount++;
> -			seq_printf(m, "%6lu ", freecount);
> +			list_for_each(curr, &area->free_list[mtype]) {
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than
> +				 * sufficient
> +				 */
> +				if (++freecount >= 100000) {
> +					overflow = true;
> +					break;
> +				}
> +			}
> +			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> +			spin_unlock_irq(&zone->lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lock);
>  		}
>  		seq_putc(m, '\n');
>  	}
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-25  7:26 ` [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
  2019-10-25  7:33   ` Vlastimil Babka
@ 2019-10-25  8:18   ` David Hildenbrand
  1 sibling, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2019-10-25  8:18 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko, David Rientjes

On 25.10.19 09:26, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Waiman Long <longman@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   mm/vmstat.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..4e885ecd44d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
>   #endif
>   #ifdef CONFIG_PROC_FS
>   	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
> -	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
> +	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
>   	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
>   	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
>   #endif
> 

Looks good too me (the ack list is already long enough :) )

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25  7:26 ` [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
  2019-10-25  7:35   ` Vlastimil Babka
@ 2019-10-25  8:21   ` David Hildenbrand
  2019-10-25 12:52   ` Rafael Aquini
  2019-10-25 21:08   ` David Rientjes
  3 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2019-10-25  8:21 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko

On 25.10.19 09:26, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes and low number of
> pages is much more interesting therefore putting a bound on the number
> of pages on the free_list sounds like a reasonable tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> 
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
> 
> While we are at it, also drop the zone lock after each free_list
> iteration which will help with the IRQ and page allocator responsiveness
> even further as the IRQ lock held time is always bound to those 100k
> pages.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   mm/vmstat.c | 23 ++++++++++++++++++++---
>   1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..ddb89f4e0486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>   			unsigned long freecount = 0;
>   			struct free_area *area;
>   			struct list_head *curr;
> +			bool overflow = false;
>   
>   			area = &(zone->free_area[order]);
>   
> -			list_for_each(curr, &area->free_list[mtype])
> -				freecount++;
> -			seq_printf(m, "%6lu ", freecount);
> +			list_for_each(curr, &area->free_list[mtype]) {
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than

"of this order" ?

> +				 * sufficient

s/sufficient"/sufficient." ?

> +				 */
> +				if (++freecount >= 100000) {
> +					overflow = true;
> +					break;
> +				}
> +			}
> +			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> +			spin_unlock_irq(&zone->lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lock);
>   		}
>   		seq_putc(m, '\n');
>   	}
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25  7:26 ` [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
  2019-10-25  7:35   ` Vlastimil Babka
  2019-10-25  8:21   ` David Hildenbrand
@ 2019-10-25 12:52   ` Rafael Aquini
  2019-10-25 21:08   ` David Rientjes
  3 siblings, 0 replies; 14+ messages in thread
From: Rafael Aquini @ 2019-10-25 12:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, linux-mm, LKML,
	Michal Hocko

On Fri, Oct 25, 2019 at 09:26:10AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes and low number of
> pages is much more interesting therefore putting a bound on the number
> of pages on the free_list sounds like a reasonable tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> 
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
> 
> While we are at it, also drop the zone lock after each free_list
> iteration which will help with the IRQ and page allocator responsiveness
> even further as the IRQ lock held time is always bound to those 100k
> pages.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..ddb89f4e0486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  			unsigned long freecount = 0;
>  			struct free_area *area;
>  			struct list_head *curr;
> +			bool overflow = false;
>  
>  			area = &(zone->free_area[order]);
>  
> -			list_for_each(curr, &area->free_list[mtype])
> -				freecount++;
> -			seq_printf(m, "%6lu ", freecount);
> +			list_for_each(curr, &area->free_list[mtype]) {
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than
> +				 * sufficient
> +				 */
> +				if (++freecount >= 100000) {
> +					overflow = true;
> +					break;
> +				}
> +			}
> +			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> +			spin_unlock_irq(&zone->lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lock);
>  		}
>  		seq_putc(m, '\n');
>  	}
> -- 
> 2.20.1
> 
Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25  7:26 ` [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
                     ` (2 preceding siblings ...)
  2019-10-25 12:52   ` Rafael Aquini
@ 2019-10-25 21:08   ` David Rientjes
  3 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2019-10-25 21:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On Fri, 25 Oct 2019, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes and low number of
> pages is much more interesting therefore putting a bound on the number
> of pages on the free_list sounds like a reasonable tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> 
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
> 
> While we are at it, also drop the zone lock after each free_list
> iteration which will help with the IRQ and page allocator responsiveness
> even further as the IRQ lock held time is always bound to those 100k
> pages.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

I think 100k is a very reasonable threshold.

Acked-by: David Rientjes <rientjes@google.com>

> ---
>  mm/vmstat.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..ddb89f4e0486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  			unsigned long freecount = 0;
>  			struct free_area *area;
>  			struct list_head *curr;
> +			bool overflow = false;
>  
>  			area = &(zone->free_area[order]);
>  
> -			list_for_each(curr, &area->free_list[mtype])
> -				freecount++;
> -			seq_printf(m, "%6lu ", freecount);
> +			list_for_each(curr, &area->free_list[mtype]) {
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than
> +				 * sufficient
> +				 */
> +				if (++freecount >= 100000) {

I suppose it's most precise to check freecount > 1000000 to print >100000, 
but I doubt anybody cares :)

> +					overflow = true;
> +					break;
> +				}
> +			}
> +			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> +			spin_unlock_irq(&zone->lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lock);
>  		}
>  		seq_putc(m, '\n');
>  	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-25  7:33   ` Vlastimil Babka
@ 2019-10-25 21:58     ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2019-10-25 21:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML, Michal Hocko,
	David Rientjes, stable

On Fri, 25 Oct 2019 09:33:26 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 10/25/19 9:26 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > /proc/pagetypeinfo is a debugging tool to examine internal page
> > allocator state wrt to fragmentation. It is not very useful for
> > any other use so normal users really do not need to read this file.
> > 
> > Waiman Long has noticed that reading this file can have negative side
> > effects because zone->lock is necessary for gathering data and that
> > a) interferes with the page allocator and its users and b) can lead to
> > hard lockups on large machines which have very long free_list.
> > 
> > Reduce both issues by simply not exporting the file to regular users.
> > 
> > Reported-by: Waiman Long <longman@redhat.com>
> > Cc: stable
> 
> Cc: <stable@vger.kernel.org>

As we don't really know how much damage this will cause, it would be
nice to let it bake in mainline for a month or three before committing
it to the -stable trees.  But we don't have a process for that, apart
from remembering to poke Greg at a suitable date.

Oh well.  I guess that if someone is truly harmed by this change, they
can just chmod /proc/pagetypeinfo back to 0444.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25 11:18 Qian Cai
  2019-10-25 11:43 ` Michal Hocko
  2019-10-25 12:00 ` Mel Gorman
@ 2019-10-25 21:25 ` Andrew Morton
  2 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2019-10-25 21:25 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On Fri, 25 Oct 2019 07:18:37 -0400 Qian Cai <cai@lca.pw> wrote:

> 
> 
> > On Oct 25, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes and low number of
> > pages is much more interesting therefore putting a bound on the number
> > of pages on the free_list sounds like a reasonable tradeoff.
> > 
> > The new output will simply tell
> > [...]
> > Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> It was mentioned that developers could use this file is to see the movement of those numbers for debugging, so this supposed to introduce regressions as there is no movement anymore for those 100k+ items?

There are risks, but we have to do something!

I'd be inclined to remove the ">" though - that'll unnecessarily break
code which parses this data.

otoh, maybe that's a feature: breaking those parsers will alert people
to go in and find out what changed.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25 11:18 Qian Cai
  2019-10-25 11:43 ` Michal Hocko
@ 2019-10-25 12:00 ` Mel Gorman
  2019-10-25 21:25 ` Andrew Morton
  2 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2019-10-25 12:00 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, Andrew Morton, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On Fri, Oct 25, 2019 at 07:18:37AM -0400, Qian Cai wrote:
> ???
> 
> > On Oct 25, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes and low number of
> > pages is much more interesting therefore putting a bound on the number
> > of pages on the free_list sounds like a reasonable tradeoff.
> > 
> > The new output will simply tell
> > [...]
> > Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> It was mentioned that developers could use this file is to see the movement of those numbers for debugging, so this supposed to introduce regressions as there is no movement anymore for those 100k+ items?

They would need to be more explicit on what their requirements are. When
the file was first introduced, the main interesting point was to see
an approximate distribution of orders by migration type. The exact
count was not particularly interesting other than knowing whether
there was large numbers at lower orders that could not be recovered
by compaction/drop_caches/hugetlbfs-allocation/memhog-abuse etc and
the distribution of pageblocks by migration type overall. That was my
perspective at least when developing fragmentation avoidance, lumpy
reclaim and later compaction.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-25 11:18 Qian Cai
@ 2019-10-25 11:43 ` Michal Hocko
  2019-10-25 12:00 ` Mel Gorman
  2019-10-25 21:25 ` Andrew Morton
  2 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2019-10-25 11:43 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML

On Fri 25-10-19 07:18:37, Qian Cai wrote:
> 
> 
> > On Oct 25, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes and low number of
> > pages is much more interesting therefore putting a bound on the number
> > of pages on the free_list sounds like a reasonable tradeoff.
> > 
> > The new output will simply tell
> > [...]
> > Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> It was mentioned that developers could use this file is to see the
> movement of those numbers for debugging, so this supposed to introduce
> regressions as there is no movement anymore for those 100k+ items?

Can you provide an explicit example please? As the changelog mentions
it is the "low numbers" that is really interesting when debugging
fragmentation issues. Because we are running out of a respective migrate
type and it is interesting to see why and what compaction is doing etc.

Having more than 100k pages on the respective migrate type list is a
good sign that the migrate type is not in problems. Comparing multiple
migrate types with >100k pages doesn't really need a large precision
AFAIU.

Now, I do agree that some debugging tools might get confused by > in the
output but can we wait for reports and see whether anybody actually
cares? It is more likely that fixing one off debugging tools wouldn't be
a big deal.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re:  [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
@ 2019-10-25 11:18 Qian Cai
  2019-10-25 11:43 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Qian Cai @ 2019-10-25 11:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko



> On Oct 25, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes and low number of
> pages is much more interesting therefore putting a bound on the number
> of pages on the free_list sounds like a reasonable tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648

It was mentioned that developers could use this file is to see the movement of those numbers for debugging, so this supposed to introduce regressions as there is no movement anymore for those 100k+ items?

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-10-25 21:58 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-25  7:26 [PATCH 0/2] mm: reduce /proc/pagetypeinfo ovehead Michal Hocko
2019-10-25  7:26 ` [PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
2019-10-25  7:33   ` Vlastimil Babka
2019-10-25 21:58     ` Andrew Morton
2019-10-25  8:18   ` David Hildenbrand
2019-10-25  7:26 ` [PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
2019-10-25  7:35   ` Vlastimil Babka
2019-10-25  8:21   ` David Hildenbrand
2019-10-25 12:52   ` Rafael Aquini
2019-10-25 21:08   ` David Rientjes
2019-10-25 11:18 Qian Cai
2019-10-25 11:43 ` Michal Hocko
2019-10-25 12:00 ` Mel Gorman
2019-10-25 21:25 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).