All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
@ 2019-10-22 16:21 Waiman Long
  2019-10-22 16:57 ` Michal Hocko
  2019-10-22 21:59 ` Andrew Morton
  0 siblings, 2 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-22 16:21 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, Waiman Long

The pagetypeinfo_showfree_print() function prints out the number of
free blocks for each of the page orders and migrate types. The current
code just iterates the each of the free lists to get counts.  There are
bug reports about hard lockup panics when reading the /proc/pagetyeinfo
file just because it look too long to iterate all the free lists within
a zone while holing the zone lock with irq disabled.

Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
of crashing a system by the simple act of reading /proc/pagetypeinfo
by any user is a security problem that needs to be addressed.

There is a free_area structure associated with each page order. There
is also a nr_free count within the free_area for all the different
migration types combined. Tracking the number of free list entries
for each migration type will probably add some overhead to the fast
paths like moving pages from one migration type to another which may
not be desirable.

we can actually skip iterating the list of one of the migration types
and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
is usually the largest one on large memory systems, this is the one
to be skipped. Since the printing order is migration-type => order, we
will have to store the counts in an internal 2D array before printing
them out.

Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
zone lock for too long blocking out other zone lock waiters from being
run. This can be problematic for systems with large amount of memory.
So a check is added to temporarily release the lock and reschedule if
more than 64k of list entries have been iterated for each order. With
a MAX_ORDER of 11, the worst case will be iterating about 700k of list
entries before releasing the lock.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/vmstat.c | 51 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..40c9a1494709 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
 	int order, mtype;
+	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
 
-	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
-		seq_printf(m, "Node %4d, zone %8s, type %12s ",
-					pgdat->node_id,
-					zone->name,
-					migratetype_names[mtype]);
-		for (order = 0; order < MAX_ORDER; ++order) {
+	lockdep_assert_held(&zone->lock);
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * MIGRATE_MOVABLE is usually the largest one in large memory
+	 * systems. We skip iterating that list. Instead, we compute it by
+	 * subtracting the total of the rests from free_area->nr_free.
+	 */
+	for (order = 0; order < MAX_ORDER; ++order) {
+		unsigned long nr_total = 0;
+		struct free_area *area = &(zone->free_area[order]);
+
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
 			unsigned long freecount = 0;
-			struct free_area *area;
 			struct list_head *curr;
 
-			area = &(zone->free_area[order]);
-
+			if (mtype == MIGRATE_MOVABLE)
+				continue;
 			list_for_each(curr, &area->free_list[mtype])
 				freecount++;
-			seq_printf(m, "%6lu ", freecount);
+			nfree[order][mtype] = freecount;
+			nr_total += freecount;
 		}
+		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
+
+		/*
+		 * If we have already iterated more than 64k of list
+		 * entries, we might have hold the zone lock for too long.
+		 * Temporarily release the lock and reschedule before
+		 * continuing so that other lock waiters have a chance
+		 * to run.
+		 */
+		if (nr_total > (1 << 16)) {
+			spin_unlock_irq(&zone->lock);
+			cond_resched();
+			spin_lock_irq(&zone->lock);
+		}
+	}
+
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+		seq_printf(m, "Node %4d, zone %8s, type %12s ",
+					pgdat->node_id,
+					zone->name,
+					migratetype_names[mtype]);
+		for (order = 0; order < MAX_ORDER; ++order)
+			seq_printf(m, "%6lu ", nfree[order][mtype]);
 		seq_putc(m, '\n');
 	}
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 16:21 [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Waiman Long
@ 2019-10-22 16:57 ` Michal Hocko
  2019-10-22 18:00   ` Waiman Long
  2019-10-23  8:31   ` Mel Gorman
  2019-10-22 21:59 ` Andrew Morton
  1 sibling, 2 replies; 67+ messages in thread
From: Michal Hocko @ 2019-10-22 16:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini,
	Mel Gorman

[Cc Mel]

On Tue 22-10-19 12:21:56, Waiman Long wrote:
> The pagetypeinfo_showfree_print() function prints out the number of
> free blocks for each of the page orders and migrate types. The current
> code just iterates the each of the free lists to get counts.  There are
> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> file just because it look too long to iterate all the free lists within
> a zone while holing the zone lock with irq disabled.
> 
> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> of crashing a system by the simple act of reading /proc/pagetypeinfo
> by any user is a security problem that needs to be addressed.

Should we make the file 0400? It is a useful thing when debugging but
not something regular users would really need for life.

> There is a free_area structure associated with each page order. There
> is also a nr_free count within the free_area for all the different
> migration types combined. Tracking the number of free list entries
> for each migration type will probably add some overhead to the fast
> paths like moving pages from one migration type to another which may
> not be desirable.

Have you tried to measure that overhead?
 
> we can actually skip iterating the list of one of the migration types
> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> is usually the largest one on large memory systems, this is the one
> to be skipped. Since the printing order is migration-type => order, we
> will have to store the counts in an internal 2D array before printing
> them out.
> 
> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> zone lock for too long blocking out other zone lock waiters from being
> run. This can be problematic for systems with large amount of memory.
> So a check is added to temporarily release the lock and reschedule if
> more than 64k of list entries have been iterated for each order. With
> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> entries before releasing the lock.

But you are still iterating through the whole free_list at once so if it
gets really large then this is still possible. I think it would be
preferable to use per migratetype nr_free if it doesn't cause any
regressions.

> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/vmstat.c | 51 +++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 41 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..40c9a1494709 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
>  {
>  	int order, mtype;
> +	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
>  
> -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> -					pgdat->node_id,
> -					zone->name,
> -					migratetype_names[mtype]);
> -		for (order = 0; order < MAX_ORDER; ++order) {
> +	lockdep_assert_held(&zone->lock);
> +	lockdep_assert_irqs_disabled();
> +
> +	/*
> +	 * MIGRATE_MOVABLE is usually the largest one in large memory
> +	 * systems. We skip iterating that list. Instead, we compute it by
> +	 * subtracting the total of the rests from free_area->nr_free.
> +	 */
> +	for (order = 0; order < MAX_ORDER; ++order) {
> +		unsigned long nr_total = 0;
> +		struct free_area *area = &(zone->free_area[order]);
> +
> +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>  			unsigned long freecount = 0;
> -			struct free_area *area;
>  			struct list_head *curr;
>  
> -			area = &(zone->free_area[order]);
> -
> +			if (mtype == MIGRATE_MOVABLE)
> +				continue;
>  			list_for_each(curr, &area->free_list[mtype])
>  				freecount++;
> -			seq_printf(m, "%6lu ", freecount);
> +			nfree[order][mtype] = freecount;
> +			nr_total += freecount;
>  		}
> +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> +
> +		/*
> +		 * If we have already iterated more than 64k of list
> +		 * entries, we might have hold the zone lock for too long.
> +		 * Temporarily release the lock and reschedule before
> +		 * continuing so that other lock waiters have a chance
> +		 * to run.
> +		 */
> +		if (nr_total > (1 << 16)) {
> +			spin_unlock_irq(&zone->lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lock);
> +		}
> +	}
> +
> +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> +					pgdat->node_id,
> +					zone->name,
> +					migratetype_names[mtype]);
> +		for (order = 0; order < MAX_ORDER; ++order)
> +			seq_printf(m, "%6lu ", nfree[order][mtype]);
>  		seq_putc(m, '\n');
>  	}
>  }
> -- 
> 2.18.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 16:57 ` Michal Hocko
@ 2019-10-22 18:00   ` Waiman Long
  2019-10-22 18:40     ` Waiman Long
  2019-10-23  8:31   ` Mel Gorman
  1 sibling, 1 reply; 67+ messages in thread
From: Waiman Long @ 2019-10-22 18:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini,
	Mel Gorman

On 10/22/19 12:57 PM, Michal Hocko wrote:
> [Cc Mel]
>
> On Tue 22-10-19 12:21:56, Waiman Long wrote:
>> The pagetypeinfo_showfree_print() function prints out the number of
>> free blocks for each of the page orders and migrate types. The current
>> code just iterates the each of the free lists to get counts.  There are
>> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
>> file just because it look too long to iterate all the free lists within
>> a zone while holing the zone lock with irq disabled.
>>
>> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
>> of crashing a system by the simple act of reading /proc/pagetypeinfo
>> by any user is a security problem that needs to be addressed.
> Should we make the file 0400? It is a useful thing when debugging but
> not something regular users would really need for life.
>
I am not against doing that, but it may break existing applications that
somehow need to read pagetypeinfo. That is why I didn't try to advocate
about changing protection.


>> There is a free_area structure associated with each page order. There
>> is also a nr_free count within the free_area for all the different
>> migration types combined. Tracking the number of free list entries
>> for each migration type will probably add some overhead to the fast
>> paths like moving pages from one migration type to another which may
>> not be desirable.
> Have you tried to measure that overhead?

I haven't tried to measure the performance impact yet. I did thought
about tracking nr_free for each of the migration types within a
free_area. That will require auditing the code to make sure that all the
intra-free_area migrations are properly accounted for. I can work on it
if people prefer this alternative.


>  
>> we can actually skip iterating the list of one of the migration types
>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>> is usually the largest one on large memory systems, this is the one
>> to be skipped. Since the printing order is migration-type => order, we
>> will have to store the counts in an internal 2D array before printing
>> them out.
>>
>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>> zone lock for too long blocking out other zone lock waiters from being
>> run. This can be problematic for systems with large amount of memory.
>> So a check is added to temporarily release the lock and reschedule if
>> more than 64k of list entries have been iterated for each order. With
>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>> entries before releasing the lock.
> But you are still iterating through the whole free_list at once so if it
> gets really large then this is still possible. I think it would be
> preferable to use per migratetype nr_free if it doesn't cause any
> regressions.
>
Yes, it is still theoretically possible. I will take a further look at
having per-migrate type nr_free. BTW, there is one more place where the
free lists are being iterated with zone lock held - mark_free_pages().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 18:00   ` Waiman Long
@ 2019-10-22 18:40     ` Waiman Long
  2019-10-23  0:52         ` David Rientjes
  0 siblings, 1 reply; 67+ messages in thread
From: Waiman Long @ 2019-10-22 18:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini,
	Mel Gorman

On 10/22/19 2:00 PM, Waiman Long wrote:
> On 10/22/19 12:57 PM, Michal Hocko wrote:
>
>>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>>> is usually the largest one on large memory systems, this is the one
>>> to be skipped. Since the printing order is migration-type => order, we
>>> will have to store the counts in an internal 2D array before printing
>>> them out.
>>>
>>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>>> zone lock for too long blocking out other zone lock waiters from being
>>> run. This can be problematic for systems with large amount of memory.
>>> So a check is added to temporarily release the lock and reschedule if
>>> more than 64k of list entries have been iterated for each order. With
>>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>>> entries before releasing the lock.
>> But you are still iterating through the whole free_list at once so if it
>> gets really large then this is still possible. I think it would be
>> preferable to use per migratetype nr_free if it doesn't cause any
>> regressions.
>>
> Yes, it is still theoretically possible. I will take a further look at
> having per-migrate type nr_free. BTW, there is one more place where the
> free lists are being iterated with zone lock held - mark_free_pages().

Looking deeper into the code, the exact migration type is not stored in
the page itself. An initial movable page can be stolen to be put into
another migration type. So in a delete or move from free_area, we don't
know exactly what migration type the page is coming from. IOW, it is
hard to get accurate counts of the number of entries in each lists.

I am not saying this is impossible, but doing it may require stealing
some bits from the page structure to store this information which is
probably not worth the benefit we can get from it. So if you have any
good suggestion of how to do it without too much cost, please let me
know about it. Otherwise, I will probably stay with the current patch.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 16:21 [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Waiman Long
  2019-10-22 16:57 ` Michal Hocko
@ 2019-10-22 21:59 ` Andrew Morton
  2019-10-23  6:15   ` Michal Hocko
  2019-10-23 14:30   ` Waiman Long
  1 sibling, 2 replies; 67+ messages in thread
From: Andrew Morton @ 2019-10-22 21:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <longman@redhat.com> wrote:

> The pagetypeinfo_showfree_print() function prints out the number of
> free blocks for each of the page orders and migrate types. The current
> code just iterates the each of the free lists to get counts.  There are
> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> file just because it look too long to iterate all the free lists within
> a zone while holing the zone lock with irq disabled.
> 
> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> of crashing a system by the simple act of reading /proc/pagetypeinfo
> by any user is a security problem that needs to be addressed.

Yes.

> There is a free_area structure associated with each page order. There
> is also a nr_free count within the free_area for all the different
> migration types combined. Tracking the number of free list entries
> for each migration type will probably add some overhead to the fast
> paths like moving pages from one migration type to another which may
> not be desirable.
> 
> we can actually skip iterating the list of one of the migration types
> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> is usually the largest one on large memory systems, this is the one
> to be skipped. Since the printing order is migration-type => order, we
> will have to store the counts in an internal 2D array before printing
> them out.
> 
> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> zone lock for too long blocking out other zone lock waiters from being
> run. This can be problematic for systems with large amount of memory.
> So a check is added to temporarily release the lock and reschedule if
> more than 64k of list entries have been iterated for each order. With
> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> entries before releasing the lock.
> 
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
>  {
>  	int order, mtype;
> +	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];

600+ bytes is a bit much.  I guess it's OK in this situation.

> -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> -					pgdat->node_id,
> -					zone->name,
> -					migratetype_names[mtype]);
> -		for (order = 0; order < MAX_ORDER; ++order) {
> +	lockdep_assert_held(&zone->lock);
> +	lockdep_assert_irqs_disabled();
> +
> +	/*
> +	 * MIGRATE_MOVABLE is usually the largest one in large memory
> +	 * systems. We skip iterating that list. Instead, we compute it by
> +	 * subtracting the total of the rests from free_area->nr_free.
> +	 */
> +	for (order = 0; order < MAX_ORDER; ++order) {
> +		unsigned long nr_total = 0;
> +		struct free_area *area = &(zone->free_area[order]);
> +
> +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>  			unsigned long freecount = 0;
> -			struct free_area *area;
>  			struct list_head *curr;
>  
> -			area = &(zone->free_area[order]);
> -
> +			if (mtype == MIGRATE_MOVABLE)
> +				continue;
>  			list_for_each(curr, &area->free_list[mtype])
>  				freecount++;
> -			seq_printf(m, "%6lu ", freecount);
> +			nfree[order][mtype] = freecount;
> +			nr_total += freecount;
>  		}
> +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> +
> +		/*
> +		 * If we have already iterated more than 64k of list
> +		 * entries, we might have hold the zone lock for too long.
> +		 * Temporarily release the lock and reschedule before
> +		 * continuing so that other lock waiters have a chance
> +		 * to run.
> +		 */
> +		if (nr_total > (1 << 16)) {
> +			spin_unlock_irq(&zone->lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lock);
> +		}
> +	}
> +
> +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> +					pgdat->node_id,
> +					zone->name,
> +					migratetype_names[mtype]);
> +		for (order = 0; order < MAX_ORDER; ++order)
> +			seq_printf(m, "%6lu ", nfree[order][mtype]);
>  		seq_putc(m, '\n');

This is not exactly a thing of beauty :( Presumably there might still
be situations where the irq-off times remain excessive.

Why are we actually holding zone->lock so much?  Can we get away with
holding it across the list_for_each() loop and nothing else?  If so,
this still isn't a bulletproof fix.  Maybe just terminate the list
walk if freecount reaches 1024.  Would anyone really care?

Sigh.  I wonder if anyone really uses this thing for anything
important.  Can we just remove it all?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 18:40     ` Waiman Long
@ 2019-10-23  0:52         ` David Rientjes
  0 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2019-10-23  0:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, Mel Gorman

On Tue, 22 Oct 2019, Waiman Long wrote:

> >>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> >>> is usually the largest one on large memory systems, this is the one
> >>> to be skipped. Since the printing order is migration-type => order, we
> >>> will have to store the counts in an internal 2D array before printing
> >>> them out.
> >>>
> >>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> >>> zone lock for too long blocking out other zone lock waiters from being
> >>> run. This can be problematic for systems with large amount of memory.
> >>> So a check is added to temporarily release the lock and reschedule if
> >>> more than 64k of list entries have been iterated for each order. With
> >>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> >>> entries before releasing the lock.
> >> But you are still iterating through the whole free_list at once so if it
> >> gets really large then this is still possible. I think it would be
> >> preferable to use per migratetype nr_free if it doesn't cause any
> >> regressions.
> >>
> > Yes, it is still theoretically possible. I will take a further look at
> > having per-migrate type nr_free. BTW, there is one more place where the
> > free lists are being iterated with zone lock held - mark_free_pages().
> 
> Looking deeper into the code, the exact migration type is not stored in
> the page itself. An initial movable page can be stolen to be put into
> another migration type. So in a delete or move from free_area, we don't
> know exactly what migration type the page is coming from. IOW, it is
> hard to get accurate counts of the number of entries in each lists.
> 

I think the suggestion is to maintain a nr_free count of the free_list for 
each order for each migratetype so anytime a page is added or deleted from 
the list, the nr_free is adjusted.  Then the free_area's nr_free becomes 
the sum of its migratetype's nr_free at that order.  That's possible to do 
if you track the migratetype per page, as you said, or like pcp pages 
track it as part of page->index.  It's a trade-off on whether you want to 
impact the performance of maintaining these new nr_frees anytime you 
manipulate the freelists.

I think Vlastimil and I discussed per order per migratetype nr_frees in 
the past and it could be a worthwhile improvement for other reasons, 
specifically it leads to heuristics that can be used to determine how 
fragmentated a certain migratetype is for a zone, i.e. a very quick way to 
determine what ratio of pages over all MIGRATE_UNMOVABLE pageblocks are 
free.

Or maybe there are other reasons why these nr_frees can't be maintained 
anymore?  (I had a patch to do it on 4.3.)

You may also find systems where MIGRATE_MOVABLE is not actually the 
longest free_list compared to other migratetypes on a severely fragmented 
system, so special casing MIGRATE_MOVABLE might not be the best way 
forward.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
@ 2019-10-23  0:52         ` David Rientjes
  0 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2019-10-23  0:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, Mel Gorman

On Tue, 22 Oct 2019, Waiman Long wrote:

> >>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> >>> is usually the largest one on large memory systems, this is the one
> >>> to be skipped. Since the printing order is migration-type => order, we
> >>> will have to store the counts in an internal 2D array before printing
> >>> them out.
> >>>
> >>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> >>> zone lock for too long blocking out other zone lock waiters from being
> >>> run. This can be problematic for systems with large amount of memory.
> >>> So a check is added to temporarily release the lock and reschedule if
> >>> more than 64k of list entries have been iterated for each order. With
> >>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> >>> entries before releasing the lock.
> >> But you are still iterating through the whole free_list at once so if it
> >> gets really large then this is still possible. I think it would be
> >> preferable to use per migratetype nr_free if it doesn't cause any
> >> regressions.
> >>
> > Yes, it is still theoretically possible. I will take a further look at
> > having per-migrate type nr_free. BTW, there is one more place where the
> > free lists are being iterated with zone lock held - mark_free_pages().
> 
> Looking deeper into the code, the exact migration type is not stored in
> the page itself. An initial movable page can be stolen to be put into
> another migration type. So in a delete or move from free_area, we don't
> know exactly what migration type the page is coming from. IOW, it is
> hard to get accurate counts of the number of entries in each lists.
> 

I think the suggestion is to maintain a nr_free count of the free_list for 
each order for each migratetype so anytime a page is added or deleted from 
the list, the nr_free is adjusted.  Then the free_area's nr_free becomes 
the sum of its migratetype's nr_free at that order.  That's possible to do 
if you track the migratetype per page, as you said, or like pcp pages 
track it as part of page->index.  It's a trade-off on whether you want to 
impact the performance of maintaining these new nr_frees anytime you 
manipulate the freelists.

I think Vlastimil and I discussed per order per migratetype nr_frees in 
the past and it could be a worthwhile improvement for other reasons, 
specifically it leads to heuristics that can be used to determine how 
fragmentated a certain migratetype is for a zone, i.e. a very quick way to 
determine what ratio of pages over all MIGRATE_UNMOVABLE pageblocks are 
free.

Or maybe there are other reasons why these nr_frees can't be maintained 
anymore?  (I had a patch to do it on 4.3.)

You may also find systems where MIGRATE_MOVABLE is not actually the 
longest free_list compared to other migratetypes on a severely fragmented 
system, so special casing MIGRATE_MOVABLE might not be the best way 
forward.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 21:59 ` Andrew Morton
@ 2019-10-23  6:15   ` Michal Hocko
  2019-10-23 14:30   ` Waiman Long
  1 sibling, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2019-10-23  6:15 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Mel Gorman
  Cc: Waiman Long, linux-mm, linux-kernel, Johannes Weiner,
	Roman Gushchin, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini

On Tue 22-10-19 14:59:02, Andrew Morton wrote:
> On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <longman@redhat.com> wrote:
[...]
> > -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > -					pgdat->node_id,
> > -					zone->name,
> > -					migratetype_names[mtype]);
> > -		for (order = 0; order < MAX_ORDER; ++order) {
> > +	lockdep_assert_held(&zone->lock);
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	/*
> > +	 * MIGRATE_MOVABLE is usually the largest one in large memory
> > +	 * systems. We skip iterating that list. Instead, we compute it by
> > +	 * subtracting the total of the rests from free_area->nr_free.
> > +	 */
> > +	for (order = 0; order < MAX_ORDER; ++order) {
> > +		unsigned long nr_total = 0;
> > +		struct free_area *area = &(zone->free_area[order]);
> > +
> > +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> >  			unsigned long freecount = 0;
> > -			struct free_area *area;
> >  			struct list_head *curr;
> >  
> > -			area = &(zone->free_area[order]);
> > -
> > +			if (mtype == MIGRATE_MOVABLE)
> > +				continue;
> >  			list_for_each(curr, &area->free_list[mtype])
> >  				freecount++;
> > -			seq_printf(m, "%6lu ", freecount);
> > +			nfree[order][mtype] = freecount;
> > +			nr_total += freecount;
> >  		}
> > +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > +
> > +		/*
> > +		 * If we have already iterated more than 64k of list
> > +		 * entries, we might have hold the zone lock for too long.
> > +		 * Temporarily release the lock and reschedule before
> > +		 * continuing so that other lock waiters have a chance
> > +		 * to run.
> > +		 */
> > +		if (nr_total > (1 << 16)) {
> > +			spin_unlock_irq(&zone->lock);
> > +			cond_resched();
> > +			spin_lock_irq(&zone->lock);
> > +		}
> > +	}
> > +
> > +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > +					pgdat->node_id,
> > +					zone->name,
> > +					migratetype_names[mtype]);
> > +		for (order = 0; order < MAX_ORDER; ++order)
> > +			seq_printf(m, "%6lu ", nfree[order][mtype]);
> >  		seq_putc(m, '\n');
> 
> This is not exactly a thing of beauty :( Presumably there might still
> be situations where the irq-off times remain excessive.

Yes. It is the list_for_each over the free_list that needs the lock and
that is the actual problem here. This can be really large with a _lot_
of memory. And this is why I objected to the patch. Because it doesn't
really address this problem. I would like to hear from Mel and Vlastimil
how would they feel about making free_list fully migrate type aware
(including nr_free).

> Why are we actually holding zone->lock so much?  Can we get away with
> holding it across the list_for_each() loop and nothing else?  If so,
> this still isn't a bulletproof fix.  Maybe just terminate the list
> walk if freecount reaches 1024.  Would anyone really care?
> 
> Sigh.  I wonder if anyone really uses this thing for anything
> important.  Can we just remove it all?

Vlastimil would know much better but I have seen this being used for
fragmentation related debugging. That should imply that 0400 should be
sufficient and a quick and easily backportable fix for the most pressing
immediate problem.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 16:57 ` Michal Hocko
  2019-10-22 18:00   ` Waiman Long
@ 2019-10-23  8:31   ` Mel Gorman
  2019-10-23  9:04     ` Michal Hocko
  1 sibling, 1 reply; 67+ messages in thread
From: Mel Gorman @ 2019-10-23  8:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, Mel Gorman

On Tue, Oct 22, 2019 at 06:57:45PM +0200, Michal Hocko wrote:
> [Cc Mel]
> 
> On Tue 22-10-19 12:21:56, Waiman Long wrote:
> > The pagetypeinfo_showfree_print() function prints out the number of
> > free blocks for each of the page orders and migrate types. The current
> > code just iterates the each of the free lists to get counts.  There are
> > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > file just because it look too long to iterate all the free lists within
> > a zone while holing the zone lock with irq disabled.
> > 
> > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > by any user is a security problem that needs to be addressed.
> 
> Should we make the file 0400? It is a useful thing when debugging but
> not something regular users would really need for life.
> 

I think this would be useful in general. The information is not that
useful outside of debugging. Even then it's only useful when trying to
get a handle on why a path like compaction is taking too long.

> > There is a free_area structure associated with each page order. There
> > is also a nr_free count within the free_area for all the different
> > migration types combined. Tracking the number of free list entries
> > for each migration type will probably add some overhead to the fast
> > paths like moving pages from one migration type to another which may
> > not be desirable.
> 
> Have you tried to measure that overhead?
>  

I would prefer this option not be taken. It would increase the cost of
watermark calculations which is a relatively fast path.

> > we can actually skip iterating the list of one of the migration types
> > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > is usually the largest one on large memory systems, this is the one
> > to be skipped. Since the printing order is migration-type => order, we
> > will have to store the counts in an internal 2D array before printing
> > them out.
> > 
> > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > zone lock for too long blocking out other zone lock waiters from being
> > run. This can be problematic for systems with large amount of memory.
> > So a check is added to temporarily release the lock and reschedule if
> > more than 64k of list entries have been iterated for each order. With
> > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > entries before releasing the lock.
> 
> But you are still iterating through the whole free_list at once so if it
> gets really large then this is still possible. I think it would be
> preferable to use per migratetype nr_free if it doesn't cause any
> regressions.
> 

I think it will. The patch as it is contains the overhead within the
reader of the pagetypeinfo proc file which is a non-critical path. The
page allocator paths on the other hand is very important.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23  8:31   ` Mel Gorman
@ 2019-10-23  9:04     ` Michal Hocko
  2019-10-23  9:56       ` Mel Gorman
  0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-23  9:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, Mel Gorman

On Wed 23-10-19 09:31:43, Mel Gorman wrote:
> On Tue, Oct 22, 2019 at 06:57:45PM +0200, Michal Hocko wrote:
> > [Cc Mel]
> > 
> > On Tue 22-10-19 12:21:56, Waiman Long wrote:
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts.  There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > > 
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> > 
> > Should we make the file 0400? It is a useful thing when debugging but
> > not something regular users would really need for life.
> > 
> 
> I think this would be useful in general. The information is not that
> useful outside of debugging. Even then it's only useful when trying to
> get a handle on why a path like compaction is taking too long.

So can we go with this to address the security aspect of this and have
something trivial to backport.

> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > 
> > Have you tried to measure that overhead?
> >  
> 
> I would prefer this option not be taken. It would increase the cost of
> watermark calculations which is a relatively fast path.

Is the change for the wmark check going to require more than
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0b2e0306720..5d95313ba4a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3448,9 +3448,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		struct free_area *area = &z->free_area[o];
 		int mt;
 
-		if (!area->nr_free)
-			continue;
-
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
 			if (!free_area_empty(area, mt))
 				return true;

Is this really going to be visible in practice? Sure we are going to do
more checks but most orders tend to have at least some memory in a
reasonably balanced system and we can hardly expect an optimal
allocation path on those that are not.
 
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > > 
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > 
> > But you are still iterating through the whole free_list at once so if it
> > gets really large then this is still possible. I think it would be
> > preferable to use per migratetype nr_free if it doesn't cause any
> > regressions.
> > 
> 
> I think it will. The patch as it is contains the overhead within the
> reader of the pagetypeinfo proc file which is a non-critical path. The
> page allocator paths on the other hand is very important.

As pointed out in other email. The problem with this patch is that it
hasn't really removed the iteration over the whole free_list which is
the primary problem. So I think that we should either consider this a
non-issue and make it "admin knows this is potentially expensive" or do
something like Andrew was suggesting if we do not want to change the
nr_free accounting.

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..83c0295ecddc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1386,8 +1386,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 
 			area = &(zone->free_area[order]);
 
-			list_for_each(curr, &area->free_list[mtype])
+			list_for_each(curr, &area->free_list[mtype]) {
 				freecount++;
+				if (freecount > BIG_NUMBER) {
+					seq_printf(">%6lu ", freecount);
+					spin_unlock_irq(&zone->lock);
+					cond_resched();
+					spin_lock_irq(&zone->lock);
+					continue;
+				}
+			}
 			seq_printf(m, "%6lu ", freecount);
 		}
 		seq_putc(m, '\n');
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23  9:04     ` Michal Hocko
@ 2019-10-23  9:56       ` Mel Gorman
  2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
                           ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Mel Gorman @ 2019-10-23  9:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Wed, Oct 23, 2019 at 11:04:22AM +0200, Michal Hocko wrote:
> On Wed 23-10-19 09:31:43, Mel Gorman wrote:
> > On Tue, Oct 22, 2019 at 06:57:45PM +0200, Michal Hocko wrote:
> > > [Cc Mel]
> > > 
> > > On Tue 22-10-19 12:21:56, Waiman Long wrote:
> > > > The pagetypeinfo_showfree_print() function prints out the number of
> > > > free blocks for each of the page orders and migrate types. The current
> > > > code just iterates the each of the free lists to get counts.  There are
> > > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > > file just because it look too long to iterate all the free lists within
> > > > a zone while holing the zone lock with irq disabled.
> > > > 
> > > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > > by any user is a security problem that needs to be addressed.
> > > 
> > > Should we make the file 0400? It is a useful thing when debugging but
> > > not something regular users would really need for life.
> > > 
> > 
> > I think this would be useful in general. The information is not that
> > useful outside of debugging. Even then it's only useful when trying to
> > get a handle on why a path like compaction is taking too long.
> 
> So can we go with this to address the security aspect of this and have
> something trivial to backport.
> 

Yes.

> > > > There is a free_area structure associated with each page order. There
> > > > is also a nr_free count within the free_area for all the different
> > > > migration types combined. Tracking the number of free list entries
> > > > for each migration type will probably add some overhead to the fast
> > > > paths like moving pages from one migration type to another which may
> > > > not be desirable.
> > > 
> > > Have you tried to measure that overhead?
> > >  
> > 
> > I would prefer this option not be taken. It would increase the cost of
> > watermark calculations which is a relatively fast path.
> 
> Is the change for the wmark check going to require more than
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..5d95313ba4a5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3448,9 +3448,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  		struct free_area *area = &z->free_area[o];
>  		int mt;
>  
> -		if (!area->nr_free)
> -			continue;
> -
>  		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
>  			if (!free_area_empty(area, mt))
>  				return true;
> 
> Is this really going to be visible in practice? Sure we are going to do
> more checks but most orders tend to have at least some memory in a
> reasonably balanced system and we can hardly expect an optimal
> allocation path on those that are not.
>  

You also have to iterate over them all later in the same function.  The the
free counts are per migrate type then they would have to be iterated over
every time.

Similarly, there would be multiple places where all the counters would
have to be iterated -- find_suitable_fallback, show_free_areas,
fast_isolate_freepages, fill_contig_page_info, zone_init_free_lists etc.

It'd be a small cost but given that it's aimed at fixing a problem with
reading pagetypeinfo, is it really worth it? I don't think so.

> > > > we can actually skip iterating the list of one of the migration types
> > > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > > is usually the largest one on large memory systems, this is the one
> > > > to be skipped. Since the printing order is migration-type => order, we
> > > > will have to store the counts in an internal 2D array before printing
> > > > them out.
> > > > 
> > > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > > zone lock for too long blocking out other zone lock waiters from being
> > > > run. This can be problematic for systems with large amount of memory.
> > > > So a check is added to temporarily release the lock and reschedule if
> > > > more than 64k of list entries have been iterated for each order. With
> > > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > > entries before releasing the lock.
> > > 
> > > But you are still iterating through the whole free_list at once so if it
> > > gets really large then this is still possible. I think it would be
> > > preferable to use per migratetype nr_free if it doesn't cause any
> > > regressions.
> > > 
> > 
> > I think it will. The patch as it is contains the overhead within the
> > reader of the pagetypeinfo proc file which is a non-critical path. The
> > page allocator paths on the other hand is very important.
> 
> As pointed out in other email. The problem with this patch is that it
> hasn't really removed the iteration over the whole free_list which is
> the primary problem. So I think that we should either consider this a
> non-issue and make it "admin knows this is potentially expensive" or do
> something like Andrew was suggesting if we do not want to change the
> nr_free accounting.
> 

Again, the cost is when reading a proc file. From what Andrew said,
the lock is necessary to safely walk the list but if anything. I would
be ok with limiting the length of the walk but honestly, I would also
be ok with simply deleting the proc file. The utility for debugging a
problem with it is limited now (it was more important when fragmentation
avoidance was first introduced) and there is little an admin can do with
the information. I can't remember the last time I asked for the contents
of the file when trying to debug a problem. There is a possibility that
someone will complain but I'm not aware of any utility that reads the
information and does something useful with it. In the unlikely event
something breaks, the file can be re-added with a limited walk.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23  9:56       ` Mel Gorman
@ 2019-10-23 10:27         ` Michal Hocko
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
                             ` (2 more replies)
  2019-10-23 12:42         ` [PATCH] " Qian Cai
  2019-10-23 13:25         ` Vlastimil Babka
  2 siblings, 3 replies; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 10:27 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML

On Wed 23-10-19 10:56:08, Mel Gorman wrote:
> On Wed, Oct 23, 2019 at 11:04:22AM +0200, Michal Hocko wrote:
> > So can we go with this to address the security aspect of this and have
> > something trivial to backport.
> > 
> 
> Yes.

Ok, patch 1 in reply to this email.

> > > > > There is a free_area structure associated with each page order. There
> > > > > is also a nr_free count within the free_area for all the different
> > > > > migration types combined. Tracking the number of free list entries
> > > > > for each migration type will probably add some overhead to the fast
> > > > > paths like moving pages from one migration type to another which may
> > > > > not be desirable.
> > > > 
> > > > Have you tried to measure that overhead?
> > > >  
> > > 
> > > I would prefer this option not be taken. It would increase the cost of
> > > watermark calculations which is a relatively fast path.
> > 
> > Is the change for the wmark check going to require more than
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c0b2e0306720..5d95313ba4a5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3448,9 +3448,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >  		struct free_area *area = &z->free_area[o];
> >  		int mt;
> >  
> > -		if (!area->nr_free)
> > -			continue;
> > -
> >  		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> >  			if (!free_area_empty(area, mt))
> >  				return true;
> > 
> > Is this really going to be visible in practice? Sure we are going to do
> > more checks but most orders tend to have at least some memory in a
> > reasonably balanced system and we can hardly expect an optimal
> > allocation path on those that are not.
> >  
> 
> You also have to iterate over them all later in the same function.  The the
> free counts are per migrate type then they would have to be iterated over
> every time.
> 
> Similarly, there would be multiple places where all the counters would
> have to be iterated -- find_suitable_fallback, show_free_areas,
> fast_isolate_freepages, fill_contig_page_info, zone_init_free_lists etc.
> 
> It'd be a small cost but given that it's aimed at fixing a problem with
> reading pagetypeinfo, is it really worth it? I don't think so.

Fair enough.

[...]
> > As pointed out in other email. The problem with this patch is that it
> > hasn't really removed the iteration over the whole free_list which is
> > the primary problem. So I think that we should either consider this a
> > non-issue and make it "admin knows this is potentially expensive" or do
> > something like Andrew was suggesting if we do not want to change the
> > nr_free accounting.
> > 
> 
> Again, the cost is when reading a proc file. From what Andrew said,
> the lock is necessary to safely walk the list but if anything. I would
> be ok with limiting the length of the walk but honestly, I would also
> be ok with simply deleting the proc file. The utility for debugging a
> problem with it is limited now (it was more important when fragmentation
> avoidance was first introduced) and there is little an admin can do with
> the information. I can't remember the last time I asked for the contents
> of the file when trying to debug a problem. There is a possibility that
> someone will complain but I'm not aware of any utility that reads the
> information and does something useful with it. In the unlikely event
> something breaks, the file can be re-added with a limited walk.

I went with a bound to the pages iteratred over in the free_list. See
patch 2.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
@ 2019-10-23 10:27           ` Michal Hocko
  2019-10-23 13:13             ` Mel Gorman
                               ` (5 more replies)
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
  2019-10-24  8:20           ` [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Michal Hocko
  2 siblings, 6 replies; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 10:27 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for
any other use so normal users really do not need to read this file.

Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that
a) interferes with the page allocator and its users and b) can lead to
hard lockups on large machines which have very long free_list.

Reduce both issues by simply not exporting the file to regular users.

Reported-by: Waiman Long <longman@redhat.com>
Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..4e885ecd44d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
-	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
+	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
 	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
 	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
@ 2019-10-23 10:27           ` Michal Hocko
  2019-10-23 13:15             ` Mel Gorman
                               ` (6 more replies)
  2019-10-24  8:20           ` [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Michal Hocko
  2 siblings, 7 replies; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 10:27 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes therefore putting
a bound on the number of pages on the free_list sounds like a reasonable
tradeoff.

The new output will simply tell
[...]
Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648

instead of
Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmstat.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4e885ecd44d1..762034fc3b83 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 
 			area = &(zone->free_area[order]);
 
-			list_for_each(curr, &area->free_list[mtype])
+			list_for_each(curr, &area->free_list[mtype]) {
 				freecount++;
+				/*
+				 * Cap the free_list iteration because it might
+				 * be really large and we are under a spinlock
+				 * so a long time spent here could trigger a
+				 * hard lockup detector. Anyway this is a
+				 * debugging tool so knowing there is a handful
+				 * of pages in this order should be more than
+				 * sufficient
+				 */
+				if (freecount > 100000) {
+					seq_printf(m, ">%6lu ", freecount);
+					spin_unlock_irq(&zone->lock);
+					cond_resched();
+					spin_lock_irq(&zone->lock);
+					continue;
+				}
+			}
 			seq_printf(m, "%6lu ", freecount);
 		}
 		seq_putc(m, '\n');
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23  9:56       ` Mel Gorman
  2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
@ 2019-10-23 12:42         ` Qian Cai
  2019-10-23 13:25         ` Vlastimil Babka
  2 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-23 12:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Mel Gorman, Waiman Long, Andrew Morton, linux-mm,
	linux-kernel, Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini



> On Oct 23, 2019, at 5:56 AM, Mel Gorman <mgorman@suse.de> wrote:
> 
> Again, the cost is when reading a proc file. From what Andrew said,
> the lock is necessary to safely walk the list but if anything. I would
> be ok with limiting the length of the walk but honestly, I would also
> be ok with simply deleting the proc file. The utility for debugging a
> problem with it is limited now (it was more important when fragmentation
> avoidance was first introduced) and there is little an admin can do with
> the information. I can't remember the last time I asked for the contents
> of the file when trying to debug a problem. There is a possibility that
> someone will complain but I'm not aware of any utility that reads the
> information and does something useful with it. In the unlikely event
> something breaks, the file can be re-added with a limited walk.

Agree. It is better to remove this file all together in linux-next first for a while to see if anyone complains loudly.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
@ 2019-10-23 13:13             ` Mel Gorman
  2019-10-23 13:27             ` Vlastimil Babka
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 67+ messages in thread
From: Mel Gorman @ 2019-10-23 13:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Waiman Long, Johannes Weiner, Roman Gushchin,
	Vlastimil Babka, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML, Michal Hocko

On Wed, Oct 23, 2019 at 12:27:36PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
@ 2019-10-23 13:15             ` Mel Gorman
  2019-10-23 13:32             ` Vlastimil Babka
                               ` (5 subsequent siblings)
  6 siblings, 0 replies; 67+ messages in thread
From: Mel Gorman @ 2019-10-23 13:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Waiman Long, Johannes Weiner, Roman Gushchin,
	Vlastimil Babka, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML, Michal Hocko

On Wed, Oct 23, 2019 at 12:27:37PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> 
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

You could have used need_resched() instead of unconditionally dropping the
lock but that's very minor for a proc file and it would allos a parallel
allocation to go ahead so

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23  9:56       ` Mel Gorman
  2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
  2019-10-23 12:42         ` [PATCH] " Qian Cai
@ 2019-10-23 13:25         ` Vlastimil Babka
  2 siblings, 0 replies; 67+ messages in thread
From: Vlastimil Babka @ 2019-10-23 13:25 UTC (permalink / raw)
  To: Mel Gorman, Michal Hocko
  Cc: Mel Gorman, Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini,
	David Rientjes

On 10/23/19 11:56 AM, Mel Gorman wrote:
> You also have to iterate over them all later in the same function.  The the
> free counts are per migrate type then they would have to be iterated over
> every time.
> 
> Similarly, there would be multiple places where all the counters would
> have to be iterated -- find_suitable_fallback, show_free_areas,
> fast_isolate_freepages, fill_contig_page_info, zone_init_free_lists etc.
> 
> It'd be a small cost but given that it's aimed at fixing a problem with
> reading pagetypeinfo, is it really worth it? I don't think so.


I think the largest issue would be that 1) the migratetype would have to
be stored somewhere (ok, perhaps that's not an issue as free pages have
plenty of space in struct page), and 2) free page merging code
(__free_one_page()) would have to start looking at the migratetype and
fix up the counters - merging between migratetypes is not prohibited,
for good reasons. IIRC David's patch was missing that part. So I would
also prefer to avoid that.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
  2019-10-23 13:13             ` Mel Gorman
@ 2019-10-23 13:27             ` Vlastimil Babka
  2019-10-23 14:52             ` Waiman Long
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 67+ messages in thread
From: Vlastimil Babka @ 2019-10-23 13:27 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On 10/23/19 12:27 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmstat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..4e885ecd44d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
>  #endif
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
> -	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
> +	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
>  	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
>  	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
>  #endif
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
  2019-10-23 13:15             ` Mel Gorman
@ 2019-10-23 13:32             ` Vlastimil Babka
  2019-10-23 13:37               ` Michal Hocko
  2019-10-23 13:46               ` Rafael Aquini
  2019-10-23 14:56             ` Waiman Long
                               ` (4 subsequent siblings)
  6 siblings, 2 replies; 67+ messages in thread
From: Vlastimil Babka @ 2019-10-23 13:32 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On 10/23/19 12:27 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> 
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Hmm dunno, I would rather e.g. hide the file behind some config or boot
option than do this. Or move it to /sys/kernel/debug ?

> ---
>  mm/vmstat.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..762034fc3b83 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  
>  			area = &(zone->free_area[order]);
>  
> -			list_for_each(curr, &area->free_list[mtype])
> +			list_for_each(curr, &area->free_list[mtype]) {
>  				freecount++;
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than
> +				 * sufficient
> +				 */
> +				if (freecount > 100000) {
> +					seq_printf(m, ">%6lu ", freecount);
> +					spin_unlock_irq(&zone->lock);
> +					cond_resched();
> +					spin_lock_irq(&zone->lock);
> +					continue;
> +				}
> +			}
>  			seq_printf(m, "%6lu ", freecount);
>  		}
>  		seq_putc(m, '\n');
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 13:32             ` Vlastimil Babka
@ 2019-10-23 13:37               ` Michal Hocko
  2019-10-23 13:48                 ` Vlastimil Babka
  2019-10-23 13:46               ` Rafael Aquini
  1 sibling, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 13:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML

On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
> On 10/23/19 12:27 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> > 
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> > 
> > The new output will simply tell
> > [...]
> > Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> > 
> > instead of
> > Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> > 
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> > 
> > Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> option than do this. Or move it to /sys/kernel/debug ?

But those wouldn't really help to prevent from the lockup, right?
Besides that who would enable that config and how much of a difference
would root only vs. debugfs make?

Is the incomplete value a real problem?

> > ---
> >  mm/vmstat.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  
> >  			area = &(zone->free_area[order]);
> >  
> > -			list_for_each(curr, &area->free_list[mtype])
> > +			list_for_each(curr, &area->free_list[mtype]) {
> >  				freecount++;
> > +				/*
> > +				 * Cap the free_list iteration because it might
> > +				 * be really large and we are under a spinlock
> > +				 * so a long time spent here could trigger a
> > +				 * hard lockup detector. Anyway this is a
> > +				 * debugging tool so knowing there is a handful
> > +				 * of pages in this order should be more than
> > +				 * sufficient
> > +				 */
> > +				if (freecount > 100000) {
> > +					seq_printf(m, ">%6lu ", freecount);
> > +					spin_unlock_irq(&zone->lock);
> > +					cond_resched();
> > +					spin_lock_irq(&zone->lock);
> > +					continue;
> > +				}
> > +			}
> >  			seq_printf(m, "%6lu ", freecount);
> >  		}
> >  		seq_putc(m, '\n');
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 13:32             ` Vlastimil Babka
  2019-10-23 13:37               ` Michal Hocko
@ 2019-10-23 13:46               ` Rafael Aquini
  1 sibling, 0 replies; 67+ messages in thread
From: Rafael Aquini @ 2019-10-23 13:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long,
	Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, linux-mm, LKML,
	Michal Hocko

On Wed, Oct 23, 2019 at 03:32:05PM +0200, Vlastimil Babka wrote:
> On 10/23/19 12:27 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> > 
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> > 
> > The new output will simply tell
> > [...]
> > Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> > 
> > instead of
> > Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> > 
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> > 
> > Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> option than do this. Or move it to /sys/kernel/debug ?
>

You beat me to it. I was going to suggest moving it to debug, as well.


 
> > ---
> >  mm/vmstat.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  
> >  			area = &(zone->free_area[order]);
> >  
> > -			list_for_each(curr, &area->free_list[mtype])
> > +			list_for_each(curr, &area->free_list[mtype]) {
> >  				freecount++;
> > +				/*
> > +				 * Cap the free_list iteration because it might
> > +				 * be really large and we are under a spinlock
> > +				 * so a long time spent here could trigger a
> > +				 * hard lockup detector. Anyway this is a
> > +				 * debugging tool so knowing there is a handful
> > +				 * of pages in this order should be more than
> > +				 * sufficient
> > +				 */
> > +				if (freecount > 100000) {
> > +					seq_printf(m, ">%6lu ", freecount);
> > +					spin_unlock_irq(&zone->lock);
> > +					cond_resched();
> > +					spin_lock_irq(&zone->lock);
> > +					continue;
> > +				}
> > +			}
> >  			seq_printf(m, "%6lu ", freecount);
> >  		}
> >  		seq_putc(m, '\n');
> > 
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 13:37               ` Michal Hocko
@ 2019-10-23 13:48                 ` Vlastimil Babka
  2019-10-23 14:31                   ` Michal Hocko
  0 siblings, 1 reply; 67+ messages in thread
From: Vlastimil Babka @ 2019-10-23 13:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML

On 10/23/19 3:37 PM, Michal Hocko wrote:
> On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
>> On 10/23/19 12:27 PM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>>> This is not really nice because it blocks both any interrupts on that
>>> cpu and the page allocator. On large machines this might even trigger
>>> the hard lockup detector.
>>>
>>> Considering the pagetypeinfo is a debugging tool we do not really need
>>> exact numbers here. The primary reason to look at the outuput is to see
>>> how pageblocks are spread among different migratetypes therefore putting
>>> a bound on the number of pages on the free_list sounds like a reasonable
>>> tradeoff.
>>>
>>> The new output will simply tell
>>> [...]
>>> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
>>>
>>> instead of
>>> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
>>>
>>> The limit has been chosen arbitrary and it is a subject of a future
>>> change should there be a need for that.
>>>
>>> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> Hmm dunno, I would rather e.g. hide the file behind some config or boot
>> option than do this. Or move it to /sys/kernel/debug ?
> 
> But those wouldn't really help to prevent from the lockup, right?

No, but it would perhaps help ensure that only people who know what they
are doing (or been told so by a developer e.g. on linux-mm) will try to
collect the data, and not some automatic monitoring tools taking
periodic snapshots of stuff in /proc that looks interesting.

> Besides that who would enable that config and how much of a difference
> would root only vs. debugfs make?

I would hope those tools don't scrap debugfs as much as /proc, but I
might be wrong of course :)

> Is the incomplete value a real problem?

Hmm perhaps not. If the overflow happens only for one migratetype, one
can use also /proc/buddyinfo to get to the exact count, as was proposed
in this thread for Movable migratetype.

>>> ---
>>>  mm/vmstat.c | 19 ++++++++++++++++++-
>>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 4e885ecd44d1..762034fc3b83 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>>  
>>>  			area = &(zone->free_area[order]);
>>>  
>>> -			list_for_each(curr, &area->free_list[mtype])
>>> +			list_for_each(curr, &area->free_list[mtype]) {
>>>  				freecount++;
>>> +				/*
>>> +				 * Cap the free_list iteration because it might
>>> +				 * be really large and we are under a spinlock
>>> +				 * so a long time spent here could trigger a
>>> +				 * hard lockup detector. Anyway this is a
>>> +				 * debugging tool so knowing there is a handful
>>> +				 * of pages in this order should be more than
>>> +				 * sufficient
>>> +				 */
>>> +				if (freecount > 100000) {
>>> +					seq_printf(m, ">%6lu ", freecount);
>>> +					spin_unlock_irq(&zone->lock);
>>> +					cond_resched();
>>> +					spin_lock_irq(&zone->lock);
>>> +					continue;
>>> +				}
>>> +			}
>>>  			seq_printf(m, "%6lu ", freecount);
>>>  		}
>>>  		seq_putc(m, '\n');
>>>
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-22 21:59 ` Andrew Morton
  2019-10-23  6:15   ` Michal Hocko
@ 2019-10-23 14:30   ` Waiman Long
  2019-10-23 14:48       ` Qian Cai
  1 sibling, 1 reply; 67+ messages in thread
From: Waiman Long @ 2019-10-23 14:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On 10/22/19 5:59 PM, Andrew Morton wrote:
> On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> The pagetypeinfo_showfree_print() function prints out the number of
>> free blocks for each of the page orders and migrate types. The current
>> code just iterates the each of the free lists to get counts.  There are
>> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
>> file just because it look too long to iterate all the free lists within
>> a zone while holing the zone lock with irq disabled.
>>
>> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
>> of crashing a system by the simple act of reading /proc/pagetypeinfo
>> by any user is a security problem that needs to be addressed.
> Yes.
>
>> There is a free_area structure associated with each page order. There
>> is also a nr_free count within the free_area for all the different
>> migration types combined. Tracking the number of free list entries
>> for each migration type will probably add some overhead to the fast
>> paths like moving pages from one migration type to another which may
>> not be desirable.
>>
>> we can actually skip iterating the list of one of the migration types
>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>> is usually the largest one on large memory systems, this is the one
>> to be skipped. Since the printing order is migration-type => order, we
>> will have to store the counts in an internal 2D array before printing
>> them out.
>>
>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>> zone lock for too long blocking out other zone lock waiters from being
>> run. This can be problematic for systems with large amount of memory.
>> So a check is added to temporarily release the lock and reschedule if
>> more than 64k of list entries have been iterated for each order. With
>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>> entries before releasing the lock.
>>
>> ...
>>
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>  					pg_data_t *pgdat, struct zone *zone)
>>  {
>>  	int order, mtype;
>> +	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> 600+ bytes is a bit much.  I guess it's OK in this situation.
>
This function is called by reading /proc/pagetypeinfo. The call stack is
rather shallow:

PID: 58188  TASK: ffff938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
 #0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
 #1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
 #2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
 #3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
    [exception RIP: pagetypeinfo_showfree_print+0x73]
    RIP: ffffffffb8db7173  RSP: ffff938b9fcbfda0  RFLAGS: 00000006
    RAX: fffff0c9946d7020  RBX: ffff96073ffd5528  RCX: 0000000000000000
    RDX: 00000000001c7764  RSI: ffffffffb9676ab1  RDI: 0000000000000000
    RBP: ffff938b9fcbfdd0   R8: 000000000000000a   R9: 00000000fffffffe
    R10: 0000000000000000  R11: ffff938b9fcbfc36  R12: ffff942b97758240
    R13: ffffffffb942f730  R14: ffff96073ffd5000  R15: ffff96073ffd5180
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
 #5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
 #6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
 #7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
 #8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
 #9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
#10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
#11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b

So we should not be in any risk of overflowing the stack.

>> -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>> -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> -					pgdat->node_id,
>> -					zone->name,
>> -					migratetype_names[mtype]);
>> -		for (order = 0; order < MAX_ORDER; ++order) {
>> +	lockdep_assert_held(&zone->lock);
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	/*
>> +	 * MIGRATE_MOVABLE is usually the largest one in large memory
>> +	 * systems. We skip iterating that list. Instead, we compute it by
>> +	 * subtracting the total of the rests from free_area->nr_free.
>> +	 */
>> +	for (order = 0; order < MAX_ORDER; ++order) {
>> +		unsigned long nr_total = 0;
>> +		struct free_area *area = &(zone->free_area[order]);
>> +
>> +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>>  			unsigned long freecount = 0;
>> -			struct free_area *area;
>>  			struct list_head *curr;
>>  
>> -			area = &(zone->free_area[order]);
>> -
>> +			if (mtype == MIGRATE_MOVABLE)
>> +				continue;
>>  			list_for_each(curr, &area->free_list[mtype])
>>  				freecount++;
>> -			seq_printf(m, "%6lu ", freecount);
>> +			nfree[order][mtype] = freecount;
>> +			nr_total += freecount;
>>  		}
>> +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
>> +
>> +		/*
>> +		 * If we have already iterated more than 64k of list
>> +		 * entries, we might have hold the zone lock for too long.
>> +		 * Temporarily release the lock and reschedule before
>> +		 * continuing so that other lock waiters have a chance
>> +		 * to run.
>> +		 */
>> +		if (nr_total > (1 << 16)) {
>> +			spin_unlock_irq(&zone->lock);
>> +			cond_resched();
>> +			spin_lock_irq(&zone->lock);
>> +		}
>> +	}
>> +
>> +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>> +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> +					pgdat->node_id,
>> +					zone->name,
>> +					migratetype_names[mtype]);
>> +		for (order = 0; order < MAX_ORDER; ++order)
>> +			seq_printf(m, "%6lu ", nfree[order][mtype]);
>>  		seq_putc(m, '\n');
> This is not exactly a thing of beauty :( Presumably there might still
> be situations where the irq-off times remain excessive.
Yes, that is still possible.
>
> Why are we actually holding zone->lock so much?  Can we get away with
> holding it across the list_for_each() loop and nothing else?  If so,

We can certainly do that with the risk that the counts will be less
reliable for a given order. I can send a v2 patch if you think this is
safer.


> this still isn't a bulletproof fix.  Maybe just terminate the list
> walk if freecount reaches 1024.  Would anyone really care?
>
> Sigh.  I wonder if anyone really uses this thing for anything
> important.  Can we just remove it all?
>
Removing it will be a breakage of kernel API.

Another alternative is to mark the migration type in the page structure
so that we can do per-migration type nr_free tracking. That will be a
major change to the mm code.

I consider this patch lesser of the two evils. 

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 13:48                 ` Vlastimil Babka
@ 2019-10-23 14:31                   ` Michal Hocko
  2019-10-23 16:20                     ` Vlastimil Babka
  0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 14:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML

On Wed 23-10-19 15:48:36, Vlastimil Babka wrote:
> On 10/23/19 3:37 PM, Michal Hocko wrote:
> > On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
> >> On 10/23/19 12:27 PM, Michal Hocko wrote:
> >>> From: Michal Hocko <mhocko@suse.com>
> >>>
> >>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> >>> This is not really nice because it blocks both any interrupts on that
> >>> cpu and the page allocator. On large machines this might even trigger
> >>> the hard lockup detector.
> >>>
> >>> Considering the pagetypeinfo is a debugging tool we do not really need
> >>> exact numbers here. The primary reason to look at the outuput is to see
> >>> how pageblocks are spread among different migratetypes therefore putting
> >>> a bound on the number of pages on the free_list sounds like a reasonable
> >>> tradeoff.
> >>>
> >>> The new output will simply tell
> >>> [...]
> >>> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> >>>
> >>> instead of
> >>> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> >>>
> >>> The limit has been chosen arbitrary and it is a subject of a future
> >>> change should there be a need for that.
> >>>
> >>> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> >>> Signed-off-by: Michal Hocko <mhocko@suse.com>
> >>
> >> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> >> option than do this. Or move it to /sys/kernel/debug ?
> > 
> > But those wouldn't really help to prevent from the lockup, right?
> 
> No, but it would perhaps help ensure that only people who know what they
> are doing (or been told so by a developer e.g. on linux-mm) will try to
> collect the data, and not some automatic monitoring tools taking
> periodic snapshots of stuff in /proc that looks interesting.

Well, we do trust root doesn't do harm, right?

> > Besides that who would enable that config and how much of a difference
> > would root only vs. debugfs make?
> 
> I would hope those tools don't scrap debugfs as much as /proc, but I
> might be wrong of course :)
> 
> > Is the incomplete value a real problem?
> 
> Hmm perhaps not. If the overflow happens only for one migratetype, one
> can use also /proc/buddyinfo to get to the exact count, as was proposed
> in this thread for Movable migratetype.

Let's say this won't be the case. What is the worst case that the
imprecision would cause? In other words. Does it really matter whether
we have 100k pages on the free list of the specific migrate type for
order or say 200k?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 14:30   ` Waiman Long
@ 2019-10-23 14:48       ` Qian Cai
  0 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-23 14:48 UTC (permalink / raw)
  To: Waiman Long, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> On 10/22/19 5:59 PM, Andrew Morton wrote:
> > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <longman@redhat.com> wrote:
> > 
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts.  There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > > 
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> > 
> > Yes.
> > 
> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > > 
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > > 
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > > 
> > > ...
> > > 
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  					pg_data_t *pgdat, struct zone *zone)
> > >  {
> > >  	int order, mtype;
> > > +	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > 
> > 600+ bytes is a bit much.  I guess it's OK in this situation.
> > 
> 
> This function is called by reading /proc/pagetypeinfo. The call stack is
> rather shallow:
> 
> PID: 58188  TASK: ffff938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
>  #0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
>  #1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
>  #2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
>  #3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
>     [exception RIP: pagetypeinfo_showfree_print+0x73]
>     RIP: ffffffffb8db7173  RSP: ffff938b9fcbfda0  RFLAGS: 00000006
>     RAX: fffff0c9946d7020  RBX: ffff96073ffd5528  RCX: 0000000000000000
>     RDX: 00000000001c7764  RSI: ffffffffb9676ab1  RDI: 0000000000000000
>     RBP: ffff938b9fcbfdd0   R8: 000000000000000a   R9: 00000000fffffffe
>     R10: 0000000000000000  R11: ffff938b9fcbfc36  R12: ffff942b97758240
>     R13: ffffffffb942f730  R14: ffff96073ffd5000  R15: ffff96073ffd5180
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
>  #5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
>  #6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
>  #7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
>  #8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
>  #9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
> #10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
> #11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b
> 
> So we should not be in any risk of overflowing the stack.
> 
> > > -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > -					pgdat->node_id,
> > > -					zone->name,
> > > -					migratetype_names[mtype]);
> > > -		for (order = 0; order < MAX_ORDER; ++order) {
> > > +	lockdep_assert_held(&zone->lock);
> > > +	lockdep_assert_irqs_disabled();
> > > +
> > > +	/*
> > > +	 * MIGRATE_MOVABLE is usually the largest one in large memory
> > > +	 * systems. We skip iterating that list. Instead, we compute it by
> > > +	 * subtracting the total of the rests from free_area->nr_free.
> > > +	 */
> > > +	for (order = 0; order < MAX_ORDER; ++order) {
> > > +		unsigned long nr_total = 0;
> > > +		struct free_area *area = &(zone->free_area[order]);
> > > +
> > > +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > >  			unsigned long freecount = 0;
> > > -			struct free_area *area;
> > >  			struct list_head *curr;
> > >  
> > > -			area = &(zone->free_area[order]);
> > > -
> > > +			if (mtype == MIGRATE_MOVABLE)
> > > +				continue;
> > >  			list_for_each(curr, &area->free_list[mtype])
> > >  				freecount++;
> > > -			seq_printf(m, "%6lu ", freecount);
> > > +			nfree[order][mtype] = freecount;
> > > +			nr_total += freecount;
> > >  		}
> > > +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > > +
> > > +		/*
> > > +		 * If we have already iterated more than 64k of list
> > > +		 * entries, we might have hold the zone lock for too long.
> > > +		 * Temporarily release the lock and reschedule before
> > > +		 * continuing so that other lock waiters have a chance
> > > +		 * to run.
> > > +		 */
> > > +		if (nr_total > (1 << 16)) {
> > > +			spin_unlock_irq(&zone->lock);
> > > +			cond_resched();
> > > +			spin_lock_irq(&zone->lock);
> > > +		}
> > > +	}
> > > +
> > > +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > +					pgdat->node_id,
> > > +					zone->name,
> > > +					migratetype_names[mtype]);
> > > +		for (order = 0; order < MAX_ORDER; ++order)
> > > +			seq_printf(m, "%6lu ", nfree[order][mtype]);
> > >  		seq_putc(m, '\n');
> > 
> > This is not exactly a thing of beauty :( Presumably there might still
> > be situations where the irq-off times remain excessive.
> 
> Yes, that is still possible.
> > 
> > Why are we actually holding zone->lock so much?  Can we get away with
> > holding it across the list_for_each() loop and nothing else?  If so,
> 
> We can certainly do that with the risk that the counts will be less
> reliable for a given order. I can send a v2 patch if you think this is
> safer.
> 
> 
> > this still isn't a bulletproof fix.  Maybe just terminate the list
> > walk if freecount reaches 1024.  Would anyone really care?
> > 
> > Sigh.  I wonder if anyone really uses this thing for anything
> > important.  Can we just remove it all?
> > 
> 
> Removing it will be a breakage of kernel API.

Who cares about breaking this part of the API that essentially nobody will use
this file?

> 
> Another alternative is to mark the migration type in the page structure
> so that we can do per-migration type nr_free tracking. That will be a
> major change to the mm code.
> 
> I consider this patch lesser of the two evils. 
> 
> Cheers,
> Longman
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
@ 2019-10-23 14:48       ` Qian Cai
  0 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-23 14:48 UTC (permalink / raw)
  To: Waiman Long, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> On 10/22/19 5:59 PM, Andrew Morton wrote:
> > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <longman@redhat.com> wrote:
> > 
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts.  There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > > 
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> > 
> > Yes.
> > 
> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > > 
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > > 
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > > 
> > > ...
> > > 
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  					pg_data_t *pgdat, struct zone *zone)
> > >  {
> > >  	int order, mtype;
> > > +	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > 
> > 600+ bytes is a bit much.  I guess it's OK in this situation.
> > 
> 
> This function is called by reading /proc/pagetypeinfo. The call stack is
> rather shallow:
> 
> PID: 58188  TASK: ffff938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
>  #0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
>  #1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
>  #2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
>  #3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
>     [exception RIP: pagetypeinfo_showfree_print+0x73]
>     RIP: ffffffffb8db7173  RSP: ffff938b9fcbfda0  RFLAGS: 00000006
>     RAX: fffff0c9946d7020  RBX: ffff96073ffd5528  RCX: 0000000000000000
>     RDX: 00000000001c7764  RSI: ffffffffb9676ab1  RDI: 0000000000000000
>     RBP: ffff938b9fcbfdd0   R8: 000000000000000a   R9: 00000000fffffffe
>     R10: 0000000000000000  R11: ffff938b9fcbfc36  R12: ffff942b97758240
>     R13: ffffffffb942f730  R14: ffff96073ffd5000  R15: ffff96073ffd5180
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
>  #5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
>  #6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
>  #7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
>  #8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
>  #9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
> #10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
> #11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b
> 
> So we should not be in any risk of overflowing the stack.
> 
> > > -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > -					pgdat->node_id,
> > > -					zone->name,
> > > -					migratetype_names[mtype]);
> > > -		for (order = 0; order < MAX_ORDER; ++order) {
> > > +	lockdep_assert_held(&zone->lock);
> > > +	lockdep_assert_irqs_disabled();
> > > +
> > > +	/*
> > > +	 * MIGRATE_MOVABLE is usually the largest one in large memory
> > > +	 * systems. We skip iterating that list. Instead, we compute it by
> > > +	 * subtracting the total of the rests from free_area->nr_free.
> > > +	 */
> > > +	for (order = 0; order < MAX_ORDER; ++order) {
> > > +		unsigned long nr_total = 0;
> > > +		struct free_area *area = &(zone->free_area[order]);
> > > +
> > > +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > >  			unsigned long freecount = 0;
> > > -			struct free_area *area;
> > >  			struct list_head *curr;
> > >  
> > > -			area = &(zone->free_area[order]);
> > > -
> > > +			if (mtype == MIGRATE_MOVABLE)
> > > +				continue;
> > >  			list_for_each(curr, &area->free_list[mtype])
> > >  				freecount++;
> > > -			seq_printf(m, "%6lu ", freecount);
> > > +			nfree[order][mtype] = freecount;
> > > +			nr_total += freecount;
> > >  		}
> > > +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > > +
> > > +		/*
> > > +		 * If we have already iterated more than 64k of list
> > > +		 * entries, we might have hold the zone lock for too long.
> > > +		 * Temporarily release the lock and reschedule before
> > > +		 * continuing so that other lock waiters have a chance
> > > +		 * to run.
> > > +		 */
> > > +		if (nr_total > (1 << 16)) {
> > > +			spin_unlock_irq(&zone->lock);
> > > +			cond_resched();
> > > +			spin_lock_irq(&zone->lock);
> > > +		}
> > > +	}
> > > +
> > > +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > +					pgdat->node_id,
> > > +					zone->name,
> > > +					migratetype_names[mtype]);
> > > +		for (order = 0; order < MAX_ORDER; ++order)
> > > +			seq_printf(m, "%6lu ", nfree[order][mtype]);
> > >  		seq_putc(m, '\n');
> > 
> > This is not exactly a thing of beauty :( Presumably there might still
> > be situations where the irq-off times remain excessive.
> 
> Yes, that is still possible.
> > 
> > Why are we actually holding zone->lock so much?  Can we get away with
> > holding it across the list_for_each() loop and nothing else?  If so,
> 
> We can certainly do that with the risk that the counts will be less
> reliable for a given order. I can send a v2 patch if you think this is
> safer.
> 
> 
> > this still isn't a bulletproof fix.  Maybe just terminate the list
> > walk if freecount reaches 1024.  Would anyone really care?
> > 
> > Sigh.  I wonder if anyone really uses this thing for anything
> > important.  Can we just remove it all?
> > 
> 
> Removing it will be a breakage of kernel API.

Who cares about breaking this part of the API that essentially nobody will use
this file?

> 
> Another alternative is to mark the migration type in the page structure
> so that we can do per-migration type nr_free tracking. That will be a
> major change to the mm code.
> 
> I consider this patch lesser of the two evils. 
> 
> Cheers,
> Longman
> 
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
  2019-10-23 13:13             ` Mel Gorman
  2019-10-23 13:27             ` Vlastimil Babka
@ 2019-10-23 14:52             ` Waiman Long
  2019-10-23 15:10             ` Rafael Aquini
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-23 14:52 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko

On 10/23/19 6:27 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
>
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
>
> Reduce both issues by simply not exporting the file to regular users.
>
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..4e885ecd44d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
>  #endif
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
> -	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
> +	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
>  	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
>  	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
>  #endif

Acked-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
  2019-10-23 13:15             ` Mel Gorman
  2019-10-23 13:32             ` Vlastimil Babka
@ 2019-10-23 14:56             ` Waiman Long
  2019-10-23 15:21               ` Waiman Long
  2019-10-23 16:10               ` Michal Hocko
  2019-10-23 16:15             ` Vlastimil Babka
                               ` (3 subsequent siblings)
  6 siblings, 2 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-23 14:56 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko

On 10/23/19 6:27 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
>
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
>
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
>
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
>
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..762034fc3b83 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  
>  			area = &(zone->free_area[order]);
>  
> -			list_for_each(curr, &area->free_list[mtype])
> +			list_for_each(curr, &area->free_list[mtype]) {
>  				freecount++;
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than
> +				 * sufficient
> +				 */
> +				if (freecount > 100000) {
> +					seq_printf(m, ">%6lu ", freecount);
> +					spin_unlock_irq(&zone->lock);
> +					cond_resched();
> +					spin_lock_irq(&zone->lock);
> +					continue;
list_for_each() is a for loop. The continue statement will just iterate
the rests with the possibility that curr will be stale. Should we use
goto to jump after the seq_print() below?
> +				}
> +			}
>  			seq_printf(m, "%6lu ", freecount);
>  		}
>  		seq_putc(m, '\n');

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 14:48       ` Qian Cai
  (?)
@ 2019-10-23 15:01       ` Waiman Long
  2019-10-23 15:05           ` Qian Cai
                           ` (2 more replies)
  -1 siblings, 3 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-23 15:01 UTC (permalink / raw)
  To: Qian Cai, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On 10/23/19 10:48 AM, Qian Cai wrote:
>>> this still isn't a bulletproof fix.  Maybe just terminate the list
>>> walk if freecount reaches 1024.  Would anyone really care?
>>>
>>> Sigh.  I wonder if anyone really uses this thing for anything
>>> important.  Can we just remove it all?
>>>
>> Removing it will be a breakage of kernel API.
> Who cares about breaking this part of the API that essentially nobody will use
> this file?
>
There are certainly tools that use /proc/pagetypeinfo and this is how
the problem is found. I am not against removing it, but we have to be
careful and deprecate it in way that minimize user impact.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 14:48       ` Qian Cai
  (?)
  (?)
@ 2019-10-23 15:03       ` Rafael Aquini
  2019-10-23 15:51           ` Qian Cai
  -1 siblings, 1 reply; 67+ messages in thread
From: Rafael Aquini @ 2019-10-23 15:03 UTC (permalink / raw)
  To: Qian Cai
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman

On Wed, Oct 23, 2019 at 10:48:13AM -0400, Qian Cai wrote:
> On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> > On 10/22/19 5:59 PM, Andrew Morton wrote:
> > > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <longman@redhat.com> wrote:
> > > 
> > > > The pagetypeinfo_showfree_print() function prints out the number of
> > > > free blocks for each of the page orders and migrate types. The current
> > > > code just iterates the each of the free lists to get counts.  There are
> > > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > > file just because it look too long to iterate all the free lists within
> > > > a zone while holing the zone lock with irq disabled.
> > > > 
> > > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > > by any user is a security problem that needs to be addressed.
> > > 
> > > Yes.
> > > 
> > > > There is a free_area structure associated with each page order. There
> > > > is also a nr_free count within the free_area for all the different
> > > > migration types combined. Tracking the number of free list entries
> > > > for each migration type will probably add some overhead to the fast
> > > > paths like moving pages from one migration type to another which may
> > > > not be desirable.
> > > > 
> > > > we can actually skip iterating the list of one of the migration types
> > > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > > is usually the largest one on large memory systems, this is the one
> > > > to be skipped. Since the printing order is migration-type => order, we
> > > > will have to store the counts in an internal 2D array before printing
> > > > them out.
> > > > 
> > > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > > zone lock for too long blocking out other zone lock waiters from being
> > > > run. This can be problematic for systems with large amount of memory.
> > > > So a check is added to temporarily release the lock and reschedule if
> > > > more than 64k of list entries have been iterated for each order. With
> > > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > > entries before releasing the lock.
> > > > 
> > > > ...
> > > > 
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > > >  					pg_data_t *pgdat, struct zone *zone)
> > > >  {
> > > >  	int order, mtype;
> > > > +	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > > 
> > > 600+ bytes is a bit much.  I guess it's OK in this situation.
> > > 
> > 
> > This function is called by reading /proc/pagetypeinfo. The call stack is
> > rather shallow:
> > 
> > PID: 58188  TASK: ffff938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
> >  #0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
> >  #1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
> >  #2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
> >  #3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
> >     [exception RIP: pagetypeinfo_showfree_print+0x73]
> >     RIP: ffffffffb8db7173  RSP: ffff938b9fcbfda0  RFLAGS: 00000006
> >     RAX: fffff0c9946d7020  RBX: ffff96073ffd5528  RCX: 0000000000000000
> >     RDX: 00000000001c7764  RSI: ffffffffb9676ab1  RDI: 0000000000000000
> >     RBP: ffff938b9fcbfdd0   R8: 000000000000000a   R9: 00000000fffffffe
> >     R10: 0000000000000000  R11: ffff938b9fcbfc36  R12: ffff942b97758240
> >     R13: ffffffffb942f730  R14: ffff96073ffd5000  R15: ffff96073ffd5180
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > --- <NMI exception stack> ---
> >  #4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
> >  #5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
> >  #6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
> >  #7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
> >  #8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
> >  #9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
> > #10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
> > #11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b
> > 
> > So we should not be in any risk of overflowing the stack.
> > 
> > > > -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > > -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > > -					pgdat->node_id,
> > > > -					zone->name,
> > > > -					migratetype_names[mtype]);
> > > > -		for (order = 0; order < MAX_ORDER; ++order) {
> > > > +	lockdep_assert_held(&zone->lock);
> > > > +	lockdep_assert_irqs_disabled();
> > > > +
> > > > +	/*
> > > > +	 * MIGRATE_MOVABLE is usually the largest one in large memory
> > > > +	 * systems. We skip iterating that list. Instead, we compute it by
> > > > +	 * subtracting the total of the rests from free_area->nr_free.
> > > > +	 */
> > > > +	for (order = 0; order < MAX_ORDER; ++order) {
> > > > +		unsigned long nr_total = 0;
> > > > +		struct free_area *area = &(zone->free_area[order]);
> > > > +
> > > > +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > >  			unsigned long freecount = 0;
> > > > -			struct free_area *area;
> > > >  			struct list_head *curr;
> > > >  
> > > > -			area = &(zone->free_area[order]);
> > > > -
> > > > +			if (mtype == MIGRATE_MOVABLE)
> > > > +				continue;
> > > >  			list_for_each(curr, &area->free_list[mtype])
> > > >  				freecount++;
> > > > -			seq_printf(m, "%6lu ", freecount);
> > > > +			nfree[order][mtype] = freecount;
> > > > +			nr_total += freecount;
> > > >  		}
> > > > +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > > > +
> > > > +		/*
> > > > +		 * If we have already iterated more than 64k of list
> > > > +		 * entries, we might have hold the zone lock for too long.
> > > > +		 * Temporarily release the lock and reschedule before
> > > > +		 * continuing so that other lock waiters have a chance
> > > > +		 * to run.
> > > > +		 */
> > > > +		if (nr_total > (1 << 16)) {
> > > > +			spin_unlock_irq(&zone->lock);
> > > > +			cond_resched();
> > > > +			spin_lock_irq(&zone->lock);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > > +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > > +					pgdat->node_id,
> > > > +					zone->name,
> > > > +					migratetype_names[mtype]);
> > > > +		for (order = 0; order < MAX_ORDER; ++order)
> > > > +			seq_printf(m, "%6lu ", nfree[order][mtype]);
> > > >  		seq_putc(m, '\n');
> > > 
> > > This is not exactly a thing of beauty :( Presumably there might still
> > > be situations where the irq-off times remain excessive.
> > 
> > Yes, that is still possible.
> > > 
> > > Why are we actually holding zone->lock so much?  Can we get away with
> > > holding it across the list_for_each() loop and nothing else?  If so,
> > 
> > We can certainly do that with the risk that the counts will be less
> > reliable for a given order. I can send a v2 patch if you think this is
> > safer.
> > 
> > 
> > > this still isn't a bulletproof fix.  Maybe just terminate the list
> > > walk if freecount reaches 1024.  Would anyone really care?
> > > 
> > > Sigh.  I wonder if anyone really uses this thing for anything
> > > important.  Can we just remove it all?
> > > 
> > 
> > Removing it will be a breakage of kernel API.
> 
> Who cares about breaking this part of the API that essentially nobody will use
> this file?
>

Seems that _some_ are using it, aren't they? Otherwise we wouldn't be
seeing complaints. Moving it out of /proc to /sys/kernel/debug looks 
like a reasonable compromise with those that care about the interface.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 15:01       ` Waiman Long
@ 2019-10-23 15:05           ` Qian Cai
  2019-10-23 22:30         ` Andrew Morton
  2019-10-24  3:33         ` Feng Tang
  2 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-23 15:05 UTC (permalink / raw)
  To: Waiman Long, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On Wed, 2019-10-23 at 11:01 -0400, Waiman Long wrote:
> On 10/23/19 10:48 AM, Qian Cai wrote:
> > > > this still isn't a bulletproof fix.  Maybe just terminate the list
> > > > walk if freecount reaches 1024.  Would anyone really care?
> > > > 
> > > > Sigh.  I wonder if anyone really uses this thing for anything
> > > > important.  Can we just remove it all?
> > > > 
> > > 
> > > Removing it will be a breakage of kernel API.
> > 
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> > 
> 
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

What are those tools?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
@ 2019-10-23 15:05           ` Qian Cai
  0 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-23 15:05 UTC (permalink / raw)
  To: Waiman Long, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On Wed, 2019-10-23 at 11:01 -0400, Waiman Long wrote:
> On 10/23/19 10:48 AM, Qian Cai wrote:
> > > > this still isn't a bulletproof fix.  Maybe just terminate the list
> > > > walk if freecount reaches 1024.  Would anyone really care?
> > > > 
> > > > Sigh.  I wonder if anyone really uses this thing for anything
> > > > important.  Can we just remove it all?
> > > > 
> > > 
> > > Removing it will be a breakage of kernel API.
> > 
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> > 
> 
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

What are those tools?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
                               ` (2 preceding siblings ...)
  2019-10-23 14:52             ` Waiman Long
@ 2019-10-23 15:10             ` Rafael Aquini
  2019-10-23 16:15             ` Vlastimil Babka
  2019-10-24 19:01               ` David Rientjes
  5 siblings, 0 replies; 67+ messages in thread
From: Rafael Aquini @ 2019-10-23 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, linux-mm, LKML,
	Michal Hocko

On Wed, Oct 23, 2019 at 12:27:36PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..4e885ecd44d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
>  #endif
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
> -	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
> +	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
>  	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
>  	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
>  #endif
> -- 
> 2.20.1
>
 
Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 14:56             ` Waiman Long
@ 2019-10-23 15:21               ` Waiman Long
  2019-10-23 16:10               ` Michal Hocko
  1 sibling, 0 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-23 15:21 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML, Michal Hocko

On 10/23/19 10:56 AM, Waiman Long wrote:
> On 10/23/19 6:27 AM, Michal Hocko wrote:
>> From: Michal Hocko <mhocko@suse.com>
>>
>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>> This is not really nice because it blocks both any interrupts on that
>> cpu and the page allocator. On large machines this might even trigger
>> the hard lockup detector.
>>
>> Considering the pagetypeinfo is a debugging tool we do not really need
>> exact numbers here. The primary reason to look at the outuput is to see
>> how pageblocks are spread among different migratetypes therefore putting
>> a bound on the number of pages on the free_list sounds like a reasonable
>> tradeoff.
>>
>> The new output will simply tell
>> [...]
>> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
>>
>> instead of
>> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
>>
>> The limit has been chosen arbitrary and it is a subject of a future
>> change should there be a need for that.
>>
>> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  mm/vmstat.c | 19 ++++++++++++++++++-
>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 4e885ecd44d1..762034fc3b83 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>  
>>  			area = &(zone->free_area[order]);
>>  
>> -			list_for_each(curr, &area->free_list[mtype])
>> +			list_for_each(curr, &area->free_list[mtype]) {
>>  				freecount++;
>> +				/*
>> +				 * Cap the free_list iteration because it might
>> +				 * be really large and we are under a spinlock
>> +				 * so a long time spent here could trigger a
>> +				 * hard lockup detector. Anyway this is a
>> +				 * debugging tool so knowing there is a handful
>> +				 * of pages in this order should be more than
>> +				 * sufficient
>> +				 */
>> +				if (freecount > 100000) {
>> +					seq_printf(m, ">%6lu ", freecount);

It will print ">100001" which seems a bit awk and will be incorrect if
it is exactly 100001. Could you just hardcode ">100000" into seq_printf()?

Cheers,
Longman



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 15:03       ` Rafael Aquini
@ 2019-10-23 15:51           ` Qian Cai
  0 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-23 15:51 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman

On Wed, 2019-10-23 at 11:03 -0400, Rafael Aquini wrote:
> > > > this still isn't a bulletproof fix.  Maybe just terminate the list
> > > > walk if freecount reaches 1024.  Would anyone really care?
> > > > 
> > > > Sigh.  I wonder if anyone really uses this thing for anything
> > > > important.  Can we just remove it all?
> > > > 
> > > 
> > > Removing it will be a breakage of kernel API.
> > 
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> > 
> 
> Seems that _some_ are using it, aren't they? Otherwise we wouldn't be
> seeing complaints. Moving it out of /proc to /sys/kernel/debug looks 
> like a reasonable compromise with those that care about the interface.

No, there are some known testing tools will blindly read this file just because
it exists which is not important to keep the file.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
@ 2019-10-23 15:51           ` Qian Cai
  0 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-23 15:51 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman

On Wed, 2019-10-23 at 11:03 -0400, Rafael Aquini wrote:
> > > > this still isn't a bulletproof fix.  Maybe just terminate the list
> > > > walk if freecount reaches 1024.  Would anyone really care?
> > > > 
> > > > Sigh.  I wonder if anyone really uses this thing for anything
> > > > important.  Can we just remove it all?
> > > > 
> > > 
> > > Removing it will be a breakage of kernel API.
> > 
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> > 
> 
> Seems that _some_ are using it, aren't they? Otherwise we wouldn't be
> seeing complaints. Moving it out of /proc to /sys/kernel/debug looks 
> like a reasonable compromise with those that care about the interface.

No, there are some known testing tools will blindly read this file just because
it exists which is not important to keep the file.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 14:56             ` Waiman Long
  2019-10-23 15:21               ` Waiman Long
@ 2019-10-23 16:10               ` Michal Hocko
  2019-10-23 16:17                 ` Waiman Long
  1 sibling, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 16:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Roman Gushchin,
	Vlastimil Babka, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML

On Wed 23-10-19 10:56:30, Waiman Long wrote:
> On 10/23/19 6:27 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> >
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> >
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> >
> > The new output will simply tell
> > [...]
> > Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> >
> > instead of
> > Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> >
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> >
> > Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmstat.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  
> >  			area = &(zone->free_area[order]);
> >  
> > -			list_for_each(curr, &area->free_list[mtype])
> > +			list_for_each(curr, &area->free_list[mtype]) {
> >  				freecount++;
> > +				/*
> > +				 * Cap the free_list iteration because it might
> > +				 * be really large and we are under a spinlock
> > +				 * so a long time spent here could trigger a
> > +				 * hard lockup detector. Anyway this is a
> > +				 * debugging tool so knowing there is a handful
> > +				 * of pages in this order should be more than
> > +				 * sufficient
> > +				 */
> > +				if (freecount > 100000) {
> > +					seq_printf(m, ">%6lu ", freecount);
> > +					spin_unlock_irq(&zone->lock);
> > +					cond_resched();
> > +					spin_lock_irq(&zone->lock);
> > +					continue;
> list_for_each() is a for loop. The continue statement will just iterate
> the rests with the possibility that curr will be stale. Should we use
> goto to jump after the seq_print() below?

You are right. Kinda brown paper back material. Sorry about that. What
about this on top?
--- 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 762034fc3b83..c156ce24a322 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1383,11 +1383,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 			unsigned long freecount = 0;
 			struct free_area *area;
 			struct list_head *curr;
+			bool overflow = false;
 
 			area = &(zone->free_area[order]);
 
 			list_for_each(curr, &area->free_list[mtype]) {
-				freecount++;
 				/*
 				 * Cap the free_list iteration because it might
 				 * be really large and we are under a spinlock
@@ -1397,15 +1397,15 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 				 * of pages in this order should be more than
 				 * sufficient
 				 */
-				if (freecount > 100000) {
-					seq_printf(m, ">%6lu ", freecount);
+				if (++freecount >= 100000) {
+					overflow = true;
 					spin_unlock_irq(&zone->lock);
 					cond_resched();
 					spin_lock_irq(&zone->lock);
-					continue;
+					break;
 				}
 			}
-			seq_printf(m, "%6lu ", freecount);
+			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
 		}
 		seq_putc(m, '\n');
 	}

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
                               ` (3 preceding siblings ...)
  2019-10-23 15:10             ` Rafael Aquini
@ 2019-10-23 16:15             ` Vlastimil Babka
  2019-10-24 19:01               ` David Rientjes
  5 siblings, 0 replies; 67+ messages in thread
From: Vlastimil Babka @ 2019-10-23 16:15 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko, Linux API

+ linux-api

On 10/23/19 12:27 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..4e885ecd44d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
>  #endif
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
> -	proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
> +	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
>  	proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
>  	proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
>  #endif
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
                               ` (2 preceding siblings ...)
  2019-10-23 14:56             ` Waiman Long
@ 2019-10-23 16:15             ` Vlastimil Babka
  2019-10-23 16:41             ` Michal Hocko
                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 67+ messages in thread
From: Vlastimil Babka @ 2019-10-23 16:15 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko, Linux API

+ linux-api

On 10/23/19 12:27 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
> 
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes therefore putting
> a bound on the number of pages on the free_list sounds like a reasonable
> tradeoff.
> 
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
> 
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
> 
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..762034fc3b83 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  
>  			area = &(zone->free_area[order]);
>  
> -			list_for_each(curr, &area->free_list[mtype])
> +			list_for_each(curr, &area->free_list[mtype]) {
>  				freecount++;
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than
> +				 * sufficient
> +				 */
> +				if (freecount > 100000) {
> +					seq_printf(m, ">%6lu ", freecount);
> +					spin_unlock_irq(&zone->lock);
> +					cond_resched();
> +					spin_lock_irq(&zone->lock);
> +					continue;
> +				}
> +			}
>  			seq_printf(m, "%6lu ", freecount);
>  		}
>  		seq_putc(m, '\n');
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 16:10               ` Michal Hocko
@ 2019-10-23 16:17                 ` Waiman Long
  2019-10-23 16:21                   ` Waiman Long
  0 siblings, 1 reply; 67+ messages in thread
From: Waiman Long @ 2019-10-23 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Roman Gushchin,
	Vlastimil Babka, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML

On 10/23/19 12:10 PM, Michal Hocko wrote:
> On Wed 23-10-19 10:56:30, Waiman Long wrote:
>> On 10/23/19 6:27 AM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>>> This is not really nice because it blocks both any interrupts on that
>>> cpu and the page allocator. On large machines this might even trigger
>>> the hard lockup detector.
>>>
>>> Considering the pagetypeinfo is a debugging tool we do not really need
>>> exact numbers here. The primary reason to look at the outuput is to see
>>> how pageblocks are spread among different migratetypes therefore putting
>>> a bound on the number of pages on the free_list sounds like a reasonable
>>> tradeoff.
>>>
>>> The new output will simply tell
>>> [...]
>>> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
>>>
>>> instead of
>>> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
>>>
>>> The limit has been chosen arbitrary and it is a subject of a future
>>> change should there be a need for that.
>>>
>>> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>> ---
>>>  mm/vmstat.c | 19 ++++++++++++++++++-
>>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 4e885ecd44d1..762034fc3b83 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>>  
>>>  			area = &(zone->free_area[order]);
>>>  
>>> -			list_for_each(curr, &area->free_list[mtype])
>>> +			list_for_each(curr, &area->free_list[mtype]) {
>>>  				freecount++;
>>> +				/*
>>> +				 * Cap the free_list iteration because it might
>>> +				 * be really large and we are under a spinlock
>>> +				 * so a long time spent here could trigger a
>>> +				 * hard lockup detector. Anyway this is a
>>> +				 * debugging tool so knowing there is a handful
>>> +				 * of pages in this order should be more than
>>> +				 * sufficient
>>> +				 */
>>> +				if (freecount > 100000) {
>>> +					seq_printf(m, ">%6lu ", freecount);
>>> +					spin_unlock_irq(&zone->lock);
>>> +					cond_resched();
>>> +					spin_lock_irq(&zone->lock);
>>> +					continue;
>> list_for_each() is a for loop. The continue statement will just iterate
>> the rests with the possibility that curr will be stale. Should we use
>> goto to jump after the seq_print() below?
> You are right. Kinda brown paper back material. Sorry about that. What
> about this on top?
> --- 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 762034fc3b83..c156ce24a322 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,11 +1383,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  			unsigned long freecount = 0;
>  			struct free_area *area;
>  			struct list_head *curr;
> +			bool overflow = false;
>  
>  			area = &(zone->free_area[order]);
>  
>  			list_for_each(curr, &area->free_list[mtype]) {
> -				freecount++;
>  				/*
>  				 * Cap the free_list iteration because it might
>  				 * be really large and we are under a spinlock
> @@ -1397,15 +1397,15 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  				 * of pages in this order should be more than
>  				 * sufficient
>  				 */
> -				if (freecount > 100000) {
> -					seq_printf(m, ">%6lu ", freecount);
> +				if (++freecount >= 100000) {
> +					overflow = true;
>  					spin_unlock_irq(&zone->lock);
>  					cond_resched();
>  					spin_lock_irq(&zone->lock);
> -					continue;
> +					break;
>  				}
>  			}
> -			seq_printf(m, "%6lu ", freecount);
> +			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
>  		}
>  		seq_putc(m, '\n');
>  	}
>
Yes, that looks good to me. There is still a small chance that the
description will be a bit off if it is exactly 100,000. However, it is
not a big deal and I can live with that.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 14:31                   ` Michal Hocko
@ 2019-10-23 16:20                     ` Vlastimil Babka
  0 siblings, 0 replies; 67+ messages in thread
From: Vlastimil Babka @ 2019-10-23 16:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML

On 10/23/19 4:31 PM, Michal Hocko wrote:
> On Wed 23-10-19 15:48:36, Vlastimil Babka wrote:
>> On 10/23/19 3:37 PM, Michal Hocko wrote:
>>>
>>> But those wouldn't really help to prevent from the lockup, right?
>>
>> No, but it would perhaps help ensure that only people who know what they
>> are doing (or been told so by a developer e.g. on linux-mm) will try to
>> collect the data, and not some automatic monitoring tools taking
>> periodic snapshots of stuff in /proc that looks interesting.
> 
> Well, we do trust root doesn't do harm, right?

Perhaps too much :)

>>> Besides that who would enable that config and how much of a difference
>>> would root only vs. debugfs make?
>>
>> I would hope those tools don't scrap debugfs as much as /proc, but I
>> might be wrong of course :)
>>
>>> Is the incomplete value a real problem?
>>
>> Hmm perhaps not. If the overflow happens only for one migratetype, one
>> can use also /proc/buddyinfo to get to the exact count, as was proposed
>> in this thread for Movable migratetype.
> 
> Let's say this won't be the case. What is the worst case that the
> imprecision would cause? In other words. Does it really matter whether
> we have 100k pages on the free list of the specific migrate type for
> order or say 200k?

Probably not, it rather matters for which order the count approaches zero.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 16:17                 ` Waiman Long
@ 2019-10-23 16:21                   ` Waiman Long
  0 siblings, 0 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-23 16:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Roman Gushchin,
	Vlastimil Babka, Konstantin Khlebnikov, Jann Horn, Song Liu,
	Greg Kroah-Hartman, Rafael Aquini, linux-mm, LKML

On 10/23/19 12:17 PM, Waiman Long wrote:
> On 10/23/19 12:10 PM, Michal Hocko wrote:
>> On Wed 23-10-19 10:56:30, Waiman Long wrote:
>>> On 10/23/19 6:27 AM, Michal Hocko wrote:
>>>> From: Michal Hocko <mhocko@suse.com>
>>>>
>>>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
>>>> This is not really nice because it blocks both any interrupts on that
>>>> cpu and the page allocator. On large machines this might even trigger
>>>> the hard lockup detector.
>>>>
>>>> Considering the pagetypeinfo is a debugging tool we do not really need
>>>> exact numbers here. The primary reason to look at the outuput is to see
>>>> how pageblocks are spread among different migratetypes therefore putting
>>>> a bound on the number of pages on the free_list sounds like a reasonable
>>>> tradeoff.
>>>>
>>>> The new output will simply tell
>>>> [...]
>>>> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
>>>>
>>>> instead of
>>>> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
>>>>
>>>> The limit has been chosen arbitrary and it is a subject of a future
>>>> change should there be a need for that.
>>>>
>>>> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
>>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>>> ---
>>>>  mm/vmstat.c | 19 ++++++++++++++++++-
>>>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>>> index 4e885ecd44d1..762034fc3b83 100644
>>>> --- a/mm/vmstat.c
>>>> +++ b/mm/vmstat.c
>>>> @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>>>  
>>>>  			area = &(zone->free_area[order]);
>>>>  
>>>> -			list_for_each(curr, &area->free_list[mtype])
>>>> +			list_for_each(curr, &area->free_list[mtype]) {
>>>>  				freecount++;
>>>> +				/*
>>>> +				 * Cap the free_list iteration because it might
>>>> +				 * be really large and we are under a spinlock
>>>> +				 * so a long time spent here could trigger a
>>>> +				 * hard lockup detector. Anyway this is a
>>>> +				 * debugging tool so knowing there is a handful
>>>> +				 * of pages in this order should be more than
>>>> +				 * sufficient
>>>> +				 */
>>>> +				if (freecount > 100000) {
>>>> +					seq_printf(m, ">%6lu ", freecount);
>>>> +					spin_unlock_irq(&zone->lock);
>>>> +					cond_resched();
>>>> +					spin_lock_irq(&zone->lock);
>>>> +					continue;
>>> list_for_each() is a for loop. The continue statement will just iterate
>>> the rests with the possibility that curr will be stale. Should we use
>>> goto to jump after the seq_print() below?
>> You are right. Kinda brown paper back material. Sorry about that. What
>> about this on top?
>> --- 
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 762034fc3b83..c156ce24a322 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1383,11 +1383,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>  			unsigned long freecount = 0;
>>  			struct free_area *area;
>>  			struct list_head *curr;
>> +			bool overflow = false;
>>  
>>  			area = &(zone->free_area[order]);
>>  
>>  			list_for_each(curr, &area->free_list[mtype]) {
>> -				freecount++;
>>  				/*
>>  				 * Cap the free_list iteration because it might
>>  				 * be really large and we are under a spinlock
>> @@ -1397,15 +1397,15 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>  				 * of pages in this order should be more than
>>  				 * sufficient
>>  				 */
>> -				if (freecount > 100000) {
>> -					seq_printf(m, ">%6lu ", freecount);
>> +				if (++freecount >= 100000) {
>> +					overflow = true;
>>  					spin_unlock_irq(&zone->lock);
>>  					cond_resched();
>>  					spin_lock_irq(&zone->lock);
>> -					continue;
>> +					break;
>>  				}
>>  			}
>> -			seq_printf(m, "%6lu ", freecount);
>> +			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
>>  		}
>>  		seq_putc(m, '\n');
>>  	}
>>
> Yes, that looks good to me. There is still a small chance that the
> description will be a bit off if it is exactly 100,000. However, it is
> not a big deal and I can live with that.

Alternatively, you can do

if (++freecount > 100000) {
        :
    freecount--;
    break;
}

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
                               ` (3 preceding siblings ...)
  2019-10-23 16:15             ` Vlastimil Babka
@ 2019-10-23 16:41             ` Michal Hocko
  2019-10-23 16:47               ` Waiman Long
  2019-10-23 17:34             ` [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo Waiman Long
  2019-10-23 17:34             ` [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo Waiman Long
  6 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 16:41 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML

With a brown paper bag bug fixed. I have also added a note about low
number of pages being more important as per Vlastimil's feedback

From 0282f604144a5c06fdf3cf0bb2df532411e7f8c9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 23 Oct 2019 12:13:02 +0200
Subject: [PATCH] mm, vmstat: reduce zone->lock holding time by
 /proc/pagetypeinfo

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.

The new output will simply tell
[...]
Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648

instead of
Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmstat.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4e885ecd44d1..c156ce24a322 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 			unsigned long freecount = 0;
 			struct free_area *area;
 			struct list_head *curr;
+			bool overflow = false;
 
 			area = &(zone->free_area[order]);
 
-			list_for_each(curr, &area->free_list[mtype])
-				freecount++;
-			seq_printf(m, "%6lu ", freecount);
+			list_for_each(curr, &area->free_list[mtype]) {
+				/*
+				 * Cap the free_list iteration because it might
+				 * be really large and we are under a spinlock
+				 * so a long time spent here could trigger a
+				 * hard lockup detector. Anyway this is a
+				 * debugging tool so knowing there is a handful
+				 * of pages in this order should be more than
+				 * sufficient
+				 */
+				if (++freecount >= 100000) {
+					overflow = true;
+					spin_unlock_irq(&zone->lock);
+					cond_resched();
+					spin_lock_irq(&zone->lock);
+					break;
+				}
+			}
+			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
 		}
 		seq_putc(m, '\n');
 	}
-- 
2.20.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  2019-10-23 16:41             ` Michal Hocko
@ 2019-10-23 16:47               ` Waiman Long
  0 siblings, 0 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-23 16:47 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML

On 10/23/19 12:41 PM, Michal Hocko wrote:
> With a brown paper bag bug fixed. I have also added a note about low
> number of pages being more important as per Vlastimil's feedback
>
> From 0282f604144a5c06fdf3cf0bb2df532411e7f8c9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 23 Oct 2019 12:13:02 +0200
> Subject: [PATCH] mm, vmstat: reduce zone->lock holding time by
>  /proc/pagetypeinfo
>
> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> This is not really nice because it blocks both any interrupts on that
> cpu and the page allocator. On large machines this might even trigger
> the hard lockup detector.
>
> Considering the pagetypeinfo is a debugging tool we do not really need
> exact numbers here. The primary reason to look at the outuput is to see
> how pageblocks are spread among different migratetypes and low number of
> pages is much more interesting therefore putting a bound on the number
> of pages on the free_list sounds like a reasonable tradeoff.
>
> The new output will simply tell
> [...]
> Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
>
> instead of
> Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
>
> The limit has been chosen arbitrary and it is a subject of a future
> change should there be a need for that.
>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmstat.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4e885ecd44d1..c156ce24a322 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  			unsigned long freecount = 0;
>  			struct free_area *area;
>  			struct list_head *curr;
> +			bool overflow = false;
>  
>  			area = &(zone->free_area[order]);
>  
> -			list_for_each(curr, &area->free_list[mtype])
> -				freecount++;
> -			seq_printf(m, "%6lu ", freecount);
> +			list_for_each(curr, &area->free_list[mtype]) {
> +				/*
> +				 * Cap the free_list iteration because it might
> +				 * be really large and we are under a spinlock
> +				 * so a long time spent here could trigger a
> +				 * hard lockup detector. Anyway this is a
> +				 * debugging tool so knowing there is a handful
> +				 * of pages in this order should be more than
> +				 * sufficient
> +				 */
> +				if (++freecount >= 100000) {
> +					overflow = true;
> +					spin_unlock_irq(&zone->lock);
> +					cond_resched();
> +					spin_lock_irq(&zone->lock);
> +					break;
> +				}
> +			}
> +			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
>  		}
>  		seq_putc(m, '\n');
>  	}

Reviewed-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
                               ` (4 preceding siblings ...)
  2019-10-23 16:41             ` Michal Hocko
@ 2019-10-23 17:34             ` Waiman Long
  2019-10-23 18:01               ` Michal Hocko
  2019-10-23 17:34             ` [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo Waiman Long
  6 siblings, 1 reply; 67+ messages in thread
From: Waiman Long @ 2019-10-23 17:34 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman
  Cc: linux-mm, linux-kernel, linux-api, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini,
	Waiman Long

With a threshold of 100000, it is still possible that the zone lock
will be held for a very long time in the worst case scenario where all
the counts are just below the threshold. With up to 6 migration types
and 11 orders, it means up to 6.6 millions.

Track the total number of list iterations done since the acquisition
of the zone lock and release it whenever 100000 iterations or more have
been completed. This will cap the lock hold time to no more than 200,000
list iterations.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/vmstat.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 57ba091e5460..c5b82fdf54af 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
 	int order, mtype;
+	unsigned long iteration_count = 0;
 
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
 		seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 				 * of pages in this order should be more than
 				 * sufficient
 				 */
-				if (++freecount >= 100000) {
+				if (++freecount > 100000) {
 					overflow = true;
-					spin_unlock_irq(&zone->lock);
-					cond_resched();
-					spin_lock_irq(&zone->lock);
+					freecount--;
 					break;
 				}
 			}
 			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
+			/*
+			 * Take a break and release the zone lock when
+			 * 100000 or more entries have been iterated.
+			 */
+			iteration_count += freecount;
+			if (iteration_count >= 100000) {
+				iteration_count = 0;
+				spin_unlock_irq(&zone->lock);
+				cond_resched();
+				spin_lock_irq(&zone->lock);
+			}
 		}
 		seq_putc(m, '\n');
 	}
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
                               ` (5 preceding siblings ...)
  2019-10-23 17:34             ` [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo Waiman Long
@ 2019-10-23 17:34             ` Waiman Long
  2019-10-23 18:02               ` Michal Hocko
  6 siblings, 1 reply; 67+ messages in thread
From: Waiman Long @ 2019-10-23 17:34 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mel Gorman
  Cc: linux-mm, linux-kernel, linux-api, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini,
	Waiman Long

Now that the free block count for each migration types in
/proc/pagetypeinfo may not show the exact count if it excceeds
100,000. Users may not know how much more the counts will be. As the
free_area structure has already tracked the total free block count in
nr_free, we may as well print it out with no additional cost. That will
give users a rough idea of where the upper bounds will be.

If there is no overflow, the presence of the total counts will also
enable us to check if the nr_free counts match the total number of
entries in the free lists.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/vmstat.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c5b82fdf54af..172946d8f358 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
 	int order, mtype;
+	struct free_area *area;
 	unsigned long iteration_count = 0;
 
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
@@ -1382,7 +1383,6 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					migratetype_names[mtype]);
 		for (order = 0; order < MAX_ORDER; ++order) {
 			unsigned long freecount = 0;
-			struct free_area *area;
 			struct list_head *curr;
 			bool overflow = false;
 
@@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 		}
 		seq_putc(m, '\n');
 	}
+
+	/*
+	 * List total free blocks per order
+	 */
+	seq_printf(m, "Node %4d, zone %8s, total             ",
+		   pgdat->node_id, zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		area = &(zone->free_area[order]);
+		seq_printf(m, "%6lu ", area->nr_free);
+	}
+	seq_putc(m, '\n');
 }
 
 /* Print out the free pages at each order for each migatetype */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo
  2019-10-23 17:34             ` [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo Waiman Long
@ 2019-10-23 18:01               ` Michal Hocko
  2019-10-23 18:14                 ` Waiman Long
  0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 18:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, linux-api,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Wed 23-10-19 13:34:22, Waiman Long wrote:
> With a threshold of 100000, it is still possible that the zone lock
> will be held for a very long time in the worst case scenario where all
> the counts are just below the threshold. With up to 6 migration types
> and 11 orders, it means up to 6.6 millions.
> 
> Track the total number of list iterations done since the acquisition
> of the zone lock and release it whenever 100000 iterations or more have
> been completed. This will cap the lock hold time to no more than 200,000
> list iterations.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/vmstat.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 57ba091e5460..c5b82fdf54af 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
>  {
>  	int order, mtype;
> +	unsigned long iteration_count = 0;
>  
>  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  				 * of pages in this order should be more than
>  				 * sufficient
>  				 */
> -				if (++freecount >= 100000) {
> +				if (++freecount > 100000) {
>  					overflow = true;
> -					spin_unlock_irq(&zone->lock);
> -					cond_resched();
> -					spin_lock_irq(&zone->lock);
> +					freecount--;
>  					break;
>  				}
>  			}
>  			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> +			/*
> +			 * Take a break and release the zone lock when
> +			 * 100000 or more entries have been iterated.
> +			 */
> +			iteration_count += freecount;
> +			if (iteration_count >= 100000) {
> +				iteration_count = 0;
> +				spin_unlock_irq(&zone->lock);
> +				cond_resched();
> +				spin_lock_irq(&zone->lock);
> +			}

Aren't you overengineering this a bit? If you are still worried then we
can simply cond_resched for each order
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c156ce24a322..ddb89f4e0486 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 				 */
 				if (++freecount >= 100000) {
 					overflow = true;
-					spin_unlock_irq(&zone->lock);
-					cond_resched();
-					spin_lock_irq(&zone->lock);
 					break;
 				}
 			}
 			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
+			spin_unlock_irq(&zone->lock);
+			cond_resched();
+			spin_lock_irq(&zone->lock);
 		}
 		seq_putc(m, '\n');
 	}

I do not have a strong opinion here but I can fold this into my patch 2.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo
  2019-10-23 17:34             ` [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo Waiman Long
@ 2019-10-23 18:02               ` Michal Hocko
  2019-10-23 18:07                 ` Waiman Long
  0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 18:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, linux-api,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Wed 23-10-19 13:34:23, Waiman Long wrote:
[...]
> @@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  		}
>  		seq_putc(m, '\n');
>  	}
> +
> +	/*
> +	 * List total free blocks per order
> +	 */
> +	seq_printf(m, "Node %4d, zone %8s, total             ",
> +		   pgdat->node_id, zone->name);
> +	for (order = 0; order < MAX_ORDER; ++order) {
> +		area = &(zone->free_area[order]);
> +		seq_printf(m, "%6lu ", area->nr_free);
> +	}
> +	seq_putc(m, '\n');

This is essentially duplicating /proc/buddyinfo. Do we really need that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo
  2019-10-23 18:02               ` Michal Hocko
@ 2019-10-23 18:07                 ` Waiman Long
  0 siblings, 0 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-23 18:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, linux-api,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On 10/23/19 2:02 PM, Michal Hocko wrote:
> On Wed 23-10-19 13:34:23, Waiman Long wrote:
> [...]
>> @@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>  		}
>>  		seq_putc(m, '\n');
>>  	}
>> +
>> +	/*
>> +	 * List total free blocks per order
>> +	 */
>> +	seq_printf(m, "Node %4d, zone %8s, total             ",
>> +		   pgdat->node_id, zone->name);
>> +	for (order = 0; order < MAX_ORDER; ++order) {
>> +		area = &(zone->free_area[order]);
>> +		seq_printf(m, "%6lu ", area->nr_free);
>> +	}
>> +	seq_putc(m, '\n');
> This is essentially duplicating /proc/buddyinfo. Do we really need that?

Yes, you are right. As the information is available elsewhere. I am fine
with dropping this.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo
  2019-10-23 18:01               ` Michal Hocko
@ 2019-10-23 18:14                 ` Waiman Long
  2019-10-23 20:02                   ` Michal Hocko
  0 siblings, 1 reply; 67+ messages in thread
From: Waiman Long @ 2019-10-23 18:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, linux-api,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On 10/23/19 2:01 PM, Michal Hocko wrote:
> On Wed 23-10-19 13:34:22, Waiman Long wrote:
>> With a threshold of 100000, it is still possible that the zone lock
>> will be held for a very long time in the worst case scenario where all
>> the counts are just below the threshold. With up to 6 migration types
>> and 11 orders, it means up to 6.6 millions.
>>
>> Track the total number of list iterations done since the acquisition
>> of the zone lock and release it whenever 100000 iterations or more have
>> been completed. This will cap the lock hold time to no more than 200,000
>> list iterations.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  mm/vmstat.c | 18 ++++++++++++++----
>>  1 file changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 57ba091e5460..c5b82fdf54af 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>  					pg_data_t *pgdat, struct zone *zone)
>>  {
>>  	int order, mtype;
>> +	unsigned long iteration_count = 0;
>>  
>>  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>>  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>  				 * of pages in this order should be more than
>>  				 * sufficient
>>  				 */
>> -				if (++freecount >= 100000) {
>> +				if (++freecount > 100000) {
>>  					overflow = true;
>> -					spin_unlock_irq(&zone->lock);
>> -					cond_resched();
>> -					spin_lock_irq(&zone->lock);
>> +					freecount--;
>>  					break;
>>  				}
>>  			}
>>  			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
>> +			/*
>> +			 * Take a break and release the zone lock when
>> +			 * 100000 or more entries have been iterated.
>> +			 */
>> +			iteration_count += freecount;
>> +			if (iteration_count >= 100000) {
>> +				iteration_count = 0;
>> +				spin_unlock_irq(&zone->lock);
>> +				cond_resched();
>> +				spin_lock_irq(&zone->lock);
>> +			}
> Aren't you overengineering this a bit? If you are still worried then we
> can simply cond_resched for each order
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c156ce24a322..ddb89f4e0486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>  				 */
>  				if (++freecount >= 100000) {
>  					overflow = true;
> -					spin_unlock_irq(&zone->lock);
> -					cond_resched();
> -					spin_lock_irq(&zone->lock);
>  					break;
>  				}
>  			}
>  			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> +			spin_unlock_irq(&zone->lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lock);
>  		}
>  		seq_putc(m, '\n');
>  	}
>
> I do not have a strong opinion here but I can fold this into my patch 2.

If the free list is empty or is very short, there is probably no need to
release and reacquire the lock. How about adding a check for a lower
bound like:

if (freecount > 1000) {
    spin_unlock_irq(&zone->lock);
    cond_resched();
    spin_lock_irq(&zone->lock);
}

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo
  2019-10-23 18:14                 ` Waiman Long
@ 2019-10-23 20:02                   ` Michal Hocko
  0 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2019-10-23 20:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, linux-api,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Wed 23-10-19 14:14:14, Waiman Long wrote:
> On 10/23/19 2:01 PM, Michal Hocko wrote:
> > On Wed 23-10-19 13:34:22, Waiman Long wrote:
> >> With a threshold of 100000, it is still possible that the zone lock
> >> will be held for a very long time in the worst case scenario where all
> >> the counts are just below the threshold. With up to 6 migration types
> >> and 11 orders, it means up to 6.6 millions.
> >>
> >> Track the total number of list iterations done since the acquisition
> >> of the zone lock and release it whenever 100000 iterations or more have
> >> been completed. This will cap the lock hold time to no more than 200,000
> >> list iterations.
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >>  mm/vmstat.c | 18 ++++++++++++++----
> >>  1 file changed, 14 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 57ba091e5460..c5b82fdf54af 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >>  					pg_data_t *pgdat, struct zone *zone)
> >>  {
> >>  	int order, mtype;
> >> +	unsigned long iteration_count = 0;
> >>  
> >>  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> >>  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> >> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >>  				 * of pages in this order should be more than
> >>  				 * sufficient
> >>  				 */
> >> -				if (++freecount >= 100000) {
> >> +				if (++freecount > 100000) {
> >>  					overflow = true;
> >> -					spin_unlock_irq(&zone->lock);
> >> -					cond_resched();
> >> -					spin_lock_irq(&zone->lock);
> >> +					freecount--;
> >>  					break;
> >>  				}
> >>  			}
> >>  			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> >> +			/*
> >> +			 * Take a break and release the zone lock when
> >> +			 * 100000 or more entries have been iterated.
> >> +			 */
> >> +			iteration_count += freecount;
> >> +			if (iteration_count >= 100000) {
> >> +				iteration_count = 0;
> >> +				spin_unlock_irq(&zone->lock);
> >> +				cond_resched();
> >> +				spin_lock_irq(&zone->lock);
> >> +			}
> > Aren't you overengineering this a bit? If you are still worried then we
> > can simply cond_resched for each order
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index c156ce24a322..ddb89f4e0486 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  				 */
> >  				if (++freecount >= 100000) {
> >  					overflow = true;
> > -					spin_unlock_irq(&zone->lock);
> > -					cond_resched();
> > -					spin_lock_irq(&zone->lock);
> >  					break;
> >  				}
> >  			}
> >  			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> > +			spin_unlock_irq(&zone->lock);
> > +			cond_resched();
> > +			spin_lock_irq(&zone->lock);
> >  		}
> >  		seq_putc(m, '\n');
> >  	}
> >
> > I do not have a strong opinion here but I can fold this into my patch 2.
> 
> If the free list is empty or is very short, there is probably no need to
> release and reacquire the lock. How about adding a check for a lower
> bound like:

Again, does it really make any sense to micro optimize something like
this. It is a debugging tool. I would rather go simple.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 15:01       ` Waiman Long
  2019-10-23 15:05           ` Qian Cai
@ 2019-10-23 22:30         ` Andrew Morton
  2019-10-24  5:33           ` Qian Cai
  2019-10-24  3:33         ` Feng Tang
  2 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2019-10-23 22:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Qian Cai, linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini

On Wed, 23 Oct 2019 11:01:04 -0400 Waiman Long <longman@redhat.com> wrote:

> On 10/23/19 10:48 AM, Qian Cai wrote:
> >>> this still isn't a bulletproof fix.  Maybe just terminate the list
> >>> walk if freecount reaches 1024.  Would anyone really care?
> >>>
> >>> Sigh.  I wonder if anyone really uses this thing for anything
> >>> important.  Can we just remove it all?
> >>>
> >> Removing it will be a breakage of kernel API.
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> >
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

Yes, removing things is hard.  One approach is to add printk_once(this
is going away, please email us if you use it) then wait a few years. 
Backport that one-liner into -stable kernels to hopefully speed up the
process.

Meanwhile, we need to fix the DoS opportunity.  How about my suggestion
that we limit the count to 1024, see if anyone notices?  I bet they
don't!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 15:01       ` Waiman Long
  2019-10-23 15:05           ` Qian Cai
  2019-10-23 22:30         ` Andrew Morton
@ 2019-10-24  3:33         ` Feng Tang
  2019-10-24  4:34           ` Qian Cai
  2 siblings, 1 reply; 67+ messages in thread
From: Feng Tang @ 2019-10-24  3:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Qian Cai, Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Wed, Oct 23, 2019 at 11:01:04PM +0800, Waiman Long wrote:
> On 10/23/19 10:48 AM, Qian Cai wrote:
> >>> this still isn't a bulletproof fix.  Maybe just terminate the list
> >>> walk if freecount reaches 1024.  Would anyone really care?
> >>>
> >>> Sigh.  I wonder if anyone really uses this thing for anything
> >>> important.  Can we just remove it all?
> >>>
> >> Removing it will be a breakage of kernel API.
> > Who cares about breaking this part of the API that essentially nobody will use
> > this file?
> >
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

We have been using the /proc/pagetypeinfo for debugging, mainly for
client platforms like phones/tablets. We met problems like very slow
system response or OOM things, and many of them could be related with
memory pressure or fragmentation issues, where monitoring /proc/pagetypeinfo
will be very useful for debugging.

So I think Michal's idea to change it to 0400 is a good idea.

Thanks,
Feng

 
> Cheers,
> Longman
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24  3:33         ` Feng Tang
@ 2019-10-24  4:34           ` Qian Cai
  2019-10-24  5:34             ` Feng Tang
  0 siblings, 1 reply; 67+ messages in thread
From: Qian Cai @ 2019-10-24  4:34 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini



> On Oct 23, 2019, at 11:33 PM, Feng Tang <feng.tang@intel.com> wrote:
> 
> We have been using the /proc/pagetypeinfo for debugging, mainly for
> client platforms like phones/tablets. We met problems like very slow
> system response or OOM things, and many of them could be related with
> memory pressure or fragmentation issues, where monitoring /proc/pagetypeinfo
> will be very useful for debugging.

This description of use case is rather vague. Which part of the information in that file is critical to debug an OOM or slow system that is not readily available in places like /proc/zoneinfo, /proc/buddyinfo, sysrq-m, or OOM trace etc?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 22:30         ` Andrew Morton
@ 2019-10-24  5:33           ` Qian Cai
  2019-10-24  7:42             ` Michal Hocko
  0 siblings, 1 reply; 67+ messages in thread
From: Qian Cai @ 2019-10-24  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Waiman Long, linux-mm, linux-kernel, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini



> On Oct 23, 2019, at 6:30 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> Yes, removing things is hard.  One approach is to add printk_once(this
> is going away, please email us if you use it) then wait a few years. 
> Backport that one-liner into -stable kernels to hopefully speed up the
> process.

Although it still look like an overkill to me given,

1) Mel given a green light to remove it.
2) Nobody justifies any sensible reason to keep it apart from it was probably only triggering by some testing tools blindly read procfs entries.

it is still better than wasting developers’ time to beating the “dead” horse.

> 
> Meanwhile, we need to fix the DoS opportunity.  How about my suggestion
> that we limit the count to 1024, see if anyone notices?  I bet they
> don't!

The DoS is probably there since the file had been introduced  almost 10 years ago, so I suspect it is not that easily exploitable.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24  4:34           ` Qian Cai
@ 2019-10-24  5:34             ` Feng Tang
  2019-10-24 10:51               ` Qian Cai
  0 siblings, 1 reply; 67+ messages in thread
From: Feng Tang @ 2019-10-24  5:34 UTC (permalink / raw)
  To: Qian Cai
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Thu, Oct 24, 2019 at 12:34:41PM +0800, Qian Cai wrote:
> 
> 
> > On Oct 23, 2019, at 11:33 PM, Feng Tang <feng.tang@intel.com> wrote:
> > 
> > We have been using the /proc/pagetypeinfo for debugging, mainly for
> > client platforms like phones/tablets. We met problems like very slow
> > system response or OOM things, and many of them could be related with
> > memory pressure or fragmentation issues, where monitoring /proc/pagetypeinfo
> > will be very useful for debugging.
> 
> This description of use case is rather vague. Which part of the information in that file is critical to debug an OOM or slow system that is not readily available in places like /proc/zoneinfo, /proc/buddyinfo, sysrq-m, or OOM trace etc?

One example is, there was a platform with limited DRAM, so it preset
some CMA memory for camera's big buffer allocation use, while it let 
these memory to be shared by others when camera was not used. And
after long time running, the cma region got fragmented and camera
app started to fail due to the buffer allocation failure. And during
debugging, we kept monitoring the buddy info for different migrate
types.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24  5:33           ` Qian Cai
@ 2019-10-24  7:42             ` Michal Hocko
  2019-10-24 11:11               ` Qian Cai
  0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-24  7:42 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Waiman Long, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Thu 24-10-19 01:33:01, Qian Cai wrote:
> 
> 
> > On Oct 23, 2019, at 6:30 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > Yes, removing things is hard.  One approach is to add printk_once(this
> > is going away, please email us if you use it) then wait a few years. 
> > Backport that one-liner into -stable kernels to hopefully speed up the
> > process.
> 
> Although it still look like an overkill to me given,
> 
> 1) Mel given a green light to remove it.
> 2) Nobody justifies any sensible reason to keep it apart from it was
> probably only triggering by some testing tools blindly read procfs
> entries.

It's been useful for debugging memory fragmentation problems and we do
not have anything that would provide a similar information. Considering
that making it root only is trivial and reducing the lock hold times
likewise I do not really see any strong reason to dump it at this
moment.
 
> it is still better than wasting developers’ time to beating the “dead” horse.
> 
> > 
> > Meanwhile, we need to fix the DoS opportunity.  How about my suggestion
> > that we limit the count to 1024, see if anyone notices?  I bet they
> > don't!
> 
> The DoS is probably there since the file had been introduced almost 10
> years ago, so I suspect it is not that easily exploitable.

Yes you need _tons_ of memory. Reading the file on my 3TB system takes
sys     0m3.673s

The situation might be worse if the system is terribly fragmented which
is not the case here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
  2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
@ 2019-10-24  8:20           ` Michal Hocko
  2019-10-24 16:16             ` Waiman Long
  2 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-24  8:20 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, Waiman Long
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML

On Wed 23-10-19 12:27:35, Michal Hocko wrote:
[...]
> I went with a bound to the pages iteratred over in the free_list. See
> patch 2.

I will fold http://lkml.kernel.org/r/20191023180121.GN17610@dhcp22.suse.cz
to patch 2 unless there are any objections. If there are no further
comments I will send the two patches without an RFC tomorrow.

Thanks for all the feedback.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24  5:34             ` Feng Tang
@ 2019-10-24 10:51               ` Qian Cai
  2019-10-25  1:38                 ` Feng Tang
  0 siblings, 1 reply; 67+ messages in thread
From: Qian Cai @ 2019-10-24 10:51 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini



> On Oct 24, 2019, at 1:34 AM, Feng Tang <feng.tang@intel.com> wrote:
> 
> One example is, there was a platform with limited DRAM, so it preset
> some CMA memory for camera's big buffer allocation use, while it let 
> these memory to be shared by others when camera was not used. And
> after long time running, the cma region got fragmented and camera
> app started to fail due to the buffer allocation failure. And during
> debugging, we kept monitoring the buddy info for different migrate
> types.

For CMA-specific debugging, why CMA debugfs is not enough?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24  7:42             ` Michal Hocko
@ 2019-10-24 11:11               ` Qian Cai
  2019-10-24 13:38                 ` Michal Hocko
  0 siblings, 1 reply; 67+ messages in thread
From: Qian Cai @ 2019-10-24 11:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Waiman Long, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini



> On Oct 24, 2019, at 3:42 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> It's been useful for debugging memory fragmentation problems and we do
> not have anything that would provide a similar information. Considering

Actually, zoneinfo and buddyinfo are somewhat similar to it. Why the extra information in pagetypeinfo is still useful in debugging today’s real-world scenarios?

> that making it root only is trivial and reducing the lock hold times
> likewise I do not really see any strong reason to dump it at this
> moment.

There is no need to hurry this, and clearly this is rather a good opportunity to discuss the consolidation of memory fragmentation debugging to ease the maintenance in long term.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24 11:11               ` Qian Cai
@ 2019-10-24 13:38                 ` Michal Hocko
  2019-10-24 14:55                   ` Qian Cai
  0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2019-10-24 13:38 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Waiman Long, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Thu 24-10-19 07:11:52, Qian Cai wrote:
> 
> 
> > On Oct 24, 2019, at 3:42 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > It's been useful for debugging memory fragmentation problems and we do
> > not have anything that would provide a similar information. Considering
> 
> Actually, zoneinfo and buddyinfo are somewhat similar to it. Why
> the extra information in pagetypeinfo is still useful in debugging
> today’s real-world scenarios?

Because the migrate type is the information that helps to understand
fragmentation related issues. I am pretty sure Vlastimil would tell you
much more.

> > that making it root only is trivial and reducing the lock hold times
> > likewise I do not really see any strong reason to dump it at this
> > moment.
> 
> There is no need to hurry this, and clearly this is rather a good
> opportunity to discuss the consolidation of memory fragmentation
> debugging to ease the maintenance in long term.

Right. I think we should address the issue at hands first and then
handle any possible consolidations.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24 13:38                 ` Michal Hocko
@ 2019-10-24 14:55                   ` Qian Cai
  0 siblings, 0 replies; 67+ messages in thread
From: Qian Cai @ 2019-10-24 14:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Waiman Long, linux-mm, linux-kernel,
	Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini



> On Oct 24, 2019, at 9:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> Because the migrate type is the information that helps to understand
> fragmentation related issues. I am pretty sure Vlastimil would tell you
> much more.

Then, I don’t see any point of having two files (pagetypeinfo and buddyinfo) exporting the similar type of information which is probably more important to figure out than fixing the 10-year old DoS.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24  8:20           ` [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Michal Hocko
@ 2019-10-24 16:16             ` Waiman Long
  0 siblings, 0 replies; 67+ messages in thread
From: Waiman Long @ 2019-10-24 16:16 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman
  Cc: Johannes Weiner, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini, linux-mm, LKML

On 10/24/19 4:20 AM, Michal Hocko wrote:
> On Wed 23-10-19 12:27:35, Michal Hocko wrote:
> [...]
>> I went with a bound to the pages iteratred over in the free_list. See
>> patch 2.
> I will fold http://lkml.kernel.org/r/20191023180121.GN17610@dhcp22.suse.cz
> to patch 2 unless there are any objections. If there are no further
> comments I will send the two patches without an RFC tomorrow.
>
> Thanks for all the feedback.

I am fine with your change. My concern is to make sure that there is a
reasonable bound to the worst case scenario. With that change, the upper
bound is iterating 100,000 list entries. I think Andrew suggested
lowering it to 1024. That I think may be too low, but I don't mind if it
is lowered somewhat from the current value.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
  2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
@ 2019-10-24 19:01               ` David Rientjes
  2019-10-23 13:27             ` Vlastimil Babka
                                 ` (4 subsequent siblings)
  5 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2019-10-24 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On Wed, 23 Oct 2019, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users
@ 2019-10-24 19:01               ` David Rientjes
  0 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2019-10-24 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Waiman Long, Johannes Weiner,
	Roman Gushchin, Vlastimil Babka, Konstantin Khlebnikov,
	Jann Horn, Song Liu, Greg Kroah-Hartman, Rafael Aquini, linux-mm,
	LKML, Michal Hocko

On Wed, 23 Oct 2019, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> /proc/pagetypeinfo is a debugging tool to examine internal page
> allocator state wrt to fragmentation. It is not very useful for
> any other use so normal users really do not need to read this file.
> 
> Waiman Long has noticed that reading this file can have negative side
> effects because zone->lock is necessary for gathering data and that
> a) interferes with the page allocator and its users and b) can lead to
> hard lockups on large machines which have very long free_list.
> 
> Reduce both issues by simply not exporting the file to regular users.
> 
> Reported-by: Waiman Long <longman@redhat.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
  2019-10-24 10:51               ` Qian Cai
@ 2019-10-25  1:38                 ` Feng Tang
  0 siblings, 0 replies; 67+ messages in thread
From: Feng Tang @ 2019-10-25  1:38 UTC (permalink / raw)
  To: Qian Cai
  Cc: Waiman Long, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Konstantin Khlebnikov, Jann Horn, Song Liu, Greg Kroah-Hartman,
	Rafael Aquini

On Thu, Oct 24, 2019 at 06:51:54PM +0800, Qian Cai wrote:
> 
> 
> > On Oct 24, 2019, at 1:34 AM, Feng Tang <feng.tang@intel.com> wrote:
> > 
> > One example is, there was a platform with limited DRAM, so it preset
> > some CMA memory for camera's big buffer allocation use, while it let 
> > these memory to be shared by others when camera was not used. And
> > after long time running, the cma region got fragmented and camera
> > app started to fail due to the buffer allocation failure. And during
> > debugging, we kept monitoring the buddy info for different migrate
> > types.
> 
> For CMA-specific debugging, why CMA debugfs is not enough?

We care about change flow of the numbers of different orders for cma
and other migrate types, sometimes I just use one simple cmd
	watch -d 'cat /proc/pagetypeinfo'
which shows direct and dynamic flows.

Thanks,
Feng



^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2019-10-25  1:38 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22 16:21 [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Waiman Long
2019-10-22 16:57 ` Michal Hocko
2019-10-22 18:00   ` Waiman Long
2019-10-22 18:40     ` Waiman Long
2019-10-23  0:52       ` David Rientjes
2019-10-23  0:52         ` David Rientjes
2019-10-23  8:31   ` Mel Gorman
2019-10-23  9:04     ` Michal Hocko
2019-10-23  9:56       ` Mel Gorman
2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
2019-10-23 13:13             ` Mel Gorman
2019-10-23 13:27             ` Vlastimil Babka
2019-10-23 14:52             ` Waiman Long
2019-10-23 15:10             ` Rafael Aquini
2019-10-23 16:15             ` Vlastimil Babka
2019-10-24 19:01             ` David Rientjes
2019-10-24 19:01               ` David Rientjes
2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
2019-10-23 13:15             ` Mel Gorman
2019-10-23 13:32             ` Vlastimil Babka
2019-10-23 13:37               ` Michal Hocko
2019-10-23 13:48                 ` Vlastimil Babka
2019-10-23 14:31                   ` Michal Hocko
2019-10-23 16:20                     ` Vlastimil Babka
2019-10-23 13:46               ` Rafael Aquini
2019-10-23 14:56             ` Waiman Long
2019-10-23 15:21               ` Waiman Long
2019-10-23 16:10               ` Michal Hocko
2019-10-23 16:17                 ` Waiman Long
2019-10-23 16:21                   ` Waiman Long
2019-10-23 16:15             ` Vlastimil Babka
2019-10-23 16:41             ` Michal Hocko
2019-10-23 16:47               ` Waiman Long
2019-10-23 17:34             ` [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo Waiman Long
2019-10-23 18:01               ` Michal Hocko
2019-10-23 18:14                 ` Waiman Long
2019-10-23 20:02                   ` Michal Hocko
2019-10-23 17:34             ` [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo Waiman Long
2019-10-23 18:02               ` Michal Hocko
2019-10-23 18:07                 ` Waiman Long
2019-10-24  8:20           ` [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Michal Hocko
2019-10-24 16:16             ` Waiman Long
2019-10-23 12:42         ` [PATCH] " Qian Cai
2019-10-23 13:25         ` Vlastimil Babka
2019-10-22 21:59 ` Andrew Morton
2019-10-23  6:15   ` Michal Hocko
2019-10-23 14:30   ` Waiman Long
2019-10-23 14:48     ` Qian Cai
2019-10-23 14:48       ` Qian Cai
2019-10-23 15:01       ` Waiman Long
2019-10-23 15:05         ` Qian Cai
2019-10-23 15:05           ` Qian Cai
2019-10-23 22:30         ` Andrew Morton
2019-10-24  5:33           ` Qian Cai
2019-10-24  7:42             ` Michal Hocko
2019-10-24 11:11               ` Qian Cai
2019-10-24 13:38                 ` Michal Hocko
2019-10-24 14:55                   ` Qian Cai
2019-10-24  3:33         ` Feng Tang
2019-10-24  4:34           ` Qian Cai
2019-10-24  5:34             ` Feng Tang
2019-10-24 10:51               ` Qian Cai
2019-10-25  1:38                 ` Feng Tang
2019-10-23 15:03       ` Rafael Aquini
2019-10-23 15:51         ` Qian Cai
2019-10-23 15:51           ` Qian Cai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.