linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] stability fixes for vmalloc allocator
@ 2019-01-24 11:56 Uladzislau Rezki (Sony)
  2019-01-24 11:56 ` [PATCH v1 1/2] mm/vmalloc: fix kernel BUG at mm/vmalloc.c:512! Uladzislau Rezki (Sony)
  2019-01-24 11:56 ` [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy() Uladzislau Rezki (Sony)
  0 siblings, 2 replies; 10+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-01-24 11:56 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Matthew Wilcox, linux-mm
  Cc: LKML, Thomas Garnier, Oleksiy Avramchenko, Steven Rostedt,
	Joel Fernandes, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	Uladzislau Rezki (Sony)

Hello.

The vmalloc test driver(https://lkml.org/lkml/2019/1/2/52) has been
added to the linux-next. Therefore i would like to fix some stability
issues i identified using it. I explained those issues in detail here:

https://lkml.org/lkml/2018/10/19/786

There are two patches, i think they are pretty ready to go with, unless
there are any comments from you.

Thank you!

--
Vlad Rezki

Uladzislau Rezki (Sony) (2):
  mm/vmalloc: fix kernel BUG at mm/vmalloc.c:512!
  mm: add priority threshold to __purge_vmap_area_lazy()

 mm/vmalloc.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v1 1/2] mm/vmalloc: fix kernel BUG at mm/vmalloc.c:512!
  2019-01-24 11:56 [PATCH v1 0/2] stability fixes for vmalloc allocator Uladzislau Rezki (Sony)
@ 2019-01-24 11:56 ` Uladzislau Rezki (Sony)
  2019-01-24 11:56 ` [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy() Uladzislau Rezki (Sony)
  1 sibling, 0 replies; 10+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-01-24 11:56 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Matthew Wilcox, linux-mm
  Cc: LKML, Thomas Garnier, Oleksiy Avramchenko, Steven Rostedt,
	Joel Fernandes, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	Uladzislau Rezki (Sony)

One of the vmalloc stress test case triggers the kernel BUG():

<snip>
[60.562151] ------------[ cut here ]------------
[60.562154] kernel BUG at mm/vmalloc.c:512!
[60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
[60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390
<snip>

it can happen due to big align request resulting in overflowing
of calculated address, i.e. it becomes 0 after ALIGN()'s fixup.

Fix it by checking if calculated address is within vstart/vend
range.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1c512fff8a56..fb4fb5fcee74 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -498,7 +498,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	}
 
 found:
-	if (addr + size > vend)
+	/*
+	 * Check also calculated address against the vstart,
+	 * because it can be 0 because of big align request.
+	 */
+	if (addr + size > vend || addr < vstart)
 		goto overflow;
 
 	va->va_start = addr;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-01-24 11:56 [PATCH v1 0/2] stability fixes for vmalloc allocator Uladzislau Rezki (Sony)
  2019-01-24 11:56 ` [PATCH v1 1/2] mm/vmalloc: fix kernel BUG at mm/vmalloc.c:512! Uladzislau Rezki (Sony)
@ 2019-01-24 11:56 ` Uladzislau Rezki (Sony)
  2019-01-28 20:04   ` Andrew Morton
  2019-01-28 22:45   ` Joel Fernandes
  1 sibling, 2 replies; 10+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-01-24 11:56 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Matthew Wilcox, linux-mm
  Cc: LKML, Thomas Garnier, Oleksiy Avramchenko, Steven Rostedt,
	Joel Fernandes, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	Uladzislau Rezki (Sony)

commit 763b218ddfaf ("mm: add preempt points into
__purge_vmap_area_lazy()")

introduced some preempt points, one of those is making an
allocation more prioritized over lazy free of vmap areas.

Prioritizing an allocation over freeing does not work well
all the time, i.e. it should be rather a compromise.

1) Number of lazy pages directly influence on busy list length
thus on operations like: allocation, lookup, unmap, remove, etc.

2) Under heavy stress of vmalloc subsystem i run into a situation
when memory usage gets increased hitting out_of_memory -> panic
state due to completely blocking of logic that frees vmap areas
in the __purge_vmap_area_lazy() function.

Establish a threshold passing which the freeing is prioritized
back over allocation creating a balance between each other.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fb4fb5fcee74..abe83f885069 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	struct llist_node *valist;
 	struct vmap_area *va;
 	struct vmap_area *n_va;
-	bool do_free = false;
+	int resched_threshold;
 
 	lockdep_assert_held(&vmap_purge_lock);
 
 	valist = llist_del_all(&vmap_purge_list);
+	if (unlikely(valist == NULL))
+		return false;
+
+	/*
+	 * TODO: to calculate a flush range without looping.
+	 * The list can be up to lazy_max_pages() elements.
+	 */
 	llist_for_each_entry(va, valist, purge_list) {
 		if (va->va_start < start)
 			start = va->va_start;
 		if (va->va_end > end)
 			end = va->va_end;
-		do_free = true;
 	}
 
-	if (!do_free)
-		return false;
-
 	flush_tlb_kernel_range(start, end);
+	resched_threshold = (int) lazy_max_pages() << 1;
 
 	spin_lock(&vmap_area_lock);
 	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
@@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 
 		__free_vmap_area(va);
 		atomic_sub(nr, &vmap_lazy_nr);
-		cond_resched_lock(&vmap_area_lock);
+
+		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
+			cond_resched_lock(&vmap_area_lock);
 	}
 	spin_unlock(&vmap_area_lock);
 	return true;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-01-24 11:56 ` [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy() Uladzislau Rezki (Sony)
@ 2019-01-28 20:04   ` Andrew Morton
  2019-01-29 16:17     ` Uladzislau Rezki
  2019-01-28 22:45   ` Joel Fernandes
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2019-01-28 20:04 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: Michal Hocko, Matthew Wilcox, linux-mm, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Steven Rostedt, Joel Fernandes,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

On Thu, 24 Jan 2019 12:56:48 +0100 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:

> commit 763b218ddfaf ("mm: add preempt points into
> __purge_vmap_area_lazy()")
> 
> introduced some preempt points, one of those is making an
> allocation more prioritized over lazy free of vmap areas.
> 
> Prioritizing an allocation over freeing does not work well
> all the time, i.e. it should be rather a compromise.
> 
> 1) Number of lazy pages directly influence on busy list length
> thus on operations like: allocation, lookup, unmap, remove, etc.
> 
> 2) Under heavy stress of vmalloc subsystem i run into a situation
> when memory usage gets increased hitting out_of_memory -> panic
> state due to completely blocking of logic that frees vmap areas
> in the __purge_vmap_area_lazy() function.
> 
> Establish a threshold passing which the freeing is prioritized
> back over allocation creating a balance between each other.

It would be useful to credit the vmalloc test driver for this
discovery, and perhaps to identify specifically which test triggered
the kernel misbehaviour.  Please send along suitable words and I'll add
them.


> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	struct llist_node *valist;
>  	struct vmap_area *va;
>  	struct vmap_area *n_va;
> -	bool do_free = false;
> +	int resched_threshold;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
>  
>  	valist = llist_del_all(&vmap_purge_list);
> +	if (unlikely(valist == NULL))
> +		return false;

Why this change?

> +	/*
> +	 * TODO: to calculate a flush range without looping.
> +	 * The list can be up to lazy_max_pages() elements.
> +	 */

How important is this?

>  	llist_for_each_entry(va, valist, purge_list) {
>  		if (va->va_start < start)
>  			start = va->va_start;
>  		if (va->va_end > end)
>  			end = va->va_end;
> -		do_free = true;
>  	}
>  
> -	if (!do_free)
> -		return false;
> -
>  	flush_tlb_kernel_range(start, end);
> +	resched_threshold = (int) lazy_max_pages() << 1;

Is the typecast really needed?

Perhaps resched_threshold shiould have unsigned long type and perhaps
vmap_lazy_nr should be atomic_long_t?

>  	spin_lock(&vmap_area_lock);
>  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  
>  		__free_vmap_area(va);
>  		atomic_sub(nr, &vmap_lazy_nr);
> -		cond_resched_lock(&vmap_area_lock);
> +
> +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> +			cond_resched_lock(&vmap_area_lock);
>  	}
>  	spin_unlock(&vmap_area_lock);
>  	return true;


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-01-24 11:56 ` [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy() Uladzislau Rezki (Sony)
  2019-01-28 20:04   ` Andrew Morton
@ 2019-01-28 22:45   ` Joel Fernandes
  2019-01-29 17:39     ` Uladzislau Rezki
  1 sibling, 1 reply; 10+ messages in thread
From: Joel Fernandes @ 2019-01-28 22:45 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: Andrew Morton, Michal Hocko, Matthew Wilcox, linux-mm, LKML,
	Thomas Garnier, Oleksiy Avramchenko, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

On Thu, Jan 24, 2019 at 12:56:48PM +0100, Uladzislau Rezki (Sony) wrote:
> commit 763b218ddfaf ("mm: add preempt points into
> __purge_vmap_area_lazy()")
> 
> introduced some preempt points, one of those is making an
> allocation more prioritized over lazy free of vmap areas.
> 
> Prioritizing an allocation over freeing does not work well
> all the time, i.e. it should be rather a compromise.
> 
> 1) Number of lazy pages directly influence on busy list length
> thus on operations like: allocation, lookup, unmap, remove, etc.
> 
> 2) Under heavy stress of vmalloc subsystem i run into a situation
> when memory usage gets increased hitting out_of_memory -> panic
> state due to completely blocking of logic that frees vmap areas
> in the __purge_vmap_area_lazy() function.
> 
> Establish a threshold passing which the freeing is prioritized
> back over allocation creating a balance between each other.

I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
is greater than half of lazy_max_pages(). Which IIUC will be more likely if
the number of CPUs is large.

In fact, when vmap_lazy_nr is high, that's when the latency will be the worst
so one could say that that's when you *should* reschedule since the frees are
taking too long and hurting real-time tasks.

Could this be better solved by tweaking lazy_max_pages() such that purging is
more aggressive?

Another approach could be to detect the scenario you brought up (allocations
happening faster than free), somehow, and avoid a reschedule?

thanks,

 - Joel

> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index fb4fb5fcee74..abe83f885069 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	struct llist_node *valist;
>  	struct vmap_area *va;
>  	struct vmap_area *n_va;
> -	bool do_free = false;
> +	int resched_threshold;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
>  
>  	valist = llist_del_all(&vmap_purge_list);
> +	if (unlikely(valist == NULL))
> +		return false;
> +
> +	/*
> +	 * TODO: to calculate a flush range without looping.
> +	 * The list can be up to lazy_max_pages() elements.
> +	 */
>  	llist_for_each_entry(va, valist, purge_list) {
>  		if (va->va_start < start)
>  			start = va->va_start;
>  		if (va->va_end > end)
>  			end = va->va_end;
> -		do_free = true;
>  	}
>  
> -	if (!do_free)
> -		return false;
> -
>  	flush_tlb_kernel_range(start, end);
> +	resched_threshold = (int) lazy_max_pages() << 1;
>  
>  	spin_lock(&vmap_area_lock);
>  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  
>  		__free_vmap_area(va);
>  		atomic_sub(nr, &vmap_lazy_nr);
> -		cond_resched_lock(&vmap_area_lock);
> +
> +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> +			cond_resched_lock(&vmap_area_lock);
>  	}
>  	spin_unlock(&vmap_area_lock);
>  	return true;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-01-28 20:04   ` Andrew Morton
@ 2019-01-29 16:17     ` Uladzislau Rezki
  2019-01-29 18:03       ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Uladzislau Rezki @ 2019-01-29 16:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Uladzislau Rezki (Sony),
	Michal Hocko, Matthew Wilcox, linux-mm, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Steven Rostedt, Joel Fernandes,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

On Mon, Jan 28, 2019 at 12:04:29PM -0800, Andrew Morton wrote:
> On Thu, 24 Jan 2019 12:56:48 +0100 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:
> 
> > commit 763b218ddfaf ("mm: add preempt points into
> > __purge_vmap_area_lazy()")
> > 
> > introduced some preempt points, one of those is making an
> > allocation more prioritized over lazy free of vmap areas.
> > 
> > Prioritizing an allocation over freeing does not work well
> > all the time, i.e. it should be rather a compromise.
> > 
> > 1) Number of lazy pages directly influence on busy list length
> > thus on operations like: allocation, lookup, unmap, remove, etc.
> > 
> > 2) Under heavy stress of vmalloc subsystem i run into a situation
> > when memory usage gets increased hitting out_of_memory -> panic
> > state due to completely blocking of logic that frees vmap areas
> > in the __purge_vmap_area_lazy() function.
> > 
> > Establish a threshold passing which the freeing is prioritized
> > back over allocation creating a balance between each other.
> 
> It would be useful to credit the vmalloc test driver for this
> discovery, and perhaps to identify specifically which test triggered
> the kernel misbehaviour.  Please send along suitable words and I'll add
> them.
> 
Please see below more detail of testing:

<snip>
Using vmalloc test driver in "stress mode", i.e. When all available test
cases are run simultaneously on all online CPUs applying a pressure on the
vmalloc subsystem, my HiKey 960 board runs out of memory due to the fact
that __purge_vmap_area_lazy() logic simply is not able to free pages in
time.

How i run it:

1) You should build your kernel with CONFIG_TEST_VMALLOC=m
2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

during this test "vmap_lazy_nr" pages will go far beyond acceptable
lazy_max_pages() threshold, that will lead to enormous busy list size
and other problems including allocation time and so on.
<snip>
> 
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  	struct llist_node *valist;
> >  	struct vmap_area *va;
> >  	struct vmap_area *n_va;
> > -	bool do_free = false;
> > +	int resched_threshold;
> >  
> >  	lockdep_assert_held(&vmap_purge_lock);
> >  
> >  	valist = llist_del_all(&vmap_purge_list);
> > +	if (unlikely(valist == NULL))
> > +		return false;
> 
> Why this change?
> 
I decided to refactor a bit, simplify and get rid of unneeded
do_free check logic. I think it is more straightforward just to
check if list is empty or not, instead of accessing to "do_free"
"n" times in a loop.

I can drop it, or upload as separate patch. What is your view?

> > +	/*
> > +	 * TODO: to calculate a flush range without looping.
> > +	 * The list can be up to lazy_max_pages() elements.
> > +	 */
> 
> How important is this?
> 
It depends on vmap_lazy_nr pages in the list we iterate. For example
on my ARM 8 cores with 4Gb system i see that __purge_vmap_area_lazy()
can take up to 12 milliseconds because of long list. That is why there
is the cond_resched_lock().

As for this first loop's time execution, it takes ~4/5 milliseconds to
find out the flush range. Probably it is not so important since it is
not done in atomic context means it can be interrupted or preempted.
So, it will increase execution time of the current process that does:

vfree()/etc -> __purge_vmap_area_lazy().

From the other hand if we could calculate that range in runtime, i
mean when we add a VA to the vmap_purge_list checking va->va_start
and va->va_end with min/max we could get rid of that loop. But this
is just an idea.

> >  	llist_for_each_entry(va, valist, purge_list) {
> >  		if (va->va_start < start)
> >  			start = va->va_start;
> >  		if (va->va_end > end)
> >  			end = va->va_end;
> > -		do_free = true;
> >  	}
> >  
> > -	if (!do_free)
> > -		return false;
> > -
> >  	flush_tlb_kernel_range(start, end);
> > +	resched_threshold = (int) lazy_max_pages() << 1;
> 
> Is the typecast really needed?
> 
> Perhaps resched_threshold shiould have unsigned long type and perhaps
> vmap_lazy_nr should be atomic_long_t?
> 
I think so. Especially that atomit_t is 32 bit integer value on both 32
and 64 bit systems. lazy_max_pages() deals with unsigned long that is 8
bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit
as well.

Should i send it as separate patch? What is your view?

> >  	spin_lock(&vmap_area_lock);
> >  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> > @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  
> >  		__free_vmap_area(va);
> >  		atomic_sub(nr, &vmap_lazy_nr);
> > -		cond_resched_lock(&vmap_area_lock);
> > +
> > +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> > +			cond_resched_lock(&vmap_area_lock);
> >  	}
> >  	spin_unlock(&vmap_area_lock);
> >  	return true;
> 

Thank you for your comments and review.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-01-28 22:45   ` Joel Fernandes
@ 2019-01-29 17:39     ` Uladzislau Rezki
  2019-03-06 16:25       ` Joel Fernandes
  0 siblings, 1 reply; 10+ messages in thread
From: Uladzislau Rezki @ 2019-01-29 17:39 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki (Sony),
	Andrew Morton, Michal Hocko, Matthew Wilcox, linux-mm, LKML,
	Thomas Garnier, Oleksiy Avramchenko, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

On Mon, Jan 28, 2019 at 05:45:28PM -0500, Joel Fernandes wrote:
> On Thu, Jan 24, 2019 at 12:56:48PM +0100, Uladzislau Rezki (Sony) wrote:
> > commit 763b218ddfaf ("mm: add preempt points into
> > __purge_vmap_area_lazy()")
> > 
> > introduced some preempt points, one of those is making an
> > allocation more prioritized over lazy free of vmap areas.
> > 
> > Prioritizing an allocation over freeing does not work well
> > all the time, i.e. it should be rather a compromise.
> > 
> > 1) Number of lazy pages directly influence on busy list length
> > thus on operations like: allocation, lookup, unmap, remove, etc.
> > 
> > 2) Under heavy stress of vmalloc subsystem i run into a situation
> > when memory usage gets increased hitting out_of_memory -> panic
> > state due to completely blocking of logic that frees vmap areas
> > in the __purge_vmap_area_lazy() function.
> > 
> > Establish a threshold passing which the freeing is prioritized
> > back over allocation creating a balance between each other.
> 
> I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
> is greater than half of lazy_max_pages(). Which IIUC will be more likely if
> the number of CPUs is large.
> 
The threshold that we establish is two times more than lazy_max_pages(),
i.e. in case of 4 system CPUs lazy_max_pages() is 24576, therefore the
threshold is 49152, if PAGE_SIZE is 4096.

It means that we allow rescheduling if vmap_lazy_nr < 49152. If vmap_lazy_nr 
is higher then we forbid rescheduling and free areas until it becomes lower
again to stabilize the system. By doing that, we will not allow vmap_lazy_nr
to be enormously increased.

>
> In fact, when vmap_lazy_nr is high, that's when the latency will be the worst
> so one could say that that's when you *should* reschedule since the frees are
> taking too long and hurting real-time tasks.
> 
> Could this be better solved by tweaking lazy_max_pages() such that purging is
> more aggressive?
> 
> Another approach could be to detect the scenario you brought up (allocations
> happening faster than free), somehow, and avoid a reschedule?
> 
This is what i am trying to achieve by this change. 

Thank you for your comments.

--
Vlad Rezki
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 18 ++++++++++++------
> >  1 file changed, 12 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index fb4fb5fcee74..abe83f885069 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  	struct llist_node *valist;
> >  	struct vmap_area *va;
> >  	struct vmap_area *n_va;
> > -	bool do_free = false;
> > +	int resched_threshold;
> >  
> >  	lockdep_assert_held(&vmap_purge_lock);
> >  
> >  	valist = llist_del_all(&vmap_purge_list);
> > +	if (unlikely(valist == NULL))
> > +		return false;
> > +
> > +	/*
> > +	 * TODO: to calculate a flush range without looping.
> > +	 * The list can be up to lazy_max_pages() elements.
> > +	 */
> >  	llist_for_each_entry(va, valist, purge_list) {
> >  		if (va->va_start < start)
> >  			start = va->va_start;
> >  		if (va->va_end > end)
> >  			end = va->va_end;
> > -		do_free = true;
> >  	}
> >  
> > -	if (!do_free)
> > -		return false;
> > -
> >  	flush_tlb_kernel_range(start, end);
> > +	resched_threshold = (int) lazy_max_pages() << 1;
> >  
> >  	spin_lock(&vmap_area_lock);
> >  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> > @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  
> >  		__free_vmap_area(va);
> >  		atomic_sub(nr, &vmap_lazy_nr);
> > -		cond_resched_lock(&vmap_area_lock);
> > +
> > +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> > +			cond_resched_lock(&vmap_area_lock);
> >  	}
> >  	spin_unlock(&vmap_area_lock);
> >  	return true;
> > -- 
> > 2.11.0
> > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-01-29 16:17     ` Uladzislau Rezki
@ 2019-01-29 18:03       ` Andrew Morton
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2019-01-29 18:03 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Michal Hocko, Matthew Wilcox, linux-mm, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Steven Rostedt, Joel Fernandes,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

On Tue, 29 Jan 2019 17:17:54 +0100 Uladzislau Rezki <urezki@gmail.com> wrote:

> > > +	resched_threshold = (int) lazy_max_pages() << 1;
> > 
> > Is the typecast really needed?
> > 
> > Perhaps resched_threshold shiould have unsigned long type and perhaps
> > vmap_lazy_nr should be atomic_long_t?
> > 
> I think so. Especially that atomit_t is 32 bit integer value on both 32
> and 64 bit systems. lazy_max_pages() deals with unsigned long that is 8
> bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit
> as well.
> 
> Should i send it as separate patch? What is your view?

Sounds good.  When convenient, please.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-01-29 17:39     ` Uladzislau Rezki
@ 2019-03-06 16:25       ` Joel Fernandes
  2019-03-07 11:15         ` Uladzislau Rezki
  0 siblings, 1 reply; 10+ messages in thread
From: Joel Fernandes @ 2019-03-06 16:25 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Michal Hocko, Matthew Wilcox, linux-mm, LKML,
	Thomas Garnier, Oleksiy Avramchenko, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

On Tue, Jan 29, 2019 at 06:39:36PM +0100, Uladzislau Rezki wrote:
> On Mon, Jan 28, 2019 at 05:45:28PM -0500, Joel Fernandes wrote:
> > On Thu, Jan 24, 2019 at 12:56:48PM +0100, Uladzislau Rezki (Sony) wrote:
> > > commit 763b218ddfaf ("mm: add preempt points into
> > > __purge_vmap_area_lazy()")
> > > 
> > > introduced some preempt points, one of those is making an
> > > allocation more prioritized over lazy free of vmap areas.
> > > 
> > > Prioritizing an allocation over freeing does not work well
> > > all the time, i.e. it should be rather a compromise.
> > > 
> > > 1) Number of lazy pages directly influence on busy list length
> > > thus on operations like: allocation, lookup, unmap, remove, etc.
> > > 
> > > 2) Under heavy stress of vmalloc subsystem i run into a situation
> > > when memory usage gets increased hitting out_of_memory -> panic
> > > state due to completely blocking of logic that frees vmap areas
> > > in the __purge_vmap_area_lazy() function.
> > > 
> > > Establish a threshold passing which the freeing is prioritized
> > > back over allocation creating a balance between each other.
> > 
> > I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
> > is greater than half of lazy_max_pages(). Which IIUC will be more likely if
> > the number of CPUs is large.
> > 
> The threshold that we establish is two times more than lazy_max_pages(),
> i.e. in case of 4 system CPUs lazy_max_pages() is 24576, therefore the
> threshold is 49152, if PAGE_SIZE is 4096.
> 
> It means that we allow rescheduling if vmap_lazy_nr < 49152. If vmap_lazy_nr 
> is higher then we forbid rescheduling and free areas until it becomes lower
> again to stabilize the system. By doing that, we will not allow vmap_lazy_nr
> to be enormously increased.

Sorry for late reply.

This sounds reasonable. Such an extreme situation of vmap_lazy_nr being twice
the lazy_max_pages() is probably only possible using a stress test anyway
since (hopefully) the try_purge_vmap_area_lazy() call is happening often
enough to keep the vmap_lazy_nr low.

Have you experimented with what is the highest threshold that prevents the
issues you're seeing? Have you tried 3x or 4x the vmap_lazy_nr?

I also wonder what is the cost these days of the global TLB flush on the most
common Linux architectures and if the whole purge vmap_area lazy stuff is
starting to get a bit dated, and if we can do the purging inline as areas are
freed. There is a cost to having this mechanism too as you said, which is as
the list size grows, all other operations also take time.

thanks,

 - Joel


> > In fact, when vmap_lazy_nr is high, that's when the latency will be the worst
> > so one could say that that's when you *should* reschedule since the frees are
> > taking too long and hurting real-time tasks.
> > 
> > Could this be better solved by tweaking lazy_max_pages() such that purging is
> > more aggressive?
> > 
> > Another approach could be to detect the scenario you brought up (allocations
> > happening faster than free), somehow, and avoid a reschedule?
> > 
> This is what i am trying to achieve by this change. 
> 
> Thank you for your comments.
> 
> --
> Vlad Rezki
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 18 ++++++++++++------
> > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index fb4fb5fcee74..abe83f885069 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> > >  	struct llist_node *valist;
> > >  	struct vmap_area *va;
> > >  	struct vmap_area *n_va;
> > > -	bool do_free = false;
> > > +	int resched_threshold;
> > >  
> > >  	lockdep_assert_held(&vmap_purge_lock);
> > >  
> > >  	valist = llist_del_all(&vmap_purge_list);
> > > +	if (unlikely(valist == NULL))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * TODO: to calculate a flush range without looping.
> > > +	 * The list can be up to lazy_max_pages() elements.
> > > +	 */
> > >  	llist_for_each_entry(va, valist, purge_list) {
> > >  		if (va->va_start < start)
> > >  			start = va->va_start;
> > >  		if (va->va_end > end)
> > >  			end = va->va_end;
> > > -		do_free = true;
> > >  	}
> > >  
> > > -	if (!do_free)
> > > -		return false;
> > > -
> > >  	flush_tlb_kernel_range(start, end);
> > > +	resched_threshold = (int) lazy_max_pages() << 1;
> > >  
> > >  	spin_lock(&vmap_area_lock);
> > >  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> > > @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> > >  
> > >  		__free_vmap_area(va);
> > >  		atomic_sub(nr, &vmap_lazy_nr);
> > > -		cond_resched_lock(&vmap_area_lock);
> > > +
> > > +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> > > +			cond_resched_lock(&vmap_area_lock);
> > >  	}
> > >  	spin_unlock(&vmap_area_lock);
> > >  	return true;
> > > -- 
> > > 2.11.0
> > > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
  2019-03-06 16:25       ` Joel Fernandes
@ 2019-03-07 11:15         ` Uladzislau Rezki
  0 siblings, 0 replies; 10+ messages in thread
From: Uladzislau Rezki @ 2019-03-07 11:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Andrew Morton, Michal Hocko, Matthew Wilcox,
	linux-mm, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, Tejun Heo

> > > 
> > > I'm a bit concerned that this will introduce the latency back if vmap_lazy_nr
> > > is greater than half of lazy_max_pages(). Which IIUC will be more likely if
> > > the number of CPUs is large.
> > > 
> > The threshold that we establish is two times more than lazy_max_pages(),
> > i.e. in case of 4 system CPUs lazy_max_pages() is 24576, therefore the
> > threshold is 49152, if PAGE_SIZE is 4096.
> > 
> > It means that we allow rescheduling if vmap_lazy_nr < 49152. If vmap_lazy_nr 
> > is higher then we forbid rescheduling and free areas until it becomes lower
> > again to stabilize the system. By doing that, we will not allow vmap_lazy_nr
> > to be enormously increased.
> 
> Sorry for late reply.
> 
> This sounds reasonable. Such an extreme situation of vmap_lazy_nr being twice
> the lazy_max_pages() is probably only possible using a stress test anyway
> since (hopefully) the try_purge_vmap_area_lazy() call is happening often
> enough to keep the vmap_lazy_nr low.
> 
> Have you experimented with what is the highest threshold that prevents the
> issues you're seeing? Have you tried 3x or 4x the vmap_lazy_nr?
> 
I do not think it make sense to go with 3x/4x/etc threshold for many
reasons. One of them is we just need to prevent that skew, returning back
to reasonable balance.

>
> I also wonder what is the cost these days of the global TLB flush on the most
> common Linux architectures and if the whole purge vmap_area lazy stuff is
> starting to get a bit dated, and if we can do the purging inline as areas are
> freed. There is a cost to having this mechanism too as you said, which is as
> the list size grows, all other operations also take time.
> 
I guess if we go with flushing the TLB each time per each vmap_area freeing,
then i think we are in trouble. Though, i have not analyzed how that approach
impacts performance.

I agree about the cost of having such mechanism. Basically it is one of the
source of bigger fragmentation(not limited to it). But from the other hand
the KVA allocator has to be capable of effective and fast allocation even
in that condition.

I am working on v2 of https://lkml.org/lkml/2018/10/19/786. When i complete
some other job related tasks i will upload a new RFC.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-03-07 11:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-24 11:56 [PATCH v1 0/2] stability fixes for vmalloc allocator Uladzislau Rezki (Sony)
2019-01-24 11:56 ` [PATCH v1 1/2] mm/vmalloc: fix kernel BUG at mm/vmalloc.c:512! Uladzislau Rezki (Sony)
2019-01-24 11:56 ` [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy() Uladzislau Rezki (Sony)
2019-01-28 20:04   ` Andrew Morton
2019-01-29 16:17     ` Uladzislau Rezki
2019-01-29 18:03       ` Andrew Morton
2019-01-28 22:45   ` Joel Fernandes
2019-01-29 17:39     ` Uladzislau Rezki
2019-03-06 16:25       ` Joel Fernandes
2019-03-07 11:15         ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).