Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()

From: Uladzislau Rezki <urezki@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Thomas Garnier <thgarnie@google.com>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Joel Fernandes <joelaf@google.com>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH v1 2/2] mm: add priority threshold to __purge_vmap_area_lazy()
Date: Tue, 29 Jan 2019 17:17:54 +0100	[thread overview]
Message-ID: <20190129161754.phdr3puhp4pjrnao@pc636> (raw)
In-Reply-To: <20190128120429.17819bd348753c2d7ed3a7b9@linux-foundation.org>

On Mon, Jan 28, 2019 at 12:04:29PM -0800, Andrew Morton wrote:
> On Thu, 24 Jan 2019 12:56:48 +0100 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:
> 
> > commit 763b218ddfaf ("mm: add preempt points into
> > __purge_vmap_area_lazy()")
> > 
> > introduced some preempt points, one of those is making an
> > allocation more prioritized over lazy free of vmap areas.
> > 
> > Prioritizing an allocation over freeing does not work well
> > all the time, i.e. it should be rather a compromise.
> > 
> > 1) Number of lazy pages directly influence on busy list length
> > thus on operations like: allocation, lookup, unmap, remove, etc.
> > 
> > 2) Under heavy stress of vmalloc subsystem i run into a situation
> > when memory usage gets increased hitting out_of_memory -> panic
> > state due to completely blocking of logic that frees vmap areas
> > in the __purge_vmap_area_lazy() function.
> > 
> > Establish a threshold passing which the freeing is prioritized
> > back over allocation creating a balance between each other.
> 
> It would be useful to credit the vmalloc test driver for this
> discovery, and perhaps to identify specifically which test triggered
> the kernel misbehaviour.  Please send along suitable words and I'll add
> them.
> 
Please see below more detail of testing:

<snip>
Using vmalloc test driver in "stress mode", i.e. When all available test
cases are run simultaneously on all online CPUs applying a pressure on the
vmalloc subsystem, my HiKey 960 board runs out of memory due to the fact
that __purge_vmap_area_lazy() logic simply is not able to free pages in
time.

How i run it:

1) You should build your kernel with CONFIG_TEST_VMALLOC=m
2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

during this test "vmap_lazy_nr" pages will go far beyond acceptable
lazy_max_pages() threshold, that will lead to enormous busy list size
and other problems including allocation time and so on.
<snip>
> 
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -661,23 +661,27 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  	struct llist_node *valist;
> >  	struct vmap_area *va;
> >  	struct vmap_area *n_va;
> > -	bool do_free = false;
> > +	int resched_threshold;
> >  
> >  	lockdep_assert_held(&vmap_purge_lock);
> >  
> >  	valist = llist_del_all(&vmap_purge_list);
> > +	if (unlikely(valist == NULL))
> > +		return false;
> 
> Why this change?
> 
I decided to refactor a bit, simplify and get rid of unneeded
do_free check logic. I think it is more straightforward just to
check if list is empty or not, instead of accessing to "do_free"
"n" times in a loop.

I can drop it, or upload as separate patch. What is your view?

> > +	/*
> > +	 * TODO: to calculate a flush range without looping.
> > +	 * The list can be up to lazy_max_pages() elements.
> > +	 */
> 
> How important is this?
> 
It depends on vmap_lazy_nr pages in the list we iterate. For example
on my ARM 8 cores with 4Gb system i see that __purge_vmap_area_lazy()
can take up to 12 milliseconds because of long list. That is why there
is the cond_resched_lock().

As for this first loop's time execution, it takes ~4/5 milliseconds to
find out the flush range. Probably it is not so important since it is
not done in atomic context means it can be interrupted or preempted.
So, it will increase execution time of the current process that does:

vfree()/etc -> __purge_vmap_area_lazy().

From the other hand if we could calculate that range in runtime, i
mean when we add a VA to the vmap_purge_list checking va->va_start
and va->va_end with min/max we could get rid of that loop. But this
is just an idea.

> >  	llist_for_each_entry(va, valist, purge_list) {
> >  		if (va->va_start < start)
> >  			start = va->va_start;
> >  		if (va->va_end > end)
> >  			end = va->va_end;
> > -		do_free = true;
> >  	}
> >  
> > -	if (!do_free)
> > -		return false;
> > -
> >  	flush_tlb_kernel_range(start, end);
> > +	resched_threshold = (int) lazy_max_pages() << 1;
> 
> Is the typecast really needed?
> 
> Perhaps resched_threshold shiould have unsigned long type and perhaps
> vmap_lazy_nr should be atomic_long_t?
> 
I think so. Especially that atomit_t is 32 bit integer value on both 32
and 64 bit systems. lazy_max_pages() deals with unsigned long that is 8
bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit
as well.

Should i send it as separate patch? What is your view?

> >  	spin_lock(&vmap_area_lock);
> >  	llist_for_each_entry_safe(va, n_va, valist, purge_list) {
> > @@ -685,7 +689,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >  
> >  		__free_vmap_area(va);
> >  		atomic_sub(nr, &vmap_lazy_nr);
> > -		cond_resched_lock(&vmap_area_lock);
> > +
> > +		if (atomic_read(&vmap_lazy_nr) < resched_threshold)
> > +			cond_resched_lock(&vmap_area_lock);
> >  	}
> >  	spin_unlock(&vmap_area_lock);
> >  	return true;
> 

Thank you for your comments and review.

--
Vlad Rezki