Re: [RESEND PATCH 1/3] mm/vmap: keep track of free blocks for vmap allocation

From: Roman Gushchin <guro@fb.com>
To: Uladzislau Rezki <urezki@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Thomas Garnier <thgarnie@google.com>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Joel Fernandes <joelaf@google.com>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [RESEND PATCH 1/3] mm/vmap: keep track of free blocks for vmap allocation
Date: Thu, 4 Apr 2019 16:52:45 +0000	[thread overview]
Message-ID: <20190404165240.GA9713@tower.DHCP.thefacebook.com> (raw)
In-Reply-To: <20190404154320.pf3lkwm5zcblvsfv@pc636>

On Thu, Apr 04, 2019 at 05:43:20PM +0200, Uladzislau Rezki wrote:
> Hello, Roman!
> 
> > 
> > The patch looks really good to me! I've tried hard, but didn't find
> > any serious issues/bugs. Some small nits below.
> > 
> > Thank you for working on it!
> > 
> I try my best, thank you for your review!
> 
> > BTW, when sending a new iteration, please use "[PATCH vX]" subject prefix,
> > e.g. [PATCH v3 1/3] mm/vmap: keep track of free blocks for vmap allocation".
> > RESEND usually means that you're sending the same version, e.g. when
> > you need cc more people.
> > 
> Thank you for the clarification. I will fix that next time.
> 
> > 
> > On Tue, Apr 02, 2019 at 06:25:29PM +0200, Uladzislau Rezki (Sony) wrote:
> > > Currently an allocation of the new vmap area is done over busy
> > > list iteration(complexity O(n)) until a suitable hole is found
> > > between two busy areas. Therefore each new allocation causes
> > > the list being grown. Due to over fragmented list and different
> > > permissive parameters an allocation can take a long time. For
> > > example on embedded devices it is milliseconds.
> > > 
> > > This patch organizes the KVA memory layout into free areas of the
> > > 1-ULONG_MAX range. It uses an augment red-black tree that keeps
> > > blocks sorted by their offsets in pair with linked list keeping
> > > the free space in order of increasing addresses.
> > > 
> > > Each vmap_area object contains the "subtree_max_size" that reflects
> > > a maximum available free block in its left or right sub-tree. Thus,
> > > that allows to take a decision and traversal toward the block that
> > > will fit and will have the lowest start address, i.e. sequential
> > > allocation.
> > 
> > I'd add here that an augmented red-black tree is used, and nodes
> > are augmented with the size of the maximum available free block.
> > 
> Will add.
> 
> > > 
> > > Allocation: to allocate a new block a search is done over the
> > > tree until a suitable lowest(left most) block is large enough
> > > to encompass: the requested size, alignment and vstart point.
> > > If the block is bigger than requested size - it is split.
> > > 
> > > De-allocation: when a busy vmap area is freed it can either be
> > > merged or inserted to the tree. Red-black tree allows efficiently
> > > find a spot whereas a linked list provides a constant-time access
> > > to previous and next blocks to check if merging can be done. In case
> > > of merging of de-allocated memory chunk a large coalesced area is
> > > created.
> > > 
> > > Complexity: ~O(log(N))
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  include/linux/vmalloc.h |    6 +-
> > >  mm/vmalloc.c            | 1004 +++++++++++++++++++++++++++++++++++------------
> > >  2 files changed, 762 insertions(+), 248 deletions(-)
> > > 
> > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > index 398e9c95cd61..ad483378fdd1 100644
> > > --- a/include/linux/vmalloc.h
> > > +++ b/include/linux/vmalloc.h
> > > @@ -45,12 +45,16 @@ struct vm_struct {
> > >  struct vmap_area {
> > >  	unsigned long va_start;
> > >  	unsigned long va_end;
> > > +
> > > +	/*
> > > +	 * Largest available free size in subtree.
> > > +	 */
> > > +	unsigned long subtree_max_size;
> > >  	unsigned long flags;
> > >  	struct rb_node rb_node;         /* address sorted rbtree */
> > >  	struct list_head list;          /* address sorted list */
> > >  	struct llist_node purge_list;    /* "lazy purge" list */
> > >  	struct vm_struct *vm;
> > > -	struct rcu_head rcu_head;
> > >  };
> > >  
> > >  /*
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 755b02983d8d..3adbad3fb6c1 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -31,6 +31,7 @@
> > >  #include <linux/compiler.h>
> > >  #include <linux/llist.h>
> > >  #include <linux/bitops.h>
> > > +#include <linux/rbtree_augmented.h>
> > >  
> > >  #include <linux/uaccess.h>
> > >  #include <asm/tlbflush.h>
> > > @@ -320,9 +321,7 @@ unsigned long vmalloc_to_pfn(const void *vmalloc_addr)
> > >  }
> > >  EXPORT_SYMBOL(vmalloc_to_pfn);
> > >  
> > > -
> > >  /*** Global kva allocator ***/
> > > -
> > 
> > Do we need this change?
> >
> This patch does not tend to refactor the code. I have removed extra empty
> lines because i touched the code around. I can either keep that change or
> remove it. What is your opinion?

Usually it's better to separate cosmetic changes from functional, if you're
not touching directly these lines. Not a big deal, of course.

> 
> > >  #define VM_LAZY_FREE	0x02
> > >  #define VM_VM_AREA	0x04
> > >  
> > > @@ -331,14 +330,76 @@ static DEFINE_SPINLOCK(vmap_area_lock);
> > >  LIST_HEAD(vmap_area_list);
> > >  static LLIST_HEAD(vmap_purge_list);
> > >  static struct rb_root vmap_area_root = RB_ROOT;
> > > +static bool vmap_initialized __read_mostly;
> > > +
> > > +/*
> > > + * This kmem_cache is used for vmap_area objects. Instead of
> > > + * allocating from slab we reuse an object from this cache to
> > > + * make things faster. Especially in "no edge" splitting of
> > > + * free block.
> > > + */
> > > +static struct kmem_cache *vmap_area_cachep;
> > > +
> > > +/*
> > > + * This linked list is used in pair with free_vmap_area_root.
> > > + * It gives O(1) access to prev/next to perform fast coalescing.
> > > + */
> > > +static LIST_HEAD(free_vmap_area_list);
> > > +
> > > +/*
> > > + * This augment red-black tree represents the free vmap space.
> > > + * All vmap_area objects in this tree are sorted by va->va_start
> > > + * address. It is used for allocation and merging when a vmap
> > > + * object is released.
> > > + *
> > > + * Each vmap_area node contains a maximum available free block
> > > + * of its sub-tree, right or left. Therefore it is possible to
> > > + * find a lowest match of free area.
> > > + */
> > > +static struct rb_root free_vmap_area_root = RB_ROOT;
> > >  
> > > -/* The vmap cache globals are protected by vmap_area_lock */
> > > -static struct rb_node *free_vmap_cache;
> > > -static unsigned long cached_hole_size;
> > > -static unsigned long cached_vstart;
> > > -static unsigned long cached_align;
> > > +static __always_inline unsigned long
> > > +__va_size(struct vmap_area *va)
> > > +{
> > > +	return (va->va_end - va->va_start);
> > > +}
> > > +
> > > +static __always_inline unsigned long
> > > +get_subtree_max_size(struct rb_node *node)
> > > +{
> > > +	struct vmap_area *va;
> > >  
> > > -static unsigned long vmap_area_pcpu_hole;
> > > +	va = rb_entry_safe(node, struct vmap_area, rb_node);
> > > +	return va ? va->subtree_max_size : 0;
> > > +}
> > > +
> > > +/*
> > > + * Gets called when remove the node and rotate.
> > > + */
> > > +static __always_inline unsigned long
> > > +compute_subtree_max_size(struct vmap_area *va)
> > > +{
> > > +	unsigned long max_size = __va_size(va);
> > > +	unsigned long child_max_size;
> > > +
> > > +	child_max_size = get_subtree_max_size(va->rb_node.rb_right);
> > > +	if (child_max_size > max_size)
> > > +		max_size = child_max_size;
> > > +
> > > +	child_max_size = get_subtree_max_size(va->rb_node.rb_left);
> > > +	if (child_max_size > max_size)
> > > +		max_size = child_max_size;
> > > +
> > > +	return max_size;
> > 
> > Nit: you can use max3 instead, e.g. :
> > 
> > return max3(__va_size(va),
> > 	    get_subtree_max_size(va->rb_node.rb_left),
> > 	    get_subtree_max_size(va->rb_node.rb_right));
> > 
> Good point. Will replace it!
> 
> > > +}
> > > +
> > > +RB_DECLARE_CALLBACKS(static, free_vmap_area_rb_augment_cb,
> > > +	struct vmap_area, rb_node, unsigned long, subtree_max_size,
> > > +	compute_subtree_max_size)
> > > +
> > > +static void purge_vmap_area_lazy(void);
> > > +static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
> > > +static unsigned long lazy_max_pages(void);
> > >  
> > >  static struct vmap_area *__find_vmap_area(unsigned long addr)
> > >  {
> > > @@ -359,41 +420,520 @@ static struct vmap_area *__find_vmap_area(unsigned long addr)
> > >  	return NULL;
> > >  }
> > >  
> > > -static void __insert_vmap_area(struct vmap_area *va)
> > > -{
> > > -	struct rb_node **p = &vmap_area_root.rb_node;
> > > -	struct rb_node *parent = NULL;
> > > -	struct rb_node *tmp;
> > > +/*
> > > + * This function returns back addresses of parent node
> > > + * and its left or right link for further processing.
> > > + */
> > > +static __always_inline struct rb_node **
> > > +__find_va_links(struct vmap_area *va,
> > > +	struct rb_root *root, struct rb_node *from,
> > > +	struct rb_node **parent)
> > 
> > The function looks much cleaner now, thank you!
> > 
> > But if I understand it correctly, it returns a node (via parent)
> > and a pointer to one of two links, so that the returned value
> > is always == parent + some constant offset.
> > If so, I wonder if it's cleaner to return a parent node
> > (as rb_node*) and a bool value which will indicate if the left
> > or the right link should be used.
> > 
> > Not a strong opinion, just an idea.
> > 
> I see your point. Yes, that is possible to return "bool" value that
> indicates left or right path. After that we can detect the direction.
> 
> From the other hand, we end up and access the correct link anyway during
> the traversal the tree. In case of "bool" way, we will need to add on top
> some extra logic that checks where to attach to.

Sure, makes sense. I'd add some comments here then.

Thanks!