[Ksummit-discuss] [TECH TOPIC] Reworking of KVA allocator in Linux kernel

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ksummit-discuss] [TECH TOPIC] Reworking of KVA allocator in Linux kernel
@ 2019-05-30  6:05 Theodore Ts'o
  2019-05-31  2:51 ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Theodore Ts'o @ 2019-05-30  6:05 UTC (permalink / raw)
  To: ksummit-discuss; +Cc: Uladzislau Rezki

From: Uladzislau Rezki <urezki@gmail.com>

[ Note: The following abstract was submitted via the Linsux Plumbers
  Conference website.  Per the instructions that were posted for the
  Maintainer's / Kernel Summit Call for Proposals[1], the proposal
  should also be posted on the ksummit-discuss list, so that people
  can comment on the proposal, and perhaps start a discussion before
  the summit.

  [1] https://lwn.net/Articles/788378/

  Please note that topic proposals for both the Kernel Summit and the
  Maintainer's Summit are still welcome, and the deadline has been
  extended to June 3rd. -- Ted ]

Hello.

I would like to give a talk about KVA allocator in the kernel and
about improvements i have done.

See below the presentation:

ftp://vps418301.ovh.net/incoming/Reworking_of_KVA_allocator_in_Linux_kernel.pdf

Thank you in advance!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Reworking of KVA allocator in Linux kernel
  2019-05-30  6:05 [Ksummit-discuss] [TECH TOPIC] Reworking of KVA allocator in Linux kernel Theodore Ts'o
@ 2019-05-31  2:51 ` Matthew Wilcox
  2019-06-02 12:05   ` Uladzislau Rezki
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2019-05-31  2:51 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Uladzislau Rezki, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

Vlad, I was under the impression this work was complete. Are there any
remaining issues to discuss?

On Thu., May 30, 2019, 02:15 Theodore Ts'o, <tytso@mit.edu> wrote:

> From: Uladzislau Rezki <urezki@gmail.com>
>
> [ Note: The following abstract was submitted via the Linsux Plumbers
>   Conference website.  Per the instructions that were posted for the
>   Maintainer's / Kernel Summit Call for Proposals[1], the proposal
>   should also be posted on the ksummit-discuss list, so that people
>   can comment on the proposal, and perhaps start a discussion before
>   the summit.
>
>   [1] https://lwn.net/Articles/788378/
>
>   Please note that topic proposals for both the Kernel Summit and the
>   Maintainer's Summit are still welcome, and the deadline has been
>   extended to June 3rd. -- Ted ]
>
> Hello.
>
> I would like to give a talk about KVA allocator in the kernel and
> about improvements i have done.
>
> See below the presentation:
>
>
> ftp://vps418301.ovh.net/incoming/Reworking_of_KVA_allocator_in_Linux_kernel.pdf
>
> Thank you in advance!
>
> --
> Vlad Rezki
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
>

[-- Attachment #2: Type: text/html, Size: 2199 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Reworking of KVA allocator in Linux kernel
  2019-05-31  2:51 ` Matthew Wilcox
@ 2019-06-02 12:05   ` Uladzislau Rezki
  2019-06-02 18:31     ` Uladzislau Rezki
  0 siblings, 1 reply; 4+ messages in thread
From: Uladzislau Rezki @ 2019-06-02 12:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Uladzislau Rezki, ksummit-discuss

Hello, Matthew.

>
> Vlad, I was under the impression this work was complete.
>
Thank you. Actually there was a discussion once upon a time:

<snip>
I think our real problem is that we have no data structure that stores
free VA space.  We have the vmap_area which stores allocated space, but no
data structure to store free space.
<snip>

and it was a good argument to start to examine the KVA and its problems :)

>
> Are there any remaining issues to discuss?
>
If we have a look at it from issues point of view, then i do not see them.
Though, there are some small things i would like to refactor. For instance
see below:

https://lkml.org/lkml/2019/5/28/1040

Apart from that, there is still a window for improvements. As an example
i would like to reduce a lock contention. In general it means making of
the entire logic faster or/+ reworking of locking. Below the perf output
in case of stressing my box(Intel Xeon 6 physical CPUs, ~3,8Ghz) by running
6 simultaneous pinned jobs which do random allocations:

<snip>
  49.55%  [kernel]               [k] native_queued_spin_lock_slowpath
   7.49%  [kernel]               [k] get_page_from_freelist
   4.65%  [kernel]               [k] alloc_vmap_area
   4.15%  [kernel]               [k] _raw_spin_lock
   4.15%  [kernel]               [k] free_unref_page
   2.80%  [kernel]               [k] __alloc_pages_nodemask
   2.53%  [kernel]               [k] insert_vmap_area.constprop.48
   2.47%  [kernel]               [k] vunmap_page_range
   2.10%  [kernel]               [k] __free_pages
   1.79%  [kernel]               [k] find_vmap_area
   1.66%  [kernel]               [k] vmap_page_range_noflush
   1.29%  [kernel]               [k] alloc_pages_current
   1.28%  [kernel]               [k] __free_vmap_area
   1.25%  [kernel]               [k] prep_new_page
   0.91%  [kernel]               [k] llist_add_batch
   0.87%  [kernel]               [k] free_vmap_area_noflush
   0.79%  [kernel]               [k] __vunmap
   0.67%  [kernel]               [k] free_unref_page_prepare.part.69
   0.60%  [kernel]               [k] _cond_resched
<snip>

See below some proposals:

1) we can maintain the pointer to last area we allocate from to have possibility
of O(1) access to the block if permissive parameters allow that. Something
like this:

<snip>
	if (last_free_area.va && vstart == last_free_area.vstart &&
		align >= last_free_area.align &&
		size >= last_free_area.size)

	/* Use last cached node and do not lookup from the root of the tree */
<snip>

2) Get rid of "busy" tree that stores allocated spaces or replaced it by something
faster. We need it only for mapping va->va_start to vmap_area object when we release it.

3) We can remove vmap_area node from busy tree as soon as an object gets released.
It becomes possible now, because we allocate from another tree. It will improve
insertion time into "busy tree", otherwise it stays there until "lazy" logic removes
it: 

<snip>
@@ -1754,8 +1754,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
                return;
        }

-       va = find_vmap_area(addr);
+       spin_lock(&vmap_area_lock);
+       va = __find_vmap_area(addr);
        BUG_ON(!va);
+       unlink_va(va, &vmap_area_root);
+       spin_unlock(&vmap_area_lock);
+
        debug_check_no_locks_freed((void *)va->va_start,
                                    (va->va_end - va->va_start));
        free_unmap_vmap_area(va);
@@ -2162,6 +2166,7 @@ struct vm_struct *remove_vm_area(const void *addr)
                va->vm = NULL;
                va->flags &= ~VM_VM_AREA;
                va->flags |= VM_LAZY_FREE;
+               unlink_va(va, &vmap_area_root);
                spin_unlock(&vmap_area_lock);

                kasan_free_shadow(vm); 
<snip>

All those things could be discussed over lkml. If there are some higher priority
topics to discuss i do not want to waste the time and we can drop my proposal topic
on the Kernel Summit.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Reworking of KVA allocator in Linux kernel
  2019-06-02 12:05   ` Uladzislau Rezki
@ 2019-06-02 18:31     ` Uladzislau Rezki
  0 siblings, 0 replies; 4+ messages in thread
From: Uladzislau Rezki @ 2019-06-02 18:31 UTC (permalink / raw)
  To: Uladzislau Rezki; +Cc: ksummit-discuss

On Sun, Jun 02, 2019 at 02:05:10PM +0200, Uladzislau Rezki wrote:
> Hello, Matthew.
> 
> >
> > Vlad, I was under the impression this work was complete.
> >
> Thank you. Actually there was a discussion once upon a time:
> 
> <snip>
> I think our real problem is that we have no data structure that stores
> free VA space.  We have the vmap_area which stores allocated space, but no
> data structure to store free space.
> <snip>
> 
> and it was a good argument to start to examine the KVA and its problems :)
> 
> >
> > Are there any remaining issues to discuss?
> >
> If we have a look at it from issues point of view, then i do not see them.
> Though, there are some small things i would like to refactor. For instance
> see below:
> 
> https://lkml.org/lkml/2019/5/28/1040
> 
> Apart from that, there is still a window for improvements. As an example
> i would like to reduce a lock contention. In general it means making of
> the entire logic faster or/+ reworking of locking. Below the perf output
> in case of stressing my box(Intel Xeon 6 physical CPUs, ~3,8Ghz) by running
> 6 simultaneous pinned jobs which do random allocations:
> 
> <snip>
>   49.55%  [kernel]               [k] native_queued_spin_lock_slowpath
>    7.49%  [kernel]               [k] get_page_from_freelist
>    4.65%  [kernel]               [k] alloc_vmap_area
>    4.15%  [kernel]               [k] _raw_spin_lock
>    4.15%  [kernel]               [k] free_unref_page
>    2.80%  [kernel]               [k] __alloc_pages_nodemask
>    2.53%  [kernel]               [k] insert_vmap_area.constprop.48
>    2.47%  [kernel]               [k] vunmap_page_range
>    2.10%  [kernel]               [k] __free_pages
>    1.79%  [kernel]               [k] find_vmap_area
>    1.66%  [kernel]               [k] vmap_page_range_noflush
>    1.29%  [kernel]               [k] alloc_pages_current
>    1.28%  [kernel]               [k] __free_vmap_area
>    1.25%  [kernel]               [k] prep_new_page
>    0.91%  [kernel]               [k] llist_add_batch
>    0.87%  [kernel]               [k] free_vmap_area_noflush
>    0.79%  [kernel]               [k] __vunmap
>    0.67%  [kernel]               [k] free_unref_page_prepare.part.69
>    0.60%  [kernel]               [k] _cond_resched
> <snip>
> 
> See below some proposals:
> 
> 1) we can maintain the pointer to last area we allocate from to have possibility
> of O(1) access to the block if permissive parameters allow that. Something
> like this:
> 
> <snip>
> 	if (last_free_area.va && vstart == last_free_area.vstart &&
> 		align >= last_free_area.align &&
> 		size >= last_free_area.size)
> 
> 	/* Use last cached node and do not lookup from the root of the tree */
> <snip>
> 
> 2) Get rid of "busy" tree that stores allocated spaces or replaced it by something
> faster. We need it only for mapping va->va_start to vmap_area object when we release it.
> 
> 3) We can remove vmap_area node from busy tree as soon as an object gets released.
> It becomes possible now, because we allocate from another tree. It will improve
> insertion time into "busy tree", otherwise it stays there until "lazy" logic removes
> it: 
> 
> <snip>
> @@ -1754,8 +1754,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
>                 return;
>         }
>  
> -       va = find_vmap_area(addr);
> +       spin_lock(&vmap_area_lock);
> +       va = __find_vmap_area(addr);
>         BUG_ON(!va);
> +       unlink_va(va, &vmap_area_root);
> +       spin_unlock(&vmap_area_lock);
> +
>         debug_check_no_locks_freed((void *)va->va_start,
>                                     (va->va_end - va->va_start));
>         free_unmap_vmap_area(va);
> @@ -2162,6 +2166,7 @@ struct vm_struct *remove_vm_area(const void *addr)
>                 va->vm = NULL;
>                 va->flags &= ~VM_VM_AREA;
>                 va->flags |= VM_LAZY_FREE;
> +               unlink_va(va, &vmap_area_root);
>                 spin_unlock(&vmap_area_lock);
>  
>                 kasan_free_shadow(vm); 
> <snip>
> 
> All those things could be discussed over lkml. If there are some higher priority
> topics to discuss i do not want to waste the time and we can drop my proposal topic
> on the Kernel Summit.
> 
I have missed at least one thing i wanted to mention. That is to make it
per CPU allocation. I mean to split the KVA space for NR_CPUS chunks and
make it lockless(or almost lockless). Probably it is possible to do that
only for specific zone, for example vmalloc space only. Lazelly drain
logic also цшдд be per-CPU. 

But that is theory and probably looks like over optimization or so :)

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-06-02 18:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-30  6:05 [Ksummit-discuss] [TECH TOPIC] Reworking of KVA allocator in Linux kernel Theodore Ts'o
2019-05-31  2:51 ` Matthew Wilcox
2019-06-02 12:05   ` Uladzislau Rezki
2019-06-02 18:31     ` Uladzislau Rezki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.