linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Some cleanups for the KVA/vmalloc
@ 2019-05-27  9:38 Uladzislau Rezki (Sony)
  2019-05-27  9:38 ` [PATCH v3 1/4] mm/vmap: remove "node" argument Uladzislau Rezki (Sony)
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-05-27  9:38 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Roman Gushchin, Uladzislau Rezki, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

Patch [1] removes an unused argument "node" from the __alloc_vmap_area()
function and that is it.

Patch [2] is not driven by any particular workload that fails or so,
it is just better approach to handle one specific split case.

Patch [3] some cleanups in merging path. Basically on a first step
the mergeable node is detached and there is no reason to "unlink" it.
The same concerns the second step unless it has been merged on first
one.

Patch [4] moves BUG_ON()/RB_EMPTY_NODE() checks under "unlink" logic.
After [3] merging path "unlink" only linked nodes. Therefore we can say
that removing detached object is a bug in all cases.

v2->v3:
    - remove the odd comment from the [3];

v1->v2:
    - update the commit message. [2] patch;
    - fix typos in comments. [2] patch;
    - do the "preload" for NUMA awareness. [2] patch;

Uladzislau Rezki (Sony) (4):
  mm/vmap: remove "node" argument
  mm/vmap: preload a CPU with one object for split purpose
  mm/vmap: get rid of one single unlink_va() when merge
  mm/vmap: move BUG_ON() check to the unlink_va()

 mm/vmalloc.c | 115 +++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 89 insertions(+), 26 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/4] mm/vmap: remove "node" argument
  2019-05-27  9:38 [PATCH v3 0/4] Some cleanups for the KVA/vmalloc Uladzislau Rezki (Sony)
@ 2019-05-27  9:38 ` Uladzislau Rezki (Sony)
  2019-05-28 22:33   ` Roman Gushchin
  2019-05-27  9:38 ` [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose Uladzislau Rezki (Sony)
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-05-27  9:38 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Roman Gushchin, Uladzislau Rezki, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

Remove unused argument from the __alloc_vmap_area() function.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c42872ed82ac..ea1b65fac599 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -985,7 +985,7 @@ adjust_va_to_fit_type(struct vmap_area *va,
  */
 static __always_inline unsigned long
 __alloc_vmap_area(unsigned long size, unsigned long align,
-	unsigned long vstart, unsigned long vend, int node)
+	unsigned long vstart, unsigned long vend)
 {
 	unsigned long nva_start_addr;
 	struct vmap_area *va;
@@ -1062,7 +1062,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	 * If an allocation fails, the "vend" address is
 	 * returned. Therefore trigger the overflow path.
 	 */
-	addr = __alloc_vmap_area(size, align, vstart, vend, node);
+	addr = __alloc_vmap_area(size, align, vstart, vend);
 	if (unlikely(addr == vend))
 		goto overflow;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose
  2019-05-27  9:38 [PATCH v3 0/4] Some cleanups for the KVA/vmalloc Uladzislau Rezki (Sony)
  2019-05-27  9:38 ` [PATCH v3 1/4] mm/vmap: remove "node" argument Uladzislau Rezki (Sony)
@ 2019-05-27  9:38 ` Uladzislau Rezki (Sony)
  2019-05-28 22:42   ` Roman Gushchin
  2019-05-27  9:38 ` [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge Uladzislau Rezki (Sony)
  2019-05-27  9:38 ` [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va() Uladzislau Rezki (Sony)
  3 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-05-27  9:38 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Roman Gushchin, Uladzislau Rezki, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

Refactor the NE_FIT_TYPE split case when it comes to an
allocation of one extra object. We need it in order to
build a remaining space.

Introduce ne_fit_preload()/ne_fit_preload_end() functions
for preloading one extra vmap_area object to ensure that
we have it available when fit type is NE_FIT_TYPE.

The preload is done per CPU in non-atomic context thus with
GFP_KERNEL allocation masks. More permissive parameters can
be beneficial for systems which are suffer from high memory
pressure or low memory condition.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 76 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ea1b65fac599..b553047aa05b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -364,6 +364,13 @@ static LIST_HEAD(free_vmap_area_list);
  */
 static struct rb_root free_vmap_area_root = RB_ROOT;
 
+/*
+ * Preload a CPU with one object for "no edge" split case. The
+ * aim is to get rid of allocations from the atomic context, thus
+ * to use more permissive allocation masks.
+ */
+static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
+
 static __always_inline unsigned long
 va_size(struct vmap_area *va)
 {
@@ -950,9 +957,24 @@ adjust_va_to_fit_type(struct vmap_area *va,
 		 *   L V  NVA  V R
 		 * |---|-------|---|
 		 */
-		lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
-		if (unlikely(!lva))
-			return -1;
+		lva = __this_cpu_xchg(ne_fit_preload_node, NULL);
+		if (unlikely(!lva)) {
+			/*
+			 * For percpu allocator we do not do any pre-allocation
+			 * and leave it as it is. The reason is it most likely
+			 * never ends up with NE_FIT_TYPE splitting. In case of
+			 * percpu allocations offsets and sizes are aligned to
+			 * fixed align request, i.e. RE_FIT_TYPE and FL_FIT_TYPE
+			 * are its main fitting cases.
+			 *
+			 * There are a few exceptions though, as an example it is
+			 * a first allocation (early boot up) when we have "one"
+			 * big free space that has to be split.
+			 */
+			lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
+			if (!lva)
+				return -1;
+		}
 
 		/*
 		 * Build the remainder.
@@ -1023,6 +1045,48 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 }
 
 /*
+ * Preload this CPU with one extra vmap_area object to ensure
+ * that we have it available when fit type of free area is
+ * NE_FIT_TYPE.
+ *
+ * The preload is done in non-atomic context, thus it allows us
+ * to use more permissive allocation masks to be more stable under
+ * low memory condition and high memory pressure.
+ *
+ * If success it returns 1 with preemption disabled. In case
+ * of error 0 is returned with preemption not disabled. Note it
+ * has to be paired with ne_fit_preload_end().
+ */
+static int
+ne_fit_preload(int nid)
+{
+	preempt_disable();
+
+	if (!__this_cpu_read(ne_fit_preload_node)) {
+		struct vmap_area *node;
+
+		preempt_enable();
+		node = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, nid);
+		if (node == NULL)
+			return 0;
+
+		preempt_disable();
+
+		if (__this_cpu_cmpxchg(ne_fit_preload_node, NULL, node))
+			kmem_cache_free(vmap_area_cachep, node);
+	}
+
+	return 1;
+}
+
+static void
+ne_fit_preload_end(int preloaded)
+{
+	if (preloaded)
+		preempt_enable();
+}
+
+/*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend.
  */
@@ -1034,6 +1098,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	struct vmap_area *va;
 	unsigned long addr;
 	int purged = 0;
+	int preloaded;
 
 	BUG_ON(!size);
 	BUG_ON(offset_in_page(size));
@@ -1056,6 +1121,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask & GFP_RECLAIM_MASK);
 
 retry:
+	/*
+	 * Even if it fails we do not really care about that.
+	 * Just proceed as it is. "overflow" path will refill
+	 * the cache we allocate from.
+	 */
+	preloaded = ne_fit_preload(node);
 	spin_lock(&vmap_area_lock);
 
 	/*
@@ -1063,6 +1134,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	 * returned. Therefore trigger the overflow path.
 	 */
 	addr = __alloc_vmap_area(size, align, vstart, vend);
+	ne_fit_preload_end(preloaded);
+
 	if (unlikely(addr == vend))
 		goto overflow;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge
  2019-05-27  9:38 [PATCH v3 0/4] Some cleanups for the KVA/vmalloc Uladzislau Rezki (Sony)
  2019-05-27  9:38 ` [PATCH v3 1/4] mm/vmap: remove "node" argument Uladzislau Rezki (Sony)
  2019-05-27  9:38 ` [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose Uladzislau Rezki (Sony)
@ 2019-05-27  9:38 ` Uladzislau Rezki (Sony)
  2019-05-28 22:45   ` Roman Gushchin
  2019-05-27  9:38 ` [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va() Uladzislau Rezki (Sony)
  3 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-05-27  9:38 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Roman Gushchin, Uladzislau Rezki, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

It does not make sense to try to "unlink" the node that is
definitely not linked with a list nor tree. On the first
merge step VA just points to the previously disconnected
busy area.

On the second step, check if the node has been merged and do
"unlink" if so, because now it points to an object that must
be linked.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Hillf Danton <hdanton@sina.com>
---
 mm/vmalloc.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b553047aa05b..371aba9a4bf1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -718,9 +718,6 @@ merge_or_add_vmap_area(struct vmap_area *va,
 			/* Check and update the tree if needed. */
 			augment_tree_propagate_from(sibling);
 
-			/* Remove this VA, it has been merged. */
-			unlink_va(va, root);
-
 			/* Free vmap_area object. */
 			kmem_cache_free(vmap_area_cachep, va);
 
@@ -745,12 +742,11 @@ merge_or_add_vmap_area(struct vmap_area *va,
 			/* Check and update the tree if needed. */
 			augment_tree_propagate_from(sibling);
 
-			/* Remove this VA, it has been merged. */
-			unlink_va(va, root);
+			if (merged)
+				unlink_va(va, root);
 
 			/* Free vmap_area object. */
 			kmem_cache_free(vmap_area_cachep, va);
-
 			return;
 		}
 	}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-05-27  9:38 [PATCH v3 0/4] Some cleanups for the KVA/vmalloc Uladzislau Rezki (Sony)
                   ` (2 preceding siblings ...)
  2019-05-27  9:38 ` [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge Uladzislau Rezki (Sony)
@ 2019-05-27  9:38 ` Uladzislau Rezki (Sony)
  2019-05-27 12:59   ` Steven Rostedt
  2019-05-28 22:50   ` Roman Gushchin
  3 siblings, 2 replies; 20+ messages in thread
From: Uladzislau Rezki (Sony) @ 2019-05-27  9:38 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Roman Gushchin, Uladzislau Rezki, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
function, it means if an empty node gets freed it is a BUG
thus is considered as faulty behaviour.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 371aba9a4bf1..340959b81228 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -533,20 +533,16 @@ link_va(struct vmap_area *va, struct rb_root *root,
 static __always_inline void
 unlink_va(struct vmap_area *va, struct rb_root *root)
 {
-	/*
-	 * During merging a VA node can be empty, therefore
-	 * not linked with the tree nor list. Just check it.
-	 */
-	if (!RB_EMPTY_NODE(&va->rb_node)) {
-		if (root == &free_vmap_area_root)
-			rb_erase_augmented(&va->rb_node,
-				root, &free_vmap_area_rb_augment_cb);
-		else
-			rb_erase(&va->rb_node, root);
+	BUG_ON(RB_EMPTY_NODE(&va->rb_node));
 
-		list_del(&va->list);
-		RB_CLEAR_NODE(&va->rb_node);
-	}
+	if (root == &free_vmap_area_root)
+		rb_erase_augmented(&va->rb_node,
+			root, &free_vmap_area_rb_augment_cb);
+	else
+		rb_erase(&va->rb_node, root);
+
+	list_del(&va->list);
+	RB_CLEAR_NODE(&va->rb_node);
 }
 
 #if DEBUG_AUGMENT_PROPAGATE_CHECK
@@ -1187,8 +1183,6 @@ EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier);
 
 static void __free_vmap_area(struct vmap_area *va)
 {
-	BUG_ON(RB_EMPTY_NODE(&va->rb_node));
-
 	/*
 	 * Remove from the busy tree/list.
 	 */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-05-27  9:38 ` [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va() Uladzislau Rezki (Sony)
@ 2019-05-27 12:59   ` Steven Rostedt
  2019-05-27 14:02     ` Uladzislau Rezki
  2019-05-28 22:50   ` Roman Gushchin
  1 sibling, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2019-05-27 12:59 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: Andrew Morton, linux-mm, Roman Gushchin, Hillf Danton,
	Michal Hocko, Matthew Wilcox, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Joel Fernandes, Thomas Gleixner,
	Ingo Molnar, Tejun Heo

On Mon, 27 May 2019 11:38:42 +0200
"Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:

> Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
> function, it means if an empty node gets freed it is a BUG
> thus is considered as faulty behaviour.

Can we switch it to a WARN_ON(). We are trying to remove all BUG_ON()s.
If a user wants to crash on warning, there's a sysctl for that. But
crashing the system can make it hard to debug. Especially if it is hit
by someone without a serial console, and the machine just hangs in X.
That is very annoying.

With a WARN_ON, you at least get a chance to see the crash dump.

-- Steve


> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 24 +++++++++---------------
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 371aba9a4bf1..340959b81228 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -533,20 +533,16 @@ link_va(struct vmap_area *va, struct rb_root *root,
>  static __always_inline void
>  unlink_va(struct vmap_area *va, struct rb_root *root)
>  {
> -	/*
> -	 * During merging a VA node can be empty, therefore
> -	 * not linked with the tree nor list. Just check it.
> -	 */
> -	if (!RB_EMPTY_NODE(&va->rb_node)) {
> -		if (root == &free_vmap_area_root)
> -			rb_erase_augmented(&va->rb_node,
> -				root, &free_vmap_area_rb_augment_cb);
> -		else
> -			rb_erase(&va->rb_node, root);
> +	BUG_ON(RB_EMPTY_NODE(&va->rb_node));
>  
> -		list_del(&va->list);
> -		RB_CLEAR_NODE(&va->rb_node);
> -	}
> +	if (root == &free_vmap_area_root)
> +		rb_erase_augmented(&va->rb_node,
> +			root, &free_vmap_area_rb_augment_cb);
> +	else
> +		rb_erase(&va->rb_node, root);
> +
> +	list_del(&va->list);
> +	RB_CLEAR_NODE(&va->rb_node);
>  }
>  
>  #if DEBUG_AUGMENT_PROPAGATE_CHECK
> @@ -1187,8 +1183,6 @@ EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier);
>  
>  static void __free_vmap_area(struct vmap_area *va)
>  {
> -	BUG_ON(RB_EMPTY_NODE(&va->rb_node));
> -
>  	/*
>  	 * Remove from the busy tree/list.
>  	 */


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-05-27 12:59   ` Steven Rostedt
@ 2019-05-27 14:02     ` Uladzislau Rezki
  0 siblings, 0 replies; 20+ messages in thread
From: Uladzislau Rezki @ 2019-05-27 14:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Uladzislau Rezki (Sony),
	Andrew Morton, linux-mm, Roman Gushchin, Hillf Danton,
	Michal Hocko, Matthew Wilcox, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Joel Fernandes, Thomas Gleixner,
	Ingo Molnar, Tejun Heo

> > Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
> > function, it means if an empty node gets freed it is a BUG
> > thus is considered as faulty behaviour.
> 
> Can we switch it to a WARN_ON(). We are trying to remove all BUG_ON()s.
> If a user wants to crash on warning, there's a sysctl for that. But
> crashing the system can make it hard to debug. Especially if it is hit
> by someone without a serial console, and the machine just hangs in X.
> That is very annoying.
> 
> With a WARN_ON, you at least get a chance to see the crash dump.
Yes we can. Even though it is considered as faulty behavior it is not
a good reason to trigger a BUG. I will fix that.

Thank you!

--
Vlad Rezki 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] mm/vmap: remove "node" argument
  2019-05-27  9:38 ` [PATCH v3 1/4] mm/vmap: remove "node" argument Uladzislau Rezki (Sony)
@ 2019-05-28 22:33   ` Roman Gushchin
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-05-28 22:33 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Mon, May 27, 2019 at 11:38:39AM +0200, Uladzislau Rezki (Sony) wrote:
> Remove unused argument from the __alloc_vmap_area() function.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c42872ed82ac..ea1b65fac599 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -985,7 +985,7 @@ adjust_va_to_fit_type(struct vmap_area *va,
>   */
>  static __always_inline unsigned long
>  __alloc_vmap_area(unsigned long size, unsigned long align,
> -	unsigned long vstart, unsigned long vend, int node)
> +	unsigned long vstart, unsigned long vend)
>  {
>  	unsigned long nva_start_addr;
>  	struct vmap_area *va;
> @@ -1062,7 +1062,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	 * If an allocation fails, the "vend" address is
>  	 * returned. Therefore trigger the overflow path.
>  	 */
> -	addr = __alloc_vmap_area(size, align, vstart, vend, node);
> +	addr = __alloc_vmap_area(size, align, vstart, vend);
>  	if (unlikely(addr == vend))
>  		goto overflow;
>  
> -- 
> 2.11.0
> 

Reviewed-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose
  2019-05-27  9:38 ` [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose Uladzislau Rezki (Sony)
@ 2019-05-28 22:42   ` Roman Gushchin
  2019-05-29 14:27     ` Uladzislau Rezki
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-28 22:42 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Mon, May 27, 2019 at 11:38:40AM +0200, Uladzislau Rezki (Sony) wrote:
> Refactor the NE_FIT_TYPE split case when it comes to an
> allocation of one extra object. We need it in order to
> build a remaining space.
> 
> Introduce ne_fit_preload()/ne_fit_preload_end() functions
> for preloading one extra vmap_area object to ensure that
> we have it available when fit type is NE_FIT_TYPE.
> 
> The preload is done per CPU in non-atomic context thus with
> GFP_KERNEL allocation masks. More permissive parameters can
> be beneficial for systems which are suffer from high memory
> pressure or low memory condition.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 76 insertions(+), 3 deletions(-)

Hi Uladzislau!

This patch generally looks good to me (see some nits below),
but it would be really great to add some motivation, e.g. numbers.

Thanks!

> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ea1b65fac599..b553047aa05b 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -364,6 +364,13 @@ static LIST_HEAD(free_vmap_area_list);
>   */
>  static struct rb_root free_vmap_area_root = RB_ROOT;
>  
> +/*
> + * Preload a CPU with one object for "no edge" split case. The
> + * aim is to get rid of allocations from the atomic context, thus
> + * to use more permissive allocation masks.
> + */
> +static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> +
>  static __always_inline unsigned long
>  va_size(struct vmap_area *va)
>  {
> @@ -950,9 +957,24 @@ adjust_va_to_fit_type(struct vmap_area *va,
>  		 *   L V  NVA  V R
>  		 * |---|-------|---|
>  		 */
> -		lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
> -		if (unlikely(!lva))
> -			return -1;
> +		lva = __this_cpu_xchg(ne_fit_preload_node, NULL);
> +		if (unlikely(!lva)) {
> +			/*
> +			 * For percpu allocator we do not do any pre-allocation
> +			 * and leave it as it is. The reason is it most likely
> +			 * never ends up with NE_FIT_TYPE splitting. In case of
> +			 * percpu allocations offsets and sizes are aligned to
> +			 * fixed align request, i.e. RE_FIT_TYPE and FL_FIT_TYPE
> +			 * are its main fitting cases.
> +			 *
> +			 * There are a few exceptions though, as an example it is
> +			 * a first allocation (early boot up) when we have "one"
> +			 * big free space that has to be split.
> +			 */
> +			lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
> +			if (!lva)
> +				return -1;
> +		}
>  
>  		/*
>  		 * Build the remainder.
> @@ -1023,6 +1045,48 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
>  }
>  
>  /*
> + * Preload this CPU with one extra vmap_area object to ensure
> + * that we have it available when fit type of free area is
> + * NE_FIT_TYPE.
> + *
> + * The preload is done in non-atomic context, thus it allows us
> + * to use more permissive allocation masks to be more stable under
> + * low memory condition and high memory pressure.
> + *
> + * If success it returns 1 with preemption disabled. In case
> + * of error 0 is returned with preemption not disabled. Note it
> + * has to be paired with ne_fit_preload_end().
> + */
> +static int

Cosmetic nit: you don't need a new line here.

> +ne_fit_preload(int nid)

> +{
> +	preempt_disable();
> +
> +	if (!__this_cpu_read(ne_fit_preload_node)) {
> +		struct vmap_area *node;
> +
> +		preempt_enable();
> +		node = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, nid);
> +		if (node == NULL)
> +			return 0;
> +
> +		preempt_disable();
> +
> +		if (__this_cpu_cmpxchg(ne_fit_preload_node, NULL, node))
> +			kmem_cache_free(vmap_area_cachep, node);
> +	}
> +
> +	return 1;
> +}
> +
> +static void

Here too.

> +ne_fit_preload_end(int preloaded)
> +{
> +	if (preloaded)
> +		preempt_enable();
> +}

I'd open code it. It's used only once, but hiding preempt_disable()
behind a helper makes it harder to understand and easier to mess.

Then ne_fit_preload() might require disabled preemption (which it can
temporarily re-enable), so that preempt_enable()/disable() logic
will be in one place.

> +
> +/*
>   * Allocate a region of KVA of the specified size and alignment, within the
>   * vstart and vend.
>   */
> @@ -1034,6 +1098,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	struct vmap_area *va;
>  	unsigned long addr;
>  	int purged = 0;
> +	int preloaded;
>  
>  	BUG_ON(!size);
>  	BUG_ON(offset_in_page(size));
> @@ -1056,6 +1121,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask & GFP_RECLAIM_MASK);
>  
>  retry:
> +	/*
> +	 * Even if it fails we do not really care about that.
> +	 * Just proceed as it is. "overflow" path will refill
> +	 * the cache we allocate from.
> +	 */
> +	preloaded = ne_fit_preload(node);
>  	spin_lock(&vmap_area_lock);
>  
>  	/*
> @@ -1063,6 +1134,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	 * returned. Therefore trigger the overflow path.
>  	 */
>  	addr = __alloc_vmap_area(size, align, vstart, vend);
> +	ne_fit_preload_end(preloaded);
> +
>  	if (unlikely(addr == vend))
>  		goto overflow;
>  
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge
  2019-05-27  9:38 ` [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge Uladzislau Rezki (Sony)
@ 2019-05-28 22:45   ` Roman Gushchin
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-05-28 22:45 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Mon, May 27, 2019 at 11:38:41AM +0200, Uladzislau Rezki (Sony) wrote:
> It does not make sense to try to "unlink" the node that is
> definitely not linked with a list nor tree. On the first
> merge step VA just points to the previously disconnected
> busy area.
> 
> On the second step, check if the node has been merged and do
> "unlink" if so, because now it points to an object that must
> be linked.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Acked-by: Hillf Danton <hdanton@sina.com>

Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-05-27  9:38 ` [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va() Uladzislau Rezki (Sony)
  2019-05-27 12:59   ` Steven Rostedt
@ 2019-05-28 22:50   ` Roman Gushchin
  2019-05-29 13:58     ` Uladzislau Rezki
  1 sibling, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-28 22:50 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Mon, May 27, 2019 at 11:38:42AM +0200, Uladzislau Rezki (Sony) wrote:
> Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
> function, it means if an empty node gets freed it is a BUG
> thus is considered as faulty behaviour.

It's not exactly clear from the description, why it's better.

Also, do we really need a BUG_ON() in either place?

Isn't something like this better?

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c42872ed82ac..2df0e86d6aff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1118,7 +1118,8 @@ EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier);
 
 static void __free_vmap_area(struct vmap_area *va)
 {
-       BUG_ON(RB_EMPTY_NODE(&va->rb_node));
+       if (WARN_ON_ONCE(RB_EMPTY_NODE(&va->rb_node)))
+               return;
 
        /*
         * Remove from the busy tree/list.

Thanks!

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-05-28 22:50   ` Roman Gushchin
@ 2019-05-29 13:58     ` Uladzislau Rezki
  2019-05-29 16:26       ` Roman Gushchin
  0 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki @ 2019-05-29 13:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Uladzislau Rezki (Sony),
	Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

Hello, Roman!

> > Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
> > function, it means if an empty node gets freed it is a BUG
> > thus is considered as faulty behaviour.
> 
> It's not exactly clear from the description, why it's better.
> 
It is rather about if "unlink" happens on unhandled node it is
faulty behavior. Something that clearly written in stone. We used
to call "unlink" on detached node during merge, but after:

[PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge

it is not supposed to be ever happened across the logic.

>
> Also, do we really need a BUG_ON() in either place?
> 
Historically we used to have the BUG_ON there. We can get rid of it
for sure. But in this case, it would be harder to find a head or tail
of it when the crash occurs, soon or later.

> Isn't something like this better?
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c42872ed82ac..2df0e86d6aff 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1118,7 +1118,8 @@ EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier);
>  
>  static void __free_vmap_area(struct vmap_area *va)
>  {
> -       BUG_ON(RB_EMPTY_NODE(&va->rb_node));
> +       if (WARN_ON_ONCE(RB_EMPTY_NODE(&va->rb_node)))
> +               return;
>
I was thinking about WARN_ON_ONCE. The concern was about if the
message gets lost due to kernel ring buffer. Therefore i used that.
I am not sure if we have something like WARN_ONE_RATELIMIT that
would be the best i think. At least it would indicate if a warning
happens periodically or not.

Any thoughts?

Thanks for the comments!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose
  2019-05-28 22:42   ` Roman Gushchin
@ 2019-05-29 14:27     ` Uladzislau Rezki
  2019-05-29 16:34       ` Roman Gushchin
  0 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki @ 2019-05-29 14:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Uladzislau Rezki (Sony),
	Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

Hello, Roman!

> On Mon, May 27, 2019 at 11:38:40AM +0200, Uladzislau Rezki (Sony) wrote:
> > Refactor the NE_FIT_TYPE split case when it comes to an
> > allocation of one extra object. We need it in order to
> > build a remaining space.
> > 
> > Introduce ne_fit_preload()/ne_fit_preload_end() functions
> > for preloading one extra vmap_area object to ensure that
> > we have it available when fit type is NE_FIT_TYPE.
> > 
> > The preload is done per CPU in non-atomic context thus with
> > GFP_KERNEL allocation masks. More permissive parameters can
> > be beneficial for systems which are suffer from high memory
> > pressure or low memory condition.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 76 insertions(+), 3 deletions(-)
> 
> Hi Uladzislau!
> 
> This patch generally looks good to me (see some nits below),
> but it would be really great to add some motivation, e.g. numbers.
> 
The main goal of this patch to get rid of using GFP_NOWAIT since it is
more restricted due to allocation from atomic context. IMHO, if we can
avoid of using it that is a right way to go.

From the other hand, as i mentioned before i have not seen any issues
with that on all my test systems during big rework. But it could be
beneficial for tiny systems where we do not have any swap and are
limited in memory size.

> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ea1b65fac599..b553047aa05b 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -364,6 +364,13 @@ static LIST_HEAD(free_vmap_area_list);
> >   */
> >  static struct rb_root free_vmap_area_root = RB_ROOT;
> >  
> > +/*
> > + * Preload a CPU with one object for "no edge" split case. The
> > + * aim is to get rid of allocations from the atomic context, thus
> > + * to use more permissive allocation masks.
> > + */
> > +static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> > +
> >  static __always_inline unsigned long
> >  va_size(struct vmap_area *va)
> >  {
> > @@ -950,9 +957,24 @@ adjust_va_to_fit_type(struct vmap_area *va,
> >  		 *   L V  NVA  V R
> >  		 * |---|-------|---|
> >  		 */
> > -		lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
> > -		if (unlikely(!lva))
> > -			return -1;
> > +		lva = __this_cpu_xchg(ne_fit_preload_node, NULL);
> > +		if (unlikely(!lva)) {
> > +			/*
> > +			 * For percpu allocator we do not do any pre-allocation
> > +			 * and leave it as it is. The reason is it most likely
> > +			 * never ends up with NE_FIT_TYPE splitting. In case of
> > +			 * percpu allocations offsets and sizes are aligned to
> > +			 * fixed align request, i.e. RE_FIT_TYPE and FL_FIT_TYPE
> > +			 * are its main fitting cases.
> > +			 *
> > +			 * There are a few exceptions though, as an example it is
> > +			 * a first allocation (early boot up) when we have "one"
> > +			 * big free space that has to be split.
> > +			 */
> > +			lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
> > +			if (!lva)
> > +				return -1;
> > +		}
> >  
> >  		/*
> >  		 * Build the remainder.
> > @@ -1023,6 +1045,48 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
> >  }
> >  
> >  /*
> > + * Preload this CPU with one extra vmap_area object to ensure
> > + * that we have it available when fit type of free area is
> > + * NE_FIT_TYPE.
> > + *
> > + * The preload is done in non-atomic context, thus it allows us
> > + * to use more permissive allocation masks to be more stable under
> > + * low memory condition and high memory pressure.
> > + *
> > + * If success it returns 1 with preemption disabled. In case
> > + * of error 0 is returned with preemption not disabled. Note it
> > + * has to be paired with ne_fit_preload_end().
> > + */
> > +static int
> 
> Cosmetic nit: you don't need a new line here.
> 
> > +ne_fit_preload(int nid)
> 
I can fix that.

> > +{
> > +	preempt_disable();
> > +
> > +	if (!__this_cpu_read(ne_fit_preload_node)) {
> > +		struct vmap_area *node;
> > +
> > +		preempt_enable();
> > +		node = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, nid);
> > +		if (node == NULL)
> > +			return 0;
> > +
> > +		preempt_disable();
> > +
> > +		if (__this_cpu_cmpxchg(ne_fit_preload_node, NULL, node))
> > +			kmem_cache_free(vmap_area_cachep, node);
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> > +static void
> 
> Here too.
> 
> > +ne_fit_preload_end(int preloaded)
> > +{
> > +	if (preloaded)
> > +		preempt_enable();
> > +}
I can fix that.

> 
> I'd open code it. It's used only once, but hiding preempt_disable()
> behind a helper makes it harder to understand and easier to mess.
> 
> Then ne_fit_preload() might require disabled preemption (which it can
> temporarily re-enable), so that preempt_enable()/disable() logic
> will be in one place.
> 
I see your point. One of the aim was to make less clogged the
alloc_vmap_area() function. But we can refactor it like you say:

<snip>
 static void
@@ -1091,7 +1089,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
                                unsigned long vstart, unsigned long vend,
                                int node, gfp_t gfp_mask)
 {
-       struct vmap_area *va;
+       struct vmap_area *va, *pva;
        unsigned long addr;
        int purged = 0;
        int preloaded;
@@ -1122,16 +1120,26 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
         * Just proceed as it is. "overflow" path will refill
         * the cache we allocate from.
         */
-       ne_fit_preload(&preloaded);
+       preempt_disable();
+       if (!__this_cpu_read(ne_fit_preload_node)) {
+               preempt_enable();
+               pva = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, node);
+               preempt_disable();
+
+               if (__this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva)) {
+                       if (pva)
+                               kmem_cache_free(vmap_area_cachep, pva);
+               }
+       }
+
        spin_lock(&vmap_area_lock);
+       preempt_enable();
 
        /*
         * If an allocation fails, the "vend" address is
         * returned. Therefore trigger the overflow path.
         */
        addr = __alloc_vmap_area(size, align, vstart, vend);
-       ne_fit_preload_end(preloaded);
-
        if (unlikely(addr == vend))
                goto overflow;
<snip>

Do you mean something like that? If so, i can go with that, unless there are no
any objections from others.

Thank you for your comments, Roman!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-05-29 13:58     ` Uladzislau Rezki
@ 2019-05-29 16:26       ` Roman Gushchin
  2019-06-03 17:35         ` Uladzislau Rezki
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-29 16:26 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Wed, May 29, 2019 at 03:58:17PM +0200, Uladzislau Rezki wrote:
> Hello, Roman!
> 
> > > Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
> > > function, it means if an empty node gets freed it is a BUG
> > > thus is considered as faulty behaviour.
> > 
> > It's not exactly clear from the description, why it's better.
> > 
> It is rather about if "unlink" happens on unhandled node it is
> faulty behavior. Something that clearly written in stone. We used
> to call "unlink" on detached node during merge, but after:
> 
> [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge
> 
> it is not supposed to be ever happened across the logic.
> 
> >
> > Also, do we really need a BUG_ON() in either place?
> > 
> Historically we used to have the BUG_ON there. We can get rid of it
> for sure. But in this case, it would be harder to find a head or tail
> of it when the crash occurs, soon or later.
> 
> > Isn't something like this better?
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index c42872ed82ac..2df0e86d6aff 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1118,7 +1118,8 @@ EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier);
> >  
> >  static void __free_vmap_area(struct vmap_area *va)
> >  {
> > -       BUG_ON(RB_EMPTY_NODE(&va->rb_node));
> > +       if (WARN_ON_ONCE(RB_EMPTY_NODE(&va->rb_node)))
> > +               return;
> >
> I was thinking about WARN_ON_ONCE. The concern was about if the
> message gets lost due to kernel ring buffer. Therefore i used that.
> I am not sure if we have something like WARN_ONE_RATELIMIT that
> would be the best i think. At least it would indicate if a warning
> happens periodically or not.
> 
> Any thoughts?

Hello, Uladzislau!

I don't have a strong opinion here. If you're worried about losing the message,
WARN_ON() should be fine here. I don't think that this event will happen often,
if at all.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose
  2019-05-29 14:27     ` Uladzislau Rezki
@ 2019-05-29 16:34       ` Roman Gushchin
  2019-06-03 17:53         ` Uladzislau Rezki
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-29 16:34 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Wed, May 29, 2019 at 04:27:15PM +0200, Uladzislau Rezki wrote:
> Hello, Roman!
> 
> > On Mon, May 27, 2019 at 11:38:40AM +0200, Uladzislau Rezki (Sony) wrote:
> > > Refactor the NE_FIT_TYPE split case when it comes to an
> > > allocation of one extra object. We need it in order to
> > > build a remaining space.
> > > 
> > > Introduce ne_fit_preload()/ne_fit_preload_end() functions
> > > for preloading one extra vmap_area object to ensure that
> > > we have it available when fit type is NE_FIT_TYPE.
> > > 
> > > The preload is done per CPU in non-atomic context thus with
> > > GFP_KERNEL allocation masks. More permissive parameters can
> > > be beneficial for systems which are suffer from high memory
> > > pressure or low memory condition.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 76 insertions(+), 3 deletions(-)
> > 
> > Hi Uladzislau!
> > 
> > This patch generally looks good to me (see some nits below),
> > but it would be really great to add some motivation, e.g. numbers.
> > 
> The main goal of this patch to get rid of using GFP_NOWAIT since it is
> more restricted due to allocation from atomic context. IMHO, if we can
> avoid of using it that is a right way to go.
> 
> From the other hand, as i mentioned before i have not seen any issues
> with that on all my test systems during big rework. But it could be
> beneficial for tiny systems where we do not have any swap and are
> limited in memory size.

Ok, that makes sense to me. Is it possible to emulate such a tiny system
on kvm and measure the benefits? Again, not a strong opinion here,
but it will be easier to justify adding a good chunk of code.

> 
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index ea1b65fac599..b553047aa05b 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -364,6 +364,13 @@ static LIST_HEAD(free_vmap_area_list);
> > >   */
> > >  static struct rb_root free_vmap_area_root = RB_ROOT;
> > >  
> > > +/*
> > > + * Preload a CPU with one object for "no edge" split case. The
> > > + * aim is to get rid of allocations from the atomic context, thus
> > > + * to use more permissive allocation masks.
> > > + */
> > > +static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> > > +
> > >  static __always_inline unsigned long
> > >  va_size(struct vmap_area *va)
> > >  {
> > > @@ -950,9 +957,24 @@ adjust_va_to_fit_type(struct vmap_area *va,
> > >  		 *   L V  NVA  V R
> > >  		 * |---|-------|---|
> > >  		 */
> > > -		lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
> > > -		if (unlikely(!lva))
> > > -			return -1;
> > > +		lva = __this_cpu_xchg(ne_fit_preload_node, NULL);
> > > +		if (unlikely(!lva)) {
> > > +			/*
> > > +			 * For percpu allocator we do not do any pre-allocation
> > > +			 * and leave it as it is. The reason is it most likely
> > > +			 * never ends up with NE_FIT_TYPE splitting. In case of
> > > +			 * percpu allocations offsets and sizes are aligned to
> > > +			 * fixed align request, i.e. RE_FIT_TYPE and FL_FIT_TYPE
> > > +			 * are its main fitting cases.
> > > +			 *
> > > +			 * There are a few exceptions though, as an example it is
> > > +			 * a first allocation (early boot up) when we have "one"
> > > +			 * big free space that has to be split.
> > > +			 */
> > > +			lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT);
> > > +			if (!lva)
> > > +				return -1;
> > > +		}
> > >  
> > >  		/*
> > >  		 * Build the remainder.
> > > @@ -1023,6 +1045,48 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
> > >  }
> > >  
> > >  /*
> > > + * Preload this CPU with one extra vmap_area object to ensure
> > > + * that we have it available when fit type of free area is
> > > + * NE_FIT_TYPE.
> > > + *
> > > + * The preload is done in non-atomic context, thus it allows us
> > > + * to use more permissive allocation masks to be more stable under
> > > + * low memory condition and high memory pressure.
> > > + *
> > > + * If success it returns 1 with preemption disabled. In case
> > > + * of error 0 is returned with preemption not disabled. Note it
> > > + * has to be paired with ne_fit_preload_end().
> > > + */
> > > +static int
> > 
> > Cosmetic nit: you don't need a new line here.
> > 
> > > +ne_fit_preload(int nid)
> > 
> I can fix that.
> 
> > > +{
> > > +	preempt_disable();
> > > +
> > > +	if (!__this_cpu_read(ne_fit_preload_node)) {
> > > +		struct vmap_area *node;
> > > +
> > > +		preempt_enable();
> > > +		node = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, nid);
> > > +		if (node == NULL)
> > > +			return 0;
> > > +
> > > +		preempt_disable();
> > > +
> > > +		if (__this_cpu_cmpxchg(ne_fit_preload_node, NULL, node))
> > > +			kmem_cache_free(vmap_area_cachep, node);
> > > +	}
> > > +
> > > +	return 1;
> > > +}
> > > +
> > > +static void
> > 
> > Here too.
> > 
> > > +ne_fit_preload_end(int preloaded)
> > > +{
> > > +	if (preloaded)
> > > +		preempt_enable();
> > > +}
> I can fix that.
> 
> > 
> > I'd open code it. It's used only once, but hiding preempt_disable()
> > behind a helper makes it harder to understand and easier to mess.
> > 
> > Then ne_fit_preload() might require disabled preemption (which it can
> > temporarily re-enable), so that preempt_enable()/disable() logic
> > will be in one place.
> > 
> I see your point. One of the aim was to make less clogged the
> alloc_vmap_area() function. But we can refactor it like you say:
> 
> <snip>
>  static void
> @@ -1091,7 +1089,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>                                 unsigned long vstart, unsigned long vend,
>                                 int node, gfp_t gfp_mask)
>  {
> -       struct vmap_area *va;
> +       struct vmap_area *va, *pva;
>         unsigned long addr;
>         int purged = 0;
>         int preloaded;
> @@ -1122,16 +1120,26 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>          * Just proceed as it is. "overflow" path will refill
>          * the cache we allocate from.
>          */
> -       ne_fit_preload(&preloaded);
> +       preempt_disable();
> +       if (!__this_cpu_read(ne_fit_preload_node)) {
> +               preempt_enable();
> +               pva = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, node);
> +               preempt_disable();
> +
> +               if (__this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva)) {
> +                       if (pva)
> +                               kmem_cache_free(vmap_area_cachep, pva);
> +               }
> +       }
> +
>         spin_lock(&vmap_area_lock);
> +       preempt_enable();
>  
>         /*
>          * If an allocation fails, the "vend" address is
>          * returned. Therefore trigger the overflow path.
>          */
>         addr = __alloc_vmap_area(size, align, vstart, vend);
> -       ne_fit_preload_end(preloaded);
> -
>         if (unlikely(addr == vend))
>                 goto overflow;
> <snip>
> 
> Do you mean something like that? If so, i can go with that, unless there are no
> any objections from others.

Yes, it looks much better to me!

Thank you!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-05-29 16:26       ` Roman Gushchin
@ 2019-06-03 17:35         ` Uladzislau Rezki
  2019-06-03 20:30           ` Roman Gushchin
  0 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki @ 2019-06-03 17:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Uladzislau Rezki, Andrew Morton, linux-mm, Hillf Danton,
	Michal Hocko, Matthew Wilcox, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Steven Rostedt, Joel Fernandes,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

Hello, Roman!

On Wed, May 29, 2019 at 04:26:43PM +0000, Roman Gushchin wrote:
> On Wed, May 29, 2019 at 03:58:17PM +0200, Uladzislau Rezki wrote:
> > Hello, Roman!
> > 
> > > > Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
> > > > function, it means if an empty node gets freed it is a BUG
> > > > thus is considered as faulty behaviour.
> > > 
> > > It's not exactly clear from the description, why it's better.
> > > 
> > It is rather about if "unlink" happens on unhandled node it is
> > faulty behavior. Something that clearly written in stone. We used
> > to call "unlink" on detached node during merge, but after:
> > 
> > [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge
> > 
> > it is not supposed to be ever happened across the logic.
> > 
> > >
> > > Also, do we really need a BUG_ON() in either place?
> > > 
> > Historically we used to have the BUG_ON there. We can get rid of it
> > for sure. But in this case, it would be harder to find a head or tail
> > of it when the crash occurs, soon or later.
> > 
> > > Isn't something like this better?
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index c42872ed82ac..2df0e86d6aff 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -1118,7 +1118,8 @@ EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier);
> > >  
> > >  static void __free_vmap_area(struct vmap_area *va)
> > >  {
> > > -       BUG_ON(RB_EMPTY_NODE(&va->rb_node));
> > > +       if (WARN_ON_ONCE(RB_EMPTY_NODE(&va->rb_node)))
> > > +               return;
> > >
> > I was thinking about WARN_ON_ONCE. The concern was about if the
> > message gets lost due to kernel ring buffer. Therefore i used that.
> > I am not sure if we have something like WARN_ONE_RATELIMIT that
> > would be the best i think. At least it would indicate if a warning
> > happens periodically or not.
> > 
> > Any thoughts?
> 
> Hello, Uladzislau!
> 
> I don't have a strong opinion here. If you're worried about losing the message,
> WARN_ON() should be fine here. I don't think that this event will happen often,
> if at all.
>


If it happens then we are in trouble :) I prefer to keep it here as of now,
later on will see. Anyway, let's keep it and i will update it with:

<snip>
    if (WARN_ON(RB_EMPTY_NODE(&va->rb_node)))
        return;
<snip>

Thank you for the comments!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose
  2019-05-29 16:34       ` Roman Gushchin
@ 2019-06-03 17:53         ` Uladzislau Rezki
  2019-06-03 20:53           ` Uladzislau Rezki
  0 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki @ 2019-06-03 17:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Uladzislau Rezki, Andrew Morton, linux-mm, Hillf Danton,
	Michal Hocko, Matthew Wilcox, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Steven Rostedt, Joel Fernandes,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

Hello, Roman!

On Wed, May 29, 2019 at 04:34:40PM +0000, Roman Gushchin wrote:
> On Wed, May 29, 2019 at 04:27:15PM +0200, Uladzislau Rezki wrote:
> > Hello, Roman!
> > 
> > > On Mon, May 27, 2019 at 11:38:40AM +0200, Uladzislau Rezki (Sony) wrote:
> > > > Refactor the NE_FIT_TYPE split case when it comes to an
> > > > allocation of one extra object. We need it in order to
> > > > build a remaining space.
> > > > 
> > > > Introduce ne_fit_preload()/ne_fit_preload_end() functions
> > > > for preloading one extra vmap_area object to ensure that
> > > > we have it available when fit type is NE_FIT_TYPE.
> > > > 
> > > > The preload is done per CPU in non-atomic context thus with
> > > > GFP_KERNEL allocation masks. More permissive parameters can
> > > > be beneficial for systems which are suffer from high memory
> > > > pressure or low memory condition.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > ---
> > > >  mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > > >  1 file changed, 76 insertions(+), 3 deletions(-)
> > > 
> > > Hi Uladzislau!
> > > 
> > > This patch generally looks good to me (see some nits below),
> > > but it would be really great to add some motivation, e.g. numbers.
> > > 
> > The main goal of this patch to get rid of using GFP_NOWAIT since it is
> > more restricted due to allocation from atomic context. IMHO, if we can
> > avoid of using it that is a right way to go.
> > 
> > From the other hand, as i mentioned before i have not seen any issues
> > with that on all my test systems during big rework. But it could be
> > beneficial for tiny systems where we do not have any swap and are
> > limited in memory size.
> 
> Ok, that makes sense to me. Is it possible to emulate such a tiny system
> on kvm and measure the benefits? Again, not a strong opinion here,
> but it will be easier to justify adding a good chunk of code.
> 
It seems it is not so straightforward as it looks like. I tried it before,
but usually the systems gets panic due to out of memory or just invokes
the OOM killer.

I will upload a new version of it, where i embed "preloading" logic directly
into alloc_vmap_area() function.

Thanks.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va()
  2019-06-03 17:35         ` Uladzislau Rezki
@ 2019-06-03 20:30           ` Roman Gushchin
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-06-03 20:30 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Mon, Jun 03, 2019 at 07:35:28PM +0200, Uladzislau Rezki wrote:
> Hello, Roman!
> 
> On Wed, May 29, 2019 at 04:26:43PM +0000, Roman Gushchin wrote:
> > On Wed, May 29, 2019 at 03:58:17PM +0200, Uladzislau Rezki wrote:
> > > Hello, Roman!
> > > 
> > > > > Move the BUG_ON()/RB_EMPTY_NODE() check under unlink_va()
> > > > > function, it means if an empty node gets freed it is a BUG
> > > > > thus is considered as faulty behaviour.
> > > > 
> > > > It's not exactly clear from the description, why it's better.
> > > > 
> > > It is rather about if "unlink" happens on unhandled node it is
> > > faulty behavior. Something that clearly written in stone. We used
> > > to call "unlink" on detached node during merge, but after:
> > > 
> > > [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge
> > > 
> > > it is not supposed to be ever happened across the logic.
> > > 
> > > >
> > > > Also, do we really need a BUG_ON() in either place?
> > > > 
> > > Historically we used to have the BUG_ON there. We can get rid of it
> > > for sure. But in this case, it would be harder to find a head or tail
> > > of it when the crash occurs, soon or later.
> > > 
> > > > Isn't something like this better?
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index c42872ed82ac..2df0e86d6aff 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -1118,7 +1118,8 @@ EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier);
> > > >  
> > > >  static void __free_vmap_area(struct vmap_area *va)
> > > >  {
> > > > -       BUG_ON(RB_EMPTY_NODE(&va->rb_node));
> > > > +       if (WARN_ON_ONCE(RB_EMPTY_NODE(&va->rb_node)))
> > > > +               return;
> > > >
> > > I was thinking about WARN_ON_ONCE. The concern was about if the
> > > message gets lost due to kernel ring buffer. Therefore i used that.
> > > I am not sure if we have something like WARN_ONE_RATELIMIT that
> > > would be the best i think. At least it would indicate if a warning
> > > happens periodically or not.
> > > 
> > > Any thoughts?
> > 
> > Hello, Uladzislau!
> > 
> > I don't have a strong opinion here. If you're worried about losing the message,
> > WARN_ON() should be fine here. I don't think that this event will happen often,
> > if at all.
> >
> 
> 
> If it happens then we are in trouble :) I prefer to keep it here as of now,
> later on will see. Anyway, let's keep it and i will update it with:
> 
> <snip>
>     if (WARN_ON(RB_EMPTY_NODE(&va->rb_node)))
>         return;
> <snip>

Works for me. Thank you!

> 
> Thank you for the comments!

You're welcome!

Roman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose
  2019-06-03 17:53         ` Uladzislau Rezki
@ 2019-06-03 20:53           ` Uladzislau Rezki
  2019-06-03 21:06             ` Roman Gushchin
  0 siblings, 1 reply; 20+ messages in thread
From: Uladzislau Rezki @ 2019-06-03 20:53 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Roman Gushchin, Andrew Morton, linux-mm, Hillf Danton,
	Michal Hocko, Matthew Wilcox, LKML, Thomas Garnier,
	Oleksiy Avramchenko, Steven Rostedt, Joel Fernandes,
	Thomas Gleixner, Ingo Molnar, Tejun Heo

On Mon, Jun 03, 2019 at 07:53:12PM +0200, Uladzislau Rezki wrote:
> Hello, Roman!
> 
> On Wed, May 29, 2019 at 04:34:40PM +0000, Roman Gushchin wrote:
> > On Wed, May 29, 2019 at 04:27:15PM +0200, Uladzislau Rezki wrote:
> > > Hello, Roman!
> > > 
> > > > On Mon, May 27, 2019 at 11:38:40AM +0200, Uladzislau Rezki (Sony) wrote:
> > > > > Refactor the NE_FIT_TYPE split case when it comes to an
> > > > > allocation of one extra object. We need it in order to
> > > > > build a remaining space.
> > > > > 
> > > > > Introduce ne_fit_preload()/ne_fit_preload_end() functions
> > > > > for preloading one extra vmap_area object to ensure that
> > > > > we have it available when fit type is NE_FIT_TYPE.
> > > > > 
> > > > > The preload is done per CPU in non-atomic context thus with
> > > > > GFP_KERNEL allocation masks. More permissive parameters can
> > > > > be beneficial for systems which are suffer from high memory
> > > > > pressure or low memory condition.
> > > > > 
> > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > > > >  1 file changed, 76 insertions(+), 3 deletions(-)
> > > > 
> > > > Hi Uladzislau!
> > > > 
> > > > This patch generally looks good to me (see some nits below),
> > > > but it would be really great to add some motivation, e.g. numbers.
> > > > 
> > > The main goal of this patch to get rid of using GFP_NOWAIT since it is
> > > more restricted due to allocation from atomic context. IMHO, if we can
> > > avoid of using it that is a right way to go.
> > > 
> > > From the other hand, as i mentioned before i have not seen any issues
> > > with that on all my test systems during big rework. But it could be
> > > beneficial for tiny systems where we do not have any swap and are
> > > limited in memory size.
> > 
> > Ok, that makes sense to me. Is it possible to emulate such a tiny system
> > on kvm and measure the benefits? Again, not a strong opinion here,
> > but it will be easier to justify adding a good chunk of code.
> > 
> It seems it is not so straightforward as it looks like. I tried it before,
> but usually the systems gets panic due to out of memory or just invokes
> the OOM killer.
> 
> I will upload a new version of it, where i embed "preloading" logic directly
> into alloc_vmap_area() function.
> 
just managed to simulate the faulty behavior of GFP_NOWAIT restriction,
resulting to failure of vmalloc allocation. Under heavy load and low
memory condition and without swap, i can trigger below warning on my
KVM machine:

<snip>
[  366.910037] Out of memory: Killed process 470 (bash) total-vm:21012kB, anon-rss:1700kB, file-rss:264kB, shmem-rss:0kB
[  366.910692] oom_reaper: reaped process 470 (bash), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  367.913199] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[  367.913206] CPU: 3 PID: 19951 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #999
[  367.913207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  367.913208] Call Trace:
[  367.913215]  dump_stack+0x5c/0x7b
[  367.913219]  warn_alloc+0x108/0x190
[  367.913222]  __alloc_pages_slowpath+0xdc7/0xdf0
[  367.913226]  __alloc_pages_nodemask+0x2de/0x330
[  367.913230]  cache_grow_begin+0x77/0x420
[  367.913232]  fallback_alloc+0x161/0x200
[  367.913235]  kmem_cache_alloc+0x1c9/0x570
[  367.913237]  alloc_vmap_area+0x98b/0xa20
[  367.913240]  __get_vm_area_node+0xb0/0x170
[  367.913243]  __vmalloc_node_range+0x6d/0x230
[  367.913246]  ? _do_fork+0xce/0x3d0
[  367.913248]  copy_process.part.46+0x850/0x1b90
[  367.913250]  ? _do_fork+0xce/0x3d0
[  367.913254]  _do_fork+0xce/0x3d0
[  367.913257]  ? __do_page_fault+0x2bf/0x4e0
[  367.913260]  do_syscall_64+0x55/0x130
[  367.913263]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  367.913265] RIP: 0033:0x7f2a8248d38b
[  367.913268] Code: db 45 85 f6 0f 85 95 01 00 00 64 4c 8b 04 25 10 00 00 00 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 de 00 00 00 85 c0 41 89 c5 0f 85 e5 00 00
[  367.913269] RSP: 002b:00007fff1b058c30 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[  367.913271] RAX: ffffffffffffffda RBX: 00007fff1b058c30 RCX: 00007f2a8248d38b
[  367.913272] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  367.913273] RBP: 00007fff1b058c80 R08: 00007f2a83d34300 R09: 00007fff1b1890a0
[  367.913274] R10: 00007f2a83d345d0 R11: 0000000000000246 R12: 0000000000000000
[  367.913275] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[  367.913278] Mem-Info:
[  367.913282] active_anon:45795 inactive_anon:80706 isolated_anon:0
                active_file:394 inactive_file:359 isolated_file:210
                unevictable:2 dirty:0 writeback:0 unstable:0
                slab_reclaimable:2691 slab_unreclaimable:21864
                mapped:80835 shmem:80740 pagetables:50422 bounce:0
                free:12185 free_pcp:776 free_cma:0
[  367.913286] Node 0 active_anon:183180kB inactive_anon:322824kB active_file:1576kB inactive_file:1436kB unevictable:8kB isolated(anon):0kB isolated(file):840kB mapped:323340kB dirty:0kB writeback:0kB shmem:322960kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  367.913287] Node 0 DMA free:4516kB min:724kB low:904kB high:1084kB active_anon:2384kB inactive_anon:0kB active_file:48kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:1256kB pagetables:4516kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  367.913292] lowmem_reserve[]: 0 948 948 948
[  367.913294] Node 0 DMA32 free:44224kB min:44328kB low:55408kB high:66488kB active_anon:180252kB inactive_anon:322824kB active_file:992kB inactive_file:1332kB unevictable:8kB writepending:252kB present:1032064kB managed:995428kB mlocked:8kB kernel_stack:43260kB pagetables:197172kB bounce:0kB free_pcp:3252kB local_pcp:480kB free_cma:0kB
[  367.913299] lowmem_reserve[]: 0 0 0 0
[  367.913301] Node 0 DMA: 46*4kB (UM) 45*8kB (UM) 12*16kB (UM) 9*32kB (UM) 2*64kB (M) 2*128kB (UM) 2*256kB (M) 3*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 4480kB
[  367.913310] Node 0 DMA32: 966*4kB (UE) 552*8kB (UME) 648*16kB (UME) 265*32kB (UME) 75*64kB (UME) 12*128kB (ME) 1*256kB (U) 1*512kB (E) 1*1024kB (U) 2*2048kB (UM) 1*4096kB (M) = 43448kB
[  367.913322] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  367.913323] 81750 total pagecache pages
[  367.913324] 0 pages in swap cache
[  367.913325] Swap cache stats: add 0, delete 0, find 0/0
[  367.913325] Free swap  = 0kB
[  367.913326] Total swap = 0kB
[  367.913327] 262014 pages RAM
[  367.913327] 0 pages HighMem/MovableOnly
[  367.913328] 9180 pages reserved
[  367.913329] 0 pages hwpoisoned
[  372.338733] systemd-journald[195]: /dev/kmsg buffer overrun, some messages lost.
<snip>

Whereas with "preload" logic i see only OOM killer related messages:

<snip>
[  136.787266] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=systemd-journal,pid=196,uid=0
[  136.787276] Out of memory: Killed process 196 (systemd-journal) total-vm:56832kB, anon-rss:512kB, file-rss:336kB, shmem-rss:820kB
[  136.790481] oom_reaper: reaped process 196 (systemd-journal), now anon-rss:0kB, file-rss:0kB, shmem-rss:820kB
<snip>

i.e. vmalloc still able to allocate.

Probably i need to update the commit message by this simulation and finding.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose
  2019-06-03 20:53           ` Uladzislau Rezki
@ 2019-06-03 21:06             ` Roman Gushchin
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-06-03 21:06 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, linux-mm, Hillf Danton, Michal Hocko,
	Matthew Wilcox, LKML, Thomas Garnier, Oleksiy Avramchenko,
	Steven Rostedt, Joel Fernandes, Thomas Gleixner, Ingo Molnar,
	Tejun Heo

On Mon, Jun 03, 2019 at 10:53:34PM +0200, Uladzislau Rezki wrote:
> On Mon, Jun 03, 2019 at 07:53:12PM +0200, Uladzislau Rezki wrote:
> > Hello, Roman!
> > 
> > On Wed, May 29, 2019 at 04:34:40PM +0000, Roman Gushchin wrote:
> > > On Wed, May 29, 2019 at 04:27:15PM +0200, Uladzislau Rezki wrote:
> > > > Hello, Roman!
> > > > 
> > > > > On Mon, May 27, 2019 at 11:38:40AM +0200, Uladzislau Rezki (Sony) wrote:
> > > > > > Refactor the NE_FIT_TYPE split case when it comes to an
> > > > > > allocation of one extra object. We need it in order to
> > > > > > build a remaining space.
> > > > > > 
> > > > > > Introduce ne_fit_preload()/ne_fit_preload_end() functions
> > > > > > for preloading one extra vmap_area object to ensure that
> > > > > > we have it available when fit type is NE_FIT_TYPE.
> > > > > > 
> > > > > > The preload is done per CPU in non-atomic context thus with
> > > > > > GFP_KERNEL allocation masks. More permissive parameters can
> > > > > > be beneficial for systems which are suffer from high memory
> > > > > > pressure or low memory condition.
> > > > > > 
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > ---
> > > > > >  mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > > > > >  1 file changed, 76 insertions(+), 3 deletions(-)
> > > > > 
> > > > > Hi Uladzislau!
> > > > > 
> > > > > This patch generally looks good to me (see some nits below),
> > > > > but it would be really great to add some motivation, e.g. numbers.
> > > > > 
> > > > The main goal of this patch to get rid of using GFP_NOWAIT since it is
> > > > more restricted due to allocation from atomic context. IMHO, if we can
> > > > avoid of using it that is a right way to go.
> > > > 
> > > > From the other hand, as i mentioned before i have not seen any issues
> > > > with that on all my test systems during big rework. But it could be
> > > > beneficial for tiny systems where we do not have any swap and are
> > > > limited in memory size.
> > > 
> > > Ok, that makes sense to me. Is it possible to emulate such a tiny system
> > > on kvm and measure the benefits? Again, not a strong opinion here,
> > > but it will be easier to justify adding a good chunk of code.
> > > 
> > It seems it is not so straightforward as it looks like. I tried it before,
> > but usually the systems gets panic due to out of memory or just invokes
> > the OOM killer.
> > 
> > I will upload a new version of it, where i embed "preloading" logic directly
> > into alloc_vmap_area() function.
> > 
> just managed to simulate the faulty behavior of GFP_NOWAIT restriction,
> resulting to failure of vmalloc allocation. Under heavy load and low
> memory condition and without swap, i can trigger below warning on my
> KVM machine:
> 
> <snip>
> [  366.910037] Out of memory: Killed process 470 (bash) total-vm:21012kB, anon-rss:1700kB, file-rss:264kB, shmem-rss:0kB
> [  366.910692] oom_reaper: reaped process 470 (bash), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  367.913199] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
> [  367.913206] CPU: 3 PID: 19951 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #999
> [  367.913207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [  367.913208] Call Trace:
> [  367.913215]  dump_stack+0x5c/0x7b
> [  367.913219]  warn_alloc+0x108/0x190
> [  367.913222]  __alloc_pages_slowpath+0xdc7/0xdf0
> [  367.913226]  __alloc_pages_nodemask+0x2de/0x330
> [  367.913230]  cache_grow_begin+0x77/0x420
> [  367.913232]  fallback_alloc+0x161/0x200
> [  367.913235]  kmem_cache_alloc+0x1c9/0x570
> [  367.913237]  alloc_vmap_area+0x98b/0xa20
> [  367.913240]  __get_vm_area_node+0xb0/0x170
> [  367.913243]  __vmalloc_node_range+0x6d/0x230
> [  367.913246]  ? _do_fork+0xce/0x3d0
> [  367.913248]  copy_process.part.46+0x850/0x1b90
> [  367.913250]  ? _do_fork+0xce/0x3d0
> [  367.913254]  _do_fork+0xce/0x3d0
> [  367.913257]  ? __do_page_fault+0x2bf/0x4e0
> [  367.913260]  do_syscall_64+0x55/0x130
> [  367.913263]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  367.913265] RIP: 0033:0x7f2a8248d38b
> [  367.913268] Code: db 45 85 f6 0f 85 95 01 00 00 64 4c 8b 04 25 10 00 00 00 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 de 00 00 00 85 c0 41 89 c5 0f 85 e5 00 00
> [  367.913269] RSP: 002b:00007fff1b058c30 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
> [  367.913271] RAX: ffffffffffffffda RBX: 00007fff1b058c30 RCX: 00007f2a8248d38b
> [  367.913272] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> [  367.913273] RBP: 00007fff1b058c80 R08: 00007f2a83d34300 R09: 00007fff1b1890a0
> [  367.913274] R10: 00007f2a83d345d0 R11: 0000000000000246 R12: 0000000000000000
> [  367.913275] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
> [  367.913278] Mem-Info:
> [  367.913282] active_anon:45795 inactive_anon:80706 isolated_anon:0
>                 active_file:394 inactive_file:359 isolated_file:210
>                 unevictable:2 dirty:0 writeback:0 unstable:0
>                 slab_reclaimable:2691 slab_unreclaimable:21864
>                 mapped:80835 shmem:80740 pagetables:50422 bounce:0
>                 free:12185 free_pcp:776 free_cma:0
> [  367.913286] Node 0 active_anon:183180kB inactive_anon:322824kB active_file:1576kB inactive_file:1436kB unevictable:8kB isolated(anon):0kB isolated(file):840kB mapped:323340kB dirty:0kB writeback:0kB shmem:322960kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [  367.913287] Node 0 DMA free:4516kB min:724kB low:904kB high:1084kB active_anon:2384kB inactive_anon:0kB active_file:48kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:1256kB pagetables:4516kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [  367.913292] lowmem_reserve[]: 0 948 948 948
> [  367.913294] Node 0 DMA32 free:44224kB min:44328kB low:55408kB high:66488kB active_anon:180252kB inactive_anon:322824kB active_file:992kB inactive_file:1332kB unevictable:8kB writepending:252kB present:1032064kB managed:995428kB mlocked:8kB kernel_stack:43260kB pagetables:197172kB bounce:0kB free_pcp:3252kB local_pcp:480kB free_cma:0kB
> [  367.913299] lowmem_reserve[]: 0 0 0 0
> [  367.913301] Node 0 DMA: 46*4kB (UM) 45*8kB (UM) 12*16kB (UM) 9*32kB (UM) 2*64kB (M) 2*128kB (UM) 2*256kB (M) 3*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 4480kB
> [  367.913310] Node 0 DMA32: 966*4kB (UE) 552*8kB (UME) 648*16kB (UME) 265*32kB (UME) 75*64kB (UME) 12*128kB (ME) 1*256kB (U) 1*512kB (E) 1*1024kB (U) 2*2048kB (UM) 1*4096kB (M) = 43448kB
> [  367.913322] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [  367.913323] 81750 total pagecache pages
> [  367.913324] 0 pages in swap cache
> [  367.913325] Swap cache stats: add 0, delete 0, find 0/0
> [  367.913325] Free swap  = 0kB
> [  367.913326] Total swap = 0kB
> [  367.913327] 262014 pages RAM
> [  367.913327] 0 pages HighMem/MovableOnly
> [  367.913328] 9180 pages reserved
> [  367.913329] 0 pages hwpoisoned
> [  372.338733] systemd-journald[195]: /dev/kmsg buffer overrun, some messages lost.
> <snip>
> 
> Whereas with "preload" logic i see only OOM killer related messages:
> 
> <snip>
> [  136.787266] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=systemd-journal,pid=196,uid=0
> [  136.787276] Out of memory: Killed process 196 (systemd-journal) total-vm:56832kB, anon-rss:512kB, file-rss:336kB, shmem-rss:820kB
> [  136.790481] oom_reaper: reaped process 196 (systemd-journal), now anon-rss:0kB, file-rss:0kB, shmem-rss:820kB
> <snip>
> 
> i.e. vmalloc still able to allocate.
> 
> Probably i need to update the commit message by this simulation and finding.

Ah, perfect! Than it makes total sense to me.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-06-03 22:46 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-27  9:38 [PATCH v3 0/4] Some cleanups for the KVA/vmalloc Uladzislau Rezki (Sony)
2019-05-27  9:38 ` [PATCH v3 1/4] mm/vmap: remove "node" argument Uladzislau Rezki (Sony)
2019-05-28 22:33   ` Roman Gushchin
2019-05-27  9:38 ` [PATCH v3 2/4] mm/vmap: preload a CPU with one object for split purpose Uladzislau Rezki (Sony)
2019-05-28 22:42   ` Roman Gushchin
2019-05-29 14:27     ` Uladzislau Rezki
2019-05-29 16:34       ` Roman Gushchin
2019-06-03 17:53         ` Uladzislau Rezki
2019-06-03 20:53           ` Uladzislau Rezki
2019-06-03 21:06             ` Roman Gushchin
2019-05-27  9:38 ` [PATCH v3 3/4] mm/vmap: get rid of one single unlink_va() when merge Uladzislau Rezki (Sony)
2019-05-28 22:45   ` Roman Gushchin
2019-05-27  9:38 ` [PATCH v3 4/4] mm/vmap: move BUG_ON() check to the unlink_va() Uladzislau Rezki (Sony)
2019-05-27 12:59   ` Steven Rostedt
2019-05-27 14:02     ` Uladzislau Rezki
2019-05-28 22:50   ` Roman Gushchin
2019-05-29 13:58     ` Uladzislau Rezki
2019-05-29 16:26       ` Roman Gushchin
2019-06-03 17:35         ` Uladzislau Rezki
2019-06-03 20:30           ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).