All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29  7:34 ` Jisheng Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29  7:34 UTC (permalink / raw)
  To: akpm, mgorman, chris, rientjes, iamjoonsoo.kim, npiggin, agnel.joel
  Cc: linux-mm, linux-kernel, linux-arm-kernel, Jisheng Zhang

On Marvell berlin arm64 platforms, I see the preemptoff tracer report
a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
awfully bad for STB. And the ftrace log also shows __free_vmap_area
contributes most latency now. I noticed that Joel mentioned the same
issue[1] on x86 platform and gave two solutions, but it seems no patch
is sent out for this purpose.

This patch adopts Joel's first solution, but I use 16MB per core
rather than 8MB per core for the number of lazy_max_pages. After this
patch, the preemptoff tracer reports a max 6455us latency, reduced to
1/4 of original result.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1603.2/04803.html

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
---
 mm/vmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 91f44e7..66f377a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -596,7 +596,7 @@ static unsigned long lazy_max_pages(void)
 
 	log = fls(num_online_cpus());
 
-	return log * (32UL * 1024 * 1024 / PAGE_SIZE);
+	return log * (16UL * 1024 * 1024 / PAGE_SIZE);
 }
 
 static atomic_t vmap_lazy_nr = ATOMIC_INIT(0);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29  7:34 ` Jisheng Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29  7:34 UTC (permalink / raw)
  To: akpm, mgorman, chris, rientjes, iamjoonsoo.kim, npiggin, agnel.joel
  Cc: linux-mm, linux-kernel, linux-arm-kernel, Jisheng Zhang

On Marvell berlin arm64 platforms, I see the preemptoff tracer report
a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
awfully bad for STB. And the ftrace log also shows __free_vmap_area
contributes most latency now. I noticed that Joel mentioned the same
issue[1] on x86 platform and gave two solutions, but it seems no patch
is sent out for this purpose.

This patch adopts Joel's first solution, but I use 16MB per core
rather than 8MB per core for the number of lazy_max_pages. After this
patch, the preemptoff tracer reports a max 6455us latency, reduced to
1/4 of original result.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1603.2/04803.html

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
---
 mm/vmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 91f44e7..66f377a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -596,7 +596,7 @@ static unsigned long lazy_max_pages(void)
 
 	log = fls(num_online_cpus());
 
-	return log * (32UL * 1024 * 1024 / PAGE_SIZE);
+	return log * (16UL * 1024 * 1024 / PAGE_SIZE);
 }
 
 static atomic_t vmap_lazy_nr = ATOMIC_INIT(0);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29  7:34 ` Jisheng Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29  7:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Marvell berlin arm64 platforms, I see the preemptoff tracer report
a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
awfully bad for STB. And the ftrace log also shows __free_vmap_area
contributes most latency now. I noticed that Joel mentioned the same
issue[1] on x86 platform and gave two solutions, but it seems no patch
is sent out for this purpose.

This patch adopts Joel's first solution, but I use 16MB per core
rather than 8MB per core for the number of lazy_max_pages. After this
patch, the preemptoff tracer reports a max 6455us latency, reduced to
1/4 of original result.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1603.2/04803.html

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
---
 mm/vmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 91f44e7..66f377a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -596,7 +596,7 @@ static unsigned long lazy_max_pages(void)
 
 	log = fls(num_online_cpus());
 
-	return log * (32UL * 1024 * 1024 / PAGE_SIZE);
+	return log * (16UL * 1024 * 1024 / PAGE_SIZE);
 }
 
 static atomic_t vmap_lazy_nr = ATOMIC_INIT(0);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-09-29  7:34 ` Jisheng Zhang
  (?)
@ 2016-09-29  8:18   ` Chris Wilson
  -1 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-09-29  8:18 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, npiggin, agnel.joel,
	linux-mm, linux-kernel, linux-arm-kernel

On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> awfully bad for STB. And the ftrace log also shows __free_vmap_area
> contributes most latency now. I noticed that Joel mentioned the same
> issue[1] on x86 platform and gave two solutions, but it seems no patch
> is sent out for this purpose.
> 
> This patch adopts Joel's first solution, but I use 16MB per core
> rather than 8MB per core for the number of lazy_max_pages. After this
> patch, the preemptoff tracer reports a max 6455us latency, reduced to
> 1/4 of original result.

My understanding is that

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 91f44e78c516..3f7c6d6969ac 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                                        int sync, int force_flush)
 {
-       static DEFINE_SPINLOCK(purge_lock);
        struct llist_node *valist;
        struct vmap_area *va;
        struct vmap_area *n_va;
@@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
         * should not expect such behaviour. This just simplifies locking for
         * the case that isn't actually used at the moment anyway.
         */
-       if (!sync && !force_flush) {
-               if (!spin_trylock(&purge_lock))
-                       return;
-       } else
-               spin_lock(&purge_lock);
-
        if (sync)
                purge_fragmented_blocks_allcpus();
 
@@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                        __free_vmap_area(va);
                spin_unlock(&vmap_area_lock);
        }
-       spin_unlock(&purge_lock);
 }
 
 /*


should now be safe. That should significantly reduce the preempt-disabled
section, I think.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29  8:18   ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-09-29  8:18 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, npiggin, agnel.joel,
	linux-mm, linux-kernel, linux-arm-kernel

On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> awfully bad for STB. And the ftrace log also shows __free_vmap_area
> contributes most latency now. I noticed that Joel mentioned the same
> issue[1] on x86 platform and gave two solutions, but it seems no patch
> is sent out for this purpose.
> 
> This patch adopts Joel's first solution, but I use 16MB per core
> rather than 8MB per core for the number of lazy_max_pages. After this
> patch, the preemptoff tracer reports a max 6455us latency, reduced to
> 1/4 of original result.

My understanding is that

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 91f44e78c516..3f7c6d6969ac 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                                        int sync, int force_flush)
 {
-       static DEFINE_SPINLOCK(purge_lock);
        struct llist_node *valist;
        struct vmap_area *va;
        struct vmap_area *n_va;
@@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
         * should not expect such behaviour. This just simplifies locking for
         * the case that isn't actually used at the moment anyway.
         */
-       if (!sync && !force_flush) {
-               if (!spin_trylock(&purge_lock))
-                       return;
-       } else
-               spin_lock(&purge_lock);
-
        if (sync)
                purge_fragmented_blocks_allcpus();
 
@@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                        __free_vmap_area(va);
                spin_unlock(&vmap_area_lock);
        }
-       spin_unlock(&purge_lock);
 }
 
 /*


should now be safe. That should significantly reduce the preempt-disabled
section, I think.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29  8:18   ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-09-29  8:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> awfully bad for STB. And the ftrace log also shows __free_vmap_area
> contributes most latency now. I noticed that Joel mentioned the same
> issue[1] on x86 platform and gave two solutions, but it seems no patch
> is sent out for this purpose.
> 
> This patch adopts Joel's first solution, but I use 16MB per core
> rather than 8MB per core for the number of lazy_max_pages. After this
> patch, the preemptoff tracer reports a max 6455us latency, reduced to
> 1/4 of original result.

My understanding is that

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 91f44e78c516..3f7c6d6969ac 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                                        int sync, int force_flush)
 {
-       static DEFINE_SPINLOCK(purge_lock);
        struct llist_node *valist;
        struct vmap_area *va;
        struct vmap_area *n_va;
@@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
         * should not expect such behaviour. This just simplifies locking for
         * the case that isn't actually used at the moment anyway.
         */
-       if (!sync && !force_flush) {
-               if (!spin_trylock(&purge_lock))
-                       return;
-       } else
-               spin_lock(&purge_lock);
-
        if (sync)
                purge_fragmented_blocks_allcpus();
 
@@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                        __free_vmap_area(va);
                spin_unlock(&vmap_area_lock);
        }
-       spin_unlock(&purge_lock);
 }
 
 /*


should now be safe. That should significantly reduce the preempt-disabled
section, I think.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-09-29  8:18   ` Chris Wilson
  (?)
@ 2016-09-29  8:28     ` Jisheng Zhang
  -1 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29  8:28 UTC (permalink / raw)
  To: Chris Wilson
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, agnel.joel, linux-mm,
	linux-kernel, linux-arm-kernel

On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:

> On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > contributes most latency now. I noticed that Joel mentioned the same
> > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > is sent out for this purpose.
> > 
> > This patch adopts Joel's first solution, but I use 16MB per core
> > rather than 8MB per core for the number of lazy_max_pages. After this
> > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > 1/4 of original result.  
> 
> My understanding is that
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 91f44e78c516..3f7c6d6969ac 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>          * should not expect such behaviour. This just simplifies locking for
>          * the case that isn't actually used at the moment anyway.
>          */
> -       if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> -                       return;
> -       } else
> -               spin_lock(&purge_lock);
> -
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
>  
> @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                         __free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);

Hi Chris,

Per my test, the bottleneck now is __free_vmap_area() over the valist, the
iteration is protected with spinlock vmap_area_lock. So the larger lazy max
pages, the longer valist, the bigger the latency.

So besides above patch, we still need to remove vmap_are_lock or replace with
mutex.

Thanks,
Jisheng

>         }
> -       spin_unlock(&purge_lock);
>  }
>  
>  /*
> 
> 
> should now be safe. That should significantly reduce the preempt-disabled
> section, I think.
> -Chris
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29  8:28     ` Jisheng Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29  8:28 UTC (permalink / raw)
  To: Chris Wilson
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, agnel.joel, linux-mm,
	linux-kernel, linux-arm-kernel

On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:

> On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > contributes most latency now. I noticed that Joel mentioned the same
> > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > is sent out for this purpose.
> > 
> > This patch adopts Joel's first solution, but I use 16MB per core
> > rather than 8MB per core for the number of lazy_max_pages. After this
> > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > 1/4 of original result.  
> 
> My understanding is that
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 91f44e78c516..3f7c6d6969ac 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>          * should not expect such behaviour. This just simplifies locking for
>          * the case that isn't actually used at the moment anyway.
>          */
> -       if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> -                       return;
> -       } else
> -               spin_lock(&purge_lock);
> -
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
>  
> @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                         __free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);

Hi Chris,

Per my test, the bottleneck now is __free_vmap_area() over the valist, the
iteration is protected with spinlock vmap_area_lock. So the larger lazy max
pages, the longer valist, the bigger the latency.

So besides above patch, we still need to remove vmap_are_lock or replace with
mutex.

Thanks,
Jisheng

>         }
> -       spin_unlock(&purge_lock);
>  }
>  
>  /*
> 
> 
> should now be safe. That should significantly reduce the preempt-disabled
> section, I think.
> -Chris
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29  8:28     ` Jisheng Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29  8:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:

> On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > contributes most latency now. I noticed that Joel mentioned the same
> > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > is sent out for this purpose.
> > 
> > This patch adopts Joel's first solution, but I use 16MB per core
> > rather than 8MB per core for the number of lazy_max_pages. After this
> > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > 1/4 of original result.  
> 
> My understanding is that
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 91f44e78c516..3f7c6d6969ac 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>          * should not expect such behaviour. This just simplifies locking for
>          * the case that isn't actually used at the moment anyway.
>          */
> -       if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> -                       return;
> -       } else
> -               spin_lock(&purge_lock);
> -
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
>  
> @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                         __free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);

Hi Chris,

Per my test, the bottleneck now is __free_vmap_area() over the valist, the
iteration is protected with spinlock vmap_area_lock. So the larger lazy max
pages, the longer valist, the bigger the latency.

So besides above patch, we still need to remove vmap_are_lock or replace with
mutex.

Thanks,
Jisheng

>         }
> -       spin_unlock(&purge_lock);
>  }
>  
>  /*
> 
> 
> should now be safe. That should significantly reduce the preempt-disabled
> section, I think.
> -Chris
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-09-29  8:28     ` Jisheng Zhang
  (?)
@ 2016-09-29 11:07       ` Chris Wilson
  -1 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-09-29 11:07 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, agnel.joel, linux-mm,
	linux-kernel, linux-arm-kernel

On Thu, Sep 29, 2016 at 04:28:08PM +0800, Jisheng Zhang wrote:
> On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:
> 
> > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> > > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > > contributes most latency now. I noticed that Joel mentioned the same
> > > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > > is sent out for this purpose.
> > > 
> > > This patch adopts Joel's first solution, but I use 16MB per core
> > > rather than 8MB per core for the number of lazy_max_pages. After this
> > > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > > 1/4 of original result.  
> > 
> > My understanding is that
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 91f44e78c516..3f7c6d6969ac 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                                         int sync, int force_flush)
> >  {
> > -       static DEFINE_SPINLOCK(purge_lock);
> >         struct llist_node *valist;
> >         struct vmap_area *va;
> >         struct vmap_area *n_va;
> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >          * should not expect such behaviour. This just simplifies locking for
> >          * the case that isn't actually used at the moment anyway.
> >          */
> > -       if (!sync && !force_flush) {
> > -               if (!spin_trylock(&purge_lock))
> > -                       return;
> > -       } else
> > -               spin_lock(&purge_lock);
> > -
> >         if (sync)
> >                 purge_fragmented_blocks_allcpus();
> >  
> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                         __free_vmap_area(va);
> >                 spin_unlock(&vmap_area_lock);
> 
> Hi Chris,
> 
> Per my test, the bottleneck now is __free_vmap_area() over the valist, the
> iteration is protected with spinlock vmap_area_lock. So the larger lazy max
> pages, the longer valist, the bigger the latency.
> 
> So besides above patch, we still need to remove vmap_are_lock or replace with
> mutex.

Or follow up with

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f7c6d6969ac..67b5475f0b0a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -656,8 +656,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
 
        if (nr) {
                spin_lock(&vmap_area_lock);
-               llist_for_each_entry_safe(va, n_va, valist, purge_list)
+               llist_for_each_entry_safe(va, n_va, valist, purge_list) {
                        __free_vmap_area(va);
+                       cond_resched_lock(&vmap_area_lock);
+               }
                spin_unlock(&vmap_area_lock);
        }
 }

?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29 11:07       ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-09-29 11:07 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, agnel.joel, linux-mm,
	linux-kernel, linux-arm-kernel

On Thu, Sep 29, 2016 at 04:28:08PM +0800, Jisheng Zhang wrote:
> On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:
> 
> > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> > > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > > contributes most latency now. I noticed that Joel mentioned the same
> > > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > > is sent out for this purpose.
> > > 
> > > This patch adopts Joel's first solution, but I use 16MB per core
> > > rather than 8MB per core for the number of lazy_max_pages. After this
> > > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > > 1/4 of original result.  
> > 
> > My understanding is that
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 91f44e78c516..3f7c6d6969ac 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                                         int sync, int force_flush)
> >  {
> > -       static DEFINE_SPINLOCK(purge_lock);
> >         struct llist_node *valist;
> >         struct vmap_area *va;
> >         struct vmap_area *n_va;
> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >          * should not expect such behaviour. This just simplifies locking for
> >          * the case that isn't actually used at the moment anyway.
> >          */
> > -       if (!sync && !force_flush) {
> > -               if (!spin_trylock(&purge_lock))
> > -                       return;
> > -       } else
> > -               spin_lock(&purge_lock);
> > -
> >         if (sync)
> >                 purge_fragmented_blocks_allcpus();
> >  
> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                         __free_vmap_area(va);
> >                 spin_unlock(&vmap_area_lock);
> 
> Hi Chris,
> 
> Per my test, the bottleneck now is __free_vmap_area() over the valist, the
> iteration is protected with spinlock vmap_area_lock. So the larger lazy max
> pages, the longer valist, the bigger the latency.
> 
> So besides above patch, we still need to remove vmap_are_lock or replace with
> mutex.

Or follow up with

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f7c6d6969ac..67b5475f0b0a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -656,8 +656,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
 
        if (nr) {
                spin_lock(&vmap_area_lock);
-               llist_for_each_entry_safe(va, n_va, valist, purge_list)
+               llist_for_each_entry_safe(va, n_va, valist, purge_list) {
                        __free_vmap_area(va);
+                       cond_resched_lock(&vmap_area_lock);
+               }
                spin_unlock(&vmap_area_lock);
        }
 }

?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29 11:07       ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-09-29 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 29, 2016 at 04:28:08PM +0800, Jisheng Zhang wrote:
> On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:
> 
> > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> > > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > > contributes most latency now. I noticed that Joel mentioned the same
> > > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > > is sent out for this purpose.
> > > 
> > > This patch adopts Joel's first solution, but I use 16MB per core
> > > rather than 8MB per core for the number of lazy_max_pages. After this
> > > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > > 1/4 of original result.  
> > 
> > My understanding is that
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 91f44e78c516..3f7c6d6969ac 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                                         int sync, int force_flush)
> >  {
> > -       static DEFINE_SPINLOCK(purge_lock);
> >         struct llist_node *valist;
> >         struct vmap_area *va;
> >         struct vmap_area *n_va;
> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >          * should not expect such behaviour. This just simplifies locking for
> >          * the case that isn't actually used at the moment anyway.
> >          */
> > -       if (!sync && !force_flush) {
> > -               if (!spin_trylock(&purge_lock))
> > -                       return;
> > -       } else
> > -               spin_lock(&purge_lock);
> > -
> >         if (sync)
> >                 purge_fragmented_blocks_allcpus();
> >  
> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                         __free_vmap_area(va);
> >                 spin_unlock(&vmap_area_lock);
> 
> Hi Chris,
> 
> Per my test, the bottleneck now is __free_vmap_area() over the valist, the
> iteration is protected with spinlock vmap_area_lock. So the larger lazy max
> pages, the longer valist, the bigger the latency.
> 
> So besides above patch, we still need to remove vmap_are_lock or replace with
> mutex.

Or follow up with

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f7c6d6969ac..67b5475f0b0a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -656,8 +656,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
 
        if (nr) {
                spin_lock(&vmap_area_lock);
-               llist_for_each_entry_safe(va, n_va, valist, purge_list)
+               llist_for_each_entry_safe(va, n_va, valist, purge_list) {
                        __free_vmap_area(va);
+                       cond_resched_lock(&vmap_area_lock);
+               }
                spin_unlock(&vmap_area_lock);
        }
 }

?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-09-29 11:07       ` Chris Wilson
  (?)
@ 2016-09-29 11:18         ` Jisheng Zhang
  -1 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29 11:18 UTC (permalink / raw)
  To: Chris Wilson
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, agnel.joel, linux-mm,
	linux-kernel, linux-arm-kernel

On Thu, 29 Sep 2016 12:07:14 +0100 Chris Wilson wrote:

> On Thu, Sep 29, 2016 at 04:28:08PM +0800, Jisheng Zhang wrote:
> > On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:
> >   
> > > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:  
> > > > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > > > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > > > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > > > contributes most latency now. I noticed that Joel mentioned the same
> > > > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > > > is sent out for this purpose.
> > > > 
> > > > This patch adopts Joel's first solution, but I use 16MB per core
> > > > rather than 8MB per core for the number of lazy_max_pages. After this
> > > > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > > > 1/4 of original result.    
> > > 
> > > My understanding is that
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 91f44e78c516..3f7c6d6969ac 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> > >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >                                         int sync, int force_flush)
> > >  {
> > > -       static DEFINE_SPINLOCK(purge_lock);
> > >         struct llist_node *valist;
> > >         struct vmap_area *va;
> > >         struct vmap_area *n_va;
> > > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >          * should not expect such behaviour. This just simplifies locking for
> > >          * the case that isn't actually used at the moment anyway.
> > >          */
> > > -       if (!sync && !force_flush) {
> > > -               if (!spin_trylock(&purge_lock))
> > > -                       return;
> > > -       } else
> > > -               spin_lock(&purge_lock);
> > > -
> > >         if (sync)
> > >                 purge_fragmented_blocks_allcpus();
> > >  
> > > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >                         __free_vmap_area(va);
> > >                 spin_unlock(&vmap_area_lock);  
> > 
> > Hi Chris,
> > 
> > Per my test, the bottleneck now is __free_vmap_area() over the valist, the
> > iteration is protected with spinlock vmap_area_lock. So the larger lazy max
> > pages, the longer valist, the bigger the latency.
> > 
> > So besides above patch, we still need to remove vmap_are_lock or replace with
> > mutex.  
> 
> Or follow up with
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3f7c6d6969ac..67b5475f0b0a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -656,8 +656,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>  
>         if (nr) {
>                 spin_lock(&vmap_area_lock);
> -               llist_for_each_entry_safe(va, n_va, valist, purge_list)
> +               llist_for_each_entry_safe(va, n_va, valist, purge_list) {
>                         __free_vmap_area(va);
> +                       cond_resched_lock(&vmap_area_lock);

oh, great! This seems works fine. I'm not sure there's any side effect or
performance regression, but this patch plus previous purge_lock removing do
addressed my problem.

Thanks,
Jisheng

> +               }
>                 spin_unlock(&vmap_area_lock);
>         }
>  }
> 
> ?
> -Chris
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29 11:18         ` Jisheng Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29 11:18 UTC (permalink / raw)
  To: Chris Wilson
  Cc: akpm, mgorman, rientjes, iamjoonsoo.kim, agnel.joel, linux-mm,
	linux-kernel, linux-arm-kernel

On Thu, 29 Sep 2016 12:07:14 +0100 Chris Wilson wrote:

> On Thu, Sep 29, 2016 at 04:28:08PM +0800, Jisheng Zhang wrote:
> > On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:
> >   
> > > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:  
> > > > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > > > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > > > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > > > contributes most latency now. I noticed that Joel mentioned the same
> > > > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > > > is sent out for this purpose.
> > > > 
> > > > This patch adopts Joel's first solution, but I use 16MB per core
> > > > rather than 8MB per core for the number of lazy_max_pages. After this
> > > > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > > > 1/4 of original result.    
> > > 
> > > My understanding is that
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 91f44e78c516..3f7c6d6969ac 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> > >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >                                         int sync, int force_flush)
> > >  {
> > > -       static DEFINE_SPINLOCK(purge_lock);
> > >         struct llist_node *valist;
> > >         struct vmap_area *va;
> > >         struct vmap_area *n_va;
> > > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >          * should not expect such behaviour. This just simplifies locking for
> > >          * the case that isn't actually used at the moment anyway.
> > >          */
> > > -       if (!sync && !force_flush) {
> > > -               if (!spin_trylock(&purge_lock))
> > > -                       return;
> > > -       } else
> > > -               spin_lock(&purge_lock);
> > > -
> > >         if (sync)
> > >                 purge_fragmented_blocks_allcpus();
> > >  
> > > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >                         __free_vmap_area(va);
> > >                 spin_unlock(&vmap_area_lock);  
> > 
> > Hi Chris,
> > 
> > Per my test, the bottleneck now is __free_vmap_area() over the valist, the
> > iteration is protected with spinlock vmap_area_lock. So the larger lazy max
> > pages, the longer valist, the bigger the latency.
> > 
> > So besides above patch, we still need to remove vmap_are_lock or replace with
> > mutex.  
> 
> Or follow up with
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3f7c6d6969ac..67b5475f0b0a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -656,8 +656,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>  
>         if (nr) {
>                 spin_lock(&vmap_area_lock);
> -               llist_for_each_entry_safe(va, n_va, valist, purge_list)
> +               llist_for_each_entry_safe(va, n_va, valist, purge_list) {
>                         __free_vmap_area(va);
> +                       cond_resched_lock(&vmap_area_lock);

oh, great! This seems works fine. I'm not sure there's any side effect or
performance regression, but this patch plus previous purge_lock removing do
addressed my problem.

Thanks,
Jisheng

> +               }
>                 spin_unlock(&vmap_area_lock);
>         }
>  }
> 
> ?
> -Chris
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-09-29 11:18         ` Jisheng Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Jisheng Zhang @ 2016-09-29 11:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 29 Sep 2016 12:07:14 +0100 Chris Wilson wrote:

> On Thu, Sep 29, 2016 at 04:28:08PM +0800, Jisheng Zhang wrote:
> > On Thu, 29 Sep 2016 09:18:18 +0100 Chris Wilson wrote:
> >   
> > > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:  
> > > > On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> > > > a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> > > > awfully bad for STB. And the ftrace log also shows __free_vmap_area
> > > > contributes most latency now. I noticed that Joel mentioned the same
> > > > issue[1] on x86 platform and gave two solutions, but it seems no patch
> > > > is sent out for this purpose.
> > > > 
> > > > This patch adopts Joel's first solution, but I use 16MB per core
> > > > rather than 8MB per core for the number of lazy_max_pages. After this
> > > > patch, the preemptoff tracer reports a max 6455us latency, reduced to
> > > > 1/4 of original result.    
> > > 
> > > My understanding is that
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 91f44e78c516..3f7c6d6969ac 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> > >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >                                         int sync, int force_flush)
> > >  {
> > > -       static DEFINE_SPINLOCK(purge_lock);
> > >         struct llist_node *valist;
> > >         struct vmap_area *va;
> > >         struct vmap_area *n_va;
> > > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >          * should not expect such behaviour. This just simplifies locking for
> > >          * the case that isn't actually used at the moment anyway.
> > >          */
> > > -       if (!sync && !force_flush) {
> > > -               if (!spin_trylock(&purge_lock))
> > > -                       return;
> > > -       } else
> > > -               spin_lock(&purge_lock);
> > > -
> > >         if (sync)
> > >                 purge_fragmented_blocks_allcpus();
> > >  
> > > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> > >                         __free_vmap_area(va);
> > >                 spin_unlock(&vmap_area_lock);  
> > 
> > Hi Chris,
> > 
> > Per my test, the bottleneck now is __free_vmap_area() over the valist, the
> > iteration is protected with spinlock vmap_area_lock. So the larger lazy max
> > pages, the longer valist, the bigger the latency.
> > 
> > So besides above patch, we still need to remove vmap_are_lock or replace with
> > mutex.  
> 
> Or follow up with
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3f7c6d6969ac..67b5475f0b0a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -656,8 +656,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>  
>         if (nr) {
>                 spin_lock(&vmap_area_lock);
> -               llist_for_each_entry_safe(va, n_va, valist, purge_list)
> +               llist_for_each_entry_safe(va, n_va, valist, purge_list) {
>                         __free_vmap_area(va);
> +                       cond_resched_lock(&vmap_area_lock);

oh, great! This seems works fine. I'm not sure there's any side effect or
performance regression, but this patch plus previous purge_lock removing do
addressed my problem.

Thanks,
Jisheng

> +               }
>                 spin_unlock(&vmap_area_lock);
>         }
>  }
> 
> ?
> -Chris
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-09-29  8:18   ` Chris Wilson
  (?)
@ 2016-10-09  3:43     ` Joel Fernandes
  -1 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-09  3:43 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Jisheng Zhang, Andrew Morton, mgorman, rientjes, iamjoonsoo.kim,
	npiggin, linux-mm, Linux Kernel Mailing List,
	Linux ARM Kernel List

On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
>> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
>> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
>> awfully bad for STB. And the ftrace log also shows __free_vmap_area
>> contributes most latency now. I noticed that Joel mentioned the same
>> issue[1] on x86 platform and gave two solutions, but it seems no patch
>> is sent out for this purpose.
>>
>> This patch adopts Joel's first solution, but I use 16MB per core
>> rather than 8MB per core for the number of lazy_max_pages. After this
>> patch, the preemptoff tracer reports a max 6455us latency, reduced to
>> 1/4 of original result.
>
> My understanding is that
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 91f44e78c516..3f7c6d6969ac 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>          * should not expect such behaviour. This just simplifies locking for
>          * the case that isn't actually used at the moment anyway.
>          */
> -       if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> -                       return;
> -       } else
> -               spin_lock(&purge_lock);
> -
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
>
> @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                         __free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);
>         }
> -       spin_unlock(&purge_lock);
>  }
>
[..]
> should now be safe. That should significantly reduce the preempt-disabled
> section, I think.

I believe that the purge_lock is supposed to prevent concurrent purges
from happening.

For the case where if you have another concurrent overflow happen in
alloc_vmap_area() between the spin_unlock and purge :

spin_unlock(&vmap_area_lock);
if (!purged)
   purge_vmap_area_lazy();

Then the 2 purges would happen at the same time and could subtract
vmap_lazy_nr twice.

I had proposed to change it to mutex in [1]. How do you feel about
that? Let me know your suggestions, thanks. I am also Ok with reducing
the lazy_max_pages value.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1603.2/04803.html

Regards,
Joel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09  3:43     ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-09  3:43 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Jisheng Zhang, Andrew Morton, mgorman, rientjes, iamjoonsoo.kim,
	npiggin, linux-mm, Linux Kernel Mailing List,
	Linux ARM Kernel List

On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
>> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
>> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
>> awfully bad for STB. And the ftrace log also shows __free_vmap_area
>> contributes most latency now. I noticed that Joel mentioned the same
>> issue[1] on x86 platform and gave two solutions, but it seems no patch
>> is sent out for this purpose.
>>
>> This patch adopts Joel's first solution, but I use 16MB per core
>> rather than 8MB per core for the number of lazy_max_pages. After this
>> patch, the preemptoff tracer reports a max 6455us latency, reduced to
>> 1/4 of original result.
>
> My understanding is that
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 91f44e78c516..3f7c6d6969ac 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>          * should not expect such behaviour. This just simplifies locking for
>          * the case that isn't actually used at the moment anyway.
>          */
> -       if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> -                       return;
> -       } else
> -               spin_lock(&purge_lock);
> -
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
>
> @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                         __free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);
>         }
> -       spin_unlock(&purge_lock);
>  }
>
[..]
> should now be safe. That should significantly reduce the preempt-disabled
> section, I think.

I believe that the purge_lock is supposed to prevent concurrent purges
from happening.

For the case where if you have another concurrent overflow happen in
alloc_vmap_area() between the spin_unlock and purge :

spin_unlock(&vmap_area_lock);
if (!purged)
   purge_vmap_area_lazy();

Then the 2 purges would happen at the same time and could subtract
vmap_lazy_nr twice.

I had proposed to change it to mutex in [1]. How do you feel about
that? Let me know your suggestions, thanks. I am also Ok with reducing
the lazy_max_pages value.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1603.2/04803.html

Regards,
Joel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09  3:43     ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-09  3:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
>> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
>> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
>> awfully bad for STB. And the ftrace log also shows __free_vmap_area
>> contributes most latency now. I noticed that Joel mentioned the same
>> issue[1] on x86 platform and gave two solutions, but it seems no patch
>> is sent out for this purpose.
>>
>> This patch adopts Joel's first solution, but I use 16MB per core
>> rather than 8MB per core for the number of lazy_max_pages. After this
>> patch, the preemptoff tracer reports a max 6455us latency, reduced to
>> 1/4 of original result.
>
> My understanding is that
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 91f44e78c516..3f7c6d6969ac 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>          * should not expect such behaviour. This just simplifies locking for
>          * the case that isn't actually used at the moment anyway.
>          */
> -       if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> -                       return;
> -       } else
> -               spin_lock(&purge_lock);
> -
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
>
> @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                         __free_vmap_area(va);
>                 spin_unlock(&vmap_area_lock);
>         }
> -       spin_unlock(&purge_lock);
>  }
>
[..]
> should now be safe. That should significantly reduce the preempt-disabled
> section, I think.

I believe that the purge_lock is supposed to prevent concurrent purges
from happening.

For the case where if you have another concurrent overflow happen in
alloc_vmap_area() between the spin_unlock and purge :

spin_unlock(&vmap_area_lock);
if (!purged)
   purge_vmap_area_lazy();

Then the 2 purges would happen at the same time and could subtract
vmap_lazy_nr twice.

I had proposed to change it to mutex in [1]. How do you feel about
that? Let me know your suggestions, thanks. I am also Ok with reducing
the lazy_max_pages value.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1603.2/04803.html

Regards,
Joel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-10-09  3:43     ` Joel Fernandes
  (?)
@ 2016-10-09 12:42       ` Chris Wilson
  -1 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-10-09 12:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Jisheng Zhang, Andrew Morton, mgorman, rientjes, iamjoonsoo.kim,
	npiggin, linux-mm, Linux Kernel Mailing List,
	Linux ARM Kernel List

On Sat, Oct 08, 2016 at 08:43:51PM -0700, Joel Fernandes wrote:
> On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> >> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> >> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> >> awfully bad for STB. And the ftrace log also shows __free_vmap_area
> >> contributes most latency now. I noticed that Joel mentioned the same
> >> issue[1] on x86 platform and gave two solutions, but it seems no patch
> >> is sent out for this purpose.
> >>
> >> This patch adopts Joel's first solution, but I use 16MB per core
> >> rather than 8MB per core for the number of lazy_max_pages. After this
> >> patch, the preemptoff tracer reports a max 6455us latency, reduced to
> >> 1/4 of original result.
> >
> > My understanding is that
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 91f44e78c516..3f7c6d6969ac 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                                         int sync, int force_flush)
> >  {
> > -       static DEFINE_SPINLOCK(purge_lock);
> >         struct llist_node *valist;
> >         struct vmap_area *va;
> >         struct vmap_area *n_va;
> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >          * should not expect such behaviour. This just simplifies locking for
> >          * the case that isn't actually used at the moment anyway.
> >          */
> > -       if (!sync && !force_flush) {
> > -               if (!spin_trylock(&purge_lock))
> > -                       return;
> > -       } else
> > -               spin_lock(&purge_lock);
> > -
> >         if (sync)
> >                 purge_fragmented_blocks_allcpus();
> >
> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                         __free_vmap_area(va);
> >                 spin_unlock(&vmap_area_lock);
> >         }
> > -       spin_unlock(&purge_lock);
> >  }
> >
> [..]
> > should now be safe. That should significantly reduce the preempt-disabled
> > section, I think.
> 
> I believe that the purge_lock is supposed to prevent concurrent purges
> from happening.
> 
> For the case where if you have another concurrent overflow happen in
> alloc_vmap_area() between the spin_unlock and purge :
> 
> spin_unlock(&vmap_area_lock);
> if (!purged)
>    purge_vmap_area_lazy();
> 
> Then the 2 purges would happen at the same time and could subtract
> vmap_lazy_nr twice.

That itself is not the problem, as each instance of
__purge_vmap_area_lazy() operates on its own freelist, and so there will
be no double accounting.

However, removing the lock removes the serialisation which does mean
that alloc_vmap_area() will not block on another thread conducting the
purge, and so it will try to reallocate before that is complete and the
free area made available. It also means that we are doing the
atomic_sub(vmap_lazy_nr) too early.

That supports making the outer lock a mutex as you suggested. But I think
cond_resched_lock() is better for the vmap_area_lock (just because it
turns out to be an expensive loop and we may want the reschedule).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09 12:42       ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-10-09 12:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Jisheng Zhang, Andrew Morton, mgorman, rientjes, iamjoonsoo.kim,
	npiggin, linux-mm, Linux Kernel Mailing List,
	Linux ARM Kernel List

On Sat, Oct 08, 2016 at 08:43:51PM -0700, Joel Fernandes wrote:
> On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> >> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> >> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> >> awfully bad for STB. And the ftrace log also shows __free_vmap_area
> >> contributes most latency now. I noticed that Joel mentioned the same
> >> issue[1] on x86 platform and gave two solutions, but it seems no patch
> >> is sent out for this purpose.
> >>
> >> This patch adopts Joel's first solution, but I use 16MB per core
> >> rather than 8MB per core for the number of lazy_max_pages. After this
> >> patch, the preemptoff tracer reports a max 6455us latency, reduced to
> >> 1/4 of original result.
> >
> > My understanding is that
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 91f44e78c516..3f7c6d6969ac 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                                         int sync, int force_flush)
> >  {
> > -       static DEFINE_SPINLOCK(purge_lock);
> >         struct llist_node *valist;
> >         struct vmap_area *va;
> >         struct vmap_area *n_va;
> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >          * should not expect such behaviour. This just simplifies locking for
> >          * the case that isn't actually used at the moment anyway.
> >          */
> > -       if (!sync && !force_flush) {
> > -               if (!spin_trylock(&purge_lock))
> > -                       return;
> > -       } else
> > -               spin_lock(&purge_lock);
> > -
> >         if (sync)
> >                 purge_fragmented_blocks_allcpus();
> >
> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                         __free_vmap_area(va);
> >                 spin_unlock(&vmap_area_lock);
> >         }
> > -       spin_unlock(&purge_lock);
> >  }
> >
> [..]
> > should now be safe. That should significantly reduce the preempt-disabled
> > section, I think.
> 
> I believe that the purge_lock is supposed to prevent concurrent purges
> from happening.
> 
> For the case where if you have another concurrent overflow happen in
> alloc_vmap_area() between the spin_unlock and purge :
> 
> spin_unlock(&vmap_area_lock);
> if (!purged)
>    purge_vmap_area_lazy();
> 
> Then the 2 purges would happen at the same time and could subtract
> vmap_lazy_nr twice.

That itself is not the problem, as each instance of
__purge_vmap_area_lazy() operates on its own freelist, and so there will
be no double accounting.

However, removing the lock removes the serialisation which does mean
that alloc_vmap_area() will not block on another thread conducting the
purge, and so it will try to reallocate before that is complete and the
free area made available. It also means that we are doing the
atomic_sub(vmap_lazy_nr) too early.

That supports making the outer lock a mutex as you suggested. But I think
cond_resched_lock() is better for the vmap_area_lock (just because it
turns out to be an expensive loop and we may want the reschedule).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09 12:42       ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-10-09 12:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Oct 08, 2016 at 08:43:51PM -0700, Joel Fernandes wrote:
> On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote:
> >> On Marvell berlin arm64 platforms, I see the preemptoff tracer report
> >> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an
> >> awfully bad for STB. And the ftrace log also shows __free_vmap_area
> >> contributes most latency now. I noticed that Joel mentioned the same
> >> issue[1] on x86 platform and gave two solutions, but it seems no patch
> >> is sent out for this purpose.
> >>
> >> This patch adopts Joel's first solution, but I use 16MB per core
> >> rather than 8MB per core for the number of lazy_max_pages. After this
> >> patch, the preemptoff tracer reports a max 6455us latency, reduced to
> >> 1/4 of original result.
> >
> > My understanding is that
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 91f44e78c516..3f7c6d6969ac 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                                         int sync, int force_flush)
> >  {
> > -       static DEFINE_SPINLOCK(purge_lock);
> >         struct llist_node *valist;
> >         struct vmap_area *va;
> >         struct vmap_area *n_va;
> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >          * should not expect such behaviour. This just simplifies locking for
> >          * the case that isn't actually used at the moment anyway.
> >          */
> > -       if (!sync && !force_flush) {
> > -               if (!spin_trylock(&purge_lock))
> > -                       return;
> > -       } else
> > -               spin_lock(&purge_lock);
> > -
> >         if (sync)
> >                 purge_fragmented_blocks_allcpus();
> >
> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> >                         __free_vmap_area(va);
> >                 spin_unlock(&vmap_area_lock);
> >         }
> > -       spin_unlock(&purge_lock);
> >  }
> >
> [..]
> > should now be safe. That should significantly reduce the preempt-disabled
> > section, I think.
> 
> I believe that the purge_lock is supposed to prevent concurrent purges
> from happening.
> 
> For the case where if you have another concurrent overflow happen in
> alloc_vmap_area() between the spin_unlock and purge :
> 
> spin_unlock(&vmap_area_lock);
> if (!purged)
>    purge_vmap_area_lazy();
> 
> Then the 2 purges would happen at the same time and could subtract
> vmap_lazy_nr twice.

That itself is not the problem, as each instance of
__purge_vmap_area_lazy() operates on its own freelist, and so there will
be no double accounting.

However, removing the lock removes the serialisation which does mean
that alloc_vmap_area() will not block on another thread conducting the
purge, and so it will try to reallocate before that is complete and the
free area made available. It also means that we are doing the
atomic_sub(vmap_lazy_nr) too early.

That supports making the outer lock a mutex as you suggested. But I think
cond_resched_lock() is better for the vmap_area_lock (just because it
turns out to be an expensive loop and we may want the reschedule).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-10-09 12:42       ` Chris Wilson
  (?)
@ 2016-10-09 19:00         ` Joel Fernandes
  -1 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-09 19:00 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Sun, Oct 9, 2016 at 5:42 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
[..]
>> > My understanding is that
>> >
>> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> > index 91f44e78c516..3f7c6d6969ac 100644
>> > --- a/mm/vmalloc.c
>> > +++ b/mm/vmalloc.c
>> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >                                         int sync, int force_flush)
>> >  {
>> > -       static DEFINE_SPINLOCK(purge_lock);
>> >         struct llist_node *valist;
>> >         struct vmap_area *va;
>> >         struct vmap_area *n_va;
>> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >          * should not expect such behaviour. This just simplifies locking for
>> >          * the case that isn't actually used at the moment anyway.
>> >          */
>> > -       if (!sync && !force_flush) {
>> > -               if (!spin_trylock(&purge_lock))
>> > -                       return;
>> > -       } else
>> > -               spin_lock(&purge_lock);
>> > -
>> >         if (sync)
>> >                 purge_fragmented_blocks_allcpus();
>> >
>> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >                         __free_vmap_area(va);
>> >                 spin_unlock(&vmap_area_lock);
>> >         }
>> > -       spin_unlock(&purge_lock);
>> >  }
>> >
>> [..]
>> > should now be safe. That should significantly reduce the preempt-disabled
>> > section, I think.
>>
>> I believe that the purge_lock is supposed to prevent concurrent purges
>> from happening.
>>
>> For the case where if you have another concurrent overflow happen in
>> alloc_vmap_area() between the spin_unlock and purge :
>>
>> spin_unlock(&vmap_area_lock);
>> if (!purged)
>>    purge_vmap_area_lazy();
>>
>> Then the 2 purges would happen at the same time and could subtract
>> vmap_lazy_nr twice.
>
> That itself is not the problem, as each instance of
> __purge_vmap_area_lazy() operates on its own freelist, and so there will
> be no double accounting.
>
> However, removing the lock removes the serialisation which does mean
> that alloc_vmap_area() will not block on another thread conducting the
> purge, and so it will try to reallocate before that is complete and the
> free area made available. It also means that we are doing the
> atomic_sub(vmap_lazy_nr) too early.
>
> That supports making the outer lock a mutex as you suggested. But I think
> cond_resched_lock() is better for the vmap_area_lock (just because it
> turns out to be an expensive loop and we may want the reschedule).
> -Chris

Ok. So I'll submit a patch with mutex for purge_lock and use
cond_resched_lock for the vmap_area_lock as you suggested. I'll also
drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
hold time. Let me know if you have any objections.

Thanks,
Joel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09 19:00         ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-09 19:00 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Sun, Oct 9, 2016 at 5:42 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
[..]
>> > My understanding is that
>> >
>> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> > index 91f44e78c516..3f7c6d6969ac 100644
>> > --- a/mm/vmalloc.c
>> > +++ b/mm/vmalloc.c
>> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >                                         int sync, int force_flush)
>> >  {
>> > -       static DEFINE_SPINLOCK(purge_lock);
>> >         struct llist_node *valist;
>> >         struct vmap_area *va;
>> >         struct vmap_area *n_va;
>> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >          * should not expect such behaviour. This just simplifies locking for
>> >          * the case that isn't actually used at the moment anyway.
>> >          */
>> > -       if (!sync && !force_flush) {
>> > -               if (!spin_trylock(&purge_lock))
>> > -                       return;
>> > -       } else
>> > -               spin_lock(&purge_lock);
>> > -
>> >         if (sync)
>> >                 purge_fragmented_blocks_allcpus();
>> >
>> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >                         __free_vmap_area(va);
>> >                 spin_unlock(&vmap_area_lock);
>> >         }
>> > -       spin_unlock(&purge_lock);
>> >  }
>> >
>> [..]
>> > should now be safe. That should significantly reduce the preempt-disabled
>> > section, I think.
>>
>> I believe that the purge_lock is supposed to prevent concurrent purges
>> from happening.
>>
>> For the case where if you have another concurrent overflow happen in
>> alloc_vmap_area() between the spin_unlock and purge :
>>
>> spin_unlock(&vmap_area_lock);
>> if (!purged)
>>    purge_vmap_area_lazy();
>>
>> Then the 2 purges would happen at the same time and could subtract
>> vmap_lazy_nr twice.
>
> That itself is not the problem, as each instance of
> __purge_vmap_area_lazy() operates on its own freelist, and so there will
> be no double accounting.
>
> However, removing the lock removes the serialisation which does mean
> that alloc_vmap_area() will not block on another thread conducting the
> purge, and so it will try to reallocate before that is complete and the
> free area made available. It also means that we are doing the
> atomic_sub(vmap_lazy_nr) too early.
>
> That supports making the outer lock a mutex as you suggested. But I think
> cond_resched_lock() is better for the vmap_area_lock (just because it
> turns out to be an expensive loop and we may want the reschedule).
> -Chris

Ok. So I'll submit a patch with mutex for purge_lock and use
cond_resched_lock for the vmap_area_lock as you suggested. I'll also
drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
hold time. Let me know if you have any objections.

Thanks,
Joel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09 19:00         ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-09 19:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Oct 9, 2016 at 5:42 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
[..]
>> > My understanding is that
>> >
>> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> > index 91f44e78c516..3f7c6d6969ac 100644
>> > --- a/mm/vmalloc.c
>> > +++ b/mm/vmalloc.c
>> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void)
>> >  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >                                         int sync, int force_flush)
>> >  {
>> > -       static DEFINE_SPINLOCK(purge_lock);
>> >         struct llist_node *valist;
>> >         struct vmap_area *va;
>> >         struct vmap_area *n_va;
>> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >          * should not expect such behaviour. This just simplifies locking for
>> >          * the case that isn't actually used at the moment anyway.
>> >          */
>> > -       if (!sync && !force_flush) {
>> > -               if (!spin_trylock(&purge_lock))
>> > -                       return;
>> > -       } else
>> > -               spin_lock(&purge_lock);
>> > -
>> >         if (sync)
>> >                 purge_fragmented_blocks_allcpus();
>> >
>> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>> >                         __free_vmap_area(va);
>> >                 spin_unlock(&vmap_area_lock);
>> >         }
>> > -       spin_unlock(&purge_lock);
>> >  }
>> >
>> [..]
>> > should now be safe. That should significantly reduce the preempt-disabled
>> > section, I think.
>>
>> I believe that the purge_lock is supposed to prevent concurrent purges
>> from happening.
>>
>> For the case where if you have another concurrent overflow happen in
>> alloc_vmap_area() between the spin_unlock and purge :
>>
>> spin_unlock(&vmap_area_lock);
>> if (!purged)
>>    purge_vmap_area_lazy();
>>
>> Then the 2 purges would happen at the same time and could subtract
>> vmap_lazy_nr twice.
>
> That itself is not the problem, as each instance of
> __purge_vmap_area_lazy() operates on its own freelist, and so there will
> be no double accounting.
>
> However, removing the lock removes the serialisation which does mean
> that alloc_vmap_area() will not block on another thread conducting the
> purge, and so it will try to reallocate before that is complete and the
> free area made available. It also means that we are doing the
> atomic_sub(vmap_lazy_nr) too early.
>
> That supports making the outer lock a mutex as you suggested. But I think
> cond_resched_lock() is better for the vmap_area_lock (just because it
> turns out to be an expensive loop and we may want the reschedule).
> -Chris

Ok. So I'll submit a patch with mutex for purge_lock and use
cond_resched_lock for the vmap_area_lock as you suggested. I'll also
drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
hold time. Let me know if you have any objections.

Thanks,
Joel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-10-09 19:00         ` Joel Fernandes
  (?)
@ 2016-10-09 19:26           ` Chris Wilson
  -1 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-10-09 19:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
> Ok. So I'll submit a patch with mutex for purge_lock and use
> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
> hold time. Let me know if you have any objections.

The downside of using a mutex here though, is that we may be called
from contexts that cannot sleep (alloc_vmap_area), or reschedule for
that matter! If we change the notion of purged, we can forgo the mutex
in favour of spinning on the direct reclaim path. That just leaves the
complication of whether to use cond_resched_lock() or a lock around
the individual __free_vmap_area().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09 19:26           ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-10-09 19:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
> Ok. So I'll submit a patch with mutex for purge_lock and use
> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
> hold time. Let me know if you have any objections.

The downside of using a mutex here though, is that we may be called
from contexts that cannot sleep (alloc_vmap_area), or reschedule for
that matter! If we change the notion of purged, we can forgo the mutex
in favour of spinning on the direct reclaim path. That just leaves the
complication of whether to use cond_resched_lock() or a lock around
the individual __free_vmap_area().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-09 19:26           ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2016-10-09 19:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
> Ok. So I'll submit a patch with mutex for purge_lock and use
> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
> hold time. Let me know if you have any objections.

The downside of using a mutex here though, is that we may be called
from contexts that cannot sleep (alloc_vmap_area), or reschedule for
that matter! If we change the notion of purged, we can forgo the mutex
in favour of spinning on the direct reclaim path. That just leaves the
complication of whether to use cond_resched_lock() or a lock around
the individual __free_vmap_area().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-10-09 19:26           ` Chris Wilson
  (?)
@ 2016-10-11  5:06             ` Joel Fernandes
  -1 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-11  5:06 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Sun, Oct 9, 2016 at 12:26 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
>> Ok. So I'll submit a patch with mutex for purge_lock and use
>> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
>> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
>> hold time. Let me know if you have any objections.
>
> The downside of using a mutex here though, is that we may be called
> from contexts that cannot sleep (alloc_vmap_area), or reschedule for
> that matter! If we change the notion of purged, we can forgo the mutex
> in favour of spinning on the direct reclaim path. That just leaves the
> complication of whether to use cond_resched_lock() or a lock around
> the individual __free_vmap_area().

Good point. I agree with you. I think we still need to know if purging
is in progress to preserve previous trylock behavior. How about
something like the following diff? (diff is untested).

This drops the purge lock and uses a ref count to indicate if purging
is in progress, so that callers who don't want to purge if purging is
already in progress can be kept happy. Also I am reducing vmap_lazy_nr
as we go, and, not all at once, so that we don't reduce the counter
too soon as we're not holding purge lock anymore. Lastly, I added the
cond_resched as you suggested.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2481cb..5616ca4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,7 +626,7 @@ void set_iounmap_nonlazy(void)
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                                        int sync, int force_flush)
 {
-       static DEFINE_SPINLOCK(purge_lock);
+       static atomic_t purging;
        struct llist_node *valist;
        struct vmap_area *va;
        struct vmap_area *n_va;
@@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
         * the case that isn't actually used at the moment anyway.
         */
        if (!sync && !force_flush) {
-               if (!spin_trylock(&purge_lock))
+               if (atomic_cmpxchg(&purging, 0, 1))
                        return;
        } else
-               spin_lock(&purge_lock);
+               atomic_inc(&purging);

        if (sync)
                purge_fragmented_blocks_allcpus();
@@ -655,9 +655,6 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
                nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
        }

-       if (nr)
-               atomic_sub(nr, &vmap_lazy_nr);
-
        if (nr || force_flush)
                flush_tlb_kernel_range(*start, *end);

@@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
                spin_lock(&vmap_area_lock);
                llist_for_each_entry_safe(va, n_va, valist, purge_list)
                        __free_vmap_area(va);
+               atomic_sub(1, &vmap_lazy_nr);
+               cond_resched_lock(&vmap_area_lock);
                spin_unlock(&vmap_area_lock);
        }
-       spin_unlock(&purge_lock);
+       atomic_dec(&purging);
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-11  5:06             ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-11  5:06 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Sun, Oct 9, 2016 at 12:26 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
>> Ok. So I'll submit a patch with mutex for purge_lock and use
>> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
>> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
>> hold time. Let me know if you have any objections.
>
> The downside of using a mutex here though, is that we may be called
> from contexts that cannot sleep (alloc_vmap_area), or reschedule for
> that matter! If we change the notion of purged, we can forgo the mutex
> in favour of spinning on the direct reclaim path. That just leaves the
> complication of whether to use cond_resched_lock() or a lock around
> the individual __free_vmap_area().

Good point. I agree with you. I think we still need to know if purging
is in progress to preserve previous trylock behavior. How about
something like the following diff? (diff is untested).

This drops the purge lock and uses a ref count to indicate if purging
is in progress, so that callers who don't want to purge if purging is
already in progress can be kept happy. Also I am reducing vmap_lazy_nr
as we go, and, not all at once, so that we don't reduce the counter
too soon as we're not holding purge lock anymore. Lastly, I added the
cond_resched as you suggested.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2481cb..5616ca4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,7 +626,7 @@ void set_iounmap_nonlazy(void)
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                                        int sync, int force_flush)
 {
-       static DEFINE_SPINLOCK(purge_lock);
+       static atomic_t purging;
        struct llist_node *valist;
        struct vmap_area *va;
        struct vmap_area *n_va;
@@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
         * the case that isn't actually used at the moment anyway.
         */
        if (!sync && !force_flush) {
-               if (!spin_trylock(&purge_lock))
+               if (atomic_cmpxchg(&purging, 0, 1))
                        return;
        } else
-               spin_lock(&purge_lock);
+               atomic_inc(&purging);

        if (sync)
                purge_fragmented_blocks_allcpus();
@@ -655,9 +655,6 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
                nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
        }

-       if (nr)
-               atomic_sub(nr, &vmap_lazy_nr);
-
        if (nr || force_flush)
                flush_tlb_kernel_range(*start, *end);

@@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
                spin_lock(&vmap_area_lock);
                llist_for_each_entry_safe(va, n_va, valist, purge_list)
                        __free_vmap_area(va);
+               atomic_sub(1, &vmap_lazy_nr);
+               cond_resched_lock(&vmap_area_lock);
                spin_unlock(&vmap_area_lock);
        }
-       spin_unlock(&purge_lock);
+       atomic_dec(&purging);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-11  5:06             ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-11  5:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Oct 9, 2016 at 12:26 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
>> Ok. So I'll submit a patch with mutex for purge_lock and use
>> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
>> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
>> hold time. Let me know if you have any objections.
>
> The downside of using a mutex here though, is that we may be called
> from contexts that cannot sleep (alloc_vmap_area), or reschedule for
> that matter! If we change the notion of purged, we can forgo the mutex
> in favour of spinning on the direct reclaim path. That just leaves the
> complication of whether to use cond_resched_lock() or a lock around
> the individual __free_vmap_area().

Good point. I agree with you. I think we still need to know if purging
is in progress to preserve previous trylock behavior. How about
something like the following diff? (diff is untested).

This drops the purge lock and uses a ref count to indicate if purging
is in progress, so that callers who don't want to purge if purging is
already in progress can be kept happy. Also I am reducing vmap_lazy_nr
as we go, and, not all at once, so that we don't reduce the counter
too soon as we're not holding purge lock anymore. Lastly, I added the
cond_resched as you suggested.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2481cb..5616ca4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -626,7 +626,7 @@ void set_iounmap_nonlazy(void)
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
                                        int sync, int force_flush)
 {
-       static DEFINE_SPINLOCK(purge_lock);
+       static atomic_t purging;
        struct llist_node *valist;
        struct vmap_area *va;
        struct vmap_area *n_va;
@@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
         * the case that isn't actually used at the moment anyway.
         */
        if (!sync && !force_flush) {
-               if (!spin_trylock(&purge_lock))
+               if (atomic_cmpxchg(&purging, 0, 1))
                        return;
        } else
-               spin_lock(&purge_lock);
+               atomic_inc(&purging);

        if (sync)
                purge_fragmented_blocks_allcpus();
@@ -655,9 +655,6 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
                nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
        }

-       if (nr)
-               atomic_sub(nr, &vmap_lazy_nr);
-
        if (nr || force_flush)
                flush_tlb_kernel_range(*start, *end);

@@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
                spin_lock(&vmap_area_lock);
                llist_for_each_entry_safe(va, n_va, valist, purge_list)
                        __free_vmap_area(va);
+               atomic_sub(1, &vmap_lazy_nr);
+               cond_resched_lock(&vmap_area_lock);
                spin_unlock(&vmap_area_lock);
        }
-       spin_unlock(&purge_lock);
+       atomic_dec(&purging);
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
  2016-10-11  5:06             ` Joel Fernandes
  (?)
@ 2016-10-11  5:34               ` Joel Fernandes
  -1 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-11  5:34 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Mon, Oct 10, 2016 at 10:06 PM, Joel Fernandes <agnel.joel@gmail.com> wrote:
> On Sun, Oct 9, 2016 at 12:26 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
>>> Ok. So I'll submit a patch with mutex for purge_lock and use
>>> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
>>> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
>>> hold time. Let me know if you have any objections.
>>
>> The downside of using a mutex here though, is that we may be called
>> from contexts that cannot sleep (alloc_vmap_area), or reschedule for
>> that matter! If we change the notion of purged, we can forgo the mutex
>> in favour of spinning on the direct reclaim path. That just leaves the
>> complication of whether to use cond_resched_lock() or a lock around
>> the individual __free_vmap_area().
>
> Good point. I agree with you. I think we still need to know if purging
> is in progress to preserve previous trylock behavior. How about
> something like the following diff? (diff is untested).
>
> This drops the purge lock and uses a ref count to indicate if purging
> is in progress, so that callers who don't want to purge if purging is
> already in progress can be kept happy. Also I am reducing vmap_lazy_nr
> as we go, and, not all at once, so that we don't reduce the counter
> too soon as we're not holding purge lock anymore. Lastly, I added the
> cond_resched as you suggested.
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f2481cb..5616ca4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,7 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
> +       static atomic_t purging;
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>          * the case that isn't actually used at the moment anyway.
>          */
>         if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> +               if (atomic_cmpxchg(&purging, 0, 1))
>                         return;
>         } else
> -               spin_lock(&purge_lock);
> +               atomic_inc(&purging);
>
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
> @@ -655,9 +655,6 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>                 nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
>         }
>
> -       if (nr)
> -               atomic_sub(nr, &vmap_lazy_nr);
> -
>         if (nr || force_flush)
>                 flush_tlb_kernel_range(*start, *end);
>
> @@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>                 spin_lock(&vmap_area_lock);
>                 llist_for_each_entry_safe(va, n_va, valist, purge_list)
>                         __free_vmap_area(va);
> +               atomic_sub(1, &vmap_lazy_nr);
> +               cond_resched_lock(&vmap_area_lock);
>                 spin_unlock(&vmap_area_lock);

For this particular hunk, I forgot the braces. sorry, I meant to say:

 @@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
 *start, unsigned long *end,
                 spin_lock(&vmap_area_lock);
-                llist_for_each_entry_safe(va, n_va, valist, purge_list)
+                llist_for_each_entry_safe(va, n_va, valist,
purge_list) {
                   __free_vmap_area(va);
+                  atomic_sub(1, &vmap_lazy_nr);
+                  cond_resched_lock(&vmap_area_lock);
+                }
                 spin_unlock(&vmap_area_lock);


Regards,
Joel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-11  5:34               ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-11  5:34 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Joel Fernandes, Jisheng Zhang, npiggin,
	Linux Kernel Mailing List, linux-mm, rientjes, Andrew Morton,
	mgorman, iamjoonsoo.kim, Linux ARM Kernel List

On Mon, Oct 10, 2016 at 10:06 PM, Joel Fernandes <agnel.joel@gmail.com> wrote:
> On Sun, Oct 9, 2016 at 12:26 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
>>> Ok. So I'll submit a patch with mutex for purge_lock and use
>>> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
>>> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
>>> hold time. Let me know if you have any objections.
>>
>> The downside of using a mutex here though, is that we may be called
>> from contexts that cannot sleep (alloc_vmap_area), or reschedule for
>> that matter! If we change the notion of purged, we can forgo the mutex
>> in favour of spinning on the direct reclaim path. That just leaves the
>> complication of whether to use cond_resched_lock() or a lock around
>> the individual __free_vmap_area().
>
> Good point. I agree with you. I think we still need to know if purging
> is in progress to preserve previous trylock behavior. How about
> something like the following diff? (diff is untested).
>
> This drops the purge lock and uses a ref count to indicate if purging
> is in progress, so that callers who don't want to purge if purging is
> already in progress can be kept happy. Also I am reducing vmap_lazy_nr
> as we go, and, not all at once, so that we don't reduce the counter
> too soon as we're not holding purge lock anymore. Lastly, I added the
> cond_resched as you suggested.
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f2481cb..5616ca4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,7 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
> +       static atomic_t purging;
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>          * the case that isn't actually used at the moment anyway.
>          */
>         if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> +               if (atomic_cmpxchg(&purging, 0, 1))
>                         return;
>         } else
> -               spin_lock(&purge_lock);
> +               atomic_inc(&purging);
>
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
> @@ -655,9 +655,6 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>                 nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
>         }
>
> -       if (nr)
> -               atomic_sub(nr, &vmap_lazy_nr);
> -
>         if (nr || force_flush)
>                 flush_tlb_kernel_range(*start, *end);
>
> @@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>                 spin_lock(&vmap_area_lock);
>                 llist_for_each_entry_safe(va, n_va, valist, purge_list)
>                         __free_vmap_area(va);
> +               atomic_sub(1, &vmap_lazy_nr);
> +               cond_resched_lock(&vmap_area_lock);
>                 spin_unlock(&vmap_area_lock);

For this particular hunk, I forgot the braces. sorry, I meant to say:

 @@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
 *start, unsigned long *end,
                 spin_lock(&vmap_area_lock);
-                llist_for_each_entry_safe(va, n_va, valist, purge_list)
+                llist_for_each_entry_safe(va, n_va, valist,
purge_list) {
                   __free_vmap_area(va);
+                  atomic_sub(1, &vmap_lazy_nr);
+                  cond_resched_lock(&vmap_area_lock);
+                }
                 spin_unlock(&vmap_area_lock);


Regards,
Joel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency
@ 2016-10-11  5:34               ` Joel Fernandes
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Fernandes @ 2016-10-11  5:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 10, 2016 at 10:06 PM, Joel Fernandes <agnel.joel@gmail.com> wrote:
> On Sun, Oct 9, 2016 at 12:26 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote:
>>> Ok. So I'll submit a patch with mutex for purge_lock and use
>>> cond_resched_lock for the vmap_area_lock as you suggested. I'll also
>>> drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock
>>> hold time. Let me know if you have any objections.
>>
>> The downside of using a mutex here though, is that we may be called
>> from contexts that cannot sleep (alloc_vmap_area), or reschedule for
>> that matter! If we change the notion of purged, we can forgo the mutex
>> in favour of spinning on the direct reclaim path. That just leaves the
>> complication of whether to use cond_resched_lock() or a lock around
>> the individual __free_vmap_area().
>
> Good point. I agree with you. I think we still need to know if purging
> is in progress to preserve previous trylock behavior. How about
> something like the following diff? (diff is untested).
>
> This drops the purge lock and uses a ref count to indicate if purging
> is in progress, so that callers who don't want to purge if purging is
> already in progress can be kept happy. Also I am reducing vmap_lazy_nr
> as we go, and, not all at once, so that we don't reduce the counter
> too soon as we're not holding purge lock anymore. Lastly, I added the
> cond_resched as you suggested.
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f2481cb..5616ca4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -626,7 +626,7 @@ void set_iounmap_nonlazy(void)
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>                                         int sync, int force_flush)
>  {
> -       static DEFINE_SPINLOCK(purge_lock);
> +       static atomic_t purging;
>         struct llist_node *valist;
>         struct vmap_area *va;
>         struct vmap_area *n_va;
> @@ -638,10 +638,10 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>          * the case that isn't actually used at the moment anyway.
>          */
>         if (!sync && !force_flush) {
> -               if (!spin_trylock(&purge_lock))
> +               if (atomic_cmpxchg(&purging, 0, 1))
>                         return;
>         } else
> -               spin_lock(&purge_lock);
> +               atomic_inc(&purging);
>
>         if (sync)
>                 purge_fragmented_blocks_allcpus();
> @@ -655,9 +655,6 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>                 nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
>         }
>
> -       if (nr)
> -               atomic_sub(nr, &vmap_lazy_nr);
> -
>         if (nr || force_flush)
>                 flush_tlb_kernel_range(*start, *end);
>
> @@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
> *start, unsigned long *end,
>                 spin_lock(&vmap_area_lock);
>                 llist_for_each_entry_safe(va, n_va, valist, purge_list)
>                         __free_vmap_area(va);
> +               atomic_sub(1, &vmap_lazy_nr);
> +               cond_resched_lock(&vmap_area_lock);
>                 spin_unlock(&vmap_area_lock);

For this particular hunk, I forgot the braces. sorry, I meant to say:

 @@ -665,9 +662,11 @@ static void __purge_vmap_area_lazy(unsigned long
 *start, unsigned long *end,
                 spin_lock(&vmap_area_lock);
-                llist_for_each_entry_safe(va, n_va, valist, purge_list)
+                llist_for_each_entry_safe(va, n_va, valist,
purge_list) {
                   __free_vmap_area(va);
+                  atomic_sub(1, &vmap_lazy_nr);
+                  cond_resched_lock(&vmap_area_lock);
+                }
                 spin_unlock(&vmap_area_lock);


Regards,
Joel

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-10-11  5:34 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-29  7:34 [PATCH] mm/vmalloc: reduce the number of lazy_max_pages to reduce latency Jisheng Zhang
2016-09-29  7:34 ` Jisheng Zhang
2016-09-29  7:34 ` Jisheng Zhang
2016-09-29  8:18 ` Chris Wilson
2016-09-29  8:18   ` Chris Wilson
2016-09-29  8:18   ` Chris Wilson
2016-09-29  8:28   ` Jisheng Zhang
2016-09-29  8:28     ` Jisheng Zhang
2016-09-29  8:28     ` Jisheng Zhang
2016-09-29 11:07     ` Chris Wilson
2016-09-29 11:07       ` Chris Wilson
2016-09-29 11:07       ` Chris Wilson
2016-09-29 11:18       ` Jisheng Zhang
2016-09-29 11:18         ` Jisheng Zhang
2016-09-29 11:18         ` Jisheng Zhang
2016-10-09  3:43   ` Joel Fernandes
2016-10-09  3:43     ` Joel Fernandes
2016-10-09  3:43     ` Joel Fernandes
2016-10-09 12:42     ` Chris Wilson
2016-10-09 12:42       ` Chris Wilson
2016-10-09 12:42       ` Chris Wilson
2016-10-09 19:00       ` Joel Fernandes
2016-10-09 19:00         ` Joel Fernandes
2016-10-09 19:00         ` Joel Fernandes
2016-10-09 19:26         ` Chris Wilson
2016-10-09 19:26           ` Chris Wilson
2016-10-09 19:26           ` Chris Wilson
2016-10-11  5:06           ` Joel Fernandes
2016-10-11  5:06             ` Joel Fernandes
2016-10-11  5:06             ` Joel Fernandes
2016-10-11  5:34             ` Joel Fernandes
2016-10-11  5:34               ` Joel Fernandes
2016-10-11  5:34               ` Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.