linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] mm: correct return value of migrate_pages()
       [not found] <Yes>
@ 2012-07-16 16:14 ` Joonsoo Kim
  2012-07-16 16:14   ` [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
                     ` (4 more replies)
  2012-07-17 12:33 ` [PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
                   ` (11 subsequent siblings)
  12 siblings, 5 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-16 16:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Christoph Lameter

migrate_pages() should return number of pages not migrated or error code.
When unmap_and_move return -EAGAIN, outer loop is re-execution without
initialising nr_failed. This makes nr_failed over-counted.

So this patch correct it by initialising nr_failed in outer loop.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux.com>

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..294d52a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from,
 
 	for(pass = 0; pass < 10 && retry; pass++) {
 		retry = 0;
+		nr_failed = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
@ 2012-07-16 16:14   ` Joonsoo Kim
  2012-07-16 17:26     ` Christoph Lameter
  2012-07-16 17:40     ` Michal Nazarewicz
  2012-07-16 16:14   ` [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-16 16:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Sasha Levin, Christoph Lameter

do_migrate_pages() can return the number of pages not migrated.
Because migrate_pages() syscall return this value directly,
migrate_pages() syscall may return the number of pages not migrated.
In fail case in migrate_pages() syscall, we should return error value.
So change err to -EIO

Additionally, Correct comment above do_migrate_pages()

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Christoph Lameter <cl@linux.com>

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..f7df271 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
  * Move pages between the two nodesets so as to preserve the physical
  * layout as much as possible.
  *
- * Returns the number of page that could not be moved.
+ * Returns error or the number of pages not migrated.
  */
 int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 		     const nodemask_t *to, int flags)
@@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
 
 	err = do_migrate_pages(mm, old, new,
 		capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
+	if (err > 0)
+		err = -EIO;
 
 	mmput(mm);
 out:
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
  2012-07-16 16:14   ` [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
@ 2012-07-16 16:14   ` Joonsoo Kim
  2012-07-16 17:29     ` Christoph Lameter
  2012-07-16 17:40     ` Michal Nazarewicz
  2012-07-16 17:14   ` [PATCH 4] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-16 16:14 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Michal Nazarewicz,
	Marek Szyprowski, Minchan Kim, Christoph Lameter

migrate_pages() would return positive value in some failure case,
so 'ret > 0 ? 0 : ret' may be wrong.
This fix it and remove one dead statement.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christoph Lameter <cl@linux.com>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 			}
 			tries = 0;
 		} else if (++tries == 5) {
-			ret = ret < 0 ? ret : -EBUSY;
 			break;
 		}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 	}
 
 	putback_lru_pages(&cc.migratepages);
-	return ret > 0 ? 0 : ret;
+	return ret <= 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 4] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
  2012-07-16 16:14   ` [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
  2012-07-16 16:14   ` [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
@ 2012-07-16 17:14   ` Joonsoo Kim
  2012-07-16 17:30     ` Christoph Lameter
  2012-07-16 17:23   ` [PATCH 1/3] mm: correct return value of migrate_pages() Christoph Lameter
  2012-07-16 17:40   ` Michal Nazarewicz
  4 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-16 17:14 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Brice Goglin,
	Christoph Lameter, Minchan Kim

move_pages() syscall may return success in case that
do_move_page_to_node_array return positive value which means migration failed.
This patch changes return value of do_move_page_to_node_array
for not returning positive value. It can fix the problem.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Brice Goglin <brice@myri.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan@kernel.org>

diff --git a/mm/migrate.c b/mm/migrate.c
index 294d52a..adabaf4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1171,7 +1171,7 @@ set_status:
 	}
 
 	up_read(&mm->mmap_sem);
-	return err;
+	return err > 0 ? -EIO : err;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/3] mm: correct return value of migrate_pages()
  2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
                     ` (2 preceding siblings ...)
  2012-07-16 17:14   ` [PATCH 4] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
@ 2012-07-16 17:23   ` Christoph Lameter
  2012-07-16 17:32     ` JoonSoo Kim
  2012-07-16 17:40   ` Michal Nazarewicz
  4 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-07-16 17:23 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm

On Tue, 17 Jul 2012, Joonsoo Kim wrote:

> migrate_pages() should return number of pages not migrated or error code.
> When unmap_and_move return -EAGAIN, outer loop is re-execution without
> initialising nr_failed. This makes nr_failed over-counted.

The itention of the nr_failed was only to give an indication as to how
many attempts where made. The failed pages where on a separate queue that
seems to have vanished.

> So this patch correct it by initialising nr_failed in outer loop.

Well yea it makes sense since retry is initialized there as well.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-16 16:14   ` [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
@ 2012-07-16 17:26     ` Christoph Lameter
  2012-07-16 17:40     ` Michal Nazarewicz
  1 sibling, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-16 17:26 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm, Sasha Levin

On Tue, 17 Jul 2012, Joonsoo Kim wrote

> do_migrate_pages() can return the number of pages not migrated.
> Because migrate_pages() syscall return this value directly,
> migrate_pages() syscall may return the number of pages not migrated.
> In fail case in migrate_pages() syscall, we should return error value.
> So change err to -EIO

Pages are not migrated because they are busy not because there is an
error. So lets return EBUSY.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-16 16:14   ` [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
@ 2012-07-16 17:29     ` Christoph Lameter
  2012-07-16 17:40     ` Michal Nazarewicz
  1 sibling, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-16 17:29 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, linux-kernel, linux-mm, Michal Nazarewicz,
	Marek Szyprowski, Minchan Kim

On Tue, 17 Jul 2012, Joonsoo Kim wrote:

> migrate_pages() would return positive value in some failure case,
> so 'ret > 0 ? 0 : ret' may be wrong.
> This fix it and remove one dead statement.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 4] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-16 17:14   ` [PATCH 4] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
@ 2012-07-16 17:30     ` Christoph Lameter
  0 siblings, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-16 17:30 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim

On Tue, 17 Jul 2012, Joonsoo Kim wrote:

> move_pages() syscall may return success in case that
> do_move_page_to_node_array return positive value which means migration failed.
> This patch changes return value of do_move_page_to_node_array
> for not returning positive value. It can fix the problem.
>
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> Cc: Brice Goglin <brice@myri.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Minchan Kim <minchan@kernel.org>
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 294d52a..adabaf4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1171,7 +1171,7 @@ set_status:
>  	}
>
>  	up_read(&mm->mmap_sem);
> -	return err;
> +	return err > 0 ? -EIO : err;
>  }

Please use EBUSY.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/3] mm: correct return value of migrate_pages()
  2012-07-16 17:23   ` [PATCH 1/3] mm: correct return value of migrate_pages() Christoph Lameter
@ 2012-07-16 17:32     ` JoonSoo Kim
  2012-07-16 17:37       ` Christoph Lameter
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-16 17:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm

2012/7/17 Christoph Lameter <cl@linux.com>:
> On Tue, 17 Jul 2012, Joonsoo Kim wrote:
>
>> migrate_pages() should return number of pages not migrated or error code.
>> When unmap_and_move return -EAGAIN, outer loop is re-execution without
>> initialising nr_failed. This makes nr_failed over-counted.
>
> The itention of the nr_failed was only to give an indication as to how
> many attempts where made. The failed pages where on a separate queue that
> seems to have vanished.
>
>> So this patch correct it by initialising nr_failed in outer loop.
>
> Well yea it makes sense since retry is initialized there as well.
>
> Acked-by: Christoph Lameter <cl@linux.com>

Thanks for comment.

Additinally, I find that migrate_huge_pages() is needed identical fix
as migrate_pages().

@@ -1029,6 +1030,7 @@ int migrate_huge_pages(struct list_head *from,

        for (pass = 0; pass < 10 && retry; pass++) {
                retry = 0;
+               nr_failed = 0;

                list_for_each_entry_safe(page, page2, from, lru) {
                        cond_resched();

When I resend with this, could I include "Acked-by: Christoph Lameter
<cl@linux.com>"?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/3] mm: correct return value of migrate_pages()
  2012-07-16 17:32     ` JoonSoo Kim
@ 2012-07-16 17:37       ` Christoph Lameter
  0 siblings, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-16 17:37 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: akpm, linux-kernel, linux-mm

On Tue, 17 Jul 2012, JoonSoo Kim wrote:

>
>         for (pass = 0; pass < 10 && retry; pass++) {
>                 retry = 0;
> +               nr_failed = 0;
>
>                 list_for_each_entry_safe(page, page2, from, lru) {
>                         cond_resched();
>
> When I resend with this, could I include "Acked-by: Christoph Lameter
> <cl@linux.com>"?

Sure.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/3] mm: correct return value of migrate_pages()
  2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
                     ` (3 preceding siblings ...)
  2012-07-16 17:23   ` [PATCH 1/3] mm: correct return value of migrate_pages() Christoph Lameter
@ 2012-07-16 17:40   ` Michal Nazarewicz
  2012-07-16 17:57     ` JoonSoo Kim
  4 siblings, 1 reply; 146+ messages in thread
From: Michal Nazarewicz @ 2012-07-16 17:40 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm, Christoph Lameter

[-- Attachment #1: Type: text/plain, Size: 1307 bytes --]

Joonsoo Kim <js1304@gmail.com> writes:
> migrate_pages() should return number of pages not migrated or error code.
> When unmap_and_move return -EAGAIN, outer loop is re-execution without
> initialising nr_failed. This makes nr_failed over-counted.
>
> So this patch correct it by initialising nr_failed in outer loop.
>
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>

Acked-by: Michal Nazarewicz <mina86@mina86.com>

Actually, it makes me wonder if there is any code that uses this
information.  If not, it would be best in my opinion to make it return
zero or negative error code, but that would have to be checked.

> diff --git a/mm/migrate.c b/mm/migrate.c
> index be26d5c..294d52a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from,
>  
>  	for(pass = 0; pass < 10 && retry; pass++) {
>  		retry = 0;
> +		nr_failed = 0;
>  
>  		list_for_each_entry_safe(page, page2, from, lru) {
>  			cond_resched();

-- 
Best regards,                                          _     _
 .o. | Liege of Serenly Enlightened Majesty of       o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
 ooo +-<mina86-mina86.com>-<jid:mina86-jabber.org>--ooO--(_)--Ooo--

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-16 16:14   ` [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
  2012-07-16 17:26     ` Christoph Lameter
@ 2012-07-16 17:40     ` Michal Nazarewicz
  2012-07-16 17:59       ` JoonSoo Kim
  1 sibling, 1 reply; 146+ messages in thread
From: Michal Nazarewicz @ 2012-07-16 17:40 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm, Sasha Levin, Christoph Lameter

[-- Attachment #1: Type: text/plain, Size: 1709 bytes --]

Joonsoo Kim <js1304@gmail.com> writes:
> do_migrate_pages() can return the number of pages not migrated.
> Because migrate_pages() syscall return this value directly,
> migrate_pages() syscall may return the number of pages not migrated.
> In fail case in migrate_pages() syscall, we should return error value.
> So change err to -EIO
>
> Additionally, Correct comment above do_migrate_pages()
>
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> Cc: Sasha Levin <levinsasha928@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>

Acked-by: Michal Nazarewicz <mina86@mina86.com>

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 1d771e4..f7df271 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
>   * Move pages between the two nodesets so as to preserve the physical
>   * layout as much as possible.
>   *
> - * Returns the number of page that could not be moved.
> + * Returns error or the number of pages not migrated.
>   */
>  int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
>  		     const nodemask_t *to, int flags)
> @@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
>  
>  	err = do_migrate_pages(mm, old, new,
>  		capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
> +	if (err > 0)
> +		err = -EIO;
>  
>  	mmput(mm);
>  out:

-- 
Best regards,                                          _     _
 .o. | Liege of Serenly Enlightened Majesty of       o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
 ooo +-<mina86-mina86.com>-<jid:mina86-jabber.org>--ooO--(_)--Ooo--

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-16 16:14   ` [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
  2012-07-16 17:29     ` Christoph Lameter
@ 2012-07-16 17:40     ` Michal Nazarewicz
  2012-07-16 18:40       ` JoonSoo Kim
  1 sibling, 1 reply; 146+ messages in thread
From: Michal Nazarewicz @ 2012-07-16 17:40 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, linux-kernel, linux-mm, Marek Szyprowski, Minchan Kim,
	Christoph Lameter

[-- Attachment #1: Type: text/plain, Size: 1694 bytes --]

Joonsoo Kim <js1304@gmail.com> writes:

> migrate_pages() would return positive value in some failure case,
> so 'ret > 0 ? 0 : ret' may be wrong.
> This fix it and remove one dead statement.
>
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>

Have you actually encountered this problem?  If migrate_pages() fails
with a positive value, the code that you are removing kicks in and
-EBUSY is assigned to ret (now that I look at it, I think that in the
current code the "return ret > 0 ? 0 : ret;" statement could be reduced
to "return ret;").  Your code seems to be cleaner, but the commit
message does not look accurate to me.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..02d4519 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
>  			}
>  			tries = 0;
>  		} else if (++tries == 5) {
> -			ret = ret < 0 ? ret : -EBUSY;
>  			break;
>  		}
>  
> @@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
>  	}
>  
>  	putback_lru_pages(&cc.migratepages);
> -	return ret > 0 ? 0 : ret;
> +	return ret <= 0 ? ret : -EBUSY;
>  }
>  
>  /*

-- 
Best regards,                                          _     _
 .o. | Liege of Serenly Enlightened Majesty of       o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
 ooo +-<mina86-mina86.com>-<jid:mina86-jabber.org>--ooO--(_)--Ooo--

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/3] mm: correct return value of migrate_pages()
  2012-07-16 17:40   ` Michal Nazarewicz
@ 2012-07-16 17:57     ` JoonSoo Kim
  2012-07-16 18:05       ` Christoph Lameter
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-16 17:57 UTC (permalink / raw)
  To: Michal Nazarewicz; +Cc: akpm, linux-kernel, linux-mm, Christoph Lameter

2012/7/17 Michal Nazarewicz <mina86@tlen.pl>:
> Acked-by: Michal Nazarewicz <mina86@mina86.com>

Thanks.

> Actually, it makes me wonder if there is any code that uses this
> information.  If not, it would be best in my opinion to make it return
> zero or negative error code, but that would have to be checked.

I think that, too.
I looked at every callsites for migrate_pages() and there is no place
which really need fail count.
This function sometimes makes caller error-prone,
so I think changing return value is preferable.

How do you think, Christoph?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-16 17:40     ` Michal Nazarewicz
@ 2012-07-16 17:59       ` JoonSoo Kim
  2012-07-17 13:02         ` Michal Nazarewicz
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-16 17:59 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: akpm, linux-kernel, linux-mm, Sasha Levin, Christoph Lameter

2012/7/17 Michal Nazarewicz <mina86@tlen.pl>:
> Joonsoo Kim <js1304@gmail.com> writes:
>> do_migrate_pages() can return the number of pages not migrated.
>> Because migrate_pages() syscall return this value directly,
>> migrate_pages() syscall may return the number of pages not migrated.
>> In fail case in migrate_pages() syscall, we should return error value.
>> So change err to -EIO
>>
>> Additionally, Correct comment above do_migrate_pages()
>>
>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>> Cc: Sasha Levin <levinsasha928@gmail.com>
>> Cc: Christoph Lameter <cl@linux.com>
>
> Acked-by: Michal Nazarewicz <mina86@mina86.com>

Thanks.

When I resend with changing -EIO to -EBUSY,
could I include "Acked-by: Michal Nazarewicz <mina86@mina86.com>"?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/3] mm: correct return value of migrate_pages()
  2012-07-16 17:57     ` JoonSoo Kim
@ 2012-07-16 18:05       ` Christoph Lameter
  0 siblings, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-16 18:05 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: Michal Nazarewicz, akpm, linux-kernel, linux-mm

On Tue, 17 Jul 2012, JoonSoo Kim wrote:

> > Actually, it makes me wonder if there is any code that uses this
> > information.  If not, it would be best in my opinion to make it return
> > zero or negative error code, but that would have to be checked.
>
> I think that, too.
> I looked at every callsites for migrate_pages() and there is no place
> which really need fail count.
> This function sometimes makes caller error-prone,
> so I think changing return value is preferable.
>
> How do you think, Christoph?

We could do that. I am not aware of anything using that information
either. However, the condition in which some pages where migrated and
others are not is not like a classic error. In many situations the moving
of the pages is done for performance reasons. This just means that the
best performant memory locations could not be used for some pages. A
situation like that may be ok for an application.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-16 17:40     ` Michal Nazarewicz
@ 2012-07-16 18:40       ` JoonSoo Kim
  2012-07-17 13:16         ` Michal Nazarewicz
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-16 18:40 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: akpm, linux-kernel, linux-mm, Marek Szyprowski, Minchan Kim,
	Christoph Lameter

2012/7/17 Michal Nazarewicz <mina86@mina86.com>:
> Joonsoo Kim <js1304@gmail.com> writes:
>
>> migrate_pages() would return positive value in some failure case,
>> so 'ret > 0 ? 0 : ret' may be wrong.
>> This fix it and remove one dead statement.
>>
>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Christoph Lameter <cl@linux.com>
>
> Have you actually encountered this problem?  If migrate_pages() fails
> with a positive value, the code that you are removing kicks in and
> -EBUSY is assigned to ret (now that I look at it, I think that in the
> current code the "return ret > 0 ? 0 : ret;" statement could be reduced
> to "return ret;").  Your code seems to be cleaner, but the commit
> message does not look accurate to me.
>

I don't encounter this problem yet.

If migrate_pages() with offlining false meets KSM page, then migration failed.
In this case, failed page is removed from cc.migratepage list and
return failed count.
So it can be possible exiting loop without testing ++tries == 5 and
ret is over the zero.
Is there any point which I missing?
Is there any possible scenario "migrate_pages return  > 0 and
cc.migratepages is empty"?

I'm not expert for MM, so please comment my humble opinion.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages()
       [not found] <Yes>
  2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
@ 2012-07-17 12:33 ` Joonsoo Kim
  2012-07-17 12:33   ` [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
                     ` (2 more replies)
  2012-07-27 17:55 ` [RESEND PATCH 1/4 v3] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
                   ` (10 subsequent siblings)
  12 siblings, 3 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-17 12:33 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Christoph Lameter

migrate_pages() should return number of pages not migrated or error code.
When unmap_and_move return -EAGAIN, outer loop is re-execution without
initialising nr_failed. This makes nr_failed over-counted.

So this patch correct it by initialising nr_failed in outer loop.

migrate_huge_pages() is identical case as migrate_pages()

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..f495c58 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from,
 
 	for(pass = 0; pass < 10 && retry; pass++) {
 		retry = 0;
+		nr_failed = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
@@ -1029,6 +1030,7 @@ int migrate_huge_pages(struct list_head *from,
 
 	for (pass = 0; pass < 10 && retry; pass++) {
 		retry = 0;
+		nr_failed = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-17 12:33 ` [PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
@ 2012-07-17 12:33   ` Joonsoo Kim
  2012-07-17 14:28     ` Christoph Lameter
  2012-07-17 12:33   ` [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
  2012-07-17 12:33   ` [PATCH 4/4 v2] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
  2 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-17 12:33 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Sasha Levin, Christoph Lameter

do_migrate_pages() can return the number of pages not migrated.
Because migrate_pages() syscall return this value directly,
migrate_pages() syscall may return the number of pages not migrated.
In fail case in migrate_pages() syscall, we should return error value.
So change err to -EBUSY

Additionally, Correct comment above do_migrate_pages()

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Christoph Lameter <cl@linux.com>

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..0732729 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
  * Move pages between the two nodesets so as to preserve the physical
  * layout as much as possible.
  *
- * Returns the number of page that could not be moved.
+ * Returns error or the number of pages not migrated.
  */
 int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 		     const nodemask_t *to, int flags)
@@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
 
 	err = do_migrate_pages(mm, old, new,
 		capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
+	if (err > 0)
+		err = -EBUSY;
 
 	mmput(mm);
 out:
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-17 12:33 ` [PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
  2012-07-17 12:33   ` [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
@ 2012-07-17 12:33   ` Joonsoo Kim
  2012-07-17 13:25     ` Michal Nazarewicz
  2012-07-17 12:33   ` [PATCH 4/4 v2] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
  2 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-17 12:33 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Michal Nazarewicz,
	Marek Szyprowski, Minchan Kim, Christoph Lameter

migrate_pages() would return positive value in some failure case,
so 'ret > 0 ? 0 : ret' may be wrong.
This fix it and remove one dead statement.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Christoph Lameter <cl@linux.com>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 			}
 			tries = 0;
 		} else if (++tries == 5) {
-			ret = ret < 0 ? ret : -EBUSY;
 			break;
 		}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 	}
 
 	putback_lru_pages(&cc.migratepages);
-	return ret > 0 ? 0 : ret;
+	return ret <= 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 4/4 v2] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-17 12:33 ` [PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
  2012-07-17 12:33   ` [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
  2012-07-17 12:33   ` [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
@ 2012-07-17 12:33   ` Joonsoo Kim
  2 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-17 12:33 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Brice Goglin,
	Christoph Lameter, Minchan Kim

move_pages() syscall may return success in case that
do_move_page_to_node_array return positive value which means migration failed.
This patch changes return value of do_move_page_to_node_array
for not returning positive value. It can fix the problem.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Brice Goglin <brice@myri.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan@kernel.org>

diff --git a/mm/migrate.c b/mm/migrate.c
index f495c58..eeaf409 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1172,7 +1172,7 @@ set_status:
 	}
 
 	up_read(&mm->mmap_sem);
-	return err;
+	return err > 0 ? -EBUSY : err;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-16 17:59       ` JoonSoo Kim
@ 2012-07-17 13:02         ` Michal Nazarewicz
  0 siblings, 0 replies; 146+ messages in thread
From: Michal Nazarewicz @ 2012-07-17 13:02 UTC (permalink / raw)
  To: Michal Nazarewicz, JoonSoo Kim
  Cc: akpm, linux-kernel, linux-mm, Sasha Levin, Christoph Lameter

On Mon, 16 Jul 2012 19:59:18 +0200, JoonSoo Kim <js1304@gmail.com> wrote:

> 2012/7/17 Michal Nazarewicz <mina86@tlen.pl>:
>> Joonsoo Kim <js1304@gmail.com> writes:
>>> do_migrate_pages() can return the number of pages not migrated.
>>> Because migrate_pages() syscall return this value directly,
>>> migrate_pages() syscall may return the number of pages not migrated.
>>> In fail case in migrate_pages() syscall, we should return error value.
>>> So change err to -EIO
>>>
>>> Additionally, Correct comment above do_migrate_pages()
>>>
>>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>>> Cc: Sasha Levin <levinsasha928@gmail.com>
>>> Cc: Christoph Lameter <cl@linux.com>
>>
>> Acked-by: Michal Nazarewicz <mina86@mina86.com>
>
> Thanks.
>
> When I resend with changing -EIO to -EBUSY,
> could I include "Acked-by: Michal Nazarewicz <mina86@mina86.com>"?

Sure thing.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-16 18:40       ` JoonSoo Kim
@ 2012-07-17 13:16         ` Michal Nazarewicz
  0 siblings, 0 replies; 146+ messages in thread
From: Michal Nazarewicz @ 2012-07-17 13:16 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: akpm, linux-kernel, linux-mm, Marek Szyprowski, Minchan Kim,
	Christoph Lameter

On Mon, 16 Jul 2012 20:40:56 +0200, JoonSoo Kim <js1304@gmail.com> wrote:

> 2012/7/17 Michal Nazarewicz <mina86@mina86.com>:
>> Joonsoo Kim <js1304@gmail.com> writes:
>>
>>> migrate_pages() would return positive value in some failure case,
>>> so 'ret > 0 ? 0 : ret' may be wrong.
>>> This fix it and remove one dead statement.
>>>
>>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Cc: Christoph Lameter <cl@linux.com>
>>
>> Have you actually encountered this problem?  If migrate_pages() fails
>> with a positive value, the code that you are removing kicks in and
>> -EBUSY is assigned to ret (now that I look at it, I think that in the
>> current code the "return ret > 0 ? 0 : ret;" statement could be reduced
>> to "return ret;").  Your code seems to be cleaner, but the commit
>> message does not look accurate to me.
>>
>
> I don't encounter this problem yet.
>
> If migrate_pages() with offlining false meets KSM page, then migration failed.
> In this case, failed page is removed from cc.migratepage list and
> return failed count.

Good point.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-17 12:33   ` [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
@ 2012-07-17 13:25     ` Michal Nazarewicz
  2012-07-17 15:45       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Michal Nazarewicz @ 2012-07-17 13:25 UTC (permalink / raw)
  To: akpm, Joonsoo Kim
  Cc: linux-kernel, linux-mm, Marek Szyprowski, Minchan Kim, Christoph Lameter

On Tue, 17 Jul 2012 14:33:34 +0200, Joonsoo Kim <js1304@gmail.com> wrote:
> migrate_pages() would return positive value in some failure case,
> so 'ret > 0 ? 0 : ret' may be wrong.
> This fix it and remove one dead statement.

How about the following message:

------------------- >8 ---------------------------------------------------
migrate_pages() can return positive value while at the same time emptying
the list of pages it was called with.  Such situation means that it went
through all the pages on the list some of which failed to be migrated.

If that happens, __alloc_contig_migrate_range()'s loop may finish without
"++tries == 5" never being checked.  This in turn means that at the end
of the function, ret may have a positive value, which should be treated
as an error.

This patch changes __alloc_contig_migrate_range() so that the return
statement converts positive ret value into -EBUSY error.
------------------- >8 ---------------------------------------------------

> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>

Acked-by: Michal Nazarewicz <mina86@mina86.com>

> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Acked-by: Christoph Lameter <cl@linux.com>

In fact, now that I look at it, I think that __alloc_contig_migrate_range()
should be changed even further.  I'll take a closer look at it and send
a patch (possibly through Marek ;) ).

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..02d4519 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
>  			}
>  			tries = 0;
>  		} else if (++tries == 5) {
> -			ret = ret < 0 ? ret : -EBUSY;
>  			break;
>  		}
>@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
>  	}
> 	putback_lru_pages(&cc.migratepages);
> -	return ret > 0 ? 0 : ret;
> +	return ret <= 0 ? ret : -EBUSY;
>  }
> /*


-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-17 12:33   ` [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
@ 2012-07-17 14:28     ` Christoph Lameter
  2012-07-17 15:41       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-07-17 14:28 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm, Sasha Levin

On Tue, 17 Jul 2012, Joonsoo Kim wrote:

> @@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
>
>  	err = do_migrate_pages(mm, old, new,
>  		capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
> +	if (err > 0)
> +		err = -EBUSY;
>
>  	mmput(mm);
>  out:

Why not have do_migrate_pages() return EBUSY if we do not need the number
of failed/retried pages?


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-17 14:28     ` Christoph Lameter
@ 2012-07-17 15:41       ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-17 15:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Sasha Levin

2012/7/17 Christoph Lameter <cl@linux.com>:
> On Tue, 17 Jul 2012, Joonsoo Kim wrote:
>
>> @@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
>>
>>       err = do_migrate_pages(mm, old, new,
>>               capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
>> +     if (err > 0)
>> +             err = -EBUSY;
>>
>>       mmput(mm);
>>  out:
>
> Why not have do_migrate_pages() return EBUSY if we do not need the number
> of failed/retried pages?

There is no serious reason.
do_migrate_pages() have two callsites, although another one doesn't
use return value.
do_migrate_pages() is commented "Return the number of page ...".
And my focus is fixing possible error in migrate_pages() syscall.
So, I keep to return the number of failed/retired pages.

If we really think the number of failed/retired pages is useless, in that time,
instead that do_migrate_pages() return EBUSY, we can make migrate_pages()
return EBUSY. I think it is better to fix all the related codes at one go.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-17 13:25     ` Michal Nazarewicz
@ 2012-07-17 15:45       ` JoonSoo Kim
  2012-07-17 15:49         ` [PATCH 3/4 v3] " Joonsoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-17 15:45 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: akpm, linux-kernel, linux-mm, Marek Szyprowski, Minchan Kim,
	Christoph Lameter

2012/7/17 Michal Nazarewicz <mina86@mina86.com>:
> On Tue, 17 Jul 2012 14:33:34 +0200, Joonsoo Kim <js1304@gmail.com> wrote:
>>
>> migrate_pages() would return positive value in some failure case,
>> so 'ret > 0 ? 0 : ret' may be wrong.
>> This fix it and remove one dead statement.
>
>
> How about the following message:
>
> ------------------- >8 ---------------------------------------------------
> migrate_pages() can return positive value while at the same time emptying
> the list of pages it was called with.  Such situation means that it went
> through all the pages on the list some of which failed to be migrated.
>
> If that happens, __alloc_contig_migrate_range()'s loop may finish without
> "++tries == 5" never being checked.  This in turn means that at the end
> of the function, ret may have a positive value, which should be treated
> as an error.
>
> This patch changes __alloc_contig_migrate_range() so that the return
> statement converts positive ret value into -EBUSY error.
> ------------------- >8 ---------------------------------------------------

It's good.
I will resend patch replacing my comment with yours.
Thanks for help.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-17 15:45       ` JoonSoo Kim
@ 2012-07-17 15:49         ` Joonsoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-17 15:49 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Michal Nazarewicz,
	Marek Szyprowski, Minchan Kim, Christoph Lameter

migrate_pages() can return positive value while at the same time emptying
the list of pages it was called with.  Such situation means that it went
through all the pages on the list some of which failed to be migrated.

If that happens, __alloc_contig_migrate_range()'s loop may finish without
"++tries == 5" never being checked.  This in turn means that at the end
of the function, ret may have a positive value, which should be treated
as an error.

This patch changes __alloc_contig_migrate_range() so that the return
statement converts positive ret value into -EBUSY error.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 			}
 			tries = 0;
 		} else if (++tries == 5) {
-			ret = ret < 0 ? ret : -EBUSY;
 			break;
 		}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 	}
 
 	putback_lru_pages(&cc.migratepages);
-	return ret > 0 ? 0 : ret;
+	return ret <= 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RESEND PATCH 1/4 v3] mm: correct return value of migrate_pages() and migrate_huge_pages()
       [not found] <Yes>
  2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
  2012-07-17 12:33 ` [PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
@ 2012-07-27 17:55 ` Joonsoo Kim
  2012-07-27 17:55   ` [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
                     ` (2 more replies)
  2012-08-13 16:17 ` [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
                   ` (9 subsequent siblings)
  12 siblings, 3 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-27 17:55 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Christoph Lameter

migrate_pages() should return number of pages not migrated or error code.
When unmap_and_move return -EAGAIN, outer loop is re-execution without
initialising nr_failed. This makes nr_failed over-counted.

So this patch correct it by initialising nr_failed in outer loop.

migrate_huge_pages() is identical case as migrate_pages()

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
---
[Patch 2/4]: add "Acked-by: Michal Nazarewicz <mina86@mina86.com>"
[Patch 3/4]: commit log is changed according to Michal Nazarewicz's suggestion.
There is no other change from v2.
Just resend as ping for Andrew.

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..f495c58 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from,
 
 	for(pass = 0; pass < 10 && retry; pass++) {
 		retry = 0;
+		nr_failed = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
@@ -1029,6 +1030,7 @@ int migrate_huge_pages(struct list_head *from,
 
 	for (pass = 0; pass < 10 && retry; pass++) {
 		retry = 0;
+		nr_failed = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-27 17:55 ` [RESEND PATCH 1/4 v3] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
@ 2012-07-27 17:55   ` Joonsoo Kim
  2012-07-27 20:57     ` Christoph Lameter
  2012-07-27 17:55   ` [RESEND PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
  2012-07-27 17:55   ` [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
  2 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-27 17:55 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Sasha Levin, Christoph Lameter

do_migrate_pages() can return the number of pages not migrated.
Because migrate_pages() syscall return this value directly,
migrate_pages() syscall may return the number of pages not migrated.
In fail case in migrate_pages() syscall, we should return error value.
So change err to -EBUSY

Additionally, Correct comment above do_migrate_pages()

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..0732729 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
  * Move pages between the two nodesets so as to preserve the physical
  * layout as much as possible.
  *
- * Returns the number of page that could not be moved.
+ * Returns error or the number of pages not migrated.
  */
 int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 		     const nodemask_t *to, int flags)
@@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
 
 	err = do_migrate_pages(mm, old, new,
 		capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
+	if (err > 0)
+		err = -EBUSY;
 
 	mmput(mm);
 out:
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RESEND PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range()
  2012-07-27 17:55 ` [RESEND PATCH 1/4 v3] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
  2012-07-27 17:55   ` [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
@ 2012-07-27 17:55   ` Joonsoo Kim
  2012-07-27 17:55   ` [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
  2 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-27 17:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Michal Nazarewicz,
	Marek Szyprowski, Minchan Kim, Christoph Lameter

migrate_pages() can return positive value while at the same time emptying
the list of pages it was called with.  Such situation means that it went
through all the pages on the list some of which failed to be migrated.

If that happens, __alloc_contig_migrate_range()'s loop may finish without
"++tries == 5" never being checked.  This in turn means that at the end
of the function, ret may have a positive value, which should be treated
as an error.

This patch changes __alloc_contig_migrate_range() so that the return
statement converts positive ret value into -EBUSY error.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 			}
 			tries = 0;
 		} else if (++tries == 5) {
-			ret = ret < 0 ? ret : -EBUSY;
 			break;
 		}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 	}
 
 	putback_lru_pages(&cc.migratepages);
-	return ret > 0 ? 0 : ret;
+	return ret <= 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-27 17:55 ` [RESEND PATCH 1/4 v3] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
  2012-07-27 17:55   ` [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
  2012-07-27 17:55   ` [RESEND PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
@ 2012-07-27 17:55   ` Joonsoo Kim
  2012-07-27 20:54     ` Christoph Lameter
  2 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-07-27 17:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Brice Goglin,
	Christoph Lameter, Minchan Kim

move_pages() syscall may return success in case that
do_move_page_to_node_array return positive value which means migration failed.
This patch changes return value of do_move_page_to_node_array
for not returning positive value. It can fix the problem.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Brice Goglin <brice@myri.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan@kernel.org>

diff --git a/mm/migrate.c b/mm/migrate.c
index f495c58..eeaf409 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1172,7 +1172,7 @@ set_status:
 	}
 
 	up_read(&mm->mmap_sem);
-	return err;
+	return err > 0 ? -EBUSY : err;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-27 17:55   ` [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
@ 2012-07-27 20:54     ` Christoph Lameter
  2012-07-28  6:09       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-07-27 20:54 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim

On Sat, 28 Jul 2012, Joonsoo Kim wrote:

> move_pages() syscall may return success in case that
> do_move_page_to_node_array return positive value which means migration failed.

Nope. It only means that the migration for some pages has failed. This may
still be considered successful for the app if it moves 10000 pages and one
failed.

This patch would break the move_pages() syscall because an error code
return from do_move_pages_to_node_array() will cause the status byte for
each page move to not be updated anymore. Application will not be able to
tell anymore which pages were successfully moved and which are not.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-27 17:55   ` [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
@ 2012-07-27 20:57     ` Christoph Lameter
  2012-07-28  6:16       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-07-27 20:57 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, linux-kernel, linux-mm, Sasha Levin

On Sat, 28 Jul 2012, Joonsoo Kim wrote:

> do_migrate_pages() can return the number of pages not migrated.
> Because migrate_pages() syscall return this value directly,
> migrate_pages() syscall may return the number of pages not migrated.
> In fail case in migrate_pages() syscall, we should return error value.
> So change err to -EBUSY

Lets leave this alone. This would change the migrate_pages semantics
because a successful move of N out of M pages would be marked as a
total failure although pages were in fact moved.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-27 20:54     ` Christoph Lameter
@ 2012-07-28  6:09       ` JoonSoo Kim
  2012-07-30 19:29         ` Christoph Lameter
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-28  6:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim

2012/7/28 Christoph Lameter <cl@linux.com>:
> On Sat, 28 Jul 2012, Joonsoo Kim wrote:
>
>> move_pages() syscall may return success in case that
>> do_move_page_to_node_array return positive value which means migration failed.
>
> Nope. It only means that the migration for some pages has failed. This may
> still be considered successful for the app if it moves 10000 pages and one
> failed.
>
> This patch would break the move_pages() syscall because an error code
> return from do_move_pages_to_node_array() will cause the status byte for
> each page move to not be updated anymore. Application will not be able to
> tell anymore which pages were successfully moved and which are not.

In case of returning non-zero, valid status is not required according
to man page.
So, this patch would not break the move_pages() syscall.
But, I agree that returning positive value only means that the
migration for some pages has failed.
This is my mistake, so please drop this patch.
Thanks for review.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-27 20:57     ` Christoph Lameter
@ 2012-07-28  6:16       ` JoonSoo Kim
  2012-07-30 19:30         ` Christoph Lameter
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-28  6:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, linux-mm, Sasha Levin

2012/7/28 Christoph Lameter <cl@linux.com>:
> On Sat, 28 Jul 2012, Joonsoo Kim wrote:
>
>> do_migrate_pages() can return the number of pages not migrated.
>> Because migrate_pages() syscall return this value directly,
>> migrate_pages() syscall may return the number of pages not migrated.
>> In fail case in migrate_pages() syscall, we should return error value.
>> So change err to -EBUSY
>
> Lets leave this alone. This would change the migrate_pages semantics
> because a successful move of N out of M pages would be marked as a
> total failure although pages were in fact moved.
>

Okay.
Then, do we need to fix man-page of migrate_pages() syscall?
According to man-page, only returning 0 or -1 is valid.
Without this patch, it can return positive value.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-28  6:09       ` JoonSoo Kim
@ 2012-07-30 19:29         ` Christoph Lameter
  2012-07-31  3:34           ` JoonSoo Kim
  2012-08-01  5:15           ` Michael Kerrisk
  0 siblings, 2 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-30 19:29 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim, Michael Kerrisk

On Sat, 28 Jul 2012, JoonSoo Kim wrote:

> 2012/7/28 Christoph Lameter <cl@linux.com>:
> > On Sat, 28 Jul 2012, Joonsoo Kim wrote:
> >
> >> move_pages() syscall may return success in case that
> >> do_move_page_to_node_array return positive value which means migration failed.
> >
> > Nope. It only means that the migration for some pages has failed. This may
> > still be considered successful for the app if it moves 10000 pages and one
> > failed.
> >
> > This patch would break the move_pages() syscall because an error code
> > return from do_move_pages_to_node_array() will cause the status byte for
> > each page move to not be updated anymore. Application will not be able to
> > tell anymore which pages were successfully moved and which are not.
>
> In case of returning non-zero, valid status is not required according
> to man page.

Cannot find a statement like that in the man page. The return code
description is incorrect. It should that that is returns the number of
pages not moved otherwise an error code (Michael please fix the manpage).

> So, this patch would not break the move_pages() syscall.

It changes the way the system call is behaving right now.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall
  2012-07-28  6:16       ` JoonSoo Kim
@ 2012-07-30 19:30         ` Christoph Lameter
  0 siblings, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-30 19:30 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: akpm, linux-kernel, linux-mm, Sasha Levin, Michael Kerrisk

On Sat, 28 Jul 2012, JoonSoo Kim wrote:

> 2012/7/28 Christoph Lameter <cl@linux.com>:
> > On Sat, 28 Jul 2012, Joonsoo Kim wrote:
> >
> >> do_migrate_pages() can return the number of pages not migrated.
> >> Because migrate_pages() syscall return this value directly,
> >> migrate_pages() syscall may return the number of pages not migrated.
> >> In fail case in migrate_pages() syscall, we should return error value.
> >> So change err to -EBUSY
> >
> > Lets leave this alone. This would change the migrate_pages semantics
> > because a successful move of N out of M pages would be marked as a
> > total failure although pages were in fact moved.
> >
>
> Okay.
> Then, do we need to fix man-page of migrate_pages() syscall?
> According to man-page, only returning 0 or -1 is valid.
> Without this patch, it can return positive value.

Yes the manpage needs updating to say that it can return the number of
pages not migrated.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-30 19:29         ` Christoph Lameter
@ 2012-07-31  3:34           ` JoonSoo Kim
  2012-07-31 14:04             ` Christoph Lameter
  2012-08-01  5:15           ` Michael Kerrisk
  1 sibling, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-07-31  3:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim, Michael Kerrisk

2012/7/31 Christoph Lameter <cl@linux.com>:
> On Sat, 28 Jul 2012, JoonSoo Kim wrote:
>
>> 2012/7/28 Christoph Lameter <cl@linux.com>:
>> > On Sat, 28 Jul 2012, Joonsoo Kim wrote:
>> >
>> >> move_pages() syscall may return success in case that
>> >> do_move_page_to_node_array return positive value which means migration failed.
>> >
>> > Nope. It only means that the migration for some pages has failed. This may
>> > still be considered successful for the app if it moves 10000 pages and one
>> > failed.
>> >
>> > This patch would break the move_pages() syscall because an error code
>> > return from do_move_pages_to_node_array() will cause the status byte for
>> > each page move to not be updated anymore. Application will not be able to
>> > tell anymore which pages were successfully moved and which are not.
>>
>> In case of returning non-zero, valid status is not required according
>> to man page.
>
> Cannot find a statement like that in the man page. The return code
> description is incorrect. It should that that is returns the number of
> pages not moved otherwise an error code (Michael please fix the manpage).

In man page, there is following statement.
"status is an array of integers that return the status of each page.  The array
only contains valid values if move_pages() did not return an error."

And current implementation of move_pages() syscall doesn't return the number
of pages not moved, just return 0 when it encounter some failed pages.
So, if u want to fix the man page, u should fix do_pages_move() first.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-31  3:34           ` JoonSoo Kim
@ 2012-07-31 14:04             ` Christoph Lameter
  0 siblings, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-07-31 14:04 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim, Michael Kerrisk

On Tue, 31 Jul 2012, JoonSoo Kim wrote:

> In man page, there is following statement.
> "status is an array of integers that return the status of each page.  The array
> only contains valid values if move_pages() did not return an error."

> And current implementation of move_pages() syscall doesn't return the number
> of pages not moved, just return 0 when it encounter some failed pages.
> So, if u want to fix the man page, u should fix do_pages_move() first.

Hmm... Yeah actually that is sufficient since the status is readily
obtainable from the status array. It would be better though if the
function would return the number of pages not moved in the same way as
migrate_pages().




^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-07-30 19:29         ` Christoph Lameter
  2012-07-31  3:34           ` JoonSoo Kim
@ 2012-08-01  5:15           ` Michael Kerrisk
  2012-08-01 18:00             ` Christoph Lameter
  1 sibling, 1 reply; 146+ messages in thread
From: Michael Kerrisk @ 2012-08-01  5:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: JoonSoo Kim, akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Mon, Jul 30, 2012 at 9:29 PM, Christoph Lameter <cl@linux.com> wrote:
> On Sat, 28 Jul 2012, JoonSoo Kim wrote:
>
>> 2012/7/28 Christoph Lameter <cl@linux.com>:
>> > On Sat, 28 Jul 2012, Joonsoo Kim wrote:
>> >
>> >> move_pages() syscall may return success in case that
>> >> do_move_page_to_node_array return positive value which means migration failed.
>> >
>> > Nope. It only means that the migration for some pages has failed. This may
>> > still be considered successful for the app if it moves 10000 pages and one
>> > failed.
>> >
>> > This patch would break the move_pages() syscall because an error code
>> > return from do_move_pages_to_node_array() will cause the status byte for
>> > each page move to not be updated anymore. Application will not be able to
>> > tell anymore which pages were successfully moved and which are not.
>>
>> In case of returning non-zero, valid status is not required according
>> to man page.
>
> Cannot find a statement like that in the man page. The return code
> description is incorrect. It should that that is returns the number of
> pages not moved otherwise an error code (Michael please fix the manpage).

Hi Christoph,

Is the patch below acceptable? (I've attached the complete page as well.)

See you in San Diego (?),

Michael

--- a/man2/migrate_pages.2
+++ b/man2/migrate_pages.2
@@ -29,7 +29,7 @@ migrate_pages \- move all pages in a process to
another set of nodes
 Link with \fI\-lnuma\fP.
 .SH DESCRIPTION
 .BR migrate_pages ()
-moves all pages of the process
+attempts to move all pages of the process
 .I pid
 that are in memory nodes
 .I old_nodes
@@ -87,7 +87,8 @@ privilege.
 .SH "RETURN VALUE"
 On success
 .BR migrate_pages ()
-returns zero.
+returns the number of pages that cold not be moved
+(i.e., a return of zero means that all pages were successfully moved).
 On error, it returns \-1, and sets
 .I errno
 to indicate the error.

-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

[-- Attachment #2: migrate_pages.2 --]
[-- Type: application/octet-stream, Size: 4084 bytes --]

.\" Hey Emacs! This file is -*- nroff -*- source.
.\"
.\" Copyright 2009 Intel Corporation
.\"                Author: Andi Kleen
.\" Based on the move_pages manpage which was
.\" This manpage is Copyright (C) 2006 Silicon Graphics, Inc.
.\"                               Christoph Lameter
.\"
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.TH MIGRATE_PAGES 2 2012-08-01 "Linux" "Linux Programmer's Manual"
.SH NAME
migrate_pages \- move all pages in a process to another set of nodes
.SH SYNOPSIS
.nf
.B #include <numaif.h>
.sp
.BI "long migrate_pages(int " pid ", unsigned long " maxnode,
.BI "                   const unsigned long *" old_nodes,
.BI "                   const unsigned long *" new_nodes);
.fi
.sp
Link with \fI\-lnuma\fP.
.SH DESCRIPTION
.BR migrate_pages ()
attempts to move all pages of the process
.I pid
that are in memory nodes
.I old_nodes
to the memory nodes in
.IR new_nodes .
Pages not located in any node in
.I old_nodes
will not be migrated.
As far as possible,
the kernel maintains the relative topology relationship inside
.I old_nodes
during the migration to
.IR new_nodes .

The
.I old_nodes
and
.I new_nodes
arguments are pointers to bit masks of node numbers, with up to
.I maxnode
bits in each mask.
These masks are maintained as arrays of unsigned
.I long
integers (in the last
.I long
integer, the bits beyond those specified by
.I maxnode
are ignored).
The
.I maxnode
argument is the maximum node number in the bit mask plus one (this is the same
as in
.BR mbind (2),
but different from
.BR select (2)).

The
.I pid
argument is the ID of the process whose pages are to be moved.
To move pages in another process,
the caller must be privileged
.RB ( CAP_SYS_NICE )
or the real or effective user ID of the calling process must match the
real or saved-set user ID of the target process.
If
.I pid
is 0, then
.BR migrate_pages ()
moves pages of the calling process.

Pages shared with another process will only be moved if the initiating
process has the
.B CAP_SYS_NICE
privilege.
.SH "RETURN VALUE"
On success
.BR migrate_pages ()
returns the number of pages that cold not be moved
(i.e., a return of zero means that all pages were successfully moved).
On error, it returns \-1, and sets
.I errno
to indicate the error.
.SH ERRORS
.TP
.B EPERM
Insufficient privilege
.RB ( CAP_SYS_NICE )
to move pages of the process specified by
.IR pid ,
or insufficient privilege
.RB ( CAP_SYS_NICE )
to access the specified target nodes.
.TP
.B ESRCH
No process matching
.I pid
could be found.
.\" FIXME There are other errors
.SH VERSIONS
The
.BR migrate_pages ()
system call first appeared on Linux in version 2.6.16.
.SH CONFORMING TO
This system call is Linux-specific.
.SH "NOTES"
For information on library support, see
.BR numa (7).

Use
.BR get_mempolicy (2)
with the
.B MPOL_F_MEMS_ALLOWED
flag to obtain the set of nodes that are allowed by
the calling process's cpuset.
Note that this information is subject to change at any
time by manual or automatic reconfiguration of the cpuset.

Use of
.BR migrate_pages ()
may result in pages whose location
(node) violates the memory policy established for the
specified addresses (see
.BR mbind (2))
and/or the specified process (see
.BR set_mempolicy (2)).
That is, memory policy does not constrain the destination
nodes used by
.BR migrate_pages ().

The
.I <numaif.h>
header is not included with glibc, but requires installing
.I libnuma-devel
or a similar package.
.SH "SEE ALSO"
.BR get_mempolicy (2),
.BR mbind (2),
.BR set_mempolicy (2),
.BR numa (3),
.BR numa_maps (5),
.BR cpuset (7),
.BR numa (7),
.BR migratepages (8),
.BR numa_stat (8);
.br
the kernel source file
.IR Documentation/vm/page_migration .

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-08-01  5:15           ` Michael Kerrisk
@ 2012-08-01 18:00             ` Christoph Lameter
  2012-08-02  5:52               ` Michael Kerrisk
  0 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-08-01 18:00 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: JoonSoo Kim, akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim

On Wed, 1 Aug 2012, Michael Kerrisk wrote:

> Is the patch below acceptable? (I've attached the complete page as well.)

Yes looks good.

> See you in San Diego (?),

Yup. I will be there too.



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall
  2012-08-01 18:00             ` Christoph Lameter
@ 2012-08-02  5:52               ` Michael Kerrisk
  0 siblings, 0 replies; 146+ messages in thread
From: Michael Kerrisk @ 2012-08-02  5:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: JoonSoo Kim, akpm, linux-kernel, linux-mm, Brice Goglin, Minchan Kim

On Wed, Aug 1, 2012 at 8:00 PM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 1 Aug 2012, Michael Kerrisk wrote:
>
>> Is the patch below acceptable? (I've attached the complete page as well.)
>
> Yes looks good.

Thanks for checking it!

>> See you in San Diego (?),
>
> Yup. I will be there too.

See you then!

Cheers,

Michael

-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq
       [not found] <Yes>
                   ` (2 preceding siblings ...)
  2012-07-27 17:55 ` [RESEND PATCH 1/4 v3] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
@ 2012-08-13 16:17 ` Joonsoo Kim
  2012-08-13 16:17   ` [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on() Joonsoo Kim
                     ` (3 more replies)
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
                   ` (8 subsequent siblings)
  12 siblings, 4 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-13 16:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 692d976..188eef8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -183,7 +183,8 @@ struct global_cwq {
 	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
 						/* L: hash of busy workers */
 
-	struct worker_pool	pools[2];	/* normal and highpri pools */
+	struct worker_pool	pools[NR_WORKER_POOLS];
+						/* normal and highpri pools */
 
 	wait_queue_head_t	rebind_hold;	/* rebind hold wait */
 } ____cacheline_aligned_in_smp;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on()
  2012-08-13 16:17 ` [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
@ 2012-08-13 16:17   ` Joonsoo Kim
  2012-08-13 16:32     ` Tejun Heo
  2012-08-13 16:17   ` [PATCH 3/5] workqueue: introduce system_highpri_wq Joonsoo Kim
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-13 16:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

We assign cpu id into work struct in queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into sub-optimal path
in case of WQ_NON_REENTRANT.
Change it to cpu argument is prevent to go into sub-optimal path.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 188eef8..bc5c5e1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1158,6 +1158,8 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 
 			if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
 				lcpu = gcwq->cpu;
+			else if (cpu >= 0)
+				lcpu = cpu;
 			else
 				lcpu = raw_smp_processor_id();
 		} else
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 3/5] workqueue: introduce system_highpri_wq
  2012-08-13 16:17 ` [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
  2012-08-13 16:17   ` [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on() Joonsoo Kim
@ 2012-08-13 16:17   ` Joonsoo Kim
  2012-08-13 16:17   ` [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
  2012-08-13 16:17   ` [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
  3 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-13 16:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.

It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bc5c5e1..f69f094 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -269,12 +269,14 @@ struct workqueue_struct {
 };
 
 struct workqueue_struct *system_wq __read_mostly;
+struct workqueue_struct *system_highpri_wq __read_mostly;
 struct workqueue_struct *system_long_wq __read_mostly;
 struct workqueue_struct *system_nrt_wq __read_mostly;
 struct workqueue_struct *system_unbound_wq __read_mostly;
 struct workqueue_struct *system_freezable_wq __read_mostly;
 struct workqueue_struct *system_nrt_freezable_wq __read_mostly;
 EXPORT_SYMBOL_GPL(system_wq);
+EXPORT_SYMBOL_GPL(system_highpri_wq);
 EXPORT_SYMBOL_GPL(system_long_wq);
 EXPORT_SYMBOL_GPL(system_nrt_wq);
 EXPORT_SYMBOL_GPL(system_unbound_wq);
@@ -3749,6 +3751,7 @@ static int __init init_workqueues(void)
 	}
 
 	system_wq = alloc_workqueue("events", 0, 0);
+	system_highpri_wq = alloc_workqueue("events_highpri", 0, 0);
 	system_long_wq = alloc_workqueue("events_long", 0, 0);
 	system_nrt_wq = alloc_workqueue("events_nrt", WQ_NON_REENTRANT, 0);
 	system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
@@ -3757,8 +3760,8 @@ static int __init init_workqueues(void)
 					      WQ_FREEZABLE, 0);
 	system_nrt_freezable_wq = alloc_workqueue("events_nrt_freezable",
 			WQ_NON_REENTRANT | WQ_FREEZABLE, 0);
-	BUG_ON(!system_wq || !system_long_wq || !system_nrt_wq ||
-	       !system_unbound_wq || !system_freezable_wq ||
+	BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
+		!system_nrt_wq || !system_unbound_wq || !system_freezable_wq ||
 		!system_nrt_freezable_wq);
 	return 0;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-13 16:17 ` [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
  2012-08-13 16:17   ` [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on() Joonsoo Kim
  2012-08-13 16:17   ` [PATCH 3/5] workqueue: introduce system_highpri_wq Joonsoo Kim
@ 2012-08-13 16:17   ` Joonsoo Kim
  2012-08-13 16:34     ` Tejun Heo
  2012-08-13 16:17   ` [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
  3 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-13 16:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.

To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f69f094..e0e1d41 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -489,6 +489,11 @@ static int worker_pool_pri(struct worker_pool *pool)
 	return pool - pool->gcwq->pools;
 }
 
+static struct workqueue_struct *get_pool_system_wq(struct worker_pool *pool)
+{
+	return worker_pool_pri(pool) ? system_highpri_wq : system_wq;
+}
+
 static struct global_cwq *get_gcwq(unsigned int cpu)
 {
 	if (cpu != WORK_CPU_UNBOUND)
@@ -1450,9 +1455,10 @@ retry:
 
 		/* wq doesn't matter, use the default one */
 		debug_work_activate(rebind_work);
-		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
-			    worker->scheduled.next,
-			    work_color_to_flags(WORK_NO_COLOR));
+		insert_work(
+			get_cwq(gcwq->cpu, get_pool_system_wq(worker->pool)),
+			rebind_work, worker->scheduled.next,
+			work_color_to_flags(WORK_NO_COLOR));
 	}
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work
  2012-08-13 16:17 ` [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
                     ` (2 preceding siblings ...)
  2012-08-13 16:17   ` [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
@ 2012-08-13 16:17   ` Joonsoo Kim
  2012-08-13 16:36     ` Tejun Heo
  3 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-13 16:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e0e1d41..5d50955 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3501,7 +3501,7 @@ static int __devinit workqueue_cpu_down_callback(struct notifier_block *nfb,
 	case CPU_DOWN_PREPARE:
 		/* unbinding should happen on the local CPU */
 		INIT_WORK_ONSTACK(&unbind_work, gcwq_unbind_fn);
-		schedule_work_on(cpu, &unbind_work);
+		queue_work_on(cpu, system_highpri_wq, &unbind_work);
 		flush_work(&unbind_work);
 		break;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on()
  2012-08-13 16:17   ` [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on() Joonsoo Kim
@ 2012-08-13 16:32     ` Tejun Heo
  2012-08-13 16:54       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-13 16:32 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

Hello,

On Tue, Aug 14, 2012 at 01:17:49AM +0900, Joonsoo Kim wrote:
> We assign cpu id into work struct in queue_delayed_work_on().
> In current implementation, when work is come in first time,
> current running cpu id is assigned.
> If we do queue_delayed_work_on() with CPU A on CPU B,
> __queue_work() invoked in delayed_work_timer_fn() go into sub-optimal path
> in case of WQ_NON_REENTRANT.
> Change it to cpu argument is prevent to go into sub-optimal path.

Which part is suboptimal?  Also, what if @cpu is WQ_UNBOUND?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-13 16:17   ` [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
@ 2012-08-13 16:34     ` Tejun Heo
  2012-08-13 16:57       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-13 16:34 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

On Tue, Aug 14, 2012 at 01:17:51AM +0900, Joonsoo Kim wrote:
> In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
> Currently, in this case, we use only system_wq. This makes a possible
> error situation as there is mismatch between cwq->pool and worker->pool.
> 
> To prevent this, we should use system_highpri_wq for highpri worker
> to match theses. This implements it.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index f69f094..e0e1d41 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -489,6 +489,11 @@ static int worker_pool_pri(struct worker_pool *pool)
>  	return pool - pool->gcwq->pools;
>  }
>  
> +static struct workqueue_struct *get_pool_system_wq(struct worker_pool *pool)
> +{
> +	return worker_pool_pri(pool) ? system_highpri_wq : system_wq;
> +}
> +
>  static struct global_cwq *get_gcwq(unsigned int cpu)
>  {
>  	if (cpu != WORK_CPU_UNBOUND)
> @@ -1450,9 +1455,10 @@ retry:
>  
>  		/* wq doesn't matter, use the default one */
>  		debug_work_activate(rebind_work);
> -		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
> -			    worker->scheduled.next,
> -			    work_color_to_flags(WORK_NO_COLOR));
> +		insert_work(
> +			get_cwq(gcwq->cpu, get_pool_system_wq(worker->pool)),
> +			rebind_work, worker->scheduled.next,
> +			work_color_to_flags(WORK_NO_COLOR));

I think it would be better to just opencode system_wq selection in
rebind_workers().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work
  2012-08-13 16:17   ` [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
@ 2012-08-13 16:36     ` Tejun Heo
  2012-08-13 17:02       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-13 16:36 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

On Tue, Aug 14, 2012 at 01:17:52AM +0900, Joonsoo Kim wrote:
> To speed cpu down processing up, use system_highpri_wq.
> As scheduling priority of workers on it is higher than system_wq and
> it is not contended by other normal works on this cpu, work on it
> is processed faster than system_wq.

Is this from an actual workload?  ie. do you have a test case where
this matters?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on()
  2012-08-13 16:32     ` Tejun Heo
@ 2012-08-13 16:54       ` JoonSoo Kim
  2012-08-13 17:03         ` Tejun Heo
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-13 16:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

2012/8/14 Tejun Heo <tj@kernel.org>:
> Hello,
>
> On Tue, Aug 14, 2012 at 01:17:49AM +0900, Joonsoo Kim wrote:
>> We assign cpu id into work struct in queue_delayed_work_on().
>> In current implementation, when work is come in first time,
>> current running cpu id is assigned.
>> If we do queue_delayed_work_on() with CPU A on CPU B,
>> __queue_work() invoked in delayed_work_timer_fn() go into sub-optimal path
>> in case of WQ_NON_REENTRANT.
>> Change it to cpu argument is prevent to go into sub-optimal path.
>
> Which part is suboptimal?  Also, what if @cpu is WQ_UNBOUND?

Hi.
I think a following scenario.

wq = WQ_NON_REENTRANT.
queue_delayed_work_on(CPU B) is invoked in CPU A, so lcpu = CPU A, cpu = CPU B.

In this case, we call add_time_on(CPU B), then delayed_work_timer_fn()
is invoked on CPU B.
delayed_work_timer_fn() calls __queue_work(), then following
comparisons return true!

                gcwq = get_gcwq(cpu);
                if (wq->flags & WQ_NON_REENTRANT &&
                    (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {

I thinks that if we assign cpu to lcpu, above comparisons return
false, so save some overheads.
Is there any missing part?

And, do u mean  @cpu is WORK_CPU_UNBOUND?

Thanks.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-13 16:34     ` Tejun Heo
@ 2012-08-13 16:57       ` JoonSoo Kim
  2012-08-13 17:05         ` Tejun Heo
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-13 16:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

> I think it would be better to just opencode system_wq selection in
> rebind_workers().

Sorry for my poor English skill.
Could you elaborate "opencode system_wq selection" means?

Thanks!

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work
  2012-08-13 16:36     ` Tejun Heo
@ 2012-08-13 17:02       ` JoonSoo Kim
  2012-08-13 17:07         ` Tejun Heo
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-13 17:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

2012/8/14 Tejun Heo <tj@kernel.org>:
> On Tue, Aug 14, 2012 at 01:17:52AM +0900, Joonsoo Kim wrote:
>> To speed cpu down processing up, use system_highpri_wq.
>> As scheduling priority of workers on it is higher than system_wq and
>> it is not contended by other normal works on this cpu, work on it
>> is processed faster than system_wq.
>
> Is this from an actual workload?  ie. do you have a test case where
> this matters?

No. it is not from an actual workload.
I got this idea suddenly, during implementing others.
I will do a test for this tomorrow.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on()
  2012-08-13 16:54       ` JoonSoo Kim
@ 2012-08-13 17:03         ` Tejun Heo
  2012-08-13 17:43           ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-13 17:03 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: linux-kernel

Hello,

On Tue, Aug 14, 2012 at 01:54:22AM +0900, JoonSoo Kim wrote:
> wq = WQ_NON_REENTRANT.
> queue_delayed_work_on(CPU B) is invoked in CPU A, so lcpu = CPU A, cpu = CPU B.
> 
> In this case, we call add_time_on(CPU B), then delayed_work_timer_fn()
> is invoked on CPU B.
> delayed_work_timer_fn() calls __queue_work(), then following
> comparisons return true!
> 
>                 gcwq = get_gcwq(cpu);
>                 if (wq->flags & WQ_NON_REENTRANT &&
>                     (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
> 
> I thinks that if we assign cpu to lcpu, above comparisons return
> false, so save some overheads.
> Is there any missing part?

Nope, that sound correct to me.

> And, do u mean  @cpu is WORK_CPU_UNBOUND?

@cpu could be WORK_CPU_UNBOUND at that point.  The timer will be added
to local CPU but @work->data would be pointing to WORK_CPU_UNBOUND,
again triggering the condition.  Given that @cpu being
WORK_CPU_UNBOUND is far more common than an actual CPU number, the
patch would actually increase spurious nrt lookups.  The right thing
to do is probably setting cpu to raw_smp_processor_id() beforehand.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-13 16:57       ` JoonSoo Kim
@ 2012-08-13 17:05         ` Tejun Heo
  2012-08-13 17:45           ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-13 17:05 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: linux-kernel

On Tue, Aug 14, 2012 at 01:57:10AM +0900, JoonSoo Kim wrote:
> > I think it would be better to just opencode system_wq selection in
> > rebind_workers().
> 
> Sorry for my poor English skill.
> Could you elaborate "opencode system_wq selection" means?

Dropping get_pool_system_wq() and putting the logic inside
rebind_workers() directly.  I tend to dislike short helpers which are
used only once.  They tend to obfuscate while not really contributing
to anything else.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work
  2012-08-13 17:02       ` JoonSoo Kim
@ 2012-08-13 17:07         ` Tejun Heo
  2012-08-13 17:52           ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-13 17:07 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: linux-kernel

Hello,

On Tue, Aug 14, 2012 at 02:02:31AM +0900, JoonSoo Kim wrote:
> 2012/8/14 Tejun Heo <tj@kernel.org>:
> > On Tue, Aug 14, 2012 at 01:17:52AM +0900, Joonsoo Kim wrote:
> > Is this from an actual workload?  ie. do you have a test case where
> > this matters?
> 
> No. it is not from an actual workload.
> I got this idea suddenly, during implementing others.
> I will do a test for this tomorrow.

I don't think it's a bad idea and guess this wouldn't make much
difference under most workload.  Was mostly wondering whether this
came from some crazy phone powersave optimization.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on()
  2012-08-13 17:03         ` Tejun Heo
@ 2012-08-13 17:43           ` JoonSoo Kim
  2012-08-13 18:00             ` Tejun Heo
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-13 17:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

>> And, do u mean  @cpu is WORK_CPU_UNBOUND?
>
> @cpu could be WORK_CPU_UNBOUND at that point.  The timer will be added
> to local CPU but @work->data would be pointing to WORK_CPU_UNBOUND,
> again triggering the condition.  Given that @cpu being
> WORK_CPU_UNBOUND is far more common than an actual CPU number, the
> patch would actually increase spurious nrt lookups.  The right thing
> to do is probably setting cpu to raw_smp_processor_id() beforehand.

I got your point.
Thanks for kind illustration.

Following is a alternative implementation for this.
I thinks this is too rare case, so it doesn't help in any real workload.
But how do you thinks?

@@ -1156,7 +1156,9 @@ int queue_delayed_work_on(int cpu, struct
workqueue_struct *wq,
                if (!(wq->flags & WQ_UNBOUND)) {
                        struct global_cwq *gcwq = get_work_gcwq(work);

-                       if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
+                       if (!gcwq)
+                               lcpu = cpu;
+                       else if (gcwq->cpu != WORK_CPU_UNBOUND)
                                lcpu = gcwq->cpu;
                        else
                                lcpu = raw_smp_processor_id();

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-13 17:05         ` Tejun Heo
@ 2012-08-13 17:45           ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-13 17:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

2012/8/14 Tejun Heo <tj@kernel.org>:
> On Tue, Aug 14, 2012 at 01:57:10AM +0900, JoonSoo Kim wrote:
>> > I think it would be better to just opencode system_wq selection in
>> > rebind_workers().
>>
>> Sorry for my poor English skill.
>> Could you elaborate "opencode system_wq selection" means?
>
> Dropping get_pool_system_wq() and putting the logic inside
> rebind_workers() directly.  I tend to dislike short helpers which are
> used only once.  They tend to obfuscate while not really contributing
> to anything else.

Okay!
I will send a next version which adopt your recommendation tomorrow.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work
  2012-08-13 17:07         ` Tejun Heo
@ 2012-08-13 17:52           ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-13 17:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

2012/8/14 Tejun Heo <tj@kernel.org>:
> Hello,
>
> On Tue, Aug 14, 2012 at 02:02:31AM +0900, JoonSoo Kim wrote:
>> 2012/8/14 Tejun Heo <tj@kernel.org>:
>> > On Tue, Aug 14, 2012 at 01:17:52AM +0900, Joonsoo Kim wrote:
>> > Is this from an actual workload?  ie. do you have a test case where
>> > this matters?
>>
>> No. it is not from an actual workload.
>> I got this idea suddenly, during implementing others.
>> I will do a test for this tomorrow.
>
> I don't think it's a bad idea and guess this wouldn't make much
> difference under most workload.  Was mostly wondering whether this
> came from some crazy phone powersave optimization.

Okay.
It is not from an actual workload,
but I intent to optimize some crazy phone powersave :)

Thanks for comments.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on()
  2012-08-13 17:43           ` JoonSoo Kim
@ 2012-08-13 18:00             ` Tejun Heo
  2012-08-14 18:04               ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-13 18:00 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: linux-kernel

Hello,

On Tue, Aug 14, 2012 at 02:43:28AM +0900, JoonSoo Kim wrote:
> >> And, do u mean  @cpu is WORK_CPU_UNBOUND?
> >
> > @cpu could be WORK_CPU_UNBOUND at that point.  The timer will be added
> > to local CPU but @work->data would be pointing to WORK_CPU_UNBOUND,
> > again triggering the condition.  Given that @cpu being
> > WORK_CPU_UNBOUND is far more common than an actual CPU number, the
> > patch would actually increase spurious nrt lookups.  The right thing
> > to do is probably setting cpu to raw_smp_processor_id() beforehand.
> 
> I got your point.
> Thanks for kind illustration.
> 
> Following is a alternative implementation for this.
> I thinks this is too rare case, so it doesn't help in any real workload.
> But how do you thinks?
> 
> @@ -1156,7 +1156,9 @@ int queue_delayed_work_on(int cpu, struct
> workqueue_struct *wq,
>                 if (!(wq->flags & WQ_UNBOUND)) {
>                         struct global_cwq *gcwq = get_work_gcwq(work);
> 
> -                       if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
> +                       if (!gcwq)
> +                               lcpu = cpu;
> +                       else if (gcwq->cpu != WORK_CPU_UNBOUND)
>                                 lcpu = gcwq->cpu;
>                         else
>                                 lcpu = raw_smp_processor_id();

Why not just do

	if (cpu == WORK_CPU_UNBOUND)
		cpu = raw_smp_processor_id();

	if (!(wq->flags...) {
		...
		if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
			lcpu = gcwq->cpu;
		else
			lcpu = cpu;
	}
	...

	add_timer_on(timer, cpu);

Also, can you please base the patches on top of the following git
branch?

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.7

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on()
  2012-08-13 18:00             ` Tejun Heo
@ 2012-08-14 18:04               ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-14 18:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

> Why not just do
>
>         if (cpu == WORK_CPU_UNBOUND)
>                 cpu = raw_smp_processor_id();
>
>         if (!(wq->flags...) {
>                 ...
>                 if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
>                         lcpu = gcwq->cpu;
>                 else
>                         lcpu = cpu;
>         }
>         ...
>
>         add_timer_on(timer, cpu);

I re-look at code deeply and find that dwork->cpu is related to
tracing information.
If we do above change, tracing information is bias.

> Also, can you please base the patches on top of the following git
> branch?
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.7

Sure!

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v2 0/6] system_highpri_wq
       [not found] <Yes>
                   ` (3 preceding siblings ...)
  2012-08-13 16:17 ` [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
@ 2012-08-14 18:10 ` Joonsoo Kim
  2012-08-14 18:10   ` [PATCH v2 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
                     ` (5 more replies)
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
                   ` (7 subsequent siblings)
  12 siblings, 6 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-14 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Change from v1
[ALL] Add cover-letter
[ALL] Rebase on git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.7

[1/6] No change
[2/6] First added into this series.
[3/6] Add change log to clarify what pacth do,
      Add logic to handle the case "@cpu is WORK_CPU_UNBOUND"
[4/6] Fix awsome mistake: Set WQ_HIGHPRI for system_highpri_wq
[5/6] Adopt Tejun's comment about selection system_wq
[6/6] No change

This patchset introduce system_highpri_wq
in order to use proper cwq for highpri worker.

First 3 patches are not related to that purpose.
Just fix arbitrary issues.
Last 3 patches are for our purpose.

Joonsoo Kim (6):
  workqueue: use enum value to set array size of pools in gcwq
  workqueue: correct req_cpu in trace_workqueue_queue_work()
  workqueue: change value of lcpu in __queue_delayed_work_on()
  workqueue: introduce system_highpri_wq
  workqueue: use system_highpri_wq for highpri workers in
    rebind_workers()
  workqueue: use system_highpri_wq for unbind_work

 kernel/workqueue.c |   34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v2 1/6] workqueue: use enum value to set array size of pools in gcwq
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
@ 2012-08-14 18:10   ` Joonsoo Kim
  2012-08-14 18:10   ` [PATCH v2 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work() Joonsoo Kim
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-14 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4fef952..49d8f4a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -183,7 +183,8 @@ struct global_cwq {
 	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
 						/* L: hash of busy workers */
 
-	struct worker_pool	pools[2];	/* normal and highpri pools */
+	struct worker_pool	pools[NR_WORKER_POOLS];
+						/* normal and highpri pools */
 
 	wait_queue_head_t	rebind_hold;	/* rebind hold wait */
 } ____cacheline_aligned_in_smp;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work()
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
  2012-08-14 18:10   ` [PATCH v2 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
@ 2012-08-14 18:10   ` Joonsoo Kim
  2012-08-14 18:34     ` Tejun Heo
  2012-08-14 18:10   ` [PATCH v2 3/6] workqueue: change value of lcpu in __queue_delayed_work_on() Joonsoo Kim
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-14 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.

Use temporary local variable for storing local cpu.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 49d8f4a..6a17ab0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1198,6 +1198,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 	struct cpu_workqueue_struct *cwq;
 	struct list_head *worklist;
 	unsigned int work_flags;
+	unsigned int lcpu;
 
 	/*
 	 * While a work item is PENDING && off queue, a task trying to
@@ -1219,7 +1220,9 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 		struct global_cwq *last_gcwq;
 
 		if (cpu == WORK_CPU_UNBOUND)
-			cpu = raw_smp_processor_id();
+			lcpu = raw_smp_processor_id();
+		else
+			lcpu = cpu;
 
 		/*
 		 * It's multi cpu.  If @wq is non-reentrant and @work
@@ -1227,7 +1230,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 		 * be running there, in which case the work needs to
 		 * be queued on that cpu to guarantee non-reentrance.
 		 */
-		gcwq = get_gcwq(cpu);
+		gcwq = get_gcwq(lcpu);
 		if (wq->flags & WQ_NON_REENTRANT &&
 		    (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
 			struct worker *worker;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 3/6] workqueue: change value of lcpu in __queue_delayed_work_on()
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
  2012-08-14 18:10   ` [PATCH v2 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
  2012-08-14 18:10   ` [PATCH v2 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work() Joonsoo Kim
@ 2012-08-14 18:10   ` Joonsoo Kim
  2012-08-14 19:00     ` Tejun Heo
  2012-08-14 18:10   ` [PATCH v2 4/6] workqueue: introduce system_highpri_wq Joonsoo Kim
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-14 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.

	gcwq = get_gcwq(cpu);
	if (wq->flags & WQ_NON_REENTRANT &&
		(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {

Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6a17ab0..f55ac26 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1358,9 +1358,10 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 	if (!(wq->flags & WQ_UNBOUND)) {
 		struct global_cwq *gcwq = get_work_gcwq(work);
 
-		if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
+		lcpu = cpu;
+		if (gcwq)
 			lcpu = gcwq->cpu;
-		else
+		if (lcpu == WORK_CPU_UNBOUND)
 			lcpu = raw_smp_processor_id();
 	} else {
 		lcpu = WORK_CPU_UNBOUND;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 4/6] workqueue: introduce system_highpri_wq
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
                     ` (2 preceding siblings ...)
  2012-08-14 18:10   ` [PATCH v2 3/6] workqueue: change value of lcpu in __queue_delayed_work_on() Joonsoo Kim
@ 2012-08-14 18:10   ` Joonsoo Kim
  2012-08-14 18:10   ` [PATCH v2 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
  2012-08-14 18:10   ` [PATCH v2 6/6] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
  5 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-14 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.

It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f55ac26..470b0eb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -269,12 +269,14 @@ struct workqueue_struct {
 };
 
 struct workqueue_struct *system_wq __read_mostly;
+struct workqueue_struct *system_highpri_wq __read_mostly;
 struct workqueue_struct *system_long_wq __read_mostly;
 struct workqueue_struct *system_nrt_wq __read_mostly;
 struct workqueue_struct *system_unbound_wq __read_mostly;
 struct workqueue_struct *system_freezable_wq __read_mostly;
 struct workqueue_struct *system_nrt_freezable_wq __read_mostly;
 EXPORT_SYMBOL_GPL(system_wq);
+EXPORT_SYMBOL_GPL(system_highpri_wq);
 EXPORT_SYMBOL_GPL(system_long_wq);
 EXPORT_SYMBOL_GPL(system_nrt_wq);
 EXPORT_SYMBOL_GPL(system_unbound_wq);
@@ -3925,6 +3927,7 @@ static int __init init_workqueues(void)
 	}
 
 	system_wq = alloc_workqueue("events", 0, 0);
+	system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
 	system_long_wq = alloc_workqueue("events_long", 0, 0);
 	system_nrt_wq = alloc_workqueue("events_nrt", WQ_NON_REENTRANT, 0);
 	system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
@@ -3933,8 +3936,8 @@ static int __init init_workqueues(void)
 					      WQ_FREEZABLE, 0);
 	system_nrt_freezable_wq = alloc_workqueue("events_nrt_freezable",
 			WQ_NON_REENTRANT | WQ_FREEZABLE, 0);
-	BUG_ON(!system_wq || !system_long_wq || !system_nrt_wq ||
-	       !system_unbound_wq || !system_freezable_wq ||
+	BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
+		!system_nrt_wq || !system_unbound_wq || !system_freezable_wq ||
 		!system_nrt_freezable_wq);
 	return 0;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
                     ` (3 preceding siblings ...)
  2012-08-14 18:10   ` [PATCH v2 4/6] workqueue: introduce system_highpri_wq Joonsoo Kim
@ 2012-08-14 18:10   ` Joonsoo Kim
  2012-08-14 18:29     ` Tejun Heo
  2012-08-14 18:10   ` [PATCH v2 6/6] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
  5 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-14 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.

To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 470b0eb..4c5733c1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1738,6 +1738,7 @@ retry:
 	/* rebind busy workers */
 	for_each_busy_worker(worker, i, pos, gcwq) {
 		struct work_struct *rebind_work = &worker->rebind_work;
+		struct workqueue_struct *wq;
 
 		/* morph UNBOUND to REBIND */
 		worker->flags &= ~WORKER_UNBOUND;
@@ -1749,9 +1750,12 @@ retry:
 
 		/* wq doesn't matter, use the default one */
 		debug_work_activate(rebind_work);
-		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
-			    worker->scheduled.next,
-			    work_color_to_flags(WORK_NO_COLOR));
+		wq = worker_pool_pri(worker->pool) ? system_highpri_wq :
+								system_wq;
+		insert_work(
+			get_cwq(gcwq->cpu, wq),
+			rebind_work, worker->scheduled.next,
+			work_color_to_flags(WORK_NO_COLOR));
 	}
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 6/6] workqueue: use system_highpri_wq for unbind_work
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
                     ` (4 preceding siblings ...)
  2012-08-14 18:10   ` [PATCH v2 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
@ 2012-08-14 18:10   ` Joonsoo Kim
  5 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-14 18:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4c5733c1..fa5ec4c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3671,7 +3671,7 @@ static int __devinit workqueue_cpu_down_callback(struct notifier_block *nfb,
 	case CPU_DOWN_PREPARE:
 		/* unbinding should happen on the local CPU */
 		INIT_WORK_ONSTACK(&unbind_work, gcwq_unbind_fn);
-		schedule_work_on(cpu, &unbind_work);
+		queue_work_on(cpu, system_highpri_wq, &unbind_work);
 		flush_work(&unbind_work);
 		break;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-14 18:10   ` [PATCH v2 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
@ 2012-08-14 18:29     ` Tejun Heo
  0 siblings, 0 replies; 146+ messages in thread
From: Tejun Heo @ 2012-08-14 18:29 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

On Wed, Aug 15, 2012 at 03:10:15AM +0900, Joonsoo Kim wrote:
> In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
> Currently, in this case, we use only system_wq. This makes a possible
> error situation as there is mismatch between cwq->pool and worker->pool.
> 
> To prevent this, we should use system_highpri_wq for highpri worker
> to match theses. This implements it.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 470b0eb..4c5733c1 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1738,6 +1738,7 @@ retry:
>  	/* rebind busy workers */
>  	for_each_busy_worker(worker, i, pos, gcwq) {
>  		struct work_struct *rebind_work = &worker->rebind_work;
> +		struct workqueue_struct *wq;
>  
>  		/* morph UNBOUND to REBIND */
>  		worker->flags &= ~WORKER_UNBOUND;
> @@ -1749,9 +1750,12 @@ retry:
>  
>  		/* wq doesn't matter, use the default one */
>  		debug_work_activate(rebind_work);
> -		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
> -			    worker->scheduled.next,
> -			    work_color_to_flags(WORK_NO_COLOR));
> +		wq = worker_pool_pri(worker->pool) ? system_highpri_wq :
> +								system_wq;
> +		insert_work(
> +			get_cwq(gcwq->cpu, wq),
> +			rebind_work, worker->scheduled.next,
> +			work_color_to_flags(WORK_NO_COLOR));
				
Umm... this indentation is ugly.  Please follow the indentation of the
surrounding code.  If ?: gets too long, using if/else might be better.
Please also comment why the above is necessary.  The comment above
says "wq doesn't matter" and then we're choosing workqueue, which is
confusing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work()
  2012-08-14 18:10   ` [PATCH v2 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work() Joonsoo Kim
@ 2012-08-14 18:34     ` Tejun Heo
  2012-08-14 18:56       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-14 18:34 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

Hello,

On Wed, Aug 15, 2012 at 03:10:12AM +0900, Joonsoo Kim wrote:
> When we do tracing workqueue_queue_work(), it records requested cpu.
> But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
> requested cpu is changed as local cpu.
> In case of @wq->flag & WQ_UNBOUND, above change is not occured,
> therefore it is reasonable to correct it.
>
> Use temporary local variable for storing local cpu.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 49d8f4a..6a17ab0 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1198,6 +1198,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
>  	struct cpu_workqueue_struct *cwq;
>  	struct list_head *worklist;
>  	unsigned int work_flags;
> +	unsigned int lcpu;

@lcpu in __queue_delayed_work() stands for "last cpu", which is kinda
weird here.  Maybe just add "unsigned int req_cpu = cpu" at the top of
the function and use it for the TP?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work()
  2012-08-14 18:34     ` Tejun Heo
@ 2012-08-14 18:56       ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-14 18:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

2012/8/15 Tejun Heo <tj@kernel.org>:
> Hello,
>
> On Wed, Aug 15, 2012 at 03:10:12AM +0900, Joonsoo Kim wrote:
>> When we do tracing workqueue_queue_work(), it records requested cpu.
>> But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
>> requested cpu is changed as local cpu.
>> In case of @wq->flag & WQ_UNBOUND, above change is not occured,
>> therefore it is reasonable to correct it.
>>
>> Use temporary local variable for storing local cpu.
>>
>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 49d8f4a..6a17ab0 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -1198,6 +1198,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
>>       struct cpu_workqueue_struct *cwq;
>>       struct list_head *worklist;
>>       unsigned int work_flags;
>> +     unsigned int lcpu;
>
> @lcpu in __queue_delayed_work() stands for "last cpu", which is kinda
> weird here.  Maybe just add "unsigned int req_cpu = cpu" at the top of
> the function and use it for the TP?

Okay! It looks better.
I will re-send v3 of two patches commented from u.
Thanks!

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 3/6] workqueue: change value of lcpu in __queue_delayed_work_on()
  2012-08-14 18:10   ` [PATCH v2 3/6] workqueue: change value of lcpu in __queue_delayed_work_on() Joonsoo Kim
@ 2012-08-14 19:00     ` Tejun Heo
  0 siblings, 0 replies; 146+ messages in thread
From: Tejun Heo @ 2012-08-14 19:00 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

On Wed, Aug 15, 2012 at 03:10:13AM +0900, Joonsoo Kim wrote:
> We assign cpu id into work struct's data field in __queue_delayed_work_on().
> In current implementation, when work is come in first time,
> current running cpu id is assigned.
> If we do __queue_delayed_work_on() with CPU A on CPU B,
> __queue_work() invoked in delayed_work_timer_fn() go into
> the following sub-optimal path in case of WQ_NON_REENTRANT.
> 
> 	gcwq = get_gcwq(cpu);
> 	if (wq->flags & WQ_NON_REENTRANT &&
> 		(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
> 
> Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
> It is sufficient to prevent to go into sub-optimal path.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 6a17ab0..f55ac26 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1358,9 +1358,10 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
>  	if (!(wq->flags & WQ_UNBOUND)) {
>  		struct global_cwq *gcwq = get_work_gcwq(work);
>  
> -		if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
> +		lcpu = cpu;
> +		if (gcwq)
>  			lcpu = gcwq->cpu;
> -		else
> +		if (lcpu == WORK_CPU_UNBOUND)
>  			lcpu = raw_smp_processor_id();

Can you please add a comment explanining what's going on?

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v3 0/6] system_highpri_wq
       [not found] <Yes>
                   ` (4 preceding siblings ...)
  2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
@ 2012-08-15 14:25 ` Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
                     ` (6 more replies)
  2012-08-24 16:05 ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Joonsoo Kim
                   ` (6 subsequent siblings)
  12 siblings, 7 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-15 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Change from v2
[1/6] No change
[2/6] Change local variable name and use it directly for TP
[3/6] Add a comment.
[4/6] No change
[5/6] Add a comment. Fix ugly indentation.
[6/6] No change

Change from v1
[ALL] Add cover-letter
[ALL] Rebase on git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.7

[1/6] No change
[2/6] First added into this series.
[3/6] Add change log to clarify what pacth do,
      Add logic to handle the case "@cpu is WORK_CPU_UNBOUND"
[4/6] Fix awsome mistake: Set WQ_HIGHPRI for system_highpri_wq
[5/6] Adopt Tejun's comment about selection system_wq
[6/6] No change

This patchset introduce system_highpri_wq
in order to use proper cwq for highpri worker.

First 3 patches are not related to that purpose.
Just fix arbitrary issues.
Last 3 patches are for our purpose.

Joonsoo Kim (6):
  workqueue: use enum value to set array size of pools in gcwq
  workqueue: correct req_cpu in trace_workqueue_queue_work()
  workqueue: change value of lcpu in __queue_delayed_work_on()
  workqueue: introduce system_highpri_wq
  workqueue: use system_highpri_wq for highpri workers in
    rebind_workers()
  workqueue: use system_highpri_wq for unbind_work

 kernel/workqueue.c |   42 +++++++++++++++++++++++++++++++-----------
 1 file changed, 31 insertions(+), 11 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v3 1/6] workqueue: use enum value to set array size of pools in gcwq
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
@ 2012-08-15 14:25   ` Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work() Joonsoo Kim
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-15 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4fef952..49d8f4a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -183,7 +183,8 @@ struct global_cwq {
 	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
 						/* L: hash of busy workers */
 
-	struct worker_pool	pools[2];	/* normal and highpri pools */
+	struct worker_pool	pools[NR_WORKER_POOLS];
+						/* normal and highpri pools */
 
 	wait_queue_head_t	rebind_hold;	/* rebind hold wait */
 } ____cacheline_aligned_in_smp;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work()
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
@ 2012-08-15 14:25   ` Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 3/6] workqueue: change value of lcpu in __queue_delayed_work_on() Joonsoo Kim
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-15 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.

Use temporary local variable for storing requested cpu.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 49d8f4a..c29f2dc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1198,6 +1198,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 	struct cpu_workqueue_struct *cwq;
 	struct list_head *worklist;
 	unsigned int work_flags;
+	unsigned int req_cpu = cpu;
 
 	/*
 	 * While a work item is PENDING && off queue, a task trying to
@@ -1253,7 +1254,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 
 	/* gcwq determined, get cwq and queue */
 	cwq = get_cwq(gcwq->cpu, wq);
-	trace_workqueue_queue_work(cpu, cwq, work);
+	trace_workqueue_queue_work(req_cpu, cwq, work);
 
 	if (WARN_ON(!list_empty(&work->entry))) {
 		spin_unlock(&gcwq->lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 3/6] workqueue: change value of lcpu in __queue_delayed_work_on()
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work() Joonsoo Kim
@ 2012-08-15 14:25   ` Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 4/6] workqueue: introduce system_highpri_wq Joonsoo Kim
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-15 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.

	gcwq = get_gcwq(cpu);
	if (wq->flags & WQ_NON_REENTRANT &&
		(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {

Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c29f2dc..32c4f79 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1356,9 +1356,16 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 	if (!(wq->flags & WQ_UNBOUND)) {
 		struct global_cwq *gcwq = get_work_gcwq(work);
 
-		if (gcwq && gcwq->cpu != WORK_CPU_UNBOUND)
+		/*
+		 * If we cannot get gcwq from work directly, we should
+		 * deliberately select last cpu not to go into sub-optimal
+		 * path of reentrance detection for delayed work. In this case,
+		 * we assign requested cpu to lcpu except WORK_CPU_UNBOUND
+		 */
+		lcpu = cpu;
+		if (gcwq)
 			lcpu = gcwq->cpu;
-		else
+		if (lcpu == WORK_CPU_UNBOUND)
 			lcpu = raw_smp_processor_id();
 	} else {
 		lcpu = WORK_CPU_UNBOUND;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 4/6] workqueue: introduce system_highpri_wq
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
                     ` (2 preceding siblings ...)
  2012-08-15 14:25   ` [PATCH v3 3/6] workqueue: change value of lcpu in __queue_delayed_work_on() Joonsoo Kim
@ 2012-08-15 14:25   ` Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-15 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.

It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 32c4f79..a768ffd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -269,12 +269,14 @@ struct workqueue_struct {
 };
 
 struct workqueue_struct *system_wq __read_mostly;
+struct workqueue_struct *system_highpri_wq __read_mostly;
 struct workqueue_struct *system_long_wq __read_mostly;
 struct workqueue_struct *system_nrt_wq __read_mostly;
 struct workqueue_struct *system_unbound_wq __read_mostly;
 struct workqueue_struct *system_freezable_wq __read_mostly;
 struct workqueue_struct *system_nrt_freezable_wq __read_mostly;
 EXPORT_SYMBOL_GPL(system_wq);
+EXPORT_SYMBOL_GPL(system_highpri_wq);
 EXPORT_SYMBOL_GPL(system_long_wq);
 EXPORT_SYMBOL_GPL(system_nrt_wq);
 EXPORT_SYMBOL_GPL(system_unbound_wq);
@@ -3929,6 +3931,7 @@ static int __init init_workqueues(void)
 	}
 
 	system_wq = alloc_workqueue("events", 0, 0);
+	system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
 	system_long_wq = alloc_workqueue("events_long", 0, 0);
 	system_nrt_wq = alloc_workqueue("events_nrt", WQ_NON_REENTRANT, 0);
 	system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
@@ -3937,8 +3940,8 @@ static int __init init_workqueues(void)
 					      WQ_FREEZABLE, 0);
 	system_nrt_freezable_wq = alloc_workqueue("events_nrt_freezable",
 			WQ_NON_REENTRANT | WQ_FREEZABLE, 0);
-	BUG_ON(!system_wq || !system_long_wq || !system_nrt_wq ||
-	       !system_unbound_wq || !system_freezable_wq ||
+	BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
+		!system_nrt_wq || !system_unbound_wq || !system_freezable_wq ||
 		!system_nrt_freezable_wq);
 	return 0;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers()
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
                     ` (3 preceding siblings ...)
  2012-08-15 14:25   ` [PATCH v3 4/6] workqueue: introduce system_highpri_wq Joonsoo Kim
@ 2012-08-15 14:25   ` Joonsoo Kim
  2012-08-15 14:25   ` [PATCH v3 6/6] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
  2012-08-16 21:22   ` [PATCH v3 0/6] system_highpri_wq Tejun Heo
  6 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-15 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.

To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a768ffd..2945734 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1742,6 +1742,7 @@ retry:
 	/* rebind busy workers */
 	for_each_busy_worker(worker, i, pos, gcwq) {
 		struct work_struct *rebind_work = &worker->rebind_work;
+		struct workqueue_struct *wq;
 
 		/* morph UNBOUND to REBIND */
 		worker->flags &= ~WORKER_UNBOUND;
@@ -1751,11 +1752,18 @@ retry:
 				     work_data_bits(rebind_work)))
 			continue;
 
-		/* wq doesn't matter, use the default one */
 		debug_work_activate(rebind_work);
-		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
-			    worker->scheduled.next,
-			    work_color_to_flags(WORK_NO_COLOR));
+		/*
+		 * Now, we have two types of worker based on priority.
+		 * Therefore, each should choose the correct wq.
+		 */
+		if (worker_pool_pri(worker->pool))
+			wq = system_highpri_wq;
+		else
+			wq = system_wq;
+		insert_work(get_cwq(gcwq->cpu, wq), rebind_work,
+			worker->scheduled.next,
+			work_color_to_flags(WORK_NO_COLOR));
 	}
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 6/6] workqueue: use system_highpri_wq for unbind_work
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
                     ` (4 preceding siblings ...)
  2012-08-15 14:25   ` [PATCH v3 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
@ 2012-08-15 14:25   ` Joonsoo Kim
  2012-08-16 21:22   ` [PATCH v3 0/6] system_highpri_wq Tejun Heo
  6 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-15 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2945734..e17326e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3679,7 +3679,7 @@ static int __devinit workqueue_cpu_down_callback(struct notifier_block *nfb,
 	case CPU_DOWN_PREPARE:
 		/* unbinding should happen on the local CPU */
 		INIT_WORK_ONSTACK(&unbind_work, gcwq_unbind_fn);
-		schedule_work_on(cpu, &unbind_work);
+		queue_work_on(cpu, system_highpri_wq, &unbind_work);
 		flush_work(&unbind_work);
 		break;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 0/6] system_highpri_wq
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
                     ` (5 preceding siblings ...)
  2012-08-15 14:25   ` [PATCH v3 6/6] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
@ 2012-08-16 21:22   ` Tejun Heo
  2012-08-17 13:38     ` JoonSoo Kim
  6 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-08-16 21:22 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

On Wed, Aug 15, 2012 at 11:25:35PM +0900, Joonsoo Kim wrote:
> Change from v2
> [1/6] No change
> [2/6] Change local variable name and use it directly for TP
> [3/6] Add a comment.
> [4/6] No change
> [5/6] Add a comment. Fix ugly indentation.
> [6/6] No change
> 
> Change from v1
> [ALL] Add cover-letter
> [ALL] Rebase on git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.7
> 
> [1/6] No change
> [2/6] First added into this series.
> [3/6] Add change log to clarify what pacth do,
>       Add logic to handle the case "@cpu is WORK_CPU_UNBOUND"
> [4/6] Fix awsome mistake: Set WQ_HIGHPRI for system_highpri_wq
> [5/6] Adopt Tejun's comment about selection system_wq
> [6/6] No change
> 
> This patchset introduce system_highpri_wq
> in order to use proper cwq for highpri worker.
> 
> First 3 patches are not related to that purpose.
> Just fix arbitrary issues.
> Last 3 patches are for our purpose.

Applied to wq/for-3.7 w/ minor updates.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 0/6] system_highpri_wq
  2012-08-16 21:22   ` [PATCH v3 0/6] system_highpri_wq Tejun Heo
@ 2012-08-17 13:38     ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-17 13:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

2012/8/17 Tejun Heo <tj@kernel.org>:
> On Wed, Aug 15, 2012 at 11:25:35PM +0900, Joonsoo Kim wrote:
>> Change from v2
>> [1/6] No change
>> [2/6] Change local variable name and use it directly for TP
>> [3/6] Add a comment.
>> [4/6] No change
>> [5/6] Add a comment. Fix ugly indentation.
>> [6/6] No change
>>
>> Change from v1
>> [ALL] Add cover-letter
>> [ALL] Rebase on git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.7
>>
>> [1/6] No change
>> [2/6] First added into this series.
>> [3/6] Add change log to clarify what pacth do,
>>       Add logic to handle the case "@cpu is WORK_CPU_UNBOUND"
>> [4/6] Fix awsome mistake: Set WQ_HIGHPRI for system_highpri_wq
>> [5/6] Adopt Tejun's comment about selection system_wq
>> [6/6] No change
>>
>> This patchset introduce system_highpri_wq
>> in order to use proper cwq for highpri worker.
>>
>> First 3 patches are not related to that purpose.
>> Just fix arbitrary issues.
>> Last 3 patches are for our purpose.
>
> Applied to wq/for-3.7 w/ minor updates.
>
> Thanks.
>
> --
> tejun

Hello.
I checked your minor update and it looks better.
Sorry for my poor explanation for patches.

Thank you very much.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 1/2] slub: rename cpu_partial to max_cpu_object
       [not found] <Yes>
                   ` (5 preceding siblings ...)
  2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
@ 2012-08-24 16:05 ` Joonsoo Kim
  2012-08-24 16:05   ` [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node Joonsoo Kim
  2012-08-24 16:12   ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Christoph Lameter
  2012-08-25 14:11 ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Joonsoo Kim
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-24 16:05 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Christoph Lameter

cpu_partial of kmem_cache struct is a bit awkward.

It means the maximum number of objects kept in the per cpu slab
and cpu partial lists of a processor. However, current name
seems to represent objects kept in the cpu partial lists only.
So, this patch renames it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index df448ad..9130e6b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -84,7 +84,7 @@ struct kmem_cache {
 	int size;		/* The size of an object including meta data */
 	int object_size;	/* The size of an object without meta data */
 	int offset;		/* Free pointer offset. */
-	int cpu_partial;	/* Number of per cpu partial objects to keep around */
+	int max_cpu_object;	/* Number of per cpu objects to keep around */
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
diff --git a/mm/slub.c b/mm/slub.c
index c67bd0a..d597530 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1565,7 +1565,7 @@ static void *get_partial_node(struct kmem_cache *s,
 			available = put_cpu_partial(s, page, 0);
 			stat(s, CPU_PARTIAL_NODE);
 		}
-		if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
+		if (kmem_cache_debug(s) || available > s->max_cpu_object / 2)
 			break;
 
 	}
@@ -1953,7 +1953,7 @@ int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 		if (oldpage) {
 			pobjects = oldpage->pobjects;
 			pages = oldpage->pages;
-			if (drain && pobjects > s->cpu_partial) {
+			if (drain && pobjects > s->max_cpu_object) {
 				unsigned long flags;
 				/*
 				 * partial array is full. Move the existing
@@ -3073,8 +3073,8 @@ static int kmem_cache_open(struct kmem_cache *s,
 	set_min_partial(s, ilog2(s->size) / 2);
 
 	/*
-	 * cpu_partial determined the maximum number of objects kept in the
-	 * per cpu partial lists of a processor.
+	 * max_cpu_object determined the maximum number of objects kept in the
+	 * per cpu slab and cpu partial lists of a processor.
 	 *
 	 * Per cpu partial lists mainly contain slabs that just have one
 	 * object freed. If they are used for allocation then they can be
@@ -3085,20 +3085,20 @@ static int kmem_cache_open(struct kmem_cache *s,
 	 *
 	 * A) The number of objects from per cpu partial slabs dumped to the
 	 *    per node list when we reach the limit.
-	 * B) The number of objects in cpu partial slabs to extract from the
-	 *    per node list when we run out of per cpu objects. We only fetch 50%
-	 *    to keep some capacity around for frees.
+	 * B) The number of objects in cpu slab and cpu partial lists to
+	 *    extract from the per node list when we run out of per cpu objects.
+	 *    We only fetch 50% to keep some capacity around for frees.
 	 */
 	if (kmem_cache_debug(s))
-		s->cpu_partial = 0;
+		s->max_cpu_object = 0;
 	else if (s->size >= PAGE_SIZE)
-		s->cpu_partial = 2;
+		s->max_cpu_object = 2;
 	else if (s->size >= 1024)
-		s->cpu_partial = 6;
+		s->max_cpu_object = 6;
 	else if (s->size >= 256)
-		s->cpu_partial = 13;
+		s->max_cpu_object = 13;
 	else
-		s->cpu_partial = 30;
+		s->max_cpu_object = 30;
 
 	s->refcount = 1;
 #ifdef CONFIG_NUMA
@@ -4677,12 +4677,12 @@ static ssize_t min_partial_store(struct kmem_cache *s, const char *buf,
 }
 SLAB_ATTR(min_partial);
 
-static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf)
+static ssize_t max_cpu_object_show(struct kmem_cache *s, char *buf)
 {
-	return sprintf(buf, "%u\n", s->cpu_partial);
+	return sprintf(buf, "%u\n", s->max_cpu_object);
 }
 
-static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
+static ssize_t max_cpu_object_store(struct kmem_cache *s, const char *buf,
 				 size_t length)
 {
 	unsigned long objects;
@@ -4694,11 +4694,11 @@ static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
 	if (objects && kmem_cache_debug(s))
 		return -EINVAL;
 
-	s->cpu_partial = objects;
+	s->max_cpu_object = objects;
 	flush_all(s);
 	return length;
 }
-SLAB_ATTR(cpu_partial);
+SLAB_ATTR(max_cpu_object);
 
 static ssize_t ctor_show(struct kmem_cache *s, char *buf)
 {
@@ -5103,7 +5103,7 @@ static struct attribute *slab_attrs[] = {
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
 	&min_partial_attr.attr,
-	&cpu_partial_attr.attr,
+	&max_cpu_object_attr.attr,
 	&objects_attr.attr,
 	&objects_partial_attr.attr,
 	&partial_attr.attr,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node
  2012-08-24 16:05 ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Joonsoo Kim
@ 2012-08-24 16:05   ` Joonsoo Kim
  2012-08-24 16:15     ` Christoph Lameter
  2012-08-24 16:12   ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Christoph Lameter
  1 sibling, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-24 16:05 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Christoph Lameter

In get_partial_node(), we want to refill cpu slab and cpu partial slabs
until the number of object kept in the per cpu slab and cpu partial lists
of a processor is reached to max_cpu_object.

However, in current implementation, it is not achieved.
See following code in get_partial_node().

if (!object) {
	c->page = page;
	stat(s, ALLOC_FROM_PARTIAL);
	object = t;
	available =  page->objects - page->inuse;
} else {
	available = put_cpu_partial(s, page, 0);
	stat(s, CPU_PARTIAL_NODE);
}
if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
	break;

In case of !object (available = page->objects - page->inuse),
"available" means the number of objects in cpu slab.
In this time, we don't have any cpu partial slab, so "available" imply
the number of objects available to the cpu without locking.
This is what we want.

But, look at another "available" (available = put_cpu_partial(s, page, 0)).
This "available" doesn't include the number of objects in cpu slab.
It only include the number of objects in cpu partial slabs.
So, it doesn't imply the number of objects available to the cpu without locking.
This isn't what we want.

Therefore fix it to imply same meaning in both case
and rename "available" to "cpu_slab_objects" for readability.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>

diff --git a/mm/slub.c b/mm/slub.c
index d597530..c96e0e4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1538,6 +1538,7 @@ static void *get_partial_node(struct kmem_cache *s,
 {
 	struct page *page, *page2;
 	void *object = NULL;
+	int cpu_slab_objects = 0, pobjects = 0;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -1551,7 +1552,6 @@ static void *get_partial_node(struct kmem_cache *s,
 	spin_lock(&n->list_lock);
 	list_for_each_entry_safe(page, page2, &n->partial, lru) {
 		void *t = acquire_slab(s, n, page, object == NULL);
-		int available;
 
 		if (!t)
 			break;
@@ -1560,12 +1560,13 @@ static void *get_partial_node(struct kmem_cache *s,
 			c->page = page;
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
-			available =  page->objects - page->inuse;
+			cpu_slab_objects = page->objects - page->inuse;
 		} else {
-			available = put_cpu_partial(s, page, 0);
+			pobjects = put_cpu_partial(s, page, 0);
 			stat(s, CPU_PARTIAL_NODE);
 		}
-		if (kmem_cache_debug(s) || available > s->max_cpu_object / 2)
+		if (kmem_cache_debug(s)
+			|| cpu_slab_objects + pobjects > s->max_cpu_object / 2)
 			break;
 
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/2] slub: rename cpu_partial to max_cpu_object
  2012-08-24 16:05 ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Joonsoo Kim
  2012-08-24 16:05   ` [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node Joonsoo Kim
@ 2012-08-24 16:12   ` Christoph Lameter
  1 sibling, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-08-24 16:12 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Pekka Enberg, linux-kernel, linux-mm

On Sat, 25 Aug 2012, Joonsoo Kim wrote:

> cpu_partial of kmem_cache struct is a bit awkward.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node
  2012-08-24 16:05   ` [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node Joonsoo Kim
@ 2012-08-24 16:15     ` Christoph Lameter
  2012-08-24 16:28       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-08-24 16:15 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Pekka Enberg, linux-kernel, linux-mm

On Sat, 25 Aug 2012, Joonsoo Kim wrote:

> index d597530..c96e0e4 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1538,6 +1538,7 @@ static void *get_partial_node(struct kmem_cache *s,
>  {
>  	struct page *page, *page2;
>  	void *object = NULL;
> +	int cpu_slab_objects = 0, pobjects = 0;

We really need be clear here.

One counter is for the numbe of objects in the per cpu slab and the other
for the objects in tbhe per cpu partial lists.

So I think the first name is ok. Second should be similar

cpu_partial_objects?



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node
  2012-08-24 16:15     ` Christoph Lameter
@ 2012-08-24 16:28       ` JoonSoo Kim
  2012-08-24 16:31         ` Christoph Lameter
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-24 16:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-kernel, linux-mm

2012/8/25 Christoph Lameter <cl@linux.com>:
> On Sat, 25 Aug 2012, Joonsoo Kim wrote:
>
>> index d597530..c96e0e4 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1538,6 +1538,7 @@ static void *get_partial_node(struct kmem_cache *s,
>>  {
>>       struct page *page, *page2;
>>       void *object = NULL;
>> +     int cpu_slab_objects = 0, pobjects = 0;
>
> We really need be clear here.
>
> One counter is for the numbe of objects in the per cpu slab and the other
> for the objects in tbhe per cpu partial lists.
>
> So I think the first name is ok. Second should be similar
>
> cpu_partial_objects?
>

Okay! It looks good.
But, when using "cpu_partial_objects", I have a coding style problem.

                if (kmem_cache_debug(s)
                        || cpu_slab_objects + cpu_partial_objects
                                                > s->max_cpu_object / 2)

Do you have any good idea?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node
  2012-08-24 16:28       ` JoonSoo Kim
@ 2012-08-24 16:31         ` Christoph Lameter
  2012-08-24 16:40           ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-08-24 16:31 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: Pekka Enberg, linux-kernel, linux-mm

On Sat, 25 Aug 2012, JoonSoo Kim wrote:

> But, when using "cpu_partial_objects", I have a coding style problem.
>
>                 if (kmem_cache_debug(s)
>                         || cpu_slab_objects + cpu_partial_objects
>                                                 > s->max_cpu_object / 2)
>
> Do you have any good idea?

Not sure what the problem is? The line wrap?

Reduce the tabs for the third line?


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node
  2012-08-24 16:31         ` Christoph Lameter
@ 2012-08-24 16:40           ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-08-24 16:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-kernel, linux-mm

2012/8/25 Christoph Lameter <cl@linux.com>:
> On Sat, 25 Aug 2012, JoonSoo Kim wrote:
>
>> But, when using "cpu_partial_objects", I have a coding style problem.
>>
>>                 if (kmem_cache_debug(s)
>>                         || cpu_slab_objects + cpu_partial_objects
>>                                                 > s->max_cpu_object / 2)
>>
>> Do you have any good idea?
>
> Not sure what the problem is? The line wrap?

Yes! The line wrap.


                if (kmem_cache_debug(s)
                || cpu_slab_objects + cpu_partial_objects >
s->max_cpu_object / 2)
                        break;

Above example use 82 columns... The line wrapping problem.

                if (kmem_cache_debug(s) ||
                cpu_slab_objects + cpu_partial_objects > s->max_cpu_object / 2)
                        break;

This one use 79 columns, but somehow ugly
because second line start at same column of above line.
Is it okay?


                if (kmem_cache_debug(s)
                        || cpu_slab_objects + cpu_partial_objects
                                                > s->max_cpu_object / 2)
                        break;

Is it the best?
It use 72 columns.
Let me know what is the best method for this situation.

Thanks!

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 1/2] slab:  do ClearSlabPfmemalloc() for all pages of slab
       [not found] <Yes>
                   ` (6 preceding siblings ...)
  2012-08-24 16:05 ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Joonsoo Kim
@ 2012-08-25 14:11 ` Joonsoo Kim
  2012-08-25 14:11   ` [PATCH 2/2] slab: fix starting index for finding another object Joonsoo Kim
  2012-09-03 10:08   ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Mel Gorman
  2012-10-18 23:18 ` [PATCH 0/2] clean-up initialization of deferrable timer Joonsoo Kim
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-25 14:11 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman, Christoph Lameter

Now, we just do ClearSlabPfmemalloc() for first page of slab
when we clear SlabPfmemalloc flag. It is a problem because we sometimes
test flag of page which is not first page of slab in __ac_put_obj().

So add code to do ClearSlabPfmemalloc for all pages of slab.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
---
This patch based on Pekka's slab/next tree

diff --git a/mm/slab.c b/mm/slab.c
index 3b4587b..45cf59a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -992,8 +992,11 @@ static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 		 */
 		l3 = cachep->nodelists[numa_mem_id()];
 		if (!list_empty(&l3->slabs_free) && force_refill) {
-			struct slab *slabp = virt_to_slab(objp);
-			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+			int i, nr_pages = (1 << cachep->gfporder);
+			struct page *page = virt_to_head_page(objp);
+
+			for (i = 0; i < nr_pages; i++)
+				ClearPageSlabPfmemalloc(page + i);
 			clear_obj_pfmemalloc(&objp);
 			recheck_pfmemalloc_active(cachep, ac);
 			return objp;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/2] slab: fix starting index for finding another object
  2012-08-25 14:11 ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Joonsoo Kim
@ 2012-08-25 14:11   ` Joonsoo Kim
  2012-09-03 10:08   ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Mel Gorman
  1 sibling, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-08-25 14:11 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman, Christoph Lameter

In array cache, there is a object at index 0.
So fix it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>

diff --git a/mm/slab.c b/mm/slab.c
index 45cf59a..eb74bf5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -976,7 +976,7 @@ static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 		}
 
 		/* The caller cannot use PFMEMALLOC objects, find another one */
-		for (i = 1; i < ac->avail; i++) {
+		for (i = 0; i < ac->avail; i++) {
 			/* If a !PFMEMALLOC object is found, swap them */
 			if (!is_obj_pfmemalloc(ac->entry[i])) {
 				objp = ac->entry[i];
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/2] slab:  do ClearSlabPfmemalloc() for all pages of slab
  2012-08-25 14:11 ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Joonsoo Kim
  2012-08-25 14:11   ` [PATCH 2/2] slab: fix starting index for finding another object Joonsoo Kim
@ 2012-09-03 10:08   ` Mel Gorman
  1 sibling, 0 replies; 146+ messages in thread
From: Mel Gorman @ 2012-09-03 10:08 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Pekka Enberg, linux-kernel, linux-mm, Christoph Lameter

It took me a while to getting around to reviewing this due to attending
kernel summit. Sorry about that.

On Sat, Aug 25, 2012 at 11:11:10PM +0900, Joonsoo Kim wrote:
> Now, we just do ClearSlabPfmemalloc() for first page of slab
> when we clear SlabPfmemalloc flag. It is a problem because we sometimes
> test flag of page which is not first page of slab in __ac_put_obj().
> 

Well spotted.

The impact is marginal as far as pfmemalloc protection is concerned. I do not
believe that any of the slabs that use high-order allocations are used in for
the swap-over-network paths. It would be unfortunate if that ever changed.

> So add code to do ClearSlabPfmemalloc for all pages of slab.
> 

I would prefer if the pfmemalloc information was kept on the head page.
Would the following patch also address your concerns?

diff --git a/mm/slab.c b/mm/slab.c
index 811af03..d34a903 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1000,7 +1000,7 @@ static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 		l3 = cachep->nodelists[numa_mem_id()];
 		if (!list_empty(&l3->slabs_free) && force_refill) {
 			struct slab *slabp = virt_to_slab(objp);
-			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+			ClearPageSlabPfmemalloc(virt_to_head_page(slabp->s_mem));
 			clear_obj_pfmemalloc(&objp);
 			recheck_pfmemalloc_active(cachep, ac);
 			return objp;
@@ -1032,7 +1032,7 @@ static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 {
 	if (unlikely(pfmemalloc_active)) {
 		/* Some pfmemalloc slabs exist, check if this is one */
-		struct page *page = virt_to_page(objp);
+		struct page *page = virt_to_head_page(objp);
 		if (PageSlabPfmemalloc(page))
 			set_obj_pfmemalloc(&objp);
 	}

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 0/2] clean-up initialization of deferrable timer
       [not found] <Yes>
                   ` (7 preceding siblings ...)
  2012-08-25 14:11 ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Joonsoo Kim
@ 2012-10-18 23:18 ` Joonsoo Kim
  2012-10-18 23:18   ` [PATCH 1/2] timer: add setup_timer_deferrable() macro Joonsoo Kim
                     ` (2 more replies)
  2012-10-20 15:48 ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Joonsoo Kim
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-18 23:18 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, Joonsoo Kim

This patchset introduces setup_timer_deferrable() macro.
Using it makes code simple and understandable.

This patchset doesn't make any functional difference.
It is just for clean-up.

It is based on v3.7-rc1

Joonsoo Kim (2):
  timer: add setup_timer_deferrable() macro
  timer: use new setup_timer_deferrable() macro

 drivers/acpi/apei/ghes.c                     |    5 ++---
 drivers/ata/libata-core.c                    |    5 ++---
 drivers/net/ethernet/nvidia/forcedeth.c      |   13 ++++---------
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |    4 +---
 drivers/net/vxlan.c                          |    5 ++---
 include/linux/timer.h                        |    2 ++
 kernel/workqueue.c                           |    6 ++----
 net/mac80211/agg-rx.c                        |   12 ++++++------
 net/mac80211/agg-tx.c                        |   12 ++++++------
 net/sched/cls_flow.c                         |    5 ++---
 net/sched/sch_sfq.c                          |    5 ++---
 11 files changed, 31 insertions(+), 43 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 1/2] timer: add setup_timer_deferrable() macro
  2012-10-18 23:18 ` [PATCH 0/2] clean-up initialization of deferrable timer Joonsoo Kim
@ 2012-10-18 23:18   ` Joonsoo Kim
  2012-10-18 23:18   ` [PATCH 2/2] timer: use new " Joonsoo Kim
  2012-10-26 14:08   ` [PATCH 0/2] clean-up initialization of deferrable timer JoonSoo Kim
  2 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-18 23:18 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, Joonsoo Kim

There are some users of deferrable timer. Because of lacking of
handy initializer, they should initialize deferrable timer fumblingly.
We might do better with new setup_timer_deferrable() macro.
So add it.

Following patch will makes some users of init_timer_deferrable() use
this handy macro.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197..5950276 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -151,6 +151,8 @@ static inline void init_timer_on_stack_key(struct timer_list *timer,
 
 #define setup_timer(timer, fn, data)					\
 	__setup_timer((timer), (fn), (data), 0)
+#define setup_timer_deferrable(timer, fn, data)				\
+	__setup_timer((timer), (fn), (data), TIMER_DEFERRABLE)
 #define setup_timer_on_stack(timer, fn, data)				\
 	__setup_timer_on_stack((timer), (fn), (data), 0)
 #define setup_deferrable_timer_on_stack(timer, fn, data)		\
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/2] timer: use new setup_timer_deferrable() macro
  2012-10-18 23:18 ` [PATCH 0/2] clean-up initialization of deferrable timer Joonsoo Kim
  2012-10-18 23:18   ` [PATCH 1/2] timer: add setup_timer_deferrable() macro Joonsoo Kim
@ 2012-10-18 23:18   ` Joonsoo Kim
  2012-10-26 14:08   ` [PATCH 0/2] clean-up initialization of deferrable timer JoonSoo Kim
  2 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-18 23:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Joonsoo Kim, Len Brown, Jeff Garzik,
	Jitendra Kalsaria, Ron Mercer, Tejun Heo, Johannes Berg,
	John W. Linville, David S. Miller, Jamal Hadi Salim

Now, we have a handy macro for initializing deferrable timer.
Using it makes code clean and easy to understand.

Additionally, in some driver codes, use setup_timer() instead of init_timer().

This patch doesn't make any functional difference.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Cc: Ron Mercer <ron.mercer@qlogic.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1599566..1707f9a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -944,9 +944,8 @@ static int __devinit ghes_probe(struct platform_device *ghes_dev)
 	}
 	switch (generic->notify.type) {
 	case ACPI_HEST_NOTIFY_POLLED:
-		ghes->timer.function = ghes_poll_func;
-		ghes->timer.data = (unsigned long)ghes;
-		init_timer_deferrable(&ghes->timer);
+		setup_timer_deferrable(&ghes->timer, ghes_poll_func,
+							(unsigned long)ghes);
 		ghes_add_timer(ghes);
 		break;
 	case ACPI_HEST_NOTIFY_EXTERNAL:
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 3cc7096..a811e6f 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5616,9 +5616,8 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
 	INIT_LIST_HEAD(&ap->eh_done_q);
 	init_waitqueue_head(&ap->eh_wait_q);
 	init_completion(&ap->park_req_pending);
-	init_timer_deferrable(&ap->fastdrain_timer);
-	ap->fastdrain_timer.function = ata_eh_fastdrain_timerfn;
-	ap->fastdrain_timer.data = (unsigned long)ap;
+	setup_timer_deferrable(&ap->fastdrain_timer, ata_eh_fastdrain_timerfn,
+							(unsigned long)ap);
 
 	ap->cbl = ATA_CBL_NONE;
 
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 876bece..987e2cf 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -5548,15 +5548,10 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_i
 	spin_lock_init(&np->hwstats_lock);
 	SET_NETDEV_DEV(dev, &pci_dev->dev);
 
-	init_timer(&np->oom_kick);
-	np->oom_kick.data = (unsigned long) dev;
-	np->oom_kick.function = nv_do_rx_refill;	/* timer handler */
-	init_timer(&np->nic_poll);
-	np->nic_poll.data = (unsigned long) dev;
-	np->nic_poll.function = nv_do_nic_poll;	/* timer handler */
-	init_timer_deferrable(&np->stats_poll);
-	np->stats_poll.data = (unsigned long) dev;
-	np->stats_poll.function = nv_do_stats_poll;	/* timer handler */
+	setup_timer(&np->oom_kick, nv_do_rx_refill, (unsigned long)dev);
+	setup_timer(&np->nic_poll, nv_do_nic_poll, (unsigned long)dev);
+	setup_timer_deferrable(&np->stats_poll, nv_do_stats_poll,
+						(unsigned long)dev);
 
 	err = pci_enable_device(pci_dev);
 	if (err)
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index b262d61..1687883 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -4707,9 +4707,7 @@ static int __devinit qlge_probe(struct pci_dev *pdev,
 	/* Start up the timer to trigger EEH if
 	 * the bus goes dead
 	 */
-	init_timer_deferrable(&qdev->timer);
-	qdev->timer.data = (unsigned long)qdev;
-	qdev->timer.function = ql_timer;
+	setup_timer_deferrable(&qdev->timer, ql_timer, (unsigned long)qdev);
 	qdev->timer.expires = jiffies + (5*HZ);
 	add_timer(&qdev->timer);
 	ql_link_off(qdev);
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 607976c..743c7d7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -995,9 +995,8 @@ static void vxlan_setup(struct net_device *dev)
 
 	spin_lock_init(&vxlan->hash_lock);
 
-	init_timer_deferrable(&vxlan->age_timer);
-	vxlan->age_timer.function = vxlan_cleanup;
-	vxlan->age_timer.data = (unsigned long) vxlan;
+	setup_timer_deferrable(&vxlan->age_timer, vxlan_cleanup,
+						(unsigned long)vxlan);
 
 	inet_get_local_port_range(&low, &high);
 	vxlan->port_min = low;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d951daa..77d6f62 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3845,10 +3845,8 @@ static int __init init_workqueues(void)
 			INIT_LIST_HEAD(&pool->worklist);
 			INIT_LIST_HEAD(&pool->idle_list);
 
-			init_timer_deferrable(&pool->idle_timer);
-			pool->idle_timer.function = idle_worker_timeout;
-			pool->idle_timer.data = (unsigned long)pool;
-
+			setup_timer_deferrable(&pool->idle_timer,
+				    idle_worker_timeout, (unsigned long)pool);
 			setup_timer(&pool->mayday_timer, gcwq_mayday_timeout,
 				    (unsigned long)pool);
 
diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
index 186d991..c8985c1 100644
--- a/net/mac80211/agg-rx.c
+++ b/net/mac80211/agg-rx.c
@@ -294,14 +294,14 @@ void ieee80211_process_addba_request(struct ieee80211_local *local,
 	spin_lock_init(&tid_agg_rx->reorder_lock);
 
 	/* rx timer */
-	tid_agg_rx->session_timer.function = sta_rx_agg_session_timer_expired;
-	tid_agg_rx->session_timer.data = (unsigned long)&sta->timer_to_tid[tid];
-	init_timer_deferrable(&tid_agg_rx->session_timer);
+	setup_timer_deferrable(&tid_agg_rx->session_timer,
+				sta_rx_agg_session_timer_expired,
+				(unsigned long)&sta->timer_to_tid[tid]);
 
 	/* rx reorder timer */
-	tid_agg_rx->reorder_timer.function = sta_rx_agg_reorder_timer_expired;
-	tid_agg_rx->reorder_timer.data = (unsigned long)&sta->timer_to_tid[tid];
-	init_timer(&tid_agg_rx->reorder_timer);
+	setup_timer(&tid_agg_rx->reorder_timer,
+				sta_rx_agg_reorder_timer_expired,
+				(unsigned long)&sta->timer_to_tid[tid]);
 
 	/* prepare reordering buffer */
 	tid_agg_rx->reorder_buf =
diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
index 3195a63..cacd75d 100644
--- a/net/mac80211/agg-tx.c
+++ b/net/mac80211/agg-tx.c
@@ -535,14 +535,14 @@ int ieee80211_start_tx_ba_session(struct ieee80211_sta *pubsta, u16 tid,
 	tid_tx->timeout = timeout;
 
 	/* response timer */
-	tid_tx->addba_resp_timer.function = sta_addba_resp_timer_expired;
-	tid_tx->addba_resp_timer.data = (unsigned long)&sta->timer_to_tid[tid];
-	init_timer(&tid_tx->addba_resp_timer);
+	setup_timer(&tid_tx->addba_resp_timer,
+				sta_addba_resp_timer_expired,
+				(unsigned long)&sta->timer_to_tid[tid]);
 
 	/* tx timer */
-	tid_tx->session_timer.function = sta_tx_agg_session_timer_expired;
-	tid_tx->session_timer.data = (unsigned long)&sta->timer_to_tid[tid];
-	init_timer_deferrable(&tid_tx->session_timer);
+	setup_timer_deferrable(&tid_tx->session_timer,
+				sta_tx_agg_session_timer_expired,
+				(unsigned long)&sta->timer_to_tid[tid]);
 
 	/* assign a dialog token */
 	sta->ampdu_mlme.dialog_token_allocator++;
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index ce82d0c..ff1afc5 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -457,9 +457,8 @@ static int flow_change(struct sk_buff *in_skb,
 		f->mask	  = ~0U;
 
 		get_random_bytes(&f->hashrnd, 4);
-		f->perturb_timer.function = flow_perturbation;
-		f->perturb_timer.data = (unsigned long)f;
-		init_timer_deferrable(&f->perturb_timer);
+		setup_timer_deferrable(&f->perturb_timer, flow_perturbation,
+							(unsigned long)f);
 	}
 
 	tcf_exts_change(tp, &f->exts, &e);
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index d3a1bc2..93ff569 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -739,9 +739,8 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	int i;
 
-	q->perturb_timer.function = sfq_perturbation;
-	q->perturb_timer.data = (unsigned long)sch;
-	init_timer_deferrable(&q->perturb_timer);
+	setup_timer_deferrable(&q->perturb_timer, sfq_perturbation,
+						(unsigned long)sch);
 
 	for (i = 0; i < SFQ_MAX_DEPTH + 1; i++) {
 		q->dep[i].next = i + SFQ_MAX_FLOWS;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions
       [not found] <Yes>
                   ` (8 preceding siblings ...)
  2012-10-18 23:18 ` [PATCH 0/2] clean-up initialization of deferrable timer Joonsoo Kim
@ 2012-10-20 15:48 ` Joonsoo Kim
  2012-10-20 15:48   ` [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA Joonsoo Kim
  2012-10-24  8:05   ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Pekka Enberg
  2012-10-20 16:30 ` [PATCH 0/3] workqueue: minor cleanup Joonsoo Kim
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-20 15:48 UTC (permalink / raw)
  To: Pekka Enberg, Andrew Morton
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Christoph Lameter

kmalloc() and kmalloc_node() is always inlined into generic code.
However, there is a mistake in implemention of the SLUB.

In kmalloc() and kmalloc_node() of the SLUB,
we try to compare kmalloc_caches[index] with NULL.
As it cannot be known at compile time,
this comparison is inserted into generic code invoking kmalloc*.
This may decrease system performance, so we should fix it.

Below is the result of "size vmlinux"
text size is decreased roughly 20KB

Before:
   text	   data	    bss	    dec	    hex	filename
10044177        1443168 5722112 17209457        1069871 vmlinux
After:
   text	   data	    bss	    dec	    hex	filename
10022627        1443136 5722112 17187875        1064423 vmlinux

Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
---
With Christoph's patchset(common kmalloc caches:
'[15/15] Common Kmalloc cache determination') which is not merged into mainline yet,
this issue will be fixed.
As it takes some time, I send this patch for v3.7

This patch is based on v3.7-rc1

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index df448ad..4c75f2b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -271,9 +271,10 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
 			return kmalloc_large(size, flags);
 
 		if (!(flags & SLUB_DMA)) {
-			struct kmem_cache *s = kmalloc_slab(size);
+			int index = kmalloc_index(size);
+			struct kmem_cache *s = kmalloc_caches[index];
 
-			if (!s)
+			if (!index)
 				return ZERO_SIZE_PTR;
 
 			return kmem_cache_alloc_trace(s, flags, size);
@@ -304,9 +305,10 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	if (__builtin_constant_p(size) &&
 		size <= SLUB_MAX_SIZE && !(flags & SLUB_DMA)) {
-			struct kmem_cache *s = kmalloc_slab(size);
+		int index = kmalloc_index(size);
+		struct kmem_cache *s = kmalloc_caches[index];
 
-		if (!s)
+		if (!index)
 			return ZERO_SIZE_PTR;
 
 		return kmem_cache_alloc_node_trace(s, flags, node, size);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
  2012-10-20 15:48 ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Joonsoo Kim
@ 2012-10-20 15:48   ` Joonsoo Kim
  2012-10-22 14:31     ` Christoph Lameter
  2012-10-24  8:05   ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Pekka Enberg
  1 sibling, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-20 15:48 UTC (permalink / raw)
  To: Pekka Enberg, Andrew Morton
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Christoph Lameter

kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA.
This patch optimize this case,
so when @flags = __GFP_DMA, it will be inlined into generic code.

Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 4c75f2b..4adf50b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -147,6 +147,7 @@ struct kmem_cache {
  * 2^x bytes of allocations.
  */
 extern struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -266,19 +267,24 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
 
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	struct kmem_cache *s;
+	int index;
+
 	if (__builtin_constant_p(size)) {
 		if (size > SLUB_MAX_SIZE)
 			return kmalloc_large(size, flags);
 
-		if (!(flags & SLUB_DMA)) {
-			int index = kmalloc_index(size);
-			struct kmem_cache *s = kmalloc_caches[index];
-
-			if (!index)
-				return ZERO_SIZE_PTR;
+		index = kmalloc_index(size);
+		if (!index)
+			return ZERO_SIZE_PTR;
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA)) {
+			s = kmalloc_dma_caches[index];
+		} else
+#endif
+			s = kmalloc_caches[index];
 
-			return kmem_cache_alloc_trace(s, flags, size);
-		}
+		return kmem_cache_alloc_trace(s, flags, size);
 	}
 	return __kmalloc(size, flags);
 }
@@ -303,13 +309,19 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
-	if (__builtin_constant_p(size) &&
-		size <= SLUB_MAX_SIZE && !(flags & SLUB_DMA)) {
-		int index = kmalloc_index(size);
-		struct kmem_cache *s = kmalloc_caches[index];
+	struct kmem_cache *s;
+	int index;
 
+	if (__builtin_constant_p(size) && size <= SLUB_MAX_SIZE) {
+		index = kmalloc_index(size);
 		if (!index)
 			return ZERO_SIZE_PTR;
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA)) {
+			s = kmalloc_dma_caches[index];
+		} else
+#endif
+			s = kmalloc_caches[index];
 
 		return kmem_cache_alloc_node_trace(s, flags, node, size);
 	}
diff --git a/mm/slub.c b/mm/slub.c
index a0d6984..a94533c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3222,7 +3222,8 @@ struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
 EXPORT_SYMBOL(kmalloc_caches);
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+EXPORT_SYMBOL(kmalloc_dma_caches);
 #endif
 
 static int __init setup_slub_min_order(char *str)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 0/3] workqueue: minor cleanup
       [not found] <Yes>
                   ` (9 preceding siblings ...)
  2012-10-20 15:48 ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Joonsoo Kim
@ 2012-10-20 16:30 ` Joonsoo Kim
  2012-10-20 16:30   ` [PATCH 1/3] workqueue: optimize mod_delayed_work_on() when @delay == 0 Joonsoo Kim
                     ` (2 more replies)
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
  2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
  12 siblings, 3 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-20 16:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

This patchset do minor cleanup for workqueue code.

First patch makes minor behavior change, however, it is trivial.
Others doesn't makes any functional difference.
These are based on v3.7-rc1

Joonsoo Kim (3):
  workqueue: optimize mod_delayed_work_on() when @delay == 0
  workqueue: trivial fix for return statement in work_busy()
  workqueue: remove unused argument of wq_worker_waking_up()

 kernel/sched/core.c      |    2 +-
 kernel/workqueue.c       |   11 +++++++----
 kernel/workqueue_sched.h |    2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 1/3] workqueue: optimize mod_delayed_work_on() when @delay == 0
  2012-10-20 16:30 ` [PATCH 0/3] workqueue: minor cleanup Joonsoo Kim
@ 2012-10-20 16:30   ` Joonsoo Kim
  2012-10-20 16:30   ` [PATCH 2/3] workqueue: trivial fix for return statement in work_busy() Joonsoo Kim
  2012-10-20 16:30   ` [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up() Joonsoo Kim
  2 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-20 16:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

After try_to_grab_pending(), __queue_delayed_work() is invoked
in mod_delayed_work_on(). When @delay == 0, we can call __queue_work()
directly in order to avoid setting useless timer.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d951daa..c57358e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1477,7 +1477,11 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	} while (unlikely(ret == -EAGAIN));
 
 	if (likely(ret >= 0)) {
-		__queue_delayed_work(cpu, wq, dwork, delay);
+		if (!delay)
+			__queue_work(cpu, wq, &dwork->work);
+		else
+			__queue_delayed_work(cpu, wq, dwork, delay);
+
 		local_irq_restore(flags);
 	}
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/3] workqueue: trivial fix for return statement in work_busy()
  2012-10-20 16:30 ` [PATCH 0/3] workqueue: minor cleanup Joonsoo Kim
  2012-10-20 16:30   ` [PATCH 1/3] workqueue: optimize mod_delayed_work_on() when @delay == 0 Joonsoo Kim
@ 2012-10-20 16:30   ` Joonsoo Kim
  2012-10-20 22:53     ` Tejun Heo
  2012-10-20 16:30   ` [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up() Joonsoo Kim
  2 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-20 16:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Return type of work_busy() is unsigned int.
There is return statement returning boolean value, 'false' in work_busy().
It is not problem, because 'false' may be treated '0'.
However, fixing it would make code robust.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c57358e..27a6dee 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3479,7 +3479,7 @@ unsigned int work_busy(struct work_struct *work)
 	unsigned int ret = 0;
 
 	if (!gcwq)
-		return false;
+		return 0;
 
 	spin_lock_irqsave(&gcwq->lock, flags);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()
  2012-10-20 16:30 ` [PATCH 0/3] workqueue: minor cleanup Joonsoo Kim
  2012-10-20 16:30   ` [PATCH 1/3] workqueue: optimize mod_delayed_work_on() when @delay == 0 Joonsoo Kim
  2012-10-20 16:30   ` [PATCH 2/3] workqueue: trivial fix for return statement in work_busy() Joonsoo Kim
@ 2012-10-20 16:30   ` Joonsoo Kim
  2012-10-20 22:57     ` Tejun Heo
  2 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-20 16:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Joonsoo Kim

Commit 63d95a91 ('workqueue: use @pool instead of @gcwq or @cpu where
applicable') changes an approach to access nr_running.
Thus, wq_worker_waking_up() doesn't use @cpu anymore.
Remove it and remove comment related to it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..30a23d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1243,7 +1243,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
 
 	/* if a worker is waking up, notify workqueue */
 	if (p->flags & PF_WQ_WORKER)
-		wq_worker_waking_up(p, cpu_of(rq));
+		wq_worker_waking_up(p);
 }
 
 /*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 27a6dee..daf101c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -727,7 +727,6 @@ static void wake_up_worker(struct worker_pool *pool)
 /**
  * wq_worker_waking_up - a worker is waking up
  * @task: task waking up
- * @cpu: CPU @task is waking up to
  *
  * This function is called during try_to_wake_up() when a worker is
  * being awoken.
@@ -735,7 +734,7 @@ static void wake_up_worker(struct worker_pool *pool)
  * CONTEXT:
  * spin_lock_irq(rq->lock)
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
+void wq_worker_waking_up(struct task_struct *task)
 {
 	struct worker *worker = kthread_data(task);
 
diff --git a/kernel/workqueue_sched.h b/kernel/workqueue_sched.h
index 2d10fc9..c1b45a5 100644
--- a/kernel/workqueue_sched.h
+++ b/kernel/workqueue_sched.h
@@ -4,6 +4,6 @@
  * Scheduler hooks for concurrency managed workqueue.  Only to be
  * included from sched.c and workqueue.c.
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
+void wq_worker_waking_up(struct task_struct *task);
 struct task_struct *wq_worker_sleeping(struct task_struct *task,
 				       unsigned int cpu);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/3] workqueue: trivial fix for return statement in work_busy()
  2012-10-20 16:30   ` [PATCH 2/3] workqueue: trivial fix for return statement in work_busy() Joonsoo Kim
@ 2012-10-20 22:53     ` Tejun Heo
  0 siblings, 0 replies; 146+ messages in thread
From: Tejun Heo @ 2012-10-20 22:53 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

On Sun, Oct 21, 2012 at 01:30:06AM +0900, Joonsoo Kim wrote:
> Return type of work_busy() is unsigned int.
> There is return statement returning boolean value, 'false' in work_busy().
> It is not problem, because 'false' may be treated '0'.
> However, fixing it would make code robust.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>

Applied 1-2 to wq/for-3.8.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()
  2012-10-20 16:30   ` [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up() Joonsoo Kim
@ 2012-10-20 22:57     ` Tejun Heo
  2012-10-23  1:44       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Tejun Heo @ 2012-10-20 22:57 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel

On Sun, Oct 21, 2012 at 01:30:07AM +0900, Joonsoo Kim wrote:
> Commit 63d95a91 ('workqueue: use @pool instead of @gcwq or @cpu where
> applicable') changes an approach to access nr_running.
> Thus, wq_worker_waking_up() doesn't use @cpu anymore.
> Remove it and remove comment related to it.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>

I'm not sure whether I wanna remove or add WARN_ON_ONCE() on it.  That
part has gone through some changes and seen some bugs.  Can we please
do the following instead at least for now?

	if (!(worker->flags & WORKER_NOT_RUNNING)) {
		WARN_ON_ONCE(worker->pool->gcwq->cpu != cpu);
		atomic_inc(get_pool_nr_running(worker->pool));
	}

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
  2012-10-20 15:48   ` [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA Joonsoo Kim
@ 2012-10-22 14:31     ` Christoph Lameter
  2012-10-23  2:29       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Christoph Lameter @ 2012-10-22 14:31 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Pekka Enberg, Andrew Morton, linux-kernel, linux-mm

On Sun, 21 Oct 2012, Joonsoo Kim wrote:

> kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA.
> This patch optimize this case,
> so when @flags = __GFP_DMA, it will be inlined into generic code.

__GFP_DMA is a rarely used flag for kmalloc allocators and so far it was
not considered that it is worth to directly support it in the inlining
code.



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()
  2012-10-20 22:57     ` Tejun Heo
@ 2012-10-23  1:44       ` JoonSoo Kim
  2012-10-24 19:14         ` Tejun Heo
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-23  1:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel

2012/10/21 Tejun Heo <tj@kernel.org>:
> On Sun, Oct 21, 2012 at 01:30:07AM +0900, Joonsoo Kim wrote:
>> Commit 63d95a91 ('workqueue: use @pool instead of @gcwq or @cpu where
>> applicable') changes an approach to access nr_running.
>> Thus, wq_worker_waking_up() doesn't use @cpu anymore.
>> Remove it and remove comment related to it.
>>
>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>
> I'm not sure whether I wanna remove or add WARN_ON_ONCE() on it.  That
> part has gone through some changes and seen some bugs.  Can we please
> do the following instead at least for now?
>
>         if (!(worker->flags & WORKER_NOT_RUNNING)) {
>                 WARN_ON_ONCE(worker->pool->gcwq->cpu != cpu);
>                 atomic_inc(get_pool_nr_running(worker->pool));
>         }
>

I have no objection to do this for now.
Thanks.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
  2012-10-22 14:31     ` Christoph Lameter
@ 2012-10-23  2:29       ` JoonSoo Kim
  2012-10-23  6:16         ` Eric Dumazet
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-23  2:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, Andrew Morton, linux-kernel, linux-mm

2012/10/22 Christoph Lameter <cl@linux.com>:
> On Sun, 21 Oct 2012, Joonsoo Kim wrote:
>
>> kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA.
>> This patch optimize this case,
>> so when @flags = __GFP_DMA, it will be inlined into generic code.
>
> __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was
> not considered that it is worth to directly support it in the inlining
> code.
>
>

Hmm... but, the SLAB already did that optimization for __GFP_DMA.
Almost every kmalloc() is invoked with constant flags value,
so I think that overhead from this patch may be negligible.
With this patch, code size of vmlinux is reduced slightly.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
  2012-10-23  2:29       ` JoonSoo Kim
@ 2012-10-23  6:16         ` Eric Dumazet
  2012-10-23 16:12           ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Eric Dumazet @ 2012-10-23  6:16 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: Christoph Lameter, Pekka Enberg, Andrew Morton, linux-kernel, linux-mm

On Tue, 2012-10-23 at 11:29 +0900, JoonSoo Kim wrote:
> 2012/10/22 Christoph Lameter <cl@linux.com>:
> > On Sun, 21 Oct 2012, Joonsoo Kim wrote:
> >
> >> kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA.
> >> This patch optimize this case,
> >> so when @flags = __GFP_DMA, it will be inlined into generic code.
> >
> > __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was
> > not considered that it is worth to directly support it in the inlining
> > code.
> >
> >
> 
> Hmm... but, the SLAB already did that optimization for __GFP_DMA.
> Almost every kmalloc() is invoked with constant flags value,
> so I think that overhead from this patch may be negligible.
> With this patch, code size of vmlinux is reduced slightly.

Only because you asked a allyesconfig

GFP_DMA is used for less than 0.1 % of kmalloc() calls, for legacy
hardware (from last century)


In fact if you want to reduce even more your vmlinux, you could test

if (__builtin_constant_p(flags) && (flags & SLUB_DMA))
    return kmem_cache_alloc_trace(s, flags, size);

to force the call to out of line code.





^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
  2012-10-23  6:16         ` Eric Dumazet
@ 2012-10-23 16:12           ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-23 16:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Pekka Enberg, Andrew Morton, linux-kernel, linux-mm

Hi, Eric.

2012/10/23 Eric Dumazet <eric.dumazet@gmail.com>:
> On Tue, 2012-10-23 at 11:29 +0900, JoonSoo Kim wrote:
>> 2012/10/22 Christoph Lameter <cl@linux.com>:
>> > On Sun, 21 Oct 2012, Joonsoo Kim wrote:
>> >
>> >> kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA.
>> >> This patch optimize this case,
>> >> so when @flags = __GFP_DMA, it will be inlined into generic code.
>> >
>> > __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was
>> > not considered that it is worth to directly support it in the inlining
>> > code.
>> >
>> >
>>
>> Hmm... but, the SLAB already did that optimization for __GFP_DMA.
>> Almost every kmalloc() is invoked with constant flags value,
>> so I think that overhead from this patch may be negligible.
>> With this patch, code size of vmlinux is reduced slightly.
>
> Only because you asked a allyesconfig
>
> GFP_DMA is used for less than 0.1 % of kmalloc() calls, for legacy
> hardware (from last century)

I'm not doing with allyesconfig,
but localmodconfig on my ubuntu desktop system.
On my system, 700 bytes of text of vmlinux is reduced
which mean there may be more than 100 callsite with GFP_DMA.

> In fact if you want to reduce even more your vmlinux, you could test
>
> if (__builtin_constant_p(flags) && (flags & SLUB_DMA))
>     return kmem_cache_alloc_trace(s, flags, size);
>
> to force the call to out of line code.

The reason why I mention about code size is that I want to say it may
be good for performance,
although it has a just small impact.
I'm not interest of reducing code size :)

Thanks for comment.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions
  2012-10-20 15:48 ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Joonsoo Kim
  2012-10-20 15:48   ` [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA Joonsoo Kim
@ 2012-10-24  8:05   ` Pekka Enberg
  2012-10-24 13:36     ` Christoph Lameter
  1 sibling, 1 reply; 146+ messages in thread
From: Pekka Enberg @ 2012-10-24  8:05 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Andrew Morton, linux-kernel, linux-mm, Christoph Lameter

On Sat, Oct 20, 2012 at 6:48 PM, Joonsoo Kim <js1304@gmail.com> wrote:
> kmalloc() and kmalloc_node() is always inlined into generic code.
> However, there is a mistake in implemention of the SLUB.
>
> In kmalloc() and kmalloc_node() of the SLUB,
> we try to compare kmalloc_caches[index] with NULL.
> As it cannot be known at compile time,
> this comparison is inserted into generic code invoking kmalloc*.
> This may decrease system performance, so we should fix it.
>
> Below is the result of "size vmlinux"
> text size is decreased roughly 20KB
>
> Before:
>    text    data     bss     dec     hex filename
> 10044177        1443168 5722112 17209457        1069871 vmlinux
> After:
>    text    data     bss     dec     hex filename
> 10022627        1443136 5722112 17187875        1064423 vmlinux
>
> Cc: Christoph Lameter <cl@linux.com>
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> ---
> With Christoph's patchset(common kmalloc caches:
> '[15/15] Common Kmalloc cache determination') which is not merged into mainline yet,
> this issue will be fixed.
> As it takes some time, I send this patch for v3.7
>
> This patch is based on v3.7-rc1

Looks reasonable to me. Christoph, any objections?

>
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index df448ad..4c75f2b 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -271,9 +271,10 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
>                         return kmalloc_large(size, flags);
>
>                 if (!(flags & SLUB_DMA)) {
> -                       struct kmem_cache *s = kmalloc_slab(size);
> +                       int index = kmalloc_index(size);
> +                       struct kmem_cache *s = kmalloc_caches[index];
>
> -                       if (!s)
> +                       if (!index)
>                                 return ZERO_SIZE_PTR;
>
>                         return kmem_cache_alloc_trace(s, flags, size);
> @@ -304,9 +305,10 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
>  {
>         if (__builtin_constant_p(size) &&
>                 size <= SLUB_MAX_SIZE && !(flags & SLUB_DMA)) {
> -                       struct kmem_cache *s = kmalloc_slab(size);
> +               int index = kmalloc_index(size);
> +               struct kmem_cache *s = kmalloc_caches[index];
>
> -               if (!s)
> +               if (!index)
>                         return ZERO_SIZE_PTR;
>
>                 return kmem_cache_alloc_node_trace(s, flags, node, size);
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions
  2012-10-24  8:05   ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Pekka Enberg
@ 2012-10-24 13:36     ` Christoph Lameter
  0 siblings, 0 replies; 146+ messages in thread
From: Christoph Lameter @ 2012-10-24 13:36 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Joonsoo Kim, Andrew Morton, linux-kernel, linux-mm

On Wed, 24 Oct 2012, Pekka Enberg wrote:

> Looks reasonable to me. Christoph, any objections?

I am fine with it. Its going to be short lived because my latest patchset
will do the same. Can we merge this for 3.7?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()
  2012-10-23  1:44       ` JoonSoo Kim
@ 2012-10-24 19:14         ` Tejun Heo
  0 siblings, 0 replies; 146+ messages in thread
From: Tejun Heo @ 2012-10-24 19:14 UTC (permalink / raw)
  To: JoonSoo Kim; +Cc: linux-kernel

On Tue, Oct 23, 2012 at 10:44:49AM +0900, JoonSoo Kim wrote:
> >         if (!(worker->flags & WORKER_NOT_RUNNING)) {
> >                 WARN_ON_ONCE(worker->pool->gcwq->cpu != cpu);
> >                 atomic_inc(get_pool_nr_running(worker->pool));
> >         }
> >
> 
> I have no objection to do this for now.

Care to cook up a patch?

Thanks.

--
tejun


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 0/2] clean-up initialization of deferrable timer
  2012-10-18 23:18 ` [PATCH 0/2] clean-up initialization of deferrable timer Joonsoo Kim
  2012-10-18 23:18   ` [PATCH 1/2] timer: add setup_timer_deferrable() macro Joonsoo Kim
  2012-10-18 23:18   ` [PATCH 2/2] timer: use new " Joonsoo Kim
@ 2012-10-26 14:08   ` JoonSoo Kim
  2 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-26 14:08 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, Joonsoo Kim

2012/10/19 Joonsoo Kim <js1304@gmail.com>:
> This patchset introduces setup_timer_deferrable() macro.
> Using it makes code simple and understandable.
>
> This patchset doesn't make any functional difference.
> It is just for clean-up.
>
> It is based on v3.7-rc1
>
> Joonsoo Kim (2):
>   timer: add setup_timer_deferrable() macro
>   timer: use new setup_timer_deferrable() macro
>
>  drivers/acpi/apei/ghes.c                     |    5 ++---
>  drivers/ata/libata-core.c                    |    5 ++---
>  drivers/net/ethernet/nvidia/forcedeth.c      |   13 ++++---------
>  drivers/net/ethernet/qlogic/qlge/qlge_main.c |    4 +---
>  drivers/net/vxlan.c                          |    5 ++---
>  include/linux/timer.h                        |    2 ++
>  kernel/workqueue.c                           |    6 ++----
>  net/mac80211/agg-rx.c                        |   12 ++++++------
>  net/mac80211/agg-tx.c                        |   12 ++++++------
>  net/sched/cls_flow.c                         |    5 ++---
>  net/sched/sch_sfq.c                          |    5 ++---
>  11 files changed, 31 insertions(+), 43 deletions(-)
>
> --
> 1.7.9.5

Hello, Thomas.
Will you pick this for your tree?
Or is there anything wrong with it?

Thanks.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 0/5] minor clean-up and optimize highmem related code
       [not found] <Yes>
                   ` (10 preceding siblings ...)
  2012-10-20 16:30 ` [PATCH 0/3] workqueue: minor cleanup Joonsoo Kim
@ 2012-10-28 19:12 ` Joonsoo Kim
  2012-10-28 19:12   ` [PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
                     ` (5 more replies)
  2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
  12 siblings, 6 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-28 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Joonsoo Kim

This patchset clean-up and optimize highmem related code.

[1] is just clean-up and doesn't introduce any functional change.
[2-3] are for clean-up and optimization.
These eliminate an useless lock opearation and list management.
[4-5] is for optimization related to flush_all_zero_pkmaps().

Joonsoo Kim (5):
  mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  mm, highmem: remove useless pool_lock
  mm, highmem: remove page_address_pool list
  mm, highmem: makes flush_all_zero_pkmaps() return index of last
    flushed entry
  mm, highmem: get virtual address of the page using PKMAP_ADDR()

 include/linux/highmem.h |    1 +
 mm/highmem.c            |  102 ++++++++++++++++++++---------------------------
 2 files changed, 45 insertions(+), 58 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
@ 2012-10-28 19:12   ` Joonsoo Kim
  2012-10-29  1:48     ` Minchan Kim
  2012-10-28 19:12   ` [PATCH 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-28 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman

To calculate an index of pkmap, using PKMAP_NR() is more understandable
and maintainable, So change it.

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/mm/highmem.c b/mm/highmem.c
index d517cd1..b3b3d68 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -99,7 +99,7 @@ struct page *kmap_to_page(void *vaddr)
 	unsigned long addr = (unsigned long)vaddr;
 
 	if (addr >= PKMAP_ADDR(0) && addr <= PKMAP_ADDR(LAST_PKMAP)) {
-		int i = (addr - PKMAP_ADDR(0)) >> PAGE_SHIFT;
+		int i = PKMAP_NR(addr);
 		return pte_page(pkmap_page_table[i]);
 	}
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 2/5] mm, highmem: remove useless pool_lock
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
  2012-10-28 19:12   ` [PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
@ 2012-10-28 19:12   ` Joonsoo Kim
  2012-10-29  1:52     ` Minchan Kim
  2012-10-30 21:31     ` Andrew Morton
  2012-10-28 19:12   ` [PATCH 3/5] mm, highmem: remove page_address_pool list Joonsoo Kim
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-28 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Joonsoo Kim

The pool_lock protects the page_address_pool from concurrent access.
But, access to the page_address_pool is already protected by kmap_lock.
So remove it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/mm/highmem.c b/mm/highmem.c
index b3b3d68..017bad1 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -328,7 +328,6 @@ struct page_address_map {
  * page_address_map freelist, allocated from page_address_maps.
  */
 static struct list_head page_address_pool;	/* freelist */
-static spinlock_t pool_lock;			/* protects page_address_pool */
 
 /*
  * Hash table bucket
@@ -395,11 +394,9 @@ void set_page_address(struct page *page, void *virtual)
 	if (virtual) {		/* Add */
 		BUG_ON(list_empty(&page_address_pool));
 
-		spin_lock_irqsave(&pool_lock, flags);
 		pam = list_entry(page_address_pool.next,
 				struct page_address_map, list);
 		list_del(&pam->list);
-		spin_unlock_irqrestore(&pool_lock, flags);
 
 		pam->page = page;
 		pam->virtual = virtual;
@@ -413,9 +410,7 @@ void set_page_address(struct page *page, void *virtual)
 			if (pam->page == page) {
 				list_del(&pam->list);
 				spin_unlock_irqrestore(&pas->lock, flags);
-				spin_lock_irqsave(&pool_lock, flags);
 				list_add_tail(&pam->list, &page_address_pool);
-				spin_unlock_irqrestore(&pool_lock, flags);
 				goto done;
 			}
 		}
@@ -438,7 +433,6 @@ void __init page_address_init(void)
 		INIT_LIST_HEAD(&page_address_htable[i].lh);
 		spin_lock_init(&page_address_htable[i].lock);
 	}
-	spin_lock_init(&pool_lock);
 }
 
 #endif	/* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 3/5] mm, highmem: remove page_address_pool list
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
  2012-10-28 19:12   ` [PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
  2012-10-28 19:12   ` [PATCH 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
@ 2012-10-28 19:12   ` Joonsoo Kim
  2012-10-29  1:57     ` Minchan Kim
  2012-10-28 19:12   ` [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry Joonsoo Kim
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-28 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Joonsoo Kim

We can find free page_address_map instance without the page_address_pool.
So remove it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/mm/highmem.c b/mm/highmem.c
index 017bad1..731cf9a 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -323,11 +323,7 @@ struct page_address_map {
 	void *virtual;
 	struct list_head list;
 };
-
-/*
- * page_address_map freelist, allocated from page_address_maps.
- */
-static struct list_head page_address_pool;	/* freelist */
+static struct page_address_map page_address_maps[LAST_PKMAP];
 
 /*
  * Hash table bucket
@@ -392,12 +388,7 @@ void set_page_address(struct page *page, void *virtual)
 
 	pas = page_slot(page);
 	if (virtual) {		/* Add */
-		BUG_ON(list_empty(&page_address_pool));
-
-		pam = list_entry(page_address_pool.next,
-				struct page_address_map, list);
-		list_del(&pam->list);
-
+		pam = &page_address_maps[PKMAP_NR((unsigned long)virtual)];
 		pam->page = page;
 		pam->virtual = virtual;
 
@@ -410,7 +401,6 @@ void set_page_address(struct page *page, void *virtual)
 			if (pam->page == page) {
 				list_del(&pam->list);
 				spin_unlock_irqrestore(&pas->lock, flags);
-				list_add_tail(&pam->list, &page_address_pool);
 				goto done;
 			}
 		}
@@ -420,15 +410,10 @@ done:
 	return;
 }
 
-static struct page_address_map page_address_maps[LAST_PKMAP];
-
 void __init page_address_init(void)
 {
 	int i;
 
-	INIT_LIST_HEAD(&page_address_pool);
-	for (i = 0; i < ARRAY_SIZE(page_address_maps); i++)
-		list_add(&page_address_maps[i].list, &page_address_pool);
 	for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) {
 		INIT_LIST_HEAD(&page_address_htable[i].lh);
 		spin_lock_init(&page_address_htable[i].lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
                     ` (2 preceding siblings ...)
  2012-10-28 19:12   ` [PATCH 3/5] mm, highmem: remove page_address_pool list Joonsoo Kim
@ 2012-10-28 19:12   ` Joonsoo Kim
  2012-10-29  2:06     ` Minchan Kim
  2012-10-28 19:12   ` [PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR() Joonsoo Kim
  2012-10-29  2:12   ` [PATCH 0/5] minor clean-up and optimize highmem related code Minchan Kim
  5 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-28 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Joonsoo Kim

In current code, after flush_all_zero_pkmaps() is invoked,
then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
return index of flushed entry. With this index,
we can immediately map highmem page to virtual address represented by index.
So change return type of flush_all_zero_pkmaps()
and return index of last flushed entry.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ef788b5..0683869 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
 
 #ifdef CONFIG_HIGHMEM
 #include <asm/highmem.h>
+#define PKMAP_INDEX_INVAL (-1)
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
diff --git a/mm/highmem.c b/mm/highmem.c
index 731cf9a..65beb9a 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
 	return virt_to_page(addr);
 }
 
-static void flush_all_zero_pkmaps(void)
+static int flush_all_zero_pkmaps(void)
 {
 	int i;
-	int need_flush = 0;
+	int index = PKMAP_INDEX_INVAL;
 
 	flush_cache_kmaps();
 
@@ -141,10 +141,12 @@ static void flush_all_zero_pkmaps(void)
 			  &pkmap_page_table[i]);
 
 		set_page_address(page, NULL);
-		need_flush = 1;
+		index = i;
 	}
-	if (need_flush)
+	if (index != PKMAP_INDEX_INVAL)
 		flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+
+	return index;
 }
 
 /**
@@ -160,6 +162,7 @@ void kmap_flush_unused(void)
 static inline unsigned long map_new_virtual(struct page *page)
 {
 	unsigned long vaddr;
+	int index = PKMAP_INDEX_INVAL;
 	int count;
 
 start:
@@ -168,40 +171,45 @@ start:
 	for (;;) {
 		last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
 		if (!last_pkmap_nr) {
-			flush_all_zero_pkmaps();
-			count = LAST_PKMAP;
+			index = flush_all_zero_pkmaps();
+			if (index != PKMAP_INDEX_INVAL)
+				break; /* Found a usable entry */
 		}
-		if (!pkmap_count[last_pkmap_nr])
+		if (!pkmap_count[last_pkmap_nr]) {
+			index = last_pkmap_nr;
 			break;	/* Found a usable entry */
-		if (--count)
-			continue;
-
-		/*
-		 * Sleep for somebody else to unmap their entries
-		 */
-		{
-			DECLARE_WAITQUEUE(wait, current);
-
-			__set_current_state(TASK_UNINTERRUPTIBLE);
-			add_wait_queue(&pkmap_map_wait, &wait);
-			unlock_kmap();
-			schedule();
-			remove_wait_queue(&pkmap_map_wait, &wait);
-			lock_kmap();
-
-			/* Somebody else might have mapped it while we slept */
-			if (page_address(page))
-				return (unsigned long)page_address(page);
-
-			/* Re-start */
-			goto start;
 		}
+		if (--count == 0)
+			break;
 	}
-	vaddr = PKMAP_ADDR(last_pkmap_nr);
+
+	/*
+	 * Sleep for somebody else to unmap their entries
+	 */
+	if (index == PKMAP_INDEX_INVAL) {
+		DECLARE_WAITQUEUE(wait, current);
+
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		add_wait_queue(&pkmap_map_wait, &wait);
+		unlock_kmap();
+		schedule();
+		remove_wait_queue(&pkmap_map_wait, &wait);
+		lock_kmap();
+
+		/* Somebody else might have mapped it while we slept */
+		vaddr = (unsigned long)page_address(page);
+		if (vaddr)
+			return vaddr;
+
+		/* Re-start */
+		goto start;
+	}
+
+	vaddr = PKMAP_ADDR(index);
 	set_pte_at(&init_mm, vaddr,
-		   &(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
+		   &(pkmap_page_table[index]), mk_pte(page, kmap_prot));
 
-	pkmap_count[last_pkmap_nr] = 1;
+	pkmap_count[index] = 1;
 	set_page_address(page, (void *)vaddr);
 
 	return vaddr;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
                     ` (3 preceding siblings ...)
  2012-10-28 19:12   ` [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry Joonsoo Kim
@ 2012-10-28 19:12   ` Joonsoo Kim
  2012-10-29  2:09     ` Minchan Kim
  2012-10-29  2:12   ` [PATCH 0/5] minor clean-up and optimize highmem related code Minchan Kim
  5 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-28 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Joonsoo Kim

In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page.
Using this index, we can simply get virtual address of the page.
So change it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/mm/highmem.c b/mm/highmem.c
index 65beb9a..1417f4f 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -137,8 +137,7 @@ static int flush_all_zero_pkmaps(void)
 		 * So no dangers, even with speculative execution.
 		 */
 		page = pte_page(pkmap_page_table[i]);
-		pte_clear(&init_mm, (unsigned long)page_address(page),
-			  &pkmap_page_table[i]);
+		pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
 
 		set_page_address(page, NULL);
 		index = i;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  2012-10-28 19:12   ` [PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
@ 2012-10-29  1:48     ` Minchan Kim
  0 siblings, 0 replies; 146+ messages in thread
From: Minchan Kim @ 2012-10-29  1:48 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Mon, Oct 29, 2012 at 04:12:52AM +0900, Joonsoo Kim wrote:
> To calculate an index of pkmap, using PKMAP_NR() is more understandable
> and maintainable, So change it.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] mm, highmem: remove useless pool_lock
  2012-10-28 19:12   ` [PATCH 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
@ 2012-10-29  1:52     ` Minchan Kim
  2012-10-30 21:31     ` Andrew Morton
  1 sibling, 0 replies; 146+ messages in thread
From: Minchan Kim @ 2012-10-29  1:52 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

On Mon, Oct 29, 2012 at 04:12:53AM +0900, Joonsoo Kim wrote:
> The pool_lock protects the page_address_pool from concurrent access.
> But, access to the page_address_pool is already protected by kmap_lock.
> So remove it.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kin <minchan@kernel.org>

Looks good to me.
Just a nitpick.

Please write comment about locking rule like below.

> 
> diff --git a/mm/highmem.c b/mm/highmem.c
> index b3b3d68..017bad1 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -328,7 +328,6 @@ struct page_address_map {
>   * page_address_map freelist, allocated from page_address_maps.
>   */

/* page_address_pool is protected by kmap_lock */

>  static struct list_head page_address_pool;	/* freelist */
> -static spinlock_t pool_lock;			/* protects page_address_pool */
>  

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 3/5] mm, highmem: remove page_address_pool list
  2012-10-28 19:12   ` [PATCH 3/5] mm, highmem: remove page_address_pool list Joonsoo Kim
@ 2012-10-29  1:57     ` Minchan Kim
  0 siblings, 0 replies; 146+ messages in thread
From: Minchan Kim @ 2012-10-29  1:57 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

On Mon, Oct 29, 2012 at 04:12:54AM +0900, Joonsoo Kim wrote:
> We can find free page_address_map instance without the page_address_pool.
> So remove it.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

See below a nitpick. :)

> 
> diff --git a/mm/highmem.c b/mm/highmem.c
> index 017bad1..731cf9a 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -323,11 +323,7 @@ struct page_address_map {
>  	void *virtual;
>  	struct list_head list;
>  };
> -

Let's leave a blank line.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry
  2012-10-28 19:12   ` [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry Joonsoo Kim
@ 2012-10-29  2:06     ` Minchan Kim
  2012-10-29 13:12       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Minchan Kim @ 2012-10-29  2:06 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

On Mon, Oct 29, 2012 at 04:12:55AM +0900, Joonsoo Kim wrote:
> In current code, after flush_all_zero_pkmaps() is invoked,
> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
> return index of flushed entry. With this index,
> we can immediately map highmem page to virtual address represented by index.
> So change return type of flush_all_zero_pkmaps()
> and return index of last flushed entry.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index ef788b5..0683869 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>  
>  #ifdef CONFIG_HIGHMEM
>  #include <asm/highmem.h>
> +#define PKMAP_INDEX_INVAL (-1)

How about this?

#define PKMAP_INVALID_INDEX (-1)

>  
>  /* declarations for linux/mm/highmem.c */
>  unsigned int nr_free_highpages(void);
> diff --git a/mm/highmem.c b/mm/highmem.c
> index 731cf9a..65beb9a 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>  	return virt_to_page(addr);
>  }
>  
> -static void flush_all_zero_pkmaps(void)
> +static int flush_all_zero_pkmaps(void)
>  {
>  	int i;
> -	int need_flush = 0;
> +	int index = PKMAP_INDEX_INVAL;
>  
>  	flush_cache_kmaps();
>  
> @@ -141,10 +141,12 @@ static void flush_all_zero_pkmaps(void)
>  			  &pkmap_page_table[i]);
>  
>  		set_page_address(page, NULL);
> -		need_flush = 1;
> +		index = i;

How about returning first free index instead of last one?
and update last_pkmap_nr to it.

>  	}
> -	if (need_flush)
> +	if (index != PKMAP_INDEX_INVAL)
>  		flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
> +
> +	return index;
>  }
>  
>  /**
> @@ -160,6 +162,7 @@ void kmap_flush_unused(void)
>  static inline unsigned long map_new_virtual(struct page *page)
>  {
>  	unsigned long vaddr;
> +	int index = PKMAP_INDEX_INVAL;
>  	int count;
>  
>  start:
> @@ -168,40 +171,45 @@ start:
>  	for (;;) {
>  		last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
>  		if (!last_pkmap_nr) {
> -			flush_all_zero_pkmaps();
> -			count = LAST_PKMAP;
> +			index = flush_all_zero_pkmaps();
> +			if (index != PKMAP_INDEX_INVAL)
> +				break; /* Found a usable entry */
>  		}
> -		if (!pkmap_count[last_pkmap_nr])
> +		if (!pkmap_count[last_pkmap_nr]) {
> +			index = last_pkmap_nr;
>  			break;	/* Found a usable entry */
> -		if (--count)
> -			continue;
> -
> -		/*
> -		 * Sleep for somebody else to unmap their entries
> -		 */
> -		{
> -			DECLARE_WAITQUEUE(wait, current);
> -
> -			__set_current_state(TASK_UNINTERRUPTIBLE);
> -			add_wait_queue(&pkmap_map_wait, &wait);
> -			unlock_kmap();
> -			schedule();
> -			remove_wait_queue(&pkmap_map_wait, &wait);
> -			lock_kmap();
> -
> -			/* Somebody else might have mapped it while we slept */
> -			if (page_address(page))
> -				return (unsigned long)page_address(page);
> -
> -			/* Re-start */
> -			goto start;
>  		}
> +		if (--count == 0)
> +			break;
>  	}
> -	vaddr = PKMAP_ADDR(last_pkmap_nr);
> +
> +	/*
> +	 * Sleep for somebody else to unmap their entries
> +	 */
> +	if (index == PKMAP_INDEX_INVAL) {
> +		DECLARE_WAITQUEUE(wait, current);
> +
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		add_wait_queue(&pkmap_map_wait, &wait);
> +		unlock_kmap();
> +		schedule();
> +		remove_wait_queue(&pkmap_map_wait, &wait);
> +		lock_kmap();
> +
> +		/* Somebody else might have mapped it while we slept */
> +		vaddr = (unsigned long)page_address(page);
> +		if (vaddr)
> +			return vaddr;
> +
> +		/* Re-start */
> +		goto start;
> +	}
> +
> +	vaddr = PKMAP_ADDR(index);
>  	set_pte_at(&init_mm, vaddr,
> -		   &(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
> +		   &(pkmap_page_table[index]), mk_pte(page, kmap_prot));
>  
> -	pkmap_count[last_pkmap_nr] = 1;
> +	pkmap_count[index] = 1;
>  	set_page_address(page, (void *)vaddr);
>  
>  	return vaddr;
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()
  2012-10-28 19:12   ` [PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR() Joonsoo Kim
@ 2012-10-29  2:09     ` Minchan Kim
  0 siblings, 0 replies; 146+ messages in thread
From: Minchan Kim @ 2012-10-29  2:09 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

On Mon, Oct 29, 2012 at 04:12:56AM +0900, Joonsoo Kim wrote:
> In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page.
> Using this index, we can simply get virtual address of the page.
> So change it.
> 
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 0/5] minor clean-up and optimize highmem related code
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
                     ` (4 preceding siblings ...)
  2012-10-28 19:12   ` [PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR() Joonsoo Kim
@ 2012-10-29  2:12   ` Minchan Kim
  2012-10-29 13:15     ` JoonSoo Kim
  5 siblings, 1 reply; 146+ messages in thread
From: Minchan Kim @ 2012-10-29  2:12 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Andrew Morton, linux-kernel, linux-mm, Peter Zijlstra

Hi Joonsoo,

On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote:
> This patchset clean-up and optimize highmem related code.
> 
> [1] is just clean-up and doesn't introduce any functional change.
> [2-3] are for clean-up and optimization.
> These eliminate an useless lock opearation and list management.
> [4-5] is for optimization related to flush_all_zero_pkmaps().
> 
> Joonsoo Kim (5):
>   mm, highmem: use PKMAP_NR() to calculate an index of pkmap
>   mm, highmem: remove useless pool_lock
>   mm, highmem: remove page_address_pool list
>   mm, highmem: makes flush_all_zero_pkmaps() return index of last
>     flushed entry
>   mm, highmem: get virtual address of the page using PKMAP_ADDR()

This patchset looks awesome to me.
If you have a plan to respin, please CCed Peter.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry
  2012-10-29  2:06     ` Minchan Kim
@ 2012-10-29 13:12       ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-29 13:12 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

2012/10/29 Minchan Kim <minchan@kernel.org>:
> On Mon, Oct 29, 2012 at 04:12:55AM +0900, Joonsoo Kim wrote:
>> In current code, after flush_all_zero_pkmaps() is invoked,
>> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
>> return index of flushed entry. With this index,
>> we can immediately map highmem page to virtual address represented by index.
>> So change return type of flush_all_zero_pkmaps()
>> and return index of last flushed entry.
>>
>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>>
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index ef788b5..0683869 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>>
>>  #ifdef CONFIG_HIGHMEM
>>  #include <asm/highmem.h>
>> +#define PKMAP_INDEX_INVAL (-1)
>
> How about this?
>
> #define PKMAP_INVALID_INDEX (-1)

Okay.

>>
>>  /* declarations for linux/mm/highmem.c */
>>  unsigned int nr_free_highpages(void);
>> diff --git a/mm/highmem.c b/mm/highmem.c
>> index 731cf9a..65beb9a 100644
>> --- a/mm/highmem.c
>> +++ b/mm/highmem.c
>> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>>       return virt_to_page(addr);
>>  }
>>
>> -static void flush_all_zero_pkmaps(void)
>> +static int flush_all_zero_pkmaps(void)
>>  {
>>       int i;
>> -     int need_flush = 0;
>> +     int index = PKMAP_INDEX_INVAL;
>>
>>       flush_cache_kmaps();
>>
>> @@ -141,10 +141,12 @@ static void flush_all_zero_pkmaps(void)
>>                         &pkmap_page_table[i]);
>>
>>               set_page_address(page, NULL);
>> -             need_flush = 1;
>> +             index = i;
>
> How about returning first free index instead of last one?
> and update last_pkmap_nr to it.

Okay. It will be more good.

Thanks.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 0/5] minor clean-up and optimize highmem related code
  2012-10-29  2:12   ` [PATCH 0/5] minor clean-up and optimize highmem related code Minchan Kim
@ 2012-10-29 13:15     ` JoonSoo Kim
  2012-10-31 17:11       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-29 13:15 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-kernel, linux-mm, Peter Zijlstra

Hi, Minchan.

2012/10/29 Minchan Kim <minchan@kernel.org>:
> Hi Joonsoo,
>
> On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote:
>> This patchset clean-up and optimize highmem related code.
>>
>> [1] is just clean-up and doesn't introduce any functional change.
>> [2-3] are for clean-up and optimization.
>> These eliminate an useless lock opearation and list management.
>> [4-5] is for optimization related to flush_all_zero_pkmaps().
>>
>> Joonsoo Kim (5):
>>   mm, highmem: use PKMAP_NR() to calculate an index of pkmap
>>   mm, highmem: remove useless pool_lock
>>   mm, highmem: remove page_address_pool list
>>   mm, highmem: makes flush_all_zero_pkmaps() return index of last
>>     flushed entry
>>   mm, highmem: get virtual address of the page using PKMAP_ADDR()
>
> This patchset looks awesome to me.
> If you have a plan to respin, please CCed Peter.

Thanks for review.
I will wait more review and respin, the day after tomorrow.
Version 2 will include fix about your comment.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] mm, highmem: remove useless pool_lock
  2012-10-28 19:12   ` [PATCH 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
  2012-10-29  1:52     ` Minchan Kim
@ 2012-10-30 21:31     ` Andrew Morton
  2012-10-31  5:14       ` Minchan Kim
  2012-10-31 15:01       ` JoonSoo Kim
  1 sibling, 2 replies; 146+ messages in thread
From: Andrew Morton @ 2012-10-30 21:31 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-kernel, linux-mm

On Mon, 29 Oct 2012 04:12:53 +0900
Joonsoo Kim <js1304@gmail.com> wrote:

> The pool_lock protects the page_address_pool from concurrent access.
> But, access to the page_address_pool is already protected by kmap_lock.
> So remove it.

Well, there's a set_page_address() call in mm/page_alloc.c which
doesn't have lock_kmap().  it doesn't *need* lock_kmap() because it's
init-time code and we're running single-threaded there.  I hope!

But this exception should be double-checked and mentioned in the
changelog, please.  And it's a reason why we can't add
assert_spin_locked(&kmap_lock) to set_page_address(), which is
unfortunate.


The irq-disabling in this code is odd.  If ARCH_NEEDS_KMAP_HIGH_GET=n,
we didn't need irq-safe locking in set_page_address().  I guess we'll
need to retain it in page_address() - I expect some callers have IRQs
disabled.


ARCH_NEEDS_KMAP_HIGH_GET is a nasty looking thing.  It's ARM:

/*
 * The reason for kmap_high_get() is to ensure that the currently kmap'd
 * page usage count does not decrease to zero while we're using its
 * existing virtual mapping in an atomic context.  With a VIVT cache this
 * is essential to do, but with a VIPT cache this is only an optimization
 * so not to pay the price of establishing a second mapping if an existing
 * one can be used.  However, on platforms without hardware TLB maintenance
 * broadcast, we simply cannot use ARCH_NEEDS_KMAP_HIGH_GET at all since
 * the locking involved must also disable IRQs which is incompatible with
 * the IPI mechanism used by global TLB operations.
 */
#define ARCH_NEEDS_KMAP_HIGH_GET
#if defined(CONFIG_SMP) && defined(CONFIG_CPU_TLB_V6)
#undef ARCH_NEEDS_KMAP_HIGH_GET
#if defined(CONFIG_HIGHMEM) && defined(CONFIG_CPU_CACHE_VIVT)
#error "The sum of features in your kernel config cannot be supported together"
#endif
#endif


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] mm, highmem: remove useless pool_lock
  2012-10-30 21:31     ` Andrew Morton
@ 2012-10-31  5:14       ` Minchan Kim
  2012-10-31 15:01       ` JoonSoo Kim
  1 sibling, 0 replies; 146+ messages in thread
From: Minchan Kim @ 2012-10-31  5:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joonsoo Kim, linux-kernel, linux-mm

Hi Andrew,

On Tue, Oct 30, 2012 at 02:31:07PM -0700, Andrew Morton wrote:
> On Mon, 29 Oct 2012 04:12:53 +0900
> Joonsoo Kim <js1304@gmail.com> wrote:
> 
> > The pool_lock protects the page_address_pool from concurrent access.
> > But, access to the page_address_pool is already protected by kmap_lock.
> > So remove it.
> 
> Well, there's a set_page_address() call in mm/page_alloc.c which
> doesn't have lock_kmap().  it doesn't *need* lock_kmap() because it's
> init-time code and we're running single-threaded there.  I hope!
> 
> But this exception should be double-checked and mentioned in the
> changelog, please.  And it's a reason why we can't add
> assert_spin_locked(&kmap_lock) to set_page_address(), which is
> unfortunate.
> 

The exception is vaild only in m68k and sparc and they will use not
set_page_address of highmem.c but page->virtual. So I think we can add
such lock check in set_page_address in highmem.c.

But I'm not sure we really need it because set_page_address is used in
few places so isn't it enough adding a just wording to avoid unnecessary
overhead?

/* NOTE : Caller should hold kmap_lock by lock_kmap() */

> 
> The irq-disabling in this code is odd.  If ARCH_NEEDS_KMAP_HIGH_GET=n,
> we didn't need irq-safe locking in set_page_address().  I guess we'll

What lock you mean in set_page_address?
We have two locks in there, pool_lock and pas->lock.
By this patchset, we don't need pool_lock any more.
Remained thing is pas->lock.

If we make the lock irq-unsafe, it would be deadlock with page_addresss
if it is called in irq context. Currenntly, page_address is used
lots of places and not sure it's called only process context.
Was there any rule that we have to use page_addresss in only
process context?

> need to retain it in page_address() - I expect some callers have IRQs
> disabled.
> 
> 
> ARCH_NEEDS_KMAP_HIGH_GET is a nasty looking thing.  It's ARM:
> 
> /*
>  * The reason for kmap_high_get() is to ensure that the currently kmap'd
>  * page usage count does not decrease to zero while we're using its
>  * existing virtual mapping in an atomic context.  With a VIVT cache this
>  * is essential to do, but with a VIPT cache this is only an optimization
>  * so not to pay the price of establishing a second mapping if an existing
>  * one can be used.  However, on platforms without hardware TLB maintenance
>  * broadcast, we simply cannot use ARCH_NEEDS_KMAP_HIGH_GET at all since
>  * the locking involved must also disable IRQs which is incompatible with
>  * the IPI mechanism used by global TLB operations.
>  */
> #define ARCH_NEEDS_KMAP_HIGH_GET
> #if defined(CONFIG_SMP) && defined(CONFIG_CPU_TLB_V6)
> #undef ARCH_NEEDS_KMAP_HIGH_GET
> #if defined(CONFIG_HIGHMEM) && defined(CONFIG_CPU_CACHE_VIVT)
> #error "The sum of features in your kernel config cannot be supported together"
> #endif
> #endif
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH 2/5] mm, highmem: remove useless pool_lock
  2012-10-30 21:31     ` Andrew Morton
  2012-10-31  5:14       ` Minchan Kim
@ 2012-10-31 15:01       ` JoonSoo Kim
  1 sibling, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-31 15:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Hello, Andrew.

2012/10/31 Andrew Morton <akpm@linux-foundation.org>:
> On Mon, 29 Oct 2012 04:12:53 +0900
> Joonsoo Kim <js1304@gmail.com> wrote:
>
>> The pool_lock protects the page_address_pool from concurrent access.
>> But, access to the page_address_pool is already protected by kmap_lock.
>> So remove it.
>
> Well, there's a set_page_address() call in mm/page_alloc.c which
> doesn't have lock_kmap().  it doesn't *need* lock_kmap() because it's
> init-time code and we're running single-threaded there.  I hope!
>
> But this exception should be double-checked and mentioned in the
> changelog, please.  And it's a reason why we can't add
> assert_spin_locked(&kmap_lock) to set_page_address(), which is
> unfortunate.

set_page_address() in mm/page_alloc.c is invoked only when
WANT_PAGE_VIRTUAL is defined.
And in this case, set_page_address()'s definition is not in highmem.c,
but in include/linux/mm.h.
So, we don't need to worry about set_page_address() call in mm/page_alloc.c

> The irq-disabling in this code is odd.  If ARCH_NEEDS_KMAP_HIGH_GET=n,
> we didn't need irq-safe locking in set_page_address().  I guess we'll
> need to retain it in page_address() - I expect some callers have IRQs
> disabled.

As Minchan described, if we don't disable irq when we take a lock for pas->lock,
it would be deadlock with page_address().

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v2 0/5] minor clean-up and optimize highmem related code
       [not found] <Yes>
                   ` (11 preceding siblings ...)
  2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
@ 2012-10-31 16:56 ` Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
                     ` (4 more replies)
  12 siblings, 5 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-31 16:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Joonsoo Kim

This patchset clean-up and optimize highmem related code.

Change from v1
Rebase on v3.7-rc3
[4] Instead of returning index of last flushed entry, return first index.
And update last_pkmap_nr to this index to optimize more.

Summary for v1
[1] is just clean-up and doesn't introduce any functional change.
[2-3] are for clean-up and optimization.
These eliminate an useless lock opearation and list management.
[4-5] is for optimization related to flush_all_zero_pkmaps().

Joonsoo Kim (5):
  mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  mm, highmem: remove useless pool_lock
  mm, highmem: remove page_address_pool list
  mm, highmem: makes flush_all_zero_pkmaps() return index of first
    flushed entry
  mm, highmem: get virtual address of the page using PKMAP_ADDR()

 include/linux/highmem.h |    1 +
 mm/highmem.c            |  108 ++++++++++++++++++++++-------------------------
 2 files changed, 51 insertions(+), 58 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
@ 2012-10-31 16:56   ` Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman, Peter Zijlstra

To calculate an index of pkmap, using PKMAP_NR() is more understandable
and maintainable, So change it.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

diff --git a/mm/highmem.c b/mm/highmem.c
index d517cd1..b3b3d68 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -99,7 +99,7 @@ struct page *kmap_to_page(void *vaddr)
 	unsigned long addr = (unsigned long)vaddr;
 
 	if (addr >= PKMAP_ADDR(0) && addr <= PKMAP_ADDR(LAST_PKMAP)) {
-		int i = (addr - PKMAP_ADDR(0)) >> PAGE_SHIFT;
+		int i = PKMAP_NR(addr);
 		return pte_page(pkmap_page_table[i]);
 	}
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 2/5] mm, highmem: remove useless pool_lock
  2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
@ 2012-10-31 16:56   ` Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 3/5] mm, highmem: remove page_address_pool list Joonsoo Kim
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman, Peter Zijlstra

The pool_lock protects the page_address_pool from concurrent access.
But, access to the page_address_pool is already protected by kmap_lock.
So remove it.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

diff --git a/mm/highmem.c b/mm/highmem.c
index b3b3d68..017bad1 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -328,7 +328,6 @@ struct page_address_map {
  * page_address_map freelist, allocated from page_address_maps.
  */
 static struct list_head page_address_pool;	/* freelist */
-static spinlock_t pool_lock;			/* protects page_address_pool */
 
 /*
  * Hash table bucket
@@ -395,11 +394,9 @@ void set_page_address(struct page *page, void *virtual)
 	if (virtual) {		/* Add */
 		BUG_ON(list_empty(&page_address_pool));
 
-		spin_lock_irqsave(&pool_lock, flags);
 		pam = list_entry(page_address_pool.next,
 				struct page_address_map, list);
 		list_del(&pam->list);
-		spin_unlock_irqrestore(&pool_lock, flags);
 
 		pam->page = page;
 		pam->virtual = virtual;
@@ -413,9 +410,7 @@ void set_page_address(struct page *page, void *virtual)
 			if (pam->page == page) {
 				list_del(&pam->list);
 				spin_unlock_irqrestore(&pas->lock, flags);
-				spin_lock_irqsave(&pool_lock, flags);
 				list_add_tail(&pam->list, &page_address_pool);
-				spin_unlock_irqrestore(&pool_lock, flags);
 				goto done;
 			}
 		}
@@ -438,7 +433,6 @@ void __init page_address_init(void)
 		INIT_LIST_HEAD(&page_address_htable[i].lh);
 		spin_lock_init(&page_address_htable[i].lock);
 	}
-	spin_lock_init(&pool_lock);
 }
 
 #endif	/* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 3/5] mm, highmem: remove page_address_pool list
  2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
@ 2012-10-31 16:56   ` Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry Joonsoo Kim
  2012-10-31 16:56   ` [PATCH v2 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR() Joonsoo Kim
  4 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman, Peter Zijlstra

We can find free page_address_map instance without the page_address_pool.
So remove it.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

diff --git a/mm/highmem.c b/mm/highmem.c
index 017bad1..d98b0a9 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -324,10 +324,7 @@ struct page_address_map {
 	struct list_head list;
 };
 
-/*
- * page_address_map freelist, allocated from page_address_maps.
- */
-static struct list_head page_address_pool;	/* freelist */
+static struct page_address_map page_address_maps[LAST_PKMAP];
 
 /*
  * Hash table bucket
@@ -392,12 +389,7 @@ void set_page_address(struct page *page, void *virtual)
 
 	pas = page_slot(page);
 	if (virtual) {		/* Add */
-		BUG_ON(list_empty(&page_address_pool));
-
-		pam = list_entry(page_address_pool.next,
-				struct page_address_map, list);
-		list_del(&pam->list);
-
+		pam = &page_address_maps[PKMAP_NR((unsigned long)virtual)];
 		pam->page = page;
 		pam->virtual = virtual;
 
@@ -410,7 +402,6 @@ void set_page_address(struct page *page, void *virtual)
 			if (pam->page == page) {
 				list_del(&pam->list);
 				spin_unlock_irqrestore(&pas->lock, flags);
-				list_add_tail(&pam->list, &page_address_pool);
 				goto done;
 			}
 		}
@@ -420,15 +411,10 @@ done:
 	return;
 }
 
-static struct page_address_map page_address_maps[LAST_PKMAP];
-
 void __init page_address_init(void)
 {
 	int i;
 
-	INIT_LIST_HEAD(&page_address_pool);
-	for (i = 0; i < ARRAY_SIZE(page_address_maps); i++)
-		list_add(&page_address_maps[i].list, &page_address_pool);
 	for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) {
 		INIT_LIST_HEAD(&page_address_htable[i].lh);
 		spin_lock_init(&page_address_htable[i].lock);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
                     ` (2 preceding siblings ...)
  2012-10-31 16:56   ` [PATCH v2 3/5] mm, highmem: remove page_address_pool list Joonsoo Kim
@ 2012-10-31 16:56   ` Joonsoo Kim
  2012-11-01  5:03     ` Minchan Kim
  2012-10-31 16:56   ` [PATCH v2 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR() Joonsoo Kim
  4 siblings, 1 reply; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman, Peter Zijlstra,
	Minchan Kim

In current code, after flush_all_zero_pkmaps() is invoked,
then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
return index of first flushed entry. With this index,
we can immediately map highmem page to virtual address represented by index.
So change return type of flush_all_zero_pkmaps()
and return index of first flushed entry.

Additionally, update last_pkmap_nr to this index.
It is certain that entry which is below this index is occupied by other mapping,
therefore updating last_pkmap_nr to this index is reasonable optimization.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ef788b5..97ad208 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
 
 #ifdef CONFIG_HIGHMEM
 #include <asm/highmem.h>
+#define PKMAP_INVALID_INDEX (LAST_PKMAP)
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
diff --git a/mm/highmem.c b/mm/highmem.c
index d98b0a9..b365f7b 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
 	return virt_to_page(addr);
 }
 
-static void flush_all_zero_pkmaps(void)
+static unsigned int flush_all_zero_pkmaps(void)
 {
 	int i;
-	int need_flush = 0;
+	unsigned int index = PKMAP_INVALID_INDEX;
 
 	flush_cache_kmaps();
 
@@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
 			  &pkmap_page_table[i]);
 
 		set_page_address(page, NULL);
-		need_flush = 1;
+		if (index == PKMAP_INVALID_INDEX)
+			index = i;
 	}
-	if (need_flush)
+	if (index != PKMAP_INVALID_INDEX)
 		flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+
+	return index;
 }
 
 /**
@@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
  */
 void kmap_flush_unused(void)
 {
+	unsigned int index;
+
 	lock_kmap();
-	flush_all_zero_pkmaps();
+	index = flush_all_zero_pkmaps();
+	if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
+		last_pkmap_nr = index;
 	unlock_kmap();
 }
 
 static inline unsigned long map_new_virtual(struct page *page)
 {
 	unsigned long vaddr;
+	unsigned int index = PKMAP_INVALID_INDEX;
 	int count;
 
 start:
@@ -168,40 +176,45 @@ start:
 	for (;;) {
 		last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
 		if (!last_pkmap_nr) {
-			flush_all_zero_pkmaps();
-			count = LAST_PKMAP;
+			index = flush_all_zero_pkmaps();
+			break;
 		}
-		if (!pkmap_count[last_pkmap_nr])
+		if (!pkmap_count[last_pkmap_nr]) {
+			index = last_pkmap_nr;
 			break;	/* Found a usable entry */
-		if (--count)
-			continue;
-
-		/*
-		 * Sleep for somebody else to unmap their entries
-		 */
-		{
-			DECLARE_WAITQUEUE(wait, current);
-
-			__set_current_state(TASK_UNINTERRUPTIBLE);
-			add_wait_queue(&pkmap_map_wait, &wait);
-			unlock_kmap();
-			schedule();
-			remove_wait_queue(&pkmap_map_wait, &wait);
-			lock_kmap();
-
-			/* Somebody else might have mapped it while we slept */
-			if (page_address(page))
-				return (unsigned long)page_address(page);
-
-			/* Re-start */
-			goto start;
 		}
+		if (--count == 0)
+			break;
 	}
-	vaddr = PKMAP_ADDR(last_pkmap_nr);
+
+	/*
+	 * Sleep for somebody else to unmap their entries
+	 */
+	if (index == PKMAP_INVALID_INDEX) {
+		DECLARE_WAITQUEUE(wait, current);
+
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		add_wait_queue(&pkmap_map_wait, &wait);
+		unlock_kmap();
+		schedule();
+		remove_wait_queue(&pkmap_map_wait, &wait);
+		lock_kmap();
+
+		/* Somebody else might have mapped it while we slept */
+		vaddr = (unsigned long)page_address(page);
+		if (vaddr)
+			return vaddr;
+
+		/* Re-start */
+		goto start;
+	}
+
+	vaddr = PKMAP_ADDR(index);
 	set_pte_at(&init_mm, vaddr,
-		   &(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
+		   &(pkmap_page_table[index]), mk_pte(page, kmap_prot));
 
-	pkmap_count[last_pkmap_nr] = 1;
+	pkmap_count[index] = 1;
+	last_pkmap_nr = index;
 	set_page_address(page, (void *)vaddr);
 
 	return vaddr;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()
  2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
                     ` (3 preceding siblings ...)
  2012-10-31 16:56   ` [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry Joonsoo Kim
@ 2012-10-31 16:56   ` Joonsoo Kim
  4 siblings, 0 replies; 146+ messages in thread
From: Joonsoo Kim @ 2012-10-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Mel Gorman, Peter Zijlstra

In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page.
Using this index, we can simply get virtual address of the page.
So change it.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

diff --git a/mm/highmem.c b/mm/highmem.c
index b365f7b..675ec97 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -137,8 +137,7 @@ static unsigned int flush_all_zero_pkmaps(void)
 		 * So no dangers, even with speculative execution.
 		 */
 		page = pte_page(pkmap_page_table[i]);
-		pte_clear(&init_mm, (unsigned long)page_address(page),
-			  &pkmap_page_table[i]);
+		pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
 
 		set_page_address(page, NULL);
 		if (index == PKMAP_INVALID_INDEX)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH 0/5] minor clean-up and optimize highmem related code
  2012-10-29 13:15     ` JoonSoo Kim
@ 2012-10-31 17:11       ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-10-31 17:11 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-kernel, linux-mm, Peter Zijlstra

Hello, Andrew.

2012/10/29 JoonSoo Kim <js1304@gmail.com>:
> Hi, Minchan.
>
> 2012/10/29 Minchan Kim <minchan@kernel.org>:
>> Hi Joonsoo,
>>
>> On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote:
>>> This patchset clean-up and optimize highmem related code.
>>>
>>> [1] is just clean-up and doesn't introduce any functional change.
>>> [2-3] are for clean-up and optimization.
>>> These eliminate an useless lock opearation and list management.
>>> [4-5] is for optimization related to flush_all_zero_pkmaps().
>>>
>>> Joonsoo Kim (5):
>>>   mm, highmem: use PKMAP_NR() to calculate an index of pkmap
>>>   mm, highmem: remove useless pool_lock
>>>   mm, highmem: remove page_address_pool list
>>>   mm, highmem: makes flush_all_zero_pkmaps() return index of last
>>>     flushed entry
>>>   mm, highmem: get virtual address of the page using PKMAP_ADDR()
>>
>> This patchset looks awesome to me.
>> If you have a plan to respin, please CCed Peter.
>
> Thanks for review.
> I will wait more review and respin, the day after tomorrow.
> Version 2 will include fix about your comment.

Could you pick up second version of this patchset?

[3] is changed to leave one blank line.
[4] is changed in order to further optimize according to Minchan's comment.
It return first index of flushed entry, instead of last index.
Others doesn't changed.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-10-31 16:56   ` [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry Joonsoo Kim
@ 2012-11-01  5:03     ` Minchan Kim
  2012-11-02 19:07       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Minchan Kim @ 2012-11-01  5:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
> In current code, after flush_all_zero_pkmaps() is invoked,
> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
> return index of first flushed entry. With this index,
> we can immediately map highmem page to virtual address represented by index.
> So change return type of flush_all_zero_pkmaps()
> and return index of first flushed entry.
> 
> Additionally, update last_pkmap_nr to this index.
> It is certain that entry which is below this index is occupied by other mapping,
> therefore updating last_pkmap_nr to this index is reasonable optimization.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index ef788b5..97ad208 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>  
>  #ifdef CONFIG_HIGHMEM
>  #include <asm/highmem.h>
> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>  
>  /* declarations for linux/mm/highmem.c */
>  unsigned int nr_free_highpages(void);
> diff --git a/mm/highmem.c b/mm/highmem.c
> index d98b0a9..b365f7b 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>  	return virt_to_page(addr);
>  }
>  
> -static void flush_all_zero_pkmaps(void)
> +static unsigned int flush_all_zero_pkmaps(void)
>  {
>  	int i;
> -	int need_flush = 0;
> +	unsigned int index = PKMAP_INVALID_INDEX;
>  
>  	flush_cache_kmaps();
>  
> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>  			  &pkmap_page_table[i]);
>  
>  		set_page_address(page, NULL);
> -		need_flush = 1;
> +		if (index == PKMAP_INVALID_INDEX)
> +			index = i;
>  	}
> -	if (need_flush)
> +	if (index != PKMAP_INVALID_INDEX)
>  		flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
> +
> +	return index;
>  }
>  
>  /**
> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
>   */
>  void kmap_flush_unused(void)
>  {
> +	unsigned int index;
> +
>  	lock_kmap();
> -	flush_all_zero_pkmaps();
> +	index = flush_all_zero_pkmaps();
> +	if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
> +		last_pkmap_nr = index;

I don't know how kmap_flush_unused is really fast path so how my nitpick
is effective. Anyway,
What problem happens if we do following as?

lock()
index = flush_all_zero_pkmaps();
if (index != PKMAP_INVALID_INDEX)
        last_pkmap_nr = index;
unlock();

Normally, last_pkmap_nr is increased with searching empty slot in
map_new_virtual. So I expect return value of flush_all_zero_pkmaps
in kmap_flush_unused normally become either less than last_pkmap_nr
or last_pkmap_nr + 1.

 
>  	unlock_kmap();
>  }
>  
>  static inline unsigned long map_new_virtual(struct page *page)
>  {
>  	unsigned long vaddr;
> +	unsigned int index = PKMAP_INVALID_INDEX;
>  	int count;
>  
>  start:
> @@ -168,40 +176,45 @@ start:
>  	for (;;) {
>  		last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
>  		if (!last_pkmap_nr) {
> -			flush_all_zero_pkmaps();
> -			count = LAST_PKMAP;
> +			index = flush_all_zero_pkmaps();
> +			break;
>  		}
> -		if (!pkmap_count[last_pkmap_nr])
> +		if (!pkmap_count[last_pkmap_nr]) {
> +			index = last_pkmap_nr;
>  			break;	/* Found a usable entry */
> -		if (--count)
> -			continue;
> -
> -		/*
> -		 * Sleep for somebody else to unmap their entries
> -		 */
> -		{
> -			DECLARE_WAITQUEUE(wait, current);
> -
> -			__set_current_state(TASK_UNINTERRUPTIBLE);
> -			add_wait_queue(&pkmap_map_wait, &wait);
> -			unlock_kmap();
> -			schedule();
> -			remove_wait_queue(&pkmap_map_wait, &wait);
> -			lock_kmap();
> -
> -			/* Somebody else might have mapped it while we slept */
> -			if (page_address(page))
> -				return (unsigned long)page_address(page);
> -
> -			/* Re-start */
> -			goto start;
>  		}
> +		if (--count == 0)
> +			break;
>  	}
> -	vaddr = PKMAP_ADDR(last_pkmap_nr);
> +
> +	/*
> +	 * Sleep for somebody else to unmap their entries
> +	 */
> +	if (index == PKMAP_INVALID_INDEX) {
> +		DECLARE_WAITQUEUE(wait, current);
> +
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		add_wait_queue(&pkmap_map_wait, &wait);
> +		unlock_kmap();
> +		schedule();
> +		remove_wait_queue(&pkmap_map_wait, &wait);
> +		lock_kmap();
> +
> +		/* Somebody else might have mapped it while we slept */
> +		vaddr = (unsigned long)page_address(page);
> +		if (vaddr)
> +			return vaddr;
> +
> +		/* Re-start */
> +		goto start;
> +	}
> +
> +	vaddr = PKMAP_ADDR(index);
>  	set_pte_at(&init_mm, vaddr,
> -		   &(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
> +		   &(pkmap_page_table[index]), mk_pte(page, kmap_prot));
>  
> -	pkmap_count[last_pkmap_nr] = 1;
> +	pkmap_count[index] = 1;
> +	last_pkmap_nr = index;
>  	set_page_address(page, (void *)vaddr);
>  
>  	return vaddr;
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-01  5:03     ` Minchan Kim
@ 2012-11-02 19:07       ` JoonSoo Kim
  2012-11-02 22:42         ` Minchan Kim
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-11-02 19:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

Hello, Minchan.

2012/11/1 Minchan Kim <minchan@kernel.org>:
> On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
>> In current code, after flush_all_zero_pkmaps() is invoked,
>> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
>> return index of first flushed entry. With this index,
>> we can immediately map highmem page to virtual address represented by index.
>> So change return type of flush_all_zero_pkmaps()
>> and return index of first flushed entry.
>>
>> Additionally, update last_pkmap_nr to this index.
>> It is certain that entry which is below this index is occupied by other mapping,
>> therefore updating last_pkmap_nr to this index is reasonable optimization.
>>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>>
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index ef788b5..97ad208 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>>
>>  #ifdef CONFIG_HIGHMEM
>>  #include <asm/highmem.h>
>> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>>
>>  /* declarations for linux/mm/highmem.c */
>>  unsigned int nr_free_highpages(void);
>> diff --git a/mm/highmem.c b/mm/highmem.c
>> index d98b0a9..b365f7b 100644
>> --- a/mm/highmem.c
>> +++ b/mm/highmem.c
>> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>>       return virt_to_page(addr);
>>  }
>>
>> -static void flush_all_zero_pkmaps(void)
>> +static unsigned int flush_all_zero_pkmaps(void)
>>  {
>>       int i;
>> -     int need_flush = 0;
>> +     unsigned int index = PKMAP_INVALID_INDEX;
>>
>>       flush_cache_kmaps();
>>
>> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>>                         &pkmap_page_table[i]);
>>
>>               set_page_address(page, NULL);
>> -             need_flush = 1;
>> +             if (index == PKMAP_INVALID_INDEX)
>> +                     index = i;
>>       }
>> -     if (need_flush)
>> +     if (index != PKMAP_INVALID_INDEX)
>>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>> +
>> +     return index;
>>  }
>>
>>  /**
>> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
>>   */
>>  void kmap_flush_unused(void)
>>  {
>> +     unsigned int index;
>> +
>>       lock_kmap();
>> -     flush_all_zero_pkmaps();
>> +     index = flush_all_zero_pkmaps();
>> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
>> +             last_pkmap_nr = index;
>
> I don't know how kmap_flush_unused is really fast path so how my nitpick
> is effective. Anyway,
> What problem happens if we do following as?
>
> lock()
> index = flush_all_zero_pkmaps();
> if (index != PKMAP_INVALID_INDEX)
>         last_pkmap_nr = index;
> unlock();
>
> Normally, last_pkmap_nr is increased with searching empty slot in
> map_new_virtual. So I expect return value of flush_all_zero_pkmaps
> in kmap_flush_unused normally become either less than last_pkmap_nr
> or last_pkmap_nr + 1.

There is a case that return value of kmap_flush_unused() is larger
than last_pkmap_nr.
Look at the following example.

Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped.

do kmap_flush_unused() => flush index 10 => last_pkmap = 10;
do kunmap() with index 17
do kmap_flush_unused() => flush index 17

So, little dirty implementation is needed.

Thanks.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-02 19:07       ` JoonSoo Kim
@ 2012-11-02 22:42         ` Minchan Kim
  2012-11-13  0:30           ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Minchan Kim @ 2012-11-02 22:42 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

Hi Joonsoo,

On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
> Hello, Minchan.
> 
> 2012/11/1 Minchan Kim <minchan@kernel.org>:
> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
> >> In current code, after flush_all_zero_pkmaps() is invoked,
> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
> >> return index of first flushed entry. With this index,
> >> we can immediately map highmem page to virtual address represented by index.
> >> So change return type of flush_all_zero_pkmaps()
> >> and return index of first flushed entry.
> >>
> >> Additionally, update last_pkmap_nr to this index.
> >> It is certain that entry which is below this index is occupied by other mapping,
> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
> >>
> >> Cc: Mel Gorman <mel@csn.ul.ie>
> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >> Cc: Minchan Kim <minchan@kernel.org>
> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> >>
> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> >> index ef788b5..97ad208 100644
> >> --- a/include/linux/highmem.h
> >> +++ b/include/linux/highmem.h
> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> >>
> >>  #ifdef CONFIG_HIGHMEM
> >>  #include <asm/highmem.h>
> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
> >>
> >>  /* declarations for linux/mm/highmem.c */
> >>  unsigned int nr_free_highpages(void);
> >> diff --git a/mm/highmem.c b/mm/highmem.c
> >> index d98b0a9..b365f7b 100644
> >> --- a/mm/highmem.c
> >> +++ b/mm/highmem.c
> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
> >>       return virt_to_page(addr);
> >>  }
> >>
> >> -static void flush_all_zero_pkmaps(void)
> >> +static unsigned int flush_all_zero_pkmaps(void)
> >>  {
> >>       int i;
> >> -     int need_flush = 0;
> >> +     unsigned int index = PKMAP_INVALID_INDEX;
> >>
> >>       flush_cache_kmaps();
> >>
> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
> >>                         &pkmap_page_table[i]);
> >>
> >>               set_page_address(page, NULL);
> >> -             need_flush = 1;
> >> +             if (index == PKMAP_INVALID_INDEX)
> >> +                     index = i;
> >>       }
> >> -     if (need_flush)
> >> +     if (index != PKMAP_INVALID_INDEX)
> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
> >> +
> >> +     return index;
> >>  }
> >>
> >>  /**
> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
> >>   */
> >>  void kmap_flush_unused(void)
> >>  {
> >> +     unsigned int index;
> >> +
> >>       lock_kmap();
> >> -     flush_all_zero_pkmaps();
> >> +     index = flush_all_zero_pkmaps();
> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
> >> +             last_pkmap_nr = index;
> >
> > I don't know how kmap_flush_unused is really fast path so how my nitpick
> > is effective. Anyway,
> > What problem happens if we do following as?
> >
> > lock()
> > index = flush_all_zero_pkmaps();
> > if (index != PKMAP_INVALID_INDEX)
> >         last_pkmap_nr = index;
> > unlock();
> >
> > Normally, last_pkmap_nr is increased with searching empty slot in
> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
> > in kmap_flush_unused normally become either less than last_pkmap_nr
> > or last_pkmap_nr + 1.
> 
> There is a case that return value of kmap_flush_unused() is larger
> than last_pkmap_nr.

I see but why it's problem? kmap_flush_unused returns larger value than
last_pkmap_nr means that there is no free slot at below the value.
So unconditional last_pkmap_nr update is vaild.

> Look at the following example.
> 
> Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped.
> 
> do kmap_flush_unused() => flush index 10 => last_pkmap = 10;
> do kunmap() with index 17
> do kmap_flush_unused() => flush index 17
> 
> So, little dirty implementation is needed.
> 
> Thanks.

-- 
Kind Regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-02 22:42         ` Minchan Kim
@ 2012-11-13  0:30           ` JoonSoo Kim
  2012-11-13 12:49             ` Minchan Kim
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-11-13  0:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

2012/11/3 Minchan Kim <minchan@kernel.org>:
> Hi Joonsoo,
>
> On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
>> Hello, Minchan.
>>
>> 2012/11/1 Minchan Kim <minchan@kernel.org>:
>> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
>> >> In current code, after flush_all_zero_pkmaps() is invoked,
>> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
>> >> return index of first flushed entry. With this index,
>> >> we can immediately map highmem page to virtual address represented by index.
>> >> So change return type of flush_all_zero_pkmaps()
>> >> and return index of first flushed entry.
>> >>
>> >> Additionally, update last_pkmap_nr to this index.
>> >> It is certain that entry which is below this index is occupied by other mapping,
>> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
>> >>
>> >> Cc: Mel Gorman <mel@csn.ul.ie>
>> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> >> Cc: Minchan Kim <minchan@kernel.org>
>> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>> >>
>> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> >> index ef788b5..97ad208 100644
>> >> --- a/include/linux/highmem.h
>> >> +++ b/include/linux/highmem.h
>> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>> >>
>> >>  #ifdef CONFIG_HIGHMEM
>> >>  #include <asm/highmem.h>
>> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>> >>
>> >>  /* declarations for linux/mm/highmem.c */
>> >>  unsigned int nr_free_highpages(void);
>> >> diff --git a/mm/highmem.c b/mm/highmem.c
>> >> index d98b0a9..b365f7b 100644
>> >> --- a/mm/highmem.c
>> >> +++ b/mm/highmem.c
>> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>> >>       return virt_to_page(addr);
>> >>  }
>> >>
>> >> -static void flush_all_zero_pkmaps(void)
>> >> +static unsigned int flush_all_zero_pkmaps(void)
>> >>  {
>> >>       int i;
>> >> -     int need_flush = 0;
>> >> +     unsigned int index = PKMAP_INVALID_INDEX;
>> >>
>> >>       flush_cache_kmaps();
>> >>
>> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>> >>                         &pkmap_page_table[i]);
>> >>
>> >>               set_page_address(page, NULL);
>> >> -             need_flush = 1;
>> >> +             if (index == PKMAP_INVALID_INDEX)
>> >> +                     index = i;
>> >>       }
>> >> -     if (need_flush)
>> >> +     if (index != PKMAP_INVALID_INDEX)
>> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>> >> +
>> >> +     return index;
>> >>  }
>> >>
>> >>  /**
>> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
>> >>   */
>> >>  void kmap_flush_unused(void)
>> >>  {
>> >> +     unsigned int index;
>> >> +
>> >>       lock_kmap();
>> >> -     flush_all_zero_pkmaps();
>> >> +     index = flush_all_zero_pkmaps();
>> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
>> >> +             last_pkmap_nr = index;
>> >
>> > I don't know how kmap_flush_unused is really fast path so how my nitpick
>> > is effective. Anyway,
>> > What problem happens if we do following as?
>> >
>> > lock()
>> > index = flush_all_zero_pkmaps();
>> > if (index != PKMAP_INVALID_INDEX)
>> >         last_pkmap_nr = index;
>> > unlock();
>> >
>> > Normally, last_pkmap_nr is increased with searching empty slot in
>> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
>> > in kmap_flush_unused normally become either less than last_pkmap_nr
>> > or last_pkmap_nr + 1.
>>
>> There is a case that return value of kmap_flush_unused() is larger
>> than last_pkmap_nr.
>
> I see but why it's problem? kmap_flush_unused returns larger value than
> last_pkmap_nr means that there is no free slot at below the value.
> So unconditional last_pkmap_nr update is vaild.

I think that this is not true.
Look at the slightly different example.

Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.

do kmap_flush_unused() => flush index 10,11 => last_pkmap = 10;
do kunmap() with index 17
do kmap_flush_unused() => flush index 17 => last_pkmap = 17?

In this case, unconditional last_pkmap_nr update skip one kunmapped index.
So, conditional update is needed.

>> Look at the following example.
>>
>> Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped.
>>
>> do kmap_flush_unused() => flush index 10 => last_pkmap = 10;
>> do kunmap() with index 17
>> do kmap_flush_unused() => flush index 17
>>
>> So, little dirty implementation is needed.
>>
>> Thanks.
>
> --
> Kind Regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-13  0:30           ` JoonSoo Kim
@ 2012-11-13 12:49             ` Minchan Kim
  2012-11-13 14:12               ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Minchan Kim @ 2012-11-13 12:49 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
> 2012/11/3 Minchan Kim <minchan@kernel.org>:
> > Hi Joonsoo,
> >
> > On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
> >> Hello, Minchan.
> >>
> >> 2012/11/1 Minchan Kim <minchan@kernel.org>:
> >> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
> >> >> In current code, after flush_all_zero_pkmaps() is invoked,
> >> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
> >> >> return index of first flushed entry. With this index,
> >> >> we can immediately map highmem page to virtual address represented by index.
> >> >> So change return type of flush_all_zero_pkmaps()
> >> >> and return index of first flushed entry.
> >> >>
> >> >> Additionally, update last_pkmap_nr to this index.
> >> >> It is certain that entry which is below this index is occupied by other mapping,
> >> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
> >> >>
> >> >> Cc: Mel Gorman <mel@csn.ul.ie>
> >> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >> >> Cc: Minchan Kim <minchan@kernel.org>
> >> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> >> >>
> >> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> >> >> index ef788b5..97ad208 100644
> >> >> --- a/include/linux/highmem.h
> >> >> +++ b/include/linux/highmem.h
> >> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> >> >>
> >> >>  #ifdef CONFIG_HIGHMEM
> >> >>  #include <asm/highmem.h>
> >> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
> >> >>
> >> >>  /* declarations for linux/mm/highmem.c */
> >> >>  unsigned int nr_free_highpages(void);
> >> >> diff --git a/mm/highmem.c b/mm/highmem.c
> >> >> index d98b0a9..b365f7b 100644
> >> >> --- a/mm/highmem.c
> >> >> +++ b/mm/highmem.c
> >> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
> >> >>       return virt_to_page(addr);
> >> >>  }
> >> >>
> >> >> -static void flush_all_zero_pkmaps(void)
> >> >> +static unsigned int flush_all_zero_pkmaps(void)
> >> >>  {
> >> >>       int i;
> >> >> -     int need_flush = 0;
> >> >> +     unsigned int index = PKMAP_INVALID_INDEX;
> >> >>
> >> >>       flush_cache_kmaps();
> >> >>
> >> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
> >> >>                         &pkmap_page_table[i]);
> >> >>
> >> >>               set_page_address(page, NULL);
> >> >> -             need_flush = 1;
> >> >> +             if (index == PKMAP_INVALID_INDEX)
> >> >> +                     index = i;
> >> >>       }
> >> >> -     if (need_flush)
> >> >> +     if (index != PKMAP_INVALID_INDEX)
> >> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
> >> >> +
> >> >> +     return index;
> >> >>  }
> >> >>
> >> >>  /**
> >> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
> >> >>   */
> >> >>  void kmap_flush_unused(void)
> >> >>  {
> >> >> +     unsigned int index;
> >> >> +
> >> >>       lock_kmap();
> >> >> -     flush_all_zero_pkmaps();
> >> >> +     index = flush_all_zero_pkmaps();
> >> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
> >> >> +             last_pkmap_nr = index;
> >> >
> >> > I don't know how kmap_flush_unused is really fast path so how my nitpick
> >> > is effective. Anyway,
> >> > What problem happens if we do following as?
> >> >
> >> > lock()
> >> > index = flush_all_zero_pkmaps();
> >> > if (index != PKMAP_INVALID_INDEX)
> >> >         last_pkmap_nr = index;
> >> > unlock();
> >> >
> >> > Normally, last_pkmap_nr is increased with searching empty slot in
> >> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
> >> > in kmap_flush_unused normally become either less than last_pkmap_nr
> >> > or last_pkmap_nr + 1.
> >>
> >> There is a case that return value of kmap_flush_unused() is larger
> >> than last_pkmap_nr.
> >
> > I see but why it's problem? kmap_flush_unused returns larger value than
> > last_pkmap_nr means that there is no free slot at below the value.
> > So unconditional last_pkmap_nr update is vaild.
> 
> I think that this is not true.
> Look at the slightly different example.
> 
> Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.
> 
> do kmap_flush_unused() => flush index 10,11 => last_pkmap = 10;
> do kunmap() with index 17
> do kmap_flush_unused() => flush index 17 => last_pkmap = 17?
> 
> In this case, unconditional last_pkmap_nr update skip one kunmapped index.
> So, conditional update is needed.

Thanks for pouinting out, Joonsoo.
You're right. I misunderstood your flush_all_zero_pkmaps change.
As your change, flush_all_zero_pkmaps returns first *flushed* free slot index.
What's the benefit returning flushed flushed free slot index rather than free slot index?
I think flush_all_zero_pkmaps should return first free slot because customer of
flush_all_zero_pkmaps doesn't care whether it's just flushed or not.
What he want is just free or not. In such case, we can remove above check and it makes
flusha_all_zero_pkmaps more intuitive.


-- 
Kind Regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-13 12:49             ` Minchan Kim
@ 2012-11-13 14:12               ` JoonSoo Kim
  2012-11-13 15:01                 ` Minchan Kim
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-11-13 14:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

2012/11/13 Minchan Kim <minchan@kernel.org>:
> On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
>> 2012/11/3 Minchan Kim <minchan@kernel.org>:
>> > Hi Joonsoo,
>> >
>> > On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
>> >> Hello, Minchan.
>> >>
>> >> 2012/11/1 Minchan Kim <minchan@kernel.org>:
>> >> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
>> >> >> In current code, after flush_all_zero_pkmaps() is invoked,
>> >> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
>> >> >> return index of first flushed entry. With this index,
>> >> >> we can immediately map highmem page to virtual address represented by index.
>> >> >> So change return type of flush_all_zero_pkmaps()
>> >> >> and return index of first flushed entry.
>> >> >>
>> >> >> Additionally, update last_pkmap_nr to this index.
>> >> >> It is certain that entry which is below this index is occupied by other mapping,
>> >> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
>> >> >>
>> >> >> Cc: Mel Gorman <mel@csn.ul.ie>
>> >> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> >> >> Cc: Minchan Kim <minchan@kernel.org>
>> >> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>> >> >>
>> >> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> >> >> index ef788b5..97ad208 100644
>> >> >> --- a/include/linux/highmem.h
>> >> >> +++ b/include/linux/highmem.h
>> >> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>> >> >>
>> >> >>  #ifdef CONFIG_HIGHMEM
>> >> >>  #include <asm/highmem.h>
>> >> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>> >> >>
>> >> >>  /* declarations for linux/mm/highmem.c */
>> >> >>  unsigned int nr_free_highpages(void);
>> >> >> diff --git a/mm/highmem.c b/mm/highmem.c
>> >> >> index d98b0a9..b365f7b 100644
>> >> >> --- a/mm/highmem.c
>> >> >> +++ b/mm/highmem.c
>> >> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>> >> >>       return virt_to_page(addr);
>> >> >>  }
>> >> >>
>> >> >> -static void flush_all_zero_pkmaps(void)
>> >> >> +static unsigned int flush_all_zero_pkmaps(void)
>> >> >>  {
>> >> >>       int i;
>> >> >> -     int need_flush = 0;
>> >> >> +     unsigned int index = PKMAP_INVALID_INDEX;
>> >> >>
>> >> >>       flush_cache_kmaps();
>> >> >>
>> >> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>> >> >>                         &pkmap_page_table[i]);
>> >> >>
>> >> >>               set_page_address(page, NULL);
>> >> >> -             need_flush = 1;
>> >> >> +             if (index == PKMAP_INVALID_INDEX)
>> >> >> +                     index = i;
>> >> >>       }
>> >> >> -     if (need_flush)
>> >> >> +     if (index != PKMAP_INVALID_INDEX)
>> >> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>> >> >> +
>> >> >> +     return index;
>> >> >>  }
>> >> >>
>> >> >>  /**
>> >> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
>> >> >>   */
>> >> >>  void kmap_flush_unused(void)
>> >> >>  {
>> >> >> +     unsigned int index;
>> >> >> +
>> >> >>       lock_kmap();
>> >> >> -     flush_all_zero_pkmaps();
>> >> >> +     index = flush_all_zero_pkmaps();
>> >> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
>> >> >> +             last_pkmap_nr = index;
>> >> >
>> >> > I don't know how kmap_flush_unused is really fast path so how my nitpick
>> >> > is effective. Anyway,
>> >> > What problem happens if we do following as?
>> >> >
>> >> > lock()
>> >> > index = flush_all_zero_pkmaps();
>> >> > if (index != PKMAP_INVALID_INDEX)
>> >> >         last_pkmap_nr = index;
>> >> > unlock();
>> >> >
>> >> > Normally, last_pkmap_nr is increased with searching empty slot in
>> >> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
>> >> > in kmap_flush_unused normally become either less than last_pkmap_nr
>> >> > or last_pkmap_nr + 1.
>> >>
>> >> There is a case that return value of kmap_flush_unused() is larger
>> >> than last_pkmap_nr.
>> >
>> > I see but why it's problem? kmap_flush_unused returns larger value than
>> > last_pkmap_nr means that there is no free slot at below the value.
>> > So unconditional last_pkmap_nr update is vaild.
>>
>> I think that this is not true.
>> Look at the slightly different example.
>>
>> Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.
>>
>> do kmap_flush_unused() => flush index 10,11 => last_pkmap = 10;
>> do kunmap() with index 17
>> do kmap_flush_unused() => flush index 17 => last_pkmap = 17?
>>
>> In this case, unconditional last_pkmap_nr update skip one kunmapped index.
>> So, conditional update is needed.
>
> Thanks for pouinting out, Joonsoo.
> You're right. I misunderstood your flush_all_zero_pkmaps change.
> As your change, flush_all_zero_pkmaps returns first *flushed* free slot index.
> What's the benefit returning flushed flushed free slot index rather than free slot index?

If flush_all_zero_pkmaps() return free slot index rather than first
flushed free slot,
we need another comparison like as 'if pkmap_count[i] == 0' and
need another local variable for determining whether flush is occurred or not.
I want to minimize these overhead and churning of the code, although
they are negligible.

> I think flush_all_zero_pkmaps should return first free slot because customer of
> flush_all_zero_pkmaps doesn't care whether it's just flushed or not.
> What he want is just free or not. In such case, we can remove above check and it makes
> flusha_all_zero_pkmaps more intuitive.

Yes, it is more intuitive, but as I mentioned above, it need another comparison,
so with that, a benefit which prevent to re-iterate when there is no
free slot, may be disappeared.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-13 14:12               ` JoonSoo Kim
@ 2012-11-13 15:01                 ` Minchan Kim
  2012-11-14 17:09                   ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Minchan Kim @ 2012-11-13 15:01 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote:
> 2012/11/13 Minchan Kim <minchan@kernel.org>:
> > On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
> >> 2012/11/3 Minchan Kim <minchan@kernel.org>:
> >> > Hi Joonsoo,
> >> >
> >> > On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
> >> >> Hello, Minchan.
> >> >>
> >> >> 2012/11/1 Minchan Kim <minchan@kernel.org>:
> >> >> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
> >> >> >> In current code, after flush_all_zero_pkmaps() is invoked,
> >> >> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
> >> >> >> return index of first flushed entry. With this index,
> >> >> >> we can immediately map highmem page to virtual address represented by index.
> >> >> >> So change return type of flush_all_zero_pkmaps()
> >> >> >> and return index of first flushed entry.
> >> >> >>
> >> >> >> Additionally, update last_pkmap_nr to this index.
> >> >> >> It is certain that entry which is below this index is occupied by other mapping,
> >> >> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
> >> >> >>
> >> >> >> Cc: Mel Gorman <mel@csn.ul.ie>
> >> >> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >> >> >> Cc: Minchan Kim <minchan@kernel.org>
> >> >> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> >> >> >>
> >> >> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> >> >> >> index ef788b5..97ad208 100644
> >> >> >> --- a/include/linux/highmem.h
> >> >> >> +++ b/include/linux/highmem.h
> >> >> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> >> >> >>
> >> >> >>  #ifdef CONFIG_HIGHMEM
> >> >> >>  #include <asm/highmem.h>
> >> >> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
> >> >> >>
> >> >> >>  /* declarations for linux/mm/highmem.c */
> >> >> >>  unsigned int nr_free_highpages(void);
> >> >> >> diff --git a/mm/highmem.c b/mm/highmem.c
> >> >> >> index d98b0a9..b365f7b 100644
> >> >> >> --- a/mm/highmem.c
> >> >> >> +++ b/mm/highmem.c
> >> >> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
> >> >> >>       return virt_to_page(addr);
> >> >> >>  }
> >> >> >>
> >> >> >> -static void flush_all_zero_pkmaps(void)
> >> >> >> +static unsigned int flush_all_zero_pkmaps(void)
> >> >> >>  {
> >> >> >>       int i;
> >> >> >> -     int need_flush = 0;
> >> >> >> +     unsigned int index = PKMAP_INVALID_INDEX;
> >> >> >>
> >> >> >>       flush_cache_kmaps();
> >> >> >>
> >> >> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
> >> >> >>                         &pkmap_page_table[i]);
> >> >> >>
> >> >> >>               set_page_address(page, NULL);
> >> >> >> -             need_flush = 1;
> >> >> >> +             if (index == PKMAP_INVALID_INDEX)
> >> >> >> +                     index = i;
> >> >> >>       }
> >> >> >> -     if (need_flush)
> >> >> >> +     if (index != PKMAP_INVALID_INDEX)
> >> >> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
> >> >> >> +
> >> >> >> +     return index;
> >> >> >>  }
> >> >> >>
> >> >> >>  /**
> >> >> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
> >> >> >>   */
> >> >> >>  void kmap_flush_unused(void)
> >> >> >>  {
> >> >> >> +     unsigned int index;
> >> >> >> +
> >> >> >>       lock_kmap();
> >> >> >> -     flush_all_zero_pkmaps();
> >> >> >> +     index = flush_all_zero_pkmaps();
> >> >> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
> >> >> >> +             last_pkmap_nr = index;
> >> >> >
> >> >> > I don't know how kmap_flush_unused is really fast path so how my nitpick
> >> >> > is effective. Anyway,
> >> >> > What problem happens if we do following as?
> >> >> >
> >> >> > lock()
> >> >> > index = flush_all_zero_pkmaps();
> >> >> > if (index != PKMAP_INVALID_INDEX)
> >> >> >         last_pkmap_nr = index;
> >> >> > unlock();
> >> >> >
> >> >> > Normally, last_pkmap_nr is increased with searching empty slot in
> >> >> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
> >> >> > in kmap_flush_unused normally become either less than last_pkmap_nr
> >> >> > or last_pkmap_nr + 1.
> >> >>
> >> >> There is a case that return value of kmap_flush_unused() is larger
> >> >> than last_pkmap_nr.
> >> >
> >> > I see but why it's problem? kmap_flush_unused returns larger value than
> >> > last_pkmap_nr means that there is no free slot at below the value.
> >> > So unconditional last_pkmap_nr update is vaild.
> >>
> >> I think that this is not true.
> >> Look at the slightly different example.
> >>
> >> Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.
> >>
> >> do kmap_flush_unused() => flush index 10,11 => last_pkmap = 10;
> >> do kunmap() with index 17
> >> do kmap_flush_unused() => flush index 17 => last_pkmap = 17?
> >>
> >> In this case, unconditional last_pkmap_nr update skip one kunmapped index.
> >> So, conditional update is needed.
> >
> > Thanks for pouinting out, Joonsoo.
> > You're right. I misunderstood your flush_all_zero_pkmaps change.
> > As your change, flush_all_zero_pkmaps returns first *flushed* free slot index.
> > What's the benefit returning flushed flushed free slot index rather than free slot index?
> 
> If flush_all_zero_pkmaps() return free slot index rather than first
> flushed free slot,
> we need another comparison like as 'if pkmap_count[i] == 0' and
> need another local variable for determining whether flush is occurred or not.
> I want to minimize these overhead and churning of the code, although
> they are negligible.
> 
> > I think flush_all_zero_pkmaps should return first free slot because customer of
> > flush_all_zero_pkmaps doesn't care whether it's just flushed or not.
> > What he want is just free or not. In such case, we can remove above check and it makes
> > flusha_all_zero_pkmaps more intuitive.
> 
> Yes, it is more intuitive, but as I mentioned above, it need another comparison,
> so with that, a benefit which prevent to re-iterate when there is no
> free slot, may be disappeared.

If you're very keen on the performance, why do you have such code?
You can remove below branch if you were keen on the performance.

diff --git a/mm/highmem.c b/mm/highmem.c
index c8be376..44a88dd 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -114,7 +114,7 @@ static unsigned int flush_all_zero_pkmaps(void)
 
        flush_cache_kmaps();
 
-       for (i = 0; i < LAST_PKMAP; i++) {
+       for (i = LAST_PKMAP - 1; i >= 0; i--) {
                struct page *page;
 
                /*
@@ -141,8 +141,7 @@ static unsigned int flush_all_zero_pkmaps(void)
                pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
 
                set_page_address(page, NULL);
-               if (index == PKMAP_INVALID_INDEX)
-                       index = i;
+               index = i;
        }
        if (index != PKMAP_INVALID_INDEX)
                flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));


Anyway, if you have the concern of performance, Okay let's give up making code clear
although I didn't see any report about kmap perfomance. Instead, please consider above
optimization because you have already broken what you mentioned.
If we can't make function clear, another method for it is to add function comment. Please.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind Regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-13 15:01                 ` Minchan Kim
@ 2012-11-14 17:09                   ` JoonSoo Kim
  2012-11-19 23:46                     ` Minchan Kim
  0 siblings, 1 reply; 146+ messages in thread
From: JoonSoo Kim @ 2012-11-14 17:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

Hi, Minchan.

2012/11/14 Minchan Kim <minchan@kernel.org>:
> On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote:
>> 2012/11/13 Minchan Kim <minchan@kernel.org>:
>> > On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
>> >> 2012/11/3 Minchan Kim <minchan@kernel.org>:
>> >> > Hi Joonsoo,
>> >> >
>> >> > On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
>> >> >> Hello, Minchan.
>> >> >>
>> >> >> 2012/11/1 Minchan Kim <minchan@kernel.org>:
>> >> >> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
>> >> >> >> In current code, after flush_all_zero_pkmaps() is invoked,
>> >> >> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
>> >> >> >> return index of first flushed entry. With this index,
>> >> >> >> we can immediately map highmem page to virtual address represented by index.
>> >> >> >> So change return type of flush_all_zero_pkmaps()
>> >> >> >> and return index of first flushed entry.
>> >> >> >>
>> >> >> >> Additionally, update last_pkmap_nr to this index.
>> >> >> >> It is certain that entry which is below this index is occupied by other mapping,
>> >> >> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
>> >> >> >>
>> >> >> >> Cc: Mel Gorman <mel@csn.ul.ie>
>> >> >> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> >> >> >> Cc: Minchan Kim <minchan@kernel.org>
>> >> >> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>> >> >> >>
>> >> >> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> >> >> >> index ef788b5..97ad208 100644
>> >> >> >> --- a/include/linux/highmem.h
>> >> >> >> +++ b/include/linux/highmem.h
>> >> >> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>> >> >> >>
>> >> >> >>  #ifdef CONFIG_HIGHMEM
>> >> >> >>  #include <asm/highmem.h>
>> >> >> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>> >> >> >>
>> >> >> >>  /* declarations for linux/mm/highmem.c */
>> >> >> >>  unsigned int nr_free_highpages(void);
>> >> >> >> diff --git a/mm/highmem.c b/mm/highmem.c
>> >> >> >> index d98b0a9..b365f7b 100644
>> >> >> >> --- a/mm/highmem.c
>> >> >> >> +++ b/mm/highmem.c
>> >> >> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>> >> >> >>       return virt_to_page(addr);
>> >> >> >>  }
>> >> >> >>
>> >> >> >> -static void flush_all_zero_pkmaps(void)
>> >> >> >> +static unsigned int flush_all_zero_pkmaps(void)
>> >> >> >>  {
>> >> >> >>       int i;
>> >> >> >> -     int need_flush = 0;
>> >> >> >> +     unsigned int index = PKMAP_INVALID_INDEX;
>> >> >> >>
>> >> >> >>       flush_cache_kmaps();
>> >> >> >>
>> >> >> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>> >> >> >>                         &pkmap_page_table[i]);
>> >> >> >>
>> >> >> >>               set_page_address(page, NULL);
>> >> >> >> -             need_flush = 1;
>> >> >> >> +             if (index == PKMAP_INVALID_INDEX)
>> >> >> >> +                     index = i;
>> >> >> >>       }
>> >> >> >> -     if (need_flush)
>> >> >> >> +     if (index != PKMAP_INVALID_INDEX)
>> >> >> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>> >> >> >> +
>> >> >> >> +     return index;
>> >> >> >>  }
>> >> >> >>
>> >> >> >>  /**
>> >> >> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
>> >> >> >>   */
>> >> >> >>  void kmap_flush_unused(void)
>> >> >> >>  {
>> >> >> >> +     unsigned int index;
>> >> >> >> +
>> >> >> >>       lock_kmap();
>> >> >> >> -     flush_all_zero_pkmaps();
>> >> >> >> +     index = flush_all_zero_pkmaps();
>> >> >> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
>> >> >> >> +             last_pkmap_nr = index;
>> >> >> >
>> >> >> > I don't know how kmap_flush_unused is really fast path so how my nitpick
>> >> >> > is effective. Anyway,
>> >> >> > What problem happens if we do following as?
>> >> >> >
>> >> >> > lock()
>> >> >> > index = flush_all_zero_pkmaps();
>> >> >> > if (index != PKMAP_INVALID_INDEX)
>> >> >> >         last_pkmap_nr = index;
>> >> >> > unlock();
>> >> >> >
>> >> >> > Normally, last_pkmap_nr is increased with searching empty slot in
>> >> >> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
>> >> >> > in kmap_flush_unused normally become either less than last_pkmap_nr
>> >> >> > or last_pkmap_nr + 1.
>> >> >>
>> >> >> There is a case that return value of kmap_flush_unused() is larger
>> >> >> than last_pkmap_nr.
>> >> >
>> >> > I see but why it's problem? kmap_flush_unused returns larger value than
>> >> > last_pkmap_nr means that there is no free slot at below the value.
>> >> > So unconditional last_pkmap_nr update is vaild.
>> >>
>> >> I think that this is not true.
>> >> Look at the slightly different example.
>> >>
>> >> Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.
>> >>
>> >> do kmap_flush_unused() => flush index 10,11 => last_pkmap = 10;
>> >> do kunmap() with index 17
>> >> do kmap_flush_unused() => flush index 17 => last_pkmap = 17?
>> >>
>> >> In this case, unconditional last_pkmap_nr update skip one kunmapped index.
>> >> So, conditional update is needed.
>> >
>> > Thanks for pouinting out, Joonsoo.
>> > You're right. I misunderstood your flush_all_zero_pkmaps change.
>> > As your change, flush_all_zero_pkmaps returns first *flushed* free slot index.
>> > What's the benefit returning flushed flushed free slot index rather than free slot index?
>>
>> If flush_all_zero_pkmaps() return free slot index rather than first
>> flushed free slot,
>> we need another comparison like as 'if pkmap_count[i] == 0' and
>> need another local variable for determining whether flush is occurred or not.
>> I want to minimize these overhead and churning of the code, although
>> they are negligible.
>>
>> > I think flush_all_zero_pkmaps should return first free slot because customer of
>> > flush_all_zero_pkmaps doesn't care whether it's just flushed or not.
>> > What he want is just free or not. In such case, we can remove above check and it makes
>> > flusha_all_zero_pkmaps more intuitive.
>>
>> Yes, it is more intuitive, but as I mentioned above, it need another comparison,
>> so with that, a benefit which prevent to re-iterate when there is no
>> free slot, may be disappeared.
>
> If you're very keen on the performance, why do you have such code?
> You can remove below branch if you were keen on the performance.
>
> diff --git a/mm/highmem.c b/mm/highmem.c
> index c8be376..44a88dd 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -114,7 +114,7 @@ static unsigned int flush_all_zero_pkmaps(void)
>
>         flush_cache_kmaps();
>
> -       for (i = 0; i < LAST_PKMAP; i++) {
> +       for (i = LAST_PKMAP - 1; i >= 0; i--) {
>                 struct page *page;
>
>                 /*
> @@ -141,8 +141,7 @@ static unsigned int flush_all_zero_pkmaps(void)
>                 pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
>
>                 set_page_address(page, NULL);
> -               if (index == PKMAP_INVALID_INDEX)
> -                       index = i;
> +               index = i;
>         }
>         if (index != PKMAP_INVALID_INDEX)
>                 flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>
>
> Anyway, if you have the concern of performance, Okay let's give up making code clear
> although I didn't see any report about kmap perfomance. Instead, please consider above
> optimization because you have already broken what you mentioned.
> If we can't make function clear, another method for it is to add function comment. Please.

Yes, I also didn't see any report about kmap performance.
By your reviewing comment, I eventually reach that this patch will not
give any benefit.
So how about to drop it?

Thanks for review.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-14 17:09                   ` JoonSoo Kim
@ 2012-11-19 23:46                     ` Minchan Kim
  2012-11-27 15:01                       ` JoonSoo Kim
  0 siblings, 1 reply; 146+ messages in thread
From: Minchan Kim @ 2012-11-19 23:46 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Peter Zijlstra

Hi Joonsoo,
Sorry for the delay.

On Thu, Nov 15, 2012 at 02:09:04AM +0900, JoonSoo Kim wrote:
> Hi, Minchan.
> 
> 2012/11/14 Minchan Kim <minchan@kernel.org>:
> > On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote:
> >> 2012/11/13 Minchan Kim <minchan@kernel.org>:
> >> > On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
> >> >> 2012/11/3 Minchan Kim <minchan@kernel.org>:
> >> >> > Hi Joonsoo,
> >> >> >
> >> >> > On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
> >> >> >> Hello, Minchan.
> >> >> >>
> >> >> >> 2012/11/1 Minchan Kim <minchan@kernel.org>:
> >> >> >> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
> >> >> >> >> In current code, after flush_all_zero_pkmaps() is invoked,
> >> >> >> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
> >> >> >> >> return index of first flushed entry. With this index,
> >> >> >> >> we can immediately map highmem page to virtual address represented by index.
> >> >> >> >> So change return type of flush_all_zero_pkmaps()
> >> >> >> >> and return index of first flushed entry.
> >> >> >> >>
> >> >> >> >> Additionally, update last_pkmap_nr to this index.
> >> >> >> >> It is certain that entry which is below this index is occupied by other mapping,
> >> >> >> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
> >> >> >> >>
> >> >> >> >> Cc: Mel Gorman <mel@csn.ul.ie>
> >> >> >> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >> >> >> >> Cc: Minchan Kim <minchan@kernel.org>
> >> >> >> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
> >> >> >> >>
> >> >> >> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> >> >> >> >> index ef788b5..97ad208 100644
> >> >> >> >> --- a/include/linux/highmem.h
> >> >> >> >> +++ b/include/linux/highmem.h
> >> >> >> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
> >> >> >> >>
> >> >> >> >>  #ifdef CONFIG_HIGHMEM
> >> >> >> >>  #include <asm/highmem.h>
> >> >> >> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
> >> >> >> >>
> >> >> >> >>  /* declarations for linux/mm/highmem.c */
> >> >> >> >>  unsigned int nr_free_highpages(void);
> >> >> >> >> diff --git a/mm/highmem.c b/mm/highmem.c
> >> >> >> >> index d98b0a9..b365f7b 100644
> >> >> >> >> --- a/mm/highmem.c
> >> >> >> >> +++ b/mm/highmem.c
> >> >> >> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
> >> >> >> >>       return virt_to_page(addr);
> >> >> >> >>  }
> >> >> >> >>
> >> >> >> >> -static void flush_all_zero_pkmaps(void)
> >> >> >> >> +static unsigned int flush_all_zero_pkmaps(void)
> >> >> >> >>  {
> >> >> >> >>       int i;
> >> >> >> >> -     int need_flush = 0;
> >> >> >> >> +     unsigned int index = PKMAP_INVALID_INDEX;
> >> >> >> >>
> >> >> >> >>       flush_cache_kmaps();
> >> >> >> >>
> >> >> >> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
> >> >> >> >>                         &pkmap_page_table[i]);
> >> >> >> >>
> >> >> >> >>               set_page_address(page, NULL);
> >> >> >> >> -             need_flush = 1;
> >> >> >> >> +             if (index == PKMAP_INVALID_INDEX)
> >> >> >> >> +                     index = i;
> >> >> >> >>       }
> >> >> >> >> -     if (need_flush)
> >> >> >> >> +     if (index != PKMAP_INVALID_INDEX)
> >> >> >> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
> >> >> >> >> +
> >> >> >> >> +     return index;
> >> >> >> >>  }
> >> >> >> >>
> >> >> >> >>  /**
> >> >> >> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
> >> >> >> >>   */
> >> >> >> >>  void kmap_flush_unused(void)
> >> >> >> >>  {
> >> >> >> >> +     unsigned int index;
> >> >> >> >> +
> >> >> >> >>       lock_kmap();
> >> >> >> >> -     flush_all_zero_pkmaps();
> >> >> >> >> +     index = flush_all_zero_pkmaps();
> >> >> >> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
> >> >> >> >> +             last_pkmap_nr = index;
> >> >> >> >
> >> >> >> > I don't know how kmap_flush_unused is really fast path so how my nitpick
> >> >> >> > is effective. Anyway,
> >> >> >> > What problem happens if we do following as?
> >> >> >> >
> >> >> >> > lock()
> >> >> >> > index = flush_all_zero_pkmaps();
> >> >> >> > if (index != PKMAP_INVALID_INDEX)
> >> >> >> >         last_pkmap_nr = index;
> >> >> >> > unlock();
> >> >> >> >
> >> >> >> > Normally, last_pkmap_nr is increased with searching empty slot in
> >> >> >> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
> >> >> >> > in kmap_flush_unused normally become either less than last_pkmap_nr
> >> >> >> > or last_pkmap_nr + 1.
> >> >> >>
> >> >> >> There is a case that return value of kmap_flush_unused() is larger
> >> >> >> than last_pkmap_nr.
> >> >> >
> >> >> > I see but why it's problem? kmap_flush_unused returns larger value than
> >> >> > last_pkmap_nr means that there is no free slot at below the value.
> >> >> > So unconditional last_pkmap_nr update is vaild.
> >> >>
> >> >> I think that this is not true.
> >> >> Look at the slightly different example.
> >> >>
> >> >> Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.
> >> >>
> >> >> do kmap_flush_unused() => flush index 10,11 => last_pkmap = 10;
> >> >> do kunmap() with index 17
> >> >> do kmap_flush_unused() => flush index 17 => last_pkmap = 17?
> >> >>
> >> >> In this case, unconditional last_pkmap_nr update skip one kunmapped index.
> >> >> So, conditional update is needed.
> >> >
> >> > Thanks for pouinting out, Joonsoo.
> >> > You're right. I misunderstood your flush_all_zero_pkmaps change.
> >> > As your change, flush_all_zero_pkmaps returns first *flushed* free slot index.
> >> > What's the benefit returning flushed flushed free slot index rather than free slot index?
> >>
> >> If flush_all_zero_pkmaps() return free slot index rather than first
> >> flushed free slot,
> >> we need another comparison like as 'if pkmap_count[i] == 0' and
> >> need another local variable for determining whether flush is occurred or not.
> >> I want to minimize these overhead and churning of the code, although
> >> they are negligible.
> >>
> >> > I think flush_all_zero_pkmaps should return first free slot because customer of
> >> > flush_all_zero_pkmaps doesn't care whether it's just flushed or not.
> >> > What he want is just free or not. In such case, we can remove above check and it makes
> >> > flusha_all_zero_pkmaps more intuitive.
> >>
> >> Yes, it is more intuitive, but as I mentioned above, it need another comparison,
> >> so with that, a benefit which prevent to re-iterate when there is no
> >> free slot, may be disappeared.
> >
> > If you're very keen on the performance, why do you have such code?
> > You can remove below branch if you were keen on the performance.
> >
> > diff --git a/mm/highmem.c b/mm/highmem.c
> > index c8be376..44a88dd 100644
> > --- a/mm/highmem.c
> > +++ b/mm/highmem.c
> > @@ -114,7 +114,7 @@ static unsigned int flush_all_zero_pkmaps(void)
> >
> >         flush_cache_kmaps();
> >
> > -       for (i = 0; i < LAST_PKMAP; i++) {
> > +       for (i = LAST_PKMAP - 1; i >= 0; i--) {
> >                 struct page *page;
> >
> >                 /*
> > @@ -141,8 +141,7 @@ static unsigned int flush_all_zero_pkmaps(void)
> >                 pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
> >
> >                 set_page_address(page, NULL);
> > -               if (index == PKMAP_INVALID_INDEX)
> > -                       index = i;
> > +               index = i;
> >         }
> >         if (index != PKMAP_INVALID_INDEX)
> >                 flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
> >
> >
> > Anyway, if you have the concern of performance, Okay let's give up making code clear
> > although I didn't see any report about kmap perfomance. Instead, please consider above
> > optimization because you have already broken what you mentioned.
> > If we can't make function clear, another method for it is to add function comment. Please.
> 
> Yes, I also didn't see any report about kmap performance.
> By your reviewing comment, I eventually reach that this patch will not
> give any benefit.
> So how about to drop it?

Personally, I prefer to proceed but if you don't have a confidence about gain,
No problem to drop it.
Thanks.

> 
> Thanks for review.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
  2012-11-19 23:46                     ` Minchan Kim
@ 2012-11-27 15:01                       ` JoonSoo Kim
  0 siblings, 0 replies; 146+ messages in thread
From: JoonSoo Kim @ 2012-11-27 15:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Mel Gorman, Minchan Kim

Hello, Andrew.

2012/11/20 Minchan Kim <minchan@kernel.org>:
> Hi Joonsoo,
> Sorry for the delay.
>
> On Thu, Nov 15, 2012 at 02:09:04AM +0900, JoonSoo Kim wrote:
>> Hi, Minchan.
>>
>> 2012/11/14 Minchan Kim <minchan@kernel.org>:
>> > On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote:
>> >> 2012/11/13 Minchan Kim <minchan@kernel.org>:
>> >> > On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
>> >> >> 2012/11/3 Minchan Kim <minchan@kernel.org>:
>> >> >> > Hi Joonsoo,
>> >> >> >
>> >> >> > On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
>> >> >> >> Hello, Minchan.
>> >> >> >>
>> >> >> >> 2012/11/1 Minchan Kim <minchan@kernel.org>:
>> >> >> >> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
>> >> >> >> >> In current code, after flush_all_zero_pkmaps() is invoked,
>> >> >> >> >> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
>> >> >> >> >> return index of first flushed entry. With this index,
>> >> >> >> >> we can immediately map highmem page to virtual address represented by index.
>> >> >> >> >> So change return type of flush_all_zero_pkmaps()
>> >> >> >> >> and return index of first flushed entry.
>> >> >> >> >>
>> >> >> >> >> Additionally, update last_pkmap_nr to this index.
>> >> >> >> >> It is certain that entry which is below this index is occupied by other mapping,
>> >> >> >> >> therefore updating last_pkmap_nr to this index is reasonable optimization.
>> >> >> >> >>
>> >> >> >> >> Cc: Mel Gorman <mel@csn.ul.ie>
>> >> >> >> >> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> >> >> >> >> Cc: Minchan Kim <minchan@kernel.org>
>> >> >> >> >> Signed-off-by: Joonsoo Kim <js1304@gmail.com>
>> >> >> >> >>
>> >> >> >> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> >> >> >> >> index ef788b5..97ad208 100644
>> >> >> >> >> --- a/include/linux/highmem.h
>> >> >> >> >> +++ b/include/linux/highmem.h
>> >> >> >> >> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>> >> >> >> >>
>> >> >> >> >>  #ifdef CONFIG_HIGHMEM
>> >> >> >> >>  #include <asm/highmem.h>
>> >> >> >> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>> >> >> >> >>
>> >> >> >> >>  /* declarations for linux/mm/highmem.c */
>> >> >> >> >>  unsigned int nr_free_highpages(void);
>> >> >> >> >> diff --git a/mm/highmem.c b/mm/highmem.c
>> >> >> >> >> index d98b0a9..b365f7b 100644
>> >> >> >> >> --- a/mm/highmem.c
>> >> >> >> >> +++ b/mm/highmem.c
>> >> >> >> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>> >> >> >> >>       return virt_to_page(addr);
>> >> >> >> >>  }
>> >> >> >> >>
>> >> >> >> >> -static void flush_all_zero_pkmaps(void)
>> >> >> >> >> +static unsigned int flush_all_zero_pkmaps(void)
>> >> >> >> >>  {
>> >> >> >> >>       int i;
>> >> >> >> >> -     int need_flush = 0;
>> >> >> >> >> +     unsigned int index = PKMAP_INVALID_INDEX;
>> >> >> >> >>
>> >> >> >> >>       flush_cache_kmaps();
>> >> >> >> >>
>> >> >> >> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>> >> >> >> >>                         &pkmap_page_table[i]);
>> >> >> >> >>
>> >> >> >> >>               set_page_address(page, NULL);
>> >> >> >> >> -             need_flush = 1;
>> >> >> >> >> +             if (index == PKMAP_INVALID_INDEX)
>> >> >> >> >> +                     index = i;
>> >> >> >> >>       }
>> >> >> >> >> -     if (need_flush)
>> >> >> >> >> +     if (index != PKMAP_INVALID_INDEX)
>> >> >> >> >>               flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>> >> >> >> >> +
>> >> >> >> >> +     return index;
>> >> >> >> >>  }
>> >> >> >> >>
>> >> >> >> >>  /**
>> >> >> >> >> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
>> >> >> >> >>   */
>> >> >> >> >>  void kmap_flush_unused(void)
>> >> >> >> >>  {
>> >> >> >> >> +     unsigned int index;
>> >> >> >> >> +
>> >> >> >> >>       lock_kmap();
>> >> >> >> >> -     flush_all_zero_pkmaps();
>> >> >> >> >> +     index = flush_all_zero_pkmaps();
>> >> >> >> >> +     if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
>> >> >> >> >> +             last_pkmap_nr = index;
>> >> >> >> >
>> >> >> >> > I don't know how kmap_flush_unused is really fast path so how my nitpick
>> >> >> >> > is effective. Anyway,
>> >> >> >> > What problem happens if we do following as?
>> >> >> >> >
>> >> >> >> > lock()
>> >> >> >> > index = flush_all_zero_pkmaps();
>> >> >> >> > if (index != PKMAP_INVALID_INDEX)
>> >> >> >> >         last_pkmap_nr = index;
>> >> >> >> > unlock();
>> >> >> >> >
>> >> >> >> > Normally, last_pkmap_nr is increased with searching empty slot in
>> >> >> >> > map_new_virtual. So I expect return value of flush_all_zero_pkmaps
>> >> >> >> > in kmap_flush_unused normally become either less than last_pkmap_nr
>> >> >> >> > or last_pkmap_nr + 1.
>> >> >> >>
>> >> >> >> There is a case that return value of kmap_flush_unused() is larger
>> >> >> >> than last_pkmap_nr.
>> >> >> >
>> >> >> > I see but why it's problem? kmap_flush_unused returns larger value than
>> >> >> > last_pkmap_nr means that there is no free slot at below the value.
>> >> >> > So unconditional last_pkmap_nr update is vaild.
>> >> >>
>> >> >> I think that this is not true.
>> >> >> Look at the slightly different example.
>> >> >>
>> >> >> Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.
>> >> >>
>> >> >> do kmap_flush_unused() => flush index 10,11 => last_pkmap = 10;
>> >> >> do kunmap() with index 17
>> >> >> do kmap_flush_unused() => flush index 17 => last_pkmap = 17?
>> >> >>
>> >> >> In this case, unconditional last_pkmap_nr update skip one kunmapped index.
>> >> >> So, conditional update is needed.
>> >> >
>> >> > Thanks for pouinting out, Joonsoo.
>> >> > You're right. I misunderstood your flush_all_zero_pkmaps change.
>> >> > As your change, flush_all_zero_pkmaps returns first *flushed* free slot index.
>> >> > What's the benefit returning flushed flushed free slot index rather than free slot index?
>> >>
>> >> If flush_all_zero_pkmaps() return free slot index rather than first
>> >> flushed free slot,
>> >> we need another comparison like as 'if pkmap_count[i] == 0' and
>> >> need another local variable for determining whether flush is occurred or not.
>> >> I want to minimize these overhead and churning of the code, although
>> >> they are negligible.
>> >>
>> >> > I think flush_all_zero_pkmaps should return first free slot because customer of
>> >> > flush_all_zero_pkmaps doesn't care whether it's just flushed or not.
>> >> > What he want is just free or not. In such case, we can remove above check and it makes
>> >> > flusha_all_zero_pkmaps more intuitive.
>> >>
>> >> Yes, it is more intuitive, but as I mentioned above, it need another comparison,
>> >> so with that, a benefit which prevent to re-iterate when there is no
>> >> free slot, may be disappeared.
>> >
>> > If you're very keen on the performance, why do you have such code?
>> > You can remove below branch if you were keen on the performance.
>> >
>> > diff --git a/mm/highmem.c b/mm/highmem.c
>> > index c8be376..44a88dd 100644
>> > --- a/mm/highmem.c
>> > +++ b/mm/highmem.c
>> > @@ -114,7 +114,7 @@ static unsigned int flush_all_zero_pkmaps(void)
>> >
>> >         flush_cache_kmaps();
>> >
>> > -       for (i = 0; i < LAST_PKMAP; i++) {
>> > +       for (i = LAST_PKMAP - 1; i >= 0; i--) {
>> >                 struct page *page;
>> >
>> >                 /*
>> > @@ -141,8 +141,7 @@ static unsigned int flush_all_zero_pkmaps(void)
>> >                 pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
>> >
>> >                 set_page_address(page, NULL);
>> > -               if (index == PKMAP_INVALID_INDEX)
>> > -                       index = i;
>> > +               index = i;
>> >         }
>> >         if (index != PKMAP_INVALID_INDEX)
>> >                 flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>> >
>> >
>> > Anyway, if you have the concern of performance, Okay let's give up making code clear
>> > although I didn't see any report about kmap perfomance. Instead, please consider above
>> > optimization because you have already broken what you mentioned.
>> > If we can't make function clear, another method for it is to add function comment. Please.
>>
>> Yes, I also didn't see any report about kmap performance.
>> By your reviewing comment, I eventually reach that this patch will not
>> give any benefit.
>> So how about to drop it?
>
> Personally, I prefer to proceed but if you don't have a confidence about gain,
> No problem to drop it.
> Thanks.
>
>>
>> Thanks for review.

During the review, I concluded that this patch have no gain.
And this patch churn the code too much.
So I want to drop this patch for your tree.

Sorry for late notification.

^ permalink raw reply	[flat|nested] 146+ messages in thread

end of thread, other threads:[~2012-11-27 15:01 UTC | newest]

Thread overview: 146+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Yes>
2012-07-16 16:14 ` [PATCH 1/3] mm: correct return value of migrate_pages() Joonsoo Kim
2012-07-16 16:14   ` [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
2012-07-16 17:26     ` Christoph Lameter
2012-07-16 17:40     ` Michal Nazarewicz
2012-07-16 17:59       ` JoonSoo Kim
2012-07-17 13:02         ` Michal Nazarewicz
2012-07-16 16:14   ` [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
2012-07-16 17:29     ` Christoph Lameter
2012-07-16 17:40     ` Michal Nazarewicz
2012-07-16 18:40       ` JoonSoo Kim
2012-07-17 13:16         ` Michal Nazarewicz
2012-07-16 17:14   ` [PATCH 4] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
2012-07-16 17:30     ` Christoph Lameter
2012-07-16 17:23   ` [PATCH 1/3] mm: correct return value of migrate_pages() Christoph Lameter
2012-07-16 17:32     ` JoonSoo Kim
2012-07-16 17:37       ` Christoph Lameter
2012-07-16 17:40   ` Michal Nazarewicz
2012-07-16 17:57     ` JoonSoo Kim
2012-07-16 18:05       ` Christoph Lameter
2012-07-17 12:33 ` [PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
2012-07-17 12:33   ` [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
2012-07-17 14:28     ` Christoph Lameter
2012-07-17 15:41       ` JoonSoo Kim
2012-07-17 12:33   ` [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
2012-07-17 13:25     ` Michal Nazarewicz
2012-07-17 15:45       ` JoonSoo Kim
2012-07-17 15:49         ` [PATCH 3/4 v3] " Joonsoo Kim
2012-07-17 12:33   ` [PATCH 4/4 v2] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
2012-07-27 17:55 ` [RESEND PATCH 1/4 v3] mm: correct return value of migrate_pages() and migrate_huge_pages() Joonsoo Kim
2012-07-27 17:55   ` [RESEND PATCH 2/4 v3] mm: fix possible incorrect return value of migrate_pages() syscall Joonsoo Kim
2012-07-27 20:57     ` Christoph Lameter
2012-07-28  6:16       ` JoonSoo Kim
2012-07-30 19:30         ` Christoph Lameter
2012-07-27 17:55   ` [RESEND PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range() Joonsoo Kim
2012-07-27 17:55   ` [RESEND PATCH 4/4 v3] mm: fix possible incorrect return value of move_pages() syscall Joonsoo Kim
2012-07-27 20:54     ` Christoph Lameter
2012-07-28  6:09       ` JoonSoo Kim
2012-07-30 19:29         ` Christoph Lameter
2012-07-31  3:34           ` JoonSoo Kim
2012-07-31 14:04             ` Christoph Lameter
2012-08-01  5:15           ` Michael Kerrisk
2012-08-01 18:00             ` Christoph Lameter
2012-08-02  5:52               ` Michael Kerrisk
2012-08-13 16:17 ` [PATCH 1/5] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
2012-08-13 16:17   ` [PATCH 2/5] workqueue: change value of lcpu in queue_delayed_work_on() Joonsoo Kim
2012-08-13 16:32     ` Tejun Heo
2012-08-13 16:54       ` JoonSoo Kim
2012-08-13 17:03         ` Tejun Heo
2012-08-13 17:43           ` JoonSoo Kim
2012-08-13 18:00             ` Tejun Heo
2012-08-14 18:04               ` JoonSoo Kim
2012-08-13 16:17   ` [PATCH 3/5] workqueue: introduce system_highpri_wq Joonsoo Kim
2012-08-13 16:17   ` [PATCH 4/5] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
2012-08-13 16:34     ` Tejun Heo
2012-08-13 16:57       ` JoonSoo Kim
2012-08-13 17:05         ` Tejun Heo
2012-08-13 17:45           ` JoonSoo Kim
2012-08-13 16:17   ` [PATCH 5/5] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
2012-08-13 16:36     ` Tejun Heo
2012-08-13 17:02       ` JoonSoo Kim
2012-08-13 17:07         ` Tejun Heo
2012-08-13 17:52           ` JoonSoo Kim
2012-08-14 18:10 ` [PATCH v2 0/6] system_highpri_wq Joonsoo Kim
2012-08-14 18:10   ` [PATCH v2 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
2012-08-14 18:10   ` [PATCH v2 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work() Joonsoo Kim
2012-08-14 18:34     ` Tejun Heo
2012-08-14 18:56       ` JoonSoo Kim
2012-08-14 18:10   ` [PATCH v2 3/6] workqueue: change value of lcpu in __queue_delayed_work_on() Joonsoo Kim
2012-08-14 19:00     ` Tejun Heo
2012-08-14 18:10   ` [PATCH v2 4/6] workqueue: introduce system_highpri_wq Joonsoo Kim
2012-08-14 18:10   ` [PATCH v2 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
2012-08-14 18:29     ` Tejun Heo
2012-08-14 18:10   ` [PATCH v2 6/6] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
2012-08-15 14:25 ` [PATCH v3 0/6] system_highpri_wq Joonsoo Kim
2012-08-15 14:25   ` [PATCH v3 1/6] workqueue: use enum value to set array size of pools in gcwq Joonsoo Kim
2012-08-15 14:25   ` [PATCH v3 2/6] workqueue: correct req_cpu in trace_workqueue_queue_work() Joonsoo Kim
2012-08-15 14:25   ` [PATCH v3 3/6] workqueue: change value of lcpu in __queue_delayed_work_on() Joonsoo Kim
2012-08-15 14:25   ` [PATCH v3 4/6] workqueue: introduce system_highpri_wq Joonsoo Kim
2012-08-15 14:25   ` [PATCH v3 5/6] workqueue: use system_highpri_wq for highpri workers in rebind_workers() Joonsoo Kim
2012-08-15 14:25   ` [PATCH v3 6/6] workqueue: use system_highpri_wq for unbind_work Joonsoo Kim
2012-08-16 21:22   ` [PATCH v3 0/6] system_highpri_wq Tejun Heo
2012-08-17 13:38     ` JoonSoo Kim
2012-08-24 16:05 ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Joonsoo Kim
2012-08-24 16:05   ` [PATCH 2/2] slub: correct the calculation of the number of cpu objects in get_partial_node Joonsoo Kim
2012-08-24 16:15     ` Christoph Lameter
2012-08-24 16:28       ` JoonSoo Kim
2012-08-24 16:31         ` Christoph Lameter
2012-08-24 16:40           ` JoonSoo Kim
2012-08-24 16:12   ` [PATCH 1/2] slub: rename cpu_partial to max_cpu_object Christoph Lameter
2012-08-25 14:11 ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Joonsoo Kim
2012-08-25 14:11   ` [PATCH 2/2] slab: fix starting index for finding another object Joonsoo Kim
2012-09-03 10:08   ` [PATCH 1/2] slab: do ClearSlabPfmemalloc() for all pages of slab Mel Gorman
2012-10-18 23:18 ` [PATCH 0/2] clean-up initialization of deferrable timer Joonsoo Kim
2012-10-18 23:18   ` [PATCH 1/2] timer: add setup_timer_deferrable() macro Joonsoo Kim
2012-10-18 23:18   ` [PATCH 2/2] timer: use new " Joonsoo Kim
2012-10-26 14:08   ` [PATCH 0/2] clean-up initialization of deferrable timer JoonSoo Kim
2012-10-20 15:48 ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Joonsoo Kim
2012-10-20 15:48   ` [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA Joonsoo Kim
2012-10-22 14:31     ` Christoph Lameter
2012-10-23  2:29       ` JoonSoo Kim
2012-10-23  6:16         ` Eric Dumazet
2012-10-23 16:12           ` JoonSoo Kim
2012-10-24  8:05   ` [PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions Pekka Enberg
2012-10-24 13:36     ` Christoph Lameter
2012-10-20 16:30 ` [PATCH 0/3] workqueue: minor cleanup Joonsoo Kim
2012-10-20 16:30   ` [PATCH 1/3] workqueue: optimize mod_delayed_work_on() when @delay == 0 Joonsoo Kim
2012-10-20 16:30   ` [PATCH 2/3] workqueue: trivial fix for return statement in work_busy() Joonsoo Kim
2012-10-20 22:53     ` Tejun Heo
2012-10-20 16:30   ` [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up() Joonsoo Kim
2012-10-20 22:57     ` Tejun Heo
2012-10-23  1:44       ` JoonSoo Kim
2012-10-24 19:14         ` Tejun Heo
2012-10-28 19:12 ` [PATCH 0/5] minor clean-up and optimize highmem related code Joonsoo Kim
2012-10-28 19:12   ` [PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
2012-10-29  1:48     ` Minchan Kim
2012-10-28 19:12   ` [PATCH 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
2012-10-29  1:52     ` Minchan Kim
2012-10-30 21:31     ` Andrew Morton
2012-10-31  5:14       ` Minchan Kim
2012-10-31 15:01       ` JoonSoo Kim
2012-10-28 19:12   ` [PATCH 3/5] mm, highmem: remove page_address_pool list Joonsoo Kim
2012-10-29  1:57     ` Minchan Kim
2012-10-28 19:12   ` [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry Joonsoo Kim
2012-10-29  2:06     ` Minchan Kim
2012-10-29 13:12       ` JoonSoo Kim
2012-10-28 19:12   ` [PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR() Joonsoo Kim
2012-10-29  2:09     ` Minchan Kim
2012-10-29  2:12   ` [PATCH 0/5] minor clean-up and optimize highmem related code Minchan Kim
2012-10-29 13:15     ` JoonSoo Kim
2012-10-31 17:11       ` JoonSoo Kim
2012-10-31 16:56 ` [PATCH v2 " Joonsoo Kim
2012-10-31 16:56   ` [PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap Joonsoo Kim
2012-10-31 16:56   ` [PATCH v2 2/5] mm, highmem: remove useless pool_lock Joonsoo Kim
2012-10-31 16:56   ` [PATCH v2 3/5] mm, highmem: remove page_address_pool list Joonsoo Kim
2012-10-31 16:56   ` [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry Joonsoo Kim
2012-11-01  5:03     ` Minchan Kim
2012-11-02 19:07       ` JoonSoo Kim
2012-11-02 22:42         ` Minchan Kim
2012-11-13  0:30           ` JoonSoo Kim
2012-11-13 12:49             ` Minchan Kim
2012-11-13 14:12               ` JoonSoo Kim
2012-11-13 15:01                 ` Minchan Kim
2012-11-14 17:09                   ` JoonSoo Kim
2012-11-19 23:46                     ` Minchan Kim
2012-11-27 15:01                       ` JoonSoo Kim
2012-10-31 16:56   ` [PATCH v2 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR() Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).