linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
@ 2014-02-28 22:18 Ning Qu
  2014-02-28 22:18 ` [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs Ning Qu
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Ning Qu @ 2014-02-28 22:18 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Ning Qu

This is a follow-up patch for "mm: map few pages around fault address if they are in page cache"

We use the generic filemap_map_pages as ->map_pages in shmem/tmpfs.

Please consider applying.

=========================================================================
Below is just some simple experiment numbers from this patch, let me know if
you would like more:

Tested on Xeon machine with 64GiB of RAM, using the current default fault
order 4.

Sequential access 8GiB file
			Baseline 	with-patch
1 thread
    minor fault		205		101	
    time, seconds	7.94		7.82

Random access 8GiB file
			Baseline 	with-patch
1 thread
    minor fault		724		623
    time, seconds	9.75		9.84


Ning Qu (1):
  mm: implement ->map_pages for shmem/tmpfs

 mm/shmem.c | 1 +
 1 file changed, 1 insertion(+)

-- 
1.9.0.279.gdc9e3eb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs
  2014-02-28 22:18 [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache Ning Qu
@ 2014-02-28 22:18 ` Ning Qu
  2014-03-01  1:20   ` Hugh Dickins
  2014-02-28 22:34 ` [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Ning Qu @ 2014-02-28 22:18 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Ning Qu

In shmem/tmpfs, we also use the generic filemap_map_pages,
seems the additional checking is not worth a separate version
of map_pages for it.

Signed-off-by: Ning Qu <quning@google.com>
---
 mm/shmem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1f18c9d..2ea4e89 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2783,6 +2783,7 @@ static const struct super_operations shmem_ops = {
 
 static const struct vm_operations_struct shmem_vm_ops = {
 	.fault		= shmem_fault,
+	.map_pages	= filemap_map_pages,
 #ifdef CONFIG_NUMA
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
-- 
1.9.0.279.gdc9e3eb


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
  2014-02-28 22:18 [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache Ning Qu
  2014-02-28 22:18 ` [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs Ning Qu
@ 2014-02-28 22:34 ` Andrew Morton
  2014-03-01  0:35 ` Ning Qu
       [not found] ` <CACz4_2eYUOkHdOtBJGDGMMwBcQkyPs8BDXQ491Ab_ig4z8q5mQ@mail.gmail.com>
  3 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2014-02-28 22:34 UTC (permalink / raw)
  To: Ning Qu
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Hugh Dickins

On Fri, 28 Feb 2014 14:18:50 -0800 Ning Qu <quning@google.com> wrote:

> This is a follow-up patch for "mm: map few pages around fault address if they are in page cache"
> 
> We use the generic filemap_map_pages as ->map_pages in shmem/tmpfs.
> 

Please cc Hugh on shmem/tmpfs things

> 
> =========================================================================
> Below is just some simple experiment numbers from this patch, let me know if
> you would like more:
> 
> Tested on Xeon machine with 64GiB of RAM, using the current default fault
> order 4.
> 
> Sequential access 8GiB file
> 			Baseline 	with-patch
> 1 thread
>     minor fault		205		101	

Confused.  Sequential access of an 8G file should generate 2,000,000
minor faults, not 205.  And with FAULT_AROUND_ORDER=4, that should come
down to 2,000,000/16 minor faults when using faultaround?

>     time, seconds	7.94		7.82
> 
> Random access 8GiB file
> 			Baseline 	with-patch
> 1 thread
>     minor fault		724		623
>     time, seconds	9.75		9.84
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
  2014-02-28 22:18 [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache Ning Qu
  2014-02-28 22:18 ` [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs Ning Qu
  2014-02-28 22:34 ` [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache Andrew Morton
@ 2014-03-01  0:35 ` Ning Qu
  2014-03-01  1:41   ` Andrew Morton
       [not found] ` <CACz4_2eYUOkHdOtBJGDGMMwBcQkyPs8BDXQ491Ab_ig4z8q5mQ@mail.gmail.com>
  3 siblings, 1 reply; 16+ messages in thread
From: Ning Qu @ 2014-03-01  0:35 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Hugh Dickins
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Ning Qu

Sorry about my fault about the experiments, here is the real one.

Btw, apparently, there are still some questions about the results and
I will sync with Kirill about his test command line.

Below is just some simple experiment numbers from this patch, let me know if
you would like more:

Tested on Xeon machine with 64GiB of RAM, using the current default fault
order 4.

Sequential access 8GiB file
                        Baseline        with-patch
1 thread
    minor fault         8,389,052    4,456,530
    time, seconds    9.55            8.31

Random access 8GiB file
                        Baseline        with-patch
1 thread
    minor fault         8,389,315   6,423,386
    time, seconds    11.68         10.51



Best wishes,
-- 
Ning Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs
  2014-02-28 22:18 ` [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs Ning Qu
@ 2014-03-01  1:20   ` Hugh Dickins
  2014-03-01  6:36     ` Ning Qu
  2014-03-04 20:02     ` Hugh Dickins
  0 siblings, 2 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-03-01  1:20 UTC (permalink / raw)
  To: Ning Qu
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, 28 Feb 2014, Ning Qu wrote:

> In shmem/tmpfs, we also use the generic filemap_map_pages,
> seems the additional checking is not worth a separate version
> of map_pages for it.
> 
> Signed-off-by: Ning Qu <quning@google.com>
> ---
>  mm/shmem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1f18c9d..2ea4e89 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2783,6 +2783,7 @@ static const struct super_operations shmem_ops = {
>  
>  static const struct vm_operations_struct shmem_vm_ops = {
>  	.fault		= shmem_fault,
> +	.map_pages	= filemap_map_pages,
>  #ifdef CONFIG_NUMA
>  	.set_policy     = shmem_set_policy,
>  	.get_policy     = shmem_get_policy,
> -- 

(There's no need for a 0/1, all the info should go into the one patch.)

I expect this will prove to be a very sensible and adequate patch,
thank you: it probably wouldn't be worth more effort to give shmem
anything special of its own, and filemap_map_pages() is already
(almost) coping with exceptional entries.

But I can't Ack it until I've tested it some more, won't be able to
do so until Sunday; and even then some doubt, since this and Kirill's
are built upon mmotm/next, which after a while gives me spinlock
lockups under load these days, yet to be investigated.

"almost" above because, Kirill, even without Ning's extension to
shmem, your filemap_map_page() soon crashes on an exceptional entry:

Don't try to dereference an exceptional entry.

Signed-off-by: Hugh Dickins <hughd@google.com>

--- mmotm+kirill/mm/filemap.c	2014-02-28 15:17:50.984019060 -0800
+++ linux/mm/filemap.c	2014-02-28 16:38:04.976633308 -0800
@@ -2084,7 +2084,7 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				break;
 			else
-				goto next;
+				continue;
 		}
 
 		if (!page_cache_get_speculative(page))

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
  2014-03-01  0:35 ` Ning Qu
@ 2014-03-01  1:41   ` Andrew Morton
  2014-03-01  6:10     ` Ning Qu
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2014-03-01  1:41 UTC (permalink / raw)
  To: Ning Qu
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Hugh Dickins, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	linux-kernel, Ning Qu

On Fri, 28 Feb 2014 16:35:16 -0800 Ning Qu <quning@gmail.com> wrote:

> Sorry about my fault about the experiments, here is the real one.
> 
> Btw, apparently, there are still some questions about the results and
> I will sync with Kirill about his test command line.
> 
> Below is just some simple experiment numbers from this patch, let me know if
> you would like more:
> 
> Tested on Xeon machine with 64GiB of RAM, using the current default fault
> order 4.
> 
> Sequential access 8GiB file
>                         Baseline        with-patch
> 1 thread
>     minor fault         8,389,052    4,456,530
>     time, seconds    9.55            8.31

The numbers still seem wrong.  I'd expect to see almost exactly 2M minor
faults with this test.

Looky:

#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define G (1024 * 1024 * 1024)

int main(int argc, char *argv[])
{
	char *p;
	int fd;
	unsigned long idx;
	int sum = 0;

	fd = open("foo", O_RDONLY);
	if (fd < 0) {
		perror("open");
		exit(1);
	}
	p = mmap(NULL, 1 * G, PROT_READ, MAP_PRIVATE, fd, 0);
	if (p == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}

	for (idx = 0; idx < 1 * G; idx += 4096)
		sum += p[idx];
	printf("%d\n", sum);
	exit(0);
}

z:/home/akpm> /usr/bin/time ./a.out
0
0.05user 0.33system 0:00.38elapsed 99%CPU (0avgtext+0avgdata 4195856maxresident)k
0inputs+0outputs (0major+262264minor)pagefaults 0swaps

z:/home/akpm> dc
16o
262264 4 * p
1001E0

That's close!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
  2014-03-01  1:41   ` Andrew Morton
@ 2014-03-01  6:10     ` Ning Qu
  2014-03-01  6:27       ` Ning Qu
  0 siblings, 1 reply; 16+ messages in thread
From: Ning Qu @ 2014-03-01  6:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Hugh Dickins, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	linux-kernel

Yes, I am using the iozone -i 0 -i 1. Let me try the most simple test
as you mentioned.
Best wishes,
-- 
Ning Qu


On Fri, Feb 28, 2014 at 5:41 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri, 28 Feb 2014 16:35:16 -0800 Ning Qu <quning@gmail.com> wrote:
>
>> Sorry about my fault about the experiments, here is the real one.
>>
>> Btw, apparently, there are still some questions about the results and
>> I will sync with Kirill about his test command line.
>>
>> Below is just some simple experiment numbers from this patch, let me know if
>> you would like more:
>>
>> Tested on Xeon machine with 64GiB of RAM, using the current default fault
>> order 4.
>>
>> Sequential access 8GiB file
>>                         Baseline        with-patch
>> 1 thread
>>     minor fault         8,389,052    4,456,530
>>     time, seconds    9.55            8.31
>
> The numbers still seem wrong.  I'd expect to see almost exactly 2M minor
> faults with this test.
>
> Looky:
>
> #include <sys/mman.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> #define G (1024 * 1024 * 1024)
>
> int main(int argc, char *argv[])
> {
>         char *p;
>         int fd;
>         unsigned long idx;
>         int sum = 0;
>
>         fd = open("foo", O_RDONLY);
>         if (fd < 0) {
>                 perror("open");
>                 exit(1);
>         }
>         p = mmap(NULL, 1 * G, PROT_READ, MAP_PRIVATE, fd, 0);
>         if (p == MAP_FAILED) {
>                 perror("mmap");
>                 exit(1);
>         }
>
>         for (idx = 0; idx < 1 * G; idx += 4096)
>                 sum += p[idx];
>         printf("%d\n", sum);
>         exit(0);
> }
>
> z:/home/akpm> /usr/bin/time ./a.out
> 0
> 0.05user 0.33system 0:00.38elapsed 99%CPU (0avgtext+0avgdata 4195856maxresident)k
> 0inputs+0outputs (0major+262264minor)pagefaults 0swaps
>
> z:/home/akpm> dc
> 16o
> 262264 4 * p
> 1001E0
>
> That's close!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
  2014-03-01  6:10     ` Ning Qu
@ 2014-03-01  6:27       ` Ning Qu
       [not found]         ` <20140303143834.90ebe8ec5c6a369e54a599ec@linux-foundation.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Ning Qu @ 2014-03-01  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Hugh Dickins, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	linux-kernel

Yes, the simple test does verify that the page fault number are
correct with the patch. So my previous results are from those command
lines, which also show some performance improvement with this change
in tmpfs.

sequential access
/usr/bin/time -a ./iozone —B s 8g -i 0 -i 1

random access
/usr/bin/time -a ./iozone —B s 8g -i 0 -i 2
Best wishes,
-- 
Ning Qu


On Fri, Feb 28, 2014 at 10:10 PM, Ning Qu <quning@gmail.com> wrote:
> Yes, I am using the iozone -i 0 -i 1. Let me try the most simple test
> as you mentioned.
> Best wishes,
> --
> Ning Qu
>
>
> On Fri, Feb 28, 2014 at 5:41 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Fri, 28 Feb 2014 16:35:16 -0800 Ning Qu <quning@gmail.com> wrote:
>>
>>> Sorry about my fault about the experiments, here is the real one.
>>>
>>> Btw, apparently, there are still some questions about the results and
>>> I will sync with Kirill about his test command line.
>>>
>>> Below is just some simple experiment numbers from this patch, let me know if
>>> you would like more:
>>>
>>> Tested on Xeon machine with 64GiB of RAM, using the current default fault
>>> order 4.
>>>
>>> Sequential access 8GiB file
>>>                         Baseline        with-patch
>>> 1 thread
>>>     minor fault         8,389,052    4,456,530
>>>     time, seconds    9.55            8.31
>>
>> The numbers still seem wrong.  I'd expect to see almost exactly 2M minor
>> faults with this test.
>>
>> Looky:
>>
>> #include <sys/mman.h>
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>>
>> #define G (1024 * 1024 * 1024)
>>
>> int main(int argc, char *argv[])
>> {
>>         char *p;
>>         int fd;
>>         unsigned long idx;
>>         int sum = 0;
>>
>>         fd = open("foo", O_RDONLY);
>>         if (fd < 0) {
>>                 perror("open");
>>                 exit(1);
>>         }
>>         p = mmap(NULL, 1 * G, PROT_READ, MAP_PRIVATE, fd, 0);
>>         if (p == MAP_FAILED) {
>>                 perror("mmap");
>>                 exit(1);
>>         }
>>
>>         for (idx = 0; idx < 1 * G; idx += 4096)
>>                 sum += p[idx];
>>         printf("%d\n", sum);
>>         exit(0);
>> }
>>
>> z:/home/akpm> /usr/bin/time ./a.out
>> 0
>> 0.05user 0.33system 0:00.38elapsed 99%CPU (0avgtext+0avgdata 4195856maxresident)k
>> 0inputs+0outputs (0major+262264minor)pagefaults 0swaps
>>
>> z:/home/akpm> dc
>> 16o
>> 262264 4 * p
>> 1001E0
>>
>> That's close!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs
  2014-03-01  1:20   ` Hugh Dickins
@ 2014-03-01  6:36     ` Ning Qu
  2014-03-03 11:07       ` Kirill A. Shutemov
  2014-03-04 20:02     ` Hugh Dickins
  1 sibling, 1 reply; 16+ messages in thread
From: Ning Qu @ 2014-03-01  6:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	linux-kernel

Btw, should we first check if page returned by radix_tree_deref_slot is NULL?

diff --git a/mm/filemap.c b/mm/filemap.c
index 1bc12a9..c129ee5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1745,6 +1745,8 @@ void filemap_map_pages(struct vm_area_struct
*vma, struct vm_fault *vmf)
                        break;
 repeat:
                page = radix_tree_deref_slot(slot);
+               if (unlikely(!page))
+                       continue;
                if (radix_tree_exception(page)) {
                        if (radix_tree_deref_retry(page))


Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Fri, Feb 28, 2014 at 5:20 PM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 28 Feb 2014, Ning Qu wrote:
>
>> In shmem/tmpfs, we also use the generic filemap_map_pages,
>> seems the additional checking is not worth a separate version
>> of map_pages for it.
>>
>> Signed-off-by: Ning Qu <quning@google.com>
>> ---
>>  mm/shmem.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 1f18c9d..2ea4e89 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2783,6 +2783,7 @@ static const struct super_operations shmem_ops = {
>>
>>  static const struct vm_operations_struct shmem_vm_ops = {
>>       .fault          = shmem_fault,
>> +     .map_pages      = filemap_map_pages,
>>  #ifdef CONFIG_NUMA
>>       .set_policy     = shmem_set_policy,
>>       .get_policy     = shmem_get_policy,
>> --
>
> (There's no need for a 0/1, all the info should go into the one patch.)
>
> I expect this will prove to be a very sensible and adequate patch,
> thank you: it probably wouldn't be worth more effort to give shmem
> anything special of its own, and filemap_map_pages() is already
> (almost) coping with exceptional entries.
>
> But I can't Ack it until I've tested it some more, won't be able to
> do so until Sunday; and even then some doubt, since this and Kirill's
> are built upon mmotm/next, which after a while gives me spinlock
> lockups under load these days, yet to be investigated.
>
> "almost" above because, Kirill, even without Ning's extension to
> shmem, your filemap_map_page() soon crashes on an exceptional entry:
>
> Don't try to dereference an exceptional entry.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
>
> --- mmotm+kirill/mm/filemap.c   2014-02-28 15:17:50.984019060 -0800
> +++ linux/mm/filemap.c  2014-02-28 16:38:04.976633308 -0800
> @@ -2084,7 +2084,7 @@ repeat:
>                         if (radix_tree_deref_retry(page))
>                                 break;
>                         else
> -                               goto next;
> +                               continue;
>                 }
>
>                 if (!page_cache_get_speculative(page))

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs
  2014-03-01  6:36     ` Ning Qu
@ 2014-03-03 11:07       ` Kirill A. Shutemov
  2014-03-03 18:49         ` Ning Qu
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2014-03-03 11:07 UTC (permalink / raw)
  To: Ning Qu, Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	linux-kernel

Ning Qu wrote:
> Btw, should we first check if page returned by radix_tree_deref_slot is NULL?

Yes, we should. I don't know how I missed that. :(

The patch below should address both issues.

>From dca24c9a1f31ee1599fe81e9a60d4f87a4eaf0ea Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 3 Mar 2014 12:07:03 +0200
Subject: [PATCH] mm: filemap_map_pages() avoid dereference NULL/exception
 slots

radix_tree_deref_slot() can return NULL: add missed check.

Do no dereference 'page': we can get there as result of
radix_tree_exception(page) check.

Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Ning Qu <quning@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 5f4fe7f0c258..e48624634927 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1745,6 +1745,8 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
 			break;
 repeat:
 		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			goto next;
 		if (radix_tree_exception(page)) {
 			if (radix_tree_deref_retry(page))
 				break;
@@ -1790,7 +1792,7 @@ unlock:
 skip:
 		page_cache_release(page);
 next:
-		if (page->index == vmf->max_pgoff)
+		if (iter.index == vmf->max_pgoff)
 			break;
 	}
 	rcu_read_unlock();
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs
  2014-03-03 11:07       ` Kirill A. Shutemov
@ 2014-03-03 18:49         ` Ning Qu
  0 siblings, 0 replies; 16+ messages in thread
From: Ning Qu @ 2014-03-03 18:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	linux-kernel

Thanks for the updates!
Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Mar 3, 2014 at 3:07 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Ning Qu wrote:
>> Btw, should we first check if page returned by radix_tree_deref_slot is NULL?
>
> Yes, we should. I don't know how I missed that. :(
>
> The patch below should address both issues.
>
> From dca24c9a1f31ee1599fe81e9a60d4f87a4eaf0ea Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 3 Mar 2014 12:07:03 +0200
> Subject: [PATCH] mm: filemap_map_pages() avoid dereference NULL/exception
>  slots
>
> radix_tree_deref_slot() can return NULL: add missed check.
>
> Do no dereference 'page': we can get there as result of
> radix_tree_exception(page) check.
>
> Reported-by: Hugh Dickins <hughd@google.com>
> Reported-by: Ning Qu <quning@google.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 5f4fe7f0c258..e48624634927 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1745,6 +1745,8 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
>                         break;
>  repeat:
>                 page = radix_tree_deref_slot(slot);
> +               if (unlikely(!page))
> +                       goto next;
>                 if (radix_tree_exception(page)) {
>                         if (radix_tree_deref_retry(page))
>                                 break;
> @@ -1790,7 +1792,7 @@ unlock:
>  skip:
>                 page_cache_release(page);
>  next:
> -               if (page->index == vmf->max_pgoff)
> +               if (iter.index == vmf->max_pgoff)
>                         break;
>         }
>         rcu_read_unlock();
> --
>  Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
       [not found]         ` <20140303143834.90ebe8ec5c6a369e54a599ec@linux-foundation.org>
@ 2014-03-03 23:07           ` Ning Qu
  2014-03-03 23:29           ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Ning Qu @ 2014-03-03 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Hugh Dickins, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	linux-kernel

Actually this one is just using the generic functions provided in
Kirill's patch.

And Kirill just sent another email earlier with two fixes reported by
Hugh and I.

So it doesn't change anything in this patch. I will update the log and
work on the

experiment results. I already asked Kirill command line, so I can have a

apple-to-apple comparison then. I will update the patch with new results soon.

Sorry about the mess-up previously. I should have asked Kirill about the test

in the first place.


Best wishes,
-- 
Ning Qu


On Mon, Mar 3, 2014 at 2:38 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 28 Feb 2014 22:27:04 -0800 Ning Qu <quning@gmail.com> wrote:
>
>> On Fri, Feb 28, 2014 at 10:10 PM, Ning Qu <quning@gmail.com> wrote:
>> > Yes, I am using the iozone -i 0 -i 1. Let me try the most simple test
>> > as you mentioned.
>> > Best wishes,
>> > --
>> > Ning Qu
>> >
>> >
>> > On Fri, Feb 28, 2014 at 5:41 PM, Andrew Morton
>> > <akpm@linux-foundation.org> wrote:
>> >> On Fri, 28 Feb 2014 16:35:16 -0800 Ning Qu <quning@gmail.com> wrote:
>> >>
>> >>
>> >> int main(int argc, char *argv[])
>> >> {
>> >>         char *p;
>> >>         int fd;
>> >>         unsigned long idx;
>> >>         int sum = 0;
>> >>
>> >>         fd = open("foo", O_RDONLY);
>> >>         if (fd < 0) {
>> >>                 perror("open");
>> >>                 exit(1);
>> >>         }
>> >>         p = mmap(NULL, 1 * G, PROT_READ, MAP_PRIVATE, fd, 0);
>> >>         if (p == MAP_FAILED) {
>> >>                 perror("mmap");
>> >>                 exit(1);
>> >>         }
>> >>
>> >>         for (idx = 0; idx < 1 * G; idx += 4096)
>> >>                 sum += p[idx];
>> >>         printf("%d\n", sum);
>> >>         exit(0);
>> >> }
>> >>
>> >> z:/home/akpm> /usr/bin/time ./a.out
>> >> 0
>> >> 0.05user 0.33system 0:00.38elapsed 99%CPU (0avgtext+0avgdata 4195856maxresident)k
>> >> 0inputs+0outputs (0major+262264minor)pagefaults 0swaps
>> >>
>> >> z:/home/akpm> dc
>> >> 16o
>> >> 262264 4 * p
>> >> 1001E0
>> >>
>> >> That's close!
>
> OK, I'm repairing your top-posting here.  It makes it unnecessarily
> hard to conduct a conversation - please just don't do it.
>
>> Yes, the simple test does verify that the page fault number are
>> correct with the patch. So my previous results are from those command
>> lines, which also show some performance improvement with this change
>> in tmpfs.
>>
>> sequential access
>> /usr/bin/time -a ./iozone -B s 8g -i 0 -i 1
>>
>> random access
>> /usr/bin/time -a ./iozone -B s 8g -i 0 -i 2
>
> I don't understand your point here.
>
> Running my simple test app with and without Kirill's
> mm-introduce-vm_ops-map_pages and
> mm-implement-map_pages-for-page-cache, minor faults are reduced 16x
> when the file is cached, as expected:
>
> 0.02user 0.22system 0:00.24elapsed 97%CPU (0avgtext+0avgdata 4198080maxresident)k
> 0inputs+0outputs (0major+16433minor)pagefaults 0swaps
>
>
> When the file is uncached, results are peculiar:
>
> 0.00user 2.84system 0:50.90elapsed 5%CPU (0avgtext+0avgdata 4198096maxresident)k
> 0inputs+0outputs (1major+49666minor)pagefaults 0swaps
>
> That's approximately 3x more minor faults.  I thought it might be due
> to the fact that userspace pagefaults and disk IO completions are both
> working in the same order through the same pages, so the pagefaults
> keep stumbling across not-yet-completed pages.  So I attempted to
> complete the pages in reverse order:
>
> --- a/fs/mpage.c~a
> +++ a/fs/mpage.c
> @@ -41,12 +41,16 @@
>   * status of that page is hard.  See end_buffer_async_read() for the details.
>   * There is no point in duplicating all that complexity.
>   */
> +#define bio_for_each_segment_all_reverse(bvl, bio, i)                  \
> +       for (i = 0, bvl = (bio)->bi_io_vec + (bio)->bi_vcnt - 1;        \
> +       i < (bio)->bi_vcnt; i++, bvl--)
> +
>  static void mpage_end_io(struct bio *bio, int err)
>  {
>         struct bio_vec *bv;
>         int i;
>
> -       bio_for_each_segment_all(bv, bio, i) {
> +       bio_for_each_segment_all_reverse(bv, bio, i) {
>                 struct page *page = bv->bv_page;
>
>                 if (bio_data_dir(bio) == READ) {
>
> But that made no difference.  Maybe I got the wrong BIO completion
> routine, but I don't think so (it's ext3).  Probably my theory is
> wrong.
>
> Anyway, could you please resend your patch with Hugh's fix and with a
> more carefully written and more accurate changelog?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
       [not found]         ` <20140303143834.90ebe8ec5c6a369e54a599ec@linux-foundation.org>
  2014-03-03 23:07           ` Ning Qu
@ 2014-03-03 23:29           ` Linus Torvalds
       [not found]             ` <20140303153707.beced5c271179d1b1658a246@linux-foundation.org>
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2014-03-03 23:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ning Qu, Mel Gorman, Rik van Riel, Kirill A. Shutemov,
	Hugh Dickins, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Mon, Mar 3, 2014 at 2:38 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> When the file is uncached, results are peculiar:
>
> 0.00user 2.84system 0:50.90elapsed 5%CPU (0avgtext+0avgdata 4198096maxresident)k
> 0inputs+0outputs (1major+49666minor)pagefaults 0swaps
>
> That's approximately 3x more minor faults.

This is not peculiar.

When the file is uncached, some pages will obviously be under IO due
to readahead etc. And the fault-around code very much on purpose will
*not* try to wait for those pages, so any busy pages will just simply
not be faulted-around.

So you should still have fewer minor faults than faulting on *every*
page (ie the non-fault-around case), but I would very much expect that
fault-around will not see the full "one sixteenth" reduction in minor
faults.

And the order of IO will not matter, since the read-ahead is
asynchronous wrt the page-faults.

             Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
       [not found]             ` <20140303153707.beced5c271179d1b1658a246@linux-foundation.org>
@ 2014-03-04  0:50               ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2014-03-04  0:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Ning Qu, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Hugh Dickins, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, linux-mm,
	linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 03:37:07PM -0800, Andrew Morton wrote:
> On Mon, 3 Mar 2014 15:29:00 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Mon, Mar 3, 2014 at 2:38 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > When the file is uncached, results are peculiar:
> > >
> > > 0.00user 2.84system 0:50.90elapsed 5%CPU (0avgtext+0avgdata 4198096maxresident)k
> > > 0inputs+0outputs (1major+49666minor)pagefaults 0swaps
> > >
> > > That's approximately 3x more minor faults.
> > 
> > This is not peculiar.
> > 
> > When the file is uncached, some pages will obviously be under IO due
> > to readahead etc. And the fault-around code very much on purpose will
> > *not* try to wait for those pages, so any busy pages will just simply
> > not be faulted-around.
> 
> Of course.
> 
> > So you should still have fewer minor faults than faulting on *every*
> > page (ie the non-fault-around case), but I would very much expect that
> > fault-around will not see the full "one sixteenth" reduction in minor
> > faults.
> > 
> > And the order of IO will not matter, since the read-ahead is
> > asynchronous wrt the page-faults.
> 
> When a pagefault hits a locked, not-uptodate page it is going to block.
> Once it wakes up we'd *like* to find lots of now-uptodate pages in
> that page's vicinity.  Obviously, that is happening, but not to the
> fullest possible extent.  We _could_ still achieve the 16x if readahead
> was cooperating in an ideal fashion.
> 
> I don't know what's going on in there to produce this consistent 3x
> factor.

In my VM numbers are different (fault in 1G):

cold cache: 2097352inputs+0outputs (2major+25048minor)pagefaults 0swaps
hot cache: 0inputs+0outputs (0major+16450minor)pagefaults 0swaps

~1.5x more page faults with cold cache comparing to hot cache.

BTW, moving do_fault_around() below __do_fault() doesn't make much better:

cold cache: 2097200inputs+0outputs (1major+24641minor)pagefaults 0swaps

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs
  2014-03-01  1:20   ` Hugh Dickins
  2014-03-01  6:36     ` Ning Qu
@ 2014-03-04 20:02     ` Hugh Dickins
  1 sibling, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-03-04 20:02 UTC (permalink / raw)
  To: Ning Qu
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, 28 Feb 2014, Hugh Dickins wrote:
> On Fri, 28 Feb 2014, Ning Qu wrote:
> 
> > In shmem/tmpfs, we also use the generic filemap_map_pages,
> > seems the additional checking is not worth a separate version
> > of map_pages for it.
> > 
> > Signed-off-by: Ning Qu <quning@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

> > ---
> >  mm/shmem.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 1f18c9d..2ea4e89 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2783,6 +2783,7 @@ static const struct super_operations shmem_ops = {
> >  
> >  static const struct vm_operations_struct shmem_vm_ops = {
> >  	.fault		= shmem_fault,
> > +	.map_pages	= filemap_map_pages,
> >  #ifdef CONFIG_NUMA
> >  	.set_policy     = shmem_set_policy,
> >  	.get_policy     = shmem_get_policy,
> > -- 
> 
> (There's no need for a 0/1, all the info should go into the one patch.)
> 
> I expect this will prove to be a very sensible and adequate patch,
> thank you: it probably wouldn't be worth more effort to give shmem
> anything special of its own, and filemap_map_pages() is already
> (almost) coping with exceptional entries.

I haven't looked at performance at all: I'm just glad that you and
Kirill have it working correctly as on other filesystems, without
any need for special treatment in filemap_map_pages() - thanks.

> 
> But I can't Ack it until I've tested it some more, won't be able to
> do so until Sunday; and even then some doubt, since this and Kirill's
> are built upon mmotm/next, which after a while gives me spinlock
> lockups under load these days, yet to be investigated.

Other test machines didn't give me those lockups at the weekend, and
overnight it looks like yesterday's mmotm has fixed the freezes on my
laptop (PeterZ's "sched: Guarantee task priority in pick_next_task()"
probably fixed them, but it's old history now, so I've not verified).

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache
       [not found] ` <CACz4_2eYUOkHdOtBJGDGMMwBcQkyPs8BDXQ491Ab_ig4z8q5mQ@mail.gmail.com>
@ 2014-03-13 20:46   ` Ning Qu
  0 siblings, 0 replies; 16+ messages in thread
From: Ning Qu @ 2014-03-13 20:46 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Kirill A. Shutemov
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Ning Qu

Hi, Andrew,

I have updated the results for tmpfs last week and Hugh already ack
this patchset on Mar 4 in "[PATCH 1/1] mm: implement ->map_pages for
shmem/tmpfs"

Acked-by: Hugh Dickins <hughd@google.com>

Please consider to apply this patch so that tmpfs will have the
similar feature together with the other file systems. Thanks a lot!

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-03-13 20:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-28 22:18 [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache Ning Qu
2014-02-28 22:18 ` [PATCH 1/1] mm: implement ->map_pages for shmem/tmpfs Ning Qu
2014-03-01  1:20   ` Hugh Dickins
2014-03-01  6:36     ` Ning Qu
2014-03-03 11:07       ` Kirill A. Shutemov
2014-03-03 18:49         ` Ning Qu
2014-03-04 20:02     ` Hugh Dickins
2014-02-28 22:34 ` [PATCH 0/1] mm, shmem: map few pages around fault address if they are in page cache Andrew Morton
2014-03-01  0:35 ` Ning Qu
2014-03-01  1:41   ` Andrew Morton
2014-03-01  6:10     ` Ning Qu
2014-03-01  6:27       ` Ning Qu
     [not found]         ` <20140303143834.90ebe8ec5c6a369e54a599ec@linux-foundation.org>
2014-03-03 23:07           ` Ning Qu
2014-03-03 23:29           ` Linus Torvalds
     [not found]             ` <20140303153707.beced5c271179d1b1658a246@linux-foundation.org>
2014-03-04  0:50               ` Kirill A. Shutemov
     [not found] ` <CACz4_2eYUOkHdOtBJGDGMMwBcQkyPs8BDXQ491Ab_ig4z8q5mQ@mail.gmail.com>
2014-03-13 20:46   ` Ning Qu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).