All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/readahead.c: fix regression caused by small readahead limit
@ 2015-08-20 16:19 ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-20 16:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Roman Gushchin, Linus Torvalds, Raghavendra K T,
	Jan Kara, Wu Fengguang, David Rientjes, Andrew Morton

Effectively reverts: 6d2be915e589b58cb11418cbe1f22ff90732b6ac
("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and
limit readahead pages").

This commit causes significant i/o performance regression on large
RAID disks. Limiting maximal readahead size by 2Mb is not suitable for
this case, alghough previous logic (based on free memory size
on current NUMA node) is much better.

To avoid regression in case of memoryless NUMA we can still use
MAX_READAHEAD constant, if current node has no (or less) free memory.

before:

$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 12.6441 s, 829 MB/s

$ dd if=/dev/md2 of=/dev/null bs=100M count=100 iflag=direct
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 9.49377 s, 1.1 GB/s

after:

$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 9.18119 s, 1.1 GB/s

$ dd if=/dev/md2 of=/dev/null bs=100M count=100 iflag=direct
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 9.34751 s, 1.1 GB/s

(It's 8 disks RAID 5 with 1024k chunk.)

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/readahead.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 60cd846..93a00b3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -239,7 +239,12 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, MAX_READAHEAD);
+	unsigned long max_sane;
+
+	max_sane = max(MAX_READAHEAD,
+		       (node_page_state(numa_node_id(), NR_INACTIVE_FILE) +
+			node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	return min(nr, max_sane);
 }
 
 /*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] mm/readahead.c: fix regression caused by small readahead limit
@ 2015-08-20 16:19 ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-20 16:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Roman Gushchin, Linus Torvalds, Raghavendra K T,
	Jan Kara, Wu Fengguang, David Rientjes, Andrew Morton

Effectively reverts: 6d2be915e589b58cb11418cbe1f22ff90732b6ac
("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and
limit readahead pages").

This commit causes significant i/o performance regression on large
RAID disks. Limiting maximal readahead size by 2Mb is not suitable for
this case, alghough previous logic (based on free memory size
on current NUMA node) is much better.

To avoid regression in case of memoryless NUMA we can still use
MAX_READAHEAD constant, if current node has no (or less) free memory.

before:

$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 12.6441 s, 829 MB/s

$ dd if=/dev/md2 of=/dev/null bs=100M count=100 iflag=direct
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 9.49377 s, 1.1 GB/s

after:

$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 9.18119 s, 1.1 GB/s

$ dd if=/dev/md2 of=/dev/null bs=100M count=100 iflag=direct
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 9.34751 s, 1.1 GB/s

(It's 8 disks RAID 5 with 1024k chunk.)

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/readahead.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 60cd846..93a00b3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -239,7 +239,12 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, MAX_READAHEAD);
+	unsigned long max_sane;
+
+	max_sane = max(MAX_READAHEAD,
+		       (node_page_state(numa_node_id(), NR_INACTIVE_FILE) +
+			node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	return min(nr, max_sane);
 }
 
 /*
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/readahead.c: fix regression caused by small readahead limit
  2015-08-20 16:19 ` Roman Gushchin
@ 2015-08-20 19:23   ` Linus Torvalds
  -1 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-08-20 19:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On Thu, Aug 20, 2015 at 9:19 AM, Roman Gushchin <klamm@yandex-team.ru> wrote:
> +       max_sane = max(MAX_READAHEAD,
> +                      (node_page_state(numa_node_id(), NR_INACTIVE_FILE) +
> +                       node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);

No, we're not re-introducing the idiotic and broken per-node logic.
There was a reason it was killed.

There have been other patches suggested that actually use _valid_
heuristics, this is not one of them.

                  Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/readahead.c: fix regression caused by small readahead limit
@ 2015-08-20 19:23   ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-08-20 19:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On Thu, Aug 20, 2015 at 9:19 AM, Roman Gushchin <klamm@yandex-team.ru> wrote:
> +       max_sane = max(MAX_READAHEAD,
> +                      (node_page_state(numa_node_id(), NR_INACTIVE_FILE) +
> +                       node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);

No, we're not re-introducing the idiotic and broken per-node logic.
There was a reason it was killed.

There have been other patches suggested that actually use _valid_
heuristics, this is not one of them.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm: use only per-device readahead limit
  2015-08-20 19:23   ` Linus Torvalds
@ 2015-08-21 17:12     ` Roman Gushchin
  -1 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-21 17:12 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds
  Cc: linux-kernel, Roman Gushchin, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

Maximal readahead size is limited now by two values:
1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
2) by configurable per-device value* (bdi->ra_pages)

There are devices, which require custom readahead limit.
For instance, for RAIDs it's calculated as number of devices
multiplied by chunk size times 2.

Readahead size can never be larger than bdi->ra_pages * 2 value
(POSIX_FADV_SEQUNTIAL doubles readahead size).

If so, why do we need two limits?
I suggest to completely remove this max_sane_readahead() stuff and
use per-device readahead limit everywhere.

Also, using right readahead size for RAID disks can significantly
increase i/o performance:

before:
dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s

after:
$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s

(It's an 8-disks RAID5 storage).

This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
behavior introduced by commit
6d2be915e589b58cb11418cbe1f22ff90732b6ac ("mm/readahead.c: fix
readahead failure for memoryless NUMA nodes and limit readahead pages").

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm.h |  2 --
 mm/filemap.c       |  8 +++-----
 mm/readahead.c     | 14 ++------------
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..a62abdd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1942,8 +1942,6 @@ void page_cache_async_readahead(struct address_space *mapping,
 				pgoff_t offset,
 				unsigned long size);
 
-unsigned long max_sane_readahead(unsigned long nr);
-
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1283fc8..0e1ebef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1807,7 +1807,6 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 				   struct file *file,
 				   pgoff_t offset)
 {
-	unsigned long ra_pages;
 	struct address_space *mapping = file->f_mapping;
 
 	/* If we don't want any read-ahead, don't bother */
@@ -1836,10 +1835,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
-	ra->start = max_t(long, 0, offset - ra_pages / 2);
-	ra->size = ra_pages;
-	ra->async_size = ra_pages / 4;
+	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
+	ra->size = ra->ra_pages;
+	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 60cd846..b4937a6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -213,7 +213,7 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
-	nr_to_read = max_sane_readahead(nr_to_read);
+	nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
 	while (nr_to_read) {
 		int err;
 
@@ -232,16 +232,6 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return 0;
 }
 
-#define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
-/*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
- */
-unsigned long max_sane_readahead(unsigned long nr)
-{
-	return min(nr, MAX_READAHEAD);
-}
-
 /*
  * Set the initial window size, round to next power of 2 and square
  * for small size, x 4 for medium, and x 2 for large
@@ -380,7 +370,7 @@ ondemand_readahead(struct address_space *mapping,
 		   bool hit_readahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
-	unsigned long max = max_sane_readahead(ra->ra_pages);
+	unsigned long max = ra->ra_pages;
 	pgoff_t prev_offset;
 
 	/*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] mm: use only per-device readahead limit
@ 2015-08-21 17:12     ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-21 17:12 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds
  Cc: linux-kernel, Roman Gushchin, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

Maximal readahead size is limited now by two values:
1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
2) by configurable per-device value* (bdi->ra_pages)

There are devices, which require custom readahead limit.
For instance, for RAIDs it's calculated as number of devices
multiplied by chunk size times 2.

Readahead size can never be larger than bdi->ra_pages * 2 value
(POSIX_FADV_SEQUNTIAL doubles readahead size).

If so, why do we need two limits?
I suggest to completely remove this max_sane_readahead() stuff and
use per-device readahead limit everywhere.

Also, using right readahead size for RAID disks can significantly
increase i/o performance:

before:
dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s

after:
$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s

(It's an 8-disks RAID5 storage).

This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
behavior introduced by commit
6d2be915e589b58cb11418cbe1f22ff90732b6ac ("mm/readahead.c: fix
readahead failure for memoryless NUMA nodes and limit readahead pages").

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm.h |  2 --
 mm/filemap.c       |  8 +++-----
 mm/readahead.c     | 14 ++------------
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..a62abdd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1942,8 +1942,6 @@ void page_cache_async_readahead(struct address_space *mapping,
 				pgoff_t offset,
 				unsigned long size);
 
-unsigned long max_sane_readahead(unsigned long nr);
-
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1283fc8..0e1ebef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1807,7 +1807,6 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 				   struct file *file,
 				   pgoff_t offset)
 {
-	unsigned long ra_pages;
 	struct address_space *mapping = file->f_mapping;
 
 	/* If we don't want any read-ahead, don't bother */
@@ -1836,10 +1835,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
-	ra->start = max_t(long, 0, offset - ra_pages / 2);
-	ra->size = ra_pages;
-	ra->async_size = ra_pages / 4;
+	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
+	ra->size = ra->ra_pages;
+	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 60cd846..b4937a6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -213,7 +213,7 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
-	nr_to_read = max_sane_readahead(nr_to_read);
+	nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
 	while (nr_to_read) {
 		int err;
 
@@ -232,16 +232,6 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return 0;
 }
 
-#define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
-/*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
- */
-unsigned long max_sane_readahead(unsigned long nr)
-{
-	return min(nr, MAX_READAHEAD);
-}
-
 /*
  * Set the initial window size, round to next power of 2 and square
  * for small size, x 4 for medium, and x 2 for large
@@ -380,7 +370,7 @@ ondemand_readahead(struct address_space *mapping,
 		   bool hit_readahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
-	unsigned long max = max_sane_readahead(ra->ra_pages);
+	unsigned long max = ra->ra_pages;
 	pgoff_t prev_offset;
 
 	/*
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: use only per-device readahead limit
  2015-08-21 17:12     ` Roman Gushchin
@ 2015-08-21 18:17       ` Linus Torvalds
  -1 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-08-21 18:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On Fri, Aug 21, 2015 at 10:12 AM, Roman Gushchin <klamm@yandex-team.ru> wrote:
>
> There are devices, which require custom readahead limit.
> For instance, for RAIDs it's calculated as number of devices
> multiplied by chunk size times 2.

So afaik, the default read-ahead size is 128kB, which is actually
smaller than the old 512-page limit.

Which means that you probably changed "ra_pages" somehow. Is it some
system tool that does that automatically, and if so based on what,
exactly?

I'm also slightly worried about the fact that now the max read-ahead
may actually be zero, and/or basically infinite (there's a ioctl to
set it that only tests that it's not negative). Does everything react
ok to that?

               Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: use only per-device readahead limit
@ 2015-08-21 18:17       ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-08-21 18:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On Fri, Aug 21, 2015 at 10:12 AM, Roman Gushchin <klamm@yandex-team.ru> wrote:
>
> There are devices, which require custom readahead limit.
> For instance, for RAIDs it's calculated as number of devices
> multiplied by chunk size times 2.

So afaik, the default read-ahead size is 128kB, which is actually
smaller than the old 512-page limit.

Which means that you probably changed "ra_pages" somehow. Is it some
system tool that does that automatically, and if so based on what,
exactly?

I'm also slightly worried about the fact that now the max read-ahead
may actually be zero, and/or basically infinite (there's a ioctl to
set it that only tests that it's not negative). Does everything react
ok to that?

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: use only per-device readahead limit
  2015-08-21 18:17       ` Linus Torvalds
@ 2015-08-21 20:21         ` Roman Gushchin
  -1 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-21 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

21.08.2015, 21:17, "Linus Torvalds" <torvalds@linux-foundation.org>:
> On Fri, Aug 21, 2015 at 10:12 AM, Roman Gushchin <klamm@yandex-team.ru> wrote:
>>  There are devices, which require custom readahead limit.
>>  For instance, for RAIDs it's calculated as number of devices
>>  multiplied by chunk size times 2.
>
> So afaik, the default read-ahead size is 128kB, which is actually
> smaller than the old 512-page limit.
>
> Which means that you probably changed "ra_pages" somehow. Is it some
> system tool that does that automatically, and if so based on what,
> exactly?

It's just a raid driver. For instance, drivers/ms/raid5.c:6898 .

On my setup I got unexpectedly even slight perfomance increase 
over O_DIRECT case and over old memory-based readahead limit, 
as you can see from numbers in the commit message (1.2GB/s vs 1.1 GB/s).

So, I like an idea to delegate the readahead limit calculation to the underlying i/o level.

> I'm also slightly worried about the fact that now the max read-ahead
> may actually be zero, 

For "normal" readahead nothing changes. Only readahead syscall and 
madvise(MADV_WILL_NEED) cases are affected.
I think, it's ok to do nothing, if readahead was deliberately disabled.

> and/or basically infinite (there's a ioctl to
> set it that only tests that it's not negative). Does everything react
> ok to that?

It's an open question, if we have to add some checks to avoid miss-configuration.
In any case, we can check the limit on setting rather then adjust them dynamically.

--
Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: use only per-device readahead limit
@ 2015-08-21 20:21         ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-21 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

21.08.2015, 21:17, "Linus Torvalds" <torvalds@linux-foundation.org>:
> On Fri, Aug 21, 2015 at 10:12 AM, Roman Gushchin <klamm@yandex-team.ru> wrote:
>> ?There are devices, which require custom readahead limit.
>> ?For instance, for RAIDs it's calculated as number of devices
>> ?multiplied by chunk size times 2.
>
> So afaik, the default read-ahead size is 128kB, which is actually
> smaller than the old 512-page limit.
>
> Which means that you probably changed "ra_pages" somehow. Is it some
> system tool that does that automatically, and if so based on what,
> exactly?

It's just a raid driver. For instance, drivers/ms/raid5.c:6898 .

On my setup I got unexpectedly even slight perfomance increase 
over O_DIRECT case and over old memory-based readahead limit, 
as you can see from numbers in the commit message (1.2GB/s vs 1.1 GB/s).

So, I like an idea to delegate the readahead limit calculation to the underlying i/o level.

> I'm also slightly worried about the fact that now the max read-ahead
> may actually be zero, 

For "normal" readahead nothing changes. Only readahead syscall and 
madvise(MADV_WILL_NEED) cases are affected.
I think, it's ok to do nothing, if readahead was deliberately disabled.

> and/or basically infinite (there's a ioctl to
> set it that only tests that it's not negative). Does everything react
> ok to that?

It's an open question, if we have to add some checks to avoid miss-configuration.
In any case, we can check the limit on setting rather then adjust them dynamically.

--
Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: use only per-device readahead limit
  2015-08-21 20:21         ` Roman Gushchin
@ 2015-08-21 20:44           ` Linus Torvalds
  -1 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-08-21 20:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On Fri, Aug 21, 2015 at 1:21 PM, Roman Gushchin <klamm@yandex-team.ru> wrote:
>
> It's just a raid driver. For instance, drivers/md/raid5.c:6898 .

Ok. That makes me a bit less nervous. I was worried there was some
admin program out there that just ups the readahead on peoples
devices, which would mean that ra_pages is some random value chosen by
crazy user space people.

> So, I like an idea to delegate the readahead limit calculation to the underlying i/o level.

Yeah, I'm not against it either. It's just that historically we've had
some issues with people over-doing readahead (because it often helps
some made-up microbenchmark), and then we end up with latency issues
when somebody does a multi-gigabyte readahead... Iirc, we had exactly
that problem with the readahead() system call at some point (long
ago).

But if it's just the default ra_pages, then that should be ok. I think
the kernel defaults are generally sane, and I hope there isn't some
crazy distro that ends up mucking with this.

                  Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: use only per-device readahead limit
@ 2015-08-21 20:44           ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2015-08-21 20:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Linux Kernel Mailing List, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On Fri, Aug 21, 2015 at 1:21 PM, Roman Gushchin <klamm@yandex-team.ru> wrote:
>
> It's just a raid driver. For instance, drivers/md/raid5.c:6898 .

Ok. That makes me a bit less nervous. I was worried there was some
admin program out there that just ups the readahead on peoples
devices, which would mean that ra_pages is some random value chosen by
crazy user space people.

> So, I like an idea to delegate the readahead limit calculation to the underlying i/o level.

Yeah, I'm not against it either. It's just that historically we've had
some issues with people over-doing readahead (because it often helps
some made-up microbenchmark), and then we end up with latency issues
when somebody does a multi-gigabyte readahead... Iirc, we had exactly
that problem with the readahead() system call at some point (long
ago).

But if it's just the default ra_pages, then that should be ok. I think
the kernel defaults are generally sane, and I hope there isn't some
crazy distro that ends up mucking with this.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] mm: use only per-device readahead limit
  2015-08-21 20:44           ` Linus Torvalds
@ 2015-08-24 11:57             ` Roman Gushchin
  -1 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-24 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Roman Gushchin, Linus Torvalds, Raghavendra K T,
	Jan Kara, Wu Fengguang, David Rientjes, Andrew Morton,
	Konstantin Khlebnikov

Maximal readahead size is limited now by two values:
1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
2) by configurable per-device value* (bdi->ra_pages)

There are devices, which require custom readahead limit.
For instance, for RAIDs it's calculated as number of devices
multiplied by chunk size times 2.

Readahead size can never be larger than bdi->ra_pages * 2 value
(POSIX_FADV_SEQUNTIAL doubles readahead size).

If so, why do we need two limits?
I suggest to completely remove this max_sane_readahead() stuff and
use per-device readahead limit everywhere.

Also, using right readahead size for RAID disks can significantly
increase i/o performance:

before:
dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s

after:
$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s

(It's an 8-disks RAID5 storage).

This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
behavior introduced by commit
6d2be915e589b58cb11418cbe1f22ff90732b6ac ("mm/readahead.c: fix
readahead failure for memoryless NUMA nodes and limit readahead pages").

V2:
Konstantin Khlebnikov noticed, that if readahead is completely
disabled, force_page_cache_readahead() will not read anything.
This function is used for sync reads (if FMODE_RANDOM flag is set).
So, to guarantee read progress it's necessary to read at least 1 page.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/mm.h |  2 --
 mm/filemap.c       |  8 +++-----
 mm/readahead.c     | 18 ++++++------------
 3 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..a62abdd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1942,8 +1942,6 @@ void page_cache_async_readahead(struct address_space *mapping,
 				pgoff_t offset,
 				unsigned long size);
 
-unsigned long max_sane_readahead(unsigned long nr);
-
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1283fc8..0e1ebef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1807,7 +1807,6 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 				   struct file *file,
 				   pgoff_t offset)
 {
-	unsigned long ra_pages;
 	struct address_space *mapping = file->f_mapping;
 
 	/* If we don't want any read-ahead, don't bother */
@@ -1836,10 +1835,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
-	ra->start = max_t(long, 0, offset - ra_pages / 2);
-	ra->size = ra_pages;
-	ra->async_size = ra_pages / 4;
+	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
+	ra->size = ra->ra_pages;
+	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 60cd846..7eb844c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -213,7 +213,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
-	nr_to_read = max_sane_readahead(nr_to_read);
+	/*
+	 * Read at least 1 page, even if readahead is completely disabled.
+	 */
+	nr_to_read = min(nr_to_read, max(inode_to_bdi(mapping->host)->ra_pages,
+					 1ul));
 	while (nr_to_read) {
 		int err;
 
@@ -232,16 +236,6 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return 0;
 }
 
-#define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
-/*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
- */
-unsigned long max_sane_readahead(unsigned long nr)
-{
-	return min(nr, MAX_READAHEAD);
-}
-
 /*
  * Set the initial window size, round to next power of 2 and square
  * for small size, x 4 for medium, and x 2 for large
@@ -380,7 +374,7 @@ ondemand_readahead(struct address_space *mapping,
 		   bool hit_readahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
-	unsigned long max = max_sane_readahead(ra->ra_pages);
+	unsigned long max = ra->ra_pages;
 	pgoff_t prev_offset;
 
 	/*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2] mm: use only per-device readahead limit
@ 2015-08-24 11:57             ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2015-08-24 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Roman Gushchin, Linus Torvalds, Raghavendra K T,
	Jan Kara, Wu Fengguang, David Rientjes, Andrew Morton,
	Konstantin Khlebnikov

Maximal readahead size is limited now by two values:
1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
2) by configurable per-device value* (bdi->ra_pages)

There are devices, which require custom readahead limit.
For instance, for RAIDs it's calculated as number of devices
multiplied by chunk size times 2.

Readahead size can never be larger than bdi->ra_pages * 2 value
(POSIX_FADV_SEQUNTIAL doubles readahead size).

If so, why do we need two limits?
I suggest to completely remove this max_sane_readahead() stuff and
use per-device readahead limit everywhere.

Also, using right readahead size for RAID disks can significantly
increase i/o performance:

before:
dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s

after:
$ dd if=/dev/md2 of=/dev/null bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s

(It's an 8-disks RAID5 storage).

This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
behavior introduced by commit
6d2be915e589b58cb11418cbe1f22ff90732b6ac ("mm/readahead.c: fix
readahead failure for memoryless NUMA nodes and limit readahead pages").

V2:
Konstantin Khlebnikov noticed, that if readahead is completely
disabled, force_page_cache_readahead() will not read anything.
This function is used for sync reads (if FMODE_RANDOM flag is set).
So, to guarantee read progress it's necessary to read at least 1 page.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/mm.h |  2 --
 mm/filemap.c       |  8 +++-----
 mm/readahead.c     | 18 ++++++------------
 3 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..a62abdd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1942,8 +1942,6 @@ void page_cache_async_readahead(struct address_space *mapping,
 				pgoff_t offset,
 				unsigned long size);
 
-unsigned long max_sane_readahead(unsigned long nr);
-
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1283fc8..0e1ebef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1807,7 +1807,6 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 				   struct file *file,
 				   pgoff_t offset)
 {
-	unsigned long ra_pages;
 	struct address_space *mapping = file->f_mapping;
 
 	/* If we don't want any read-ahead, don't bother */
@@ -1836,10 +1835,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
-	ra->start = max_t(long, 0, offset - ra_pages / 2);
-	ra->size = ra_pages;
-	ra->async_size = ra_pages / 4;
+	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
+	ra->size = ra->ra_pages;
+	ra->async_size = ra->ra_pages / 4;
 	ra_submit(ra, mapping, file);
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 60cd846..7eb844c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -213,7 +213,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
-	nr_to_read = max_sane_readahead(nr_to_read);
+	/*
+	 * Read at least 1 page, even if readahead is completely disabled.
+	 */
+	nr_to_read = min(nr_to_read, max(inode_to_bdi(mapping->host)->ra_pages,
+					 1ul));
 	while (nr_to_read) {
 		int err;
 
@@ -232,16 +236,6 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	return 0;
 }
 
-#define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
-/*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
- * sensible upper limit.
- */
-unsigned long max_sane_readahead(unsigned long nr)
-{
-	return min(nr, MAX_READAHEAD);
-}
-
 /*
  * Set the initial window size, round to next power of 2 and square
  * for small size, x 4 for medium, and x 2 for large
@@ -380,7 +374,7 @@ ondemand_readahead(struct address_space *mapping,
 		   bool hit_readahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
-	unsigned long max = max_sane_readahead(ra->ra_pages);
+	unsigned long max = ra->ra_pages;
 	pgoff_t prev_offset;
 
 	/*
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] mm: use only per-device readahead limit
  2015-08-24 11:57             ` Roman Gushchin
@ 2015-08-24 12:41               ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 16+ messages in thread
From: Konstantin Khlebnikov @ 2015-08-24 12:41 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm
  Cc: linux-kernel, Linus Torvalds, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On 24.08.2015 14:57, Roman Gushchin wrote:
> Maximal readahead size is limited now by two values:
> 1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
> 2) by configurable per-device value* (bdi->ra_pages)
>
> There are devices, which require custom readahead limit.
> For instance, for RAIDs it's calculated as number of devices
> multiplied by chunk size times 2.
>
> Readahead size can never be larger than bdi->ra_pages * 2 value
> (POSIX_FADV_SEQUNTIAL doubles readahead size).
>
> If so, why do we need two limits?
> I suggest to completely remove this max_sane_readahead() stuff and
> use per-device readahead limit everywhere.
>
> Also, using right readahead size for RAID disks can significantly
> increase i/o performance:
>
> before:
> dd if=/dev/md2 of=/dev/null bs=100M count=100
> 100+0 records in
> 100+0 records out
> 10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
>
> after:
> $ dd if=/dev/md2 of=/dev/null bs=100M count=100
> 100+0 records in
> 100+0 records out
> 10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
>
> (It's an 8-disks RAID5 storage).
>
> This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
> behavior introduced by commit
> 6d2be915e589b58cb11418cbe1f22ff90732b6ac ("mm/readahead.c: fix
> readahead failure for memoryless NUMA nodes and limit readahead pages").
>
> V2:
> Konstantin Khlebnikov noticed, that if readahead is completely
> disabled, force_page_cache_readahead() will not read anything.
> This function is used for sync reads (if FMODE_RANDOM flag is set).
> So, to guarantee read progress it's necessary to read at least 1 page.

After second thought: this isn't important. V1 is fine.

page_cache_sync_readahead checks "if (!ra->ra_pages)" before and
never calls force_page_cache_readahead if readahead is disabled.

Anyway, this function doesn't return references to pages. All
users must be ready to handle non-present or non-uptodate pages.
But this probably never happened before so all callsites should
be reviewed: for example splice always re-lookups pages after
->readpage() (I guess page can be truncated here) while some
other users use the same page reference.

>
> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>   include/linux/mm.h |  2 --
>   mm/filemap.c       |  8 +++-----
>   mm/readahead.c     | 18 ++++++------------
>   3 files changed, 9 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2e872f9..a62abdd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1942,8 +1942,6 @@ void page_cache_async_readahead(struct address_space *mapping,
>   				pgoff_t offset,
>   				unsigned long size);
>
> -unsigned long max_sane_readahead(unsigned long nr);
> -
>   /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
>   extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1283fc8..0e1ebef 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1807,7 +1807,6 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
>   				   struct file *file,
>   				   pgoff_t offset)
>   {
> -	unsigned long ra_pages;
>   	struct address_space *mapping = file->f_mapping;
>
>   	/* If we don't want any read-ahead, don't bother */
> @@ -1836,10 +1835,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
>   	/*
>   	 * mmap read-around
>   	 */
> -	ra_pages = max_sane_readahead(ra->ra_pages);
> -	ra->start = max_t(long, 0, offset - ra_pages / 2);
> -	ra->size = ra_pages;
> -	ra->async_size = ra_pages / 4;
> +	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
> +	ra->size = ra->ra_pages;
> +	ra->async_size = ra->ra_pages / 4;
>   	ra_submit(ra, mapping, file);
>   }
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 60cd846..7eb844c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -213,7 +213,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>   	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
>   		return -EINVAL;
>
> -	nr_to_read = max_sane_readahead(nr_to_read);
> +	/*
> +	 * Read at least 1 page, even if readahead is completely disabled.
> +	 */
> +	nr_to_read = min(nr_to_read, max(inode_to_bdi(mapping->host)->ra_pages,
> +					 1ul));
>   	while (nr_to_read) {
>   		int err;
>
> @@ -232,16 +236,6 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>   	return 0;
>   }
>
> -#define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
> -/*
> - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> - * sensible upper limit.
> - */
> -unsigned long max_sane_readahead(unsigned long nr)
> -{
> -	return min(nr, MAX_READAHEAD);
> -}
> -
>   /*
>    * Set the initial window size, round to next power of 2 and square
>    * for small size, x 4 for medium, and x 2 for large
> @@ -380,7 +374,7 @@ ondemand_readahead(struct address_space *mapping,
>   		   bool hit_readahead_marker, pgoff_t offset,
>   		   unsigned long req_size)
>   {
> -	unsigned long max = max_sane_readahead(ra->ra_pages);
> +	unsigned long max = ra->ra_pages;
>   	pgoff_t prev_offset;
>
>   	/*
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] mm: use only per-device readahead limit
@ 2015-08-24 12:41               ` Konstantin Khlebnikov
  0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Khlebnikov @ 2015-08-24 12:41 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm
  Cc: linux-kernel, Linus Torvalds, Raghavendra K T, Jan Kara,
	Wu Fengguang, David Rientjes, Andrew Morton

On 24.08.2015 14:57, Roman Gushchin wrote:
> Maximal readahead size is limited now by two values:
> 1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
> 2) by configurable per-device value* (bdi->ra_pages)
>
> There are devices, which require custom readahead limit.
> For instance, for RAIDs it's calculated as number of devices
> multiplied by chunk size times 2.
>
> Readahead size can never be larger than bdi->ra_pages * 2 value
> (POSIX_FADV_SEQUNTIAL doubles readahead size).
>
> If so, why do we need two limits?
> I suggest to completely remove this max_sane_readahead() stuff and
> use per-device readahead limit everywhere.
>
> Also, using right readahead size for RAID disks can significantly
> increase i/o performance:
>
> before:
> dd if=/dev/md2 of=/dev/null bs=100M count=100
> 100+0 records in
> 100+0 records out
> 10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
>
> after:
> $ dd if=/dev/md2 of=/dev/null bs=100M count=100
> 100+0 records in
> 100+0 records out
> 10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
>
> (It's an 8-disks RAID5 storage).
>
> This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
> behavior introduced by commit
> 6d2be915e589b58cb11418cbe1f22ff90732b6ac ("mm/readahead.c: fix
> readahead failure for memoryless NUMA nodes and limit readahead pages").
>
> V2:
> Konstantin Khlebnikov noticed, that if readahead is completely
> disabled, force_page_cache_readahead() will not read anything.
> This function is used for sync reads (if FMODE_RANDOM flag is set).
> So, to guarantee read progress it's necessary to read at least 1 page.

After second thought: this isn't important. V1 is fine.

page_cache_sync_readahead checks "if (!ra->ra_pages)" before and
never calls force_page_cache_readahead if readahead is disabled.

Anyway, this function doesn't return references to pages. All
users must be ready to handle non-present or non-uptodate pages.
But this probably never happened before so all callsites should
be reviewed: for example splice always re-lookups pages after
->readpage() (I guess page can be truncated here) while some
other users use the same page reference.

>
> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>   include/linux/mm.h |  2 --
>   mm/filemap.c       |  8 +++-----
>   mm/readahead.c     | 18 ++++++------------
>   3 files changed, 9 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2e872f9..a62abdd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1942,8 +1942,6 @@ void page_cache_async_readahead(struct address_space *mapping,
>   				pgoff_t offset,
>   				unsigned long size);
>
> -unsigned long max_sane_readahead(unsigned long nr);
> -
>   /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
>   extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1283fc8..0e1ebef 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1807,7 +1807,6 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
>   				   struct file *file,
>   				   pgoff_t offset)
>   {
> -	unsigned long ra_pages;
>   	struct address_space *mapping = file->f_mapping;
>
>   	/* If we don't want any read-ahead, don't bother */
> @@ -1836,10 +1835,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
>   	/*
>   	 * mmap read-around
>   	 */
> -	ra_pages = max_sane_readahead(ra->ra_pages);
> -	ra->start = max_t(long, 0, offset - ra_pages / 2);
> -	ra->size = ra_pages;
> -	ra->async_size = ra_pages / 4;
> +	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
> +	ra->size = ra->ra_pages;
> +	ra->async_size = ra->ra_pages / 4;
>   	ra_submit(ra, mapping, file);
>   }
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 60cd846..7eb844c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -213,7 +213,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>   	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
>   		return -EINVAL;
>
> -	nr_to_read = max_sane_readahead(nr_to_read);
> +	/*
> +	 * Read at least 1 page, even if readahead is completely disabled.
> +	 */
> +	nr_to_read = min(nr_to_read, max(inode_to_bdi(mapping->host)->ra_pages,
> +					 1ul));
>   	while (nr_to_read) {
>   		int err;
>
> @@ -232,16 +236,6 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>   	return 0;
>   }
>
> -#define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
> -/*
> - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> - * sensible upper limit.
> - */
> -unsigned long max_sane_readahead(unsigned long nr)
> -{
> -	return min(nr, MAX_READAHEAD);
> -}
> -
>   /*
>    * Set the initial window size, round to next power of 2 and square
>    * for small size, x 4 for medium, and x 2 for large
> @@ -380,7 +374,7 @@ ondemand_readahead(struct address_space *mapping,
>   		   bool hit_readahead_marker, pgoff_t offset,
>   		   unsigned long req_size)
>   {
> -	unsigned long max = max_sane_readahead(ra->ra_pages);
> +	unsigned long max = ra->ra_pages;
>   	pgoff_t prev_offset;
>
>   	/*
>


-- 
Konstantin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-08-24 12:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-20 16:19 [PATCH] mm/readahead.c: fix regression caused by small readahead limit Roman Gushchin
2015-08-20 16:19 ` Roman Gushchin
2015-08-20 19:23 ` Linus Torvalds
2015-08-20 19:23   ` Linus Torvalds
2015-08-21 17:12   ` [PATCH] mm: use only per-device " Roman Gushchin
2015-08-21 17:12     ` Roman Gushchin
2015-08-21 18:17     ` Linus Torvalds
2015-08-21 18:17       ` Linus Torvalds
2015-08-21 20:21       ` Roman Gushchin
2015-08-21 20:21         ` Roman Gushchin
2015-08-21 20:44         ` Linus Torvalds
2015-08-21 20:44           ` Linus Torvalds
2015-08-24 11:57           ` [PATCH v2] " Roman Gushchin
2015-08-24 11:57             ` Roman Gushchin
2015-08-24 12:41             ` Konstantin Khlebnikov
2015-08-24 12:41               ` Konstantin Khlebnikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.