All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 19:15 ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-10-28 19:15 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner
  Cc: linux-kernel, linux-mm, wad, olofj, hughd

On ChromiumOS, we do not use swap. When memory is low, the only way to
free memory is to reclaim pages from the file list. This results in a
lot of thrashing under low memory conditions. We see the system become
unresponsive for minutes before it eventually OOMs. We also see very
slow browser tab switching under low memory. Instead of an unresponsive
system, we'd really like the kernel to OOM as soon as it starts to
thrash. If it can't keep the working set in memory, then OOM.
Losing one of many tabs is a better behaviour for the user than an
unresponsive system.

This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
of file-backed pages when when there are less than min_filelist_bytes worth
of such pages in the cache. This tunable is handy for low memory systems
using solid-state storage where interactive response is more important
than not OOMing.

With this patch and min_filelist_kbytes set to 50000, I see very little
block layer activity during low memory. The system stays responsive under
low memory and browser tab switching is fast. Eventually, a process a gets
killed by OOM. Without this patch, the system gets wedged for minutes
before it eventually OOMs. Below is the vmstat output from my test runs.

BEFORE (notice the high bi and wa, also how long it takes to OOM):

$ vmstat -a 5 1000
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa
 6  2      0  10212 350464 352276    0    0   780     1 3227 4348 78 11  1 10
 1  2      0   8852 351168 353216    0    0  3154     0 3424 3424 65 16  6 14
 2  1      0  14788 348844 349044    0    0  1620     2 2925 3336 74 10  3 13
 4  1      0  16756 346264 349004    0    0   372     0 2923 2977 76  8  1 15
 1  3      0   8432 357596 347136    0    0  5346     1 3633 4599 57 20  4 19
 1  2      0  10704 350856 351720    0    0  3003     1 3635 3921 57 15  7 20
 2  5      0   8048 352160 352660    0    0  6995     0 4033 4872 47 25  4 24

* unresponsive

 1  6      0   8120 351928 352884    0    0 13402     0 4767 4663 36 37  2 25
 1 14      0   8540 351700 352672    0    0 23932     3 4352 3188 10 54  0 36
 0  6      0   8276 351860 353004    0    0 24741     2 4286 3076 10 55  1 34
 0 18      0   8012 352012 352836    0    0 26684     0 4441 2995  9 54  0 36
 0 27      0   8384 351600 352992    0    0 27056     1 4688 2994  3 54  0 43
 0 20      0   8292 351696 353008    0    0 27410     5 4568 2957  2 55  0 42
 1 16      0   8180 351728 352984    0    0 27199     0 4409 2789  1 56  0 43
 3 14      0   7928 351524 353072    0    0 28060     0 4563 3426  1 57  0 42
 0 21      0   8140 351572 353100    0    0 29664     0 5074 5127  1 59  0 39
 0 21      0   7960 351504 352656    0    0 31719     1 4769 4917  0 64  0 36

* OOM

 1 26      0  99864 351424 261060    0    0 27382     0 5229 6085  1 59  0 40
 0  1      0  58124 388300 266644    0    0  8688     0 3413 5204 35 26 11 29
 0  1      0  69796 369644 273136    0    0   201    11 2266 1622 32  3 29 36
 1  1      0  74560 360908 276976    0    0     0     0 1916 1650 24  3 33 40

AFTER (Notice almost no bi or wa and quick OOM):

$ vmstat -a 5 1000
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa
 0  0      0  35892 387588 289644    0    0     0     0 3616 3983 50 11 40  0
 5  0      0  40108 375328 297996    0    0   291     0 3657 4534 52 12 36  0
 2  0      0  58676 369724 284320    0    0   193     1 2677 3265 54  7 39  0
 3  0      0  61188 366028 285492    0    0     0     0 2639 2756 35  5 60  0
 0  0      0  58716 367132 286996    0    0    13     0 3044 4233 34  7 59  0
 5  0      0  43080 379872 289924    0    0     0     1 3475 4244 62 12 27  0
 0  0      0  42580 372940 297684    0    0   485     0 2794 3253 76 10 13  1
 2  0      0  42160 370292 300864    0    0   202     0 3074 4365 61  9 29  1
 6  0      0  44116 370716 298100    0    0    75     0 3062 5257 75 10 15  0
 3  0      0  30228 383652 298696    0    0     0     1 3244 4858 76 11 12  0
 4  0      0  26752 384272 301844    0    0    18     0 2892 4634 83 10  7  0
 3  0      0  19348 386540 307252    0    0   333     0 2876 3932 84  9  7  0
 1  0      0  30864 378408 304440    0    0   198     2 3024 4167 79  9 12  0
 6  0      0  28540 379684 304848    0    0    14     0 2925 4746 79 11 10  0
 6  2      0  14216 379312 320088    0    0   289     2 3561 3764 77 10  4 10

* OOM

 0  0      0  83880 352600 276612    0    0   853     0 3947 4777 45 13 38  4
 2  1      0  85016 355900 272980    0    0   787     1 3480 4787 71 14 13  2
 1  0      0  67496 358288 286760    0    0   689     0 3211 4056 72 12 15  2
 2  0      0  66504 356896 289528    0    0     0     6 2848 3268 51  6 43  0
 1  0      0  58444 357780 296760    0    0     2     0 2938 3956 39  7 53  0
 2  0      0  58196 356680 297860    0    0     5     0 2606 3204 34  6 60  0

Change-Id: I17d4521a35e2648dda9db5c85aba5334a2d12f50
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 include/linux/mm.h |    2 ++
 kernel/sysctl.c    |   10 ++++++++++
 mm/vmscan.c        |   25 +++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..40ececc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern int min_filelist_kbytes;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..59f898a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,16 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "min_filelist_kbytes",
+		.data		= &min_filelist_kbytes,
+		.maxlen		= sizeof(min_filelist_kbytes),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..9c27d9a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+/*
+ * Low watermark used to prevent fscache thrashing during low memory.
+ */
+int min_filelist_kbytes = 0;
+
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1583,11 +1588,31 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 		return inactive_anon_is_low(zone, sc);
 }
 
+/*
+ * Check low watermark used to prevent fscache thrashing during low memory.
+ */
+static int file_is_low(struct zone *zone, struct scan_control *sc)
+{
+	unsigned long pages_min, active, inactive;
+
+	if (!scanning_global_lru(sc))
+		return false;
+
+	pages_min = min_filelist_kbytes >> (PAGE_SHIFT - 10);
+	active = zone_page_state(zone, NR_ACTIVE_FILE);
+	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+
+	return ((active + inactive) < pages_min);
+}
+
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
+	if (file && file_is_low(zone, sc))
+		return 0;
+
 	if (is_active_lru(lru)) {
 		if (inactive_list_is_low(zone, sc, file))
 		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 19:15 ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-10-28 19:15 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner
  Cc: linux-kernel, linux-mm, wad, olofj, hughd

On ChromiumOS, we do not use swap. When memory is low, the only way to
free memory is to reclaim pages from the file list. This results in a
lot of thrashing under low memory conditions. We see the system become
unresponsive for minutes before it eventually OOMs. We also see very
slow browser tab switching under low memory. Instead of an unresponsive
system, we'd really like the kernel to OOM as soon as it starts to
thrash. If it can't keep the working set in memory, then OOM.
Losing one of many tabs is a better behaviour for the user than an
unresponsive system.

This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
of file-backed pages when when there are less than min_filelist_bytes worth
of such pages in the cache. This tunable is handy for low memory systems
using solid-state storage where interactive response is more important
than not OOMing.

With this patch and min_filelist_kbytes set to 50000, I see very little
block layer activity during low memory. The system stays responsive under
low memory and browser tab switching is fast. Eventually, a process a gets
killed by OOM. Without this patch, the system gets wedged for minutes
before it eventually OOMs. Below is the vmstat output from my test runs.

BEFORE (notice the high bi and wa, also how long it takes to OOM):

$ vmstat -a 5 1000
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa
 6  2      0  10212 350464 352276    0    0   780     1 3227 4348 78 11  1 10
 1  2      0   8852 351168 353216    0    0  3154     0 3424 3424 65 16  6 14
 2  1      0  14788 348844 349044    0    0  1620     2 2925 3336 74 10  3 13
 4  1      0  16756 346264 349004    0    0   372     0 2923 2977 76  8  1 15
 1  3      0   8432 357596 347136    0    0  5346     1 3633 4599 57 20  4 19
 1  2      0  10704 350856 351720    0    0  3003     1 3635 3921 57 15  7 20
 2  5      0   8048 352160 352660    0    0  6995     0 4033 4872 47 25  4 24

* unresponsive

 1  6      0   8120 351928 352884    0    0 13402     0 4767 4663 36 37  2 25
 1 14      0   8540 351700 352672    0    0 23932     3 4352 3188 10 54  0 36
 0  6      0   8276 351860 353004    0    0 24741     2 4286 3076 10 55  1 34
 0 18      0   8012 352012 352836    0    0 26684     0 4441 2995  9 54  0 36
 0 27      0   8384 351600 352992    0    0 27056     1 4688 2994  3 54  0 43
 0 20      0   8292 351696 353008    0    0 27410     5 4568 2957  2 55  0 42
 1 16      0   8180 351728 352984    0    0 27199     0 4409 2789  1 56  0 43
 3 14      0   7928 351524 353072    0    0 28060     0 4563 3426  1 57  0 42
 0 21      0   8140 351572 353100    0    0 29664     0 5074 5127  1 59  0 39
 0 21      0   7960 351504 352656    0    0 31719     1 4769 4917  0 64  0 36

* OOM

 1 26      0  99864 351424 261060    0    0 27382     0 5229 6085  1 59  0 40
 0  1      0  58124 388300 266644    0    0  8688     0 3413 5204 35 26 11 29
 0  1      0  69796 369644 273136    0    0   201    11 2266 1622 32  3 29 36
 1  1      0  74560 360908 276976    0    0     0     0 1916 1650 24  3 33 40

AFTER (Notice almost no bi or wa and quick OOM):

$ vmstat -a 5 1000
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa
 0  0      0  35892 387588 289644    0    0     0     0 3616 3983 50 11 40  0
 5  0      0  40108 375328 297996    0    0   291     0 3657 4534 52 12 36  0
 2  0      0  58676 369724 284320    0    0   193     1 2677 3265 54  7 39  0
 3  0      0  61188 366028 285492    0    0     0     0 2639 2756 35  5 60  0
 0  0      0  58716 367132 286996    0    0    13     0 3044 4233 34  7 59  0
 5  0      0  43080 379872 289924    0    0     0     1 3475 4244 62 12 27  0
 0  0      0  42580 372940 297684    0    0   485     0 2794 3253 76 10 13  1
 2  0      0  42160 370292 300864    0    0   202     0 3074 4365 61  9 29  1
 6  0      0  44116 370716 298100    0    0    75     0 3062 5257 75 10 15  0
 3  0      0  30228 383652 298696    0    0     0     1 3244 4858 76 11 12  0
 4  0      0  26752 384272 301844    0    0    18     0 2892 4634 83 10  7  0
 3  0      0  19348 386540 307252    0    0   333     0 2876 3932 84  9  7  0
 1  0      0  30864 378408 304440    0    0   198     2 3024 4167 79  9 12  0
 6  0      0  28540 379684 304848    0    0    14     0 2925 4746 79 11 10  0
 6  2      0  14216 379312 320088    0    0   289     2 3561 3764 77 10  4 10

* OOM

 0  0      0  83880 352600 276612    0    0   853     0 3947 4777 45 13 38  4
 2  1      0  85016 355900 272980    0    0   787     1 3480 4787 71 14 13  2
 1  0      0  67496 358288 286760    0    0   689     0 3211 4056 72 12 15  2
 2  0      0  66504 356896 289528    0    0     0     6 2848 3268 51  6 43  0
 1  0      0  58444 357780 296760    0    0     2     0 2938 3956 39  7 53  0
 2  0      0  58196 356680 297860    0    0     5     0 2606 3204 34  6 60  0

Change-Id: I17d4521a35e2648dda9db5c85aba5334a2d12f50
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 include/linux/mm.h |    2 ++
 kernel/sysctl.c    |   10 ++++++++++
 mm/vmscan.c        |   25 +++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..40ececc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern int min_filelist_kbytes;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..59f898a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,16 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "min_filelist_kbytes",
+		.data		= &min_filelist_kbytes,
+		.maxlen		= sizeof(min_filelist_kbytes),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..9c27d9a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+/*
+ * Low watermark used to prevent fscache thrashing during low memory.
+ */
+int min_filelist_kbytes = 0;
+
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1583,11 +1588,31 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 		return inactive_anon_is_low(zone, sc);
 }
 
+/*
+ * Check low watermark used to prevent fscache thrashing during low memory.
+ */
+static int file_is_low(struct zone *zone, struct scan_control *sc)
+{
+	unsigned long pages_min, active, inactive;
+
+	if (!scanning_global_lru(sc))
+		return false;
+
+	pages_min = min_filelist_kbytes >> (PAGE_SHIFT - 10);
+	active = zone_page_state(zone, NR_ACTIVE_FILE);
+	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+
+	return ((active + inactive) < pages_min);
+}
+
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
+	if (file && file_is_low(zone, sc))
+		return 0;
+
 	if (is_active_lru(lru)) {
 		if (inactive_list_is_low(zone, sc, file))
 		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 19:15 ` Mandeep Singh Baines
@ 2010-10-28 20:10   ` Andrew Morton
  -1 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2010-10-28 20:10 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Thu, 28 Oct 2010 12:15:23 -0700
Mandeep Singh Baines <msb@chromium.org> wrote:

> On ChromiumOS, we do not use swap.

Well that's bad.  Why not?

> When memory is low, the only way to
> free memory is to reclaim pages from the file list. This results in a
> lot of thrashing under low memory conditions. We see the system become
> unresponsive for minutes before it eventually OOMs. We also see very
> slow browser tab switching under low memory. Instead of an unresponsive
> system, we'd really like the kernel to OOM as soon as it starts to
> thrash. If it can't keep the working set in memory, then OOM.
> Losing one of many tabs is a better behaviour for the user than an
> unresponsive system.
> 
> This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> of file-backed pages when when there are less than min_filelist_bytes worth
> of such pages in the cache. This tunable is handy for low memory systems
> using solid-state storage where interactive response is more important
> than not OOMing.
> 
> With this patch and min_filelist_kbytes set to 50000, I see very little
> block layer activity during low memory. The system stays responsive under
> low memory and browser tab switching is fast. Eventually, a process a gets
> killed by OOM. Without this patch, the system gets wedged for minutes
> before it eventually OOMs. Below is the vmstat output from my test runs.
> 
> BEFORE (notice the high bi and wa, also how long it takes to OOM):

That's an interesting result.

Having the machine "wedged for minutes" thrashing away paging
executable text is pretty bad behaviour.  I wonder how to fix it. 
Perhaps simply declaring oom at an earlier stage.

Your patch is certainly simple enough but a bit sad.  It says "the VM
gets this wrong, so lets just disable it all".  And thereby reduces the
motivation to fix it for real.

But the patch definitely improves the situation in real-world
situations and there's a case to be made that it should be available at
least as an interim thing until the VM gets fixed for real.  Which
means that the /proc tunable might disappear again (or become a no-op)
some time in the future.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 20:10   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2010-10-28 20:10 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Thu, 28 Oct 2010 12:15:23 -0700
Mandeep Singh Baines <msb@chromium.org> wrote:

> On ChromiumOS, we do not use swap.

Well that's bad.  Why not?

> When memory is low, the only way to
> free memory is to reclaim pages from the file list. This results in a
> lot of thrashing under low memory conditions. We see the system become
> unresponsive for minutes before it eventually OOMs. We also see very
> slow browser tab switching under low memory. Instead of an unresponsive
> system, we'd really like the kernel to OOM as soon as it starts to
> thrash. If it can't keep the working set in memory, then OOM.
> Losing one of many tabs is a better behaviour for the user than an
> unresponsive system.
> 
> This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> of file-backed pages when when there are less than min_filelist_bytes worth
> of such pages in the cache. This tunable is handy for low memory systems
> using solid-state storage where interactive response is more important
> than not OOMing.
> 
> With this patch and min_filelist_kbytes set to 50000, I see very little
> block layer activity during low memory. The system stays responsive under
> low memory and browser tab switching is fast. Eventually, a process a gets
> killed by OOM. Without this patch, the system gets wedged for minutes
> before it eventually OOMs. Below is the vmstat output from my test runs.
> 
> BEFORE (notice the high bi and wa, also how long it takes to OOM):

That's an interesting result.

Having the machine "wedged for minutes" thrashing away paging
executable text is pretty bad behaviour.  I wonder how to fix it. 
Perhaps simply declaring oom at an earlier stage.

Your patch is certainly simple enough but a bit sad.  It says "the VM
gets this wrong, so lets just disable it all".  And thereby reduces the
motivation to fix it for real.

But the patch definitely improves the situation in real-world
situations and there's a case to be made that it should be available at
least as an interim thing until the VM gets fixed for real.  Which
means that the /proc tunable might disappear again (or become a no-op)
some time in the future.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 19:15 ` Mandeep Singh Baines
@ 2010-10-28 21:30   ` Rik van Riel
  -1 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-10-28 21:30 UTC (permalink / raw)
  To: 20101025094235.9154.A69D9226
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

On 10/28/2010 03:15 PM, Mandeep Singh Baines wrote:

> +/*
> + * Check low watermark used to prevent fscache thrashing during low memory.
> + */
> +static int file_is_low(struct zone *zone, struct scan_control *sc)
> +{
> +	unsigned long pages_min, active, inactive;
> +
> +	if (!scanning_global_lru(sc))
> +		return false;
> +
> +	pages_min = min_filelist_kbytes>>  (PAGE_SHIFT - 10);
> +	active = zone_page_state(zone, NR_ACTIVE_FILE);
> +	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	return ((active + inactive)<  pages_min);
> +}

This is problematic.

It is quite possible for a NUMA system to have one zone
legitimately low on page cache (because all the binaries
and libraries got paged in on another NUMA node), without
the system being anywhere near out of memory.

This patch looks like it could cause a false OOM kill
in that scenario.

At the very minimum, you'd have to check that the system
is low on page cache globally, not just locally.

You do point out a real problem though, and it would be
nice to find a generic solution to it...


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 21:30   ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-10-28 21:30 UTC (permalink / raw)
  To: 20101025094235.9154.A69D9226
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

On 10/28/2010 03:15 PM, Mandeep Singh Baines wrote:

> +/*
> + * Check low watermark used to prevent fscache thrashing during low memory.
> + */
> +static int file_is_low(struct zone *zone, struct scan_control *sc)
> +{
> +	unsigned long pages_min, active, inactive;
> +
> +	if (!scanning_global_lru(sc))
> +		return false;
> +
> +	pages_min = min_filelist_kbytes>>  (PAGE_SHIFT - 10);
> +	active = zone_page_state(zone, NR_ACTIVE_FILE);
> +	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> +
> +	return ((active + inactive)<  pages_min);
> +}

This is problematic.

It is quite possible for a NUMA system to have one zone
legitimately low on page cache (because all the binaries
and libraries got paged in on another NUMA node), without
the system being anywhere near out of memory.

This patch looks like it could cause a false OOM kill
in that scenario.

At the very minimum, you'd have to check that the system
is low on page cache globally, not just locally.

You do point out a real problem though, and it would be
nice to find a generic solution to it...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 20:10   ` Andrew Morton
@ 2010-10-28 22:03     ` Mandeep Singh Baines
  -1 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-10-28 22:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Andrew Morton (akpm@linux-foundation.org) wrote:
> On Thu, 28 Oct 2010 12:15:23 -0700
> Mandeep Singh Baines <msb@chromium.org> wrote:
> 
> > On ChromiumOS, we do not use swap.
> 
> Well that's bad.  Why not?
> 

We're using SSDs. We're still in the "make it work" phase so wanted
avoid swap unless/until we learn how to use it effectively with
an SSD.

You'll want to tune swap differently if you're using an SSD. Not sure
if swappiness is the answer. Maybe a new tunable to control how aggressive
swap is unless such a thing already exits?

> > When memory is low, the only way to
> > free memory is to reclaim pages from the file list. This results in a
> > lot of thrashing under low memory conditions. We see the system become
> > unresponsive for minutes before it eventually OOMs. We also see very
> > slow browser tab switching under low memory. Instead of an unresponsive
> > system, we'd really like the kernel to OOM as soon as it starts to
> > thrash. If it can't keep the working set in memory, then OOM.
> > Losing one of many tabs is a better behaviour for the user than an
> > unresponsive system.
> > 
> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> > of file-backed pages when when there are less than min_filelist_bytes worth
> > of such pages in the cache. This tunable is handy for low memory systems
> > using solid-state storage where interactive response is more important
> > than not OOMing.
> > 
> > With this patch and min_filelist_kbytes set to 50000, I see very little
> > block layer activity during low memory. The system stays responsive under
> > low memory and browser tab switching is fast. Eventually, a process a gets
> > killed by OOM. Without this patch, the system gets wedged for minutes
> > before it eventually OOMs. Below is the vmstat output from my test runs.
> > 
> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
> 
> That's an interesting result.
> 
> Having the machine "wedged for minutes" thrashing away paging
> executable text is pretty bad behaviour.  I wonder how to fix it. 
> Perhaps simply declaring oom at an earlier stage.
> 
> Your patch is certainly simple enough but a bit sad.  It says "the VM
> gets this wrong, so lets just disable it all".  And thereby reduces the
> motivation to fix it for real.
> 

Yeah, I used the RFC label because we're thinking this is just a temporary
bandaid until something better comes along.

Couple of other nits I have with our patch:
* Not really sure what to do for the cgroup case. We do something
  reasonable for now.
* One of my colleagues also brought up the point that we might want to do
  something different if swap was enabled.

> But the patch definitely improves the situation in real-world
> situations and there's a case to be made that it should be available at
> least as an interim thing until the VM gets fixed for real.  Which
> means that the /proc tunable might disappear again (or become a no-op)
> some time in the future.
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 22:03     ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-10-28 22:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Andrew Morton (akpm@linux-foundation.org) wrote:
> On Thu, 28 Oct 2010 12:15:23 -0700
> Mandeep Singh Baines <msb@chromium.org> wrote:
> 
> > On ChromiumOS, we do not use swap.
> 
> Well that's bad.  Why not?
> 

We're using SSDs. We're still in the "make it work" phase so wanted
avoid swap unless/until we learn how to use it effectively with
an SSD.

You'll want to tune swap differently if you're using an SSD. Not sure
if swappiness is the answer. Maybe a new tunable to control how aggressive
swap is unless such a thing already exits?

> > When memory is low, the only way to
> > free memory is to reclaim pages from the file list. This results in a
> > lot of thrashing under low memory conditions. We see the system become
> > unresponsive for minutes before it eventually OOMs. We also see very
> > slow browser tab switching under low memory. Instead of an unresponsive
> > system, we'd really like the kernel to OOM as soon as it starts to
> > thrash. If it can't keep the working set in memory, then OOM.
> > Losing one of many tabs is a better behaviour for the user than an
> > unresponsive system.
> > 
> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> > of file-backed pages when when there are less than min_filelist_bytes worth
> > of such pages in the cache. This tunable is handy for low memory systems
> > using solid-state storage where interactive response is more important
> > than not OOMing.
> > 
> > With this patch and min_filelist_kbytes set to 50000, I see very little
> > block layer activity during low memory. The system stays responsive under
> > low memory and browser tab switching is fast. Eventually, a process a gets
> > killed by OOM. Without this patch, the system gets wedged for minutes
> > before it eventually OOMs. Below is the vmstat output from my test runs.
> > 
> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
> 
> That's an interesting result.
> 
> Having the machine "wedged for minutes" thrashing away paging
> executable text is pretty bad behaviour.  I wonder how to fix it. 
> Perhaps simply declaring oom at an earlier stage.
> 
> Your patch is certainly simple enough but a bit sad.  It says "the VM
> gets this wrong, so lets just disable it all".  And thereby reduces the
> motivation to fix it for real.
> 

Yeah, I used the RFC label because we're thinking this is just a temporary
bandaid until something better comes along.

Couple of other nits I have with our patch:
* Not really sure what to do for the cgroup case. We do something
  reasonable for now.
* One of my colleagues also brought up the point that we might want to do
  something different if swap was enabled.

> But the patch definitely improves the situation in real-world
> situations and there's a case to be made that it should be available at
> least as an interim thing until the VM gets fixed for real.  Which
> means that the /proc tunable might disappear again (or become a no-op)
> some time in the future.
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 21:30   ` Rik van Riel
@ 2010-10-28 22:13     ` Mandeep Singh Baines
  -1 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-10-28 22:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Rik van Riel (riel@redhat.com) wrote:
> On 10/28/2010 03:15 PM, Mandeep Singh Baines wrote:
> 
> >+/*
> >+ * Check low watermark used to prevent fscache thrashing during low memory.
> >+ */
> >+static int file_is_low(struct zone *zone, struct scan_control *sc)
> >+{
> >+	unsigned long pages_min, active, inactive;
> >+
> >+	if (!scanning_global_lru(sc))
> >+		return false;
> >+
> >+	pages_min = min_filelist_kbytes>>  (PAGE_SHIFT - 10);
> >+	active = zone_page_state(zone, NR_ACTIVE_FILE);
> >+	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> >+
> >+	return ((active + inactive)<  pages_min);
> >+}
> 
> This is problematic.
> 

Yeah, just sending this out as an RFC for now in order to draw attention
to the issue. But the patch does solve our problem really well and would
probably help out for similar applications.

> It is quite possible for a NUMA system to have one zone
> legitimately low on page cache (because all the binaries
> and libraries got paged in on another NUMA node), without
> the system being anywhere near out of memory.
> 
> This patch looks like it could cause a false OOM kill
> in that scenario.
> 
> At the very minimum, you'd have to check that the system
> is low on page cache globally, not just locally.
> 
> You do point out a real problem though, and it would be
> nice to find a generic solution to it...
> 

Here's another patch I was playing with that helped but wasn't quite
as bulletproof or as easy to reason about as min_filelist_kbytes.

---

[PATCH] vmscan: add a configurable inactive_file_ratio

This patch adds a new tuning option which can control how aggressively
the working set is protected. By aggressively protecting the working set,
one sees less page faults and more responsiveness on low memory netbook
systems.

In commit 56e49d21, "vmscan: evict use-once pages first", Rik Van Riel,
added an inactive_file_is_low method which would protect the working
set by only scanning the file_active_list when there were more active
pages than inactive. This patch makes the ratio configurable via a
sysctl. The ratio controls how aggressively we protect the working
set and indirectly controls the working set time constant: the period
of time over which we examine whats in the working set.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 include/linux/mm.h |    2 ++
 kernel/sysctl.c    |   12 ++++++++++++
 mm/vmscan.c        |    7 ++++++-
 3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..6f6db8e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern int inactive_file_ratio;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..1fe3a81 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,18 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "inactive_file_ratio",
+		.data		= &inactive_file_ratio,
+		.maxlen		= sizeof(inactive_file_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
+
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0984dee..cdae972 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+/*
+ * Only start shrinking active file list when inactive is below this percentage.
+ */
+int inactive_file_ratio = 50;
+
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1556,7 +1561,7 @@ static int inactive_file_is_low_global(struct zone *zone)
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
 	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
 
-	return (active > inactive);
+	return ((inactive * 100)/(inactive + active) < inactive_file_ratio);
 }
 
 /**
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 22:13     ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-10-28 22:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Rik van Riel (riel@redhat.com) wrote:
> On 10/28/2010 03:15 PM, Mandeep Singh Baines wrote:
> 
> >+/*
> >+ * Check low watermark used to prevent fscache thrashing during low memory.
> >+ */
> >+static int file_is_low(struct zone *zone, struct scan_control *sc)
> >+{
> >+	unsigned long pages_min, active, inactive;
> >+
> >+	if (!scanning_global_lru(sc))
> >+		return false;
> >+
> >+	pages_min = min_filelist_kbytes>>  (PAGE_SHIFT - 10);
> >+	active = zone_page_state(zone, NR_ACTIVE_FILE);
> >+	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> >+
> >+	return ((active + inactive)<  pages_min);
> >+}
> 
> This is problematic.
> 

Yeah, just sending this out as an RFC for now in order to draw attention
to the issue. But the patch does solve our problem really well and would
probably help out for similar applications.

> It is quite possible for a NUMA system to have one zone
> legitimately low on page cache (because all the binaries
> and libraries got paged in on another NUMA node), without
> the system being anywhere near out of memory.
> 
> This patch looks like it could cause a false OOM kill
> in that scenario.
> 
> At the very minimum, you'd have to check that the system
> is low on page cache globally, not just locally.
> 
> You do point out a real problem though, and it would be
> nice to find a generic solution to it...
> 

Here's another patch I was playing with that helped but wasn't quite
as bulletproof or as easy to reason about as min_filelist_kbytes.

---

[PATCH] vmscan: add a configurable inactive_file_ratio

This patch adds a new tuning option which can control how aggressively
the working set is protected. By aggressively protecting the working set,
one sees less page faults and more responsiveness on low memory netbook
systems.

In commit 56e49d21, "vmscan: evict use-once pages first", Rik Van Riel,
added an inactive_file_is_low method which would protect the working
set by only scanning the file_active_list when there were more active
pages than inactive. This patch makes the ratio configurable via a
sysctl. The ratio controls how aggressively we protect the working
set and indirectly controls the working set time constant: the period
of time over which we examine whats in the working set.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 include/linux/mm.h |    2 ++
 kernel/sysctl.c    |   12 ++++++++++++
 mm/vmscan.c        |    7 ++++++-
 3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..6f6db8e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern int inactive_file_ratio;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..1fe3a81 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,18 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "inactive_file_ratio",
+		.data		= &inactive_file_ratio,
+		.maxlen		= sizeof(inactive_file_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
+
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0984dee..cdae972 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+/*
+ * Only start shrinking active file list when inactive is below this percentage.
+ */
+int inactive_file_ratio = 50;
+
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1556,7 +1561,7 @@ static int inactive_file_is_low_global(struct zone *zone)
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
 	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
 
-	return (active > inactive);
+	return ((inactive * 100)/(inactive + active) < inactive_file_ratio);
 }
 
 /**
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 22:03     ` Mandeep Singh Baines
@ 2010-10-28 23:28       ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-10-28 23:28 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Fri, Oct 29, 2010 at 7:03 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> Andrew Morton (akpm@linux-foundation.org) wrote:
>> On Thu, 28 Oct 2010 12:15:23 -0700
>> Mandeep Singh Baines <msb@chromium.org> wrote:
>>
>> > On ChromiumOS, we do not use swap.
>>
>> Well that's bad.  Why not?
>>
>
> We're using SSDs. We're still in the "make it work" phase so wanted
> avoid swap unless/until we learn how to use it effectively with
> an SSD.
>
> You'll want to tune swap differently if you're using an SSD. Not sure
> if swappiness is the answer. Maybe a new tunable to control how aggressive
> swap is unless such a thing already exits?
>
>> > When memory is low, the only way to
>> > free memory is to reclaim pages from the file list. This results in a
>> > lot of thrashing under low memory conditions. We see the system become
>> > unresponsive for minutes before it eventually OOMs. We also see very
>> > slow browser tab switching under low memory. Instead of an unresponsive
>> > system, we'd really like the kernel to OOM as soon as it starts to
>> > thrash. If it can't keep the working set in memory, then OOM.
>> > Losing one of many tabs is a better behaviour for the user than an
>> > unresponsive system.
>> >
>> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
>> > of file-backed pages when when there are less than min_filelist_bytes worth
>> > of such pages in the cache. This tunable is handy for low memory systems
>> > using solid-state storage where interactive response is more important
>> > than not OOMing.
>> >
>> > With this patch and min_filelist_kbytes set to 50000, I see very little
>> > block layer activity during low memory. The system stays responsive under
>> > low memory and browser tab switching is fast. Eventually, a process a gets
>> > killed by OOM. Without this patch, the system gets wedged for minutes
>> > before it eventually OOMs. Below is the vmstat output from my test runs.
>> >
>> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
>>
>> That's an interesting result.
>>
>> Having the machine "wedged for minutes" thrashing away paging
>> executable text is pretty bad behaviour.  I wonder how to fix it.
>> Perhaps simply declaring oom at an earlier stage.
>>
>> Your patch is certainly simple enough but a bit sad.  It says "the VM
>> gets this wrong, so lets just disable it all".  And thereby reduces the
>> motivation to fix it for real.
>>
>
> Yeah, I used the RFC label because we're thinking this is just a temporary
> bandaid until something better comes along.
>
> Couple of other nits I have with our patch:
> * Not really sure what to do for the cgroup case. We do something
>  reasonable for now.
> * One of my colleagues also brought up the point that we might want to do
>  something different if swap was enabled.
>
>> But the patch definitely improves the situation in real-world
>> situations and there's a case to be made that it should be available at
>> least as an interim thing until the VM gets fixed for real.  Which
>> means that the /proc tunable might disappear again (or become a no-op)
>> some time in the future.

I think this feature that "System response time doesn't allow but OOM allow".
While we can control process to not killed by OOM using
/oom_score_adj, we can't control response time directly.
But in mobile system, we have to control response time. One of cause
to avoid swap is due to response time.

How about using memcg?
Isolate processes related to system response(ex, rendering engine, IPC
engine and so no)  to another group.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 23:28       ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-10-28 23:28 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Fri, Oct 29, 2010 at 7:03 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> Andrew Morton (akpm@linux-foundation.org) wrote:
>> On Thu, 28 Oct 2010 12:15:23 -0700
>> Mandeep Singh Baines <msb@chromium.org> wrote:
>>
>> > On ChromiumOS, we do not use swap.
>>
>> Well that's bad.  Why not?
>>
>
> We're using SSDs. We're still in the "make it work" phase so wanted
> avoid swap unless/until we learn how to use it effectively with
> an SSD.
>
> You'll want to tune swap differently if you're using an SSD. Not sure
> if swappiness is the answer. Maybe a new tunable to control how aggressive
> swap is unless such a thing already exits?
>
>> > When memory is low, the only way to
>> > free memory is to reclaim pages from the file list. This results in a
>> > lot of thrashing under low memory conditions. We see the system become
>> > unresponsive for minutes before it eventually OOMs. We also see very
>> > slow browser tab switching under low memory. Instead of an unresponsive
>> > system, we'd really like the kernel to OOM as soon as it starts to
>> > thrash. If it can't keep the working set in memory, then OOM.
>> > Losing one of many tabs is a better behaviour for the user than an
>> > unresponsive system.
>> >
>> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
>> > of file-backed pages when when there are less than min_filelist_bytes worth
>> > of such pages in the cache. This tunable is handy for low memory systems
>> > using solid-state storage where interactive response is more important
>> > than not OOMing.
>> >
>> > With this patch and min_filelist_kbytes set to 50000, I see very little
>> > block layer activity during low memory. The system stays responsive under
>> > low memory and browser tab switching is fast. Eventually, a process a gets
>> > killed by OOM. Without this patch, the system gets wedged for minutes
>> > before it eventually OOMs. Below is the vmstat output from my test runs.
>> >
>> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
>>
>> That's an interesting result.
>>
>> Having the machine "wedged for minutes" thrashing away paging
>> executable text is pretty bad behaviour.  I wonder how to fix it.
>> Perhaps simply declaring oom at an earlier stage.
>>
>> Your patch is certainly simple enough but a bit sad.  It says "the VM
>> gets this wrong, so lets just disable it all".  And thereby reduces the
>> motivation to fix it for real.
>>
>
> Yeah, I used the RFC label because we're thinking this is just a temporary
> bandaid until something better comes along.
>
> Couple of other nits I have with our patch:
> * Not really sure what to do for the cgroup case. We do something
>  reasonable for now.
> * One of my colleagues also brought up the point that we might want to do
>  something different if swap was enabled.
>
>> But the patch definitely improves the situation in real-world
>> situations and there's a case to be made that it should be available at
>> least as an interim thing until the VM gets fixed for real.  Which
>> means that the /proc tunable might disappear again (or become a no-op)
>> some time in the future.

I think this feature that "System response time doesn't allow but OOM allow".
While we can control process to not killed by OOM using
/oom_score_adj, we can't control response time directly.
But in mobile system, we have to control response time. One of cause
to avoid swap is due to response time.

How about using memcg?
Isolate processes related to system response(ex, rendering engine, IPC
engine and so no)  to another group.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 23:28       ` Minchan Kim
@ 2010-10-28 23:29         ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-10-28 23:29 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Fri, Oct 29, 2010 at 8:28 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> I think this feature that "System response time doesn't allow but OOM allow".
I think we _need_ this feature that "System response time doesn't
allow but OOM allow".

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-28 23:29         ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-10-28 23:29 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Fri, Oct 29, 2010 at 8:28 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> I think this feature that "System response time doesn't allow but OOM allow".
I think we _need_ this feature that "System response time doesn't
allow but OOM allow".

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 23:28       ` Minchan Kim
@ 2010-10-29  0:04         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 55+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  0:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, Johannes Weiner, linux-kernel,
	linux-mm, wad, olofj, hughd

On Fri, 29 Oct 2010 08:28:23 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Oct 29, 2010 at 7:03 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> > Andrew Morton (akpm@linux-foundation.org) wrote:
> >> On Thu, 28 Oct 2010 12:15:23 -0700
> >> Mandeep Singh Baines <msb@chromium.org> wrote:
> >>
> >> > On ChromiumOS, we do not use swap.
> >>
> >> Well that's bad.  Why not?
> >>
> >
> > We're using SSDs. We're still in the "make it work" phase so wanted
> > avoid swap unless/until we learn how to use it effectively with
> > an SSD.
> >
> > You'll want to tune swap differently if you're using an SSD. Not sure
> > if swappiness is the answer. Maybe a new tunable to control how aggressive
> > swap is unless such a thing already exits?
> >
> >> > When memory is low, the only way to
> >> > free memory is to reclaim pages from the file list. This results in a
> >> > lot of thrashing under low memory conditions. We see the system become
> >> > unresponsive for minutes before it eventually OOMs. We also see very
> >> > slow browser tab switching under low memory. Instead of an unresponsive
> >> > system, we'd really like the kernel to OOM as soon as it starts to
> >> > thrash. If it can't keep the working set in memory, then OOM.
> >> > Losing one of many tabs is a better behaviour for the user than an
> >> > unresponsive system.
> >> >
> >> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> >> > of file-backed pages when when there are less than min_filelist_bytes worth
> >> > of such pages in the cache. This tunable is handy for low memory systems
> >> > using solid-state storage where interactive response is more important
> >> > than not OOMing.
> >> >
> >> > With this patch and min_filelist_kbytes set to 50000, I see very little
> >> > block layer activity during low memory. The system stays responsive under
> >> > low memory and browser tab switching is fast. Eventually, a process a gets
> >> > killed by OOM. Without this patch, the system gets wedged for minutes
> >> > before it eventually OOMs. Below is the vmstat output from my test runs.
> >> >
> >> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
> >>
> >> That's an interesting result.
> >>
> >> Having the machine "wedged for minutes" thrashing away paging
> >> executable text is pretty bad behaviour.  I wonder how to fix it.
> >> Perhaps simply declaring oom at an earlier stage.
> >>
> >> Your patch is certainly simple enough but a bit sad.  It says "the VM
> >> gets this wrong, so lets just disable it all".  And thereby reduces the
> >> motivation to fix it for real.
> >>
> >
> > Yeah, I used the RFC label because we're thinking this is just a temporary
> > bandaid until something better comes along.
> >
> > Couple of other nits I have with our patch:
> > * Not really sure what to do for the cgroup case. We do something
> >  reasonable for now.
> > * One of my colleagues also brought up the point that we might want to do
> >  something different if swap was enabled.
> >
> >> But the patch definitely improves the situation in real-world
> >> situations and there's a case to be made that it should be available at
> >> least as an interim thing until the VM gets fixed for real.  Which
> >> means that the /proc tunable might disappear again (or become a no-op)
> >> some time in the future.
> 
> I think this feature that "System response time doesn't allow but OOM allow".
> While we can control process to not killed by OOM using
> /oom_score_adj, we can't control response time directly.
> But in mobile system, we have to control response time. One of cause
> to avoid swap is due to response time.
> 
> How about using memcg?
> Isolate processes related to system response(ex, rendering engine, IPC
> engine and so no)  to another group.
> 
Yes, this seems interesting topic on memcg.

maybe configure cgroups as..

/system       ....... limit to X % of the system.
/application  ....... limit to 100-X % of the system.

and put management software to /system. Then, the system software can check
behavior of applicatoin and measure cpu time and I/O performance in /applicaiton.
(And yes, it can watch memory usage.)

Here, memory cgroup has oom-notifier, you may able to do something other than
oom-killer by the system. If this patch is applied to global VM, I'll check
memcg can support it or not.
Hmm....checking anon/file rate in /application may be enough ?

Or, as a google guy proosed, we may have to add "file-cache-only" memcg.
For example, configure system as

/system
/application-anon
/application-file-cache

(But balancing file/anon must be done by user....this is difficult.)

BTW, can we know that "recently paged out file cache comes back immediately!"
score ?


Thanks,
-Kame











^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-29  0:04         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 55+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  0:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, Johannes Weiner, linux-kernel,
	linux-mm, wad, olofj, hughd

On Fri, 29 Oct 2010 08:28:23 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Oct 29, 2010 at 7:03 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> > Andrew Morton (akpm@linux-foundation.org) wrote:
> >> On Thu, 28 Oct 2010 12:15:23 -0700
> >> Mandeep Singh Baines <msb@chromium.org> wrote:
> >>
> >> > On ChromiumOS, we do not use swap.
> >>
> >> Well that's bad. A Why not?
> >>
> >
> > We're using SSDs. We're still in the "make it work" phase so wanted
> > avoid swap unless/until we learn how to use it effectively with
> > an SSD.
> >
> > You'll want to tune swap differently if you're using an SSD. Not sure
> > if swappiness is the answer. Maybe a new tunable to control how aggressive
> > swap is unless such a thing already exits?
> >
> >> > When memory is low, the only way to
> >> > free memory is to reclaim pages from the file list. This results in a
> >> > lot of thrashing under low memory conditions. We see the system become
> >> > unresponsive for minutes before it eventually OOMs. We also see very
> >> > slow browser tab switching under low memory. Instead of an unresponsive
> >> > system, we'd really like the kernel to OOM as soon as it starts to
> >> > thrash. If it can't keep the working set in memory, then OOM.
> >> > Losing one of many tabs is a better behaviour for the user than an
> >> > unresponsive system.
> >> >
> >> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> >> > of file-backed pages when when there are less than min_filelist_bytes worth
> >> > of such pages in the cache. This tunable is handy for low memory systems
> >> > using solid-state storage where interactive response is more important
> >> > than not OOMing.
> >> >
> >> > With this patch and min_filelist_kbytes set to 50000, I see very little
> >> > block layer activity during low memory. The system stays responsive under
> >> > low memory and browser tab switching is fast. Eventually, a process a gets
> >> > killed by OOM. Without this patch, the system gets wedged for minutes
> >> > before it eventually OOMs. Below is the vmstat output from my test runs.
> >> >
> >> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
> >>
> >> That's an interesting result.
> >>
> >> Having the machine "wedged for minutes" thrashing away paging
> >> executable text is pretty bad behaviour. A I wonder how to fix it.
> >> Perhaps simply declaring oom at an earlier stage.
> >>
> >> Your patch is certainly simple enough but a bit sad. A It says "the VM
> >> gets this wrong, so lets just disable it all". A And thereby reduces the
> >> motivation to fix it for real.
> >>
> >
> > Yeah, I used the RFC label because we're thinking this is just a temporary
> > bandaid until something better comes along.
> >
> > Couple of other nits I have with our patch:
> > * Not really sure what to do for the cgroup case. We do something
> > A reasonable for now.
> > * One of my colleagues also brought up the point that we might want to do
> > A something different if swap was enabled.
> >
> >> But the patch definitely improves the situation in real-world
> >> situations and there's a case to be made that it should be available at
> >> least as an interim thing until the VM gets fixed for real. A Which
> >> means that the /proc tunable might disappear again (or become a no-op)
> >> some time in the future.
> 
> I think this feature that "System response time doesn't allow but OOM allow".
> While we can control process to not killed by OOM using
> /oom_score_adj, we can't control response time directly.
> But in mobile system, we have to control response time. One of cause
> to avoid swap is due to response time.
> 
> How about using memcg?
> Isolate processes related to system response(ex, rendering engine, IPC
> engine and so no)  to another group.
> 
Yes, this seems interesting topic on memcg.

maybe configure cgroups as..

/system       ....... limit to X % of the system.
/application  ....... limit to 100-X % of the system.

and put management software to /system. Then, the system software can check
behavior of applicatoin and measure cpu time and I/O performance in /applicaiton.
(And yes, it can watch memory usage.)

Here, memory cgroup has oom-notifier, you may able to do something other than
oom-killer by the system. If this patch is applied to global VM, I'll check
memcg can support it or not.
Hmm....checking anon/file rate in /application may be enough ?

Or, as a google guy proosed, we may have to add "file-cache-only" memcg.
For example, configure system as

/system
/application-anon
/application-file-cache

(But balancing file/anon must be done by user....this is difficult.)

BTW, can we know that "recently paged out file cache comes back immediately!"
score ?


Thanks,
-Kame










--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-29  0:04         ` KAMEZAWA Hiroyuki
@ 2010-10-29  0:28           ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-10-29  0:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, Johannes Weiner, linux-kernel,
	linux-mm, wad, olofj, hughd

On Fri, Oct 29, 2010 at 9:04 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 29 Oct 2010 08:28:23 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Fri, Oct 29, 2010 at 7:03 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
>> > Andrew Morton (akpm@linux-foundation.org) wrote:
>> >> On Thu, 28 Oct 2010 12:15:23 -0700
>> >> Mandeep Singh Baines <msb@chromium.org> wrote:
>> >>
>> >> > On ChromiumOS, we do not use swap.
>> >>
>> >> Well that's bad.  Why not?
>> >>
>> >
>> > We're using SSDs. We're still in the "make it work" phase so wanted
>> > avoid swap unless/until we learn how to use it effectively with
>> > an SSD.
>> >
>> > You'll want to tune swap differently if you're using an SSD. Not sure
>> > if swappiness is the answer. Maybe a new tunable to control how aggressive
>> > swap is unless such a thing already exits?
>> >
>> >> > When memory is low, the only way to
>> >> > free memory is to reclaim pages from the file list. This results in a
>> >> > lot of thrashing under low memory conditions. We see the system become
>> >> > unresponsive for minutes before it eventually OOMs. We also see very
>> >> > slow browser tab switching under low memory. Instead of an unresponsive
>> >> > system, we'd really like the kernel to OOM as soon as it starts to
>> >> > thrash. If it can't keep the working set in memory, then OOM.
>> >> > Losing one of many tabs is a better behaviour for the user than an
>> >> > unresponsive system.
>> >> >
>> >> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
>> >> > of file-backed pages when when there are less than min_filelist_bytes worth
>> >> > of such pages in the cache. This tunable is handy for low memory systems
>> >> > using solid-state storage where interactive response is more important
>> >> > than not OOMing.
>> >> >
>> >> > With this patch and min_filelist_kbytes set to 50000, I see very little
>> >> > block layer activity during low memory. The system stays responsive under
>> >> > low memory and browser tab switching is fast. Eventually, a process a gets
>> >> > killed by OOM. Without this patch, the system gets wedged for minutes
>> >> > before it eventually OOMs. Below is the vmstat output from my test runs.
>> >> >
>> >> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
>> >>
>> >> That's an interesting result.
>> >>
>> >> Having the machine "wedged for minutes" thrashing away paging
>> >> executable text is pretty bad behaviour.  I wonder how to fix it.
>> >> Perhaps simply declaring oom at an earlier stage.
>> >>
>> >> Your patch is certainly simple enough but a bit sad.  It says "the VM
>> >> gets this wrong, so lets just disable it all".  And thereby reduces the
>> >> motivation to fix it for real.
>> >>
>> >
>> > Yeah, I used the RFC label because we're thinking this is just a temporary
>> > bandaid until something better comes along.
>> >
>> > Couple of other nits I have with our patch:
>> > * Not really sure what to do for the cgroup case. We do something
>> >  reasonable for now.
>> > * One of my colleagues also brought up the point that we might want to do
>> >  something different if swap was enabled.
>> >
>> >> But the patch definitely improves the situation in real-world
>> >> situations and there's a case to be made that it should be available at
>> >> least as an interim thing until the VM gets fixed for real.  Which
>> >> means that the /proc tunable might disappear again (or become a no-op)
>> >> some time in the future.
>>
>> I think this feature that "System response time doesn't allow but OOM allow".
>> While we can control process to not killed by OOM using
>> /oom_score_adj, we can't control response time directly.
>> But in mobile system, we have to control response time. One of cause
>> to avoid swap is due to response time.
>>
>> How about using memcg?
>> Isolate processes related to system response(ex, rendering engine, IPC
>> engine and so no)  to another group.
>>
> Yes, this seems interesting topic on memcg.
>
> maybe configure cgroups as..
>
> /system       ....... limit to X % of the system.
> /application  ....... limit to 100-X % of the system.
>
> and put management software to /system. Then, the system software can check
> behavior of applicatoin and measure cpu time and I/O performance in /applicaiton.
> (And yes, it can watch memory usage.)
>
> Here, memory cgroup has oom-notifier, you may able to do something other than
> oom-killer by the system. If this patch is applied to global VM, I'll check
> memcg can support it or not.
> Hmm....checking anon/file rate in /application may be enough ?

I think anon/file/mapped_file is enough to do that.

>
> Or, as a google guy proosed, we may have to add "file-cache-only" memcg.
> For example, configure system as
>
> /system
> /application-anon
> /application-file-cache
>
> (But balancing file/anon must be done by user....this is difficult.)

Yes. I believe such fine-grained control can make system admin more annoying.

>
> BTW, can we know that "recently paged out file cache comes back immediately!"
> score ?

Not easy. If we can get it easily, we can enhance victim selection algorithm.
AFAIR, Rik tried it.
http://lwn.net/Articles/147879/


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-10-29  0:28           ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-10-29  0:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mandeep Singh Baines, Andrew Morton, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, Johannes Weiner, linux-kernel,
	linux-mm, wad, olofj, hughd

On Fri, Oct 29, 2010 at 9:04 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 29 Oct 2010 08:28:23 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Fri, Oct 29, 2010 at 7:03 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
>> > Andrew Morton (akpm@linux-foundation.org) wrote:
>> >> On Thu, 28 Oct 2010 12:15:23 -0700
>> >> Mandeep Singh Baines <msb@chromium.org> wrote:
>> >>
>> >> > On ChromiumOS, we do not use swap.
>> >>
>> >> Well that's bad.  Why not?
>> >>
>> >
>> > We're using SSDs. We're still in the "make it work" phase so wanted
>> > avoid swap unless/until we learn how to use it effectively with
>> > an SSD.
>> >
>> > You'll want to tune swap differently if you're using an SSD. Not sure
>> > if swappiness is the answer. Maybe a new tunable to control how aggressive
>> > swap is unless such a thing already exits?
>> >
>> >> > When memory is low, the only way to
>> >> > free memory is to reclaim pages from the file list. This results in a
>> >> > lot of thrashing under low memory conditions. We see the system become
>> >> > unresponsive for minutes before it eventually OOMs. We also see very
>> >> > slow browser tab switching under low memory. Instead of an unresponsive
>> >> > system, we'd really like the kernel to OOM as soon as it starts to
>> >> > thrash. If it can't keep the working set in memory, then OOM.
>> >> > Losing one of many tabs is a better behaviour for the user than an
>> >> > unresponsive system.
>> >> >
>> >> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
>> >> > of file-backed pages when when there are less than min_filelist_bytes worth
>> >> > of such pages in the cache. This tunable is handy for low memory systems
>> >> > using solid-state storage where interactive response is more important
>> >> > than not OOMing.
>> >> >
>> >> > With this patch and min_filelist_kbytes set to 50000, I see very little
>> >> > block layer activity during low memory. The system stays responsive under
>> >> > low memory and browser tab switching is fast. Eventually, a process a gets
>> >> > killed by OOM. Without this patch, the system gets wedged for minutes
>> >> > before it eventually OOMs. Below is the vmstat output from my test runs.
>> >> >
>> >> > BEFORE (notice the high bi and wa, also how long it takes to OOM):
>> >>
>> >> That's an interesting result.
>> >>
>> >> Having the machine "wedged for minutes" thrashing away paging
>> >> executable text is pretty bad behaviour.  I wonder how to fix it.
>> >> Perhaps simply declaring oom at an earlier stage.
>> >>
>> >> Your patch is certainly simple enough but a bit sad.  It says "the VM
>> >> gets this wrong, so lets just disable it all".  And thereby reduces the
>> >> motivation to fix it for real.
>> >>
>> >
>> > Yeah, I used the RFC label because we're thinking this is just a temporary
>> > bandaid until something better comes along.
>> >
>> > Couple of other nits I have with our patch:
>> > * Not really sure what to do for the cgroup case. We do something
>> >  reasonable for now.
>> > * One of my colleagues also brought up the point that we might want to do
>> >  something different if swap was enabled.
>> >
>> >> But the patch definitely improves the situation in real-world
>> >> situations and there's a case to be made that it should be available at
>> >> least as an interim thing until the VM gets fixed for real.  Which
>> >> means that the /proc tunable might disappear again (or become a no-op)
>> >> some time in the future.
>>
>> I think this feature that "System response time doesn't allow but OOM allow".
>> While we can control process to not killed by OOM using
>> /oom_score_adj, we can't control response time directly.
>> But in mobile system, we have to control response time. One of cause
>> to avoid swap is due to response time.
>>
>> How about using memcg?
>> Isolate processes related to system response(ex, rendering engine, IPC
>> engine and so no)  to another group.
>>
> Yes, this seems interesting topic on memcg.
>
> maybe configure cgroups as..
>
> /system       ....... limit to X % of the system.
> /application  ....... limit to 100-X % of the system.
>
> and put management software to /system. Then, the system software can check
> behavior of applicatoin and measure cpu time and I/O performance in /applicaiton.
> (And yes, it can watch memory usage.)
>
> Here, memory cgroup has oom-notifier, you may able to do something other than
> oom-killer by the system. If this patch is applied to global VM, I'll check
> memcg can support it or not.
> Hmm....checking anon/file rate in /application may be enough ?

I think anon/file/mapped_file is enough to do that.

>
> Or, as a google guy proosed, we may have to add "file-cache-only" memcg.
> For example, configure system as
>
> /system
> /application-anon
> /application-file-cache
>
> (But balancing file/anon must be done by user....this is difficult.)

Yes. I believe such fine-grained control can make system admin more annoying.

>
> BTW, can we know that "recently paged out file cache comes back immediately!"
> score ?

Not easy. If we can get it easily, we can enhance victim selection algorithm.
AFAIR, Rik tried it.
http://lwn.net/Articles/147879/


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-10-28 19:15 ` Mandeep Singh Baines
@ 2010-11-01  7:05   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2010-11-01  7:05 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Hi

> On ChromiumOS, we do not use swap. When memory is low, the only way to
> free memory is to reclaim pages from the file list. This results in a
> lot of thrashing under low memory conditions. We see the system become
> unresponsive for minutes before it eventually OOMs. We also see very
> slow browser tab switching under low memory. Instead of an unresponsive
> system, we'd really like the kernel to OOM as soon as it starts to
> thrash. If it can't keep the working set in memory, then OOM.
> Losing one of many tabs is a better behaviour for the user than an
> unresponsive system.
> 
> This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> of file-backed pages when when there are less than min_filelist_bytes worth
> of such pages in the cache. This tunable is handy for low memory systems
> using solid-state storage where interactive response is more important
> than not OOMing.
> 
> With this patch and min_filelist_kbytes set to 50000, I see very little
> block layer activity during low memory. The system stays responsive under
> low memory and browser tab switching is fast. Eventually, a process a gets
> killed by OOM. Without this patch, the system gets wedged for minutes
> before it eventually OOMs. Below is the vmstat output from my test runs.

I've heared similar requirement sometimes from embedded people. then also
don't use swap. then, I don't think this is hopeless idea. but I hope to 
clarify some thing at first.

Yes, a system often have should-not-be-evicted-file-caches. Typically, they
are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
application which linked above important lib and call mlockall() at startup.
such technique prevent reclaim. So, Q1: Why do you think above traditional way
is insufficient? 

Q2: In the above you used min_filelist_kbytes=50000. How do you decide 
such value? Do other users can calculate proper value?

In addition, I have two request. R1: I think chromium specific feature is
harder acceptable because it's harder maintable. but we have good chance to
solve embedded generic issue. Please discuss Minchan and/or another embedded
developers. R2: If you want to deal OOM combination, please consider to 
combination of memcg OOM notifier too. It is most flexible and powerful OOM
mechanism. Probably desktop and server people never use bare OOM killer intentionally.

Thanks.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-01  7:05   ` KOSAKI Motohiro
  0 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2010-11-01  7:05 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Hi

> On ChromiumOS, we do not use swap. When memory is low, the only way to
> free memory is to reclaim pages from the file list. This results in a
> lot of thrashing under low memory conditions. We see the system become
> unresponsive for minutes before it eventually OOMs. We also see very
> slow browser tab switching under low memory. Instead of an unresponsive
> system, we'd really like the kernel to OOM as soon as it starts to
> thrash. If it can't keep the working set in memory, then OOM.
> Losing one of many tabs is a better behaviour for the user than an
> unresponsive system.
> 
> This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> of file-backed pages when when there are less than min_filelist_bytes worth
> of such pages in the cache. This tunable is handy for low memory systems
> using solid-state storage where interactive response is more important
> than not OOMing.
> 
> With this patch and min_filelist_kbytes set to 50000, I see very little
> block layer activity during low memory. The system stays responsive under
> low memory and browser tab switching is fast. Eventually, a process a gets
> killed by OOM. Without this patch, the system gets wedged for minutes
> before it eventually OOMs. Below is the vmstat output from my test runs.

I've heared similar requirement sometimes from embedded people. then also
don't use swap. then, I don't think this is hopeless idea. but I hope to 
clarify some thing at first.

Yes, a system often have should-not-be-evicted-file-caches. Typically, they
are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
application which linked above important lib and call mlockall() at startup.
such technique prevent reclaim. So, Q1: Why do you think above traditional way
is insufficient? 

Q2: In the above you used min_filelist_kbytes=50000. How do you decide 
such value? Do other users can calculate proper value?

In addition, I have two request. R1: I think chromium specific feature is
harder acceptable because it's harder maintable. but we have good chance to
solve embedded generic issue. Please discuss Minchan and/or another embedded
developers. R2: If you want to deal OOM combination, please consider to 
combination of memcg OOM notifier too. It is most flexible and powerful OOM
mechanism. Probably desktop and server people never use bare OOM killer intentionally.

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-01  7:05   ` KOSAKI Motohiro
@ 2010-11-01 18:24     ` Mandeep Singh Baines
  -1 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-01 18:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mandeep Singh Baines, Andrew Morton, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> Hi
> 
> > On ChromiumOS, we do not use swap. When memory is low, the only way to
> > free memory is to reclaim pages from the file list. This results in a
> > lot of thrashing under low memory conditions. We see the system become
> > unresponsive for minutes before it eventually OOMs. We also see very
> > slow browser tab switching under low memory. Instead of an unresponsive
> > system, we'd really like the kernel to OOM as soon as it starts to
> > thrash. If it can't keep the working set in memory, then OOM.
> > Losing one of many tabs is a better behaviour for the user than an
> > unresponsive system.
> > 
> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> > of file-backed pages when when there are less than min_filelist_bytes worth
> > of such pages in the cache. This tunable is handy for low memory systems
> > using solid-state storage where interactive response is more important
> > than not OOMing.
> > 
> > With this patch and min_filelist_kbytes set to 50000, I see very little
> > block layer activity during low memory. The system stays responsive under
> > low memory and browser tab switching is fast. Eventually, a process a gets
> > killed by OOM. Without this patch, the system gets wedged for minutes
> > before it eventually OOMs. Below is the vmstat output from my test runs.
> 
> I've heared similar requirement sometimes from embedded people. then also
> don't use swap. then, I don't think this is hopeless idea. but I hope to 
> clarify some thing at first.
> 

swap would be intersting if we could somehow control swap thrashing. Maybe
we could add min_anonlist_kbytes. Just kidding:)

> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
> application which linked above important lib and call mlockall() at startup.
> such technique prevent reclaim. So, Q1: Why do you think above traditional way
> is insufficient? 
> 

mlock is too coarse grain. It requires locking the whole file in memory.
The chrome and X binaries are quite large so locking them would waste a lot
of memory. We could lock just the pages that are part of the working set but
that is difficult to do in practice. Its unmaintainable if you do it
statically. If you do it at runtime by mlocking the working set, you're
sort of giving up on mm's active list.

Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
job of identifying the working set. We did look at ways to do a better
job of keeping the working set in the active list but these were tricker
patches and never quite worked out. This patch is simple and works great.

Under memory pressure, I see the active list get smaller and smaller. Its
getting smaller because we're scanning it faster and faster, causing more
and more page faults which slows forward progress resulting in the active
list getting smaller still. One way to approach this might to make the
scan rate constant and configurable. It doesn't seem right that we scan
memory faster and faster under low memory. For us, we'd rather OOM than
evict pages that are likely to be accessed again so we'd prefer to make
a conservative estimate as to what belongs in the working set. Other
folks (long computations) might want to reclaim more aggressively.

> Q2: In the above you used min_filelist_kbytes=50000. How do you decide 
> such value? Do other users can calculate proper value?
> 

50M was small enough that we were comfortable with keeping 50M of file pages
in memory and large enough that it is bigger than the working set. I tested
by loading up a bunch of popular web sites in chrome and then observing what
happend when I ran out of memory. With 50M, I saw almost no thrashing and
the system stayed responsive even under low memory. but I wanted to be
conservative since I'm really just guessing.

Other users could calculate their value by doing something similar. Load
up the system (exhaust free memory) with a typical load and then observe
file io via vmstat. They can then set min_filelist_kbytes to the value
where they see a tolerable amounting of thrashing (page faults, block io).

> In addition, I have two request. R1: I think chromium specific feature is
> harder acceptable because it's harder maintable. but we have good chance to
> solve embedded generic issue. Please discuss Minchan and/or another embedded

I think this feature should be useful to a lot of embedded applications where
OOM is OK, especially web browsing applications where the user is OK with
losing 1 of many tabs they have open. However, I consider this patch a
stop-gap. I think the real solution is to do a better job of protecting
the active list.

> developers. R2: If you want to deal OOM combination, please consider to 
> combination of memcg OOM notifier too. It is most flexible and powerful OOM
> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
> 

Yes, will definitely look at OOM notifier. Currently trying to see if we can
get by with oomadj. With OOM notifier you'd have to respond earlier so you
might OOM more. However, with a notifier you might be able to take action that
might prevent OOM altogether.

I see memcg more as an isolation mechanism but I guess you could use it to
isolate the working set from anon browser tab data as Kamezawa suggests.

Regards,
Mandeep

> Thanks.
> 
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-01 18:24     ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-01 18:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mandeep Singh Baines, Andrew Morton, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> Hi
> 
> > On ChromiumOS, we do not use swap. When memory is low, the only way to
> > free memory is to reclaim pages from the file list. This results in a
> > lot of thrashing under low memory conditions. We see the system become
> > unresponsive for minutes before it eventually OOMs. We also see very
> > slow browser tab switching under low memory. Instead of an unresponsive
> > system, we'd really like the kernel to OOM as soon as it starts to
> > thrash. If it can't keep the working set in memory, then OOM.
> > Losing one of many tabs is a better behaviour for the user than an
> > unresponsive system.
> > 
> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> > of file-backed pages when when there are less than min_filelist_bytes worth
> > of such pages in the cache. This tunable is handy for low memory systems
> > using solid-state storage where interactive response is more important
> > than not OOMing.
> > 
> > With this patch and min_filelist_kbytes set to 50000, I see very little
> > block layer activity during low memory. The system stays responsive under
> > low memory and browser tab switching is fast. Eventually, a process a gets
> > killed by OOM. Without this patch, the system gets wedged for minutes
> > before it eventually OOMs. Below is the vmstat output from my test runs.
> 
> I've heared similar requirement sometimes from embedded people. then also
> don't use swap. then, I don't think this is hopeless idea. but I hope to 
> clarify some thing at first.
> 

swap would be intersting if we could somehow control swap thrashing. Maybe
we could add min_anonlist_kbytes. Just kidding:)

> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
> application which linked above important lib and call mlockall() at startup.
> such technique prevent reclaim. So, Q1: Why do you think above traditional way
> is insufficient? 
> 

mlock is too coarse grain. It requires locking the whole file in memory.
The chrome and X binaries are quite large so locking them would waste a lot
of memory. We could lock just the pages that are part of the working set but
that is difficult to do in practice. Its unmaintainable if you do it
statically. If you do it at runtime by mlocking the working set, you're
sort of giving up on mm's active list.

Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
job of identifying the working set. We did look at ways to do a better
job of keeping the working set in the active list but these were tricker
patches and never quite worked out. This patch is simple and works great.

Under memory pressure, I see the active list get smaller and smaller. Its
getting smaller because we're scanning it faster and faster, causing more
and more page faults which slows forward progress resulting in the active
list getting smaller still. One way to approach this might to make the
scan rate constant and configurable. It doesn't seem right that we scan
memory faster and faster under low memory. For us, we'd rather OOM than
evict pages that are likely to be accessed again so we'd prefer to make
a conservative estimate as to what belongs in the working set. Other
folks (long computations) might want to reclaim more aggressively.

> Q2: In the above you used min_filelist_kbytes=50000. How do you decide 
> such value? Do other users can calculate proper value?
> 

50M was small enough that we were comfortable with keeping 50M of file pages
in memory and large enough that it is bigger than the working set. I tested
by loading up a bunch of popular web sites in chrome and then observing what
happend when I ran out of memory. With 50M, I saw almost no thrashing and
the system stayed responsive even under low memory. but I wanted to be
conservative since I'm really just guessing.

Other users could calculate their value by doing something similar. Load
up the system (exhaust free memory) with a typical load and then observe
file io via vmstat. They can then set min_filelist_kbytes to the value
where they see a tolerable amounting of thrashing (page faults, block io).

> In addition, I have two request. R1: I think chromium specific feature is
> harder acceptable because it's harder maintable. but we have good chance to
> solve embedded generic issue. Please discuss Minchan and/or another embedded

I think this feature should be useful to a lot of embedded applications where
OOM is OK, especially web browsing applications where the user is OK with
losing 1 of many tabs they have open. However, I consider this patch a
stop-gap. I think the real solution is to do a better job of protecting
the active list.

> developers. R2: If you want to deal OOM combination, please consider to 
> combination of memcg OOM notifier too. It is most flexible and powerful OOM
> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
> 

Yes, will definitely look at OOM notifier. Currently trying to see if we can
get by with oomadj. With OOM notifier you'd have to respond earlier so you
might OOM more. However, with a notifier you might be able to take action that
might prevent OOM altogether.

I see memcg more as an isolation mechanism but I guess you could use it to
isolate the working set from anon browser tab data as Kamezawa suggests.

Regards,
Mandeep

> Thanks.
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-01 18:24     ` Mandeep Singh Baines
@ 2010-11-01 18:50       ` Rik van Riel
  -1 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-01 18:50 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/01/2010 02:24 PM, Mandeep Singh Baines wrote:

> Under memory pressure, I see the active list get smaller and smaller. Its
> getting smaller because we're scanning it faster and faster, causing more
> and more page faults which slows forward progress resulting in the active
> list getting smaller still. One way to approach this might to make the
> scan rate constant and configurable. It doesn't seem right that we scan
> memory faster and faster under low memory. For us, we'd rather OOM than
> evict pages that are likely to be accessed again so we'd prefer to make
> a conservative estimate as to what belongs in the working set. Other
> folks (long computations) might want to reclaim more aggressively.

Have you actually read the code?

The active file list is only ever scanned when it is larger
than the inactive file list.

>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>> such value? Do other users can calculate proper value?
>>
>
> 50M was small enough that we were comfortable with keeping 50M of file pages
> in memory and large enough that it is bigger than the working set. I tested
> by loading up a bunch of popular web sites in chrome and then observing what
> happend when I ran out of memory. With 50M, I saw almost no thrashing and
> the system stayed responsive even under low memory. but I wanted to be
> conservative since I'm really just guessing.
>
> Other users could calculate their value by doing something similar.

Maybe we can scale this by memory amount?

Say, make sure the total amount of page cache in the system
is at least 2* as much as the sum of all the zone->pages_high
watermarks, and refuse to evict page cache if we have less
than that?

This may need to be tunable for a few special use cases,
like HPC and virtual machine hosting nodes, but it may just
do the right thing for everybody else.

Another alternative could be to really slow down the
reclaiming of page cache once we hit this level, so virt
hosts and HPC nodes can still decrease the page cache to
something really small ... but only if it is not being
used.

Andrew, could a hack like the above be "good enough"?

Anybody - does the above hack inspire you to come up with
an even better idea?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-01 18:50       ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-01 18:50 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/01/2010 02:24 PM, Mandeep Singh Baines wrote:

> Under memory pressure, I see the active list get smaller and smaller. Its
> getting smaller because we're scanning it faster and faster, causing more
> and more page faults which slows forward progress resulting in the active
> list getting smaller still. One way to approach this might to make the
> scan rate constant and configurable. It doesn't seem right that we scan
> memory faster and faster under low memory. For us, we'd rather OOM than
> evict pages that are likely to be accessed again so we'd prefer to make
> a conservative estimate as to what belongs in the working set. Other
> folks (long computations) might want to reclaim more aggressively.

Have you actually read the code?

The active file list is only ever scanned when it is larger
than the inactive file list.

>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>> such value? Do other users can calculate proper value?
>>
>
> 50M was small enough that we were comfortable with keeping 50M of file pages
> in memory and large enough that it is bigger than the working set. I tested
> by loading up a bunch of popular web sites in chrome and then observing what
> happend when I ran out of memory. With 50M, I saw almost no thrashing and
> the system stayed responsive even under low memory. but I wanted to be
> conservative since I'm really just guessing.
>
> Other users could calculate their value by doing something similar.

Maybe we can scale this by memory amount?

Say, make sure the total amount of page cache in the system
is at least 2* as much as the sum of all the zone->pages_high
watermarks, and refuse to evict page cache if we have less
than that?

This may need to be tunable for a few special use cases,
like HPC and virtual machine hosting nodes, but it may just
do the right thing for everybody else.

Another alternative could be to really slow down the
reclaiming of page cache once we hit this level, so virt
hosts and HPC nodes can still decrease the page cache to
something really small ... but only if it is not being
used.

Andrew, could a hack like the above be "good enough"?

Anybody - does the above hack inspire you to come up with
an even better idea?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-01 18:50       ` Rik van Riel
@ 2010-11-01 19:43         ` Mandeep Singh Baines
  -1 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-01 19:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Mon, Nov 1, 2010 at 11:50 AM, Rik van Riel <riel@redhat.com> wrote:
> On 11/01/2010 02:24 PM, Mandeep Singh Baines wrote:
>
>> Under memory pressure, I see the active list get smaller and smaller. Its
>> getting smaller because we're scanning it faster and faster, causing more
>> and more page faults which slows forward progress resulting in the active
>> list getting smaller still. One way to approach this might to make the
>> scan rate constant and configurable. It doesn't seem right that we scan
>> memory faster and faster under low memory. For us, we'd rather OOM than
>> evict pages that are likely to be accessed again so we'd prefer to make
>> a conservative estimate as to what belongs in the working set. Other
>> folks (long computations) might want to reclaim more aggressively.
>
> Have you actually read the code?
>

I have but really just recently. I consider myself an mm newb so take any
conclusion I make with a grain of salt.

> The active file list is only ever scanned when it is larger
> than the inactive file list.
>

Yes, this prevents you from reclaiming the active list all at once. But if the
memory pressure doesn't go away, you'll start to reclaim the active list
little by little. First you'll empty the inactive list, and then
you'll start scanning
the active list and pulling pages from inactive to active. The problem is that
there is no minimum time limit to how long a page will sit in the inactive list
before it is reclaimed. Just depends on scan rate which does not depend
on time.

In my experiments, I saw the active list get smaller and smaller
over time until eventually it was only a few MB at which point the system came
grinding to a halt due to thrashing.

I played around with making the active/inactive ratio configurable. I
sent a patch out
for an inactive_file_ratio. So instead of the default 50%, you'd make the
ratio configurable.

inactive_file_ratio = (inactive * 100) / (inactive + active)

I saw less thrashing at 10% but this patch wasn't nearly as effective
as min_filelist_kbytes.
I can resend the patch if you think its interesting.

>>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>>> such value? Do other users can calculate proper value?
>>>
>>
>> 50M was small enough that we were comfortable with keeping 50M of file
>> pages
>> in memory and large enough that it is bigger than the working set. I
>> tested
>> by loading up a bunch of popular web sites in chrome and then observing
>> what
>> happend when I ran out of memory. With 50M, I saw almost no thrashing and
>> the system stayed responsive even under low memory. but I wanted to be
>> conservative since I'm really just guessing.
>>
>> Other users could calculate their value by doing something similar.
>
> Maybe we can scale this by memory amount?
>
> Say, make sure the total amount of page cache in the system
> is at least 2* as much as the sum of all the zone->pages_high
> watermarks, and refuse to evict page cache if we have less
> than that?
>
> This may need to be tunable for a few special use cases,
> like HPC and virtual machine hosting nodes, but it may just
> do the right thing for everybody else.
>
> Another alternative could be to really slow down the
> reclaiming of page cache once we hit this level, so virt
> hosts and HPC nodes can still decrease the page cache to
> something really small ... but only if it is not being
> used.
>
> Andrew, could a hack like the above be "good enough"?
>
> Anybody - does the above hack inspire you to come up with
> an even better idea?
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-01 19:43         ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-01 19:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Mon, Nov 1, 2010 at 11:50 AM, Rik van Riel <riel@redhat.com> wrote:
> On 11/01/2010 02:24 PM, Mandeep Singh Baines wrote:
>
>> Under memory pressure, I see the active list get smaller and smaller. Its
>> getting smaller because we're scanning it faster and faster, causing more
>> and more page faults which slows forward progress resulting in the active
>> list getting smaller still. One way to approach this might to make the
>> scan rate constant and configurable. It doesn't seem right that we scan
>> memory faster and faster under low memory. For us, we'd rather OOM than
>> evict pages that are likely to be accessed again so we'd prefer to make
>> a conservative estimate as to what belongs in the working set. Other
>> folks (long computations) might want to reclaim more aggressively.
>
> Have you actually read the code?
>

I have but really just recently. I consider myself an mm newb so take any
conclusion I make with a grain of salt.

> The active file list is only ever scanned when it is larger
> than the inactive file list.
>

Yes, this prevents you from reclaiming the active list all at once. But if the
memory pressure doesn't go away, you'll start to reclaim the active list
little by little. First you'll empty the inactive list, and then
you'll start scanning
the active list and pulling pages from inactive to active. The problem is that
there is no minimum time limit to how long a page will sit in the inactive list
before it is reclaimed. Just depends on scan rate which does not depend
on time.

In my experiments, I saw the active list get smaller and smaller
over time until eventually it was only a few MB at which point the system came
grinding to a halt due to thrashing.

I played around with making the active/inactive ratio configurable. I
sent a patch out
for an inactive_file_ratio. So instead of the default 50%, you'd make the
ratio configurable.

inactive_file_ratio = (inactive * 100) / (inactive + active)

I saw less thrashing at 10% but this patch wasn't nearly as effective
as min_filelist_kbytes.
I can resend the patch if you think its interesting.

>>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>>> such value? Do other users can calculate proper value?
>>>
>>
>> 50M was small enough that we were comfortable with keeping 50M of file
>> pages
>> in memory and large enough that it is bigger than the working set. I
>> tested
>> by loading up a bunch of popular web sites in chrome and then observing
>> what
>> happend when I ran out of memory. With 50M, I saw almost no thrashing and
>> the system stayed responsive even under low memory. but I wanted to be
>> conservative since I'm really just guessing.
>>
>> Other users could calculate their value by doing something similar.
>
> Maybe we can scale this by memory amount?
>
> Say, make sure the total amount of page cache in the system
> is at least 2* as much as the sum of all the zone->pages_high
> watermarks, and refuse to evict page cache if we have less
> than that?
>
> This may need to be tunable for a few special use cases,
> like HPC and virtual machine hosting nodes, but it may just
> do the right thing for everybody else.
>
> Another alternative could be to really slow down the
> reclaiming of page cache once we hit this level, so virt
> hosts and HPC nodes can still decrease the page cache to
> something really small ... but only if it is not being
> used.
>
> Andrew, could a hack like the above be "good enough"?
>
> Anybody - does the above hack inspire you to come up with
> an even better idea?
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-01 18:24     ` Mandeep Singh Baines
@ 2010-11-01 23:46       ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-01 23:46 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
>> Hi
>>
>> > On ChromiumOS, we do not use swap. When memory is low, the only way to
>> > free memory is to reclaim pages from the file list. This results in a
>> > lot of thrashing under low memory conditions. We see the system become
>> > unresponsive for minutes before it eventually OOMs. We also see very
>> > slow browser tab switching under low memory. Instead of an unresponsive
>> > system, we'd really like the kernel to OOM as soon as it starts to
>> > thrash. If it can't keep the working set in memory, then OOM.
>> > Losing one of many tabs is a better behaviour for the user than an
>> > unresponsive system.
>> >
>> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
>> > of file-backed pages when when there are less than min_filelist_bytes worth
>> > of such pages in the cache. This tunable is handy for low memory systems
>> > using solid-state storage where interactive response is more important
>> > than not OOMing.
>> >
>> > With this patch and min_filelist_kbytes set to 50000, I see very little
>> > block layer activity during low memory. The system stays responsive under
>> > low memory and browser tab switching is fast. Eventually, a process a gets
>> > killed by OOM. Without this patch, the system gets wedged for minutes
>> > before it eventually OOMs. Below is the vmstat output from my test runs.
>>
>> I've heared similar requirement sometimes from embedded people. then also
>> don't use swap. then, I don't think this is hopeless idea. but I hope to
>> clarify some thing at first.
>>
>
> swap would be intersting if we could somehow control swap thrashing. Maybe
> we could add min_anonlist_kbytes. Just kidding:)
>
>> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
>> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
>> application which linked above important lib and call mlockall() at startup.
>> such technique prevent reclaim. So, Q1: Why do you think above traditional way
>> is insufficient?
>>
>
> mlock is too coarse grain. It requires locking the whole file in memory.
> The chrome and X binaries are quite large so locking them would waste a lot
> of memory. We could lock just the pages that are part of the working set but
> that is difficult to do in practice. Its unmaintainable if you do it
> statically. If you do it at runtime by mlocking the working set, you're
> sort of giving up on mm's active list.
>
> Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
> job of identifying the working set. We did look at ways to do a better
> job of keeping the working set in the active list but these were tricker
> patches and never quite worked out. This patch is simple and works great.
>
> Under memory pressure, I see the active list get smaller and smaller. Its
> getting smaller because we're scanning it faster and faster, causing more
> and more page faults which slows forward progress resulting in the active
> list getting smaller still. One way to approach this might to make the
> scan rate constant and configurable. It doesn't seem right that we scan
> memory faster and faster under low memory. For us, we'd rather OOM than
> evict pages that are likely to be accessed again so we'd prefer to make
> a conservative estimate as to what belongs in the working set. Other
> folks (long computations) might want to reclaim more aggressively.
>
>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>> such value? Do other users can calculate proper value?
>>
>
> 50M was small enough that we were comfortable with keeping 50M of file pages
> in memory and large enough that it is bigger than the working set. I tested
> by loading up a bunch of popular web sites in chrome and then observing what
> happend when I ran out of memory. With 50M, I saw almost no thrashing and
> the system stayed responsive even under low memory. but I wanted to be
> conservative since I'm really just guessing.
>
> Other users could calculate their value by doing something similar. Load
> up the system (exhaust free memory) with a typical load and then observe
> file io via vmstat. They can then set min_filelist_kbytes to the value
> where they see a tolerable amounting of thrashing (page faults, block io).
>
>> In addition, I have two request. R1: I think chromium specific feature is
>> harder acceptable because it's harder maintable. but we have good chance to
>> solve embedded generic issue. Please discuss Minchan and/or another embedded
>
> I think this feature should be useful to a lot of embedded applications where
> OOM is OK, especially web browsing applications where the user is OK with
> losing 1 of many tabs they have open. However, I consider this patch a
> stop-gap. I think the real solution is to do a better job of protecting
> the active list.
>
>> developers. R2: If you want to deal OOM combination, please consider to
>> combination of memcg OOM notifier too. It is most flexible and powerful OOM
>> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
>>
>
> Yes, will definitely look at OOM notifier. Currently trying to see if we can
> get by with oomadj. With OOM notifier you'd have to respond earlier so you
> might OOM more. However, with a notifier you might be able to take action that
> might prevent OOM altogether.
>
> I see memcg more as an isolation mechanism but I guess you could use it to
> isolate the working set from anon browser tab data as Kamezawa suggests.


I don't think current VM behavior has a problem.
Current problem is that you use up many memory than real memory.
As system memory without swap is low, VM doesn't have a many choice.
It ends up evict your working set to meet for user request. It's very
natural result for greedy user.

Rather than OOM notifier, what we need is memory notifier.
AFAIR, before some years ago, KOSAKI tried similar thing .
http://lwn.net/Articles/268732/
(I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
can't meet yours requirement. I mean when the user receive the memory
low signal, it's too late. Maybe there are other causes for KOSAKi to
quit it.)
Anyway, If the system memory is low, your intelligent middleware can
control it very well than VM.
In this chance, how about improving it?
Mandeep, Could you feel needing this feature?



> Regards,
> Mandeep
>
>> Thanks.
>>
>>
>>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-01 23:46       ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-01 23:46 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
>> Hi
>>
>> > On ChromiumOS, we do not use swap. When memory is low, the only way to
>> > free memory is to reclaim pages from the file list. This results in a
>> > lot of thrashing under low memory conditions. We see the system become
>> > unresponsive for minutes before it eventually OOMs. We also see very
>> > slow browser tab switching under low memory. Instead of an unresponsive
>> > system, we'd really like the kernel to OOM as soon as it starts to
>> > thrash. If it can't keep the working set in memory, then OOM.
>> > Losing one of many tabs is a better behaviour for the user than an
>> > unresponsive system.
>> >
>> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
>> > of file-backed pages when when there are less than min_filelist_bytes worth
>> > of such pages in the cache. This tunable is handy for low memory systems
>> > using solid-state storage where interactive response is more important
>> > than not OOMing.
>> >
>> > With this patch and min_filelist_kbytes set to 50000, I see very little
>> > block layer activity during low memory. The system stays responsive under
>> > low memory and browser tab switching is fast. Eventually, a process a gets
>> > killed by OOM. Without this patch, the system gets wedged for minutes
>> > before it eventually OOMs. Below is the vmstat output from my test runs.
>>
>> I've heared similar requirement sometimes from embedded people. then also
>> don't use swap. then, I don't think this is hopeless idea. but I hope to
>> clarify some thing at first.
>>
>
> swap would be intersting if we could somehow control swap thrashing. Maybe
> we could add min_anonlist_kbytes. Just kidding:)
>
>> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
>> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
>> application which linked above important lib and call mlockall() at startup.
>> such technique prevent reclaim. So, Q1: Why do you think above traditional way
>> is insufficient?
>>
>
> mlock is too coarse grain. It requires locking the whole file in memory.
> The chrome and X binaries are quite large so locking them would waste a lot
> of memory. We could lock just the pages that are part of the working set but
> that is difficult to do in practice. Its unmaintainable if you do it
> statically. If you do it at runtime by mlocking the working set, you're
> sort of giving up on mm's active list.
>
> Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
> job of identifying the working set. We did look at ways to do a better
> job of keeping the working set in the active list but these were tricker
> patches and never quite worked out. This patch is simple and works great.
>
> Under memory pressure, I see the active list get smaller and smaller. Its
> getting smaller because we're scanning it faster and faster, causing more
> and more page faults which slows forward progress resulting in the active
> list getting smaller still. One way to approach this might to make the
> scan rate constant and configurable. It doesn't seem right that we scan
> memory faster and faster under low memory. For us, we'd rather OOM than
> evict pages that are likely to be accessed again so we'd prefer to make
> a conservative estimate as to what belongs in the working set. Other
> folks (long computations) might want to reclaim more aggressively.
>
>> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
>> such value? Do other users can calculate proper value?
>>
>
> 50M was small enough that we were comfortable with keeping 50M of file pages
> in memory and large enough that it is bigger than the working set. I tested
> by loading up a bunch of popular web sites in chrome and then observing what
> happend when I ran out of memory. With 50M, I saw almost no thrashing and
> the system stayed responsive even under low memory. but I wanted to be
> conservative since I'm really just guessing.
>
> Other users could calculate their value by doing something similar. Load
> up the system (exhaust free memory) with a typical load and then observe
> file io via vmstat. They can then set min_filelist_kbytes to the value
> where they see a tolerable amounting of thrashing (page faults, block io).
>
>> In addition, I have two request. R1: I think chromium specific feature is
>> harder acceptable because it's harder maintable. but we have good chance to
>> solve embedded generic issue. Please discuss Minchan and/or another embedded
>
> I think this feature should be useful to a lot of embedded applications where
> OOM is OK, especially web browsing applications where the user is OK with
> losing 1 of many tabs they have open. However, I consider this patch a
> stop-gap. I think the real solution is to do a better job of protecting
> the active list.
>
>> developers. R2: If you want to deal OOM combination, please consider to
>> combination of memcg OOM notifier too. It is most flexible and powerful OOM
>> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
>>
>
> Yes, will definitely look at OOM notifier. Currently trying to see if we can
> get by with oomadj. With OOM notifier you'd have to respond earlier so you
> might OOM more. However, with a notifier you might be able to take action that
> might prevent OOM altogether.
>
> I see memcg more as an isolation mechanism but I guess you could use it to
> isolate the working set from anon browser tab data as Kamezawa suggests.


I don't think current VM behavior has a problem.
Current problem is that you use up many memory than real memory.
As system memory without swap is low, VM doesn't have a many choice.
It ends up evict your working set to meet for user request. It's very
natural result for greedy user.

Rather than OOM notifier, what we need is memory notifier.
AFAIR, before some years ago, KOSAKI tried similar thing .
http://lwn.net/Articles/268732/
(I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
can't meet yours requirement. I mean when the user receive the memory
low signal, it's too late. Maybe there are other causes for KOSAKi to
quit it.)
Anyway, If the system memory is low, your intelligent middleware can
control it very well than VM.
In this chance, how about improving it?
Mandeep, Could you feel needing this feature?



> Regards,
> Mandeep
>
>> Thanks.
>>
>>
>>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-01 19:43         ` Mandeep Singh Baines
@ 2010-11-02  3:11           ` Rik van Riel
  -1 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-02  3:11 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:

> Yes, this prevents you from reclaiming the active list all at once. But if the
> memory pressure doesn't go away, you'll start to reclaim the active list
> little by little. First you'll empty the inactive list, and then
> you'll start scanning
> the active list and pulling pages from inactive to active. The problem is that
> there is no minimum time limit to how long a page will sit in the inactive list
> before it is reclaimed. Just depends on scan rate which does not depend
> on time.
>
> In my experiments, I saw the active list get smaller and smaller
> over time until eventually it was only a few MB at which point the system came
> grinding to a halt due to thrashing.

I believe that changing the active/inactive ratio has other
potential thrashing issues.  Specifically, when the inactive
list is too small, pages may not stick around long enough to
be accessed multiple times and get promoted to the active
list, even when they are in active use.

I prefer a more flexible solution, that automatically does
the right thing.

The problem you see is that the file list gets reclaimed
very quickly, even when it is already very small.

I wonder if a possible solution would be to limit how fast
file pages get reclaimed, when the page cache is very small.
Say, inactive_file * active_file < 2 * zone->pages_high ?

At that point, maybe we could slow down the reclaiming of
page cache pages to be significantly slower than they can
be refilled by the disk.  Maybe 100 pages a second - that
can be refilled even by an actual spinning metal disk
without even the use of readahead.

That can be rounded up to one batch of SWAP_CLUSTER_MAX
file pages every 1/4 second, when the number of page cache
pages is very low.

This way HPC and virtual machine hosting nodes can still
get rid of totally unused page cache, but on any system
that actually uses page cache, some minimal amount of
cache will be protected under heavy memory pressure.

Does this sound like a reasonable approach?

I realize the threshold may have to be tweaked...

The big question is, how do we integrate this with the
OOM killer?  Do we pretend we are out of memory when
we've hit our file cache eviction quota and kill something?

Would there be any downsides to this approach?

Are there any volunteers for implementing this idea?
(Maybe someone who needs the feature?)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-02  3:11           ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-02  3:11 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:

> Yes, this prevents you from reclaiming the active list all at once. But if the
> memory pressure doesn't go away, you'll start to reclaim the active list
> little by little. First you'll empty the inactive list, and then
> you'll start scanning
> the active list and pulling pages from inactive to active. The problem is that
> there is no minimum time limit to how long a page will sit in the inactive list
> before it is reclaimed. Just depends on scan rate which does not depend
> on time.
>
> In my experiments, I saw the active list get smaller and smaller
> over time until eventually it was only a few MB at which point the system came
> grinding to a halt due to thrashing.

I believe that changing the active/inactive ratio has other
potential thrashing issues.  Specifically, when the inactive
list is too small, pages may not stick around long enough to
be accessed multiple times and get promoted to the active
list, even when they are in active use.

I prefer a more flexible solution, that automatically does
the right thing.

The problem you see is that the file list gets reclaimed
very quickly, even when it is already very small.

I wonder if a possible solution would be to limit how fast
file pages get reclaimed, when the page cache is very small.
Say, inactive_file * active_file < 2 * zone->pages_high ?

At that point, maybe we could slow down the reclaiming of
page cache pages to be significantly slower than they can
be refilled by the disk.  Maybe 100 pages a second - that
can be refilled even by an actual spinning metal disk
without even the use of readahead.

That can be rounded up to one batch of SWAP_CLUSTER_MAX
file pages every 1/4 second, when the number of page cache
pages is very low.

This way HPC and virtual machine hosting nodes can still
get rid of totally unused page cache, but on any system
that actually uses page cache, some minimal amount of
cache will be protected under heavy memory pressure.

Does this sound like a reasonable approach?

I realize the threshold may have to be tweaked...

The big question is, how do we integrate this with the
OOM killer?  Do we pretend we are out of memory when
we've hit our file cache eviction quota and kill something?

Would there be any downsides to this approach?

Are there any volunteers for implementing this idea?
(Maybe someone who needs the feature?)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-02  3:11           ` Rik van Riel
  (?)
@ 2010-11-03  0:48           ` Minchan Kim
  2010-11-03  2:00               ` Rik van Riel
  -1 siblings, 1 reply; 55+ messages in thread
From: Minchan Kim @ 2010-11-03  0:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

[-- Attachment #1: Type: text/plain, Size: 7328 bytes --]

Hi Rik,

On Tue, Nov 2, 2010 at 12:11 PM, Rik van Riel <riel@redhat.com> wrote:
> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>
>> Yes, this prevents you from reclaiming the active list all at once. But if
>> the
>> memory pressure doesn't go away, you'll start to reclaim the active list
>> little by little. First you'll empty the inactive list, and then
>> you'll start scanning
>> the active list and pulling pages from inactive to active. The problem is
>> that
>> there is no minimum time limit to how long a page will sit in the inactive
>> list
>> before it is reclaimed. Just depends on scan rate which does not depend
>> on time.
>>
>> In my experiments, I saw the active list get smaller and smaller
>> over time until eventually it was only a few MB at which point the system
>> came
>> grinding to a halt due to thrashing.
>
> I believe that changing the active/inactive ratio has other
> potential thrashing issues.  Specifically, when the inactive
> list is too small, pages may not stick around long enough to
> be accessed multiple times and get promoted to the active
> list, even when they are in active use.
>
> I prefer a more flexible solution, that automatically does
> the right thing.

I agree. Ideally, it's the best if we handle it well in kernel internal.

>
> The problem you see is that the file list gets reclaimed
> very quickly, even when it is already very small.
>
> I wonder if a possible solution would be to limit how fast
> file pages get reclaimed, when the page cache is very small.
> Say, inactive_file * active_file < 2 * zone->pages_high ?

Why do you multiply inactive_file and active_file?
What's meaning?

I think it's very difficult to fix _a_ threshold.
At least, user have to set it with proper value to use the feature.
Anyway, we need default value. It needs some experiments in desktop
and embedded.

>
> At that point, maybe we could slow down the reclaiming of
> page cache pages to be significantly slower than they can
> be refilled by the disk.  Maybe 100 pages a second - that
> can be refilled even by an actual spinning metal disk
> without even the use of readahead.
>
> That can be rounded up to one batch of SWAP_CLUSTER_MAX
> file pages every 1/4 second, when the number of page cache
> pages is very low.

How about reducing scanning window size?
I think it could approximate the idea.

>
> This way HPC and virtual machine hosting nodes can still
> get rid of totally unused page cache, but on any system
> that actually uses page cache, some minimal amount of
> cache will be protected under heavy memory pressure.
>
> Does this sound like a reasonable approach?
>
> I realize the threshold may have to be tweaked...

Absolutely.

>
> The big question is, how do we integrate this with the
> OOM killer?  Do we pretend we are out of memory when
> we've hit our file cache eviction quota and kill something?

I think "Yes".
But I think killing isn't best if oom_badness can't select proper victim.
Normally, embedded system doesn't have swap. And it could try to keep
many task in memory due to application startup latency.
It means some tasks never executed during long time and just stay in
memory with consuming the memory.
OOM have to kill it. Anyway it's off topic.

>
> Would there be any downsides to this approach?

At first feeling, I have a concern unbalance aging of anon/file.
But I think it's no problem. It a result user want. User want to
protect file-backed page(ex, code page) so many anon swapout is
natural result to go on the system. If the system has no swap, we have
no choice except OOM.

>
> Are there any volunteers for implementing this idea?
> (Maybe someone who needs the feature?)

I made quick patch to discuss as combining your idea and Mandeep.
(Just pass the compile test.)


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7687228..98380ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -29,6 +29,7 @@ extern unsigned long num_physpages;
 extern unsigned long totalram_pages;
 extern void * high_memory;
 extern int page_cluster;
+extern int min_filelist_kbytes;

 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..c61f0c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,14 @@ static struct ctl_table vm_table[] = {
                .extra2         = &one,
        },
 #endif
+       {
+               .procname       = "min_filelist_kbytes",
+               .data           = &min_filelist_kbytes,
+               .maxlen         = sizeof(min_filelist_kbytes),
+               .mode           = 0644,
+               .proc_handler   = &proc_dointvec,
+               .extra1         = &zero,
+       },

 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..3b0e95d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;   /* The total number of pages which the VM controls */

+/*
+ * Low watermark used to prevent fscache thrashing during low memory.
+ * 20M is a arbitrary value. We need more discussion.
+ */
+int min_filelist_kbytes = 1024 * 20;
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);

@@ -1635,6 +1640,7 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
        u64 fraction[2], denominator;
        enum lru_list l;
        int noswap = 0;
+       int low_pagecache = 0;

        /* If we have no swap space, do not bother scanning anon pages. */
        if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1651,6 +1657,7 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
                zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);

        if (scanning_global_lru(sc)) {
+               unsigned long pagecache_threshold;
                free  = zone_page_state(zone, NR_FREE_PAGES);
                /* If we have very few page cache pages,
                   force-scan anon pages. */
@@ -1660,6 +1667,10 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
                        denominator = 1;
                        goto out;
                }
+
+               pagecache_threshold = min_filelist_kbytes >> (PAGE_SHIFT - 10);
+               if (file < pagecache_threshold)
+                       low_pagecache = 1;
        }

        /*
@@ -1715,6 +1726,12 @@ out:
                if (priority || noswap) {
                        scan >>= priority;
                        scan = div64_u64(scan * fraction[file], denominator);
+                       /*
+                        * If the system has low page cache, we slow down
+                        * scanning speed with 1/8 to protect working set.
+                        */
+                       if (low_pagecache)
+                               scan >>= 3;
                }
                nr[l] = nr_scan_try_batch(scan,
                                          &reclaim_stat->nr_saved_scan[l]);



> --
> All rights reversed
>



-- 
Kind regards,
Minchan Kim

[-- Attachment #2: slow_down_file_lru.patch --]
[-- Type: text/x-patch, Size: 2705 bytes --]

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7687228..98380ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -29,6 +29,7 @@ extern unsigned long num_physpages;
 extern unsigned long totalram_pages;
 extern void * high_memory;
 extern int page_cluster;
+extern int min_filelist_kbytes;
 
 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..c61f0c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,14 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname       = "min_filelist_kbytes",
+		.data           = &min_filelist_kbytes,
+		.maxlen         = sizeof(min_filelist_kbytes),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+		.extra1         = &zero,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..3b0e95d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+/* 
+ * Low watermark used to prevent fscache thrashing during low memory.
+ * 20M is a arbitrary value. We need more discussion.
+ */
+int min_filelist_kbytes = 1024 * 20;
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1635,6 +1640,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	u64 fraction[2], denominator;
 	enum lru_list l;
 	int noswap = 0;
+	int low_pagecache = 0;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1651,6 +1657,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
 
 	if (scanning_global_lru(sc)) {
+		unsigned long pagecache_threshold;
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
@@ -1660,6 +1667,10 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 			denominator = 1;
 			goto out;
 		}
+
+		pagecache_threshold = min_filelist_kbytes >> (PAGE_SHIFT - 10);
+		if (file < pagecache_threshold)
+			low_pagecache = 1;
 	}
 
 	/*
@@ -1715,6 +1726,12 @@ out:
 		if (priority || noswap) {
 			scan >>= priority;
 			scan = div64_u64(scan * fraction[file], denominator);
+			/*
+			 * If the system has low page cache, we slow down 
+			 * scanning speed with 1/8 to protect working set.
+			 */
+			if (low_pagecache)
+				scan >>= 3;
 		}
 		nr[l] = nr_scan_try_batch(scan,
 					  &reclaim_stat->nr_saved_scan[l]);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-03  0:48           ` Minchan Kim
@ 2010-11-03  2:00               ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-03  2:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/02/2010 08:48 PM, Minchan Kim wrote:

>> I wonder if a possible solution would be to limit how fast
>> file pages get reclaimed, when the page cache is very small.
>> Say, inactive_file * active_file<  2 * zone->pages_high ?
>
> Why do you multiply inactive_file and active_file?
> What's meaning?

That was a stupid typo, it should have been a + :)

> I think it's very difficult to fix _a_ threshold.
> At least, user have to set it with proper value to use the feature.
> Anyway, we need default value. It needs some experiments in desktop
> and embedded.

Yes, setting a threshold will be difficult.  However,
if the behaviour below that threshold is harmless to
pretty much any workload, it doesn't matter a whole
lot where we set it...

>> At that point, maybe we could slow down the reclaiming of
>> page cache pages to be significantly slower than they can
>> be refilled by the disk.  Maybe 100 pages a second - that
>> can be refilled even by an actual spinning metal disk
>> without even the use of readahead.
>>
>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>> file pages every 1/4 second, when the number of page cache
>> pages is very low.
>
> How about reducing scanning window size?
> I think it could approximate the idea.

A good idea in principle, but if it results in the VM
simply calling the pageout code more often, I suspect
it will not have any effect.

Your patch looks like it would have that effect.

I suspect we will need a time-based approach to really
protect the last bits of page cache in a near-OOM
situation.

>> Would there be any downsides to this approach?
>
> At first feeling, I have a concern unbalance aging of anon/file.
> But I think it's no problem. It a result user want. User want to
> protect file-backed page(ex, code page) so many anon swapout is
> natural result to go on the system. If the system has no swap, we have
> no choice except OOM.

We already have an unbalance in aging anon and file
pages, several of which are introduced on purpose.

In this proposal, there would only be an imbalance
if the number of file pages is really low.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-03  2:00               ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-03  2:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/02/2010 08:48 PM, Minchan Kim wrote:

>> I wonder if a possible solution would be to limit how fast
>> file pages get reclaimed, when the page cache is very small.
>> Say, inactive_file * active_file<  2 * zone->pages_high ?
>
> Why do you multiply inactive_file and active_file?
> What's meaning?

That was a stupid typo, it should have been a + :)

> I think it's very difficult to fix _a_ threshold.
> At least, user have to set it with proper value to use the feature.
> Anyway, we need default value. It needs some experiments in desktop
> and embedded.

Yes, setting a threshold will be difficult.  However,
if the behaviour below that threshold is harmless to
pretty much any workload, it doesn't matter a whole
lot where we set it...

>> At that point, maybe we could slow down the reclaiming of
>> page cache pages to be significantly slower than they can
>> be refilled by the disk.  Maybe 100 pages a second - that
>> can be refilled even by an actual spinning metal disk
>> without even the use of readahead.
>>
>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>> file pages every 1/4 second, when the number of page cache
>> pages is very low.
>
> How about reducing scanning window size?
> I think it could approximate the idea.

A good idea in principle, but if it results in the VM
simply calling the pageout code more often, I suspect
it will not have any effect.

Your patch looks like it would have that effect.

I suspect we will need a time-based approach to really
protect the last bits of page cache in a near-OOM
situation.

>> Would there be any downsides to this approach?
>
> At first feeling, I have a concern unbalance aging of anon/file.
> But I think it's no problem. It a result user want. User want to
> protect file-backed page(ex, code page) so many anon swapout is
> natural result to go on the system. If the system has no swap, we have
> no choice except OOM.

We already have an unbalance in aging anon and file
pages, several of which are introduced on purpose.

In this proposal, there would only be an imbalance
if the number of file pages is really low.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-03  2:00               ` Rik van Riel
@ 2010-11-03  3:03                 ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-03  3:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Wed, Nov 3, 2010 at 11:00 AM, Rik van Riel <riel@redhat.com> wrote:
> On 11/02/2010 08:48 PM, Minchan Kim wrote:
>
>>> I wonder if a possible solution would be to limit how fast
>>> file pages get reclaimed, when the page cache is very small.
>>> Say, inactive_file * active_file<  2 * zone->pages_high ?
>>
>> Why do you multiply inactive_file and active_file?
>> What's meaning?
>
> That was a stupid typo, it should have been a + :)
>
>> I think it's very difficult to fix _a_ threshold.
>> At least, user have to set it with proper value to use the feature.
>> Anyway, we need default value. It needs some experiments in desktop
>> and embedded.
>
> Yes, setting a threshold will be difficult.  However,
> if the behaviour below that threshold is harmless to
> pretty much any workload, it doesn't matter a whole
> lot where we set it...

Okay. But I doubt we could make the default value with effective when
we really need the function.
Maybe whenever user uses the feature, he have to tweak the knob.

>
>>> At that point, maybe we could slow down the reclaiming of
>>> page cache pages to be significantly slower than they can
>>> be refilled by the disk.  Maybe 100 pages a second - that
>>> can be refilled even by an actual spinning metal disk
>>> without even the use of readahead.
>>>
>>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>>> file pages every 1/4 second, when the number of page cache
>>> pages is very low.
>>
>> How about reducing scanning window size?
>> I think it could approximate the idea.
>
> A good idea in principle, but if it results in the VM
> simply calling the pageout code more often, I suspect
> it will not have any effect.
>
> Your patch looks like it would have that effect.


It could.
But time based approach would be same, IMHO.
First of all, I don't want long latency of direct reclaim process.
It could affect response of foreground process directly.

If VM limits the number of pages reclaimed per second, direct reclaim
process's latency will be affected. so we should avoid throttling in
direct reclaim path. Agree?

So, for slow down reclaim pages in kswapd, there will be processes
enter direct relcaim. So it results in the VM simply calling the
pageout code more often.

If I misunderstood way to implement your idea, please let me know it.

>
> I suspect we will need a time-based approach to really
> protect the last bits of page cache in a near-OOM
> situation.
>
>>> Would there be any downsides to this approach?
>>
>> At first feeling, I have a concern unbalance aging of anon/file.
>> But I think it's no problem. It a result user want. User want to
>> protect file-backed page(ex, code page) so many anon swapout is
>> natural result to go on the system. If the system has no swap, we have
>> no choice except OOM.
>
> We already have an unbalance in aging anon and file
> pages, several of which are introduced on purpose.
>
> In this proposal, there would only be an imbalance
> if the number of file pages is really low.

Right.

>
> --
> All rights reversed
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-03  3:03                 ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-03  3:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Wed, Nov 3, 2010 at 11:00 AM, Rik van Riel <riel@redhat.com> wrote:
> On 11/02/2010 08:48 PM, Minchan Kim wrote:
>
>>> I wonder if a possible solution would be to limit how fast
>>> file pages get reclaimed, when the page cache is very small.
>>> Say, inactive_file * active_file<  2 * zone->pages_high ?
>>
>> Why do you multiply inactive_file and active_file?
>> What's meaning?
>
> That was a stupid typo, it should have been a + :)
>
>> I think it's very difficult to fix _a_ threshold.
>> At least, user have to set it with proper value to use the feature.
>> Anyway, we need default value. It needs some experiments in desktop
>> and embedded.
>
> Yes, setting a threshold will be difficult.  However,
> if the behaviour below that threshold is harmless to
> pretty much any workload, it doesn't matter a whole
> lot where we set it...

Okay. But I doubt we could make the default value with effective when
we really need the function.
Maybe whenever user uses the feature, he have to tweak the knob.

>
>>> At that point, maybe we could slow down the reclaiming of
>>> page cache pages to be significantly slower than they can
>>> be refilled by the disk.  Maybe 100 pages a second - that
>>> can be refilled even by an actual spinning metal disk
>>> without even the use of readahead.
>>>
>>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>>> file pages every 1/4 second, when the number of page cache
>>> pages is very low.
>>
>> How about reducing scanning window size?
>> I think it could approximate the idea.
>
> A good idea in principle, but if it results in the VM
> simply calling the pageout code more often, I suspect
> it will not have any effect.
>
> Your patch looks like it would have that effect.


It could.
But time based approach would be same, IMHO.
First of all, I don't want long latency of direct reclaim process.
It could affect response of foreground process directly.

If VM limits the number of pages reclaimed per second, direct reclaim
process's latency will be affected. so we should avoid throttling in
direct reclaim path. Agree?

So, for slow down reclaim pages in kswapd, there will be processes
enter direct relcaim. So it results in the VM simply calling the
pageout code more often.

If I misunderstood way to implement your idea, please let me know it.

>
> I suspect we will need a time-based approach to really
> protect the last bits of page cache in a near-OOM
> situation.
>
>>> Would there be any downsides to this approach?
>>
>> At first feeling, I have a concern unbalance aging of anon/file.
>> But I think it's no problem. It a result user want. User want to
>> protect file-backed page(ex, code page) so many anon swapout is
>> natural result to go on the system. If the system has no swap, we have
>> no choice except OOM.
>
> We already have an unbalance in aging anon and file
> pages, several of which are introduced on purpose.
>
> In this proposal, there would only be an imbalance
> if the number of file pages is really low.

Right.

>
> --
> All rights reversed
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-03  3:03                 ` Minchan Kim
@ 2010-11-03 11:41                   ` Rik van Riel
  -1 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-03 11:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/02/2010 11:03 PM, Minchan Kim wrote:

> It could.
> But time based approach would be same, IMHO.
> First of all, I don't want long latency of direct reclaim process.
> It could affect response of foreground process directly.
>
> If VM limits the number of pages reclaimed per second, direct reclaim
> process's latency will be affected. so we should avoid throttling in
> direct reclaim path. Agree?

The idea would be to not throttle the processes trying to
reclaim page cache pages, but to only reclaim anonymous
pages when the page cache pages are low (and occasionally
a few page cache pages, say 128 a second).

If too many reclaimers come in when the page cache is
low and no swap is available, we will OOM kill instead
of stalling.

After all, the entire point of this patch would be to
avoid minutes-long latencies in triggering the OOM
killer.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-03 11:41                   ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-03 11:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/02/2010 11:03 PM, Minchan Kim wrote:

> It could.
> But time based approach would be same, IMHO.
> First of all, I don't want long latency of direct reclaim process.
> It could affect response of foreground process directly.
>
> If VM limits the number of pages reclaimed per second, direct reclaim
> process's latency will be affected. so we should avoid throttling in
> direct reclaim path. Agree?

The idea would be to not throttle the processes trying to
reclaim page cache pages, but to only reclaim anonymous
pages when the page cache pages are low (and occasionally
a few page cache pages, say 128 a second).

If too many reclaimers come in when the page cache is
low and no swap is available, we will OOM kill instead
of stalling.

After all, the entire point of this patch would be to
avoid minutes-long latencies in triggering the OOM
killer.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-03 11:41                   ` Rik van Riel
@ 2010-11-03 15:42                     ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-03 15:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Wed, Nov 03, 2010 at 07:41:35AM -0400, Rik van Riel wrote:
> On 11/02/2010 11:03 PM, Minchan Kim wrote:
> 
> >It could.
> >But time based approach would be same, IMHO.
> >First of all, I don't want long latency of direct reclaim process.
> >It could affect response of foreground process directly.
> >
> >If VM limits the number of pages reclaimed per second, direct reclaim
> >process's latency will be affected. so we should avoid throttling in
> >direct reclaim path. Agree?
> 
> The idea would be to not throttle the processes trying to
> reclaim page cache pages, but to only reclaim anonymous
> pages when the page cache pages are low (and occasionally
> a few page cache pages, say 128 a second).

Fair enough. Only anon reclaim is better than thrashing of code pages. 

> 
> If too many reclaimers come in when the page cache is
> low and no swap is available, we will OOM kill instead
> of stalling.

I understand why you use (file < pages_min).
We can keep the threshold small value. Otherwise, 
we can see the many OOM question. "Why OOM happens although my system have enough 
file LRU pages?"

> 
> After all, the entire point of this patch would be to
> avoid minutes-long latencies in triggering the OOM
> killer.

I got your point. The patch's goal is not protect working set fully, but prevent
page cache thrashing in low file LRU. 
It could make minutes-long latencies by reaching the OOM. 

Okay. I will look into this idea.
Thanks for the good suggestion, Rik. 

> 
> -- 
> All rights reversed

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-03 15:42                     ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-03 15:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Wed, Nov 03, 2010 at 07:41:35AM -0400, Rik van Riel wrote:
> On 11/02/2010 11:03 PM, Minchan Kim wrote:
> 
> >It could.
> >But time based approach would be same, IMHO.
> >First of all, I don't want long latency of direct reclaim process.
> >It could affect response of foreground process directly.
> >
> >If VM limits the number of pages reclaimed per second, direct reclaim
> >process's latency will be affected. so we should avoid throttling in
> >direct reclaim path. Agree?
> 
> The idea would be to not throttle the processes trying to
> reclaim page cache pages, but to only reclaim anonymous
> pages when the page cache pages are low (and occasionally
> a few page cache pages, say 128 a second).

Fair enough. Only anon reclaim is better than thrashing of code pages. 

> 
> If too many reclaimers come in when the page cache is
> low and no swap is available, we will OOM kill instead
> of stalling.

I understand why you use (file < pages_min).
We can keep the threshold small value. Otherwise, 
we can see the many OOM question. "Why OOM happens although my system have enough 
file LRU pages?"

> 
> After all, the entire point of this patch would be to
> avoid minutes-long latencies in triggering the OOM
> killer.

I got your point. The patch's goal is not protect working set fully, but prevent
page cache thrashing in low file LRU. 
It could make minutes-long latencies by reaching the OOM. 

Okay. I will look into this idea.
Thanks for the good suggestion, Rik. 

> 
> -- 
> All rights reversed

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-02  3:11           ` Rik van Riel
@ 2010-11-03 22:40             ` Mandeep Singh Baines
  -1 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-03 22:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Rik van Riel (riel@redhat.com) wrote:
> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
> 
> >Yes, this prevents you from reclaiming the active list all at once. But if the
> >memory pressure doesn't go away, you'll start to reclaim the active list
> >little by little. First you'll empty the inactive list, and then
> >you'll start scanning
> >the active list and pulling pages from inactive to active. The problem is that
> >there is no minimum time limit to how long a page will sit in the inactive list
> >before it is reclaimed. Just depends on scan rate which does not depend
> >on time.
> >
> >In my experiments, I saw the active list get smaller and smaller
> >over time until eventually it was only a few MB at which point the system came
> >grinding to a halt due to thrashing.
> 
> I believe that changing the active/inactive ratio has other
> potential thrashing issues.  Specifically, when the inactive
> list is too small, pages may not stick around long enough to
> be accessed multiple times and get promoted to the active
> list, even when they are in active use.
> 
> I prefer a more flexible solution, that automatically does
> the right thing.
> 
> The problem you see is that the file list gets reclaimed
> very quickly, even when it is already very small.
> 
> I wonder if a possible solution would be to limit how fast
> file pages get reclaimed, when the page cache is very small.
> Say, inactive_file * active_file < 2 * zone->pages_high ?
> 
> At that point, maybe we could slow down the reclaiming of
> page cache pages to be significantly slower than they can
> be refilled by the disk.  Maybe 100 pages a second - that
> can be refilled even by an actual spinning metal disk
> without even the use of readahead.
> 
> That can be rounded up to one batch of SWAP_CLUSTER_MAX
> file pages every 1/4 second, when the number of page cache
> pages is very low.
> 
> This way HPC and virtual machine hosting nodes can still
> get rid of totally unused page cache, but on any system
> that actually uses page cache, some minimal amount of
> cache will be protected under heavy memory pressure.
> 
> Does this sound like a reasonable approach?
> 
> I realize the threshold may have to be tweaked...
> 
> The big question is, how do we integrate this with the
> OOM killer?  Do we pretend we are out of memory when
> we've hit our file cache eviction quota and kill something?
> 
> Would there be any downsides to this approach?
> 
> Are there any volunteers for implementing this idea?
> (Maybe someone who needs the feature?)
> 

I've created a patch which takes a slightly different approach.
Instead of limiting how fast pages get reclaimed, the patch limits
how fast the active list gets scanned. This should result in the
active list being a better measure of the working set. I've seen
fairly good results with this patch and a scan inteval of 1
centisecond. I see no thrashing when the scan interval is non-zero.

I've made it a tunable because I don't know what to set the scan
interval. The final patch could set the value based on HZ and some
other system parameters. Maybe relate it to sched_period?

---

[PATCH] vmscan: add a configurable scan interval

On ChromiumOS, we see a lot of thrashing under low memory. We do not
use swap, so the mm system can only free file-backed pages. Eventually,
we are left with little file back pages remaining (a few MB) and the
system becomes unresponsive due to thrashing.

Our preference is for the system to OOM instead of becoming unresponsive.

This patch create a tunable, vmscan_interval_centisecs, for controlling
the minimum interval between active list scans. At 0, I see the same
thrashing. At 1, I see no thrashing. The mm system does a good job
of protecting the working set. If a page has been referenced in the
last vmscan_interval_centisecs it is kept in memory.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 include/linux/mm.h     |    2 ++
 include/linux/mmzone.h |    9 +++++++++
 kernel/sysctl.c        |    7 +++++++
 mm/page_alloc.c        |    2 ++
 mm/vmscan.c            |   21 +++++++++++++++++++--
 5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..af058f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern unsigned int vmscan_interval;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..6c4b6e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -415,6 +415,15 @@ struct zone {
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
 
 	/*
+	 * To avoid over-scanning, we store the time of the last
+	 * scan (in jiffies).
+	 *
+	 * The anon LRU stats live in [0], file LRU stats in [1]
+	 */
+
+	unsigned long		last_scan[2];
+
+	/*
 	 * rarely used fields:
 	 */
 	const char		*name;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c33a1ed..c34251d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1318,6 +1318,13 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "scan_interval_centisecs",
+		.data		= &vmscan_interval,
+		.maxlen		= sizeof(vmscan_interval),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..46991d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -51,6 +51,7 @@
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
 #include <linux/compaction.h>
+#include <linux/jiffies.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -4150,6 +4151,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		BUG_ON(ret);
 		memmap_init(size, nid, j, zone_start_pfn);
 		zone_start_pfn += size;
+		zone->last_scan[0] = zone->last_scan[1] = jiffies;
 	}
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8a6fdc..be45b91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/jiffies.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -136,6 +137,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+/*
+ * Minimum interval between active list scans.
+ */
+unsigned int vmscan_interval = 0;
+
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1659,14 +1665,25 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 		return inactive_anon_is_low(zone, sc);
 }
 
+static int list_scanned_recently(struct zone *zone, int file)
+{
+	unsigned long now = jiffies;
+	unsigned long delta = vmscan_interval * HZ / 100;
+
+	return time_after(zone->last_scan[file] + delta, now);
+}
+
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(zone, sc, file))
-		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		if (inactive_list_is_low(zone, sc, file) &&
+		    !list_scanned_recently(zone, file)) {
+			shrink_active_list(nr_to_scan, zone, sc, priority, file);
+			zone->last_scan[file] = jiffies;
+		}
 		return 0;
 	}
 
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-03 22:40             ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-03 22:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Rik van Riel (riel@redhat.com) wrote:
> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
> 
> >Yes, this prevents you from reclaiming the active list all at once. But if the
> >memory pressure doesn't go away, you'll start to reclaim the active list
> >little by little. First you'll empty the inactive list, and then
> >you'll start scanning
> >the active list and pulling pages from inactive to active. The problem is that
> >there is no minimum time limit to how long a page will sit in the inactive list
> >before it is reclaimed. Just depends on scan rate which does not depend
> >on time.
> >
> >In my experiments, I saw the active list get smaller and smaller
> >over time until eventually it was only a few MB at which point the system came
> >grinding to a halt due to thrashing.
> 
> I believe that changing the active/inactive ratio has other
> potential thrashing issues.  Specifically, when the inactive
> list is too small, pages may not stick around long enough to
> be accessed multiple times and get promoted to the active
> list, even when they are in active use.
> 
> I prefer a more flexible solution, that automatically does
> the right thing.
> 
> The problem you see is that the file list gets reclaimed
> very quickly, even when it is already very small.
> 
> I wonder if a possible solution would be to limit how fast
> file pages get reclaimed, when the page cache is very small.
> Say, inactive_file * active_file < 2 * zone->pages_high ?
> 
> At that point, maybe we could slow down the reclaiming of
> page cache pages to be significantly slower than they can
> be refilled by the disk.  Maybe 100 pages a second - that
> can be refilled even by an actual spinning metal disk
> without even the use of readahead.
> 
> That can be rounded up to one batch of SWAP_CLUSTER_MAX
> file pages every 1/4 second, when the number of page cache
> pages is very low.
> 
> This way HPC and virtual machine hosting nodes can still
> get rid of totally unused page cache, but on any system
> that actually uses page cache, some minimal amount of
> cache will be protected under heavy memory pressure.
> 
> Does this sound like a reasonable approach?
> 
> I realize the threshold may have to be tweaked...
> 
> The big question is, how do we integrate this with the
> OOM killer?  Do we pretend we are out of memory when
> we've hit our file cache eviction quota and kill something?
> 
> Would there be any downsides to this approach?
> 
> Are there any volunteers for implementing this idea?
> (Maybe someone who needs the feature?)
> 

I've created a patch which takes a slightly different approach.
Instead of limiting how fast pages get reclaimed, the patch limits
how fast the active list gets scanned. This should result in the
active list being a better measure of the working set. I've seen
fairly good results with this patch and a scan inteval of 1
centisecond. I see no thrashing when the scan interval is non-zero.

I've made it a tunable because I don't know what to set the scan
interval. The final patch could set the value based on HZ and some
other system parameters. Maybe relate it to sched_period?

---

[PATCH] vmscan: add a configurable scan interval

On ChromiumOS, we see a lot of thrashing under low memory. We do not
use swap, so the mm system can only free file-backed pages. Eventually,
we are left with little file back pages remaining (a few MB) and the
system becomes unresponsive due to thrashing.

Our preference is for the system to OOM instead of becoming unresponsive.

This patch create a tunable, vmscan_interval_centisecs, for controlling
the minimum interval between active list scans. At 0, I see the same
thrashing. At 1, I see no thrashing. The mm system does a good job
of protecting the working set. If a page has been referenced in the
last vmscan_interval_centisecs it is kept in memory.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
---
 include/linux/mm.h     |    2 ++
 include/linux/mmzone.h |    9 +++++++++
 kernel/sysctl.c        |    7 +++++++
 mm/page_alloc.c        |    2 ++
 mm/vmscan.c            |   21 +++++++++++++++++++--
 5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..af058f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,8 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+extern unsigned int vmscan_interval;
+
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..6c4b6e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -415,6 +415,15 @@ struct zone {
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
 
 	/*
+	 * To avoid over-scanning, we store the time of the last
+	 * scan (in jiffies).
+	 *
+	 * The anon LRU stats live in [0], file LRU stats in [1]
+	 */
+
+	unsigned long		last_scan[2];
+
+	/*
 	 * rarely used fields:
 	 */
 	const char		*name;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c33a1ed..c34251d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1318,6 +1318,13 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "scan_interval_centisecs",
+		.data		= &vmscan_interval,
+		.maxlen		= sizeof(vmscan_interval),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..46991d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -51,6 +51,7 @@
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
 #include <linux/compaction.h>
+#include <linux/jiffies.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -4150,6 +4151,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		BUG_ON(ret);
 		memmap_init(size, nid, j, zone_start_pfn);
 		zone_start_pfn += size;
+		zone->last_scan[0] = zone->last_scan[1] = jiffies;
 	}
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8a6fdc..be45b91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/jiffies.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -136,6 +137,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+/*
+ * Minimum interval between active list scans.
+ */
+unsigned int vmscan_interval = 0;
+
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1659,14 +1665,25 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 		return inactive_anon_is_low(zone, sc);
 }
 
+static int list_scanned_recently(struct zone *zone, int file)
+{
+	unsigned long now = jiffies;
+	unsigned long delta = vmscan_interval * HZ / 100;
+
+	return time_after(zone->last_scan[file] + delta, now);
+}
+
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(zone, sc, file))
-		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		if (inactive_list_is_low(zone, sc, file) &&
+		    !list_scanned_recently(zone, file)) {
+			shrink_active_list(nr_to_scan, zone, sc, priority, file);
+			zone->last_scan[file] = jiffies;
+		}
 		return 0;
 	}
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-03 22:40             ` Mandeep Singh Baines
@ 2010-11-03 23:49               ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-03 23:49 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Rik van Riel, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

Hello.

On Thu, Nov 4, 2010 at 7:40 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> Rik van Riel (riel@redhat.com) wrote:
>> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>>
>> >Yes, this prevents you from reclaiming the active list all at once. But if the
>> >memory pressure doesn't go away, you'll start to reclaim the active list
>> >little by little. First you'll empty the inactive list, and then
>> >you'll start scanning
>> >the active list and pulling pages from inactive to active. The problem is that
>> >there is no minimum time limit to how long a page will sit in the inactive list
>> >before it is reclaimed. Just depends on scan rate which does not depend
>> >on time.
>> >
>> >In my experiments, I saw the active list get smaller and smaller
>> >over time until eventually it was only a few MB at which point the system came
>> >grinding to a halt due to thrashing.
>>
>> I believe that changing the active/inactive ratio has other
>> potential thrashing issues.  Specifically, when the inactive
>> list is too small, pages may not stick around long enough to
>> be accessed multiple times and get promoted to the active
>> list, even when they are in active use.
>>
>> I prefer a more flexible solution, that automatically does
>> the right thing.
>>
>> The problem you see is that the file list gets reclaimed
>> very quickly, even when it is already very small.
>>
>> I wonder if a possible solution would be to limit how fast
>> file pages get reclaimed, when the page cache is very small.
>> Say, inactive_file * active_file < 2 * zone->pages_high ?
>>
>> At that point, maybe we could slow down the reclaiming of
>> page cache pages to be significantly slower than they can
>> be refilled by the disk.  Maybe 100 pages a second - that
>> can be refilled even by an actual spinning metal disk
>> without even the use of readahead.
>>
>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>> file pages every 1/4 second, when the number of page cache
>> pages is very low.
>>
>> This way HPC and virtual machine hosting nodes can still
>> get rid of totally unused page cache, but on any system
>> that actually uses page cache, some minimal amount of
>> cache will be protected under heavy memory pressure.
>>
>> Does this sound like a reasonable approach?
>>
>> I realize the threshold may have to be tweaked...
>>
>> The big question is, how do we integrate this with the
>> OOM killer?  Do we pretend we are out of memory when
>> we've hit our file cache eviction quota and kill something?
>>
>> Would there be any downsides to this approach?
>>
>> Are there any volunteers for implementing this idea?
>> (Maybe someone who needs the feature?)
>>
>
> I've created a patch which takes a slightly different approach.
> Instead of limiting how fast pages get reclaimed, the patch limits
> how fast the active list gets scanned. This should result in the
> active list being a better measure of the working set. I've seen
> fairly good results with this patch and a scan inteval of 1
> centisecond. I see no thrashing when the scan interval is non-zero.
>
> I've made it a tunable because I don't know what to set the scan
> interval. The final patch could set the value based on HZ and some
> other system parameters. Maybe relate it to sched_period?
>
> ---
>
> [PATCH] vmscan: add a configurable scan interval
>
> On ChromiumOS, we see a lot of thrashing under low memory. We do not
> use swap, so the mm system can only free file-backed pages. Eventually,
> we are left with little file back pages remaining (a few MB) and the
> system becomes unresponsive due to thrashing.
>
> Our preference is for the system to OOM instead of becoming unresponsive.
>
> This patch create a tunable, vmscan_interval_centisecs, for controlling
> the minimum interval between active list scans. At 0, I see the same
> thrashing. At 1, I see no thrashing. The mm system does a good job
> of protecting the working set. If a page has been referenced in the
> last vmscan_interval_centisecs it is kept in memory.
>
> Signed-off-by: Mandeep Singh Baines <msb@chromium.org>

vmscan already have used HZ/10 to calm down congestion of writeback or
something.
(But I don't know why VM used the value and who determined it by any
rationale. It might be a value determined by some experiments.)
If there isn't any good math, we will depend on experiment in this time, too.

Anyway If interval is long, It could make inactive list's size very
shortly in many reclaim workload and then unnecessary OOM kill.
So I hope if inactive list size is very small compared to active list
size, quit the check and refiill the inactive list.

Anyway, the approach makes sense to me.
But need other guy's opinion.

Nitpick :
I expect you will include description of knob in
Documentation/sysctl/vm.txt in your formal patch.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-03 23:49               ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-03 23:49 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: Rik van Riel, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

Hello.

On Thu, Nov 4, 2010 at 7:40 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> Rik van Riel (riel@redhat.com) wrote:
>> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>>
>> >Yes, this prevents you from reclaiming the active list all at once. But if the
>> >memory pressure doesn't go away, you'll start to reclaim the active list
>> >little by little. First you'll empty the inactive list, and then
>> >you'll start scanning
>> >the active list and pulling pages from inactive to active. The problem is that
>> >there is no minimum time limit to how long a page will sit in the inactive list
>> >before it is reclaimed. Just depends on scan rate which does not depend
>> >on time.
>> >
>> >In my experiments, I saw the active list get smaller and smaller
>> >over time until eventually it was only a few MB at which point the system came
>> >grinding to a halt due to thrashing.
>>
>> I believe that changing the active/inactive ratio has other
>> potential thrashing issues.  Specifically, when the inactive
>> list is too small, pages may not stick around long enough to
>> be accessed multiple times and get promoted to the active
>> list, even when they are in active use.
>>
>> I prefer a more flexible solution, that automatically does
>> the right thing.
>>
>> The problem you see is that the file list gets reclaimed
>> very quickly, even when it is already very small.
>>
>> I wonder if a possible solution would be to limit how fast
>> file pages get reclaimed, when the page cache is very small.
>> Say, inactive_file * active_file < 2 * zone->pages_high ?
>>
>> At that point, maybe we could slow down the reclaiming of
>> page cache pages to be significantly slower than they can
>> be refilled by the disk.  Maybe 100 pages a second - that
>> can be refilled even by an actual spinning metal disk
>> without even the use of readahead.
>>
>> That can be rounded up to one batch of SWAP_CLUSTER_MAX
>> file pages every 1/4 second, when the number of page cache
>> pages is very low.
>>
>> This way HPC and virtual machine hosting nodes can still
>> get rid of totally unused page cache, but on any system
>> that actually uses page cache, some minimal amount of
>> cache will be protected under heavy memory pressure.
>>
>> Does this sound like a reasonable approach?
>>
>> I realize the threshold may have to be tweaked...
>>
>> The big question is, how do we integrate this with the
>> OOM killer?  Do we pretend we are out of memory when
>> we've hit our file cache eviction quota and kill something?
>>
>> Would there be any downsides to this approach?
>>
>> Are there any volunteers for implementing this idea?
>> (Maybe someone who needs the feature?)
>>
>
> I've created a patch which takes a slightly different approach.
> Instead of limiting how fast pages get reclaimed, the patch limits
> how fast the active list gets scanned. This should result in the
> active list being a better measure of the working set. I've seen
> fairly good results with this patch and a scan inteval of 1
> centisecond. I see no thrashing when the scan interval is non-zero.
>
> I've made it a tunable because I don't know what to set the scan
> interval. The final patch could set the value based on HZ and some
> other system parameters. Maybe relate it to sched_period?
>
> ---
>
> [PATCH] vmscan: add a configurable scan interval
>
> On ChromiumOS, we see a lot of thrashing under low memory. We do not
> use swap, so the mm system can only free file-backed pages. Eventually,
> we are left with little file back pages remaining (a few MB) and the
> system becomes unresponsive due to thrashing.
>
> Our preference is for the system to OOM instead of becoming unresponsive.
>
> This patch create a tunable, vmscan_interval_centisecs, for controlling
> the minimum interval between active list scans. At 0, I see the same
> thrashing. At 1, I see no thrashing. The mm system does a good job
> of protecting the working set. If a page has been referenced in the
> last vmscan_interval_centisecs it is kept in memory.
>
> Signed-off-by: Mandeep Singh Baines <msb@chromium.org>

vmscan already have used HZ/10 to calm down congestion of writeback or
something.
(But I don't know why VM used the value and who determined it by any
rationale. It might be a value determined by some experiments.)
If there isn't any good math, we will depend on experiment in this time, too.

Anyway If interval is long, It could make inactive list's size very
shortly in many reclaim workload and then unnecessary OOM kill.
So I hope if inactive list size is very small compared to active list
size, quit the check and refiill the inactive list.

Anyway, the approach makes sense to me.
But need other guy's opinion.

Nitpick :
I expect you will include description of knob in
Documentation/sysctl/vm.txt in your formal patch.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-01 23:46       ` Minchan Kim
@ 2010-11-04  1:52         ` Mandeep Singh Baines
  -1 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-04  1:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton,
	Rik van Riel, Mel Gorman, Johannes Weiner, linux-kernel,
	linux-mm, wad, olofj, hughd

Minchan Kim (minchan.kim@gmail.com) wrote:
> On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> > KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> >> Hi
> >>
> >> > On ChromiumOS, we do not use swap. When memory is low, the only way to
> >> > free memory is to reclaim pages from the file list. This results in a
> >> > lot of thrashing under low memory conditions. We see the system become
> >> > unresponsive for minutes before it eventually OOMs. We also see very
> >> > slow browser tab switching under low memory. Instead of an unresponsive
> >> > system, we'd really like the kernel to OOM as soon as it starts to
> >> > thrash. If it can't keep the working set in memory, then OOM.
> >> > Losing one of many tabs is a better behaviour for the user than an
> >> > unresponsive system.
> >> >
> >> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> >> > of file-backed pages when when there are less than min_filelist_bytes worth
> >> > of such pages in the cache. This tunable is handy for low memory systems
> >> > using solid-state storage where interactive response is more important
> >> > than not OOMing.
> >> >
> >> > With this patch and min_filelist_kbytes set to 50000, I see very little
> >> > block layer activity during low memory. The system stays responsive under
> >> > low memory and browser tab switching is fast. Eventually, a process a gets
> >> > killed by OOM. Without this patch, the system gets wedged for minutes
> >> > before it eventually OOMs. Below is the vmstat output from my test runs.
> >>
> >> I've heared similar requirement sometimes from embedded people. then also
> >> don't use swap. then, I don't think this is hopeless idea. but I hope to
> >> clarify some thing at first.
> >>
> >
> > swap would be intersting if we could somehow control swap thrashing. Maybe
> > we could add min_anonlist_kbytes. Just kidding:)
> >
> >> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
> >> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
> >> application which linked above important lib and call mlockall() at startup.
> >> such technique prevent reclaim. So, Q1: Why do you think above traditional way
> >> is insufficient?
> >>
> >
> > mlock is too coarse grain. It requires locking the whole file in memory.
> > The chrome and X binaries are quite large so locking them would waste a lot
> > of memory. We could lock just the pages that are part of the working set but
> > that is difficult to do in practice. Its unmaintainable if you do it
> > statically. If you do it at runtime by mlocking the working set, you're
> > sort of giving up on mm's active list.
> >
> > Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
> > job of identifying the working set. We did look at ways to do a better
> > job of keeping the working set in the active list but these were tricker
> > patches and never quite worked out. This patch is simple and works great.
> >
> > Under memory pressure, I see the active list get smaller and smaller. Its
> > getting smaller because we're scanning it faster and faster, causing more
> > and more page faults which slows forward progress resulting in the active
> > list getting smaller still. One way to approach this might to make the
> > scan rate constant and configurable. It doesn't seem right that we scan
> > memory faster and faster under low memory. For us, we'd rather OOM than
> > evict pages that are likely to be accessed again so we'd prefer to make
> > a conservative estimate as to what belongs in the working set. Other
> > folks (long computations) might want to reclaim more aggressively.
> >
> >> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
> >> such value? Do other users can calculate proper value?
> >>
> >
> > 50M was small enough that we were comfortable with keeping 50M of file pages
> > in memory and large enough that it is bigger than the working set. I tested
> > by loading up a bunch of popular web sites in chrome and then observing what
> > happend when I ran out of memory. With 50M, I saw almost no thrashing and
> > the system stayed responsive even under low memory. but I wanted to be
> > conservative since I'm really just guessing.
> >
> > Other users could calculate their value by doing something similar. Load
> > up the system (exhaust free memory) with a typical load and then observe
> > file io via vmstat. They can then set min_filelist_kbytes to the value
> > where they see a tolerable amounting of thrashing (page faults, block io).
> >
> >> In addition, I have two request. R1: I think chromium specific feature is
> >> harder acceptable because it's harder maintable. but we have good chance to
> >> solve embedded generic issue. Please discuss Minchan and/or another embedded
> >
> > I think this feature should be useful to a lot of embedded applications where
> > OOM is OK, especially web browsing applications where the user is OK with
> > losing 1 of many tabs they have open. However, I consider this patch a
> > stop-gap. I think the real solution is to do a better job of protecting
> > the active list.
> >
> >> developers. R2: If you want to deal OOM combination, please consider to
> >> combination of memcg OOM notifier too. It is most flexible and powerful OOM
> >> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
> >>
> >
> > Yes, will definitely look at OOM notifier. Currently trying to see if we can
> > get by with oomadj. With OOM notifier you'd have to respond earlier so you
> > might OOM more. However, with a notifier you might be able to take action that
> > might prevent OOM altogether.
> >
> > I see memcg more as an isolation mechanism but I guess you could use it to
> > isolate the working set from anon browser tab data as Kamezawa suggests.
> 
> 
> I don't think current VM behavior has a problem.
> Current problem is that you use up many memory than real memory.
> As system memory without swap is low, VM doesn't have a many choice.
> It ends up evict your working set to meet for user request. It's very
> natural result for greedy user.
> 
> Rather than OOM notifier, what we need is memory notifier.
> AFAIR, before some years ago, KOSAKI tried similar thing .
> http://lwn.net/Articles/268732/

Thanks! This is perfect. I wonder why its not merged. Was a different
solution eventually implemented? Is there another way of doing the
same thing?

> (I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
> can't meet yours requirement. I mean when the user receive the memory
> low signal, it's too late. Maybe there are other causes for KOSAKi to
> quit it.)
> Anyway, If the system memory is low, your intelligent middleware can
> control it very well than VM.

Agree.

> In this chance, how about improving it?
> Mandeep, Could you feel needing this feature?
> 

mem_notify seems perfect.

> 
> 
> > Regards,
> > Mandeep
> >
> >> Thanks.
> >>
> >>
> >>
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-04  1:52         ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-04  1:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton,
	Rik van Riel, Mel Gorman, Johannes Weiner, linux-kernel,
	linux-mm, wad, olofj, hughd

Minchan Kim (minchan.kim@gmail.com) wrote:
> On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> > KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> >> Hi
> >>
> >> > On ChromiumOS, we do not use swap. When memory is low, the only way to
> >> > free memory is to reclaim pages from the file list. This results in a
> >> > lot of thrashing under low memory conditions. We see the system become
> >> > unresponsive for minutes before it eventually OOMs. We also see very
> >> > slow browser tab switching under low memory. Instead of an unresponsive
> >> > system, we'd really like the kernel to OOM as soon as it starts to
> >> > thrash. If it can't keep the working set in memory, then OOM.
> >> > Losing one of many tabs is a better behaviour for the user than an
> >> > unresponsive system.
> >> >
> >> > This patch create a new sysctl, min_filelist_kbytes, which disables reclaim
> >> > of file-backed pages when when there are less than min_filelist_bytes worth
> >> > of such pages in the cache. This tunable is handy for low memory systems
> >> > using solid-state storage where interactive response is more important
> >> > than not OOMing.
> >> >
> >> > With this patch and min_filelist_kbytes set to 50000, I see very little
> >> > block layer activity during low memory. The system stays responsive under
> >> > low memory and browser tab switching is fast. Eventually, a process a gets
> >> > killed by OOM. Without this patch, the system gets wedged for minutes
> >> > before it eventually OOMs. Below is the vmstat output from my test runs.
> >>
> >> I've heared similar requirement sometimes from embedded people. then also
> >> don't use swap. then, I don't think this is hopeless idea. but I hope to
> >> clarify some thing at first.
> >>
> >
> > swap would be intersting if we could somehow control swap thrashing. Maybe
> > we could add min_anonlist_kbytes. Just kidding:)
> >
> >> Yes, a system often have should-not-be-evicted-file-caches. Typically, they
> >> are libc, libX11 and some GUI libraries. Traditionally, we was making tiny
> >> application which linked above important lib and call mlockall() at startup.
> >> such technique prevent reclaim. So, Q1: Why do you think above traditional way
> >> is insufficient?
> >>
> >
> > mlock is too coarse grain. It requires locking the whole file in memory.
> > The chrome and X binaries are quite large so locking them would waste a lot
> > of memory. We could lock just the pages that are part of the working set but
> > that is difficult to do in practice. Its unmaintainable if you do it
> > statically. If you do it at runtime by mlocking the working set, you're
> > sort of giving up on mm's active list.
> >
> > Like akpm, I'm sad that we need this patch. I'd rather the kernel did a better
> > job of identifying the working set. We did look at ways to do a better
> > job of keeping the working set in the active list but these were tricker
> > patches and never quite worked out. This patch is simple and works great.
> >
> > Under memory pressure, I see the active list get smaller and smaller. Its
> > getting smaller because we're scanning it faster and faster, causing more
> > and more page faults which slows forward progress resulting in the active
> > list getting smaller still. One way to approach this might to make the
> > scan rate constant and configurable. It doesn't seem right that we scan
> > memory faster and faster under low memory. For us, we'd rather OOM than
> > evict pages that are likely to be accessed again so we'd prefer to make
> > a conservative estimate as to what belongs in the working set. Other
> > folks (long computations) might want to reclaim more aggressively.
> >
> >> Q2: In the above you used min_filelist_kbytes=50000. How do you decide
> >> such value? Do other users can calculate proper value?
> >>
> >
> > 50M was small enough that we were comfortable with keeping 50M of file pages
> > in memory and large enough that it is bigger than the working set. I tested
> > by loading up a bunch of popular web sites in chrome and then observing what
> > happend when I ran out of memory. With 50M, I saw almost no thrashing and
> > the system stayed responsive even under low memory. but I wanted to be
> > conservative since I'm really just guessing.
> >
> > Other users could calculate their value by doing something similar. Load
> > up the system (exhaust free memory) with a typical load and then observe
> > file io via vmstat. They can then set min_filelist_kbytes to the value
> > where they see a tolerable amounting of thrashing (page faults, block io).
> >
> >> In addition, I have two request. R1: I think chromium specific feature is
> >> harder acceptable because it's harder maintable. but we have good chance to
> >> solve embedded generic issue. Please discuss Minchan and/or another embedded
> >
> > I think this feature should be useful to a lot of embedded applications where
> > OOM is OK, especially web browsing applications where the user is OK with
> > losing 1 of many tabs they have open. However, I consider this patch a
> > stop-gap. I think the real solution is to do a better job of protecting
> > the active list.
> >
> >> developers. R2: If you want to deal OOM combination, please consider to
> >> combination of memcg OOM notifier too. It is most flexible and powerful OOM
> >> mechanism. Probably desktop and server people never use bare OOM killer intentionally.
> >>
> >
> > Yes, will definitely look at OOM notifier. Currently trying to see if we can
> > get by with oomadj. With OOM notifier you'd have to respond earlier so you
> > might OOM more. However, with a notifier you might be able to take action that
> > might prevent OOM altogether.
> >
> > I see memcg more as an isolation mechanism but I guess you could use it to
> > isolate the working set from anon browser tab data as Kamezawa suggests.
> 
> 
> I don't think current VM behavior has a problem.
> Current problem is that you use up many memory than real memory.
> As system memory without swap is low, VM doesn't have a many choice.
> It ends up evict your working set to meet for user request. It's very
> natural result for greedy user.
> 
> Rather than OOM notifier, what we need is memory notifier.
> AFAIR, before some years ago, KOSAKI tried similar thing .
> http://lwn.net/Articles/268732/

Thanks! This is perfect. I wonder why its not merged. Was a different
solution eventually implemented? Is there another way of doing the
same thing?

> (I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
> can't meet yours requirement. I mean when the user receive the memory
> low signal, it's too late. Maybe there are other causes for KOSAKi to
> quit it.)
> Anyway, If the system memory is low, your intelligent middleware can
> control it very well than VM.

Agree.

> In this chance, how about improving it?
> Mandeep, Could you feel needing this feature?
> 

mem_notify seems perfect.

> 
> 
> > Regards,
> > Mandeep
> >
> >> Thanks.
> >>
> >>
> >>
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-03 22:40             ` Mandeep Singh Baines
@ 2010-11-04 15:30               ` Rik van Riel
  -1 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-04 15:30 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:

> I've created a patch which takes a slightly different approach.
> Instead of limiting how fast pages get reclaimed, the patch limits
> how fast the active list gets scanned. This should result in the
> active list being a better measure of the working set. I've seen
> fairly good results with this patch and a scan inteval of 1
> centisecond. I see no thrashing when the scan interval is non-zero.
>
> I've made it a tunable because I don't know what to set the scan
> interval. The final patch could set the value based on HZ and some
> other system parameters. Maybe relate it to sched_period?

I like your approach. For file pages it looks like it
could work fine, since new pages always start on the
inactive file list.

However, for anonymous pages I could see your patch
leading to problems, because all anonymous pages start
on the active list.  With a scan interval of 1
centiseconds, that means there would be a limit of 3200
pages, or 12MB of anonymous memory that can be moved to
the inactive list a second.

I have seen systems with single SATA disks push out
several times that to swap per second, which matters
when someone starts up a program that is just too big
to fit in memory and requires that something is pushed
out.

That would reduce the size of the inactive list to
zero, reducing our page replacement to a slow FIFO
at best, causing false OOM kills at worst.

Staying with a default of 0 would of course not do
anything, which would make merging the code not too
useful.

I believe we absolutely need to preserve the ability
to evict pages quickly, when new pages are brought
into memory or allocated quickly.

However, speed limits are probably a very good idea
once a cache has been reduced to a smaller size, or
when most IO bypasses the reclaim-speed-limited cache.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-04 15:30               ` Rik van Riel
  0 siblings, 0 replies; 55+ messages in thread
From: Rik van Riel @ 2010-11-04 15:30 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:

> I've created a patch which takes a slightly different approach.
> Instead of limiting how fast pages get reclaimed, the patch limits
> how fast the active list gets scanned. This should result in the
> active list being a better measure of the working set. I've seen
> fairly good results with this patch and a scan inteval of 1
> centisecond. I see no thrashing when the scan interval is non-zero.
>
> I've made it a tunable because I don't know what to set the scan
> interval. The final patch could set the value based on HZ and some
> other system parameters. Maybe relate it to sched_period?

I like your approach. For file pages it looks like it
could work fine, since new pages always start on the
inactive file list.

However, for anonymous pages I could see your patch
leading to problems, because all anonymous pages start
on the active list.  With a scan interval of 1
centiseconds, that means there would be a limit of 3200
pages, or 12MB of anonymous memory that can be moved to
the inactive list a second.

I have seen systems with single SATA disks push out
several times that to swap per second, which matters
when someone starts up a program that is just too big
to fit in memory and requires that something is pushed
out.

That would reduce the size of the inactive list to
zero, reducing our page replacement to a slow FIFO
at best, causing false OOM kills at worst.

Staying with a default of 0 would of course not do
anything, which would make merging the code not too
useful.

I believe we absolutely need to preserve the ability
to evict pages quickly, when new pages are brought
into memory or allocated quickly.

However, speed limits are probably a very good idea
once a cache has been reduced to a smaller size, or
when most IO bypasses the reclaim-speed-limited cache.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-04  1:52         ` Mandeep Singh Baines
@ 2010-11-05  2:36           ` Minchan Kim
  -1 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-05  2:36 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Thu, Nov 4, 2010 at 10:52 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> Minchan Kim (minchan.kim@gmail.com) wrote:
>> On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
>> > I see memcg more as an isolation mechanism but I guess you could use it to
>> > isolate the working set from anon browser tab data as Kamezawa suggests.
>>
>>
>> I don't think current VM behavior has a problem.
>> Current problem is that you use up many memory than real memory.
>> As system memory without swap is low, VM doesn't have a many choice.
>> It ends up evict your working set to meet for user request. It's very
>> natural result for greedy user.
>>
>> Rather than OOM notifier, what we need is memory notifier.
>> AFAIR, before some years ago, KOSAKI tried similar thing .
>> http://lwn.net/Articles/268732/
>
> Thanks! This is perfect. I wonder why its not merged. Was a different
> solution eventually implemented? Is there another way of doing the
> same thing?

If my remember is right, there was timing issue.
When the application is notified, it was too late to handle it.
Mabye KOSAKI can explain more detail problem.

I think we need some leveling mechanism.
For example, user can set the limits 30M, 20M, 10M, 5M.

If free memory is low below 30M, master application can require
freeing of extra memory of background sleeping application.
If free memory is low below 20M, master application can require
existing of background sleeping application.
If free memory is low below 10M, master application can kill
none-critical application.
If free memory is low below 5M, master application can require freeing
of memory of critical application.

I think this mechanism would be useful memcg, too.

>
>> (I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
>> can't meet yours requirement. I mean when the user receive the memory
>> low signal, it's too late. Maybe there are other causes for KOSAKi to
>> quit it.)
>> Anyway, If the system memory is low, your intelligent middleware can
>> control it very well than VM.
>
> Agree.
>
>> In this chance, how about improving it?
>> Mandeep, Could you feel needing this feature?
>>
>
> mem_notify seems perfect.

BTW, Regardless of mem_notify, I think this patch is useful in general
system, too.
We have to progress this patch.

>
>>
>>
>> > Regards,
>> > Mandeep
>> >
>> >> Thanks.
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-05  2:36           ` Minchan Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Minchan Kim @ 2010-11-05  2:36 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: KOSAKI Motohiro, Andrew Morton, Rik van Riel, Mel Gorman,
	Johannes Weiner, linux-kernel, linux-mm, wad, olofj, hughd

On Thu, Nov 4, 2010 at 10:52 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
> Minchan Kim (minchan.kim@gmail.com) wrote:
>> On Tue, Nov 2, 2010 at 3:24 AM, Mandeep Singh Baines <msb@chromium.org> wrote:
>> > I see memcg more as an isolation mechanism but I guess you could use it to
>> > isolate the working set from anon browser tab data as Kamezawa suggests.
>>
>>
>> I don't think current VM behavior has a problem.
>> Current problem is that you use up many memory than real memory.
>> As system memory without swap is low, VM doesn't have a many choice.
>> It ends up evict your working set to meet for user request. It's very
>> natural result for greedy user.
>>
>> Rather than OOM notifier, what we need is memory notifier.
>> AFAIR, before some years ago, KOSAKI tried similar thing .
>> http://lwn.net/Articles/268732/
>
> Thanks! This is perfect. I wonder why its not merged. Was a different
> solution eventually implemented? Is there another way of doing the
> same thing?

If my remember is right, there was timing issue.
When the application is notified, it was too late to handle it.
Mabye KOSAKI can explain more detail problem.

I think we need some leveling mechanism.
For example, user can set the limits 30M, 20M, 10M, 5M.

If free memory is low below 30M, master application can require
freeing of extra memory of background sleeping application.
If free memory is low below 20M, master application can require
existing of background sleeping application.
If free memory is low below 10M, master application can kill
none-critical application.
If free memory is low below 5M, master application can require freeing
of memory of critical application.

I think this mechanism would be useful memcg, too.

>
>> (I can't remember why KOSAKI quit it exactly, AFAIR, some signal time
>> can't meet yours requirement. I mean when the user receive the memory
>> low signal, it's too late. Maybe there are other causes for KOSAKi to
>> quit it.)
>> Anyway, If the system memory is low, your intelligent middleware can
>> control it very well than VM.
>
> Agree.
>
>> In this chance, how about improving it?
>> Mandeep, Could you feel needing this feature?
>>
>
> mem_notify seems perfect.

BTW, Regardless of mem_notify, I think this patch is useful in general
system, too.
We have to progress this patch.

>
>>
>>
>> > Regards,
>> > Mandeep
>> >
>> >> Thanks.
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-04 15:30               ` Rik van Riel
@ 2010-11-08 21:55                 ` Mandeep Singh Baines
  -1 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-08 21:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Rik van Riel (riel@redhat.com) wrote:
> On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:
> 
> >I've created a patch which takes a slightly different approach.
> >Instead of limiting how fast pages get reclaimed, the patch limits
> >how fast the active list gets scanned. This should result in the
> >active list being a better measure of the working set. I've seen
> >fairly good results with this patch and a scan inteval of 1
> >centisecond. I see no thrashing when the scan interval is non-zero.
> >
> >I've made it a tunable because I don't know what to set the scan
> >interval. The final patch could set the value based on HZ and some
> >other system parameters. Maybe relate it to sched_period?
> 
> I like your approach. For file pages it looks like it
> could work fine, since new pages always start on the
> inactive file list.
> 
> However, for anonymous pages I could see your patch
> leading to problems, because all anonymous pages start
> on the active list.  With a scan interval of 1
> centiseconds, that means there would be a limit of 3200
> pages, or 12MB of anonymous memory that can be moved to
> the inactive list a second.
> 

Good point.

> I have seen systems with single SATA disks push out
> several times that to swap per second, which matters
> when someone starts up a program that is just too big
> to fit in memory and requires that something is pushed
> out.
> 
> That would reduce the size of the inactive list to
> zero, reducing our page replacement to a slow FIFO
> at best, causing false OOM kills at worst.
> 
> Staying with a default of 0 would of course not do
> anything, which would make merging the code not too
> useful.
> 
> I believe we absolutely need to preserve the ability
> to evict pages quickly, when new pages are brought
> into memory or allocated quickly.
> 

Agree.

Instead of doing one scan of SWAP_CLUSTER_MAX pages per vmscan_interval,
we could one "full" scan per vmscan_interval. You could do one full scan
all at once or scan SWAP_CLUSTER_MAX every scan until you've scanned
the whole list.

Psuedo code:

if (zone->to_scan[file] == 0 && !list_scanned_recently(zone, file))
	zone->to_scan[file] = list_get_size(zone, file);
if (zone->to_scan[file]) {
	shrink_active_list(nr_to_scan, zone, sc, priority, file);
	zone->to_scan[file] -= min(zone->to_scan[file], nr_to_scan);
}

> However, speed limits are probably a very good idea
> once a cache has been reduced to a smaller size, or
> when most IO bypasses the reclaim-speed-limited cache.
> 
> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-08 21:55                 ` Mandeep Singh Baines
  0 siblings, 0 replies; 55+ messages in thread
From: Mandeep Singh Baines @ 2010-11-08 21:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mandeep Singh Baines, KOSAKI Motohiro, Andrew Morton, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

Rik van Riel (riel@redhat.com) wrote:
> On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:
> 
> >I've created a patch which takes a slightly different approach.
> >Instead of limiting how fast pages get reclaimed, the patch limits
> >how fast the active list gets scanned. This should result in the
> >active list being a better measure of the working set. I've seen
> >fairly good results with this patch and a scan inteval of 1
> >centisecond. I see no thrashing when the scan interval is non-zero.
> >
> >I've made it a tunable because I don't know what to set the scan
> >interval. The final patch could set the value based on HZ and some
> >other system parameters. Maybe relate it to sched_period?
> 
> I like your approach. For file pages it looks like it
> could work fine, since new pages always start on the
> inactive file list.
> 
> However, for anonymous pages I could see your patch
> leading to problems, because all anonymous pages start
> on the active list.  With a scan interval of 1
> centiseconds, that means there would be a limit of 3200
> pages, or 12MB of anonymous memory that can be moved to
> the inactive list a second.
> 

Good point.

> I have seen systems with single SATA disks push out
> several times that to swap per second, which matters
> when someone starts up a program that is just too big
> to fit in memory and requires that something is pushed
> out.
> 
> That would reduce the size of the inactive list to
> zero, reducing our page replacement to a slow FIFO
> at best, causing false OOM kills at worst.
> 
> Staying with a default of 0 would of course not do
> anything, which would make merging the code not too
> useful.
> 
> I believe we absolutely need to preserve the ability
> to evict pages quickly, when new pages are brought
> into memory or allocated quickly.
> 

Agree.

Instead of doing one scan of SWAP_CLUSTER_MAX pages per vmscan_interval,
we could one "full" scan per vmscan_interval. You could do one full scan
all at once or scan SWAP_CLUSTER_MAX every scan until you've scanned
the whole list.

Psuedo code:

if (zone->to_scan[file] == 0 && !list_scanned_recently(zone, file))
	zone->to_scan[file] = list_get_size(zone, file);
if (zone->to_scan[file]) {
	shrink_active_list(nr_to_scan, zone, sc, priority, file);
	zone->to_scan[file] -= min(zone->to_scan[file], nr_to_scan);
}

> However, speed limits are probably a very good idea
> once a cache has been reduced to a smaller size, or
> when most IO bypasses the reclaim-speed-limited cache.
> 
> -- 
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-04 15:30               ` Rik van Riel
@ 2010-11-09  2:49                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  2:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Mandeep Singh Baines, Andrew Morton, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

> On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:
> 
> > I've created a patch which takes a slightly different approach.
> > Instead of limiting how fast pages get reclaimed, the patch limits
> > how fast the active list gets scanned. This should result in the
> > active list being a better measure of the working set. I've seen
> > fairly good results with this patch and a scan inteval of 1
> > centisecond. I see no thrashing when the scan interval is non-zero.
> >
> > I've made it a tunable because I don't know what to set the scan
> > interval. The final patch could set the value based on HZ and some
> > other system parameters. Maybe relate it to sched_period?
> 
> I like your approach. For file pages it looks like it
> could work fine, since new pages always start on the
> inactive file list.
> 
> However, for anonymous pages I could see your patch
> leading to problems, because all anonymous pages start
> on the active list.  With a scan interval of 1
> centiseconds, that means there would be a limit of 3200
> pages, or 12MB of anonymous memory that can be moved to
> the inactive list a second.
> 
> I have seen systems with single SATA disks push out
> several times that to swap per second, which matters
> when someone starts up a program that is just too big
> to fit in memory and requires that something is pushed
> out.
> 
> That would reduce the size of the inactive list to
> zero, reducing our page replacement to a slow FIFO
> at best, causing false OOM kills at worst.
> 
> Staying with a default of 0 would of course not do
> anything, which would make merging the code not too
> useful.
> 
> I believe we absolutely need to preserve the ability
> to evict pages quickly, when new pages are brought
> into memory or allocated quickly.
> 
> However, speed limits are probably a very good idea
> once a cache has been reduced to a smaller size, or
> when most IO bypasses the reclaim-speed-limited cache.

Yeah.

But I doubt fixed rate limit is good thing. When playing movie case
(aka streaming I/O case), We don't want any throttle. I think.
Also, I don't like jiffies dependency. CPU hardware improvement naturally
will break such heuristics.


btw, now congestion_wait() already has jiffies dependency. but we should
kill such strange timeout eventually. I think.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-09  2:49                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  2:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Mandeep Singh Baines, Andrew Morton, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

> On 11/03/2010 06:40 PM, Mandeep Singh Baines wrote:
> 
> > I've created a patch which takes a slightly different approach.
> > Instead of limiting how fast pages get reclaimed, the patch limits
> > how fast the active list gets scanned. This should result in the
> > active list being a better measure of the working set. I've seen
> > fairly good results with this patch and a scan inteval of 1
> > centisecond. I see no thrashing when the scan interval is non-zero.
> >
> > I've made it a tunable because I don't know what to set the scan
> > interval. The final patch could set the value based on HZ and some
> > other system parameters. Maybe relate it to sched_period?
> 
> I like your approach. For file pages it looks like it
> could work fine, since new pages always start on the
> inactive file list.
> 
> However, for anonymous pages I could see your patch
> leading to problems, because all anonymous pages start
> on the active list.  With a scan interval of 1
> centiseconds, that means there would be a limit of 3200
> pages, or 12MB of anonymous memory that can be moved to
> the inactive list a second.
> 
> I have seen systems with single SATA disks push out
> several times that to swap per second, which matters
> when someone starts up a program that is just too big
> to fit in memory and requires that something is pushed
> out.
> 
> That would reduce the size of the inactive list to
> zero, reducing our page replacement to a slow FIFO
> at best, causing false OOM kills at worst.
> 
> Staying with a default of 0 would of course not do
> anything, which would make merging the code not too
> useful.
> 
> I believe we absolutely need to preserve the ability
> to evict pages quickly, when new pages are brought
> into memory or allocated quickly.
> 
> However, speed limits are probably a very good idea
> once a cache has been reduced to a smaller size, or
> when most IO bypasses the reclaim-speed-limited cache.

Yeah.

But I doubt fixed rate limit is good thing. When playing movie case
(aka streaming I/O case), We don't want any throttle. I think.
Also, I don't like jiffies dependency. CPU hardware improvement naturally
will break such heuristics.


btw, now congestion_wait() already has jiffies dependency. but we should
kill such strange timeout eventually. I think.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
  2010-11-04  1:52         ` Mandeep Singh Baines
@ 2010-11-09  2:53           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  2:53 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: kosaki.motohiro, Minchan Kim, Andrew Morton, Rik van Riel,
	Mel Gorman, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

> > I don't think current VM behavior has a problem.
> > Current problem is that you use up many memory than real memory.
> > As system memory without swap is low, VM doesn't have a many choice.
> > It ends up evict your working set to meet for user request. It's very
> > natural result for greedy user.
> > 
> > Rather than OOM notifier, what we need is memory notifier.
> > AFAIR, before some years ago, KOSAKI tried similar thing .
> > http://lwn.net/Articles/268732/
> 
> Thanks! This is perfect. I wonder why its not merged. Was a different
> solution eventually implemented? Is there another way of doing the
> same thing?

Now memcg has memory threshold notification feature and almost people
are using it. If you think notification fit your case, can you please
try this feature at first?
And if it doesn't fit your case and we will get a feedback from you, 
we probably can extend such one.

Thanks.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set
@ 2010-11-09  2:53           ` KOSAKI Motohiro
  0 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  2:53 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: kosaki.motohiro, Minchan Kim, Andrew Morton, Rik van Riel,
	Mel Gorman, Johannes Weiner, linux-kernel, linux-mm, wad, olofj,
	hughd

> > I don't think current VM behavior has a problem.
> > Current problem is that you use up many memory than real memory.
> > As system memory without swap is low, VM doesn't have a many choice.
> > It ends up evict your working set to meet for user request. It's very
> > natural result for greedy user.
> > 
> > Rather than OOM notifier, what we need is memory notifier.
> > AFAIR, before some years ago, KOSAKI tried similar thing .
> > http://lwn.net/Articles/268732/
> 
> Thanks! This is perfect. I wonder why its not merged. Was a different
> solution eventually implemented? Is there another way of doing the
> same thing?

Now memcg has memory threshold notification feature and almost people
are using it. If you think notification fit your case, can you please
try this feature at first?
And if it doesn't fit your case and we will get a feedback from you, 
we probably can extend such one.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2010-11-09  2:53 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-28 19:15 [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set Mandeep Singh Baines
2010-10-28 19:15 ` Mandeep Singh Baines
2010-10-28 20:10 ` Andrew Morton
2010-10-28 20:10   ` Andrew Morton
2010-10-28 22:03   ` Mandeep Singh Baines
2010-10-28 22:03     ` Mandeep Singh Baines
2010-10-28 23:28     ` Minchan Kim
2010-10-28 23:28       ` Minchan Kim
2010-10-28 23:29       ` Minchan Kim
2010-10-28 23:29         ` Minchan Kim
2010-10-29  0:04       ` KAMEZAWA Hiroyuki
2010-10-29  0:04         ` KAMEZAWA Hiroyuki
2010-10-29  0:28         ` Minchan Kim
2010-10-29  0:28           ` Minchan Kim
2010-10-28 21:30 ` Rik van Riel
2010-10-28 21:30   ` Rik van Riel
2010-10-28 22:13   ` Mandeep Singh Baines
2010-10-28 22:13     ` Mandeep Singh Baines
2010-11-01  7:05 ` KOSAKI Motohiro
2010-11-01  7:05   ` KOSAKI Motohiro
2010-11-01 18:24   ` Mandeep Singh Baines
2010-11-01 18:24     ` Mandeep Singh Baines
2010-11-01 18:50     ` Rik van Riel
2010-11-01 18:50       ` Rik van Riel
2010-11-01 19:43       ` Mandeep Singh Baines
2010-11-01 19:43         ` Mandeep Singh Baines
2010-11-02  3:11         ` Rik van Riel
2010-11-02  3:11           ` Rik van Riel
2010-11-03  0:48           ` Minchan Kim
2010-11-03  2:00             ` Rik van Riel
2010-11-03  2:00               ` Rik van Riel
2010-11-03  3:03               ` Minchan Kim
2010-11-03  3:03                 ` Minchan Kim
2010-11-03 11:41                 ` Rik van Riel
2010-11-03 11:41                   ` Rik van Riel
2010-11-03 15:42                   ` Minchan Kim
2010-11-03 15:42                     ` Minchan Kim
2010-11-03 22:40           ` Mandeep Singh Baines
2010-11-03 22:40             ` Mandeep Singh Baines
2010-11-03 23:49             ` Minchan Kim
2010-11-03 23:49               ` Minchan Kim
2010-11-04 15:30             ` Rik van Riel
2010-11-04 15:30               ` Rik van Riel
2010-11-08 21:55               ` Mandeep Singh Baines
2010-11-08 21:55                 ` Mandeep Singh Baines
2010-11-09  2:49               ` KOSAKI Motohiro
2010-11-09  2:49                 ` KOSAKI Motohiro
2010-11-01 23:46     ` Minchan Kim
2010-11-01 23:46       ` Minchan Kim
2010-11-04  1:52       ` Mandeep Singh Baines
2010-11-04  1:52         ` Mandeep Singh Baines
2010-11-05  2:36         ` Minchan Kim
2010-11-05  2:36           ` Minchan Kim
2010-11-09  2:53         ` KOSAKI Motohiro
2010-11-09  2:53           ` KOSAKI Motohiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.