All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15  7:22 ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15  7:22 UTC (permalink / raw)
  To: KVM development list
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singh <balbir@linux.vnet.ibm.com>

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

The patch is applied against mmotm feb-11-2010.

Tests
-----
Ran 4 VMs in parallel, running kernbench using kvm autotest. Each
guest had 2 CPUs with 512M of memory.

Guest Usage without boot parameter (memory in KB)
----------------------------
MemFree Cached Time
19900   292912 137
17540   296196 139
17900   296124 141
19356   296660 141

Host usage:  (memory in KB)

RSS     Cache   mapped  swap
2788664 781884  3780    359536

Guest Usage with boot parameter (memory in KB)
-------------------------
Memfree Cached   Time
244824  74828   144
237840  81764   143
235880  83044   138
239312  80092   148

Host usage: (memory in KB)

RSS     Cache   mapped  swap
2700184 958012  334848  398412

The key thing to observe is the free memory when the boot parameter is
enabled.

TODOS
-----
1. Balance slab cache as well
2. Invoke the balance routines from the balloon driver

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    2 -
 include/linux/swap.h   |    3 +
 mm/page_alloc.c        |    9 ++-
 mm/vmscan.c            |  165 ++++++++++++++++++++++++++++++++++++------------
 4 files changed, 134 insertions(+), 45 deletions(-)


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ad5abcf..f0b245f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -293,12 +293,12 @@ struct zone {
 	 */
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
+	unsigned long		min_unmapped_pages;
 #ifdef CONFIG_NUMA
 	int node;
 	/*
 	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
 #endif
 	struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c2a4295..d0c8176 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,10 +254,11 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
+extern bool should_balance_unmapped_pages(struct zone *zone);
 
+extern int sysctl_min_unmapped_ratio;
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 416b056..1cc5c75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1578,6 +1578,9 @@ zonelist_scan:
 			unsigned long mark;
 			int ret;
 
+			if (should_balance_unmapped_pages(zone))
+				wakeup_kswapd(zone, order);
+
 			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags))
@@ -3816,10 +3819,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
-#ifdef CONFIG_NUMA
-		zone->node = nid;
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
 						/ 100;
+#ifdef CONFIG_NUMA
+		zone->node = nid;
 		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
@@ -4727,7 +4730,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -4744,6 +4746,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5cbf64d..46026e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,6 +138,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+static int unmapped_page_control __read_mostly;
+
+static int __init unmapped_page_control_parm(char *str)
+{
+	unmapped_page_control = 1;
+	/*
+	 * XXX: Should we tweak swappiness here?
+	 */
+	return 1;
+}
+__setup("unmapped_page_control", unmapped_page_control_parm);
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -1938,6 +1950,103 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 }
 
 /*
+ * Percentage of pages in a zone that must be unmapped for zone_reclaim to
+ * occur.
+ */
+int sysctl_min_unmapped_ratio = 1;
+/*
+ * Priority for ZONE_RECLAIM. This determines the fraction of pages
+ * of a node considered for each zone_reclaim. 4 scans 1/16th of
+ * a zone.
+ */
+#define ZONE_RECLAIM_PRIORITY 4
+
+
+#define RECLAIM_OFF 0
+#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
+
+static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+{
+	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
+		zone_page_state(zone, NR_ACTIVE_FILE);
+
+	/*
+	 * It's possible for there to be more file mapped pages than
+	 * accounted for by the pages on the file LRU lists because
+	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
+	 */
+	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
+}
+
+/*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
+				unsigned long nr_pages)
+{
+	int priority;
+	/*
+	 * Free memory by calling shrink zone with increasing
+	 * priorities until we have enough memory freed.
+	 */
+	priority = ZONE_RECLAIM_PRIORITY;
+	do {
+		note_zone_scanning_priority(zone, priority);
+		shrink_zone(priority, zone, sc);
+		priority--;
+	} while (priority >= 0 && sc->nr_reclaimed < nr_pages);
+}
+
+/*
+ * Routine to balance unmapped pages, inspired from the code under
+ * CONFIG_NUMA that does unmapped page and slab page control by keeping
+ * min_unmapped_pages in the zone. We currently reclaim just unmapped
+ * pages, slab control will come in soon, at which point this routine
+ * should be called balance cached pages
+ */
+static unsigned long balance_unmapped_pages(int priority, struct zone *zone,
+						struct scan_control *sc)
+{
+	if (unmapped_page_control &&
+		(zone_unmapped_file_pages(zone) > zone->min_unmapped_pages)) {
+		struct scan_control nsc;
+		unsigned long nr_pages;
+
+		nsc = *sc;
+
+		nsc.swappiness = 0;
+		nsc.may_writepage = 0;
+		nsc.may_unmap = 0;
+		nsc.nr_reclaimed = 0;
+
+		nr_pages = zone_unmapped_file_pages(zone) -
+				zone->min_unmapped_pages;
+		/* Magically try to reclaim eighth the unmapped cache pages */
+		nr_pages >>= 3;
+
+		zone_reclaim_unmapped_pages(zone, &nsc, nr_pages);
+		return nsc.nr_reclaimed;
+	}
+	return 0;
+}
+
+#define UNMAPPED_PAGE_RATIO 16
+bool should_balance_unmapped_pages(struct zone *zone)
+{
+	if (unmapped_page_control &&
+		(zone_unmapped_file_pages(zone) >
+			UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages))
+		return true;
+	return false;
+}
+
+/*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
  *
@@ -2027,6 +2136,12 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
+			/*
+			 * We do unmapped page balancing once here and once
+			 * below, so that we don't lose out
+			 */
+			balance_unmapped_pages(priority, zone, &sc);
+
 			if (!zone_watermark_ok(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
@@ -2068,6 +2183,13 @@ loop_again:
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
+
+			/*
+			 * Balance unmapped pages upfront, this should be
+			 * really cheap
+			 */
+			balance_unmapped_pages(priority, zone, &sc);
+
 			/*
 			 * Call soft limit reclaim before calling shrink_zone.
 			 * For now we ignore the return value
@@ -2289,7 +2411,8 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
+	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0) &&
+		!should_balance_unmapped_pages(zone))
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
@@ -2456,44 +2579,12 @@ module_init(kswapd_init)
  */
 int zone_reclaim_mode __read_mostly;
 
-#define RECLAIM_OFF 0
-#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
-
-/*
- * Priority for ZONE_RECLAIM. This determines the fraction of pages
- * of a node considered for each zone_reclaim. 4 scans 1/16th of
- * a zone.
- */
-#define ZONE_RECLAIM_PRIORITY 4
-
-/*
- * Percentage of pages in a zone that must be unmapped for zone_reclaim to
- * occur.
- */
-int sysctl_min_unmapped_ratio = 1;
-
 /*
  * If the number of slab pages in a zone grows beyond this percentage then
  * slab reclaim needs to occur.
  */
 int sysctl_min_slab_ratio = 5;
 
-static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
-{
-	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
-	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
-		zone_page_state(zone, NR_ACTIVE_FILE);
-
-	/*
-	 * It's possible for there to be more file mapped pages than
-	 * accounted for by the pages on the file LRU lists because
-	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
-	 */
-	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
-}
-
 /* Work out how many page cache pages we can reclaim in this reclaim_mode */
 static long zone_pagecache_reclaimable(struct zone *zone)
 {
@@ -2531,7 +2622,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
-	int priority;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2562,12 +2652,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
-		priority = ZONE_RECLAIM_PRIORITY;
-		do {
-			note_zone_scanning_priority(zone, priority);
-			shrink_zone(priority, zone, &sc);
-			priority--;
-		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
+		zone_reclaim_unmapped_pages(zone, &sc, nr_pages);
 	}
 
 	slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);

-- 
	Three Cheers,
	Balbir

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15  7:22 ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15  7:22 UTC (permalink / raw)
  To: KVM development list
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singh <balbir@linux.vnet.ibm.com>

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

The patch is applied against mmotm feb-11-2010.

Tests
-----
Ran 4 VMs in parallel, running kernbench using kvm autotest. Each
guest had 2 CPUs with 512M of memory.

Guest Usage without boot parameter (memory in KB)
----------------------------
MemFree Cached Time
19900   292912 137
17540   296196 139
17900   296124 141
19356   296660 141

Host usage:  (memory in KB)

RSS     Cache   mapped  swap
2788664 781884  3780    359536

Guest Usage with boot parameter (memory in KB)
-------------------------
Memfree Cached   Time
244824  74828   144
237840  81764   143
235880  83044   138
239312  80092   148

Host usage: (memory in KB)

RSS     Cache   mapped  swap
2700184 958012  334848  398412

The key thing to observe is the free memory when the boot parameter is
enabled.

TODOS
-----
1. Balance slab cache as well
2. Invoke the balance routines from the balloon driver

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    2 -
 include/linux/swap.h   |    3 +
 mm/page_alloc.c        |    9 ++-
 mm/vmscan.c            |  165 ++++++++++++++++++++++++++++++++++++------------
 4 files changed, 134 insertions(+), 45 deletions(-)


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ad5abcf..f0b245f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -293,12 +293,12 @@ struct zone {
 	 */
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
+	unsigned long		min_unmapped_pages;
 #ifdef CONFIG_NUMA
 	int node;
 	/*
 	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
-	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
 #endif
 	struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c2a4295..d0c8176 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,10 +254,11 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
+extern bool should_balance_unmapped_pages(struct zone *zone);
 
+extern int sysctl_min_unmapped_ratio;
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 416b056..1cc5c75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1578,6 +1578,9 @@ zonelist_scan:
 			unsigned long mark;
 			int ret;
 
+			if (should_balance_unmapped_pages(zone))
+				wakeup_kswapd(zone, order);
+
 			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags))
@@ -3816,10 +3819,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
-#ifdef CONFIG_NUMA
-		zone->node = nid;
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
 						/ 100;
+#ifdef CONFIG_NUMA
+		zone->node = nid;
 		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
@@ -4727,7 +4730,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -4744,6 +4746,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5cbf64d..46026e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,6 +138,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+static int unmapped_page_control __read_mostly;
+
+static int __init unmapped_page_control_parm(char *str)
+{
+	unmapped_page_control = 1;
+	/*
+	 * XXX: Should we tweak swappiness here?
+	 */
+	return 1;
+}
+__setup("unmapped_page_control", unmapped_page_control_parm);
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -1938,6 +1950,103 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 }
 
 /*
+ * Percentage of pages in a zone that must be unmapped for zone_reclaim to
+ * occur.
+ */
+int sysctl_min_unmapped_ratio = 1;
+/*
+ * Priority for ZONE_RECLAIM. This determines the fraction of pages
+ * of a node considered for each zone_reclaim. 4 scans 1/16th of
+ * a zone.
+ */
+#define ZONE_RECLAIM_PRIORITY 4
+
+
+#define RECLAIM_OFF 0
+#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
+
+static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+{
+	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
+		zone_page_state(zone, NR_ACTIVE_FILE);
+
+	/*
+	 * It's possible for there to be more file mapped pages than
+	 * accounted for by the pages on the file LRU lists because
+	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
+	 */
+	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
+}
+
+/*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
+				unsigned long nr_pages)
+{
+	int priority;
+	/*
+	 * Free memory by calling shrink zone with increasing
+	 * priorities until we have enough memory freed.
+	 */
+	priority = ZONE_RECLAIM_PRIORITY;
+	do {
+		note_zone_scanning_priority(zone, priority);
+		shrink_zone(priority, zone, sc);
+		priority--;
+	} while (priority >= 0 && sc->nr_reclaimed < nr_pages);
+}
+
+/*
+ * Routine to balance unmapped pages, inspired from the code under
+ * CONFIG_NUMA that does unmapped page and slab page control by keeping
+ * min_unmapped_pages in the zone. We currently reclaim just unmapped
+ * pages, slab control will come in soon, at which point this routine
+ * should be called balance cached pages
+ */
+static unsigned long balance_unmapped_pages(int priority, struct zone *zone,
+						struct scan_control *sc)
+{
+	if (unmapped_page_control &&
+		(zone_unmapped_file_pages(zone) > zone->min_unmapped_pages)) {
+		struct scan_control nsc;
+		unsigned long nr_pages;
+
+		nsc = *sc;
+
+		nsc.swappiness = 0;
+		nsc.may_writepage = 0;
+		nsc.may_unmap = 0;
+		nsc.nr_reclaimed = 0;
+
+		nr_pages = zone_unmapped_file_pages(zone) -
+				zone->min_unmapped_pages;
+		/* Magically try to reclaim eighth the unmapped cache pages */
+		nr_pages >>= 3;
+
+		zone_reclaim_unmapped_pages(zone, &nsc, nr_pages);
+		return nsc.nr_reclaimed;
+	}
+	return 0;
+}
+
+#define UNMAPPED_PAGE_RATIO 16
+bool should_balance_unmapped_pages(struct zone *zone)
+{
+	if (unmapped_page_control &&
+		(zone_unmapped_file_pages(zone) >
+			UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages))
+		return true;
+	return false;
+}
+
+/*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
  *
@@ -2027,6 +2136,12 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
+			/*
+			 * We do unmapped page balancing once here and once
+			 * below, so that we don't lose out
+			 */
+			balance_unmapped_pages(priority, zone, &sc);
+
 			if (!zone_watermark_ok(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
@@ -2068,6 +2183,13 @@ loop_again:
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
+
+			/*
+			 * Balance unmapped pages upfront, this should be
+			 * really cheap
+			 */
+			balance_unmapped_pages(priority, zone, &sc);
+
 			/*
 			 * Call soft limit reclaim before calling shrink_zone.
 			 * For now we ignore the return value
@@ -2289,7 +2411,8 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
+	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0) &&
+		!should_balance_unmapped_pages(zone))
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
@@ -2456,44 +2579,12 @@ module_init(kswapd_init)
  */
 int zone_reclaim_mode __read_mostly;
 
-#define RECLAIM_OFF 0
-#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
-
-/*
- * Priority for ZONE_RECLAIM. This determines the fraction of pages
- * of a node considered for each zone_reclaim. 4 scans 1/16th of
- * a zone.
- */
-#define ZONE_RECLAIM_PRIORITY 4
-
-/*
- * Percentage of pages in a zone that must be unmapped for zone_reclaim to
- * occur.
- */
-int sysctl_min_unmapped_ratio = 1;
-
 /*
  * If the number of slab pages in a zone grows beyond this percentage then
  * slab reclaim needs to occur.
  */
 int sysctl_min_slab_ratio = 5;
 
-static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
-{
-	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
-	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
-		zone_page_state(zone, NR_ACTIVE_FILE);
-
-	/*
-	 * It's possible for there to be more file mapped pages than
-	 * accounted for by the pages on the file LRU lists because
-	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
-	 */
-	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
-}
-
 /* Work out how many page cache pages we can reclaim in this reclaim_mode */
 static long zone_pagecache_reclaimable(struct zone *zone)
 {
@@ -2531,7 +2622,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
-	int priority;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2562,12 +2652,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
-		priority = ZONE_RECLAIM_PRIORITY;
-		do {
-			note_zone_scanning_priority(zone, priority);
-			shrink_zone(priority, zone, &sc);
-			priority--;
-		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
+		zone_reclaim_unmapped_pages(zone, &sc, nr_pages);
 	}
 
 	slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  7:22 ` Balbir Singh
@ 2010-03-15  7:48   ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-15  7:48 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On 03/15/2010 09:22 AM, Balbir Singh wrote:
> Selectively control Unmapped Page Cache (nospam version)
>
> From: Balbir Singh<balbir@linux.vnet.ibm.com>
>
> This patch implements unmapped page cache control via preferred
> page cache reclaim. The current patch hooks into kswapd and reclaims
> page cache if the user has requested for unmapped page control.
> This is useful in the following scenario
>
> - In a virtualized environment with cache!=none, we see
>    double caching - (one in the host and one in the guest). As
>    we try to scale guests, cache usage across the system grows.
>    The goal of this patch is to reclaim page cache when Linux is running
>    as a guest and get the host to hold the page cache and manage it.
>    There might be temporary duplication, but in the long run, memory
>    in the guests would be used for mapped pages.
>    

Well, for a guest, host page cache is a lot slower than guest page cache.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15  7:48   ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-15  7:48 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On 03/15/2010 09:22 AM, Balbir Singh wrote:
> Selectively control Unmapped Page Cache (nospam version)
>
> From: Balbir Singh<balbir@linux.vnet.ibm.com>
>
> This patch implements unmapped page cache control via preferred
> page cache reclaim. The current patch hooks into kswapd and reclaims
> page cache if the user has requested for unmapped page control.
> This is useful in the following scenario
>
> - In a virtualized environment with cache!=none, we see
>    double caching - (one in the host and one in the guest). As
>    we try to scale guests, cache usage across the system grows.
>    The goal of this patch is to reclaim page cache when Linux is running
>    as a guest and get the host to hold the page cache and manage it.
>    There might be temporary duplication, but in the long run, memory
>    in the guests would be used for mapped pages.
>    

Well, for a guest, host page cache is a lot slower than guest page cache.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  7:48   ` Avi Kivity
@ 2010-03-15  8:07     ` Balbir Singh
  -1 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15  8:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Avi Kivity <avi@redhat.com> [2010-03-15 09:48:05]:

> On 03/15/2010 09:22 AM, Balbir Singh wrote:
> >Selectively control Unmapped Page Cache (nospam version)
> >
> >From: Balbir Singh<balbir@linux.vnet.ibm.com>
> >
> >This patch implements unmapped page cache control via preferred
> >page cache reclaim. The current patch hooks into kswapd and reclaims
> >page cache if the user has requested for unmapped page control.
> >This is useful in the following scenario
> >
> >- In a virtualized environment with cache!=none, we see
> >   double caching - (one in the host and one in the guest). As
> >   we try to scale guests, cache usage across the system grows.
> >   The goal of this patch is to reclaim page cache when Linux is running
> >   as a guest and get the host to hold the page cache and manage it.
> >   There might be temporary duplication, but in the long run, memory
> >   in the guests would be used for mapped pages.
> 
> Well, for a guest, host page cache is a lot slower than guest page cache.
>

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable? One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15  8:07     ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15  8:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Avi Kivity <avi@redhat.com> [2010-03-15 09:48:05]:

> On 03/15/2010 09:22 AM, Balbir Singh wrote:
> >Selectively control Unmapped Page Cache (nospam version)
> >
> >From: Balbir Singh<balbir@linux.vnet.ibm.com>
> >
> >This patch implements unmapped page cache control via preferred
> >page cache reclaim. The current patch hooks into kswapd and reclaims
> >page cache if the user has requested for unmapped page control.
> >This is useful in the following scenario
> >
> >- In a virtualized environment with cache!=none, we see
> >   double caching - (one in the host and one in the guest). As
> >   we try to scale guests, cache usage across the system grows.
> >   The goal of this patch is to reclaim page cache when Linux is running
> >   as a guest and get the host to hold the page cache and manage it.
> >   There might be temporary duplication, but in the long run, memory
> >   in the guests would be used for mapped pages.
> 
> Well, for a guest, host page cache is a lot slower than guest page cache.
>

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable? One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  8:07     ` Balbir Singh
@ 2010-03-15  8:27       ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-15  8:27 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On 03/15/2010 10:07 AM, Balbir Singh wrote:
> * Avi Kivity<avi@redhat.com>  [2010-03-15 09:48:05]:
>
>    
>> On 03/15/2010 09:22 AM, Balbir Singh wrote:
>>      
>>> Selectively control Unmapped Page Cache (nospam version)
>>>
>>> From: Balbir Singh<balbir@linux.vnet.ibm.com>
>>>
>>> This patch implements unmapped page cache control via preferred
>>> page cache reclaim. The current patch hooks into kswapd and reclaims
>>> page cache if the user has requested for unmapped page control.
>>> This is useful in the following scenario
>>>
>>> - In a virtualized environment with cache!=none, we see
>>>    double caching - (one in the host and one in the guest). As
>>>    we try to scale guests, cache usage across the system grows.
>>>    The goal of this patch is to reclaim page cache when Linux is running
>>>    as a guest and get the host to hold the page cache and manage it.
>>>    There might be temporary duplication, but in the long run, memory
>>>    in the guests would be used for mapped pages.
>>>        
>> Well, for a guest, host page cache is a lot slower than guest page cache.
>>
>>      
> Yes, it is a virtio call away, but is the cost of paying twice in
> terms of memory acceptable?

Usually, it isn't, which is why I recommend cache=off.

> One of the reasons I created a boot
> parameter was to deal with selective enablement for cases where
> memory is the most important resource being managed.
>
> I do see a hit in performance with my results (please see the data
> below), but the savings are quite large. The other solution mentioned
> in the TODOs is to have the balloon driver invoke this path. The
> sysctl also allows the guest to tune the amount of unmapped page cache
> if needed.
>
> The knobs are for
>
> 1. Selective enablement
> 2. Selective control of the % of unmapped pages
>    

An alternative path is to enable KSM for page cache.  Then we have 
direct read-only guest access to host page cache, without any guest 
modifications required.  That will be pretty difficult to achieve though 
- will need a readonly bit in the page cache radix tree, and teach all 
paths to honour it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15  8:27       ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-15  8:27 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On 03/15/2010 10:07 AM, Balbir Singh wrote:
> * Avi Kivity<avi@redhat.com>  [2010-03-15 09:48:05]:
>
>    
>> On 03/15/2010 09:22 AM, Balbir Singh wrote:
>>      
>>> Selectively control Unmapped Page Cache (nospam version)
>>>
>>> From: Balbir Singh<balbir@linux.vnet.ibm.com>
>>>
>>> This patch implements unmapped page cache control via preferred
>>> page cache reclaim. The current patch hooks into kswapd and reclaims
>>> page cache if the user has requested for unmapped page control.
>>> This is useful in the following scenario
>>>
>>> - In a virtualized environment with cache!=none, we see
>>>    double caching - (one in the host and one in the guest). As
>>>    we try to scale guests, cache usage across the system grows.
>>>    The goal of this patch is to reclaim page cache when Linux is running
>>>    as a guest and get the host to hold the page cache and manage it.
>>>    There might be temporary duplication, but in the long run, memory
>>>    in the guests would be used for mapped pages.
>>>        
>> Well, for a guest, host page cache is a lot slower than guest page cache.
>>
>>      
> Yes, it is a virtio call away, but is the cost of paying twice in
> terms of memory acceptable?

Usually, it isn't, which is why I recommend cache=off.

> One of the reasons I created a boot
> parameter was to deal with selective enablement for cases where
> memory is the most important resource being managed.
>
> I do see a hit in performance with my results (please see the data
> below), but the savings are quite large. The other solution mentioned
> in the TODOs is to have the balloon driver invoke this path. The
> sysctl also allows the guest to tune the amount of unmapped page cache
> if needed.
>
> The knobs are for
>
> 1. Selective enablement
> 2. Selective control of the % of unmapped pages
>    

An alternative path is to enable KSM for page cache.  Then we have 
direct read-only guest access to host page cache, without any guest 
modifications required.  That will be pretty difficult to achieve though 
- will need a readonly bit in the page cache radix tree, and teach all 
paths to honour it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  8:27       ` Avi Kivity
@ 2010-03-15  9:17         ` Balbir Singh
  -1 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15  9:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Avi Kivity <avi@redhat.com> [2010-03-15 10:27:45]:

> On 03/15/2010 10:07 AM, Balbir Singh wrote:
> >* Avi Kivity<avi@redhat.com>  [2010-03-15 09:48:05]:
> >
> >>On 03/15/2010 09:22 AM, Balbir Singh wrote:
> >>>Selectively control Unmapped Page Cache (nospam version)
> >>>
> >>>From: Balbir Singh<balbir@linux.vnet.ibm.com>
> >>>
> >>>This patch implements unmapped page cache control via preferred
> >>>page cache reclaim. The current patch hooks into kswapd and reclaims
> >>>page cache if the user has requested for unmapped page control.
> >>>This is useful in the following scenario
> >>>
> >>>- In a virtualized environment with cache!=none, we see
> >>>   double caching - (one in the host and one in the guest). As
> >>>   we try to scale guests, cache usage across the system grows.
> >>>   The goal of this patch is to reclaim page cache when Linux is running
> >>>   as a guest and get the host to hold the page cache and manage it.
> >>>   There might be temporary duplication, but in the long run, memory
> >>>   in the guests would be used for mapped pages.
> >>Well, for a guest, host page cache is a lot slower than guest page cache.
> >>
> >Yes, it is a virtio call away, but is the cost of paying twice in
> >terms of memory acceptable?
> 
> Usually, it isn't, which is why I recommend cache=off.
>

cache=off works for *direct I/O* supported filesystems and my concern is that
one of the side-effects is that idle VM's can consume a lot of memory
(assuming all the memory is available to them). As the number of VM's
grow, they could cache a whole lot of memory. In my experiments I
found that the total amount of memory cached far exceeded the mapped
ratio by a large amount when we had idle VM's. The philosophy of this
patch is to move the caching to the _host_ and let the host maintain
the cache instead of the guest.
 
> >One of the reasons I created a boot
> >parameter was to deal with selective enablement for cases where
> >memory is the most important resource being managed.
> >
> >I do see a hit in performance with my results (please see the data
> >below), but the savings are quite large. The other solution mentioned
> >in the TODOs is to have the balloon driver invoke this path. The
> >sysctl also allows the guest to tune the amount of unmapped page cache
> >if needed.
> >
> >The knobs are for
> >
> >1. Selective enablement
> >2. Selective control of the % of unmapped pages
> 
> An alternative path is to enable KSM for page cache.  Then we have
> direct read-only guest access to host page cache, without any guest
> modifications required.  That will be pretty difficult to achieve
> though - will need a readonly bit in the page cache radix tree, and
> teach all paths to honour it.
> 

Yes, it is, I've taken a quick look. I am not sure if de-duplication
would be the best approach, may be dropping the page in the page cache
might be a good first step. Data consistency would be much easier to
maintain that way, as long as the guest is not writing frequently to
that page, we don't need the page cache in the host.

> -- 
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15  9:17         ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15  9:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Avi Kivity <avi@redhat.com> [2010-03-15 10:27:45]:

> On 03/15/2010 10:07 AM, Balbir Singh wrote:
> >* Avi Kivity<avi@redhat.com>  [2010-03-15 09:48:05]:
> >
> >>On 03/15/2010 09:22 AM, Balbir Singh wrote:
> >>>Selectively control Unmapped Page Cache (nospam version)
> >>>
> >>>From: Balbir Singh<balbir@linux.vnet.ibm.com>
> >>>
> >>>This patch implements unmapped page cache control via preferred
> >>>page cache reclaim. The current patch hooks into kswapd and reclaims
> >>>page cache if the user has requested for unmapped page control.
> >>>This is useful in the following scenario
> >>>
> >>>- In a virtualized environment with cache!=none, we see
> >>>   double caching - (one in the host and one in the guest). As
> >>>   we try to scale guests, cache usage across the system grows.
> >>>   The goal of this patch is to reclaim page cache when Linux is running
> >>>   as a guest and get the host to hold the page cache and manage it.
> >>>   There might be temporary duplication, but in the long run, memory
> >>>   in the guests would be used for mapped pages.
> >>Well, for a guest, host page cache is a lot slower than guest page cache.
> >>
> >Yes, it is a virtio call away, but is the cost of paying twice in
> >terms of memory acceptable?
> 
> Usually, it isn't, which is why I recommend cache=off.
>

cache=off works for *direct I/O* supported filesystems and my concern is that
one of the side-effects is that idle VM's can consume a lot of memory
(assuming all the memory is available to them). As the number of VM's
grow, they could cache a whole lot of memory. In my experiments I
found that the total amount of memory cached far exceeded the mapped
ratio by a large amount when we had idle VM's. The philosophy of this
patch is to move the caching to the _host_ and let the host maintain
the cache instead of the guest.
 
> >One of the reasons I created a boot
> >parameter was to deal with selective enablement for cases where
> >memory is the most important resource being managed.
> >
> >I do see a hit in performance with my results (please see the data
> >below), but the savings are quite large. The other solution mentioned
> >in the TODOs is to have the balloon driver invoke this path. The
> >sysctl also allows the guest to tune the amount of unmapped page cache
> >if needed.
> >
> >The knobs are for
> >
> >1. Selective enablement
> >2. Selective control of the % of unmapped pages
> 
> An alternative path is to enable KSM for page cache.  Then we have
> direct read-only guest access to host page cache, without any guest
> modifications required.  That will be pretty difficult to achieve
> though - will need a readonly bit in the page cache radix tree, and
> teach all paths to honour it.
> 

Yes, it is, I've taken a quick look. I am not sure if de-duplication
would be the best approach, may be dropping the page in the page cache
might be a good first step. Data consistency would be much easier to
maintain that way, as long as the guest is not writing frequently to
that page, we don't need the page cache in the host.

> -- 
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.
> 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  9:17         ` Balbir Singh
@ 2010-03-15  9:27           ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-15  9:27 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On 03/15/2010 11:17 AM, Balbir Singh wrote:
> * Avi Kivity<avi@redhat.com>  [2010-03-15 10:27:45]:
>
>    
>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>      
>>> * Avi Kivity<avi@redhat.com>   [2010-03-15 09:48:05]:
>>>
>>>        
>>>> On 03/15/2010 09:22 AM, Balbir Singh wrote:
>>>>          
>>>>> Selectively control Unmapped Page Cache (nospam version)
>>>>>
>>>>> From: Balbir Singh<balbir@linux.vnet.ibm.com>
>>>>>
>>>>> This patch implements unmapped page cache control via preferred
>>>>> page cache reclaim. The current patch hooks into kswapd and reclaims
>>>>> page cache if the user has requested for unmapped page control.
>>>>> This is useful in the following scenario
>>>>>
>>>>> - In a virtualized environment with cache!=none, we see
>>>>>    double caching - (one in the host and one in the guest). As
>>>>>    we try to scale guests, cache usage across the system grows.
>>>>>    The goal of this patch is to reclaim page cache when Linux is running
>>>>>    as a guest and get the host to hold the page cache and manage it.
>>>>>    There might be temporary duplication, but in the long run, memory
>>>>>    in the guests would be used for mapped pages.
>>>>>            
>>>> Well, for a guest, host page cache is a lot slower than guest page cache.
>>>>
>>>>          
>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>> terms of memory acceptable?
>>>        
>> Usually, it isn't, which is why I recommend cache=off.
>>
>>      
> cache=off works for *direct I/O* supported filesystems and my concern is that
> one of the side-effects is that idle VM's can consume a lot of memory
> (assuming all the memory is available to them). As the number of VM's
> grow, they could cache a whole lot of memory. In my experiments I
> found that the total amount of memory cached far exceeded the mapped
> ratio by a large amount when we had idle VM's. The philosophy of this
> patch is to move the caching to the _host_ and let the host maintain
> the cache instead of the guest.
>    

That's only beneficial if the cache is shared.  Otherwise, you could use 
the balloon to evict cache when memory is tight.

Shared cache is mostly a desktop thing where users run similar 
workloads.  For servers, it's much less likely.  So a modified-guest 
doesn't help a lot here.

>>> One of the reasons I created a boot
>>> parameter was to deal with selective enablement for cases where
>>> memory is the most important resource being managed.
>>>
>>> I do see a hit in performance with my results (please see the data
>>> below), but the savings are quite large. The other solution mentioned
>>> in the TODOs is to have the balloon driver invoke this path. The
>>> sysctl also allows the guest to tune the amount of unmapped page cache
>>> if needed.
>>>
>>> The knobs are for
>>>
>>> 1. Selective enablement
>>> 2. Selective control of the % of unmapped pages
>>>        
>> An alternative path is to enable KSM for page cache.  Then we have
>> direct read-only guest access to host page cache, without any guest
>> modifications required.  That will be pretty difficult to achieve
>> though - will need a readonly bit in the page cache radix tree, and
>> teach all paths to honour it.
>>
>>      
> Yes, it is, I've taken a quick look. I am not sure if de-duplication
> would be the best approach, may be dropping the page in the page cache
> might be a good first step. Data consistency would be much easier to
> maintain that way, as long as the guest is not writing frequently to
> that page, we don't need the page cache in the host.
>    

Trimming the host page cache should happen automatically under 
pressure.  Since the page is cached by the guest, it won't be re-read, 
so the host page is not frequently used and then dropped.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15  9:27           ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-15  9:27 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On 03/15/2010 11:17 AM, Balbir Singh wrote:
> * Avi Kivity<avi@redhat.com>  [2010-03-15 10:27:45]:
>
>    
>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>      
>>> * Avi Kivity<avi@redhat.com>   [2010-03-15 09:48:05]:
>>>
>>>        
>>>> On 03/15/2010 09:22 AM, Balbir Singh wrote:
>>>>          
>>>>> Selectively control Unmapped Page Cache (nospam version)
>>>>>
>>>>> From: Balbir Singh<balbir@linux.vnet.ibm.com>
>>>>>
>>>>> This patch implements unmapped page cache control via preferred
>>>>> page cache reclaim. The current patch hooks into kswapd and reclaims
>>>>> page cache if the user has requested for unmapped page control.
>>>>> This is useful in the following scenario
>>>>>
>>>>> - In a virtualized environment with cache!=none, we see
>>>>>    double caching - (one in the host and one in the guest). As
>>>>>    we try to scale guests, cache usage across the system grows.
>>>>>    The goal of this patch is to reclaim page cache when Linux is running
>>>>>    as a guest and get the host to hold the page cache and manage it.
>>>>>    There might be temporary duplication, but in the long run, memory
>>>>>    in the guests would be used for mapped pages.
>>>>>            
>>>> Well, for a guest, host page cache is a lot slower than guest page cache.
>>>>
>>>>          
>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>> terms of memory acceptable?
>>>        
>> Usually, it isn't, which is why I recommend cache=off.
>>
>>      
> cache=off works for *direct I/O* supported filesystems and my concern is that
> one of the side-effects is that idle VM's can consume a lot of memory
> (assuming all the memory is available to them). As the number of VM's
> grow, they could cache a whole lot of memory. In my experiments I
> found that the total amount of memory cached far exceeded the mapped
> ratio by a large amount when we had idle VM's. The philosophy of this
> patch is to move the caching to the _host_ and let the host maintain
> the cache instead of the guest.
>    

That's only beneficial if the cache is shared.  Otherwise, you could use 
the balloon to evict cache when memory is tight.

Shared cache is mostly a desktop thing where users run similar 
workloads.  For servers, it's much less likely.  So a modified-guest 
doesn't help a lot here.

>>> One of the reasons I created a boot
>>> parameter was to deal with selective enablement for cases where
>>> memory is the most important resource being managed.
>>>
>>> I do see a hit in performance with my results (please see the data
>>> below), but the savings are quite large. The other solution mentioned
>>> in the TODOs is to have the balloon driver invoke this path. The
>>> sysctl also allows the guest to tune the amount of unmapped page cache
>>> if needed.
>>>
>>> The knobs are for
>>>
>>> 1. Selective enablement
>>> 2. Selective control of the % of unmapped pages
>>>        
>> An alternative path is to enable KSM for page cache.  Then we have
>> direct read-only guest access to host page cache, without any guest
>> modifications required.  That will be pretty difficult to achieve
>> though - will need a readonly bit in the page cache radix tree, and
>> teach all paths to honour it.
>>
>>      
> Yes, it is, I've taken a quick look. I am not sure if de-duplication
> would be the best approach, may be dropping the page in the page cache
> might be a good first step. Data consistency would be much easier to
> maintain that way, as long as the guest is not writing frequently to
> that page, we don't need the page cache in the host.
>    

Trimming the host page cache should happen automatically under 
pressure.  Since the page is cached by the guest, it won't be re-read, 
so the host page is not frequently used and then dropped.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  9:27           ` Avi Kivity
@ 2010-03-15 10:45             ` Balbir Singh
  -1 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15 10:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Avi Kivity <avi@redhat.com> [2010-03-15 11:27:56]:

> >>>The knobs are for
> >>>
> >>>1. Selective enablement
> >>>2. Selective control of the % of unmapped pages
> >>An alternative path is to enable KSM for page cache.  Then we have
> >>direct read-only guest access to host page cache, without any guest
> >>modifications required.  That will be pretty difficult to achieve
> >>though - will need a readonly bit in the page cache radix tree, and
> >>teach all paths to honour it.
> >>
> >Yes, it is, I've taken a quick look. I am not sure if de-duplication
> >would be the best approach, may be dropping the page in the page cache
> >might be a good first step. Data consistency would be much easier to
> >maintain that way, as long as the guest is not writing frequently to
> >that page, we don't need the page cache in the host.
> 
> Trimming the host page cache should happen automatically under
> pressure.  Since the page is cached by the guest, it won't be
> re-read, so the host page is not frequently used and then dropped.
>

Yes, agreed, but dropping is easier than tagging cache as read-only
and getting everybody to understand read-only cached pages. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15 10:45             ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-15 10:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Avi Kivity <avi@redhat.com> [2010-03-15 11:27:56]:

> >>>The knobs are for
> >>>
> >>>1. Selective enablement
> >>>2. Selective control of the % of unmapped pages
> >>An alternative path is to enable KSM for page cache.  Then we have
> >>direct read-only guest access to host page cache, without any guest
> >>modifications required.  That will be pretty difficult to achieve
> >>though - will need a readonly bit in the page cache radix tree, and
> >>teach all paths to honour it.
> >>
> >Yes, it is, I've taken a quick look. I am not sure if de-duplication
> >would be the best approach, may be dropping the page in the page cache
> >might be a good first step. Data consistency would be much easier to
> >maintain that way, as long as the guest is not writing frequently to
> >that page, we don't need the page cache in the host.
> 
> Trimming the host page cache should happen automatically under
> pressure.  Since the page is cached by the guest, it won't be
> re-read, so the host page is not frequently used and then dropped.
>

Yes, agreed, but dropping is easier than tagging cache as read-only
and getting everybody to understand read-only cached pages. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  7:22 ` Balbir Singh
@ 2010-03-15 15:46   ` Randy Dunlap
  -1 siblings, 0 replies; 98+ messages in thread
From: Randy Dunlap @ 2010-03-15 15:46 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote:

> Selectively control Unmapped Page Cache (nospam version)
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> This patch implements unmapped page cache control via preferred
> page cache reclaim. The current patch hooks into kswapd and reclaims
> page cache if the user has requested for unmapped page control.
> This is useful in the following scenario
> 
> - In a virtualized environment with cache!=none, we see
>   double caching - (one in the host and one in the guest). As
>   we try to scale guests, cache usage across the system grows.
>   The goal of this patch is to reclaim page cache when Linux is running
>   as a guest and get the host to hold the page cache and manage it.
>   There might be temporary duplication, but in the long run, memory
>   in the guests would be used for mapped pages.
> - The option is controlled via a boot option and the administrator
>   can selectively turn it on, on a need to use basis.
> 
> A lot of the code is borrowed from zone_reclaim_mode logic for
> __zone_reclaim(). One might argue that the with ballooning and
> KSM this feature is not very useful, but even with ballooning,
> we need extra logic to balloon multiple VM machines and it is hard
> to figure out the correct amount of memory to balloon. With these
> patches applied, each guest has a sufficient amount of free memory
> available, that can be easily seen and reclaimed by the balloon driver.
> The additional memory in the guest can be reused for additional
> applications or used to start additional guests/balance memory in
> the host.
> 
> KSM currently does not de-duplicate host and guest page cache. The goal
> of this patch is to help automatically balance unmapped page cache when
> instructed to do so.
> 
> There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
> and the number of pages to reclaim when unmapped_page_control argument
> is supplied. These numbers were chosen to avoid aggressiveness in
> reaping page cache ever so frequently, at the same time providing control.
> 
> The sysctl for min_unmapped_ratio provides further control from
> within the guest on the amount of unmapped pages to reclaim.
> 
> The patch is applied against mmotm feb-11-2010.

Hi,
If you go ahead with this, please add the boot parameter & its description
to Documentation/kernel-parameters.txt.


> TODOS
> -----
> 1. Balance slab cache as well
> 2. Invoke the balance routines from the balloon driver

---
~Randy

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15 15:46   ` Randy Dunlap
  0 siblings, 0 replies; 98+ messages in thread
From: Randy Dunlap @ 2010-03-15 15:46 UTC (permalink / raw)
  To: balbir
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote:

> Selectively control Unmapped Page Cache (nospam version)
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> This patch implements unmapped page cache control via preferred
> page cache reclaim. The current patch hooks into kswapd and reclaims
> page cache if the user has requested for unmapped page control.
> This is useful in the following scenario
> 
> - In a virtualized environment with cache!=none, we see
>   double caching - (one in the host and one in the guest). As
>   we try to scale guests, cache usage across the system grows.
>   The goal of this patch is to reclaim page cache when Linux is running
>   as a guest and get the host to hold the page cache and manage it.
>   There might be temporary duplication, but in the long run, memory
>   in the guests would be used for mapped pages.
> - The option is controlled via a boot option and the administrator
>   can selectively turn it on, on a need to use basis.
> 
> A lot of the code is borrowed from zone_reclaim_mode logic for
> __zone_reclaim(). One might argue that the with ballooning and
> KSM this feature is not very useful, but even with ballooning,
> we need extra logic to balloon multiple VM machines and it is hard
> to figure out the correct amount of memory to balloon. With these
> patches applied, each guest has a sufficient amount of free memory
> available, that can be easily seen and reclaimed by the balloon driver.
> The additional memory in the guest can be reused for additional
> applications or used to start additional guests/balance memory in
> the host.
> 
> KSM currently does not de-duplicate host and guest page cache. The goal
> of this patch is to help automatically balance unmapped page cache when
> instructed to do so.
> 
> There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
> and the number of pages to reclaim when unmapped_page_control argument
> is supplied. These numbers were chosen to avoid aggressiveness in
> reaping page cache ever so frequently, at the same time providing control.
> 
> The sysctl for min_unmapped_ratio provides further control from
> within the guest on the amount of unmapped pages to reclaim.
> 
> The patch is applied against mmotm feb-11-2010.

Hi,
If you go ahead with this, please add the boot parameter & its description
to Documentation/kernel-parameters.txt.


> TODOS
> -----
> 1. Balance slab cache as well
> 2. Invoke the balance routines from the balloon driver

---
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  9:27           ` Avi Kivity
@ 2010-03-15 18:48             ` Anthony Liguori
  -1 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-15 18:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel

On 03/15/2010 04:27 AM, Avi Kivity wrote:
>
> That's only beneficial if the cache is shared.  Otherwise, you could 
> use the balloon to evict cache when memory is tight.
>
> Shared cache is mostly a desktop thing where users run similar 
> workloads.  For servers, it's much less likely.  So a modified-guest 
> doesn't help a lot here.

Not really.  In many cloud environments, there's a set of common images 
that are instantiated on each node.  Usually this is because you're 
running a horizontally scalable application or because you're supporting 
an ephemeral storage model.

In fact, with ephemeral storage, you typically want to use 
cache=writeback since you aren't providing data guarantees across 
shutdown/failure.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15 18:48             ` Anthony Liguori
  0 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-15 18:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel

On 03/15/2010 04:27 AM, Avi Kivity wrote:
>
> That's only beneficial if the cache is shared.  Otherwise, you could 
> use the balloon to evict cache when memory is tight.
>
> Shared cache is mostly a desktop thing where users run similar 
> workloads.  For servers, it's much less likely.  So a modified-guest 
> doesn't help a lot here.

Not really.  In many cloud environments, there's a set of common images 
that are instantiated on each node.  Usually this is because you're 
running a horizontally scalable application or because you're supporting 
an ephemeral storage model.

In fact, with ephemeral storage, you typically want to use 
cache=writeback since you aren't providing data guarantees across 
shutdown/failure.

Regards,

Anthony Liguori

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15  8:27       ` Avi Kivity
@ 2010-03-15 20:23         ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-15 20:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel

Avi Kivity <avi@redhat.com> writes:

> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>
> >Yes, it is a virtio call away, but is the cost of paying twice in
> >terms of memory acceptable?
> 
> Usually, it isn't, which is why I recommend cache=off.

Hi Avi. One observation about your recommendation for cache=none:

We run hosts of VMs accessing drives backed by logical volumes carved out
from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
twenty virtual machines, which pretty much fill the available memory on the
host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
caching turned on get advertised to the guest as having a write-cache, and
FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
isn't acting as cache=neverflush like it would have done a year ago. I know
that comparing performance for cache=none against that unsafe behaviour
would be somewhat unfair!)

Wasteful duplication of page cache between guest and host notwithstanding,
turning on cache=writeback is a spectacular performance win for our guests.
For example, even IDE with cache=writeback easily beats virtio with
cache=none in most of the guest filesystem performance tests I've tried. The
anecdotal feedback from clients is also very strongly in favour of
cache=writeback.

With a host full of cache=none guests, IO contention between guests is
hugely problematic with non-stop seek from the disks to service tiny
O_DIRECT writes (especially without virtio), many of which needn't have been
synchronous if only there had been some way for the guest OS to tell qemu
that. Running with cache=writeback seems to reduce the frequency of disk
flush per guest to a much more manageable level, and to allow the host's
elevator to optimise writing out across the guests in between these flushes.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15 20:23         ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-15 20:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel

Avi Kivity <avi@redhat.com> writes:

> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>
> >Yes, it is a virtio call away, but is the cost of paying twice in
> >terms of memory acceptable?
> 
> Usually, it isn't, which is why I recommend cache=off.

Hi Avi. One observation about your recommendation for cache=none:

We run hosts of VMs accessing drives backed by logical volumes carved out
from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
twenty virtual machines, which pretty much fill the available memory on the
host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
caching turned on get advertised to the guest as having a write-cache, and
FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
isn't acting as cache=neverflush like it would have done a year ago. I know
that comparing performance for cache=none against that unsafe behaviour
would be somewhat unfair!)

Wasteful duplication of page cache between guest and host notwithstanding,
turning on cache=writeback is a spectacular performance win for our guests.
For example, even IDE with cache=writeback easily beats virtio with
cache=none in most of the guest filesystem performance tests I've tried. The
anecdotal feedback from clients is also very strongly in favour of
cache=writeback.

With a host full of cache=none guests, IO contention between guests is
hugely problematic with non-stop seek from the disks to service tiny
O_DIRECT writes (especially without virtio), many of which needn't have been
synchronous if only there had been some way for the guest OS to tell qemu
that. Running with cache=writeback seems to reduce the frequency of disk
flush per guest to a much more manageable level, and to allow the host's
elevator to optimise writing out across the guests in between these flushes.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15 20:23         ` Chris Webb
@ 2010-03-15 23:43           ` Anthony Liguori
  -1 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-15 23:43 UTC (permalink / raw)
  To: Chris Webb
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/15/2010 03:23 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>
>>      
>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>> terms of memory acceptable?
>>>        
>> Usually, it isn't, which is why I recommend cache=off.
>>      
> Hi Avi. One observation about your recommendation for cache=none:
>
> We run hosts of VMs accessing drives backed by logical volumes carved out
> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
> twenty virtual machines, which pretty much fill the available memory on the
> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
> caching turned on get advertised to the guest as having a write-cache, and
> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
> isn't acting as cache=neverflush like it would have done a year ago. I know
> that comparing performance for cache=none against that unsafe behaviour
> would be somewhat unfair!)
>    

I knew someone would do this...

This really gets down to your definition of "safe" behaviour.  As it 
stands, if you suffer a power outage, it may lead to guest corruption.

While we are correct in advertising a write-cache, write-caches are 
volatile and should a drive lose power, it could lead to data 
corruption.  Enterprise disks tend to have battery backed write caches 
to prevent this.

In the set up you're emulating, the host is acting as a giant write 
cache.  Should your host fail, you can get data corruption.

cache=writethrough provides a much stronger data guarantee.  Even in the 
event of a host failure, data integrity will be preserved.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-15 23:43           ` Anthony Liguori
  0 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-15 23:43 UTC (permalink / raw)
  To: Chris Webb
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/15/2010 03:23 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>
>>      
>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>> terms of memory acceptable?
>>>        
>> Usually, it isn't, which is why I recommend cache=off.
>>      
> Hi Avi. One observation about your recommendation for cache=none:
>
> We run hosts of VMs accessing drives backed by logical volumes carved out
> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
> twenty virtual machines, which pretty much fill the available memory on the
> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
> caching turned on get advertised to the guest as having a write-cache, and
> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
> isn't acting as cache=neverflush like it would have done a year ago. I know
> that comparing performance for cache=none against that unsafe behaviour
> would be somewhat unfair!)
>    

I knew someone would do this...

This really gets down to your definition of "safe" behaviour.  As it 
stands, if you suffer a power outage, it may lead to guest corruption.

While we are correct in advertising a write-cache, write-caches are 
volatile and should a drive lose power, it could lead to data 
corruption.  Enterprise disks tend to have battery backed write caches 
to prevent this.

In the set up you're emulating, the host is acting as a giant write 
cache.  Should your host fail, you can get data corruption.

cache=writethrough provides a much stronger data guarantee.  Even in the 
event of a host failure, data integrity will be preserved.

Regards,

Anthony Liguori

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15 23:43           ` Anthony Liguori
@ 2010-03-16  0:43             ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16  0:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Webb, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote:
> I knew someone would do this...
>
> This really gets down to your definition of "safe" behaviour.  As it  
> stands, if you suffer a power outage, it may lead to guest corruption.
>
> While we are correct in advertising a write-cache, write-caches are  
> volatile and should a drive lose power, it could lead to data  
> corruption.  Enterprise disks tend to have battery backed write caches  
> to prevent this.
>
> In the set up you're emulating, the host is acting as a giant write  
> cache.  Should your host fail, you can get data corruption.
>
> cache=writethrough provides a much stronger data guarantee.  Even in the  
> event of a host failure, data integrity will be preserved.

Actually cache=writeback is as safe as any normal host is with a
volatile disk cache, except that in this case the disk cache is
actually a lot larger.  With a properly implemented filesystem this
will never cause corruption.  You will lose recent updates after
the last sync/fsync/etc up to the size of the cache, but filesystem
metadata should never be corrupted, and data that has been forced to
disk using fsync/O_SYNC should never be lost either.  If it is that's
a bug somewhere in the stack, but in my powerfail testing we never did
so using xfs or ext3/4 after I fixed up the fsync code in the latter
two.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  0:43             ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16  0:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Webb, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote:
> I knew someone would do this...
>
> This really gets down to your definition of "safe" behaviour.  As it  
> stands, if you suffer a power outage, it may lead to guest corruption.
>
> While we are correct in advertising a write-cache, write-caches are  
> volatile and should a drive lose power, it could lead to data  
> corruption.  Enterprise disks tend to have battery backed write caches  
> to prevent this.
>
> In the set up you're emulating, the host is acting as a giant write  
> cache.  Should your host fail, you can get data corruption.
>
> cache=writethrough provides a much stronger data guarantee.  Even in the  
> event of a host failure, data integrity will be preserved.

Actually cache=writeback is as safe as any normal host is with a
volatile disk cache, except that in this case the disk cache is
actually a lot larger.  With a properly implemented filesystem this
will never cause corruption.  You will lose recent updates after
the last sync/fsync/etc up to the size of the cache, but filesystem
metadata should never be corrupted, and data that has been forced to
disk using fsync/O_SYNC should never be lost either.  If it is that's
a bug somewhere in the stack, but in my powerfail testing we never did
so using xfs or ext3/4 after I fixed up the fsync code in the latter
two.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16  0:43             ` Christoph Hellwig
@ 2010-03-16  1:27               ` Anthony Liguori
  -1 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-16  1:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/15/2010 07:43 PM, Christoph Hellwig wrote:
> On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote:
>    
>> I knew someone would do this...
>>
>> This really gets down to your definition of "safe" behaviour.  As it
>> stands, if you suffer a power outage, it may lead to guest corruption.
>>
>> While we are correct in advertising a write-cache, write-caches are
>> volatile and should a drive lose power, it could lead to data
>> corruption.  Enterprise disks tend to have battery backed write caches
>> to prevent this.
>>
>> In the set up you're emulating, the host is acting as a giant write
>> cache.  Should your host fail, you can get data corruption.
>>
>> cache=writethrough provides a much stronger data guarantee.  Even in the
>> event of a host failure, data integrity will be preserved.
>>      
> Actually cache=writeback is as safe as any normal host is with a
> volatile disk cache, except that in this case the disk cache is
> actually a lot larger.  With a properly implemented filesystem this
> will never cause corruption.

Metadata corruption, not necessarily corruption of data stored in a file.

>    You will lose recent updates after
> the last sync/fsync/etc up to the size of the cache, but filesystem
> metadata should never be corrupted, and data that has been forced to
> disk using fsync/O_SYNC should never be lost either.

Not all software uses fsync as much as they should.  And often times, 
it's for good reason (like ext3).  This is mitigated by the fact that 
there's usually a short window of time before metadata is flushed to 
disk.  Adding another layer increases that delay.

IIUC, an O_DIRECT write using cache=writeback is not actually on the 
spindle when the write() completes.  Rather, an explicit fsync() would 
be required.  That will cause data corruption in many applications (like 
databases) regardless of whether the fs gets metadata corruption.

You could argue that the software should disable writeback caching on 
the virtual disk, but we don't currently support that so even if the 
application did, it's not going to help.

Regards,

Anthony Liguori

>    If it is that's
> a bug somewhere in the stack, but in my powerfail testing we never did
> so using xfs or ext3/4 after I fixed up the fsync code in the latter
> two.
>
>    


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  1:27               ` Anthony Liguori
  0 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-16  1:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/15/2010 07:43 PM, Christoph Hellwig wrote:
> On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote:
>    
>> I knew someone would do this...
>>
>> This really gets down to your definition of "safe" behaviour.  As it
>> stands, if you suffer a power outage, it may lead to guest corruption.
>>
>> While we are correct in advertising a write-cache, write-caches are
>> volatile and should a drive lose power, it could lead to data
>> corruption.  Enterprise disks tend to have battery backed write caches
>> to prevent this.
>>
>> In the set up you're emulating, the host is acting as a giant write
>> cache.  Should your host fail, you can get data corruption.
>>
>> cache=writethrough provides a much stronger data guarantee.  Even in the
>> event of a host failure, data integrity will be preserved.
>>      
> Actually cache=writeback is as safe as any normal host is with a
> volatile disk cache, except that in this case the disk cache is
> actually a lot larger.  With a properly implemented filesystem this
> will never cause corruption.

Metadata corruption, not necessarily corruption of data stored in a file.

>    You will lose recent updates after
> the last sync/fsync/etc up to the size of the cache, but filesystem
> metadata should never be corrupted, and data that has been forced to
> disk using fsync/O_SYNC should never be lost either.

Not all software uses fsync as much as they should.  And often times, 
it's for good reason (like ext3).  This is mitigated by the fact that 
there's usually a short window of time before metadata is flushed to 
disk.  Adding another layer increases that delay.

IIUC, an O_DIRECT write using cache=writeback is not actually on the 
spindle when the write() completes.  Rather, an explicit fsync() would 
be required.  That will cause data corruption in many applications (like 
databases) regardless of whether the fs gets metadata corruption.

You could argue that the software should disable writeback caching on 
the virtual disk, but we don't currently support that so even if the 
application did, it's not going to help.

Regards,

Anthony Liguori

>    If it is that's
> a bug somewhere in the stack, but in my powerfail testing we never did
> so using xfs or ext3/4 after I fixed up the fsync code in the latter
> two.
>
>    

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15 20:23         ` Chris Webb
@ 2010-03-16  3:16           ` Balbir Singh
  -1 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-16  3:16 UTC (permalink / raw)
  To: Chris Webb
  Cc: Avi Kivity, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

* Chris Webb <chris@arachsys.com> [2010-03-15 20:23:54]:

> Avi Kivity <avi@redhat.com> writes:
> 
> > On 03/15/2010 10:07 AM, Balbir Singh wrote:
> >
> > >Yes, it is a virtio call away, but is the cost of paying twice in
> > >terms of memory acceptable?
> > 
> > Usually, it isn't, which is why I recommend cache=off.
> 
> Hi Avi. One observation about your recommendation for cache=none:
> 
> We run hosts of VMs accessing drives backed by logical volumes carved out
> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
> twenty virtual machines, which pretty much fill the available memory on the
> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
> caching turned on get advertised to the guest as having a write-cache, and
> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
> isn't acting as cache=neverflush like it would have done a year ago. I know
> that comparing performance for cache=none against that unsafe behaviour
> would be somewhat unfair!)
> 
> Wasteful duplication of page cache between guest and host notwithstanding,
> turning on cache=writeback is a spectacular performance win for our guests.
> For example, even IDE with cache=writeback easily beats virtio with
> cache=none in most of the guest filesystem performance tests I've tried. The
> anecdotal feedback from clients is also very strongly in favour of
> cache=writeback.
> 
> With a host full of cache=none guests, IO contention between guests is
> hugely problematic with non-stop seek from the disks to service tiny
> O_DIRECT writes (especially without virtio), many of which needn't have been
> synchronous if only there had been some way for the guest OS to tell qemu
> that. Running with cache=writeback seems to reduce the frequency of disk
> flush per guest to a much more manageable level, and to allow the host's
> elevator to optimise writing out across the guests in between these flushes.

Thanks for the inputs above, they are extremely useful. The goal of
these patches is that with cache != none, we allow double caching when
needed and then slowly take away unmapped pages, pushing the caching
to the host. There are knobs to control how much, etc and the whole
feature is enabled via a boot parameter.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  3:16           ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-16  3:16 UTC (permalink / raw)
  To: Chris Webb
  Cc: Avi Kivity, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

* Chris Webb <chris@arachsys.com> [2010-03-15 20:23:54]:

> Avi Kivity <avi@redhat.com> writes:
> 
> > On 03/15/2010 10:07 AM, Balbir Singh wrote:
> >
> > >Yes, it is a virtio call away, but is the cost of paying twice in
> > >terms of memory acceptable?
> > 
> > Usually, it isn't, which is why I recommend cache=off.
> 
> Hi Avi. One observation about your recommendation for cache=none:
> 
> We run hosts of VMs accessing drives backed by logical volumes carved out
> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
> twenty virtual machines, which pretty much fill the available memory on the
> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
> caching turned on get advertised to the guest as having a write-cache, and
> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
> isn't acting as cache=neverflush like it would have done a year ago. I know
> that comparing performance for cache=none against that unsafe behaviour
> would be somewhat unfair!)
> 
> Wasteful duplication of page cache between guest and host notwithstanding,
> turning on cache=writeback is a spectacular performance win for our guests.
> For example, even IDE with cache=writeback easily beats virtio with
> cache=none in most of the guest filesystem performance tests I've tried. The
> anecdotal feedback from clients is also very strongly in favour of
> cache=writeback.
> 
> With a host full of cache=none guests, IO contention between guests is
> hugely problematic with non-stop seek from the disks to service tiny
> O_DIRECT writes (especially without virtio), many of which needn't have been
> synchronous if only there had been some way for the guest OS to tell qemu
> that. Running with cache=writeback seems to reduce the frequency of disk
> flush per guest to a much more manageable level, and to allow the host's
> elevator to optimise writing out across the guests in between these flushes.

Thanks for the inputs above, they are extremely useful. The goal of
these patches is that with cache != none, we allow double caching when
needed and then slowly take away unmapped pages, pushing the caching
to the host. There are knobs to control how much, etc and the whole
feature is enabled via a boot parameter.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15 15:46   ` Randy Dunlap
@ 2010-03-16  3:21     ` Balbir Singh
  -1 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-16  3:21 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Randy Dunlap <randy.dunlap@oracle.com> [2010-03-15 08:46:31]:

> On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote:
> 
> Hi,
> If you go ahead with this, please add the boot parameter & its description
> to Documentation/kernel-parameters.txt.
>

I certainly will, thanks for keeping a watch. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  3:21     ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-16  3:21 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

* Randy Dunlap <randy.dunlap@oracle.com> [2010-03-15 08:46:31]:

> On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote:
> 
> Hi,
> If you go ahead with this, please add the boot parameter & its description
> to Documentation/kernel-parameters.txt.
>

I certainly will, thanks for keeping a watch. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16  1:27               ` Anthony Liguori
@ 2010-03-16  8:19                 ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16  8:19 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Christoph Hellwig, Chris Webb, Avi Kivity, balbir,
	KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On Mon, Mar 15, 2010 at 08:27:25PM -0500, Anthony Liguori wrote:
>> Actually cache=writeback is as safe as any normal host is with a
>> volatile disk cache, except that in this case the disk cache is
>> actually a lot larger.  With a properly implemented filesystem this
>> will never cause corruption.
>
> Metadata corruption, not necessarily corruption of data stored in a file.

Again, this will not cause metadata corruption either if the filesystem
loses barriers, although we may lose up to the cache size of new (data
or metadata operations).  The consistency of the filesystem is still
guaranteed.

> Not all software uses fsync as much as they should.  And often times,  
> it's for good reason (like ext3).

If an application needs data on disk it must call fsync, or there
is no guaranteed at all, even on ext3.  And with growing disk caches
these issues show up on normal disks often enough that people have
realized it by now.


> IIUC, an O_DIRECT write using cache=writeback is not actually on the  
> spindle when the write() completes.  Rather, an explicit fsync() would  
> be required.  That will cause data corruption in many applications (like  
> databases) regardless of whether the fs gets metadata corruption.

It's neither for O_DIRECT without qemu involved.  The O_DIRECT write
goes through the disk cache and requires and explicit fsync or O_SYNC
open flag to make sure it goes to disk.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  8:19                 ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16  8:19 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Christoph Hellwig, Chris Webb, Avi Kivity, balbir,
	KVM development list, Rik van Riel, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel

On Mon, Mar 15, 2010 at 08:27:25PM -0500, Anthony Liguori wrote:
>> Actually cache=writeback is as safe as any normal host is with a
>> volatile disk cache, except that in this case the disk cache is
>> actually a lot larger.  With a properly implemented filesystem this
>> will never cause corruption.
>
> Metadata corruption, not necessarily corruption of data stored in a file.

Again, this will not cause metadata corruption either if the filesystem
loses barriers, although we may lose up to the cache size of new (data
or metadata operations).  The consistency of the filesystem is still
guaranteed.

> Not all software uses fsync as much as they should.  And often times,  
> it's for good reason (like ext3).

If an application needs data on disk it must call fsync, or there
is no guaranteed at all, even on ext3.  And with growing disk caches
these issues show up on normal disks often enough that people have
realized it by now.


> IIUC, an O_DIRECT write using cache=writeback is not actually on the  
> spindle when the write() completes.  Rather, an explicit fsync() would  
> be required.  That will cause data corruption in many applications (like  
> databases) regardless of whether the fs gets metadata corruption.

It's neither for O_DIRECT without qemu involved.  The O_DIRECT write
goes through the disk cache and requires and explicit fsync or O_SYNC
open flag to make sure it goes to disk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15 18:48             ` Anthony Liguori
@ 2010-03-16  9:05               ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16  9:05 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel

On 03/15/2010 08:48 PM, Anthony Liguori wrote:
> On 03/15/2010 04:27 AM, Avi Kivity wrote:
>>
>> That's only beneficial if the cache is shared.  Otherwise, you could 
>> use the balloon to evict cache when memory is tight.
>>
>> Shared cache is mostly a desktop thing where users run similar 
>> workloads.  For servers, it's much less likely.  So a modified-guest 
>> doesn't help a lot here.
>
> Not really.  In many cloud environments, there's a set of common 
> images that are instantiated on each node.  Usually this is because 
> you're running a horizontally scalable application or because you're 
> supporting an ephemeral storage model.

But will these servers actually benefit from shared cache?  So the 
images are shared, they boot up, what then?

- apache really won't like serving static files from the host pagecache
- dynamic content (java, cgi) will be mostly in anonymous memory, not 
pagecache
- ditto for application servers
- what else are people doing?

> In fact, with ephemeral storage, you typically want to use 
> cache=writeback since you aren't providing data guarantees across 
> shutdown/failure.

Interesting point.

We'd need a cache=volatile for this use case to avoid the fdatasync()s 
we do now.  Also useful for -snapshot.  In fact I have a patch for this 
somewhere I can dig out.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  9:05               ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16  9:05 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel

On 03/15/2010 08:48 PM, Anthony Liguori wrote:
> On 03/15/2010 04:27 AM, Avi Kivity wrote:
>>
>> That's only beneficial if the cache is shared.  Otherwise, you could 
>> use the balloon to evict cache when memory is tight.
>>
>> Shared cache is mostly a desktop thing where users run similar 
>> workloads.  For servers, it's much less likely.  So a modified-guest 
>> doesn't help a lot here.
>
> Not really.  In many cloud environments, there's a set of common 
> images that are instantiated on each node.  Usually this is because 
> you're running a horizontally scalable application or because you're 
> supporting an ephemeral storage model.

But will these servers actually benefit from shared cache?  So the 
images are shared, they boot up, what then?

- apache really won't like serving static files from the host pagecache
- dynamic content (java, cgi) will be mostly in anonymous memory, not 
pagecache
- ditto for application servers
- what else are people doing?

> In fact, with ephemeral storage, you typically want to use 
> cache=writeback since you aren't providing data guarantees across 
> shutdown/failure.

Interesting point.

We'd need a cache=volatile for this use case to avoid the fdatasync()s 
we do now.  Also useful for -snapshot.  In fact I have a patch for this 
somewhere I can dig out.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15 20:23         ` Chris Webb
@ 2010-03-16  9:17           ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16  9:17 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/15/2010 10:23 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>
>>      
>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>> terms of memory acceptable?
>>>        
>> Usually, it isn't, which is why I recommend cache=off.
>>      
> Hi Avi. One observation about your recommendation for cache=none:
>
> We run hosts of VMs accessing drives backed by logical volumes carved out
> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
> twenty virtual machines, which pretty much fill the available memory on the
> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
> caching turned on get advertised to the guest as having a write-cache, and
> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
> isn't acting as cache=neverflush like it would have done a year ago. I know
> that comparing performance for cache=none against that unsafe behaviour
> would be somewhat unfair!)
>
> Wasteful duplication of page cache between guest and host notwithstanding,
> turning on cache=writeback is a spectacular performance win for our guests.
> For example, even IDE with cache=writeback easily beats virtio with
> cache=none in most of the guest filesystem performance tests I've tried. The
> anecdotal feedback from clients is also very strongly in favour of
> cache=writeback.
>    

Is this with qcow2, raw file, or direct volume access?

I can understand it for qcow2, but for direct volume access this 
shouldn't happen.  The guest schedules as many writes as it can, 
followed by a sync.  The host (and disk) can then reschedule them 
whether they are in the writeback cache or in the block layer, and must 
sync in the same way once completed.

Perhaps what we need is bdrv_aio_submit() which can take a number of 
requests.  For direct volume access, this allows easier reordering 
(io_submit() should plug the queues before it starts processing and 
unplug them when done, though I don't see the code for this?).  For 
qcow2, we can coalesce metadata updates for multiple requests into one 
RMW (for example, a sequential write split into multiple 64K-256K write 
requests).

Christoph/Kevin?

> With a host full of cache=none guests, IO contention between guests is
> hugely problematic with non-stop seek from the disks to service tiny
> O_DIRECT writes (especially without virtio), many of which needn't have been
> synchronous if only there had been some way for the guest OS to tell qemu
> that. Running with cache=writeback seems to reduce the frequency of disk
> flush per guest to a much more manageable level, and to allow the host's
> elevator to optimise writing out across the guests in between these flushes.
>    

The host eventually has to turn the writes into synchronous writes, no 
way around that.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  9:17           ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16  9:17 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/15/2010 10:23 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>
>>      
>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>> terms of memory acceptable?
>>>        
>> Usually, it isn't, which is why I recommend cache=off.
>>      
> Hi Avi. One observation about your recommendation for cache=none:
>
> We run hosts of VMs accessing drives backed by logical volumes carved out
> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
> twenty virtual machines, which pretty much fill the available memory on the
> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
> caching turned on get advertised to the guest as having a write-cache, and
> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
> isn't acting as cache=neverflush like it would have done a year ago. I know
> that comparing performance for cache=none against that unsafe behaviour
> would be somewhat unfair!)
>
> Wasteful duplication of page cache between guest and host notwithstanding,
> turning on cache=writeback is a spectacular performance win for our guests.
> For example, even IDE with cache=writeback easily beats virtio with
> cache=none in most of the guest filesystem performance tests I've tried. The
> anecdotal feedback from clients is also very strongly in favour of
> cache=writeback.
>    

Is this with qcow2, raw file, or direct volume access?

I can understand it for qcow2, but for direct volume access this 
shouldn't happen.  The guest schedules as many writes as it can, 
followed by a sync.  The host (and disk) can then reschedule them 
whether they are in the writeback cache or in the block layer, and must 
sync in the same way once completed.

Perhaps what we need is bdrv_aio_submit() which can take a number of 
requests.  For direct volume access, this allows easier reordering 
(io_submit() should plug the queues before it starts processing and 
unplug them when done, though I don't see the code for this?).  For 
qcow2, we can coalesce metadata updates for multiple requests into one 
RMW (for example, a sequential write split into multiple 64K-256K write 
requests).

Christoph/Kevin?

> With a host full of cache=none guests, IO contention between guests is
> hugely problematic with non-stop seek from the disks to service tiny
> O_DIRECT writes (especially without virtio), many of which needn't have been
> synchronous if only there had been some way for the guest OS to tell qemu
> that. Running with cache=writeback seems to reduce the frequency of disk
> flush per guest to a much more manageable level, and to allow the host's
> elevator to optimise writing out across the guests in between these flushes.
>    

The host eventually has to turn the writes into synchronous writes, no 
way around that.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16  9:17           ` Avi Kivity
@ 2010-03-16  9:54             ` Kevin Wolf
  -1 siblings, 0 replies; 98+ messages in thread
From: Kevin Wolf @ 2010-03-16  9:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig

Am 16.03.2010 10:17, schrieb Avi Kivity:
> On 03/15/2010 10:23 PM, Chris Webb wrote:
>> Avi Kivity<avi@redhat.com>  writes:
>>
>>    
>>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>>
>>>      
>>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>>> terms of memory acceptable?
>>>>        
>>> Usually, it isn't, which is why I recommend cache=off.
>>>      
>> Hi Avi. One observation about your recommendation for cache=none:
>>
>> We run hosts of VMs accessing drives backed by logical volumes carved out
>> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
>> twenty virtual machines, which pretty much fill the available memory on the
>> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
>> caching turned on get advertised to the guest as having a write-cache, and
>> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
>> isn't acting as cache=neverflush like it would have done a year ago. I know
>> that comparing performance for cache=none against that unsafe behaviour
>> would be somewhat unfair!)
>>
>> Wasteful duplication of page cache between guest and host notwithstanding,
>> turning on cache=writeback is a spectacular performance win for our guests.
>> For example, even IDE with cache=writeback easily beats virtio with
>> cache=none in most of the guest filesystem performance tests I've tried. The
>> anecdotal feedback from clients is also very strongly in favour of
>> cache=writeback.
>>    
> 
> Is this with qcow2, raw file, or direct volume access?
> 
> I can understand it for qcow2, but for direct volume access this 
> shouldn't happen.  The guest schedules as many writes as it can, 
> followed by a sync.  The host (and disk) can then reschedule them 
> whether they are in the writeback cache or in the block layer, and must 
> sync in the same way once completed.
> 
> Perhaps what we need is bdrv_aio_submit() which can take a number of 
> requests.  For direct volume access, this allows easier reordering 
> (io_submit() should plug the queues before it starts processing and 
> unplug them when done, though I don't see the code for this?).  For 
> qcow2, we can coalesce metadata updates for multiple requests into one 
> RMW (for example, a sequential write split into multiple 64K-256K write 
> requests).

We already do merge sequential writes back into one larger request. So
this is in fact a case that wouldn't benefit from such changes. It may
help for other cases. But even if it did, coalescing metadata writes in
qcow2 sounds like a good way to mess up, so I'd stay with doing it only
for the data itself.

Apart from that, wouldn't your points apply to writeback as well?

Kevin

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16  9:54             ` Kevin Wolf
  0 siblings, 0 replies; 98+ messages in thread
From: Kevin Wolf @ 2010-03-16  9:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig

Am 16.03.2010 10:17, schrieb Avi Kivity:
> On 03/15/2010 10:23 PM, Chris Webb wrote:
>> Avi Kivity<avi@redhat.com>  writes:
>>
>>    
>>> On 03/15/2010 10:07 AM, Balbir Singh wrote:
>>>
>>>      
>>>> Yes, it is a virtio call away, but is the cost of paying twice in
>>>> terms of memory acceptable?
>>>>        
>>> Usually, it isn't, which is why I recommend cache=off.
>>>      
>> Hi Avi. One observation about your recommendation for cache=none:
>>
>> We run hosts of VMs accessing drives backed by logical volumes carved out
>> from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
>> twenty virtual machines, which pretty much fill the available memory on the
>> host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
>> caching turned on get advertised to the guest as having a write-cache, and
>> FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
>> isn't acting as cache=neverflush like it would have done a year ago. I know
>> that comparing performance for cache=none against that unsafe behaviour
>> would be somewhat unfair!)
>>
>> Wasteful duplication of page cache between guest and host notwithstanding,
>> turning on cache=writeback is a spectacular performance win for our guests.
>> For example, even IDE with cache=writeback easily beats virtio with
>> cache=none in most of the guest filesystem performance tests I've tried. The
>> anecdotal feedback from clients is also very strongly in favour of
>> cache=writeback.
>>    
> 
> Is this with qcow2, raw file, or direct volume access?
> 
> I can understand it for qcow2, but for direct volume access this 
> shouldn't happen.  The guest schedules as many writes as it can, 
> followed by a sync.  The host (and disk) can then reschedule them 
> whether they are in the writeback cache or in the block layer, and must 
> sync in the same way once completed.
> 
> Perhaps what we need is bdrv_aio_submit() which can take a number of 
> requests.  For direct volume access, this allows easier reordering 
> (io_submit() should plug the queues before it starts processing and 
> unplug them when done, though I don't see the code for this?).  For 
> qcow2, we can coalesce metadata updates for multiple requests into one 
> RMW (for example, a sequential write split into multiple 64K-256K write 
> requests).

We already do merge sequential writes back into one larger request. So
this is in fact a case that wouldn't benefit from such changes. It may
help for other cases. But even if it did, coalescing metadata writes in
qcow2 sounds like a good way to mess up, so I'd stay with doing it only
for the data itself.

Apart from that, wouldn't your points apply to writeback as well?

Kevin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16  9:54             ` Kevin Wolf
@ 2010-03-16 10:16               ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 10:16 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig

On 03/16/2010 11:54 AM, Kevin Wolf wrote:
>
>> Is this with qcow2, raw file, or direct volume access?
>>
>> I can understand it for qcow2, but for direct volume access this
>> shouldn't happen.  The guest schedules as many writes as it can,
>> followed by a sync.  The host (and disk) can then reschedule them
>> whether they are in the writeback cache or in the block layer, and must
>> sync in the same way once completed.
>>
>> Perhaps what we need is bdrv_aio_submit() which can take a number of
>> requests.  For direct volume access, this allows easier reordering
>> (io_submit() should plug the queues before it starts processing and
>> unplug them when done, though I don't see the code for this?).  For
>> qcow2, we can coalesce metadata updates for multiple requests into one
>> RMW (for example, a sequential write split into multiple 64K-256K write
>> requests).
>>      
> We already do merge sequential writes back into one larger request. So
> this is in fact a case that wouldn't benefit from such changes.

I'm not happy with that.  It increases overall latency.  With qcow2 it's 
fine, but I'd let requests to raw volumes flow unaltered.

> It may
> help for other cases. But even if it did, coalescing metadata writes in
> qcow2 sounds like a good way to mess up, so I'd stay with doing it only
> for the data itself.
>    

I don't see why.

> Apart from that, wouldn't your points apply to writeback as well?
>    

They do, but for writeback the host kernel already does all the 
coalescing/merging/blah for us.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16 10:16               ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 10:16 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig

On 03/16/2010 11:54 AM, Kevin Wolf wrote:
>
>> Is this with qcow2, raw file, or direct volume access?
>>
>> I can understand it for qcow2, but for direct volume access this
>> shouldn't happen.  The guest schedules as many writes as it can,
>> followed by a sync.  The host (and disk) can then reschedule them
>> whether they are in the writeback cache or in the block layer, and must
>> sync in the same way once completed.
>>
>> Perhaps what we need is bdrv_aio_submit() which can take a number of
>> requests.  For direct volume access, this allows easier reordering
>> (io_submit() should plug the queues before it starts processing and
>> unplug them when done, though I don't see the code for this?).  For
>> qcow2, we can coalesce metadata updates for multiple requests into one
>> RMW (for example, a sequential write split into multiple 64K-256K write
>> requests).
>>      
> We already do merge sequential writes back into one larger request. So
> this is in fact a case that wouldn't benefit from such changes.

I'm not happy with that.  It increases overall latency.  With qcow2 it's 
fine, but I'd let requests to raw volumes flow unaltered.

> It may
> help for other cases. But even if it did, coalescing metadata writes in
> qcow2 sounds like a good way to mess up, so I'd stay with doing it only
> for the data itself.
>    

I don't see why.

> Apart from that, wouldn't your points apply to writeback as well?
>    

They do, but for writeback the host kernel already does all the 
coalescing/merging/blah for us.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16  9:17           ` Avi Kivity
@ 2010-03-16 10:26             ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16 10:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

Avi,

cache=writeback can be faster than cache=none for the same reasons
a disk cache speeds up access.  As long as the I/O mix contains more
asynchronous then synchronous writes it allows the host to do much
more reordering, only limited by the cache size (which can be quite
huge when using the host pagecache) and the amount of cache flushes
coming from the host.  If you have a fsync heavy workload or metadata
operation with a filesystem like the current XFS you will get lots
of cache flushes that make the use of the additional cache limits.

If you don't have a of lot of cache flushes, e.g. due to dumb
applications that do not issue fsync, or even run ext3 in it's default
mode never issues cache flushes the benefit will be enormous, but the
data loss and possible corruption will be enormous.

But even for something like btrfs that does provide data integrity
but issues cache flushes fairly effeciently data=writeback may
provide a quite nice speedup, especially if using multiple guest
accessing the same spindle(s).

But I wouldn't be surprised if IBM's exteme differences are indeed due
to the extremly unsafe ext3 default behaviour.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16 10:26             ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16 10:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

Avi,

cache=writeback can be faster than cache=none for the same reasons
a disk cache speeds up access.  As long as the I/O mix contains more
asynchronous then synchronous writes it allows the host to do much
more reordering, only limited by the cache size (which can be quite
huge when using the host pagecache) and the amount of cache flushes
coming from the host.  If you have a fsync heavy workload or metadata
operation with a filesystem like the current XFS you will get lots
of cache flushes that make the use of the additional cache limits.

If you don't have a of lot of cache flushes, e.g. due to dumb
applications that do not issue fsync, or even run ext3 in it's default
mode never issues cache flushes the benefit will be enormous, but the
data loss and possible corruption will be enormous.

But even for something like btrfs that does provide data integrity
but issues cache flushes fairly effeciently data=writeback may
provide a quite nice speedup, especially if using multiple guest
accessing the same spindle(s).

But I wouldn't be surprised if IBM's exteme differences are indeed due
to the extremly unsafe ext3 default behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16 10:26             ` Christoph Hellwig
@ 2010-03-16 10:36               ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 10:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/16/2010 12:26 PM, Christoph Hellwig wrote:
> Avi,
>
> cache=writeback can be faster than cache=none for the same reasons
> a disk cache speeds up access.  As long as the I/O mix contains more
> asynchronous then synchronous writes it allows the host to do much
> more reordering, only limited by the cache size (which can be quite
> huge when using the host pagecache) and the amount of cache flushes
> coming from the host.  If you have a fsync heavy workload or metadata
> operation with a filesystem like the current XFS you will get lots
> of cache flushes that make the use of the additional cache limits.
>    

Are you talking about direct volume access or qcow2?

For direct volume access, I still don't get it.  The number of barriers 
issues by the host must equal (or exceed, but that's pointless) the 
number of barriers issued by the guest.  cache=writeback allows the host 
to reorder writes, but so does cache=none.  Where does the difference 
come from?

Put it another way.  In an unvirtualized environment, if you implement a 
write cache in a storage driver (not device), and sync it on a barrier 
request, would you expect to see a performance improvement?


> If you don't have a of lot of cache flushes, e.g. due to dumb
> applications that do not issue fsync, or even run ext3 in it's default
> mode never issues cache flushes the benefit will be enormous, but the
> data loss and possible corruption will be enormous.
>    

Shouldn't the host never issue cache flushes in this case? (for direct 
volume access; qcow2 still needs flushes for metadata integrity).

> But even for something like btrfs that does provide data integrity
> but issues cache flushes fairly effeciently data=writeback may
> provide a quite nice speedup, especially if using multiple guest
> accessing the same spindle(s).
>
> But I wouldn't be surprised if IBM's exteme differences are indeed due
> to the extremly unsafe ext3 default behaviour.
>    


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16 10:36               ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 10:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/16/2010 12:26 PM, Christoph Hellwig wrote:
> Avi,
>
> cache=writeback can be faster than cache=none for the same reasons
> a disk cache speeds up access.  As long as the I/O mix contains more
> asynchronous then synchronous writes it allows the host to do much
> more reordering, only limited by the cache size (which can be quite
> huge when using the host pagecache) and the amount of cache flushes
> coming from the host.  If you have a fsync heavy workload or metadata
> operation with a filesystem like the current XFS you will get lots
> of cache flushes that make the use of the additional cache limits.
>    

Are you talking about direct volume access or qcow2?

For direct volume access, I still don't get it.  The number of barriers 
issues by the host must equal (or exceed, but that's pointless) the 
number of barriers issued by the guest.  cache=writeback allows the host 
to reorder writes, but so does cache=none.  Where does the difference 
come from?

Put it another way.  In an unvirtualized environment, if you implement a 
write cache in a storage driver (not device), and sync it on a barrier 
request, would you expect to see a performance improvement?


> If you don't have a of lot of cache flushes, e.g. due to dumb
> applications that do not issue fsync, or even run ext3 in it's default
> mode never issues cache flushes the benefit will be enormous, but the
> data loss and possible corruption will be enormous.
>    

Shouldn't the host never issue cache flushes in this case? (for direct 
volume access; qcow2 still needs flushes for metadata integrity).

> But even for something like btrfs that does provide data integrity
> but issues cache flushes fairly effeciently data=writeback may
> provide a quite nice speedup, especially if using multiple guest
> accessing the same spindle(s).
>
> But I wouldn't be surprised if IBM's exteme differences are indeed due
> to the extremly unsafe ext3 default behaviour.
>    


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16 10:36               ` Avi Kivity
@ 2010-03-16 10:44                 ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16 10:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Chris Webb, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
> Are you talking about direct volume access or qcow2?

Doesn't matter.

> For direct volume access, I still don't get it.  The number of barriers 
> issues by the host must equal (or exceed, but that's pointless) the 
> number of barriers issued by the guest.  cache=writeback allows the host 
> to reorder writes, but so does cache=none.  Where does the difference 
> come from?
> 
> Put it another way.  In an unvirtualized environment, if you implement a 
> write cache in a storage driver (not device), and sync it on a barrier 
> request, would you expect to see a performance improvement?

cache=none only allows very limited reorderning in the host.  O_DIRECT
is synchronous on the host, so there's just some very limited reordering
going on in the elevator if we have other I/O going on in parallel.
In addition to that the disk writecache can perform limited reodering
and caching, but the disk cache has a rather limited size.  The host
pagecache gives a much wieder opportunity to reorder, especially if
the guest workload is not cache flush heavy.  If the guest workload
is extremly cache flush heavy the usefulness of the pagecache is rather
limited, as we'll only use very little of it, but pay by having to do
a data copy.  If the workload is not cache flush heavy, and we have
multiple guests doing I/O to the same spindles it will allow the host
do do much more efficient data writeout by beeing able to do better
ordered (less seeky) and bigger I/O (especially if the host has real
storage compared to ide for the guest).

> >If you don't have a of lot of cache flushes, e.g. due to dumb
> >applications that do not issue fsync, or even run ext3 in it's default
> >mode never issues cache flushes the benefit will be enormous, but the
> >data loss and possible corruption will be enormous.
> >   
> 
> Shouldn't the host never issue cache flushes in this case? (for direct 
> volume access; qcow2 still needs flushes for metadata integrity).

If the guest never issues a flush the host will neither, indeed.  Data
will only go to disk by background writeout or memory pressure.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16 10:44                 ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-16 10:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Chris Webb, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
> Are you talking about direct volume access or qcow2?

Doesn't matter.

> For direct volume access, I still don't get it.  The number of barriers 
> issues by the host must equal (or exceed, but that's pointless) the 
> number of barriers issued by the guest.  cache=writeback allows the host 
> to reorder writes, but so does cache=none.  Where does the difference 
> come from?
> 
> Put it another way.  In an unvirtualized environment, if you implement a 
> write cache in a storage driver (not device), and sync it on a barrier 
> request, would you expect to see a performance improvement?

cache=none only allows very limited reorderning in the host.  O_DIRECT
is synchronous on the host, so there's just some very limited reordering
going on in the elevator if we have other I/O going on in parallel.
In addition to that the disk writecache can perform limited reodering
and caching, but the disk cache has a rather limited size.  The host
pagecache gives a much wieder opportunity to reorder, especially if
the guest workload is not cache flush heavy.  If the guest workload
is extremly cache flush heavy the usefulness of the pagecache is rather
limited, as we'll only use very little of it, but pay by having to do
a data copy.  If the workload is not cache flush heavy, and we have
multiple guests doing I/O to the same spindles it will allow the host
do do much more efficient data writeout by beeing able to do better
ordered (less seeky) and bigger I/O (especially if the host has real
storage compared to ide for the guest).

> >If you don't have a of lot of cache flushes, e.g. due to dumb
> >applications that do not issue fsync, or even run ext3 in it's default
> >mode never issues cache flushes the benefit will be enormous, but the
> >data loss and possible corruption will be enormous.
> >   
> 
> Shouldn't the host never issue cache flushes in this case? (for direct 
> volume access; qcow2 still needs flushes for metadata integrity).

If the guest never issues a flush the host will neither, indeed.  Data
will only go to disk by background writeout or memory pressure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16 10:44                 ` Christoph Hellwig
@ 2010-03-16 11:08                   ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 11:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/16/2010 12:44 PM, Christoph Hellwig wrote:
> On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
>    
>> Are you talking about direct volume access or qcow2?
>>      
> Doesn't matter.
>
>    
>> For direct volume access, I still don't get it.  The number of barriers
>> issues by the host must equal (or exceed, but that's pointless) the
>> number of barriers issued by the guest.  cache=writeback allows the host
>> to reorder writes, but so does cache=none.  Where does the difference
>> come from?
>>
>> Put it another way.  In an unvirtualized environment, if you implement a
>> write cache in a storage driver (not device), and sync it on a barrier
>> request, would you expect to see a performance improvement?
>>      
> cache=none only allows very limited reorderning in the host.  O_DIRECT
> is synchronous on the host, so there's just some very limited reordering
> going on in the elevator if we have other I/O going on in parallel.
>    

Presumably there is lots of I/O going on, or we wouldn't be having this 
conversation.

> In addition to that the disk writecache can perform limited reodering
> and caching, but the disk cache has a rather limited size.  The host
> pagecache gives a much wieder opportunity to reorder, especially if
> the guest workload is not cache flush heavy.  If the guest workload
> is extremly cache flush heavy the usefulness of the pagecache is rather
> limited, as we'll only use very little of it, but pay by having to do
> a data copy.  If the workload is not cache flush heavy, and we have
> multiple guests doing I/O to the same spindles it will allow the host
> do do much more efficient data writeout by beeing able to do better
> ordered (less seeky) and bigger I/O (especially if the host has real
> storage compared to ide for the guest).
>    

Let's assume the guest has virtio (I agree with IDE we need reordering 
on the host).  The guest sends batches of I/O separated by cache 
flushes.  If the batches are smaller than the virtio queue length, 
ideally things look like:

  io_submit(..., batch_size_1);
  io_getevents(..., batch_size_1);
  fdatasync();
  io_submit(..., batch_size_2);
   io_getevents(..., batch_size_2);
   fdatasync();
   io_submit(..., batch_size_3);
   io_getevents(..., batch_size_3);
   fdatasync();

(certainly that won't happen today, but it could in principle).

How does a write cache give any advantage?  The host kernel sees 
_exactly_ the same information as it would from a bunch of threaded 
pwritev()s followed by fdatasync().

(wish: IO_CMD_ORDERED_FDATASYNC)

If the batch size is larger than the virtio queue size, or if there are 
no flushes at all, then yes the huge write cache gives more opportunity 
for reordering.  But we're already talking hundreds of requests here.

Let's say the virtio queue size was unlimited.  What merging/reordering 
opportunity are we missing on the host?  Again we have exactly the same 
information: either the pagecache lru + radix tree that identifies all 
dirty pages in disk order, or the block queue with pending requests that 
contains exactly the same information.

Something is wrong.  Maybe it's my understanding, but on the other hand 
it may be a piece of kernel code.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16 11:08                   ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 11:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/16/2010 12:44 PM, Christoph Hellwig wrote:
> On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
>    
>> Are you talking about direct volume access or qcow2?
>>      
> Doesn't matter.
>
>    
>> For direct volume access, I still don't get it.  The number of barriers
>> issues by the host must equal (or exceed, but that's pointless) the
>> number of barriers issued by the guest.  cache=writeback allows the host
>> to reorder writes, but so does cache=none.  Where does the difference
>> come from?
>>
>> Put it another way.  In an unvirtualized environment, if you implement a
>> write cache in a storage driver (not device), and sync it on a barrier
>> request, would you expect to see a performance improvement?
>>      
> cache=none only allows very limited reorderning in the host.  O_DIRECT
> is synchronous on the host, so there's just some very limited reordering
> going on in the elevator if we have other I/O going on in parallel.
>    

Presumably there is lots of I/O going on, or we wouldn't be having this 
conversation.

> In addition to that the disk writecache can perform limited reodering
> and caching, but the disk cache has a rather limited size.  The host
> pagecache gives a much wieder opportunity to reorder, especially if
> the guest workload is not cache flush heavy.  If the guest workload
> is extremly cache flush heavy the usefulness of the pagecache is rather
> limited, as we'll only use very little of it, but pay by having to do
> a data copy.  If the workload is not cache flush heavy, and we have
> multiple guests doing I/O to the same spindles it will allow the host
> do do much more efficient data writeout by beeing able to do better
> ordered (less seeky) and bigger I/O (especially if the host has real
> storage compared to ide for the guest).
>    

Let's assume the guest has virtio (I agree with IDE we need reordering 
on the host).  The guest sends batches of I/O separated by cache 
flushes.  If the batches are smaller than the virtio queue length, 
ideally things look like:

  io_submit(..., batch_size_1);
  io_getevents(..., batch_size_1);
  fdatasync();
  io_submit(..., batch_size_2);
   io_getevents(..., batch_size_2);
   fdatasync();
   io_submit(..., batch_size_3);
   io_getevents(..., batch_size_3);
   fdatasync();

(certainly that won't happen today, but it could in principle).

How does a write cache give any advantage?  The host kernel sees 
_exactly_ the same information as it would from a bunch of threaded 
pwritev()s followed by fdatasync().

(wish: IO_CMD_ORDERED_FDATASYNC)

If the batch size is larger than the virtio queue size, or if there are 
no flushes at all, then yes the huge write cache gives more opportunity 
for reordering.  But we're already talking hundreds of requests here.

Let's say the virtio queue size was unlimited.  What merging/reordering 
opportunity are we missing on the host?  Again we have exactly the same 
information: either the pagecache lru + radix tree that identifies all 
dirty pages in disk order, or the block queue with pending requests that 
contains exactly the same information.

Something is wrong.  Maybe it's my understanding, but on the other hand 
it may be a piece of kernel code.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16 11:08                   ` Avi Kivity
@ 2010-03-16 14:27                     ` Balbir Singh
  -1 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-16 14:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Chris Webb, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

* Avi Kivity <avi@redhat.com> [2010-03-16 13:08:28]:

> On 03/16/2010 12:44 PM, Christoph Hellwig wrote:
> >On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
> >>Are you talking about direct volume access or qcow2?
> >Doesn't matter.
> >
> >>For direct volume access, I still don't get it.  The number of barriers
> >>issues by the host must equal (or exceed, but that's pointless) the
> >>number of barriers issued by the guest.  cache=writeback allows the host
> >>to reorder writes, but so does cache=none.  Where does the difference
> >>come from?
> >>
> >>Put it another way.  In an unvirtualized environment, if you implement a
> >>write cache in a storage driver (not device), and sync it on a barrier
> >>request, would you expect to see a performance improvement?
> >cache=none only allows very limited reorderning in the host.  O_DIRECT
> >is synchronous on the host, so there's just some very limited reordering
> >going on in the elevator if we have other I/O going on in parallel.
> 
> Presumably there is lots of I/O going on, or we wouldn't be having
> this conversation.
>

We are speaking of multiple VM's doing I/O in parallel.
 
> >In addition to that the disk writecache can perform limited reodering
> >and caching, but the disk cache has a rather limited size.  The host
> >pagecache gives a much wieder opportunity to reorder, especially if
> >the guest workload is not cache flush heavy.  If the guest workload
> >is extremly cache flush heavy the usefulness of the pagecache is rather
> >limited, as we'll only use very little of it, but pay by having to do
> >a data copy.  If the workload is not cache flush heavy, and we have
> >multiple guests doing I/O to the same spindles it will allow the host
> >do do much more efficient data writeout by beeing able to do better
> >ordered (less seeky) and bigger I/O (especially if the host has real
> >storage compared to ide for the guest).
> 
> Let's assume the guest has virtio (I agree with IDE we need
> reordering on the host).  The guest sends batches of I/O separated
> by cache flushes.  If the batches are smaller than the virtio queue
> length, ideally things look like:
> 
>  io_submit(..., batch_size_1);
>  io_getevents(..., batch_size_1);
>  fdatasync();
>  io_submit(..., batch_size_2);
>   io_getevents(..., batch_size_2);
>   fdatasync();
>   io_submit(..., batch_size_3);
>   io_getevents(..., batch_size_3);
>   fdatasync();
> 
> (certainly that won't happen today, but it could in principle).
>
> How does a write cache give any advantage?  The host kernel sees
> _exactly_ the same information as it would from a bunch of threaded
> pwritev()s followed by fdatasync().
>

Are you suggesting that the model with cache=writeback gives us the
same I/O pattern as cache=none, so there are no opportunities for
optimization?
 
> (wish: IO_CMD_ORDERED_FDATASYNC)
> 
> If the batch size is larger than the virtio queue size, or if there
> are no flushes at all, then yes the huge write cache gives more
> opportunity for reordering.  But we're already talking hundreds of
> requests here.
> 
> Let's say the virtio queue size was unlimited.  What
> merging/reordering opportunity are we missing on the host?  Again we
> have exactly the same information: either the pagecache lru + radix
> tree that identifies all dirty pages in disk order, or the block
> queue with pending requests that contains exactly the same
> information.
> 
> Something is wrong.  Maybe it's my understanding, but on the other
> hand it may be a piece of kernel code.
> 

I assume you are talking of dedicated disk partitions and not
individual disk images residing on the same partition.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16 14:27                     ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-16 14:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Chris Webb, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

* Avi Kivity <avi@redhat.com> [2010-03-16 13:08:28]:

> On 03/16/2010 12:44 PM, Christoph Hellwig wrote:
> >On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
> >>Are you talking about direct volume access or qcow2?
> >Doesn't matter.
> >
> >>For direct volume access, I still don't get it.  The number of barriers
> >>issues by the host must equal (or exceed, but that's pointless) the
> >>number of barriers issued by the guest.  cache=writeback allows the host
> >>to reorder writes, but so does cache=none.  Where does the difference
> >>come from?
> >>
> >>Put it another way.  In an unvirtualized environment, if you implement a
> >>write cache in a storage driver (not device), and sync it on a barrier
> >>request, would you expect to see a performance improvement?
> >cache=none only allows very limited reorderning in the host.  O_DIRECT
> >is synchronous on the host, so there's just some very limited reordering
> >going on in the elevator if we have other I/O going on in parallel.
> 
> Presumably there is lots of I/O going on, or we wouldn't be having
> this conversation.
>

We are speaking of multiple VM's doing I/O in parallel.
 
> >In addition to that the disk writecache can perform limited reodering
> >and caching, but the disk cache has a rather limited size.  The host
> >pagecache gives a much wieder opportunity to reorder, especially if
> >the guest workload is not cache flush heavy.  If the guest workload
> >is extremly cache flush heavy the usefulness of the pagecache is rather
> >limited, as we'll only use very little of it, but pay by having to do
> >a data copy.  If the workload is not cache flush heavy, and we have
> >multiple guests doing I/O to the same spindles it will allow the host
> >do do much more efficient data writeout by beeing able to do better
> >ordered (less seeky) and bigger I/O (especially if the host has real
> >storage compared to ide for the guest).
> 
> Let's assume the guest has virtio (I agree with IDE we need
> reordering on the host).  The guest sends batches of I/O separated
> by cache flushes.  If the batches are smaller than the virtio queue
> length, ideally things look like:
> 
>  io_submit(..., batch_size_1);
>  io_getevents(..., batch_size_1);
>  fdatasync();
>  io_submit(..., batch_size_2);
>   io_getevents(..., batch_size_2);
>   fdatasync();
>   io_submit(..., batch_size_3);
>   io_getevents(..., batch_size_3);
>   fdatasync();
> 
> (certainly that won't happen today, but it could in principle).
>
> How does a write cache give any advantage?  The host kernel sees
> _exactly_ the same information as it would from a bunch of threaded
> pwritev()s followed by fdatasync().
>

Are you suggesting that the model with cache=writeback gives us the
same I/O pattern as cache=none, so there are no opportunities for
optimization?
 
> (wish: IO_CMD_ORDERED_FDATASYNC)
> 
> If the batch size is larger than the virtio queue size, or if there
> are no flushes at all, then yes the huge write cache gives more
> opportunity for reordering.  But we're already talking hundreds of
> requests here.
> 
> Let's say the virtio queue size was unlimited.  What
> merging/reordering opportunity are we missing on the host?  Again we
> have exactly the same information: either the pagecache lru + radix
> tree that identifies all dirty pages in disk order, or the block
> queue with pending requests that contains exactly the same
> information.
> 
> Something is wrong.  Maybe it's my understanding, but on the other
> hand it may be a piece of kernel code.
> 

I assume you are talking of dedicated disk partitions and not
individual disk images residing on the same partition.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16 14:27                     ` Balbir Singh
@ 2010-03-16 15:59                       ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 15:59 UTC (permalink / raw)
  To: balbir
  Cc: Christoph Hellwig, Chris Webb, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

On 03/16/2010 04:27 PM, Balbir Singh wrote:
>
>> Let's assume the guest has virtio (I agree with IDE we need
>> reordering on the host).  The guest sends batches of I/O separated
>> by cache flushes.  If the batches are smaller than the virtio queue
>> length, ideally things look like:
>>
>>   io_submit(..., batch_size_1);
>>   io_getevents(..., batch_size_1);
>>   fdatasync();
>>   io_submit(..., batch_size_2);
>>    io_getevents(..., batch_size_2);
>>    fdatasync();
>>    io_submit(..., batch_size_3);
>>    io_getevents(..., batch_size_3);
>>    fdatasync();
>>
>> (certainly that won't happen today, but it could in principle).
>>
>> How does a write cache give any advantage?  The host kernel sees
>> _exactly_ the same information as it would from a bunch of threaded
>> pwritev()s followed by fdatasync().
>>
>>      
> Are you suggesting that the model with cache=writeback gives us the
> same I/O pattern as cache=none, so there are no opportunities for
> optimization?
>    

Yes.  The guest also has a large cache with the same optimization algorithm.

>
>    
>> (wish: IO_CMD_ORDERED_FDATASYNC)
>>
>> If the batch size is larger than the virtio queue size, or if there
>> are no flushes at all, then yes the huge write cache gives more
>> opportunity for reordering.  But we're already talking hundreds of
>> requests here.
>>
>> Let's say the virtio queue size was unlimited.  What
>> merging/reordering opportunity are we missing on the host?  Again we
>> have exactly the same information: either the pagecache lru + radix
>> tree that identifies all dirty pages in disk order, or the block
>> queue with pending requests that contains exactly the same
>> information.
>>
>> Something is wrong.  Maybe it's my understanding, but on the other
>> hand it may be a piece of kernel code.
>>
>>      
> I assume you are talking of dedicated disk partitions and not
> individual disk images residing on the same partition.
>    

Correct. Images in files introduce new writes which can be optimized.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-16 15:59                       ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-16 15:59 UTC (permalink / raw)
  To: balbir
  Cc: Christoph Hellwig, Chris Webb, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

On 03/16/2010 04:27 PM, Balbir Singh wrote:
>
>> Let's assume the guest has virtio (I agree with IDE we need
>> reordering on the host).  The guest sends batches of I/O separated
>> by cache flushes.  If the batches are smaller than the virtio queue
>> length, ideally things look like:
>>
>>   io_submit(..., batch_size_1);
>>   io_getevents(..., batch_size_1);
>>   fdatasync();
>>   io_submit(..., batch_size_2);
>>    io_getevents(..., batch_size_2);
>>    fdatasync();
>>    io_submit(..., batch_size_3);
>>    io_getevents(..., batch_size_3);
>>    fdatasync();
>>
>> (certainly that won't happen today, but it could in principle).
>>
>> How does a write cache give any advantage?  The host kernel sees
>> _exactly_ the same information as it would from a bunch of threaded
>> pwritev()s followed by fdatasync().
>>
>>      
> Are you suggesting that the model with cache=writeback gives us the
> same I/O pattern as cache=none, so there are no opportunities for
> optimization?
>    

Yes.  The guest also has a large cache with the same optimization algorithm.

>
>    
>> (wish: IO_CMD_ORDERED_FDATASYNC)
>>
>> If the batch size is larger than the virtio queue size, or if there
>> are no flushes at all, then yes the huge write cache gives more
>> opportunity for reordering.  But we're already talking hundreds of
>> requests here.
>>
>> Let's say the virtio queue size was unlimited.  What
>> merging/reordering opportunity are we missing on the host?  Again we
>> have exactly the same information: either the pagecache lru + radix
>> tree that identifies all dirty pages in disk order, or the block
>> queue with pending requests that contains exactly the same
>> information.
>>
>> Something is wrong.  Maybe it's my understanding, but on the other
>> hand it may be a piece of kernel code.
>>
>>      
> I assume you are talking of dedicated disk partitions and not
> individual disk images residing on the same partition.
>    

Correct. Images in files introduce new writes which can be optimized.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16 11:08                   ` Avi Kivity
@ 2010-03-17  8:49                     ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17  8:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Chris Webb, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote:
> If the batch size is larger than the virtio queue size, or if there are 
> no flushes at all, then yes the huge write cache gives more opportunity 
> for reordering.  But we're already talking hundreds of requests here.

Yes.  And rememember those don't have to come from the same host.  Also
remember that we rather limit execssive reodering of O_DIRECT requests
in the I/O scheduler because they are "synchronous" type I/O while
we don't do that for pagecache writeback.

And we don't have unlimited virtio queue size, in fact it's quite
limited.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17  8:49                     ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17  8:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Chris Webb, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Kevin Wolf

On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote:
> If the batch size is larger than the virtio queue size, or if there are 
> no flushes at all, then yes the huge write cache gives more opportunity 
> for reordering.  But we're already talking hundreds of requests here.

Yes.  And rememember those don't have to come from the same host.  Also
remember that we rather limit execssive reodering of O_DIRECT requests
in the I/O scheduler because they are "synchronous" type I/O while
we don't do that for pagecache writeback.

And we don't have unlimited virtio queue size, in fact it's quite
limited.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17  8:49                     ` Christoph Hellwig
@ 2010-03-17  9:10                       ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17  9:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 10:49 AM, Christoph Hellwig wrote:
> On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote:
>    
>> If the batch size is larger than the virtio queue size, or if there are
>> no flushes at all, then yes the huge write cache gives more opportunity
>> for reordering.  But we're already talking hundreds of requests here.
>>      
> Yes.  And rememember those don't have to come from the same host.  Also
> remember that we rather limit execssive reodering of O_DIRECT requests
> in the I/O scheduler because they are "synchronous" type I/O while
> we don't do that for pagecache writeback.
>    

Maybe we should relax that for kvm.  Perhaps some of the problem comes 
from the fact that we call io_submit() once per request.

> And we don't have unlimited virtio queue size, in fact it's quite
> limited.
>    

That can be extended easily if it fixes the problem.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17  9:10                       ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17  9:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 10:49 AM, Christoph Hellwig wrote:
> On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote:
>    
>> If the batch size is larger than the virtio queue size, or if there are
>> no flushes at all, then yes the huge write cache gives more opportunity
>> for reordering.  But we're already talking hundreds of requests here.
>>      
> Yes.  And rememember those don't have to come from the same host.  Also
> remember that we rather limit execssive reodering of O_DIRECT requests
> in the I/O scheduler because they are "synchronous" type I/O while
> we don't do that for pagecache writeback.
>    

Maybe we should relax that for kvm.  Perhaps some of the problem comes 
from the fact that we call io_submit() once per request.

> And we don't have unlimited virtio queue size, in fact it's quite
> limited.
>    

That can be extended easily if it fixes the problem.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-15 23:43           ` Anthony Liguori
@ 2010-03-17 15:14             ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 15:14 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Anthony Liguori <anthony@codemonkey.ws> writes:

> This really gets down to your definition of "safe" behaviour.  As it
> stands, if you suffer a power outage, it may lead to guest
> corruption.
> 
> While we are correct in advertising a write-cache, write-caches are
> volatile and should a drive lose power, it could lead to data
> corruption.  Enterprise disks tend to have battery backed write
> caches to prevent this.
> 
> In the set up you're emulating, the host is acting as a giant write
> cache.  Should your host fail, you can get data corruption.

Hi Anthony. I suspected my post might spark an interesting discussion!

Before considering anything like this, we did quite a bit of testing with
OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
NTFS filesystems despite these efforts.

Is your claim here that:-

  (a) qemu doesn't emulate a disk write cache correctly; or

  (b) operating systems are inherently unsafe running on top of a disk with
      a write-cache; or

  (c) installations that are already broken and lose data with a physical
      drive with a write-cache can lose much more in this case because the
      write cache is much bigger?

Following Christoph Hellwig's patch series from last September, I'm pretty
convinced that (a) isn't true apart from the inability to disable the
write-cache at run-time, which is something that neither recent linux nor
windows seem to want to do out-of-the box.

Given that modern SATA drives come with fairly substantial write-caches
nowadays which operating systems leave on without widespread disaster, I
don't really believe in (b) either, at least for the ide and scsi case.
Filesystems know they have to flush the disk cache to avoid corruption.
(Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so
I know virtio-blk has to be avoided for current windows and obsolete linux
when writeback caching is on.)

I can certainly imagine (c) might be the case, although when I use strace to
watch the IO to the block device, I see pretty regular fdatasyncs being
issued by the guests, interleaved with the writes, so I'm not sure how
likely the problem would be in practice. Perhaps my test guests were
unrepresentatively well-behaved.

However, the potentially unlimited time-window for loss of incorrectly
unsynced data is also something one could imagine fixing at the qemu level.
Perhaps I should be implementing something like
cache=writeback,flushtimeout=N which, upon a write being issued to the block
device, starts an N second timer if it isn't already running. The timer is
destroyed on flush, and if it expires before it's destroyed, a gratuitous
flush is sent. Do you think this is worth doing? Just a simple 'while sleep
10; do sync; done' on the host even!

We've used cache=none and cache=writethrough, and whilst performance is fine
with a single guest accessing a disk, when we chop the disks up with LVM and
run a even a small handful of guests, the constant seeking to serve tiny
synchronous IOs leads to truly abysmal throughput---we've seen less than
700kB/s streaming write rates within guests when the backing store is
capable of 100MB/s.

With cache=writeback, there's still IO contention between guests, but the
write granularity is a bit coarser, so the host's elevator seems to get a
bit more of a chance to help us out and we can at least squeeze out 5-10MB/s
from two or three concurrently running guests, getting a total of 20-30% of
the performance of the underlying block device rather than a total of around
5%.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 15:14             ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 15:14 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Anthony Liguori <anthony@codemonkey.ws> writes:

> This really gets down to your definition of "safe" behaviour.  As it
> stands, if you suffer a power outage, it may lead to guest
> corruption.
> 
> While we are correct in advertising a write-cache, write-caches are
> volatile and should a drive lose power, it could lead to data
> corruption.  Enterprise disks tend to have battery backed write
> caches to prevent this.
> 
> In the set up you're emulating, the host is acting as a giant write
> cache.  Should your host fail, you can get data corruption.

Hi Anthony. I suspected my post might spark an interesting discussion!

Before considering anything like this, we did quite a bit of testing with
OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
NTFS filesystems despite these efforts.

Is your claim here that:-

  (a) qemu doesn't emulate a disk write cache correctly; or

  (b) operating systems are inherently unsafe running on top of a disk with
      a write-cache; or

  (c) installations that are already broken and lose data with a physical
      drive with a write-cache can lose much more in this case because the
      write cache is much bigger?

Following Christoph Hellwig's patch series from last September, I'm pretty
convinced that (a) isn't true apart from the inability to disable the
write-cache at run-time, which is something that neither recent linux nor
windows seem to want to do out-of-the box.

Given that modern SATA drives come with fairly substantial write-caches
nowadays which operating systems leave on without widespread disaster, I
don't really believe in (b) either, at least for the ide and scsi case.
Filesystems know they have to flush the disk cache to avoid corruption.
(Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so
I know virtio-blk has to be avoided for current windows and obsolete linux
when writeback caching is on.)

I can certainly imagine (c) might be the case, although when I use strace to
watch the IO to the block device, I see pretty regular fdatasyncs being
issued by the guests, interleaved with the writes, so I'm not sure how
likely the problem would be in practice. Perhaps my test guests were
unrepresentatively well-behaved.

However, the potentially unlimited time-window for loss of incorrectly
unsynced data is also something one could imagine fixing at the qemu level.
Perhaps I should be implementing something like
cache=writeback,flushtimeout=N which, upon a write being issued to the block
device, starts an N second timer if it isn't already running. The timer is
destroyed on flush, and if it expires before it's destroyed, a gratuitous
flush is sent. Do you think this is worth doing? Just a simple 'while sleep
10; do sync; done' on the host even!

We've used cache=none and cache=writethrough, and whilst performance is fine
with a single guest accessing a disk, when we chop the disks up with LVM and
run a even a small handful of guests, the constant seeking to serve tiny
synchronous IOs leads to truly abysmal throughput---we've seen less than
700kB/s streaming write rates within guests when the backing store is
capable of 100MB/s.

With cache=writeback, there's still IO contention between guests, but the
write granularity is a bit coarser, so the host's elevator seems to get a
bit more of a chance to help us out and we can at least squeeze out 5-10MB/s
from two or three concurrently running guests, getting a total of 20-30% of
the performance of the underlying block device rather than a total of around
5%.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16  9:17           ` Avi Kivity
@ 2010-03-17 15:24             ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 15:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

Avi Kivity <avi@redhat.com> writes:

> On 03/15/2010 10:23 PM, Chris Webb wrote:
>
> >Wasteful duplication of page cache between guest and host notwithstanding,
> >turning on cache=writeback is a spectacular performance win for our guests.
> 
> Is this with qcow2, raw file, or direct volume access?

This is with direct access to logical volumes. No file systems or qcow2 in
the stack. Our typical host has a couple of SATA disks, combined in md
RAID1, chopped up into volumes with LVM2 (really just dm linear targets).
The performance measured outside qemu is excellent, inside qemu-kvm is fine
too until multiple guests are trying to access their drives at once, but
then everything starts to grind badly.

> I can understand it for qcow2, but for direct volume access this
> shouldn't happen.  The guest schedules as many writes as it can,
> followed by a sync.  The host (and disk) can then reschedule them
> whether they are in the writeback cache or in the block layer, and
> must sync in the same way once completed.

I don't really understand what's going on here, but I wonder if the
underlying problem might be that all the O_DIRECT/O_SYNC writes from the
guests go down into the same block device at the bottom of the device mapper
stack, and thus can't be reordered with respect to one another. For our
purposes,

  Guest AA   Guest BB       Guest AA   Guest BB       Guest AA   Guest BB
  write A1                  write A1                             write B1
             write B1       write A2                  write A1
  write A2                             write B1       write A2

are all equivalent, but the system isn't allowed to reorder in this way
because there isn't a separate request queue for each logical volume, just
the one at the bottom. (I don't know whether nested request queues would
behave remotely reasonably either, though!)

Also, if my guest kernel issues (say) three small writes, one at the start
of the disk, one in the middle, one at the end, and then does a flush, can
virtio really express this as one non-contiguous O_DIRECT write (the three
components of which can be reordered by the elevator with respect to one
another) rather than three distinct O_DIRECT writes which can't be permuted?
Can qemu issue a write like that? cache=writeback + flush allows this to be
optimised by the block layer in the normal way.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 15:24             ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 15:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

Avi Kivity <avi@redhat.com> writes:

> On 03/15/2010 10:23 PM, Chris Webb wrote:
>
> >Wasteful duplication of page cache between guest and host notwithstanding,
> >turning on cache=writeback is a spectacular performance win for our guests.
> 
> Is this with qcow2, raw file, or direct volume access?

This is with direct access to logical volumes. No file systems or qcow2 in
the stack. Our typical host has a couple of SATA disks, combined in md
RAID1, chopped up into volumes with LVM2 (really just dm linear targets).
The performance measured outside qemu is excellent, inside qemu-kvm is fine
too until multiple guests are trying to access their drives at once, but
then everything starts to grind badly.

> I can understand it for qcow2, but for direct volume access this
> shouldn't happen.  The guest schedules as many writes as it can,
> followed by a sync.  The host (and disk) can then reschedule them
> whether they are in the writeback cache or in the block layer, and
> must sync in the same way once completed.

I don't really understand what's going on here, but I wonder if the
underlying problem might be that all the O_DIRECT/O_SYNC writes from the
guests go down into the same block device at the bottom of the device mapper
stack, and thus can't be reordered with respect to one another. For our
purposes,

  Guest AA   Guest BB       Guest AA   Guest BB       Guest AA   Guest BB
  write A1                  write A1                             write B1
             write B1       write A2                  write A1
  write A2                             write B1       write A2

are all equivalent, but the system isn't allowed to reorder in this way
because there isn't a separate request queue for each logical volume, just
the one at the bottom. (I don't know whether nested request queues would
behave remotely reasonably either, though!)

Also, if my guest kernel issues (say) three small writes, one at the start
of the disk, one in the middle, one at the end, and then does a flush, can
virtio really express this as one non-contiguous O_DIRECT write (the three
components of which can be reordered by the elevator with respect to one
another) rather than three distinct O_DIRECT writes which can't be permuted?
Can qemu issue a write like that? cache=writeback + flush allows this to be
optimised by the block layer in the normal way.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 15:14             ` Chris Webb
@ 2010-03-17 15:55               ` Anthony Liguori
  -1 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-17 15:55 UTC (permalink / raw)
  To: Chris Webb
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/17/2010 10:14 AM, Chris Webb wrote:
> Anthony Liguori<anthony@codemonkey.ws>  writes:
>
>    
>> This really gets down to your definition of "safe" behaviour.  As it
>> stands, if you suffer a power outage, it may lead to guest
>> corruption.
>>
>> While we are correct in advertising a write-cache, write-caches are
>> volatile and should a drive lose power, it could lead to data
>> corruption.  Enterprise disks tend to have battery backed write
>> caches to prevent this.
>>
>> In the set up you're emulating, the host is acting as a giant write
>> cache.  Should your host fail, you can get data corruption.
>>      
> Hi Anthony. I suspected my post might spark an interesting discussion!
>
> Before considering anything like this, we did quite a bit of testing with
> OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
> power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
> NTFS filesystems despite these efforts.
>
> Is your claim here that:-
>
>    (a) qemu doesn't emulate a disk write cache correctly; or
>
>    (b) operating systems are inherently unsafe running on top of a disk with
>        a write-cache; or
>
>    (c) installations that are already broken and lose data with a physical
>        drive with a write-cache can lose much more in this case because the
>        write cache is much bigger?
>    

This is the closest to the most accurate.

It basically boils down to this: most enterprises use a disks with 
battery backed write caches.  Having the host act as a giant write cache 
means that you can lose data.

I agree that a well behaved file system will not become corrupt, but my 
contention is that for many types of applications, data lose == 
corruption and not all file systems are well behaved.  And it's 
certainly valid to argue about whether common filesystems are "broken" 
but from a purely pragmatic perspective, this is going to be the case.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 15:55               ` Anthony Liguori
  0 siblings, 0 replies; 98+ messages in thread
From: Anthony Liguori @ 2010-03-17 15:55 UTC (permalink / raw)
  To: Chris Webb
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/17/2010 10:14 AM, Chris Webb wrote:
> Anthony Liguori<anthony@codemonkey.ws>  writes:
>
>    
>> This really gets down to your definition of "safe" behaviour.  As it
>> stands, if you suffer a power outage, it may lead to guest
>> corruption.
>>
>> While we are correct in advertising a write-cache, write-caches are
>> volatile and should a drive lose power, it could lead to data
>> corruption.  Enterprise disks tend to have battery backed write
>> caches to prevent this.
>>
>> In the set up you're emulating, the host is acting as a giant write
>> cache.  Should your host fail, you can get data corruption.
>>      
> Hi Anthony. I suspected my post might spark an interesting discussion!
>
> Before considering anything like this, we did quite a bit of testing with
> OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
> power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
> NTFS filesystems despite these efforts.
>
> Is your claim here that:-
>
>    (a) qemu doesn't emulate a disk write cache correctly; or
>
>    (b) operating systems are inherently unsafe running on top of a disk with
>        a write-cache; or
>
>    (c) installations that are already broken and lose data with a physical
>        drive with a write-cache can lose much more in this case because the
>        write cache is much bigger?
>    

This is the closest to the most accurate.

It basically boils down to this: most enterprises use a disks with 
battery backed write caches.  Having the host act as a giant write cache 
means that you can lose data.

I agree that a well behaved file system will not become corrupt, but my 
contention is that for many types of applications, data lose == 
corruption and not all file systems are well behaved.  And it's 
certainly valid to argue about whether common filesystems are "broken" 
but from a purely pragmatic perspective, this is going to be the case.

Regards,

Anthony Liguori

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 15:24             ` Chris Webb
@ 2010-03-17 16:22               ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 16:22 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/17/2010 05:24 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> On 03/15/2010 10:23 PM, Chris Webb wrote:
>>
>>      
>>> Wasteful duplication of page cache between guest and host notwithstanding,
>>> turning on cache=writeback is a spectacular performance win for our guests.
>>>        
>> Is this with qcow2, raw file, or direct volume access?
>>      
> This is with direct access to logical volumes. No file systems or qcow2 in
> the stack. Our typical host has a couple of SATA disks, combined in md
> RAID1, chopped up into volumes with LVM2 (really just dm linear targets).
> The performance measured outside qemu is excellent, inside qemu-kvm is fine
> too until multiple guests are trying to access their drives at once, but
> then everything starts to grind badly.
>
>    

OK.

>> I can understand it for qcow2, but for direct volume access this
>> shouldn't happen.  The guest schedules as many writes as it can,
>> followed by a sync.  The host (and disk) can then reschedule them
>> whether they are in the writeback cache or in the block layer, and
>> must sync in the same way once completed.
>>      
> I don't really understand what's going on here, but I wonder if the
> underlying problem might be that all the O_DIRECT/O_SYNC writes from the
> guests go down into the same block device at the bottom of the device mapper
> stack, and thus can't be reordered with respect to one another.

They should be reorderable.  Otherwise host filesystems on several 
volumes would suffer the same problems.

Whether the filesystem is in the host or guest shouldn't matter.

> For our
> purposes,
>
>    Guest AA   Guest BB       Guest AA   Guest BB       Guest AA   Guest BB
>    write A1                  write A1                             write B1
>               write B1       write A2                  write A1
>    write A2                             write B1       write A2
>
> are all equivalent, but the system isn't allowed to reorder in this way
> because there isn't a separate request queue for each logical volume, just
> the one at the bottom. (I don't know whether nested request queues would
> behave remotely reasonably either, though!)
>
> Also, if my guest kernel issues (say) three small writes, one at the start
> of the disk, one in the middle, one at the end, and then does a flush, can
> virtio really express this as one non-contiguous O_DIRECT write (the three
> components of which can be reordered by the elevator with respect to one
> another) rather than three distinct O_DIRECT writes which can't be permuted?
> Can qemu issue a write like that? cache=writeback + flush allows this to be
> optimised by the block layer in the normal way.
>    

Guest side virtio will send this as three requests followed by a flush.  
Qemu will issue these as three distinct requests and then flush.  The 
requests are marked, as Christoph says, in a way that limits their 
reorderability, and perhaps if we fix these two problems performance 
will improve.

Something that comes to mind is merging of flush requests.  If N guests 
issue one write and one flush each, we should issue N writes and just 
one flush - a flush for the disk applies to all volumes on that disk.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:22               ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 16:22 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/17/2010 05:24 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> On 03/15/2010 10:23 PM, Chris Webb wrote:
>>
>>      
>>> Wasteful duplication of page cache between guest and host notwithstanding,
>>> turning on cache=writeback is a spectacular performance win for our guests.
>>>        
>> Is this with qcow2, raw file, or direct volume access?
>>      
> This is with direct access to logical volumes. No file systems or qcow2 in
> the stack. Our typical host has a couple of SATA disks, combined in md
> RAID1, chopped up into volumes with LVM2 (really just dm linear targets).
> The performance measured outside qemu is excellent, inside qemu-kvm is fine
> too until multiple guests are trying to access their drives at once, but
> then everything starts to grind badly.
>
>    

OK.

>> I can understand it for qcow2, but for direct volume access this
>> shouldn't happen.  The guest schedules as many writes as it can,
>> followed by a sync.  The host (and disk) can then reschedule them
>> whether they are in the writeback cache or in the block layer, and
>> must sync in the same way once completed.
>>      
> I don't really understand what's going on here, but I wonder if the
> underlying problem might be that all the O_DIRECT/O_SYNC writes from the
> guests go down into the same block device at the bottom of the device mapper
> stack, and thus can't be reordered with respect to one another.

They should be reorderable.  Otherwise host filesystems on several 
volumes would suffer the same problems.

Whether the filesystem is in the host or guest shouldn't matter.

> For our
> purposes,
>
>    Guest AA   Guest BB       Guest AA   Guest BB       Guest AA   Guest BB
>    write A1                  write A1                             write B1
>               write B1       write A2                  write A1
>    write A2                             write B1       write A2
>
> are all equivalent, but the system isn't allowed to reorder in this way
> because there isn't a separate request queue for each logical volume, just
> the one at the bottom. (I don't know whether nested request queues would
> behave remotely reasonably either, though!)
>
> Also, if my guest kernel issues (say) three small writes, one at the start
> of the disk, one in the middle, one at the end, and then does a flush, can
> virtio really express this as one non-contiguous O_DIRECT write (the three
> components of which can be reordered by the elevator with respect to one
> another) rather than three distinct O_DIRECT writes which can't be permuted?
> Can qemu issue a write like that? cache=writeback + flush allows this to be
> optimised by the block layer in the normal way.
>    

Guest side virtio will send this as three requests followed by a flush.  
Qemu will issue these as three distinct requests and then flush.  The 
requests are marked, as Christoph says, in a way that limits their 
reorderability, and perhaps if we fix these two problems performance 
will improve.

Something that comes to mind is merging of flush requests.  If N guests 
issue one write and one flush each, we should issue N writes and just 
one flush - a flush for the disk applies to all volumes on that disk.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 15:55               ` Anthony Liguori
@ 2010-03-17 16:27                 ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 16:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Anthony Liguori <anthony@codemonkey.ws> writes:

> On 03/17/2010 10:14 AM, Chris Webb wrote:
> >   (c) installations that are already broken and lose data with a physical
> >       drive with a write-cache can lose much more in this case because the
> >       write cache is much bigger?
> 
> This is the closest to the most accurate.
> 
> It basically boils down to this: most enterprises use a disks with
> battery backed write caches.  Having the host act as a giant write
> cache means that you can lose data.
> 
> I agree that a well behaved file system will not become corrupt, but
> my contention is that for many types of applications, data lose ==
> corruption and not all file systems are well behaved.  And it's
> certainly valid to argue about whether common filesystems are
> "broken" but from a purely pragmatic perspective, this is going to
> be the case.

Okay. What I was driving at in describing these systems as 'already broken'
is that they will already lose data (in this sense) if they're run on bare
metal with normal commodity SATA disks with their 32MB write caches on. That
configuration surely describes the vast majority of PC-class desktops and
servers!

If I understand correctly, your point here is that the small cache on a real
SATA drive gives a relatively small time window for data loss, whereas the
worry with cache=writeback is that the host page cache can be gigabytes, so
the time window for unsynced data to be lost is potentially enormous.

Isn't the fix for that just forcing periodic sync on the host to bound-above
the time window for unsynced data loss in the guest?

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:27                 ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 16:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Anthony Liguori <anthony@codemonkey.ws> writes:

> On 03/17/2010 10:14 AM, Chris Webb wrote:
> >   (c) installations that are already broken and lose data with a physical
> >       drive with a write-cache can lose much more in this case because the
> >       write cache is much bigger?
> 
> This is the closest to the most accurate.
> 
> It basically boils down to this: most enterprises use a disks with
> battery backed write caches.  Having the host act as a giant write
> cache means that you can lose data.
> 
> I agree that a well behaved file system will not become corrupt, but
> my contention is that for many types of applications, data lose ==
> corruption and not all file systems are well behaved.  And it's
> certainly valid to argue about whether common filesystems are
> "broken" but from a purely pragmatic perspective, this is going to
> be the case.

Okay. What I was driving at in describing these systems as 'already broken'
is that they will already lose data (in this sense) if they're run on bare
metal with normal commodity SATA disks with their 32MB write caches on. That
configuration surely describes the vast majority of PC-class desktops and
servers!

If I understand correctly, your point here is that the small cache on a real
SATA drive gives a relatively small time window for data loss, whereas the
worry with cache=writeback is that the host page cache can be gigabytes, so
the time window for unsynced data to be lost is potentially enormous.

Isn't the fix for that just forcing periodic sync on the host to bound-above
the time window for unsynced data loss in the guest?

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 15:55               ` Anthony Liguori
@ 2010-03-17 16:27                 ` Balbir Singh
  -1 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-17 16:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Webb, Avi Kivity, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

* Anthony Liguori <anthony@codemonkey.ws> [2010-03-17 10:55:47]:

> On 03/17/2010 10:14 AM, Chris Webb wrote:
> >Anthony Liguori<anthony@codemonkey.ws>  writes:
> >
> >>This really gets down to your definition of "safe" behaviour.  As it
> >>stands, if you suffer a power outage, it may lead to guest
> >>corruption.
> >>
> >>While we are correct in advertising a write-cache, write-caches are
> >>volatile and should a drive lose power, it could lead to data
> >>corruption.  Enterprise disks tend to have battery backed write
> >>caches to prevent this.
> >>
> >>In the set up you're emulating, the host is acting as a giant write
> >>cache.  Should your host fail, you can get data corruption.
> >Hi Anthony. I suspected my post might spark an interesting discussion!
> >
> >Before considering anything like this, we did quite a bit of testing with
> >OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
> >power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
> >NTFS filesystems despite these efforts.
> >
> >Is your claim here that:-
> >
> >   (a) qemu doesn't emulate a disk write cache correctly; or
> >
> >   (b) operating systems are inherently unsafe running on top of a disk with
> >       a write-cache; or
> >
> >   (c) installations that are already broken and lose data with a physical
> >       drive with a write-cache can lose much more in this case because the
> >       write cache is much bigger?
> 
> This is the closest to the most accurate.
> 
> It basically boils down to this: most enterprises use a disks with
> battery backed write caches.  Having the host act as a giant write
> cache means that you can lose data.
> 

Dirty limits can help control how much we lose, but also affect how
much we write out.

> I agree that a well behaved file system will not become corrupt, but
> my contention is that for many types of applications, data lose ==
> corruption and not all file systems are well behaved.  And it's
> certainly valid to argue about whether common filesystems are
> "broken" but from a purely pragmatic perspective, this is going to
> be the case.
>

I think it is a trade-off for end users to decide on. cache=writeback
does provide performance benefits, but can cause data loss.


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:27                 ` Balbir Singh
  0 siblings, 0 replies; 98+ messages in thread
From: Balbir Singh @ 2010-03-17 16:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Webb, Avi Kivity, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

* Anthony Liguori <anthony@codemonkey.ws> [2010-03-17 10:55:47]:

> On 03/17/2010 10:14 AM, Chris Webb wrote:
> >Anthony Liguori<anthony@codemonkey.ws>  writes:
> >
> >>This really gets down to your definition of "safe" behaviour.  As it
> >>stands, if you suffer a power outage, it may lead to guest
> >>corruption.
> >>
> >>While we are correct in advertising a write-cache, write-caches are
> >>volatile and should a drive lose power, it could lead to data
> >>corruption.  Enterprise disks tend to have battery backed write
> >>caches to prevent this.
> >>
> >>In the set up you're emulating, the host is acting as a giant write
> >>cache.  Should your host fail, you can get data corruption.
> >Hi Anthony. I suspected my post might spark an interesting discussion!
> >
> >Before considering anything like this, we did quite a bit of testing with
> >OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
> >power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
> >NTFS filesystems despite these efforts.
> >
> >Is your claim here that:-
> >
> >   (a) qemu doesn't emulate a disk write cache correctly; or
> >
> >   (b) operating systems are inherently unsafe running on top of a disk with
> >       a write-cache; or
> >
> >   (c) installations that are already broken and lose data with a physical
> >       drive with a write-cache can lose much more in this case because the
> >       write cache is much bigger?
> 
> This is the closest to the most accurate.
> 
> It basically boils down to this: most enterprises use a disks with
> battery backed write caches.  Having the host act as a giant write
> cache means that you can lose data.
> 

Dirty limits can help control how much we lose, but also affect how
much we write out.

> I agree that a well behaved file system will not become corrupt, but
> my contention is that for many types of applications, data lose ==
> corruption and not all file systems are well behaved.  And it's
> certainly valid to argue about whether common filesystems are
> "broken" but from a purely pragmatic perspective, this is going to
> be the case.
>

I think it is a trade-off for end users to decide on. cache=writeback
does provide performance benefits, but can cause data loss.


-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:22               ` Avi Kivity
@ 2010-03-17 16:40                 ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 16:40 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/17/2010 06:22 PM, Avi Kivity wrote:
>> Also, if my guest kernel issues (say) three small writes, one at the 
>> start
>> of the disk, one in the middle, one at the end, and then does a 
>> flush, can
>> virtio really express this as one non-contiguous O_DIRECT write (the 
>> three
>> components of which can be reordered by the elevator with respect to one
>> another) rather than three distinct O_DIRECT writes which can't be 
>> permuted?
>> Can qemu issue a write like that? cache=writeback + flush allows this 
>> to be
>> optimised by the block layer in the normal way.
>
>
> Guest side virtio will send this as three requests followed by a 
> flush.  Qemu will issue these as three distinct requests and then 
> flush.  The requests are marked, as Christoph says, in a way that 
> limits their reorderability, and perhaps if we fix these two problems 
> performance will improve.
>
> Something that comes to mind is merging of flush requests.  If N 
> guests issue one write and one flush each, we should issue N writes 
> and just one flush - a flush for the disk applies to all volumes on 
> that disk.
>

Chris, can you carry out an experiment?  Write a program that pwrite()s 
a byte to a file at the same location repeatedly, with the file opened 
using O_SYNC.  Measure the write rate, and run blktrace on the host to 
see what the disk (/dev/sda, not the volume) sees.  Should be a (write, 
flush, write, flush) per pwrite pattern or similar (for writing the data 
and a journal block, perhaps even three writes will be needed).

Then scale this across multiple guests, measure and trace again.  If 
we're lucky, the flushes will be coalesced, if not, we need to work on it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:40                 ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 16:40 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/17/2010 06:22 PM, Avi Kivity wrote:
>> Also, if my guest kernel issues (say) three small writes, one at the 
>> start
>> of the disk, one in the middle, one at the end, and then does a 
>> flush, can
>> virtio really express this as one non-contiguous O_DIRECT write (the 
>> three
>> components of which can be reordered by the elevator with respect to one
>> another) rather than three distinct O_DIRECT writes which can't be 
>> permuted?
>> Can qemu issue a write like that? cache=writeback + flush allows this 
>> to be
>> optimised by the block layer in the normal way.
>
>
> Guest side virtio will send this as three requests followed by a 
> flush.  Qemu will issue these as three distinct requests and then 
> flush.  The requests are marked, as Christoph says, in a way that 
> limits their reorderability, and perhaps if we fix these two problems 
> performance will improve.
>
> Something that comes to mind is merging of flush requests.  If N 
> guests issue one write and one flush each, we should issue N writes 
> and just one flush - a flush for the disk applies to all volumes on 
> that disk.
>

Chris, can you carry out an experiment?  Write a program that pwrite()s 
a byte to a file at the same location repeatedly, with the file opened 
using O_SYNC.  Measure the write rate, and run blktrace on the host to 
see what the disk (/dev/sda, not the volume) sees.  Should be a (write, 
flush, write, flush) per pwrite pattern or similar (for writing the data 
and a journal block, perhaps even three writes will be needed).

Then scale this across multiple guests, measure and trace again.  If 
we're lucky, the flushes will be coalesced, if not, we need to work on it.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:40                 ` Avi Kivity
@ 2010-03-17 16:47                   ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 16:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

Avi Kivity <avi@redhat.com> writes:

> Chris, can you carry out an experiment?  Write a program that
> pwrite()s a byte to a file at the same location repeatedly, with the
> file opened using O_SYNC.  Measure the write rate, and run blktrace
> on the host to see what the disk (/dev/sda, not the volume) sees.
> Should be a (write, flush, write, flush) per pwrite pattern or
> similar (for writing the data and a journal block, perhaps even
> three writes will be needed).
> 
> Then scale this across multiple guests, measure and trace again.  If
> we're lucky, the flushes will be coalesced, if not, we need to work
> on it.

Sure, sounds like an excellent plan. I don't have a test machine at the
moment as the last host I was using for this has gone into production, but
I'm due to get another one to install later today or first thing tomorrow
which would be ideal for doing this. I'll follow up with the results once I
have them.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:47                   ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 16:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

Avi Kivity <avi@redhat.com> writes:

> Chris, can you carry out an experiment?  Write a program that
> pwrite()s a byte to a file at the same location repeatedly, with the
> file opened using O_SYNC.  Measure the write rate, and run blktrace
> on the host to see what the disk (/dev/sda, not the volume) sees.
> Should be a (write, flush, write, flush) per pwrite pattern or
> similar (for writing the data and a journal block, perhaps even
> three writes will be needed).
> 
> Then scale this across multiple guests, measure and trace again.  If
> we're lucky, the flushes will be coalesced, if not, we need to work
> on it.

Sure, sounds like an excellent plan. I don't have a test machine at the
moment as the last host I was using for this has gone into production, but
I'm due to get another one to install later today or first thing tomorrow
which would be ideal for doing this. I'll follow up with the results once I
have them.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:22               ` Avi Kivity
@ 2010-03-17 16:52                 ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17 16:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote:
> They should be reorderable.  Otherwise host filesystems on several 
> volumes would suffer the same problems.

They are reordable, just not as extremly as the the page cache.
Remember that the request queue really is just a relatively small queue
of outstanding I/O, and that is absolutely intentional.  Large scale
_caching_ is done by the VM in the pagecache, with all the usual aging,
pressure, etc algorithms applied to it.  The block devices have a
relatively small fixed size request queue associated with it to
facilitate request merging and limited reordering and having fully
set up I/O requests for the device.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:52                 ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17 16:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote:
> They should be reorderable.  Otherwise host filesystems on several 
> volumes would suffer the same problems.

They are reordable, just not as extremly as the the page cache.
Remember that the request queue really is just a relatively small queue
of outstanding I/O, and that is absolutely intentional.  Large scale
_caching_ is done by the VM in the pagecache, with all the usual aging,
pressure, etc algorithms applied to it.  The block devices have a
relatively small fixed size request queue associated with it to
facilitate request merging and limited reordering and having fully
set up I/O requests for the device.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:47                   ` Chris Webb
@ 2010-03-17 16:53                     ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 16:53 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/17/2010 06:47 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> Chris, can you carry out an experiment?  Write a program that
>> pwrite()s a byte to a file at the same location repeatedly, with the
>> file opened using O_SYNC.  Measure the write rate, and run blktrace
>> on the host to see what the disk (/dev/sda, not the volume) sees.
>> Should be a (write, flush, write, flush) per pwrite pattern or
>> similar (for writing the data and a journal block, perhaps even
>> three writes will be needed).
>>
>> Then scale this across multiple guests, measure and trace again.  If
>> we're lucky, the flushes will be coalesced, if not, we need to work
>> on it.
>>      
> Sure, sounds like an excellent plan. I don't have a test machine at the
> moment as the last host I was using for this has gone into production, but
> I'm due to get another one to install later today or first thing tomorrow
> which would be ideal for doing this. I'll follow up with the results once I
> have them.
>    

Meanwhile I looked at the code, and it looks bad.  There is an 
IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before 
issuing it.  In any case, qemu doesn't use it as far as I could tell, 
and even if it did, device-matter doesn't implement the needed 
->aio_fsync() operation.

So, there's a lot of plubming needed before we can get cache flushes 
merged into each other.  Given cache=writeback does allow merging, I 
think we explained part of the problem at least.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:53                     ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 16:53 UTC (permalink / raw)
  To: Chris Webb
  Cc: balbir, KVM development list, Rik van Riel, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, Christoph Hellwig, Kevin Wolf

On 03/17/2010 06:47 PM, Chris Webb wrote:
> Avi Kivity<avi@redhat.com>  writes:
>
>    
>> Chris, can you carry out an experiment?  Write a program that
>> pwrite()s a byte to a file at the same location repeatedly, with the
>> file opened using O_SYNC.  Measure the write rate, and run blktrace
>> on the host to see what the disk (/dev/sda, not the volume) sees.
>> Should be a (write, flush, write, flush) per pwrite pattern or
>> similar (for writing the data and a journal block, perhaps even
>> three writes will be needed).
>>
>> Then scale this across multiple guests, measure and trace again.  If
>> we're lucky, the flushes will be coalesced, if not, we need to work
>> on it.
>>      
> Sure, sounds like an excellent plan. I don't have a test machine at the
> moment as the last host I was using for this has gone into production, but
> I'm due to get another one to install later today or first thing tomorrow
> which would be ideal for doing this. I'll follow up with the results once I
> have them.
>    

Meanwhile I looked at the code, and it looks bad.  There is an 
IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before 
issuing it.  In any case, qemu doesn't use it as far as I could tell, 
and even if it did, device-matter doesn't implement the needed 
->aio_fsync() operation.

So, there's a lot of plubming needed before we can get cache flushes 
merged into each other.  Given cache=writeback does allow merging, I 
think we explained part of the problem at least.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:40                 ` Avi Kivity
@ 2010-03-17 16:57                   ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17 16:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote:
> Chris, can you carry out an experiment?  Write a program that pwrite()s 
> a byte to a file at the same location repeatedly, with the file opened 
> using O_SYNC.  Measure the write rate, and run blktrace on the host to 
> see what the disk (/dev/sda, not the volume) sees.  Should be a (write, 
> flush, write, flush) per pwrite pattern or similar (for writing the data 
> and a journal block, perhaps even three writes will be needed).
> 
> Then scale this across multiple guests, measure and trace again.  If 
> we're lucky, the flushes will be coalesced, if not, we need to work on it.

As the person who has written quite a bit of the current O_SYNC
implementation and also reviewed the rest of it I can tell you that
those flushes won't be coalesced.  If we always rewrite the same block
we do the cache flush from the fsync method and there's is nothing
to coalesced it there.  If you actually do modify metadata (e.g. by
using the new real O_SYNC instead of the old one that always was O_DSYNC
that I introduced in 2.6.33 but that isn't picked up by userspace yet)
you might hit a very limited transaction merging window in some
filesystems, but it's generally very small for a good reason.  If it
were too large we'd make the once progress wait for I/O in another just
because we might expect transactions to coalesced later.  There's been
some long discussion about that fsync transaction batching tuning
for ext3 a while ago.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:57                   ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17 16:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote:
> Chris, can you carry out an experiment?  Write a program that pwrite()s 
> a byte to a file at the same location repeatedly, with the file opened 
> using O_SYNC.  Measure the write rate, and run blktrace on the host to 
> see what the disk (/dev/sda, not the volume) sees.  Should be a (write, 
> flush, write, flush) per pwrite pattern or similar (for writing the data 
> and a journal block, perhaps even three writes will be needed).
> 
> Then scale this across multiple guests, measure and trace again.  If 
> we're lucky, the flushes will be coalesced, if not, we need to work on it.

As the person who has written quite a bit of the current O_SYNC
implementation and also reviewed the rest of it I can tell you that
those flushes won't be coalesced.  If we always rewrite the same block
we do the cache flush from the fsync method and there's is nothing
to coalesced it there.  If you actually do modify metadata (e.g. by
using the new real O_SYNC instead of the old one that always was O_DSYNC
that I introduced in 2.6.33 but that isn't picked up by userspace yet)
you might hit a very limited transaction merging window in some
filesystems, but it's generally very small for a good reason.  If it
were too large we'd make the once progress wait for I/O in another just
because we might expect transactions to coalesced later.  There's been
some long discussion about that fsync transaction batching tuning
for ext3 a while ago.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:53                     ` Avi Kivity
@ 2010-03-17 16:58                       ` Christoph Hellwig
  -1 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17 16:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote:
> Meanwhile I looked at the code, and it looks bad.  There is an 
> IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before 
> issuing it.  In any case, qemu doesn't use it as far as I could tell, 
> and even if it did, device-matter doesn't implement the needed 
> ->aio_fsync() operation.

No one implements it, and all surrounding code is dead wood.  It would
require us to do asynchronous pagecache operations, which involve
major surgery of the VM code.  Patches to do this were rejected multiple
times.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 16:58                       ` Christoph Hellwig
  0 siblings, 0 replies; 98+ messages in thread
From: Christoph Hellwig @ 2010-03-17 16:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Christoph Hellwig,
	Kevin Wolf

On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote:
> Meanwhile I looked at the code, and it looks bad.  There is an 
> IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before 
> issuing it.  In any case, qemu doesn't use it as far as I could tell, 
> and even if it did, device-matter doesn't implement the needed 
> ->aio_fsync() operation.

No one implements it, and all surrounding code is dead wood.  It would
require us to do asynchronous pagecache operations, which involve
major surgery of the VM code.  Patches to do this were rejected multiple
times.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:52                 ` Christoph Hellwig
@ 2010-03-17 17:02                   ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 17:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 06:52 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote:
>    
>> They should be reorderable.  Otherwise host filesystems on several
>> volumes would suffer the same problems.
>>      
> They are reordable, just not as extremly as the the page cache.
> Remember that the request queue really is just a relatively small queue
> of outstanding I/O, and that is absolutely intentional.  Large scale
> _caching_ is done by the VM in the pagecache, with all the usual aging,
> pressure, etc algorithms applied to it.

We already have the large scale caching and stuff running in the guest.  
We have a stream of optimized requests coming out of guests, running the 
same algorithm again shouldn't improve things.  The host has an 
opportunity to do inter-guest optimization, but given each guest has its 
own disk area, I don't see how any reordering or merging could help here 
(beyond sorting guests according to disk order).

> The block devices have a
> relatively small fixed size request queue associated with it to
> facilitate request merging and limited reordering and having fully
> set up I/O requests for the device.
>    

We should enlarge the queues, increase request reorderability, and merge 
flushes (delay flushes until after unrelated writes, then adjacent 
flushes can be collapsed).

Collapsing flushes should get us better than linear scaling (since we 
collapes N writes + M flushes into N writes and 1 flush).  However the 
writes themselves scale worse than linearly, since they now span a 
larger disk space and cause higher seek penalties.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 17:02                   ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 17:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 06:52 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote:
>    
>> They should be reorderable.  Otherwise host filesystems on several
>> volumes would suffer the same problems.
>>      
> They are reordable, just not as extremly as the the page cache.
> Remember that the request queue really is just a relatively small queue
> of outstanding I/O, and that is absolutely intentional.  Large scale
> _caching_ is done by the VM in the pagecache, with all the usual aging,
> pressure, etc algorithms applied to it.

We already have the large scale caching and stuff running in the guest.  
We have a stream of optimized requests coming out of guests, running the 
same algorithm again shouldn't improve things.  The host has an 
opportunity to do inter-guest optimization, but given each guest has its 
own disk area, I don't see how any reordering or merging could help here 
(beyond sorting guests according to disk order).

> The block devices have a
> relatively small fixed size request queue associated with it to
> facilitate request merging and limited reordering and having fully
> set up I/O requests for the device.
>    

We should enlarge the queues, increase request reorderability, and merge 
flushes (delay flushes until after unrelated writes, then adjacent 
flushes can be collapsed).

Collapsing flushes should get us better than linear scaling (since we 
collapes N writes + M flushes into N writes and 1 flush).  However the 
writes themselves scale worse than linearly, since they now span a 
larger disk space and cause higher seek penalties.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:58                       ` Christoph Hellwig
@ 2010-03-17 17:03                         ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 17:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 06:58 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote:
>    
>> Meanwhile I looked at the code, and it looks bad.  There is an
>> IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before
>> issuing it.  In any case, qemu doesn't use it as far as I could tell,
>> and even if it did, device-matter doesn't implement the needed
>> ->aio_fsync() operation.
>>      
> No one implements it, and all surrounding code is dead wood.  It would
> require us to do asynchronous pagecache operations, which involve
> major surgery of the VM code.  Patches to do this were rejected multiple
> times.
>    

Pity.  What about the O_DIRECT aio case?  It's ridiculous that you can 
submit async write requests but have to wait synchronously for them to 
actually hit the disk if you have a write cache.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 17:03                         ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 17:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 06:58 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote:
>    
>> Meanwhile I looked at the code, and it looks bad.  There is an
>> IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before
>> issuing it.  In any case, qemu doesn't use it as far as I could tell,
>> and even if it did, device-matter doesn't implement the needed
>> ->aio_fsync() operation.
>>      
> No one implements it, and all surrounding code is dead wood.  It would
> require us to do asynchronous pagecache operations, which involve
> major surgery of the VM code.  Patches to do this were rejected multiple
> times.
>    

Pity.  What about the O_DIRECT aio case?  It's ridiculous that you can 
submit async write requests but have to wait synchronously for them to 
actually hit the disk if you have a write cache.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 15:14             ` Chris Webb
@ 2010-03-17 17:05               ` Vivek Goyal
  -1 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2010-03-17 17:05 UTC (permalink / raw)
  To: Chris Webb
  Cc: Anthony Liguori, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Wed, Mar 17, 2010 at 03:14:10PM +0000, Chris Webb wrote:
> Anthony Liguori <anthony@codemonkey.ws> writes:
> 
> > This really gets down to your definition of "safe" behaviour.  As it
> > stands, if you suffer a power outage, it may lead to guest
> > corruption.
> > 
> > While we are correct in advertising a write-cache, write-caches are
> > volatile and should a drive lose power, it could lead to data
> > corruption.  Enterprise disks tend to have battery backed write
> > caches to prevent this.
> > 
> > In the set up you're emulating, the host is acting as a giant write
> > cache.  Should your host fail, you can get data corruption.
> 
> Hi Anthony. I suspected my post might spark an interesting discussion!
> 
> Before considering anything like this, we did quite a bit of testing with
> OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
> power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
> NTFS filesystems despite these efforts.
> 
> Is your claim here that:-
> 
>   (a) qemu doesn't emulate a disk write cache correctly; or
> 
>   (b) operating systems are inherently unsafe running on top of a disk with
>       a write-cache; or
> 
>   (c) installations that are already broken and lose data with a physical
>       drive with a write-cache can lose much more in this case because the
>       write cache is much bigger?
> 
> Following Christoph Hellwig's patch series from last September, I'm pretty
> convinced that (a) isn't true apart from the inability to disable the
> write-cache at run-time, which is something that neither recent linux nor
> windows seem to want to do out-of-the box.
> 
> Given that modern SATA drives come with fairly substantial write-caches
> nowadays which operating systems leave on without widespread disaster, I
> don't really believe in (b) either, at least for the ide and scsi case.
> Filesystems know they have to flush the disk cache to avoid corruption.
> (Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so
> I know virtio-blk has to be avoided for current windows and obsolete linux
> when writeback caching is on.)
> 
> I can certainly imagine (c) might be the case, although when I use strace to
> watch the IO to the block device, I see pretty regular fdatasyncs being
> issued by the guests, interleaved with the writes, so I'm not sure how
> likely the problem would be in practice. Perhaps my test guests were
> unrepresentatively well-behaved.
> 
> However, the potentially unlimited time-window for loss of incorrectly
> unsynced data is also something one could imagine fixing at the qemu level.
> Perhaps I should be implementing something like
> cache=writeback,flushtimeout=N which, upon a write being issued to the block
> device, starts an N second timer if it isn't already running. The timer is
> destroyed on flush, and if it expires before it's destroyed, a gratuitous
> flush is sent. Do you think this is worth doing? Just a simple 'while sleep
> 10; do sync; done' on the host even!
> 
> We've used cache=none and cache=writethrough, and whilst performance is fine
> with a single guest accessing a disk, when we chop the disks up with LVM and
> run a even a small handful of guests, the constant seeking to serve tiny
> synchronous IOs leads to truly abysmal throughput---we've seen less than
> 700kB/s streaming write rates within guests when the backing store is
> capable of 100MB/s.
> 
> With cache=writeback, there's still IO contention between guests, but the
> write granularity is a bit coarser, so the host's elevator seems to get a
> bit more of a chance to help us out and we can at least squeeze out 5-10MB/s
> from two or three concurrently running guests, getting a total of 20-30% of
> the performance of the underlying block device rather than a total of around
> 5%.

Hi Chris,

Are you using CFQ in the host? What is the host kernel version? I am not sure
what is the problem here but you might want to play with IO controller and put
these guests in individual cgroups and see if you get better throughput even
with cache=writethrough.

If the problem is that if sync writes from different guests get intermixed
resulting in more seeks, IO controller might help as these writes will now
go on different group service trees and in CFQ, we try to service requests
from one service tree at a time for a period before we switch the service
tree.

The issue will be that all the logic is in CFQ and it works at leaf nodes
of storage stack and not at LVM nodes. So first you might want to try it with
single partitioned disk. If it helps, then it might help with LVM
configuration also (IO control working at leaf nodes).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 17:05               ` Vivek Goyal
  0 siblings, 0 replies; 98+ messages in thread
From: Vivek Goyal @ 2010-03-17 17:05 UTC (permalink / raw)
  To: Chris Webb
  Cc: Anthony Liguori, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Wed, Mar 17, 2010 at 03:14:10PM +0000, Chris Webb wrote:
> Anthony Liguori <anthony@codemonkey.ws> writes:
> 
> > This really gets down to your definition of "safe" behaviour.  As it
> > stands, if you suffer a power outage, it may lead to guest
> > corruption.
> > 
> > While we are correct in advertising a write-cache, write-caches are
> > volatile and should a drive lose power, it could lead to data
> > corruption.  Enterprise disks tend to have battery backed write
> > caches to prevent this.
> > 
> > In the set up you're emulating, the host is acting as a giant write
> > cache.  Should your host fail, you can get data corruption.
> 
> Hi Anthony. I suspected my post might spark an interesting discussion!
> 
> Before considering anything like this, we did quite a bit of testing with
> OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
> power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
> NTFS filesystems despite these efforts.
> 
> Is your claim here that:-
> 
>   (a) qemu doesn't emulate a disk write cache correctly; or
> 
>   (b) operating systems are inherently unsafe running on top of a disk with
>       a write-cache; or
> 
>   (c) installations that are already broken and lose data with a physical
>       drive with a write-cache can lose much more in this case because the
>       write cache is much bigger?
> 
> Following Christoph Hellwig's patch series from last September, I'm pretty
> convinced that (a) isn't true apart from the inability to disable the
> write-cache at run-time, which is something that neither recent linux nor
> windows seem to want to do out-of-the box.
> 
> Given that modern SATA drives come with fairly substantial write-caches
> nowadays which operating systems leave on without widespread disaster, I
> don't really believe in (b) either, at least for the ide and scsi case.
> Filesystems know they have to flush the disk cache to avoid corruption.
> (Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so
> I know virtio-blk has to be avoided for current windows and obsolete linux
> when writeback caching is on.)
> 
> I can certainly imagine (c) might be the case, although when I use strace to
> watch the IO to the block device, I see pretty regular fdatasyncs being
> issued by the guests, interleaved with the writes, so I'm not sure how
> likely the problem would be in practice. Perhaps my test guests were
> unrepresentatively well-behaved.
> 
> However, the potentially unlimited time-window for loss of incorrectly
> unsynced data is also something one could imagine fixing at the qemu level.
> Perhaps I should be implementing something like
> cache=writeback,flushtimeout=N which, upon a write being issued to the block
> device, starts an N second timer if it isn't already running. The timer is
> destroyed on flush, and if it expires before it's destroyed, a gratuitous
> flush is sent. Do you think this is worth doing? Just a simple 'while sleep
> 10; do sync; done' on the host even!
> 
> We've used cache=none and cache=writethrough, and whilst performance is fine
> with a single guest accessing a disk, when we chop the disks up with LVM and
> run a even a small handful of guests, the constant seeking to serve tiny
> synchronous IOs leads to truly abysmal throughput---we've seen less than
> 700kB/s streaming write rates within guests when the backing store is
> capable of 100MB/s.
> 
> With cache=writeback, there's still IO contention between guests, but the
> write granularity is a bit coarser, so the host's elevator seems to get a
> bit more of a chance to help us out and we can at least squeeze out 5-10MB/s
> from two or three concurrently running guests, getting a total of 20-30% of
> the performance of the underlying block device rather than a total of around
> 5%.

Hi Chris,

Are you using CFQ in the host? What is the host kernel version? I am not sure
what is the problem here but you might want to play with IO controller and put
these guests in individual cgroups and see if you get better throughput even
with cache=writethrough.

If the problem is that if sync writes from different guests get intermixed
resulting in more seeks, IO controller might help as these writes will now
go on different group service trees and in CFQ, we try to service requests
from one service tree at a time for a period before we switch the service
tree.

The issue will be that all the logic is in CFQ and it works at leaf nodes
of storage stack and not at LVM nodes. So first you might want to try it with
single partitioned disk. If it helps, then it might help with LVM
configuration also (IO control working at leaf nodes).

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:57                   ` Christoph Hellwig
@ 2010-03-17 17:06                     ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 17:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 06:57 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote:
>    
>> Chris, can you carry out an experiment?  Write a program that pwrite()s
>> a byte to a file at the same location repeatedly, with the file opened
>> using O_SYNC.  Measure the write rate, and run blktrace on the host to
>> see what the disk (/dev/sda, not the volume) sees.  Should be a (write,
>> flush, write, flush) per pwrite pattern or similar (for writing the data
>> and a journal block, perhaps even three writes will be needed).
>>
>> Then scale this across multiple guests, measure and trace again.  If
>> we're lucky, the flushes will be coalesced, if not, we need to work on it.
>>      
> As the person who has written quite a bit of the current O_SYNC
> implementation and also reviewed the rest of it I can tell you that
> those flushes won't be coalesced.  If we always rewrite the same block
> we do the cache flush from the fsync method and there's is nothing
> to coalesced it there.  If you actually do modify metadata (e.g. by
> using the new real O_SYNC instead of the old one that always was O_DSYNC
> that I introduced in 2.6.33 but that isn't picked up by userspace yet)
> you might hit a very limited transaction merging window in some
> filesystems, but it's generally very small for a good reason.  If it
> were too large we'd make the once progress wait for I/O in another just
> because we might expect transactions to coalesced later.  There's been
> some long discussion about that fsync transaction batching tuning
> for ext3 a while ago.
>    

I definitely don't expect flush merging for a single guest, but for 
multiple guests there is certainly an opportunity for merging.  Most 
likely we don't take advantage of it and that's one of the problems.  
Copying data into pagecache so that we can merge the flushes seems like 
a very unsatisfactory implementation.




-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 17:06                     ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-17 17:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Webb, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Kevin Wolf

On 03/17/2010 06:57 PM, Christoph Hellwig wrote:
> On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote:
>    
>> Chris, can you carry out an experiment?  Write a program that pwrite()s
>> a byte to a file at the same location repeatedly, with the file opened
>> using O_SYNC.  Measure the write rate, and run blktrace on the host to
>> see what the disk (/dev/sda, not the volume) sees.  Should be a (write,
>> flush, write, flush) per pwrite pattern or similar (for writing the data
>> and a journal block, perhaps even three writes will be needed).
>>
>> Then scale this across multiple guests, measure and trace again.  If
>> we're lucky, the flushes will be coalesced, if not, we need to work on it.
>>      
> As the person who has written quite a bit of the current O_SYNC
> implementation and also reviewed the rest of it I can tell you that
> those flushes won't be coalesced.  If we always rewrite the same block
> we do the cache flush from the fsync method and there's is nothing
> to coalesced it there.  If you actually do modify metadata (e.g. by
> using the new real O_SYNC instead of the old one that always was O_DSYNC
> that I introduced in 2.6.33 but that isn't picked up by userspace yet)
> you might hit a very limited transaction merging window in some
> filesystems, but it's generally very small for a good reason.  If it
> were too large we'd make the once progress wait for I/O in another just
> because we might expect transactions to coalesced later.  There's been
> some long discussion about that fsync transaction batching tuning
> for ext3 a while ago.
>    

I definitely don't expect flush merging for a single guest, but for 
multiple guests there is certainly an opportunity for merging.  Most 
likely we don't take advantage of it and that's one of the problems.  
Copying data into pagecache so that we can merge the flushes seems like 
a very unsatisfactory implementation.




-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 17:05               ` Vivek Goyal
@ 2010-03-17 19:11                 ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 19:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Anthony Liguori, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> Are you using CFQ in the host? What is the host kernel version? I am not sure
> what is the problem here but you might want to play with IO controller and put
> these guests in individual cgroups and see if you get better throughput even
> with cache=writethrough.

Hi. We're using the deadline IO scheduler on 2.6.32.7. We got better
performance from deadline than from cfq when we last tested, which was
admittedly around the 2.6.30 timescale so is now a rather outdated
measurement.

> If the problem is that if sync writes from different guests get intermixed
> resulting in more seeks, IO controller might help as these writes will now
> go on different group service trees and in CFQ, we try to service requests
> from one service tree at a time for a period before we switch the service
> tree.

Thanks for the suggestion: I'll have a play with this. I currently use
/sys/kernel/uids/N/cpu_share with one UID per guest to divide up the CPU
between guests, but this could just as easily be done with a cgroup per
guest if a side-effect is to provide a hint about IO independence to CFQ.

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-17 19:11                 ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-17 19:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Anthony Liguori, Avi Kivity, balbir, KVM development list,
	Rik van Riel, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> Are you using CFQ in the host? What is the host kernel version? I am not sure
> what is the problem here but you might want to play with IO controller and put
> these guests in individual cgroups and see if you get better throughput even
> with cache=writethrough.

Hi. We're using the deadline IO scheduler on 2.6.32.7. We got better
performance from deadline than from cfq when we last tested, which was
admittedly around the 2.6.30 timescale so is now a rather outdated
measurement.

> If the problem is that if sync writes from different guests get intermixed
> resulting in more seeks, IO controller might help as these writes will now
> go on different group service trees and in CFQ, we try to service requests
> from one service tree at a time for a period before we switch the service
> tree.

Thanks for the suggestion: I'll have a play with this. I currently use
/sys/kernel/uids/N/cpu_share with one UID per guest to divide up the CPU
between guests, but this could just as easily be done with a cgroup per
guest if a side-effect is to provide a hint about IO independence to CFQ.

Best wishes,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-16  9:05               ` Avi Kivity
@ 2010-03-19  7:23                 ` Dave Hansen
  -1 siblings, 0 replies; 98+ messages in thread
From: Dave Hansen @ 2010-03-19  7:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Tue, 2010-03-16 at 11:05 +0200, Avi Kivity wrote:
> > Not really.  In many cloud environments, there's a set of common 
> > images that are instantiated on each node.  Usually this is because 
> > you're running a horizontally scalable application or because you're 
> > supporting an ephemeral storage model.
> 
> But will these servers actually benefit from shared cache?  So the 
> images are shared, they boot up, what then?
> 
> - apache really won't like serving static files from the host pagecache
> - dynamic content (java, cgi) will be mostly in anonymous memory, not 
> pagecache
> - ditto for application servers
> - what else are people doing?

Think of an OpenVZ-style model where you're renting out a bunch of
relatively tiny VMs and they're getting used pretty sporadically.  They
either have relatively little memory, or they've been ballooned down to
a pretty small footprint.

The more you shrink them down, the more similar they become.  You'll end
up having things like init, cron, apache, bash and libc start to
dominate the memory footprint in the VM.

That's *certainly* a case where this makes a lot of sense.

-- Dave


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-19  7:23                 ` Dave Hansen
  0 siblings, 0 replies; 98+ messages in thread
From: Dave Hansen @ 2010-03-19  7:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Tue, 2010-03-16 at 11:05 +0200, Avi Kivity wrote:
> > Not really.  In many cloud environments, there's a set of common 
> > images that are instantiated on each node.  Usually this is because 
> > you're running a horizontally scalable application or because you're 
> > supporting an ephemeral storage model.
> 
> But will these servers actually benefit from shared cache?  So the 
> images are shared, they boot up, what then?
> 
> - apache really won't like serving static files from the host pagecache
> - dynamic content (java, cgi) will be mostly in anonymous memory, not 
> pagecache
> - ditto for application servers
> - what else are people doing?

Think of an OpenVZ-style model where you're renting out a bunch of
relatively tiny VMs and they're getting used pretty sporadically.  They
either have relatively little memory, or they've been ballooned down to
a pretty small footprint.

The more you shrink them down, the more similar they become.  You'll end
up having things like init, cron, apache, bash and libc start to
dominate the memory footprint in the VM.

That's *certainly* a case where this makes a lot of sense.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-17 16:27                 ` Chris Webb
@ 2010-03-22 21:04                   ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-22 21:04 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Chris Webb <chris@arachsys.com> writes:

> Okay. What I was driving at in describing these systems as 'already broken'
> is that they will already lose data (in this sense) if they're run on bare
> metal with normal commodity SATA disks with their 32MB write caches on. That
> configuration surely describes the vast majority of PC-class desktops and
> servers!
> 
> If I understand correctly, your point here is that the small cache on a real
> SATA drive gives a relatively small time window for data loss, whereas the
> worry with cache=writeback is that the host page cache can be gigabytes, so
> the time window for unsynced data to be lost is potentially enormous.
> 
> Isn't the fix for that just forcing periodic sync on the host to bound-above
> the time window for unsynced data loss in the guest?

For the benefit of the archives, it turns out the simplest fix for this is
already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 32<<20,
and the size of dirty page cache is bounded above by 32MB, so we are
simulating exactly the case of a SATA drive with a 32MB writeback-cache.

Unless I'm missing something, the risk to guest OSes in this configuration
should therefore be exactly the same as the risk from running on normal
commodity hardware with such drives and no expensive battery-backed RAM.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-22 21:04                   ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-22 21:04 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Chris Webb <chris@arachsys.com> writes:

> Okay. What I was driving at in describing these systems as 'already broken'
> is that they will already lose data (in this sense) if they're run on bare
> metal with normal commodity SATA disks with their 32MB write caches on. That
> configuration surely describes the vast majority of PC-class desktops and
> servers!
> 
> If I understand correctly, your point here is that the small cache on a real
> SATA drive gives a relatively small time window for data loss, whereas the
> worry with cache=writeback is that the host page cache can be gigabytes, so
> the time window for unsynced data to be lost is potentially enormous.
> 
> Isn't the fix for that just forcing periodic sync on the host to bound-above
> the time window for unsynced data loss in the guest?

For the benefit of the archives, it turns out the simplest fix for this is
already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 32<<20,
and the size of dirty page cache is bounded above by 32MB, so we are
simulating exactly the case of a SATA drive with a 32MB writeback-cache.

Unless I'm missing something, the risk to guest OSes in this configuration
should therefore be exactly the same as the risk from running on normal
commodity hardware with such drives and no expensive battery-backed RAM.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-22 21:04                   ` Chris Webb
@ 2010-03-22 21:07                     ` Avi Kivity
  -1 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-22 21:07 UTC (permalink / raw)
  To: Chris Webb
  Cc: Anthony Liguori, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/22/2010 11:04 PM, Chris Webb wrote:
> Chris Webb<chris@arachsys.com>  writes:
>
>    
>> Okay. What I was driving at in describing these systems as 'already broken'
>> is that they will already lose data (in this sense) if they're run on bare
>> metal with normal commodity SATA disks with their 32MB write caches on. That
>> configuration surely describes the vast majority of PC-class desktops and
>> servers!
>>
>> If I understand correctly, your point here is that the small cache on a real
>> SATA drive gives a relatively small time window for data loss, whereas the
>> worry with cache=writeback is that the host page cache can be gigabytes, so
>> the time window for unsynced data to be lost is potentially enormous.
>>
>> Isn't the fix for that just forcing periodic sync on the host to bound-above
>> the time window for unsynced data loss in the guest?
>>      
> For the benefit of the archives, it turns out the simplest fix for this is
> already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 32<<20,
> and the size of dirty page cache is bounded above by 32MB, so we are
> simulating exactly the case of a SATA drive with a 32MB writeback-cache.
>
> Unless I'm missing something, the risk to guest OSes in this configuration
> should therefore be exactly the same as the risk from running on normal
> commodity hardware with such drives and no expensive battery-backed RAM.
>    

A host crash will destroy your data.  If  your machine is connected to a 
UPS, only a firmware crash can destroy your data.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-22 21:07                     ` Avi Kivity
  0 siblings, 0 replies; 98+ messages in thread
From: Avi Kivity @ 2010-03-22 21:07 UTC (permalink / raw)
  To: Chris Webb
  Cc: Anthony Liguori, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On 03/22/2010 11:04 PM, Chris Webb wrote:
> Chris Webb<chris@arachsys.com>  writes:
>
>    
>> Okay. What I was driving at in describing these systems as 'already broken'
>> is that they will already lose data (in this sense) if they're run on bare
>> metal with normal commodity SATA disks with their 32MB write caches on. That
>> configuration surely describes the vast majority of PC-class desktops and
>> servers!
>>
>> If I understand correctly, your point here is that the small cache on a real
>> SATA drive gives a relatively small time window for data loss, whereas the
>> worry with cache=writeback is that the host page cache can be gigabytes, so
>> the time window for unsynced data to be lost is potentially enormous.
>>
>> Isn't the fix for that just forcing periodic sync on the host to bound-above
>> the time window for unsynced data loss in the guest?
>>      
> For the benefit of the archives, it turns out the simplest fix for this is
> already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 32<<20,
> and the size of dirty page cache is bounded above by 32MB, so we are
> simulating exactly the case of a SATA drive with a 32MB writeback-cache.
>
> Unless I'm missing something, the risk to guest OSes in this configuration
> should therefore be exactly the same as the risk from running on normal
> commodity hardware with such drives and no expensive battery-backed RAM.
>    

A host crash will destroy your data.  If  your machine is connected to a 
UPS, only a firmware crash can destroy your data.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
  2010-03-22 21:07                     ` Avi Kivity
@ 2010-03-22 21:10                       ` Chris Webb
  -1 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-22 21:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Avi Kivity <avi@redhat.com> writes:

> On 03/22/2010 11:04 PM, Chris Webb wrote:
>
> >Unless I'm missing something, the risk to guest OSes in this configuration
> >should therefore be exactly the same as the risk from running on normal
> >commodity hardware with such drives and no expensive battery-backed RAM.
> 
> A host crash will destroy your data.  If  your machine is connected
> to a UPS, only a firmware crash can destroy your data.

Yes, that's a good point: in this configuration a host crash is equivalent
to a power failure rather than a OS crash in terms of data loss.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
@ 2010-03-22 21:10                       ` Chris Webb
  0 siblings, 0 replies; 98+ messages in thread
From: Chris Webb @ 2010-03-22 21:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, balbir, KVM development list, Rik van Riel,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Avi Kivity <avi@redhat.com> writes:

> On 03/22/2010 11:04 PM, Chris Webb wrote:
>
> >Unless I'm missing something, the risk to guest OSes in this configuration
> >should therefore be exactly the same as the risk from running on normal
> >commodity hardware with such drives and no expensive battery-backed RAM.
> 
> A host crash will destroy your data.  If  your machine is connected
> to a UPS, only a firmware crash can destroy your data.

Yes, that's a good point: in this configuration a host crash is equivalent
to a power failure rather than a OS crash in terms of data loss.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2010-03-22 21:10 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-15  7:22 [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter Balbir Singh
2010-03-15  7:22 ` Balbir Singh
2010-03-15  7:48 ` Avi Kivity
2010-03-15  7:48   ` Avi Kivity
2010-03-15  8:07   ` Balbir Singh
2010-03-15  8:07     ` Balbir Singh
2010-03-15  8:27     ` Avi Kivity
2010-03-15  8:27       ` Avi Kivity
2010-03-15  9:17       ` Balbir Singh
2010-03-15  9:17         ` Balbir Singh
2010-03-15  9:27         ` Avi Kivity
2010-03-15  9:27           ` Avi Kivity
2010-03-15 10:45           ` Balbir Singh
2010-03-15 10:45             ` Balbir Singh
2010-03-15 18:48           ` Anthony Liguori
2010-03-15 18:48             ` Anthony Liguori
2010-03-16  9:05             ` Avi Kivity
2010-03-16  9:05               ` Avi Kivity
2010-03-19  7:23               ` Dave Hansen
2010-03-19  7:23                 ` Dave Hansen
2010-03-15 20:23       ` Chris Webb
2010-03-15 20:23         ` Chris Webb
2010-03-15 23:43         ` Anthony Liguori
2010-03-15 23:43           ` Anthony Liguori
2010-03-16  0:43           ` Christoph Hellwig
2010-03-16  0:43             ` Christoph Hellwig
2010-03-16  1:27             ` Anthony Liguori
2010-03-16  1:27               ` Anthony Liguori
2010-03-16  8:19               ` Christoph Hellwig
2010-03-16  8:19                 ` Christoph Hellwig
2010-03-17 15:14           ` Chris Webb
2010-03-17 15:14             ` Chris Webb
2010-03-17 15:55             ` Anthony Liguori
2010-03-17 15:55               ` Anthony Liguori
2010-03-17 16:27               ` Chris Webb
2010-03-17 16:27                 ` Chris Webb
2010-03-22 21:04                 ` Chris Webb
2010-03-22 21:04                   ` Chris Webb
2010-03-22 21:07                   ` Avi Kivity
2010-03-22 21:07                     ` Avi Kivity
2010-03-22 21:10                     ` Chris Webb
2010-03-22 21:10                       ` Chris Webb
2010-03-17 16:27               ` Balbir Singh
2010-03-17 16:27                 ` Balbir Singh
2010-03-17 17:05             ` Vivek Goyal
2010-03-17 17:05               ` Vivek Goyal
2010-03-17 19:11               ` Chris Webb
2010-03-17 19:11                 ` Chris Webb
2010-03-16  3:16         ` Balbir Singh
2010-03-16  3:16           ` Balbir Singh
2010-03-16  9:17         ` Avi Kivity
2010-03-16  9:17           ` Avi Kivity
2010-03-16  9:54           ` Kevin Wolf
2010-03-16  9:54             ` Kevin Wolf
2010-03-16 10:16             ` Avi Kivity
2010-03-16 10:16               ` Avi Kivity
2010-03-16 10:26           ` Christoph Hellwig
2010-03-16 10:26             ` Christoph Hellwig
2010-03-16 10:36             ` Avi Kivity
2010-03-16 10:36               ` Avi Kivity
2010-03-16 10:44               ` Christoph Hellwig
2010-03-16 10:44                 ` Christoph Hellwig
2010-03-16 11:08                 ` Avi Kivity
2010-03-16 11:08                   ` Avi Kivity
2010-03-16 14:27                   ` Balbir Singh
2010-03-16 14:27                     ` Balbir Singh
2010-03-16 15:59                     ` Avi Kivity
2010-03-16 15:59                       ` Avi Kivity
2010-03-17  8:49                   ` Christoph Hellwig
2010-03-17  8:49                     ` Christoph Hellwig
2010-03-17  9:10                     ` Avi Kivity
2010-03-17  9:10                       ` Avi Kivity
2010-03-17 15:24           ` Chris Webb
2010-03-17 15:24             ` Chris Webb
2010-03-17 16:22             ` Avi Kivity
2010-03-17 16:22               ` Avi Kivity
2010-03-17 16:40               ` Avi Kivity
2010-03-17 16:40                 ` Avi Kivity
2010-03-17 16:47                 ` Chris Webb
2010-03-17 16:47                   ` Chris Webb
2010-03-17 16:53                   ` Avi Kivity
2010-03-17 16:53                     ` Avi Kivity
2010-03-17 16:58                     ` Christoph Hellwig
2010-03-17 16:58                       ` Christoph Hellwig
2010-03-17 17:03                       ` Avi Kivity
2010-03-17 17:03                         ` Avi Kivity
2010-03-17 16:57                 ` Christoph Hellwig
2010-03-17 16:57                   ` Christoph Hellwig
2010-03-17 17:06                   ` Avi Kivity
2010-03-17 17:06                     ` Avi Kivity
2010-03-17 16:52               ` Christoph Hellwig
2010-03-17 16:52                 ` Christoph Hellwig
2010-03-17 17:02                 ` Avi Kivity
2010-03-17 17:02                   ` Avi Kivity
2010-03-15 15:46 ` Randy Dunlap
2010-03-15 15:46   ` Randy Dunlap
2010-03-16  3:21   ` Balbir Singh
2010-03-16  3:21     ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.